{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "R9APPZy4_RiK" }, "source": [ "# PTT gossip classification\n", "\n", "這章節我們使用中文預訓練模型`bert-base-chinese`來進行`finetune`。" ] }, { "cell_type": "code", "source": [ "!pip install transformers" ], "metadata": { "id": "AA45qLICIUkA" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AVxgvyPR_RiP" }, "outputs": [], "source": [ "import tensorflow as tf\n", "import tensorflow_datasets as tfds\n", "import numpy as np\n", "import pandas as pd\n", "import os\n", "from sklearn.metrics import classification_report, confusion_matrix\n", "\n", "from sklearn.model_selection import train_test_split\n", "from transformers import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TyF7-Sun_RiT" }, "outputs": [], "source": [ "model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "uJ63QJsn_RiU" }, "outputs": [], "source": [ "tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')" ] }, { "cell_type": "markdown", "metadata": { "id": "qjMn2uq0_RiW" }, "source": [ "### Data overview\n", "\n", "我們使用從ptt八卦版進行爬蟲整理,$0$表示該留言的推數小於噓數,$1$表示該留言的推數大於噓數,所以這個任務是屬於`Text classification`任務(二元分類)。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BS_2msQj_RiW" }, "outputs": [], "source": [ "# 上傳資料\n", "!wget -q https://github.com/TA-aiacademy/course_3.0/releases/download/v2.5_nlp/NLP_part5.zip\n", "!unzip -q NLP_part5.zip" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ghO72l_U_RiX" }, "outputs": [], "source": [ "ptt = pd.read_csv('Data/ptt_gossip.csv')\n", "\n", "bert_max_length = 512\n", "ptt['sentence'] = [t[:bert_max_length] for t in ptt.sentence]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eadeVOZ2_RiX" }, "outputs": [], "source": [ "ptt.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HGHvpwiY_RiY" }, "outputs": [], "source": [ "\"\"\"\n", "訓練集80%,測試集20%\n", "\"\"\"\n", "train_size = 0.8\n", "\n", "mask = np.random.rand(len(ptt)) < train_size\n", "train_dataset = ptt[mask]\n", "valid_dataset = ptt[~mask]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Kz-tkrcA_RiY" }, "outputs": [], "source": [ "train_size = len(train_dataset)\n", "valid_size = len(valid_dataset)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iI3oKVJq_RiZ" }, "outputs": [], "source": [ "print('Train size: ', train_size)\n", "print('Valid size: ', valid_size)" ] }, { "cell_type": "markdown", "metadata": { "id": "83_OlhMm_Ria" }, "source": [ "### Convert to tensor\n", "\n", "各種`Transformer`預訓練都支持`tf.tensor`輸入格式,需要將資料集轉為`tf.tensor`格式。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mSo1Rgb5_Ria" }, "outputs": [], "source": [ "train_dataset = tf.data.Dataset.from_tensor_slices(dict(train_dataset))\n", "valid_dataset = tf.data.Dataset.from_tensor_slices(dict(valid_dataset))" ] }, { "cell_type": "markdown", "metadata": { "id": "3l241pCD_Rib" }, "source": [ "### Traing data format\n", "\n", "使用`glue_convert_examples_to_features`將資料集轉為模型可讀取格式,因為是二元分類,所以我們使用的任務為`cola`,`cola`是`bert`在`finetune`時的任務之一,一樣是二元分類任務,我們可以套用他的輸入格式來進行轉換,而在中文部分目前的預訓練模型都是用`chararcter-level`進行斷詞,所以我們將`max_length`提高至$256$,下表為在`Titan X 12G`上`finetune`的參數限制,表示模型以及多少句子長度對應其最大的`batch_size`,需要注意其硬體限制,而`1080ti`為`11G`,可以使用句子長度`256`搭配`batch_size`為16。" ] }, { "cell_type": "markdown", "metadata": { "id": "cYDkd6uM_Rib" }, "source": [ "\"Drawing\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "F_AzIGxn_Rib" }, "outputs": [], "source": [ "max_length = 512\n", "task = 'cola'\n", "\n", "train_dataset = glue_convert_examples_to_features(train_dataset,\n", " tokenizer,\n", " max_length,\n", " task)\n", "valid_dataset = glue_convert_examples_to_features(valid_dataset,\n", " tokenizer,\n", " max_length,\n", " task)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "uIwqavVv_Rib" }, "outputs": [], "source": [ "train_temp = next(iter(train_dataset))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5rWk2ZcR_Ric" }, "outputs": [], "source": [ "train_temp" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Jfh8pjAD_Ric" }, "outputs": [], "source": [ "buffer_size = 100\n", "train_bz = 6\n", "epochs = 3\n", "valid_bz = 6\n", "\n", "train_gen = train_dataset.shuffle(buffer_size).batch(train_bz).repeat(epochs)\n", "valid_gen = valid_dataset.batch(valid_bz)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "An8ZbS45_Ric" }, "outputs": [], "source": [ "optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5,\n", " epsilon=1e-8,\n", " clipnorm=1.0)\n", "loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,\n", " reduction=tf.keras.losses.Reduction.SUM_OVER_BATCH_SIZE)\n", "model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8pZQPs4D_Ric" }, "outputs": [], "source": [ "history = model.fit(train_gen,\n", " epochs=epochs,\n", " steps_per_epoch=train_size//train_bz,\n", " validation_data=valid_gen,\n", " validation_steps=valid_size//valid_bz)" ] }, { "cell_type": "markdown", "metadata": { "id": "8FA3BFzf_Rid" }, "source": [ "## Save model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oPYABXIt_Rid" }, "outputs": [], "source": [ "save_path = 'save_ptt'\n", "if not os.path.exists(save_path):\n", " os.mkdir(save_path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JpT_C3sO_Rid" }, "outputs": [], "source": [ "model.save_pretrained('./save_ptt/')" ] }, { "cell_type": "markdown", "metadata": { "id": "UCpaDqzs_Rid" }, "source": [ "## Evaluation\n", "\n", "畫出`precision`, `recall`, `f1-score`以及`confusion matrix`評估模型表現。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "M8JbRMXi_Rid" }, "outputs": [], "source": [ "valid_pred = model.predict(valid_gen)\n", "valid_pred_ids = np.argmax(valid_pred.logits, axis=-1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6u33yCGa_Rid" }, "outputs": [], "source": [ "valid_label = list()\n", "for x in valid_dataset:\n", " valid_label += [x[1].numpy()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qGV-qFmH_Rie" }, "outputs": [], "source": [ "print(classification_report(y_pred=valid_pred_ids, y_true=valid_label))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "k1R1h6Uc_Rie" }, "outputs": [], "source": [ "confm = confusion_matrix(y_pred=valid_pred_ids, y_true=valid_label)\n", "\n", "index = ['Actual_0', 'Actual_1']\n", "columns = ['Pred_0', 'Pred_1']\n", "pd.DataFrame(confm, index=index, columns=columns)" ] }, { "cell_type": "markdown", "metadata": { "id": "0qnYM_Ua_Rie" }, "source": [ "## Load model and predict" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5U3LIDr5_Rie" }, "outputs": [], "source": [ "new_model = TFBertForSequenceClassification.from_pretrained('save_chinese/')\n", "tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "U2M2ZrXz_Rie" }, "outputs": [], "source": [ "sentence = [\"文瑋助教好壯\"]\n", "\n", "test_dataset = pd.DataFrame(dict(idx=list(range(len(sentence))),\n", " label=[0]*len(sentence),\n", " sentence=sentence))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b5t7H-BT_Rif" }, "outputs": [], "source": [ "test_dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1CwG6flP_Rif" }, "outputs": [], "source": [ "test_gen = tf.data.Dataset.from_tensor_slices(dict(test_dataset))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qByb0A0P_Rif" }, "outputs": [], "source": [ "max_length = 512\n", "task = 'cola'\n", "test_gen = glue_convert_examples_to_features(test_gen, tokenizer, max_length, task)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hTYGL4cj_Rif" }, "outputs": [], "source": [ "test_gen = test_gen.batch(1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cPicbk-E_Rif" }, "outputs": [], "source": [ "next(iter(test_gen))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AzVwoaJj_Rig" }, "outputs": [], "source": [ "pred = new_model.predict(test_gen)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ldcEPDwH_Rig" }, "outputs": [], "source": [ "pred_ids = np.argmax(pred.logits, axis=-1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Qu9W4QG__Rim" }, "outputs": [], "source": [ "print(pred_ids[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Q5vF5Sgf_Rin" }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" }, "colab": { "provenance": [], "gpuType": "T4", "include_colab_link": true }, "accelerator": "GPU" }, "nbformat": 4, "nbformat_minor": 0 }