{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/TA-aiacademy/course_3.0/blob/v2-5_nlp/09_v2-5_NLP/Part5/02_Bert_finetune_ptt.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "R9APPZy4_RiK"
      },
      "source": [
        "# PTT gossip classification\n",
        "\n",
        "這章節我們使用中文預訓練模型`bert-base-chinese`來進行`finetune`。"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install transformers"
      ],
      "metadata": {
        "id": "AA45qLICIUkA"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AVxgvyPR_RiP"
      },
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "import tensorflow_datasets as tfds\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import os\n",
        "from sklearn.metrics import classification_report, confusion_matrix\n",
        "\n",
        "from sklearn.model_selection import train_test_split\n",
        "from transformers import *"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "TyF7-Sun_RiT"
      },
      "outputs": [],
      "source": [
        "model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "uJ63QJsn_RiU"
      },
      "outputs": [],
      "source": [
        "tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qjMn2uq0_RiW"
      },
      "source": [
        "### Data overview\n",
        "\n",
        "我們使用從ptt八卦版進行爬蟲整理，$0$表示該留言的推數小於噓數，$1$表示該留言的推數大於噓數，所以這個任務是屬於`Text classification`任務(二元分類)。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "BS_2msQj_RiW"
      },
      "outputs": [],
      "source": [
        "# 上傳資料\n",
        "!wget -q https://github.com/TA-aiacademy/course_3.0/releases/download/v2.5_nlp/NLP_part5.zip\n",
        "!unzip -q NLP_part5.zip"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ghO72l_U_RiX"
      },
      "outputs": [],
      "source": [
        "ptt = pd.read_csv('Data/ptt_gossip.csv')\n",
        "\n",
        "bert_max_length = 512\n",
        "ptt['sentence'] = [t[:bert_max_length] for t in ptt.sentence]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "eadeVOZ2_RiX"
      },
      "outputs": [],
      "source": [
        "ptt.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HGHvpwiY_RiY"
      },
      "outputs": [],
      "source": [
        "\"\"\"\n",
        "訓練集80%，測試集20%\n",
        "\"\"\"\n",
        "train_size = 0.8\n",
        "\n",
        "mask = np.random.rand(len(ptt)) < train_size\n",
        "train_dataset = ptt[mask]\n",
        "valid_dataset = ptt[~mask]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Kz-tkrcA_RiY"
      },
      "outputs": [],
      "source": [
        "train_size = len(train_dataset)\n",
        "valid_size = len(valid_dataset)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iI3oKVJq_RiZ"
      },
      "outputs": [],
      "source": [
        "print('Train size: ', train_size)\n",
        "print('Valid size: ', valid_size)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "83_OlhMm_Ria"
      },
      "source": [
        "### Convert to tensor\n",
        "\n",
        "各種`Transformer`預訓練都支持`tf.tensor`輸入格式，需要將資料集轉為`tf.tensor`格式。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mSo1Rgb5_Ria"
      },
      "outputs": [],
      "source": [
        "train_dataset = tf.data.Dataset.from_tensor_slices(dict(train_dataset))\n",
        "valid_dataset = tf.data.Dataset.from_tensor_slices(dict(valid_dataset))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3l241pCD_Rib"
      },
      "source": [
        "### Traing data format\n",
        "\n",
        "使用`glue_convert_examples_to_features`將資料集轉為模型可讀取格式，因為是二元分類，所以我們使用的任務為`cola`，`cola`是`bert`在`finetune`時的任務之一，一樣是二元分類任務，我們可以套用他的輸入格式來進行轉換，而在中文部分目前的預訓練模型都是用`chararcter-level`進行斷詞，所以我們將`max_length`提高至$256$，下表為在`Titan X 12G`上`finetune`的參數限制，表示模型以及多少句子長度對應其最大的`batch_size`，需要注意其硬體限制，而`1080ti`為`11G`，可以使用句子長度`256`搭配`batch_size`為16。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cYDkd6uM_Rib"
      },
      "source": [
        "<img src=\"https://hackmd.io/_uploads/Hybk-351p.png\" alt=\"Drawing\" style=\"width: 250px;\"/>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "F_AzIGxn_Rib"
      },
      "outputs": [],
      "source": [
        "max_length = 512\n",
        "task = 'cola'\n",
        "\n",
        "train_dataset = glue_convert_examples_to_features(train_dataset,\n",
        "                                                  tokenizer,\n",
        "                                                  max_length,\n",
        "                                                  task)\n",
        "valid_dataset = glue_convert_examples_to_features(valid_dataset,\n",
        "                                                  tokenizer,\n",
        "                                                  max_length,\n",
        "                                                  task)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "uIwqavVv_Rib"
      },
      "outputs": [],
      "source": [
        "train_temp = next(iter(train_dataset))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5rWk2ZcR_Ric"
      },
      "outputs": [],
      "source": [
        "train_temp"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Jfh8pjAD_Ric"
      },
      "outputs": [],
      "source": [
        "buffer_size = 100\n",
        "train_bz = 6\n",
        "epochs = 3\n",
        "valid_bz = 6\n",
        "\n",
        "train_gen = train_dataset.shuffle(buffer_size).batch(train_bz).repeat(epochs)\n",
        "valid_gen = valid_dataset.batch(valid_bz)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "An8ZbS45_Ric"
      },
      "outputs": [],
      "source": [
        "optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5,\n",
        "                                     epsilon=1e-8,\n",
        "                                     clipnorm=1.0)\n",
        "loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,\n",
        "                                                     reduction=tf.keras.losses.Reduction.SUM_OVER_BATCH_SIZE)\n",
        "model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8pZQPs4D_Ric"
      },
      "outputs": [],
      "source": [
        "history = model.fit(train_gen,\n",
        "                    epochs=epochs,\n",
        "                    steps_per_epoch=train_size//train_bz,\n",
        "                    validation_data=valid_gen,\n",
        "                    validation_steps=valid_size//valid_bz)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8FA3BFzf_Rid"
      },
      "source": [
        "## Save model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oPYABXIt_Rid"
      },
      "outputs": [],
      "source": [
        "save_path = 'save_ptt'\n",
        "if not os.path.exists(save_path):\n",
        "    os.mkdir(save_path)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JpT_C3sO_Rid"
      },
      "outputs": [],
      "source": [
        "model.save_pretrained('./save_ptt/')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UCpaDqzs_Rid"
      },
      "source": [
        "## Evaluation\n",
        "\n",
        "畫出`precision`, `recall`, `f1-score`以及`confusion matrix`評估模型表現。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "M8JbRMXi_Rid"
      },
      "outputs": [],
      "source": [
        "valid_pred = model.predict(valid_gen)\n",
        "valid_pred_ids = np.argmax(valid_pred.logits, axis=-1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6u33yCGa_Rid"
      },
      "outputs": [],
      "source": [
        "valid_label = list()\n",
        "for x in valid_dataset:\n",
        "    valid_label += [x[1].numpy()]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qGV-qFmH_Rie"
      },
      "outputs": [],
      "source": [
        "print(classification_report(y_pred=valid_pred_ids, y_true=valid_label))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "k1R1h6Uc_Rie"
      },
      "outputs": [],
      "source": [
        "confm = confusion_matrix(y_pred=valid_pred_ids, y_true=valid_label)\n",
        "\n",
        "index = ['Actual_0', 'Actual_1']\n",
        "columns = ['Pred_0', 'Pred_1']\n",
        "pd.DataFrame(confm, index=index, columns=columns)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0qnYM_Ua_Rie"
      },
      "source": [
        "## Load model and predict"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5U3LIDr5_Rie"
      },
      "outputs": [],
      "source": [
        "new_model = TFBertForSequenceClassification.from_pretrained('save_chinese/')\n",
        "tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "U2M2ZrXz_Rie"
      },
      "outputs": [],
      "source": [
        "sentence = [\"文瑋助教好壯\"]\n",
        "\n",
        "test_dataset = pd.DataFrame(dict(idx=list(range(len(sentence))),\n",
        "                                 label=[0]*len(sentence),\n",
        "                                 sentence=sentence))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "b5t7H-BT_Rif"
      },
      "outputs": [],
      "source": [
        "test_dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1CwG6flP_Rif"
      },
      "outputs": [],
      "source": [
        "test_gen = tf.data.Dataset.from_tensor_slices(dict(test_dataset))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qByb0A0P_Rif"
      },
      "outputs": [],
      "source": [
        "max_length = 512\n",
        "task = 'cola'\n",
        "test_gen = glue_convert_examples_to_features(test_gen, tokenizer, max_length, task)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "hTYGL4cj_Rif"
      },
      "outputs": [],
      "source": [
        "test_gen = test_gen.batch(1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cPicbk-E_Rif"
      },
      "outputs": [],
      "source": [
        "next(iter(test_gen))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AzVwoaJj_Rig"
      },
      "outputs": [],
      "source": [
        "pred = new_model.predict(test_gen)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ldcEPDwH_Rig"
      },
      "outputs": [],
      "source": [
        "pred_ids = np.argmax(pred.logits, axis=-1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Qu9W4QG__Rim"
      },
      "outputs": [],
      "source": [
        "print(pred_ids[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Q5vF5Sgf_Rin"
      },
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.12"
    },
    "colab": {
      "provenance": [],
      "gpuType": "T4",
      "include_colab_link": true
    },
    "accelerator": "GPU"
  },
  "nbformat": 4,
  "nbformat_minor": 0
}