{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/TA-aiacademy/course_3.0/blob/v2-5_nlp/09_v2-5_NLP/Part5/01_Bert_finetune_glue.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HXkiMDYvJppY"
      },
      "source": [
        "## Transformer Pre-trained model\n",
        "\n",
        "這一章節介紹目前自然語言處理最強大的模型-`Transformer`，`Transformer`相較於`RNN`系列的模型，`Transformer`在表現(`metrics`)以及計算效率(`parallel`)都有絕對的優勢，著名的`pre-train`模型如下，連結為各個模型的論文路徑，基本上這些模型都是`Transformer`的變形，不同的地方在於預訓練的策略，例如資料量大小、`Masked`的差異以及`Self-attention`矩陣的差異，最特別的是最後一個`ELECTRA`，是在`2019`年`11`月初提出的論文，結合了`transformer`還有`GAN`。\n",
        "\n",
        "* BERT: https://arxiv.org/abs/1810.04805\n",
        " - Masked Language Modeling + Next Sentence Prediction\n",
        "\n",
        "\n",
        "* GPT: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf\n",
        " - AutoRegressive Prediction\n",
        "\n",
        "\n",
        "* Transformer-XL: https://arxiv.org/abs/1901.02860\n",
        " - Learning dependency beyond a fixed length(>512)\n",
        "\n",
        "\n",
        "* XLNet: https://arxiv.org/abs/1906.08237\n",
        " - Permutation Modeling\n",
        "\n",
        "\n",
        "* XLM: https://arxiv.org/abs/1901.07291\n",
        " - Pretrain on cross-lingual language\n",
        "\n",
        "\n",
        "* RoBERTa: https://arxiv.org/abs/1907.11692\n",
        " - Pretrain model longer, more data\n",
        "\n",
        "\n",
        "* DistilBERT: https://arxiv.org/abs/1910.01108\n",
        "* CTRL: https://arxiv.org/abs/1909.05858\n",
        "* ELECTRA: https://openreview.net/pdf?id=r1xMH1BtvB\n",
        " - Transformer + GAN\n",
        "\n",
        "\n",
        "### [GLUE Benchmark](https://gluebenchmark.com/leaderboard)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2iKyfpOjJppi"
      },
      "source": [
        "## [Transformer](https://huggingface.co/transformers/)\n",
        "\n",
        "這邊我們使用`Transformers`套件來進行`finetune`，在進行`finetune`之前，需要了解自然語言處理任務上的差異，最主要分為兩種分類任務：\n",
        "\n",
        "1. `Text classification`: 輸入一個句子，輸出該句子的分類。\n",
        "2. `Sentence-Pair classification`: 輸入兩個句子的pair，輸出兩個句子之間的關係。\n",
        "\n",
        "* PS. 這些預訓練模型除了表現亮眼之外，最重要的貢獻在於預訓練後的`word embedding`，`word embedding`表示在文本中，詞與詞之間的關係，最著名的例子就是: 男性 - 女性 = 國王 - 皇后，像這樣的對應關係，有訓練良好的`word embedding`基本上在其他應用任務表現也會不錯，例如聊天機器人。\n",
        "\n",
        "在這裡我們會使用`BERT`來進行`finetune`。"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install transformers"
      ],
      "metadata": {
        "id": "pIjhM8PKhUqd"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "exyGLLLyJppj"
      },
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "import tensorflow_datasets as tfds\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import os\n",
        "from sklearn.metrics import classification_report, confusion_matrix\n",
        "\n",
        "from transformers import *"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Bwh5vrHAJppm"
      },
      "source": [
        "## 模型名稱解釋\n",
        "\n",
        "* `bert-base-uncased`:\n",
        "  - `bert`: 模型名稱\n",
        "  - `base`: 模型大小，`base`表示層數為$12$層, `word embedding(hidden)`為$768$維, `heads`為$12$，另外有`large`，層數為$24$層，`word embedding(hidden)`為$1024$維，`heads`為$16$。\n",
        "  - `uncased`: 表示對於文本的前處理，`uncased`表示字全部轉小寫，反之`cased`表示維持原樣。\n",
        "\n",
        "另外不只有這些模型，其餘模型可以參考：\n",
        "https://huggingface.co/transformers/pretrained_models.html"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8oXs6YrFJppn"
      },
      "outputs": [],
      "source": [
        "\"\"\"\n",
        "載入預訓練模型\n",
        "\"\"\"\n",
        "model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_84fytQ4Jppo"
      },
      "outputs": [],
      "source": [
        "\"\"\"\n",
        "載入模型斷詞工具\n",
        "\"\"\"\n",
        "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g3stlbSEJppp"
      },
      "source": [
        "## Finetune\n",
        "\n",
        "![img](https://hackmd.io/_uploads/rJIRRucJp.png)\n",
        "\n",
        "所有預訓練模型都是在[GLUE Benchmark](https://gluebenchmark.com/leaderboard)進行競賽，這個競賽提供多種不同的自然語言處理任務，這些任務都是屬於分類任務，只是差別在於資料集大小以及來源而已，這裡我們使用其中一種分類任務`MRPC`來進行`finetune`。\n",
        "\n",
        "* 資料來源: [tensorflow dataset](https://www.tensorflow.org/datasets/catalog/overview#wmt19_translate)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0rSfVdSKJppr"
      },
      "outputs": [],
      "source": [
        "data, info = tfds.load('glue/mrpc', with_info=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yJt0Oa1sJpps"
      },
      "source": [
        "### Info\n",
        "\n",
        "資料集的介紹，最需要注意的地方就是資料集的樣子，因為`MRPC`是屬於`Sentnece-Pair classification`任務，所以資料集包括了`sentence1`和`sentence2`對應一個`label`，`MRPC`主要是在分類兩個句子之間的語義是否相同，`label`為$1$表示相同，反之$0$表示不同。\n",
        "\n",
        "因為是競賽資料集，所以資料集已經切割好為train, validation以及test。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Tfm-Ku3IJppu"
      },
      "outputs": [],
      "source": [
        "info"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fZfmalSaJppv"
      },
      "outputs": [],
      "source": [
        "for k, v in data.items():\n",
        "    print('key:', k)\n",
        "    print('data shapes:\\n', v)\n",
        "    print('-' * 20)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "v3TEjm_PJppw"
      },
      "source": [
        "### Dataset overview\n",
        "\n",
        "`tensorflow`儲存資料的方式都是以`tf.data.Data`型態來儲存，可以使用`iter`來建立`generator`，並使用`next`來觀看第一筆資料，資料中包含了`idx`、`label`、`sentence1`以及`sentence2`。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Exr6ID71Jppx"
      },
      "outputs": [],
      "source": [
        "assert isinstance(data['train'], tf.data.Dataset)\n",
        "\n",
        "temp = data['train']\n",
        "temp_gen = iter(temp)\n",
        "next(temp_gen)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BeI6KBvQJppx"
      },
      "source": [
        "### Training data format\n",
        "\n",
        "接下來我們需要將資料集轉換成模型可讀取的格式，輸入格式有三個：\n",
        "\n",
        "* `input_ids`: 這表示句子斷完詞之後轉成`token embeddings`，每一個詞有一個`id`，如下圖，其中`101`表示`[CLS]`，`102`表示`[SEP]`，因為`MPRC`是`Sentence-Pair classification`任務，所以下面的範例中會看到兩個`102`。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ebcx3H3EJppy"
      },
      "source": [
        "![](https://hackmd.io/_uploads/Hyl-1F5ka.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EHFOazasJppy"
      },
      "source": [
        "* `attention mask`: 因為`Transformer`會限制輸入句子的長度，最大限制為`512`，而我們選擇`128`，但不是所有的句子長度都是128，所以需要在後面進行`padding`(就是補0)，最主要的目的是不去計算`padding`位置的`loss`。\n",
        "\n",
        "* `token_type_ids`: 用來表示`Segment embedding`，如上圖，表示詞屬於哪一個句子，因為`MRPC`有兩個句子，所以`ids`有2種，`0`和`1`。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "J06oucDkJppy"
      },
      "outputs": [],
      "source": [
        "max_length = 128\n",
        "task = 'mrpc'\n",
        "\n",
        "train_dataset = glue_convert_examples_to_features(data['train'],\n",
        "                                                  tokenizer,\n",
        "                                                  max_length,\n",
        "                                                  task)\n",
        "valid_dataset = glue_convert_examples_to_features(data['validation'],\n",
        "                                                  tokenizer,\n",
        "                                                  max_length,\n",
        "                                                  task)\n",
        "test_dataset = glue_convert_examples_to_features(data['test'],\n",
        "                                                 tokenizer,\n",
        "                                                 max_length,\n",
        "                                                 task)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jRGDHgkLJppz"
      },
      "source": [
        "### Example\n",
        "\n",
        "觀察轉換過後的資料集。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "BjuMh9HUJppz"
      },
      "outputs": [],
      "source": [
        "next(iter(train_dataset))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0zRMmp4TJpp0"
      },
      "source": [
        "### Parameter settings\n",
        "\n",
        "在`tf.data.Dataset`中，通常會在訓練資料集後面接上三個標準的操作：\n",
        "\n",
        "* `.shuffle()`: 打亂資料集的方式，會先從資料集中隨機抽取`buffer_size`筆資料進去`buffer`，然後再`buffer`從中抽取`batch_size`筆資料進行訓練，丟進`buffer`的步驟主要是在處理無法一次將所有資料集丟進記憶體進行訓練的情形。\n",
        "\n",
        "* `.batch()`: 每次迭代使用的資料數量。\n",
        "* `.repeat()`: `epochs`數量。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fzETuRU5Jpp1"
      },
      "outputs": [],
      "source": [
        "buffer_size = 100\n",
        "train_bz = 16\n",
        "epochs = 3\n",
        "valid_bz = 50\n",
        "\n",
        "train_dataset = train_dataset.shuffle(buffer_size).batch(train_bz).repeat(epochs)\n",
        "valid_dataset = valid_dataset.batch(valid_bz)\n",
        "test_dataset = test_dataset.batch(valid_bz)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KiD0nnEuJpp1"
      },
      "outputs": [],
      "source": [
        "optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5,\n",
        "                                     epsilon=1e-8,\n",
        "                                     clipnorm=1.0)\n",
        "loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,\n",
        "                                                     reduction=tf.keras.losses.Reduction.SUM_OVER_BATCH_SIZE)\n",
        "\n",
        "model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vZF6eevhJpp1"
      },
      "source": [
        "## Training\n",
        "\n",
        "* `.fit()`: 支援`generator`的輸入方式，也可以用`fit_generator`。\n",
        "* `steps_per_epoch`: 每個`epoch`訓練幾次，通常是$\\frac{train\\_size}{batch\\_size}$，遍歷整個訓練集。\n",
        "* `validation_steps`: 與`steps_per_epoch`同義。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vQum-yYqJpp2"
      },
      "outputs": [],
      "source": [
        "history = model.fit(train_dataset,\n",
        "                    epochs=epochs,\n",
        "                    steps_per_epoch=3668//train_bz,\n",
        "                    validation_data=valid_dataset,\n",
        "                    validation_steps=408//valid_bz)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GBKQb1C1Jpp2"
      },
      "source": [
        "## Evaluation"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yg3h_nlNJpp3"
      },
      "outputs": [],
      "source": [
        "valid_pred = model.predict(valid_dataset)\n",
        "valid_pred_ids = np.argmax(valid_pred.logits, axis=-1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "J88wbU2TJpp3"
      },
      "outputs": [],
      "source": [
        "valid_pred_ids"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "muhl_YSoJpp3"
      },
      "outputs": [],
      "source": [
        "\"\"\"\n",
        "從tf.data.Dataset中拿取label\n",
        "\"\"\"\n",
        "valid_label = list()\n",
        "for x in valid_dataset:\n",
        "    valid_label += x[1].numpy().tolist()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LDrIPawYJpp5"
      },
      "outputs": [],
      "source": [
        "print(classification_report(y_pred=valid_pred_ids, y_true=valid_label))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "pIvSR9hiJpp6"
      },
      "outputs": [],
      "source": [
        "confm = confusion_matrix(y_pred=valid_pred_ids, y_true=valid_label)\n",
        "\n",
        "index = ['Actual_0', 'Actual_1']\n",
        "columns = ['Pred_0', 'Pred_1']\n",
        "pd.DataFrame(confm, index=index, columns=columns)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6ZwOmKWPJpp7"
      },
      "source": [
        "## Save model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "39tM6VxoJpp7"
      },
      "outputs": [],
      "source": [
        "save_path = 'save_glue'\n",
        "if not os.path.exists(save_path):\n",
        "    os.mkdir(save_path)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rSuv_pG1Jpp7"
      },
      "outputs": [],
      "source": [
        "model.save_pretrained(save_path)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cVup-611Jpp8"
      },
      "source": [
        "## Load model and predict\n",
        "\n",
        "這邊參考`MRPC`的輸入格式，一樣會使用`glue_convert_examples_to_features`這個函數進行轉換。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "TINVSbLIJpp8"
      },
      "outputs": [],
      "source": [
        "new_model = TFBertForSequenceClassification.from_pretrained(save_path)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zXNAc_DQJpp8"
      },
      "outputs": [],
      "source": [
        "sentence1 = [\"Anorld Schwarzenegger is my idol.\"]\n",
        "sentence2 = [\"My favorite idol is Anorld Schwarzenegger.\"]\n",
        "\n",
        "test_dataset = pd.DataFrame(dict(idx=list(range(len(sentence1))),\n",
        "                                 label=[0]*len(sentence1),\n",
        "                                 sentence1=sentence1,\n",
        "                                 sentence2=sentence2))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "clDEXnFqJpp9"
      },
      "outputs": [],
      "source": [
        "\"\"\"\n",
        "模仿GLUE的輸入格式: (idx, label, sentence1, sentence2)\n",
        "其中label是假的，是因為輸入需要，不會影響預測值\n",
        "\"\"\"\n",
        "test_dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "43Wzz0tVJpp9"
      },
      "outputs": [],
      "source": [
        "test_gen = tf.data.Dataset.from_tensor_slices(dict(test_dataset))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ztX2pmcuJpp-"
      },
      "outputs": [],
      "source": [
        "test_gen = glue_convert_examples_to_features(test_gen, tokenizer, max_length, task)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vkq941BpJpqF"
      },
      "outputs": [],
      "source": [
        "test_gen = test_gen.batch(1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jhfuQ1QiJpqG"
      },
      "outputs": [],
      "source": [
        "next(iter(test_gen))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DP9cdd4wJpqG"
      },
      "outputs": [],
      "source": [
        "pred = new_model.predict(test_gen)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yLuvF8LLJpqH"
      },
      "outputs": [],
      "source": [
        "pred"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3EMo89OQJpqH"
      },
      "outputs": [],
      "source": [
        "pred_ids = np.argmax(pred.logits, axis=-1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "EkvlxkAJJpqI"
      },
      "outputs": [],
      "source": [
        "print(pred_ids[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PvXgBKDfJpqI"
      },
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.12"
    },
    "colab": {
      "provenance": [],
      "gpuType": "T4",
      "include_colab_link": true
    },
    "accelerator": "GPU"
  },
  "nbformat": 4,
  "nbformat_minor": 0
}