{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/TA-aiacademy/course_3.0/blob/v2-5_nlp/09_v2-5_NLP/Part4/Transformer.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LGCriYLCjfOA"
      },
      "source": [
        "# Transformer"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IXvK-t8ZjfOI"
      },
      "source": [
        "<img src=\"https://hackmd.io/_uploads/ryCwQ7YJT.png\" alt=\"Drawing\" style=\"width: 1000px;\"/>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9KlqRvfVjfOJ"
      },
      "source": [
        "* [Seq2seq](https://arxiv.org/pdf/1409.3215.pdf)\n",
        "* [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)\n",
        "* [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025v5)\n",
        "* [Attention is all you need](https://arxiv.org/abs/1706.03762)\n",
        "* [GPT](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)\n",
        "* [BERT](https://arxiv.org/abs/1810.04805)\n",
        "* [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf)\n",
        "* [ERNIE](https://arxiv.org/abs/1905.07129)\n",
        "* [XLNet](https://arxiv.org/abs/1906.08237)\n",
        "* [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4UJYX0AMjfOK"
      },
      "source": [
        "# Environment"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JjJJyJTZYebt"
      },
      "outputs": [],
      "source": [
        "import tensorflow_datasets as tfds\n",
        "import tensorflow as tf\n",
        "import os\n",
        "from pprint import pprint\n",
        "\n",
        "import time\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "import matplotlib as mpl\n",
        "\n",
        "print(tf.__version__)\n",
        "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fd1NWMxjfsDd"
      },
      "source": [
        "### 建立資料夾路徑\n",
        "\n",
        "`vocab_file`: 儲存中英文字典(vocabulary)路徑\n",
        "\n",
        "`checkpoint`: 儲存模型路徑\n",
        "\n",
        "`log_dir`: 記錄實驗結果\n",
        "\n",
        "`download_dir`: 使用`wmt19`機器翻譯競賽的資料集，資料儲存路徑"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# 上傳資料\n",
        "!wget -q https://github.com/TA-aiacademy/course_3.0/releases/download/v2.5_nlp/NLP_part4.zip\n",
        "!unzip -q NLP_part4.zip"
      ],
      "metadata": {
        "id": "6gjKgBR6midw"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-rqQ9n2fjfOP"
      },
      "outputs": [],
      "source": [
        "output_dir = \"nmt\"\n",
        "en_vocab_file = os.path.join(output_dir, \"en_vocab\")\n",
        "zh_vocab_file = os.path.join(output_dir, \"zh_vocab\")\n",
        "checkpoint_path = os.path.join(output_dir, \"checkpoints\")\n",
        "log_dir = os.path.join(output_dir, 'logs')\n",
        "download_dir = \"tensorflow-datasets/downloads\"\n",
        "\n",
        "if not os.path.exists(output_dir):\n",
        "    os.makedirs(output_dir)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zECDCX1kjfOQ"
      },
      "source": [
        "## 查看wmt19 中英文對照資料集\n",
        "\n",
        "* `newscommentary_v14`: 新聞評論\n",
        "* `wikititles_v1`: wiki標題\n",
        "* `uncorpus_v1`: 聯合國數據"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bxtScXcKjfOR"
      },
      "outputs": [],
      "source": [
        "tmp_builder = tfds.builder(\"wmt19_translate/zh-en\")\n",
        "pprint(tmp_builder.subsets)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dqCxP1f0jfOS"
      },
      "source": [
        "## 透過`tf.DatasetBuilder`下載資料集\n",
        "\n",
        "https://www.tensorflow.org/datasets/catalog/wmt19_translate\n",
        "\n",
        "下載中英文的新聞評論資料集，會在`download_dir`下產生資料集，下次再執行就不需要使用`download_and_prepare`。\n",
        "\n",
        "`builder.info`顯示資料集細節"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bxhKra6JjfOT"
      },
      "outputs": [],
      "source": [
        "config = tfds.translate.wmt.WmtConfig(\n",
        "  version=\"0.0.1\",\n",
        "  language_pair=(\"zh\", \"en\"),\n",
        "  subsets={\n",
        "    tfds.Split.TRAIN: [\"newscommentary_v14\"]\n",
        "  }\n",
        ")\n",
        "builder = tfds.builder(\"wmt_translate\", config=config)\n",
        "builder.download_and_prepare(download_dir=download_dir)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Qi_cPaiwjfOT"
      },
      "source": [
        "## 切割資料集\n",
        "70%訓練集，30%測試集"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YMT1lxwHjfOT"
      },
      "source": [
        "## 透過`tf.DatasetBuilder`載入資料\n",
        "\n",
        "`assert`檢查型態"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DP1ugD4UjfOT"
      },
      "outputs": [],
      "source": [
        "train_perc = 70\n",
        "examples = builder.as_dataset(split=[f'train[:{train_perc}%]', f'train[{train_perc}%:]'], as_supervised=True)\n",
        "train_examples, val_examples = examples\n",
        "\n",
        "assert isinstance(train_examples, tf.data.Dataset)\n",
        "assert isinstance(val_examples, tf.data.Dataset)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "58A2uMZUjfOU"
      },
      "source": [
        "## 使用`tfds.features.text.SubwordTextEncoder`載入與建立字典\n",
        "\n",
        "* `.load_from_file`: 從`.subwords`檔案讀取字典\n",
        "* `.build_from_corpus`: 建立`.subwords`字典\n",
        "\n",
        "中文字典將`max_subword_length`設為1，以字為單位進行斷詞，大幅度減少字典大小，降低複雜度。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KVBg5Q8tBk5z"
      },
      "outputs": [],
      "source": [
        "%%time\n",
        "try:\n",
        "    tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.load_from_file(en_vocab_file)\n",
        "    print('Load English vocabulary: %s' % en_vocab_file)\n",
        "except:\n",
        "    print('Build English vocabulary: %s' % en_vocab_file)\n",
        "    tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus((en.numpy() for en, zh in train_examples),\n",
        "                                                                             target_vocab_size = 2**13)\n",
        "    tokenizer_en.save_to_file(en_vocab_file)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XemfT4VTjfOU"
      },
      "outputs": [],
      "source": [
        "print('English vocabulary size: ', tokenizer_en.vocab_size)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AVA5AL1fjfOV"
      },
      "outputs": [],
      "source": [
        "%%time\n",
        "try:\n",
        "    tokenizer_zh = tfds.deprecated.text.SubwordTextEncoder.load_from_file(zh_vocab_file)\n",
        "    print('Load Chinese vocfabulary: %s' % zh_vocab_file)\n",
        "except:\n",
        "    print('Build Chinese vocabulary: %s' % zh_vocab_file)\n",
        "    tokenizer_zh = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus((zh.numpy() for en, zh in train_examples),\n",
        "                                                                             target_vocab_size = 2**13, max_subword_length=1)\n",
        "    tokenizer_zh.save_to_file(zh_vocab_file)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ERJYyfdVjfOV"
      },
      "outputs": [],
      "source": [
        "print('Chinese vocabulary size: ', tokenizer_zh.vocab_size)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "r_xQLGi7jfOV"
      },
      "source": [
        "### Example\n",
        "\n",
        "英文的斷詞方式是以`wordpiece`進行斷詞。\n",
        "\n",
        "* 原始句子: `Transformer is awesome.`\n",
        "* 空白斷詞: `[Transformer, is, awesome, .]`\n",
        "* `Wordpiece`斷詞: `[Trans, former, is, aw, es, ome, .]`\n",
        "\n",
        "`Wordpiece`斷詞優點:\n",
        "* 有些字是由其他的`wordpiece`組成，例如說`Translation`, `Transpose`等等，可以降低字典大小，避免有些字可能在所有句子中只出現過一次。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4DYWukNFkGQN"
      },
      "outputs": [],
      "source": [
        "sample_string = 'Transformer is awesome.'\n",
        "\n",
        "tokenized_string_token = tokenizer_en.encode(sample_string)\n",
        "print('Tokenized string token is {}'.format(tokenized_string_token))\n",
        "\n",
        "tokenized_string = [tokenizer_en.decode([ts]) for ts in tokenized_string_token]\n",
        "print('Tokenized srting is {}'.format(tokenized_string))\n",
        "\n",
        "original_string = tokenizer_en.decode(tokenized_string_token)\n",
        "print('The original string: {}'.format(original_string))\n",
        "\n",
        "assert original_string == sample_string"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "s8LD3JfZjfOW"
      },
      "source": [
        "## 添加`<BOS>`,`<EOS>`在句子頭尾\n",
        "\n",
        "<img src=\"https://hackmd.io/_uploads/SJfeKvYJa.png\" alt=\"Drawing\" style=\"width: 400px;\"/>\n",
        "\n",
        "`.vocab_size`視為`<BOS>`, `.vocab_size+1`視為`<EOS>`\n",
        "\n",
        "之後所有的訓練資料都需要通過`train_examples`產生，然後再透過`encode`轉成`token_id`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UZwnPr4R055s"
      },
      "outputs": [],
      "source": [
        "def encode(en_t, zh_t):\n",
        "    en_indics = [tokenizer_en.vocab_size] + tokenizer_en.encode(en_t.numpy()) + [tokenizer_en.vocab_size + 1]\n",
        "    zh_indics = [tokenizer_zh.vocab_size] + tokenizer_zh.encode(zh_t.numpy()) + [tokenizer_zh.vocab_size + 1]\n",
        "    return en_indics, zh_indics"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lWpRf-oSjfOX"
      },
      "outputs": [],
      "source": [
        "en_t, zh_t = next(iter(train_examples))\n",
        "en_indics, zh_indics = encode(en_t, zh_t)\n",
        "\n",
        "print('英文<BOS>: %d' % tokenizer_en.vocab_size)\n",
        "print('英文<EOS>: %d' % (tokenizer_en.vocab_size + 1))\n",
        "print('中文<BOS>: %d' % tokenizer_zh.vocab_size)\n",
        "print('中文<EOS>: %d' % (tokenizer_zh.vocab_size + 1))\n",
        "\n",
        "print('-' * 20)\n",
        "print('Before encode: (two tensor): ')\n",
        "pprint((en_t, zh_t))\n",
        "print()\n",
        "print('After encode: (two array): ')\n",
        "pprint((en_indics, zh_indics))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tx1sFbR-9fRs"
      },
      "source": [
        "## 將`encode`函數的輸出型態轉為計算圖的`Tensor`\n",
        "\n",
        "如果直接將`train_examples`接上`encode`，會發生`'Tensor' object has no attribute 'numpy'`\n",
        "\n",
        "這是因為`encode`這個自定義函數是透過`tfds`來進行，而`tfds`的`map function`會採用`tf1.0`的`Graph mode`運算，所以無法直接使用`tf2.0`的`Eager mode`中的`attribute.numpy()`，最快的解決方式是透過`tf.py_function`強制讓所有操作都在`python`完成。\n",
        "\n",
        "https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/python/data/ops/dataset_ops.py#L1099-L1214"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2HpeIFe-jfOZ"
      },
      "outputs": [],
      "source": [
        "# import traceback\n",
        "\n",
        "# try:\n",
        "#     train_examples.map(encode)\n",
        "# except AttributeError:\n",
        "#     traceback.print_exc()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Mah1cS-P70Iz"
      },
      "outputs": [],
      "source": [
        "def tf_encode(en_t, zh_t):\n",
        "\n",
        "    return tf.py_function(encode, [en_t, zh_t], [tf.int64, tf.int64])\n",
        "\n",
        "tmp_dataset = train_examples.map(tf_encode)\n",
        "en_indices, zh_indices = next(iter(tmp_dataset))\n",
        "\n",
        "print('After tf_encode: (two tensor)')\n",
        "print(en_indices)\n",
        "print(zh_indices)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xZ70WxDKjfOZ"
      },
      "source": [
        "## 限制句子長度\n",
        "\n",
        "為了加快訓練速度，使用`tf.logical_and`限制中英文句子長度，並使用`.filter`過濾。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "c081xPGv1CPI"
      },
      "outputs": [],
      "source": [
        "max_length = 50\n",
        "def filter_max_length(en_t, zh_t, max_length = max_length):\n",
        "\n",
        "    return tf.logical_and(tf.size(en_t) <= max_length,\n",
        "                          tf.size(zh_t) <= max_length)\n",
        "\n",
        "tmp_dataset = tmp_dataset.filter(filter_max_length)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AY2AQ2OojfOa"
      },
      "source": [
        "## Padding\n",
        "\n",
        "針對每個`batch`都進行中英文的`padding`。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UMV-3WTEjfOa"
      },
      "outputs": [],
      "source": [
        "batch_size = 64\n",
        "tmp_dataset = tmp_dataset.padded_batch(batch_size=batch_size, padded_shapes=([-1], [-1]))\n",
        "\n",
        "en_batch, zh_batch = next(iter(tmp_dataset))\n",
        "\n",
        "print('英文batch: ')\n",
        "print(en_batch)\n",
        "print('-' * 15)\n",
        "print('中文batch: ')\n",
        "print(zh_batch)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8Ql5EG8kjfOa"
      },
      "source": [
        "### 將`train_examples`與`val_examples`做同樣處理\n",
        "\n",
        "* `train`:\n",
        "\n",
        " - `map(tf_encode)`: 將字串轉成`token_id`。\n",
        " - `filter(filter_max_length)`:過濾最大句子長度。\n",
        " - `cache()`: 在每次迭代時將訓練資料先放進去`cache`裡面，加速訓練速度。\n",
        " - `shuffle(buffer_size)`: 從資料集中抽樣`buffer_size`放近`buffer`裡面，然後從`buffer`中抽取一個`batch`進行訓練，同時確保了隨機性與加快訓練速度。\n",
        " - `padded_batch(batch_size, padded_shapes=([-1],[-1]))`: `padding`長度。\n",
        "\n",
        "Tensor-core pipeline: https://www.tensorflow.org/guide/performance/datasets?hl=zh_cn"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9mk9AZdZ5bcS"
      },
      "outputs": [],
      "source": [
        "max_length = 50\n",
        "batch_size = 128\n",
        "buffer_size = 15000\n",
        "\n",
        "train_dataset = (train_examples\n",
        "                 .map(tf_encode)\n",
        "                 .filter(filter_max_length)\n",
        "                 .cache()\n",
        "                 .shuffle(buffer_size)\n",
        "                 .padded_batch(batch_size, padded_shapes=([-1],[-1])))\n",
        "\n",
        "\n",
        "val_dataset = (val_examples\n",
        "               .map(tf_encode)\n",
        "               .filter(filter_max_length)\n",
        "               .padded_batch(batch_size, padded_shapes=([-1], [-1])))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_fXvfYVfQr2n"
      },
      "outputs": [],
      "source": [
        "en_batch, zh_batch = next(iter(train_dataset))\n",
        "\n",
        "print('英文batch tensor: ')\n",
        "print(en_batch)\n",
        "print('-' * 20)\n",
        "print('中文batch tensor: ')\n",
        "print(zh_batch)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WjZOW6JIjfOb"
      },
      "source": [
        "### 假設有新資料時的處理方式\n",
        "\n",
        "1. `map(tf_encode)`: 轉成`token_id`。\n",
        "2. `filter(filter_max_length)`: 過濾最大長度。\n",
        "3. `padded_batch()`: padding。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7Ba_0X1WjfOc"
      },
      "outputs": [],
      "source": [
        "demo_examples = [\n",
        "    (\"It is important.\", \"這很重要。\"),\n",
        "    (\"The math speaks for themselves.\", \"數學證明一切。\"),\n",
        "]\n",
        "\n",
        "batch_size = 2\n",
        "demo_examples = tf.data.Dataset.from_tensor_slices((\n",
        "    [en for en, _ in demo_examples], [zh for _, zh in demo_examples]\n",
        "))\n",
        "\n",
        "demo_examples = demo_examples.map(tf_encode).filter(filter_max_length).padded_batch(batch_size, padded_shapes=([-1],[-1]))\n",
        "\n",
        "en_sample, zh_sample = next(iter(demo_examples))\n",
        "print(en_sample)\n",
        "print('-' * 15)\n",
        "pprint(zh_sample)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nBQuibYA4n0n"
      },
      "source": [
        "# Transformer\n",
        "\n",
        "<img src=\"https://hackmd.io/_uploads/HJFfYvY1p.png\" alt=\"Drawing\" style=\"width: 400px;\"/>\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZV1m4ETejfOc"
      },
      "source": [
        "這裡分為`Encoder`與`Decoder`:\n",
        "\n",
        "1. `Encoder`: 負責接收`source sentence`，最主要的目的是將`source sentence`作為`q,k,v`進行`self-attention`。\n",
        "\n",
        "2. `Decoder`: 負責接收`target sentence`，最主要的目的有兩個:\n",
        " - 使用`target sentence`作為`q,k,v`進行`self-attention`。\n",
        " - 將`Encoder`的輸出作為`v,k`，然後與`Decoder`的`q`進行`self-attention`。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZVSmOpiZjfOc"
      },
      "source": [
        "## Positional Encoding\n",
        "\n",
        "Word Embedding所表達的是所有詞向量之間的相似關係，而Transformer的做法是透過內積解決RNN的長距離依賴問題(long-range dependenices)，但是Transformer這樣做卻沒有考慮到句子中的詞先後順序關係，透過Positional Encoding，讓詞向量之間不只因為word embedding語義關係而靠近，也可以因為詞之間的位置相互靠近而靠近。\n",
        "\n",
        "$$\n",
        "PE_{(pos,2i)} = \\sin(pos/10000^{\\frac{2i}{d_{model}}}) \\\\\n",
        "PE_{(pos,2i+1)} = \\cos(pos/10000^{\\frac{2i}{d_{model}}})\n",
        "$$\n",
        "\n",
        "#### 之後可以調整看看三角函數的參數，例如10000 -> 100"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "WhIOZjMNKujn"
      },
      "outputs": [],
      "source": [
        "# 先建立角度\n",
        "def get_angles(pos, i, d_model):\n",
        "    angle_rates = 1 / np.power(10000, 2 * (i//2) / np.float32(d_model))\n",
        "    return pos * angle_rates"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1Rz82wEs5biZ"
      },
      "outputs": [],
      "source": [
        "def positional_encoding(position, d_model):\n",
        "    \"\"\"\n",
        "    奇數sin\n",
        "    偶數cos\n",
        "\n",
        "          第一個字: [[sin(0),cos(0),sin(1),cos(1),...,sin(d_model-1),cos(d_model)]\n",
        "          第二個字: ,[sin(0),cos(0),sin(1),cos(1),...,sin(d_model-1),cos(d_model)]\n",
        "          第三個字: ,[sin(0),cos(0),sin(1),cos(1),...,sin(d_model-1),cos(d_model)]\n",
        "          ...\n",
        "    第position個字: ,[sin(0),cos(0),sin(1),cos(1),...,sin(d_model-1),cos(d_model)]]\n",
        "\n",
        "    return:\n",
        "    (batch_size, position, d_model)\n",
        "    \"\"\"\n",
        "    pos = np.arange(position)[:, np.newaxis] # [[0],[1],[2],...,[pos-1]]\n",
        "    i = np.arange(d_model)[np.newaxis, :] # [[0,1,2,3,...,d_model-1]]\n",
        "\n",
        "    angle_rads = get_angles(pos, i, d_model) # (position, d_model)\n",
        "\n",
        "\n",
        "    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])\n",
        "    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])\n",
        "\n",
        "    pos_encoding = angle_rads[np.newaxis, ...]\n",
        "\n",
        "    return tf.cast(pos_encoding, dtype=tf.float32)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hXRfZ3eHjfOk"
      },
      "source": [
        "## Positional encoding 理解\n",
        "\n",
        "此例拿第25個token的positional encoding來跟其餘50個字(包含自己)的positional encoding計算內積(`np.dot`)，能夠發現越靠近token 25的值內積越大，反之，越遠則內積越小。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mL6AOqHQjfOl"
      },
      "outputs": [],
      "source": [
        "position = 50 # 50個字\n",
        "d_model = 512 # 每個字的positional encoding維度為512\n",
        "pos_encoding = positional_encoding(position, d_model)\n",
        "\n",
        "inp = pos_encoding[0][25].numpy()\n",
        "\n",
        "dis_list = list()\n",
        "for i in range(50):\n",
        "    tar = pos_encoding[0][i].numpy()\n",
        "    dot_prod = np.dot(inp, tar)\n",
        "    dis_list.append(dot_prod)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sSWXaBggjfOm"
      },
      "outputs": [],
      "source": [
        "plt.figure(figsize=(12,10))\n",
        "plt.plot(dis_list)\n",
        "plt.xticks(list(range(50)))\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a_b4ou4TYqUN"
      },
      "source": [
        "## Masking\n",
        "在Transformer中有兩個地方需要進行masking，以下兩種masking的方式都是先指定要進行masking的位置，然後將`QK`內積過後的attetion matrix進行masking。\n",
        "\n",
        " 1. `Padding_masking`: 句子padding的部分不需要被transformer注意到，透過mask，讓self-attention出來的weight接近0。\n",
        " 2. `Look_ahead_masking`: Decoder中的masked self attention會使用到，不讓當前的字去注意到之後所有的字，一樣是讓self-attention出來的weight接近0。\n",
        "\n",
        "<img src=\"https://hackmd.io/_uploads/BkvNKPtJT.png\" alt=\"Drawing\" style=\"width: 400px;\"/>\n",
        "\n",
        "### Padding masking"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "U2i8-e1s8ti9"
      },
      "outputs": [],
      "source": [
        "def create_padding_mask(seq):\n",
        "    \"\"\"\n",
        "    Input:\n",
        "    在字典中，padding的index為0\n",
        "\n",
        "    所以當Input遇到0時就將其變為1，之後當成要進行masking的index\n",
        "\n",
        "    Return:\n",
        "    在中間插上兩個維度是為了後面attention時做broadcasting\n",
        "    \"\"\"\n",
        "    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)\n",
        "    return seq[:, tf.newaxis, tf.newaxis, :]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "A7BYeBCNvi7n"
      },
      "outputs": [],
      "source": [
        "x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])\n",
        "print(x)\n",
        "print(create_padding_mask(x))\n",
        "# 1的位置就是要進行masking的位置"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Z0hzukDBgVom"
      },
      "source": [
        "### Look ahead masking"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dVxS8OPI9uI0"
      },
      "outputs": [],
      "source": [
        "def create_look_ahead_mask(size):\n",
        "    \"\"\"\n",
        "    Input: 方陣size，以transformer來說就是self-attention的weigh matrix，將上三角進行masking\n",
        "\n",
        "    tf.linalg.band_part(input, num_lower, num_upper)\n",
        "    num_lower, num_upper: 從主對角線開始決定mask的起點，-1表示保留原值\n",
        "    \"\"\"\n",
        "    mask = 1 - tf.linalg.band_part(tf.ones((size,size)), -1, 0)\n",
        "    return mask"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yxKGuXxaBeeE"
      },
      "outputs": [],
      "source": [
        "create_look_ahead_mask(3)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vsxEE_-Wa1gF"
      },
      "source": [
        "## Scaled dot-product attention(self-attention)\n",
        "\n",
        "<img src=\"https://hackmd.io/_uploads/SkHUKwKk6.png\" alt=\"Drawing\" style=\"width: 700px;\"/>\n",
        "\n",
        "1. Q與K進行矩陣相乘的地方就是實現Self-attention的地方，表示Q中的每個字對於K的每個字的attention。\n",
        "2. 接著進行Scale是為了避免後面通過Softmax之後的attention weight不是1就是0，這樣會造成很小的梯度(hard softmax)。\n",
        "3. 通過Softmax之後就產生attention weight matrix，再乘上V，最後得到Context matrix。\n",
        "\n",
        "$$\n",
        "\\mathrm{Attention}(Q,K,V) = \\mathrm{Softmax}(\\frac{QK^\\top}{\\sqrt{d_k}})V\n",
        "$$\n",
        "\n",
        "<img src=\"https://hackmd.io/_uploads/HkIPYDtya.png\" alt=\"Drawing\" style=\"width: 500px;\"/>\n",
        "\n",
        "<img src=\"https://hackmd.io/_uploads/S1EdYvt1a.png\" alt=\"Drawing\" style=\"width: 500px;\"/>\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LazzUq3bJ5SH"
      },
      "outputs": [],
      "source": [
        "def scaled_dot_product_attention(q, k, v, mask):\n",
        "    \"\"\"\n",
        "    Args:\n",
        "        q: query shape == (..., seq_len_q, depth_k)\n",
        "        k: key shape == (..., seq_len_k, depth_k)\n",
        "        v: value shape == (..., seq_len_v, depth_v)\n",
        "        mask: Float tensor with shape broadcastable to (..., seq_len_q, seq_len_k)\n",
        "    \"\"\"\n",
        "    # q,k矩陣相乘\n",
        "    matmul_qk = tf.matmul(q, k, transpose_b=True) # (..., q_dim, k_dim)\n",
        "\n",
        "    # Scaled\n",
        "    dk = tf.cast(tf.shape(k)[-1], tf.float32)\n",
        "    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)\n",
        "\n",
        "    # mask\n",
        "    if mask is not None:\n",
        "        scaled_attention_logits += (mask * -1e9)\n",
        "    # Softmax最後一個維度(k_dim)，表示每個字對於所有字的attention weights\n",
        "    attention_weights = tf.nn.softmax(scaled_attention_logits, axis = -1)\n",
        "\n",
        "    output = tf.matmul(attention_weights, v) # (..., q_dim, depth_v)\n",
        "\n",
        "    return output, attention_weights"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "n90YjClyInFy"
      },
      "outputs": [],
      "source": [
        "\"\"\"\n",
        "假設一個字(query)對四個字(key)進行self attention，得到attention weight之後再與value相乘\n",
        "\"\"\"\n",
        "\n",
        "temp_k = tf.constant([[10,0,0],\n",
        "                      [0,10,0],\n",
        "                      [0,0,10],\n",
        "                      [0,0,10]], dtype=tf.float32)  # (4, 3)\n",
        "\n",
        "temp_v = tf.constant([[10,0,0],\n",
        "                      [0,10,0],\n",
        "                      [0,0,10],\n",
        "                      [0,0,10]], dtype=tf.float32)  # (4, 3)\n",
        "\n",
        "temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32) # (1, 3)\n",
        "\n",
        "output, attention_weights = scaled_dot_product_attention(temp_q, temp_k, temp_v, mask=None)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zg6k-fGhgXra"
      },
      "outputs": [],
      "source": [
        "print('Attention weights: ')\n",
        "print(attention_weights)\n",
        "print()\n",
        "print('Ouptut: ')\n",
        "print(output)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UAq3YOzUgXhb"
      },
      "outputs": [],
      "source": [
        "\"\"\"\n",
        "假設四個字(query)對四個字(key)進行self attention，然後將上三角形進行mask，得到attention weight之後再與value相乘\n",
        "\n",
        "將右上角mask掉之後觀察attention weights會發現上三角形的weigh趨近於0\n",
        "\"\"\"\n",
        "\n",
        "# 為了方便觀察weight，將temp_q都設為1\n",
        "temp_q = tf.constant([[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1]], dtype=tf.float32) # (4, 3)\n",
        "\n",
        "mask = create_look_ahead_mask(temp_q.shape[0])\n",
        "\n",
        "output, attention_weights = scaled_dot_product_attention(temp_q, temp_k, temp_v, mask=mask)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6dlU8Tm-hYrF"
      },
      "outputs": [],
      "source": [
        "print('Attention weights: ')\n",
        "print(attention_weights)\n",
        "print()\n",
        "print('Ouptut: ')\n",
        "print(output)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fz5BMC8Kaoqo"
      },
      "source": [
        "## Multi-Head Attention\n",
        "\n",
        "<img src=\"https://hackmd.io/_uploads/r1sKFPKkT.png\" alt=\"Drawing\" style=\"width: 300px;\"/>\n",
        "\n",
        "將`q,k,v`分成num_heads份，各自做self-attention，然後再concat，通過dense輸出，分成num_heads的優點最主要是希望讓每個head各自注意到Sequence中不同的地方，而且切分成較小的矩陣還能加速訓練過程。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "BSV3PPKsYecw"
      },
      "outputs": [],
      "source": [
        "class MultiHeadAttention(tf.keras.layers.Layer):\n",
        "\n",
        "    def __init__(self, d_model, num_heads):\n",
        "        super().__init__()\n",
        "        self.num_heads = num_heads\n",
        "        self.d_model = d_model\n",
        "\n",
        "        # 確保d_model可以被num_heads整除\n",
        "        assert d_model % self.num_heads == 0\n",
        "\n",
        "        self.depth = d_model // self.num_heads\n",
        "\n",
        "        self.wq = tf.keras.layers.Dense(d_model)\n",
        "        self.wk = tf.keras.layers.Dense(d_model)\n",
        "        self.wv = tf.keras.layers.Dense(d_model)\n",
        "\n",
        "        self.dense = tf.keras.layers.Dense(d_model)\n",
        "\n",
        "    def split_heads(self, x, batch_size):\n",
        "        \"\"\"\n",
        "        將d_model切割成(num_heads, depth)\n",
        "        為了後面做self-attention，transpose成(batch_size, num_heads, seq_len, depth)\n",
        "        \"\"\"\n",
        "        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))\n",
        "        return tf.transpose(x, perm=[0, 2, 1, 3])\n",
        "\n",
        "    def __call__(self, v, k, q, mask):\n",
        "        batch_size = tf.shape(q)[0]\n",
        "\n",
        "        q = self.wq(q) # (batch_size, seq_len, d_model)\n",
        "        k = self.wk(k) # (batch_size, seq_len, d_model)\n",
        "        v = self.wv(v)  # (batch_size, seq_len, d_model)\n",
        "\n",
        "        q = self.split_heads(q, batch_size) # (bat d_model)\n",
        "        k = self.split_heads(k, batch_size) # (batch_size, num_heads, seq_len_k, depth)\n",
        "        v = self.split_heads(v, batch_size) # (batch_size, num_heads, seq_len_v, depth)\n",
        "\n",
        "        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)\n",
        "        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)\n",
        "        scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)\n",
        "\n",
        "        #為了將num_heads進行concat，transpose成(batch_size, seq_len_q, num_heads, depth)\n",
        "        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])\n",
        "\n",
        "        # 合併後面兩維度 (batch_size, seq_len_q, d_model)\n",
        "        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))\n",
        "\n",
        "        output = self.dense(concat_attention)\n",
        "\n",
        "        return output, attention_weights"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Hu94p-_-2_BX"
      },
      "outputs": [],
      "source": [
        "y = tf.random.uniform((1, 60, 512)) # (batch_size, seq_len, d_model)\n",
        "\n",
        "d_model = 512\n",
        "num_heads = 8\n",
        "temp_mha = MultiHeadAttention(d_model, num_heads)\n",
        "output, attention_weights = temp_mha(v=y, k=y, q=y, mask=None)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8l5p--9ojfOr"
      },
      "outputs": [],
      "source": [
        "# 輸出仍然是 (batch_size, seq_len, d_model)\n",
        "print('output shape', output.shape)\n",
        "\n",
        "# 8個heads各自有一個attention weight matrix\n",
        "print('attention_weights shape: ', attention_weights.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RdDqGayx67vv"
      },
      "source": [
        "### Point-wise feed forward network\n",
        "\n",
        "$$\n",
        "FFN(x) = max(0, xW_1 + b_1)W_2+b_2\n",
        "$$"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ET7xLt0yCT6Z"
      },
      "outputs": [],
      "source": [
        "def point_wise_ffn(d_model, dff):\n",
        "    return tf.keras.Sequential([tf.keras.layers.Dense(dff, activation='relu'), # (batch_size, seq_len, dff)\n",
        "                                tf.keras.layers.Dense(d_model)]) # (batch_size, seq_len, d_model)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mytb1lPyOHLB"
      },
      "outputs": [],
      "source": [
        "d_model = 512\n",
        "dff = 2048\n",
        "sample_ffn = point_wise_ffn(d_model, dff)\n",
        "sample_ffn(tf.random.uniform((64, 50, 512))).shape"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yScbC0MUH8dS"
      },
      "source": [
        "## Encoderblock and Decoderblock\n",
        "\n",
        "<img src=\"https://hackmd.io/_uploads/HJFfYvY1p.png\" alt=\"Drawing\" style=\"width: 400px;\"/>\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QFv-FNYUmvpn"
      },
      "source": [
        "### EncoderLayer\n",
        "\n",
        "這邊我們將以上橘色虛線`Encoderlayer`進行組合，其中主要由兩種`class`組成，分別是`MultiHeadAttention`和`point_wise_ffn`，依照上圖的順序為:\n",
        "\n",
        "1. `MultiHeadAttention(padding_mask)`\n",
        "\n",
        "2. `Residual connection` + `Layer Normalization`\n",
        "\n",
        "3. `point_wise_ffn`\n",
        "\n",
        "4. `Residual connection` + `Layer Normalization`\n",
        "\n",
        "另外`dropout`的部分是在論文中提及的，所以另外加上去。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ncyS-Ms3i2x_"
      },
      "outputs": [],
      "source": [
        "class EncoderLayer(tf.keras.layers.Layer):\n",
        "    def __init__(self, d_model, num_heads, dff, dropout_rate = 0.1):\n",
        "        super().__init__()\n",
        "\n",
        "        self.mha = MultiHeadAttention(d_model, num_heads)\n",
        "        self.ffn = point_wise_ffn(d_model, dff)\n",
        "\n",
        "        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n",
        "        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n",
        "\n",
        "        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)\n",
        "        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)\n",
        "\n",
        "    def __call__(self, x, training, mask):\n",
        "        # 不需要看Encoder的attention weight\n",
        "        attention_output, _ = self.mha(v = x, k = x, q = x, mask=mask) # (batch_size, input_seq_len, d_model)\n",
        "        # Inference時不需要使用dropout\n",
        "        attention_output = self.dropout1(attention_output, training=training) # (batch_size, input_seq_len, d_model)\n",
        "        # Residual + Layer Normalization\n",
        "        out1 = self.layernorm1(x + attention_output) # (batch_size, input_seq_len, d_model)\n",
        "\n",
        "        ffn_output = self.ffn(out1) # (batch_size, input_seq_len, d_model)\n",
        "        ffn_output = self.dropout2(ffn_output, training=training) # (batch_size, input_seq_len, d_model)\n",
        "        enc_output = self.layernorm2(out1 + ffn_output) # (batch_size, input_seq_len, d_model)\n",
        "\n",
        "        return enc_output"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AzZRXdO0mI48"
      },
      "outputs": [],
      "source": [
        "d_model = 512\n",
        "num_heads = 8\n",
        "dff = 2048\n",
        "dropout_rate = 0.1\n",
        "\n",
        "sample_encooder_layer = EncoderLayer(d_model, num_heads, dff, dropout_rate)\n",
        "\n",
        "x = tf.random.uniform((64, 50, 512))\n",
        "training = False\n",
        "mask = None\n",
        "\n",
        "sample_encooder_layer_output = sample_encooder_layer(x, training, mask)\n",
        "sample_encooder_layer_output.shape  # (batch_size, input_seq_len, d_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6LO_48Owmx_o"
      },
      "source": [
        "### DecoderLayer\n",
        "\n",
        "這邊我們將以上橘色虛線`Decoderlayer`進行組合，其中主要由兩種`class`組成，分別是`MultiHeadAttention`和`point_wise_ffn`，依照上圖的順序為:\n",
        "\n",
        "1. `MultiHeadAttention(padding_mask + look_ahead_mask)`\n",
        "\n",
        "2. `Residual connection` + `Layer Normalization`\n",
        "\n",
        "3. `MultiHeadAttention(padding_mask)`\n",
        "\n",
        "4. `Residual connection` + `Layer Normalization`\n",
        "\n",
        "5. `point_wise_ffn`\n",
        "\n",
        "6. ``Residual connection` + `Layer Normalization``"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9SoX0-vd1hue"
      },
      "outputs": [],
      "source": [
        "class DecoderLayer(tf.keras.layers.Layer):\n",
        "    def __init__(self, d_model, num_heads, dff, dropout_rate = 0.1):\n",
        "        super().__init__()\n",
        "\n",
        "        self.mha1 = MultiHeadAttention(d_model, num_heads)\n",
        "        self.mha2 = MultiHeadAttention(d_model, num_heads)\n",
        "\n",
        "        self.ffn = point_wise_ffn(d_model, dff)\n",
        "\n",
        "        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n",
        "        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n",
        "        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n",
        "\n",
        "        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)\n",
        "        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)\n",
        "        self.dropout3 = tf.keras.layers.Dropout(dropout_rate)\n",
        "\n",
        "    def __call__(self, x, enc_output, training, look_ahead_mask, padding_mask):\n",
        "\n",
        "        # masked self-attention，後面需要觀察attention weight matrix\n",
        "        # 使用look_ahead_mask，讓decoder輸入只能往前看\n",
        "        attention_output1, masked_attention_weights = self.mha1(v=x, k=x, q=x, mask=look_ahead_mask) # (batch_size, ouptut_seq_len, d_model)\n",
        "        attention_output1 = self.dropout1(attention_output1, training=training) # (batch_size, ouptut_seq_len, d_model)\n",
        "        attention_output1 = self.layernorm1(x + attention_output1) # (batch_size, ouptut_seq_len, d_model)\n",
        "\n",
        "        # 使用padding_mask，忽略padding的attention weights，不讓任何字去注意到padding的位置\n",
        "        attention_output2, dec_attention_weights = self.mha2(v=enc_output, k=enc_output, q=attention_output1, mask=padding_mask) # (batch_size, ouptut_seq_len, d_model)\n",
        "        attention_output2 = self.dropout2(attention_output2, training=training) # (batch_size, ouptut_seq_len, d_model)\n",
        "        attention_output2 = self.layernorm2(attention_output1 + attention_output2) # (batch_size, ouptut_seq_len, d_model)\n",
        "\n",
        "        ffn_output = self.ffn(attention_output2)\n",
        "        ffn_output = self.dropout3(ffn_output, training=training)\n",
        "        dec_output = self.layernorm3(attention_output2 + ffn_output)\n",
        "\n",
        "        return dec_output, masked_attention_weights, dec_attention_weights"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Ne2Bqx8k71l0"
      },
      "outputs": [],
      "source": [
        "d_model = 512\n",
        "num_heads = 8\n",
        "dff = 2048\n",
        "dropout_rate = 0.1\n",
        "\n",
        "x = tf.random.uniform((64, 60, 512))\n",
        "training = False\n",
        "look_ahead_mask = None\n",
        "padding_mask = None\n",
        "\n",
        "sample_decoder_layer = DecoderLayer(d_model, num_heads, dff, dropout_rate)\n",
        "sample_dec_output, masked_attention_weights, dec_attention_weights = sample_decoder_layer(x, sample_encooder_layer_output,\n",
        "                                                                                          training,\n",
        "                                                                                          look_ahead_mask,\n",
        "                                                                                          padding_mask)  # (batch_size, target_seq_len, d_model)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "STw18LYCjfOt"
      },
      "outputs": [],
      "source": [
        "# (batch_size, output_seq_len, d_model)\n",
        "print('dec_output shape: ', sample_dec_output.shape)\n",
        "# (batch_size, num_heads, output_seq_len, output_seq_len)\n",
        "print('masked_attention_weights shape: ', masked_attention_weights.shape)\n",
        "# (batch_size, num_heads, output_seq_len, Input_seq_len)\n",
        "print('dec_attention_weights shape: ', dec_attention_weights.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SE1H51Ajm0q1"
      },
      "source": [
        "### Encoder\n",
        "\n",
        "上面我們已經把`Encoderlayer`的主架構完成了，現在再把兩個輸入放進`Encoderlayer`形成整個`Encoder`。\n",
        "\n",
        "1. `Source Word embedding`\n",
        "2. `Positional encoding`\n",
        "3. `Encoder Layer * num_layers`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jpEox7gJ8FCI"
      },
      "outputs": [],
      "source": [
        "class Encoder(tf.keras.layers.Layer):\n",
        "    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, dropout_rate=0.1):\n",
        "        super().__init__()\n",
        "\n",
        "        self.num_layers = num_layers\n",
        "        self.d_model = d_model\n",
        "\n",
        "        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)\n",
        "        self.pos_encoding = positional_encoding(input_vocab_size, d_model)\n",
        "\n",
        "        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, dropout_rate) for _ in range(self.num_layers)]\n",
        "\n",
        "        self.dropout = tf.keras.layers.Dropout(dropout_rate)\n",
        "\n",
        "    def __call__(self, x, training, mask):\n",
        "\n",
        "        seq_len = tf.shape(x)[1]\n",
        "\n",
        "        x = self.embedding(x) # (batch_size, seq_len, d_model)\n",
        "        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))\n",
        "        x += self.pos_encoding[:, :seq_len, :] # (batch_size, seq_len, d_model)\n",
        "\n",
        "        x = self.dropout(x, training=training) # (batch_size, seq_len, d_model)\n",
        "\n",
        "        for i in range(self.num_layers):\n",
        "            x = self.enc_layers[i](x, training, mask) # (batch_size, seq_len, d_model)\n",
        "\n",
        "        return x"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8QG9nueFQKXx"
      },
      "outputs": [],
      "source": [
        "num_layers = 2\n",
        "d_model = 512\n",
        "num_heads = 8\n",
        "dff = 2048\n",
        "input_vocab_size = 10000\n",
        "dropout_rate = 0.1\n",
        "\n",
        "sample_encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, dropout_rate)\n",
        "\n",
        "# 模擬輸入64個句子，每個句子padding成50個字\n",
        "x = tf.random.uniform((64, 50))\n",
        "training = False\n",
        "mask = None\n",
        "\n",
        "sample_encoder_output = sample_encoder(x, training, mask)\n",
        "# (batch_size, input_seq_len, d_model)\n",
        "print('sample_encoder_output shape: ',sample_encoder_output.shape)  # (batch_size, input_seq_len, d_model)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZtT7PKzrXkNr"
      },
      "source": [
        "## Decoder\n",
        "\n",
        "`Decoder`的輸入也是`word embedding`與`positional encoding`。\n",
        "\n",
        "1. `Target Word embedding`\n",
        "2. `Positional encoding`\n",
        "3. `Decoder Layer * num_layers`\n",
        "\n",
        "因為要觀察`masked_attention_weights`以及`dec_attention_weight`，所以另外寫一個`attention_weights`儲存。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "d5_d5-PLQXwY"
      },
      "outputs": [],
      "source": [
        "class Decoder(tf.keras.layers.Layer):\n",
        "    def __init__(self, num_layers, d_model, num_heads, dff, output_vocab_size, dropout_rate = 0.1):\n",
        "        super().__init__()\n",
        "\n",
        "        self.num_layers = num_layers\n",
        "        self.d_model = d_model\n",
        "\n",
        "        self.embedding = tf.keras.layers.Embedding(output_vocab_size, d_model)\n",
        "        self.pos_encoding = positional_encoding(output_vocab_size, d_model)\n",
        "\n",
        "        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)]\n",
        "\n",
        "        self.dropout = tf.keras.layers.Dropout(dropout_rate)\n",
        "\n",
        "    def __call__(self, x, enc_output, training, look_ahead_mask, padding_mask):\n",
        "\n",
        "        seq_len = tf.shape(x)[1]\n",
        "        attention_weights = {}\n",
        "\n",
        "        x = self.embedding(x)\n",
        "        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))\n",
        "        x += self.pos_encoding[:, :seq_len, :]\n",
        "\n",
        "        x = self.dropout(x, training=training)\n",
        "\n",
        "        for i in range(self.num_layers):\n",
        "            # x.shape: (batch_size, output_seq_len, d_model)\n",
        "            x, masked_attention_weights, dec_attention_weights = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)\n",
        "\n",
        "            # masked attention: (batch_size, num_head, output_seq_len, output_seq_len)\n",
        "            # dec attention: (batch_size, num_head, output_seq_len, input_seq_len)\n",
        "            attention_weights['decoder_layer{}_masked_attention_weights'.format(i + 1)] = masked_attention_weights\n",
        "            attention_weights['decoder_layer{}_dec_attention_weights'.format(i + 1)] = dec_attention_weights\n",
        "\n",
        "        return x, attention_weights"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "a1jXoAMRZyvu"
      },
      "outputs": [],
      "source": [
        "num_layers = 2\n",
        "d_model = 512\n",
        "num_heads = 8\n",
        "dff = 2048\n",
        "output_vocab_size = 10000\n",
        "dropout_rate = 0.1\n",
        "\n",
        "sample_decoder = Decoder(num_layers, d_model, num_heads, dff, output_vocab_size, dropout_rate)\n",
        "\n",
        "# 模擬輸入64個句子，每個句子padding成20個字\n",
        "x = tf.random.uniform((64, 20))\n",
        "training = False\n",
        "look_ahead_mask = None\n",
        "padding_mask = None\n",
        "\n",
        "sample_decoder_output, attention_weights = sample_decoder(x, sample_encoder_output, training, look_ahead_mask, padding_mask)\n",
        "\n",
        "# (batch_size, output_seq_len, d_model)\n",
        "print('sample_decoder_output shape:', sample_decoder_output.shape)\n",
        "\n",
        "# masked attention: (batch_size, num_head, output_seq_len, output_seq_len)\n",
        "# dec attention: (batch_size, num_head, output_seq_len, input_seq_len)\n",
        "# dec attention表示 output_seq對input_seq的注意力\n",
        "for key, value in attention_weights.items():\n",
        "    print(key, ' :', value.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uERO1y54cOKq"
      },
      "source": [
        "## Transformer\n",
        "\n",
        "結合`Encoder`和`Decoder`，接上最後的`Dense`，輸出probability。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PED3bIpOYkBu"
      },
      "outputs": [],
      "source": [
        "class Transformer(tf.keras.Model):\n",
        "    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, output_vocab_size, dropout_rate = 0.1):\n",
        "        super().__init__()\n",
        "\n",
        "        self.encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, dropout_rate)\n",
        "        self.decoder = Decoder(num_layers, d_model, num_heads, dff, output_vocab_size, dropout_rate)\n",
        "\n",
        "        self.final_layer = tf.keras.layers.Dense(output_vocab_size)\n",
        "\n",
        "    def __call__(self, inp, tar, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):\n",
        "\n",
        "        # enc_output.shape: (batch_size, inp_seq_len, d_model)\n",
        "        enc_output = self.encoder(inp, training, enc_padding_mask)\n",
        "\n",
        "        # dec_output.shape: (batch_size, tar_seq_len, d_model)\n",
        "        dec_output, attention_weights = self.decoder(tar, enc_output, training, look_ahead_mask, dec_padding_mask)\n",
        "\n",
        "        # final_output.shape: (batch_szie, tar_seq_len, output_vocab_size )\n",
        "        final_output = self.final_layer(dec_output)\n",
        "\n",
        "        return final_output, attention_weights"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tJ4fbQcIkHW1"
      },
      "outputs": [],
      "source": [
        "num_layers = 2\n",
        "d_model = 512\n",
        "num_heads = 8\n",
        "dff = 2048\n",
        "input_vocab_size = 10000\n",
        "output_vocab_size = 10000\n",
        "dropout_rate = 0.1\n",
        "\n",
        "\n",
        "sample_transformer = Transformer(num_layers, d_model, num_heads, dff, input_vocab_size, output_vocab_size, dropout_rate)\n",
        "\n",
        "# Input: 模擬輸入64個句子，每個句子padding成50個字\n",
        "# Target: 模擬輸入64個句子，每個句子padding成20個字\n",
        "temp_input = tf.random.uniform((64, 50))\n",
        "temp_target = tf.random.uniform((64, 20))\n",
        "training = False\n",
        "enc_padding_mask = None\n",
        "look_ahead_mask = None\n",
        "dec_padding_mask = None\n",
        "\n",
        "final_output, attention_weights = sample_transformer(temp_input, temp_target, training, enc_padding_mask, look_ahead_mask, dec_padding_mask)\n",
        "\n",
        "print('final_output shape:', final_output.shape)\n",
        "\n",
        "# masked attention: (batch_size, num_head, output_seq_len, output_seq_len)\n",
        "# dec attention: (batch_size, num_head, output_seq_len, input_seq_len)\n",
        "# dec attention表示 output_seq對input_seq的注意力\n",
        "for key, value in attention_weights.items():\n",
        "    print(key, ' :', value.shape)  # (batch_size, tar_seq_len, target_vocab_size)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GOmWW--yP3zx"
      },
      "source": [
        "## Optimizer and Customer Learning rate\n",
        "論文使用`Adam`搭配客製化的`Learning rate`，`Learning rate`在warmup_steps前遞增，在warmup_step後遞減。\n",
        "\n",
        "$$\n",
        "lrate = d^{-0.5}_{model}\\times min(step\\_num^{-0.5},\\;step\\_num \\times warmup\\_steps^{-1.5})\n",
        "$$"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iYQdOO1axwEI"
      },
      "outputs": [],
      "source": [
        "class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):\n",
        "    def __init__(self, d_model, warmup_steps=4000):\n",
        "        super(CustomSchedule, self).__init__()\n",
        "\n",
        "        self.d_model = tf.math.rsqrt(tf.cast(d_model, tf.float32))\n",
        "\n",
        "        self.warmup_steps = warmup_steps\n",
        "\n",
        "    def __call__(self, step):\n",
        "        step = tf.cast(step, tf.float32)\n",
        "        arg1 = tf.math.rsqrt(step) # step_num^{-0.5}\n",
        "        arg2 = step * (self.warmup_steps ** -1.5) # step_num * warmup_step^{-1.5}\n",
        "\n",
        "        return self.d_model * tf.math.minimum(arg1, arg2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "b3NCEMvvjfOx"
      },
      "source": [
        "### 不同`warmup_steps`對於`learning rate`的影響"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7r4scdulztRx"
      },
      "outputs": [],
      "source": [
        "d_models = 512\n",
        "warmup_steps = [3000 ,4000, 5000, 6000]\n",
        "\n",
        "step = tf.range(50000, dtype=tf.float32)\n",
        "\n",
        "for warmup_step in warmup_steps:\n",
        "    temp_learning_rate_schedule = CustomSchedule(d_model, warmup_step)\n",
        "    plt.plot(temp_learning_rate_schedule(step), label = str(warmup_step))\n",
        "    plt.ylabel('Learning Rate')\n",
        "    plt.xlabel('Train Step')\n",
        "    plt.legend(loc='upper right')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "pDbD5QVejfOx"
      },
      "outputs": [],
      "source": [
        "d_model = 512\n",
        "warmup_steps = 4000\n",
        "# learning_rate = CustomSchedule(d_model, warmup_steps)\n",
        "\n",
        "beta_1 = 0.9\n",
        "beta_2 = 0.98\n",
        "epsilon = 1e-9\n",
        "optimizer = tf.keras.optimizers.Adam(learning_rate=CustomSchedule(d_model, warmup_steps), beta_1=beta_1, beta_2=beta_2, epsilon=epsilon)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oxGJtoDuYIHL"
      },
      "source": [
        "### Loss and metrics\n",
        "\n",
        "不需要計算句子中`padding`位置的`loss`，所以需要進行mask。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Wx9V7lOFjfOy"
      },
      "outputs": [],
      "source": [
        "def loss_function(real, pred):\n",
        "\n",
        "    mask = tf.math.logical_not(tf.math.equal(real, 0)) # 將sequence中padding(index為0)的部分設為False\n",
        "\n",
        "    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')\n",
        "    \"\"\"\n",
        "    from_logits: y_pred is expected to be a logits tensor. By default, we assume that y_pred encodes a probability distribution.\n",
        "    reduction: the reduction schedule of output loss vectors. `https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/losses/Reduction`\n",
        "    \"\"\"\n",
        "\n",
        "    loss_ = loss_object(real, pred)\n",
        "\n",
        "    mask = tf.cast(mask, dtype=loss_.dtype)\n",
        "    loss_ *= mask # 只計算非padding的loss\n",
        "\n",
        "    return tf.reduce_mean(loss_)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HaAoEvWljfOy"
      },
      "outputs": [],
      "source": [
        "# Loss sample\n",
        "cce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')\n",
        "\n",
        "y_true = tf.constant([0, 1, 0], dtype=tf.float32)\n",
        "y_pred = tf.constant([[.95, .05], [.11, .89], [.05, .95]], dtype=tf.float32)\n",
        "\n",
        "loss = cce(y_true, y_pred)\n",
        "print('Loss: ', loss.numpy())  # Loss:  0.6532173"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DbRAJt_PjfOy"
      },
      "source": [
        "### Loss, Accuracy"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "phlyxMnm-Tpx"
      },
      "outputs": [],
      "source": [
        "train_loss = tf.keras.metrics.Mean(name='train_loss')\n",
        "train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(\n",
        "    name='train_accuracy')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aeHumfr7zmMa"
      },
      "source": [
        "### Create masking\n",
        "\n",
        "建立訓練時`Encoder`和`Decoder`需要用到的masking\n",
        "\n",
        "* `Encoder`:\n",
        "  - 第一個Multi-head attention需要Source的`padding_mask`\n",
        "\n",
        "\n",
        "* `Decoder`:\n",
        "  - 第一個Masked Multi-head attention需要Target的`padding_mask` + `look_ahead_mask`\n",
        "  - 第二個Multi-head attention需要Target的`padding_mask`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZOJUSB1T8GjM"
      },
      "outputs": [],
      "source": [
        "def create_masks(inp, tar):\n",
        "\n",
        "    # Encoder padding mask\n",
        "    enc_padding_mask = create_padding_mask(inp)\n",
        "\n",
        "    # Decoder 2nd Multi-head attention\n",
        "    dec_padding_mask = create_padding_mask(inp)\n",
        "\n",
        "    # Decoder 1st Masked Multi-head attention\n",
        "    look_ahead_mask = create_look_ahead_mask(size=tf.shape(tar)[1]) # 建立只能往前看的mask矩陣\n",
        "    dec_target_padding_mask = create_padding_mask(tar) # padding_mask\n",
        "    combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask) # 返回兩者各別最大的值，也就是都是1的位置\n",
        "\n",
        "    return enc_padding_mask, combined_mask, dec_padding_mask"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Fzuf06YZp66w"
      },
      "source": [
        "### Set Parameters and Transformer"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2vKR9EcWjfO1"
      },
      "outputs": [],
      "source": [
        "num_layers = 4\n",
        "d_model = 128\n",
        "dff = 512\n",
        "num_heads = 8\n",
        "input_vocab_size = tokenizer_en.vocab_size + 2\n",
        "output_vocab_size = tokenizer_zh.vocab_size + 2\n",
        "dropout_rate = 0.1\n",
        "\n",
        "epochs = 5\n",
        "transformer = Transformer(num_layers, d_model, num_heads, dff, input_vocab_size, output_vocab_size, dropout_rate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_69MCxBfjfO1"
      },
      "source": [
        "## Checkpoint"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "hNhuYfllndLZ"
      },
      "outputs": [],
      "source": [
        "ckpt = tf.train.Checkpoint(transformer = transformer, optimizer = optimizer)\n",
        "\n",
        "record_params = f'{num_layers}layers_{d_model}d_model_{num_heads}heads_{dff}dff'\n",
        "checkpoint_path = os.path.join(checkpoint_path, record_params)\n",
        "log_dir = os.path.join(log_dir, record_params)\n",
        "\n",
        "# 只保留最近3次訓練結果\n",
        "ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=3)\n",
        "\n",
        "# 檢查在checkpoint_path上是否有已訓練的checkpoint，有就叫ckpt進行讀取\n",
        "if ckpt_manager.latest_checkpoint:\n",
        "    ckpt.restore(ckpt_manager.latest_checkpoint)\n",
        "    print('Latest checkpoint restored')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0Di_Yaa1gf9r"
      },
      "source": [
        "## Define training step\n",
        "\n",
        "訓練時採用`Teacher forcing`，直接輸入給Decoder正確答案，因為若使用`Recursive`預測方式，預測錯誤則會導致之後面接收到錯誤的資訊。\n",
        "\n",
        "預測時則採用`AutoRegressive`方式遞迴預測。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iJwmp9OE29oj"
      },
      "outputs": [],
      "source": [
        "@tf.function(input_signature=(tf.TensorSpec(shape=[None, None], dtype=tf.int64), tf.TensorSpec(shape=[None, None], dtype=tf.int64)))\n",
        "def train_step(inp, tar):\n",
        "\n",
        "    # teacher forcing\n",
        "    tar_inp = tar[:, :-1] # Deocder的target輸入不需要<EOS>\n",
        "    tar_real = tar[:, 1:] # Decdoer的target輸出不需要<BOS>\n",
        "\n",
        "    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)\n",
        "\n",
        "\n",
        "    # 記錄梯度，之後做梯度下降\n",
        "    Training = True\n",
        "    with tf.GradientTape() as tape:\n",
        "        predictions, _ = transformer(inp, tar_inp, Training, enc_padding_mask, combined_mask, dec_padding_mask)\n",
        "        loss = loss_function(tar_real, predictions)\n",
        "\n",
        "    # 拿出所有可訓練參數的gradient\n",
        "    gradients = tape.gradient(loss, transformer.trainable_variables)\n",
        "    # 呼叫Adam透過gradient更新參數\n",
        "    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))\n",
        "\n",
        "    # 輸出loss以及acc，之後準備給Tensorboard記錄\n",
        "    train_loss(loss)\n",
        "    train_accuracy(tar_real, predictions)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6tDNfGE_jfO2"
      },
      "source": [
        "## Training\n",
        "\n",
        "使用Tensorboard記錄Loss以及Accuracy"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bbvmaKNiznHZ"
      },
      "outputs": [],
      "source": [
        "# Tensorboard\n",
        "summary_writer = tf.summary.create_file_writer(logdir=log_dir)\n",
        "\n",
        "for epoch in range(epochs):\n",
        "    start = time.time()\n",
        "\n",
        "    # 每次epoch重置Tensorboard metrics\n",
        "    train_loss.reset_states()\n",
        "    train_accuracy.reset_states()\n",
        "\n",
        "    # 依序訓練所有batch\n",
        "    for (batch, (inp, tar)) in enumerate(train_dataset):\n",
        "        train_step(inp, tar)\n",
        "\n",
        "        if batch % 50 == 0:\n",
        "            print ('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1, batch, train_loss.result(), train_accuracy.result()))\n",
        "\n",
        "    # 每2個epoch就儲存模型\n",
        "    if (epoch + 1) % 2 == 0:\n",
        "        ckpt_save_path = ckpt_manager.save()\n",
        "        print('Saving checkpoint for epoch {} at {}'.format(epoch+1, ckpt_save_path))\n",
        "\n",
        "    with summary_writer.as_default():\n",
        "        tf.summary.scalar('train_loss', train_loss.result(), step=epoch+1)\n",
        "        tf.summary.scalar('train_acc', train_accuracy.result(), step=epoch+1)\n",
        "\n",
        "    print('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch+1, train_loss.result(), train_accuracy.result()))\n",
        "    print('Time taken for 1 epoch: {} secs\\n'.format(time.time()-start))\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "y6APsFrgImLW"
      },
      "source": [
        "## Evaluate\n",
        "\n",
        "當有新sentence要預測時，sentence一樣要做與encoder輸入的處理:\n",
        "\n",
        "1. Encoder輸入前後需要增加`<BOS>`與`<EOS>`\n",
        "2. Decoder的預測方式是用AutoRegressive，輸入是從`<BOS>`開始預測，第一次預測完將預測結果concat在`<BOS>`後，之後以此類推。\n",
        "\n",
        "`<BOS>` => `<BOS> 我` => `<BOS> 我 好 ` => `<BOS> 我 好 帥`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5buvMlnvyrFm"
      },
      "outputs": [],
      "source": [
        "def evaluate(inp_sentence):\n",
        "\n",
        "    start_token = [tokenizer_en.vocab_size]\n",
        "    end_token = [tokenizer_en.vocab_size + 1]\n",
        "\n",
        "    # Encoder的輸入需要增加<BOS>,<EOS>\n",
        "    inp_sentence = start_token + tokenizer_en.encode(inp_sentence) + end_token\n",
        "    encoder_input = tf.expand_dims(inp_sentence, axis=0)\n",
        "\n",
        "    # Decoder的預測方式是autoregressive，即從<BOS>開始預測，每次預測完拿取預測結果最後一個字的概率\n",
        "    decoder_input = [tokenizer_zh.vocab_size]\n",
        "    output = tf.expand_dims(decoder_input, axis=0)\n",
        "\n",
        "    # AutoRegressive\n",
        "    for i in range(max_length):\n",
        "        # create mask\n",
        "        enc_padding_mask, combined_mask, dec_padding_mask = create_masks(encoder_input, output)\n",
        "\n",
        "        # prediction.shape == (batch_size, seq_len, vocab_size)\n",
        "        predictions, attention_weights = transformer(encoder_input, output, False, enc_padding_mask, combined_mask, dec_padding_mask)\n",
        "\n",
        "        # 拿取最後一個字作為預測結果\n",
        "        prediction = predictions[:, -1:, :]\n",
        "\n",
        "        prediction_id = tf.cast(tf.argmax(prediction, axis = -1), tf.int32)\n",
        "\n",
        "        # 預測結果遇到<EOS>就停止回傳output\n",
        "        if prediction_id == tokenizer_zh.vocab_size + 1:\n",
        "            return tf.squeeze(output, axis=0), attention_weights\n",
        "\n",
        "        output = tf.concat([output, prediction_id], axis = -1)\n",
        "\n",
        "    return tf.squeeze(output, axis=0), attention_weights"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wNE6NyaFjfO3"
      },
      "outputs": [],
      "source": [
        "def map_from_pred(pred_tokens):\n",
        "\n",
        "    pred_tokens = [t for t in pred_tokens if t < tokenizer_zh.vocab_size]\n",
        "    pred_sentence = tokenizer_zh.decode(pred_tokens)\n",
        "\n",
        "    return pred_sentence"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FVi16PuIjfO3"
      },
      "outputs": [],
      "source": [
        "sentence = 'Taiwan is a beautiful country.'\n",
        "predicted_seq, attention_weights = evaluate(sentence)\n",
        "predicted_seq = map_from_pred(predicted_seq)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "EwQx63bBjfO3"
      },
      "outputs": [],
      "source": [
        "print('Source sentence:\\n',sentence)\n",
        "print()\n",
        "print('Predict sentence:\\n', predicted_seq)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lj-fNcE-jfO4"
      },
      "source": [
        "## Visualization\n",
        "\n",
        "我們畫出`Decoder`中的`self-attention`權重矩陣，每個`head`各有一個矩陣，這裏挑最後一層的`decoder_layer4_dec_attention_weights`。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7ouqbiD0jfO4"
      },
      "outputs": [],
      "source": [
        "for key,value in attention_weights.items():\n",
        "    print(key,':',value.shape)\n",
        "\n",
        "layer_name = 'decoder_layer4_dec_attention_weights'"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!wget -O /usr/share/fonts/truetype/liberation/simhei.ttf \"https://www.wfonts.com/download/data/2014/06/01/simhei/chinese.simhei.ttf\"\n",
        "import matplotlib as mpl\n",
        "zhfont = mpl.font_manager.FontProperties(fname='/usr/share/fonts/truetype/liberation/simhei.ttf')"
      ],
      "metadata": {
        "id": "s1sDxcPM3BwJ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yxwsRuPHjfO4"
      },
      "outputs": [],
      "source": [
        "def plot_attention_weights(attention_weights, sentence, predicted_seq, layer_name):\n",
        "    fig = plt.figure(figsize=(17, 14))\n",
        "\n",
        "    sentence = tokenizer_en.encode(sentence)\n",
        "\n",
        "    attention_weights = tf.squeeze(attention_weights[layer_name], axis=0)\n",
        "    # (num_heads, tar_seq_len, inp_seq_len)\n",
        "\n",
        "    # 只畫其中4個head\n",
        "    #attention_weights = attention_weights[4:8,:,:]\n",
        "\n",
        "    # 將每個 head 的注意權重畫出\n",
        "    for head in range(attention_weights.shape[0]):\n",
        "        ax = fig.add_subplot(4, 2, head + 1)\n",
        "\n",
        "        attn_map = np.transpose(attention_weights[head])\n",
        "        ax.matshow(attn_map, cmap='viridis')  # (inp_seq_len, tar_seq_len)\n",
        "\n",
        "        ax.set_xticks(range(len(predicted_seq)))\n",
        "        ax.set_xticklabels(predicted_seq, fontproperties=zhfont)\n",
        "\n",
        "        ax.set_yticks(range(len(sentence) + 2))\n",
        "        ax.set_yticklabels(['<start>'] + [tokenizer_en.decode([i]) for i in sentence] + ['<end>'])\n",
        "\n",
        "        ax.set_xlabel('Head {}'.format(head + 1), fontsize=13)\n",
        "\n",
        "    plt.tight_layout()\n",
        "    plt.show()\n",
        "    plt.close(fig)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XXOvfTgDjfO5"
      },
      "outputs": [],
      "source": [
        "import logging\n",
        "logging.getLogger('matplotlib.font_manager').disabled = True\n",
        "\n",
        "plt.figure(figsize=(20,15))\n",
        "plot_attention_weights(attention_weights, sentence, predicted_seq, layer_name)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8lrRwW9QjfO5"
      },
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [
        "s_qNSzzyaCbD"
      ],
      "name": "transformer.ipynb",
      "private_outputs": true,
      "provenance": [],
      "gpuType": "T4",
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}