{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "LGCriYLCjfOA" }, "source": [ "# Transformer" ] }, { "cell_type": "markdown", "metadata": { "id": "IXvK-t8ZjfOI" }, "source": [ "\"Drawing\"" ] }, { "cell_type": "markdown", "metadata": { "id": "9KlqRvfVjfOJ" }, "source": [ "* [Seq2seq](https://arxiv.org/pdf/1409.3215.pdf)\n", "* [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)\n", "* [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025v5)\n", "* [Attention is all you need](https://arxiv.org/abs/1706.03762)\n", "* [GPT](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)\n", "* [BERT](https://arxiv.org/abs/1810.04805)\n", "* [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf)\n", "* [ERNIE](https://arxiv.org/abs/1905.07129)\n", "* [XLNet](https://arxiv.org/abs/1906.08237)\n", "* [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB)" ] }, { "cell_type": "markdown", "metadata": { "id": "4UJYX0AMjfOK" }, "source": [ "# Environment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JjJJyJTZYebt" }, "outputs": [], "source": [ "import tensorflow_datasets as tfds\n", "import tensorflow as tf\n", "import os\n", "from pprint import pprint\n", "\n", "import time\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import matplotlib as mpl\n", "\n", "print(tf.__version__)\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"" ] }, { "cell_type": "markdown", "metadata": { "id": "fd1NWMxjfsDd" }, "source": [ "### 建立資料夾路徑\n", "\n", "`vocab_file`: 儲存中英文字典(vocabulary)路徑\n", "\n", "`checkpoint`: 儲存模型路徑\n", "\n", "`log_dir`: 記錄實驗結果\n", "\n", "`download_dir`: 使用`wmt19`機器翻譯競賽的資料集,資料儲存路徑" ] }, { "cell_type": "code", "source": [ "# 上傳資料\n", "!wget -q https://github.com/TA-aiacademy/course_3.0/releases/download/v2.5_nlp/NLP_part4.zip\n", "!unzip -q NLP_part4.zip" ], "metadata": { "id": "6gjKgBR6midw" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-rqQ9n2fjfOP" }, "outputs": [], "source": [ "output_dir = \"nmt\"\n", "en_vocab_file = os.path.join(output_dir, \"en_vocab\")\n", "zh_vocab_file = os.path.join(output_dir, \"zh_vocab\")\n", "checkpoint_path = os.path.join(output_dir, \"checkpoints\")\n", "log_dir = os.path.join(output_dir, 'logs')\n", "download_dir = \"tensorflow-datasets/downloads\"\n", "\n", "if not os.path.exists(output_dir):\n", " os.makedirs(output_dir)" ] }, { "cell_type": "markdown", "metadata": { "id": "zECDCX1kjfOQ" }, "source": [ "## 查看wmt19 中英文對照資料集\n", "\n", "* `newscommentary_v14`: 新聞評論\n", "* `wikititles_v1`: wiki標題\n", "* `uncorpus_v1`: 聯合國數據" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bxtScXcKjfOR" }, "outputs": [], "source": [ "tmp_builder = tfds.builder(\"wmt19_translate/zh-en\")\n", "pprint(tmp_builder.subsets)" ] }, { "cell_type": "markdown", "metadata": { "id": "dqCxP1f0jfOS" }, "source": [ "## 透過`tf.DatasetBuilder`下載資料集\n", "\n", "https://www.tensorflow.org/datasets/catalog/wmt19_translate\n", "\n", "下載中英文的新聞評論資料集,會在`download_dir`下產生資料集,下次再執行就不需要使用`download_and_prepare`。\n", "\n", "`builder.info`顯示資料集細節" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bxhKra6JjfOT" }, "outputs": [], "source": [ "config = tfds.translate.wmt.WmtConfig(\n", " version=\"0.0.1\",\n", " language_pair=(\"zh\", \"en\"),\n", " subsets={\n", " tfds.Split.TRAIN: [\"newscommentary_v14\"]\n", " }\n", ")\n", "builder = tfds.builder(\"wmt_translate\", config=config)\n", "builder.download_and_prepare(download_dir=download_dir)" ] }, { "cell_type": "markdown", "metadata": { "id": "Qi_cPaiwjfOT" }, "source": [ "## 切割資料集\n", "70%訓練集,30%測試集" ] }, { "cell_type": "markdown", "metadata": { "id": "YMT1lxwHjfOT" }, "source": [ "## 透過`tf.DatasetBuilder`載入資料\n", "\n", "`assert`檢查型態" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DP1ugD4UjfOT" }, "outputs": [], "source": [ "train_perc = 70\n", "examples = builder.as_dataset(split=[f'train[:{train_perc}%]', f'train[{train_perc}%:]'], as_supervised=True)\n", "train_examples, val_examples = examples\n", "\n", "assert isinstance(train_examples, tf.data.Dataset)\n", "assert isinstance(val_examples, tf.data.Dataset)" ] }, { "cell_type": "markdown", "metadata": { "id": "58A2uMZUjfOU" }, "source": [ "## 使用`tfds.features.text.SubwordTextEncoder`載入與建立字典\n", "\n", "* `.load_from_file`: 從`.subwords`檔案讀取字典\n", "* `.build_from_corpus`: 建立`.subwords`字典\n", "\n", "中文字典將`max_subword_length`設為1,以字為單位進行斷詞,大幅度減少字典大小,降低複雜度。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "KVBg5Q8tBk5z" }, "outputs": [], "source": [ "%%time\n", "try:\n", " tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.load_from_file(en_vocab_file)\n", " print('Load English vocabulary: %s' % en_vocab_file)\n", "except:\n", " print('Build English vocabulary: %s' % en_vocab_file)\n", " tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus((en.numpy() for en, zh in train_examples),\n", " target_vocab_size = 2**13)\n", " tokenizer_en.save_to_file(en_vocab_file)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XemfT4VTjfOU" }, "outputs": [], "source": [ "print('English vocabulary size: ', tokenizer_en.vocab_size)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AVA5AL1fjfOV" }, "outputs": [], "source": [ "%%time\n", "try:\n", " tokenizer_zh = tfds.deprecated.text.SubwordTextEncoder.load_from_file(zh_vocab_file)\n", " print('Load Chinese vocfabulary: %s' % zh_vocab_file)\n", "except:\n", " print('Build Chinese vocabulary: %s' % zh_vocab_file)\n", " tokenizer_zh = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus((zh.numpy() for en, zh in train_examples),\n", " target_vocab_size = 2**13, max_subword_length=1)\n", " tokenizer_zh.save_to_file(zh_vocab_file)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ERJYyfdVjfOV" }, "outputs": [], "source": [ "print('Chinese vocabulary size: ', tokenizer_zh.vocab_size)" ] }, { "cell_type": "markdown", "metadata": { "id": "r_xQLGi7jfOV" }, "source": [ "### Example\n", "\n", "英文的斷詞方式是以`wordpiece`進行斷詞。\n", "\n", "* 原始句子: `Transformer is awesome.`\n", "* 空白斷詞: `[Transformer, is, awesome, .]`\n", "* `Wordpiece`斷詞: `[Trans, former, is, aw, es, ome, .]`\n", "\n", "`Wordpiece`斷詞優點:\n", "* 有些字是由其他的`wordpiece`組成,例如說`Translation`, `Transpose`等等,可以降低字典大小,避免有些字可能在所有句子中只出現過一次。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4DYWukNFkGQN" }, "outputs": [], "source": [ "sample_string = 'Transformer is awesome.'\n", "\n", "tokenized_string_token = tokenizer_en.encode(sample_string)\n", "print('Tokenized string token is {}'.format(tokenized_string_token))\n", "\n", "tokenized_string = [tokenizer_en.decode([ts]) for ts in tokenized_string_token]\n", "print('Tokenized srting is {}'.format(tokenized_string))\n", "\n", "original_string = tokenizer_en.decode(tokenized_string_token)\n", "print('The original string: {}'.format(original_string))\n", "\n", "assert original_string == sample_string" ] }, { "cell_type": "markdown", "metadata": { "id": "s8LD3JfZjfOW" }, "source": [ "## 添加``,``在句子頭尾\n", "\n", "\"Drawing\"\n", "\n", "`.vocab_size`視為``, `.vocab_size+1`視為``\n", "\n", "之後所有的訓練資料都需要通過`train_examples`產生,然後再透過`encode`轉成`token_id`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UZwnPr4R055s" }, "outputs": [], "source": [ "def encode(en_t, zh_t):\n", " en_indics = [tokenizer_en.vocab_size] + tokenizer_en.encode(en_t.numpy()) + [tokenizer_en.vocab_size + 1]\n", " zh_indics = [tokenizer_zh.vocab_size] + tokenizer_zh.encode(zh_t.numpy()) + [tokenizer_zh.vocab_size + 1]\n", " return en_indics, zh_indics" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lWpRf-oSjfOX" }, "outputs": [], "source": [ "en_t, zh_t = next(iter(train_examples))\n", "en_indics, zh_indics = encode(en_t, zh_t)\n", "\n", "print('英文: %d' % tokenizer_en.vocab_size)\n", "print('英文: %d' % (tokenizer_en.vocab_size + 1))\n", "print('中文: %d' % tokenizer_zh.vocab_size)\n", "print('中文: %d' % (tokenizer_zh.vocab_size + 1))\n", "\n", "print('-' * 20)\n", "print('Before encode: (two tensor): ')\n", "pprint((en_t, zh_t))\n", "print()\n", "print('After encode: (two array): ')\n", "pprint((en_indics, zh_indics))" ] }, { "cell_type": "markdown", "metadata": { "id": "Tx1sFbR-9fRs" }, "source": [ "## 將`encode`函數的輸出型態轉為計算圖的`Tensor`\n", "\n", "如果直接將`train_examples`接上`encode`,會發生`'Tensor' object has no attribute 'numpy'`\n", "\n", "這是因為`encode`這個自定義函數是透過`tfds`來進行,而`tfds`的`map function`會採用`tf1.0`的`Graph mode`運算,所以無法直接使用`tf2.0`的`Eager mode`中的`attribute.numpy()`,最快的解決方式是透過`tf.py_function`強制讓所有操作都在`python`完成。\n", "\n", "https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/python/data/ops/dataset_ops.py#L1099-L1214" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2HpeIFe-jfOZ" }, "outputs": [], "source": [ "# import traceback\n", "\n", "# try:\n", "# train_examples.map(encode)\n", "# except AttributeError:\n", "# traceback.print_exc()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Mah1cS-P70Iz" }, "outputs": [], "source": [ "def tf_encode(en_t, zh_t):\n", "\n", " return tf.py_function(encode, [en_t, zh_t], [tf.int64, tf.int64])\n", "\n", "tmp_dataset = train_examples.map(tf_encode)\n", "en_indices, zh_indices = next(iter(tmp_dataset))\n", "\n", "print('After tf_encode: (two tensor)')\n", "print(en_indices)\n", "print(zh_indices)" ] }, { "cell_type": "markdown", "metadata": { "id": "xZ70WxDKjfOZ" }, "source": [ "## 限制句子長度\n", "\n", "為了加快訓練速度,使用`tf.logical_and`限制中英文句子長度,並使用`.filter`過濾。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c081xPGv1CPI" }, "outputs": [], "source": [ "max_length = 50\n", "def filter_max_length(en_t, zh_t, max_length = max_length):\n", "\n", " return tf.logical_and(tf.size(en_t) <= max_length,\n", " tf.size(zh_t) <= max_length)\n", "\n", "tmp_dataset = tmp_dataset.filter(filter_max_length)" ] }, { "cell_type": "markdown", "metadata": { "id": "AY2AQ2OojfOa" }, "source": [ "## Padding\n", "\n", "針對每個`batch`都進行中英文的`padding`。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UMV-3WTEjfOa" }, "outputs": [], "source": [ "batch_size = 64\n", "tmp_dataset = tmp_dataset.padded_batch(batch_size=batch_size, padded_shapes=([-1], [-1]))\n", "\n", "en_batch, zh_batch = next(iter(tmp_dataset))\n", "\n", "print('英文batch: ')\n", "print(en_batch)\n", "print('-' * 15)\n", "print('中文batch: ')\n", "print(zh_batch)" ] }, { "cell_type": "markdown", "metadata": { "id": "8Ql5EG8kjfOa" }, "source": [ "### 將`train_examples`與`val_examples`做同樣處理\n", "\n", "* `train`:\n", "\n", " - `map(tf_encode)`: 將字串轉成`token_id`。\n", " - `filter(filter_max_length)`:過濾最大句子長度。\n", " - `cache()`: 在每次迭代時將訓練資料先放進去`cache`裡面,加速訓練速度。\n", " - `shuffle(buffer_size)`: 從資料集中抽樣`buffer_size`放近`buffer`裡面,然後從`buffer`中抽取一個`batch`進行訓練,同時確保了隨機性與加快訓練速度。\n", " - `padded_batch(batch_size, padded_shapes=([-1],[-1]))`: `padding`長度。\n", "\n", "Tensor-core pipeline: https://www.tensorflow.org/guide/performance/datasets?hl=zh_cn" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9mk9AZdZ5bcS" }, "outputs": [], "source": [ "max_length = 50\n", "batch_size = 128\n", "buffer_size = 15000\n", "\n", "train_dataset = (train_examples\n", " .map(tf_encode)\n", " .filter(filter_max_length)\n", " .cache()\n", " .shuffle(buffer_size)\n", " .padded_batch(batch_size, padded_shapes=([-1],[-1])))\n", "\n", "\n", "val_dataset = (val_examples\n", " .map(tf_encode)\n", " .filter(filter_max_length)\n", " .padded_batch(batch_size, padded_shapes=([-1], [-1])))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_fXvfYVfQr2n" }, "outputs": [], "source": [ "en_batch, zh_batch = next(iter(train_dataset))\n", "\n", "print('英文batch tensor: ')\n", "print(en_batch)\n", "print('-' * 20)\n", "print('中文batch tensor: ')\n", "print(zh_batch)" ] }, { "cell_type": "markdown", "metadata": { "id": "WjZOW6JIjfOb" }, "source": [ "### 假設有新資料時的處理方式\n", "\n", "1. `map(tf_encode)`: 轉成`token_id`。\n", "2. `filter(filter_max_length)`: 過濾最大長度。\n", "3. `padded_batch()`: padding。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7Ba_0X1WjfOc" }, "outputs": [], "source": [ "demo_examples = [\n", " (\"It is important.\", \"這很重要。\"),\n", " (\"The math speaks for themselves.\", \"數學證明一切。\"),\n", "]\n", "\n", "batch_size = 2\n", "demo_examples = tf.data.Dataset.from_tensor_slices((\n", " [en for en, _ in demo_examples], [zh for _, zh in demo_examples]\n", "))\n", "\n", "demo_examples = demo_examples.map(tf_encode).filter(filter_max_length).padded_batch(batch_size, padded_shapes=([-1],[-1]))\n", "\n", "en_sample, zh_sample = next(iter(demo_examples))\n", "print(en_sample)\n", "print('-' * 15)\n", "pprint(zh_sample)" ] }, { "cell_type": "markdown", "metadata": { "id": "nBQuibYA4n0n" }, "source": [ "# Transformer\n", "\n", "\"Drawing\"\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ZV1m4ETejfOc" }, "source": [ "這裡分為`Encoder`與`Decoder`:\n", "\n", "1. `Encoder`: 負責接收`source sentence`,最主要的目的是將`source sentence`作為`q,k,v`進行`self-attention`。\n", "\n", "2. `Decoder`: 負責接收`target sentence`,最主要的目的有兩個:\n", " - 使用`target sentence`作為`q,k,v`進行`self-attention`。\n", " - 將`Encoder`的輸出作為`v,k`,然後與`Decoder`的`q`進行`self-attention`。" ] }, { "cell_type": "markdown", "metadata": { "id": "ZVSmOpiZjfOc" }, "source": [ "## Positional Encoding\n", "\n", "Word Embedding所表達的是所有詞向量之間的相似關係,而Transformer的做法是透過內積解決RNN的長距離依賴問題(long-range dependenices),但是Transformer這樣做卻沒有考慮到句子中的詞先後順序關係,透過Positional Encoding,讓詞向量之間不只因為word embedding語義關係而靠近,也可以因為詞之間的位置相互靠近而靠近。\n", "\n", "$$\n", "PE_{(pos,2i)} = \\sin(pos/10000^{\\frac{2i}{d_{model}}}) \\\\\n", "PE_{(pos,2i+1)} = \\cos(pos/10000^{\\frac{2i}{d_{model}}})\n", "$$\n", "\n", "#### 之後可以調整看看三角函數的參數,例如10000 -> 100" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WhIOZjMNKujn" }, "outputs": [], "source": [ "# 先建立角度\n", "def get_angles(pos, i, d_model):\n", " angle_rates = 1 / np.power(10000, 2 * (i//2) / np.float32(d_model))\n", " return pos * angle_rates" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1Rz82wEs5biZ" }, "outputs": [], "source": [ "def positional_encoding(position, d_model):\n", " \"\"\"\n", " 奇數sin\n", " 偶數cos\n", "\n", " 第一個字: [[sin(0),cos(0),sin(1),cos(1),...,sin(d_model-1),cos(d_model)]\n", " 第二個字: ,[sin(0),cos(0),sin(1),cos(1),...,sin(d_model-1),cos(d_model)]\n", " 第三個字: ,[sin(0),cos(0),sin(1),cos(1),...,sin(d_model-1),cos(d_model)]\n", " ...\n", " 第position個字: ,[sin(0),cos(0),sin(1),cos(1),...,sin(d_model-1),cos(d_model)]]\n", "\n", " return:\n", " (batch_size, position, d_model)\n", " \"\"\"\n", " pos = np.arange(position)[:, np.newaxis] # [[0],[1],[2],...,[pos-1]]\n", " i = np.arange(d_model)[np.newaxis, :] # [[0,1,2,3,...,d_model-1]]\n", "\n", " angle_rads = get_angles(pos, i, d_model) # (position, d_model)\n", "\n", "\n", " angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])\n", " angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])\n", "\n", " pos_encoding = angle_rads[np.newaxis, ...]\n", "\n", " return tf.cast(pos_encoding, dtype=tf.float32)" ] }, { "cell_type": "markdown", "metadata": { "id": "hXRfZ3eHjfOk" }, "source": [ "## Positional encoding 理解\n", "\n", "此例拿第25個token的positional encoding來跟其餘50個字(包含自己)的positional encoding計算內積(`np.dot`),能夠發現越靠近token 25的值內積越大,反之,越遠則內積越小。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mL6AOqHQjfOl" }, "outputs": [], "source": [ "position = 50 # 50個字\n", "d_model = 512 # 每個字的positional encoding維度為512\n", "pos_encoding = positional_encoding(position, d_model)\n", "\n", "inp = pos_encoding[0][25].numpy()\n", "\n", "dis_list = list()\n", "for i in range(50):\n", " tar = pos_encoding[0][i].numpy()\n", " dot_prod = np.dot(inp, tar)\n", " dis_list.append(dot_prod)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sSWXaBggjfOm" }, "outputs": [], "source": [ "plt.figure(figsize=(12,10))\n", "plt.plot(dis_list)\n", "plt.xticks(list(range(50)))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "a_b4ou4TYqUN" }, "source": [ "## Masking\n", "在Transformer中有兩個地方需要進行masking,以下兩種masking的方式都是先指定要進行masking的位置,然後將`QK`內積過後的attetion matrix進行masking。\n", "\n", " 1. `Padding_masking`: 句子padding的部分不需要被transformer注意到,透過mask,讓self-attention出來的weight接近0。\n", " 2. `Look_ahead_masking`: Decoder中的masked self attention會使用到,不讓當前的字去注意到之後所有的字,一樣是讓self-attention出來的weight接近0。\n", "\n", "\"Drawing\"\n", "\n", "### Padding masking" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "U2i8-e1s8ti9" }, "outputs": [], "source": [ "def create_padding_mask(seq):\n", " \"\"\"\n", " Input:\n", " 在字典中,padding的index為0\n", "\n", " 所以當Input遇到0時就將其變為1,之後當成要進行masking的index\n", "\n", " Return:\n", " 在中間插上兩個維度是為了後面attention時做broadcasting\n", " \"\"\"\n", " seq = tf.cast(tf.math.equal(seq, 0), tf.float32)\n", " return seq[:, tf.newaxis, tf.newaxis, :]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "A7BYeBCNvi7n" }, "outputs": [], "source": [ "x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])\n", "print(x)\n", "print(create_padding_mask(x))\n", "# 1的位置就是要進行masking的位置" ] }, { "cell_type": "markdown", "metadata": { "id": "Z0hzukDBgVom" }, "source": [ "### Look ahead masking" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dVxS8OPI9uI0" }, "outputs": [], "source": [ "def create_look_ahead_mask(size):\n", " \"\"\"\n", " Input: 方陣size,以transformer來說就是self-attention的weigh matrix,將上三角進行masking\n", "\n", " tf.linalg.band_part(input, num_lower, num_upper)\n", " num_lower, num_upper: 從主對角線開始決定mask的起點,-1表示保留原值\n", " \"\"\"\n", " mask = 1 - tf.linalg.band_part(tf.ones((size,size)), -1, 0)\n", " return mask" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yxKGuXxaBeeE" }, "outputs": [], "source": [ "create_look_ahead_mask(3)" ] }, { "cell_type": "markdown", "metadata": { "id": "vsxEE_-Wa1gF" }, "source": [ "## Scaled dot-product attention(self-attention)\n", "\n", "\"Drawing\"\n", "\n", "1. Q與K進行矩陣相乘的地方就是實現Self-attention的地方,表示Q中的每個字對於K的每個字的attention。\n", "2. 接著進行Scale是為了避免後面通過Softmax之後的attention weight不是1就是0,這樣會造成很小的梯度(hard softmax)。\n", "3. 通過Softmax之後就產生attention weight matrix,再乘上V,最後得到Context matrix。\n", "\n", "$$\n", "\\mathrm{Attention}(Q,K,V) = \\mathrm{Softmax}(\\frac{QK^\\top}{\\sqrt{d_k}})V\n", "$$\n", "\n", "\"Drawing\"\n", "\n", "\"Drawing\"\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LazzUq3bJ5SH" }, "outputs": [], "source": [ "def scaled_dot_product_attention(q, k, v, mask):\n", " \"\"\"\n", " Args:\n", " q: query shape == (..., seq_len_q, depth_k)\n", " k: key shape == (..., seq_len_k, depth_k)\n", " v: value shape == (..., seq_len_v, depth_v)\n", " mask: Float tensor with shape broadcastable to (..., seq_len_q, seq_len_k)\n", " \"\"\"\n", " # q,k矩陣相乘\n", " matmul_qk = tf.matmul(q, k, transpose_b=True) # (..., q_dim, k_dim)\n", "\n", " # Scaled\n", " dk = tf.cast(tf.shape(k)[-1], tf.float32)\n", " scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)\n", "\n", " # mask\n", " if mask is not None:\n", " scaled_attention_logits += (mask * -1e9)\n", " # Softmax最後一個維度(k_dim),表示每個字對於所有字的attention weights\n", " attention_weights = tf.nn.softmax(scaled_attention_logits, axis = -1)\n", "\n", " output = tf.matmul(attention_weights, v) # (..., q_dim, depth_v)\n", "\n", " return output, attention_weights" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "n90YjClyInFy" }, "outputs": [], "source": [ "\"\"\"\n", "假設一個字(query)對四個字(key)進行self attention,得到attention weight之後再與value相乘\n", "\"\"\"\n", "\n", "temp_k = tf.constant([[10,0,0],\n", " [0,10,0],\n", " [0,0,10],\n", " [0,0,10]], dtype=tf.float32) # (4, 3)\n", "\n", "temp_v = tf.constant([[10,0,0],\n", " [0,10,0],\n", " [0,0,10],\n", " [0,0,10]], dtype=tf.float32) # (4, 3)\n", "\n", "temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32) # (1, 3)\n", "\n", "output, attention_weights = scaled_dot_product_attention(temp_q, temp_k, temp_v, mask=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zg6k-fGhgXra" }, "outputs": [], "source": [ "print('Attention weights: ')\n", "print(attention_weights)\n", "print()\n", "print('Ouptut: ')\n", "print(output)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UAq3YOzUgXhb" }, "outputs": [], "source": [ "\"\"\"\n", "假設四個字(query)對四個字(key)進行self attention,然後將上三角形進行mask,得到attention weight之後再與value相乘\n", "\n", "將右上角mask掉之後觀察attention weights會發現上三角形的weigh趨近於0\n", "\"\"\"\n", "\n", "# 為了方便觀察weight,將temp_q都設為1\n", "temp_q = tf.constant([[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1]], dtype=tf.float32) # (4, 3)\n", "\n", "mask = create_look_ahead_mask(temp_q.shape[0])\n", "\n", "output, attention_weights = scaled_dot_product_attention(temp_q, temp_k, temp_v, mask=mask)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6dlU8Tm-hYrF" }, "outputs": [], "source": [ "print('Attention weights: ')\n", "print(attention_weights)\n", "print()\n", "print('Ouptut: ')\n", "print(output)" ] }, { "cell_type": "markdown", "metadata": { "id": "fz5BMC8Kaoqo" }, "source": [ "## Multi-Head Attention\n", "\n", "\"Drawing\"\n", "\n", "將`q,k,v`分成num_heads份,各自做self-attention,然後再concat,通過dense輸出,分成num_heads的優點最主要是希望讓每個head各自注意到Sequence中不同的地方,而且切分成較小的矩陣還能加速訓練過程。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BSV3PPKsYecw" }, "outputs": [], "source": [ "class MultiHeadAttention(tf.keras.layers.Layer):\n", "\n", " def __init__(self, d_model, num_heads):\n", " super().__init__()\n", " self.num_heads = num_heads\n", " self.d_model = d_model\n", "\n", " # 確保d_model可以被num_heads整除\n", " assert d_model % self.num_heads == 0\n", "\n", " self.depth = d_model // self.num_heads\n", "\n", " self.wq = tf.keras.layers.Dense(d_model)\n", " self.wk = tf.keras.layers.Dense(d_model)\n", " self.wv = tf.keras.layers.Dense(d_model)\n", "\n", " self.dense = tf.keras.layers.Dense(d_model)\n", "\n", " def split_heads(self, x, batch_size):\n", " \"\"\"\n", " 將d_model切割成(num_heads, depth)\n", " 為了後面做self-attention,transpose成(batch_size, num_heads, seq_len, depth)\n", " \"\"\"\n", " x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))\n", " return tf.transpose(x, perm=[0, 2, 1, 3])\n", "\n", " def __call__(self, v, k, q, mask):\n", " batch_size = tf.shape(q)[0]\n", "\n", " q = self.wq(q) # (batch_size, seq_len, d_model)\n", " k = self.wk(k) # (batch_size, seq_len, d_model)\n", " v = self.wv(v) # (batch_size, seq_len, d_model)\n", "\n", " q = self.split_heads(q, batch_size) # (bat d_model)\n", " k = self.split_heads(k, batch_size) # (batch_size, num_heads, seq_len_k, depth)\n", " v = self.split_heads(v, batch_size) # (batch_size, num_heads, seq_len_v, depth)\n", "\n", " # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)\n", " # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)\n", " scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)\n", "\n", " #為了將num_heads進行concat,transpose成(batch_size, seq_len_q, num_heads, depth)\n", " scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])\n", "\n", " # 合併後面兩維度 (batch_size, seq_len_q, d_model)\n", " concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))\n", "\n", " output = self.dense(concat_attention)\n", "\n", " return output, attention_weights" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Hu94p-_-2_BX" }, "outputs": [], "source": [ "y = tf.random.uniform((1, 60, 512)) # (batch_size, seq_len, d_model)\n", "\n", "d_model = 512\n", "num_heads = 8\n", "temp_mha = MultiHeadAttention(d_model, num_heads)\n", "output, attention_weights = temp_mha(v=y, k=y, q=y, mask=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8l5p--9ojfOr" }, "outputs": [], "source": [ "# 輸出仍然是 (batch_size, seq_len, d_model)\n", "print('output shape', output.shape)\n", "\n", "# 8個heads各自有一個attention weight matrix\n", "print('attention_weights shape: ', attention_weights.shape)" ] }, { "cell_type": "markdown", "metadata": { "id": "RdDqGayx67vv" }, "source": [ "### Point-wise feed forward network\n", "\n", "$$\n", "FFN(x) = max(0, xW_1 + b_1)W_2+b_2\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ET7xLt0yCT6Z" }, "outputs": [], "source": [ "def point_wise_ffn(d_model, dff):\n", " return tf.keras.Sequential([tf.keras.layers.Dense(dff, activation='relu'), # (batch_size, seq_len, dff)\n", " tf.keras.layers.Dense(d_model)]) # (batch_size, seq_len, d_model)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mytb1lPyOHLB" }, "outputs": [], "source": [ "d_model = 512\n", "dff = 2048\n", "sample_ffn = point_wise_ffn(d_model, dff)\n", "sample_ffn(tf.random.uniform((64, 50, 512))).shape" ] }, { "cell_type": "markdown", "metadata": { "id": "yScbC0MUH8dS" }, "source": [ "## Encoderblock and Decoderblock\n", "\n", "\"Drawing\"\n" ] }, { "cell_type": "markdown", "metadata": { "id": "QFv-FNYUmvpn" }, "source": [ "### EncoderLayer\n", "\n", "這邊我們將以上橘色虛線`Encoderlayer`進行組合,其中主要由兩種`class`組成,分別是`MultiHeadAttention`和`point_wise_ffn`,依照上圖的順序為:\n", "\n", "1. `MultiHeadAttention(padding_mask)`\n", "\n", "2. `Residual connection` + `Layer Normalization`\n", "\n", "3. `point_wise_ffn`\n", "\n", "4. `Residual connection` + `Layer Normalization`\n", "\n", "另外`dropout`的部分是在論文中提及的,所以另外加上去。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ncyS-Ms3i2x_" }, "outputs": [], "source": [ "class EncoderLayer(tf.keras.layers.Layer):\n", " def __init__(self, d_model, num_heads, dff, dropout_rate = 0.1):\n", " super().__init__()\n", "\n", " self.mha = MultiHeadAttention(d_model, num_heads)\n", " self.ffn = point_wise_ffn(d_model, dff)\n", "\n", " self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n", " self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n", "\n", " self.dropout1 = tf.keras.layers.Dropout(dropout_rate)\n", " self.dropout2 = tf.keras.layers.Dropout(dropout_rate)\n", "\n", " def __call__(self, x, training, mask):\n", " # 不需要看Encoder的attention weight\n", " attention_output, _ = self.mha(v = x, k = x, q = x, mask=mask) # (batch_size, input_seq_len, d_model)\n", " # Inference時不需要使用dropout\n", " attention_output = self.dropout1(attention_output, training=training) # (batch_size, input_seq_len, d_model)\n", " # Residual + Layer Normalization\n", " out1 = self.layernorm1(x + attention_output) # (batch_size, input_seq_len, d_model)\n", "\n", " ffn_output = self.ffn(out1) # (batch_size, input_seq_len, d_model)\n", " ffn_output = self.dropout2(ffn_output, training=training) # (batch_size, input_seq_len, d_model)\n", " enc_output = self.layernorm2(out1 + ffn_output) # (batch_size, input_seq_len, d_model)\n", "\n", " return enc_output" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AzZRXdO0mI48" }, "outputs": [], "source": [ "d_model = 512\n", "num_heads = 8\n", "dff = 2048\n", "dropout_rate = 0.1\n", "\n", "sample_encooder_layer = EncoderLayer(d_model, num_heads, dff, dropout_rate)\n", "\n", "x = tf.random.uniform((64, 50, 512))\n", "training = False\n", "mask = None\n", "\n", "sample_encooder_layer_output = sample_encooder_layer(x, training, mask)\n", "sample_encooder_layer_output.shape # (batch_size, input_seq_len, d_model)" ] }, { "cell_type": "markdown", "metadata": { "id": "6LO_48Owmx_o" }, "source": [ "### DecoderLayer\n", "\n", "這邊我們將以上橘色虛線`Decoderlayer`進行組合,其中主要由兩種`class`組成,分別是`MultiHeadAttention`和`point_wise_ffn`,依照上圖的順序為:\n", "\n", "1. `MultiHeadAttention(padding_mask + look_ahead_mask)`\n", "\n", "2. `Residual connection` + `Layer Normalization`\n", "\n", "3. `MultiHeadAttention(padding_mask)`\n", "\n", "4. `Residual connection` + `Layer Normalization`\n", "\n", "5. `point_wise_ffn`\n", "\n", "6. ``Residual connection` + `Layer Normalization``" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9SoX0-vd1hue" }, "outputs": [], "source": [ "class DecoderLayer(tf.keras.layers.Layer):\n", " def __init__(self, d_model, num_heads, dff, dropout_rate = 0.1):\n", " super().__init__()\n", "\n", " self.mha1 = MultiHeadAttention(d_model, num_heads)\n", " self.mha2 = MultiHeadAttention(d_model, num_heads)\n", "\n", " self.ffn = point_wise_ffn(d_model, dff)\n", "\n", " self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n", " self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n", " self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)\n", "\n", " self.dropout1 = tf.keras.layers.Dropout(dropout_rate)\n", " self.dropout2 = tf.keras.layers.Dropout(dropout_rate)\n", " self.dropout3 = tf.keras.layers.Dropout(dropout_rate)\n", "\n", " def __call__(self, x, enc_output, training, look_ahead_mask, padding_mask):\n", "\n", " # masked self-attention,後面需要觀察attention weight matrix\n", " # 使用look_ahead_mask,讓decoder輸入只能往前看\n", " attention_output1, masked_attention_weights = self.mha1(v=x, k=x, q=x, mask=look_ahead_mask) # (batch_size, ouptut_seq_len, d_model)\n", " attention_output1 = self.dropout1(attention_output1, training=training) # (batch_size, ouptut_seq_len, d_model)\n", " attention_output1 = self.layernorm1(x + attention_output1) # (batch_size, ouptut_seq_len, d_model)\n", "\n", " # 使用padding_mask,忽略padding的attention weights,不讓任何字去注意到padding的位置\n", " attention_output2, dec_attention_weights = self.mha2(v=enc_output, k=enc_output, q=attention_output1, mask=padding_mask) # (batch_size, ouptut_seq_len, d_model)\n", " attention_output2 = self.dropout2(attention_output2, training=training) # (batch_size, ouptut_seq_len, d_model)\n", " attention_output2 = self.layernorm2(attention_output1 + attention_output2) # (batch_size, ouptut_seq_len, d_model)\n", "\n", " ffn_output = self.ffn(attention_output2)\n", " ffn_output = self.dropout3(ffn_output, training=training)\n", " dec_output = self.layernorm3(attention_output2 + ffn_output)\n", "\n", " return dec_output, masked_attention_weights, dec_attention_weights" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ne2Bqx8k71l0" }, "outputs": [], "source": [ "d_model = 512\n", "num_heads = 8\n", "dff = 2048\n", "dropout_rate = 0.1\n", "\n", "x = tf.random.uniform((64, 60, 512))\n", "training = False\n", "look_ahead_mask = None\n", "padding_mask = None\n", "\n", "sample_decoder_layer = DecoderLayer(d_model, num_heads, dff, dropout_rate)\n", "sample_dec_output, masked_attention_weights, dec_attention_weights = sample_decoder_layer(x, sample_encooder_layer_output,\n", " training,\n", " look_ahead_mask,\n", " padding_mask) # (batch_size, target_seq_len, d_model)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "STw18LYCjfOt" }, "outputs": [], "source": [ "# (batch_size, output_seq_len, d_model)\n", "print('dec_output shape: ', sample_dec_output.shape)\n", "# (batch_size, num_heads, output_seq_len, output_seq_len)\n", "print('masked_attention_weights shape: ', masked_attention_weights.shape)\n", "# (batch_size, num_heads, output_seq_len, Input_seq_len)\n", "print('dec_attention_weights shape: ', dec_attention_weights.shape)" ] }, { "cell_type": "markdown", "metadata": { "id": "SE1H51Ajm0q1" }, "source": [ "### Encoder\n", "\n", "上面我們已經把`Encoderlayer`的主架構完成了,現在再把兩個輸入放進`Encoderlayer`形成整個`Encoder`。\n", "\n", "1. `Source Word embedding`\n", "2. `Positional encoding`\n", "3. `Encoder Layer * num_layers`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jpEox7gJ8FCI" }, "outputs": [], "source": [ "class Encoder(tf.keras.layers.Layer):\n", " def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, dropout_rate=0.1):\n", " super().__init__()\n", "\n", " self.num_layers = num_layers\n", " self.d_model = d_model\n", "\n", " self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)\n", " self.pos_encoding = positional_encoding(input_vocab_size, d_model)\n", "\n", " self.enc_layers = [EncoderLayer(d_model, num_heads, dff, dropout_rate) for _ in range(self.num_layers)]\n", "\n", " self.dropout = tf.keras.layers.Dropout(dropout_rate)\n", "\n", " def __call__(self, x, training, mask):\n", "\n", " seq_len = tf.shape(x)[1]\n", "\n", " x = self.embedding(x) # (batch_size, seq_len, d_model)\n", " x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))\n", " x += self.pos_encoding[:, :seq_len, :] # (batch_size, seq_len, d_model)\n", "\n", " x = self.dropout(x, training=training) # (batch_size, seq_len, d_model)\n", "\n", " for i in range(self.num_layers):\n", " x = self.enc_layers[i](x, training, mask) # (batch_size, seq_len, d_model)\n", "\n", " return x" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8QG9nueFQKXx" }, "outputs": [], "source": [ "num_layers = 2\n", "d_model = 512\n", "num_heads = 8\n", "dff = 2048\n", "input_vocab_size = 10000\n", "dropout_rate = 0.1\n", "\n", "sample_encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, dropout_rate)\n", "\n", "# 模擬輸入64個句子,每個句子padding成50個字\n", "x = tf.random.uniform((64, 50))\n", "training = False\n", "mask = None\n", "\n", "sample_encoder_output = sample_encoder(x, training, mask)\n", "# (batch_size, input_seq_len, d_model)\n", "print('sample_encoder_output shape: ',sample_encoder_output.shape) # (batch_size, input_seq_len, d_model)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZtT7PKzrXkNr" }, "source": [ "## Decoder\n", "\n", "`Decoder`的輸入也是`word embedding`與`positional encoding`。\n", "\n", "1. `Target Word embedding`\n", "2. `Positional encoding`\n", "3. `Decoder Layer * num_layers`\n", "\n", "因為要觀察`masked_attention_weights`以及`dec_attention_weight`,所以另外寫一個`attention_weights`儲存。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "d5_d5-PLQXwY" }, "outputs": [], "source": [ "class Decoder(tf.keras.layers.Layer):\n", " def __init__(self, num_layers, d_model, num_heads, dff, output_vocab_size, dropout_rate = 0.1):\n", " super().__init__()\n", "\n", " self.num_layers = num_layers\n", " self.d_model = d_model\n", "\n", " self.embedding = tf.keras.layers.Embedding(output_vocab_size, d_model)\n", " self.pos_encoding = positional_encoding(output_vocab_size, d_model)\n", "\n", " self.dec_layers = [DecoderLayer(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)]\n", "\n", " self.dropout = tf.keras.layers.Dropout(dropout_rate)\n", "\n", " def __call__(self, x, enc_output, training, look_ahead_mask, padding_mask):\n", "\n", " seq_len = tf.shape(x)[1]\n", " attention_weights = {}\n", "\n", " x = self.embedding(x)\n", " x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))\n", " x += self.pos_encoding[:, :seq_len, :]\n", "\n", " x = self.dropout(x, training=training)\n", "\n", " for i in range(self.num_layers):\n", " # x.shape: (batch_size, output_seq_len, d_model)\n", " x, masked_attention_weights, dec_attention_weights = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)\n", "\n", " # masked attention: (batch_size, num_head, output_seq_len, output_seq_len)\n", " # dec attention: (batch_size, num_head, output_seq_len, input_seq_len)\n", " attention_weights['decoder_layer{}_masked_attention_weights'.format(i + 1)] = masked_attention_weights\n", " attention_weights['decoder_layer{}_dec_attention_weights'.format(i + 1)] = dec_attention_weights\n", "\n", " return x, attention_weights" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "a1jXoAMRZyvu" }, "outputs": [], "source": [ "num_layers = 2\n", "d_model = 512\n", "num_heads = 8\n", "dff = 2048\n", "output_vocab_size = 10000\n", "dropout_rate = 0.1\n", "\n", "sample_decoder = Decoder(num_layers, d_model, num_heads, dff, output_vocab_size, dropout_rate)\n", "\n", "# 模擬輸入64個句子,每個句子padding成20個字\n", "x = tf.random.uniform((64, 20))\n", "training = False\n", "look_ahead_mask = None\n", "padding_mask = None\n", "\n", "sample_decoder_output, attention_weights = sample_decoder(x, sample_encoder_output, training, look_ahead_mask, padding_mask)\n", "\n", "# (batch_size, output_seq_len, d_model)\n", "print('sample_decoder_output shape:', sample_decoder_output.shape)\n", "\n", "# masked attention: (batch_size, num_head, output_seq_len, output_seq_len)\n", "# dec attention: (batch_size, num_head, output_seq_len, input_seq_len)\n", "# dec attention表示 output_seq對input_seq的注意力\n", "for key, value in attention_weights.items():\n", " print(key, ' :', value.shape)" ] }, { "cell_type": "markdown", "metadata": { "id": "uERO1y54cOKq" }, "source": [ "## Transformer\n", "\n", "結合`Encoder`和`Decoder`,接上最後的`Dense`,輸出probability。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PED3bIpOYkBu" }, "outputs": [], "source": [ "class Transformer(tf.keras.Model):\n", " def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, output_vocab_size, dropout_rate = 0.1):\n", " super().__init__()\n", "\n", " self.encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, dropout_rate)\n", " self.decoder = Decoder(num_layers, d_model, num_heads, dff, output_vocab_size, dropout_rate)\n", "\n", " self.final_layer = tf.keras.layers.Dense(output_vocab_size)\n", "\n", " def __call__(self, inp, tar, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):\n", "\n", " # enc_output.shape: (batch_size, inp_seq_len, d_model)\n", " enc_output = self.encoder(inp, training, enc_padding_mask)\n", "\n", " # dec_output.shape: (batch_size, tar_seq_len, d_model)\n", " dec_output, attention_weights = self.decoder(tar, enc_output, training, look_ahead_mask, dec_padding_mask)\n", "\n", " # final_output.shape: (batch_szie, tar_seq_len, output_vocab_size )\n", " final_output = self.final_layer(dec_output)\n", "\n", " return final_output, attention_weights" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tJ4fbQcIkHW1" }, "outputs": [], "source": [ "num_layers = 2\n", "d_model = 512\n", "num_heads = 8\n", "dff = 2048\n", "input_vocab_size = 10000\n", "output_vocab_size = 10000\n", "dropout_rate = 0.1\n", "\n", "\n", "sample_transformer = Transformer(num_layers, d_model, num_heads, dff, input_vocab_size, output_vocab_size, dropout_rate)\n", "\n", "# Input: 模擬輸入64個句子,每個句子padding成50個字\n", "# Target: 模擬輸入64個句子,每個句子padding成20個字\n", "temp_input = tf.random.uniform((64, 50))\n", "temp_target = tf.random.uniform((64, 20))\n", "training = False\n", "enc_padding_mask = None\n", "look_ahead_mask = None\n", "dec_padding_mask = None\n", "\n", "final_output, attention_weights = sample_transformer(temp_input, temp_target, training, enc_padding_mask, look_ahead_mask, dec_padding_mask)\n", "\n", "print('final_output shape:', final_output.shape)\n", "\n", "# masked attention: (batch_size, num_head, output_seq_len, output_seq_len)\n", "# dec attention: (batch_size, num_head, output_seq_len, input_seq_len)\n", "# dec attention表示 output_seq對input_seq的注意力\n", "for key, value in attention_weights.items():\n", " print(key, ' :', value.shape) # (batch_size, tar_seq_len, target_vocab_size)" ] }, { "cell_type": "markdown", "metadata": { "id": "GOmWW--yP3zx" }, "source": [ "## Optimizer and Customer Learning rate\n", "論文使用`Adam`搭配客製化的`Learning rate`,`Learning rate`在warmup_steps前遞增,在warmup_step後遞減。\n", "\n", "$$\n", "lrate = d^{-0.5}_{model}\\times min(step\\_num^{-0.5},\\;step\\_num \\times warmup\\_steps^{-1.5})\n", "$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iYQdOO1axwEI" }, "outputs": [], "source": [ "class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):\n", " def __init__(self, d_model, warmup_steps=4000):\n", " super(CustomSchedule, self).__init__()\n", "\n", " self.d_model = tf.math.rsqrt(tf.cast(d_model, tf.float32))\n", "\n", " self.warmup_steps = warmup_steps\n", "\n", " def __call__(self, step):\n", " step = tf.cast(step, tf.float32)\n", " arg1 = tf.math.rsqrt(step) # step_num^{-0.5}\n", " arg2 = step * (self.warmup_steps ** -1.5) # step_num * warmup_step^{-1.5}\n", "\n", " return self.d_model * tf.math.minimum(arg1, arg2)" ] }, { "cell_type": "markdown", "metadata": { "id": "b3NCEMvvjfOx" }, "source": [ "### 不同`warmup_steps`對於`learning rate`的影響" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7r4scdulztRx" }, "outputs": [], "source": [ "d_models = 512\n", "warmup_steps = [3000 ,4000, 5000, 6000]\n", "\n", "step = tf.range(50000, dtype=tf.float32)\n", "\n", "for warmup_step in warmup_steps:\n", " temp_learning_rate_schedule = CustomSchedule(d_model, warmup_step)\n", " plt.plot(temp_learning_rate_schedule(step), label = str(warmup_step))\n", " plt.ylabel('Learning Rate')\n", " plt.xlabel('Train Step')\n", " plt.legend(loc='upper right')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pDbD5QVejfOx" }, "outputs": [], "source": [ "d_model = 512\n", "warmup_steps = 4000\n", "# learning_rate = CustomSchedule(d_model, warmup_steps)\n", "\n", "beta_1 = 0.9\n", "beta_2 = 0.98\n", "epsilon = 1e-9\n", "optimizer = tf.keras.optimizers.Adam(learning_rate=CustomSchedule(d_model, warmup_steps), beta_1=beta_1, beta_2=beta_2, epsilon=epsilon)" ] }, { "cell_type": "markdown", "metadata": { "id": "oxGJtoDuYIHL" }, "source": [ "### Loss and metrics\n", "\n", "不需要計算句子中`padding`位置的`loss`,所以需要進行mask。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Wx9V7lOFjfOy" }, "outputs": [], "source": [ "def loss_function(real, pred):\n", "\n", " mask = tf.math.logical_not(tf.math.equal(real, 0)) # 將sequence中padding(index為0)的部分設為False\n", "\n", " loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')\n", " \"\"\"\n", " from_logits: y_pred is expected to be a logits tensor. By default, we assume that y_pred encodes a probability distribution.\n", " reduction: the reduction schedule of output loss vectors. `https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/losses/Reduction`\n", " \"\"\"\n", "\n", " loss_ = loss_object(real, pred)\n", "\n", " mask = tf.cast(mask, dtype=loss_.dtype)\n", " loss_ *= mask # 只計算非padding的loss\n", "\n", " return tf.reduce_mean(loss_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HaAoEvWljfOy" }, "outputs": [], "source": [ "# Loss sample\n", "cce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')\n", "\n", "y_true = tf.constant([0, 1, 0], dtype=tf.float32)\n", "y_pred = tf.constant([[.95, .05], [.11, .89], [.05, .95]], dtype=tf.float32)\n", "\n", "loss = cce(y_true, y_pred)\n", "print('Loss: ', loss.numpy()) # Loss: 0.6532173" ] }, { "cell_type": "markdown", "metadata": { "id": "DbRAJt_PjfOy" }, "source": [ "### Loss, Accuracy" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "phlyxMnm-Tpx" }, "outputs": [], "source": [ "train_loss = tf.keras.metrics.Mean(name='train_loss')\n", "train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(\n", " name='train_accuracy')" ] }, { "cell_type": "markdown", "metadata": { "id": "aeHumfr7zmMa" }, "source": [ "### Create masking\n", "\n", "建立訓練時`Encoder`和`Decoder`需要用到的masking\n", "\n", "* `Encoder`:\n", " - 第一個Multi-head attention需要Source的`padding_mask`\n", "\n", "\n", "* `Decoder`:\n", " - 第一個Masked Multi-head attention需要Target的`padding_mask` + `look_ahead_mask`\n", " - 第二個Multi-head attention需要Target的`padding_mask`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZOJUSB1T8GjM" }, "outputs": [], "source": [ "def create_masks(inp, tar):\n", "\n", " # Encoder padding mask\n", " enc_padding_mask = create_padding_mask(inp)\n", "\n", " # Decoder 2nd Multi-head attention\n", " dec_padding_mask = create_padding_mask(inp)\n", "\n", " # Decoder 1st Masked Multi-head attention\n", " look_ahead_mask = create_look_ahead_mask(size=tf.shape(tar)[1]) # 建立只能往前看的mask矩陣\n", " dec_target_padding_mask = create_padding_mask(tar) # padding_mask\n", " combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask) # 返回兩者各別最大的值,也就是都是1的位置\n", "\n", " return enc_padding_mask, combined_mask, dec_padding_mask" ] }, { "cell_type": "markdown", "metadata": { "id": "Fzuf06YZp66w" }, "source": [ "### Set Parameters and Transformer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2vKR9EcWjfO1" }, "outputs": [], "source": [ "num_layers = 4\n", "d_model = 128\n", "dff = 512\n", "num_heads = 8\n", "input_vocab_size = tokenizer_en.vocab_size + 2\n", "output_vocab_size = tokenizer_zh.vocab_size + 2\n", "dropout_rate = 0.1\n", "\n", "epochs = 5\n", "transformer = Transformer(num_layers, d_model, num_heads, dff, input_vocab_size, output_vocab_size, dropout_rate)" ] }, { "cell_type": "markdown", "metadata": { "id": "_69MCxBfjfO1" }, "source": [ "## Checkpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hNhuYfllndLZ" }, "outputs": [], "source": [ "ckpt = tf.train.Checkpoint(transformer = transformer, optimizer = optimizer)\n", "\n", "record_params = f'{num_layers}layers_{d_model}d_model_{num_heads}heads_{dff}dff'\n", "checkpoint_path = os.path.join(checkpoint_path, record_params)\n", "log_dir = os.path.join(log_dir, record_params)\n", "\n", "# 只保留最近3次訓練結果\n", "ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=3)\n", "\n", "# 檢查在checkpoint_path上是否有已訓練的checkpoint,有就叫ckpt進行讀取\n", "if ckpt_manager.latest_checkpoint:\n", " ckpt.restore(ckpt_manager.latest_checkpoint)\n", " print('Latest checkpoint restored')" ] }, { "cell_type": "markdown", "metadata": { "id": "0Di_Yaa1gf9r" }, "source": [ "## Define training step\n", "\n", "訓練時採用`Teacher forcing`,直接輸入給Decoder正確答案,因為若使用`Recursive`預測方式,預測錯誤則會導致之後面接收到錯誤的資訊。\n", "\n", "預測時則採用`AutoRegressive`方式遞迴預測。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iJwmp9OE29oj" }, "outputs": [], "source": [ "@tf.function(input_signature=(tf.TensorSpec(shape=[None, None], dtype=tf.int64), tf.TensorSpec(shape=[None, None], dtype=tf.int64)))\n", "def train_step(inp, tar):\n", "\n", " # teacher forcing\n", " tar_inp = tar[:, :-1] # Deocder的target輸入不需要\n", " tar_real = tar[:, 1:] # Decdoer的target輸出不需要\n", "\n", " enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)\n", "\n", "\n", " # 記錄梯度,之後做梯度下降\n", " Training = True\n", " with tf.GradientTape() as tape:\n", " predictions, _ = transformer(inp, tar_inp, Training, enc_padding_mask, combined_mask, dec_padding_mask)\n", " loss = loss_function(tar_real, predictions)\n", "\n", " # 拿出所有可訓練參數的gradient\n", " gradients = tape.gradient(loss, transformer.trainable_variables)\n", " # 呼叫Adam透過gradient更新參數\n", " optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))\n", "\n", " # 輸出loss以及acc,之後準備給Tensorboard記錄\n", " train_loss(loss)\n", " train_accuracy(tar_real, predictions)" ] }, { "cell_type": "markdown", "metadata": { "id": "6tDNfGE_jfO2" }, "source": [ "## Training\n", "\n", "使用Tensorboard記錄Loss以及Accuracy" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bbvmaKNiznHZ" }, "outputs": [], "source": [ "# Tensorboard\n", "summary_writer = tf.summary.create_file_writer(logdir=log_dir)\n", "\n", "for epoch in range(epochs):\n", " start = time.time()\n", "\n", " # 每次epoch重置Tensorboard metrics\n", " train_loss.reset_states()\n", " train_accuracy.reset_states()\n", "\n", " # 依序訓練所有batch\n", " for (batch, (inp, tar)) in enumerate(train_dataset):\n", " train_step(inp, tar)\n", "\n", " if batch % 50 == 0:\n", " print ('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1, batch, train_loss.result(), train_accuracy.result()))\n", "\n", " # 每2個epoch就儲存模型\n", " if (epoch + 1) % 2 == 0:\n", " ckpt_save_path = ckpt_manager.save()\n", " print('Saving checkpoint for epoch {} at {}'.format(epoch+1, ckpt_save_path))\n", "\n", " with summary_writer.as_default():\n", " tf.summary.scalar('train_loss', train_loss.result(), step=epoch+1)\n", " tf.summary.scalar('train_acc', train_accuracy.result(), step=epoch+1)\n", "\n", " print('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch+1, train_loss.result(), train_accuracy.result()))\n", " print('Time taken for 1 epoch: {} secs\\n'.format(time.time()-start))\n" ] }, { "cell_type": "markdown", "metadata": { "id": "y6APsFrgImLW" }, "source": [ "## Evaluate\n", "\n", "當有新sentence要預測時,sentence一樣要做與encoder輸入的處理:\n", "\n", "1. Encoder輸入前後需要增加``與``\n", "2. Decoder的預測方式是用AutoRegressive,輸入是從``開始預測,第一次預測完將預測結果concat在``後,之後以此類推。\n", "\n", "`` => ` 我` => ` 我 好 ` => ` 我 好 帥`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5buvMlnvyrFm" }, "outputs": [], "source": [ "def evaluate(inp_sentence):\n", "\n", " start_token = [tokenizer_en.vocab_size]\n", " end_token = [tokenizer_en.vocab_size + 1]\n", "\n", " # Encoder的輸入需要增加,\n", " inp_sentence = start_token + tokenizer_en.encode(inp_sentence) + end_token\n", " encoder_input = tf.expand_dims(inp_sentence, axis=0)\n", "\n", " # Decoder的預測方式是autoregressive,即從開始預測,每次預測完拿取預測結果最後一個字的概率\n", " decoder_input = [tokenizer_zh.vocab_size]\n", " output = tf.expand_dims(decoder_input, axis=0)\n", "\n", " # AutoRegressive\n", " for i in range(max_length):\n", " # create mask\n", " enc_padding_mask, combined_mask, dec_padding_mask = create_masks(encoder_input, output)\n", "\n", " # prediction.shape == (batch_size, seq_len, vocab_size)\n", " predictions, attention_weights = transformer(encoder_input, output, False, enc_padding_mask, combined_mask, dec_padding_mask)\n", "\n", " # 拿取最後一個字作為預測結果\n", " prediction = predictions[:, -1:, :]\n", "\n", " prediction_id = tf.cast(tf.argmax(prediction, axis = -1), tf.int32)\n", "\n", " # 預測結果遇到就停止回傳output\n", " if prediction_id == tokenizer_zh.vocab_size + 1:\n", " return tf.squeeze(output, axis=0), attention_weights\n", "\n", " output = tf.concat([output, prediction_id], axis = -1)\n", "\n", " return tf.squeeze(output, axis=0), attention_weights" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wNE6NyaFjfO3" }, "outputs": [], "source": [ "def map_from_pred(pred_tokens):\n", "\n", " pred_tokens = [t for t in pred_tokens if t < tokenizer_zh.vocab_size]\n", " pred_sentence = tokenizer_zh.decode(pred_tokens)\n", "\n", " return pred_sentence" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FVi16PuIjfO3" }, "outputs": [], "source": [ "sentence = 'Taiwan is a beautiful country.'\n", "predicted_seq, attention_weights = evaluate(sentence)\n", "predicted_seq = map_from_pred(predicted_seq)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EwQx63bBjfO3" }, "outputs": [], "source": [ "print('Source sentence:\\n',sentence)\n", "print()\n", "print('Predict sentence:\\n', predicted_seq)" ] }, { "cell_type": "markdown", "metadata": { "id": "lj-fNcE-jfO4" }, "source": [ "## Visualization\n", "\n", "我們畫出`Decoder`中的`self-attention`權重矩陣,每個`head`各有一個矩陣,這裏挑最後一層的`decoder_layer4_dec_attention_weights`。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7ouqbiD0jfO4" }, "outputs": [], "source": [ "for key,value in attention_weights.items():\n", " print(key,':',value.shape)\n", "\n", "layer_name = 'decoder_layer4_dec_attention_weights'" ] }, { "cell_type": "code", "source": [ "!wget -O /usr/share/fonts/truetype/liberation/simhei.ttf \"https://www.wfonts.com/download/data/2014/06/01/simhei/chinese.simhei.ttf\"\n", "import matplotlib as mpl\n", "zhfont = mpl.font_manager.FontProperties(fname='/usr/share/fonts/truetype/liberation/simhei.ttf')" ], "metadata": { "id": "s1sDxcPM3BwJ" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yxwsRuPHjfO4" }, "outputs": [], "source": [ "def plot_attention_weights(attention_weights, sentence, predicted_seq, layer_name):\n", " fig = plt.figure(figsize=(17, 14))\n", "\n", " sentence = tokenizer_en.encode(sentence)\n", "\n", " attention_weights = tf.squeeze(attention_weights[layer_name], axis=0)\n", " # (num_heads, tar_seq_len, inp_seq_len)\n", "\n", " # 只畫其中4個head\n", " #attention_weights = attention_weights[4:8,:,:]\n", "\n", " # 將每個 head 的注意權重畫出\n", " for head in range(attention_weights.shape[0]):\n", " ax = fig.add_subplot(4, 2, head + 1)\n", "\n", " attn_map = np.transpose(attention_weights[head])\n", " ax.matshow(attn_map, cmap='viridis') # (inp_seq_len, tar_seq_len)\n", "\n", " ax.set_xticks(range(len(predicted_seq)))\n", " ax.set_xticklabels(predicted_seq, fontproperties=zhfont)\n", "\n", " ax.set_yticks(range(len(sentence) + 2))\n", " ax.set_yticklabels([''] + [tokenizer_en.decode([i]) for i in sentence] + [''])\n", "\n", " ax.set_xlabel('Head {}'.format(head + 1), fontsize=13)\n", "\n", " plt.tight_layout()\n", " plt.show()\n", " plt.close(fig)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XXOvfTgDjfO5" }, "outputs": [], "source": [ "import logging\n", "logging.getLogger('matplotlib.font_manager').disabled = True\n", "\n", "plt.figure(figsize=(20,15))\n", "plot_attention_weights(attention_weights, sentence, predicted_seq, layer_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8lrRwW9QjfO5" }, "outputs": [], "source": [] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [ "s_qNSzzyaCbD" ], "name": "transformer.ipynb", "private_outputs": true, "provenance": [], "gpuType": "T4", "include_colab_link": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 0 }