{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "0d8fea4f",
      "metadata": {
        "id": "0d8fea4f"
      },
      "source": [
        "# **模型訓練（迴歸問題）**\n",
        "此份程式碼會講解針對迴歸型任務在模型訓練上需要注意的細節。\n",
        "\n",
        "## 本章節內容大綱\n",
        "* ### [創建資料集／載入資料集（Dataset Creating/ Loading）](#DatasetCreating/Loading)\n",
        "* ### [資料前處理（Data Preprocessing）](#DataPreprocessing)\n",
        "* ### [模型建置（Model Building）](#ModelBuilding)\n",
        "* ### [模型訓練（Model Training）](#ModelTraining)\n",
        "* ### [模型評估（Model Evaluation）](#ModelEvaluation)\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bba873c7",
      "metadata": {
        "id": "bba873c7"
      },
      "source": [
        "## 匯入套件"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "108ce7ba",
      "metadata": {
        "id": "108ce7ba"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "# Tensorflow 相關套件\n",
        "import tensorflow as tf\n",
        "from tensorflow import keras\n",
        "from tensorflow.keras import layers"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c32bbd73",
      "metadata": {
        "id": "c32bbd73"
      },
      "source": [
        "<a name=\"DatasetCreating/Loading\"></a>\n",
        "## 創建資料集／載入資料集（Dataset Creating / Loading）"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# 上傳資料\n",
        "!wget -q https://github.com/TA-aiacademy/course_3.0/releases/download/DL/Data_part2.zip\n",
        "!unzip -q Data_part2.zip"
      ],
      "metadata": {
        "id": "rfM_DLvfFBMZ"
      },
      "id": "rfM_DLvfFBMZ",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "0d408f0e",
      "metadata": {
        "id": "0d408f0e"
      },
      "outputs": [],
      "source": [
        "train_df = pd.read_csv('./Data/FilmRating_train.csv')\n",
        "test_df = pd.read_csv('./Data/FilmRating_test.csv')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "4c6d4a50",
      "metadata": {
        "id": "4c6d4a50"
      },
      "outputs": [],
      "source": [
        "train_df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e7d33563",
      "metadata": {
        "id": "e7d33563"
      },
      "source": [
        "* #### 電影評價資料集\n",
        "資料集總共 2612 筆，\n",
        "欄位包括預算 (budget)、電影類型 (genres)、關鍵字詞 (keywords)、知名度 (popularity)、製作公司 (production_companies)、國家 (production_countries)、收入 (revenue)、時長 (runtime)、卡司 (cast)、導演 (director)、距離發布時間 (n_days)、評分 (score)，多項欄位是以 leave-one-out encoding 方式轉換數值。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a8b89aa8",
      "metadata": {
        "id": "a8b89aa8"
      },
      "outputs": [],
      "source": [
        "X_df = train_df.iloc[:, :-1].values\n",
        "y_df = train_df.score.values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "916c3e24",
      "metadata": {
        "id": "916c3e24"
      },
      "outputs": [],
      "source": [
        "X_test = test_df.iloc[:, :-1].values\n",
        "y_test = test_df.score.values"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1c94fd50",
      "metadata": {
        "id": "1c94fd50"
      },
      "source": [
        "<a name=\"DataPreprocessing\"></a>\n",
        "## 資料前處理（Data Preprocessing）"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "089f1a3a",
      "metadata": {
        "id": "089f1a3a"
      },
      "source": [
        "* ### 資料正規化（Data Normalization）\n",
        "    - 減少過度關注的特徵（由特定數字範圍造成的影響）\n",
        "    - 避免更新方向偏離，較容易收斂"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6154aada",
      "metadata": {
        "id": "6154aada"
      },
      "source": [
        "對於測試資料，需使用「訓練資料」的統計量去做轉換，避免改變兩組資料間的分布關係\n",
        "![](https://hackmd.io/_uploads/S1m3KtLZp.png)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f9d0e878",
      "metadata": {
        "id": "f9d0e878"
      },
      "outputs": [],
      "source": [
        "'''Normalize'''\n",
        "X_scale = (X_df-X_df.min(axis=0)) / (X_df.max(axis=0)-X_df.min(axis=0))\n",
        "X_test_scale = (X_test-X_df.min(axis=0)) / (X_df.max(axis=0)-X_df.min(axis=0))\n",
        "\n",
        "# 其他寫法\n",
        "# from sklearn.preprocessing import MinMaxScaler\n",
        "# sc = MinMaxScaler(feature_range=(0, 1))\n",
        "# X_scale = sc.fit_transform(X_df)\n",
        "# X_test_scale = sc.transform(X_test)\n",
        "\n",
        "# '''Standardize'''\n",
        "# X_scale = (X_df-X_df.mean(axis=0)) / (X_df.std(axis=0))\n",
        "# X_test_scale = (X_test-X_df.mean(axis=0)) / (X_df.std(axis=0))\n",
        "\n",
        "# 其他寫法\n",
        "# from sklearn.preprocessing import StandardScaler\n",
        "# sc = StandardScaler()\n",
        "# X_scale = sc.fit_transform(X_df)\n",
        "# X_test_scale = sc.transform(X_test)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5d972154",
      "metadata": {
        "id": "5d972154"
      },
      "source": [
        "* ### 資料切分（Data Splitting）"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "4a1dee07",
      "metadata": {
        "id": "4a1dee07"
      },
      "outputs": [],
      "source": [
        "# train, valid/test dataset split\n",
        "from sklearn.model_selection import train_test_split\n",
        "X_train, X_valid, y_train, y_valid = \\\n",
        "    train_test_split(X_scale, y_df, test_size=0.1, random_state=17)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e514c344",
      "metadata": {
        "id": "e514c344"
      },
      "outputs": [],
      "source": [
        "print(f'X_train shape: {X_train.shape}')\n",
        "print(f'X_valid shape: {X_valid.shape}')\n",
        "print(f'y_train shape: {y_train.shape}')\n",
        "print(f'y_valid shape: {y_valid.shape}')"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ad193ada",
      "metadata": {
        "id": "ad193ada"
      },
      "source": [
        "<a name=\"ModelBuilding\"></a>\n",
        "## 模型建置（Model Building）"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "db2915ff",
      "metadata": {
        "id": "db2915ff"
      },
      "outputs": [],
      "source": [
        "keras.backend.clear_session()  # 重置 keras 的所有狀態\n",
        "tf.random.set_seed(17)  # 設定 tensorflow 隨機種子\n",
        "\n",
        "model = keras.models.Sequential()\n",
        "model.add(layers.Dense(64,  # 神經元個數\n",
        "                       input_shape=X_train[0].shape,  # 輸入形狀\n",
        "                       activation='sigmoid'))  # 激活函數\n",
        "model.add(layers.Dense(32, activation='sigmoid'))\n",
        "model.add(layers.Dense(1, activation='linear'))\n",
        "\n",
        "model.summary()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "43b1e5cc",
      "metadata": {
        "id": "43b1e5cc"
      },
      "source": [
        "![](https://hackmd.io/_uploads/BJo6YtUZp.png)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "2fca07aa",
      "metadata": {
        "id": "2fca07aa"
      },
      "source": [
        "<a name=\"ModelTraining\"></a>\n",
        "## 模型訓練（Model Training）"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "cb38c731",
      "metadata": {
        "id": "cb38c731"
      },
      "source": [
        "* ### 模型編譯（model compile）\n",
        "設定模型訓練時，所需的優化器 (optimizer)、損失函數 (loss function)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e0a210bf",
      "metadata": {
        "id": "e0a210bf"
      },
      "outputs": [],
      "source": [
        "model.compile(optimizer='rmsprop',  # default: RMSprop(learning_rate=0.001)\n",
        "              loss='mean_squared_error')"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "074eca90",
      "metadata": {
        "id": "074eca90"
      },
      "source": [
        "![](https://hackmd.io/_uploads/ryNy9FLZ6.png)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f991db68",
      "metadata": {
        "id": "f991db68"
      },
      "outputs": [],
      "source": [
        "history = model.fit(X_train, y_train,\n",
        "                    epochs=20,\n",
        "                    batch_size=8,\n",
        "                    validation_data=(X_valid, y_valid))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "60eed35b",
      "metadata": {
        "id": "60eed35b"
      },
      "source": [
        "<a name=\"ModelEvaluation\"></a>\n",
        "## 模型評估（Model Evaluation）"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bfd3fee3",
      "metadata": {
        "id": "bfd3fee3"
      },
      "source": [
        "* ### 視覺化訓練過程的評估指標 （Visualization）"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "7938b5b7",
      "metadata": {
        "id": "7938b5b7"
      },
      "outputs": [],
      "source": [
        "# type(history.history) = dictionary\n",
        "print(history.history.keys())"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5e8f6770",
      "metadata": {
        "id": "5e8f6770"
      },
      "outputs": [],
      "source": [
        "train_loss = history.history['loss']\n",
        "valid_loss = history.history['val_loss']"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "25cc2846",
      "metadata": {
        "id": "25cc2846"
      },
      "outputs": [],
      "source": [
        "plt.figure(figsize=(15, 4))\n",
        "plt.yscale('log')\n",
        "plt.plot(range(len(train_loss)), train_loss, label='train_loss')\n",
        "plt.plot(range(len(valid_loss)), valid_loss, label='valid_loss')\n",
        "\n",
        "plt.legend()\n",
        "plt.xlabel('Epochs')\n",
        "plt.ylabel('Loss')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "505ba915",
      "metadata": {
        "id": "505ba915"
      },
      "source": [
        "* ### 模型預測（Model predictions）"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "267aabab",
      "metadata": {
        "id": "267aabab"
      },
      "outputs": [],
      "source": [
        "y_pred = model(X_valid)\n",
        "print(f'預測結果： {y_pred[:5, 0]}')\n",
        "print(f'目標值： {y_valid[:5]}')"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "39dca3ca",
      "metadata": {
        "id": "39dca3ca"
      },
      "source": [
        "* ### 視覺化結果"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "abb92318",
      "metadata": {
        "id": "abb92318"
      },
      "outputs": [],
      "source": [
        "plt.figure(figsize=(15, 4))\n",
        "plt.plot(range(len(y_pred)), y_pred, label='prediction')\n",
        "plt.plot(range(len(y_valid)), y_valid, label='groundtruth')\n",
        "plt.plot(range(len(y_pred)), y_pred[:, 0]-y_valid, label='difference')\n",
        "\n",
        "plt.legend()\n",
        "plt.xlabel('Samples')\n",
        "plt.ylabel('Values')\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "b3e3e6ad",
      "metadata": {
        "id": "b3e3e6ad"
      },
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.12"
    },
    "colab": {
      "provenance": []
    },
    "accelerator": "GPU",
    "gpuClass": "standard"
  },
  "nbformat": 4,
  "nbformat_minor": 5
}