{ "cells": [ { "cell_type": "markdown", "id": "0d8fea4f", "metadata": { "id": "0d8fea4f" }, "source": [ "# **模型訓練(迴歸問題)**\n", "此份程式碼會講解針對迴歸型任務在模型訓練上需要注意的細節。\n", "\n", "## 本章節內容大綱\n", "* ### [創建資料集/載入資料集(Dataset Creating/ Loading)](#DatasetCreating/Loading)\n", "* ### [資料前處理(Data Preprocessing)](#DataPreprocessing)\n", "* ### [模型建置(Model Building)](#ModelBuilding)\n", "* ### [模型訓練(Model Training)](#ModelTraining)\n", "* ### [模型評估(Model Evaluation)](#ModelEvaluation)\n", "---" ] }, { "cell_type": "markdown", "id": "bba873c7", "metadata": { "id": "bba873c7" }, "source": [ "## 匯入套件" ] }, { "cell_type": "code", "execution_count": null, "id": "108ce7ba", "metadata": { "id": "108ce7ba" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Tensorflow 相關套件\n", "import tensorflow as tf\n", "from tensorflow import keras\n", "from tensorflow.keras import layers" ] }, { "cell_type": "markdown", "id": "c32bbd73", "metadata": { "id": "c32bbd73" }, "source": [ "\n", "## 創建資料集/載入資料集(Dataset Creating / Loading)" ] }, { "cell_type": "code", "source": [ "# 上傳資料\n", "!wget -q https://github.com/TA-aiacademy/course_3.0/releases/download/DL/Data_part2.zip\n", "!unzip -q Data_part2.zip" ], "metadata": { "id": "rfM_DLvfFBMZ" }, "id": "rfM_DLvfFBMZ", "execution_count": null, "outputs": [] }, { "cell_type": "code", "execution_count": null, "id": "0d408f0e", "metadata": { "id": "0d408f0e" }, "outputs": [], "source": [ "train_df = pd.read_csv('./Data/FilmRating_train.csv')\n", "test_df = pd.read_csv('./Data/FilmRating_test.csv')" ] }, { "cell_type": "code", "execution_count": null, "id": "4c6d4a50", "metadata": { "id": "4c6d4a50" }, "outputs": [], "source": [ "train_df.head()" ] }, { "cell_type": "markdown", "id": "e7d33563", "metadata": { "id": "e7d33563" }, "source": [ "* #### 電影評價資料集\n", "資料集總共 2612 筆,\n", "欄位包括預算 (budget)、電影類型 (genres)、關鍵字詞 (keywords)、知名度 (popularity)、製作公司 (production_companies)、國家 (production_countries)、收入 (revenue)、時長 (runtime)、卡司 (cast)、導演 (director)、距離發布時間 (n_days)、評分 (score),多項欄位是以 leave-one-out encoding 方式轉換數值。\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a8b89aa8", "metadata": { "id": "a8b89aa8" }, "outputs": [], "source": [ "X_df = train_df.iloc[:, :-1].values\n", "y_df = train_df.score.values" ] }, { "cell_type": "code", "execution_count": null, "id": "916c3e24", "metadata": { "id": "916c3e24" }, "outputs": [], "source": [ "X_test = test_df.iloc[:, :-1].values\n", "y_test = test_df.score.values" ] }, { "cell_type": "markdown", "id": "1c94fd50", "metadata": { "id": "1c94fd50" }, "source": [ "\n", "## 資料前處理(Data Preprocessing)" ] }, { "cell_type": "markdown", "id": "089f1a3a", "metadata": { "id": "089f1a3a" }, "source": [ "* ### 資料正規化(Data Normalization)\n", " - 減少過度關注的特徵(由特定數字範圍造成的影響)\n", " - 避免更新方向偏離,較容易收斂" ] }, { "cell_type": "markdown", "id": "6154aada", "metadata": { "id": "6154aada" }, "source": [ "對於測試資料,需使用「訓練資料」的統計量去做轉換,避免改變兩組資料間的分布關係\n", "![](https://hackmd.io/_uploads/S1m3KtLZp.png)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f9d0e878", "metadata": { "id": "f9d0e878" }, "outputs": [], "source": [ "'''Normalize'''\n", "X_scale = (X_df-X_df.min(axis=0)) / (X_df.max(axis=0)-X_df.min(axis=0))\n", "X_test_scale = (X_test-X_df.min(axis=0)) / (X_df.max(axis=0)-X_df.min(axis=0))\n", "\n", "# 其他寫法\n", "# from sklearn.preprocessing import MinMaxScaler\n", "# sc = MinMaxScaler(feature_range=(0, 1))\n", "# X_scale = sc.fit_transform(X_df)\n", "# X_test_scale = sc.transform(X_test)\n", "\n", "# '''Standardize'''\n", "# X_scale = (X_df-X_df.mean(axis=0)) / (X_df.std(axis=0))\n", "# X_test_scale = (X_test-X_df.mean(axis=0)) / (X_df.std(axis=0))\n", "\n", "# 其他寫法\n", "# from sklearn.preprocessing import StandardScaler\n", "# sc = StandardScaler()\n", "# X_scale = sc.fit_transform(X_df)\n", "# X_test_scale = sc.transform(X_test)" ] }, { "cell_type": "markdown", "id": "5d972154", "metadata": { "id": "5d972154" }, "source": [ "* ### 資料切分(Data Splitting)" ] }, { "cell_type": "code", "execution_count": null, "id": "4a1dee07", "metadata": { "id": "4a1dee07" }, "outputs": [], "source": [ "# train, valid/test dataset split\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_valid, y_train, y_valid = \\\n", " train_test_split(X_scale, y_df, test_size=0.1, random_state=17)" ] }, { "cell_type": "code", "execution_count": null, "id": "e514c344", "metadata": { "id": "e514c344" }, "outputs": [], "source": [ "print(f'X_train shape: {X_train.shape}')\n", "print(f'X_valid shape: {X_valid.shape}')\n", "print(f'y_train shape: {y_train.shape}')\n", "print(f'y_valid shape: {y_valid.shape}')" ] }, { "cell_type": "markdown", "id": "ad193ada", "metadata": { "id": "ad193ada" }, "source": [ "\n", "## 模型建置(Model Building)" ] }, { "cell_type": "code", "execution_count": null, "id": "db2915ff", "metadata": { "id": "db2915ff" }, "outputs": [], "source": [ "keras.backend.clear_session() # 重置 keras 的所有狀態\n", "tf.random.set_seed(17) # 設定 tensorflow 隨機種子\n", "\n", "model = keras.models.Sequential()\n", "model.add(layers.Dense(64, # 神經元個數\n", " input_shape=X_train[0].shape, # 輸入形狀\n", " activation='sigmoid')) # 激活函數\n", "model.add(layers.Dense(32, activation='sigmoid'))\n", "model.add(layers.Dense(1, activation='linear'))\n", "\n", "model.summary()" ] }, { "cell_type": "markdown", "id": "43b1e5cc", "metadata": { "id": "43b1e5cc" }, "source": [ "![](https://hackmd.io/_uploads/BJo6YtUZp.png)\n" ] }, { "cell_type": "markdown", "id": "2fca07aa", "metadata": { "id": "2fca07aa" }, "source": [ "\n", "## 模型訓練(Model Training)" ] }, { "cell_type": "markdown", "id": "cb38c731", "metadata": { "id": "cb38c731" }, "source": [ "* ### 模型編譯(model compile)\n", "設定模型訓練時,所需的優化器 (optimizer)、損失函數 (loss function)" ] }, { "cell_type": "code", "execution_count": null, "id": "e0a210bf", "metadata": { "id": "e0a210bf" }, "outputs": [], "source": [ "model.compile(optimizer='rmsprop', # default: RMSprop(learning_rate=0.001)\n", " loss='mean_squared_error')" ] }, { "cell_type": "markdown", "id": "074eca90", "metadata": { "id": "074eca90" }, "source": [ "![](https://hackmd.io/_uploads/ryNy9FLZ6.png)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f991db68", "metadata": { "id": "f991db68" }, "outputs": [], "source": [ "history = model.fit(X_train, y_train,\n", " epochs=20,\n", " batch_size=8,\n", " validation_data=(X_valid, y_valid))" ] }, { "cell_type": "markdown", "id": "60eed35b", "metadata": { "id": "60eed35b" }, "source": [ "\n", "## 模型評估(Model Evaluation)" ] }, { "cell_type": "markdown", "id": "bfd3fee3", "metadata": { "id": "bfd3fee3" }, "source": [ "* ### 視覺化訓練過程的評估指標 (Visualization)" ] }, { "cell_type": "code", "execution_count": null, "id": "7938b5b7", "metadata": { "id": "7938b5b7" }, "outputs": [], "source": [ "# type(history.history) = dictionary\n", "print(history.history.keys())" ] }, { "cell_type": "code", "execution_count": null, "id": "5e8f6770", "metadata": { "id": "5e8f6770" }, "outputs": [], "source": [ "train_loss = history.history['loss']\n", "valid_loss = history.history['val_loss']" ] }, { "cell_type": "code", "execution_count": null, "id": "25cc2846", "metadata": { "id": "25cc2846" }, "outputs": [], "source": [ "plt.figure(figsize=(15, 4))\n", "plt.yscale('log')\n", "plt.plot(range(len(train_loss)), train_loss, label='train_loss')\n", "plt.plot(range(len(valid_loss)), valid_loss, label='valid_loss')\n", "\n", "plt.legend()\n", "plt.xlabel('Epochs')\n", "plt.ylabel('Loss')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "505ba915", "metadata": { "id": "505ba915" }, "source": [ "* ### 模型預測(Model predictions)" ] }, { "cell_type": "code", "execution_count": null, "id": "267aabab", "metadata": { "id": "267aabab" }, "outputs": [], "source": [ "y_pred = model(X_valid)\n", "print(f'預測結果: {y_pred[:5, 0]}')\n", "print(f'目標值: {y_valid[:5]}')" ] }, { "cell_type": "markdown", "id": "39dca3ca", "metadata": { "id": "39dca3ca" }, "source": [ "* ### 視覺化結果" ] }, { "cell_type": "code", "execution_count": null, "id": "abb92318", "metadata": { "id": "abb92318" }, "outputs": [], "source": [ "plt.figure(figsize=(15, 4))\n", "plt.plot(range(len(y_pred)), y_pred, label='prediction')\n", "plt.plot(range(len(y_valid)), y_valid, label='groundtruth')\n", "plt.plot(range(len(y_pred)), y_pred[:, 0]-y_valid, label='difference')\n", "\n", "plt.legend()\n", "plt.xlabel('Samples')\n", "plt.ylabel('Values')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "b3e3e6ad", "metadata": { "id": "b3e3e6ad" }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" }, "colab": { "provenance": [] }, "accelerator": "GPU", "gpuClass": "standard" }, "nbformat": 4, "nbformat_minor": 5 }