{
"cells": [
{
"cell_type": "markdown",
"id": "cfbc8b83",
"metadata": {
"id": "cfbc8b83"
},
"source": [
"# **模型訓練(分類問題)**\n",
"此份程式碼會講解針對分類型任務在模型訓練上需要注意的細節。\n",
"\n",
"## 本章節內容大綱\n",
"* ### 二元分類問題\n",
" * ### [創建資料集/載入資料集(Dataset Creating/ Loading)](#DatasetCreating/Loading)\n",
" * ### [資料前處理(Data Preprocessing)](#DataPreprocessing)\n",
" * ### [模型建置(Model Building)](#ModelBuilding)\n",
" * ### [模型訓練(Model Training)](#ModelTraining)\n",
" * ### [模型評估(Model Evaluation)](#ModelEvaluation)\n",
"* ### 多元分類問題\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "6c1ac997",
"metadata": {
"id": "6c1ac997"
},
"source": [
"## 匯入套件"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1e78cf9f",
"metadata": {
"id": "1e78cf9f"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Tensorflow 相關套件\n",
"import tensorflow as tf\n",
"from tensorflow import keras\n",
"from tensorflow.keras import layers"
]
},
{
"cell_type": "markdown",
"id": "979f67f3",
"metadata": {
"id": "979f67f3"
},
"source": [
"\n",
"## 創建資料集/載入資料集(Dataset Creating / Loading)"
]
},
{
"cell_type": "code",
"source": [
"# 上傳資料\n",
"!wget -q https://github.com/TA-aiacademy/course_3.0/releases/download/DL/Data_part2.zip\n",
"!unzip -q Data_part2.zip"
],
"metadata": {
"id": "MuTlaAi9FzqJ"
},
"id": "MuTlaAi9FzqJ",
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "5cbef01a",
"metadata": {
"id": "5cbef01a"
},
"outputs": [],
"source": [
"train_df = pd.read_csv('./Data/FilmComment_train.csv')\n",
"test_df = pd.read_csv('./Data/FilmComment_test.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9c5cca18",
"metadata": {
"id": "9c5cca18"
},
"outputs": [],
"source": [
"train_df.head()"
]
},
{
"cell_type": "markdown",
"id": "5d6bfd0d",
"metadata": {
"id": "5d6bfd0d"
},
"source": [
"* #### 電影評論資料集\n",
"訓練集,測試集分別為 6250,2500 筆,9997 種常用字詞,若在同一則評論中出現該字詞為 1,若否則為 0,y_label 標記評價正面與否。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "32082d9c",
"metadata": {
"id": "32082d9c"
},
"outputs": [],
"source": [
"X_df = train_df.iloc[:, :-1].values\n",
"y_df = train_df.y_label.values"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55e1bc9b",
"metadata": {
"id": "55e1bc9b"
},
"outputs": [],
"source": [
"X_test = test_df.iloc[:, :-1].values\n",
"y_test = test_df.y_label.values"
]
},
{
"cell_type": "markdown",
"id": "f2133a95",
"metadata": {
"id": "f2133a95"
},
"source": [
"\n",
"## 資料前處理(Data Preprocessing)"
]
},
{
"cell_type": "markdown",
"id": "858fc2ed",
"metadata": {
"id": "858fc2ed"
},
"source": [
"* ### 資料正規化(Data Normalization)\n",
"由於此資料集的數值範圍都介於 0-1,並且皆是以相同意義轉換特徵值,因此也可以使用原始的數值作為訓練資料。"
]
},
{
"cell_type": "markdown",
"id": "f18e67ea",
"metadata": {
"id": "f18e67ea"
},
"source": [
"* ### 資料切分(Data Splitting)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6fe5be4a",
"metadata": {
"id": "6fe5be4a"
},
"outputs": [],
"source": [
"# train, valid/test dataset split\n",
"from sklearn.model_selection import train_test_split\n",
"X_train, X_valid, y_train, y_valid = \\\n",
" train_test_split(X_df, y_df, test_size=0.1, random_state=17)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "74f1ed43",
"metadata": {
"id": "74f1ed43"
},
"outputs": [],
"source": [
"print(f'X_train shape: {X_train.shape}')\n",
"print(f'X_valid shape: {X_valid.shape}')\n",
"print(f'y_train shape: {y_train.shape}')\n",
"print(f'y_valid shape: {y_valid.shape}')"
]
},
{
"cell_type": "markdown",
"id": "88a51105",
"metadata": {
"id": "88a51105"
},
"source": [
"\n",
"## 模型建置(Model Building)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a4345b20",
"metadata": {
"id": "a4345b20"
},
"outputs": [],
"source": [
"keras.backend.clear_session() # 重置 keras 的所有狀態\n",
"tf.random.set_seed(17) # 設定 tensorflow 隨機種子\n",
"\n",
"model = keras.models.Sequential()\n",
"model.add(layers.Dense(16, # 神經元個數\n",
" input_shape=X_train[0].shape, # 輸入形狀\n",
" activation='relu')) # 激活函數\n",
"model.add(layers.Dense(16,\n",
" activation='relu'))\n",
"model.add(layers.Dense(1,\n",
" activation='sigmoid'))\n",
"\n",
"model.summary()"
]
},
{
"cell_type": "markdown",
"id": "5e483262",
"metadata": {
"id": "5e483262"
},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"id": "3a358982",
"metadata": {
"id": "3a358982"
},
"source": [
"## 模型訓練(Model training)"
]
},
{
"cell_type": "markdown",
"id": "2fd7204c",
"metadata": {
"id": "2fd7204c"
},
"source": [
"* ### 模型編譯(model compile)\n",
"設定模型訓練時,所需的優化器 (optimizer)、損失函數 (loss function)、評估指標 (metrics)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bde67281",
"metadata": {
"id": "bde67281"
},
"outputs": [],
"source": [
"model.compile(optimizer='rmsprop', # default: RMSprop(learning_rate=0.001)\n",
" loss='binary_crossentropy', # 針對二元分類問題的損失函數\n",
" metrics='acc') # 評估指標: 準確率"
]
},
{
"cell_type": "markdown",
"id": "4c992b9d",
"metadata": {
"id": "4c992b9d"
},
"source": [
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1072c292",
"metadata": {
"id": "1072c292"
},
"outputs": [],
"source": [
"history = model.fit(X_train, y_train,\n",
" batch_size=512,\n",
" epochs=20,\n",
" validation_data=(X_valid, y_valid))"
]
},
{
"cell_type": "markdown",
"id": "49d3db7d",
"metadata": {
"id": "49d3db7d"
},
"source": [
"\n",
"## 模型評估(Model Evaluation)"
]
},
{
"cell_type": "markdown",
"id": "cb66d982",
"metadata": {
"id": "cb66d982"
},
"source": [
"* ### 視覺化訓練過程的評估指標 (Visualization)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d7ea5696",
"metadata": {
"id": "d7ea5696"
},
"outputs": [],
"source": [
"# type(history.history) = dictionary\n",
"print(history.history.keys())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad38e820",
"metadata": {
"id": "ad38e820"
},
"outputs": [],
"source": [
"train_loss = history.history['loss']\n",
"train_acc = history.history['acc']\n",
"\n",
"valid_loss = history.history['val_loss']\n",
"valid_acc = history.history['val_acc']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3e7050e9",
"metadata": {
"id": "3e7050e9"
},
"outputs": [],
"source": [
"plt.figure(figsize=(15, 4))\n",
"plt.subplot(1, 2, 1)\n",
"plt.plot(range(len(train_loss)), train_loss, label='train_loss')\n",
"plt.plot(range(len(valid_loss)), valid_loss, label='valid_loss')\n",
"plt.xlabel('Epochs')\n",
"plt.ylabel('Binary crossentropy')\n",
"plt.legend()\n",
"\n",
"plt.subplot(1, 2, 2)\n",
"plt.plot(range(len(train_acc)), train_acc, label='train_acc')\n",
"plt.plot(range(len(valid_acc)), valid_acc, label='valid_acc')\n",
"plt.xlabel('Epochs')\n",
"plt.ylabel('Accuracy')\n",
"plt.legend()"
]
},
{
"cell_type": "markdown",
"id": "daaebf99",
"metadata": {
"id": "daaebf99"
},
"source": [
"* ### 模型預測(Model predictions)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "79b4465f",
"metadata": {
"id": "79b4465f"
},
"outputs": [],
"source": [
"# predict all test data\n",
"pred = model(X_test)\n",
"print(pred)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "895f082c",
"metadata": {
"id": "895f082c"
},
"outputs": [],
"source": [
"# use threshold to obtain binary class\n",
"pred_class = model(X_test) > 0.5\n",
"print(tf.cast(pred_class, tf.int32))"
]
},
{
"cell_type": "markdown",
"id": "8273fa68",
"metadata": {
"id": "8273fa68"
},
"source": [
"* ### 視覺化結果"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a8c906c",
"metadata": {
"id": "4a8c906c"
},
"outputs": [],
"source": [
"plt.figure(figsize=(15, 4))\n",
"plt.scatter(range(pred.shape[0]), pred)\n",
"plt.hlines(0.5, 0, pred.shape[0], colors='red', label='y=0.5')\n",
"plt.legend()"
]
},
{
"cell_type": "markdown",
"id": "73309200",
"metadata": {
"id": "73309200"
},
"source": [
"----------------\n",
"此範例是二元分類,y 的表示方式可用一維陣列,分別以 0, 1 表示兩個類別(正面,負面評價)\n",
"\n",
"\n",
"**若是多元分類又該如何表示?**以多個維度的 One-Hot Encoding 方式表示多元分類標籤\n",
"\n",
"\n",
"**對訓練有何影響?**\n",
"跟 y 最直接相關的就是 Loss function,間接影響到模型輸出的維度\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "ccb031e0",
"metadata": {
"id": "ccb031e0"
},
"source": [
"----------------------------"
]
},
{
"cell_type": "markdown",
"id": "23ba626a",
"metadata": {
"id": "23ba626a"
},
"source": [
"## 多元分類(Multi-class classification)"
]
},
{
"cell_type": "markdown",
"id": "e5e7adb5",
"metadata": {
"id": "e5e7adb5"
},
"source": [
"### 創建資料集/載入資料集(Dataset Creating / Loading)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ada5ae4",
"metadata": {
"id": "2ada5ae4"
},
"outputs": [],
"source": [
"train_df = pd.read_csv('./Data/FilmComment_train.csv')\n",
"test_df = pd.read_csv('./Data/FilmComment_test.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5fa1ba27",
"metadata": {
"id": "5fa1ba27"
},
"outputs": [],
"source": [
"X_df = train_df.iloc[:, :-1].values\n",
"y_df = train_df.y_label.values"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1db3602b",
"metadata": {
"id": "1db3602b"
},
"outputs": [],
"source": [
"X_test = test_df.iloc[:, :-1].values\n",
"y_test = test_df.y_label.values"
]
},
{
"cell_type": "markdown",
"id": "47419c59",
"metadata": {
"id": "47419c59"
},
"source": [
"\n",
"## 資料前處理(Data Preprocessing)"
]
},
{
"cell_type": "markdown",
"id": "8ad90b6f",
"metadata": {
"id": "8ad90b6f"
},
"source": [
"* #### One-Hot encoding"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "febf8069",
"metadata": {
"id": "febf8069"
},
"outputs": [],
"source": [
"# Convert to One-Hot encoding\n",
"y_df = keras.utils.to_categorical(y_df)\n",
"y_test = keras.utils.to_categorical(y_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "94cd8466",
"metadata": {
"id": "94cd8466"
},
"outputs": [],
"source": [
"# train, valid/test dataset split\n",
"from sklearn.model_selection import train_test_split\n",
"X_train, X_valid, y_train, y_valid = \\\n",
" train_test_split(X_df, y_df, test_size=0.1, random_state=17)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b85e17bf",
"metadata": {
"id": "b85e17bf"
},
"outputs": [],
"source": [
"print(f'X_train shape: {X_train.shape}')\n",
"print(f'X_valid shape: {X_valid.shape}')\n",
"print(f'y_train shape: {y_train.shape}')\n",
"print(f'y_valid shape: {y_valid.shape}')"
]
},
{
"cell_type": "markdown",
"id": "80924567",
"metadata": {
"id": "80924567"
},
"source": [
"### 模型建置(Model Building)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8391cbd3",
"metadata": {
"id": "8391cbd3"
},
"outputs": [],
"source": [
"keras.backend.clear_session()\n",
"tf.random.set_seed(17)\n",
"\n",
"model = keras.models.Sequential()\n",
"model.add(layers.Dense(16,\n",
" input_shape=X_train[0].shape,\n",
" activation='relu'))\n",
"model.add(layers.Dense(16,\n",
" activation='relu'))\n",
"model.add(layers.Dense(2,\n",
" activation='softmax'))\n",
"\n",
"model.summary()"
]
},
{
"cell_type": "markdown",
"id": "f9ceea97",
"metadata": {
"id": "f9ceea97"
},
"source": [
"### 模型訓練(Model training)"
]
},
{
"cell_type": "markdown",
"id": "bb5925b9",
"metadata": {
"id": "bb5925b9"
},
"source": [
"* #### 模型編譯(model compile)\n",
"設定模型訓練時,所需的優化器 (optimizer)、損失函數 (loss function)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f545b1ad",
"metadata": {
"id": "f545b1ad"
},
"outputs": [],
"source": [
"model.compile(optimizer='rmsprop',\n",
" loss='categorical_crossentropy',\n",
" metrics='acc')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "629d0d4f",
"metadata": {
"id": "629d0d4f"
},
"outputs": [],
"source": [
"history = model.fit(X_train, y_train,\n",
" batch_size=512,\n",
" epochs=20,\n",
" validation_data=(X_valid, y_valid))"
]
},
{
"cell_type": "markdown",
"id": "f7941459",
"metadata": {
"id": "f7941459"
},
"source": [
"### 模型評估(Model evalutation)"
]
},
{
"cell_type": "markdown",
"id": "983598de",
"metadata": {
"id": "983598de"
},
"source": [
"* #### 視覺化訓練過程的評估指標 (Visualization)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9caf93df",
"metadata": {
"id": "9caf93df"
},
"outputs": [],
"source": [
"# type(history.history) = dictionary\n",
"print(history.history.keys())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1b7e02e7",
"metadata": {
"id": "1b7e02e7"
},
"outputs": [],
"source": [
"train_loss = history.history['loss']\n",
"train_acc = history.history['acc']\n",
"\n",
"valid_loss = history.history['val_loss']\n",
"valid_acc = history.history['val_acc']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0aa975b6",
"metadata": {
"id": "0aa975b6"
},
"outputs": [],
"source": [
"plt.figure(figsize=(15, 4))\n",
"plt.subplot(1, 2, 1)\n",
"plt.plot(range(len(train_loss)), train_loss, label='train_loss')\n",
"plt.plot(range(len(valid_loss)), valid_loss, label='valid_loss')\n",
"plt.xlabel('Epochs')\n",
"plt.ylabel('Categorical crossentropy')\n",
"plt.legend()\n",
"\n",
"plt.subplot(1, 2, 2)\n",
"plt.plot(range(len(train_acc)), train_acc, label='train_acc')\n",
"plt.plot(range(len(valid_acc)), valid_acc, label='valid_acc')\n",
"plt.xlabel('Epochs')\n",
"plt.ylabel('Accuracy')\n",
"plt.legend()"
]
},
{
"cell_type": "markdown",
"id": "faf85f9f",
"metadata": {
"id": "faf85f9f"
},
"source": [
"* ### 模型預測(Model predictions)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed795618",
"metadata": {
"id": "ed795618"
},
"outputs": [],
"source": [
"# predict all test data\n",
"pred = model(X_test)\n",
"print(pred)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "597071f6",
"metadata": {
"id": "597071f6"
},
"outputs": [],
"source": [
"pred = tf.argmax(model(X_test), axis=-1) # choose maximum probability of index\n",
"print(pred)"
]
},
{
"cell_type": "markdown",
"id": "3f3366f7",
"metadata": {
"id": "3f3366f7"
},
"source": [
"---\n",
"### Remark\n",
"**Classification task**\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "702cec23",
"metadata": {
"id": "702cec23"
},
"source": [
"### Quiz\n",
"請試著利用 Data/pkgo_train.csv 做多元分類問題,預測五個種類的 pokemon,並調整模型(網路層數、神經元數目)得到更高的準確度。\n",
"\n",
"pkgo_train 為 Pokemon go 中 pokemon 出沒狀態描述的資料集,欄位說明如下:\n",
"* latitude, longitude: 位置(經緯度)\n",
"* local.xx: 時間(擷取格式 mm-dd'T'hh-mm-ss.ms'Z')\n",
"* appearedTimeOfDay: night, evening, afternoon, morning 四種時段\n",
"* appearedHour/Minute: 當地小時/分鐘\n",
"* appearedDayOfWeek: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday\n",
"* appearedDay/Month: 當地日期/月份\n",
"* terrainType: 地形種類\n",
"* closeToWater: 是否接近水源(100 公尺內)\n",
"* city: 城市\n",
"* continent: 洲別\n",
"* weather: 天氣種類(Foggy Clear, PartlyCloudy, MostlyCloudy, Overcast, Rain, BreezyandOvercast, LightRain, Drizzle, BreezyandPartlyCloudy, HeavyRain, BreezyandMostlyCloudy, Breezy, Windy, WindyandFoggy, Humid, Dry, WindyandPartlyCloudy, DryandMostlyCloudy, DryandPartlyCloudy, DrizzleandBreezy, LightRainandBreezy, HumidandPartlyCloudy, HumidandOvercast, RainandWindy)\n",
"* temperature: 攝氏溫度\n",
"* windSpeed: 風速(km/h)\n",
"* windBearing: 風向\n",
"* pressure: 氣壓\n",
"* sunrise/sunsetXX: 日出日落相關訊息\n",
"* population_density: 人口密集度\n",
"* urban/suburban/midurban/rural: 出沒過的地點城市程度(人口密集度小於 200 為 rural, 大於等於 200 且小於 400 為 midUrban, 大於等於400 且小於 800 為 subUrban, 大於 800 為 urban)\n",
"* gymDistanceKm: 最近道館的距離\n",
"* gymInxx: 道館是否在指定距離內\n",
"* cooc1-cooc151: 是否有其他 pokemon 在 24 小時內,出現在周圍 100 公尺之內\n",
"* category: 種類"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "523db00d",
"metadata": {
"id": "523db00d"
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
},
"colab": {
"provenance": []
},
"accelerator": "GPU",
"gpuClass": "standard"
},
"nbformat": 4,
"nbformat_minor": 5
}