{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "1mB0In5zYtQw" }, "source": [ "# 5_Regression&Autoregression" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "vm8CtSi-YtQ1" }, "source": [ "以上我們有了data,並組成data loader,我們使用剛剛loader可以來訓練一些模型。\n", "\n", "這邊我們用些簡單的迴歸模型來試試看:\n", "- Linear Regression\n", "- Autoregression" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "VmcMo4yAYtQ2" }, "source": [ "**先import一些套件**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "buzO-o1yYtQ3" }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import matplotlib.pyplot as plt\n", "from plotly import express as px\n", "\n", "import numpy as np\n", "import tensorflow.data as tfd" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "T19LAWQOJj63" }, "outputs": [], "source": [ "# Colab使用要更換statsmodels版本 ,安裝完請重新啟動執行階段\n", "!pip uninstall -y statsmodels\n", "!pip install statsmodels==0.11.1" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "TU9iLRCiYtQ6" }, "source": [ "**給一些必要function: 畫圖、合成資料產生、window data loader產生、評估function**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rKMh4kQyYtQ7" }, "outputs": [], "source": [ "def plot_series(time, series, start=0, end=None, labels=None, title=None):\n", " # Visualizes time series data\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # series (list of array of int) - 時間點對應的資料列表,列表內時間序列數量為D,\n", " # 每筆資料長度為T,若非為列表則轉為列表\n", " # start (int) - 開始的資料序(第幾筆)\n", " # end (int) - 結束繪製的資料序(第幾筆)\n", " # labels (list of strings)- 對於多時間序列或多維度的標註\n", " # title (string)- 圖片標題\n", "\n", " # 若資料只有一筆,則轉為list\n", " if type(series) != list:\n", " series = [series]\n", "\n", " if not end:\n", " end = len(series[0])\n", "\n", " if labels:\n", " # 設立dictionary, 讓plotly畫訊號線時可以標註label\n", " dictionary = {\"time\": time}\n", " for idx, l in enumerate(labels):\n", " # 截斷資料,保留想看的部分,並分段紀錄於dictionary中\n", " dictionary.update({l: series[idx][start:end]})\n", " # 畫訊號線\n", " fig = px.line(dictionary,\n", " x=\"time\",\n", " y=list(dictionary.keys())[1:],\n", " width=1000,\n", " height=400,\n", " title=title)\n", " else:\n", " # 畫訊號線\n", " fig = px.line(x=time, y=series, width=1000, height=400, title=title)\n", " fig.show()\n", "\n", "\n", "# 合成資料生成\n", "def trend(time, slope=0):\n", " # 產生合成水平直線資料,其長度與時間等長,直線趨勢與設定slope相同\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # slope (float) - 設定資料的傾斜程度與正負\n", " # Returns:\n", " # series (array of float) - 產出slope 與設定相同的一條線\n", "\n", " series = slope * time\n", "\n", " return series\n", "\n", "\n", "def seasonal_pattern(season_time, pattern_type='triangle'):\n", " # 產生某個特定pattern,\n", " # Args:\n", " # season_time (array of float) - 周期內的時間點, 長度為T\n", " # pattern_type (str) - 這邊提供triangle與cosine\n", " # Returns:\n", " # data_pattern (array of float) - 根據自訂函式產出特定的pattern\n", "\n", " # 用特定function生成pattern\n", " if pattern_type == 'triangle':\n", " data_pattern = np.where(season_time < 0.5,\n", " season_time*2,\n", " 2-season_time*2)\n", " if pattern_type == 'cosine':\n", " data_pattern = np.cos(season_time*np.pi*2)\n", "\n", " return data_pattern\n", "\n", "\n", "def seasonality(time, period, amplitude=1, phase=30, pattern_type='triangle'):\n", " # Repeats the same pattern at each period\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # period (int) - 週期長度,必小於T\n", " # amplitude (float) - 序列幅度大小\n", " # phase (int) - 相位,為遞移量,正的向左(提前)、負的向右(延後)\n", " # pattern_type (str) - 這邊提供triangle與cosine\n", " # Returns:\n", " # data_pattern (array of float) - 有指定周期、振幅、相位、pattern後的time series\n", "\n", " # 將時間依週期重置為0\n", " season_time = ((time + phase) % period) / period\n", "\n", " # 產生週期性訊號並乘上幅度\n", " data_pattern = amplitude * seasonal_pattern(season_time, pattern_type)\n", "\n", " return data_pattern\n", "\n", "\n", "def noise(time, noise_level=1, seed=None):\n", " # 合成雜訊,這邊用高斯雜訊,機率密度為常態分布\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # noise_level (float) - 雜訊大小\n", " # seed (int) - 同樣的seed可以重現同樣的雜訊\n", " # Returns:\n", " # noise (array of float) - 雜訊時間序列\n", "\n", " # 做一個基於某個seed的雜訊生成器\n", " rnd = np.random.RandomState(seed)\n", "\n", " # 生與time同長度的雜訊,並且乘上雜訊大小 (不乘的話,標準差是1)\n", " noise = rnd.randn(len(time)) * noise_level\n", "\n", " return noise\n", "\n", "\n", "def toy_generation(time=np.arange(4 * 365),\n", " bias=500.,\n", " slope=0.1,\n", " period=180,\n", " amplitude=40.,\n", " phase=30,\n", " pattern_type='triangle',\n", " noise_level=5.,\n", " seed=2022):\n", " signal_series = bias\\\n", " + trend(time, slope)\\\n", " + seasonality(time,\n", " period,\n", " amplitude,\n", " phase,\n", " pattern_type)\n", " noise_series = noise(time, noise_level, seed)\n", "\n", " series = signal_series+noise_series\n", " return series\n", "\n", "\n", "# Dataset\n", "def win_ar_ds(series, size, shift=1):\n", " # 輸出Window-wise Forcasting Dataset\n", " # Args:\n", " # series (array of float) - 時序資料, 長度為T\n", " # size (int) - Window大小\n", " # shift (int) - 每個window起始點間距\n", " # Returns:\n", " # (tf.data.Dataset(母類名稱,切確type為MapDataset)) -\n", " # - 一個一次生一個window的生成器\n", "\n", " ds = tfd.Dataset.from_tensor_slices(series)\n", " ds = ds.window(size=size+1, shift=1, drop_remainder=True)\n", " ds = ds.flat_map(lambda ds: ds.batch(size+1))\n", " return ds.map(lambda x: (x[:-1], x[-1:]))\n", "\n", "\n", "def regressor_ds(*regressors, series):\n", " # 輸出Window-wise Regressor Forcasting Dataset\n", " # Args:\n", " # regressors (arguments of array of float) - 多個迴歸因子,每個長度為T\n", " # series (array of float) - 預測對象,長度\n", " # Returns:\n", " # (tf.data.Dataset(母類名稱,切確type為TensorSliceDataset)) -\n", " # - 一次生regressors和time series的dataset\n", "\n", " ds = tfd.Dataset.from_tensor_slices((np.stack(regressors, -1), series))\n", " return ds\n", "\n", "\n", "# 評估function\n", "def MAE(pred, gt):\n", " # 計算Mean Absolute Error\n", " # Args:\n", " # pred (array of float) - 預測資料\n", " # gt (array of float) - 答案資料\n", " # Returns:\n", " # 計算結果 (float)\n", " return abs(pred-gt).mean()\n", "\n", "\n", "def MSE(pred, gt):\n", " # 計算Mean Square Error\n", " # Args:\n", " # pred (array of float) - 預測資料\n", " # gt (array of float) - 答案資料\n", " # Returns:\n", " # 計算結果 (float)\n", " return pow(pred-gt, 2).mean()\n", "\n", "\n", "def R2(pred, gt):\n", " # 計算R square score\n", " # Args:\n", " # pred (array of float) - 預測資料\n", " # gt (array of float) - 答案資料\n", " # Returns:\n", " # 計算結果 (float)\n", " return 1-pow(pred-gt, 2).sum()/pow(gt-gt.mean(), 2).sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_hg1yBiFYtQ-" }, "outputs": [], "source": [ "def split(x, train_size):\n", " return x[..., :train_size], x[..., train_size:]\n", "\n", "\n", "# 先合成資料,還有作資料分割\n", "time = np.arange(4*365) # 定義時間點\n", "series_sample = toy_generation(time, pattern_type='cosine') # 這就是我們合成出來的資料\n", "\n", "time_train, time_test = split(time, 365*3)\n", "series_train, series_test = split(series_sample, 365*3)\n", "\n", "# 另外也加上輔助資料\n", "cos_train = seasonality(time_train, 180, 1., 30, 'cosine')\n", "cos_test = seasonality(time_test, 180, 1., 30, 'cosine')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Mf7Hx8kqYtQ-" }, "outputs": [], "source": [ "import tensorflow as tf\n", "from tensorflow.keras import models, layers, losses, optimizers" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "X_mruI5NYtQ_" }, "outputs": [], "source": [ "# 用各種regressor predict資料的training set\n", "train_ds_r = regressor_ds(time_train,\n", " cos_train,\n", " series=series_train) # 切time series\n", "train_loader_r = train_ds_r.cache()\\\n", " .shuffle(1000).batch(32, drop_remainder=True).prefetch(-1)\n", "\n", "# 用各種regressor predict資料的testing set\n", "test_ds_r = regressor_ds(time_test,\n", " cos_test,\n", " series=series_test) # 切time series\n", "test_loader_r = test_ds_r.batch(32).prefetch(-1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hsGYkZHSYtRA" }, "outputs": [], "source": [ "for x, y in train_loader_r:\n", " pass\n", "print(x.shape, y.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5u71pbrmoeYv" }, "outputs": [], "source": [ "cos_train = seasonality(time_train, 180, 1., 45, 'cosine')\n", "plt.figure(figsize=(20, 5))\n", "plt.plot(cos_train, 'b', linewidth=3)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "bc7G-a15YtRA" }, "source": [ "## Linear Regression" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "GnJmC2lNYtRA" }, "source": [ "前面ML的課程已經教過Linear Regression的課程,\n", "\n", "其中多變量迴歸時是透過轉置矩陣w與偏移b可以預測下個時間點的值\n", "\n", "$y=w^Tx+b$\n", "\n", "透過訓練參數w與b可以使系統更加能擬合feature與預測值之間關係。\n", "\n", "\n", "\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "ULbe4biUYtRB" }, "source": [ "而Linear time regression中我們使用$t_{n-W:n}$代入input x的部分,計算output $y_n$應該要是多少" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "QYTsY090YtRB" }, "source": [ "" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "YsJKnu8vh1TO" }, "source": [ "而在示範的training data中除了time regression我們還加上cosine波形作為另一個regression feature (又稱regressor)。\n", "\n", "我們的t是前面```time_train```的部分,cosine波形是前面```cos_train```,y是前面```series_train```的部分" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "8lUoc8KSYtRB" }, "source": [ "### Build TF2 Model" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "5w8M63wxYtRC" }, "source": [ "而我們這邊可以用一個一層且無activation的Dense層來完成這個功能,而且我們前面有幾個layer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "A4WcRBw6YtRC" }, "outputs": [], "source": [ "model = models.Sequential([\n", " layers.Flatten(input_shape=[2], data_format=\"channels_first\"),\n", " layers.Dense(1, activation=None)\n", "])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Pb9AW_HFYtRC", "scrolled": true }, "outputs": [], "source": [ "model.summary()\n", "# 共有window_size*K+1個parameter 就看要放幾個regressor,最後一個parameter是bias" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bEdTdN7AYtRD" }, "outputs": [], "source": [ "opt = optimizers.Adam(learning_rate=1e-1)\n", "model.compile(loss=\"mse\", optimizer=opt)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "OEuHJz_iYtRD" }, "source": [ "### Training" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dg3CPRfKYtRE", "scrolled": true }, "outputs": [], "source": [ "history = model.fit(\n", " train_loader_r,\n", " epochs=400,\n", " verbose=2,\n", " callbacks=[\n", " tf.keras.callbacks.ReduceLROnPlateau(monitor='loss'),\n", " tf.keras.callbacks.EarlyStopping(\n", " monitor='loss',\n", " patience=20,\n", " verbose=2)])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "2Qb4uMiqYtRE" }, "source": [ "可以看一下model的weight" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wpShItcxYtRE" }, "outputs": [], "source": [ "# 第一個是time trend的weight,第二個是cosine wave 的trend\n", "model.weights[0].numpy()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "Bzvxsa4CYtRF" }, "source": [ "這邊故意調整weight的順序,使得前面一半是linear trend相關的weight,後面一半是cosine相關的weight。\n", "\n", "可以看出這兩者的量值跟我們預設的很接近" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5tp2OMR8YtRF" }, "outputs": [], "source": [ "# bias\n", "model.weights[1].numpy()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "CDjgdYuNYtRF" }, "source": [ "bias則是很逼近我們預設的level" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "SUeaJZYiYtRG" }, "source": [ "### Evaluate" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c7Jr6kjdYtRG" }, "outputs": [], "source": [ "model.evaluate(test_loader_r)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "oG62QnufYtRH" }, "source": [ "基本上就是很圓滑的線,因為我們只用了linear trend以及cosine而已\n", "\n", "結果就還好" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2m02RLq1YtRG" }, "outputs": [], "source": [ "forcast = model.predict(test_loader_r)[:, 0]\n", "time_for_view = time_test\n", "\n", "plot_series(time_for_view,\n", " [forcast, series_test],\n", " labels=['prediction', 'ground truth'])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "yi8x49FLgQNa" }, "source": [ "這邊我們把那些訓練好的weight套上個別regressor,並扣除掉訓練好的bias的影響\n", "\n", "畫出來看看:\n", "\n", "$trend[t]=time[t]*weight_{0}$\n", "\n", "$seasonality[t]=cos[t]*weight_{1}$\n", "\n", "$y'[t]=y[t]-bias$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "t4Zj_j4te70x" }, "outputs": [], "source": [ "plot_series(time_for_view,\n", " [time_test*model.weights[0][0].numpy(),\n", " cos_test*model.weights[0][1].numpy(),\n", " series_test-model.weights[1][0].numpy()],\n", " labels=['weighted time', 'weighted cosine', 'ground truth'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Gr7goZPvYtRH" }, "outputs": [], "source": [ "# 算出R2 score來看看\n", "R2(forcast, series_test)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "nkxYNtV3YtRH" }, "source": [ "### Exercise\n", "請試試看不同regressor對regression的影響,可以注意到對weight跟bias的影響\n", "\n", "e.g. 將```pattern_type```換成```triangle```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "6IEsyU2lYtRH" }, "source": [ "## Autoregressive Model" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "CDr-WShRYtRI" }, "source": [ "Autoregression假設現在的序列與前面有限個時間點的數個序列有關:\n", "\n", "$y_t=w_0 y_{t-1}+w_1 y_{t-2}+...+ w_T y_{t-T} $\n", "\n", "其實就是對前面的時間點套用一個固定的加權和,而這個加權的權重是可以訓練的。\n", "\n", "也可以看作是套用某個波形作1D convolution來預測未來序列,這樣在有一定周期性時是有幫助的。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EzJSngVAYtRI" }, "outputs": [], "source": [ "# 這邊要用不一樣的dataset\n", "window_size = 7\n", "\n", "# 用資料predict資料的training set\n", "train_ds = win_ar_ds(series_train, size=window_size) # 切time series\n", "train_loader = train_ds.cache()\\\n", " .shuffle(1000).batch(32, drop_remainder=True).prefetch(-1)\n", "\n", "# 用資料predict資料的testing set\n", "test_ds = win_ar_ds(series_test, size=window_size) # 切time series\n", "test_loader = test_ds.batch(32).prefetch(-1)\n", "for x, y in train_loader:\n", " pass\n", "print(x.shape, y.shape)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "TmKyO7qNYtRI" }, "source": [ "### Build TF2 Model" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "1lYsF7VsYtRJ" }, "source": [ "而我們這邊可以用一個一層且無activation的Dense層來完成一個AR model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zHL6umB3YtRJ" }, "outputs": [], "source": [ "model_ar = models.Sequential([\n", " layers.Dense(1, input_shape=[window_size], activation=None)\n", "])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oblTI6jCYtRJ" }, "outputs": [], "source": [ "model_ar.summary()\n", "# 共有window_size+1個parameter,最後一個parameter是bias" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "uo-8PxgYYtRJ" }, "outputs": [], "source": [ "opt = optimizers.Adam(learning_rate=1e-2)\n", "model_ar.compile(loss=\"mse\", optimizer=opt)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "RiQDt-FFYtRK" }, "source": [ "### Training" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cnQjmYkjYtRK", "scrolled": true }, "outputs": [], "source": [ "history = model_ar.fit(\n", " train_loader,\n", " epochs=400,\n", " verbose=2,\n", " callbacks=[\n", " tf.keras.callbacks.ReduceLROnPlateau(monitor='loss'),\n", " tf.keras.callbacks.EarlyStopping(\n", " monitor='loss',\n", " patience=20,\n", " verbose=2)])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "4D1N7lwiYtRK" }, "source": [ "可以看一下model的weight" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DULfcM7yYtRL" }, "outputs": [], "source": [ "# weight\n", "plt.bar(np.arange(window_size), model_ar.weights[0].numpy().squeeze())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "cMtyzxSrYtRL" }, "source": [ "它的weight與AutoCorrelated Function正相關,也就是前面講的convolution時用到的波形。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VTIU4AgtYtRL" }, "outputs": [], "source": [ "# bias\n", "model_ar.weights[1]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "7smqERA9YtRL" }, "source": [ "### Evaluate" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Cz-oFgSbYtRM" }, "outputs": [], "source": [ "model_ar.evaluate(test_loader)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "roqXpsadYtRM" }, "outputs": [], "source": [ "forcast = model_ar.predict(test_loader)[:, 0]\n", "ground_truth_for_view = series_test[window_size:]\n", "time_for_view = time_test[window_size:]\n", "\n", "plot_series(time_for_view,\n", " [forcast, ground_truth_for_view],\n", " labels=['prediction', 'ground truth'])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "llKCeRqDYtRM" }, "source": [ "這邊我們使用autoregression的結果,因為noise無法擬和,所以會受到干擾\n", "\n", "而且跟SES很像,如果預估的時間很長,則預估越來越不准" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OVKvuebUYtRM" }, "outputs": [], "source": [ "R2(forcast, ground_truth_for_view)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "g6dhkhpjYtRN" }, "source": [ "## Autoregressive Data" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "DtqswrubYtRN" }, "source": [ "當資料有短期的autoregressive效應時,目前資料會受過往資料影響較多。\n", "\n", "我們拿一個autoregressive kernel與一組起始的數值來試試看" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "m8zodJ3rYtRN" }, "outputs": [], "source": [ "kernal = np.array([-0.5, 0.4, -0.3, 0.4, 0.5])\n", "kernal = kernal/np.linalg.norm(kernal)\n", "\n", "series = [2, 2, 2, 2, 2]\n", "# series = [1, 2, 3, 4, 5]\n", "# series = [5, 4, 3, 2, 1]\n", "for i in range(80):\n", " last = np.array(series[-5:])\n", " series.append(kernal@last)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WjRU9Py-YtRN" }, "outputs": [], "source": [ "plot_series(np.arange(len(series)), np.array(series))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "4WUmjLCDYtRO" }, "source": [ "其實可以看出來,autoregression結果在哪種起始狀態都無所謂,只要不是全為0,只要weight一樣最後都會converge到一樣的pattern" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "O3F6jwA2oeY6" }, "source": [ "## Autoregressive Integrated Moving Average (ARIMA)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "tXMvJGVaoeY6" }, "source": [ "這邊AR model 可延伸出ARMA, ARIMA或者更後面的SARIMA model。\n", "\n", "ARMA就是分成兩個AR model對資料做擬合:\n", "1. 第一部分是做time series的AR model;\n", "2. 另一部分是residual 的AR model\n", "\n", "$F_𝑛=𝛽_1 𝑦_{𝑛−1}+𝛽_2 𝑦_{𝑛−2}+…+𝛽_𝑝 𝑦_{𝑛−𝑝}\\ (part 1)$\n", "\n", " $+𝜖_𝑛+𝜃_1 𝜖_{𝑛−1}+𝜃_2 𝜖_{𝑛−2}+…+𝜃_𝑞 𝜖_{𝑛−𝑞}\\ (part 2)$\n", " \n", "這residual term 是從原series扣掉AR model forecast的結果產生的序列\n", "\n", "(較為簡單的版本是用moving average取代AR model,後來比較常用AR model做第一階段forecast)\n", "\n", "$𝜖_𝑛=𝑦_𝑛− 𝛽_1 𝑦_{𝑛−1}+𝛽_2 𝑦_{𝑛−2}+…+𝛽_𝑝 𝑦_{𝑛−𝑝} $\n", "\n", "\n", "" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "Sb3myl5uoeY7" }, "source": [ "那整體模型則是由原series經過AR模型得到AR forecast,並與原series相減得到residual\n", "residual 再經由MA模型(part2)得到MA forecast\n", "最後這兩個forecast加在一起就是ARMA model的預測\n", "\n", "至於超參數p與q則是需要一些對data的了解或嘗試取得最佳解。\n", "\n", "" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "15QbppOvoeY7" }, "source": [ "ARIMA就是先做完differencing再做前面的ARMA,Differencing之後可以去掉trend,所以要求只需要保持不要有seasonality就好。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vRkZYyLGoeY7" }, "outputs": [], "source": [ "# 用Sequencial方式也組成MA model\n", "# 這邊參數數量q使用與AR model一樣\n", "model_ma = models.Sequential([\n", " layers.Dense(1, input_shape=[window_size], activation=None, use_bias=False)\n", "])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HYcw71OuoeY7" }, "outputs": [], "source": [ "model_ma.summary()\n", "# 共有window_size個parameter" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yNFGlFU2oeY7" }, "outputs": [], "source": [ "opt = optimizers.Adam(learning_rate=1e-2)\n", "model_ma.compile(loss=\"mse\", optimizer=opt)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zQVM1UlSoeY7" }, "outputs": [], "source": [ "# 這邊要重做一個dataset\n", "\n", "# 取出residual\n", "residual_train = series_train[window_size:]\\\n", " - model_ar.predict(train_ds.batch(32)).squeeze()\n", "residual_test = series_test[window_size:]\\\n", " - model_ar.predict(test_ds.batch(32)).squeeze()\n", "\n", "# 組成dataloader\n", "residual_train_loader = win_ar_ds(residual_test, size=window_size)\\\n", " .cache().shuffle(1000).batch(32, drop_remainder=True).prefetch(-1)\n", "\n", "residual_test_loader = win_ar_ds(residual_test, size=window_size)\\\n", " .batch(32).prefetch(-1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2OghrwmBoeY8", "scrolled": true }, "outputs": [], "source": [ "history = model_ma.fit(\n", " residual_train_loader,\n", " epochs=400,\n", " verbose=2,\n", " callbacks=[\n", " tf.keras.callbacks.ReduceLROnPlateau(monitor='loss'),\n", " tf.keras.callbacks.EarlyStopping(\n", " monitor='loss',\n", " patience=20,\n", " verbose=2)])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "t5FBRYJnoeY8" }, "source": [ "### Evaluate" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_k3SlgvRoeY8", "scrolled": true }, "outputs": [], "source": [ "model_ma.evaluate(residual_test_loader)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "N0RqFjrHoeY8" }, "outputs": [], "source": [ "forcast_ar = model_ar.predict(test_loader)[window_size:, 0] # AR 預測\n", "forcast_ma = model_ma.predict(residual_test_loader)[:, 0] # MA 預測\n", "forcast = forcast_ar+forcast_ma # 加起來\n", "\n", "ground_truth_for_view = series_test[window_size*2:]\n", "\n", "time_for_view = time_test[window_size*2:]\n", "\n", "plot_series(time_for_view,\n", " [forcast, ground_truth_for_view],\n", " labels=['prediction', 'ground truth'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "t4XKuQ9woeY8" }, "outputs": [], "source": [ "# 與純粹AR 模型相比 可以得到較好一點點的結果\n", "R2(forcast, ground_truth_for_view)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "YXHudvUKoeY8" }, "source": [ "ARMA與ARIMA模型只差differencing而已,可以使用一些現成的套件完成" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "EVW90grHoeY9" }, "source": [ "## Call Function from Module" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "daRPMwhooeY9" }, "source": [ "statsmodels有提供ARIMA的model,對training的時間點資料進行擬合,訓練出differeced data的AR及MA模型。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "azQAReFsoeY9" }, "outputs": [], "source": [ "from statsmodels.tsa.arima_model import ARIMA" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wrsJWZhMoeY9" }, "outputs": [], "source": [ "# 組成ARIMA模型\n", "# Args:\n", "# endog (array of float) - 時間序列\n", "# order (tuple of (int, int, int) ) - ARIMA order, 包含p,d,q\n", "# p - AR model的coefficient數量\n", "# d - differencing次數\n", "# q - 擬合residual term的coefficient數量\n", "# Returns:\n", "# ARIMA模型物件 (statsmodels.tsa.arima_model)\n", "\n", "# endog: Input data\n", "arima = ARIMA(endog=series_train, order=(7, 1, 7))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "Nb-9YHFnoeY9" }, "source": [ "用p, d, q 參數組和,除ARIMA model外也可以組成AR Model和ARMA model:\n", "- AR: (p,0,0) 沒有residual也沒有differencing\n", "- ARMA: (p,0,q) 有residual沒有differencing\n", "- ARIMA: (p,d,q) 有residual也有differencing" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "6AwOyE5hoeY9" }, "source": [ "### Training" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4Q8pcXR7oeY9", "scrolled": true }, "outputs": [], "source": [ "mdl = arima.fit()\n", "print(mdl.summary())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "bQk2oRmLoeY9" }, "source": [ "Summary報告中可呈現各參數的數值\n", "- const: bias項\n", "- ar.L*.D.y:\n", " - coef: AR model的參數值\n", " - z: regression 參數的 z 統計值\n", "- ma.L*.D.y:\n", " - coef: MA model的參數值\n", " - z: regression 參數的 z 統計值" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "KIf4VTuNoeY-" }, "outputs": [], "source": [ "forcast = mdl.forecast(len(series_test))[0]\n", "\n", "plot_series(time_test,\n", " [forcast, series_test],\n", " labels=['prediction', 'ground truth'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LrStkIQ4oeY-" }, "outputs": [], "source": [ "R2(forcast, series_test)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "MsIU-PxKoeY-" }, "source": [ "### References\n", "* statsmodes官網: https://www.statsmodels.org/dev/generated/statsmodels.tsa.arima.model.ARIMA.html\n", "* https://medium.com/analytics-vidhya/arima-model-from-scratch-in-python-489e961603ce\n", "* https://www.nbshare.io/notebook/136553745/Time-Series-Analysis-Using-ARIMA-From-StatsModels/" ] } ], "metadata": { "accelerator": "GPU", "colab": { "provenance": [] }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 0 }