{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "1mB0In5zYtQw"
},
"source": [
"# 5_Regression&Autoregression"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "vm8CtSi-YtQ1"
},
"source": [
"以上我們有了data,並組成data loader,我們使用剛剛loader可以來訓練一些模型。\n",
"\n",
"這邊我們用些簡單的迴歸模型來試試看:\n",
"- Linear Regression\n",
"- Autoregression"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "VmcMo4yAYtQ2"
},
"source": [
"**先import一些套件**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "buzO-o1yYtQ3"
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import matplotlib.pyplot as plt\n",
"from plotly import express as px\n",
"\n",
"import numpy as np\n",
"import tensorflow.data as tfd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "T19LAWQOJj63"
},
"outputs": [],
"source": [
"# Colab使用要更換statsmodels版本 ,安裝完請重新啟動執行階段\n",
"!pip uninstall -y statsmodels\n",
"!pip install statsmodels==0.11.1"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "TU9iLRCiYtQ6"
},
"source": [
"**給一些必要function: 畫圖、合成資料產生、window data loader產生、評估function**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "rKMh4kQyYtQ7"
},
"outputs": [],
"source": [
"def plot_series(time, series, start=0, end=None, labels=None, title=None):\n",
" # Visualizes time series data\n",
" # Args:\n",
" # time (array of int) - 時間點, 長度為T\n",
" # series (list of array of int) - 時間點對應的資料列表,列表內時間序列數量為D,\n",
" # 每筆資料長度為T,若非為列表則轉為列表\n",
" # start (int) - 開始的資料序(第幾筆)\n",
" # end (int) - 結束繪製的資料序(第幾筆)\n",
" # labels (list of strings)- 對於多時間序列或多維度的標註\n",
" # title (string)- 圖片標題\n",
"\n",
" # 若資料只有一筆,則轉為list\n",
" if type(series) != list:\n",
" series = [series]\n",
"\n",
" if not end:\n",
" end = len(series[0])\n",
"\n",
" if labels:\n",
" # 設立dictionary, 讓plotly畫訊號線時可以標註label\n",
" dictionary = {\"time\": time}\n",
" for idx, l in enumerate(labels):\n",
" # 截斷資料,保留想看的部分,並分段紀錄於dictionary中\n",
" dictionary.update({l: series[idx][start:end]})\n",
" # 畫訊號線\n",
" fig = px.line(dictionary,\n",
" x=\"time\",\n",
" y=list(dictionary.keys())[1:],\n",
" width=1000,\n",
" height=400,\n",
" title=title)\n",
" else:\n",
" # 畫訊號線\n",
" fig = px.line(x=time, y=series, width=1000, height=400, title=title)\n",
" fig.show()\n",
"\n",
"\n",
"# 合成資料生成\n",
"def trend(time, slope=0):\n",
" # 產生合成水平直線資料,其長度與時間等長,直線趨勢與設定slope相同\n",
" # Args:\n",
" # time (array of int) - 時間點, 長度為T\n",
" # slope (float) - 設定資料的傾斜程度與正負\n",
" # Returns:\n",
" # series (array of float) - 產出slope 與設定相同的一條線\n",
"\n",
" series = slope * time\n",
"\n",
" return series\n",
"\n",
"\n",
"def seasonal_pattern(season_time, pattern_type='triangle'):\n",
" # 產生某個特定pattern,\n",
" # Args:\n",
" # season_time (array of float) - 周期內的時間點, 長度為T\n",
" # pattern_type (str) - 這邊提供triangle與cosine\n",
" # Returns:\n",
" # data_pattern (array of float) - 根據自訂函式產出特定的pattern\n",
"\n",
" # 用特定function生成pattern\n",
" if pattern_type == 'triangle':\n",
" data_pattern = np.where(season_time < 0.5,\n",
" season_time*2,\n",
" 2-season_time*2)\n",
" if pattern_type == 'cosine':\n",
" data_pattern = np.cos(season_time*np.pi*2)\n",
"\n",
" return data_pattern\n",
"\n",
"\n",
"def seasonality(time, period, amplitude=1, phase=30, pattern_type='triangle'):\n",
" # Repeats the same pattern at each period\n",
" # Args:\n",
" # time (array of int) - 時間點, 長度為T\n",
" # period (int) - 週期長度,必小於T\n",
" # amplitude (float) - 序列幅度大小\n",
" # phase (int) - 相位,為遞移量,正的向左(提前)、負的向右(延後)\n",
" # pattern_type (str) - 這邊提供triangle與cosine\n",
" # Returns:\n",
" # data_pattern (array of float) - 有指定周期、振幅、相位、pattern後的time series\n",
"\n",
" # 將時間依週期重置為0\n",
" season_time = ((time + phase) % period) / period\n",
"\n",
" # 產生週期性訊號並乘上幅度\n",
" data_pattern = amplitude * seasonal_pattern(season_time, pattern_type)\n",
"\n",
" return data_pattern\n",
"\n",
"\n",
"def noise(time, noise_level=1, seed=None):\n",
" # 合成雜訊,這邊用高斯雜訊,機率密度為常態分布\n",
" # Args:\n",
" # time (array of int) - 時間點, 長度為T\n",
" # noise_level (float) - 雜訊大小\n",
" # seed (int) - 同樣的seed可以重現同樣的雜訊\n",
" # Returns:\n",
" # noise (array of float) - 雜訊時間序列\n",
"\n",
" # 做一個基於某個seed的雜訊生成器\n",
" rnd = np.random.RandomState(seed)\n",
"\n",
" # 生與time同長度的雜訊,並且乘上雜訊大小 (不乘的話,標準差是1)\n",
" noise = rnd.randn(len(time)) * noise_level\n",
"\n",
" return noise\n",
"\n",
"\n",
"def toy_generation(time=np.arange(4 * 365),\n",
" bias=500.,\n",
" slope=0.1,\n",
" period=180,\n",
" amplitude=40.,\n",
" phase=30,\n",
" pattern_type='triangle',\n",
" noise_level=5.,\n",
" seed=2022):\n",
" signal_series = bias\\\n",
" + trend(time, slope)\\\n",
" + seasonality(time,\n",
" period,\n",
" amplitude,\n",
" phase,\n",
" pattern_type)\n",
" noise_series = noise(time, noise_level, seed)\n",
"\n",
" series = signal_series+noise_series\n",
" return series\n",
"\n",
"\n",
"# Dataset\n",
"def win_ar_ds(series, size, shift=1):\n",
" # 輸出Window-wise Forcasting Dataset\n",
" # Args:\n",
" # series (array of float) - 時序資料, 長度為T\n",
" # size (int) - Window大小\n",
" # shift (int) - 每個window起始點間距\n",
" # Returns:\n",
" # (tf.data.Dataset(母類名稱,切確type為MapDataset)) -\n",
" # - 一個一次生一個window的生成器\n",
"\n",
" ds = tfd.Dataset.from_tensor_slices(series)\n",
" ds = ds.window(size=size+1, shift=1, drop_remainder=True)\n",
" ds = ds.flat_map(lambda ds: ds.batch(size+1))\n",
" return ds.map(lambda x: (x[:-1], x[-1:]))\n",
"\n",
"\n",
"def regressor_ds(*regressors, series):\n",
" # 輸出Window-wise Regressor Forcasting Dataset\n",
" # Args:\n",
" # regressors (arguments of array of float) - 多個迴歸因子,每個長度為T\n",
" # series (array of float) - 預測對象,長度\n",
" # Returns:\n",
" # (tf.data.Dataset(母類名稱,切確type為TensorSliceDataset)) -\n",
" # - 一次生regressors和time series的dataset\n",
"\n",
" ds = tfd.Dataset.from_tensor_slices((np.stack(regressors, -1), series))\n",
" return ds\n",
"\n",
"\n",
"# 評估function\n",
"def MAE(pred, gt):\n",
" # 計算Mean Absolute Error\n",
" # Args:\n",
" # pred (array of float) - 預測資料\n",
" # gt (array of float) - 答案資料\n",
" # Returns:\n",
" # 計算結果 (float)\n",
" return abs(pred-gt).mean()\n",
"\n",
"\n",
"def MSE(pred, gt):\n",
" # 計算Mean Square Error\n",
" # Args:\n",
" # pred (array of float) - 預測資料\n",
" # gt (array of float) - 答案資料\n",
" # Returns:\n",
" # 計算結果 (float)\n",
" return pow(pred-gt, 2).mean()\n",
"\n",
"\n",
"def R2(pred, gt):\n",
" # 計算R square score\n",
" # Args:\n",
" # pred (array of float) - 預測資料\n",
" # gt (array of float) - 答案資料\n",
" # Returns:\n",
" # 計算結果 (float)\n",
" return 1-pow(pred-gt, 2).sum()/pow(gt-gt.mean(), 2).sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_hg1yBiFYtQ-"
},
"outputs": [],
"source": [
"def split(x, train_size):\n",
" return x[..., :train_size], x[..., train_size:]\n",
"\n",
"\n",
"# 先合成資料,還有作資料分割\n",
"time = np.arange(4*365) # 定義時間點\n",
"series_sample = toy_generation(time, pattern_type='cosine') # 這就是我們合成出來的資料\n",
"\n",
"time_train, time_test = split(time, 365*3)\n",
"series_train, series_test = split(series_sample, 365*3)\n",
"\n",
"# 另外也加上輔助資料\n",
"cos_train = seasonality(time_train, 180, 1., 30, 'cosine')\n",
"cos_test = seasonality(time_test, 180, 1., 30, 'cosine')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Mf7Hx8kqYtQ-"
},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"from tensorflow.keras import models, layers, losses, optimizers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "X_mruI5NYtQ_"
},
"outputs": [],
"source": [
"# 用各種regressor predict資料的training set\n",
"train_ds_r = regressor_ds(time_train,\n",
" cos_train,\n",
" series=series_train) # 切time series\n",
"train_loader_r = train_ds_r.cache()\\\n",
" .shuffle(1000).batch(32, drop_remainder=True).prefetch(-1)\n",
"\n",
"# 用各種regressor predict資料的testing set\n",
"test_ds_r = regressor_ds(time_test,\n",
" cos_test,\n",
" series=series_test) # 切time series\n",
"test_loader_r = test_ds_r.batch(32).prefetch(-1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hsGYkZHSYtRA"
},
"outputs": [],
"source": [
"for x, y in train_loader_r:\n",
" pass\n",
"print(x.shape, y.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5u71pbrmoeYv"
},
"outputs": [],
"source": [
"cos_train = seasonality(time_train, 180, 1., 45, 'cosine')\n",
"plt.figure(figsize=(20, 5))\n",
"plt.plot(cos_train, 'b', linewidth=3)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "bc7G-a15YtRA"
},
"source": [
"## Linear Regression"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "GnJmC2lNYtRA"
},
"source": [
"前面ML的課程已經教過Linear Regression的課程,\n",
"\n",
"其中多變量迴歸時是透過轉置矩陣w與偏移b可以預測下個時間點的值\n",
"\n",
"$y=w^Tx+b$\n",
"\n",
"透過訓練參數w與b可以使系統更加能擬合feature與預測值之間關係。\n",
"\n",
"
\n",
"\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "ULbe4biUYtRB"
},
"source": [
"而Linear time regression中我們使用$t_{n-W:n}$代入input x的部分,計算output $y_n$應該要是多少"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "QYTsY090YtRB"
},
"source": [
"
"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "YsJKnu8vh1TO"
},
"source": [
"而在示範的training data中除了time regression我們還加上cosine波形作為另一個regression feature (又稱regressor)。\n",
"\n",
"我們的t是前面```time_train```的部分,cosine波形是前面```cos_train```,y是前面```series_train```的部分"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "8lUoc8KSYtRB"
},
"source": [
"### Build TF2 Model"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "5w8M63wxYtRC"
},
"source": [
"而我們這邊可以用一個一層且無activation的Dense層來完成這個功能,而且我們前面有幾個layer"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "A4WcRBw6YtRC"
},
"outputs": [],
"source": [
"model = models.Sequential([\n",
" layers.Flatten(input_shape=[2], data_format=\"channels_first\"),\n",
" layers.Dense(1, activation=None)\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Pb9AW_HFYtRC",
"scrolled": true
},
"outputs": [],
"source": [
"model.summary()\n",
"# 共有window_size*K+1個parameter 就看要放幾個regressor,最後一個parameter是bias"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bEdTdN7AYtRD"
},
"outputs": [],
"source": [
"opt = optimizers.Adam(learning_rate=1e-1)\n",
"model.compile(loss=\"mse\", optimizer=opt)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "OEuHJz_iYtRD"
},
"source": [
"### Training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dg3CPRfKYtRE",
"scrolled": true
},
"outputs": [],
"source": [
"history = model.fit(\n",
" train_loader_r,\n",
" epochs=400,\n",
" verbose=2,\n",
" callbacks=[\n",
" tf.keras.callbacks.ReduceLROnPlateau(monitor='loss'),\n",
" tf.keras.callbacks.EarlyStopping(\n",
" monitor='loss',\n",
" patience=20,\n",
" verbose=2)])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "2Qb4uMiqYtRE"
},
"source": [
"可以看一下model的weight"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wpShItcxYtRE"
},
"outputs": [],
"source": [
"# 第一個是time trend的weight,第二個是cosine wave 的trend\n",
"model.weights[0].numpy()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "Bzvxsa4CYtRF"
},
"source": [
"這邊故意調整weight的順序,使得前面一半是linear trend相關的weight,後面一半是cosine相關的weight。\n",
"\n",
"可以看出這兩者的量值跟我們預設的很接近"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5tp2OMR8YtRF"
},
"outputs": [],
"source": [
"# bias\n",
"model.weights[1].numpy()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "CDjgdYuNYtRF"
},
"source": [
"bias則是很逼近我們預設的level"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "SUeaJZYiYtRG"
},
"source": [
"### Evaluate"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "c7Jr6kjdYtRG"
},
"outputs": [],
"source": [
"model.evaluate(test_loader_r)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "oG62QnufYtRH"
},
"source": [
"基本上就是很圓滑的線,因為我們只用了linear trend以及cosine而已\n",
"\n",
"結果就還好"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2m02RLq1YtRG"
},
"outputs": [],
"source": [
"forcast = model.predict(test_loader_r)[:, 0]\n",
"time_for_view = time_test\n",
"\n",
"plot_series(time_for_view,\n",
" [forcast, series_test],\n",
" labels=['prediction', 'ground truth'])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "yi8x49FLgQNa"
},
"source": [
"這邊我們把那些訓練好的weight套上個別regressor,並扣除掉訓練好的bias的影響\n",
"\n",
"畫出來看看:\n",
"\n",
"$trend[t]=time[t]*weight_{0}$\n",
"\n",
"$seasonality[t]=cos[t]*weight_{1}$\n",
"\n",
"$y'[t]=y[t]-bias$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "t4Zj_j4te70x"
},
"outputs": [],
"source": [
"plot_series(time_for_view,\n",
" [time_test*model.weights[0][0].numpy(),\n",
" cos_test*model.weights[0][1].numpy(),\n",
" series_test-model.weights[1][0].numpy()],\n",
" labels=['weighted time', 'weighted cosine', 'ground truth'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Gr7goZPvYtRH"
},
"outputs": [],
"source": [
"# 算出R2 score來看看\n",
"R2(forcast, series_test)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "nkxYNtV3YtRH"
},
"source": [
"### Exercise\n",
"請試試看不同regressor對regression的影響,可以注意到對weight跟bias的影響\n",
"\n",
"e.g. 將```pattern_type```換成```triangle```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "6IEsyU2lYtRH"
},
"source": [
"## Autoregressive Model"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "CDr-WShRYtRI"
},
"source": [
"Autoregression假設現在的序列與前面有限個時間點的數個序列有關:\n",
"\n",
"$y_t=w_0 y_{t-1}+w_1 y_{t-2}+...+ w_T y_{t-T} $\n",
"\n",
"其實就是對前面的時間點套用一個固定的加權和,而這個加權的權重是可以訓練的。\n",
"\n",
"也可以看作是套用某個波形作1D convolution來預測未來序列,這樣在有一定周期性時是有幫助的。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EzJSngVAYtRI"
},
"outputs": [],
"source": [
"# 這邊要用不一樣的dataset\n",
"window_size = 7\n",
"\n",
"# 用資料predict資料的training set\n",
"train_ds = win_ar_ds(series_train, size=window_size) # 切time series\n",
"train_loader = train_ds.cache()\\\n",
" .shuffle(1000).batch(32, drop_remainder=True).prefetch(-1)\n",
"\n",
"# 用資料predict資料的testing set\n",
"test_ds = win_ar_ds(series_test, size=window_size) # 切time series\n",
"test_loader = test_ds.batch(32).prefetch(-1)\n",
"for x, y in train_loader:\n",
" pass\n",
"print(x.shape, y.shape)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "TmKyO7qNYtRI"
},
"source": [
"### Build TF2 Model"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "1lYsF7VsYtRJ"
},
"source": [
"而我們這邊可以用一個一層且無activation的Dense層來完成一個AR model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "zHL6umB3YtRJ"
},
"outputs": [],
"source": [
"model_ar = models.Sequential([\n",
" layers.Dense(1, input_shape=[window_size], activation=None)\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "oblTI6jCYtRJ"
},
"outputs": [],
"source": [
"model_ar.summary()\n",
"# 共有window_size+1個parameter,最後一個parameter是bias"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "uo-8PxgYYtRJ"
},
"outputs": [],
"source": [
"opt = optimizers.Adam(learning_rate=1e-2)\n",
"model_ar.compile(loss=\"mse\", optimizer=opt)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "RiQDt-FFYtRK"
},
"source": [
"### Training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cnQjmYkjYtRK",
"scrolled": true
},
"outputs": [],
"source": [
"history = model_ar.fit(\n",
" train_loader,\n",
" epochs=400,\n",
" verbose=2,\n",
" callbacks=[\n",
" tf.keras.callbacks.ReduceLROnPlateau(monitor='loss'),\n",
" tf.keras.callbacks.EarlyStopping(\n",
" monitor='loss',\n",
" patience=20,\n",
" verbose=2)])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "4D1N7lwiYtRK"
},
"source": [
"可以看一下model的weight"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DULfcM7yYtRL"
},
"outputs": [],
"source": [
"# weight\n",
"plt.bar(np.arange(window_size), model_ar.weights[0].numpy().squeeze())"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "cMtyzxSrYtRL"
},
"source": [
"它的weight與AutoCorrelated Function正相關,也就是前面講的convolution時用到的波形。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VTIU4AgtYtRL"
},
"outputs": [],
"source": [
"# bias\n",
"model_ar.weights[1]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "7smqERA9YtRL"
},
"source": [
"### Evaluate"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Cz-oFgSbYtRM"
},
"outputs": [],
"source": [
"model_ar.evaluate(test_loader)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "roqXpsadYtRM"
},
"outputs": [],
"source": [
"forcast = model_ar.predict(test_loader)[:, 0]\n",
"ground_truth_for_view = series_test[window_size:]\n",
"time_for_view = time_test[window_size:]\n",
"\n",
"plot_series(time_for_view,\n",
" [forcast, ground_truth_for_view],\n",
" labels=['prediction', 'ground truth'])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "llKCeRqDYtRM"
},
"source": [
"這邊我們使用autoregression的結果,因為noise無法擬和,所以會受到干擾\n",
"\n",
"而且跟SES很像,如果預估的時間很長,則預估越來越不准"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OVKvuebUYtRM"
},
"outputs": [],
"source": [
"R2(forcast, ground_truth_for_view)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "g6dhkhpjYtRN"
},
"source": [
"## Autoregressive Data"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "DtqswrubYtRN"
},
"source": [
"當資料有短期的autoregressive效應時,目前資料會受過往資料影響較多。\n",
"\n",
"我們拿一個autoregressive kernel與一組起始的數值來試試看"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "m8zodJ3rYtRN"
},
"outputs": [],
"source": [
"kernal = np.array([-0.5, 0.4, -0.3, 0.4, 0.5])\n",
"kernal = kernal/np.linalg.norm(kernal)\n",
"\n",
"series = [2, 2, 2, 2, 2]\n",
"# series = [1, 2, 3, 4, 5]\n",
"# series = [5, 4, 3, 2, 1]\n",
"for i in range(80):\n",
" last = np.array(series[-5:])\n",
" series.append(kernal@last)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "WjRU9Py-YtRN"
},
"outputs": [],
"source": [
"plot_series(np.arange(len(series)), np.array(series))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "4WUmjLCDYtRO"
},
"source": [
"其實可以看出來,autoregression結果在哪種起始狀態都無所謂,只要不是全為0,只要weight一樣最後都會converge到一樣的pattern"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "O3F6jwA2oeY6"
},
"source": [
"## Autoregressive Integrated Moving Average (ARIMA)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "tXMvJGVaoeY6"
},
"source": [
"這邊AR model 可延伸出ARMA, ARIMA或者更後面的SARIMA model。\n",
"\n",
"ARMA就是分成兩個AR model對資料做擬合:\n",
"1. 第一部分是做time series的AR model;\n",
"2. 另一部分是residual 的AR model\n",
"\n",
"$F_𝑛=𝛽_1 𝑦_{𝑛−1}+𝛽_2 𝑦_{𝑛−2}+…+𝛽_𝑝 𝑦_{𝑛−𝑝}\\ (part 1)$\n",
"\n",
" $+𝜖_𝑛+𝜃_1 𝜖_{𝑛−1}+𝜃_2 𝜖_{𝑛−2}+…+𝜃_𝑞 𝜖_{𝑛−𝑞}\\ (part 2)$\n",
" \n",
"這residual term 是從原series扣掉AR model forecast的結果產生的序列\n",
"\n",
"(較為簡單的版本是用moving average取代AR model,後來比較常用AR model做第一階段forecast)\n",
"\n",
"$𝜖_𝑛=𝑦_𝑛− 𝛽_1 𝑦_{𝑛−1}+𝛽_2 𝑦_{𝑛−2}+…+𝛽_𝑝 𝑦_{𝑛−𝑝} $\n",
"\n",
"\n",
"
"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "Sb3myl5uoeY7"
},
"source": [
"那整體模型則是由原series經過AR模型得到AR forecast,並與原series相減得到residual\n",
"residual 再經由MA模型(part2)得到MA forecast\n",
"最後這兩個forecast加在一起就是ARMA model的預測\n",
"\n",
"至於超參數p與q則是需要一些對data的了解或嘗試取得最佳解。\n",
"\n",
"
"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "15QbppOvoeY7"
},
"source": [
"ARIMA就是先做完differencing再做前面的ARMA,Differencing之後可以去掉trend,所以要求只需要保持不要有seasonality就好。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vRkZYyLGoeY7"
},
"outputs": [],
"source": [
"# 用Sequencial方式也組成MA model\n",
"# 這邊參數數量q使用與AR model一樣\n",
"model_ma = models.Sequential([\n",
" layers.Dense(1, input_shape=[window_size], activation=None, use_bias=False)\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "HYcw71OuoeY7"
},
"outputs": [],
"source": [
"model_ma.summary()\n",
"# 共有window_size個parameter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "yNFGlFU2oeY7"
},
"outputs": [],
"source": [
"opt = optimizers.Adam(learning_rate=1e-2)\n",
"model_ma.compile(loss=\"mse\", optimizer=opt)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "zQVM1UlSoeY7"
},
"outputs": [],
"source": [
"# 這邊要重做一個dataset\n",
"\n",
"# 取出residual\n",
"residual_train = series_train[window_size:]\\\n",
" - model_ar.predict(train_ds.batch(32)).squeeze()\n",
"residual_test = series_test[window_size:]\\\n",
" - model_ar.predict(test_ds.batch(32)).squeeze()\n",
"\n",
"# 組成dataloader\n",
"residual_train_loader = win_ar_ds(residual_test, size=window_size)\\\n",
" .cache().shuffle(1000).batch(32, drop_remainder=True).prefetch(-1)\n",
"\n",
"residual_test_loader = win_ar_ds(residual_test, size=window_size)\\\n",
" .batch(32).prefetch(-1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2OghrwmBoeY8",
"scrolled": true
},
"outputs": [],
"source": [
"history = model_ma.fit(\n",
" residual_train_loader,\n",
" epochs=400,\n",
" verbose=2,\n",
" callbacks=[\n",
" tf.keras.callbacks.ReduceLROnPlateau(monitor='loss'),\n",
" tf.keras.callbacks.EarlyStopping(\n",
" monitor='loss',\n",
" patience=20,\n",
" verbose=2)])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "t5FBRYJnoeY8"
},
"source": [
"### Evaluate"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_k3SlgvRoeY8",
"scrolled": true
},
"outputs": [],
"source": [
"model_ma.evaluate(residual_test_loader)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "N0RqFjrHoeY8"
},
"outputs": [],
"source": [
"forcast_ar = model_ar.predict(test_loader)[window_size:, 0] # AR 預測\n",
"forcast_ma = model_ma.predict(residual_test_loader)[:, 0] # MA 預測\n",
"forcast = forcast_ar+forcast_ma # 加起來\n",
"\n",
"ground_truth_for_view = series_test[window_size*2:]\n",
"\n",
"time_for_view = time_test[window_size*2:]\n",
"\n",
"plot_series(time_for_view,\n",
" [forcast, ground_truth_for_view],\n",
" labels=['prediction', 'ground truth'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "t4XKuQ9woeY8"
},
"outputs": [],
"source": [
"# 與純粹AR 模型相比 可以得到較好一點點的結果\n",
"R2(forcast, ground_truth_for_view)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "YXHudvUKoeY8"
},
"source": [
"ARMA與ARIMA模型只差differencing而已,可以使用一些現成的套件完成"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "EVW90grHoeY9"
},
"source": [
"## Call Function from Module"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "daRPMwhooeY9"
},
"source": [
"statsmodels有提供ARIMA的model,對training的時間點資料進行擬合,訓練出differeced data的AR及MA模型。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "azQAReFsoeY9"
},
"outputs": [],
"source": [
"from statsmodels.tsa.arima_model import ARIMA"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wrsJWZhMoeY9"
},
"outputs": [],
"source": [
"# 組成ARIMA模型\n",
"# Args:\n",
"# endog (array of float) - 時間序列\n",
"# order (tuple of (int, int, int) ) - ARIMA order, 包含p,d,q\n",
"# p - AR model的coefficient數量\n",
"# d - differencing次數\n",
"# q - 擬合residual term的coefficient數量\n",
"# Returns:\n",
"# ARIMA模型物件 (statsmodels.tsa.arima_model)\n",
"\n",
"# endog: Input data\n",
"arima = ARIMA(endog=series_train, order=(7, 1, 7))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "Nb-9YHFnoeY9"
},
"source": [
"用p, d, q 參數組和,除ARIMA model外也可以組成AR Model和ARMA model:\n",
"- AR: (p,0,0) 沒有residual也沒有differencing\n",
"- ARMA: (p,0,q) 有residual沒有differencing\n",
"- ARIMA: (p,d,q) 有residual也有differencing"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "6AwOyE5hoeY9"
},
"source": [
"### Training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4Q8pcXR7oeY9",
"scrolled": true
},
"outputs": [],
"source": [
"mdl = arima.fit()\n",
"print(mdl.summary())"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "bQk2oRmLoeY9"
},
"source": [
"Summary報告中可呈現各參數的數值\n",
"- const: bias項\n",
"- ar.L*.D.y:\n",
" - coef: AR model的參數值\n",
" - z: regression 參數的 z 統計值\n",
"- ma.L*.D.y:\n",
" - coef: MA model的參數值\n",
" - z: regression 參數的 z 統計值"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "KIf4VTuNoeY-"
},
"outputs": [],
"source": [
"forcast = mdl.forecast(len(series_test))[0]\n",
"\n",
"plot_series(time_test,\n",
" [forcast, series_test],\n",
" labels=['prediction', 'ground truth'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LrStkIQ4oeY-"
},
"outputs": [],
"source": [
"R2(forcast, series_test)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "MsIU-PxKoeY-"
},
"source": [
"### References\n",
"* statsmodes官網: https://www.statsmodels.org/dev/generated/statsmodels.tsa.arima.model.ARIMA.html\n",
"* https://medium.com/analytics-vidhya/arima-model-from-scratch-in-python-489e961603ce\n",
"* https://www.nbshare.io/notebook/136553745/Time-Series-Analysis-Using-ARIMA-From-StatsModels/"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"provenance": []
},
"gpuClass": "standard",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}