{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "c96ce503", "metadata": { "id": "9d8e0fd8" }, "source": [ "# Naive Methods and Metrics of Forcasting" ] }, { "attachments": {}, "cell_type": "markdown", "id": "68f7c42e", "metadata": { "id": "455bccc3" }, "source": [ "這邊我們先給各位看一下使用numpy手刻完成一些最簡單的forcasting\n", "\n", "numpy手刻只是為了讓同學了解這些算法在使用python實現的方式,後面我們會附上一些簡單使用的套件,以後在使用時直接call套件的class或者funciton就可以簡單實現演算法" ] }, { "attachments": {}, "cell_type": "markdown", "id": "05d78867", "metadata": { "id": "ee521e0c" }, "source": [ "課程包含以下內容:\n", "- Data Split\n", "- Naive Forcasting\n", " - K-Step Ahead\n", " - Seasonal K-Step Ahead\n", "- Metrics for Forcasting" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c5c5a0a2", "metadata": { "id": "da34f244" }, "source": [ "#### **開始前請先安裝或import基本套件**\n", "#### **若使用Jupyter Notebook開啟請轉成tree view方便顯示plotly出來的圖**" ] }, { "cell_type": "code", "execution_count": null, "id": "96338ae3", "metadata": { "id": "e3f9e30a", "scrolled": true }, "outputs": [], "source": [ "! pip install --user plotly" ] }, { "cell_type": "code", "execution_count": null, "id": "5bb2d3ce", "metadata": { "id": "3c4f6482" }, "outputs": [], "source": [ "%matplotlib notebook\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from plotly import express as px" ] }, { "attachments": {}, "cell_type": "markdown", "id": "77e8dbf9", "metadata": { "id": "25e0e629" }, "source": [ "**另外我們也先準備一個畫圖的function,我們不會放重點在這邊但後面會用它來看一些time series處理的過程**" ] }, { "cell_type": "code", "execution_count": null, "id": "0b1b4f0f", "metadata": { "id": "357c230d" }, "outputs": [], "source": [ "def plot_series(time, series, start=0, end=None, labels=None, title=None):\n", " # Visualizes time series data\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # series (list of array of int) - 時間點對應的資料列表,列表內時間序列數量為D,\n", " # 每筆資料長度為T,若非為列表則轉為列表\n", " # start (int) - 開始的資料序(第幾筆)\n", " # end (int) - 結束繪製的資料序(第幾筆)\n", " # labels (list of strings)- 對於多時間序列或多維度的標註\n", " # title (string)- 圖片標題\n", "\n", " # 若資料只有一筆,則轉為list\n", " if type(series) != list:\n", " series = [series]\n", "\n", " if not end:\n", " end = len(series[0])\n", "\n", " if labels:\n", " # 設立dictionary, 讓plotly畫訊號線時可以標註label\n", " dictionary = {\"time\": time}\n", " for idx, l in enumerate(labels):\n", " # 截斷資料,保留想看的部分,並分段紀錄於dictionary中\n", " dictionary.update({l: series[idx][start:end]})\n", " # 畫訊號線\n", " fig = px.line(dictionary,\n", " x=\"time\",\n", " y=list(dictionary.keys())[1:],\n", " width=1000,\n", " height=400,\n", " title=title)\n", " else:\n", " # 畫訊號線\n", " fig = px.line(x=time, y=series, width=1000, height=400, title=title)\n", " fig.show()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b2f17492", "metadata": { "id": "68a9cb51" }, "source": [ "這邊也附上我們的toy data產生器" ] }, { "cell_type": "code", "execution_count": null, "id": "79c8a77c", "metadata": { "id": "d7ba655a" }, "outputs": [], "source": [ "def trend(time, slope=0):\n", " # 產生合成水平直線資料,其長度與時間等長,直線趨勢與設定slope相同\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # slope (float) - 設定資料的傾斜程度與正負\n", " # Returns:\n", " # series (array of float) - 產出slope 與設定相同的一條線\n", "\n", " series = slope * time\n", "\n", " return series\n", "\n", "\n", "def seasonal_pattern(season_time, pattern_type='triangle'):\n", " # 產生某個特定pattern,\n", " # Args:\n", " # season_time (array of float) - 周期內的時間點, 長度為T\n", " # pattern_type (str) - 這邊提供triangle與cosine\n", " # Returns:\n", " # data_pattern (array of float) - 根據自訂函式產出特定的pattern\n", "\n", " # 用特定function生成pattern\n", " if pattern_type == 'triangle':\n", " data_pattern = np.where(season_time < 0.5,\n", " season_time*2,\n", " 2-season_time*2)\n", " if pattern_type == 'cosine':\n", " data_pattern = np.cos(season_time*np.pi*2)\n", "\n", " return data_pattern\n", "\n", "\n", "def seasonality(time, period, amplitude=1, phase=30, pattern_type='triangle'):\n", " # Repeats the same pattern at each period\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # period (int) - 週期長度,必小於T\n", " # amplitude (float) - 序列幅度大小\n", " # phase (int) - 相位,為遞移量,正的向左(提前)、負的向右(延後)\n", " # pattern_type (str) - 這邊提供triangle與cosine\n", " # Returns:\n", " # data_pattern (array of float) - 有指定周期、振幅、相位、pattern後的time series\n", "\n", " # 將時間依週期重置為0\n", " season_time = ((time + phase) % period) / period\n", "\n", " # 產生週期性訊號並乘上幅度\n", " data_pattern = amplitude * seasonal_pattern(season_time, pattern_type)\n", "\n", " return data_pattern\n", "\n", "\n", "def noise(time, noise_level=1, seed=None):\n", " # 合成雜訊,這邊用高斯雜訊,機率密度為常態分布\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # noise_level (float) - 雜訊大小\n", " # seed (int) - 同樣的seed可以重現同樣的雜訊\n", " # Returns:\n", " # noise (array of float) - 雜訊時間序列\n", "\n", " # 做一個基於某個seed的雜訊生成器\n", " rnd = np.random.RandomState(seed)\n", "\n", " # 生與time同長度的雜訊,並且乘上雜訊大小 (不乘的話,標準差是1)\n", " noise = rnd.randn(len(time)) * noise_level\n", "\n", " return noise\n", "\n", "\n", "def toy_generation(time=np.arange(4 * 365),\n", " bias=500.,\n", " slope=0.1,\n", " period=180,\n", " amplitude=40.,\n", " phase=30,\n", " pattern_type='triangle',\n", " noise_level=5.,\n", " seed=2022):\n", " signal_series = bias\\\n", " + trend(time, slope)\\\n", " + seasonality(time,\n", " period,\n", " amplitude,\n", " phase,\n", " pattern_type)\n", " noise_series = noise(time, noise_level, seed)\n", "\n", " series = signal_series+noise_series\n", " return series" ] }, { "cell_type": "code", "execution_count": null, "id": "7af4a318", "metadata": { "id": "f2039fb2" }, "outputs": [], "source": [ "# 先合成資料\n", "time = np.arange(4 * 365) # 定義時間點\n", "series_sample = toy_generation(time) # 這就是我們合成出來的資料\n", "\n", "# 畫\n", "plot_series(time, series_sample)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "fd71a5b7", "metadata": { "id": "c7aba289" }, "source": [ "## Data Split" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cac05d0b", "metadata": { "id": "17730024" }, "source": [ "我們做機器學習一定要切訓練與測試集,但時間序列是會要求使用過去資料預測未來資料,所以切的時後須帶有序列性並且testing資料需要取training資料的後面的資料。" ] }, { "cell_type": "code", "execution_count": null, "id": "eed6d07f", "metadata": { "id": "c342eb86" }, "outputs": [], "source": [ "def split(x, train_size):\n", " # 最簡單直接取前後,並且時間點也記得要切,我們直接立個function\n", " return x[..., :train_size], x[..., train_size:]\n", "\n", "\n", "time_train, time_test = split(time, 365*3)\n", "series_train, series_test = split(series_sample, 365*3)" ] }, { "cell_type": "code", "execution_count": null, "id": "8fc15240", "metadata": { "id": "887df120" }, "outputs": [], "source": [ "# 畫一下\n", "plot_series(time_train, series_train, title=\"Training\")\n", "plot_series(time_test, series_test, title=\"Testing\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a5ded3ca", "metadata": { "id": "dc645a08" }, "source": [ "## K-step Ahead" ] }, { "attachments": {}, "cell_type": "markdown", "id": "40236792", "metadata": { "id": "e25ce59a" }, "source": [ "這邊我們用最天真的方式來預測看看我們手上的資料,\n", "\n", "K-step ahead即是每次利用上K個時間資料直接預測下個時間點資料。\n", "\n", "這個前提是我們已經拿到K個時間點前的資料了,例如我們用今天的車流量去預測明天的車流量。" ] }, { "cell_type": "code", "execution_count": null, "id": "89c912a7", "metadata": { "id": "0ef5b443" }, "outputs": [], "source": [ "# 使用前面時間點資料預測下時間點\n", "K = 1\n", "forcast = series_train[:-K]\n", "ground_truth_for_view = series_train[K:]\n", "time_for_view = time_train[K:]\n", "\n", "plot_series(time_for_view,\n", " [forcast, ground_truth_for_view],\n", " labels=['prediction', 'ground truth'])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c66bb897", "metadata": { "id": "1819fd10" }, "source": [ "可以看到巨觀來講會蠻準的,但是把顯示圖片zoom-in一下會看到實際上會有跟noise大小差不多的差距。" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cd4fe9bf", "metadata": { "id": "abe4f900" }, "source": [ "## Seasonal K-step Ahead" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a6596fe3", "metadata": { "id": "3895bf54" }, "source": [ "這個超參數K很吃資料,若不是對資料很清楚,會產生非常大的偏移。因為K-step ahead可看做對資料做時間平移。\n", "\n", "但也可以利用這個平移的特性,因為我們已經知道周期了,所以或許還可以使用一個周期來預測資料。\n", "\n", "但因為我們的case裡面有trend,所以直接apply週期上會不準" ] }, { "cell_type": "code", "execution_count": null, "id": "544c6a7d", "metadata": { "id": "a48d1d6f" }, "outputs": [], "source": [ "K = 5\n", "P = 180\n", "forcast = series_train[:-K-P]\n", "ground_truth_for_view = series_train[K+P:]\n", "time_for_view = time_train[K+P:]\n", "\n", "plot_series(time_for_view,\n", " [forcast, ground_truth_for_view],\n", " labels=['prediction', 'ground truth'])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "61f27111", "metadata": { "id": "0ebac4e7" }, "source": [ "我們完成一個predictor可以把它包起來包成一個function,以供後續使用\n", "\n", "很多time series forcasting method都會讓資料喪失資料點,會喪失K個" ] }, { "cell_type": "code", "execution_count": null, "id": "af6d9ad9", "metadata": { "id": "92ad026f" }, "outputs": [], "source": [ "def k_step_ahead(data, k):\n", " # 產生k-step ahead預測\n", " # Args:\n", " # data (array of float) - 輸入資料\n", " # Returns:\n", " # forcast (array of float) - k-step ahead預測結果\n", "\n", " forcast = data[:-k]\n", " return forcast" ] }, { "cell_type": "code", "execution_count": null, "id": "f403fc89", "metadata": { "id": "0db6e732", "scrolled": true }, "outputs": [], "source": [ "# 試用\n", "K = 3\n", "forcast = k_step_ahead(series_train, K)\n", "ground_truth_for_view = series_train[K:]\n", "time_for_view = time_train[K:]\n", "\n", "plot_series(time_for_view, \n", " [forcast, ground_truth_for_view],\n", " labels=['prediction', 'ground truth'])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b7d33b7f", "metadata": { "id": "358971a7" }, "source": [ "## Metrics for Forcasting" ] }, { "attachments": {}, "cell_type": "markdown", "id": "30c573ea", "metadata": { "id": "5dfb7931" }, "source": [ "當有了預測方式後,我們可以用一些指標衡量這些預測,通常是藉由error來計算的\n", "\n", "\n", "\n", "**一些常見做法:**\n", "- Mean Absolute Error (MAE)\n", "- Mean Square Error (MSE)\n", "- R square Score (R2)\n", "\n", "下面我們自己刻一下" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0f913681", "metadata": { "id": "3e1ec34a" }, "source": [ "### Home Brew Metric Function" ] }, { "attachments": {}, "cell_type": "markdown", "id": "62886e33", "metadata": { "id": "e126474b" }, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "c54efa49", "metadata": { "id": "091dbe35" }, "outputs": [], "source": [ "def MAE(pred, gt):\n", " # 計算Mean Absolute Error\n", " # Args:\n", " # pred (array of float) - 預測資料\n", " # gt (array of float) - 答案資料\n", " # Returns:\n", " # 計算結果 (float)\n", " return abs(pred-gt).mean()" ] }, { "cell_type": "code", "execution_count": null, "id": "d75a3734", "metadata": { "id": "6646e8dc" }, "outputs": [], "source": [ "MAE(forcast, series_train[K:])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "09531b12", "metadata": { "id": "a78748fa" }, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "76c05839", "metadata": { "id": "cc543f6b" }, "outputs": [], "source": [ "def MSE(pred, gt):\n", " # 計算Mean Square Error\n", " # Args:\n", " # pred (array of float) - 預測資料\n", " # gt (array of float) - 答案資料\n", " # Returns:\n", " # 計算結果 (float)\n", " return pow(pred-gt, 2).mean()" ] }, { "cell_type": "code", "execution_count": null, "id": "766bd6bd", "metadata": { "id": "a767b2ca" }, "outputs": [], "source": [ "MSE(forcast, series_train[K:])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "eb4b4e26", "metadata": { "id": "8d678eed" }, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "id": "c0b98f37", "metadata": { "id": "da2f8e1a" }, "outputs": [], "source": [ "def R2(pred, gt):\n", " # 計算R square score\n", " # Args:\n", " # pred (array of float) - 預測資料\n", " # gt (array of float) - 答案資料\n", " # Returns:\n", " # 計算結果 (float)\n", " return 1-pow(pred-gt, 2).sum()/pow(gt-gt.mean(), 2).sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "fc8ea503", "metadata": { "id": "f34a4d75" }, "outputs": [], "source": [ "R2(forcast, series_train[K:])\n", "\n", "# 輸出值會在 [-inf~1],若等於0代表error與原本訊號的" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3dfe3659", "metadata": { "id": "0e5fced8" }, "source": [ "### Call Function from Module" ] }, { "attachments": {}, "cell_type": "markdown", "id": "35a6039f", "metadata": { "id": "bcc37e2a" }, "source": [ "而在tensorflow有直接使用的套件,都在```tensorflow.keras.metrics```裡面。" ] }, { "cell_type": "code", "execution_count": null, "id": "6e2474b7", "metadata": { "id": "09bb778a" }, "outputs": [], "source": [ "import tensorflow.keras.metrics as metrics" ] }, { "cell_type": "code", "execution_count": null, "id": "82afe640", "metadata": { "id": "cb76e4db" }, "outputs": [], "source": [ "# MAE\n", "print(metrics.mean_absolute_error(forcast, series_train[K:]))\n", "# MSE\n", "print(metrics.mean_squared_error(forcast, series_train[K:]))" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0d1e0ac1", "metadata": { "id": "15f813ea" }, "source": [ "比較特別的是它們會用tf.Tensor的方式跑,也會儲存為tf.Tensor。\n", "\n", "如果自己設計training+validation機制,用tensorflow的這幾個算法可以讓Metric計算都在GPU裡面執行" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }