{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "3be909b3", "metadata": { "id": "99485bee" }, "source": [ "# Time Series 性質與資料合成" ] }, { "attachments": {}, "cell_type": "markdown", "id": "d2330d4d", "metadata": { "id": "1e0a6065" }, "source": [ "這邊我們先給各位看一下time series的基本樣貌" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1a64bc1e", "metadata": { "id": "a1215255" }, "source": [ "課程包含以下內容:\n", "- Time Series Data性質與合成\n", " - Level\n", " - Trend\n", " - Seasonality\n", " - Time Series性值疊加\n", " - Noise" ] }, { "attachments": {}, "cell_type": "markdown", "id": "29314268", "metadata": { "id": "b9c09c6c" }, "source": [ "#### **開始前請先安裝或import基本套件**\n", "#### **若使用Jupyter Notebook開啟請轉成tree view方便顯示plotly出來的圖**" ] }, { "cell_type": "code", "execution_count": null, "id": "45bbb260", "metadata": { "id": "7229df3f" }, "outputs": [], "source": [ "! pip install plotly" ] }, { "cell_type": "code", "execution_count": null, "id": "81fc5ef4", "metadata": { "id": "2f40cf91" }, "outputs": [], "source": [ "%matplotlib notebook\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from plotly import express as px" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6b4ac23f", "metadata": { "id": "83f7bb9d" }, "source": [ "**另外我們也先準備一個畫圖的function,我們不會放重點在這邊但後面會用它來看一些time series處理的過程**" ] }, { "cell_type": "code", "execution_count": null, "id": "fba76165", "metadata": { "id": "856babb3" }, "outputs": [], "source": [ "def plot_series(time, series, start=0, end=None, labels=None):\n", " # Visualizes time series data\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # series (list of array of int) - 時間點對應的資料列表,列表內時間序列數量為D,\n", " # 每筆資料長度為T,若非為列表則轉為列表\n", " # start (int) - 開始的資料序(第幾筆)\n", " # end (int) - 結束繪製的資料序(第幾筆)\n", " # labels (list of strings)- 對於多時間序列或多維度的標註\n", "\n", " # 若資料只有一筆,則轉為list\n", " if type(series) != list:\n", " series = [series]\n", " if not end:\n", " end = len(series[0])\n", " \n", " if labels:\n", " # 設立dictionary, 讓plotly畫訊號線時可以標註label\n", " dictionary = {\"time\": time}\n", " for idx, l in enumerate(labels):\n", " # 截斷資料,保留想看的部分,並分段紀錄於dictionary中\n", " dictionary.update({l: series[idx][start:end]})\n", "\n", " # 畫訊號線\n", " fig = px.line(dictionary,\n", " x=\"time\",\n", " y=list(dictionary.keys())[1:],\n", " width=1000,\n", " height=400)\n", " else:\n", " # 畫訊號線\n", " fig = px.line(x=time, y=series, width=1000, height=400)\n", " fig.show()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "65acdad8", "metadata": { "id": "4db8b7f3" }, "source": [ "## Time Series Data性質" ] }, { "attachments": {}, "cell_type": "markdown", "id": "182a8edc", "metadata": { "id": "841359af" }, "source": [ "時間序列資料中有些基本性質是我們預測的目標,以下我們用簡單的合成序列來說明" ] }, { "attachments": {}, "cell_type": "markdown", "id": "bbbd4d7e", "metadata": { "id": "9c682ffb" }, "source": [ "### Level" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7e995994", "metadata": { "id": "02f563eb" }, "source": [ "Level是一個資料最基本的屬性,代表這個訊號主要的大小,會用固定值來model,例如median, mean, mininum之類。\n", "\n", "例如在房價在今年跟去年平均值不一樣,我們就可以說它們的level不同。\n", "\n", "最簡單我們用一條自己合成的水平線來做展示" ] }, { "cell_type": "code", "execution_count": null, "id": "26e06984", "metadata": { "id": "745eb551" }, "outputs": [], "source": [ "def level(time, bias=0.):\n", " # 產生合成水平直線資料,其長度與時間等長,直線的大小為輸入的level\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # level (float) - 設定資料的level\n", " # Returns:\n", " # series (array of float) - 產出level 與設定相同的一條線\n", "\n", " series = np.zeros_like(time)+bias\n", " return series" ] }, { "cell_type": "code", "execution_count": null, "id": "ef2e2573", "metadata": { "id": "1ef2b4ee" }, "outputs": [], "source": [ "# 產生time steps, 假設就365天的資料\n", "time = np.arange(365)\n", "\n", "# 設定不同的level的訊號要偏多少\n", "bais = [-1, 1, 2, 3]\n", "\n", "# 合成訊號\n", "series = [level(time, b) for b in bais]\n", "\n", "# 畫出來\n", "plot_series(time, series,\n", " labels=[f'level={b}' for b in bais])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b6d60192", "metadata": { "id": "b2d8199d" }, "source": [ "不過就numpy的性質,一個array可以直接用+號將全體內容加上一個bias,所以上述funciton後面就不用了" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0e4049de", "metadata": { "id": "04b784a4" }, "source": [ "### Trend" ] }, { "attachments": {}, "cell_type": "markdown", "id": "218b549a", "metadata": { "id": "89c5e841" }, "source": [ "Trend是序列的趨勢,代表這個序列改變的方向,例如傾斜程度大小,傾斜方向是正還負\n", "\n", "最簡單我們用一條傾斜的直線來展示trend" ] }, { "cell_type": "code", "execution_count": null, "id": "24c623ff", "metadata": { "id": "25595a8e" }, "outputs": [], "source": [ "def trend(time, slope=0):\n", " # 產生合成水平直線資料,其長度與時間等長,直線趨勢與設定slope相同\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # slope (float) - 設定資料的傾斜程度與正負\n", " # Returns:\n", " # series (array of float) - 產出slope 與設定相同的一條線\n", "\n", " series = slope * time\n", " return series" ] }, { "cell_type": "code", "execution_count": null, "id": "9eee970f", "metadata": { "id": "de14b7df" }, "outputs": [], "source": [ "# 設定time steps, 假設就365天的資料\n", "time = np.arange(365)\n", "\n", "# 設定不同trend的傾斜程度為多少\n", "slopes = [-0.1, 0.1, 0.2, 0.3]\n", "\n", "# 依照不同傾斜程度合成資料\n", "series = [trend(time, s) for s in slopes]\n", "\n", "# 畫出來\n", "plot_series(time, series,\n", " labels=[f'slope={s}' for s in slopes])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f3144e37", "metadata": { "id": "a3a7a8b4" }, "source": [ "目前資料可以由公式推導知道這些slope, bias,但我們這邊先假設這是一個野生的資料,我們不知道它的function。\n", "\n", "如此會需要一些模型來完成推導" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f4437e68", "metadata": { "id": "25401124" }, "source": [ "### Seasonality" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c11a72e8", "metadata": { "id": "6cea5789" }, "source": [ "季節性這個直譯有點特定性太強,我們姑且翻譯成序列的周期性。\n", "\n", "有時序列會有可能在一定周期重複顯現,例如東亞梅花的開花數量在冬天會比較多,而使得年復一年,梅花開花數量這個時間序列的數值在冬天都會呈現高峰,這樣就是具周期性。而有周期性的序列,會依照周期不斷呈現一定的pattern,例如鐘形分布,三角形,或者sine, cosing等等。\n", "\n", "我們可以拿個簡單的重複三角形來展示這個功能" ] }, { "cell_type": "code", "execution_count": null, "id": "c439cff1", "metadata": { "id": "4d5a1b66" }, "outputs": [], "source": [ "def seasonal_pattern(season_time, pattern_type='triangle'):\n", " # 產生某個特定pattern,\n", " # Args:\n", " # season_time (array of float) - 周期內的時間點, 長度為T\n", " # pattern_type (str) - 這邊提供triangle與cosine\n", " # Returns:\n", " # data_pattern (array of float) - 根據自訂函式產出特定的pattern\n", "\n", " # 用特定function生成pattern\n", " if pattern_type == 'triangle':\n", "\n", " data_pattern = np.where(season_time < 0.5,\n", " season_time*2,\n", " 2-season_time*2)\n", " if pattern_type == 'cosine':\n", " data_pattern = np.cos(season_time*np.pi*2)\n", " return data_pattern" ] }, { "cell_type": "code", "execution_count": null, "id": "8305ecc4", "metadata": { "id": "422aa5f1" }, "outputs": [], "source": [ "# 設定season time steps, 假設就365天的資料\n", "time = np.arange(365)/365\n", "\n", "# 設定不同pattern\n", "pattern = ['triangle', 'cosine']\n", "\n", "# 根據不同pattern設定合成一周期的序列\n", "series = [seasonal_pattern(time, p) for p in pattern]\n", "\n", "# 畫出來\n", "plot_series(time, series, labels=pattern)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "945c24c4", "metadata": { "id": "e368bf00" }, "source": [ "周期性序列的組成,除了有剛剛的pattern外,還會有一些更動,例如序列周期、幅度、相位會影響序列的樣貌。\n", "\n", "這邊我們把剛剛單周期序列進行連續生成,組成好幾個週期的序列" ] }, { "cell_type": "code", "execution_count": null, "id": "935fee70", "metadata": { "id": "eddf108e" }, "outputs": [], "source": [ "def seasonality(time, period, amplitude=1, phase=30, pattern_type='triangle'):\n", " # Repeats the same pattern at each period\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # period (int) - 週期長度,必小於T\n", " # amplitude (float) - 序列幅度大小\n", " # phase (int) - 相位,為遞移量,正的向左(提前)、負的向右(延後)\n", " # pattern_type (str) - 這邊提供triangle與cosine\n", " # Returns:\n", " # data_pattern (array of float) - 有指定周期、振幅、相位、pattern後的time series\n", "\n", " # 將時間依週期重置為0\n", " season_time = ((time + phase) % period) / period\n", "\n", " # 產生週期性訊號並乘上幅度\n", " data_pattern = amplitude * seasonal_pattern(season_time, pattern_type)\n", "\n", " return data_pattern" ] }, { "cell_type": "code", "execution_count": null, "id": "7e5b7197", "metadata": { "id": "f12b5716" }, "outputs": [], "source": [ "# 設定time steps, 假設就4年的資料\n", "time = np.arange(4 * 365)\n", "\n", "# 給定週期參數\n", "periods = [180, 360, 720]\n", "\n", "# 合成訊號\n", "series = [seasonality(time,\n", " period=p,\n", " pattern_type='triangle')\n", " for p in periods]\n", "# 畫出來\n", "plot_series(time, series,\n", " labels=[f'period={p}' for p in periods])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "58e509f3", "metadata": { "id": "2a2e1d40" }, "source": [ "### Time Series性質疊加" ] }, { "attachments": {}, "cell_type": "markdown", "id": "8c3c2f61", "metadata": { "id": "207a5a25" }, "source": [ "剛提到的幾種性質其實都可能同時存在一個時間序列中,我們用合成訊號來看。" ] }, { "cell_type": "code", "execution_count": null, "id": "f984532e", "metadata": { "id": "edb8a786" }, "outputs": [], "source": [ "# 設定time steps, 假設就4年的資料\n", "time = np.arange(4 * 365)\n", "\n", "# 定義各種性質\n", "bias = 500\n", "slope = 0.1\n", "period = 180\n", "amplitude = 40\n", "\n", "# 合成資料\n", "series = level(time, bias)\\\n", " + trend(time, slope)\\\n", " + seasonality(time,\n", " period=period,\n", " amplitude=amplitude,\n", " pattern_type='triangle')\n", "# 畫\n", "plot_series(time, series)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2bcafe21", "metadata": { "id": "88dfd575" }, "source": [ "可以看到它的最低值至少大於500,長期斜率約為0.1,波動周期為180天,並且周期幅度為40,周期pattern是三角形。\n", "\n", "是上述很多性值的疊加狀態。" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3fc69d86", "metadata": { "id": "5f47994f" }, "source": [ "### Noise" ] }, { "attachments": {}, "cell_type": "markdown", "id": "067ce317", "metadata": { "id": "f4d22f3b" }, "source": [ "物理世界的訊號通常帶有雜訊,我們來看加入雜訊後長什麼樣子" ] }, { "cell_type": "code", "execution_count": null, "id": "aece070e", "metadata": { "id": "0ce82372" }, "outputs": [], "source": [ "def noise(time, noise_level=1, seed=None):\n", " # 合成雜訊,這邊用高斯雜訊,機率密度為常態分布\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # noise_level (float) - 雜訊大小\n", " # seed (int) - 同樣的seed可以重現同樣的雜訊\n", " # Returns:\n", " # noise (array of float) - 雜訊時間序列\n", "\n", " # 做一個基於某個seed的雜訊生成器\n", " rnd = np.random.RandomState(seed)\n", "\n", " # 生與time同長度的雜訊,並且乘上雜訊大小 (不乘的話,標準差是1)\n", " noise = rnd.randn(len(time)) * noise_level\n", " return noise" ] }, { "cell_type": "code", "execution_count": null, "id": "78fc97f4", "metadata": { "id": "db0c674b" }, "outputs": [], "source": [ "# 給一個4年的time steps\n", "time = np.arange(4 * 365)\n", "# 設noise 強度\n", "noise_levels = [2., 1.]\n", "# 合成雜訊\n", "series = [noise(time, noise_level=nl, seed=2022) for nl in noise_levels]\n", "# 畫\n", "plot_series(time,\n", " series=series,\n", " labels=[f'noise_levels={nl}' for nl in noise_levels])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "33480efb", "metadata": { "id": "0c68278e" }, "source": [ "來看看跟剛剛的一些性質疊起來長什麼樣子" ] }, { "cell_type": "code", "execution_count": null, "id": "29d3a1be", "metadata": { "id": "4b749091" }, "outputs": [], "source": [ "# 設定time steps, 假設就4年的資料\n", "time = np.arange(4 * 365)\n", "\n", "# 定義各種性質\n", "bias = 500\n", "slope = 0.1\n", "period = 180\n", "amplitude = 40\n", "# 設noise 強度\n", "noise_level = 5.\n", "\n", "# 合成資料\n", "signal_series = level(time, bias)\\\n", " + trend(time, slope)\\\n", " + seasonality(time,\n", " period=period,\n", " amplitude=amplitude,\n", " pattern_type='triangle')\n", "noise_series = noise(time, noise_level=noise_level)\n", "\n", "series = signal_series+noise_series\n", "\n", "# 畫\n", "plot_series(time, [series, signal_series], labels=['noisy', 'noise free'])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3620670b", "metadata": { "id": "0b5d53ef" }, "source": [ "剛剛的時間序列會出現有不規則的毛邊,這些毛邊沒有規則,並且還是看得出來前面序列的周期性、大小、趨勢等等。\n", "\n", "我們現在設定的noise大小還沒有很大,所以加上去沒有多很大毛邊,可自行加大noise_level,試試看加到多少會使得根本認不出原本資料?" ] }, { "cell_type": "code", "execution_count": null, "id": "33588c0e", "metadata": { "id": "b7ad880e" }, "outputs": [], "source": [ "noise_series = noise(time, noise_level=15)\n", "series = signal_series+noise_series\n", "plot_series(time, [series, signal_series], labels=['noisy', 'noise free'])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "d09401b9", "metadata": { "id": "d34351c8" }, "source": [ "我們可以把前述幾個function可以打包在一起,做一個data generator" ] }, { "cell_type": "code", "execution_count": null, "id": "f798b3d8", "metadata": { "id": "be670f19" }, "outputs": [], "source": [ "def trend(time, slope=0):\n", " # 產生合成水平直線資料,其長度與時間等長,直線趨勢與設定slope相同\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # slope (float) - 設定資料的傾斜程度與正負\n", " # Returns:\n", " # series (array of float) - 產出slope 與設定相同的一條線\n", "\n", " series = slope * time\n", "\n", " return series\n", "\n", "\n", "def seasonal_pattern(season_time, pattern_type='triangle'):\n", " # 產生某個特定pattern,\n", " # Args:\n", " # season_time (array of float) - 周期內的時間點, 長度為T\n", " # pattern_type (str) - 這邊提供triangle與cosine\n", " # Returns:\n", " # data_pattern (array of float) - 根據自訂函式產出特定的pattern\n", "\n", " # 用特定function生成pattern\n", " if pattern_type == 'triangle':\n", " data_pattern = np.where(season_time < 0.5,\n", " season_time*2,\n", " 2-season_time*2)\n", " if pattern_type == 'cosine':\n", " data_pattern = np.cos(season_time*np.pi*2)\n", "\n", " return data_pattern\n", "\n", "\n", "def seasonality(time, period, amplitude=1, phase=30, pattern_type='triangle'):\n", " # Repeats the same pattern at each period\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # period (int) - 週期長度,必小於T\n", " # amplitude (float) - 序列幅度大小\n", " # phase (int) - 相位,為遞移量,正的向左(提前)、負的向右(延後)\n", " # pattern_type (str) - 這邊提供triangle與cosine\n", " # Returns:\n", " # data_pattern (array of float) - 有指定周期、振幅、相位、pattern後的time series\n", "\n", " # 將時間依週期重置為0\n", " season_time = ((time + phase) % period) / period\n", "\n", " # 產生週期性訊號並乘上幅度\n", " data_pattern = amplitude * seasonal_pattern(season_time, pattern_type)\n", "\n", " return data_pattern\n", "\n", "\n", "def noise(time, noise_level=1, seed=None):\n", " # 合成雜訊,這邊用高斯雜訊,機率密度為常態分布\n", " # Args:\n", " # time (array of int) - 時間點, 長度為T\n", " # noise_level (float) - 雜訊大小\n", " # seed (int) - 同樣的seed可以重現同樣的雜訊\n", " # Returns:\n", " # noise (array of float) - 雜訊時間序列\n", "\n", " # 做一個基於某個seed的雜訊生成器\n", " rnd = np.random.RandomState(seed)\n", "\n", " # 生與time同長度的雜訊,並且乘上雜訊大小 (不乘的話,標準差是1)\n", " noise = rnd.randn(len(time)) * noise_level\n", "\n", " return noise\n", "\n", "\n", "def toy_generation(time=np.arange(4 * 365),\n", " bias=500.,\n", " slope=0.1,\n", " period=180,\n", " amplitude=40.,\n", " phase=30,\n", " pattern_type='triangle',\n", " noise_level=5.,\n", " seed=2022):\n", " signal_series = bias\\\n", " + trend(time, slope)\\\n", " + seasonality(time,\n", " phase,\n", " period,\n", " amplitude,\n", " pattern_type)\n", " noise_series = noise(time, noise_level, seed)\n", "\n", " series = signal_series+noise_series\n", " return series" ] }, { "cell_type": "code", "execution_count": null, "id": "349d8e3d", "metadata": { "id": "d401a003" }, "outputs": [], "source": [ "# 生成時可快速調整想要的參數\n", "\n", "series = toy_generation(time=np.arange(4 * 365),\n", " bias=500.,\n", " slope=0.1,\n", " period=180,\n", " amplitude=40.,\n", " phase=30,\n", " pattern_type='triangle',\n", " noise_level=5.,\n", " seed=2022)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5c74cd77", "metadata": { "id": "442cb26a" }, "source": [ "## Examples for Other Trends and Seasonality" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cb1b6b61", "metadata": { "id": "OtdtLR3qwKj-" }, "source": [ "可能會遇到各種trend與Seasonality的情況\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "565843ee", "metadata": { "id": "35ce516c" }, "outputs": [], "source": [ "# 首先來看看各種trend的情況\n", "time = np.arange(100)\n", "no_trend = np.zeros_like(time) # 沒trend 就是0\n", "additive_trend = time*0.3 # additive trend就是直線型的trend\n", "\n", "# multiplicative的是對時間的對數,也稱作exponential trend\n", "multiplicative_trend = time**2*0.01\n", "\n", "# polynomial對時間是一個多項式函數\n", "polynomial_trend = -time**3*0.0005+time**2*0.05+time*0.1" ] }, { "cell_type": "code", "execution_count": null, "id": "cbcf8be7", "metadata": { "id": "9c2ba14f" }, "outputs": [], "source": [ "plot_series(time, [no_trend,\n", " additive_trend,\n", " multiplicative_trend,\n", " polynomial_trend],\n", " labels=['no_trend',\n", " 'additive_trend',\n", " 'multiplicative_trend',\n", " 'polynomial_trend'\n", " ])" ] }, { "cell_type": "code", "execution_count": null, "id": "20c6270f", "metadata": { "id": "0ed7e0f9" }, "outputs": [], "source": [ "no_seasonality = np.zeros_like(time)\n", "# additive是對時間沒有波的振幅上變化的seasonality\n", "additive_seasonality = seasonality(time,\n", " phase=0,\n", " period=30,\n", " amplitude=40.,\n", " pattern_type='cosine')\n", "# multiplicative對著時間波動的振福會改變\n", "multiplicative_seasonality = time*additive_seasonality*0.02" ] }, { "cell_type": "code", "execution_count": null, "id": "0547c752", "metadata": { "id": "4aafef5c" }, "outputs": [], "source": [ "plot_series(time, [no_seasonality,\n", " additive_seasonality,\n", " multiplicative_seasonality],\n", " labels=['no_seasonality',\n", " 'additive_seasonality',\n", " 'multiplicative_seasonality'\n", " ])" ] }, { "cell_type": "code", "execution_count": null, "id": "af74c04f", "metadata": { "id": "68466029" }, "outputs": [], "source": [ "# 看看把上述trend與seasonality都加再一起\n", "plt.figure(figsize=[10, 5])\n", "noise_series = noise(time, noise_level=5)\n", "plt.plot(time, polynomial_trend+no_seasonality+noise_series,\n", " '-', label='W=1', linewidth=3)\n", "plt.plot(time, polynomial_trend,\n", " 'k--', label='W=3', linewidth=3)\n", "# plt.legend(fontsize='large')\n", "plt.ylim([-150, 150])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4eedddd6", "metadata": { "id": "nruhyNR5xowP" }, "source": [ "## Exercise\n", "可試著看看不同的trend與seasonality類型組合畫出來是什麼樣子。" ] }, { "cell_type": "code", "execution_count": null, "id": "9a0f98fd", "metadata": { "id": "d3eee05a" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }