{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "82dhsvRUKofa" }, "source": [ "# Signal Classification" ] }, { "cell_type": "markdown", "metadata": { "id": "5mo7TK6mKrry" }, "source": [ "訊號與圖的分類一樣,在preprocess後可使用神經網路做一些AI任務,例如訊號分類、迴歸還有生成等等。\n", "\n", "這個部分我們用音訊作為訊號的範例,來試著將聲音訊號做分類,包含以下部分:\n", "- Audio Data Loader\n", "- Audio Preprocess (STFT/ MFCC)\n", "- RNN audio classification\n", "- CNN audio classification" ] }, { "cell_type": "markdown", "metadata": { "id": "R-B1OWAKMDtL" }, "source": [ "開始之前我們先準備一些內容。\n", "\n", "我們使用的範例資料集是tensorflow提供的[Mini Speech Commands](https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html)資料集,從官網下載。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jTGJO2yTGgQ_" }, "outputs": [], "source": [ "# 下載檔案並存到data資料夾\n", "!mkdir -p data\n", "!wget -O data/mini_speech_commands.zip http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip\n", "!unzip data/mini_speech_commands.zip -d data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "g1m6erWYMR6M" }, "outputs": [], "source": [ "import librosa\n", "import IPython.display as idp # 播音工具\n", "import librosa.display as ldp # 畫頻譜圖工具\n", "import numpy as np # 輔助運算\n", "import matplotlib.pyplot as plt # 輔助畫圖\n", "\n", "# import model用到的內容\n", "import tensorflow as tf" ] }, { "cell_type": "markdown", "metadata": { "id": "jo_9jVjhRVfL" }, "source": [ "## Audio Data Loader" ] }, { "cell_type": "markdown", "metadata": { "id": "ryzrWNvWRZqU" }, "source": [ "與前面CNN相同,需要有data loader去對資料做讀取,而tensorflow沒有原生讀音訊的data loader (TF 2.11有,但在Colab上要另外灌TF2.11),所以這部分要自己寫。" ] }, { "cell_type": "markdown", "metadata": { "id": "KPMvRiOT5uiQ" }, "source": [ "### 讀取音訊檔及資料切分" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "m3KXFuxQRZD7" }, "outputs": [], "source": [ "from glob import glob # 拿來列資料夾內容的小套件\n", "from sklearn.model_selection import train_test_split # 切分資料集用\n", "\n", "\n", "def find_class(x):\n", " # 根據格式,找到所屬class\n", " return x.strip('/')[-2]\n", "\n", "\n", "def audio_folder_datasets(dataset_path,\n", " class_dictionary,\n", " sr=22050,\n", " duration=-1):\n", " # 輸入:\n", " # dataset_path - 資料夾,內有數個不同class的資料夾,內有.wav檔\n", " # class_dictionary - dictionary物件,對應每個資料夾的class\n", " # sr - 讀取的sampling rate\n", " # duration - 可指定秒數(float32),不指定則為原檔長度\n", " file_names = []\n", " labels = []\n", " for cls, class_id in class_dictionary.items():\n", " f_list = glob(dataset_path+f'{cls}/*.wav') # 找到該class的所有檔案\n", " file_names.extend(f_list) # 加入列表\n", " labels.extend([class_id]*len(f_list)) # 加入相應labels\n", " print(\"total:\",\n", " f\"{len(file_names)} files of {len(class_dictionary)} classes\")\n", "\n", " @tf.function # 套tf.function使其可以對tensorflow tensor做動\n", " def load_wav(fname):\n", " # 使用指定sampling rate, duration讀檔\n", " # 這邊要用到librosa的loading才有re-sample,\n", " # 若已經知道每個檔案sampling rate也可以用tensorflow的tf.audio\n", " # 可參考:\n", " # https://www.kaggle.com/code/lkergalipatak/bird-audio-classification-with-tensorflow\n", "\n", " x = tf.numpy_function(\n", " lambda x: librosa.util.fix_length(\n", " librosa.load(x, sr=sr, duration=duration)[0],\n", " size=int(sr*duration),\n", " mode='edge'),\n", " inp=[fname], Tout=tf.float32)\n", " return x\n", " def get_dataset(paths_, labels_):\n", " # 得到所有資料夾名稱\n", " path_ds = tf.data.Dataset.from_tensor_slices(paths_) # 轉換成檔名的Dataset物件\n", " label_ds = tf.data.Dataset.from_tensor_slices(labels_)\n", "\n", " data_ds = path_ds.map(load_wav)\n", " return tf.data.Dataset.zip((data_ds, label_ds))\n", "\n", " fname_train, fname_val, label_train, label_val = train_test_split(\n", " file_names,\n", " labels,\n", " test_size=0.2)\n", " train_ds = get_dataset(fname_train, label_train)\n", " val_ds = get_dataset(fname_val, label_val)\n", " return train_ds, val_ds" ] }, { "cell_type": "markdown", "metadata": { "id": "DfRAtBnZ2ZIu" }, "source": [ "生成一個dataset拿來用作基礎。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Yow6YjV7eqcY" }, "outputs": [], "source": [ "# 準備一些參數輸入進function生成dataset\n", "DATASET_PATH = 'data/mini_speech_commands/'\n", "class_dict = {\n", " 'down': 0,\n", " 'go': 1,\n", " 'left': 2,\n", " 'no': 3,\n", " 'right': 4,\n", " 'stop': 5,\n", " 'up': 6,\n", " 'yes': 7\n", "}\n", "SR = 22050\n", "DURATION = 0.8\n", "train_ds, val_ds = audio_folder_datasets(\n", " DATASET_PATH,\n", " class_dict,\n", " sr=SR,\n", " duration=DURATION)" ] }, { "cell_type": "markdown", "metadata": { "id": "iRVQN_8r3tZw" }, "source": [ "dataset一次丟出一個signal以及一個label\n", "\n", "觀察data基本性質" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jzc53mLj2Q4W" }, "outputs": [], "source": [ "x, y = next(iter(val_ds))\n", "print('\\n', x.shape, x.numpy().min(), x.numpy().max())\n", "print(y)\n", "\n", "# 畫出來\n", "ldp.waveshow(x.numpy().squeeze(), sr=SR)\n", "\n", "# 聽看看\n", "idp.Audio(x.numpy().squeeze(), rate=SR)" ] }, { "cell_type": "markdown", "metadata": { "id": "SDrINt5U3xh2" }, "source": [ "畫出一個例子" ] }, { "cell_type": "markdown", "metadata": { "id": "hb0sjjxy50he" }, "source": [ "### 輸入NN前做轉換" ] }, { "cell_type": "markdown", "metadata": { "id": "5IZGg4NvV1s-" }, "source": [ "在預處裡時使用librosa會很慢,幸好與```librosa.stft```類似,可以使用```tensorflow.signal.stft```做時頻分析,在操作的各種過程中只能用tf function來操作。\n", "\n", "其axis為[time,frequency],與librosa相反,使用```librosa.specshow```觀察時記得要做transpose。\n", "\n", "為什麼要反過來是為了配合RNN的預設axis [batch,time,...],將time擺在batch後面第一位。\n", "\n", "若希望使用CNN模型,記得在最後多加一個空的axis,因為CNN適用的axis是[batch, hight, width, channels]。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jLi5zn7a3NQW" }, "outputs": [], "source": [ "N = 512\n", "H = 128\n", "\n", "\n", "def get_stft(waveform):\n", " # 做STFT (用tensorflow得比較快)\n", " spectrogram = tf.signal.stft(\n", " waveform, frame_length=N, frame_step=H)\n", " # 這邊frame_length是librosa的n_fft\n", " # frame_step是librosa的hop_length\n", " # 使用tf.signal stft出來時,單位為((timepoints-n_fft)/hop_length, n_fft/2)\n", " # 這是為了配合RNN等模型的\n", "\n", " # 取magnitude\n", " spectrogram = tf.abs(spectrogram)\n", "\n", " # 若是多加一個維度,可以用於CNN,shape (`batch_size`, `height`, `width`, `channels`).\n", " # spectrogram = spectrogram[..., tf.newaxis]\n", " return spectrogram\n", "\n", "\n", "# 使用STFT當作preprocess function\n", "trian_ds_stft = train_ds.map(lambda x, y: (get_stft(x), y))\\\n", " .cache().shuffle(6400).prefetch(tf.data.AUTOTUNE)\n", "val_ds_stft = val_ds.map(lambda x, y: (get_stft(x), y))\\\n", " .cache().prefetch(tf.data.AUTOTUNE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CBWPxPyWW08V" }, "outputs": [], "source": [ "# 可以看一下資料\n", "for x_S, y in val_ds_stft:\n", " print(x_S.shape, x_S.numpy().min(), x_S.numpy().max())\n", " print(y)\n", " break\n", "plt.figure(figsize=(6, 5))\n", "ldp.specshow(x_S.numpy().T, # 記得做transpose\n", " sr=SR,\n", " x_axis=\"s\",\n", " y_axis=\"hz\",\n", " cmap=\"jet\")\n", "plt.colorbar(format=\"%+4.f\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fhfraNwUTo6X" }, "outputs": [], "source": [ "# 跟librosa差不多,librosa有做time padding,會長一些\n", "x_S_ = librosa.stft(x.numpy(), n_fft=N, hop_length=H)\n", "print(x_S_.shape)\n", "plt.figure(figsize=(6, 5))\n", "ldp.specshow(abs(x_S_),\n", " sr=SR,\n", " x_axis=\"s\",\n", " y_axis=\"hz\",\n", " cmap=\"jet\")\n", "plt.colorbar(format=\"%+4.f\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VxHlfYUynEEo" }, "outputs": [], "source": [ "%%time\n", "# 預讀資料,放進GPU\n", "for x_S, y in tran_ds_stft:\n", " pass" ] }, { "cell_type": "markdown", "metadata": { "id": "efrNH2Pwqj_f" }, "source": [ "這邊也可以進一步使用MFCC,但因為```tensorflow.signal```做MFCC的步驟過於複雜且須配合的指令太多,這邊我們使用librosa套tf.function做轉換\n", "\n", "若想嘗試做tensorflow的版本可參考: https://github.com/timsainb/tensorflow2-generative-models/blob/master/7.0-Tensorflow-spectrograms-and-inversion.ipynb" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MvTZ2OFlqGFf" }, "outputs": [], "source": [ "N_MFCC = 39\n", "\n", "\n", "# 立一個方便的function給MFCC,套用想要的arguments,這邊記得,librosa做出來給tensorflow時需要做transpose\n", "def mfcc_function(waveform):\n", " return librosa.feature.mfcc(y=waveform,\n", " sr=SR,\n", " n_mfcc=N_MFCC,\n", " n_fft=N,\n", " hop_length=H).T\n", "\n", "\n", "@tf.function\n", "def get_mfcc(waveform):\n", " # 做MFCC,因為用librosa所以要用numpy_function包起來\n", " mfcc = tf.numpy_function(mfcc_function, [waveform], tf.float32)\n", " # 若是多加一個維度,可以用於CNN,shape (`batch_size`, `height`, `width`, `channels`).\n", " # mfcc = mfcc[..., tf.newaxis]\n", " return mfcc\n", "\n", "\n", "# 使用STFT當作preprocess function\n", "tran_ds_mfcc = tran_ds.map(lambda x, y: (get_mfcc(x), y))\\\n", " .cache().shuffle(6400).prefetch(tf.data.AUTOTUNE)\n", "val_ds_mfcc = val_ds.map(lambda x, y: (get_mfcc(x), y))\\\n", " .cache().prefetch(tf.data.AUTOTUNE)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cLyK1UQhw8sq" }, "outputs": [], "source": [ "# 可以看一下資料\n", "for x_C, y in val_ds_mfcc:\n", " print(x_C.shape, x_C.numpy().min(), x_C.numpy().max())\n", " print(y)\n", " break\n", "plt.figure(figsize=(6, 5))\n", "ldp.specshow(x_C.numpy().T,\n", " sr=SR,\n", " x_axis=\"s\",\n", " cmap=\"jet\") # 記得做transpose\n", "plt.colorbar(format=\"%+4.f\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b0yeRNxYzS2h" }, "outputs": [], "source": [ "%%time\n", "# 預讀資料,放進GPU\n", "for x_C,y in tran_ds_mfcc:\n", " pass" ] }, { "cell_type": "markdown", "metadata": { "id": "4Im2kjcGmFSM" }, "source": [ "## RNN Audio classifcation" ] }, { "cell_type": "markdown", "metadata": { "id": "l938Da7XyWNS" }, "source": [ "### STFT preprocess + RNN" ] }, { "cell_type": "markdown", "metadata": { "id": "UyWP60TbmKoJ" }, "source": [ "我們可使用RNN來做對剛剛的頻譜作classification的訓練" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "je3BLdd3mB31" }, "outputs": [], "source": [ "# 抓一下 data的大小\n", "for example_spectrograms, example_spect_labels in tran_ds_stft.take(1):\n", " break\n", "input_shape = example_spectrograms.shape.as_list()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "krVVzebD5q2Y" }, "outputs": [], "source": [ "from tensorflow.keras import layers\n", "from tensorflow.keras import models" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qfgu4V4FZBuN" }, "outputs": [], "source": [ "inputs = tf.keras.Input(shape=(None, input_shape[1]))\n", "h = layers.LSTM(256, dropout=0.1)(inputs) # 用層LSTM\n", "outputs = layers.Dense(len(class_dict), activation='softmax')(h)\n", "\n", "model = models.Model(inputs=inputs, outputs=outputs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SN25M3uCc-99" }, "outputs": [], "source": [ "model.compile(\n", " optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),\n", " loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),\n", " metrics=['accuracy'],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XFA7WMUfdXe0" }, "outputs": [], "source": [ "EPOCHS = 20\n", "history = model.fit(\n", " tran_ds_stft.batch(32),\n", " validation_data=val_ds_stft.batch(64),\n", " epochs=EPOCHS,\n", " callbacks=tf.keras.callbacks.EarlyStopping(verbose=1, patience=2),\n", " verbose=1\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iyJf2KFvhF2O" }, "outputs": [], "source": [ "model.evaluate(val_ds_stft.batch(64))" ] }, { "cell_type": "markdown", "metadata": { "id": "8YUFmCceyfyS" }, "source": [ "### MFCC preprocess + RNN" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_cPxEoNkyw1k" }, "outputs": [], "source": [ "# 抓一下 data的大小\n", "for example_cepstrums, example_spect_labels in tran_ds_mfcc.take(1):\n", " break\n", "input_shape = example_cepstrums.shape.as_list()\n", "\n", "inputs = tf.keras.Input(shape=(None, input_shape[1])) # 直接使用剛剛抓的大小\n", "h = layers.LSTM(256, dropout=0.1)(inputs) # 用層LSTM\n", "outputs = layers.Dense(len(class_dict), activation='softmax')(h)\n", "\n", "model = models.Model(inputs=inputs, outputs=outputs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "F8rxfWpeyw1k" }, "outputs": [], "source": [ "model.compile(\n", " optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),\n", " loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),\n", " metrics=['accuracy'],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "prLl9wsmyw1l" }, "outputs": [], "source": [ "EPOCHS = 20\n", "history = model.fit(\n", " tran_ds_mfcc.batch(32),\n", " validation_data=val_ds_mfcc.batch(64),\n", " epochs=EPOCHS,\n", " callbacks=tf.keras.callbacks.EarlyStopping(verbose=1, patience=2),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Qqq3cgMlyw1l" }, "outputs": [], "source": [ "model.evaluate(val_ds_mfcc.batch(64))" ] }, { "cell_type": "markdown", "metadata": { "id": "ecEwkyJ515pT" }, "source": [ "可以自行嘗試fine-tune這兩種結果。\n", "\n", "近期語音AI論文最常用的preprocess方式可作為訊號處理AI的前處理參考:\n", "1. 直接用一組Filter Bank\n", "2. MFCC\n", "3. Spectrogram\n", "\n", "(參考: 李宏毅老師Deep Learning for Human Language Processing (2020,Spring)課程中助教統計https://youtu.be/AIKu43goh-8?list=PLJV_el3uVTsO07RpBYFsXg-bN5Lu0nhdG&t=1964)" ] }, { "cell_type": "markdown", "metadata": { "id": "abp-h8Y4FeqB" }, "source": [ "## CNN Audio Classification" ] }, { "cell_type": "markdown", "metadata": { "id": "DGxmSwjJOfUL" }, "source": [ "當然,因為我們已經把資料轉換成2D的頻譜了,所以也可以當作一張圖來做2D CNN。\n", "\n", "這邊用的是STFT做時頻轉換的結果。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Od6y17AQG32M" }, "outputs": [], "source": [ "@tf.function\n", "def extend_dims(x, y):\n", " return x[..., np.newaxis], y\n", "\n", "\n", "tran_ds_stft_ = tran_ds_stft.map(extend_dims)\n", "val_ds_stft_ = val_ds_stft.map(extend_dims)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cspH98fnHqQH" }, "outputs": [], "source": [ "inputs = tf.keras.Input(shape=(None, input_shape[1], 1))\n", "h = layers.Conv2D(32, (3, 3), activation='relu')(inputs)\n", "h = layers.Dropout(0.1)(h)\n", "h = layers.MaxPooling2D()(h)\n", "h = layers.Conv2D(64, (3, 3), activation='relu')(h)\n", "h = layers.Dropout(0.1)(h)\n", "h = layers.MaxPooling2D()(h)\n", "h = layers.Conv2D(64, (3, 3), activation='relu')(h)\n", "h = layers.Dropout(0.1)(h)\n", "h = layers.GlobalAveragePooling2D()(h)\n", "h = layers.Flatten()(h)\n", "h = layers.Dense(32)(h)\n", "outputs = layers.Dense(len(class_dict), activation='softmax')(h)\n", "\n", "model = models.Model(inputs=inputs, outputs=outputs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zOBrKssMIzLc" }, "outputs": [], "source": [ "model.compile(\n", " optimizer=tf.keras.optimizers.Adam(),\n", " loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),\n", " metrics=['accuracy'],\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YrtK2_H5IzLe" }, "outputs": [], "source": [ "EPOCHS = 20\n", "history = model.fit(\n", " tran_ds_stft_.batch(32),\n", " validation_data=val_ds_stft_.batch(64),\n", " epochs=EPOCHS,\n", " callbacks=tf.keras.callbacks.EarlyStopping(verbose=1, patience=2),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "okm8WD97OZYL" }, "outputs": [], "source": [ "model.evaluate(val_ds_stft_.batch(64))" ] }, { "cell_type": "markdown", "metadata": { "id": "W2VOtUg8Oq55" }, "source": [ "使用CNN的好處是,已經有很多CNN-based的pre-train模型可以使用來做transfer learning。\n", "\n", "建議也可以拿transfer leraning提及的model來訓練看看!\n", "\n", "亦可嘗試使用MFCC Preprocess後做CNN模型。" ] }, { "cell_type": "markdown", "metadata": { "id": "x1oWIACng2wH" }, "source": [ "## Reference\n", "* TF官網教學: https://www.tensorflow.org/tutorials/audio/simple_audio\n", "* https://towardsdatascience.com/audio-augmentations-in-tensorflow-48483260b169\n", "* https://github.com/timsainb/tensorflow2-generative-models/blob/master/7.0-Tensorflow-spectrograms-and-inversion.ipynb" ] } ], "metadata": { "accelerator": "GPU", "colab": { "provenance": [], "toc_visible": true }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 4 }