{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dimensionality Reduction: t-SNE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在MNist Digits的範例中可以發現,PCA雖然可以盡量保留資料整體的variance,但各筆樣本之間的距離關係在低維度有可能被破壞。這是因為PCA是線性轉換,使得資料的各特徵的非線性結構在低維度空間無法被呈現,原本是相遠的點,在降維之後有可能被拉近。t-SNE 主要是將高維空間中的資料點,其點與點之間的相似度用機率分布近似,而低維數據的部分使用另一種機率分布的方式來近似,再使用 KL divergence計算兩種機率分布的距離,最後再以梯度下降(或隨機梯度下降)求最佳解。其精神在於,如果兩個點在高維度是相遠的,其在低維度也要是相遠的,反之亦然。在此我們用之前PCA的Digits範例來了解scikit-learn中如何用t-SNE來做降維。\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 首先import所有需要套件" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn.manifold import TSNE\n", "from sklearn.decomposition import PCA\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 載入digits資料集" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()\n", "X = digits.data\n", "y = digits.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 如之前的範例,使用PCA將資料降維,維度為2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca = PCA(2) # project from 64 to 2 dimensions\n", "Xproj = pca.fit_transform(X)\n", "print(X.shape)\n", "print(Xproj.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 將資料畫出來" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(Xproj[:, 0],\n", " Xproj[:, 1],\n", " c=y,\n", " edgecolor='none',\n", " alpha=0.5,\n", " cmap=plt.cm.get_cmap('nipy_spectral', 10))\n", "plt.colorbar()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 如上圖所示,可以看得出來有很多群並沒有被區分開來,資料點是交錯的,雖然原本相近的點仍相近,但原本高維空間中相遠的點,在降維之後變得混在一起。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 接下來使用t-SNE做降維,sklearn t-SNE中較重要的參數:\n", "- n_components: 降維之後的維度\n", "- perpexity: 最佳化過程中考慮鄰近點的多寡,default 30,原始paper建議5-50\n", "- n_iter: 迭代次數,預設1000" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tsne = TSNE(n_components=2, random_state=42)\n", "X_reduced = tsne.fit_transform(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(8, 6))\n", "plt.scatter(X_reduced[:, 0],\n", " X_reduced[:, 1],\n", " c=y,\n", " alpha=0.5,\n", " cmap=plt.cm.get_cmap('nipy_spectral', 10))\n", "\n", "plt.colorbar()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 如圖,各cluster距離被明顯拉開。跟PCA不同,t-SNE能讓高維空間中相遠的點,轉換到低維空間後仍是相遠的。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 在各參數之中,perplexity的建議範圍 5-50,樣本越多,perplexity應設置愈高。\n", "### t-SNE的一個缺點是其計算量很大,如果資料量較多或維度較多,建議先用PCA降維之後再使用t-SNE" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "widgets": { "state": { "ca1be6dfe7514b1d9ebee4859997ba76": { "views": [ { "cell_index": 22 } ] } }, "version": "1.2.0" } }, "nbformat": 4, "nbformat_minor": 4 }