{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dimensionality Reduction: t-SNE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在MNist Digits的範例中可以發現，PCA雖然可以盡量保留資料整體的variance，但各筆樣本之間的距離關係在低維度有可能被破壞。這是因為PCA是線性轉換，使得資料的各特徵的非線性結構在低維度空間無法被呈現，原本是相遠的點，在降維之後有可能被拉近。t-SNE 主要是將高維空間中的資料點，其點與點之間的相似度用機率分布近似，而低維數據的部分使用另一種機率分布的方式來近似，再使用 KL divergence計算兩種機率分布的距離，最後再以梯度下降（或隨機梯度下降）求最佳解。其精神在於，如果兩個點在高維度是相遠的，其在低維度也要是相遠的，反之亦然。在此我們用之前PCA的Digits範例來了解scikit-learn中如何用t-SNE來做降維。\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 首先import所有需要套件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.manifold import TSNE\n",
    "from sklearn.decomposition import PCA\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 載入digits資料集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_digits\n",
    "digits = load_digits()\n",
    "X = digits.data\n",
    "y = digits.target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 如之前的範例，使用PCA將資料降維，維度為2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca = PCA(2)  # project from 64 to 2 dimensions\n",
    "Xproj = pca.fit_transform(X)\n",
    "print(X.shape)\n",
    "print(Xproj.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 將資料畫出來"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(Xproj[:, 0],\n",
    "            Xproj[:, 1],\n",
    "            c=y,\n",
    "            edgecolor='none',\n",
    "            alpha=0.5,\n",
    "            cmap=plt.cm.get_cmap('nipy_spectral', 10))\n",
    "plt.colorbar()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 如上圖所示，可以看得出來有很多群並沒有被區分開來，資料點是交錯的，雖然原本相近的點仍相近，但原本高維空間中相遠的點，在降維之後變得混在一起。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 接下來使用t-SNE做降維，sklearn t-SNE中較重要的參數:\n",
    "- n_components: 降維之後的維度\n",
    "- perpexity: 最佳化過程中考慮鄰近點的多寡，default 30，原始paper建議5-50\n",
    "- n_iter: 迭代次數，預設1000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tsne = TSNE(n_components=2, random_state=42)\n",
    "X_reduced = tsne.fit_transform(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(8, 6))\n",
    "plt.scatter(X_reduced[:, 0],\n",
    "            X_reduced[:, 1],\n",
    "            c=y,\n",
    "            alpha=0.5,\n",
    "            cmap=plt.cm.get_cmap('nipy_spectral', 10))\n",
    "\n",
    "plt.colorbar()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 如圖，各cluster距離被明顯拉開。跟PCA不同，t-SNE能讓高維空間中相遠的點，轉換到低維空間後仍是相遠的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 在各參數之中，perplexity的建議範圍 5-50，樣本越多，perplexity應設置愈高。\n",
    "### t-SNE的一個缺點是其計算量很大，如果資料量較多或維度較多，建議先用PCA降維之後再使用t-SNE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "widgets": {
   "state": {
    "ca1be6dfe7514b1d9ebee4859997ba76": {
     "views": [
      {
       "cell_index": 22
      }
     ]
    }
   },
   "version": "1.2.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}