{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hierarchical clustering\n",
    "若是以下的 code 有不清楚的部分，請參考[連結](https://haojunsui.github.io/2016/07/16/scipy-hac/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "variables = ['X', 'Y', 'Z']\n",
    "labels = ['ID_' + str(i) for i in range(5)]\n",
    "print(labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set seed to remain the same sample numbers\n",
    "np.random.seed(42)\n",
    "X = np.random.random_sample([len(labels), len(variables)])\n",
    "df = pd.DataFrame(X, columns=variables, index=labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. build distance matrix by calculating pairwise distance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. distance matrix\n",
    "from scipy.spatial.distance import pdist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pairwise distance\n",
    "# 我們有五個樣本，每兩兩成對計算距離，會得到 10 個距離 (C 5 取 2 = 10)\n",
    "row_dist = pdist(df, metric='euclidean')\n",
    "row_dist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.spatial.distance import squareform\n",
    "\n",
    "squareform(row_dist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. build hierarchy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.cluster.hierarchy import linkage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.1 build from the pairwise distance array, row_dist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hc = linkage(row_dist, method='complete')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# row: [idx_first, idx_second, distance, sample count]\n",
    "# 第一步:算法決定合併第 1 群與 第 4 群，\n",
    "# 因為這兩群彼此的距離為 0.24，總共合併了兩個 sample\n",
    "# 第二步:算法決定合併第 0 群與 第 2 群，\n",
    "# 因為這兩群彼此的距離為 0.35，總共合併了兩個 sample\n",
    "# 第三步:算法決定合併第 3 群與 第 5 群\n",
    "# (這邊請注意，原先 data 只有五群資料，第 0 群到 第 4 群。\n",
    "# 所以這邊要合併的第 5 群，指得是第一步合併的那群\n",
    "# 第四步:算法決定合併第 6 群與 第 7 群\n",
    "# (同理，這邊指的是，合併第二步那群 與 第三步的那群)\n",
    "\n",
    "# 以上就是階層分析的步驟，看下方 dendrogram 的圖會更清楚\n",
    "hc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.2 build from the original data\n",
    "you will need to defind the distance metric\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hc = linkage(df.values, method='complete', metric='euclidean')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# same result as feed the row_dist\n",
    "hc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Dendrogram"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.cluster.hierarchy import dendrogram\n",
    "from scipy.cluster.hierarchy import set_link_color_palette"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "?dendrogram"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 繪製樹狀圖，藍色線 代表這兩群的距離超過某個限度，可自己定義 color_threshold\n",
    "set_link_color_palette(['black'])\n",
    "\n",
    "row_dendr = dendrogram(\n",
    "    hc,\n",
    "    labels=labels,\n",
    "    color_threshold=0.3  # 可改動，看看線的顏色變化\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Decide the number of clusters by various criteria\n",
    "決定分群的結果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.cluster.hierarchy import fcluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "?fcluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 不同的　criterion 會有不同的參數，t=3，限制最多分成三群\n",
    "# ID_0 與 ID_2 被分為第一群\n",
    "# ID_1 與 ID_4 被分為第二群\n",
    "# ID_3 則是獨立一群\n",
    "fcluster(hc, criterion='maxclust', t=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fcluster(hc, criterion='distance', t=0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}