{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# To support both python 2 and python 3\n", "from __future__ import division, print_function, unicode_literals\n", "\n", "# Common imports\n", "import numpy as np\n", "import os\n", "\n", "# to make this notebook's output stable across runs\n", "np.random.seed(42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# To plot pretty figures\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "from sklearn.datasets import load_iris\n", "from sklearn.datasets import make_blobs\n", "from sklearn.cluster import KMeans\n", "from sklearn.metrics import silhouette_score\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mpl.rc('axes', labelsize=14)\n", "mpl.rc('xtick', labelsize=12)\n", "mpl.rc('ytick', labelsize=12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# K-Means" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### K-Means 的演算過程:\n", "* 先隨機從資料中選出K個instance 作為初始的cluster centroid\n", "* 重複以下操作, 直到各centroid的位置不再移動為止:\n", " * 將每筆資料指派給最靠近的centroid\n", " * 更新centroid位置為,新的centroid的位置為指派給該centroid的所有資料之平均\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 以下以iris 分類問題為例,示範如何使用scikit learn中的kmeans模組,對資料進行clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 首先,先讀取sklearn內建的iris 資料集" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = load_iris()\n", "x = data.data\n", "y = data.target\n", "data.target_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 畫出所有資料點,為了視覺化方便,只取第3及第4個特徵(通常在unsupervised learning的情景,不會有target值,這邊只是用來方便比較)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(9, 3.5))\n", "plt.plot(x[y == 0, 2], x[y == 0, 3], \"ro\", label=\"Setosa\")\n", "plt.plot(x[y == 1, 2], x[y == 1, 3], \"bs\", label=\"Versicolor\")\n", "plt.plot(x[y == 2, 2], x[y == 2, 3], \"g^\", label=\"Virginica\")\n", "plt.xlabel(\"Petal length\", fontsize=14)\n", "plt.ylabel(\"Petal width\", fontsize=14)\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## scikit learn 中的KMeans 各參數的意義: (參考 https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html )\n", "* n_clusters: 指定群聚的數量\n", "* max_iter: 最大迭代次數\n", "* n_init: 用不同初始化的centroids運行的次數,最後會從中選出一個最好的結果,default為10次\n", "* init: 初始化的方式,有random 及 kmeans++ 兩種,詳見上述連結說明\n", "* algorithm: \"auto\" \"full\" or \"elkan\",用預設的\"auto\"即可 " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "k = 3\n", "\n", "kmeans = KMeans(n_clusters=k, random_state=42)\n", "clusters_pred = kmeans.fit_predict(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## inertia: within cluster sum of squres, 各sample到各該群的centroid的距離之平方和,用來評估cluster的成效,越大代表越差" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kmeans.inertia_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 查看各cluster的中心,並在圖上畫出" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kmeans.cluster_centers_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(9, 3.5))\n", "plt.plot(x[clusters_pred == 0, 2],\n", " x[clusters_pred == 0, 3],\n", " \"ro\",\n", " label=\"Iris-Setosa\")\n", "plt.plot(x[clusters_pred == 1, 2],\n", " x[clusters_pred == 1, 3],\n", " \"bs\",\n", " label=\"Iris-Versicolor\")\n", "plt.plot(x[clusters_pred == 2, 2],\n", " x[clusters_pred == 2, 3],\n", " \"g^\",\n", " label=\"Iris-Virginica\")\n", "plt.scatter(kmeans.cluster_centers_[:, 2],\n", " kmeans.cluster_centers_[:, 3],\n", " color='k',\n", " marker='X',\n", " zorder=5)\n", "plt.xlabel(\"Petal length\", fontsize=14)\n", "plt.ylabel(\"Petal width\", fontsize=14)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 雖然跟ground truth 相比結果雖然有一點差異,但確實有達到clustering的效果" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 如何決定K? 可用inertia做評估。以下利用scikit learn的make_blobs建立一個有四個中心的共2000筆的假資料作為範例:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 先指定centroids的位置及各cluster的標準差\n", "blob_centers = np.array([[-0.2, 2.3], [1.5, 2.3], [2.8, 2.8], [2.8, 1.3]])\n", "blob_std = np.array([0.4, 0.3, 0.1, 0.1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X, y = make_blobs(n_samples=2000,\n", " centers=blob_centers,\n", " cluster_std=blob_std,\n", " random_state=7)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(X[:, 0], X[:, 1], c='b', s=1)\n", "plt.xlabel(\"$x_1$\", fontsize=14)\n", "plt.ylabel(\"$x_2$\", fontsize=14, rotation=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 想像你拿到這份資料集,沒有label, 也無法一眼看出有多少個中心的情況,想用kmeans clustering, 該怎麼決定K? 可用多個k值做實驗,並用各次實驗結果的inertia做評估" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# k = 1~9 做九次kmeans, 並將每次結果的inertia收集在一個list裡\n", "kmeans_per_k = [\n", " KMeans(n_clusters=k, random_state=42).fit(X) for k in range(1, 10)\n", "]\n", "inertias = [model.inertia_ for model in kmeans_per_k]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(8, 3.5))\n", "plt.plot(range(1, 10), inertias, \"bo-\")\n", "plt.xlabel(\"$k$\", fontsize=14)\n", "plt.ylabel(\"Inertia\", fontsize=14)\n", "plt.annotate('Elbow',\n", " xy=(4, inertias[3]),\n", " xytext=(0.55, 0.55),\n", " textcoords='figure fraction',\n", " fontsize=16,\n", " arrowprops=dict(facecolor='black', shrink=0.1))\n", "plt.axis([1, 8.5, 0, 1300])\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 如圖,當K值越來越大,inertia會隨之越來越小,(但k=n時理論上inertia會減為0, 因此實際上不能選inertia最小的那個k) 一般是取elbow point, 即inertia迅速下降轉為平緩的那個點" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 另外一個方法是用silhouette scores 去評估. \n", "### The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score)\n", "## Silhouette Coefficient 越大代表分群效果越好" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "silhouette_scores = [\n", " silhouette_score(X, model.labels_) for model in kmeans_per_k[1:]\n", "]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(8, 3))\n", "plt.plot(range(2, 10), silhouette_scores, \"bo-\")\n", "plt.xlabel(\"$k$\", fontsize=14)\n", "plt.ylabel(\"Silhouette score\", fontsize=14)\n", "plt.axis([1.8, 8.5, 0.55, 0.75])\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 以圖中的結果來看應選擇k=4, 5也可以試試看" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }