{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dimensionality Reduction: Principal Component Analysis\n", "\n", "Here we'll explore **Principal Component Analysis**, which is an extremely useful linear dimensionality reduction technique.\n", "\n", "We'll start with our standard set of initial imports:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from scipy import stats\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introducing Principal Component Analysis\n", "\n", "Principal Component Analysis is a very powerful unsupervised method for *dimensionality reduction* in data. It's easiest to visualize by looking at a two-dimensional dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(1)\n", "X = np.dot(np.random.random(size=(2, 2)), np.random.normal(size=(2, 200))).T\n", "print(X.shape)\n", "plt.plot(X[:, 0], X[:, 1], 'o')\n", "plt.axis('equal')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that there is a definite trend in the data. What PCA seeks to do is to find the **Principal Axes** in the data, and explain how important those axes are in describing the data distribution:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "pca = PCA(n_components=2)\n", "pca.fit(X)\n", "print(pca.explained_variance_)\n", "print(pca.components_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca.explained_variance_ratio_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see what these numbers mean, let's view them as vectors plotted on top of the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5)\n", "for length, vector in zip(pca.explained_variance_, pca.components_):\n", " print(length, vector)\n", " v = vector * 3 * np.sqrt(length)\n", " plt.plot([0, v[0]], [0, v[1]], '-k', lw=3)\n", "plt.axis('equal')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that one vector is longer than the other. In a sense, this tells us that that direction in the data is somehow more \"important\" than the other direction.\n", "The explained variance quantifies this measure of \"importance\" in direction.\n", "\n", "Another way to think of it is that the second principal component could be **completely ignored** without much loss of information! Let's see what our data look like if we only keep 95% of the variance:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca = PCA(0.95) # keep 95% of variance\n", "X_trans = pca.fit_transform(X)\n", "print(X.shape)\n", "print(X_trans.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By specifying that we want to throw away 5% of the variance, the data is now compressed by a factor of 50%! Let's see what the data look like after this compression:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_new = pca.inverse_transform(X_trans)\n", "plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.2)\n", "plt.plot(X_new[:, 0], X_new[:, 1], 'ob', alpha=0.8)\n", "plt.axis('equal')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The light points are the original data, while the dark points are the projected version. We see that after truncating 5% of the variance of this dataset and then reprojecting it, the \"most important\" features of the data are maintained, and we've compressed the data by 50%!\n", "\n", "This is the sense in which \"dimensionality reduction\" works: if you can approximate a data set in a lower dimension, you can often have an easier time visualizing it or fitting complicated models to the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Application of PCA to Digits\n", "\n", "The dimensionality reduction might seem a bit abstract in two dimensions, but the projection and dimensionality reduction can be extremely useful when visualizing high-dimensional data. Let's take a quick look at the application of PCA to the digits data we looked at before:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()\n", "X = digits.data\n", "y = digits.target" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca = PCA(2) # project from 64 to 2 dimensions\n", "Xproj = pca.fit_transform(X)\n", "print(X.shape)\n", "print(Xproj.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(Xproj[:, 0],\n", " Xproj[:, 1],\n", " c=y,\n", " edgecolor='none',\n", " alpha=0.5,\n", " cmap=plt.cm.get_cmap('nipy_spectral', 10))\n", "plt.colorbar()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gives us an idea of the relationship between the digits. Essentially, we have found the optimal stretch and rotation in 64-dimensional space that allows us to see the layout of the digits, **without reference** to the labels." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Choosing the Number of Components\n", "\n", "But how much information have we thrown away? We can figure this out by looking at the **explained variance** as a function of the components:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca = PCA().fit(X)\n", "plt.plot(np.cumsum(pca.explained_variance_ratio_))\n", "plt.xlabel('number of components')\n", "plt.ylabel('cumulative explained variance')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we see that our two-dimensional projection loses a lot of information (as measured by the explained variance) and that we'd need about 20 components to retain 90% of the variance. Looking at this plot for a high-dimensional dataset can help you understand the level of redundancy present in multiple observations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other Dimensionality Reducting Routines\n", "\n", "Note that scikit-learn contains many other unsupervised dimensionality reduction routines: some you might wish to try are\n", "Other dimensionality reduction techniques which are useful to know about:\n", "\n", "- [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html): \n", " Principal Component Analysis\n", "- [sklearn.decomposition.RandomizedPCA](https://scikit-learn.org/0.16/modules/generated/sklearn.decomposition.RandomizedPCA.html):\n", " extremely fast approximate PCA implementation based on a randomized algorithm\n", "- [sklearn.decomposition.SparsePCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html):\n", " PCA variant including L1 penalty for sparsity\n", "- [sklearn.decomposition.FastICA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FastICA.html):\n", " Independent Component Analysis\n", "- [sklearn.decomposition.NMF](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html):\n", " non-negative matrix factorization\n", "- [sklearn.manifold.LocallyLinearEmbedding](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.LocallyLinearEmbedding.html):\n", " nonlinear manifold learning technique based on local neighborhood geometry\n", "- [sklearn.manifold.IsoMap](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.Isomap.html):\n", " nonlinear manifold learning technique based on a sparse graph algorithm\n", " \n", "Each of these has its own strengths & weaknesses, and areas of application. You can read about them on the [scikit-learn website](http://sklearn.org)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "widgets": { "state": { "ca1be6dfe7514b1d9ebee4859997ba76": { "views": [ { "cell_index": 22 } ] } }, "version": "1.2.0" } }, "nbformat": 4, "nbformat_minor": 4 }