{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "WUZm-Jbl3V6n" }, "source": [ "## $\\Large{Data\\; Visualization \\;(part2)}$\n", "\n", "在介紹過matplotlib的功能後,我們大致上已經可以畫出各式各樣的圖形,但你可能會發現使用matplotlib套件時我們需要自己將一些基礎元件組裝起來才能畫出圖形(例如主標題、標籤等),因此需要撰寫較大量的程式碼。為了讓使用者能夠更方便地進行資料視覺化,也有人以matplotlib作為底層開發了較高階的繪圖套件,在這個單元中我們要教的seaborn套件就是這樣的一個存在。" ] }, { "cell_type": "markdown", "metadata": { "id": "BKShugalzsel" }, "source": [ "### 本章節內容大綱\n", "* [載入套件](#載入套件)\n", "* [繪製基本統計圖](#繪製基本統計圖)\n", "* [繪製進階統計圖](#繪製進階統計圖)\n", " - [小提琴圖(Violin plot)](#小提琴圖)\n", " - [多變量圖(Pair plot)](#多變量圖)\n", " - [熱力圖(Heat map)](#熱力圖)\n", "* [使用FacetGrid作分面繪圖](#使用FacetGrid作分面繪圖)\n", "* [使用matplotlib作細部調整](#使用matplotlib做細部調整)" ] }, { "cell_type": "markdown", "metadata": { "id": "_mujnCTHzsem" }, "source": [ "---\n", "\n", "## 載入套件" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Go9s3ktBzsen" }, "outputs": [], "source": [ "# 載入seaborn套件並命名為sns\n", "import seaborn as sns\n", "\n", "# 載入matplotlib中的pypplot模組並且命名為plt\n", "import matplotlib.pyplot as plt\n", "\n", "# 由於繪圖需要資料,在此同時載入numpy套件與pandas套件\n", "import numpy as np\n", "import pandas as pd\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-iJnelwbzseo" }, "outputs": [], "source": [ "# 讀取資料,在此同樣以鐵達尼號資料為範例\n", "df = pd.read_csv('https://github.com/TA-aiacademy/course_3.0/releases/download/Python/titanic_train.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-oISut8JqefV" }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "T3URXzFGzsep" }, "source": [ "---\n", "\n", "## 繪製基本統計圖\n", "\n", "同樣地,我們先來看看在seaborn中如何繪製基本的統計圖形,然而在此我們會試著增加一些進階的設定使圖形有更多的資訊量。" ] }, { "cell_type": "markdown", "metadata": { "id": "3D9A1LG0zseq" }, "source": [ "- ### 直方圖" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Zp57xfH5zseq" }, "outputs": [], "source": [ "# 以histplot繪製Fare的直方圖,同時可使用kde參數決定是否呈現機率密度函數之估計\n", "sns.histplot(df['Fare'], kde=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "Kdr3yfyTzser" }, "source": [ "- ### 盒型圖" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dOHpUpTKzses" }, "outputs": [], "source": [ "# 使用boxplot可繪製盒型圖,另外可以x參數設定盒型圖的分組依據\n", "sns.boxplot(x=df['Pclass'], y=df['Fare'])" ] }, { "cell_type": "markdown", "metadata": { "id": "KBZR8lbfzset" }, "source": [ "- ### 長條圖" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "v7Y549xQzset" }, "outputs": [], "source": [ "# 使用barplot繪製長條圖呈現不同艙等(Pclass)中票價(Fare)的平均\n", "# 在此我們結合了pandas與seaborn套件作繪圖,在參數內只需要指定欄位名稱並給予資料即可,另外可以指定顏色\n", "sns.barplot(x='Pclass', y='Fare', data=df, color='salmon')" ] }, { "cell_type": "markdown", "metadata": { "id": "Vmni9ZYazseu" }, "source": [ "- ### 散佈圖" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yDdyoXqqzseu" }, "outputs": [], "source": [ "# 使用scatterplot繪製散佈圖,另一種類似的函數是regplot,但它會自動作回歸線的估計\n", "sns.scatterplot(x='Fare', y='Age', data=df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9jAESmf8qefZ" }, "outputs": [], "source": [ "# 使用scatterplot繪製散佈圖,另一種類似的函數是regplot,但它會自動作回歸線的估計\n", "sns.regplot(x='Fare', y='Age', data=df)" ] }, { "cell_type": "markdown", "metadata": { "id": "ctGNKasTzsev" }, "source": [ "---\n", "\n", "## 繪製進階統計圖\n", "\n", "除了最基本的統計圖表外,在資料視覺化上為了一次能在圖中呈現較多資訊、或是作多個變項的關聯性呈現,也有許多進階的統計圖陸續被提出。而seaborn套件的強大威力就在於可快速地協助我們繪製這些進階的圖形,以下我們將介紹三個非常常被使用到的圖形,分別為小提琴圖(violin plot)、多變量圖(pair plot)、以及熱點圖(heatmap)。" ] }, { "cell_type": "markdown", "metadata": { "id": "i85BPdgczsev" }, "source": [ "\n", "- ### 小提琴圖\n", "小提琴圖可用來比較不同組別中特定連續變項的分布是否有差異,如下圖我們可以同時考慮不同艙等(Pclass)與性別(Sex)的乘客在年齡上的分布。\n", "\n", " - [官方網頁與範例](https://seaborn.pydata.org/generated/seaborn.violinplot.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Vlh2C-pOqefb" }, "outputs": [], "source": [ "# 使用violinplot繪製小提琴圖\n", "sns.violinplot(x='Pclass', y='Age', data=df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Wsw74GCezsev" }, "outputs": [], "source": [ "# 使用violinplot繪製小提琴圖\n", "sns.violinplot(x='Pclass', y='Age', data=df, hue='Sex')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ySoTf0Dtqefb" }, "outputs": [], "source": [ "# 使用violinplot繪製小提琴圖\n", "sns.violinplot(x='Pclass', y='Age', data=df, hue='Sex', split=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "1H5cnyf5zsew" }, "source": [ "\n", "- ### 多變量圖\n", "如果我們想要一次觀察資料中連續變項的分布和彼此之間的散佈圖,我們可以使用多變量圖(pairplot)來做觀察。如圖所示,對角線部分是各個欄位的直方圖、其餘部分則是兩兩變項的散佈圖。另外一個很類似的圖形則是jointplot,差異在於單次只能看兩個連續變項。\n", "\n", " - [官方網頁與範例](https://seaborn.pydata.org/generated/seaborn.pairplot.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BZ9sRMevzsew" }, "outputs": [], "source": [ "# 使用pairplot繪製資料的多變量圖\n", "\n", "# 只選取連續類型的變項\n", "plot_df = df[['Age', 'SibSp', 'Parch', 'Fare']].dropna()\n", "sns.pairplot(data=plot_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nywA51Nwzsex" }, "outputs": [], "source": [ "# 我們也可以使用kind參數改變非對角線的圖形類型\n", "plot_df = df[['Age', 'SibSp', 'Parch', 'Fare']].dropna()\n", "sns.pairplot(data=plot_df, kind='reg')" ] }, { "cell_type": "markdown", "metadata": { "id": "FGeD1S5ezsex" }, "source": [ "\n", "- ### 熱力圖\n", "熱力圖則可以同時考慮兩個類別變項,並且呈現各個分組下的某個特定數值,例如我們可以使用熱力圖觀察各個艙等(Pclass)和性別(Sex)的乘客平均存活率\n", " - [官方網頁與範例](https://seaborn.pydata.org/generated/seaborn.heatmap.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DJA2pqgmzsex" }, "outputs": [], "source": [ "# 事先將資料整理為需要的樣態,在這邊我們利用了Pandas課程時教到的pivot_table\n", "plot_data = df.pivot_table(values='Survived', index='Pclass', columns='Sex')\n", "\n", "# 使用heatmap繪製不同艙等與性別的平均存活率,並且修改顏色與加上數值標誌\n", "sns.heatmap(plot_data, cmap=\"Blues\", annot=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CEB9LGMgqefd" }, "outputs": [], "source": [ "plot_data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "W6Y9b2E9qefd" }, "outputs": [], "source": [ "# 我們也可以利用熱力圖觀察變項之間的相關係數\n", "plot_data = df.corr()\n", "\n", "# 繪製資料的相關係數矩陣熱力圖\n", "sns.heatmap(plot_data, cmap='Reds', annot=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Jjc5kHFczsex" }, "outputs": [], "source": [ "# 我們也可以利用熱力圖觀察變項之間的相關係數\n", "plot_data = df.corr()\n", "\n", "# 繪製資料的相關係數矩陣熱力圖 (vmin 和 vmax 可按照右邊尺規調整顏色)\n", "sns.heatmap(plot_data, cmap='coolwarm', annot=True, vmin=-1.0, vmax=1.0)" ] }, { "cell_type": "markdown", "metadata": { "id": "0Iv5I-ctzsey" }, "source": [ "---\n", "\n", "## 使用FacetGrid作分面繪圖\n", "在seaborn中,若我們想依據特定欄位的組別每一組畫一張圖,我們可以使用FacetGrid的方式做繪製。在使用上我們大多數會搭配著matplotlib的基礎統計圖做使用。例如在以下範例中,我們希望依照性別(Sex)與艙等(Pclass)的各個組合分別繪製年齡的直方圖。\n", "\n", "- [官方文件與範例](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7EZD3OTOzsey" }, "outputs": [], "source": [ "# 使用FacetGrid 搭配 matplotlib.pyplot 中的直方圖做繪製\n", "\n", "# 建立FacetGrid物件並且指定欄與列的分組變項\n", "g = sns.FacetGrid(data=df, row='Sex', col='Pclass')\n", "\n", "# 在建立好的物件中加入直方圖並且指定要繪製的變項名稱\n", "g = g.map(plt.hist, 'Age')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3xp6d47Pzsey" }, "outputs": [], "source": [ "# 依據性別與艙等觀察年齡與手足/伴侶同在船上數量的關聯\n", "\n", "# 建立FacetGrid物件並且指定欄與列的分組變項\n", "g = sns.FacetGrid(data=df, row='Sex', col='Pclass')\n", "\n", "# 使用plt.scatter做繪製,相關的調整參數只要加在後面即可做調整\n", "g = g.map(plt.scatter, 'Age', 'SibSp', color='green')" ] }, { "cell_type": "markdown", "metadata": { "id": "DzublkSjzsez" }, "source": [ "---\n", "\n", "## 使用matplotlib做細部調整\n", "\n", "雖然seaborn提供了簡潔方便又美觀的繪圖方式,然而在某些時刻我們仍然需要進行些許的調整或使用seaborn未提供的方法以符合需求。由於先前提到seaborn是依據matplotlib為底層進行繪製的高階繪圖套件,因此兩者可相容。因此除了直接全部只用matplotlib的低階繪圖函數之外,我們也可以搭配著seaborn與matplotlib一起使用。" ] }, { "cell_type": "markdown", "metadata": { "id": "szirTGfxzsez" }, "source": [ "- ### 增加文字" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Irmi2Mdqzse0" }, "outputs": [], "source": [ "# 使用高階繪圖套件seaborn繪圖\n", "sns.scatterplot(data=df, x='Fare', y='Age', hue='Pclass', legend='full')\n", "\n", "# 使用低階套件matplotlib.pyplot加上文字\n", "plt.text(460, 40, 'outliers', fontsize=12, color='red')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "F3fDyFpmzse0" }, "source": [ "- ### 儲存圖片" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Jt8NO2nezse0" }, "outputs": [], "source": [ "# 使用高階繪圖套件seaborn繪圖\n", "sns.violinplot(data=df, x='Pclass', y='Age', hue='Sex', split=True)\n", "\n", "# 使用plt.save方法儲存此圖片\n", "plt.savefig('violinplot_example.png')" ] }, { "cell_type": "markdown", "metadata": { "id": "hp6KgUrCzse1" }, "source": [ "## 資料視覺化小結\n", "\n", "在這兩個部分中我們介紹了matplotlib與seaborn套件,以及如何繪製常見的統計圖。然而資料視覺化的精隨其實在於如何決定繪製何種圖表以利於增加我們對資料的認識,這部分就得靠大家的領域知識、經驗累積、甚至是創意發想,若有想法而不知該如何實踐的話,也不妨去參考他人的教學或是創意唷。\n", "\n", "### 一些好用的參考網站\n", "- [The Python Graph Gallery](https://python-graph-gallery.com/)\n", "- [另一個非常完整的視覺化套件: Plotly](https://plot.ly/python/)\n", "- [Top 50 matplotlib Visualizations](https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/)" ] } ], "metadata": { "colab": { "name": "Data_Visualization_part2.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }