{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 建立模型四步驟\n", "\n", "在 Scikit-learn 中,建立一個機器學習的模型其實非常簡單,流程大略是以下四個步驟\n", "\n", "1. 讀進資料,並檢查 shape (有多少 rows, features,label 是甚麼型態)\n", " - pd.read_csv\n", " - np.loadtxt \n", " - sklearn.datasets.load_xxx\n", " - data.shape (data need to be numpy array or pandas)\n", "2. 將資料切為訓練 (train) / 測試 (test)\n", " - train_test_split(data)\n", "3. 建立模型,將資料 fit 進模型開始訓練\n", " - clf = DecisionTreeClassifier()\n", " - clf.fit(x_train, y_train)\n", "4. 將測試資料 (features) 放進訓練好的模型中,得到 prediction,與測試資料的 lable (y_test) 做評估\n", " - clf.predict(x_test)\n", " - accuracy_score(y_test, y_pred)\n", " - f1_score(y_test, y_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read iris data\n", "iris = load_iris()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check the data shape\n", "print(iris.data.shape, iris.target.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "random_seed = 5 # set seed for same train test data split\n", "x_train, x_test, y_train, y_test = train_test_split(iris.data,\n", " iris.target,\n", " random_state=random_seed)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"shape of X_train: \", x_train.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"shape of X_test: \", x_test.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf = DecisionTreeClassifier()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf.fit(x_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred = clf.predict(x_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accuracy_score(y_test, y_pred)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(iris.feature_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf.feature_importances_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### visualize our tree" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install graphviz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import graphviz\n", "from sklearn.tree import export_graphviz\n", "dot_data = export_graphviz(clf, out_file=None)\n", "graph = graphviz.Source(dot_data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dot_data = export_graphviz(clf,\n", " out_file=None,\n", " feature_names=iris.feature_names,\n", " class_names=iris.target_names,\n", " filled=True,\n", " rounded=True,\n", " special_characters=True)\n", "graph = graphviz.Source(dot_data)\n", "graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }