File size: 69,130 Bytes
6f26afe |
1 |
{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load\n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n\n# Input data files are available in the read-only \"../input/\" directory\n# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n\nimport os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n for filename in filenames:\n print(os.path.join(dirname, filename))\n\n# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","execution":{"iopub.status.busy":"2022-08-12T16:07:49.750037Z","iopub.execute_input":"2022-08-12T16:07:49.750515Z","iopub.status.idle":"2022-08-12T16:07:49.761989Z","shell.execute_reply.started":"2022-08-12T16:07:49.750473Z","shell.execute_reply":"2022-08-12T16:07:49.760803Z"},"trusted":true},"execution_count":34,"outputs":[{"name":"stdout","text":"/kaggle/input/tabular-playground-series-aug-2022/sample_submission.csv\n/kaggle/input/tabular-playground-series-aug-2022/train.csv\n/kaggle/input/tabular-playground-series-aug-2022/test.csv\n","output_type":"stream"}]},{"cell_type":"markdown","source":"## Using skops to host your models on Hugging Face Hub\nThis notebook shows you how you can use [skops](https://skops.readthedocs.io/) to improve your data science workflows with scikit-learn. We will have end-to-end example for Kaggle Tabular Playground Series of August 2022.","metadata":{}},{"cell_type":"markdown","source":"## Install skops","metadata":{}},{"cell_type":"code","source":"#!pip install skops","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:42:20.000537Z","iopub.execute_input":"2022-08-12T16:42:20.000960Z","iopub.status.idle":"2022-08-12T16:42:20.005212Z","shell.execute_reply.started":"2022-08-12T16:42:20.000926Z","shell.execute_reply":"2022-08-12T16:42:20.004298Z"},"trusted":true},"execution_count":58,"outputs":[]},{"cell_type":"markdown","source":"## Import libraries","metadata":{}},{"cell_type":"code","source":"import skops\nimport sklearn\nimport matplotlib.pyplot as plt","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.273144Z","iopub.execute_input":"2022-08-12T16:08:01.273524Z","iopub.status.idle":"2022-08-12T16:08:01.279217Z","shell.execute_reply.started":"2022-08-12T16:08:01.273487Z","shell.execute_reply":"2022-08-12T16:08:01.277670Z"},"trusted":true},"execution_count":36,"outputs":[]},{"cell_type":"markdown","source":"## Let's take a look at the dataset\nTarget variable is a binary category. We have couple of numerical and categorical variables.","metadata":{}},{"cell_type":"code","source":"df = pd.read_csv(\"../input/tabular-playground-series-aug-2022/train.csv\")\ndf.head()","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.280555Z","iopub.execute_input":"2022-08-12T16:08:01.280918Z","iopub.status.idle":"2022-08-12T16:08:01.433127Z","shell.execute_reply.started":"2022-08-12T16:08:01.280882Z","shell.execute_reply":"2022-08-12T16:08:01.431902Z"},"trusted":true},"execution_count":37,"outputs":[{"execution_count":37,"output_type":"execute_result","data":{"text/plain":" id product_code loading attribute_0 attribute_1 attribute_2 attribute_3 \\\n0 0 A 80.10 material_7 material_8 9 5 \n1 1 A 84.89 material_7 material_8 9 5 \n2 2 A 82.43 material_7 material_8 9 5 \n3 3 A 101.07 material_7 material_8 9 5 \n4 4 A 188.06 material_7 material_8 9 5 \n\n measurement_0 measurement_1 measurement_2 ... measurement_9 \\\n0 7 8 4 ... 10.672 \n1 14 3 3 ... 12.448 \n2 12 1 5 ... 12.715 \n3 13 2 6 ... 12.471 \n4 9 2 8 ... 10.337 \n\n measurement_10 measurement_11 measurement_12 measurement_13 \\\n0 15.859 17.594 15.193 15.029 \n1 17.947 17.915 11.755 14.732 \n2 15.607 NaN 13.798 16.711 \n3 16.346 18.377 10.020 15.250 \n4 17.082 19.932 12.428 16.182 \n\n measurement_14 measurement_15 measurement_16 measurement_17 failure \n0 NaN 13.034 14.684 764.100 0 \n1 15.425 14.395 15.631 682.057 0 \n2 18.631 14.094 17.946 663.376 0 \n3 15.562 16.154 17.172 826.282 0 \n4 12.760 13.153 16.412 579.885 0 \n\n[5 rows x 26 columns]","text/html":"<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>id</th>\n <th>product_code</th>\n <th>loading</th>\n <th>attribute_0</th>\n <th>attribute_1</th>\n <th>attribute_2</th>\n <th>attribute_3</th>\n <th>measurement_0</th>\n <th>measurement_1</th>\n <th>measurement_2</th>\n <th>...</th>\n <th>measurement_9</th>\n <th>measurement_10</th>\n <th>measurement_11</th>\n <th>measurement_12</th>\n <th>measurement_13</th>\n <th>measurement_14</th>\n <th>measurement_15</th>\n <th>measurement_16</th>\n <th>measurement_17</th>\n <th>failure</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n <td>A</td>\n <td>80.10</td>\n <td>material_7</td>\n <td>material_8</td>\n <td>9</td>\n <td>5</td>\n <td>7</td>\n <td>8</td>\n <td>4</td>\n <td>...</td>\n <td>10.672</td>\n <td>15.859</td>\n <td>17.594</td>\n <td>15.193</td>\n <td>15.029</td>\n <td>NaN</td>\n <td>13.034</td>\n <td>14.684</td>\n <td>764.100</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n <td>A</td>\n <td>84.89</td>\n <td>material_7</td>\n <td>material_8</td>\n <td>9</td>\n <td>5</td>\n <td>14</td>\n <td>3</td>\n <td>3</td>\n <td>...</td>\n <td>12.448</td>\n <td>17.947</td>\n <td>17.915</td>\n <td>11.755</td>\n <td>14.732</td>\n <td>15.425</td>\n <td>14.395</td>\n <td>15.631</td>\n <td>682.057</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>2</td>\n <td>A</td>\n <td>82.43</td>\n <td>material_7</td>\n <td>material_8</td>\n <td>9</td>\n <td>5</td>\n <td>12</td>\n <td>1</td>\n <td>5</td>\n <td>...</td>\n <td>12.715</td>\n <td>15.607</td>\n <td>NaN</td>\n <td>13.798</td>\n <td>16.711</td>\n <td>18.631</td>\n <td>14.094</td>\n <td>17.946</td>\n <td>663.376</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>3</td>\n <td>A</td>\n <td>101.07</td>\n <td>material_7</td>\n <td>material_8</td>\n <td>9</td>\n <td>5</td>\n <td>13</td>\n <td>2</td>\n <td>6</td>\n <td>...</td>\n <td>12.471</td>\n <td>16.346</td>\n <td>18.377</td>\n <td>10.020</td>\n <td>15.250</td>\n <td>15.562</td>\n <td>16.154</td>\n <td>17.172</td>\n <td>826.282</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>4</td>\n <td>A</td>\n <td>188.06</td>\n <td>material_7</td>\n <td>material_8</td>\n <td>9</td>\n <td>5</td>\n <td>9</td>\n <td>2</td>\n <td>8</td>\n <td>...</td>\n <td>10.337</td>\n <td>17.082</td>\n <td>19.932</td>\n <td>12.428</td>\n <td>16.182</td>\n <td>12.760</td>\n <td>13.153</td>\n <td>16.412</td>\n <td>579.885</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 26 columns</p>\n</div>"},"metadata":{}}]},{"cell_type":"code","source":"df[\"failure\"].unique()","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.436722Z","iopub.execute_input":"2022-08-12T16:08:01.437150Z","iopub.status.idle":"2022-08-12T16:08:01.445258Z","shell.execute_reply.started":"2022-08-12T16:08:01.437117Z","shell.execute_reply":"2022-08-12T16:08:01.444066Z"},"trusted":true},"execution_count":38,"outputs":[{"execution_count":38,"output_type":"execute_result","data":{"text/plain":"array([0, 1])"},"metadata":{}}]},{"cell_type":"markdown","source":"# Encode categorical variables, impute missing values\nWe will impute mean for the numerical attribues and measurements. ","metadata":{}},{"cell_type":"code","source":"df.describe()","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.447099Z","iopub.execute_input":"2022-08-12T16:08:01.447438Z","iopub.status.idle":"2022-08-12T16:08:01.558557Z","shell.execute_reply.started":"2022-08-12T16:08:01.447409Z","shell.execute_reply":"2022-08-12T16:08:01.557437Z"},"trusted":true},"execution_count":39,"outputs":[{"execution_count":39,"output_type":"execute_result","data":{"text/plain":" id loading attribute_2 attribute_3 measurement_0 \\\ncount 26570.000000 26320.000000 26570.000000 26570.000000 26570.000000 \nmean 13284.500000 127.826233 6.754046 7.240459 7.415883 \nstd 7670.242662 39.030020 1.471852 1.456493 4.116690 \nmin 0.000000 33.160000 5.000000 5.000000 0.000000 \n25% 6642.250000 99.987500 6.000000 6.000000 4.000000 \n50% 13284.500000 122.390000 6.000000 8.000000 7.000000 \n75% 19926.750000 149.152500 8.000000 8.000000 10.000000 \nmax 26569.000000 385.860000 9.000000 9.000000 29.000000 \n\n measurement_1 measurement_2 measurement_3 measurement_4 \\\ncount 26570.000000 26570.000000 26189.000000 26032.000000 \nmean 8.232518 6.256568 17.791528 11.731988 \nstd 4.199401 3.309109 1.001200 0.996085 \nmin 0.000000 0.000000 13.968000 8.008000 \n25% 5.000000 4.000000 17.117000 11.051000 \n50% 8.000000 6.000000 17.787000 11.733000 \n75% 11.000000 8.000000 18.469000 12.410000 \nmax 29.000000 24.000000 21.499000 16.484000 \n\n measurement_5 ... measurement_9 measurement_10 measurement_11 \\\ncount 25894.000000 ... 25343.000000 25270.000000 25102.000000 \nmean 17.127804 ... 11.430725 16.117711 19.172085 \nstd 0.996414 ... 0.999137 1.405978 1.520785 \nmin 12.073000 ... 7.537000 9.323000 12.461000 \n25% 16.443000 ... 10.757000 15.209000 18.170000 \n50% 17.132000 ... 11.430000 16.127000 19.211500 \n75% 17.805000 ... 12.102000 17.025000 20.207000 \nmax 21.425000 ... 15.412000 22.479000 25.640000 \n\n measurement_12 measurement_13 measurement_14 measurement_15 \\\ncount 24969.000000 24796.000000 24696.000000 24561.000000 \nmean 11.702464 15.652904 16.048444 14.995554 \nstd 1.488838 1.155247 1.491923 1.549226 \nmin 5.167000 10.890000 9.140000 9.104000 \n25% 10.703000 14.890000 15.057000 13.957000 \n50% 11.717000 15.628500 16.040000 14.969000 \n75% 12.709000 16.374000 17.082000 16.018000 \nmax 17.663000 22.713000 22.303000 21.626000 \n\n measurement_16 measurement_17 failure \ncount 24460.000000 24286.000000 26570.000000 \nmean 16.460727 701.269059 0.212608 \nstd 1.708935 123.304161 0.409160 \nmin 9.701000 196.787000 0.000000 \n25% 15.268000 618.961500 0.000000 \n50% 16.436000 701.024500 0.000000 \n75% 17.628000 784.090250 0.000000 \nmax 24.094000 1312.794000 1.000000 \n\n[8 rows x 23 columns]","text/html":"<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>id</th>\n <th>loading</th>\n <th>attribute_2</th>\n <th>attribute_3</th>\n <th>measurement_0</th>\n <th>measurement_1</th>\n <th>measurement_2</th>\n <th>measurement_3</th>\n <th>measurement_4</th>\n <th>measurement_5</th>\n <th>...</th>\n <th>measurement_9</th>\n <th>measurement_10</th>\n <th>measurement_11</th>\n <th>measurement_12</th>\n <th>measurement_13</th>\n <th>measurement_14</th>\n <th>measurement_15</th>\n <th>measurement_16</th>\n <th>measurement_17</th>\n <th>failure</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td>26570.000000</td>\n <td>26320.000000</td>\n <td>26570.000000</td>\n <td>26570.000000</td>\n <td>26570.000000</td>\n <td>26570.000000</td>\n <td>26570.000000</td>\n <td>26189.000000</td>\n <td>26032.000000</td>\n <td>25894.000000</td>\n <td>...</td>\n <td>25343.000000</td>\n <td>25270.000000</td>\n <td>25102.000000</td>\n <td>24969.000000</td>\n <td>24796.000000</td>\n <td>24696.000000</td>\n <td>24561.000000</td>\n <td>24460.000000</td>\n <td>24286.000000</td>\n <td>26570.000000</td>\n </tr>\n <tr>\n <th>mean</th>\n <td>13284.500000</td>\n <td>127.826233</td>\n <td>6.754046</td>\n <td>7.240459</td>\n <td>7.415883</td>\n <td>8.232518</td>\n <td>6.256568</td>\n <td>17.791528</td>\n <td>11.731988</td>\n <td>17.127804</td>\n <td>...</td>\n <td>11.430725</td>\n <td>16.117711</td>\n <td>19.172085</td>\n <td>11.702464</td>\n <td>15.652904</td>\n <td>16.048444</td>\n <td>14.995554</td>\n <td>16.460727</td>\n <td>701.269059</td>\n <td>0.212608</td>\n </tr>\n <tr>\n <th>std</th>\n <td>7670.242662</td>\n <td>39.030020</td>\n <td>1.471852</td>\n <td>1.456493</td>\n <td>4.116690</td>\n <td>4.199401</td>\n <td>3.309109</td>\n <td>1.001200</td>\n <td>0.996085</td>\n <td>0.996414</td>\n <td>...</td>\n <td>0.999137</td>\n <td>1.405978</td>\n <td>1.520785</td>\n <td>1.488838</td>\n <td>1.155247</td>\n <td>1.491923</td>\n <td>1.549226</td>\n <td>1.708935</td>\n <td>123.304161</td>\n <td>0.409160</td>\n </tr>\n <tr>\n <th>min</th>\n <td>0.000000</td>\n <td>33.160000</td>\n <td>5.000000</td>\n <td>5.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>13.968000</td>\n <td>8.008000</td>\n <td>12.073000</td>\n <td>...</td>\n <td>7.537000</td>\n <td>9.323000</td>\n <td>12.461000</td>\n <td>5.167000</td>\n <td>10.890000</td>\n <td>9.140000</td>\n <td>9.104000</td>\n <td>9.701000</td>\n <td>196.787000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>25%</th>\n <td>6642.250000</td>\n <td>99.987500</td>\n <td>6.000000</td>\n <td>6.000000</td>\n <td>4.000000</td>\n <td>5.000000</td>\n <td>4.000000</td>\n <td>17.117000</td>\n <td>11.051000</td>\n <td>16.443000</td>\n <td>...</td>\n <td>10.757000</td>\n <td>15.209000</td>\n <td>18.170000</td>\n <td>10.703000</td>\n <td>14.890000</td>\n <td>15.057000</td>\n <td>13.957000</td>\n <td>15.268000</td>\n <td>618.961500</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>50%</th>\n <td>13284.500000</td>\n <td>122.390000</td>\n <td>6.000000</td>\n <td>8.000000</td>\n <td>7.000000</td>\n <td>8.000000</td>\n <td>6.000000</td>\n <td>17.787000</td>\n <td>11.733000</td>\n <td>17.132000</td>\n <td>...</td>\n <td>11.430000</td>\n <td>16.127000</td>\n <td>19.211500</td>\n <td>11.717000</td>\n <td>15.628500</td>\n <td>16.040000</td>\n <td>14.969000</td>\n <td>16.436000</td>\n <td>701.024500</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>75%</th>\n <td>19926.750000</td>\n <td>149.152500</td>\n <td>8.000000</td>\n <td>8.000000</td>\n <td>10.000000</td>\n <td>11.000000</td>\n <td>8.000000</td>\n <td>18.469000</td>\n <td>12.410000</td>\n <td>17.805000</td>\n <td>...</td>\n <td>12.102000</td>\n <td>17.025000</td>\n <td>20.207000</td>\n <td>12.709000</td>\n <td>16.374000</td>\n <td>17.082000</td>\n <td>16.018000</td>\n <td>17.628000</td>\n <td>784.090250</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>max</th>\n <td>26569.000000</td>\n <td>385.860000</td>\n <td>9.000000</td>\n <td>9.000000</td>\n <td>29.000000</td>\n <td>29.000000</td>\n <td>24.000000</td>\n <td>21.499000</td>\n <td>16.484000</td>\n <td>21.425000</td>\n <td>...</td>\n <td>15.412000</td>\n <td>22.479000</td>\n <td>25.640000</td>\n <td>17.663000</td>\n <td>22.713000</td>\n <td>22.303000</td>\n <td>21.626000</td>\n <td>24.094000</td>\n <td>1312.794000</td>\n <td>1.000000</td>\n </tr>\n </tbody>\n</table>\n<p>8 rows × 23 columns</p>\n</div>"},"metadata":{}}]},{"cell_type":"markdown","source":"Take a look at the missing values and data types.","metadata":{}},{"cell_type":"code","source":"df.isna().any()","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.560497Z","iopub.execute_input":"2022-08-12T16:08:01.560849Z","iopub.status.idle":"2022-08-12T16:08:01.573857Z","shell.execute_reply.started":"2022-08-12T16:08:01.560796Z","shell.execute_reply":"2022-08-12T16:08:01.572810Z"},"trusted":true},"execution_count":40,"outputs":[{"execution_count":40,"output_type":"execute_result","data":{"text/plain":"id False\nproduct_code False\nloading True\nattribute_0 False\nattribute_1 False\nattribute_2 False\nattribute_3 False\nmeasurement_0 False\nmeasurement_1 False\nmeasurement_2 False\nmeasurement_3 True\nmeasurement_4 True\nmeasurement_5 True\nmeasurement_6 True\nmeasurement_7 True\nmeasurement_8 True\nmeasurement_9 True\nmeasurement_10 True\nmeasurement_11 True\nmeasurement_12 True\nmeasurement_13 True\nmeasurement_14 True\nmeasurement_15 True\nmeasurement_16 True\nmeasurement_17 True\nfailure False\ndtype: bool"},"metadata":{}}]},{"cell_type":"code","source":"df.dtypes","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.575878Z","iopub.execute_input":"2022-08-12T16:08:01.576226Z","iopub.status.idle":"2022-08-12T16:08:01.585351Z","shell.execute_reply.started":"2022-08-12T16:08:01.576194Z","shell.execute_reply":"2022-08-12T16:08:01.584190Z"},"trusted":true},"execution_count":41,"outputs":[{"execution_count":41,"output_type":"execute_result","data":{"text/plain":"id int64\nproduct_code object\nloading float64\nattribute_0 object\nattribute_1 object\nattribute_2 int64\nattribute_3 int64\nmeasurement_0 int64\nmeasurement_1 int64\nmeasurement_2 int64\nmeasurement_3 float64\nmeasurement_4 float64\nmeasurement_5 float64\nmeasurement_6 float64\nmeasurement_7 float64\nmeasurement_8 float64\nmeasurement_9 float64\nmeasurement_10 float64\nmeasurement_11 float64\nmeasurement_12 float64\nmeasurement_13 float64\nmeasurement_14 float64\nmeasurement_15 float64\nmeasurement_16 float64\nmeasurement_17 float64\nfailure int64\ndtype: object"},"metadata":{}}]},{"cell_type":"markdown","source":"Let's see the cardinality of categorical variables.","metadata":{}},{"cell_type":"code","source":"print(df.product_code.nunique())\nprint(df.attribute_0.nunique())\nprint(df.attribute_0.nunique())\n","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.586538Z","iopub.execute_input":"2022-08-12T16:08:01.587124Z","iopub.status.idle":"2022-08-12T16:08:01.602366Z","shell.execute_reply.started":"2022-08-12T16:08:01.587085Z","shell.execute_reply":"2022-08-12T16:08:01.601468Z"},"trusted":true},"execution_count":42,"outputs":[{"name":"stdout","text":"5\n2\n2\n","output_type":"stream"}]},{"cell_type":"markdown","source":"# Preprocessing \nWe will use OneHotEncoder to encode our categorical variables, SimpleImputer to impute missing values and put them all in a ColumnTransformer. We will then use the transformer in our machine learning pipeline to have an end-to-end object for better reproducibility.","metadata":{}},{"cell_type":"code","source":"from sklearn.preprocessing import OneHotEncoder\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.compose import ColumnTransformer\n\ncolumn_transformer_pipeline = ColumnTransformer([\n (\"loading_missing_value_imputer\", SimpleImputer(strategy=\"mean\"), [\"loading\"]),\n (\"numerical_missing_value_imputer\", SimpleImputer(strategy=\"mean\"), list(df.columns[df.dtypes == 'float64'])),\n (\"attribute_0_encoder\", OneHotEncoder(categories = \"auto\"), [\"attribute_0\"]),\n (\"attribute_1_encoder\", OneHotEncoder(categories = \"auto\"), [\"attribute_1\"]),\n (\"product_code_encoder\", OneHotEncoder(categories = \"auto\"), [\"product_code\"])])","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.603692Z","iopub.execute_input":"2022-08-12T16:08:01.604304Z","iopub.status.idle":"2022-08-12T16:08:01.612756Z","shell.execute_reply.started":"2022-08-12T16:08:01.604268Z","shell.execute_reply":"2022-08-12T16:08:01.611678Z"},"trusted":true},"execution_count":43,"outputs":[]},{"cell_type":"code","source":"df = df.drop([\"id\"], axis=1)","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.616244Z","iopub.execute_input":"2022-08-12T16:08:01.616897Z","iopub.status.idle":"2022-08-12T16:08:01.628662Z","shell.execute_reply.started":"2022-08-12T16:08:01.616855Z","shell.execute_reply":"2022-08-12T16:08:01.627386Z"},"trusted":true},"execution_count":44,"outputs":[]},{"cell_type":"code","source":"from sklearn.tree import DecisionTreeClassifier\nfrom sklearn.pipeline import Pipeline\npipeline = Pipeline([\n ('transformation', column_transformer_pipeline),\n ('model', DecisionTreeClassifier(max_depth=4))\n])","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.630096Z","iopub.execute_input":"2022-08-12T16:08:01.631034Z","iopub.status.idle":"2022-08-12T16:08:01.640519Z","shell.execute_reply.started":"2022-08-12T16:08:01.630995Z","shell.execute_reply":"2022-08-12T16:08:01.639257Z"},"trusted":true},"execution_count":45,"outputs":[]},{"cell_type":"code","source":"X = df.drop([\"failure\"], axis = 1)\ny = df.failure","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.642270Z","iopub.execute_input":"2022-08-12T16:08:01.643448Z","iopub.status.idle":"2022-08-12T16:08:01.656699Z","shell.execute_reply.started":"2022-08-12T16:08:01.643404Z","shell.execute_reply":"2022-08-12T16:08:01.655346Z"},"trusted":true},"execution_count":46,"outputs":[]},{"cell_type":"code","source":"from sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(X, y)","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.658597Z","iopub.execute_input":"2022-08-12T16:08:01.659907Z","iopub.status.idle":"2022-08-12T16:08:01.680523Z","shell.execute_reply.started":"2022-08-12T16:08:01.659853Z","shell.execute_reply":"2022-08-12T16:08:01.679150Z"},"trusted":true},"execution_count":47,"outputs":[]},{"cell_type":"code","source":"pipeline.fit(X_train, y_train)","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.682078Z","iopub.execute_input":"2022-08-12T16:08:01.682574Z","iopub.status.idle":"2022-08-12T16:08:01.927531Z","shell.execute_reply.started":"2022-08-12T16:08:01.682526Z","shell.execute_reply":"2022-08-12T16:08:01.926319Z"},"trusted":true},"execution_count":48,"outputs":[{"execution_count":48,"output_type":"execute_result","data":{"text/plain":"Pipeline(steps=[('transformation',\n ColumnTransformer(transformers=[('loading_missing_value_imputer',\n SimpleImputer(),\n ['loading']),\n ('numerical_missing_value_imputer',\n SimpleImputer(),\n ['loading', 'measurement_3',\n 'measurement_4',\n 'measurement_5',\n 'measurement_6',\n 'measurement_7',\n 'measurement_8',\n 'measurement_9',\n 'measurement_10',\n 'measurement_11',\n 'measurement_12',\n 'measurement_13',\n 'measurement_14',\n 'measurement_15',\n 'measurement_16',\n 'measurement_17']),\n ('attribute_0_encoder',\n OneHotEncoder(),\n ['attribute_0']),\n ('attribute_1_encoder',\n OneHotEncoder(),\n ['attribute_1']),\n ('product_code_encoder',\n OneHotEncoder(),\n ['product_code'])])),\n ('model', DecisionTreeClassifier(max_depth=4))])"},"metadata":{}}]},{"cell_type":"code","source":"y_pred = pipeline.predict(X_test)","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.929273Z","iopub.execute_input":"2022-08-12T16:08:01.929842Z","iopub.status.idle":"2022-08-12T16:08:01.956267Z","shell.execute_reply.started":"2022-08-12T16:08:01.929778Z","shell.execute_reply":"2022-08-12T16:08:01.955125Z"},"trusted":true},"execution_count":49,"outputs":[]},{"cell_type":"markdown","source":"# We will now save the model and create a model card with metrics about our model!","metadata":{}},{"cell_type":"markdown","source":"We will use `hub_utils` for model hosting and `card` to create a model card. First, we will initialize a local repository to contain our model, model configuration, model card and anything else that we want. (e.g. plots)","metadata":{}},{"cell_type":"code","source":"from skops import card, hub_utils\nimport pickle\n\nmodel_path = \"model.pkl\"\nlocal_repo = \"decision-tree-playground-kaggle\"\n\nwith open(model_path, mode=\"bw\") as f:\n pickle.dump(pipeline, file=f)\n\nhub_utils.init(\nmodel=model_path, \nrequirements=[f\"scikit-learn={sklearn.__version__}\"], \ndst=local_repo,\ntask=\"tabular-classification\",\ndata=X_test,\n)","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.957800Z","iopub.execute_input":"2022-08-12T16:08:01.958544Z","iopub.status.idle":"2022-08-12T16:08:01.971908Z","shell.execute_reply.started":"2022-08-12T16:08:01.958496Z","shell.execute_reply":"2022-08-12T16:08:01.970902Z"},"trusted":true},"execution_count":50,"outputs":[]},{"cell_type":"markdown","source":"## We will now create our card 🃏 ","metadata":{}},{"cell_type":"markdown","source":"Creating the model card is as simple as instantiating `Card` class of `skops`. Calling `metadata_from_config` method will create metadata section of the model card from configuration file. We will use `add` method to pass information to our model card.","metadata":{}},{"cell_type":"code","source":"from pathlib import Path\nmodel_card = card.Card(pipeline, metadata=card.metadata_from_config(Path(local_repo)))\n\n## let's fill some information about the model\nlimitations = \"This model is not ready to be used in production.\"\nmodel_description = \"This is a DecisionTreeClassifier model built for Kaggle Tabular Playground Series August 2022, trained on supersoaker production failures dataset.\"\nmodel_card_authors = \"huggingface\"\nget_started_code = f\"import pickle \\nwith open({local_repo}/{model_path}, 'rb') as file: \\n clf = pickle.load(file)\"\n\n# pass this information to the card\nmodel_card.add(\n get_started_code=get_started_code,\n model_card_authors=model_card_authors,\n limitations=limitations,\n model_description=model_description,\n)\n# adding methods return the model card itself for easy method chaining","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:01.973796Z","iopub.execute_input":"2022-08-12T16:08:01.974696Z","iopub.status.idle":"2022-08-12T16:08:02.071532Z","shell.execute_reply.started":"2022-08-12T16:08:01.974655Z","shell.execute_reply":"2022-08-12T16:08:02.070310Z"},"trusted":true},"execution_count":51,"outputs":[{"execution_count":51,"output_type":"execute_result","data":{"text/plain":"Card(\n model=Pipeline(steps=[('transformat...cisionTreeClassifier(max_depth=4))]),\n metadata.library_name=sklearn,\n metadata.tags=['sklearn', 'skops', 'tabular-classification'],\n metadata.widget={...},\n get_started_code=\"import pickle \\\\n...s file: \\\\n clf = pickle.load(file)\",\n model_card_authors='huggingface',\n limitations='This model is not ready to be used in production.',\n model_description='This is a Decisi...soaker production failures dataset.',\n)"},"metadata":{}}]},{"cell_type":"markdown","source":"We will now plot and create insights about our model and write them to the model card. \nPipeline includes the decision tree in the last step of it, you can see the content of pipeline as a tuple. The second element of the tuple includes the object -the tree model- itself so if we want to plot the tree we have to first get it from the pipeline. (see below)","metadata":{}},{"cell_type":"code","source":"pipeline.steps[-1][1]","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:02.073020Z","iopub.execute_input":"2022-08-12T16:08:02.073400Z","iopub.status.idle":"2022-08-12T16:08:02.082681Z","shell.execute_reply.started":"2022-08-12T16:08:02.073367Z","shell.execute_reply":"2022-08-12T16:08:02.080924Z"},"trusted":true},"execution_count":52,"outputs":[{"execution_count":52,"output_type":"execute_result","data":{"text/plain":"DecisionTreeClassifier(max_depth=4)"},"metadata":{}}]},{"cell_type":"markdown","source":"We can use `add_metrics` to pass metrics to our model card, which skops will parse into a table for us. We will use `add_plots` to add our plots. ","metadata":{}},{"cell_type":"code","source":"from sklearn.metrics import accuracy_score, f1_score, ConfusionMatrixDisplay, confusion_matrix\nmodel_card.add(eval_method=\"The model is evaluated using test split, on accuracy and F1 score with micro average.\")\nmodel_card.add_metrics(accuracy=accuracy_score(y_test, y_pred))\nmodel_card.add_metrics(**{\"f1 score\": f1_score(y_test, y_pred, average=\"micro\")})\n\nmodel = pipeline.steps[-1][1]\n# we will plot the tree and add the plot to our card\nfrom sklearn.tree import plot_tree\nplt.figure()\nplot_tree(model,filled=True) \nplt.savefig(f'{local_repo}/tree.png',format='png',bbox_inches = \"tight\")\n\n# let's make a prediction and evaluate the model\n\ny_pred = pipeline.predict(X_test)\ncm = confusion_matrix(y_test, y_pred, labels=model.classes_)\ndisp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)\ndisp.plot()\n# save the plot\nplt.savefig(Path(local_repo) / \"confusion_matrix.png\")\n# add figures to model card with their new sections as keys to the dictionary\nmodel_card.add_plot(**{\"Tree Plot\": f'{local_repo}/tree.png', \"Confusion Matrix\": f\"{local_repo}/confusion_matrix.png\"})","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:02.084852Z","iopub.execute_input":"2022-08-12T16:08:02.085287Z","iopub.status.idle":"2022-08-12T16:08:05.482006Z","shell.execute_reply.started":"2022-08-12T16:08:02.085232Z","shell.execute_reply":"2022-08-12T16:08:05.480747Z"},"trusted":true},"execution_count":53,"outputs":[{"execution_count":53,"output_type":"execute_result","data":{"text/plain":"Card(\n model=Pipeline(steps=[('transformat...cisionTreeClassifier(max_depth=4))]),\n metadata.library_name=sklearn,\n metadata.tags=['sklearn', 'skops', 'tabular-classification'],\n metadata.widget={...},\n get_started_code=\"import pickle \\\\n...s file: \\\\n clf = pickle.load(file)\",\n model_card_authors='huggingface',\n limitations='This model is not ready to be used in production.',\n model_description='This is a Decisi...soaker production failures dataset.',\n eval_method='The model is evaluated...cy and F1 score with micro average.',\n Tree Plot='decision-tree-playground-kaggle/tree.png',\n Confusion Matrix='decision-tree-playground-kaggle/confusion_matrix.png',\n)"},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"<Figure size 432x288 with 1 Axes>","image/png":"\n"},"metadata":{"needs_background":"light"}},{"output_type":"display_data","data":{"text/plain":"<Figure size 432x288 with 2 Axes>","image/png":"\n"},"metadata":{"needs_background":"light"}}]},{"cell_type":"markdown","source":"We will now save our model card.","metadata":{}},{"cell_type":"code","source":"model_card.save(f\"{local_repo}/README.md\")","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:05.483575Z","iopub.execute_input":"2022-08-12T16:08:05.484066Z","iopub.status.idle":"2022-08-12T16:08:05.518708Z","shell.execute_reply.started":"2022-08-12T16:08:05.484021Z","shell.execute_reply":"2022-08-12T16:08:05.517522Z"},"trusted":true},"execution_count":54,"outputs":[]},{"cell_type":"markdown","source":"Let's push our model repository to Hub! \nHugging Face Hub requires us to authenticate ourselves, we can do that using `notebook_login`\n","metadata":{}},{"cell_type":"code","source":"from huggingface_hub import notebook_login\nnotebook_login()","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:04:27.699235Z","iopub.execute_input":"2022-08-12T16:04:27.699722Z","iopub.status.idle":"2022-08-12T16:04:27.744734Z","shell.execute_reply.started":"2022-08-12T16:04:27.699676Z","shell.execute_reply":"2022-08-12T16:04:27.743310Z"},"trusted":true},"execution_count":27,"outputs":[{"output_type":"display_data","data":{"text/plain":"VBox(children=(HTML(value='<center> <img\\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…","application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"c262065390b9467180ad8645dedb582f"}},"metadata":{}}]},{"cell_type":"markdown","source":"We can push our model using `hub_utils.push`","metadata":{}},{"cell_type":"code","source":"# if the repository doesn't exist remotely on the Hugging Face Hub, it will be created when we set create_remote to True\nrepo_id = \"scikit-learn/tabular-playground\"\nhub_utils.push(\n repo_id=repo_id,\n source=local_repo,\n token=token,\n commit_message=\"pushing files to the repo from the example!\",\n create_remote=True,\n)\n","metadata":{"execution":{"iopub.status.busy":"2022-08-12T16:08:15.078202Z","iopub.execute_input":"2022-08-12T16:08:15.078653Z","iopub.status.idle":"2022-08-12T16:08:18.508240Z","shell.execute_reply.started":"2022-08-12T16:08:15.078614Z","shell.execute_reply":"2022-08-12T16:08:18.506828Z"},"trusted":true},"execution_count":55,"outputs":[]},{"cell_type":"markdown","source":"## After we push it, the widget is enabled like below:","metadata":{}},{"cell_type":"markdown","source":"![Widget](https://huggingface.co/scikit-learn/tabular-playground/resolve/main/widget_screenshot.png)","metadata":{}},{"cell_type":"markdown","source":"# See how repository and our model card looks like [here](https://huggingface.co/scikit-learn/tabular-playground) ✨","metadata":{}}]} |