|
15 | 15 | "cell_type": "markdown",
|
16 | 16 | "metadata": {},
|
17 | 17 | "source": [
|
18 |
| - "\n# Pipeline Anova SVM\n\nSimple usage of Pipeline that runs successively a univariate\nfeature selection with anova and then a SVM of the selected features.\n\nUsing a sub-pipeline, the fitted coefficients can be mapped back into\nthe original feature space.\n" |
| 18 | + "\n# Pipeline ANOVA SVM\n\nThis example shows how a feature selection can be easily integrated within\na machine learning pipeline.\n\nWe also show that you can easily introspect part of the pipeline.\n" |
19 | 19 | ]
|
20 | 20 | },
|
21 | 21 | {
|
|
26 | 26 | },
|
27 | 27 | "outputs": [],
|
28 | 28 | "source": [
|
29 |
| - "from sklearn import svm\nfrom sklearn.datasets import make_classification\nfrom sklearn.feature_selection import SelectKBest, f_classif\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import classification_report\n\nprint(__doc__)\n\n# import some data to play with\nX, y = make_classification(\n n_features=20, n_informative=3, n_redundant=0, n_classes=4,\n n_clusters_per_class=2)\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n\n# ANOVA SVM-C\n# 1) anova filter, take 3 best ranked features\nanova_filter = SelectKBest(f_classif, k=3)\n# 2) svm\nclf = svm.LinearSVC()\n\nanova_svm = make_pipeline(anova_filter, clf)\nanova_svm.fit(X_train, y_train)\ny_pred = anova_svm.predict(X_test)\nprint(classification_report(y_test, y_pred))\n\ncoef = anova_svm[:-1].inverse_transform(anova_svm['linearsvc'].coef_)\nprint(coef)" |
| 29 | + "print(__doc__)\nfrom sklearn import set_config\nset_config(display='diagram')" |
| 30 | + ] |
| 31 | + }, |
| 32 | + { |
| 33 | + "cell_type": "markdown", |
| 34 | + "metadata": {}, |
| 35 | + "source": [ |
| 36 | + "We will start by generating a binary classification dataset. Subsequently, we\nwill divide the dataset into two subsets.\n\n" |
| 37 | + ] |
| 38 | + }, |
| 39 | + { |
| 40 | + "cell_type": "code", |
| 41 | + "execution_count": null, |
| 42 | + "metadata": { |
| 43 | + "collapsed": false |
| 44 | + }, |
| 45 | + "outputs": [], |
| 46 | + "source": [ |
| 47 | + "from sklearn.datasets import make_classification\nfrom sklearn.model_selection import train_test_split\n\nX, y = make_classification(\n n_features=20, n_informative=3, n_redundant=0, n_classes=2,\n n_clusters_per_class=2, random_state=42)\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)" |
| 48 | + ] |
| 49 | + }, |
| 50 | + { |
| 51 | + "cell_type": "markdown", |
| 52 | + "metadata": {}, |
| 53 | + "source": [ |
| 54 | + "A common mistake done with feature selection is to search a subset of\ndiscriminative features on the full dataset instead of only using the\ntraining set. The usage of scikit-learn :func:`~sklearn.pipeline.Pipeline`\nprevents to make such mistake.\n\nHere, we will demonstrate how to build a pipeline where the first step will\nbe the feature selection.\n\nWhen calling `fit` on the training data, a subset of feature will be selected\nand the index of these selected features will be stored. The feature selector\nwill subsequently reduce the number of feature and pass this subset to the\nclassifier which will be trained.\n\n" |
| 55 | + ] |
| 56 | + }, |
| 57 | + { |
| 58 | + "cell_type": "code", |
| 59 | + "execution_count": null, |
| 60 | + "metadata": { |
| 61 | + "collapsed": false |
| 62 | + }, |
| 63 | + "outputs": [], |
| 64 | + "source": [ |
| 65 | + "from sklearn.feature_selection import SelectKBest, f_classif\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.svm import LinearSVC\n\nanova_filter = SelectKBest(f_classif, k=3)\nclf = LinearSVC()\nanova_svm = make_pipeline(anova_filter, clf)\nanova_svm.fit(X_train, y_train)" |
| 66 | + ] |
| 67 | + }, |
| 68 | + { |
| 69 | + "cell_type": "markdown", |
| 70 | + "metadata": {}, |
| 71 | + "source": [ |
| 72 | + "Once the training accomplished, we can predict on new unseen samples. In this\ncase, the feature selector will only select the most discriminative features\nbased on the information stored during training. Then, the data will be\npassed to the classifier which will make the prediction.\n\nHere, we report the final metrics via a classification report.\n\n" |
| 73 | + ] |
| 74 | + }, |
| 75 | + { |
| 76 | + "cell_type": "code", |
| 77 | + "execution_count": null, |
| 78 | + "metadata": { |
| 79 | + "collapsed": false |
| 80 | + }, |
| 81 | + "outputs": [], |
| 82 | + "source": [ |
| 83 | + "from sklearn.metrics import classification_report\n\ny_pred = anova_svm.predict(X_test)\nprint(classification_report(y_test, y_pred))" |
| 84 | + ] |
| 85 | + }, |
| 86 | + { |
| 87 | + "cell_type": "markdown", |
| 88 | + "metadata": {}, |
| 89 | + "source": [ |
| 90 | + "Be aware that you can inspect a step in the pipeline. For instance, we might\nbe interested about the parameters of the classifier. Since we selected\nthree features, we expect to have three coefficients.\n\n" |
| 91 | + ] |
| 92 | + }, |
| 93 | + { |
| 94 | + "cell_type": "code", |
| 95 | + "execution_count": null, |
| 96 | + "metadata": { |
| 97 | + "collapsed": false |
| 98 | + }, |
| 99 | + "outputs": [], |
| 100 | + "source": [ |
| 101 | + "anova_svm[-1].coef_" |
| 102 | + ] |
| 103 | + }, |
| 104 | + { |
| 105 | + "cell_type": "markdown", |
| 106 | + "metadata": {}, |
| 107 | + "source": [ |
| 108 | + "However, we do not know which features where selected from the original\ndataset. We could proceed by several manner. Here, we will inverse the\ntransformation of these coefficients to get information about the original\nspace.\n\n" |
| 109 | + ] |
| 110 | + }, |
| 111 | + { |
| 112 | + "cell_type": "code", |
| 113 | + "execution_count": null, |
| 114 | + "metadata": { |
| 115 | + "collapsed": false |
| 116 | + }, |
| 117 | + "outputs": [], |
| 118 | + "source": [ |
| 119 | + "anova_svm[:-1].inverse_transform(anova_svm[-1].coef_)" |
| 120 | + ] |
| 121 | + }, |
| 122 | + { |
| 123 | + "cell_type": "markdown", |
| 124 | + "metadata": {}, |
| 125 | + "source": [ |
| 126 | + "We can see that the first three features where the selected features by\nthe first step.\n\n" |
30 | 127 | ]
|
31 | 128 | }
|
32 | 129 | ],
|
|
0 commit comments