Skip to content

Commit 5100644

Browse files
committed
Pushing the docs to dev/ for branch: master, commit a49752375d5775b1f0e6393a811c937332dccb18
1 parent a114945 commit 5100644

File tree

1,220 files changed

+4435
-4031
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,220 files changed

+4435
-4031
lines changed
Binary file not shown.

dev/_downloads/428ded6a78307f7fed5b6ea3b6fde660/plot_column_transformer.ipynb

Lines changed: 110 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
"cell_type": "markdown",
1616
"metadata": {},
1717
"source": [
18-
"\n# Column Transformer with Heterogeneous Data Sources\n\n\nDatasets can often contain components of that require different feature\nextraction and processing pipelines. This scenario might occur when:\n\n1. Your dataset consists of heterogeneous data types (e.g. raster images and\n text captions)\n2. Your dataset is stored in a Pandas DataFrame and different columns\n require different processing pipelines.\n\nThis example demonstrates how to use\n:class:`sklearn.compose.ColumnTransformer` on a dataset containing\ndifferent types of features. We use the 20-newsgroups dataset and compute\nstandard bag-of-words features for the subject line and body in separate\npipelines as well as ad hoc features on the body. We combine them (with\nweights) using a ColumnTransformer and finally train a classifier on the\ncombined set of features.\n\nThe choice of features is not particularly helpful, but serves to illustrate\nthe technique.\n"
18+
"\n# Column Transformer with Heterogeneous Data Sources\n\n\nDatasets can often contain components that require different feature\nextraction and processing pipelines. This scenario might occur when:\n\n1. your dataset consists of heterogeneous data types (e.g. raster images and\n text captions),\n2. your dataset is stored in a :class:`pandas.DataFrame` and different columns\n require different processing pipelines.\n\nThis example demonstrates how to use\n:class:`~sklearn.compose.ColumnTransformer` on a dataset containing\ndifferent types of features. The choice of features is not particularly\nhelpful, but serves to illustrate the technique.\n"
1919
]
2020
},
2121
{
@@ -26,7 +26,115 @@
2626
},
2727
"outputs": [],
2828
"source": [
29-
"# Author: Matt Terry <[email protected]>\n#\n# License: BSD 3 clause\n\nimport numpy as np\n\nfrom sklearn.base import BaseEstimator, TransformerMixin\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sklearn.decomposition import TruncatedSVD\nfrom sklearn.feature_extraction import DictVectorizer\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.metrics import classification_report\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.svm import LinearSVC\n\n\nclass TextStats(TransformerMixin, BaseEstimator):\n \"\"\"Extract features from each document for DictVectorizer\"\"\"\n\n def fit(self, x, y=None):\n return self\n\n def transform(self, posts):\n return [{'length': len(text),\n 'num_sentences': text.count('.')}\n for text in posts]\n\n\nclass SubjectBodyExtractor(TransformerMixin, BaseEstimator):\n \"\"\"Extract the subject & body from a usenet post in a single pass.\n\n Takes a sequence of strings and produces a dict of sequences. Keys are\n `subject` and `body`.\n \"\"\"\n def fit(self, x, y=None):\n return self\n\n def transform(self, posts):\n # construct object dtype array with two columns\n # first column = 'subject' and second column = 'body'\n features = np.empty(shape=(len(posts), 2), dtype=object)\n for i, text in enumerate(posts):\n headers, _, bod = text.partition('\\n\\n')\n features[i, 1] = bod\n\n prefix = 'Subject:'\n sub = ''\n for line in headers.split('\\n'):\n if line.startswith(prefix):\n sub = line[len(prefix):]\n break\n features[i, 0] = sub\n\n return features\n\n\npipeline = Pipeline([\n # Extract the subject & body\n ('subjectbody', SubjectBodyExtractor()),\n\n # Use ColumnTransformer to combine the features from subject and body\n ('union', ColumnTransformer(\n [\n # Pulling features from the post's subject line (first column)\n ('subject', TfidfVectorizer(min_df=50), 0),\n\n # Pipeline for standard bag-of-words model for body (second column)\n ('body_bow', Pipeline([\n ('tfidf', TfidfVectorizer()),\n ('best', TruncatedSVD(n_components=50)),\n ]), 1),\n\n # Pipeline for pulling ad hoc features from post's body\n ('body_stats', Pipeline([\n ('stats', TextStats()), # returns a list of dicts\n ('vect', DictVectorizer()), # list of dicts -> feature matrix\n ]), 1),\n ],\n\n # weight components in ColumnTransformer\n transformer_weights={\n 'subject': 0.8,\n 'body_bow': 0.5,\n 'body_stats': 1.0,\n }\n )),\n\n # Use a SVC classifier on the combined features\n ('svc', LinearSVC(dual=False)),\n], verbose=True)\n\n# limit the list of categories to make running this example faster.\ncategories = ['alt.atheism', 'talk.religion.misc']\nX_train, y_train = fetch_20newsgroups(random_state=1,\n subset='train',\n categories=categories,\n remove=('footers', 'quotes'),\n return_X_y=True)\nX_test, y_test = fetch_20newsgroups(random_state=1,\n subset='test',\n categories=categories,\n remove=('footers', 'quotes'),\n return_X_y=True)\n\npipeline.fit(X_train, y_train)\ny_pred = pipeline.predict(X_test)\nprint(classification_report(y_test, y_pred))"
29+
"# Author: Matt Terry <[email protected]>\n#\n# License: BSD 3 clause\n\nimport numpy as np\n\nfrom sklearn.preprocessing import FunctionTransformer\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sklearn.decomposition import TruncatedSVD\nfrom sklearn.feature_extraction import DictVectorizer\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.metrics import classification_report\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.svm import LinearSVC"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"metadata": {},
35+
"source": [
36+
"20 newsgroups dataset\n---------------------\n\nWe will use the `20 newsgroups dataset <20newsgroups_dataset>`, which\ncomprises posts from newsgroups on 20 topics. This dataset is split\ninto train and test subsets based on messages posted before and after\na specific date. We will only use posts from 2 categories to speed up running\ntime.\n\n"
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {
43+
"collapsed": false
44+
},
45+
"outputs": [],
46+
"source": [
47+
"categories = ['sci.med', 'sci.space']\nX_train, y_train = fetch_20newsgroups(random_state=1,\n subset='train',\n categories=categories,\n remove=('footers', 'quotes'),\n return_X_y=True)\nX_test, y_test = fetch_20newsgroups(random_state=1,\n subset='test',\n categories=categories,\n remove=('footers', 'quotes'),\n return_X_y=True)"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"metadata": {},
53+
"source": [
54+
"Each feature comprises meta information about that post, such as the subject,\nand the body of the news post.\n\n"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": null,
60+
"metadata": {
61+
"collapsed": false
62+
},
63+
"outputs": [],
64+
"source": [
65+
"print(X_train[0])"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"Creating transformers\n---------------------\n\nFirst, we would like a transformer that extracts the subject and\nbody of each post. Since this is a stateless transformation (does not\nrequire state information from training data), we can define a function that\nperforms the data transformation then use\n:class:`~sklearn.preprocessing.FunctionTransformer` to create a scikit-learn\ntransformer.\n\n"
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": null,
78+
"metadata": {
79+
"collapsed": false
80+
},
81+
"outputs": [],
82+
"source": [
83+
"def subject_body_extractor(posts):\n # construct object dtype array with two columns\n # first column = 'subject' and second column = 'body'\n features = np.empty(shape=(len(posts), 2), dtype=object)\n for i, text in enumerate(posts):\n # temporary variable `_` stores '\\n\\n'\n headers, _, body = text.partition('\\n\\n')\n # store body text in second column\n features[i, 1] = body\n\n prefix = 'Subject:'\n sub = ''\n # save text after 'Subject:' in first column\n for line in headers.split('\\n'):\n if line.startswith(prefix):\n sub = line[len(prefix):]\n break\n features[i, 0] = sub\n\n return features\n\n\nsubject_body_transformer = FunctionTransformer(subject_body_extractor)"
84+
]
85+
},
86+
{
87+
"cell_type": "markdown",
88+
"metadata": {},
89+
"source": [
90+
"We will also create a transformer that extracts the\nlength of the text and the number of sentences.\n\n"
91+
]
92+
},
93+
{
94+
"cell_type": "code",
95+
"execution_count": null,
96+
"metadata": {
97+
"collapsed": false
98+
},
99+
"outputs": [],
100+
"source": [
101+
"def text_stats(posts):\n return [{'length': len(text),\n 'num_sentences': text.count('.')}\n for text in posts]\n\n\ntext_stats_transformer = FunctionTransformer(text_stats)"
102+
]
103+
},
104+
{
105+
"cell_type": "markdown",
106+
"metadata": {},
107+
"source": [
108+
"Classification pipeline\n-----------------------\n\nThe pipeline below extracts the subject and body from each post using\n``SubjectBodyExtractor``, producing a (n_samples, 2) array. This array is\nthen used to compute standard bag-of-words features for the subject and body\nas well as text length and number of sentences on the body, using\n``ColumnTransformer``. We combine them, with weights, then train a\nclassifier on the combined set of features.\n\n"
109+
]
110+
},
111+
{
112+
"cell_type": "code",
113+
"execution_count": null,
114+
"metadata": {
115+
"collapsed": false
116+
},
117+
"outputs": [],
118+
"source": [
119+
"pipeline = Pipeline([\n # Extract subject & body\n ('subjectbody', subject_body_transformer),\n # Use ColumnTransformer to combine the subject and body features\n ('union', ColumnTransformer(\n [\n # bag-of-words for subject (col 0)\n ('subject', TfidfVectorizer(min_df=50), 0),\n # bag-of-words with decomposition for body (col 1)\n ('body_bow', Pipeline([\n ('tfidf', TfidfVectorizer()),\n ('best', TruncatedSVD(n_components=50)),\n ]), 1),\n # Pipeline for pulling text stats from post's body\n ('body_stats', Pipeline([\n ('stats', text_stats_transformer), # returns a list of dicts\n ('vect', DictVectorizer()), # list of dicts -> feature matrix\n ]), 1),\n ],\n # weight above ColumnTransformer features\n transformer_weights={\n 'subject': 0.8,\n 'body_bow': 0.5,\n 'body_stats': 1.0,\n }\n )),\n # Use a SVC classifier on the combined features\n ('svc', LinearSVC(dual=False)),\n], verbose=True)"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"metadata": {},
125+
"source": [
126+
"Finally, we fit our pipeline on the training data and use it to predict\ntopics for ``X_test``. Performance metrics of our pipeline are then printed.\n\n"
127+
]
128+
},
129+
{
130+
"cell_type": "code",
131+
"execution_count": null,
132+
"metadata": {
133+
"collapsed": false
134+
},
135+
"outputs": [],
136+
"source": [
137+
"pipeline.fit(X_train, y_train)\ny_pred = pipeline.predict(X_test)\nprint('Classification report:\\n\\n{}'.format(\n classification_report(y_test, y_pred))\n)"
30138
]
31139
}
32140
],

0 commit comments

Comments
 (0)