Skip to content

Commit 8042ad0

Browse files
committed
Pushing the docs for revision for branch: master, commit 6a2b4f7e7b46785ef9b18dcc9410a338ae916b47
1 parent c7cb566 commit 8042ad0

File tree

880 files changed

+2584
-2677
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

880 files changed

+2584
-2677
lines changed

dev/_downloads/missing_values.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
},
1616
{
1717
"source": [
18-
"\n# Imputing missing values before building an estimator\n\n\nThis example shows that imputing the missing values can give better results\nthan discarding the samples containing any missing value.\nImputing does not always improve the predictions, so please check via cross-validation.\nSometimes dropping rows or using marker values is more effective.\n\nIn this example, we artificially mark some of the elements in complete\ndataset as missing. Then we estimate performance using the complete dataset,\ndataset without the missing samples, after imputation without the indicator\nmatrix and imputation with the indicator matrix for the missing values.\n\nMissing values can be replaced by the mean, the median or the most frequent\nvalue using the ``strategy`` hyper-parameter.\nThe median is a more robust estimator for data with high magnitude variables\nwhich could dominate results (otherwise known as a 'long tail').\n\nScript output::\n\n Score with the complete dataset = 0.56\n Score without the samples containing missing values = 0.48\n Score after imputation of the missing values = 0.55\n Score after imputation with indicator features = 0.57\n\nIn this case, imputing helps the classifier get close to the original score.\n \n"
18+
"\n# Imputing missing values before building an estimator\n\n\nThis example shows that imputing the missing values can give better results\nthan discarding the samples containing any missing value.\nImputing does not always improve the predictions, so please check via cross-validation.\nSometimes dropping rows or using marker values is more effective.\n\nMissing values can be replaced by the mean, the median or the most frequent\nvalue using the ``strategy`` hyper-parameter.\nThe median is a more robust estimator for data with high magnitude variables\nwhich could dominate results (otherwise known as a 'long tail').\n\nScript output::\n\n Score with the entire dataset = 0.56\n Score without the samples containing missing values = 0.48\n Score after imputation of the missing values = 0.55\n\nIn this case, imputing helps the classifier get close to the original score.\n \n"
1919
],
2020
"cell_type": "markdown",
2121
"metadata": {}
@@ -24,7 +24,7 @@
2424
"execution_count": null,
2525
"cell_type": "code",
2626
"source": [
27-
"import numpy as np\n\nfrom sklearn.datasets import load_boston\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import Imputer\nfrom sklearn.model_selection import cross_val_score\n\nrng = np.random.RandomState(0)\n\ndataset = load_boston()\nX_full, y_full = dataset.data, dataset.target\nn_samples = X_full.shape[0]\nn_features = X_full.shape[1]\n\n# Estimate the score on the entire dataset, with no missing values\nestimator = RandomForestRegressor(random_state=0, n_estimators=100)\nscore = cross_val_score(estimator, X_full, y_full).mean()\nprint(\"Score with the complete dataset = %.2f\" % score)\n\n# Add missing values in 75% of the lines\nmissing_rate = 0.75\nn_missing_samples = int(n_samples * missing_rate)\nmissing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,\n dtype=np.bool),\n np.ones(n_missing_samples,\n dtype=np.bool)))\nrng.shuffle(missing_samples)\nmissing_features = rng.randint(0, n_features, n_missing_samples)\n\n# Estimate the score without the lines containing missing values\nX_filtered = X_full[~missing_samples, :]\ny_filtered = y_full[~missing_samples]\nestimator = RandomForestRegressor(random_state=0, n_estimators=100)\nscore = cross_val_score(estimator, X_filtered, y_filtered).mean()\nprint(\"Score without the samples containing missing values = %.2f\" % score)\n\n# Estimate the score after imputation of the missing values\nX_missing = X_full.copy()\nX_missing[np.where(missing_samples)[0], missing_features] = 0\ny_missing = y_full.copy()\nestimator = Pipeline([(\"imputer\", Imputer(missing_values=0,\n strategy=\"mean\",\n axis=0)),\n (\"forest\", RandomForestRegressor(random_state=0,\n n_estimators=100))])\nscore = cross_val_score(estimator, X_missing, y_missing).mean()\nprint(\"Score after imputation of the missing values = %.2f\" % score)\n\n# Estimate score after imputation of the missing values with indicator matrix\nestimator = Pipeline([(\"imputer\", Imputer(missing_values=0,\n strategy=\"mean\",\n axis=0, add_indicator_features=True)),\n (\"forest\", RandomForestRegressor(random_state=0,\n n_estimators=100))])\nscore = cross_val_score(estimator, X_missing, y_missing).mean()\nprint(\"Score after imputation with indicator features = %.2f\" % score)"
27+
"import numpy as np\n\nfrom sklearn.datasets import load_boston\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import Imputer\nfrom sklearn.model_selection import cross_val_score\n\nrng = np.random.RandomState(0)\n\ndataset = load_boston()\nX_full, y_full = dataset.data, dataset.target\nn_samples = X_full.shape[0]\nn_features = X_full.shape[1]\n\n# Estimate the score on the entire dataset, with no missing values\nestimator = RandomForestRegressor(random_state=0, n_estimators=100)\nscore = cross_val_score(estimator, X_full, y_full).mean()\nprint(\"Score with the entire dataset = %.2f\" % score)\n\n# Add missing values in 75% of the lines\nmissing_rate = 0.75\nn_missing_samples = np.floor(n_samples * missing_rate)\nmissing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,\n dtype=np.bool),\n np.ones(n_missing_samples,\n dtype=np.bool)))\nrng.shuffle(missing_samples)\nmissing_features = rng.randint(0, n_features, n_missing_samples)\n\n# Estimate the score without the lines containing missing values\nX_filtered = X_full[~missing_samples, :]\ny_filtered = y_full[~missing_samples]\nestimator = RandomForestRegressor(random_state=0, n_estimators=100)\nscore = cross_val_score(estimator, X_filtered, y_filtered).mean()\nprint(\"Score without the samples containing missing values = %.2f\" % score)\n\n# Estimate the score after imputation of the missing values\nX_missing = X_full.copy()\nX_missing[np.where(missing_samples)[0], missing_features] = 0\ny_missing = y_full.copy()\nestimator = Pipeline([(\"imputer\", Imputer(missing_values=0,\n strategy=\"mean\",\n axis=0)),\n (\"forest\", RandomForestRegressor(random_state=0,\n n_estimators=100))])\nscore = cross_val_score(estimator, X_missing, y_missing).mean()\nprint(\"Score after imputation of the missing values = %.2f\" % score)"
2828
],
2929
"outputs": [],
3030
"metadata": {

dev/_downloads/missing_values.py

Lines changed: 3 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -8,22 +8,16 @@
88
Imputing does not always improve the predictions, so please check via cross-validation.
99
Sometimes dropping rows or using marker values is more effective.
1010
11-
In this example, we artificially mark some of the elements in complete
12-
dataset as missing. Then we estimate performance using the complete dataset,
13-
dataset without the missing samples, after imputation without the indicator
14-
matrix and imputation with the indicator matrix for the missing values.
15-
1611
Missing values can be replaced by the mean, the median or the most frequent
1712
value using the ``strategy`` hyper-parameter.
1813
The median is a more robust estimator for data with high magnitude variables
1914
which could dominate results (otherwise known as a 'long tail').
2015
2116
Script output::
2217
23-
Score with the complete dataset = 0.56
18+
Score with the entire dataset = 0.56
2419
Score without the samples containing missing values = 0.48
2520
Score after imputation of the missing values = 0.55
26-
Score after imputation with indicator features = 0.57
2721
2822
In this case, imputing helps the classifier get close to the original score.
2923
@@ -46,11 +40,11 @@
4640
# Estimate the score on the entire dataset, with no missing values
4741
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
4842
score = cross_val_score(estimator, X_full, y_full).mean()
49-
print("Score with the complete dataset = %.2f" % score)
43+
print("Score with the entire dataset = %.2f" % score)
5044

5145
# Add missing values in 75% of the lines
5246
missing_rate = 0.75
53-
n_missing_samples = int(n_samples * missing_rate)
47+
n_missing_samples = np.floor(n_samples * missing_rate)
5448
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
5549
dtype=np.bool),
5650
np.ones(n_missing_samples,
@@ -76,12 +70,3 @@
7670
n_estimators=100))])
7771
score = cross_val_score(estimator, X_missing, y_missing).mean()
7872
print("Score after imputation of the missing values = %.2f" % score)
79-
80-
# Estimate score after imputation of the missing values with indicator matrix
81-
estimator = Pipeline([("imputer", Imputer(missing_values=0,
82-
strategy="mean",
83-
axis=0, add_indicator_features=True)),
84-
("forest", RandomForestRegressor(random_state=0,
85-
n_estimators=100))])
86-
score = cross_val_score(estimator, X_missing, y_missing).mean()
87-
print("Score after imputation with indicator features = %.2f" % score)
-54 Bytes
-54 Bytes
51 Bytes
51 Bytes
-435 Bytes
-435 Bytes
4 Bytes
4 Bytes

0 commit comments

Comments
 (0)