Skip to content

Commit e191e4c

Browse files
committed
Pushing the docs for revision for branch: master, commit d8379986cd594773a94a2ea9e2d4a5fa77d4843f
1 parent 9250145 commit e191e4c

File tree

901 files changed

+3086
-2991
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

901 files changed

+3086
-2991
lines changed

dev/_downloads/missing_values.py

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,16 +8,22 @@
88
Imputing does not always improve the predictions, so please check via cross-validation.
99
Sometimes dropping rows or using marker values is more effective.
1010
11+
In this example, we artificially mark some of the elements in complete
12+
dataset as missing. Then we estimate performance using the complete dataset,
13+
dataset without the missing samples, after imputation without the indicator
14+
matrix and imputation with the indicator matrix for the missing values.
15+
1116
Missing values can be replaced by the mean, the median or the most frequent
1217
value using the ``strategy`` hyper-parameter.
1318
The median is a more robust estimator for data with high magnitude variables
1419
which could dominate results (otherwise known as a 'long tail').
1520
1621
Script output::
1722
18-
Score with the entire dataset = 0.56
23+
Score with the complete dataset = 0.56
1924
Score without the samples containing missing values = 0.48
2025
Score after imputation of the missing values = 0.55
26+
Score after imputation with indicator features = 0.57
2127
2228
In this case, imputing helps the classifier get close to the original score.
2329
@@ -40,11 +46,11 @@
4046
# Estimate the score on the entire dataset, with no missing values
4147
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
4248
score = cross_val_score(estimator, X_full, y_full).mean()
43-
print("Score with the entire dataset = %.2f" % score)
49+
print("Score with the complete dataset = %.2f" % score)
4450

4551
# Add missing values in 75% of the lines
4652
missing_rate = 0.75
47-
n_missing_samples = np.floor(n_samples * missing_rate)
53+
n_missing_samples = int(n_samples * missing_rate)
4854
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
4955
dtype=np.bool),
5056
np.ones(n_missing_samples,
@@ -70,3 +76,12 @@
7076
n_estimators=100))])
7177
score = cross_val_score(estimator, X_missing, y_missing).mean()
7278
print("Score after imputation of the missing values = %.2f" % score)
79+
80+
# Estimate score after imputation of the missing values with indicator matrix
81+
estimator = Pipeline([("imputer", Imputer(missing_values=0,
82+
strategy="mean",
83+
axis=0, add_indicator_features=True)),
84+
("forest", RandomForestRegressor(random_state=0,
85+
n_estimators=100))])
86+
score = cross_val_score(estimator, X_missing, y_missing).mean()
87+
print("Score after imputation with indicator features = %.2f" % score)
24 Bytes
24 Bytes
500 Bytes
500 Bytes
460 Bytes
460 Bytes
-365 Bytes
-365 Bytes
-316 Bytes

0 commit comments

Comments
 (0)