Skip to content

Commit 6819889

Browse files
committed
Pushing the docs to dev/ for branch: main, commit 93533ddaff16b30d759936513162579ccecef502
1 parent 653b558 commit 6819889

File tree

1,272 files changed

+6085
-4463
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,272 files changed

+6085
-4463
lines changed
Binary file not shown.
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
"""
2+
===================================================
3+
Failure of Machine Learning to infer causal effects
4+
===================================================
5+
6+
Machine Learning models are great for measuring statistical associations.
7+
Unfortunately, unless we're willing to make strong assumptions about the data,
8+
those models are unable to infer causal effects.
9+
10+
To illustrate this, we will simulate a situation in which we try to answer one
11+
of the most important questions in economics of education: **what is the causal
12+
effect of earning a college degree on hourly wages?** Although the answer to
13+
this question is crucial to policy makers, `Omitted-Variable Biases
14+
<https://en.wikipedia.org/wiki/Omitted-variable_bias>`_ (OVB) prevent us from
15+
identifying that causal effect.
16+
"""
17+
18+
# %%
19+
# The dataset: simulated hourly wages
20+
# -----------------------------------
21+
#
22+
# The data generating process is laid out in the code below. Work experience in
23+
# years and a measure of ability are drawn from Normal distributions; the
24+
# hourly wage of one of the parents is drawn from Beta distribution. We then
25+
# create an indicator of college degree which is positively impacted by ability
26+
# and parental hourly wage. Finally, we model hourly wages as a linear function
27+
# of all the previous variables and a random component. Note that all variables
28+
# have a positive effect on hourly wages.
29+
import numpy as np
30+
import pandas as pd
31+
32+
n_samples = 10_000
33+
rng = np.random.RandomState(32)
34+
35+
experiences = rng.normal(20, 10, size=n_samples).astype(int)
36+
experiences[experiences < 0] = 0
37+
abilities = rng.normal(0, 0.15, size=n_samples)
38+
parent_hourly_wages = 50 * rng.beta(2, 8, size=n_samples)
39+
parent_hourly_wages[parent_hourly_wages < 0] = 0
40+
college_degrees = (
41+
9 * abilities + 0.02 * parent_hourly_wages + rng.randn(n_samples) > 0.7
42+
).astype(int)
43+
44+
true_coef = pd.Series(
45+
{
46+
"college degree": 2.0,
47+
"ability": 5.0,
48+
"experience": 0.2,
49+
"parent hourly wage": 1.0,
50+
}
51+
)
52+
hourly_wages = (
53+
true_coef["experience"] * experiences
54+
+ true_coef["parent hourly wage"] * parent_hourly_wages
55+
+ true_coef["college degree"] * college_degrees
56+
+ true_coef["ability"] * abilities
57+
+ rng.normal(0, 1, size=n_samples)
58+
)
59+
60+
hourly_wages[hourly_wages < 0] = 0
61+
62+
# %%
63+
# Description of the simulated data
64+
# ---------------------------------
65+
#
66+
# The following plot shows the distribution of each variable, and pairwise
67+
# scatter plots. Key to our OVB story is the positive relationship between
68+
# ability and college degree.
69+
import seaborn as sns
70+
71+
df = pd.DataFrame(
72+
{
73+
"college degree": college_degrees,
74+
"ability": abilities,
75+
"hourly wage": hourly_wages,
76+
"experience": experiences,
77+
"parent hourly wage": parent_hourly_wages,
78+
}
79+
)
80+
81+
grid = sns.pairplot(df, diag_kind="kde", corner=True)
82+
83+
# %%
84+
# In the next section, we train predictive models and we therefore split the
85+
# target column from over features and we split the data into a training and a
86+
# testing set.
87+
from sklearn.model_selection import train_test_split
88+
89+
target_name = "hourly wage"
90+
X, y = df.drop(columns=target_name), df[target_name]
91+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
92+
93+
# %%
94+
# Income prediction with fully observed variables
95+
# -----------------------------------------------
96+
#
97+
# First, we train a predictive model, a
98+
# :class:`~sklearn.linear_model.LinearRegression` model. In this experiment,
99+
# we assume that all variables used by the true generative model are available.
100+
from sklearn.linear_model import LinearRegression
101+
from sklearn.metrics import r2_score
102+
103+
features_names = ["experience", "parent hourly wage", "college degree", "ability"]
104+
105+
regressor_with_ability = LinearRegression()
106+
regressor_with_ability.fit(X_train[features_names], y_train)
107+
y_pred_with_ability = regressor_with_ability.predict(X_test[features_names])
108+
R2_with_ability = r2_score(y_test, y_pred_with_ability)
109+
110+
print(f"R2 score with ability: {R2_with_ability:.3f}")
111+
112+
# %%
113+
# This model predicts well the hourly wages as shown by the high R2 score. We
114+
# plot the model coefficients to show that we exactly recover the values of
115+
# the true generative model.
116+
import matplotlib.pyplot as plt
117+
118+
model_coef = pd.Series(regressor_with_ability.coef_, index=features_names)
119+
coef = pd.concat(
120+
[true_coef[features_names], model_coef],
121+
keys=["Coefficients of true generative model", "Model coefficients"],
122+
axis=1,
123+
)
124+
ax = coef.plot.barh()
125+
ax.set_xlabel("Coefficient values")
126+
ax.set_title("Coefficients of the linear regression including the ability features")
127+
plt.tight_layout()
128+
plt.show()
129+
130+
# %%
131+
# Income prediction with partial observations
132+
# -------------------------------------------
133+
#
134+
# In practice, intellectual abilities are not observed or are only estimated
135+
# from proxies that inadvertently measure education as well (e.g. by IQ tests).
136+
# But omitting the "ability" feature from a linear model inflates the estimate
137+
# via a positive OVB.
138+
features_names = ["experience", "parent hourly wage", "college degree"]
139+
140+
regressor_without_ability = LinearRegression()
141+
regressor_without_ability.fit(X_train[features_names], y_train)
142+
y_pred_without_ability = regressor_without_ability.predict(X_test[features_names])
143+
R2_without_ability = r2_score(y_test, y_pred_without_ability)
144+
145+
print(f"R2 score without ability: {R2_without_ability:.3f}")
146+
147+
# %%
148+
# The predictive power of our model is similar when we omit the ability feature
149+
# in terms of R2 score. We now check if the coefficient of the model are
150+
# different from the true generative model.
151+
152+
model_coef = pd.Series(regressor_without_ability.coef_, index=features_names)
153+
coef = pd.concat(
154+
[true_coef[features_names], model_coef],
155+
keys=["Coefficients of true generative model", "Model coefficients"],
156+
axis=1,
157+
)
158+
ax = coef.plot.barh()
159+
ax.set_xlabel("Coefficient values")
160+
_ = ax.set_title("Coefficients of the linear regression excluding the ability feature")
161+
162+
# %%
163+
# To compensate for the omitted variable, the model inflates the coefficient of
164+
# the college degree feature. Therefore, interpreting this coefficient value
165+
# as a causal effect of the true generative model is incorrect.
166+
#
167+
# Lessons learned
168+
# ---------------
169+
#
170+
# Machine learning models are not designed for the estimation of causal
171+
# effects. While we showed this with a linear model, OVB can affect any type of
172+
# model.
173+
#
174+
# Whenever interpreting a coefficient or a change in predictions brought about
175+
# by a change in one of the features, it is important to keep in mind
176+
# potentially unobserved variables that could be correlated with both the
177+
# feature in question and the target variable. Such variables are called
178+
# `Confounding Variables <https://en.wikipedia.org/wiki/Confounding>`_. In
179+
# order to still estimate causal effect in the presence of confounding,
180+
# researchers usually conduct experiments in which the treatment variable (e.g.
181+
# college degree) is randomized. When an experiment is prohibitively expensive
182+
# or unethical, researchers can sometimes use other causal inference techniques
183+
# such as `Instrumental Variables
184+
# <https://en.wikipedia.org/wiki/Instrumental_variables_estimation>`_ (IV)
185+
# estimations.

dev/_downloads/521b554adefca348463adbbe047d7e99/plot_linear_model_coefficient_interpretation.py

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -704,6 +704,43 @@
704704
# We observe that the AGE and EXPERIENCE coefficients are varying a lot
705705
# depending of the fold.
706706
#
707+
# Wrong causal interpretation
708+
# ---------------------------
709+
#
710+
# Policy makers might want to know the effect of education on wage to assess
711+
# whether or not a certain policy designed to entice people to pursue more
712+
# education would make economic sense. While Machine Learning models are great
713+
# for measuring statistical associations, they are generally unable to infer
714+
# causal effects.
715+
#
716+
# It might be tempting to look at the coefficient of education on wage from our
717+
# last model (or any model for that matter) and conclude that it captures the
718+
# true effect of a change in the standardized education variable on wages.
719+
#
720+
# Unfortunately there are likely unobserved confounding variables that either
721+
# inflate or deflate that coefficient. A confounding variable is a variable that
722+
# causes both EDUCATION and WAGE. One example of such variable is ability.
723+
# Presumably, more able people are more likely to pursue education while at the
724+
# same time being more likely to earn a higher hourly wage at any level of
725+
# education. In this case, ability induces a positive `Omitted Variable Bias
726+
# <https://en.wikipedia.org/wiki/Omitted-variable_bias>`_ (OVB) on the EDUCATION
727+
# coefficient, thereby exaggerating the effect of education on wages.
728+
#
729+
# See the :ref:`sphx_glr_auto_examples_inspection_plot_causal_interpretation.py`
730+
# for a simulated case of ability OVB.
731+
#
732+
# Warning: data and model quality
733+
# -------------------------------
734+
#
735+
# Keep in mind that the outcome `y` and features `X` are the product
736+
# of a data generating process that is hidden from us. Machine
737+
# learning models are trained to approximate the unobserved
738+
# mathematical function that links `X` to `y` from sample data. As a
739+
# result, any interpretation made about a model may not necessarily
740+
# generalize to the true data generating process. This is especially
741+
# true when the model is of bad quality or when the sample data is
742+
# not representative of the population.
743+
#
707744
# Lessons learned
708745
# ---------------
709746
#
@@ -719,3 +756,7 @@
719756
# coefficients could significantly vary from one another.
720757
# * Inspecting coefficients across the folds of a cross-validation loop
721758
# gives an idea of their stability.
759+
# * Coefficients are unlikely to have any causal meaning. They tend
760+
# to be biased by unobserved confounders.
761+
# * Inspection tools may not necessarily provide insights on the true
762+
# data generating process.
Binary file not shown.

dev/_downloads/cf0f90f46eb559facf7f63f124f61e04/plot_linear_model_coefficient_interpretation.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -693,7 +693,7 @@
693693
"cell_type": "markdown",
694694
"metadata": {},
695695
"source": [
696-
"We observe that the AGE and EXPERIENCE coefficients are varying a lot\ndepending of the fold.\n\n## Lessons learned\n\n* Coefficients must be scaled to the same unit of measure to retrieve\n feature importance. Scaling them with the standard-deviation of the\n feature is a useful proxy.\n* Coefficients in multivariate linear models represent the dependency\n between a given feature and the target, **conditional** on the other\n features.\n* Correlated features induce instabilities in the coefficients of linear\n models and their effects cannot be well teased apart.\n* Different linear models respond differently to feature correlation and\n coefficients could significantly vary from one another.\n* Inspecting coefficients across the folds of a cross-validation loop\n gives an idea of their stability.\n\n"
696+
"We observe that the AGE and EXPERIENCE coefficients are varying a lot\ndepending of the fold.\n\n## Wrong causal interpretation\n\nPolicy makers might want to know the effect of education on wage to assess\nwhether or not a certain policy designed to entice people to pursue more\neducation would make economic sense. While Machine Learning models are great\nfor measuring statistical associations, they are generally unable to infer\ncausal effects.\n\nIt might be tempting to look at the coefficient of education on wage from our\nlast model (or any model for that matter) and conclude that it captures the\ntrue effect of a change in the standardized education variable on wages.\n\nUnfortunately there are likely unobserved confounding variables that either\ninflate or deflate that coefficient. A confounding variable is a variable that\ncauses both EDUCATION and WAGE. One example of such variable is ability.\nPresumably, more able people are more likely to pursue education while at the\nsame time being more likely to earn a higher hourly wage at any level of\neducation. In this case, ability induces a positive [Omitted Variable Bias](https://en.wikipedia.org/wiki/Omitted-variable_bias) (OVB) on the EDUCATION\ncoefficient, thereby exaggerating the effect of education on wages.\n\nSee the `sphx_glr_auto_examples_inspection_plot_causal_interpretation.py`\nfor a simulated case of ability OVB.\n\n## Warning: data and model quality\n\nKeep in mind that the outcome `y` and features `X` are the product\nof a data generating process that is hidden from us. Machine\nlearning models are trained to approximate the unobserved\nmathematical function that links `X` to `y` from sample data. As a\nresult, any interpretation made about a model may not necessarily\ngeneralize to the true data generating process. This is especially\ntrue when the model is of bad quality or when the sample data is\nnot representative of the population.\n\n## Lessons learned\n\n* Coefficients must be scaled to the same unit of measure to retrieve\n feature importance. Scaling them with the standard-deviation of the\n feature is a useful proxy.\n* Coefficients in multivariate linear models represent the dependency\n between a given feature and the target, **conditional** on the other\n features.\n* Correlated features induce instabilities in the coefficients of linear\n models and their effects cannot be well teased apart.\n* Different linear models respond differently to feature correlation and\n coefficients could significantly vary from one another.\n* Inspecting coefficients across the folds of a cross-validation loop\n gives an idea of their stability.\n* Coefficients are unlikely to have any causal meaning. They tend\n to be biased by unobserved confounders.\n* Inspection tools may not necessarily provide insights on the true\n data generating process.\n\n"
697697
]
698698
}
699699
],

0 commit comments

Comments
 (0)