Skip to content

Commit 764e146

Browse files
committed
Pushing the docs to dev/ for branch: main, commit bb585438b3123574333c174fc88f9ba81385af19
1 parent fdebedf commit 764e146

File tree

1,311 files changed

+6039
-6022
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,311 files changed

+6039
-6022
lines changed

dev/.buildinfo

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: 5740a408078b668f5c1406158801993d
3+
config: 61f1dcc056c8a9a5e21505ca3e041e92
44
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file not shown.

dev/_downloads/1ed4d16a866c9fe4d86a05477e6d0664/plot_svm_scale_c.ipynb

Lines changed: 10 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"\n# Scaling the regularization parameter for SVCs\n\nThe following example illustrates the effect of scaling the\nregularization parameter when using `svm` for\n`classification <svm_classification>`.\nFor SVC classification, we are interested in a risk minimization for the\nequation:\n\n\n\\begin{align}C \\sum_{i=1, n} \\mathcal{L} (f(x_i), y_i) + \\Omega (w)\\end{align}\n\nwhere\n\n - $C$ is used to set the amount of regularization\n - $\\mathcal{L}$ is a `loss` function of our samples\n and our model parameters.\n - $\\Omega$ is a `penalty` function of our model parameters\n\nIf we consider the loss function to be the individual error per\nsample, then the data-fit term, or the sum of the error for each sample, will\nincrease as we add more samples. The penalization term, however, will not\nincrease.\n\nWhen using, for example, `cross validation <cross_validation>`, to\nset the amount of regularization with `C`, there will be a\ndifferent amount of samples between the main problem and the smaller problems\nwithin the folds of the cross validation.\n\nSince our loss function is dependent on the amount of samples, the latter\nwill influence the selected value of `C`.\nThe question that arises is \"How do we optimally adjust C to\naccount for the different amount of training samples?\"\n\nIn the remainder of this example, we will investigate the effect of scaling\nthe value of the regularization parameter `C` in regards to the number of\nsamples for both L1 and L2 penalty. We will generate some synthetic datasets\nthat are appropriate for each type of regularization.\n"
7+
"\n# Scaling the regularization parameter for SVCs\n\nThe following example illustrates the effect of scaling the regularization\nparameter when using `svm` for `classification <svm_classification>`.\nFor SVC classification, we are interested in a risk minimization for the\nequation:\n\n\n\\begin{align}C \\sum_{i=1, n} \\mathcal{L} (f(x_i), y_i) + \\Omega (w)\\end{align}\n\nwhere\n\n - $C$ is used to set the amount of regularization\n - $\\mathcal{L}$ is a `loss` function of our samples\n and our model parameters.\n - $\\Omega$ is a `penalty` function of our model parameters\n\nIf we consider the loss function to be the individual error per sample, then the\ndata-fit term, or the sum of the error for each sample, increases as we add more\nsamples. The penalization term, however, does not increase.\n\nWhen using, for example, `cross validation <cross_validation>`, to set the\namount of regularization with `C`, there would be a different amount of samples\nbetween the main problem and the smaller problems within the folds of the cross\nvalidation.\n\nSince the loss function dependens on the amount of samples, the latter\ninfluences the selected value of `C`. The question that arises is \"How do we\noptimally adjust C to account for the different amount of training samples?\"\n"
88
]
99
},
1010
{
@@ -22,7 +22,7 @@
2222
"cell_type": "markdown",
2323
"metadata": {},
2424
"source": [
25-
"## L1-penalty case\nIn the L1 case, theory says that prediction consistency (i.e. that under\ngiven hypothesis, the estimator learned predicts as well as a model knowing\nthe true distribution) is not possible because of the bias of the L1. It\ndoes say, however, that model consistency, in terms of finding the right set\nof non-zero parameters as well as their signs, can be achieved by scaling\n`C`.\n\nWe will demonstrate this effect by using a synthetic dataset. This\ndataset will be sparse, meaning that only a few features will be informative\nand useful for the model.\n\n"
25+
"## Data generation\n\nIn this example we investigate the effect of reparametrizing the regularization\nparameter `C` to account for the number of samples when using either L1 or L2\npenalty. For such purpose we create a synthetic dataset with a large number of\nfeatures, out of which only a few are informative. We therefore expect the\nregularization to shrink the coefficients towards zero (L2 penalty) or exactly\nzero (L1 penalty).\n\n"
2626
]
2727
},
2828
{
@@ -40,7 +40,7 @@
4040
"cell_type": "markdown",
4141
"metadata": {},
4242
"source": [
43-
"Now, we can define a linear SVC with the `l1` penalty.\n\n"
43+
"## L1-penalty case\nIn the L1 case, theory says that provided a strong regularization, the\nestimator cannot predict as well as a model knowing the true distribution\n(even in the limit where the sample size grows to infinity) as it may set some\nweights of otherwise predictive features to zero, which induces a bias. It does\nsay, however, that it is possible to find the right set of non-zero parameters\nas well as their signs by tuning `C`.\n\nWe define a linear SVC with the L1 penalty.\n\n"
4444
]
4545
},
4646
{
@@ -58,7 +58,7 @@
5858
"cell_type": "markdown",
5959
"metadata": {},
6060
"source": [
61-
"We will compute the mean test score for different values of `C`.\n\n"
61+
"We compute the mean test score for different values of `C` via\ncross-validation.\n\n"
6262
]
6363
},
6464
{
@@ -69,7 +69,7 @@
6969
},
7070
"outputs": [],
7171
"source": [
72-
"import numpy as np\nimport pandas as pd\n\nfrom sklearn.model_selection import ShuffleSplit, validation_curve\n\nCs = np.logspace(-2.3, -1.3, 10)\ntrain_sizes = np.linspace(0.3, 0.7, 3)\nlabels = [f\"fraction: {train_size}\" for train_size in train_sizes]\n\nresults = {\"C\": Cs}\nfor label, train_size in zip(labels, train_sizes):\n cv = ShuffleSplit(train_size=train_size, test_size=0.3, n_splits=50, random_state=1)\n train_scores, test_scores = validation_curve(\n model_l1, X, y, param_name=\"C\", param_range=Cs, cv=cv\n )\n results[label] = test_scores.mean(axis=1)\nresults = pd.DataFrame(results)"
72+
"import numpy as np\nimport pandas as pd\n\nfrom sklearn.model_selection import ShuffleSplit, validation_curve\n\nCs = np.logspace(-2.3, -1.3, 10)\ntrain_sizes = np.linspace(0.3, 0.7, 3)\nlabels = [f\"fraction: {train_size}\" for train_size in train_sizes]\nshuffle_params = {\n \"test_size\": 0.3,\n \"n_splits\": 150,\n \"random_state\": 1,\n}\n\nresults = {\"C\": Cs}\nfor label, train_size in zip(labels, train_sizes):\n cv = ShuffleSplit(train_size=train_size, **shuffle_params)\n train_scores, test_scores = validation_curve(\n model_l1,\n X,\n y,\n param_name=\"C\",\n param_range=Cs,\n cv=cv,\n n_jobs=2,\n )\n results[label] = test_scores.mean(axis=1)\nresults = pd.DataFrame(results)"
7373
]
7474
},
7575
{
@@ -80,14 +80,14 @@
8080
},
8181
"outputs": [],
8282
"source": [
83-
"import matplotlib.pyplot as plt\n\nfig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(12, 6))\n\n# plot results without scaling C\nresults.plot(x=\"C\", ax=axes[0], logx=True)\naxes[0].set_ylabel(\"CV score\")\naxes[0].set_title(\"No scaling\")\n\n# plot results by scaling C\nfor train_size_idx, label in enumerate(labels):\n results_scaled = results[[label]].assign(\n C_scaled=Cs * float(n_samples * train_sizes[train_size_idx])\n )\n results_scaled.plot(x=\"C_scaled\", ax=axes[1], logx=True, label=label)\naxes[1].set_title(\"Scaling C by 1 / n_samples\")\n\n_ = fig.suptitle(\"Effect of scaling C with L1 penalty\")"
83+
"import matplotlib.pyplot as plt\n\nfig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(12, 6))\n\n# plot results without scaling C\nresults.plot(x=\"C\", ax=axes[0], logx=True)\naxes[0].set_ylabel(\"CV score\")\naxes[0].set_title(\"No scaling\")\n\nfor label in labels:\n best_C = results.loc[results[label].idxmax(), \"C\"]\n axes[0].axvline(x=best_C, linestyle=\"--\", color=\"grey\", alpha=0.7)\n\n# plot results by scaling C\nfor train_size_idx, label in enumerate(labels):\n train_size = train_sizes[train_size_idx]\n results_scaled = results[[label]].assign(\n C_scaled=Cs * float(n_samples * np.sqrt(train_size))\n )\n results_scaled.plot(x=\"C_scaled\", ax=axes[1], logx=True, label=label)\n best_C_scaled = results_scaled[\"C_scaled\"].loc[results[label].idxmax()]\n axes[1].axvline(x=best_C_scaled, linestyle=\"--\", color=\"grey\", alpha=0.7)\n\naxes[1].set_title(\"Scaling C by sqrt(1 / n_samples)\")\n\n_ = fig.suptitle(\"Effect of scaling C with L1 penalty\")"
8484
]
8585
},
8686
{
8787
"cell_type": "markdown",
8888
"metadata": {},
8989
"source": [
90-
"Here, we observe that the cross-validation-error correlates best with the\ntest-error, when scaling our `C` with the number of samples, `n`.\n\n## L2-penalty case\nWe can repeat a similar experiment with the `l2` penalty. In this case, we\ndon't need to use a sparse dataset.\n\nIn this case, the theory says that in order to achieve prediction\nconsistency, the penalty parameter should be kept constant as the number of\nsamples grow.\n\nSo we will repeat the same experiment by creating a linear SVC classifier\nwith the `l2` penalty and check the test score via cross-validation and\nplot the results with and without scaling the parameter `C`.\n\n"
90+
"In the region of small `C` (strong regularization) all the coefficients\nlearned by the models are zero, leading to severe underfitting. Indeed, the\naccuracy in this region is at the chance level.\n\nUsing the default scale results in a somewhat stable optimal value of `C`,\nwhereas the transition out of the underfitting region depends on the number of\ntraining samples. The reparametrization leads to even more stable results.\n\nSee e.g. theorem 3 of :arxiv:`On the prediction performance of the Lasso\n<1402.1700>` or :arxiv:`Simultaneous analysis of Lasso and Dantzig selector\n<0801.1095>` where the regularization parameter is always assumed to be\nproportional to 1 / sqrt(n_samples).\n\n## L2-penalty case\nWe can do a similar experiment with the L2 penalty. In this case, the\ntheory says that in order to achieve prediction consistency, the penalty\nparameter should be kept constant as the number of samples grow.\n\n"
9191
]
9292
},
9393
{
@@ -98,7 +98,7 @@
9898
},
9999
"outputs": [],
100100
"source": [
101-
"rng = np.random.RandomState(1)\ny = np.sign(0.5 - rng.rand(n_samples))\nX = rng.randn(n_samples, n_features // 5) + y[:, np.newaxis]\nX += 5 * rng.randn(n_samples, n_features // 5)"
101+
"model_l2 = LinearSVC(penalty=\"l2\", loss=\"squared_hinge\", dual=True)\nCs = np.logspace(-8, 4, 11)\n\nlabels = [f\"fraction: {train_size}\" for train_size in train_sizes]\nresults = {\"C\": Cs}\nfor label, train_size in zip(labels, train_sizes):\n cv = ShuffleSplit(train_size=train_size, **shuffle_params)\n train_scores, test_scores = validation_curve(\n model_l2,\n X,\n y,\n param_name=\"C\",\n param_range=Cs,\n cv=cv,\n n_jobs=2,\n )\n results[label] = test_scores.mean(axis=1)\nresults = pd.DataFrame(results)"
102102
]
103103
},
104104
{
@@ -109,36 +109,14 @@
109109
},
110110
"outputs": [],
111111
"source": [
112-
"model_l2 = LinearSVC(penalty=\"l2\", loss=\"squared_hinge\", dual=True)\nCs = np.logspace(-4.5, -2, 10)\n\nlabels = [f\"fraction: {train_size}\" for train_size in train_sizes]\nresults = {\"C\": Cs}\nfor label, train_size in zip(labels, train_sizes):\n cv = ShuffleSplit(train_size=train_size, test_size=0.3, n_splits=50, random_state=1)\n train_scores, test_scores = validation_curve(\n model_l2, X, y, param_name=\"C\", param_range=Cs, cv=cv\n )\n results[label] = test_scores.mean(axis=1)\nresults = pd.DataFrame(results)"
113-
]
114-
},
115-
{
116-
"cell_type": "code",
117-
"execution_count": null,
118-
"metadata": {
119-
"collapsed": false
120-
},
121-
"outputs": [],
122-
"source": [
123-
"import matplotlib.pyplot as plt\n\nfig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(12, 6))\n\n# plot results without scaling C\nresults.plot(x=\"C\", ax=axes[0], logx=True)\naxes[0].set_ylabel(\"CV score\")\naxes[0].set_title(\"No scaling\")\n\n# plot results by scaling C\nfor train_size_idx, label in enumerate(labels):\n results_scaled = results[[label]].assign(\n C_scaled=Cs * float(n_samples * train_sizes[train_size_idx])\n )\n results_scaled.plot(x=\"C_scaled\", ax=axes[1], logx=True, label=label)\naxes[1].set_title(\"Scaling C by 1 / n_samples\")\n\n_ = fig.suptitle(\"Effect of scaling C with L2 penalty\")"
112+
"import matplotlib.pyplot as plt\n\nfig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(12, 6))\n\n# plot results without scaling C\nresults.plot(x=\"C\", ax=axes[0], logx=True)\naxes[0].set_ylabel(\"CV score\")\naxes[0].set_title(\"No scaling\")\n\nfor label in labels:\n best_C = results.loc[results[label].idxmax(), \"C\"]\n axes[0].axvline(x=best_C, linestyle=\"--\", color=\"grey\", alpha=0.8)\n\n# plot results by scaling C\nfor train_size_idx, label in enumerate(labels):\n results_scaled = results[[label]].assign(\n C_scaled=Cs * float(n_samples * np.sqrt(train_sizes[train_size_idx]))\n )\n results_scaled.plot(x=\"C_scaled\", ax=axes[1], logx=True, label=label)\n best_C_scaled = results_scaled[\"C_scaled\"].loc[results[label].idxmax()]\n axes[1].axvline(x=best_C_scaled, linestyle=\"--\", color=\"grey\", alpha=0.8)\naxes[1].set_title(\"Scaling C by sqrt(1 / n_samples)\")\n\nfig.suptitle(\"Effect of scaling C with L2 penalty\")\nplt.show()"
124113
]
125114
},
126115
{
127116
"cell_type": "markdown",
128117
"metadata": {},
129118
"source": [
130-
"So or the L2 penalty case, the best result comes from the case where `C` is\nnot scaled.\n\n"
131-
]
132-
},
133-
{
134-
"cell_type": "code",
135-
"execution_count": null,
136-
"metadata": {
137-
"collapsed": false
138-
},
139-
"outputs": [],
140-
"source": [
141-
"plt.show()"
119+
"For the L2 penalty case, the reparametrization seems to have a smaller impact\non the stability of the optimal value for the regularization. The transition\nout of the overfitting region occurs in a more spread range and the accuracy\ndoes not seem to be degraded up to chance level.\n\nTry increasing the value to `n_splits=1_000` for better results in the L2\ncase, which is not shown here due to the limitations on the documentation\nbuilder.\n\n"
142120
]
143121
}
144122
],
Binary file not shown.

0 commit comments

Comments
 (0)