|
15 | 15 | "cell_type": "markdown",
|
16 | 16 | "metadata": {},
|
17 | 17 | "source": [
|
18 |
| - "\n# Adjustment for chance in clustering performance evaluation\n\nThis notebook explores the impact of uniformly-distributed random labeling on\nthe behavior of some clustering evaluation metrics. For such purpose, the\nmetrics are computed with a fixed number of samples and as a function of the number\nof clusters assigned by the estimator. The example is divided into two\nexperiments:\n\n- a first experiment with fixed \"ground truth labels\" (and therefore fixed\n number of classes) and randomly \"predicted labels\";\n- a second experiment with varying \"ground truth labels\", randomly \"predicted\n labels\". The \"predicted labels\" have the same number of classes and clusters\n as the \"ground truth labels\".\n" |
| 18 | + "\n# Adjustment for chance in clustering performance evaluation\nThis notebook explores the impact of uniformly-distributed random labeling on\nthe behavior of some clustering evaluation metrics. For such purpose, the\nmetrics are computed with a fixed number of samples and as a function of the number\nof clusters assigned by the estimator. The example is divided into two\nexperiments:\n\n- a first experiment with fixed \"ground truth labels\" (and therefore fixed\n number of classes) and randomly \"predicted labels\";\n- a second experiment with varying \"ground truth labels\", randomly \"predicted\n labels\". The \"predicted labels\" have the same number of classes and clusters\n as the \"ground truth labels\".\n" |
19 | 19 | ]
|
20 | 20 | },
|
21 | 21 | {
|
|
98 | 98 | },
|
99 | 99 | "outputs": [],
|
100 | 100 | "source": [
|
101 |
| - "import matplotlib.pyplot as plt\nimport matplotlib.style as style\n\nn_samples = 1000\nn_classes = 10\nn_clusters_range = np.linspace(2, 100, 10).astype(int)\nplots = []\nnames = []\n\nstyle.use(\"seaborn-colorblind\")\nplt.figure(1)\n\nfor marker, (score_name, score_func) in zip(\"d^vx.,\", score_funcs):\n\n scores = fixed_classes_uniform_labelings_scores(\n score_func, n_samples, n_clusters_range, n_classes=n_classes\n )\n plots.append(\n plt.errorbar(\n n_clusters_range,\n scores.mean(axis=1),\n scores.std(axis=1),\n alpha=0.8,\n linewidth=1,\n marker=marker,\n )[0]\n )\n names.append(score_name)\n\nplt.title(\n \"Clustering measures for random uniform labeling\\n\"\n f\"against reference assignment with {n_classes} classes\"\n)\nplt.xlabel(f\"Number of clusters (Number of samples is fixed to {n_samples})\")\nplt.ylabel(\"Score value\")\nplt.ylim(bottom=-0.05, top=1.05)\nplt.legend(plots, names)\nplt.show()" |
| 101 | + "import matplotlib.pyplot as plt\nimport seaborn as sns\n\nn_samples = 1000\nn_classes = 10\nn_clusters_range = np.linspace(2, 100, 10).astype(int)\nplots = []\nnames = []\n\nsns.color_palette(\"colorblind\")\nplt.figure(1)\n\nfor marker, (score_name, score_func) in zip(\"d^vx.,\", score_funcs):\n scores = fixed_classes_uniform_labelings_scores(\n score_func, n_samples, n_clusters_range, n_classes=n_classes\n )\n plots.append(\n plt.errorbar(\n n_clusters_range,\n scores.mean(axis=1),\n scores.std(axis=1),\n alpha=0.8,\n linewidth=1,\n marker=marker,\n )[0]\n )\n names.append(score_name)\n\nplt.title(\n \"Clustering measures for random uniform labeling\\n\"\n f\"against reference assignment with {n_classes} classes\"\n)\nplt.xlabel(f\"Number of clusters (Number of samples is fixed to {n_samples})\")\nplt.ylabel(\"Score value\")\nplt.ylim(bottom=-0.05, top=1.05)\nplt.legend(plots, names, bbox_to_anchor=(0.5, 0.5))\nplt.show()" |
102 | 102 | ]
|
103 | 103 | },
|
104 | 104 | {
|
|
134 | 134 | },
|
135 | 135 | "outputs": [],
|
136 | 136 | "source": [
|
137 |
| - "n_samples = 100\nn_clusters_range = np.linspace(2, n_samples, 10).astype(int)\n\nplt.figure(2)\n\nplots = []\nnames = []\n\nfor marker, (score_name, score_func) in zip(\"d^vx.,\", score_funcs):\n\n scores = uniform_labelings_scores(score_func, n_samples, n_clusters_range)\n plots.append(\n plt.errorbar(\n n_clusters_range,\n np.median(scores, axis=1),\n scores.std(axis=1),\n alpha=0.8,\n linewidth=2,\n marker=marker,\n )[0]\n )\n names.append(score_name)\n\nplt.title(\n \"Clustering measures for 2 random uniform labelings\\nwith equal number of clusters\"\n)\nplt.xlabel(f\"Number of clusters (Number of samples is fixed to {n_samples})\")\nplt.ylabel(\"Score value\")\nplt.legend(plots, names)\nplt.ylim(bottom=-0.05, top=1.05)\nplt.show()" |
| 137 | + "n_samples = 100\nn_clusters_range = np.linspace(2, n_samples, 10).astype(int)\n\nplt.figure(2)\n\nplots = []\nnames = []\n\nfor marker, (score_name, score_func) in zip(\"d^vx.,\", score_funcs):\n scores = uniform_labelings_scores(score_func, n_samples, n_clusters_range)\n plots.append(\n plt.errorbar(\n n_clusters_range,\n np.median(scores, axis=1),\n scores.std(axis=1),\n alpha=0.8,\n linewidth=2,\n marker=marker,\n )[0]\n )\n names.append(score_name)\n\nplt.title(\n \"Clustering measures for 2 random uniform labelings\\nwith equal number of clusters\"\n)\nplt.xlabel(f\"Number of clusters (Number of samples is fixed to {n_samples})\")\nplt.ylabel(\"Score value\")\nplt.legend(plots, names)\nplt.ylim(bottom=-0.05, top=1.05)\nplt.show()" |
138 | 138 | ]
|
139 | 139 | },
|
140 | 140 | {
|
141 | 141 | "cell_type": "markdown",
|
142 | 142 | "metadata": {},
|
143 | 143 | "source": [
|
144 |
| - "We observe similar results as for the first experiment: adjusted for chance\nmetrics stay constantly near zero while other metrics tend to get larger with\nfiner-grained labelings. The mean V-measure of random labeling increases\nsignificantly as the number of clusters is closer to the total number of\nsamples used to compute the measure. Furthermore, raw Mutual Information is\nunbounded from above and its scale depends on the dimensions of the clustering\nproblem and the cardinality of the ground truth classes.\n\nOnly adjusted measures can hence be safely used as a consensus index to\nevaluate the average stability of clustering algorithms for a given value of k\non various overlapping sub-samples of the dataset.\n\nNon-adjusted clustering evaluation metric can therefore be misleading as they\noutput large values for fine-grained labelings, one could be lead to think\nthat the labeling has captured meaningful groups while they can be totally\nrandom. In particular, such non-adjusted metrics should not be used to compare\nthe results of different clustering algorithms that output a different number\nof clusters.\n\n" |
| 144 | + "We observe similar results as for the first experiment: adjusted for chance\nmetrics stay constantly near zero while other metrics tend to get larger with\nfiner-grained labelings. The mean V-measure of random labeling increases\nsignificantly as the number of clusters is closer to the total number of\nsamples used to compute the measure. Furthermore, raw Mutual Information is\nunbounded from above and its scale depends on the dimensions of the clustering\nproblem and the cardinality of the ground truth classes. This is why the\ncurve goes off the chart.\n\nOnly adjusted measures can hence be safely used as a consensus index to\nevaluate the average stability of clustering algorithms for a given value of k\non various overlapping sub-samples of the dataset.\n\nNon-adjusted clustering evaluation metric can therefore be misleading as they\noutput large values for fine-grained labelings, one could be lead to think\nthat the labeling has captured meaningful groups while they can be totally\nrandom. In particular, such non-adjusted metrics should not be used to compare\nthe results of different clustering algorithms that output a different number\nof clusters.\n\n" |
145 | 145 | ]
|
146 | 146 | }
|
147 | 147 | ],
|
|
0 commit comments