The Gini index

DS[TA]-2-b

January

Data Science Tech. & App.

The Gini index

Big picture

Economists are interested in how quantities such as income are distributed over a population.
They would like a single number to express equality or inequality of the distribution.

Gini axiomatised the requirements with his G index:

\(G \approx 0\): all individuals have exactly the same share of wealth/income

\(G \approx 1\): one individual has it all, everyone else has exacly zero

\(G < 0.3\) a rather egalitarian country (Slovakia = 0.22)

\(G > 0.4\) a rather elitist country wrt. income (S. Africa = 0.62)

G. considers the pairwise absolute differences between individuals:

\[ G_0 = \Sigma_i \Sigma_j | x_i - x_j | \]

then normalised them for scale wrt. the overall average \(\overline{x}\)

\[ G = \frac{\Sigma_i \Sigma_j | x_i - x_j |}{2n^2\overline{x}} \]

If individuals are sorted by increasing income (X-axis) and plotted against cumulative income, Gini has a straighforward ``area under the diagonal’’ interpretation.

The Gini index

is a measure of dispersion, not necessarily of egalitarianism
measures a present dispersion rather than a trend.
often implied measures are easier to observe, e.g., home computer ownership wrt. wealth.

Applications to Data Science

Gini impurity

In classification, Gini impurity is a measure of quality for a subset of the data which is to be given a classification/label.

Algorithm: take a set of elements and choose their label by randomly selecting one element and its category.

What is the probability that this simple method leads to misclassification?

It depends on the dispersion in the set.

Consider the frequency distribution of n elements over k categories.

Let \(P(i)\) be the resp. normalised frequency.

What is the prob. of misclassification, when the label is chosen randomly?

\[G = \Sigma_{i=1}^k P(i)\cdot (1 - P(i))\]

\[G = 1 - \Sigma_{i=1}^k P(i)^2\]

\(G \approx 0\): all items are into one category (whatever that is)

\(G \approx 0.5\): items equally scattered over all categories

See this blog entry for a worked-out exercise

Gini purity of a dimension

The dataset from the blog entry contains records of golf-playing days with binary classification:

Day	Outlook	Temp.	Hum.	Wind	Play?
1	Sunny	Hot	High	Weak	No
2	…

Consider three sets, on the basis of the Outlook dimension:

gini example

where G(Outlook) is the weighted sum of the impurities of a labelling based on splitting along the values of Outlook (and the random-labelling algorithm)

Research question: can we do better?