Colley’s regression

DS[TA]-3-c Rating and Ranking

February

Summary of Massey’s

The Atlantic coast conf.

Ratings are the solution \(\mathbf{r}\) of \(\overline{M}\mathbf{r} = \mathbf{p}\)

The data that drives ratings is point difference.

Duke, Miami, U of North Carolina, U of Virginia, Virginia Tech (plus Georgia Tech and Pittsburgh now)

Critique

is point difference so clearly a reflection of the skills gap?
what if, in a short season, the final result is the only goal?

Colley’s Method

Idea: rating as winning

For each team consider the no. of historical wins, \(w_i\) and losses, \(l_i\); \(t_i = w_i + l_i\)

\[ r_i = \frac{w_i}{t_i} = \frac{w_i}{w_i + l_i} \]

simple
captures long-term trend (but not decay)
ignores strength of the opponents
cold-start problem: \(\frac{0}{0}\rightarrow \frac{0}{1} | \frac{1}{1}\)???

Laplace correction

L. initialises rational probabilities to 0.5:

\[ r_i = \frac{1 + w_i}{2 + t_i} = \frac{1 + w_i}{2 + w_i + l_i} \]

\(\frac{1}{2}\rightarrow \frac{1}{3} | \frac{2}{3}\)

\(\frac{1}{2}\rightarrow \frac{1}{3} | \frac{2}{3} \rightarrow \frac{1}{4} | \frac{2}{4} | \frac{3}{4} \rightarrow \frac{1}{5} | \frac{2}{5} | \frac{3}{5} | \frac{4}{5}\)

Ratings of 0 or 1 are out of reach.

Dependence on opponent’s strength

Colley ratings can be rephrased to contain opponents’ strength as one of the factors.

\[ w_i = \frac{w_i}{2} + \frac{w_i}{2} \]

\[ w_i = \frac{w_i}{2} + \frac{w_i}{2} + \frac{l_i}{2} - \frac{l_i}{2} \]

\[ w_i = \frac{w_i-l_i}{2} + \frac{w_i + l_i}{2} \]

\[ = \frac{w_i-l_i}{2} + \frac{t_i}{2} \]

\[ = \frac{w_i-l_i}{2} + \sum_{j=1}^{t_i}\frac{1}{2} \]

At the start, since each \(r_x\) is set to \(1/2\) we have:

\[ \sum_{j=1}^{t_i}\frac{1}{2} = \sum_{j\in O_i}r_j \]

where \(O_i\) is the set of opponents.

As we progress, this becomes a convenient approximation:

\[ w_i \approx \frac{w_i-l_i}{2} + \sum_{j\in O_i}r_j \]

so we can interpret \(w_i\) as winning balance plus sum of strengths of the opponents.

Colley’s in Linear Algebra

Incorporate opponents’ strength

\[ r_i = \frac{1 + w_i}{2 + t_i} \]

\[ r_i = \frac{1 + \frac{w_i-l_i}{2} + \sum_{j\in O_i}r_i}{2 + t_i} \]

Let’s compute all \(r_i\)’s at once.

Matrix form

\[ C \mathbf{r} = \mathbf{b} \]

where C and b are defined to reflect Colley’s rating formula

\(c_{ii} = 2 + t_i\)

\(c_{ij} = -n_{ij},\) i.e., the number of direct matches

\[ \begin{pmatrix} 6 & -1 & -1 & -1 & -1 \\ -1 & 6 & -1 & -1 & -1 \\ -1 & -1 & 6 & -1 & -1 \\ -1 & -1 & -1 & 6 & -1 \\ -1 & -1 & -1 & -1 & 6 \end{pmatrix} \]

Essentially \(C = 2I + M\)

\[ \begin{pmatrix} 6 & -1 & -1 & -1 & -1 \\ -1 & 6 & -1 & -1 & -1 \\ -1 & -1 & 6 & -1 & -1 \\ -1 & -1 & -1 & 6 & -1 \\ -1 & -1 & -1 & -1 & 6 \end{pmatrix} \mathbf{r} = \mathbf{b} \]

Now set \(b_i = 1 + \frac{1}{2}(w_i - l_i)\)

\[ \begin{pmatrix} 6 & -1 & -1 & -1 & -1 \\ -1 & 6 & -1 & -1 & -1 \\ -1 & -1 & 6 & -1 & -1 \\ -1 & -1 & -1 & 6 & -1 \\ -1 & -1 & -1 & -1 & 6 \end{pmatrix} \mathbf{r} = \begin{pmatrix} -1 \\ 3 \\ 1 \\ 0 \\ 2 \end{pmatrix} \]

Results and comparison with M.

Team	r_c	Colley	Massey
Miami	.79	1st	=
VT	.65	2nd	=
UNC	.50	3rd	4th
UVA	.36	4th	3rd
Duke	.21	5th	=

Colley: conclusions

Points to remember

Laplace correction: \(\frac{Pos + 1}{Tot + 2}\)
winning-only: in fact includes the strengths of the opponents
the total strength of the league tends to remains constant

General Conclusions

Points to focus on

rating and rating is the fun side of Data Science!

latent variables represent non-measurable skills
they live in a feature space, possibly separated from the traditional data space

yet they may get a numeric estimate, and inform our predictions
M. and C. regress on the latent variable strength.