Colley’s regression

DS[TA]-3-c Rating and Ranking

February

Summary of Massey’s

The Atlantic coast conf.

Ratings are the solution \(\mathbf{r}\) of \(\overline{M}\mathbf{r} = \mathbf{p}\)

The data that drives ratings is point difference.

Duke, Miami, U of North Carolina, U of Virginia, Virginia Tech (plus Georgia Tech and Pittsburgh now)

Critique

  • is point difference so clearly a reflection of the skills gap?

  • what if, in a short season, the final result is the only goal?

Colley’s Method

Idea: rating as winning

For each team consider the no. of historical wins, \(w_i\) and losses, \(l_i\); \(t_i = w_i + l_i\)

\[ r_i = \frac{w_i}{t_i} = \frac{w_i}{w_i + l_i} \]

  1. simple

  2. captures long-term trend (but not decay)

  3. ignores strength of the opponents

  4. cold-start problem: \(\frac{0}{0}\rightarrow \frac{0}{1} | \frac{1}{1}\)???

Laplace correction

L. initialises rational probabilities to 0.5:

\[ r_i = \frac{1 + w_i}{2 + t_i} = \frac{1 + w_i}{2 + w_i + l_i} \]

\(\frac{1}{2}\rightarrow \frac{1}{3} | \frac{2}{3}\)

\(\frac{1}{2}\rightarrow \frac{1}{3} | \frac{2}{3} \rightarrow \frac{1}{4} | \frac{2}{4} | \frac{3}{4} \rightarrow \frac{1}{5} | \frac{2}{5} | \frac{3}{5} | \frac{4}{5}\)

Ratings of 0 or 1 are out of reach.

Dependence on opponent’s strength

Colley ratings can be rephrased to contain opponents’ strength as one of the factors.

\[ w_i = \frac{w_i}{2} + \frac{w_i}{2} \]

\[ w_i = \frac{w_i}{2} + \frac{w_i}{2} + \frac{l_i}{2} - \frac{l_i}{2} \]

\[ w_i = \frac{w_i-l_i}{2} + \frac{w_i + l_i}{2} \]

\[ w_i = \frac{w_i-l_i}{2} + \frac{w_i + l_i}{2} \]

\[ = \frac{w_i-l_i}{2} + \frac{t_i}{2} \]

\[ = \frac{w_i-l_i}{2} + \sum_{j=1}^{t_i}\frac{1}{2} \]

At the start, since each \(r_x\) is set to \(1/2\) we have:

\[ \sum_{j=1}^{t_i}\frac{1}{2} = \sum_{j\in O_i}r_j \]

where \(O_i\) is the set of opponents.

As we progress, this becomes a convenient approximation:

\[ w_i \approx \frac{w_i-l_i}{2} + \sum_{j\in O_i}r_j \]

so we can interpret \(w_i\) as winning balance plus sum of strengths of the opponents.

Colley’s in Linear Algebra

Incorporate opponents’ strength

\[ r_i = \frac{1 + w_i}{2 + t_i} \]

\[ r_i = \frac{1 + \frac{w_i-l_i}{2} + \sum_{j\in O_i}r_i}{2 + t_i} \]

Let’s compute all \(r_i\)’s at once.

Matrix form

\[ C \mathbf{r} = \mathbf{b} \]

where C and b are defined to reflect Colley’s rating formula

\(c_{ii} = 2 + t_i\)

\(c_{ij} = -n_{ij},\) i.e., the number of direct matches

\[ \begin{pmatrix} 6 & -1 & -1 & -1 & -1 \\ -1 & 6 & -1 & -1 & -1 \\ -1 & -1 & 6 & -1 & -1 \\ -1 & -1 & -1 & 6 & -1 \\ -1 & -1 & -1 & -1 & 6 \end{pmatrix} \]

Essentially \(C = 2I + M\)

\[ \begin{pmatrix} 6 & -1 & -1 & -1 & -1 \\ -1 & 6 & -1 & -1 & -1 \\ -1 & -1 & 6 & -1 & -1 \\ -1 & -1 & -1 & 6 & -1 \\ -1 & -1 & -1 & -1 & 6 \end{pmatrix} \mathbf{r} = \mathbf{b} \]

Now set \(b_i = 1 + \frac{1}{2}(w_i - l_i)\)

\[ \begin{pmatrix} 6 & -1 & -1 & -1 & -1 \\ -1 & 6 & -1 & -1 & -1 \\ -1 & -1 & 6 & -1 & -1 \\ -1 & -1 & -1 & 6 & -1 \\ -1 & -1 & -1 & -1 & 6 \end{pmatrix} \mathbf{r} = \begin{pmatrix} -1 \\ 3 \\ 1 \\ 0 \\ 2 \end{pmatrix} \]

Results and comparison with M.

Team rc Colley Massey
Miami .79 1st =
VT .65 2nd =
UNC .50 3rd 4th
UVA .36 4th 3rd
Duke .21 5th =

Colley: conclusions

Points to remember

  1. Laplace correction: \(\frac{Pos + 1}{Tot + 2}\)

  2. winning-only: in fact includes the strengths of the opponents

  3. the total strength of the league tends to remains constant

General Conclusions

Points to focus on

  • rating and rating is the fun side of Data Science!
  • latent variables represent non-measurable skills

  • they live in a feature space, possibly separated from the traditional data space

  • yet they may get a numeric estimate, and inform our predictions

  • M. and C. regress on the latent variable strength.

Further readings (Ch.3)

Colley can run with Massey’s points balances (and v. v.)

Both methods can be applied to Collaborative filtering.