An Invitation to Support Vector Machines

DS[TA]-5-b

February

Data Science: T & A and A & T

Review of Kernelization

Kernel functions

map datapoints from input space to feature space, to reveal dependencies.

Computation in feature space could be unfeseable, but often the Kernel trick reduces it to computing the Kernel matrix \(K_{ij} = x_i^T \cdot x_j\)

To compute \(K\) we will always need to \(\Theta(n^2)\) dot products, costing \(\Theta(d^2)\) each.

3 model kernels

\(\phi(\cdot): \mathcal{D_1}, \dots \mathcal{D_d} \rightarrow \mathcal{F}^e\)

  1. identity: \(\phi(\vec{x}) = \vec{x}\)

Meaning: let’s compute with \(K\)

  1. poly/non-linear, e.g.: \(\phi([x, y]^T) = [x^2, y^2, xy]^T\)
  1. Mercer: \(\phi(\vec{x_i}) = \sqrt(\Lambda) \vec{U_i}\)

Kernel trick

Reduces computation to looking up the Kernel matrix.

However, for a given kernel function \(\phi\) we need to show that

  1. the analysis task (min/max/var/mean etc.) can be reduced to a combinations of dot products \(\phi(x_i)^T \cdot \phi(x_j)\) in the feature space.
  1. dot prod. can be replaced with look-ups in the Kernel matrix, e.g., \(\phi(\vec{x_i})^T \cdot \phi(\vec{x_j}) = K(\vec{x_i}, \vec{x_j})\)

The K. trick works for the three model kernels

Support Vector Machines

Goal

  • Solve classification over not-so-scattered datapoints

Ideas

  • map datapoints over a feature space where they would appear to fall farther apart from each other

  • find areas of maximum separations between groups

  • determine the cutting (hyper)plane that cuts across the clear area (in 2D it’s a straight line)

  • use it as a classifier

Underlying assumptions

  • members of a class have individual/family variations that make their values oscillate around a typical class value

  • such classes are somewhat non-overlapping: there must be an emergent difference in features between individuals of class A and B.

  • in the vicinity of class A datapoints, values could only be proper for a class A object

  • datapoints should not be too close to the border to the next class

  • points lying at the mimimum distance from the cutting hyperplane are called support vectors

  • each class will have its support vectors and its minimum distance \(\delta^{c}\) from the cutting h..

  • on to the Zaki-Meira presentation…

Coda: assortativity

The underlying concept

As with K-means, we postulate that

class difference should be revealed by a distinctive difference in measured values, i.e., some break of continuity.

hence, two points that are very close should always end up in the same class.

Assortativity […] is a preference for […] attaching to others that are similar in some way.

[…] at one point [he] moved from the Mathematics Department at MIT to that at Boston University.

Which singlehandedly raised the average IQ in both departments.