DS[TA]-5-b
February
map datapoints from input space to feature space, to reveal dependencies.
Computation in feature space could be unfeseable, but often the Kernel trick reduces it to computing the Kernel matrix \(K_{ij} = x_i^T \cdot x_j\)
To compute \(K\) we will always need to \(\Theta(n^2)\) dot products, costing \(\Theta(d^2)\) each.
\(\phi(\cdot): \mathcal{D_1}, \dots \mathcal{D_d} \rightarrow \mathcal{F}^e\)
Meaning: let’s compute with \(K\)
Reduces computation to looking up the Kernel matrix.
However, for a given kernel function \(\phi\) we need to show that
The K. trick works for the three model kernels
map datapoints over a feature space where they would appear to fall farther apart from each other
find areas of maximum separations between groups
determine the cutting (hyper)plane that cuts across the clear area (in 2D it’s a straight line)
use it as a classifier
members of a class have individual/family variations that make their values oscillate around a typical class value
such classes are somewhat non-overlapping: there must be an emergent difference in features between individuals of class A and B.
in the vicinity of class A datapoints, values could only be proper for a class A object
datapoints should not be too close to the border to the next class
points lying at the mimimum distance from the cutting hyperplane are called support vectors
each class will have its support vectors and its minimum distance \(\delta^{c}\) from the cutting h..
on to the Zaki-Meira presentation…
As with K-means, we postulate that
class difference should be revealed by a distinctive difference in measured values, i.e., some break of continuity.
hence, two points that are very close should always end up in the same class.
Assortativity […] is a preference for […] attaching to others that are similar in some way.
[…] at one point [he] moved from the Mathematics Department at MIT to that at Boston University.
Which singlehandedly raised the average IQ in both departments.