Regularization perspectives on support vector machines

Lua error in package.lua at line 80: module 'strict' not found. Regularization perspectives on support vector machines provide a way of interpreting support vector machines (SVMs) in the context of other machine learning algorithms. SVM algorithms categorize multidimensional data, with the goal of fitting the training set data well, but also avoiding overfitting, so that the solution generalizes to new data points. Regularization algorithms also aim to fit training set data and avoid overfitting. They do this by choosing a fitting function that has low error on the training set, but also is not too complicated, where complicated functions are functions with high norms in some function space. Specifically, Tikhonov regularization algorithms choose a function that minimize the sum of training set error plus the function's norm. The training set error can be calculated with different loss functions. For example, regularized least squares is a special case of Tikhonov regularization using the squared error loss as the loss function.^[1]

Regularization perspectives on support vector machines interpret SVM as a special case Tikhonov regularization, specifically Tikhonov regularization with the hinge loss for a loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to generalize without overfitting. SVM was first proposed in 1995 by Corinna Cortes and Vladimir Vapnik, and framed geometrically as a method for finding hyperplanes that can separate multidimensional data into two categories.^[2] This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other machine learning techniques for avoiding overfitting like regularization, early stopping, sparsity and Bayesian inference. However, once it was discovered that SVM is also a special case of Tikhonov regularization, regularization perspectives on SVM provided the theory necessary to fit SVM within a broader class of algorithms.^[1]^[3]^[4] This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it is beneficial to use SVM's loss function, the hinge loss.^[5]

Theoretical background

In the statistical learning theory framework, an algorithm is a strategy for choosing a function $f:\mathbf X \to \mathbf Y$ given a training set $S = \{(x_1,y_1),\ldots, (x_n,y_n)\}$ of inputs, $x_i$ , and their labels, $y_i$ (the labels are usually $\pm1$ ). Regularization strategies avoid overfitting by choosing a function that fits the data, but is not too complex. Specifically:

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): f = \text{arg}\min_{f\in\mathcal{H}}\left\{\frac{1}{n}\sum_{i=1}^n V(y_i,f(x_i))+\lambda||f||^2_\mathcal{H}\right\} ,

where $\mathcal{H}$ is a hypothesis space^[6] of functions, $V:\mathbf Y \times \mathbf Y \to \mathbb R$ is the loss function, $||\cdot||_\mathcal H$ is a norm on the hypothesis space of functions, and $\lambda\in\mathbb R$ is the regularization parameter.^[7]

When $\mathcal{H}$ is a reproducing kernel Hilbert space, there exists a kernel function $K: \mathbf X \times \mathbf X \to \mathbb R$ that can be written as an $n\times n$ symmetric positive definite matrix $\mathbf K$ . By the representer theorem,^[8] $f(x_i) = \sum_{f=1}^n c_j \mathbf K_{ij}$ , and $||f||^2_{\mathcal H} = \langle f,f\rangle_\mathcal H = \sum_{i=1}^n\sum_{j=1}^n c_ic_jK(x_i,x_j) = c^T\mathbf K c$

Special properties of the hinge loss

Hinge and misclassification loss functions

The simplest and most intuitive loss function for categorization is the misclassification loss, or 0-1 loss, which is 0 if $f(x_i)=y_i$ and 1 if $f(x_i) \neq y_i$ , i.e. the Heaviside step function on $-y_if(x_i)$ . However, this loss function is not convex, which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0-1 loss. The hinge loss, $V(y_i,f(x_i)) = (1-yf(x))_+$ where $(s)_+ = max(s,0)$ , provides such a convex relaxation. In fact, the hinge loss is the tightest convex upper bound to the 0-1 misclassification loss function,^[4] and with infinite data returns the Bayes optimal solution:^[5]^[9]

$f_b(x) = \left\{\begin{matrix}1&p(1|x)>p(-1|x)\\-1&p(1|x)<p(-1|x)\end{matrix}\right.$

Derivation

The Tikhonov regularization problem can be shown to be equivalent to traditional formulations of SVM by expressing it in terms of the hinge loss.^[10] With the hinge loss,

$V(y_i,f(x_i)) = (1-yf(x))_+$

where $(s)_+ = max(s,0)$ , the regularization problem becomes

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): f = \text{arg}\min_{f\in\mathcal{H}}\left\{\frac{1}{n}\sum_{i=1}^n (1-yf(x))_+ +\lambda||f||^2_\mathcal{H}\right\} .

Multiplying by $1/(2\lambda)$ yields

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): f = \text{arg}\min_{f\in\mathcal{H}}\left\{C\sum_{i=1}^n (1-yf(x))_+ +\frac{1}{2}||f||^2_\mathcal{H}\right\} ,

with $C = 1/(2\lambda n)$ , which is equivalent to the standard SVM minimization problem.

Notes and references

Cite error: Invalid <references> tag; parameter "group" is allowed only.

Use <references />, or <references group="..." />

Lua error in package.lua at line 80: module 'strict' not found.
Lua error in package.lua at line 80: module 'strict' not found.
Lua error in package.lua at line 80: module 'strict' not found.

↑ ^1.0 ^1.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^4.0 ^4.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^5.0 ^5.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ A hypothesis space is the set of functions used to model the data in a machine learning problem. Each function corresponds to a hypothesis about the structure of the data. Typically the functions in a hypothesis space form a Hilbert space of functions with norm formed from the loss function.
↑ For insight on choosing the parameter, see, e.g., Lua error in package.lua at line 80: module 'strict' not found.
↑ See Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ For a detailed derivation, see Lua error in package.lua at line 80: module 'strict' not found.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Regularization perspectives on support vector machines

Contents

Theoretical background

Special properties of the hinge loss

Derivation

Notes and references

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools