Welcome to the 33rd part of our machine learning tutorial series and the next part in our Support Vector Machine section. In this tutorial, we're going to be closing out the coverage of the Support Vector Machine by explaining 3+ classification with the SVM as well as going through the parameters for the SVM via Scikit Learn for a bit of a review and to bring you all up to speed with the current methodologies used with the SVM.
To begin, the SVM, as you have learned, is a binary classifier. This means, at any one time, the SVM optimization is really tasked to separate one group from another. The question is then how we might classify a total of 3 or more groups. Typically, the method is to do what is referred to as "One Verse Rest" or (OVR). The idea here is you separate each group from the rest. For example, to classify three separate groups (1, 2, and 3), you would start by separating 1 from 2 and 3. Then you would separate 2 from 1 and 3. Then finally separate 3 from 1 and 2. There are some issues with this, as things like confidence may be different per classification boundary, also the separation boundaries may be slightly flawed since there are almost always going to be more negatives than positives, since you're maybe comparing one group to three others. Assuming a balanced dataset at the start, this would mean every classification boundary is actually unbalanced.
Another method is One-vs-One (or OVO). In this case, consider you have three total groups. The way this works is you have a specific boundary that separates 1 from 3 and 1 from 2, and this process repeats for the rest of the classes. In this way, the boundaries may be more balanced.
The first parameter is C
. This tells you right away that this is a soft-margin classifier. You can adjust C
however you like, and you could make C high enough to create a hard-margin classifier. Recall C
is used in the soft-margin optimization function for ||w||, like so:
The default value for C
is just a simple 1, and that really should be fine in most cases.
Next we have a choice of kernel
. The default here is the rbf
kernel, but you can also just have a linear
kernel, a poly
(for polynomial), sigmoid
, or even a custom one of your choosing or design.
Next, you have the degree
value, defaulting to 3, which is just the degree of the polynomial, if you are using the poly
value for the kernel
.
gamma
is where you can set the gamma value for the rbf
kernel. You should leave this as auto
.
coef0
allows you to adjust the independent term in your kernel function, but you should also leave this alone most likely, and it is only used in the polynomial and sigmoid kernels.
The probability
parameter setting may prove useful to you. Recall how an algorithm like K Nearest Neighbors not only has a model accuracy, but also each prediction can have a degree of "confidence." The SVM doesn't inherently have an attribute like this, but you can use this probability
parameter to enable a form of one. This is a costly functionality, but may be important enough to you to enable it, otherwise the default is False
.
Next, we have the shrinking
boolean, which is defaulted to True
. This has to do with whether or not you want a shrinking heuristic used in your optimization of the SVM, which is used in Sequential Minimal Optimization (SMO). You should leave this True, as it should greatly improve your performance, for very little loss in terms of accuracy in most cases.
The tol
parameter is a setting for the SVM's tolerance in optimization. Recall that yi(xi.w+b)-1 >= 0
. For an SVM to be valid, all values must be greater than or equal to 0, and at least one value on each side needs to be "equal" to 0, which will be your support vectors. Since it is highly unlikely that you will actually get values equal perfectly to 0, you set tolerance to allow a bit of wiggle room. The default tol
with Scikit-Learn's SVM is 1e-3, which is 0.001.
The next important parameter is max_iter
, which is where you can set a maximum number of iterations for the quadratic programming problem to cycle through to optimize. The default is -1
, which means there is no limit.
The decision_function_shape
is one-vs-one (ovo
) or one-vs-rest (ovr
), which is the concept discussed at the beginning of this tutorial.
random_state
is used for a seed in the probability estimation, if you wanted to specify it.
Aside from the parameters, we also have a few attributes:
support_
gives you the index values for the support vectors. support_vectors_
are the actual support vectors. n_support_
will tell you how many support vectors you have, which is useful for comparing to your dataset size to determine if you may have some statistical issues. The last 3 parameters dual_coef_
, coef_
, and intercept_
will be useful if you plan to graph the SVM, for example.
This wraps up the Support Vector Machine. Our next topic is clustering.