Xu Cui ? SVM (support vector machine) with libsvm

dzh1121 2011-01-16

展开全文

I am learning svm lately and tried libsvm. It’s a good package.

Linear kernel example (support vectors are in circles):

Linear

Nonlinear example (radial basis)

Nonlinear, circle

Nonlinear, two circles

Nonlinear, quadrant

3-class example

Linear, 3 classes

Basic procedure to use libsvm:

Preprocess your data. This including normalization (make all values between 0 and 1) and transform non-numeric values to numeric. You can use the following code to normalize (from libsvm webpage):
```
(data - repmat(min(data,[],1),size(data,1),1))*spdiags(1./(max(data,[],1)-min(data,[],1))',0,size(data,2),size(data,2))
```
Find optimal parameter values. For linear kernel, you have 1 parameter C (penalize parameter). For commonly used radial kernel, you have two parameters (C and gamma). Different parameter values will yield different accuracy rate. To avoid over fitting, you use n-fold cross validation. For example, a 5-fold cross validation is to use 4/5 of the data to train the svm model and the rest 1/5 to test. The option -c, -g, and -v controls parameter C, gamma and n-fold cross validation. A piece of code from libsvm website is:
```
bestcv = 0;
for log2c = -1:3,
  for log2g = -4:1,
    cmd = ['-v 5 -c ', num2str(2^log2c), ' -g ', num2str(2^log2g)];
    cv = svmtrain(heart_scale_label, heart_scale_inst, cmd);
    if (cv >= bestcv),
      bestcv = cv; bestc = 2^log2c; bestg = 2^log2g;
    end
    fprintf('%g %g %g (best c=%g, g=%g, rate=%g)\n', log2c, log2g, cv, bestc, bestg, bestcv);
  end
end
```
You may have to run the above code several times with different range of parameter values to find the optimal values. For example, you might want to start from a bigger range with coarse resolution; then fine tune to smaller regions with higher resolution.
After finding the optimal parameter values, use all data to train your model with your optimal parameter values.
```
cmd = ['-t 2 -c ', num2str(bestc), ' -g ', num2str(bestg)];
model = svmtrain(l, d, cmd);
```

If you have new data, you may use this model to classify the new data.

[predicted_label, accuracy, decision_values] = svmpredict(zeros(size(dd,1),1), dd, model);

Commonly used options

-v n: n-fold cross validation
-t 0: linear kernel
-t 2: radial basis (default)
-s 0: SVC type = C-SVC
-C: C parameter value, default 1
-g: gamma parameter value

libsvm performance

I tested on different data size and record the time spent (in second).

Computer: Processor: 2×2.66G, memory: 12G, OS: Windows XP installed in VMWare in Mac OS 10.5

data size    # features    svmtrain    svmpredict
100    2    0.00    0.00
100    6    0.00    0.00
100    10    0.00    0.00
100    20    0.00    0.00
100    50    0.01    0.00
100    100    0.02    0.01
500    2    0.02    0.01
500    6    0.03    0.02
500    10    0.05    0.03
500    20    0.08    0.03
500    50    0.46    0.07
500    100    0.56    0.12
1000    2    0.07    0.04
1000    6    0.10    0.06
1000    10    0.15    0.10
1000    20    0.36    0.14
1000    50    1.09    0.30
1000    100    3.07    0.50