数据是来自MASS包的cats数据集。在本例中你将尝试使用体重和心脏重量来预测一只猫的性别。我们拿数据集中20%的数据点,用于测试模型的准确性(在其余的80%的数据上建立模型)。
data(cats, package="MASS")
mdata <- cats
dim(mdata)
## [1] 144 3
head(mdata)
## Sex Bwt Hwt
## 1 F 2.0 7.0
## 2 F 2.0 7.4
## 3 F 2.0 9.5
## 4 F 2.1 7.2
## 5 F 2.1 7.3
## 6 F 2.1 7.6
library(e1071)
inputData <- data.frame(mdata[,c(2,3)], response=as.factor(mdata$Sex)) # response as factor
传递给函数svm()的关键参数是kernel、cost和gamma。Kernel指的是支持向量机的类型,它可能是线性SVM、多项式SVM、径向SVM或Sigmoid SVM。Cost是违反约束时的成本函数,gamma是除线性SVM外其余所有SVM都使用的一个参数。还有一个类型参数,用于指定该模型是用于回归、分类还是异常检测。但是这个参数不需要显式地设置,因为支持向量机会基于响应变量的类别自动检测这个参数,响应变量的类别可能是一个因子或一个连续变量。所以对于分类问题,一定要把你的响应变量作为一个因子。
# linear SVM
svmfit <- svm(response~., data=inputData, kernel="linear", cost=10, scale=FALSE)
print(svmfit)
##
## Call:
## svm(formula = response ~ ., data = inputData, kernel = "linear",
## cost = 10, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
## gamma: 0.5
##
## Number of Support Vectors: 79
plot(svmfit, inputData)
compareTable <- table(inputData$response, predict(svmfit))
compareTable
##
## F M
## F 33 14
## M 14 83
# misclassification error
mean(inputData$response!=predict(svmfit))
## [1] 0.1944444
径向基函数作为一个受欢迎的内核函数,可以通过设置内核参数作为“radial”来使用。当使用一个带有“radial”的内核时,结果中的超平面就不需要是一个线性的了。通常定义一个弯曲的区域来界定类别之间的分隔,这也往往导致相同的训练数据,更高的准确度。
# radial svm, scaling turned OFF
svmfit <- svm(response ~ ., data = inputData, kernel = "radial", cost = 10, scale = FALSE)
print(svmfit)
##
## Call:
## svm(formula = response ~ ., data = inputData, kernel = "radial",
## cost = 10, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
## gamma: 0.5
##
## Number of Support Vectors: 83
plot(svmfit, inputData)
compareTable <- table (inputData$response, predict(svmfit))
compareTable
##
## F M
## F 36 11
## M 16 81
mean(inputData$response != predict(svmfit))
## [1] 0.1875
你可以使用tune.svm()函数,来寻找svm()函数的最优参数。
### Tuning
# Prepare training and test data
# for reproducing results
set.seed(100)
# prepare row indices
rowIndices <- 1:nrow(inputData)
# training sample size
sampleSize <- 0.8*length(rowIndices)
# random sampling
trainingRows <- sample(rowIndices,sampleSize)
# training data
trainingData <- inputData[trainingRows,]
# test data
testData <- inputData[-trainingRows,]
# tune
tuned <- tune.svm(response~., data=trainingData, gamma=10^(-6:-1), cost=10^(1:2))
# to select best gamma and cost
summary(tuned)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.01 100
##
## - best performance: 0.2439394
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 1e-06 10 0.3325758 0.16269565
## 2 1e-05 10 0.3325758 0.16269565
## 3 1e-04 10 0.3325758 0.16269565
## 4 1e-03 10 0.3325758 0.16269565
## 5 1e-02 10 0.2886364 0.12759149
## 6 1e-01 10 0.2530303 0.09517221
## 7 1e-06 100 0.3325758 0.16269565
## 8 1e-05 100 0.3325758 0.16269565
## 9 1e-04 100 0.3325758 0.16269565
## 10 1e-03 100 0.2795455 0.12486346
## 11 1e-02 100 0.2439394 0.09739752
## 12 1e-01 100 0.2530303 0.08668644
结果证明,当cost为100,gamma为0.001时产生最小的错误率。
利用这些参数训练径向支持向量机。
# radial svm, scaling turned OFF
svmfit<-svm(response~., data=trainingData, kernel="radial", cost=100, gamma=0.001, scale=FALSE)
print(svmfit)
##
## Call:
## svm(formula = response ~ ., data = trainingData, kernel = "radial",
## cost = 100, gamma = 0.001, scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 100
## gamma: 0.001
##
## Number of Support Vectors: 77
plot(svmfit, trainingData)
compareTable <- table(testData$response,predict(svmfit, testData))
compareTable
##
## F M
## F 6 3
## M 1 19
mean(testData$response!=predict(svmfit, testData))
## [1] 0.137931
一个2-色的网格图,能让结果看起来更清楚,它将图的区域指定为利用SVM分类器得到的结果的类别。在下边的例子中,这样的网格图中有很多数据点,并且通过数据点上的倾斜的方格来标记支持向量上的点。很明显,在这种情况下,有很多越过边界违反约束的点,但在SVM内部它们的权重都被降低了。
# Grid Plot
# num grid points in a line
n_points_in_grid = 60
# range of X axis
x_axis_range <- range(inputData[,2])
# range of Y axis
y_axis_range <- range(inputData[,1])
# grid points along x-axis
X_grid_points <- seq(from=x_axis_range[1],to=x_axis_range[2],length=n_points_in_grid)
# grid points along y-axis
Y_grid_points <- seq(from=y_axis_range[1],to=y_axis_range[2],length=n_points_in_grid)
# generate all grid points
all_grid_points <- expand.grid(X_grid_points,Y_grid_points)
# rename
names(all_grid_points) <- c('Hwt', 'Bwt')
# predict for all points in grid
all_points_predited <- predict(svmfit, all_grid_points)
# colors for all points based on predictions
color_array <- c('red', 'blue')[as.numeric(all_points_predited)]
# plot all grid points
plot(all_grid_points, col=color_array, pch=20, cex=0.25)
# plot data points
points(x=trainingData$Hwt,y=trainingData$Bwt,col=c('red', 'blue')[as.numeric(trainingData$response)],pch=19)
# plot support vectors
points(trainingData[svmfit$index,c(2,1)],pch=5,cex=2)