朴素贝叶斯算法仍然是流行的十大挖掘算法之一,该算法是有监督的学习算法,解决的是分类问题,如客户是否流失、是否值得投资、信用等级评定等多分类问题。该算法的优点在于简单易懂、学习效率高、在某些领域的分类问题中能够与决策树、神经网络相媲美。但由于该算法以自变量之间的独立(条件特征独立)性和连续变量的正态性假设为前提,就会导致算法精度在某种程度上受影响。
思想很简单,就是根据某些个先验概率计算Y变量属于某个类别的后验概率
应用e1071添加包中的函数naiveBayes():
创建分类器:
m <- naiveBayes(train, class, laplace = 0)
该函数返回一个朴素贝叶斯模型对象,该对象能够用于预测,进行预测:
p <- predict(m, test, type="class")
该函数将返回一个向量,根据参数type的值,该向量含有预测的类别值或者原始预测的概率值。
R语言中的klaR包就提供了朴素贝叶斯算法实现的函数NaiveBayes,我们来看一下该函数的用法及参数含义:
NaiveBayes(formula, data, …, subset, na.action= na.pass)
NaiveBayes(x, grouping, prior, usekernel= FALSE, fL = 0, …)
formula指定参与模型计算的变量,以公式形式给出,类似于y=x1+x2+x3;
data用于指定需要分析的数据对象;
na.action指定缺失值的处理方法,默认情况下不将缺失值纳入模型计算,也不会发生报错信息,当设为“na.omit”时则会删除含有缺失值的样本;
x指定需要处理的数据,可以是数据框形式,也可以是矩阵形式;
grouping为每个观测样本指定所属类别;
prior可为各个类别指定先验概率,默认情况下用各个类别的样本比例作为先验概率;
usekernel指定密度估计的方法(在无法判断数据的分布时,采用密度密度估计方法),默认情况下使用正态分布密度估计,设为TRUE时,则使用核密度估计方法;
fL指定是否进行拉普拉斯修正,默认情况下不对数据进行修正,当数据量较小时,可以设置该参数为1,即进行拉普拉斯修正。
先验概率:由以往的数据分析得到的概率, 叫做先验概率。
后验概率:而在得到信息之后,再重新加以修正的概率叫做后验概率。贝叶斯分类是后验概率。
第一步:准备阶段。该阶段为朴素贝叶斯分类做必要的准备。主要是依据具体情况确定特征属性,并且对特征属性进行适当划分。然后就是对一部分待分类项进行人工划分,以确定训练样本。这一阶段的输入是所有的待分类项,输出特征属性和训练样本。分类器的质量很大程度上依赖于特征属性及其划分以及训练样本的质量。
第二步:分类器训练阶段。主要工作是计算每个类别在训练样本中出现频率以及每个特征属性划分对每个类别的条件概率估计。输入是特征属性和训练样本,输出是分类器。
第三步:应用阶段。这个阶段的任务是使用分类器对待分类项进行分类,其输入是分类器和待分类项,输出是待分类项与类别的映射关系。
特别要注意的是:朴素贝叶斯的核心在于它假设向量的所有分量之间是独立的。
x1 <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
x2 <- c("S","M","M","S","S","S","M","M","L","L","L","M","M","L","L")
Y <- c(-1,-1,1,1,-1,-1,-1,1,1,1,1,1,1,1,-1)
dat <- cbind(x1,x2,Y)
dat <- as.data.frame(dat)
dat$x1 <- as.numeric(dat$x1) #from factor to numeric
dat$Y <- as.numeric(dat$Y) #from factor to numeric
dat$Y <- ifelse(dat$Y==1, -1, 1)
p_y1 <- length(dat$Y==1)/length(dat$Y) #compute the #probability of Y==1
p_y <- length(dat$Y== -1)/length(dat$Y) #compute the #probability of Y==-1
p_x11_y1 <- nrow(dat[x1==1 & Y==1,])/length(dat$Y[dat$Y==1])
p_x12_y1 <- nrow(dat[x1==2 & Y==1,])/length(dat$Y[dat$Y==1])
p_x13_y1 <- nrow(dat[x1==3 & Y==1,])/length(dat$Y[dat$Y==1])
p_x2s_y1 <- nrow(dat[x2=="S" & Y==1,])/length(dat$Y[dat$Y==1])
p_x2m_y1 <- nrow(dat[x2=="M" & Y==1,])/length(dat$Y[dat$Y==1])
p_x2l_y1 <- nrow(dat[x2=="L" & Y==1,])/length(dat$Y[dat$Y==1])
p_x11_y <- nrow(dat[x1==1 & Y== -1,])/length(dat$Y[dat$Y== -1])
p_x12_y <- nrow(dat[x1==2 & Y== -1,])/length(dat$Y[dat$Y== -1])
p_x13_y <- nrow(dat[x1==3 & Y== -1,])/length(dat$Y[dat$Y== -1])
p_x2s_y <- nrow(dat[x2=="S" & Y== -1,])/length(dat$Y[dat$Y== -1])
p_x2m_y <- nrow(dat[x2=="M" & Y== -1,])/length(dat$Y[dat$Y== -1])
p_x2l_y <- nrow(dat[x2=="L" & Y== -1,])/length(dat$Y[dat$Y== -1])
##given x=(2,"S")' compute the probability
## in the condition of y==1
(p_x12_x2s_y1 <- p_y1 * p_x12_y1 * p_x2s_y1)
## [1] 0.03703704
##in the condition of y== -1
(p_x12_x2s_y <- p_y * p_x12_y * p_x2s_y)
## [1] 0.1666667
max(c(p_x12_x2s_y1, p_x12_x2s_y)) #find the maximum #probability
## [1] 0.1666667
#!/usr/bin/Rscript
#构造训练集
data <- matrix(c("sunny","hot","high","weak","no",
"sunny","hot","high","strong","no",
"overcast","hot","high","weak","yes",
"rain","mild","high","weak","yes",
"rain","cool","normal","weak","yes",
"rain","cool","normal","strong","no",
"overcast","cool","normal","strong","yes",
"sunny","mild","high","weak","no",
"sunny","cool","normal","weak","yes",
"rain","mild","normal","weak","yes",
"sunny","mild","normal","strong","yes",
"overcast","mild","high","strong","yes",
"overcast","hot","normal","weak","yes",
"rain","mild","high","strong","no"),
byrow = TRUE,
dimnames = list(day = c(),condition = c("outlook","temperature","humidity","wind","playtennis")),
nrow=14,
ncol=5);
#计算先验概率
prior.yes = sum(data[,5] == "yes") / length(data[,5]);
prior.no = sum(data[,5] == "no") / length(data[,5]);
#贝叶斯模型
naive.bayes.prediction <- function(condition.vec) {
# Calculate unnormlized posterior probability for playtennis = yes.
playtennis.yes <-
sum((data[,1] == condition.vec[1]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(outlook = f_1 | playtennis = yes)
sum((data[,2] == condition.vec[2]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(temperature = f_2 | playtennis = yes)
sum((data[,3] == condition.vec[3]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(humidity = f_3 | playtennis = yes)
sum((data[,4] == condition.vec[4]) & (data[,5] == "yes")) / sum(data[,5] == "yes") * # P(wind = f_4 | playtennis = yes)
prior.yes; # P(playtennis = yes)
# Calculate unnormlized posterior probability for playtennis = no.
playtennis.no <-
sum((data[,1] == condition.vec[1]) & (data[,5] == "no")) / sum(data[,5] == "no") * # P(outlook = f_1 | playtennis = no)
sum((data[,2] == condition.vec[2]) & (data[,5] == "no")) / sum(data[,5] == "no") * # P(temperature = f_2 | playtennis = no)
sum((data[,3] == condition.vec[3]) & (data[,5] == "no")) / sum(data[,5] == "no") * # P(humidity = f_3 | playtennis = no)
sum((data[,4] == condition.vec[4]) & (data[,5] == "no")) / sum(data[,5] == "no") * # P(wind = f_4 | playtennis = no)
prior.no; # P(playtennis = no)
return(list(post.pr.yes = playtennis.yes,
post.pr.no = playtennis.no,
prediction = ifelse(playtennis.yes >= playtennis.no, "yes", "no")));
}
#预测
naive.bayes.prediction(c("overcast", "mild", "normal", "weak"))
## $post.pr.yes
## [1] 0.05643739
##
## $post.pr.no
## [1] 0
##
## $prediction
## [1] "yes"
naive.bayes.prediction(c("rain","hot","high","strong"))
## $post.pr.yes
## [1] 0.005291005
##
## $post.pr.no
## [1] 0.02742857
##
## $prediction
## [1] "no"
naive.bayes.prediction(c("sunny","mild","normal","weak"))
## $post.pr.yes
## [1] 0.02821869
##
## $post.pr.no
## [1] 0.006857143
##
## $prediction
## [1] "yes"
#!/usr/bin/Rscript
library(plyr)
library(reshape2)
#1、根据训练集创建朴素贝叶斯分类器
#1.1、生成类别的概率
##计算训练集合D中类别出现的概率,即P{c_i}
##输入:trainData 训练集,类型为数据框
## strClassName 指明训练集中名称为 strClassName列为分类结果
##输出:数据框,P{c_i}的集合,类别名称|概率(列名为 prob)
class_prob <- function(trainData, strClassName){
#训练集样本数
#nrow返回行数
length_train <- nrow(trainData)
dTemp <- ddply(trainData, strClassName, "nrow")
dTemp <- transform(dTemp, length_train=length_train)
dTemp <- ddply(dTemp, strClassName, mutate, prob=nrow/length_train)
dTemp[,-c(2,3)]
}
##1.2、生成每个类别下,特征取不同值的概率
##计算训练集合D中,生成每个类别下,特征取不同值的概率,即P{fi|c_i}
##输入:trainData 训练集,类型为数据框
## strClassName 指明训练集中名称为strClassName列为分类结果,其余的全部列认为是特征值
##输出:数据框,P{fi|c_i}的集合,类别名称|特征名称|特征取值|概率(列名为 prob)
feature_class_prob <- function(trainData, strClassName){
# 横表转换为纵表
data.melt <- melt(trainData,id=c(strClassName))
# 统计频数
aa <- ddply(data.melt, c(strClassName,"variable","value"), "nrow")
# 计算概率
bb <- ddply(aa, c(strClassName,"variable"), mutate, sum = sum(nrow), prob = nrow/sum)
# 增加列名
colnames(bb) <- c("class.name",
"feature.name",
"feature.value",
"feature.nrow",
"feature.sum",
"prob")
# 返回结果
bb[,c(1,2,3,6)]
}
## 以上创建完朴素贝叶斯分类器
## 2、使用生成的朴素贝叶斯分类器进行预测
##使用生成的朴素贝叶斯分类器进行预测P{fi|c_i}
##输入:oneObs 数据框,待预测的样本,格式为 特征名称|特征值
## pc 数据框,训练集合D中类别出现的概率,即P{c_i} 类别名称|概率
## pfc 数据框,每个类别下,特征取不同值的概率,即P{fi|c_i}
## 类别名称|特征名称|特征值|概率
##输出:数据框,待预测样本的分类对每个类别的概率,类别名称|后验概率(列名为 prob)
pre_class <- function(oneObs, pc,pfc){
colnames(oneObs) <- c("feature.name", "feature.value")
colnames(pc) <- c("class.name","prob")
colnames(pfc) <- c("class.name","feature.name","feature.value","prob")
# 取出特征的取值的条件概率
feature.all <- join(oneObs,pfc,by=c("feature.name","feature.value"),type="inner")
# 取出特征取值的条件概率连乘
feature.prob <- ddply(feature.all,.(class.name),summarize,prob_fea=prod(prob)) #prod为连乘函数
#取出类别的概率
class.all <- join(feature.prob,pc,by="class.name",type="inner")
#输出结果
ddply(class.all,.(class.name),mutate,pre_prob=prob_fea*prob)[,c(1,4)]
}
##3、数据测试1
##用上面苹果的数据作为例子进行测试
#训练集
train.apple <- data.frame(
size=c("大","小","大","大","小","小"),
weight=c("轻","重","轻","轻","重","轻"),
color=c("红","红","红","绿","红","绿"),
taste=c("good","good","bad","bad","bad","good")
)
#待预测样本
oneObs <- data.frame(
feature.name =c("size", "weight", "color"),
feature.value =c("大","重","红")
)
#预测分类
pc <- class_prob(train.apple, "taste")
pfc <- feature_class_prob(train.apple, "taste")
pre_class(oneObs, pc, pfc)
## class.name pre_prob
## 1 bad 0.07407407
## 2 good 0.03703704
##4、数据测试2
#数据集样本
mdata <- matrix(c("sunny","hot","high","weak","no",
"sunny","hot","high","strong","no",
"overcast","hot","high","weak","yes",
"rain","mild","high","weak","yes",
"rain","cool","normal","weak","yes",
"rain","cool","normal","strong","no",
"overcast","cool","normal","strong","yes",
"sunny","mild","high","weak","no",
"sunny","cool","normal","weak","yes",
"rain","mild","normal","weak","yes",
"sunny","mild","normal","strong","yes",
"overcast","mild","high","strong","yes",
"overcast","hot","normal","weak","yes",
"rain","mild","high","strong","no"),
byrow = TRUE,
dimnames = list(day = c(),condition = c("outlook","temperature","humidity","wind","playtennis")),
nrow=14,
ncol=5)
mdata <- data.frame(mdata)
#待预测样本
ddata <- data.frame(
feature.name =c("outlook", "temperature","humidity","wind"),
feature.value =c("overcast","mild","normal","weak")
)
#预测分类
pc <- class_prob(mdata, "playtennis")
pfc <- feature_class_prob(mdata,"playtennis")
pre_class(ddata, pc, pfc)
## class.name pre_prob
## 1 no 0.01142857
## 2 yes 0.05643739
# (一)两个总体的贝叶斯判别分析
#1.载入数据
TrnX1<-matrix(
c(24.8, 24.1, 26.6, 23.5, 25.5, 27.4,
-2.0, -2.4, -3.0, -1.9, -2.1, -3.1),
ncol=2)
TrnX2<-matrix(
c(22.1, 21.6, 22.0, 22.8, 22.7, 21.5, 22.1, 21.4,
-0.7, -1.4, -0.8, -1.6, -1.5, -1.0, -1.2, -1.3),
ncol=2)
#2、载入两总体的贝叶斯判别函数
#两个总体判别的贝叶斯判别程序
#输入 TrnX1 TrnX2表示X1类 X2类训练样本 样本输入格式为数据框
#rate=p2/p1缺省时为1
#Tst为待测样本 其输入格式是数据框 为两个训练样本之和
#var.equal是逻辑变量 当其值为TRUE是表示认为两个总体的协方差相同 否则不同
#输出 函数的输出时1和2构成的一维矩阵 1表示待测样本属于X1类
discriminiant.bayes <- function(TrnX1, TrnX2, rate = 1, TstX = NULL, var.equal = FALSE){
if (is.null(TstX) == TRUE) TstX<-rbind(TrnX1,TrnX2)
if (is.vector(TstX) == TRUE) TstX <- t(as.matrix(TstX))
else if (is.matrix(TstX) != TRUE)
TstX <- as.matrix(TstX)
if (is.matrix(TrnX1) != TRUE) TrnX1 <- as.matrix(TrnX1)
if (is.matrix(TrnX2) != TRUE) TrnX2 <- as.matrix(TrnX2)
nx <- nrow(TstX)
blong <- matrix(rep(0, nx), nrow=1, byrow=TRUE,
dimnames=list("blong", 1:nx))
mu1 <- colMeans(TrnX1); mu2 <- colMeans(TrnX2)
if (var.equal == TRUE || var.equal == T){
S <- var(rbind(TrnX1,TrnX2)); beta <- 2*log(rate)
w <- mahalanobis(TstX, mu2, S)
- mahalanobis(TstX, mu1, S)
}
else{
S1 <- var(TrnX1); S2 <- var(TrnX2)
beta <- 2*log(rate) + log(det(S1)/det(S2))
w <- mahalanobis(TstX, mu2, S2)
- mahalanobis(TstX, mu1, S1)
}
for (i in 1:nx){
if (w[i] > beta)
blong[i] <- 1
else
blong[i] <- 2
}
blong
}
#3、协方差相同时的判别
discriminiant.bayes(TrnX1, TrnX2, rate=8/6,var.equal=TRUE)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## blong 1 1 1 1 1 1 1 1 1 2 2 2 2 1
# 协方差不同时的判别
discriminiant.bayes(TrnX1, TrnX2, rate=8/6)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## blong 1 1 1 1 1 1 1 2 2 1 2 2 2 2
# (二)多个总体贝叶斯判别
X <- iris[,1:4]
G <- gl(3,50)
#多个总体判别的贝叶斯判别程序
#输入 TrnX 表示训练样本 样本输入格式为数据框
#TrnG是因子变量 表示训练样本的分类情况
#输入变量p是先验概率 缺省值为1
#Tst为待测样本 其输入格式是数据框
#var.equal是逻辑变量 当其值为TRUE是表示认为两个总体的协方差相同 否则不同
#输出 函数的输出是数字构成的一维矩阵 1表示待测样本属于X1类
distinguish.bayes <- function(TrnX, TrnG, p = rep(1, length(levels(TrnG))),
TstX = NULL, var.equal = FALSE){
if ( is.factor(TrnG) == FALSE){
mx <- nrow(TrnX); mg <- nrow(TrnG)
TrnX <- rbind(TrnX, TrnG)
TrnG <- factor(rep(1:2, c(mx, mg)))
}
if (is.null(TstX) == TRUE) TstX <- TrnX
if (is.vector(TstX) == TRUE) TstX <- t(as.matrix(TstX))
else if (is.matrix(TstX) != TRUE)
TstX <- as.matrix(TstX)
if (is.matrix(TrnX) != TRUE) TrnX <- as.matrix(TrnX)
nx <- nrow(TstX)
blong <- matrix(rep(0, nx), nrow=1,
dimnames=list("blong", 1:nx))
g <- length(levels(TrnG))
mu <- matrix(0, nrow=g, ncol=ncol(TrnX))
for (i in 1:g)
mu[i,] <- colMeans(TrnX[TrnG==i,])
D <- matrix(0, nrow=g, ncol=nx)
if (var.equal == TRUE || var.equal == T){
for (i in 1:g){
d2 <- mahalanobis(TstX, mu[i,], var(TrnX))
D[i,] <- d2 - 2*log(p[i])
}
}
else{
for (i in 1:g){
S <- var(TrnX[TrnG==i,])
d2 <- mahalanobis(TstX, mu[i,], S)
D[i,] <- d2 - 2*log(p[i])-log(det(S))
}
}
for (j in 1:nx){
dmin <- Inf
for (i in 1:g)
if (D[i,j] < dmin){
dmin <- D[i,j]; blong[j] <- i
}
}
blong
}
distinguish.bayes(X,G)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## blong 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## blong 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## blong 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
## blong 3 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2
## 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
## blong 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3
## 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
## blong 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
## blong 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 148 149 150
## blong 3 3 3
1、属于有监督的学习(有训练集)
2、主要处理离散类型的数据,如果为连续数据可先进行离散化
3、训练集的特征取值要尽量完备,如果有缺失需进行预处理
4、关于特征值相互独立的假设,在实际问题中一般无法满足,但基于此假设做的预测是可以接受的