多分变量所取的离散值个数多于两个,如果各种结果之间没有自然顺序的话,则称为无序变量。例如:当购买洗衣粉时,你可以在众多可选择的洗衣粉品牌之间进行选择。如果各种结果之间有一个内在的自然顺序,则为有序变量。例如,对债券等级的排序;老师给学生A、B、C、D、E五个等级的成绩。多分变量为因变量的模型称为多元选择模型(Multinomial Model),其中又有有序选择模型(Ordered Model)、条件模型(Conditional Model)和嵌套模型(Nested Model)等分类。
有序选择模型假设因变量是有先后顺序或者等级的,如债券等级、考试成绩等级等。
问题的提出:本科生申请研究生的影响因素
为了解影响本科生申请研究生的因素,向大三学生进行问卷调查,共收集了400份有效问卷。该数据共有四个变量,apply是有序变量,有三个水平,表示申请研究生的愿望程度,分别是Unlikely、Somewhat likely和Very likely,并分别编码为0、1、2;pared是0/1虚拟变量,表示父母是否至少有一个有研究生学位;public是0/1虚拟变量,1表示本科学校是公立的,0表示是私立的;gpa是学生的平均绩点。如何研究pared、public和gpa等因素对本科生申请研究生可能性的影响?以apply为因变量,其他三个变量为自变量,建立有序Logit模型和有序Probit模型。
mdata <- read.csv('http://data.galaxystatistics.com/blog_data/regression/ologit.csv',header=T)
head(mdata)
## id apply pared public gpa
## 1 1 very likely 0 0 3.26
## 2 2 somewhat likely 1 0 3.21
## 3 3 unlikely 1 1 3.94
## 4 4 somewhat likely 0 0 2.81
## 5 5 somewhat likely 0 0 2.53
## 6 6 unlikely 0 1 2.59
modeldata <- mdata[-1]
# 描述性统计计算
summary(modeldata)
## apply pared public gpa
## somewhat likely:140 Min. :0.0000 Min. :0.0000 Min. :1.900
## unlikely :220 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2.720
## very likely : 40 Median :0.0000 Median :0.0000 Median :2.990
## Mean :0.1575 Mean :0.1425 Mean :2.999
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:3.270
## Max. :1.0000 Max. :1.0000 Max. :4.000
#数据集字段名称转换
renames <- function(x) {
ocol <- ncol(x)
if (ocol==2) {
xnam <- paste0("X")
names(x) <- c('Y', xnam)
}else if(ocol>2) {
xnam <- paste0("X", 1:(ocol-1))
names(x) <- c('Y', xnam)
}
return(x)
}
modeldata <- renames(modeldata)
head(modeldata)
## Y X1 X2 X3
## 1 very likely 0 0 3.26
## 2 somewhat likely 1 0 3.21
## 3 unlikely 1 1 3.94
## 4 somewhat likely 0 0 2.81
## 5 somewhat likely 0 0 2.53
## 6 unlikely 0 1 2.59
class(modeldata)
## [1] "data.frame"
dim(modeldata)
## [1] 400 4
regFormula <- as.formula(paste("Y ~ ", paste(names(modeldata)[-1], collapse= "+")))
regFormula
## Y ~ X1 + X2 + X3
# save.image()
# getwd()
# load('.RData')
#有序Logit模型和有序Probit模型可以用MASS包里的polr()函数拟合:
library(MASS)
# method设为logistic
Logit.fit <- polr(regFormula, data=modeldata, method="logistic")
#展示拟合模型的详细结果
summary(Logit.fit)
## Call:
## polr(formula = regFormula, data = modeldata, method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## X1 -0.2635 0.2876 -0.9162
## X2 0.5600 0.2955 1.8953
## X3 -0.0708 0.2533 -0.2795
##
## Intercepts:
## Value Std. Error t value
## somewhat likely|unlikely -0.7959 0.7516 -1.0589
## unlikely|very likely 2.0445 0.7606 2.6882
##
## Residual Deviance: 736.9261
## AIC: 746.9261
#列出拟合模型的模型参数
coefficients(Logit.fit)
## X1 X2 X3
## -0.26352826 0.55999663 -0.07080247
head(predict(Logit.fit, modeldata[-1], type='probs'),10)
## somewhat likely unlikely very likely
## 1 0.3623770 0.5444390 0.09318406
## 2 0.4243176 0.5022656 0.07341681
## 3 0.3071697 0.5764373 0.11639299
## 4 0.3550479 0.5490406 0.09591147
## 5 0.3505214 0.5518342 0.09764435
## 6 0.2364040 0.6048954 0.15870056
## 7 0.3510051 0.5515375 0.09745735
## 8 0.3537519 0.5498443 0.09640376
## 9 0.3581344 0.5471143 0.09475130
## 10 0.4293408 0.4986270 0.07203221
# load('.RData')
#有序Logit模型也可以用rms包中的lrm()函数拟合:
##install.packages('rms')
library(rms)
Logit.fit1 <- lrm(regFormula, data=modeldata)
Logit.fit1
##
## Logistic Regression Model
##
## lrm(formula = regFormula, data = modeldata)
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 400 LR chi2 4.28 R2 0.013 C 0.553
## somewhat likely 140 d.f. 3 g 0.197 Dxy 0.106
## unlikely 220 Pr(> chi2) 0.2329 gr 1.218 gamma 0.111
## very likely 40 gp 0.043 tau-a 0.060
## max |deriv| 1e-07 Brier 0.224
##
## Coef S.E. Wald Z Pr(>|Z|)
## y>=unlikely 0.7959 0.7516 1.06 0.2897
## y>=very likely -2.0445 0.7606 -2.69 0.0072
## X1 -0.2635 0.2876 -0.92 0.3596
## X2 0.5600 0.2955 1.90 0.0581
## X3 -0.0708 0.2533 -0.28 0.7799
# 可见对于有序的Logit模型,lrm()函数拟合的结果与polr()函数拟合的结果相同。并且lrm()函数还给出了似然比检验结果。
# load('.RData')
#method设为probit
library(MASS)
Probit.fit <- polr(regFormula, data=modeldata, method="probit")
#展示拟合模型的详细结果
summary(Probit.fit)
## Call:
## polr(formula = regFormula, data = modeldata, method = "probit")
##
## Coefficients:
## Value Std. Error t value
## X1 -0.068852 0.1607 -0.42837
## X2 0.313734 0.1670 1.87906
## X3 -0.006178 0.1498 -0.04125
##
## Intercepts:
## Value Std. Error t value
## somewhat likely|unlikely -0.3724 0.4449 -0.8370
## unlikely|very likely 1.3045 0.4499 2.8998
##
## Residual Deviance: 737.451
## AIC: 747.451
#列出拟合模型的模型参数
coefficients(Probit.fit)
## X1 X2 X3
## -0.068851511 0.313734276 -0.006178258
head(predict(Probit.fit, modeldata[-1], type='probs'),10)
## somewhat likely unlikely very likely
## 1 0.3623109 0.5450495 0.09263955
## 2 0.3883025 0.5299204 0.08177706
## 3 0.2766006 0.5842155 0.13918394
## 4 0.3612690 0.5456293 0.09310166
## 5 0.3606212 0.5459887 0.09339005
## 6 0.2513772 0.5916076 0.15701523
## 7 0.3606906 0.5459503 0.09335912
## 8 0.3610839 0.5457321 0.09318399
## 9 0.3617088 0.5453849 0.09290634
## 10 0.3889893 0.5295042 0.08150658