本科生申请研究生的影响因素

摘要

多分变量所取的离散值个数多于两个,如果各种结果之间没有自然顺序的话,则称为无序变量。例如:当购买洗衣粉时,你可以在众多可选择的洗衣粉品牌之间进行选择。如果各种结果之间有一个内在的自然顺序,则为有序变量。例如,对债券等级的排序;老师给学生A、B、C、D、E五个等级的成绩。多分变量为因变量的模型称为多元选择模型(Multinomial Model),其中又有有序选择模型(Ordered Model)、条件模型(Conditional Model)和嵌套模型(Nested Model)等分类。

有序选择模型假设因变量是有先后顺序或者等级的,如债券等级、考试成绩等级等。

问题的提出:本科生申请研究生的影响因素

为了解影响本科生申请研究生的因素,向大三学生进行问卷调查,共收集了400份有效问卷。该数据共有四个变量,apply是有序变量,有三个水平,表示申请研究生的愿望程度,分别是Unlikely、Somewhat likely和Very likely,并分别编码为0、1、2;pared是0/1虚拟变量,表示父母是否至少有一个有研究生学位;public是0/1虚拟变量,1表示本科学校是公立的,0表示是私立的;gpa是学生的平均绩点。如何研究pared、public和gpa等因素对本科生申请研究生可能性的影响?以apply为因变量,其他三个变量为自变量,建立有序Logit模型和有序Probit模型。

数据集

mdata <- read.csv('http://data.galaxystatistics.com/blog_data/regression/ologit.csv',header=T)
head(mdata)
##   id           apply pared public  gpa
## 1  1     very likely     0      0 3.26
## 2  2 somewhat likely     1      0 3.21
## 3  3        unlikely     1      1 3.94
## 4  4 somewhat likely     0      0 2.81
## 5  5 somewhat likely     0      0 2.53
## 6  6        unlikely     0      1 2.59
modeldata <- mdata[-1]
# 描述性统计计算
summary(modeldata)
##              apply         pared            public            gpa       
##  somewhat likely:140   Min.   :0.0000   Min.   :0.0000   Min.   :1.900  
##  unlikely       :220   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2.720  
##  very likely    : 40   Median :0.0000   Median :0.0000   Median :2.990  
##                        Mean   :0.1575   Mean   :0.1425   Mean   :2.999  
##                        3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:3.270  
##                        Max.   :1.0000   Max.   :1.0000   Max.   :4.000
#数据集字段名称转换
renames <- function(x) {
  ocol <- ncol(x)
  if (ocol==2) {
    xnam <- paste0("X")
    names(x) <- c('Y', xnam)
  }else if(ocol>2) {
    xnam <- paste0("X", 1:(ocol-1))
    names(x) <- c('Y', xnam)
  }
  return(x)
}
modeldata <- renames(modeldata)
head(modeldata)
##                 Y X1 X2   X3
## 1     very likely  0  0 3.26
## 2 somewhat likely  1  0 3.21
## 3        unlikely  1  1 3.94
## 4 somewhat likely  0  0 2.81
## 5 somewhat likely  0  0 2.53
## 6        unlikely  0  1 2.59
class(modeldata)
## [1] "data.frame"
dim(modeldata)
## [1] 400   4
regFormula <- as.formula(paste("Y ~ ", paste(names(modeldata)[-1], collapse= "+")))
regFormula
## Y ~ X1 + X2 + X3
# save.image()
# getwd()

有序Logit模型1

# load('.RData')
#有序Logit模型和有序Probit模型可以用MASS包里的polr()函数拟合:
library(MASS)
# method设为logistic
Logit.fit <- polr(regFormula, data=modeldata, method="logistic")
#展示拟合模型的详细结果
summary(Logit.fit)
## Call:
## polr(formula = regFormula, data = modeldata, method = "logistic")
## 
## Coefficients:
##      Value Std. Error t value
## X1 -0.2635     0.2876 -0.9162
## X2  0.5600     0.2955  1.8953
## X3 -0.0708     0.2533 -0.2795
## 
## Intercepts:
##                          Value   Std. Error t value
## somewhat likely|unlikely -0.7959  0.7516    -1.0589
## unlikely|very likely      2.0445  0.7606     2.6882
## 
## Residual Deviance: 736.9261 
## AIC: 746.9261
#列出拟合模型的模型参数
coefficients(Logit.fit) 
##          X1          X2          X3 
## -0.26352826  0.55999663 -0.07080247
head(predict(Logit.fit, modeldata[-1], type='probs'),10)
##    somewhat likely  unlikely very likely
## 1        0.3623770 0.5444390  0.09318406
## 2        0.4243176 0.5022656  0.07341681
## 3        0.3071697 0.5764373  0.11639299
## 4        0.3550479 0.5490406  0.09591147
## 5        0.3505214 0.5518342  0.09764435
## 6        0.2364040 0.6048954  0.15870056
## 7        0.3510051 0.5515375  0.09745735
## 8        0.3537519 0.5498443  0.09640376
## 9        0.3581344 0.5471143  0.09475130
## 10       0.4293408 0.4986270  0.07203221

有序Logit模型2

# load('.RData')
#有序Logit模型也可以用rms包中的lrm()函数拟合:
##install.packages('rms')
library(rms)
Logit.fit1 <- lrm(regFormula, data=modeldata)
Logit.fit1
## 
## Logistic Regression Model
## 
## lrm(formula = regFormula, data = modeldata)
##                         Model Likelihood     Discrimination    Rank Discrim.    
##                            Ratio Test            Indexes          Indexes       
## Obs              400    LR chi2      4.28    R2       0.013    C       0.553    
##  somewhat likely 140    d.f.            3    g        0.197    Dxy     0.106    
##  unlikely        220    Pr(> chi2) 0.2329    gr       1.218    gamma   0.111    
##  very likely      40                         gp       0.043    tau-a   0.060    
## max |deriv|    1e-07                         Brier    0.224                     
## 
##                Coef    S.E.   Wald Z Pr(>|Z|)
## y>=unlikely     0.7959 0.7516  1.06  0.2897  
## y>=very likely -2.0445 0.7606 -2.69  0.0072  
## X1             -0.2635 0.2876 -0.92  0.3596  
## X2              0.5600 0.2955  1.90  0.0581  
## X3             -0.0708 0.2533 -0.28  0.7799
# 可见对于有序的Logit模型,lrm()函数拟合的结果与polr()函数拟合的结果相同。并且lrm()函数还给出了似然比检验结果。

有序Probit模型

# load('.RData')
#method设为probit
library(MASS)
Probit.fit <- polr(regFormula, data=modeldata, method="probit")
#展示拟合模型的详细结果
summary(Probit.fit)
## Call:
## polr(formula = regFormula, data = modeldata, method = "probit")
## 
## Coefficients:
##        Value Std. Error  t value
## X1 -0.068852     0.1607 -0.42837
## X2  0.313734     0.1670  1.87906
## X3 -0.006178     0.1498 -0.04125
## 
## Intercepts:
##                          Value   Std. Error t value
## somewhat likely|unlikely -0.3724  0.4449    -0.8370
## unlikely|very likely      1.3045  0.4499     2.8998
## 
## Residual Deviance: 737.451 
## AIC: 747.451
#列出拟合模型的模型参数
coefficients(Probit.fit) 
##           X1           X2           X3 
## -0.068851511  0.313734276 -0.006178258
head(predict(Probit.fit, modeldata[-1], type='probs'),10)
##    somewhat likely  unlikely very likely
## 1        0.3623109 0.5450495  0.09263955
## 2        0.3883025 0.5299204  0.08177706
## 3        0.2766006 0.5842155  0.13918394
## 4        0.3612690 0.5456293  0.09310166
## 5        0.3606212 0.5459887  0.09339005
## 6        0.2513772 0.5916076  0.15701523
## 7        0.3606906 0.5459503  0.09335912
## 8        0.3610839 0.5457321  0.09318399
## 9        0.3617088 0.5453849  0.09290634
## 10       0.3889893 0.5295042  0.08150658