前言

什么是集成学习呢?通俗的讲,就是多算法融合。它的思想相当简单直接,以至于用一句俗语就可以完美概括:三个臭皮匠,顶个诸葛亮。实际操作中,集成学习把大大小小的多种算法融合在一起,共同协作来解决一个问题。这些算法可以是不同的算法,也可以是相同的算法。

用好集成学习有两个关键点:1)怎么训练每个算法?2)怎么融合每个算法?

Bagging 是投票式的算法,首先使用Bootstrap产生不同的训练集,然后再基于这些数据集得到多个基础分类器,最后通过基础分类器的分类结果得到一个相对更优的预测模型。

Boosting与bagging类似,主要差别在于在boosting算法中,基础分类器的学习是顺序进行的,后一轮分类器的学习与之前分类器的分类结果有关,既是在错分样本的基础上学习,boosting算法通过这样一种补尝学习的方式,达到了利用前一轮分类来调整后轮基础分类器的目的以获得更好的分类性能。

Stacking 是一种集成学习技术,通过元分类器或元回归聚合多个分类或回归模型。基础层次模型(level model)基于完整的训练集进行训练,然后元模型基于基础层次模型的输出进行训练。

Bagging

# install.packages("ipred")
library(ipred)
# bagging {ipred}
# ?bagging
# ?errorest

# 使用bagging算法对iris决策树进行分类
model.bagging <- ipred::bagging(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=iris)
pre.bagging <- predict(model.bagging,iris)
# pre.bagging$confusion
# pre.bagging$error
# pre.bagging$votes

Boosting

# install.packages("adabag")
library(adabag)
## Loading required package: rpart
## Loading required package: mlbench
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'adabag'
## The following object is masked from 'package:ipred':
## 
##     bagging
# bagging {adabag}
# ?bagging
# boosting {adabag}
# ?boosting

# 以iris数据集为例进行分析 
# 应用模型并查看模型的相应参数 
fit_bag <- adabag::bagging(Species~.,data=iris,mfinal=5,control=rpart.control(maxdepth=3)) 
fit_bag[1:length(fit_bag)] 
## $formula
## Species ~ .
## 
## $trees
## $trees[[1]]
## n= 150 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 150 95 virginica (0.32000000 0.31333333 0.36666667)  
##    2) Petal.Length< 2.45 48  0 setosa (1.00000000 0.00000000 0.00000000) *
##    3) Petal.Length>=2.45 102 47 virginica (0.00000000 0.46078431 0.53921569)  
##      6) Petal.Length< 4.95 50  4 versicolor (0.00000000 0.92000000 0.08000000)  
##       12) Petal.Width< 1.55 43  0 versicolor (0.00000000 1.00000000 0.00000000) *
##       13) Petal.Width>=1.55 7  3 virginica (0.00000000 0.42857143 0.57142857) *
##      7) Petal.Length>=4.95 52  1 virginica (0.00000000 0.01923077 0.98076923) *
## 
## $trees[[2]]
## n= 150 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 150 93 versicolor (0.28000000 0.38000000 0.34000000)  
##   2) Petal.Length< 2.45 42  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.45 108 51 versicolor (0.00000000 0.52777778 0.47222222)  
##     6) Petal.Length< 4.85 56  2 versicolor (0.00000000 0.96428571 0.03571429) *
##     7) Petal.Length>=4.85 52  3 virginica (0.00000000 0.05769231 0.94230769) *
## 
## $trees[[3]]
## n= 150 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 150 90 setosa (0.40000000 0.34666667 0.25333333)  
##   2) Petal.Length< 2.6 60  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.6 90 38 versicolor (0.00000000 0.57777778 0.42222222)  
##     6) Petal.Width< 1.65 51  3 versicolor (0.00000000 0.94117647 0.05882353) *
##     7) Petal.Width>=1.65 39  4 virginica (0.00000000 0.10256410 0.89743590) *
## 
## $trees[[4]]
## n= 150 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 150 94 versicolor (0.31333333 0.37333333 0.31333333)  
##   2) Petal.Length< 2.45 47  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.45 103 47 versicolor (0.00000000 0.54368932 0.45631068)  
##     6) Petal.Width< 1.75 59  4 versicolor (0.00000000 0.93220339 0.06779661) *
##     7) Petal.Width>=1.75 44  1 virginica (0.00000000 0.02272727 0.97727273) *
## 
## $trees[[5]]
## n= 150 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 150 97 setosa (0.35333333 0.35333333 0.29333333)  
##   2) Petal.Length< 2.5 53  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.5 97 44 versicolor (0.00000000 0.54639175 0.45360825)  
##     6) Petal.Length< 4.85 51  1 versicolor (0.00000000 0.98039216 0.01960784) *
##     7) Petal.Length>=4.85 46  3 virginica (0.00000000 0.06521739 0.93478261) *
## 
## 
## $votes
##        [,1] [,2] [,3]
##   [1,]    2    0    3
##   [2,]    2    2    1
##   [3,]    3    0    2
##   [4,]    0    3    2
##   [5,]    2    2    1
##   [6,]    1    2    2
##   [7,]    3    2    0
##   [8,]    0    2    3
##   [9,]    1    2    2
##  [10,]    0    0    5
##  [11,]    1    1    3
##  [12,]    0    2    3
##  [13,]    1    4    0
##  [14,]    2    1    2
##  [15,]    1    3    1
##  [16,]    1    3    1
##  [17,]    2    1    2
##  [18,]    3    1    1
##  [19,]    1    2    2
##  [20,]    2    1    2
##  [21,]    3    1    1
##  [22,]    3    2    0
##  [23,]    1    4    0
##  [24,]    1    1    3
##  [25,]    1    2    2
##  [26,]    2    2    1
##  [27,]    1    3    1
##  [28,]    1    1    3
##  [29,]    1    2    2
##  [30,]    4    0    1
##  [31,]    1    2    2
##  [32,]    1    2    2
##  [33,]    2    2    1
##  [34,]    0    3    2
##  [35,]    1    2    2
##  [36,]    3    1    1
##  [37,]    2    3    0
##  [38,]    1    3    1
##  [39,]    3    1    1
##  [40,]    2    2    1
##  [41,]    2    2    1
##  [42,]    3    2    0
##  [43,]    1    3    1
##  [44,]    0    1    4
##  [45,]    2    0    3
##  [46,]    2    2    1
##  [47,]    3    2    0
##  [48,]    0    2    3
##  [49,]    0    2    3
##  [50,]    2    2    1
##  [51,]    2    2    1
##  [52,]    2    1    2
##  [53,]    3    1    1
##  [54,]    4    0    1
##  [55,]    2    1    2
##  [56,]    2    2    1
##  [57,]    2    2    1
##  [58,]    3    1    1
##  [59,]    2    1    2
##  [60,]    3    1    1
##  [61,]    1    0    4
##  [62,]    1    3    1
##  [63,]    1    3    1
##  [64,]    3    1    1
##  [65,]    3    1    1
##  [66,]    2    1    2
##  [67,]    3    1    1
##  [68,]    0    1    4
##  [69,]    0    2    3
##  [70,]    1    2    2
##  [71,]    2    2    1
##  [72,]    0    4    1
##  [73,]    1    1    3
##  [74,]    3    1    1
##  [75,]    2    2    1
##  [76,]    2    2    1
##  [77,]    0    2    3
##  [78,]    2    1    2
##  [79,]    2    0    3
##  [80,]    0    2    3
##  [81,]    3    1    1
##  [82,]    3    0    2
##  [83,]    1    3    1
##  [84,]    1    1    3
##  [85,]    2    2    1
##  [86,]    2    0    3
##  [87,]    1    1    3
##  [88,]    1    4    0
##  [89,]    3    0    2
##  [90,]    2    3    0
##  [91,]    1    1    3
##  [92,]    3    1    1
##  [93,]    0    2    3
##  [94,]    0    2    3
##  [95,]    1    2    2
##  [96,]    1    4    0
##  [97,]    2    2    1
##  [98,]    2    1    2
##  [99,]    2    0    3
## [100,]    1    1    3
## [101,]    2    2    1
## [102,]    1    3    1
## [103,]    1    3    1
## [104,]    2    2    1
## [105,]    1    4    0
## [106,]    4    0    1
## [107,]    2    3    0
## [108,]    1    1    3
## [109,]    1    0    4
## [110,]    2    2    1
## [111,]    1    2    2
## [112,]    2    2    1
## [113,]    2    2    1
## [114,]    1    4    0
## [115,]    0    3    2
## [116,]    2    2    1
## [117,]    3    2    0
## [118,]    4    0    1
## [119,]    0    2    3
## [120,]    3    2    0
## [121,]    2    2    1
## [122,]    1    1    3
## [123,]    3    2    0
## [124,]    0    4    1
## [125,]    1    3    1
## [126,]    3    1    1
## [127,]    3    0    2
## [128,]    2    2    1
## [129,]    2    2    1
## [130,]    0    3    2
## [131,]    1    2    2
## [132,]    1    1    3
## [133,]    3    0    2
## [134,]    2    2    1
## [135,]    2    1    2
## [136,]    3    1    1
## [137,]    1    3    1
## [138,]    3    1    1
## [139,]    2    1    2
## [140,]    1    1    3
## [141,]    1    4    0
## [142,]    2    1    2
## [143,]    1    3    1
## [144,]    0    5    0
## [145,]    2    1    2
## [146,]    2    2    1
## [147,]    1    0    4
## [148,]    4    0    1
## [149,]    1    3    1
## [150,]    2    1    2
## 
## $prob
##        [,1] [,2] [,3]
##   [1,]  0.4  0.0  0.6
##   [2,]  0.4  0.4  0.2
##   [3,]  0.6  0.0  0.4
##   [4,]  0.0  0.6  0.4
##   [5,]  0.4  0.4  0.2
##   [6,]  0.2  0.4  0.4
##   [7,]  0.6  0.4  0.0
##   [8,]  0.0  0.4  0.6
##   [9,]  0.2  0.4  0.4
##  [10,]  0.0  0.0  1.0
##  [11,]  0.2  0.2  0.6
##  [12,]  0.0  0.4  0.6
##  [13,]  0.2  0.8  0.0
##  [14,]  0.4  0.2  0.4
##  [15,]  0.2  0.6  0.2
##  [16,]  0.2  0.6  0.2
##  [17,]  0.4  0.2  0.4
##  [18,]  0.6  0.2  0.2
##  [19,]  0.2  0.4  0.4
##  [20,]  0.4  0.2  0.4
##  [21,]  0.6  0.2  0.2
##  [22,]  0.6  0.4  0.0
##  [23,]  0.2  0.8  0.0
##  [24,]  0.2  0.2  0.6
##  [25,]  0.2  0.4  0.4
##  [26,]  0.4  0.4  0.2
##  [27,]  0.2  0.6  0.2
##  [28,]  0.2  0.2  0.6
##  [29,]  0.2  0.4  0.4
##  [30,]  0.8  0.0  0.2
##  [31,]  0.2  0.4  0.4
##  [32,]  0.2  0.4  0.4
##  [33,]  0.4  0.4  0.2
##  [34,]  0.0  0.6  0.4
##  [35,]  0.2  0.4  0.4
##  [36,]  0.6  0.2  0.2
##  [37,]  0.4  0.6  0.0
##  [38,]  0.2  0.6  0.2
##  [39,]  0.6  0.2  0.2
##  [40,]  0.4  0.4  0.2
##  [41,]  0.4  0.4  0.2
##  [42,]  0.6  0.4  0.0
##  [43,]  0.2  0.6  0.2
##  [44,]  0.0  0.2  0.8
##  [45,]  0.4  0.0  0.6
##  [46,]  0.4  0.4  0.2
##  [47,]  0.6  0.4  0.0
##  [48,]  0.0  0.4  0.6
##  [49,]  0.0  0.4  0.6
##  [50,]  0.4  0.4  0.2
##  [51,]  0.4  0.4  0.2
##  [52,]  0.4  0.2  0.4
##  [53,]  0.6  0.2  0.2
##  [54,]  0.8  0.0  0.2
##  [55,]  0.4  0.2  0.4
##  [56,]  0.4  0.4  0.2
##  [57,]  0.4  0.4  0.2
##  [58,]  0.6  0.2  0.2
##  [59,]  0.4  0.2  0.4
##  [60,]  0.6  0.2  0.2
##  [61,]  0.2  0.0  0.8
##  [62,]  0.2  0.6  0.2
##  [63,]  0.2  0.6  0.2
##  [64,]  0.6  0.2  0.2
##  [65,]  0.6  0.2  0.2
##  [66,]  0.4  0.2  0.4
##  [67,]  0.6  0.2  0.2
##  [68,]  0.0  0.2  0.8
##  [69,]  0.0  0.4  0.6
##  [70,]  0.2  0.4  0.4
##  [71,]  0.4  0.4  0.2
##  [72,]  0.0  0.8  0.2
##  [73,]  0.2  0.2  0.6
##  [74,]  0.6  0.2  0.2
##  [75,]  0.4  0.4  0.2
##  [76,]  0.4  0.4  0.2
##  [77,]  0.0  0.4  0.6
##  [78,]  0.4  0.2  0.4
##  [79,]  0.4  0.0  0.6
##  [80,]  0.0  0.4  0.6
##  [81,]  0.6  0.2  0.2
##  [82,]  0.6  0.0  0.4
##  [83,]  0.2  0.6  0.2
##  [84,]  0.2  0.2  0.6
##  [85,]  0.4  0.4  0.2
##  [86,]  0.4  0.0  0.6
##  [87,]  0.2  0.2  0.6
##  [88,]  0.2  0.8  0.0
##  [89,]  0.6  0.0  0.4
##  [90,]  0.4  0.6  0.0
##  [91,]  0.2  0.2  0.6
##  [92,]  0.6  0.2  0.2
##  [93,]  0.0  0.4  0.6
##  [94,]  0.0  0.4  0.6
##  [95,]  0.2  0.4  0.4
##  [96,]  0.2  0.8  0.0
##  [97,]  0.4  0.4  0.2
##  [98,]  0.4  0.2  0.4
##  [99,]  0.4  0.0  0.6
## [100,]  0.2  0.2  0.6
## [101,]  0.4  0.4  0.2
## [102,]  0.2  0.6  0.2
## [103,]  0.2  0.6  0.2
## [104,]  0.4  0.4  0.2
## [105,]  0.2  0.8  0.0
## [106,]  0.8  0.0  0.2
## [107,]  0.4  0.6  0.0
## [108,]  0.2  0.2  0.6
## [109,]  0.2  0.0  0.8
## [110,]  0.4  0.4  0.2
## [111,]  0.2  0.4  0.4
## [112,]  0.4  0.4  0.2
## [113,]  0.4  0.4  0.2
## [114,]  0.2  0.8  0.0
## [115,]  0.0  0.6  0.4
## [116,]  0.4  0.4  0.2
## [117,]  0.6  0.4  0.0
## [118,]  0.8  0.0  0.2
## [119,]  0.0  0.4  0.6
## [120,]  0.6  0.4  0.0
## [121,]  0.4  0.4  0.2
## [122,]  0.2  0.2  0.6
## [123,]  0.6  0.4  0.0
## [124,]  0.0  0.8  0.2
## [125,]  0.2  0.6  0.2
## [126,]  0.6  0.2  0.2
## [127,]  0.6  0.0  0.4
## [128,]  0.4  0.4  0.2
## [129,]  0.4  0.4  0.2
## [130,]  0.0  0.6  0.4
## [131,]  0.2  0.4  0.4
## [132,]  0.2  0.2  0.6
## [133,]  0.6  0.0  0.4
## [134,]  0.4  0.4  0.2
## [135,]  0.4  0.2  0.4
## [136,]  0.6  0.2  0.2
## [137,]  0.2  0.6  0.2
## [138,]  0.6  0.2  0.2
## [139,]  0.4  0.2  0.4
## [140,]  0.2  0.2  0.6
## [141,]  0.2  0.8  0.0
## [142,]  0.4  0.2  0.4
## [143,]  0.2  0.6  0.2
## [144,]  0.0  1.0  0.0
## [145,]  0.4  0.2  0.4
## [146,]  0.4  0.4  0.2
## [147,]  0.2  0.0  0.8
## [148,]  0.8  0.0  0.2
## [149,]  0.2  0.6  0.2
## [150,]  0.4  0.2  0.4
## 
## $class
##   [1] "virginica"  "setosa"     "setosa"     "versicolor" "setosa"    
##   [6] "versicolor" "setosa"     "virginica"  "versicolor" "virginica" 
##  [11] "virginica"  "virginica"  "versicolor" "setosa"     "versicolor"
##  [16] "versicolor" "setosa"     "setosa"     "versicolor" "setosa"    
##  [21] "setosa"     "setosa"     "versicolor" "virginica"  "versicolor"
##  [26] "setosa"     "versicolor" "virginica"  "versicolor" "setosa"    
##  [31] "versicolor" "versicolor" "setosa"     "versicolor" "versicolor"
##  [36] "setosa"     "versicolor" "versicolor" "setosa"     "setosa"    
##  [41] "setosa"     "setosa"     "versicolor" "virginica"  "virginica" 
##  [46] "setosa"     "setosa"     "virginica"  "virginica"  "setosa"    
##  [51] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
##  [56] "setosa"     "setosa"     "setosa"     "setosa"     "setosa"    
##  [61] "virginica"  "versicolor" "versicolor" "setosa"     "setosa"    
##  [66] "setosa"     "setosa"     "virginica"  "virginica"  "versicolor"
##  [71] "setosa"     "versicolor" "virginica"  "setosa"     "setosa"    
##  [76] "setosa"     "virginica"  "setosa"     "virginica"  "virginica" 
##  [81] "setosa"     "setosa"     "versicolor" "virginica"  "setosa"    
##  [86] "virginica"  "virginica"  "versicolor" "setosa"     "versicolor"
##  [91] "virginica"  "setosa"     "virginica"  "virginica"  "versicolor"
##  [96] "versicolor" "setosa"     "setosa"     "virginica"  "virginica" 
## [101] "setosa"     "versicolor" "versicolor" "setosa"     "versicolor"
## [106] "setosa"     "versicolor" "virginica"  "virginica"  "setosa"    
## [111] "versicolor" "setosa"     "setosa"     "versicolor" "versicolor"
## [116] "setosa"     "setosa"     "setosa"     "virginica"  "setosa"    
## [121] "setosa"     "virginica"  "setosa"     "versicolor" "versicolor"
## [126] "setosa"     "setosa"     "setosa"     "setosa"     "versicolor"
## [131] "versicolor" "virginica"  "setosa"     "setosa"     "setosa"    
## [136] "setosa"     "versicolor" "setosa"     "setosa"     "virginica" 
## [141] "versicolor" "setosa"     "versicolor" "versicolor" "setosa"    
## [146] "setosa"     "virginica"  "setosa"     "versicolor" "setosa"    
## 
## $samples
##        [,1] [,2] [,3] [,4] [,5]
##   [1,]  123  125  107   10   32
##   [2,]    4    5   83  150   86
##   [3,]  137    3   17   42  118
##   [4,]  123  121   77   61   87
##   [5,]   63  147   24   46   89
##   [6,]  135   51   49  134   73
##   [7,]   74   46   36   87   21
##   [8,]  104  120   64   86  143
##   [9,]  148   74  149    5   55
##  [10,]  109  150  150  144  114
##  [11,]   46  148   78   93  145
##  [12,]  128   56   59  140  148
##  [13,]   49   71   95   57  100
##  [14,]   52   17  146  119   12
##  [15,]   97   18   66   96  124
##  [16,]   25  112   59   51   55
##  [17,]  144  142   38   85   23
##  [18,]   18   15   11  109   91
##  [19,]   53  114   42  122   80
##  [20,]  146  136    7   84   40
##  [21,]   37  124   85   40   18
##  [22,]    5   58   38   48   81
##  [23,]   95   62   85  107   31
##  [24,]   20  110  103   57  103
##  [25,]    3  111   56   56  101
##  [26,]   73   18   12   82  101
##  [27,]   94  127  106   36   54
##  [28,]   12  111  114  128   90
##  [29,]   44  120  139   98   63
##  [30,]   21   38    4   17  110
##  [31,]  143   85   72   47  122
##  [32,]    4  146   86   65  128
##  [33,]   45   93   70  137   31
##  [34,]  150   51   55   92  114
##  [35,]   99   16   56  141  103
##  [36,]  107   18   45   95   41
##  [37,]   99    5   61   18   76
##  [38,]    2   79   85   88  135
##  [39,]   15  136   72   10   38
##  [40,]   38   64   42   57  144
##  [41,]   65   60   28  149   46
##  [42,]   24   27   59   19   65
##  [43,]   79   65  105   78    9
##  [44,]  120   70  105  133  117
##  [45,]  145   53   23  112    4
##  [46,]   53   64  121   42   43
##  [47,]    8    3   55   13   74
##  [48,]   79  130   85  126  143
##  [49,]  129   73   59  103   66
##  [50,]  137   23   33   85   71
##  [51,]   55    1   26   84  147
##  [52,]  103  136   15   58   12
##  [53,]   71   71   18    4   18
##  [54,]   39   29    9  101   42
##  [55,]  119  124   10   12   57
##  [56,]   22   44  134  124   82
##  [57,]    8   55    7  123   81
##  [58,]  106   35   25   28   76
##  [59,]  144   16    2   64  124
##  [60,]   82  150   34   44   14
##  [61,]  141  108   30  123   84
##  [62,]  121    6   64   89   57
##  [63,]  113   58   82   79   20
##  [64,]    8   51    6   49  112
##  [65,]   95   22    9  118   37
##  [66,]   86   11   96   13  108
##  [67,]   35  114   29   58   11
##  [68,]  117  124  138   94  148
##  [69,]  133  116  107   67   51
##  [70,]  135  114   29   83   67
##  [71,]   29   86   21   62  110
##  [72,]  143   64   96   59   65
##  [73,]  129   79  138   30  150
##  [74,]   71   12   28  120   16
##  [75,]   69   70   39  136   46
##  [76,]   29  143   44   52   67
##  [77,]  144  116   97  117   91
##  [78,]   51   45  105  112   22
##  [79,]  115  141   21  109   31
##  [80,]  141  100  120  116  112
##  [81,]   28  111   14   78    1
##  [82,]  103  145   39   34   27
##  [83,]   77  128   55   16   81
##  [84,]   83  118  131   26  103
##  [85,]  150   76   70   17   49
##  [86,]    8  126    9  143  123
##  [87,]  127   43   94   71  121
##  [88,]   26   91   75   65  100
##  [89,]   16   11  115   25  122
##  [90,]   99   22   89   27   60
##  [91,]  126   78   78  130    3
##  [92,]   42   33  148   13   52
##  [93,]  140   91  125  145   79
##  [94,]   79  137   95  148  111
##  [95,]   55  138   68   48  106
##  [96,]   68   88   95   82   32
##  [97,]  145   72   19   52   13
##  [98,]   44   99    8  150  114
##  [99,]  111  112  111   11   20
## [100,]  101  118  114   20   87
## [101,]    7  107   34  106   92
## [102,]    8   82  120   94  101
## [103,]   96   90   33   78  106
## [104,]   42  102   72   57   26
## [105,]   35   88   62   93   58
## [106,]   12   46   20   34  135
## [107,]   65   68   59    7   22
## [108,]    3  101  143   54  106
## [109,]   36  124  137  106  124
## [110,]   81   44  125   41  127
## [111,]   56   75    9  131  137
## [112,]   29  121   53   52   26
## [113,]   16  138   27   89  100
## [114,]   85   58   95   54   24
## [115,]  142   74  107   72   58
## [116,]    9   88   13  138   71
## [117,]   35   92    9   91   18
## [118,]  123    8   36   47   50
## [119,]   92  110  107  116   61
## [120,]   32   81   34   99   35
## [121,]   79   72    6   21  143
## [122,]   87  129  142  112   35
## [123,]   21   57   36   83   24
## [124,]   82   69   56  102   97
## [125,]   76   60  132   48   57
## [126,]   74    4  107   23   34
## [127,]  141  126   40   11   49
## [128,]   70   25  147   83    3
## [129,]  130   44   18   55   87
## [130,]   69   61  108  113   83
## [131,]  141   59   78   65    4
## [132,]  113   10   71  145   54
## [133,]  134    7   44    5  122
## [134,]  108   19   88   44   71
## [135,]   54   24  118  129   32
## [136,]  128   20   91   32   24
## [137,]   98  124   47   96   76
## [138,]  147   70   46   48    6
## [139,]  114   69  144   15   19
## [140,]  115  108   60  140   11
## [141,]   96   87   87   37   67
## [142,]   72  150  121   17   41
## [143,]  108   88   38   68   65
## [144,]   99   69   59   69   54
## [145,]   24   98   29  131   78
## [146,]   12   70   40   62  117
## [147,]  144   20  105  108  132
## [148,]  109   29    2   11   10
## [149,]   32   63   65  141   70
## [150,]   84   28   61  118   23
## 
## $importance
## Petal.Length  Petal.Width Sepal.Length  Sepal.Width 
##     83.02502     16.97498      0.00000      0.00000 
## 
## $terms
## Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
## attr(,"variables")
## list(Species, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
## attr(,"factors")
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Species                 0           0            0           0
## Sepal.Length            1           0            0           0
## Sepal.Width             0           1            0           0
## Petal.Length            0           0            1           0
## Petal.Width             0           0            0           1
## attr(,"term.labels")
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## attr(,"order")
## [1] 1 1 1 1
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
## attr(,"predvars")
## list(Species, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
## attr(,"dataClasses")
##      Species Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     "factor"    "numeric"    "numeric"    "numeric"    "numeric" 
## 
## $call
## adabag::bagging(formula = Species ~ ., data = iris, mfinal = 5, 
##     control = rpart.control(maxdepth = 3))
# 预测分析 
pre_bag <- predict(fit_bag,iris) 
pre_bag$error
## [1] 0.02666667
# 使用boosting算法对iris决策树进行分类
model.boosting <- boosting(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris)
pre.boosting <- predict(model.boosting,iris)
pre.boosting$confusion
##                Observed Class
## Predicted Class setosa versicolor virginica
##      setosa         50          0         0
##      versicolor      0         50         0
##      virginica       0          0        50
pre.boosting$error
## [1] 0
# pre.boosting$votes
# adabag包均有函数实现bagging和adaboost的分类建模
# 利用全部数据建模
library(adabag)
a<-boosting(Species~.,data=iris)
z0<-table(iris[,5],predict(a,iris)$class)
#计算误差率
E0<-(sum(z0)-sum(diag(z0)))/sum(z0)
barplot(a$importance)

b<-errorevol(a,iris)#计算全体的误差演变
plot(b$error,type="l",main="AdaBoost error vs number of trees") #对误差演变进行画图

a<-bagging(Species~.,data=iris)
z0<-table(iris[,5],predict(a,iris)$class)
#计算误差率
E0<-(sum(z0)-sum(diag(z0)))/sum(z0)
barplot(a$importance)

b<-errorevol(a,iris)#计算全体的误差演变
plot(b$error,type="l",main="AdaBoost error vs number of trees") #对误差演变进行画图

#5折交叉验证
set.seed(1044)  #设定随机种子
samp<-c(sample(1:50,25),sample(51:100,25),sample(101:150,25)) #进行随机抽样
a<-boosting(Species~.,data=iris[samp,]) #利用训练集建立adaboost分类模
z0<-table(iris[samp,5],predict(a,iris[samp,])$class)#训练集结果
z1<-table(iris[-samp,5],predict(a,iris[-samp,])$class)#测试集结果
E0<-(sum(z0)-sum(diag(z0)))/sum(z0)
E1<-(sum(z0)-sum(diag(z0)))/sum(z1)
E0
## [1] 0
E1
## [1] 0
a<-bagging(Species~.,data=iris[samp,]) #利用训练集建立bagging分类模
z0<-table(iris[samp,5],predict(a,iris[samp,])$class)#训练集结果
z1<-table(iris[-samp,5],predict(a,iris[-samp,])$class)#测试集结果
E0<-(sum(z0)-sum(diag(z0)))/sum(z0)
E1<-(sum(z0)-sum(diag(z0)))/sum(z1)
E0
## [1] 0.02666667
E1
## [1] 0.02666667

梯度提升方法

R语言中gbm包就是用来实现一般提升方法的扩展包。根据基学习器、损失函数和优化方法的不同,提升方法也有各种不同的形式。

自适应提升方法AdaBoost

它是一种传统而重要的Boost算法,在学习时为每一个样本赋上一个权重,初始时各样本权重一样。在每一步训练后,增加错误学习样本的权重,这使得某些样本的重要性凸显出来,在进行了N次迭代后,将会得到N个简单的学习器。最后将它们组合起来得到一个最终的模型。

梯度提升方法Gradient Boosting

梯度提升是先根据初始模型计算伪残差,之后建立一个基学习器来解释伪残差,该基学习器是在梯度方向上减少残差。再将基学习器乘上权重系数(学习速率)和原来的模型进行线性组合形成新的模型。这样反复迭代就可以找到一个使损失函数的期望达到最小的模型。在训练基学习器时可以使用再抽样方法,此时就称之为随机梯度提升算法stochastic gradient boosting。

在gbm包中,采用的是决策树作为基学习器,重要的参数设置如下:

  • 损失函数的形式(distribution)

  • 迭代次数(n.trees)

  • 学习速率(shrinkage)

  • 再抽样比率(bag.fraction)

  • 决策树的深度(interaction.depth)

损失函数的形式容易设定,分类问题一般选择bernoulli分布,而回归问题可以选择gaussian分布。学习速率方面,我们都知道步子迈得太大容易扯着,所以学习速率是越小越好,但是步子太小的话,步数就得增加,也就是训练的迭代次数需要加大才能使模型达到最优,这样训练所需时间和计算资源也相应加大了。gbm作者的经验法则是设置shrinkage参数在0.01-0.001之间,而n.trees参数在3000-10000之间。

我们用mlbench包中的数据集来看一下gbm包的使用。其中响应变量为diabetes,即病人的糖尿病诊断是阳性还是阴性。

library(gbm)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
data(PimaIndiansDiabetes2,package='mlbench')

# 将响应变量转为0-1格式
data <- PimaIndiansDiabetes2
data$diabetes <- as.numeric(data$diabetes)
data <- transform(data,diabetes=diabetes-1)

# 使用gbm函数建模
model <- gbm(diabetes~.,data=data,shrinkage=0.01,
             distribution='bernoulli',cv.folds=5,
             n.trees=3000,verbose=F)

# 用交叉检验确定最佳迭代次数
best.iter <- gbm.perf(model,method='cv')

best.iter
## [1] 1156

Credit Approval【信贷审批】-数据案例

dataset <- read.table("http://data.galaxystatistics.com/blog_data/ensemble_learning/crx.data", sep=",", header=F, na.strings="?")

head(dataset)
##   V1    V2    V3 V4 V5 V6 V7   V8 V9 V10 V11 V12 V13 V14 V15 V16
## 1  b 30.83 0.000  u  g  w  v 1.25  t   t   1   f   g 202   0   +
## 2  a 58.67 4.460  u  g  q  h 3.04  t   t   6   f   g  43 560   +
## 3  a 24.50 0.500  u  g  q  h 1.50  t   f   0   f   g 280 824   +
## 4  b 27.83 1.540  u  g  w  v 3.75  t   t   5   t   g 100   3   +
## 5  b 20.17 5.625  u  g  w  v 1.71  t   f   0   f   s 120   0   +
## 6  b 32.08 4.000  u  g  m  v 2.50  t   f   0   t   g 360   0   +
sapply(dataset, function(x) sum(is.na(x)))
##  V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 
##  12  12   0   6   6   9   9   0   0   0   0   0   0  13   0   0
sapply(dataset, class)
##        V1        V2        V3        V4        V5        V6        V7 
##  "factor" "numeric" "numeric"  "factor"  "factor"  "factor"  "factor" 
##        V8        V9       V10       V11       V12       V13       V14 
## "numeric"  "factor"  "factor" "integer"  "factor"  "factor" "integer" 
##       V15       V16 
## "integer"  "factor"
#分割数据的训练集和测试集
set.seed(123)
dataset <- na.omit(dataset)  # 去掉存在NA的行
n <- dim(dataset)[1]
index <- sample(n, round(0.7*n))
train <- dataset[index,]
test <- dataset[-index,]
dim(train)
## [1] 457  16
####有时候,需要转化变量为哑变量,因为在一些挖掘场合,数据不能直接使用因子型的数据
# knn
# glmnet
# svm
# xgboost
###有些挖掘方法可以使用因子变量,比如:
# logistic regression
# repart
# GBM
# randomForest

dataset2 <- dataset

library(plyr)
into_factor <- function(x){
  
  if(class(x) == "factor"){
    n <- length(x)
    data.fac <- data.frame(x=x, y=1:n)
    output <- model.matrix(y~x, data.fac)[,-1]
    ## Convert factor into dummy variable matrix
  }else{
    output <- x
    ## if x is numeric, output is x
  }
  output
  
}
into_factor(dataset$V4)[1:5,]
##   xu xy
## 1  1  0
## 2  1  0
## 3  1  0
## 4  1  0
## 5  1  0
# colwise !!!
dataset2 <- colwise(into_factor)(dataset2)
# class(dataset2)
dataset2 <- do.call(cbind,dataset2)
dataset2 <- as.data.frame(dataset2)
dim(dataset2)
## [1] 653  38
#####使用logistic回归来进行测试建模和预测,使用glm
logit.model <- glm(V16~.,data=train,family = "binomial")
logit.response <- predict(logit.model, test, type="response")
logit.predict <- ifelse(logit.response>0.5,"+","-")
table(logit.predict,test$V16)
##              
## logit.predict  -  +
##             - 90 24
##             + 13 69
accutancy1 <- mean(logit.predict==test$V16)


####使用GBM方法来预测,这里用的是caret, repeat-cv来选择最优树
library(caret)
ctrl <- trainControl(method = "repeatedcv",number = 5,repeats=5)
set.seed(300)
m_gbm <- train(V16~.,data=train,methed="gbm",metric="Kappa",trControl=ctrl)
gbm.predict <- predict(m_gbm,test)
table(gbm.predict,test$V16)
##            
## gbm.predict  -  +
##           - 96 29
##           +  7 64
accutancy2 <- mean(gbm.predict==test$V16)


####首先测试一个knn模型,不做CV,不做标准化,不做数据类型转化得到的结果,这里,不转换数据类型会把因子类型的变量舍弃,仅保留数据变量
library(caret)
knn.model1 <- knn3(V16~.,data=train,k=5)
knn.response1 <- predict(knn.model1,test,class="response")
knn.predict1 <- ifelse(knn.response1[,1]<0.5,"+","-")
table(knn.predict1,test$V16)
##             
## knn.predict1  -  +
##            - 78 48
##            + 25 45
mean(knn.predict1==test$V16)
## [1] 0.627551
##### 经过标准化和数据转换之后的准确率
knn.dataset <- cbind(colwise(scale)(dataset2[,-38]),V16=as.factor(dataset2$V16))
set.seed(123)
index <- sample(n,round(0.7*n))
train.knn <- knn.dataset[index,]
test.knn <- knn.dataset[-index,]
knn.model1 <- knn3(V16 ~ .,data = train.knn, k = 5)
knn.predict1 <- predict(knn.model1, test.knn, type="class") 
table(knn.predict1, test.knn$V16)
##             
## knn.predict1  0  1
##            0 89 32
##            1 14 61
mean(knn.predict1 == test.knn$V16)
## [1] 0.7653061
##### knn CV for k  
##### 不管是我的这个程序函数caret,总算出来应该是k=2的时候误差最小,但是实际情况不是这样
library(class)
cv.knn <- function(data,n=5,k){
  index <- sample(1:5,nrow(data),replace = T)
  acc=0
  for ( i in 1:5){
    ind <- index == i
    train <- data[-ind,]
    test <- data[ind,]
    knn.model1 <- knn3(train$V16 ~ .,data = train, k = k)  
    knn.predict <- predict(knn.model1,test,type = "class") 
    acc[i] <- mean(knn.predict == test$V16)
  }
  mean(acc)
}
cv.knn(train.knn,3,5)
## [1] 0.8532997
k <- 2:20
set.seed(123)
acc <- sapply(k,function(x) cv.knn(train.knn,3,x))
plot(k, acc, type="b")

k.final <- which.max(acc)
knn.model.f <- knn3(V16 ~ .,data = train.knn, k = k.final) 
knn.predict.f <- predict(knn.model.f,test.knn,type = "class") 
table(knn.predict.f,test.knn$V16)
##              
## knn.predict.f  0  1
##             0 81 31
##             1 22 62
mean(knn.predict.f == test.knn$V16)
## [1] 0.7295918
library(caret)
fitControl <- trainControl(method = "cv", number = 10)
knnTune <- train(x = dataset2[1:37], y = dataset2[,38], method = "knn", preProc = c("center", "scale"),tuneGrid = data.frame(.k = 1:20), trControl = fitControl)

xgboost包使用

https://segmentfault.com/a/1190000004421821

https://zhuanlan.zhihu.com/p/20871600

https://cosx.org/2015/03/xgboost

http://blog.csdn.net/sinat_26917383/article/details/52623754

http://blog.csdn.net/a819825294/article/details/51206410

http://blog.csdn.net/a819825294/article/details/51188740

Credit Approval【信贷审批】数据xgboost案例

dataset <- read.table("http://data.galaxystatistics.com/blog_data/ensemble_learning/crx.data", sep=",", header=F, na.strings="?")

head(dataset)
##   V1    V2    V3 V4 V5 V6 V7   V8 V9 V10 V11 V12 V13 V14 V15 V16
## 1  b 30.83 0.000  u  g  w  v 1.25  t   t   1   f   g 202   0   +
## 2  a 58.67 4.460  u  g  q  h 3.04  t   t   6   f   g  43 560   +
## 3  a 24.50 0.500  u  g  q  h 1.50  t   f   0   f   g 280 824   +
## 4  b 27.83 1.540  u  g  w  v 3.75  t   t   5   t   g 100   3   +
## 5  b 20.17 5.625  u  g  w  v 1.71  t   f   0   f   s 120   0   +
## 6  b 32.08 4.000  u  g  m  v 2.50  t   f   0   t   g 360   0   +
sapply(dataset, function(x) sum(is.na(x)))
##  V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 
##  12  12   0   6   6   9   9   0   0   0   0   0   0  13   0   0
sapply(dataset, class)
##        V1        V2        V3        V4        V5        V6        V7 
##  "factor" "numeric" "numeric"  "factor"  "factor"  "factor"  "factor" 
##        V8        V9       V10       V11       V12       V13       V14 
## "numeric"  "factor"  "factor" "integer"  "factor"  "factor" "integer" 
##       V15       V16 
## "integer"  "factor"
#分割数据的训练集和测试集
set.seed(123)
dataset <- na.omit(dataset)  # 去掉存在NA的行
n <- dim(dataset)[1]
index <- sample(n, round(0.7*n))
train <- dataset[index,]
test <- dataset[-index,]
dim(train)
## [1] 457  16
####有时候,需要转化变量为哑变量,因为在一些挖掘场合,数据不能直接使用因子型的数据
# knn
# glmnet
# svm
# xgboost
###有些挖掘方法可以使用因子变量,比如:
# logistic regression
# repart
# GBM
# randomForest

dataset2 <- dataset

library(plyr)
into_factor <- function(x){
  
  if(class(x) == "factor"){
    n <- length(x)
    data.fac <- data.frame(x=x, y=1:n)
    output <- model.matrix(y~x, data.fac)[,-1]
    ## Convert factor into dummy variable matrix
  }else{
    output <- x
    ## if x is numeric, output is x
  }
  output
  
}
into_factor(dataset$V4)[1:5,]
##   xu xy
## 1  1  0
## 2  1  0
## 3  1  0
## 4  1  0
## 5  1  0
# colwise !!!
dataset2 <- colwise(into_factor)(dataset2)
# class(dataset2)
dataset2 <- do.call(cbind,dataset2)
dataset2 <- as.data.frame(dataset2)
dim(dataset2)
## [1] 653  38
##### xgboost
library(xgboost)
library(methods)
library(plyr)

set.seed(123)

index = sample(n,round(0.7*n))
train.xg = dataset2[index,]
test.xg = dataset2[-index,]

label <- as.matrix(train.xg[,38,drop =F])
data <- as.matrix(train.xg[,-38,drop =F])
label2 <- as.matrix(test.xg[,38,drop =F])
data2 <- as.matrix(test.xg[,-38,drop =F])

# weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
xgmat <- xgb.DMatrix(data, label = label, missing = -10000)
param <- list("objective" = "binary:logistic","bst:eta" = 1,"bst:max_depth" = 2,"eval_metric" = "logloss","silent" = 1,"nthread" = 16 ,"min_child_weight" =1.45)
nround <- 275

bst <- xgb.train(param, xgmat, nround)
res1 <- predict(bst, data2)
pre1 <- ifelse(res1>0.5, 1, 0)
table(pre1, label2)
##     label2
## pre1  0  1
##    0 89 19
##    1 14 74
accurancy4 <- mean(pre1 ==label2)
accurancy4
## [1] 0.8316327

具体案例——官方案例 discoverYourData

案例的主要内容是:服用安慰剂对病情康复的情况,其他指标还有年龄、性别。

library(xgboost)  
library(Matrix)  
library(data.table) 
library(vcd)
## Loading required package: grid
# if (!library('vcd')) install.packages('vcd')   


# 数据导入与包的加载
data(Arthritis)  
df <- data.table(Arthritis, keep.rownames = F)

# 接下来对数据进行一些处理。这个代码写的很棒,比如:ifelse的用法,以及:=用法(直接在[]框中对数据进行一定操作)
head(df[,AgeDiscret := as.factor(round(Age/10,0))])               #:= 新增加一列  
##    ID Treatment  Sex Age Improved AgeDiscret
## 1: 57   Treated Male  27     Some          3
## 2: 46   Treated Male  29     None          3
## 3: 77   Treated Male  30     None          3
## 4: 17   Treated Male  32   Marked          3
## 5: 36   Treated Male  46   Marked          5
## 6: 23   Treated Male  58   Marked          6
head(df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))])   #ifelse  
##    ID Treatment  Sex Age Improved AgeDiscret AgeCat
## 1: 57   Treated Male  27     Some          3  Young
## 2: 46   Treated Male  29     None          3  Young
## 3: 77   Treated Male  30     None          3  Young
## 4: 17   Treated Male  32   Marked          3    Old
## 5: 36   Treated Male  46   Marked          5    Old
## 6: 23   Treated Male  58   Marked          6    Old
df[,ID:=NULL]  

# 生成特定的数据格式。
# 生成了one-hot encode数据,独热编码。Improved是Y变量,-1是将treament变量(名义变量)拆分。
sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df)  #变成稀疏数据,然后0变成.,便于占用内存最小  

# 设置因变量(多分类)
output_vector <- df[,Improved] == "Marked" 

# xgboost建模
# 其中nround是迭代次数,可以用此来调节过拟合问题;
# nthread代表运行线程,如果不指定,则表示线程全开;
# objective代表所使用的方法:binary:logistic是以非线性的方式,分支。reg:linear(默认)、reg:logistic、count:poisson(泊松分布)、multi:softmax
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,  
               eta = 1, nthread = 2, nround = 10,objective = "binary:logistic")  
## [1]  train-error:0.202381 
## [2]  train-error:0.166667 
## [3]  train-error:0.166667 
## [4]  train-error:0.166667 
## [5]  train-error:0.154762 
## [6]  train-error:0.154762 
## [7]  train-error:0.154762 
## [8]  train-error:0.166667 
## [9]  train-error:0.166667 
## [10] train-error:0.166667
# 特征重要性排名
# 会出来比较多的指标,Gain是增益,树分支的主要参考因素;cover是特征观察的相对数值;Frequence是gain的一种简单版,他是在所有生成树中,特征的数量(慎用!)
importance <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst)  
head(importance)
##             Feature        Gain      Cover  Frequency
## 1:              Age 0.622031651 0.67251706 0.67241379
## 2: TreatmentPlacebo 0.285750607 0.11916656 0.10344828
## 3:          SexMale 0.048744054 0.04522027 0.08620690
## 4:      AgeDiscret6 0.016604647 0.04784637 0.05172414
## 5:      AgeDiscret3 0.016373791 0.08028939 0.05172414
## 6:      AgeDiscret4 0.009270558 0.02858801 0.01724138
# 特征筛选与检验
# 知道特征的重要性是一回事儿,现在想知道年龄对最后的治疗的影响。所以需要可以用一些方式来反映出来。以下是官方自带的。
importanceRaw <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst, data = sparse_matrix, label = output_vector)  
## Warning in `[.data.table`(result, , `:=`("RealCover", as.numeric(vec)), :
## with=FALSE ignored, it isn't needed when using :=. See ?':=' for examples.
# Cleaning for better display  
importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequence=NULL)]  #同时去掉cover frequence  
## Warning in `[.data.table`(importanceRaw, , `:=`(Cover = NULL, Frequence =
## NULL)): Adding new column 'Frequence' then assigning NULL (deleting it).
head(importanceClean)  
##             Feature         Split       Gain  Frequency RealCover
## 1: TreatmentPlacebo -9.53674e-007 0.28575061 0.10344828         7
## 2:              Age          61.5 0.16374034 0.05172414        12
## 3:              Age            39 0.08705750 0.01724138         8
## 4:              Age          57.5 0.06947553 0.03448276        11
## 5:          SexMale -9.53674e-007 0.04874405 0.08620690         4
## 6:              Age          53.5 0.04620627 0.05172414        10
##    RealCover %
## 1:   0.2500000
## 2:   0.4285714
## 3:   0.2857143
## 4:   0.3928571
## 5:   0.1428571
## 6:   0.3571429
# 比第一种方式多了split列,代表此时特征分割的界线,比如特征2: Age  61.5,代表分割在61.5岁以下治疗了就痊愈了。同时,多了RealCover 和RealCover %列,前者代表在这个特征的个数,后者代表个数的比例。

# 绘制重要性图谱:
xgb.plot.importance(importance_matrix = importanceRaw)

# 需要加载install.packages("Ckmeans.1d.dp"),其中输出的是两个特征,这个特征数量是可以自定义的,可以定义为10族。

# 变量之间影响力的检验,官方用的卡方检验:
c2 <- chisq.test(df$Age, output_vector)  
## Warning in chisq.test(df$Age, output_vector): Chi-squared approximation may
## be incorrect
# 检验年龄对最终结果的影响。

# # 疑问?
# #Random Forest™ - 1000 trees  
# bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic")  
# #num_parallel_tree这个是什么?  
# 
# #Boosting - 3 rounds  
# bst <- xgboost(data = train$data, label = train$label, max.depth = 4, nround = 3, objective = "binary:logistic")  
# #???代表boosting  
# # 话说最后有一个疑问,这几个代码是可以区分XGBoost、随机森林以及boosting吗?


# # 一些进阶功能的尝试
# # 作为比赛型算法,真的超级好。下面列举一些我比较看中的功能:
# # 1、交叉验证每一折显示预测情况
# # 挑选比较优质的验证集。
# # do cross validation with prediction values for each fold  
# res <- xgb.cv(params = param, data = dtrain, nrounds = nround, nfold = 5, prediction = TRUE)  
# res$evaluation_log  
# length(res$pred)
# 
# # 交叉验证时可以返回模型在每一折作为预测集时的预测结果,方便构建ensemble模型。
# 
# # 2、循环迭代
# # 允许用户先迭代1000次,查看此时模型的预测效果,然后继续迭代1000次,最后模型等价于一次性迭代2000次。
# # do predict with output_margin=TRUE, will always give you margin values before logistic transformation  
# ptrain <- predict(bst, dtrain, outputmargin=TRUE)  
# ptest  <- predict(bst, dtest, outputmargin=TRUE)  
# 
# # 3、每棵树将样本分类到哪片叶子上
# # training the model for two rounds  
# bst = xgb.train(params = param, data = dtrain, nrounds = nround, nthread = 2)  
# 
# # 4、线性模型替代树模型
# # 可以选择使用线性模型替代树模型,从而得到带L1+L2惩罚的线性回归或者logistic回归。
# # you can also set lambda_bias which is L2 regularizer on the bias term  
# param <- list(objective = "binary:logistic", booster = "gblinear",  
#               nthread = 2, alpha = 0.0001, lambda = 1)  

Stacking