什么是集成学习呢?通俗的讲,就是多算法融合。它的思想相当简单直接,以至于用一句俗语就可以完美概括:三个臭皮匠,顶个诸葛亮。实际操作中,集成学习把大大小小的多种算法融合在一起,共同协作来解决一个问题。这些算法可以是不同的算法,也可以是相同的算法。
用好集成学习有两个关键点:1)怎么训练每个算法?2)怎么融合每个算法?
Bagging 是投票式的算法,首先使用Bootstrap产生不同的训练集,然后再基于这些数据集得到多个基础分类器,最后通过基础分类器的分类结果得到一个相对更优的预测模型。
Boosting与bagging类似,主要差别在于在boosting算法中,基础分类器的学习是顺序进行的,后一轮分类器的学习与之前分类器的分类结果有关,既是在错分样本的基础上学习,boosting算法通过这样一种补尝学习的方式,达到了利用前一轮分类来调整后轮基础分类器的目的以获得更好的分类性能。
Stacking 是一种集成学习技术,通过元分类器或元回归聚合多个分类或回归模型。基础层次模型(level model)基于完整的训练集进行训练,然后元模型基于基础层次模型的输出进行训练。
# install.packages("ipred")
library(ipred)
# bagging {ipred}
# ?bagging
# ?errorest
# 使用bagging算法对iris决策树进行分类
model.bagging <- ipred::bagging(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=iris)
pre.bagging <- predict(model.bagging,iris)
# pre.bagging$confusion
# pre.bagging$error
# pre.bagging$votes
# install.packages("adabag")
library(adabag)
## Loading required package: rpart
## Loading required package: mlbench
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'adabag'
## The following object is masked from 'package:ipred':
##
## bagging
# bagging {adabag}
# ?bagging
# boosting {adabag}
# ?boosting
# 以iris数据集为例进行分析
# 应用模型并查看模型的相应参数
fit_bag <- adabag::bagging(Species~.,data=iris,mfinal=5,control=rpart.control(maxdepth=3))
fit_bag[1:length(fit_bag)]
## $formula
## Species ~ .
##
## $trees
## $trees[[1]]
## n= 150
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 95 virginica (0.32000000 0.31333333 0.36666667)
## 2) Petal.Length< 2.45 48 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.45 102 47 virginica (0.00000000 0.46078431 0.53921569)
## 6) Petal.Length< 4.95 50 4 versicolor (0.00000000 0.92000000 0.08000000)
## 12) Petal.Width< 1.55 43 0 versicolor (0.00000000 1.00000000 0.00000000) *
## 13) Petal.Width>=1.55 7 3 virginica (0.00000000 0.42857143 0.57142857) *
## 7) Petal.Length>=4.95 52 1 virginica (0.00000000 0.01923077 0.98076923) *
##
## $trees[[2]]
## n= 150
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 93 versicolor (0.28000000 0.38000000 0.34000000)
## 2) Petal.Length< 2.45 42 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.45 108 51 versicolor (0.00000000 0.52777778 0.47222222)
## 6) Petal.Length< 4.85 56 2 versicolor (0.00000000 0.96428571 0.03571429) *
## 7) Petal.Length>=4.85 52 3 virginica (0.00000000 0.05769231 0.94230769) *
##
## $trees[[3]]
## n= 150
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 90 setosa (0.40000000 0.34666667 0.25333333)
## 2) Petal.Length< 2.6 60 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.6 90 38 versicolor (0.00000000 0.57777778 0.42222222)
## 6) Petal.Width< 1.65 51 3 versicolor (0.00000000 0.94117647 0.05882353) *
## 7) Petal.Width>=1.65 39 4 virginica (0.00000000 0.10256410 0.89743590) *
##
## $trees[[4]]
## n= 150
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 94 versicolor (0.31333333 0.37333333 0.31333333)
## 2) Petal.Length< 2.45 47 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.45 103 47 versicolor (0.00000000 0.54368932 0.45631068)
## 6) Petal.Width< 1.75 59 4 versicolor (0.00000000 0.93220339 0.06779661) *
## 7) Petal.Width>=1.75 44 1 virginica (0.00000000 0.02272727 0.97727273) *
##
## $trees[[5]]
## n= 150
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 97 setosa (0.35333333 0.35333333 0.29333333)
## 2) Petal.Length< 2.5 53 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.5 97 44 versicolor (0.00000000 0.54639175 0.45360825)
## 6) Petal.Length< 4.85 51 1 versicolor (0.00000000 0.98039216 0.01960784) *
## 7) Petal.Length>=4.85 46 3 virginica (0.00000000 0.06521739 0.93478261) *
##
##
## $votes
## [,1] [,2] [,3]
## [1,] 2 0 3
## [2,] 2 2 1
## [3,] 3 0 2
## [4,] 0 3 2
## [5,] 2 2 1
## [6,] 1 2 2
## [7,] 3 2 0
## [8,] 0 2 3
## [9,] 1 2 2
## [10,] 0 0 5
## [11,] 1 1 3
## [12,] 0 2 3
## [13,] 1 4 0
## [14,] 2 1 2
## [15,] 1 3 1
## [16,] 1 3 1
## [17,] 2 1 2
## [18,] 3 1 1
## [19,] 1 2 2
## [20,] 2 1 2
## [21,] 3 1 1
## [22,] 3 2 0
## [23,] 1 4 0
## [24,] 1 1 3
## [25,] 1 2 2
## [26,] 2 2 1
## [27,] 1 3 1
## [28,] 1 1 3
## [29,] 1 2 2
## [30,] 4 0 1
## [31,] 1 2 2
## [32,] 1 2 2
## [33,] 2 2 1
## [34,] 0 3 2
## [35,] 1 2 2
## [36,] 3 1 1
## [37,] 2 3 0
## [38,] 1 3 1
## [39,] 3 1 1
## [40,] 2 2 1
## [41,] 2 2 1
## [42,] 3 2 0
## [43,] 1 3 1
## [44,] 0 1 4
## [45,] 2 0 3
## [46,] 2 2 1
## [47,] 3 2 0
## [48,] 0 2 3
## [49,] 0 2 3
## [50,] 2 2 1
## [51,] 2 2 1
## [52,] 2 1 2
## [53,] 3 1 1
## [54,] 4 0 1
## [55,] 2 1 2
## [56,] 2 2 1
## [57,] 2 2 1
## [58,] 3 1 1
## [59,] 2 1 2
## [60,] 3 1 1
## [61,] 1 0 4
## [62,] 1 3 1
## [63,] 1 3 1
## [64,] 3 1 1
## [65,] 3 1 1
## [66,] 2 1 2
## [67,] 3 1 1
## [68,] 0 1 4
## [69,] 0 2 3
## [70,] 1 2 2
## [71,] 2 2 1
## [72,] 0 4 1
## [73,] 1 1 3
## [74,] 3 1 1
## [75,] 2 2 1
## [76,] 2 2 1
## [77,] 0 2 3
## [78,] 2 1 2
## [79,] 2 0 3
## [80,] 0 2 3
## [81,] 3 1 1
## [82,] 3 0 2
## [83,] 1 3 1
## [84,] 1 1 3
## [85,] 2 2 1
## [86,] 2 0 3
## [87,] 1 1 3
## [88,] 1 4 0
## [89,] 3 0 2
## [90,] 2 3 0
## [91,] 1 1 3
## [92,] 3 1 1
## [93,] 0 2 3
## [94,] 0 2 3
## [95,] 1 2 2
## [96,] 1 4 0
## [97,] 2 2 1
## [98,] 2 1 2
## [99,] 2 0 3
## [100,] 1 1 3
## [101,] 2 2 1
## [102,] 1 3 1
## [103,] 1 3 1
## [104,] 2 2 1
## [105,] 1 4 0
## [106,] 4 0 1
## [107,] 2 3 0
## [108,] 1 1 3
## [109,] 1 0 4
## [110,] 2 2 1
## [111,] 1 2 2
## [112,] 2 2 1
## [113,] 2 2 1
## [114,] 1 4 0
## [115,] 0 3 2
## [116,] 2 2 1
## [117,] 3 2 0
## [118,] 4 0 1
## [119,] 0 2 3
## [120,] 3 2 0
## [121,] 2 2 1
## [122,] 1 1 3
## [123,] 3 2 0
## [124,] 0 4 1
## [125,] 1 3 1
## [126,] 3 1 1
## [127,] 3 0 2
## [128,] 2 2 1
## [129,] 2 2 1
## [130,] 0 3 2
## [131,] 1 2 2
## [132,] 1 1 3
## [133,] 3 0 2
## [134,] 2 2 1
## [135,] 2 1 2
## [136,] 3 1 1
## [137,] 1 3 1
## [138,] 3 1 1
## [139,] 2 1 2
## [140,] 1 1 3
## [141,] 1 4 0
## [142,] 2 1 2
## [143,] 1 3 1
## [144,] 0 5 0
## [145,] 2 1 2
## [146,] 2 2 1
## [147,] 1 0 4
## [148,] 4 0 1
## [149,] 1 3 1
## [150,] 2 1 2
##
## $prob
## [,1] [,2] [,3]
## [1,] 0.4 0.0 0.6
## [2,] 0.4 0.4 0.2
## [3,] 0.6 0.0 0.4
## [4,] 0.0 0.6 0.4
## [5,] 0.4 0.4 0.2
## [6,] 0.2 0.4 0.4
## [7,] 0.6 0.4 0.0
## [8,] 0.0 0.4 0.6
## [9,] 0.2 0.4 0.4
## [10,] 0.0 0.0 1.0
## [11,] 0.2 0.2 0.6
## [12,] 0.0 0.4 0.6
## [13,] 0.2 0.8 0.0
## [14,] 0.4 0.2 0.4
## [15,] 0.2 0.6 0.2
## [16,] 0.2 0.6 0.2
## [17,] 0.4 0.2 0.4
## [18,] 0.6 0.2 0.2
## [19,] 0.2 0.4 0.4
## [20,] 0.4 0.2 0.4
## [21,] 0.6 0.2 0.2
## [22,] 0.6 0.4 0.0
## [23,] 0.2 0.8 0.0
## [24,] 0.2 0.2 0.6
## [25,] 0.2 0.4 0.4
## [26,] 0.4 0.4 0.2
## [27,] 0.2 0.6 0.2
## [28,] 0.2 0.2 0.6
## [29,] 0.2 0.4 0.4
## [30,] 0.8 0.0 0.2
## [31,] 0.2 0.4 0.4
## [32,] 0.2 0.4 0.4
## [33,] 0.4 0.4 0.2
## [34,] 0.0 0.6 0.4
## [35,] 0.2 0.4 0.4
## [36,] 0.6 0.2 0.2
## [37,] 0.4 0.6 0.0
## [38,] 0.2 0.6 0.2
## [39,] 0.6 0.2 0.2
## [40,] 0.4 0.4 0.2
## [41,] 0.4 0.4 0.2
## [42,] 0.6 0.4 0.0
## [43,] 0.2 0.6 0.2
## [44,] 0.0 0.2 0.8
## [45,] 0.4 0.0 0.6
## [46,] 0.4 0.4 0.2
## [47,] 0.6 0.4 0.0
## [48,] 0.0 0.4 0.6
## [49,] 0.0 0.4 0.6
## [50,] 0.4 0.4 0.2
## [51,] 0.4 0.4 0.2
## [52,] 0.4 0.2 0.4
## [53,] 0.6 0.2 0.2
## [54,] 0.8 0.0 0.2
## [55,] 0.4 0.2 0.4
## [56,] 0.4 0.4 0.2
## [57,] 0.4 0.4 0.2
## [58,] 0.6 0.2 0.2
## [59,] 0.4 0.2 0.4
## [60,] 0.6 0.2 0.2
## [61,] 0.2 0.0 0.8
## [62,] 0.2 0.6 0.2
## [63,] 0.2 0.6 0.2
## [64,] 0.6 0.2 0.2
## [65,] 0.6 0.2 0.2
## [66,] 0.4 0.2 0.4
## [67,] 0.6 0.2 0.2
## [68,] 0.0 0.2 0.8
## [69,] 0.0 0.4 0.6
## [70,] 0.2 0.4 0.4
## [71,] 0.4 0.4 0.2
## [72,] 0.0 0.8 0.2
## [73,] 0.2 0.2 0.6
## [74,] 0.6 0.2 0.2
## [75,] 0.4 0.4 0.2
## [76,] 0.4 0.4 0.2
## [77,] 0.0 0.4 0.6
## [78,] 0.4 0.2 0.4
## [79,] 0.4 0.0 0.6
## [80,] 0.0 0.4 0.6
## [81,] 0.6 0.2 0.2
## [82,] 0.6 0.0 0.4
## [83,] 0.2 0.6 0.2
## [84,] 0.2 0.2 0.6
## [85,] 0.4 0.4 0.2
## [86,] 0.4 0.0 0.6
## [87,] 0.2 0.2 0.6
## [88,] 0.2 0.8 0.0
## [89,] 0.6 0.0 0.4
## [90,] 0.4 0.6 0.0
## [91,] 0.2 0.2 0.6
## [92,] 0.6 0.2 0.2
## [93,] 0.0 0.4 0.6
## [94,] 0.0 0.4 0.6
## [95,] 0.2 0.4 0.4
## [96,] 0.2 0.8 0.0
## [97,] 0.4 0.4 0.2
## [98,] 0.4 0.2 0.4
## [99,] 0.4 0.0 0.6
## [100,] 0.2 0.2 0.6
## [101,] 0.4 0.4 0.2
## [102,] 0.2 0.6 0.2
## [103,] 0.2 0.6 0.2
## [104,] 0.4 0.4 0.2
## [105,] 0.2 0.8 0.0
## [106,] 0.8 0.0 0.2
## [107,] 0.4 0.6 0.0
## [108,] 0.2 0.2 0.6
## [109,] 0.2 0.0 0.8
## [110,] 0.4 0.4 0.2
## [111,] 0.2 0.4 0.4
## [112,] 0.4 0.4 0.2
## [113,] 0.4 0.4 0.2
## [114,] 0.2 0.8 0.0
## [115,] 0.0 0.6 0.4
## [116,] 0.4 0.4 0.2
## [117,] 0.6 0.4 0.0
## [118,] 0.8 0.0 0.2
## [119,] 0.0 0.4 0.6
## [120,] 0.6 0.4 0.0
## [121,] 0.4 0.4 0.2
## [122,] 0.2 0.2 0.6
## [123,] 0.6 0.4 0.0
## [124,] 0.0 0.8 0.2
## [125,] 0.2 0.6 0.2
## [126,] 0.6 0.2 0.2
## [127,] 0.6 0.0 0.4
## [128,] 0.4 0.4 0.2
## [129,] 0.4 0.4 0.2
## [130,] 0.0 0.6 0.4
## [131,] 0.2 0.4 0.4
## [132,] 0.2 0.2 0.6
## [133,] 0.6 0.0 0.4
## [134,] 0.4 0.4 0.2
## [135,] 0.4 0.2 0.4
## [136,] 0.6 0.2 0.2
## [137,] 0.2 0.6 0.2
## [138,] 0.6 0.2 0.2
## [139,] 0.4 0.2 0.4
## [140,] 0.2 0.2 0.6
## [141,] 0.2 0.8 0.0
## [142,] 0.4 0.2 0.4
## [143,] 0.2 0.6 0.2
## [144,] 0.0 1.0 0.0
## [145,] 0.4 0.2 0.4
## [146,] 0.4 0.4 0.2
## [147,] 0.2 0.0 0.8
## [148,] 0.8 0.0 0.2
## [149,] 0.2 0.6 0.2
## [150,] 0.4 0.2 0.4
##
## $class
## [1] "virginica" "setosa" "setosa" "versicolor" "setosa"
## [6] "versicolor" "setosa" "virginica" "versicolor" "virginica"
## [11] "virginica" "virginica" "versicolor" "setosa" "versicolor"
## [16] "versicolor" "setosa" "setosa" "versicolor" "setosa"
## [21] "setosa" "setosa" "versicolor" "virginica" "versicolor"
## [26] "setosa" "versicolor" "virginica" "versicolor" "setosa"
## [31] "versicolor" "versicolor" "setosa" "versicolor" "versicolor"
## [36] "setosa" "versicolor" "versicolor" "setosa" "setosa"
## [41] "setosa" "setosa" "versicolor" "virginica" "virginica"
## [46] "setosa" "setosa" "virginica" "virginica" "setosa"
## [51] "setosa" "setosa" "setosa" "setosa" "setosa"
## [56] "setosa" "setosa" "setosa" "setosa" "setosa"
## [61] "virginica" "versicolor" "versicolor" "setosa" "setosa"
## [66] "setosa" "setosa" "virginica" "virginica" "versicolor"
## [71] "setosa" "versicolor" "virginica" "setosa" "setosa"
## [76] "setosa" "virginica" "setosa" "virginica" "virginica"
## [81] "setosa" "setosa" "versicolor" "virginica" "setosa"
## [86] "virginica" "virginica" "versicolor" "setosa" "versicolor"
## [91] "virginica" "setosa" "virginica" "virginica" "versicolor"
## [96] "versicolor" "setosa" "setosa" "virginica" "virginica"
## [101] "setosa" "versicolor" "versicolor" "setosa" "versicolor"
## [106] "setosa" "versicolor" "virginica" "virginica" "setosa"
## [111] "versicolor" "setosa" "setosa" "versicolor" "versicolor"
## [116] "setosa" "setosa" "setosa" "virginica" "setosa"
## [121] "setosa" "virginica" "setosa" "versicolor" "versicolor"
## [126] "setosa" "setosa" "setosa" "setosa" "versicolor"
## [131] "versicolor" "virginica" "setosa" "setosa" "setosa"
## [136] "setosa" "versicolor" "setosa" "setosa" "virginica"
## [141] "versicolor" "setosa" "versicolor" "versicolor" "setosa"
## [146] "setosa" "virginica" "setosa" "versicolor" "setosa"
##
## $samples
## [,1] [,2] [,3] [,4] [,5]
## [1,] 123 125 107 10 32
## [2,] 4 5 83 150 86
## [3,] 137 3 17 42 118
## [4,] 123 121 77 61 87
## [5,] 63 147 24 46 89
## [6,] 135 51 49 134 73
## [7,] 74 46 36 87 21
## [8,] 104 120 64 86 143
## [9,] 148 74 149 5 55
## [10,] 109 150 150 144 114
## [11,] 46 148 78 93 145
## [12,] 128 56 59 140 148
## [13,] 49 71 95 57 100
## [14,] 52 17 146 119 12
## [15,] 97 18 66 96 124
## [16,] 25 112 59 51 55
## [17,] 144 142 38 85 23
## [18,] 18 15 11 109 91
## [19,] 53 114 42 122 80
## [20,] 146 136 7 84 40
## [21,] 37 124 85 40 18
## [22,] 5 58 38 48 81
## [23,] 95 62 85 107 31
## [24,] 20 110 103 57 103
## [25,] 3 111 56 56 101
## [26,] 73 18 12 82 101
## [27,] 94 127 106 36 54
## [28,] 12 111 114 128 90
## [29,] 44 120 139 98 63
## [30,] 21 38 4 17 110
## [31,] 143 85 72 47 122
## [32,] 4 146 86 65 128
## [33,] 45 93 70 137 31
## [34,] 150 51 55 92 114
## [35,] 99 16 56 141 103
## [36,] 107 18 45 95 41
## [37,] 99 5 61 18 76
## [38,] 2 79 85 88 135
## [39,] 15 136 72 10 38
## [40,] 38 64 42 57 144
## [41,] 65 60 28 149 46
## [42,] 24 27 59 19 65
## [43,] 79 65 105 78 9
## [44,] 120 70 105 133 117
## [45,] 145 53 23 112 4
## [46,] 53 64 121 42 43
## [47,] 8 3 55 13 74
## [48,] 79 130 85 126 143
## [49,] 129 73 59 103 66
## [50,] 137 23 33 85 71
## [51,] 55 1 26 84 147
## [52,] 103 136 15 58 12
## [53,] 71 71 18 4 18
## [54,] 39 29 9 101 42
## [55,] 119 124 10 12 57
## [56,] 22 44 134 124 82
## [57,] 8 55 7 123 81
## [58,] 106 35 25 28 76
## [59,] 144 16 2 64 124
## [60,] 82 150 34 44 14
## [61,] 141 108 30 123 84
## [62,] 121 6 64 89 57
## [63,] 113 58 82 79 20
## [64,] 8 51 6 49 112
## [65,] 95 22 9 118 37
## [66,] 86 11 96 13 108
## [67,] 35 114 29 58 11
## [68,] 117 124 138 94 148
## [69,] 133 116 107 67 51
## [70,] 135 114 29 83 67
## [71,] 29 86 21 62 110
## [72,] 143 64 96 59 65
## [73,] 129 79 138 30 150
## [74,] 71 12 28 120 16
## [75,] 69 70 39 136 46
## [76,] 29 143 44 52 67
## [77,] 144 116 97 117 91
## [78,] 51 45 105 112 22
## [79,] 115 141 21 109 31
## [80,] 141 100 120 116 112
## [81,] 28 111 14 78 1
## [82,] 103 145 39 34 27
## [83,] 77 128 55 16 81
## [84,] 83 118 131 26 103
## [85,] 150 76 70 17 49
## [86,] 8 126 9 143 123
## [87,] 127 43 94 71 121
## [88,] 26 91 75 65 100
## [89,] 16 11 115 25 122
## [90,] 99 22 89 27 60
## [91,] 126 78 78 130 3
## [92,] 42 33 148 13 52
## [93,] 140 91 125 145 79
## [94,] 79 137 95 148 111
## [95,] 55 138 68 48 106
## [96,] 68 88 95 82 32
## [97,] 145 72 19 52 13
## [98,] 44 99 8 150 114
## [99,] 111 112 111 11 20
## [100,] 101 118 114 20 87
## [101,] 7 107 34 106 92
## [102,] 8 82 120 94 101
## [103,] 96 90 33 78 106
## [104,] 42 102 72 57 26
## [105,] 35 88 62 93 58
## [106,] 12 46 20 34 135
## [107,] 65 68 59 7 22
## [108,] 3 101 143 54 106
## [109,] 36 124 137 106 124
## [110,] 81 44 125 41 127
## [111,] 56 75 9 131 137
## [112,] 29 121 53 52 26
## [113,] 16 138 27 89 100
## [114,] 85 58 95 54 24
## [115,] 142 74 107 72 58
## [116,] 9 88 13 138 71
## [117,] 35 92 9 91 18
## [118,] 123 8 36 47 50
## [119,] 92 110 107 116 61
## [120,] 32 81 34 99 35
## [121,] 79 72 6 21 143
## [122,] 87 129 142 112 35
## [123,] 21 57 36 83 24
## [124,] 82 69 56 102 97
## [125,] 76 60 132 48 57
## [126,] 74 4 107 23 34
## [127,] 141 126 40 11 49
## [128,] 70 25 147 83 3
## [129,] 130 44 18 55 87
## [130,] 69 61 108 113 83
## [131,] 141 59 78 65 4
## [132,] 113 10 71 145 54
## [133,] 134 7 44 5 122
## [134,] 108 19 88 44 71
## [135,] 54 24 118 129 32
## [136,] 128 20 91 32 24
## [137,] 98 124 47 96 76
## [138,] 147 70 46 48 6
## [139,] 114 69 144 15 19
## [140,] 115 108 60 140 11
## [141,] 96 87 87 37 67
## [142,] 72 150 121 17 41
## [143,] 108 88 38 68 65
## [144,] 99 69 59 69 54
## [145,] 24 98 29 131 78
## [146,] 12 70 40 62 117
## [147,] 144 20 105 108 132
## [148,] 109 29 2 11 10
## [149,] 32 63 65 141 70
## [150,] 84 28 61 118 23
##
## $importance
## Petal.Length Petal.Width Sepal.Length Sepal.Width
## 83.02502 16.97498 0.00000 0.00000
##
## $terms
## Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
## attr(,"variables")
## list(Species, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
## attr(,"factors")
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Species 0 0 0 0
## Sepal.Length 1 0 0 0
## Sepal.Width 0 1 0 0
## Petal.Length 0 0 1 0
## Petal.Width 0 0 0 1
## attr(,"term.labels")
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## attr(,"order")
## [1] 1 1 1 1
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
## attr(,"predvars")
## list(Species, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
## attr(,"dataClasses")
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## "factor" "numeric" "numeric" "numeric" "numeric"
##
## $call
## adabag::bagging(formula = Species ~ ., data = iris, mfinal = 5,
## control = rpart.control(maxdepth = 3))
# 预测分析
pre_bag <- predict(fit_bag,iris)
pre_bag$error
## [1] 0.02666667
# 使用boosting算法对iris决策树进行分类
model.boosting <- boosting(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris)
pre.boosting <- predict(model.boosting,iris)
pre.boosting$confusion
## Observed Class
## Predicted Class setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 50 0
## virginica 0 0 50
pre.boosting$error
## [1] 0
# pre.boosting$votes
# adabag包均有函数实现bagging和adaboost的分类建模
# 利用全部数据建模
library(adabag)
a<-boosting(Species~.,data=iris)
z0<-table(iris[,5],predict(a,iris)$class)
#计算误差率
E0<-(sum(z0)-sum(diag(z0)))/sum(z0)
barplot(a$importance)
b<-errorevol(a,iris)#计算全体的误差演变
plot(b$error,type="l",main="AdaBoost error vs number of trees") #对误差演变进行画图
a<-bagging(Species~.,data=iris)
z0<-table(iris[,5],predict(a,iris)$class)
#计算误差率
E0<-(sum(z0)-sum(diag(z0)))/sum(z0)
barplot(a$importance)
b<-errorevol(a,iris)#计算全体的误差演变
plot(b$error,type="l",main="AdaBoost error vs number of trees") #对误差演变进行画图
#5折交叉验证
set.seed(1044) #设定随机种子
samp<-c(sample(1:50,25),sample(51:100,25),sample(101:150,25)) #进行随机抽样
a<-boosting(Species~.,data=iris[samp,]) #利用训练集建立adaboost分类模
z0<-table(iris[samp,5],predict(a,iris[samp,])$class)#训练集结果
z1<-table(iris[-samp,5],predict(a,iris[-samp,])$class)#测试集结果
E0<-(sum(z0)-sum(diag(z0)))/sum(z0)
E1<-(sum(z0)-sum(diag(z0)))/sum(z1)
E0
## [1] 0
E1
## [1] 0
a<-bagging(Species~.,data=iris[samp,]) #利用训练集建立bagging分类模
z0<-table(iris[samp,5],predict(a,iris[samp,])$class)#训练集结果
z1<-table(iris[-samp,5],predict(a,iris[-samp,])$class)#测试集结果
E0<-(sum(z0)-sum(diag(z0)))/sum(z0)
E1<-(sum(z0)-sum(diag(z0)))/sum(z1)
E0
## [1] 0.02666667
E1
## [1] 0.02666667
R语言中gbm包就是用来实现一般提升方法的扩展包。根据基学习器、损失函数和优化方法的不同,提升方法也有各种不同的形式。
自适应提升方法AdaBoost
它是一种传统而重要的Boost算法,在学习时为每一个样本赋上一个权重,初始时各样本权重一样。在每一步训练后,增加错误学习样本的权重,这使得某些样本的重要性凸显出来,在进行了N次迭代后,将会得到N个简单的学习器。最后将它们组合起来得到一个最终的模型。
梯度提升方法Gradient Boosting
梯度提升是先根据初始模型计算伪残差,之后建立一个基学习器来解释伪残差,该基学习器是在梯度方向上减少残差。再将基学习器乘上权重系数(学习速率)和原来的模型进行线性组合形成新的模型。这样反复迭代就可以找到一个使损失函数的期望达到最小的模型。在训练基学习器时可以使用再抽样方法,此时就称之为随机梯度提升算法stochastic gradient boosting。
在gbm包中,采用的是决策树作为基学习器,重要的参数设置如下:
损失函数的形式(distribution)
迭代次数(n.trees)
学习速率(shrinkage)
再抽样比率(bag.fraction)
决策树的深度(interaction.depth)
损失函数的形式容易设定,分类问题一般选择bernoulli分布,而回归问题可以选择gaussian分布。学习速率方面,我们都知道步子迈得太大容易扯着,所以学习速率是越小越好,但是步子太小的话,步数就得增加,也就是训练的迭代次数需要加大才能使模型达到最优,这样训练所需时间和计算资源也相应加大了。gbm作者的经验法则是设置shrinkage参数在0.01-0.001之间,而n.trees参数在3000-10000之间。
我们用mlbench包中的数据集来看一下gbm包的使用。其中响应变量为diabetes,即病人的糖尿病诊断是阳性还是阴性。
library(gbm)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
data(PimaIndiansDiabetes2,package='mlbench')
# 将响应变量转为0-1格式
data <- PimaIndiansDiabetes2
data$diabetes <- as.numeric(data$diabetes)
data <- transform(data,diabetes=diabetes-1)
# 使用gbm函数建模
model <- gbm(diabetes~.,data=data,shrinkage=0.01,
distribution='bernoulli',cv.folds=5,
n.trees=3000,verbose=F)
# 用交叉检验确定最佳迭代次数
best.iter <- gbm.perf(model,method='cv')
best.iter
## [1] 1156
dataset <- read.table("http://data.galaxystatistics.com/blog_data/ensemble_learning/crx.data", sep=",", header=F, na.strings="?")
head(dataset)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
## 1 b 30.83 0.000 u g w v 1.25 t t 1 f g 202 0 +
## 2 a 58.67 4.460 u g q h 3.04 t t 6 f g 43 560 +
## 3 a 24.50 0.500 u g q h 1.50 t f 0 f g 280 824 +
## 4 b 27.83 1.540 u g w v 3.75 t t 5 t g 100 3 +
## 5 b 20.17 5.625 u g w v 1.71 t f 0 f s 120 0 +
## 6 b 32.08 4.000 u g m v 2.50 t f 0 t g 360 0 +
sapply(dataset, function(x) sum(is.na(x)))
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
## 12 12 0 6 6 9 9 0 0 0 0 0 0 13 0 0
sapply(dataset, class)
## V1 V2 V3 V4 V5 V6 V7
## "factor" "numeric" "numeric" "factor" "factor" "factor" "factor"
## V8 V9 V10 V11 V12 V13 V14
## "numeric" "factor" "factor" "integer" "factor" "factor" "integer"
## V15 V16
## "integer" "factor"
#分割数据的训练集和测试集
set.seed(123)
dataset <- na.omit(dataset) # 去掉存在NA的行
n <- dim(dataset)[1]
index <- sample(n, round(0.7*n))
train <- dataset[index,]
test <- dataset[-index,]
dim(train)
## [1] 457 16
####有时候,需要转化变量为哑变量,因为在一些挖掘场合,数据不能直接使用因子型的数据
# knn
# glmnet
# svm
# xgboost
###有些挖掘方法可以使用因子变量,比如:
# logistic regression
# repart
# GBM
# randomForest
dataset2 <- dataset
library(plyr)
into_factor <- function(x){
if(class(x) == "factor"){
n <- length(x)
data.fac <- data.frame(x=x, y=1:n)
output <- model.matrix(y~x, data.fac)[,-1]
## Convert factor into dummy variable matrix
}else{
output <- x
## if x is numeric, output is x
}
output
}
into_factor(dataset$V4)[1:5,]
## xu xy
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
# colwise !!!
dataset2 <- colwise(into_factor)(dataset2)
# class(dataset2)
dataset2 <- do.call(cbind,dataset2)
dataset2 <- as.data.frame(dataset2)
dim(dataset2)
## [1] 653 38
#####使用logistic回归来进行测试建模和预测,使用glm
logit.model <- glm(V16~.,data=train,family = "binomial")
logit.response <- predict(logit.model, test, type="response")
logit.predict <- ifelse(logit.response>0.5,"+","-")
table(logit.predict,test$V16)
##
## logit.predict - +
## - 90 24
## + 13 69
accutancy1 <- mean(logit.predict==test$V16)
####使用GBM方法来预测,这里用的是caret, repeat-cv来选择最优树
library(caret)
ctrl <- trainControl(method = "repeatedcv",number = 5,repeats=5)
set.seed(300)
m_gbm <- train(V16~.,data=train,methed="gbm",metric="Kappa",trControl=ctrl)
gbm.predict <- predict(m_gbm,test)
table(gbm.predict,test$V16)
##
## gbm.predict - +
## - 96 29
## + 7 64
accutancy2 <- mean(gbm.predict==test$V16)
####首先测试一个knn模型,不做CV,不做标准化,不做数据类型转化得到的结果,这里,不转换数据类型会把因子类型的变量舍弃,仅保留数据变量
library(caret)
knn.model1 <- knn3(V16~.,data=train,k=5)
knn.response1 <- predict(knn.model1,test,class="response")
knn.predict1 <- ifelse(knn.response1[,1]<0.5,"+","-")
table(knn.predict1,test$V16)
##
## knn.predict1 - +
## - 78 48
## + 25 45
mean(knn.predict1==test$V16)
## [1] 0.627551
##### 经过标准化和数据转换之后的准确率
knn.dataset <- cbind(colwise(scale)(dataset2[,-38]),V16=as.factor(dataset2$V16))
set.seed(123)
index <- sample(n,round(0.7*n))
train.knn <- knn.dataset[index,]
test.knn <- knn.dataset[-index,]
knn.model1 <- knn3(V16 ~ .,data = train.knn, k = 5)
knn.predict1 <- predict(knn.model1, test.knn, type="class")
table(knn.predict1, test.knn$V16)
##
## knn.predict1 0 1
## 0 89 32
## 1 14 61
mean(knn.predict1 == test.knn$V16)
## [1] 0.7653061
##### knn CV for k
##### 不管是我的这个程序函数caret,总算出来应该是k=2的时候误差最小,但是实际情况不是这样
library(class)
cv.knn <- function(data,n=5,k){
index <- sample(1:5,nrow(data),replace = T)
acc=0
for ( i in 1:5){
ind <- index == i
train <- data[-ind,]
test <- data[ind,]
knn.model1 <- knn3(train$V16 ~ .,data = train, k = k)
knn.predict <- predict(knn.model1,test,type = "class")
acc[i] <- mean(knn.predict == test$V16)
}
mean(acc)
}
cv.knn(train.knn,3,5)
## [1] 0.8532997
k <- 2:20
set.seed(123)
acc <- sapply(k,function(x) cv.knn(train.knn,3,x))
plot(k, acc, type="b")
k.final <- which.max(acc)
knn.model.f <- knn3(V16 ~ .,data = train.knn, k = k.final)
knn.predict.f <- predict(knn.model.f,test.knn,type = "class")
table(knn.predict.f,test.knn$V16)
##
## knn.predict.f 0 1
## 0 81 31
## 1 22 62
mean(knn.predict.f == test.knn$V16)
## [1] 0.7295918
library(caret)
fitControl <- trainControl(method = "cv", number = 10)
knnTune <- train(x = dataset2[1:37], y = dataset2[,38], method = "knn", preProc = c("center", "scale"),tuneGrid = data.frame(.k = 1:20), trControl = fitControl)
https://segmentfault.com/a/1190000004421821
https://zhuanlan.zhihu.com/p/20871600
https://cosx.org/2015/03/xgboost
http://blog.csdn.net/sinat_26917383/article/details/52623754
http://blog.csdn.net/a819825294/article/details/51206410
http://blog.csdn.net/a819825294/article/details/51188740
dataset <- read.table("http://data.galaxystatistics.com/blog_data/ensemble_learning/crx.data", sep=",", header=F, na.strings="?")
head(dataset)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
## 1 b 30.83 0.000 u g w v 1.25 t t 1 f g 202 0 +
## 2 a 58.67 4.460 u g q h 3.04 t t 6 f g 43 560 +
## 3 a 24.50 0.500 u g q h 1.50 t f 0 f g 280 824 +
## 4 b 27.83 1.540 u g w v 3.75 t t 5 t g 100 3 +
## 5 b 20.17 5.625 u g w v 1.71 t f 0 f s 120 0 +
## 6 b 32.08 4.000 u g m v 2.50 t f 0 t g 360 0 +
sapply(dataset, function(x) sum(is.na(x)))
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
## 12 12 0 6 6 9 9 0 0 0 0 0 0 13 0 0
sapply(dataset, class)
## V1 V2 V3 V4 V5 V6 V7
## "factor" "numeric" "numeric" "factor" "factor" "factor" "factor"
## V8 V9 V10 V11 V12 V13 V14
## "numeric" "factor" "factor" "integer" "factor" "factor" "integer"
## V15 V16
## "integer" "factor"
#分割数据的训练集和测试集
set.seed(123)
dataset <- na.omit(dataset) # 去掉存在NA的行
n <- dim(dataset)[1]
index <- sample(n, round(0.7*n))
train <- dataset[index,]
test <- dataset[-index,]
dim(train)
## [1] 457 16
####有时候,需要转化变量为哑变量,因为在一些挖掘场合,数据不能直接使用因子型的数据
# knn
# glmnet
# svm
# xgboost
###有些挖掘方法可以使用因子变量,比如:
# logistic regression
# repart
# GBM
# randomForest
dataset2 <- dataset
library(plyr)
into_factor <- function(x){
if(class(x) == "factor"){
n <- length(x)
data.fac <- data.frame(x=x, y=1:n)
output <- model.matrix(y~x, data.fac)[,-1]
## Convert factor into dummy variable matrix
}else{
output <- x
## if x is numeric, output is x
}
output
}
into_factor(dataset$V4)[1:5,]
## xu xy
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
# colwise !!!
dataset2 <- colwise(into_factor)(dataset2)
# class(dataset2)
dataset2 <- do.call(cbind,dataset2)
dataset2 <- as.data.frame(dataset2)
dim(dataset2)
## [1] 653 38
##### xgboost
library(xgboost)
library(methods)
library(plyr)
set.seed(123)
index = sample(n,round(0.7*n))
train.xg = dataset2[index,]
test.xg = dataset2[-index,]
label <- as.matrix(train.xg[,38,drop =F])
data <- as.matrix(train.xg[,-38,drop =F])
label2 <- as.matrix(test.xg[,38,drop =F])
data2 <- as.matrix(test.xg[,-38,drop =F])
# weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
xgmat <- xgb.DMatrix(data, label = label, missing = -10000)
param <- list("objective" = "binary:logistic","bst:eta" = 1,"bst:max_depth" = 2,"eval_metric" = "logloss","silent" = 1,"nthread" = 16 ,"min_child_weight" =1.45)
nround <- 275
bst <- xgb.train(param, xgmat, nround)
res1 <- predict(bst, data2)
pre1 <- ifelse(res1>0.5, 1, 0)
table(pre1, label2)
## label2
## pre1 0 1
## 0 89 19
## 1 14 74
accurancy4 <- mean(pre1 ==label2)
accurancy4
## [1] 0.8316327
案例的主要内容是:服用安慰剂对病情康复的情况,其他指标还有年龄、性别。
library(xgboost)
library(Matrix)
library(data.table)
library(vcd)
## Loading required package: grid
# if (!library('vcd')) install.packages('vcd')
# 数据导入与包的加载
data(Arthritis)
df <- data.table(Arthritis, keep.rownames = F)
# 接下来对数据进行一些处理。这个代码写的很棒,比如:ifelse的用法,以及:=用法(直接在[]框中对数据进行一定操作)
head(df[,AgeDiscret := as.factor(round(Age/10,0))]) #:= 新增加一列
## ID Treatment Sex Age Improved AgeDiscret
## 1: 57 Treated Male 27 Some 3
## 2: 46 Treated Male 29 None 3
## 3: 77 Treated Male 30 None 3
## 4: 17 Treated Male 32 Marked 3
## 5: 36 Treated Male 46 Marked 5
## 6: 23 Treated Male 58 Marked 6
head(df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))]) #ifelse
## ID Treatment Sex Age Improved AgeDiscret AgeCat
## 1: 57 Treated Male 27 Some 3 Young
## 2: 46 Treated Male 29 None 3 Young
## 3: 77 Treated Male 30 None 3 Young
## 4: 17 Treated Male 32 Marked 3 Old
## 5: 36 Treated Male 46 Marked 5 Old
## 6: 23 Treated Male 58 Marked 6 Old
df[,ID:=NULL]
# 生成特定的数据格式。
# 生成了one-hot encode数据,独热编码。Improved是Y变量,-1是将treament变量(名义变量)拆分。
sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df) #变成稀疏数据,然后0变成.,便于占用内存最小
# 设置因变量(多分类)
output_vector <- df[,Improved] == "Marked"
# xgboost建模
# 其中nround是迭代次数,可以用此来调节过拟合问题;
# nthread代表运行线程,如果不指定,则表示线程全开;
# objective代表所使用的方法:binary:logistic是以非线性的方式,分支。reg:linear(默认)、reg:logistic、count:poisson(泊松分布)、multi:softmax
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
eta = 1, nthread = 2, nround = 10,objective = "binary:logistic")
## [1] train-error:0.202381
## [2] train-error:0.166667
## [3] train-error:0.166667
## [4] train-error:0.166667
## [5] train-error:0.154762
## [6] train-error:0.154762
## [7] train-error:0.154762
## [8] train-error:0.166667
## [9] train-error:0.166667
## [10] train-error:0.166667
# 特征重要性排名
# 会出来比较多的指标,Gain是增益,树分支的主要参考因素;cover是特征观察的相对数值;Frequence是gain的一种简单版,他是在所有生成树中,特征的数量(慎用!)
importance <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst)
head(importance)
## Feature Gain Cover Frequency
## 1: Age 0.622031651 0.67251706 0.67241379
## 2: TreatmentPlacebo 0.285750607 0.11916656 0.10344828
## 3: SexMale 0.048744054 0.04522027 0.08620690
## 4: AgeDiscret6 0.016604647 0.04784637 0.05172414
## 5: AgeDiscret3 0.016373791 0.08028939 0.05172414
## 6: AgeDiscret4 0.009270558 0.02858801 0.01724138
# 特征筛选与检验
# 知道特征的重要性是一回事儿,现在想知道年龄对最后的治疗的影响。所以需要可以用一些方式来反映出来。以下是官方自带的。
importanceRaw <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst, data = sparse_matrix, label = output_vector)
## Warning in `[.data.table`(result, , `:=`("RealCover", as.numeric(vec)), :
## with=FALSE ignored, it isn't needed when using :=. See ?':=' for examples.
# Cleaning for better display
importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequence=NULL)] #同时去掉cover frequence
## Warning in `[.data.table`(importanceRaw, , `:=`(Cover = NULL, Frequence =
## NULL)): Adding new column 'Frequence' then assigning NULL (deleting it).
head(importanceClean)
## Feature Split Gain Frequency RealCover
## 1: TreatmentPlacebo -9.53674e-007 0.28575061 0.10344828 7
## 2: Age 61.5 0.16374034 0.05172414 12
## 3: Age 39 0.08705750 0.01724138 8
## 4: Age 57.5 0.06947553 0.03448276 11
## 5: SexMale -9.53674e-007 0.04874405 0.08620690 4
## 6: Age 53.5 0.04620627 0.05172414 10
## RealCover %
## 1: 0.2500000
## 2: 0.4285714
## 3: 0.2857143
## 4: 0.3928571
## 5: 0.1428571
## 6: 0.3571429
# 比第一种方式多了split列,代表此时特征分割的界线,比如特征2: Age 61.5,代表分割在61.5岁以下治疗了就痊愈了。同时,多了RealCover 和RealCover %列,前者代表在这个特征的个数,后者代表个数的比例。
# 绘制重要性图谱:
xgb.plot.importance(importance_matrix = importanceRaw)
# 需要加载install.packages("Ckmeans.1d.dp"),其中输出的是两个特征,这个特征数量是可以自定义的,可以定义为10族。
# 变量之间影响力的检验,官方用的卡方检验:
c2 <- chisq.test(df$Age, output_vector)
## Warning in chisq.test(df$Age, output_vector): Chi-squared approximation may
## be incorrect
# 检验年龄对最终结果的影响。
# # 疑问?
# #Random Forest™ - 1000 trees
# bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic")
# #num_parallel_tree这个是什么?
#
# #Boosting - 3 rounds
# bst <- xgboost(data = train$data, label = train$label, max.depth = 4, nround = 3, objective = "binary:logistic")
# #???代表boosting
# # 话说最后有一个疑问,这几个代码是可以区分XGBoost、随机森林以及boosting吗?
# # 一些进阶功能的尝试
# # 作为比赛型算法,真的超级好。下面列举一些我比较看中的功能:
# # 1、交叉验证每一折显示预测情况
# # 挑选比较优质的验证集。
# # do cross validation with prediction values for each fold
# res <- xgb.cv(params = param, data = dtrain, nrounds = nround, nfold = 5, prediction = TRUE)
# res$evaluation_log
# length(res$pred)
#
# # 交叉验证时可以返回模型在每一折作为预测集时的预测结果,方便构建ensemble模型。
#
# # 2、循环迭代
# # 允许用户先迭代1000次,查看此时模型的预测效果,然后继续迭代1000次,最后模型等价于一次性迭代2000次。
# # do predict with output_margin=TRUE, will always give you margin values before logistic transformation
# ptrain <- predict(bst, dtrain, outputmargin=TRUE)
# ptest <- predict(bst, dtest, outputmargin=TRUE)
#
# # 3、每棵树将样本分类到哪片叶子上
# # training the model for two rounds
# bst = xgb.train(params = param, data = dtrain, nrounds = nround, nthread = 2)
#
# # 4、线性模型替代树模型
# # 可以选择使用线性模型替代树模型,从而得到带L1+L2惩罚的线性回归或者logistic回归。
# # you can also set lambda_bias which is L2 regularizer on the bias term
# param <- list(objective = "binary:logistic", booster = "gblinear",
# nthread = 2, alpha = 0.0001, lambda = 1)