R语言教程之基于树的模型Tree-Based Models

递归分区是数据挖掘中的一个基本工具。它帮助我们探索一组数据的结构，同时开发易于可视化的预测分类（分类树）或连续（回归树）结果的决策规则。本节简要介绍CART建模，条件推理树和随机森林。

通过rpart进行CART建模

可以通过rpart包生成分类和回归树（如Brieman，Freidman，Olshen和Stone所述）。有关rpart的详细信息，请参见使用RPART例程进行递归分区简介。以下提供了一般步骤，接下来是两个示例。

1.增长树

要生长一棵树，请使用
rpart（公式，数据=，方法=，控制=）其中

式	格式为结局〜predictor1 + predictor2 + predictor3 +等。
数据=	指定数据帧
方法=	用于回归树的分类树 “anova”的“类”
控制=	用于控制树木生长的可选参数。例如，control = rpart.control（minsplit = 30，cp = 0.001）要求在尝试拆分之前节点中观察值的最小数量为30，并且拆分必须将整体缺乏拟合减少0.001（成本复杂性因素）在尝试之前。

2.检查结果

以下功能帮助我们检查结果。

printcp（fit ）	显示cp表
plotcp（适合）	绘制交叉验证结果
rsq.rpart（适合）	绘制不同分割（2个图）的近似R平方和相对误差。标签只适用于“anova”方法。
打印（贴合）	打印结果
总结（适合）	包括代理分裂的详细结果
情节（适合）	绘制决策树
文字（适合）	标记决策树图
帖子（fit，file =）	创建决策树的后记情节

在由rpart（）创建的树中，当陈述的条件为真时，移动到LEFT分支（参见下面的图表）。

3.修剪树

修剪树以避免过度拟合数据。通常，您需要选择一个最小化交叉验证错误的树大小，即由printcp（）打印的xerror列。

使用修剪将树修剪为所需的大小
（fit，cp = ）

特别是，使用printcp（）检查交叉验证的错误结果，选择与最小错误相关的复杂度参数，并将其放入prune（）函数中。或者，您可以使用代码片段

配合$ cptable [which.min（配合$ cptable [ “xerror”]）， “CP”]

自动选择与最小交叉验证错误相关的复杂度参数。感谢HSAUR的这个想法。

分类树示例

我们使用数据框后凸来预测手术后的变形类型（后凸畸形），从月龄（年龄），涉及的椎骨数量（Number）以及在（Start）上操作的最高椎骨数量。

# Classification Tree with rpart library(rpart) # grow tree fit <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis) printcp(fit) # display the results plotcp(fit) # visualize cross-validation results summary(fit) # detailed summary of splits # plot tree plot(fit, uniform=TRUE, main="Classification Tree for Kyphosis") text(fit, use.n=TRUE, all=TRUE, cex=.8) # create attractive postscript plot of tree post(fit, file = "c:/tree.ps", title = "Classification Tree for Kyphosis")

点击查看

# prune the tree pfit<- prune(fit, cp= fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]) # plot the pruned tree plot(pfit, uniform=TRUE, main="Pruned Classification Tree for Kyphosis") text(pfit, use.n=TRUE, all=TRUE, cex=.8) post(pfit, file = "c:/ptree.ps", title = "Pruned Classification Tree for Kyphosis")

点击查看

回归树示例

在这个例子中，我们将根据价格，国家，可靠性和汽车类型来预测汽车里程。数据框是cu.summary。

# Regression Tree Example library(rpart) # grow tree fit <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary) printcp(fit) # display the results plotcp(fit) # visualize cross-validation results summary(fit) # detailed summary of splits # create additional plots par(mfrow=c(1,2)) # two plots on one page rsq.rpart(fit) # visualize cross-validation results # plot tree plot(fit, uniform=TRUE, main="Regression Tree for Mileage ") text(fit, use.n=TRUE, all=TRUE, cex=.8) # create attractive postcript plot of tree post(fit, file = "c:/tree2.ps", title = "Regression Tree for Mileage ")

点击查看

# prune the tree pfit<- prune(fit, cp=0.01160389) # from cptable # plot the pruned tree plot(pfit, uniform=TRUE, main="Pruned Regression Tree for Mileage") text(pfit, use.n=TRUE, all=TRUE, cex=.8) post(pfit, file = "c:/ptree2.ps", title = "Pruned Regression Tree for Mileage")

事实证明，这产生了与原始相同的树。

通过派对的条件推理树

该方包提供非参数回归树的名义，有序，数字，审查，以及多元的响应。聚会：递归分区的实验室提供了详细信息。

您可以通过函数创建回归或分类树

ctree（formula，data =）
创建的树的类型取决于结果变量（标称因子，有序因子，数字等）。树的增长是基于统计停止规则，所以修剪不应该被要求。

前面的两个例子在下面重新分析。

# Conditional Inference Tree for Kyphosis library(party) fit <- ctree(Kyphosis ~ Age + Number + Start, data=kyphosis) plot(fit, main="Conditional Inference Tree for Kyphosis")

点击查看

# Conditional Inference Tree for Mileage library(party) fit2 <- ctree(Mileage~Price + Country + Reliability + Type, data=na.omit(cu.summary))

点击查看

随机森林

随机森林通过生成大量引导树（基于变量的随机样本），使用这个新“森林”中的每棵树对案例进行分类，并通过将所有树木的结果进行组合来确定最终预测结果，从而提高预测准确度（回归的平均值，分类的多数票）。Breiman和Cutler的随机森林方法通过randomForest包实现。

这是一个例子。

# Random Forest prediction of Kyphosis data library(randomForest) fit <- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis) print(fit) # view results importance(fit) # importance of each predictor

欲了解更多详情，请参阅综合随机森林网站。

走得更远

本节只涉及可用的选项。要了解更多信息，请参阅CRAN 机器与统计学习任务视图。

来练习

试用Kaggle R机器学习教程，其中包括随机森林练习。

当前位置：以往代写 > R语言教程 >R语言教程之基于树的模型Tree-Based Models