R语言教程之聚类分析Cluster Analysis

聚类分析

R具有用于聚类分析的多种功能。在本节中，我将介绍三种方法：层次聚合，分区和基于模型。尽管对于确定要提取的聚类数量的问题没有最佳解决方案，但下面给出了几种方法。

数据准备

在对数据进行聚类之前，您可能需要删除或估计缺失的数据并重新调整变量的可比性。

# Prepare Data mydata <- na.omit(mydata) # listwise deletion of missing mydata <- scale(mydata) # standardize variables

分区

K均值聚类是最流行的分区方法。它要求分析人员指定要提取的群集数量。根据提取的聚类数量绘制的组内平方和可帮助确定合适的聚类数量。分析师在因素分析中寻找类似于scree测试的曲线。参见Everitt＆Hothorn（第251页）。

# Determine number of clusters wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

# K-Means Cluster Analysis fit <- kmeans(mydata, 5) # 5 cluster solution # get cluster means aggregate(mydata,by=list(fit$cluster),FUN=mean) # append cluster assignment mydata <- data.frame(mydata, fit$cluster)

可以使用pam（）而不是kmeans（）来调用基于中介的K-means的健壮版本。功能pamk（）在FPC包是PAM一个封装器，也打印基于最优平均轮廓宽度簇的建议数。

分层凝聚

有广泛的层次聚类方法。我用下面描述的Ward的方法祝你好运。

# Ward Hierarchical Clustering d <- dist(mydata, method = "euclidean") # distance matrix fit <- hclust(d, method="ward") plot(fit) # display dendogram groups <- cutree(fit, k=5) # cut tree into 5 clusters # draw dendogram with red borders around the 5 clusters rect.hclust(fit, k=5, border="red")

点击查看

pvclust包中的pvclust（）函数为基于多尺度自举重采样的分层聚类提供了p值。数据高度支持的群集将具有较大的p值。口译细节提供铃木。请注意，pvclust聚集列，而不是行。在使用前调换您的数据。

# Ward Hierarchical Clustering with Bootstrapped p values library(pvclust) fit <- pvclust(mydata, method.hclust="ward", method.dist="euclidean") plot(fit) # dendogram with p values # add rectangles around groups highly supported by the data pvrect(fit, alpha=.95)

点击查看

基于模型

基于模型的方法假定各种数据模型，并应用最大似然估计和贝叶斯准则来确定最可能的模型和聚类数量。具体来说，mclust包中的Mclust（）函数根据用于参数化高斯混合模型的分层聚类初始化的EM的BIC选择最优模型。（唷！）。人们选择具有最大BIC的群集的模型和数量。请参阅帮助（mclustModelNames）了解所选模型的详细信息。

# Model Based Clustering library(mclust) fit <- Mclust(mydata) plot(fit) # plot results summary(fit) # display the best model

点击查看

绘制群集解决方案

查看群集结果总是一个好主意。

# K-Means Clustering with 5 clusters fit <- kmeans(mydata, 5) # Cluster Plot against 1st 2 principal components # vary parameters for most readable graph library(cluster) clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) # Centroid Plot against 1st 2 discriminant functions library(fpc) plotcluster(mydata, fit$cluster)

点击查看

验证群集解决方案

功能cluster.stats（）在FPC包提供了一个机构，用于使用各种验证标准比较两个集群解决方案的相似性（休伯特的伽马系数，唐恩索引和校正兰特指数）

# comparing 2 cluster solutions library(fpc) cluster.stats(d, fit1$cluster, fit2$cluster)

其中d是对象之间的距离矩阵，fit1 $ cluster和fit $ cluste r是包含来自相同数据的两个不同聚类的分类结果的整数向量。

来练习

尝试在本机器学习课程介绍中进行聚类练习。

当前位置：以往代写 > R语言教程 >R语言教程之聚类分析Cluster Analysis

R语言教程之聚类分析Cluster Analysis

R语言教程之聚类分析Cluster Analysis

R语言教程之聚类分析Cluster Analysis

聚类分析

数据准备

分区

分层凝聚

基于模型

绘制群集解决方案

验证群集解决方案

来练习

在线提交作业

当前位置：以往代写 > R语言教程 >R语言教程之聚类分析Cluster Analysis

R语言教程之聚类分析Cluster Analysis

聚类分析

数据准备

分区

分层凝聚

基于模型

绘制群集解决方案

验证群集解决方案

来练习

关键字：

在线提交作业