R语言绘制heatmap热图
先容如何利用 R 绘制 heatmap 的文章。
本日无意间在Flowingdata看到一篇关于如何利用 R 来做 heatmap 的文章(请移步到这里)。固然 heatmap 只是 R 中一个很普通的图形函数,但这个例子利用了2008-2009赛季 NBA 50个较高级球员数据做了一个极佳的演示,结果很是不错。对 R 大抵相识的童鞋可以直接在 R console 上敲
?heatmap
直接查察辅佐即可。
没有打仗过 R 的童鞋继承围观,下面会仔细先容如何利用 R 实现 NBA 50位较高级球员指标表示热图:
关于 heatmap,中文一般翻译为“热图”,其统计意义wiki上表明的很清楚:
A heat map is a graphical representation of data where the values taken by a variable in a two-dimensional map are represented as colors.Heat maps originated in 2D displays of the values in a data matrix. Larger values were represented by small dark gray or black squares (pixels) and smaller values by lighter squares.
下面这个图等于Flowingdata用一些 R 函数对2008-2009 赛季NBA 50名较高级球员指标做的一个热图(点击参看大图):
先表明一下数据:
这里共罗列了50位球员,预计喜好篮球的童鞋对上图右边的每个名字城市耳熟能详。这些球员每小我私家会有19个指标,包罗打了几场球(G)、上场几分钟(MIN)、得分(PTS)……这样就行成了一个50行×19列的矩阵。但问题是,数据有些多,需要利用一种较量好的步伐来展示,So it comes, heatmap!
简朴的说明:
好比从上面的热图上调查得分前3名(Wade、James、Bryant)PTS、FGM、FGA较量高,但Bryant的FTM、FTA和前两者就差一些;Wade在这三人中STL是佼佼者;而James的DRB和TRB又比其他两人好一些……
姚明的3PP(3 Points Percentage)这条数据很有意思,很是精彩!仔细查了一下这个数值,居然是100%。仔细追念一下,好像谁人赛季姚明仿佛投过一个3分,而且中了,然后再也没有3p。这样本可真够小的!
最后是如何做这个热图(做了些许修改):
Step 0. Download R
R 官网:http://www.r-project.org,它是免费的。官网上面提供了Windows,Mac,Linux版本(或源代码)的R措施。
Step 1. Load the data
R 可以支持网络路径,利用读取csv文件的函数read.csv。
读取数据就这么简朴:
nba<- read.csv(“http://datasets.flowingdata.com/ppg2008.csv”, sep=”,”)
Step 2. Sort data
凭据球员得分,将球员从小到大排序:
nba <- nba[order(nba$PTS),]
虽然也可以选择MIN,BLK,STL之类指标
Step 3. Prepare data
把行号换成行名(球员名称):
row.names(nba) <- nba$Name
去掉第一列行号:
nba <- nba[,2:20] # or nba <- nba[,-1]
Step 4. Prepare data, again
把 data frame 转化为我们需要的矩阵名目:
nba_matrix <- data.matrix(nba)
Step 5. Make a heatmap
# R 的默认还会在图的左边和上边绘制 dendrogram,利用Rowv=NA, Colv=NA去掉
heatmap(nba_matrix, Rowv=NA, Colv=NA, col=cm.colors(256), revC=FALSE, scale=’column’)
这样就获得了上面的那张热图。
Step 6. Color selection
可能想把热图中的颜色换一下:
heatmap(nba_matrix, Rowv=NA, Colv=NA, col=heat.colors(256), revC=FALSE, scale=”column”, margins=c(5,10))
Bioinformatics and Computational Biology Solutions Using R and Bioconductor 第10章的
例子:
Heatmaps, or false color images have a reasonably long history, as has the
notion of rearranging the columns and rows to show structure in the data.
They were applied to microarray data by Eisen et al. (1998) and have
become a standard visualization method for this type of data.
A heatmap is a two-dimensional, rectangular, colored grid. It displays
data that themselves come in the form of a rectangular matrix. The color
of each rectangle is determined by the value of the corresponding entry
in the matrix. The rows and columns of the matrix can be rearranged
independently. Usually they are reordered so that similar rows are placed
next to each other, and the same for columns. Among the orderings that
are widely used are those derived from a hierarchical clustering, but many
other orderings are possible. If hierarchical clustering is used, then it is
customary that the dendrograms are provided as well. In many cases the
resulting image has rectangular regions that are relatively homogeneous
and hence the graphic can aid in determining which rows (generally the
genes) have similar expression values within which subgroups of samples
(generally the columns).
The function heatmap is an implementation with many options. In particular,
users can control the ordering of rows and columns independently
from each other. They can use row and column labels of their own choosing
or select their own color scheme.
> library(“ALL”)
> data(“ALL”)
> selSamples <- ALL$mol.biol %in% c(“ALL1/AF4”,
+ “E2A/PBX1”)
> ALLs <- ALL[, selSamples]
> ALLs$mol.biol <- factor(ALLs$mol.biol)
> colnames(exprs(ALLs)) <- paste(ALLs$mol.biol,
+ colnames(exprs(ALLs)))
>library(“genefilter”)
> meanThr <- log2(100)
> g <- ALLs$mol.biol
> s1 <- rowMeans(exprs(ALLs)[, g == levels(g)[1]]) >
+ meanThr
> s2 <- rowMeans(exprs(ALLs)[, g == levels(g)[2]]) >
+ meanThr
> s3 <- rowttests(ALLs, g)$p.value < 2e-04
> selProbes <- (s1 | s2) & s3
> ALLhm <- ALLs[selProbes, ]
>library(RColorBrewer)
> hmcol <- colorRampPalette(brewer.pal(10, “RdBu”))(256)
> spcol <- ifelse(ALLhm$mol.biol == “ALL1/AF4”,
+ “goldenrod”, “skyblue”)
> heatmap(exprs(ALLhm), col = hmcol, ColSideColors = spcol)
最后,可以
>help(heatmap) 查找辅佐,看看辅佐给提供的例子
也可以看这的例子:
http://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/r/heatmap/
Using R to draw a Heatmap from Microarray Data |
|
[c] The first section of this page uses R to analyse an Acute lymphocytic leukemia (ALL) microarray dataset, producing a heatmap (with dendrograms) of genes differentially expressed between two types of leukemia. There is a follow on page dealing with how to do this from Python using RPy. The original citation for the raw data is “Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival” by Chiaretti et al. Blood 2004. (PMID: 14684422) The analysis is a “step by step” recipe based on this paper, Bioconductor: open software development for computational biology and bioinformatics, Gentleman et al. 2004. Their Figure 2 Heatmap, which we recreate, is reproduced here: Heatmaps from R Assuming you have a recent version of R (from The R Project) and BioConductor (see Windows XP installation instructions), the example dataset can be downloaded as the BioConductor ALL package. You should be able to install this from within R as follows: > source(“http://www.bioconductor.org/biocLite.R”) Alternatively, you can download the package by hand from here or here. If you are using Windows, download ALL_1.0.2.zip (or similar) and save it. Then from within the R program, use the menu option “Packages”, “Install package(s) from local zip files…” and select the ZIP file. On Linux, download ALL_1.0.2.tar.gz (or similar) and use sudo R CMD INSTALL ALL_1.0.2.tar.gz at the command prompt. With that out of the way, you should be able to start R and load this package with the library and data commands:> library(“ALL”) If you inspect the resulting ALL variable, it contains 128 samples with 12625 genes, and associated phenotypic data. > ALL We can looks at the results of molecular biology testing for the 128 samples: > ALL$mol.biol Ignoring the samples which came back negative on this test (NEG), most have been classified as having a translocation between chromosomes 9 and 22 (BCR/ABL), or a translocation between chromosomes 4 and 11 (ALL1/AF4). For the purposes of this example, we are only interested in these two subgroups, so we will create a filtered version of the dataset using this as a selection criteria: > eset <- ALL[, ALL$mol.biol %in% c(“BCR/ABL”, “ALL1/AF4”)] The resulting variable, eset, contains just 47 samples – each with the full 12,625 gene expression levels. This is far too much data to draw a heatmap with, but we can do one for the first 100 genes as follows: > heatmap(exprs(eset[1:100,])) According to the BioConductor paper we are following, the next step in the analysis was to use the lmFit function (from the limma package) to look for genes differentially expressed between the two groups. The fitted model object is further processed by the eBayes function to produce empirical Bayes test statistics for each gene, including moderated t-statistics, p-values and log-odds of differential expression. > library(limma) If the limma package isn’t installed, you’ll need to install it first: > source(“http://www.bioconductor.org/biocLite.R”) We can now reproduce Figure 1 from the paper. > topTable(fit, coef=2) The leftmost numbers are row indices, ID is the Affymetrix HGU95av2 accession number, M is the log ratio of expression, A is the log average expression, t the moderated t-statistic, and B is the log odds of differential expression. Next, we select those genes that have adjusted p-values below 0.05, using a very stringent Holm method to select a small number (165) of genes. > selected <- p.adjust(fit$p.value[, 2]) <0.05 The variable esetSel has data on (only) 165 genes for all 47 samples . We can easily produce a heatmap as follows (in R-2.1.1 this defaults to a yellow/red “heat” colour scheme): > heatmap(exprs(esetSel)) If you have the topographical colours installed (yellow-green-blue), you can do this: > heatmap(exprs(esetSel), col=topo.colors(100)) This is getting very close to Gentleman et al.’s Figure 2, except they have added a red/blue banner across the top to really emphasize how the hierarchical clustering has correctly split the data into the two groups (10 and 37 patients). To do that, we can use the heatmap function’s optional argument of ColSideColors. I created a small function to map the eselSet$mol.biol values to red (#FF0000) and blue (#0000FF), which we can apply to each of the molecular biology results to get a matching list of colours for our columns: > color.map <- function(mol.biol) { if (mol.biol==”ALL1/AF4″) “#FF0000” else “#0000FF” } Looks pretty close now, doesn’t it: To recap, this is “all” we needed to type into R to achieve this: library(“ALL”) Heatmaps in R – More Options One subtle point in the previous examples is that the heatmap function has automatically scaled the colours for each row (i.e. each gene has been individually normalised across patients). This can be disabled using scale=”none”, which you might want to do if you have already done your own normalisation (or this may not be appropriate for your data): heatmap(exprs(esetSel), col=topo.colors(75), scale=”none”, ColSideColors=patientcolors, cexRow=0.5) You might also have noticed in the above snippet, that I have shrunk the row captions which were so big they overlapped each other. The relevant options are cexRow and cexCol. So far so good – but what if you wanted a key to the colours shown? The heatmap function doesn’t offer this, but the good news is that heatmap.2 from the gplots library does. In fact, it offers a lot of other features, many of which I deliberately turn off in the following example: library(“gplots”) By default, heatmap.2 will also show a trace on each data point (removed this with trace=”none”). If you ask for a key (using key=TRUE) this function will actually give you a combined “color key and histogram”, but that can be overridden (with density.info=”none”). Don’t like the colour scheme? Try using the functions bluered/redblue for a red-white-blue spread, or redgreen/greenred for the red-black-green colour scheme often used with two-colour microarrays: library(“gplots”) Heatmaps from Python So, how can we do that from within Python? One way is using RPy (R from Python), and this is discussed on this page. P.S. If you want to use heatmap.2 from within python using RPy, use the syntax heatmap_2 due to the differences in how R and Python handle full stops and underscores. What about other microarray data? Well, I have also documented how you can load NCBI GEO SOFT files into R as a BioConductor expression set object. As long as you can get your data into R as a matrix or data frame, converting it into an exprSet shouldn’t be too hard. |