晋升R语言运算效率的11个实用要领

众所周知，当我们操作R语言处理惩罚大型数据集时，for轮回语句的运算效率很是低。有很多种要领可以晋升你的代码运算效率，但或者你更想相识运算效率能获得多大的晋升。本文将先容几种合用于大数据规模的要领，包罗简朴的逻辑调解设计、并行处理惩罚和Rcpp的运用，操作这些要领你可以轻松地处理惩罚1亿行以上的数据集。
让我们实验晋升往数据框中添加一个新变量进程(该进程中包括轮回和判定语句)的运算效率。下面的代码输出原始数据框：# Create the data framecol1 <- runif (12^5, 0, 2)col2 <- rnorm (12^5, 0, 2)col3 <- rpois (12^5, 3)col4 <- rchisq (12^5, 2)df <- data.frame (col1, col2, col3, col4)逐行判定该数据框(df)的总和是否大于4，假如该条件满意，则对应的新变量数值为’greaterthan4’，不然赋值为’lesserthan4’。
# Original R code: Before vectorization and pre-allocationsystem.time({for (i in 1:nrow(df)) { # for every rowif ((df[i, ‘col1’] + df[i, ‘col2’] + df[i, ‘col3’] + df[i, ‘col4’]) > 4) { # check if > 4df[i, 5] <- “greater_than_4” # assign 5th column} else {df[i, 5] <- “lesser_than_4” # assign 5th column}}})本文中所有的计较都在设置了2.6Ghz处理惩罚器和8GB内存的MAC OS X中运行。
1.向量化处理惩罚和预设数据库布局轮回运算前，记得预先配置好数据布局和输出变量的长度和范例，千万别在轮回进程中渐进性地增加数据长度。接下来，我们将探究向量化处理惩罚是如何提高处理惩罚数据的运算速度。# after vectorization and pre-allocationoutput <- character (nrow(df)) # initialize output vectorsystem.time({for (i in 1:nrow(df)) {if ((df[i, ‘col1’] + df[i, ‘col2’] + df[i, ‘col3’] + df[i, ‘col4’]) > 4) {output[i] <- “greater_than_4”} else {output[i] <- “lesser_than_4”}}df$output})
2.将条件语句的判定条件移至轮回外将条件判定语句移至轮回外可以晋升代码的运算速度，接下来本文将操作包括100,000行数据至1,000,000行数据的数据集举办测试：# after vectorization and pre-allocation, taking the condition checking outside the loop.output <- character (nrow(df))condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4 # condition check outside the loopsystem.time({for (i in 1:nrow(df)) {if (condition[i]) {output[i] <- “greater_than_4”} else {output[i] <- “lesser_than_4”}}df$output <- output})
3.只在条件语句为真时执行轮回进程另一种优化要领是预先将输出变量赋值为条件语句不满意时的取值，然后只在条件语句为真时执行轮回进程。此时，运算速度的晋升水平取决于条件状态中真值的比例。本部门的测试将和case(2)部门举办较量，和预想的功效一致，该要领确实晋升了运算效率。output <- c(rep(“lesser_than_4”, nrow(df)))condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4system.time({for (i in (1:nrow(df))[condition]) { # run loop only for true conditionsif (condition[i]) {output[i] <- “greater_than_4”}}df$output})
4.尽大概地利用 ifelse()语句操作ifelse()语句可以使你的代码越发轻便。ifelse()的句法名目雷同于if()函数，但其运算速度却有了庞大的晋升。纵然是在没有预设数据布局且没有简化条件语句的环境下，其运算效率仍高于上述的两种要领。system.time({output <- ifelse ((df$col1 + df$col2 + df$col3 + df$col4) > 4, “greater_than_4”, “lesser_than_4”)df$output <- output})
5.利用 which()语句操作which()语句来筛选数据集，我们可以到达Rcpp三分之一的运算速率。# Thanks to Gabe Beckersystem.time({want = which(rowSums(df) > 4)output = rep(“less than 4”, times = nrow(df))output[want] = “greater than 4”})# nrow = 3 Million rows (approx)user system elapsed0.396 0.074 0.481
6.操作apply族函数来替代for轮回语句本部门将操作apply()函数来计较上文所提到的案例，并将其与向量化的轮回语句举办比拟。该要领的运算效率优于原始要领，但劣于ifelse()和将条件语句置于轮回外端的要领。该要领很是有用，可是当你面临巨大的景象时，你需要机动运用该函数。
# apply familysystem.time({myfunc <- function(x) {if ((x[‘col1’] + x[‘col2’] + x[‘col3’] + x[‘col4’]) > 4) {“greater_than_4”} else {“lesser_than_4”}}output <- apply(df[, c(1:4)], 1, FUN=myfunc) # apply ‘myfunc’ on every rowdf$output <- output})
7.操作compiler包中的字节码编译函数cmpfun()这大概不是说明字节码编译有效性的较好例子，可是对付更巨大的函数而言，字节码编译将会表示地十分优异，因此我们该当相识下该函数。
# byte code compilationlibrary(compiler)myFuncCmp <- cmpfun(myfunc)system.time({output <- apply(df[, c (1:4)], 1, FUN=myFuncCmp)})
8.操作Rcpp停止今朝，我们已经测试了好几种晋升运算效率的要领，个中较佳的要领是操作ifelse()函数。假如我们将数据量增大十倍，运算效率将会酿成啥样的呢?接下来我们将操作Rcpp来实现该运算进程，并将其与ifelse()举办较量。
library(Rcpp)sourceCpp(“MyFunc.cpp”)system.time (output <- myFunc(df)) # see Rcpp function below
下面是操作C++语言编写的函数代码，将其生存为“MyFunc.cpp”并操作sourceCpp举办挪用。// Source for MyFunc.cpp #include using namespace Rcpp; // [[Rcpp::export]] CharacterVector myFunc(DataFrame x) { NumericVector col1 = as(x[“col1”]); NumericVector col2 = as(x[“col2”]); NumericVector col3 = as(x[“col3”]); NumericVector col4 = as(x[“col4”]); int n = col1.size(); CharacterVector out(n); for (int i=0; i 4){ out[i] = “greater_than_4”; } else { out[i] = “lesser_than_4”; } } return out; }
9.操作并行运算并行运算的代码：# parallel processinglibrary(foreach)library(doSNOW)cl <- makeCluster(4, type=”SOCK”) # for 4 cores machineregisterDoSNOW (cl)condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4# parallelization with vectorizationsystem.time({output <- foreach(i = 1:nrow(df), .combine=c) %dopar% {if (condition[i]) {return(“greater_than_4”)} else {return(“lesser_than_4”)}}})df$output <- output
10.尽早地移除变量并规复内存容量在举办冗长的轮回计较前，尽早地将不需要的变量移除去。在每次轮回迭代运算竣事时操作gc()函数规复内存也可以晋升运算速率。
11.操作内存较小的数据布局data.table()是一个很好的例子，因为它可以淘汰数据的内存，这有助于加速运算速率。
dt <- data.table(df) # create the data.tablesystem.time({for (i in 1:nrow (dt)) {if ((dt[i, col1] + dt[i, col2] + dt[i, col3] + dt[i, col4]) > 4) {dt[i, col5:=”greater_than_4″] # assign the output as 5th column} else {dt[i, col5:=”lesser_than_4″] # assign the output as 5th column}}})
总结要领：速度， nrow(df)/time_taken = n 行每秒1.原始要领：1X, 856.2255行每秒(正则化为1)2.向量化要领：738X, 631578行每秒3.只思量真值环境：1002X，857142.9行每秒4.ifelse：1752X，1500000行每秒5.which：8806X，7540364行每秒6.Rcpp：13476X，11538462行每秒
接待插手本站果真乐趣群贸易智能与数据阐明群乐趣范畴包罗各类让数据发生代价的步伐，实际应用案例分享与接头，阐明东西，ETL东西，数据客栈，数据挖掘东西，报表系统等全方位常识QQ群：81035754

当前位置：以往代写 > 其他教程 >晋升R语言运算效率的11个实用要领