晋升R代码运算效率的11个实用要领

众所周知，当我们操作 R 语言处理惩罚大型数据集时，for 轮回语句的运算效率很是低。有很多种要领可以晋升你的代码运算效率，但或者你更想相识运算效率能获得多大的晋升。本文将先容几种合用于大数据规模的要领，包罗简朴的逻辑调解设计、并行处理惩罚和 Rcpp 的运用，操作这些要领你可以轻松地处理惩罚1亿行以上的数据集。
让我们实验晋升往数据框中添加一个新变量进程(该进程中包括轮回和判定语句)的运算效率。下面的代码输出原始数据框：
# Create the data framecol1 <- runif (12^5, 0, 2)col2 <- rnorm (12^5, 0, 2)col3 <- rpois (12^5, 3)col4 <- rchisq (12^5, 2)df <- data.frame (col1, col2, col3, col4)逐行判定该数据框 (df) 的总和是否大于 4 ，假如该条件满意，则对应的新变量数值为 ’greaterthan4’ ，不然赋值为 ’lesserthan4’ 。
# Original R code: Before vectorization and pre-allocationsystem.time({ for (i in 1:nrow(df)) { # for every row if ((df[i, ‘col1’] + df[i, ‘col2’] + df[i, ‘col3’] + df[i, ‘col4’]) > 4) { # check if > 4 df[i, 5] <- “greater_than_4” # assign 5th column } else { df[i, 5] <- “lesser_than_4” # assign 5th column } }})本文中所有的计较都在设置了 2.6Ghz 处理惩罚器和 8GB 内存的 MAC OS X 中运行。
1.向量化处理惩罚和预设数据库布局
for (i in 1:nrow(df)) { if ((df[i, ‘col1’] + df[i, ‘col2’] + df[i, ‘col3’] + df[i, ‘col4’]) > 4) { output[i] <- “greater_than_4” } else { output[i] <- “lesser_than_4” } }df$output}) 2.将条件语句判定条件移至轮回外
将条件判定语句移至轮回外可以晋升代码的运算速度，接下来本文将操作包括 100,000行数据至 1,000,000 行数据的数据集举办测试：# after vectorization and pre-allocation, taking the condition checking outside the loop.output <- character (nrow(df))condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4 # condition check outside the loopsystem.time({ for (i in 1:nrow(df)) { if (condition[i]) { output[i] <- “greater_than_4” } else { output[i] <- “lesser_than_4” } } df$output <- output}) 3.只在条件语句为真时执行轮回进程
另一种优化要领是预先将输出变量赋值为条件语句不满意时的取值，然后只在条件语句为真时执行轮回进程。此时，运算速度的晋升水平取决于条件状态中真值的比例。本部门的测试将和 case(2) 部门举办较量，和预想的功效一致，该要领确实晋升了运算效率。
output <- c(rep(“lesser_than_4”, nrow(df)))condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4system.time({ for (i in (1:nrow(df))[condition]) { # run loop only for true conditions if (condition[i]) { output[i] <- “greater_than_4” }   } df$output })
4.尽大概地利用 ifelse() 语句
操作 ifelse() 语句可以使你的代码越发轻便。 ifelse() 的句法名目雷同于 if() 函数，但其运算速度却有了庞大的晋升。纵然是在没有预设数据布局且没有简化条件语句的环境下，其运算效率仍高于上述的两种要领。system.time({ output <- ifelse ((df$col1 + df$col2 + df$col3 + df$col4) > 4, “greater_than_4”, “lesser_than_4”) df$output <- output}) 5.利用 which() 语句
操作 which() 语句来筛选数据集，我们可以到达 Rcpp 三分之一的运算速率。# Thanks to Gabe Beckersystem.time({ want = which(rowSums(df) > 4) output = rep(“less than 4”, times = nrow(df)) output[want] = “greater than 4”}) # nrow = 3 Million rows (approx) user system elapsed   0.396 0.074 0.4816.用 apply 族函数替代 for 轮回语句本部门将操作 apply() 函数来计较上文所提到的案例，并将其与向量化的轮回语句举办比拟。该要领的运算效率优于原始要领，但劣于 ifelse() 和将条件语句置于轮回外端的要领。该要领很是有用，可是当你面临巨大的景象时，你需要机动运用该函数。# apply familysystem.time({ myfunc <- function(x) { if ((x[‘col1’] + x[‘col2’] + x[‘col3’] + x[‘col4’]) > 4) { “greater_than_4” } else { “lesser_than_4” } } output <- apply(df[, c(1:4)], 1, FUN=myfunc) # apply ‘myfunc’ on every row df$output <- output}) 7.操作compiler包编译函数cmpfun()
这大概不是说明字节码编译有效性的较好例子，可是对付更巨大的函数而言，字节码编译将会表示地十分优异，因此我们该当相识下该函数。# byte code compilationlibrary(compiler)myFuncCmp <- cmpfun(myfunc)system.time({ output <- apply(df[, c (1:4)], 1, FUN=myFuncCmp)}) 8.操作Rcpp
停止今朝，我们已经测试了好几种晋升运算效率的要领，个中较佳的要领是操作ifelse()函数。假如我们将数据量增大十倍，运算效率将会酿成啥样的呢？接下来我们将操作Rcpp来实现该运算进程，并将其与ifelse()举办较量。library(Rcpp)sourceCpp(“MyFunc.cpp”)system.time (output <- myFunc(df)) # see Rcpp function below下面是操作C++语言编写的函数代码，将其生存为“MyFunc.cpp”并操作sourceCpp举办挪用。
// Source for MyFunc.cpp#include using namespace Rcpp;// [[Rcpp::export]]CharacterVector myFunc(DataFrame x) { NumericVector col1 = as(x[“col1”]); NumericVector col2 = as(x[“col2”]); NumericVector col3 = as(x[“col3”]); NumericVector col4 = as(x[“col4”]); int n = col1.size(); CharacterVector out(n); for (int i=0; i 4){ out[i] = “greater_than_4”; } else { out[i] = “lesser_than_4”; } } return out;} 9.操作并行运算
并行运算的代码：# parallel processinglibrary(foreach)library(doSNOW)cl <- makeCluster(4, type=”SOCK”) # for 4 cores machineregisterDoSNOW (cl)condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4# parallelization with vectorizationsystem.time({ output <- foreach(i = 1:nrow(df), .combine=c) %dopar% { if (condition[i]) { return(“greater_than_4”) } else { return(“lesser_than_4”) } }})
df$output <- output 10.尽早移除变量并规复内存容量
在举办冗长的轮回计较前，尽早地将不需要的变量移除去。在每次轮回迭代运算竣事时操作gc()函数规复内存也可以晋升运算速率。  11.操作内存较小的数据布局
在举办冗长的轮回计较前，尽早地将不需要的变量移除去。在每次轮回迭代运算竣事时操作gc()函数规复内存也可以晋升运算速率。 data.table()是一个很好的例子，因为它可以淘汰数据的内存，这有助于加速运算速率。dt <- data.table(df) # create the data.tablesystem.time({ for (i in 1:nrow (dt)) { if ((dt[i, col1] + dt[i, col2] + dt[i, col3] + dt[i, col4]) > 4) { dt[i, col5:=”greater_than_4″] # assign the output as 5th column } else { dt[i, col5:=”lesser_than_4″] # assign the output as 5th column } }})
总结要领：速度， nrow(df)/time_taken = n 行每秒原始要领：1X, 856.2255行每秒(正则化为1)向量化要领：738X, 631578行每秒只思量真值环境：1002X，857142.9行每秒ifelse：1752X，1500000行每秒which：8806X，7540364行每秒Rcpp：13476X，11538462行每秒
接待插手本站果真乐趣群贸易智能与数据阐明群乐趣范畴包罗各类让数据发生代价的步伐，实际应用案例分享与接头，阐明东西，ETL东西，数据客栈，数据挖掘东西，报表系统等全方位常识QQ群：81035754

当前位置：以往代写 > 其他教程 >晋升R代码运算效率的11个实用要领