R处理惩罚大数据聚合操纵与MYSQL较量
应用场景:
MYSQL布局:
table(用户地址公司表)
uid, company
========
1, tianji
2, tianji
3, tianji
4, ganji
5, ganji
6, ganji
7, ganji
8, 58
….
聚合操纵:
select company,count(company) as num
from t_company group by company
having num>3 and num<=300
order by num desc;
功效:
company,num
===========
tianji,3
ganji,4
1kw行,800MB,MYSQL执行时间,2分钟。
R数据处理惩罚
读入csv(用户地址公司表)
1, tianji
2, tianji
3, tianji
4, ganji
5, ganji
6, ganji
7, ganji
8, 58
- file=’comapng’
- companyData<-read.table(file=file, header = FALSE, sep=”,”, quote = “\”‘”,
- na.strings=”NA”,fileEncoding=”utf-8″,encoding=”utf-8″)
- names(companyData)<-c(‘uid’,’company’)
- print(paste(‘Total Company =>’,nrow(companyData)))
- nset<-ddply(companyData, .(company), “nrow”)
- nset<-nset[which(nset$nrow<=300 & nset$nrow>3),]
- include<-c()
- for(i in 1:nrow(nset)){
- t<-which(companyData$company==nset$company[i])
- include<-c(include,t)
- }
- print(paste(‘Available Company =>’,length(include)))
- companyData<-companyData[include,]
复制代码
1kw行,800MB,占用内存1.5G,R执行时间,30分钟+
====================
想步伐优化!!