手把手教你用R语言成立信用评分模子(三)— —Logistic模子建构
相关性阐明 & IV(信息值)筛选我们在上一篇变量筛选专题中,利用WoE完成了单变量阐明的部门。接下来,我们会用颠末清洗后的数据看一下变量间的相关性。留意,这里的相关性阐明只是劈头的查抄,进一步查抄模子的多重共线性还需要通过 VIF(variance inflation factor)也就是 方差膨胀因子举办检讨。
R代码:require(corrplot)cor1<-cor(train)corrplot(cor1,tl.cex = 0.5)
输出图像:
从相关矩阵图中可以看出, CreditAmount和Duration的相关性较强(0.37),以及NoofCreditatthisBank和PaymentStatusofPreviousCredit相关性较强(0.42)。
接下来,我进一步计较每个变量的Infomation Value(IV)。IV指标是一般用来确定自变量的预测本领。 其公式为:
通过IV值判定变量预测本领的尺度是:
< 0.02: unpredictive 0.02 to 0.1: weak 0.1 to 0.3: medium 0.3 to 0.5: strong > 0.5: suspicious
因这部门代码较多,我会将更为详尽的代码放在文章末端。这里是输出各个变量IV值的语句:ggplot(infovalue, aes(x = va, y = iv)) + geom_bar(stat = “identity”,fill = “blue”, colour = “grey60”,size = 0.2, alpha = 0.2)+labs(title = “Information value”)+ theme(axis.text.x=element_text(angle=90,colour=”black”,size=10));
输出图像:
可以看出,DuratioCurrentAddress, Guarantors, Instalmentpercent,NoofCreditatthisBank,Occupation,Noofdependents,Telephone变量的IV值明明较低。 所以予以删除。个中相关性阐明中NoofCreditatthisBank和PaymentStatusofPreviousCredit相关性较强(0.42)的问题也因NoofCreditatthisBank变量被删除而办理。而CreditAmount和Duration的相关性(0.37)并不显著,可以在这部门忽略不计。
StepWise多变量阐明 & Logistic模子成立在举办StepWise阐明前,我们需要将筛选后的变量转换为WoE值并成立Logistic模子。
首先,让先去除在筛选进程中删除的因子:german_credit$DurationinCurrentaddress=NULLgerman_credit$Guarantors=NULLgerman_credit$Instalmentpercent=NULLgerman_credit$NoofCreditatthisBank=NULLgerman_credit$Occupation=NULLgerman_credit$Noofdependents=NULLgerman_credit$Telephone=NULL
然后计较变量对应的WoE值:AccountBalancewoe=woe(train2, “AccountBalance”,Continuous = F, “Creditability”,C_Bin = 4,Good = “1”,Bad = “0”)Durationwoe=woe(train2, “Duration”,Continuous = F, “Creditability”,C_Bin = 2,Good = “1”,Bad = “0”)PaymentStatusofPreviousCreditwoe=woe(train2, “PaymentStatusofPreviousCredit”,Continuous = F, “Creditability”,C_Bin = 4,Good = “1”,Bad = “0”)Purposewoe = woe(train2, “Purpose”,Continuous = F, “Creditability”,C_Bin = 11,Good = “1”,Bad = “0”)CreditAmountwoe= woe(train2, “CreditAmount”,Continuous = F, “Creditability”,C_Bin = 2,Good = “1”,Bad = “0”)(全部代码请拜见文末)
对变量对应的取值举办WoE替换:for(i in 1:1000){ for(s in 1:4){ if(german_credit$AccountBalance[i]==s){ german_credit$AccountBalance[i]=-AccountBalancewoe$WOE[s] } } for(s in 1:3){ if(german_credit$Duration[i]==s){ german_credit$Duration[i]=-Durationwoe$WOE[s] } } for(s in 0:4){ if(german_credit$PaymentStatusofPreviousCredit[i]==s){ german_credit$PaymentStatusofPreviousCredit[i]=-PaymentStatusofPreviousCreditwoe$WOE[s+1] } }(全部代码请拜见文末)
通过View(german_credit),我们可以看出全部数据已经替换乐成:
将颠末WoE转换的数据放入Logistic模子中建模,并利用向后慢慢回归要领(backward stepwise)筛选变量:fit<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment +Sex.Marital.Status+ Mostvaluableavailableasset + Age + ConcurrentCredits + Typeofapartment + ForeignWorker,train2,family = “binomial”)backwards = step(fit)
输出功效:
可以看出,通过慢慢回归,模子删除了 Typeofapartment、 Mostvaluableavailableasset 、Sex.Marital.Status等变量。
我们再用慢慢回归筛选后的的变量举办建模:fit2<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment + Age + ConcurrentCredits + ForeignWorker,train2,family = “binomial”)summary(fit2)
输出功效:
个中ConcurrentCredits这一变量并不显著,我们在这一步将此变量删除。继承成立logistic模子:fit3<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment + Age + ForeignWorker,train2,family = “binomial”)
为防备多重共线性问题的呈现,我们对模子举办VIF检讨:library(car)vif(fit3, digits =3 )
输出功效:#p#分页标题#e#
从上图可知,所有变量VIF均小于4,可以判定模子中不存在多重共线性问题。
模子检讨到这里,我们的建模部门根基竣事了。我们需要验证一下模子的预测本领如何。我们利用在建模开始阶段预留的250条数据举办检讨:
prediction <- predict(fit3,newdata=test2)for (i in 1:250) { if(prediction[i]>0.99){ prediction[i]=1} else {prediction[i]=0}}confusionMatrix(prediction, test2$Creditability)
输出功效:
模子的精度到达了0.72,模子表示一般。这同Logistic模子自己的范围性有关。传统的回归模子精度一般城市弱于决定树、SVM等呆板挖掘算法。
完整代码:german_credit$DurationinCurrentaddress=NULLgerman_credit$Guarantors=NULLgerman_credit$Instalmentpercent=NULLgerman_credit$NoofCreditatthisBank=NULLgerman_credit$Occupation=NULLgerman_credit$Noofdependents=NULLgerman_credit$Telephone=NULLAccountBalancewoe=woe(train2, “AccountBalance”,Continuous = F, “Creditability”,C_Bin = 4,Good = “1”,Bad = “0”)Durationwoe=woe(train2, “Duration”,Continuous = F, “Creditability”,C_Bin = 2,Good = “1”,Bad = “0”)PaymentStatusofPreviousCreditwoe=woe(train2, “PaymentStatusofPreviousCredit”,Continuous = F, “Creditability”,C_Bin = 4,Good = “1”,Bad = “0”)Purposewoe = woe(train2, “Purpose”,Continuous = F, “Creditability”,C_Bin = 11,Good = “1”,Bad = “0”)CreditAmountwoe= woe(train2, “CreditAmount”,Continuous = F, “Creditability”,C_Bin = 2,Good = “1”,Bad = “0”)ValueSavingswoe =woe(train2, “ValueSavings”,Continuous = F, “Creditability”,C_Bin = 4,Good = “1”,Bad = “0”)Lengthofcurrentemploymentwoe=woe(train2, “Lengthofcurrentemployment”,Continuous = F, “Creditability”,C_Bin = 4,Good = “1”,Bad = “0”)Sex.Marital.Statuswoe=woe(train2, “Sex.Marital.Status”,Continuous = F, “Creditability”,C_Bin = 4,Good = “1”,Bad = “0”)Mostvaluableavailableassetwoe=woe(train2, “Mostvaluableavailableasset”,Continuous = F, “Creditability”,C_Bin = 4,Good = “1”,Bad = “0”)Agewoe=woe(train2, “Age”,Continuous = F, “Creditability”,C_Bin = 2,Good = “1”,Bad = “0”)ConcurrentCreditswoe=woe(train2, “ConcurrentCredits”,Continuous = F, “Creditability”,C_Bin = 3,Good = “1”,Bad = “0”)Typeofapartmentwoe=woe(train2, “Typeofapartment”,Continuous = F, “Creditability”,C_Bin = 3,Good = “1”,Bad = “0”)ForeignWorkerwoe=woe(train2, “ForeignWorker”,Continuous = F, “Creditability”,C_Bin = 2,Good = “1”,Bad = “0”)
for(i in 1:1000){ for(s in 1:4){ if(german_credit$AccountBalance[i]==s){ german_credit$AccountBalance[i]=-AccountBalancewoe$WOE[s] } }
for(s in 1:3){ if(german_credit$Duration[i]==s){ german_credit$Duration[i]=-Durationwoe$WOE[s] } } for(s in 0:4){ if(german_credit$PaymentStatusofPreviousCredit[i]==s){ german_credit$PaymentStatusofPreviousCredit[i]=-PaymentStatusofPreviousCreditwoe$WOE[s+1] } } for(s in 0:10){ if(s<=6){ if(german_credit$Purpose[i]==s){ german_credit$Purpose[i]=-Purposewoe$WOE[s+1] } }else{ if(german_credit$Purpose[i]==s){ german_credit$Purpose[i]=-Purposewoe$WOE[s] } } } for(s in 1:2){ if(german_credit$CreditAmount[i]==s){ german_credit$CreditAmount[i]=-CreditAmountwoe$WOE[s] } } for(s in 2:5){ if(german_credit$ValueSavings[i]==s){ german_credit$ValueSavings[i]=-ValueSavingswoe$WOE[s-1] } } for(s in 1:5){ if(german_credit$Lengthofcurrentemployment[i]==s){ german_credit$Lengthofcurrentemployment[i]=-Lengthofcurrentemploymentwoe$WOE[s] } } for(s in 1:5){ if(german_credit$Sex.Marital.Status[i]==s){ german_credit$Sex.Marital.Status[i]=-Sex.Marital.Statuswoe$WOE[s] } } for(s in 1:4){ if(german_credit$Mostvaluableavailableasset[i]==s){ german_credit$Mostvaluableavailableasset[i]=-Mostvaluableavailableassetwoe$WOE[s] } } for(s in 1:2){ if(german_credit$Age[i]==s){ german_credit$Age[i]=-Agewoe$WOE[s] } } for(s in 1:5){ if(german_credit$ConcurrentCredits[i]==s){ german_credit$ConcurrentCredits[i]=-ConcurrentCreditswoe$WOE[s] } } for(s in 1:5){ if(german_credit$Typeofapartment[i]==s){ german_credit$Typeofapartment[i]=-Typeofapartmentwoe$WOE[s] } } for(s in 1:2){ if(german_credit$ForeignWorker[i]==s){ german_credit$ForeignWorker[i]=-ForeignWorkerwoe$WOE[s] } }}fit<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment +Sex.Marital.Status+ Mostvaluableavailableasset + Age + ConcurrentCredits + Typeofapartment + ForeignWorker,train2,family = “binomial”)backwards = step(fit)summary(backwards)fit2<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment + Age + ConcurrentCredits + ForeignWorker,train2,family = “binomial”)summary(fit2)fit3<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment + Age + ForeignWorker,train2,family = “binomial”)summary(fit3)library(car)vif(fit3, digits =3 )prediction <- predict(fit3,newdata=test2)for (i in 1:250) { if(prediction[i]>0.99){ prediction[i]=1} else {prediction[i]=0}}confusionMatrix(prediction, test2$Creditability)
接待插手本站果真乐趣群贸易智能与数据阐明群乐趣范畴包罗各类让数据发生代价的步伐,实际应用案例分享与接头,阐明东西,ETL东西,数据客栈,数据挖掘东西,报表系统等全方位常识QQ群:81035754