案例FIN之金融工程matlab:Quantitative Credit Scoring
当前位置:以往案例 > >案例FIN之金融工程matlab:Quantitative Credit Scoring
2018-08-13

PART 1 – Quantitative Credit Scoring 85% Section I – Variable Mapping


Go to the following link to download the dataset: https://www.dropbox.com/s/mkoxbevzyvj1j9q/Credit_data_RSM6305.txt?dl=0

PS: The first column is just index values, ignore it or don’t count it as one of the features!


You have been provided a credit dataset with 20 different borrower attributes (7 numerical and 13 categorical). Description of the attributes are as follows,

Attribute description:

Attribute 1: (qualitative)

Status of existing checking account Attribute 2: (numerical)

Duration in month


Attribute 3: (qualitative)

Credit history


Attribute 4: (qualitative)

Purpose

o A40 : car (new)

o A41 : car (used)

o A42 : furniture/equipment

o A43 : radio/television

o A44 : domestic appliances

o A45 : repairs

o A46 : education

o A47 : (vacation – does not exist?)

o A48 : retraining

o A49 : business

o A410 : others


Attribute 5: (numerical)

Credit amount


Attibute 6: (qualitative)

Savings account/bonds

o A61 : … < 100 DM

o A62 :  100 <= … < 500 DM o A63 : 500 <= … < 1000 DM o A64 : .. >= 1000 DM

o A65 : unknown/ no savings account


Attribute 7: (qualitative)

Present employment since

o A71 : unemployed

o A72 : … < 1 year

o A73 : 1  <= … < 4 years o A74 : 4 <= … < 7 years o A75 : .. >= 7 years


Attribute 8: (numerical)

Installment rate in percentage of disposable income




Attribute 9: (qualitative)

Personal status and sex

o A91 : male : divorced/separated

o A92 : female : divorced/separated/married

o A93 : male : single

o A94 : male : married/widowed

o A95 : female : single


Attribute 10: (qualitative)

Other debtors / guarantors

o A101 : none

o A102 : co-applicant

o A103 : guarantor


Attribute 11: (numerical)

Present residence since


Attribute 12: (qualitative)

Property

o A121 : real estate

o A122 : if not A121 : building society savings agreement/life insurance

o A123 : if not A121/A122 : car or other, not in attribute 6

o A124 : unknown / no property


Attribute 13: (numerical)

Age in years


Attribute 14: (qualitative)

Other installment plans

o A141 : bank

o A142 : stores

o A143 : none


Attribute 15: (qualitative)

Housing

o A151 : rent

o A152 : own

o A153 : for free


Attribute 16: (numerical)

Number of existing credits at this bank Attribute 17: (qualitative)

Job

o A171 : unemployed/ unskilled – non-resident

o A172 : unskilled – resident

o A173 : skilled employee / official

o A174 : management/ self-employed/highly qualified employee/ officer Attribute 18: (numerical)

Number of people being liable to provide maintenance for Attribute 19: (qualitative)

Telephone

o A191 : none

o A192 : yes, registered under the customers name



Attribute 20: (qualitative)

foreign worker

o A201 : yes

o A202 : no

The 21st column denotes the credit status of the borrower, with 1 being good and 2 being bad.

Your first task will be to transform the provided the dataset to the respective categorical variables based on the attributes mentioned above.
Section II – Exploratory Data Analysis & Wrangling

Your next task will be to conduct exploratory data analysis.


1) Observe the variables by plotting histograms and box plots (for continuous variables) and frequency tables (bar plots) for categorical variables. Conduct proper outlier detection (if any) and use necessary tools taught in the class to treat outliers. Present your analysis.

2) You may notice that the categorical variables contain missing values in some instances. It will be advisable to conduct proper data imputation instead of dropping the rows entirely which will result in loss of information. Use your judgement to conduct proper imputation. Keep in mind that not all the variables are of the same type (some continuous and some categorical), so use proper judgement and methods to treat these two types differently to conduct imputation. Present your analysis.

3) Conduct cross tabulation of all the categorical predictors with the credit status of the borrower. For this you need to create cross contingency table. For example, if we take a categorical variable with 3 categories

1, 2, and 3, the cross contingency table will take the following form,

Credit Status

1

2

3

Row Total

0

# of 1 in 0

(and % of 0)

# of 2 in 0

(and % of 0)

# of 3in 0

(and % of 0)

Total 0

1

# of 1 in 1

(and % of 0)

# of 2 in 1

(and % of 0)

# of 3 in 1

(and % of 0)

Total 1

Column Total

Total 1

Total 2

Total 3

Sub – total


For each table, present an outline of the analysis.

4) Using the cross contingency table, perform a chi-square test in order to test the dependence of the categorical variables with the credit status of the borrower. Take note of the variables which have a statistically significant dependence with the response variable.

5) For the continuous variables, present the necessary descriptive statistics. Make sure to standardize the continuous variables before moving to estimation. Also a correlation matrix of all the variables (categorical and continuous) will give us a good idea of the dependence structure of the dataset.

Section III – Estimation
Before moving to the estimation phase, it is important to not use the full data for estimation. Conduct a 70:30 cross validation, which means randomly sample 70% of the data as training set and keep the rest of 30% as test set.

1) Start by estimating a logistic regression using all the significant categorical predictors based on the chi square tests and all numeric variables. At every iteration, take out the insignificant variables and re–estimate until all the variables are significant at the 5% level. Briefly explain the final chosen variables, the signs of the coefficients.

a. For the logistic regression, you must build your own function, which includes constructing the functions for the logistic distribution, the log likelihood and the optimization process. Take help from Appendix A3 of the text book to build your own function for Newton’s method, since you need the Hessian matrix to calculate relevant regression statistics.


b. Plot the ROC curve for the in sample prediction. Use built in packages/function to extract the ROC curve. Perform a Kolmogorov Smirnov test (KS) on the possible true positive rates and false positive rates for each cut off value. Pick the appropriate cutoff value based on the KS test. Use this cut off value for out of sample prediction. (This will be important in Part IV for calculating Brier Score and conducting HL test)

2) Estimate a stepwise logistic regression model. Present your output results. Briefly explain the signs of the coefficients and their significance.

a. Take help from the slides to get an idea on how to approach the algorithm for the stepwise functions. Feel free to create only one method, either the backward or the forward.
b. Perform the KS test as usual to obtain the optimal cut off value. Use this cut off value for out of sample prediction.

3) Estimate a decision tree on the given dataset. Present your results.

a. For this question, feel free to leverage the libraries or built in function for estimation.
b. Apply cost complexity pruning to the large tree obtained in order to obtain a sequence of sub-trees. Conduct a K-fold cross validation.

在线提交订单