案例FIN之金融工程matlab定量：Quantitative Credit Scoring

PART 1 – Quantitative Credit Scoring 85% Section I – Variable Mapping

Go to the following link to download the dataset: https://www.dropbox.com/s/mkoxbevzyvj1j9q/Credit_data_RSM6305.txt?dl=0

PS: The first column is just index values, ignore it or don’t count it as one of the features!

You have been provided a credit dataset with 20 different borrower attributes (7 numerical and 13 categorical). Description of the attributes are as follows,

Attribute description:

Attribute 1: (qualitative)

Status of existing checking account Attribute 2: (numerical)

Duration in month

Attribute 3: (qualitative)

Credit history

Attribute 4: (qualitative)

Purpose

o A40 : car (new)

o A41 : car (used)

o A42 : furniture/equipment

o A43 : radio/television

o A44 : domestic appliances

o A45 : repairs

o A46 : education

o A47 : (vacation – does not exist?)

o A48 : retraining

o A49 : business

o A410 : others

Attribute 5: (numerical)

Credit amount

Attibute 6: (qualitative)

Savings account/bonds

o A61 : … < 100 DM

o A62 : 100 <= … < 500 DM o A63 : 500 <= … < 1000 DM o A64 : .. >= 1000 DM

o A65 : unknown/ no savings account

Attribute 7: (qualitative)

Present employment since

o A71 : unemployed

o A72 : … < 1 year

o A73 : 1 <= … < 4 years o A74 : 4 <= … < 7 years o A75 : .. >= 7 years

Attribute 8: (numerical)

Installment rate in percentage of disposable income

Attribute 9: (qualitative)

Personal status and sex

o A91 : male : divorced/separated

o A92 : female : divorced/separated/married

o A93 : male : single

o A94 : male : married/widowed

o A95 : female : single

Attribute 10: (qualitative)

Other debtors / guarantors

o A101 : none

o A102 : co-applicant

o A103 : guarantor

Attribute 11: (numerical)

Present residence since

Attribute 12: (qualitative)

Property

o A121 : real estate

o A122 : if not A121 : building society savings agreement/life insurance

o A123 : if not A121/A122 : car or other, not in attribute 6

o A124 : unknown / no property

Attribute 13: (numerical)

Age in years

Attribute 14: (qualitative)

Other installment plans

o A141 : bank

o A142 : stores

o A143 : none

Attribute 15: (qualitative)

Housing

o A151 : rent

o A152 : own

o A153 : for free

Attribute 16: (numerical)

Number of existing credits at this bank Attribute 17: (qualitative)

Job

o A171 : unemployed/ unskilled – non-resident

o A172 : unskilled – resident

o A173 : skilled employee / official

o A174 : management/ self-employed/highly qualified employee/ officer Attribute 18: (numerical)

Number of people being liable to provide maintenance for Attribute 19: (qualitative)

Telephone

o A191 : none

o A192 : yes, registered under the customers name

Attribute 20: (qualitative)

foreign worker

o A201 : yes

o A202 : no

The 21st column denotes the credit status of the borrower, with 1 being good and 2 being bad.

Your first task will be to transform the provided the dataset to the respective categorical variables based on the attributes mentioned above.
Section II – Exploratory Data Analysis & Wrangling

Your next task will be to conduct exploratory data analysis.

1) Observe the variables by plotting histograms and box plots (for continuous variables) and frequency tables (bar plots) for categorical variables. Conduct proper outlier detection (if any) and use necessary tools taught in the class to treat outliers. Present your analysis.

2) You may notice that the categorical variables contain missing values in some instances. It will be advisable to conduct proper data imputation instead of dropping the rows entirely which will result in loss of information. Use your judgement to conduct proper imputation. Keep in mind that not all the variables are of the same type (some continuous and some categorical), so use proper judgement and methods to treat these two types differently to conduct imputation. Present your analysis.

3) Conduct cross tabulation of all the categorical predictors with the credit status of the borrower. For this you need to create cross contingency table. For example, if we take a categorical variable with 3 categories

1, 2， and 3, the cross contingency table will take the following form,

Credit Status

1

2

3

Row Total

0

# of 1 in 0

(and % of 0)

# of 2 in 0

(and % of 0)

# of 3in 0

(and % of 0)

Total 0

1

# of 1 in 1

(and % of 0)

# of 2 in 1

(and % of 0)

# of 3 in 1

(and % of 0)

Total 1

Column Total

Total 1

Total 2

Total 3

Sub – total

For each table, present an outline of the analysis.

4) Using the cross contingency table, perform a chi-square test in order to test the dependence of the categorical variables with the credit status of the borrower. Take note of the variables which have a statistically significant dependence with the response variable.

5) For the continuous variables, present the necessary descriptive statistics. Make sure to standardize the continuous variables before moving to estimation. Also a correlation matrix of all the variables (categorical and continuous) will give us a good idea of the dependence structure of the dataset.

Section III – Estimation
Before moving to the estimation phase, it is important to not use the full data for estimation. Conduct a 70:30 cross validation, which means randomly sample 70% of the data as training set and keep the rest of 30% as test set.

1) Start by estimating a logistic regression using all the significant categorical predictors based on the chi square tests and all numeric variables. At every iteration, take out the insignificant variables and re–estimate until all the variables are significant at the 5% level. Briefly explain the final chosen variables, the signs of the coefficients.

a. For the logistic regression, you must build your own function, which includes constructing the functions for the logistic distribution, the log likelihood and the optimization process. Take help from Appendix A3 of the text book to build your own function for Newton’s method, since you need the Hessian matrix to calculate relevant regression statistics.

b. Plot the ROC curve for the in sample prediction. Use built in packages/function to extract the ROC curve. Perform a Kolmogorov Smirnov test (KS) on the possible true positive rates and false positive rates for each cut off value. Pick the appropriate cutoff value based on the KS test. Use this cut off value for out of sample prediction. (This will be important in Part IV for calculating Brier Score and conducting HL test)

2) Estimate a stepwise logistic regression model. Present your output results. Briefly explain the signs of the coefficients and their significance.

a. Take help from the slides to get an idea on how to approach the algorithm for the stepwise functions. Feel free to create only one method, either the backward or the forward.
b. Perform the KS test as usual to obtain the optimal cut off value. Use this cut off value for out of sample prediction.

3) Estimate a decision tree on the given dataset. Present your results.

a. For this question, feel free to leverage the libraries or built in function for estimation.
b. Apply cost complexity pruning to the large tree obtained in order to obtain a sequence of sub-trees. Conduct a K-fold cross validation.

当前位置：以往案例 > >案例FIN之金融工程matlab定量：Quantitative Credit Scoring

案例FIN之金融工程matlab定量：Quantitative Credit Scoring

在线提交订单