案例CS之Python – Machine Learn
当前位置:以往案例 > >案例CS之Python – Machine Learn
2017-10-14


COMP 2019 project 2 – Machine Learning

Please submit your solution via LEARNONLINE. Submission instructions are given at the end of this project.

This assessment is worth 20% of the total marks. This assessment consists of 6 questions.

In this project you will aim to predict if it will rain on each day given weather observations from the preceding day. You will perform a number of machine learning tasks, including training a classifier, assessing its output, and optimising its performance. You will document your findings in a written report. Write concise explanations; approximately one paragraph per task will be sufficient.

Download the data file for this project from the course website (file weather.zip). The archive contains the data file in CSV format, and some python code that you may use to visualise a decision tree model.

Before starting this project, ensure that you have a good understanding of the Python programming language, the Jupyter Python notebook environment, and an overall understanding of machine learning training and evaluation methods using the scikit-learn python library (Practical 3). You will need a working Python 3.x system with the Jupyter Notebook environment and the ‘sklearn’ package installed.

Documentation that you may find useful:


· Python: https://www.python.org/doc/

· Jupyter: https://jupyter-notebook.readthedocs.io/en/stable/

· Scikit-learn: http://scikit-learn.org/stable/

· Numpy: https://docs.scipy.org/doc/



Preparation
Create a Jupyter notebook and load the data. Use


import numpy as np

data = np.loadtxt(‘weather.csv’,skiprows=1,delimiter=’,’, dtype=np.int)


to load the data. Type this code into the notebook. You will get syntax errors if you copy and paste from this document. (Students familiar with the Pandas library may use that to load and explore the data instead.)

Familiarise yourself with the data. There are 44 columns and 2716 rows. All values are binary (0/1) where 0 indicates false and 1 indicates true.

Categorical variables were encoded using “One Hot” coding, where a separate column is used to indicate the presence or absence of each possible value of the variable. For example, the three binary-valued columns “MinTemp_Low”, “MinTemp_Moderate”,”MinTemp_High” correspond to the three possible values “Low”, “Moderate”, and “High” of variable “MinTemp”. A 1 in column “MinTemp_Low” means that the value of MinTemp was “Low”; the cells for the other two values must be 0 in this case.

Explore the distribution of data in each column.


The last column contains the prediction target (RainTomorrow). The meaning of the columns is as follows:

· MinTemp_{Low,Moderate,High}: 1 if the minimum temperature on the day was low/moderate/high

· MaxTemp_{Low,Moderate,High}: 1 if the maximum temperature on the day was low/moderate/high

· Evaporation_{Low,Moderate,High}: 1 if the measured evaporation on the day was low/moderate/high

· Sunshine_{Low,Moderate,High}: 1 if the aggregated periods of sunshine on the day was low/moderate/high

· WindSpeed9am_{Low,Moderate,High}: 1 if the measured wind speed at 9am on the day was low/moderate/high

· WindSpeed3pm_{Low,Moderate,High}: 1 if the measured wind speed at 3pm on the day was low/moderate/high

· Humidity9am_{Low,Moderate,High}: 1 if the humidity at 9am on the day was low/moderate/high

· Humidity3pm_{Low,Moderate,High}: 1 if the humidity at 3pm on the day was low/moderate/high

· Pressure9am_{Low,Moderate,High}: 1 if the barometric pressure at 9am on the day was low/moderate/high

· Pressure3pm_{Low,Moderate,High}: 1 if the barometric pressure at 3pm on the day was low/moderate/high

· Cloud9am_{Low,Moderate,High}: 1 if the cloud cover at 9am on the day was low/moderate/high

· Cloud3pm_{Low,Moderate,High}: 1 if the cloud cover at 3pm on the day was low/moderate/high

· Temp9am_{Low,Moderate,High}: 1 if the temperature at 9am on the day was low/moderate/high

· Temp3pm_{Low,Moderate,High}: 1 if the temperature at 3pm on the day was low/moderate/high

· RainToday: 1 if it rained on the day

· RainTomorrow: 1 if it rained on the following day. This is the target we wish to predict.



Question 1: Baseline
A simple model for predicting rain tomorrow is to use today’s weather (RainToday) as an indicator of tomorrow’s weather (RainTomorrow).

What performance can we expect from this simple model?


Choose an appropriate measure to evaluate the classifier. Select among Accuracy, F1-measure, Precision, and Recall.

Use a confusion matrix and/or classification report to support your analysis.


Question 2: Naïve Bayes
Train a Naïve Bayes classifier to predict RainTomorrow.


As all attributes are binary vectors, use the BernoulliNB classifier provided by scikit-learn. Ensure that you follow correct training and evaluation procedures.

1. Assess how well the classifier performs on the prediction task.

2. What performance can we expect from the trained model if we used next month’s data as input?


Question 3: Decision Tree
Train a DecisionTreeClassifier to predict RainTomorrow. Use argument class_weight=’balanced’ when constructing the classifier, as the target variable RainTomorrow is not equally distributed in the data set.

Ensure that you follow correct training and evaluation procedures.


1. Assess how well the classifier performs on the prediction task.

2. What performance can we expect from the model on new data?


If you wish to visualise the decision tree you can use function print_dt provided in dtutils.py provided in the project 2 zip archive:

import dtutils

dtutils.print_dt(tree, feature_names=flabels)


where tree refers to the trained decision tree model, and flabels is a list of features names (columns) in the data.

Question 4: Diagnosis
Does the Decision Tree model suffer from overfitting or underfitting? Justify why/why not.


If the model exhibits overfitting or underfitting, revise your training procedure to remedy the problem, and re-evaluate the improved model. The DecisionTreeClassifier has a number of parameters that you can consider for tuning the model:

· max_depth: maximum depth of the tree

· min_samples_leaf: minimum number of samples in each leaf node

· max_leaf_nodes: maximum number of leaf nodes



Question 5: Recommendation
Which of the models you trained should be selected for the prediction task? Assume that all errors made are equally severe. That is, predicting rain if there is actually no rain is just as bad as predicting no rain if it actually rains.

Does your answer change if predicting rain for a day without rain is a negligible error? Justify why/why not.


Question 6: Report
Write a concise report showing your analysis for Question 1-5.


Demonstrate that you have followed appropriate training and evaluation procedures, and justify your conclusions with relevant evidence from the evaluation output.

Where there are alternatives (e.g. measures, procedures, models, conclusions), demonstrate that you have considered all relevant alternatives and justify why the selected alternative is appropriate.

Do not include the python code in your report.



Submission Instructions
Submit a single zip archive containing the following:


· weather.ipynb: the Jupyter Notebook file.

· weather.html: the HTML version of weather.ipynb showing the notebook including all output. Create this by selecting File>Download as>HTML after having run all cells in the Jupyter notebook.

· report.pdf: the report as specified in Question 6.



Marking Scheme
Question

Marks

Q1: Baseline


Appropriate measure selected and justified Correct evaluation


10

Q2: Naïve Bayes


Correct training procedure applied Correct evaluation procedure applied Correct conclusion


20

Q3: Decision Tree


Correct training procedure applied Correct evaluation procedure applied Correct conclusion


15

Q4: Diagnosis


Correct diagnosis

Correct revised training and evaluation procedure applied


30

Q5: Recommendation


Correct recommendations

Recommendations justified by evaluation results


15

Q6: Report


Well-structured report Professional presentation


10

Jupyter notebook


Executes correctly when using Run All Copy saved as HTML format submitted Matches the contents of the report



Deductions apply

在线提交订单