●Margin sizes (all four!) between 2 and 2.54 cm (strict!)
Further, specific instructions are given below.
Have fun and good luck!
Project Overview
For this project you are going to work on a kaggle competition. “Kaggle is a platform for data prediction competitions. Companies, organizations and researchers post their data and have it scrutinized by the world's best statisticians and machine learning experts, i.e., you! You will be working on the competition – “Titanic: Machine Learning from Disaster”. This challenge provides a great starting point for those of you without any experience in applied machine learning. The data is highly structured and there are tutorials provided on the Kaggle site to guide you through several different approaches.
The following is Kaggle’s background to the challenge:
“The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
For this project you will analyze the Titanic dataset with regards to predicting what categories of passengers were likely to survive the sinking of the ocean liner. You will be using machine learning methods to predict as accurate as possible from general information about passengers which passengers survived the tragedy.
Datasets
Sign up at kaggle.com for a free account. The competition can be found here:
● https://www.kaggle.com/c/titanic
The data can be found here:
●.https://www.kaggle.com/c/titanic/data
The data consists of approx. 1,300 records (divided into training and test subsets) for Titanic passengers. Each record consists of the following 11 attributes:
attribute
description
Possible values
Survival
Survival
(0=No; 1=yes)
Pclass
Passenger Class
(1 = 1st; 2 = 2nd; 3 =
3rd)
Name
Name
[string]
Sex
Sex
(male; female)
Age
Age
[integer]
Number of
Sibsp
Siblings/Spouses
[integer]
Aboard
Number of
Parch
Parents/Children
[integer]
Aboard
Ticket
Ticket Number
[string]
Fare
Passenger Fare
[float]
Cabin
Cabin
[String]
(C = Cherbourg; Q =
embarked
Port of Embarkation
Queenstown; S =
Southampton)
Note that the dataset contains missing values;
Experiments
Develop a complete analysis pipeline in a programming language / environment of your choice. Note that kaggle provides substantial support for this and you could even run the experiments on the kaggle site. Demonstrators will provide support for Python (or other language such as Matlab) implementations. However, it is strongly recommended that you attempt the challenge using Python. Relevant Python resources are listed at the end of this document.
The objective of the analysis is to predict survivors of the Titanic disaster from the given data as accurately as possible.
There are no limitations with regards to the modelling approach, that is, you are free to explore (and report) as many methods, and their results,
as you wish. The minimum requirement though is that you analyze thedataset with a random forest classifier.
You will report the results of your experiments through prediction, that is, classification accuracies with regard to correctly predicted survival rates.
Note that you are NOT required to submit your solution to the official kaggle competition.
Report
Your report will need to provide documentation about your analysis experiments.
In a brief introduction you will set the scene by describing the problem area. You will need to provide an overview of the dataset that you are analysing through an appropriate visualisation of the data. This will also help you exploring the dataset during your experiments.
Following the introduction, you will describe the methods you have used. At the very least you will need to provide an explanation of the random forest classifier that you will implement as a benchmark. You are encouraged to explore different classifiers. Summarise every method you have used for your experiments in the methods section. Results should be reported using appropriate evaluation measures, on the test set. Remember to only use the provided training dataset for model estimation.
In the discussion section you will reflect upon your findings and contextualise this with the original task thereby linking back to the real-world problem (e.g., Would you have embarked on the Titanic if you were part of passenger category X?).
Submission
The following needs to be submitted by 16:00 16th December2017 (through NESS):
0 A pdf document containing your report (see format and structure specifications above).
0 A zipped folder containing the code for your analysis in a runnable format. This should allow us to run your code and verify your results. The code itself will not be marked.
Marking
As a guide your grade will be broken down as follows:
●Introduction – 10%
●Methods – 25%
●Results – 20%
●Discussion – 25%
●Conclusion – 10%
●Quality of writing – 10%
Additional Information
●Kaggle Titanic challenge:
Software
●iPython notebook comes highly recommended – to install on your own computer these links may be useful to you:
0 Anaconda Python distribution: https://www.continuum.io/**********(隐私)
0 Once Anaconda has been installed, python can be run in notebooks (iPython – interactive python) using Jupyter.
Jupyter Install: https://jupyter.readthedocs.org/en/latest/install.html
Python机器学习题目:Learning Maching Datasets Assign
2018-07-14