Naive Bayes classifier
当前位置:以往案例 > >Naive Bayes classifier
2023-01-03

Overview
In this Project, you will implement a Naive Bayes classifier, apply it to various classification datasets, and explore evaluation paradigms as well as the impact of individual features. You will then answer some conceptual questions about the Naive Bayes classifier, based on your observations.

Naive Bayes classifiers

The “Naive Bayes” lecture included some suggestions for implementing your learner. You should implement your Naive Bayes classifier from scratch with epsilon smoothing strategy (i.e., do not use existing implemen- tations/learning algorithms from libraries like sklearn). Otherwise, you may decide on the specifics of your implementation and may use libraries to help you with data processing, visualization, evaluation, or mathemat- ical operations. For marking purposes, a minimal submission will include the following functions:
• preprocess(), which opens the data file, and converts it into a usable format.[0.5 mark]
• train(), where you calculate statistics from the training data, to build a Naive Bayes(NB) model.[3 marks]
• predict(), where you use the model from train() to predict a class (or class distribution) for the test data.[1.5 marks]
• evaluate(), where you will output your evaluation metric(s).[1 mark]
• main(), where you call the above functions in order to train and test the Naive Bayes classifier on the full data sets provided (i.e., no train-test splitting). [1 mark]

The project materials include an iPython notebook 2022S2-a2.ipynb that summarises these, which you should use as a template. You may define the function inputs and outputs to suit your needs, and you may write other helper functions as you require. Please place the jupyter notebook into the same folder as the input data.

Data Sets
This project includes adapted versions of three data sets from the UCI machine learning repository 
Bank Marketing, Obesity, and Student
These datasets vary in terms of number of instances, number of attributes, number of different class labels. For the purpose of this project, you should assume that all features are nominal. Each data set is provided in .csv format with one instance per line. The first column (ID) contains a unique instance identifier. The last column (Label) specifies the class label. All other columns specify the data-set specific features. We briefly describe each data set below. The README provided as part of this project lists and explains all features and labels (note that not necessarily all original instances or features are included in our data sets).
Bank Marketing You predict if a client will subscribe a term deposit depending on a number of personal and financial features such as job, education level, housing loan, etc.
Obesity You predict whether a patient is obese or not based on various personal and habitual attributes such as alcohol consumption, exercise level, gender, etc.
Student You predict a student’s final grade {A+, A, B, C, D, F} based on a number of personal and performance related attributes, such as school, parent’s education level, number of absences, etc.
Your submission must automatically process every one of these datasets. As for the questions 1–4, it is techni- cally possible to answer each question by examining only two of the datasets. However, it is strongly recom- mended that you examine all of the data available, so that you reduce the likelihood that you arrive at faulty conclusions due to a small sample space.
1 Implementation Tips
The “Naive Bayes” lecture included several tips on how to implement the classifier. At training time, you will need to fill in data structures that hold the prior class probabilities P (cj ) as well as data structures that hold the parameters of the likelihood for each feature under each class, i.e., P(xi|cj).
At prediction time, you will combine the prior and likelihood terms (one per feature) into a final prediction score:
P(cj)

在线提交订单