[Advanced Machine Learning Course]
Your job includes,
1) Read the description of the task and download the data set
2) Implement an algorithm and output the prediction
TA's job is to evaluate your work.
Here are some baselines for your consideration.
CAUTION: DO YOUR JOB ALONE!
The Story: The bank customer churn dataset is for predicting customer churn in the banking industry which contains information on bank customers who either left the bank or continue to be a customer. The dataset includes the following attributes:
Customer ID: A unique identifier for each customer Surname: The customer's surname or last name Credit Score: A numerical value representing the customer's credit score Geography: The country where the customer resides (France, Spain or Germany) Gender: The customer's gender (Male or Female) Age: The customer's age Tenure: The number of years the customer has been with the bank Balance: The customer's account balance NumOfProducts: The number of bank products the customer uses (e.g., savings account, credit card) HasCrCard: Whether the customer has a credit card (1 = yes, 0 = no) IsActiveMember: Whether the customer is an active member (1 = yes, 0 = no) EstimatedSalary: The estimated salary of the customer The task is to predict:
Exited: Whether the customer has churned (1 = yes, 0 = no) Download:
train.csv: 9000 records + 1 line (with header). The details of the training set is shown in Figure 1. test.csv: 1000 records + 1 line (with header) Figure 1. Training Set
Task: Predict "Exited" for each record in the test dataset.
Implementation: It is up to you to implement any learning algorithm with any programming language. It is encouraged that you can make some detail analysis about the difficulty of this task, and this will be good to you to find out appropriate learning algorithm. Problem analysis and innovative thoughts will help you get higher score.
Output: The output of your learning algorithm should be a txt file "yourId.txt" which contains 1000 lines without header, each line is prediction for the corresponding example. Please do not make confusion about the order of test example, otherwise you may get a very low accuracy.
Your report should includes:
1) Your understand and analysis of the problem;
2) The motivation of your algorithm and introduction of the background of your algorithm;
3) Full technical details of your algorithm, especially including pseudo code of your algorithm;
4) Description or analysis of the performance you got;
5) Conclusion and (optional) discussion
CAUTION: NOT PLAGIARIZE! OTHERWISE, YOU WILL GET PUNISHMENT!
Please use MSWord template or LaTeX template to write your report in chinese with english abstract. Attention, please transform your source file to PDF file for submission.
Name your PDF file with "report.pdf".
Your submission includes:
1) 'yourId.txt' file : containing 1000 lines of predictions;
2) 'report.pdf' file : your report;
3) source file of your algorithm (Do not submit the whole data set or the trained model to FTP)
Please carefully check out your submission.
Note that "yourId" should be replaced with your ID and the name of the files should not be other names.
Pack all your files into a single compressed file (compress in ZIP format).
Name the compressed file using your student ID and version number, e.g., "AB12345678_v1.zip". We will take your file with highest version number as your final homework, e.g., "AB12345678_v2.zip"
Please delete the .bak, i.e., the backup files from your final zip files.
Upload your compressed file to FTP:
switch path to: /D:/courses/AML24/assignment2
username: aml24
password: course01234!@#$
Evaluate of your prediction: According to your "yourId.txt", we will use macro F1 score to evaluate your prediction. As for macro F1 score, you may refer to https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html.
Evaluate of your report: Novel idea, sound techniques, and beautiful writing gain you high scores. See also The evaluation of your report in Assignment 1.
Evaluate of your source code: Fake and plagiarized source codes receives low scores.
Some classic algorithms such as Random Forest, Logistic Regression, XGBoost GBDT, SVM, MLP, GBDT etc., serve as baselines. The performance of these algorithms can be referenced, and we look forward to you providing better solutions.
Download:
Deadline: 31 May 2024, 23:59 China Standard Time
Scores: 25%