SIT720 Machine Learning
Assessment Task 2: Problem solving task.
©Deakin University XXXXXXXXXX1 XXXXXXXXXXSIT720
This document supplies detailed information on Assessment Task 2 for this unit.
Key information
• Due: Week 7, Monday 30 August 2021 by 8.00 pm (AEST),
• Weighting: 15%
Learning Outcomes
This assessment assesses the following Unit Learning Outcomes (ULO) and related Graduate Learning
Outcomes (GLO):
Unit Learning Outcome (ULO) Graduate Learning Outcome (GLO)
ULO2 - Perform unsupervised learning of data such
as clustering and dimensionality reduction.
GLO1 - through the assessment of student ability to
use data acquisition techniques to obtain, manipulate
and represent data.
GLO3 - through student ability to use specific
programming language and modules to obtain, pre-
process, transform and analyse data.
GLO4 -through assessment of student ability to
make decisions to obtain data, use appropriate
techniques to represent and visualise complex
elationships in the data.
GLO5 - through assessment of student ability to
solve problems relates to ill-defined data.
Purpose
This assessment task is for student to apply skills for data clustering and dimensionality reduction. Students
will be required to demonstrate ability in data representation, and competency in applying suitable
clustering/dimensionality reduction techniques in a real-world scenario.
Assessment 2 XXXXXXXXXXTotal marks = 40
Submission Instructions
a) Submit your solution codes into a notebook file with “.ipynb” extension. Write discussions and
explanations including outputs and figures into a separate file and submit as a PDF file.
) Submission other than the above-mentioned file formats will not be assessed and given zero for the
entire submission.
c) Insert your Python code responses into the cell of your submitted “.ipynb” file followed by the question
i.e., copy the question by adding a cell before the solution cell. If you need multiple cells for better
presentation of the code, add question only before the first solution cell.
d) Your submitted code should be executable. If your code does not generate the submitted solution,
then you will get zero for that part of the marks.
e) Answers must be relevant and precise.
f) No hard coding is allowed. Avoid using specific value that can be calculated from the data provided.
g) Use topics covered till week 6 for answering this assignment.
h) Submit your assignment after running each cell individually.
i) The submitted notebook file name should be of this form “SIT720_A2_studentID.ipynb”. For example, if your
student ID is 1234, then the submitted file name should be “SIT720_A2_1234.ipynb”.
SIT720 Machine Learning
Assessment Task 2: Problem solving task.
©Deakin University XXXXXXXXXX2 XXXXXXXXXXSIT720
_____________________________________________________________________________________
Questions
_____________________________________________________________________________________
Datafile: Download the dataset (.csv) from the SCADI .
Data Description: This dataset contains 206 attributes of 70 children with physical and motor disability based
on ICF-CY. For more information click this link.
1. Determine the number of subgroups from the dataset using attributes 3 to 205 i.e., exclude attributes 1,
2 and 206. Is this number same as number of classes presented by attribute 206? Explain and justify
your findings. XXXXXXXXXX4 marks
2. Is this data facing curse of dimensionality? If so, then how to solve this problem. Explain with a two-
dimensional plot and report relevant loss of information. XXXXXXXXXX4 marks
3. After applying principal component analysis (PCA) on a given dataset, it was found that the percentage
of variance for the first N components is X%. How is this percentage of variance computed? 2 marks
___________________________________________________________________________________
Background
Obesity has become a global epidemic that has doubled since 1980, with serious consequences for health in
children, teenagers, and adults. Obesity levels in individuals may relate to their eating habits and physical
condition. In this assessment, you will be analysing and creating ML models based on a given dataset that
contains attributes of individuals with relation to obesity levels.
Dataset filename: obesity_levels.csv
Dataset description: This dataset include data for the estimation of obesity levels in individuals based on their
eating habits and physical condition. The data contains 17 attributes and 2111 records.
Features and labels: The attribute names are listed below. The description of the attributes can be found in this
article (web-link).
I. Gender
II. Age
III. Height
IV. Weight
V. family_history_with_overweight (family history of overweight)
VI. FAVC (frequent high caloric food)
VII. FCVC (vegetables per meal)
VIII. NCP (number of main meals per day)
IX. CAEC (any food between meals)
X. SMOKE (smoking)
XI. CH2O (daily water intake)
XII. SCC (daily consumed calories)
XIII. FAF (frequency of physical activity)
XIV. TUE (technology usage)
XV. CALC (consumption of alcohol)
XVI. MTRANS (means of transport)
XVII. NObeyesdad (obesity levels, i.e. Insufficient Weight, Normal Weight, Overweight Level I, Overweight
Level II, Obesity Type I, Obesity Type II and Obesity Type III)
_____________________________________________________________________________________
Questions
https:
archive.ics.uci.edu/ml/datasets/SCADI
https:
www.mdpi.com/ XXXXXXXXXX/11/1/89/htm
https:
doi.org/10.1016/j.dib XXXXXXXXXX
SIT720 Machine Learning
Assessment Task 2: Problem solving task.
©Deakin University XXXXXXXXXX3 XXXXXXXXXXSIT720
_____________________________________________________________________________________
4. Create a machine learning (ML) model for predicting “weight” using all features except “NObeyesdad”
and report observed performance. Explain your results based on following criteria:
XXXXXXXXXX10 marks
a. What model have you selected for solving this problem and why?
. Have you made any assumption for the target variable? If so, then why?
c. What have you done with text variables? Explain.
d. Have you optimised any model parameters? What is the benefit of this action?
e. Have you applied any step for handling overfitting or underfitting issue? What is that?
5. Create a ML model for classifying subjects into two classes applying following constraints on above
dataset. XXXXXXXXXX12 marks
• Use “NObeyesdad” as target variable and rest of them as predictor variables.
• drop samples with value “Insufficient Weight” for “NObeyesdad”
• Group Normal Weight, Overweight Level I,