Microsoft Word - MA5810Assignemnt 2.docx
MA5810-Assessment 2
Weighting: 30% Total marks: 70. Due date: Week 5 - Sunday,
This assessment focuses on machine learning techniques covered during Weeks 2-5 with
primary focus on topics of 3,4, and 5.
Wherever required you must show evidence of your work using R-code and output, as
part of your Rscript or RMarkdown submission.
The purpose of the assignment is for you to:
• Demonstrate sound knowledge of the basic theory, principles and concepts that
underpin data mining and exemplify the most common tasks and types of data mining
problems.
• Apply classic supervised and/or unsupervised data mining methods to analyse and
evaluate descriptive analytics tasks.
Submission
You will need to submit the following:
• A PDF file clearly shows the assignment question number, the associated answers,
analyses and discussions. The assignment must be presented in 12pt font on A4 pages
using single line spacing and 2.5cm margins
• R script or R markdown file to reproduce your work. Please attach a separate file or
copy the code into an Appendix.
• The assignment should not exceed 9-A4 pages. Appendices do not form part of the
page limit.
You have up to three attempts to submit your assessment, and only the last
submission will be graded.
A word on plagiarism:
Plagiarism is the act of using another’s words, works or ideas from any source as
one’s own. Plagiarism has no place in a University. Student work containing
plagiarised material will be subject to formal university processes.
Question 1 - Total Marks 40
Consider the Breast Cancer Wisconsin (Diagnostic) Data Set(wdbc.data). Thirty features are
computed from a digitized image of a fine needle aspirate (FNA) of a
east mass. They describe
characteristics of the cell nuclei present in the image. A quick recall of the Attributes-
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32) Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour) i) symmetry
j) fractal dimension ("coastline approximation" - 1)
The mean, standard e
or, and "worst" or largest (mean of the three largest values) of these
features were computed for each image, resulting in 30 features. For instance, field 3 is Mean
Radius, field 13 is Radius SE, field 23 is Worst Radius.
Assignment tasks:
Import the data into your session.
1. Partition the data into 90% training and remaining as test samples. Fit a logistic
egression model for Diagnosis against all numeric features to the training sample.
Marks 6
2. Discuss any difficulties in the model fit and interpretation of the coefficients.
From the summary of fitted model interpret the relationship between Diagnosis and
the features Texture and Concavity.. Marks 4
3. Return to the unpartitioned data. Use descriptive methods to investigate the
co
elation between the 30 numeric features on the BC data. Show relevant output.
Marks 4
4. Suggest and implement an unsupervised learning method to derive secondary
features that address inter-feature co
elation. Show R-code. Marks 6
5. Select a subset (filter(.)) of secondary features obtained in 4) Marks 12
a. Justify your approach using result(s) obtained in 4).
. Partition the data containing secondary features into training (90%) versus test
samples. Use the data obtained in 5a) to fit a logistic regression model with Diagnosis
as response on this new training sample.
c. Use the same features to fit a quadratic discriminant analysis to Diagnosis.
6. Implement both models on the test data along with the logistic regression model with
all features (as in Q1) Marks 8
Provide accuracy measures for each case and discuss your findings.
Question 2 - Total Marks 30
Clustering is a common exploratory technique used in bioinformatics where researchers aim
to identify subgroups within diseases using gene expression. Imagine you are asked to
analyse the gene expression dataset available in the leukemia_dat.Rdata file. This data was
originally generated by [Golub et al., Science, 1999] https:
science.sciencemag.
org/content/sci/286/5439/531.full.pdf and contains the expression level of 1867 selected
genes from 72 patients with different types of leukemia.
The data in each column are summarized as follows:
• Column 1: patient id = a unique identifier for each patient (observation)
• Column 2: type = A factor variable with two subtypes of leukemia; acute lymphoblastic
leukemia (ALL, n = 47) and acute myeloblastic leukemia (AML, n = 25).
• Columns 3: to 1869. Gene expression data for 1867 genes, Gene 1, ..., Gene 1867.
Assignment Tasks:
The researchers hypothesized that patient samples will cluster by subtype of leukemia based
on gene expression. Your task is to use a clustering technique to address this scientific
hypothesis and report your results back to the researcher.
(a) Select a clustering technique to apply. Justify your choice. Marks 5
(b) Implement your chosen clustering technique in R. Describe your implementation
You need to provide details of all steps relating to the implementation of the clustering
algorithms, such as data preparation including any transformations performed on the data
prior to clustering, training the model & evaluating the performance of the model. Marks 25
Ru
ic template
Criteria HD P Fail
Rmarkdown/R
(10%)
Codes are reproducible.
Demonstrate superior ability to write code in
Rmarkdown/R efficiently and produce accurate
esults.
Code is well organised and very easy to follow.
Code is well commented so the purpose of
each block of code readily understood and
what question part it co
esponds to. Variable
names give the purpose of the variable.
Codes are reproducible.
Demonstrate limited ability to use
R/Rmarkdown. Some of the results
produced by the code are accurate.
The code is readable only by someone who
already knows what it is supposed to be
doing. Comments not sufficient to see what
the code is doing. Significant lack of
comments makes it difficult to understand
code.
A lack of compliance with the factors
described in adjacent columns.
Question 1
(30%)
Demonstrate superior understanding and
implementing the logistic regression to classify
east cancer type. Provide full detail of the
implementation.
The results and discussion are explained
co
ectly, clearly, and in sufficient detail.
Demonstrate some understanding and
implementing the logistic regression to
classify
east cancer type. Provide some
steps of the implementation in detail.
The results and discussion are explained
clearly and in sufficient detail most of the
time. There are some misunderstandings in
interpreting results.
A lack of compliance with the factors
described in adjacent columns.
Question 2
(30%)
Demonstrate superior understanding of
complete and single linkage clustering. Provide
all steps to obtain the dendrograms.
Writing is authentic, easy to understand with
excellent level of detail.
Demonstrate limited understanding of
complete and single linkage clustering. Lack
of explanations on some steps obtaining the
dendrograms.
Writing is authentic, easy to understand
with some level of detail.
A lack of compliance with the factors
described in adjacent columns.
Glenn Fulford
Glenn Fulford
Glenn Fulford
Glenn Fulford
Glenn Fulford
Glenn Fulford
40%
Glenn Fulford
Glenn Fulford
incorporated
into Q1 and Q2
marks
Question 3
(30%)
Demonstrate superior understanding and
implementing clustering algorithms. Provide
full detail of the implementation.
The results and discussion are explained
co
ectly, clearly, and in sufficient detail.
Demonstrate some understanding and
implementing clustering algorithms. Provide
some steps of the implementation in detail.
The results and discussion are explained
co
ectly, clearly and in sufficient detail
most of the time. There are some
misunderstandings in interpreting results.
A lack of compliance with the factors
described in adjacent columns.
Glenn Fulford
Glenn Fulford
Question 2
Ru
ic template