Microsoft Word - MA5810Assignemnt 2.docxMA5810-Assessment 2 ...

Question

Microsoft Word - MA5810Assignemnt 2.docx
MA5810-Assessment    2
Weighting:    30%    Total    marks:    70.    Due date: Week 5 - Sunday,
This assessment focuses on machine learning techniques covered during    Weeks    2-5    with
primary    focus    on    topics    of    3,4,    and    5.
Wherever    required    you    must    show    evidence    of    your    work    using    R-code    and    output,    as
part    of    your    Rscript    or    RMarkdown    submission.
The purpose of the assignment is for    you to:
• Demonstrate sound knowledge of the basic theory, principles and concepts that
underpin data mining and exemplify the most common tasks and types of data mining
problems.
• Apply classic supervised and/or unsupervised data mining methods to analyse and
evaluate descriptive analytics tasks.
Submission
You will need to submit the following:
• A PDF file clearly shows the assignment question number, the associated answers,
analyses and discussions. The assignment must be presented in 12pt font on A4 pages
using single line spacing and 2.5cm margins
• R    script or R markdown file to reproduce your work. Please attach a separate file or
copy the code into an Appendix.
• The assignment should    not    exceed    9-A4    pages. Appendices do not form part of the
page limit.
You have up to three attempts to submit your assessment, and only the last
submission will be graded.
A    word    on    plagiarism:
Plagiarism is the act of using another’s words, works or ideas from any source as
one’s own. Plagiarism has no place in a University. Student work containing
plagiarised material will be subject to formal university processes.



Question    1    -    Total    Marks    40
Consider    the    Breast    Cancer    Wisconsin    (Diagnostic)    Data    Set(wdbc.data).    Thirty    features    are
computed    from    a    digitized    image    of    a    fine    needle    aspirate    (FNA)    of    a
east    mass.    They    describe
characteristics    of    the    cell    nuclei    present    in    the    image.    A    quick    recall    of    the    Attributes-
1)    ID    number
2)    Diagnosis    (M    =    malignant,    B    =    benign)
3-32)    Ten    real-valued    features    are    computed    for    each    cell    nucleus:
a)    radius    (mean    of    distances    from    center    to    points    on    the    perimeter)
    b)    texture    (standard    deviation    of    gray-scale    values)
c)    perimeter
d)    area
e)    smoothness    (local    variation    in    radius    lengths)
f)    compactness    (perimeter^2    /    area    -    1.0)
g)    concavity    (severity    of    concave    portions    of    the    contour)
h)    concave    points    (number    of    concave    portions    of    the    contour)    i)    symmetry
j)    fractal    dimension    ("coastline    approximation"    -    1)

The    mean,    standard    e
or,    and    "worst"    or    largest    (mean    of    the    three    largest    values)    of    these
features    were    computed    for    each    image,    resulting    in    30    features.    For    instance,    field    3    is    Mean
Radius,    field    13    is    Radius    SE,    field    23    is    Worst    Radius.
Assignment    tasks:
Import    the    data    into    your    session.
1. Partition    the    data    into    90%    training    and    remaining    as    test    samples.    Fit    a    logistic
egression    model    for    Diagnosis    against    all    numeric    features    to    the    training    sample.
Marks    6
2. Discuss    any    difficulties    in    the    model    fit    and    interpretation    of    the    coefficients.
From    the    summary    of    fitted    model    interpret    the    relationship    between    Diagnosis    and
the    features    Texture    and    Concavity..    Marks    4
3.    Return    to    the    unpartitioned    data.    Use    descriptive    methods    to    investigate    the
co
elation    between    the    30    numeric    features    on    the    BC    data.    Show    relevant    output.
Marks    4
4.    Suggest    and    implement    an    unsupervised    learning    method    to    derive    secondary
features    that    address    inter-feature    co
elation.    Show    R-code.            Marks    6
5.    Select    a    subset    (filter(.))    of    secondary    features    obtained    in    4)    Marks    12
a.    Justify    your    approach    using    result(s)    obtained    in    4).
.    Partition    the    data    containing    secondary    features    into    training    (90%)    versus    test
samples.    Use    the    data    obtained    in    5a)    to    fit    a    logistic    regression    model    with    Diagnosis
as    response    on    this    new    training    sample.
c.    Use    the    same    features    to    fit    a    quadratic    discriminant    analysis    to    Diagnosis.
6.    Implement    both    models    on    the    test    data    along    with    the    logistic    regression    model    with
all    features    (as    in    Q1)    Marks    8
Provide    accuracy    measures    for    each    case    and    discuss    your    findings.
Question    2    -    Total    Marks    30
Clustering is a common exploratory technique used in bioinformatics where researchers aim
to identify subgroups within diseases using gene expression. Imagine you are asked to
analyse the gene expression dataset available in the leukemia_dat.Rdata    file. This data was
originally generated by [Golub et al., Science, 1999] https:
science.sciencemag.
org/content/sci/286/5439/531.full.pdf    and contains the expression level of 1867 selected
genes from 72 patients with different types of leukemia.
The data in each column are summarized as follows:
•    Column 1: patient id = a unique identifier for each patient (observation)
•    Column 2: type = A factor variable with two subtypes of leukemia; acute lymphoblastic
leukemia (ALL, n = 47) and acute myeloblastic leukemia (AML, n = 25).
•    Columns 3: to 1869. Gene expression data for 1867 genes, Gene 1, ..., Gene 1867.
Assignment    Tasks:
The researchers hypothesized that patient samples will cluster by subtype of leukemia based
on gene expression. Your task is to use a clustering technique to address this scientific
hypothesis and report your results back to the researcher.
(a) Select a clustering technique to apply. Justify your choice. Marks    5
(b) Implement your chosen clustering technique in R. Describe your implementation
You need to provide details of all steps relating to the implementation of the clustering
algorithms, such as data preparation including any transformations performed on the data
prior to clustering, training the model & evaluating the performance of the model. Marks    25

Ru
ic template
Criteria HD P Fail
Rmarkdown/R
(10%)
Codes are reproducible.

Demonstrate superior ability to write code in
Rmarkdown/R efficiently and produce accurate
esults.

Code is well organised and very easy to follow.
Code is well commented so the purpose of
each block of code readily understood and
what question part it co
esponds to. Variable
names give the purpose of the variable.
Codes are reproducible.

Demonstrate limited ability to use
R/Rmarkdown. Some of the results
produced by the code are accurate.

The code is readable only by someone who
already knows what it is supposed to be
doing. Comments not sufficient to see what
the code is doing. Significant lack of
comments makes it difficult to understand
code.
A lack of compliance with the factors
described in adjacent columns.

Question 1

(30%)
Demonstrate superior understanding and
implementing the logistic regression to classify
east cancer type. Provide full detail of the
implementation.

The results and discussion are explained
co
ectly, clearly, and in sufficient detail.

Demonstrate some understanding and
implementing the logistic regression to
classify
east cancer type. Provide some
steps of the implementation in detail.

The results and discussion are explained
clearly and in sufficient detail most of the
time. There are some misunderstandings in
interpreting results.
A lack of compliance with the factors
described in adjacent columns.

Question 2

(30%)
Demonstrate superior understanding of
complete and single linkage clustering. Provide
all steps to obtain the dendrograms.

Writing is authentic, easy to understand with
excellent level of detail.
Demonstrate limited understanding of
complete and single linkage clustering. Lack
of explanations on some steps obtaining the
dendrograms.

Writing is authentic, easy to understand
with some level of detail.
A lack of compliance with the factors
described in adjacent columns.
Glenn Fulford
Glenn Fulford
Glenn Fulford
Glenn Fulford
Glenn Fulford
Glenn Fulford
40%
Glenn Fulford
Glenn Fulford
incorporated
into Q1 and Q2
marks

Question 3

(30%)
Demonstrate superior understanding and
implementing clustering algorithms. Provide
full detail of the implementation.

The results and discussion are explained
co
ectly, clearly, and in sufficient detail.

Demonstrate some understanding and
implementing clustering algorithms. Provide
some steps of the implementation in detail.

The results and discussion are explained
co
ectly, clearly and in sufficient detail
most of the time. There are some
misunderstandings in interpreting results.
A lack of compliance with the factors
described in adjacent columns.
Glenn Fulford
Glenn Fulford
Question 2
    Ru
ic template

ma5810assignemnt-2-cgtwhyfg.pdf ma5810assessment2rubric-0nquw0c1.pdf leukemiadat-xf3cvddk.rdata wdbc-z3sbdydo.data

Amar Kumar · Accepted Answer

Q1.
1.
Calculate the accuracy
This function determines how accurate our algorithm is.
Code 1: The algorithm used to determine accuracy.
I am putting the log regression with two variables into practice.
In the sections before this one, all of the essential functions needed to carry out the Logistic Regression were built. Let us quickly go over each one:
To gauge the results of danger in light of two of the 20 non-repetitive characteristics in our dataset, we will currently construct the code that envelops these capabilities. Because they have a connection value of 0.32, we might select Sweep and Surface as one of the element matches from the Stage 3 disclosure procedure. The following DataFrame df code is used to create the output NumPy vector Y and features of the NumPy array X:
	Code 2: Create the NumPy arrays for X and Y.
Plotting the two characteristics
	Code 3: Draw a feature map.
Figure 1 shows the plot that was produced as a result:
			Fig 1. Plotting the dimension and texture
The yellow spheres indicate the dark, malignant, and benign cells.
Scale and normalise our data now.
Additionally, the typical X values in our practise set, or mu, and the standard deviation, or sigma, must be gathered.
Create a new cell in your notepad and write the following:
Code 4: Implement Feature Scaling and Normalization.
The function must now be used to add a "ones" column to the array X. stack:
	Code 5: The X matrix should now have a column of "ones"
Testing
Let's put a few things to the test: Let's try to calculate the Gradient & Revenue Function to test our code. With a = [0, 0, 0]:
Code 6: With an initial value of zero, calculate the Gradient and Cost Function for the first test.
The new vector's J() value is 0.69, and its coordinates are [0.12741652, -0.35265304, -0.20056252].
We could also try using values that are not zero to see what happens:
Code 7: Use a starting value that is not zero to calculate the cost function and gradient for the second exam.
The revised vector is now = [-0.37258348, -0.35265304, -0.20056252] with a corresponding J() value of 8.48.
Advanced Descent Optimization for Gradients
Using the Create a visually, Taylor, Goldfarb, and Shanno quasi-Newton technique [5], we will construct the BFGS optimisation method. The BFGS method will be used internally by the function Scypy minimise, which will be implemented in Code 8.
Code 8: Advanced Descent Optimization for Gradients
The BFGS algorithm is utilised by default if we do not indicate the method type we wish to use in the parameter "approach". Minimise procedure. Using a truncated Newton algorithm, another method, TNC, minimises a function with bounding variables. With Scypy's.minimize capability, clients can try out the different upgrading calculations that are accessible. Discover further about the role. Minimise and the other optimisation techniques on the Scypy demonstrated the application. Code 7 results in the following:
Limit on choices
Using the BFGS algorithm, the scypy.minimize function's Result.x argument was located as = [-0.70755981, 3.72528774, 0.93824469].In Step 3, we stated that the likelihood of the result is either 0 or 1 is determined by the Hypothesis h(x) for Logistic Regression. To discretise this probability into the classes "Bening/Malignant," we select a threshold of 0.5, above which we will classify values as "1," and below which we will classify values as "0."Consequently, we must keep an eye on the previously defined Decision Boundary. A decision boundary is not a feature of a dataset but rather of a hypothesis and its inputs. Again plotting the Radius and Texture features, this time with a red line indicating the discovered's Decision Boundary:
Code 9: Draw the Data Boundary and the Decision on a Map
	Fig 2. Both the radius and the texture are plotted simultaneously to the decision boundary.
Although the Logistic Regression Hypothesis model has a non-linear (nonlinear activation) function, it is critical to remember that the Discriminator is linear.
Figure out the accuracy.
We now want to determine how accurate our algorithm is. This will be accomplished via the function CalcAccuracy mentioned :
Code10: determine the accuracy
89.1 is the result of CalculateAccuracy, which is a good accuracy rating.
Make a forecast.
We wish to make predictions now that we have tested our system and determined its correctness. A query may look like this: we want to know what happens when we use the parameters radius = 18.00 and texture = 10.12. The code below illustrates this.
Code: Calculate the likelihood of cancer for a Radius of 18.00 and a Texture of 10.12, respectively.
Keep taking mind that the Inquiry should be standardised involving mu and sigma for scaling and standardisation. With a radius of 18 and a texture of 10.12, the predicted outcome is 0.79, which indicates that the likelihood of malignancy is close to 1.
2. 
Regression Variable P Value Interpretation.
Inferential statistics include regression analysis. Regression p values can be used to determine whether the associations you find in your group apply to the entire population. The p-value of each exogenous variable in a linear regression tests the hypothesis that the variable does not relate to the predictor variables. If there is no correlation, there cannot be a link between changes in the dependent variable and variation in the independent variables. In other words, more information is needed to establish clearly that there was a change in the population.
If the p-value for a particular variable is below your significance threshold, the sample data are sufficient to reject the null hypothesis for such an unpopulactionion. Your findings support the notion that there is a correlation, one that isn't zero. Variations in the predictor variables are relatedtakingtake to ttakingndependent variables at the population level. This variable's statistical significance implies that you should include it in your regression model.
On the tp-valuer hand, the p-value of a regression indicates if there are insufficient data in your sample to support a non-zero association if it is more than the significance threshold.
The regression output sample below illustrates the statistical importance of the South and North predicvariablessThe p-values for the South and North predictor variables are equal to 0.000. However, since East's p-value (0.092) is higher than the typical significance level of 0.05, it is not statically important.
The correlation p-values are typically used to decide whether to include components inside the final model. Let’s consider eliminating East in light of the information provided above. It's possible that keeping variables that aren't statistically significant will decrease the model's precision.
3.

Microsoft Word - MA5810Assignemnt 2.docx MA5810-Assessment 2 Weighting: 30% Total marks: 70. Due date: Week 5 - Sunday, This assessment focuses on machine learning techniques covered...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment