SIT743 Multivariate and Categorical Data Analysis Assignment-2 Total Marks = 100, Weighting - 40%...

Question

SIT743 Multivariate and Categorical Data Analysis
Assignment-2
Total Marks = 100, Weighting - 40%
Due date: 25th September 2019 by 11.30 PM
---------------------------------------------------------------------------------------------------------------
INSTRUCTIONS:
• For this assignment, you need to submit the following TWO files.

1. A written document (A single pdf only) covering all of the items described in the
questions. All answers to the questions must be written in this document, i.e, not in
the other files (code files) that you will be submitting. All the relevant results
(outputs, figures) obtained by executing your R code must be included in this
document.
For questions that involve mathematical formulas, you may write the answers
manually (hand written answers), scan it to pdf and combine with your answer
document. Submit a combined single pdf of your answer document.

2. A separate “.R” file or ‘.txt’ file containing your code (R-code script) that you
implemented to produce the results. Name the file as “name-StudentID-Ass2-
Code.R" (where `name' is replaced with your name - you can use your surname or
first name, and StudentID with your student ID).

• All the documents and files should be submitted (uploaded) via SIT 743 Clouddeakin
Assignment Dropbox by the due date and time.
• Zip files are NOT accepted. All two files should be uploaded separately to the
CloudDeakin.
• E-mail or manual submissions are NOT allowed. Photos of the document are NOT
allowed.
=================================================================

Assignment tasks
Q1) [32 Marks]
A survey has been conducted in Melbourne to study the travel mode choice (M)
ehavior of people. The list of factors that influence the travel mode choice, along with
their possible values, is provided below. A Bayesian network that has been created
ased on the survey results is shown below, which represents the relationship between
these various factors (variables).
J (Occupation) ∈{Student, Employee, Individual, Others}
A (Age) ∈{<18, 18-35, 36-55, >55}
S (Salary – monthly in dollars) ∈ {<2000, XXXXXXXXXX, XXXXXXXXXX, >10000}
V (owning a private car) ∈ {Yes, No}
Page 1 of 7

W (Trip purpose) ∈ {Commute to work, other}
D (Trip distance in km) ∈ {<1, 1-3, 3-6, >6}
P (Trip time period) ∈ {Peak hour, Off-peak hour}
U (Trip duration in mins) ∈ {<30, 30-60, >60}
M (Travel mode choice) ∈ {Walking, Bicycle, Public transport, Car}
1.1) Write down the joint distribution ??(??, W, S, A, P, V, D, U, M) for the above
network.
1.2) Find the minimum number of parameters required to fully specify the
distribution according to the above network.
1.3) How many parameters are required, at a minimum, if there are no
independencies among the variables is assumed? Compare with the result of
the above question (Q1.2) and comment.
1.4) d-separation method can be used to find two sets of independent or
conditionally independent variables in a Bayesian network. For each of the
statements given below from (a) to (c), perform the following:
• List all the possible paths from the first (set of) node/s to the second (set
of) node/s.
• State if each of those paths is blocking or non-blocking with reasons.
Page 2 of 7

• Hence, mention if the statement is true or false.

a) ?? ⊥ V | ∅ (W is marginally independent of V)
) ?? ⊥ M | {D, W} (A is conditionally independent of M given {D, W})
c) {??,??} ⊥ D | V
1.5) Write a R-Program to produce the above Bayesian network, and perform the
d-separation tests for all of the above cases mentioned in Q1.4 (a) to (c). Show
the plot of the network you obtained and the output (of d-separation test)
from your program.

1.6) Show the step by step process to perform variable elimination to
compute ??(?? | ?? = ???????? − ????????, ?? = ??????, ?? = ???????? ????????, ?? = < ??).
Use the following variable ordering for the elimination process:
J, W, A, U.

[Marks XXXXXXXXXX = 32]
Q2) [16 Marks] Implementing a Bayesian network in R and performing inference
A belief network models the relation between the variables oil; inf; eh; bp; rt, which
stand for the price of oil, inflation rate, economy health, British Petroleum Stock price,
and retailer stock price respectively. Each variable takes different states as given below.
???? (???????????????? ℎ????????ℎ) ∈ {??????,ℎ????ℎ}
?????? (?????????? ???? ??????) ∈ {??????,ℎ????ℎ}
?????? (?????????????????? ????????) ∈ {??????,ℎ????ℎ}
???? (????????????ℎ ???????????? ?????????? ??????????) ∈ {??????, ?????????? ???????????? (????),?????????? ???????????? (????),ℎ????ℎ}
???? (???????????????? ?????????? ??????????) ∈ {??????,ℎ????ℎ}
The belief network that models these variables has (probability) tables as shown below.
Page 3 of 7
2.1) Use the below li
aries in R to create this belief network in R along with the
probability values as shown in the above table.
You may use the following li
aries for this:
source("https:
ioconductor.org
iocLite.R")
iocLite("RBGL")
li
ary(RBGL)
li
ary(gRbase)
li
ary(gRain)
iocLite("Rgraphviz")
#define the appropriate network and use the
“compileCPT()”function to Compile list of conditional
probability tables and create the network.
a) Show the obtained belief network for this distribution
) Show the probability tables obtained from the R output, (and verify with
the above table).

2.2) Use R program to compute the following probabilities:
a) Given that the Oil price is low and the retailer stock price is low, what is
the most possible state of the British petroleum stock price?
) Given that the inflation rate is low, what is the probability that retailer
stock price is high?
c) Find the marginal distribution of ?????????? ???? ??????.
d) Find the joint distribution of inflation, and British petroleum stock price.

[Marks: XXXXXXXXXX) = 16]
Page 4 of 7

Q3) [16 Marks]
Consider five binary variables A, B, C, D, E. The Directed Acyclic Graph (DAG)
shown below describes the relationship between these variables along with their
conditional probability tables (CPT).

3.1) In the above network, state why A is independent of B, i.e., A⊥B.

3.2) Hence, find an expression (in a simplified form) for ??(?? = ??|?? = ??,?? = ??) in
XXXXXXXXXXterms of ??.

3.3) The table shown below provides 20 simulated data obtained for the above Bayesian
network. Use this data to find the maximum likelihood estimates of ??, ??, ?? and
??.
Page 5 of 7

3.4) Find the value of ??(?? = ??|?? = ??,?? = ??) using the values obtained for ?? from
the above question Q3.3.
[Marks XXXXXXXXXX = 16]

Q4) Bayesian Structure Learning [30 Marks]

For this question, you will be using a dataset, called “Child”, which contains 20 variables.
This dataset provides information about diagnosing congenital heart disease in a new
orn "blue baby". The csv file (“CHILD10k.csv”) containing the dataset can be
downloaded from CloudDeakin.

Use the following R code to load the Child dataset:

ChildData <- read.csv(file="CHILD10k.csv", header=TRUE, sep=",")

The true network structure of this dataset can be viewed (plot) using the following R
code.

Use R programming, as appropriate, to answers the following questions.

4.1) Use the Child dataset to learn Bayesian network structures using hill-climbing
(hc) algorithm, utilizing two different scoring methods, namely Bayesian
Information Criterion score (BIC score) and the Bayesian Dirichlet equivalent
(Bde score), for each of the following sample sizes of the data:

a) 100 (first 100 data)
) 500 (first 500 data)
c XXXXXXXXXXfirst 1000 data)
d XXXXXXXXXXfirst 5000 data)
li
ary(bnlearn)
#create and plot the network structure.
modelstring = paste0("[BirthAsphyxia][Disease|BirthAsphyxia][LVH|Disease][DuctFlow|Disease]",
"[CardiacMixing|Disease][LungParench|Disease][LungFlow|Disease][Sick|Disease]",
"[HypDistrib|DuctFlow:CardiacMixing][HypoxiaInO2|CardiacMixing:LungParench]",
"[CO2|LungParench][ChestXray|LungParench:LungFlow][Grunting|LungParench:Sick]",
"[LVHreport|LVH][Age|Disease:Sick][LowerBodyO2|HypDistrib:HypoxiaInO2]",
"[RUQO2|HypoxiaInO2][CO2Report|CO2][XrayReport|ChestXray][GruntingReport|Grunting]")

dag = model2network(modelstring)
par(mfrow = c(1,1))
#source("https:
ioconductor.org
iocLite.R")
#biocLite("Rgraphviz")
graphviz.plot(dag)
Page 6 of 7

For each of the above cases,
• provide the scores obtained for BIC and BDe,
• Plot the network structure obtained for the BIC and BDe scores.

4.2) Based on the results obtained for the above question (Q 4.1), discuss how the BIC
score compare with BDe score for different sample sizes in terms of structure
and score of the learned network.

4.3)
a) Find the Bayesian network structures utilising the full dataset, and using
oth BIC and Bde scores. Show the scores and the obtained networks.

) Compare the networks obtained above (in Q4.3.a) for each BIC and Bde
scoring methods with the true network structure and comment. Use the
“compare()” function and “graphviz.compare()” function available in the
“bnlearn” R package to perform the

sit743-assignment-2-cedm4byp.pdf child10k-ajhp2hg0.csv

Abr Writing · Accepted Answer

assignment.R
library(bnlearn)
# source("https://bioconductor.org/biocLite.R")
# biocLite("RBGL")
library(RBGL)
library(gRbase)
library(gRain)
# biocLite("Rgraphviz")
library(Rgraphviz)
library(visNetwork)## Problem 1.5bn =7.5 0.90271377 0.10969977 0.09371954
Parameters of node XrayReport (multinomial distribution)
Conditional probability table:
ChestXray
XrayReport Asy/Patch Grd_Glass Normal Oligaemic Plethoric
Asy/Patchy 0.70207852 0.17474634 0.06840959 0.05291320 0.06310905
Grd_Glass 0.09853734 0.63585118 0.01960784 0.02318668 0.02041763
Normal 0.08160123 0.06877114 0.79433551 0.09601665 0.09559165
Oligaemic 0.01924557 0.01465614 0.05925926 0.80707491 0.02134571
Plethoric 0.09853734 0.10597520 0.05838780 0.02080856 0.79953596
28
Parameters of node Disease (multinomial distribution)
Conditional probability table:
, , LungParench = Abnormal, LungFlow = High
LVH
Disease no yes
Fallot 0.0435643564 0.0394736842
Lung 0.1009900990 0.1052631579
PAIVS 0.0000000000 0.1710526316
PFC 0.0059405941 0.0000000000
TAPVD 0.1643564356 0.0657894737
TGA 0.6851485149 0.6184210526
, , LungParench = Congested, LungFlow = High
LVH
Disease no yes
Fallot 0.0194174757 0.0434782609
Lung 0.0711974110 0.0869565217
PAIVS 0.0000000000 0.0869565217
PFC 0.0064724919 0.0000000000
TAPVD 0.5501618123 0.2608695652
TGA 0.3527508091 0.5217391304
, , LungParench = Normal, LungFlow = High
LVH
Disease no yes
Fallot 0.0501829587 0.0421686747
Lung 0.0005227392 0.0000000000
PAIVS 0.0057501307 0.2771084337
PFC 0.0052273915 0.0030120482
TAPVD 0.0167276529 0.0000000000
TGA 0.9215891270 0.6777108434
, , LungParench = Abnormal, LungFlow = Low
LVH
Disease no yes
Fallot 0.6254826255 0.0938511327
Lung 0.0675675676 0.0226537217
PAIVS 0.0637065637 0.8349514563
PFC 0.1833976834 0.0226537217
TAPVD 0.0154440154 0.0064724919
TGA 0.0444015444 0.0194174757
, , LungParench = Congested, LungFlow = Low
LVH
Disease no yes
Fallot 0.5212765957 0.0636363636
Lung 0.0744680851 0.0181818182
29
PAIVS 0.0265957447 0.8909090909
PFC 0.1382978723 0.0181818182
TAPVD 0.2234042553 0.0090909091
TGA 0.0159574468 0.0000000000
, , LungParench = Normal, LungFlow = Low
LVH
Disease no yes
Fallot 0.7859154930 0.1162196679
Lung 0.0018779343 0.0012771392
PAIVS 0.0755868545 0.8627075351
PFC 0.0807511737 0.0127713921
TAPVD 0.0014084507 0.0000000000
TGA 0.0544600939 0.0070242656
, , LungParench = Abnormal, LungFlow = Normal
LVH
Disease no yes
Fallot 0.1108742004 0.0641025641
Lung 0.4776119403 0.3205128205
PAIVS 0.0085287846 0.3717948718
PFC 0.1023454158 0.

SIT743 Multivariate and Categorical Data Analysis Assignment-2 Total Marks = 100, Weighting - 40% Due date: 25th September 2019 by 11.30 PM...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment