[Title of your report]
Introduction
Provides clear and concise context for the report, introducing the purpose of the analyses that follow.
As a guideline, one paragraph will be sufficient.
[Delete instruction text before submitting]
[Type your introduction here]
Motivation and Methodology
Describe the motivation for the analysis methods and tools that you have used in each section. This section must answer the questions what you did, why you did that and how you did it.
As a guideline, maximum two paragraphs will be sufficient.
[Delete instruction text before submitting]
[Type your description of methods here]
Results & Discussion
Summarise the main results of your analyses in each section I to IV. You may use subsections, tables etc. as you see fit. Present and discuss results in a clear and simple way:
Present findings of statistical analyses in a logical sequence.
Do not include code or dumps of R output. Results should either be incorporated into sentences or formatted appropriately to be neatly presented.
Interpret your findings by discussing their practical significance.
Discuss shortcomings, if any.
As a guideline, maximum three paragraphs will be sufficient.
[Delete instruction text before submitting]
[Type your results and discussion here]
Recommendations & Conclusions
Based on your analysis, provide a
ief overall discussion summarising/interpreting the results of the analyses you performed and final conclusions based on the hypothesis tested.
As a guideline, one paragraph will be sufficient. Do not introduce any new information in this section, and do not simply repeat statements made elsewhere in your report!
[Delete instruction text before submitting]
[Type your recommendations and conclusions here]
1
[Title of your report]
Introduction
Provides clear and concise context for the report, introducing the purpose of the analyses that follow.
As a guideline, one paragraph will be sufficient.
[Delete instruction text before submitting]
[Type your introduction here]
Motivation and Methodology
Describe the motivation for the analysis methods and tools that you have used in each section. This section must answer the questions what you did, why you did that and how you did it.
As a guideline, maximum two paragraphs will be sufficient.
[Delete instruction text before submitting]
[Type your description of methods here]
Results & Discussion
Summarise the main results of your analyses in each section I to IV. You may use subsections, tables etc. as you see fit. Present and discuss results in a clear and simple way:
Present findings of statistical analyses in a logical sequence.
Do not include code or dumps of R output. Results should either be incorporated into sentences or formatted appropriately to be neatly presented.
Interpret your findings by discussing their practical significance.
Discuss shortcomings, if any.
As a guideline, maximum three paragraphs will be sufficient.
[Delete instruction text before submitting]
[Type your results and discussion here]
Recommendations & Conclusions
Based on your analysis, provide a
ief overall discussion summarising/interpreting the results of the analyses you performed and final conclusions based on the hypothesis tested.
As a guideline, one paragraph will be sufficient. Do not introduce any new information in this section, and do not simply repeat statements made elsewhere in your report!
[Delete instruction text before submitting]
[Type your recommendations and conclusions here]
1
MATH 1081 UO Mathematical Methods
for Data Analytics 2
Assessment 2.2 : Project Part B
Instructions:
• Structure of the assessment: This assessment is worth 35% of your final grade
and is due no later than 5 pm on Friday, Week 10. This assessment consists
of 20 questions under 4 sections to answer and a report writing. Your submission
will be marked out of 100.
• Use of R: This project is a guided case study. It is important that you follow any
instructions or guidance in the questions, such as “Use R” where required. You
must provide your R codes to get full marks wherever you use R to answe
the questions. Upload your R script and screenshot the R codes and outputs in
your answer sheet.
• Save your work: Save your answer sheet as a pdf named “your student
ID Assessment 2.2 MATH1081.pdf”.
• Show your work: Show all necessary steps so that the reader can follow you
solution procedure.
• Submit your work: Create a folder with
1. your answer sheet
2. your R script and
3. the final dataset you used for the analysis in “.csv” format.
Name your folder with your student ID and upload it as a zip file.
• Acknowledgement of work: When submitting online, you acknowledge that
the submitted assignment is your own work unless otherwise stated.
1
• Academic integrity: The University’s policy on academic misconduct will be
strictly applied. Here are some tips to avoid academic misconduct:
– Do not copy from any printed or electronic source or from any person.
– Write your own solutions. You may discuss your work with others, but
you must write up your solutions yourself. You are not allowed to use some-
one else’s written work when writing up your submission.
– Do not give inappropriate help. Giving inappropriate help is just as
serious as receiving it and will have the same consequences. Do not show
your completed exercise to others. Dispose of drafts so that no one can access
them.
– Acknowledge help and joint work. If you receive any help from anothe
source (for example, students, tutors, friends, internet), you must make a
note of it on your submission.
• Late submission: Any late submission will attract a penalty of 5 marks avail-
able per day for five days. The cut-off time is 5 pm each day. After five
days from the assessment due date, no submissions will be marked, and zero
marks will be granted.
2
Assessment Task Overview
Photo by Luke van Zyl on Unsplash
This assessment is based on the data in Melbourne housing.csv file. It con-
tains residential building data, including construction cost, sales prices, some project
variables, and some economic variables co
esponding to real estate in Melbourne, Aus-
tralia. The objective is to understand, analyse and develop a model to predict the sales
price (Price). A
ief description of variables is provided below.
3
https:
unsplash.com
Data dictionary
Variable Description
Subu
Subu
Address Street address
Rooms number of Rooms
Type Type of Housing
Price Actual sales price (local cu
ency)
Method S - property sold; SP - property sold prior; PI - property passed in;
PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid;
VB - vendor bid; W - withdrawn prior to auction;
SA - sold after auction; SS - sold after auction price not disclosed; N/A - price NA.
Type
- bedroom(s); h - house,cottage,villa, semi,te
ace; u - unit, duplex;
t - townhouse; dev site - development site; o res - other residential.
SellerG Real Estate Agent
Date Date sold
Distance Distance from CBD in Kilometres
Regionname General Region (West, North West, North, North east . . . etc)
Propertycount Number of properties that exist in the subu
.
Bedroom2 Scraped # of Bedrooms (from different source)
Bathroom Number of Bathrooms
Car Number of carspots
Landsize Land Size in Metres
BuildingArea Building Size in Metres
YearBuilt Year the house was built
CouncilArea Governing council for the area
Lattitude: Self explanatory
Longtitude Self explanatory
Table 1: Data dictionary Melbourne Housing.csv
Assessment Task Details
You have to complete this assessment in two sections.
1. A list of questions to answer that comprising of 72% of the total grade (72 marks).
Write your answers clearly in a well-organised manner with accurate notations.
Label the questions and sub-questions.
2. A report summarising your analysis in Section 1 that comprising of 28% of the
total grade (28 marks). A guide for the project report is provided in learnonline.
4
Section 1: Questions
[I] Descriptive Statistics & Exploratory Analysis:
The data is not always cleaned and presented in a working manner. There are some
unnecessary columns and variables which do not have full completed entries. In addi-
tion, you might have e
ors in this dataset, and you have to fix them before you start
analysing. You can do data cleansing in R or Excel.
(a). Choose & filter a single house ‘Type’. Use this for the remainder of the assign-
ment as completed in Project Part A. Create a subset dataset of size at least
250 with the continuous variables and ‘Postcode” and ‘Year= 2018’. Hint: Use
na.omit function. For full marks, provide a screenshot of the first 30 row entries
of the cleaned dataset in R. [2 marks]
(b). Use R to produce histograms of all the possible continuous variables. [4 marks]
(c). Use R to produce descriptive statistics for all the variables in part (a).
[4 marks]
(d). Use R to produce boxplots describing the continuous variables side by side. This
should be a picture of one plot. [2 marks]
(e). Using your outputs from (a) to (d), comment on the shape of the distribution fo
each variable. In particular,
iefly describe in a table form:
• Whether there is one peak, or multiple peaks, in the distribution;
• The shape of the distribution (skewed or symmetric);
• Whether there appear to be any outliers. [5 marks]
Example table layout:
Variable
Number of peaks
in the distribution
One/multiple
Shape of the distribution
Left-skewed/Right-Skewed
Symmetric
Outliers present
Yes/No
(f). Which central tendency (mean/median) and dispersion (standard deviation/inte
quartile range) measures are the most appropriate to summarise the variables
numerically? Justify your choice of measures. Provide your answers in a table
form. For full marks, provide the general interpretation for the listed summary
measures. [4 marks]
Example table layout:
5
Variable
Measure of
Central tendency
mean/median
Measure of
dispersion
SD/IQR
Justification
(g). Use R to test the variables for Normality. Briefly describe whether the data fol-
lows a Normal distribution. Tabulate your answer. [4 marks]
Example table layout:
Variable P-value
Reject H0
Yes/No
Normally distributed
Yes/No
[25 marks]
[II] Normal Distribution & Central Limit Theorem:
(h). Use R to calculate the probability that the average house (unit) Price will be
more than $1,000,000 ($600,000) using the provided data. For full marks, clearly
state the distribution of average sales price and the co
ect probability statement.
Interpret your final answer. [5 marks]
(i). Use R to calculate the probability that the average house (unit) Price will be
less than $1,000,000 ($600,000). clearly state the co
ect probability statement.
Provide an interpretation to your final answer. [2 marks]
(j). What is the cut off for the probability of an average Price higher than the cutoff
would be 5%? For full marks, provide a co
ect probability statement.
[3 marks]
(l). Using R, produce a random sample of size 30 for variable BuildingArea by ran-
domly selecting 30 values without replacement from the BuildingArea variable in
the provided data. Repeat the same for Landsize variable. For full marks, provide
a screenshot of your samples in a table format.
Hint: Use data.frame() to tabulate samples [4 marks]
(k). Use R to produce the descriptive statistics for each sample in part (l), and store
the information in another table, please ensure you state the mean and standard
deviation of each sample. [2 marks]
6
(m). Determine the sampling distribution of means for BuildingArea and Landsize and
state the parameters based off your samples in part (l). Justify your answer,
quoting any theorems you used. [3 marks]
(n). Calculate the probability that the average Landsize is greater than 650 based on
your sampling distribution of the means from part (m). For full marks, provide
a co
ect probability statement and interpret the final answer. [3 marks]
[22 marks]
[III] Estimating & determining the population mean:
(o). Manually construct 95% confidence interval for the population mean for Buildin-
gArea and Landsize based on the sampled data in part (l). Use R to verify the
esults. Interpret your confidence interval.
Hint: t29,0.025 = 2.045 [3 marks]
(p). Repeat the previous question for a 99% confidence interval for the population
mean of the same variables based on the sampled data in part (l).
Hint: t29,0.005 = 2.756 [3 marks]
(q). Compare and contrast the 99% confidence intervals for the two variables in part
(p), and comment whether the means of the original dataset Melbourne Housing
for BuildingArea and Landsize are included in these interval estimates. Justify
your answer. [2 marks]
[8 marks]
[IV] Testing claims & Hypothesis Tests:
Hint: Use the whole dataset to answer the question in this section.
For full marks, define the parameters of interest appropriately, set-up of the null
and alternative hypotheses, clearly state the decision and the conclusion of the
test
(r). The project management team of these housing projects is debating that there is
no difference between the variables BuildingArea and Landsize. Use R to statis-
tically test at a 5% level of significance if there is a difference in the average of
BuildingArea and Landsize. Give a verdict and conclusion to your analysis. [5
marks]
(s). They further claim that there is a difference between the variables BuildingArea
and Landsize. Use R to statistically test at a 1% level of significance if there is a
difference in the average of BuildingArea and Landsize. [5 marks]
7
(t). Another claim the project management team is making is that ideally the av-
erage house (unit) Price should be greater than $1,000,000 ($600,000) using R.
Statistically test at a 10% level of significance whether the average house (unit)
price is greater than $1,000,000 ($600,000). Include a diagram for the hypothesis
test. [5 marks]
(u). What does it mean by