you will generate areportpresenting a data set and analysis. While you can use code to ask questions...

Question

you will generate areportpresenting a data set and analysis. While you can use code to ask questions of data, the answers to the questions are meaningless if you can't share them with others! By...

1 answer below »

you will generate areportpresenting a data set and analysis. While you can use code to ask questions of data, the answers to the questions are meaningless if you can't share them with others!

By completing this assignment you will practice and master the following skills:

Declaring document rendering using Markdown syntax

Rendering R Markdown files usingknitr

Synthesizing skills, tools, and concepts from across the course

Completing the Assignment

OpenRStudioLinks to an external site.

in a new tab, then start the assignment called "M6_assignment_data_reporting". You will put your code in the providedanalysis.Rfile, but you will also need to create your ownindex.RmdR Markdown file using the built-in wizard. Place this file in the "root" of the project directory.

PICKING A DATA SET

For this assignment you can work with adata set of your choosing. It's perfectly acceptable to work with a data set from one of the previous assignments (though not one from the exercises)—in fact that is the "default" option.

However, if you wish to practice your data programming skills on data from a different domain, you can do that as well. You will need tolocateyour own data set in this case.

By "data set", we mean a.csvfile similar to ones you've worked with previously. It's acceptable to use an alternate data format (like a relational database, or web API), but that goes beyond the scope of what we've already covered so will require extensive additional outside study. Note that a "web site" (e.g., for web crawling) is not appropriate for this assignment, nor are image databases. This is not a machine learning course, we're just doing the basics! Stick with what you know :)

Open Data Government sites (e.g.,for Seattle(Links to an external site.)
),Kaggle(Links to an external site.)
, and theFiveThirtyEight blog(Links to an external site.)
also generally good places to find easy-to-work-with data.

Your data set need not be "Big Data", but should be of sufficient size to do some interesting analysis. Having at least 100 observations across 3-4 unique features is a good size. Make sure that your.csvis less than 50mb so that there are no problems sharing it (if it's larger than that, take a subset!)

Upload your data file to your project (using the "Upload" button in the file pane) and save it in the provideddata/folder.

MAKING THE REPORT

You will present your data analysis in a singlereportcreated withR Markdownandknitr.

The report will be written in a file called
index.Rmd
. You will need to create this file (you can do this through the RStudio Wizard, as described inChapter 18(Links to an external site.)
). This file will contain your report, including both text in Markdown and instructions to dynamically executeRcode that will be executed to dynamically produce the data shown in the report.

Be sure to specify appropriate metadata, including thetitle, your name as theauthor, and thedatethe report was generated. These should automatically be set up through the R Studio wizard.

Your report will include aR code chunkcalledsetup(withinclude=FALSEas a specified option), as described and shownin the textbook(Links to an external site.)
. In this code chunk, use thesource()function to run youranalysis.Rscript. Because the chunk hasinclude=FALSE, any printed output from your script will not be shown, but the variables containing your plots will be defined so you can use them later.

Your R Markdown should use arelative pathto theanalysis.Rfile, with the assumption that they are in the same folder.

Remember that you can't have any calls toView()in any code run by R Markdown! Be sure and remove or comment out any of those calls in youranalysis.R

Because plots may take some time to create, it may take a minute for your R Markdown file to knit. Be patient!

All of your "data wrangling" and analysis work must go in youranalysis.Rscript! Only code related to the "presentation" must go in the R Markdown file. Any code generating data frames or plots shouldalsogo in theanalysis.Rfile—save those plots to variables which you can then reference from the R Markdown. Debug your code in theanalysis.Rfile, not in the Markdown!

You can use the built-inKnitbutton in R Studio to render your.Rmdfile into a.htmlfile which you can open with a web browser. Simply click the
Knitbutton(Links to an external site.)
at the top of RStudio, and yourindex.htmlfile will be saved in the same directory as your.Rmdfile. You will "re-knit" repeatedly as you work through the assignment to make sure everything works!

REPORT CONTENT

Your report will include a few different sections. Give each section an appropriate heading, and a sentence or so introducing it.

1. Data Description

The first part of your report will be a brief "introduction" to the data set and your analysis. This section will include a paragraph presenting the following information:

A non-technical description of the data sets you will be using (whatisthe data?) This only needs to be a sentence or two.

An explanation of where the data comes from, who originally collected the data, and any other information we may need to know about how this data set came to be. You must include ahyperlink to the source—we must be able to follow the links and find your data set ourselves. Again, this only needs to be a sentence or two.

For example, if you're working with the A3 World Bank data, you'd include a link to their website.

Asampleof the data set, so that we can see what raw data you'll be working with ("the data set looks like this"). This means that you'll need to load the data set intoR(e.g., withread.csv()) and present it as a table (or multiple tables) in your report.Do not include the entire table; just the a few rows is sufficient. Think about the "user experience" of reading the report!
- Use thekable()function to render a readable data table.
- You don't need to include all columns of your data frames; only including the most important/relevant ones is acceptable. You are not required to do substantive data cleaning (or even rename columns), though it wouldn't hurt to do some of that wrangling now instead of later.
  
  Remember to do your data wrangling in the.Rscript file, not in the R Markdown file!
  
  If any of the column names are not intuitive, also include a brief explanation. For example, if you're working with the A3 World Bank data, you'd include a explanation of which indicators you're working with.

This section must include some text formatting using Markdown (such as making text eitherboldoritalic)

2. Data Analysis

The second part of your report will be the analysis of your data. Your report will includetwo (2)different "questions" and the analysis that explores those questions.

For example, if you're working with the A3 World Bank Data, you could pick any two visualizations as your questions: "How are C02 emissions distributed globally?" "How has the share of wealth in the USA changed between groups over time"?

Look back at some of the reflections and analyses that you did in previous assignments to get a sense for what kinds of questions you might ask!

Each question should be presented in its own section (with asecond level heading). For each question, include the following:

A sentence or so presenting the question.

A graphical data representation (a plot, created withggplot) that explores that question.

A briefevaluationof your exploration stating your conclusions (the "answer" to the questions you asked).

Your evaluation cannot rely purely on visual or anecdotal analysis (no "the line goes up!" or "the measure for one state looks large!"). Instead it must use somedescriptive statistics(e.g., mean/median) or measures ofeffect strength(e.g., correlations or predictive statistics) to definitively state relationships among your data. You do not need to perform advanced statistical analysis—this is not a stats class!—but your conclusions need to be grounded in the data, not in the representation.
- It's quite likely that the results may not provide the answer you expected, and that's okay! In your evaluation, you can mention that, and offer a guess as to why your assumptions didn't hold up.
  
  The descriptive statistics you use to answer your questions must be included in your report asinline R expressions. For example, you might have a sentence "The USA is the largest polluter in the world", where "USA" is an inline value drawn directly from the data.
- Again, remember to do your data analysis in theanalysis.Rfile! Save whatever values you want to include in your report inside of specific variables (or lists of values).

SUBMITTING YOUR WORK

We will grade your assignment by looking at your work in RStudio cloud. You can also download the completed .R and .Rmd files from RStudio Cloud and upload it here. Then return to this page and clickNext.

GRADING RUBRIC

Each item in the below grading rubric will be scored as roughly as follows:

100%Meets all requirements for grade item. Report is effectively coded, written, and presented.

80%Meets most requirements for grade item. Report may have a few errors or be missing minor aspects/components.

60%Meets many (but not all) requirements for grade item. Report may be missing significant aspects/components or have multiple errors.

40%Meets only a few requirements. Report may be started but incomplete.

0%Missing or meets no requirements. Report demonstrates no understanding of course material.

Rubric

M6 Data Report Rubric

M6 Data Report Rubric
Criteria	Ratings	Pts
This criterion is linked to a Learning Outcome Making the Report You have created a report using R Markdown: - You created an index.Rmd file with appropriate metadata - Your report includes a setup chunk that sources your analysis file (all data wrangling should be in the analysis file) - Your report is overall organized and structured effectively so that it is readable. - You have knit the report into an index.html file.		10pts
This criterion is linked to a Learning Outcome Data Set Analysis You have effectively wrangled and analyzed your chosen data set using R techniques introduces in this course. You will receive full credit for this grade item even if you just use your wrangling work from a previous assignment		5pts
This criterion is linked to a Learning Outcome Report Content: Data Description Your report includes a section describing your data, including: - a non-technical description of the data set - information on the source of the data - a hyperlink to the source of the source of the data - a sample data table (created using kable()) - markdown text formatting		20pts
This criterion is linked to a Learning Outcome Report Content: Data Analysis You have included two subsections presenting your data analysis. Each subsection includes: - a statement of the "question" being analyzed - a graphical representation of the analysis (a well-designed data plot) - a paragraph evaluating the results of the analysis - inline R expressions presenting the analysis results		30pts
Total Points:65

Answered Same Day Mar 24, 2023

Solution

Mukesh answered on Mar 25 2023

42 Votes

Data Analysis
Data Analysis
Steph
2023-03-25
About IRIS Dataset
The Iris dataset was used in R.A. Fisher’s classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.
It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other. The columns in this dataset are:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
li
ary(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
li
ary(gridExtra)
li
ary(grid)
li
ary(plyr)
## Warning: package 'plyr' was built under R version 4.2.2
iris= read.csv("iris.csv")
View(iris)
summary(iris)
## Id Sepal.Length Sepal.Width Petal.Length
## Min. : 1.00 Min. :4.300 Min. :2.000 Min. :1.000
## 1st Qu.: 38.25 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
## Median : 75.50 Median :5.800 Median :3.000 Median :4.350
## Mean : 75.50 Mean :5.843 Mean :3.054 Mean :3.759
## 3rd Qu.:112.75 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
## Max. :150.00 Max. :7.900 Max. :4.400 Max. :6.900
## Petal.Width Species
## Min. :0.100 Length:150
## 1st Qu.:0.300 Class :character
## Median :1.300 Mode :character
## Mean :1.199
## 3rd Qu.:1.800
## Max. :2.500
Including Plots
Density & Frequency analysis with the Histogram,
You can also embed plots, for example:
# Sepal length
HisSl <- ggplot(data=iris, aes(x=Sepal.Length))+
geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) +
xlab("Sepal Length (cm)") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Sepal Length")+
geom_vline(data=iris, aes(xintercept = mean(Sepal.Length)),linetype="dashed",color="grey")
# Sepal width
HistSw <- ggplot(data=iris, aes(x=Sepal.Width)) +
geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) +
xlab("Sepal Width (cm)") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Sepal Width")+
geom_vline(data=iris, aes(xintercept = mean(Sepal.Width)),linetype="dashed",color="grey")
# Petal length
HistPl <- ggplot(data=iris, aes(x=Petal.Length))+
geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) +
xlab("Petal Length (cm)") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Petal Length")+
geom_vline(data=iris, aes(xintercept = mean(Petal.Length)),
linetype="dashed",color="grey")
# Petal width
HistPw <- ggplot(data=iris, aes(x=Petal.Width))+
geom_histogram(binwidth=0.2, color="black", aes(fill=Species)) +
xlab("Petal Width (cm)") +
ylab("Frequency") +
theme(legend.position="right" )+
ggtitle("Histogram...

SOLUTION.PDF