Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Problem Set 1 Problem Set 1 Statistics 100 Due June 29, 2020 at 11:59 pm Problem set policies. Please provide concise, clear answers for each question. Note that only writing the result of a...

1 answer below »
Problem Set 1
Problem Set 1
Statistics 100
Due June 29, 2020 at 11:59 pm
Problem set policies. Please provide concise, clear answers for each question. Note that only writing the result of
a calculation (e.g., "SD = 3.3") without explanation is not sufficient. For problems involving R, be sure to include
the code in your solution.
Please submit your problem set via Canvas as a PDF, along with the R Markdown source file.
We encourage you to discuss problems with other students (and, of course, with the teaching team), but you must
write your final answer in your own words. Solutions prepared "in committee" are not acceptable. If you do
collaborate with classmates on a problem, please list your collaborators on your solution.
Problem 1.
For each of the following scenarios, discuss (in at most five sentences) the main issue(s) with
espect to sampling or reporting bias.
a) A particular city has 14 architects who own their own firm. To select a survey sample, each
architect was contacted via telephone by order of appearance in the telephone directory,
then the first 8 that agreed to be interviewed formed the sample.
) The September 1992 issue of Prevention magazine included a women’s health survey; ap-
proximately 16,500 women responded to the survey. The May 1993 issue reported on the
survey results, claiming that “92% of our readers rated their health as excellent, very good,
or good”.
c) Many scholars and policymakers are interested in estimating the prevalence of mental ill-
ness among the homeless population. In one study, the authors sampled homeless persons
who received medical attention from a clinic that was part of the Health Care for the Home-
less project, resulting in an estimated prevalence of 33%.1 The authors maintain that se-
lection bias is not a serious problem because the clinics are easily accessible to homeless
people.
Problem 2.
A recently published analysis examined 10 studies that measured optimism and pessimism by
asking participants about their level of agreement with statements like “In uncertain times, I
usually expect the best,” or “I rarely expect good things to happen to me”. Optimistic people
tend to expect that they will encounter favorable outcomes, whereas less optimistic people tend
to expect that they will encounter unfavorable outcomes.2
These studies also measured other variables on participants, including factors related to heart
disease. The analysis found that compared with pessimists, people with the most optimistic out-
look had a 35% lower risk for cardiovascular events (e.g., heart attacks). The studies, on average,
1This project is a federally funded program that
ings general health and mental health services to homeless people.
2Alan Rozanski, MD, et al. Association of optimism with cardiovascular events and all-cause mortality. JAMA
Network Open 2019; 2(9):e1912200.
1
observed people over a 14-year period and compared the rate of cardiovascular events between
those classified as optimists versus pessimists.
a) A popular newspaper reports on the analysis with the headline “Thinking Positively Im-
proves Cardiovascular Health”. Write a short response to the editor explaining clearly why
the headline is potentially misleading. Be sure to use language accessible to a general audi-
ence without a statistics background. Limit your answer to at most five sentences.
) Briefly describe a plausible study design that has the potential to demonstrate the effect of
thinking positively on cardiovascular health.
c) Suppose someone who is very optimistic reads about the analysis and concludes that the
findings suggest he has a 35% lower risk for cardiovascular events than his friend who is
extremely pessimistic. Explain why this is not necessarily the case.
Problem 3.
The following graphs are based on data from the National Center for Health Certificates.
a) Describe what you see in the two graphs, with particular focus on the differences between
the two distributions.
) Economists are interested in the possible causes driving the shape of the age distribution in
2016.
i. Discuss a possible reason behind the discrepancy between the 1980 distribution and
the 2016 distribution; i.e., what is a potential factor driving the difference in the distri-
utions?
ii. Discuss a possible reason behind the shape of the age distribution in 2016.
2
Problem 4.
The Stanford Open Policing Project is a team of researchers and journalists at Stanford Univeristy
working to collect and standardize data on vehicle and pedestrian stops from law enforcement
departments across the country, with the goal of investigating and improving interactions between
the police and the public. In a recently published analysis based on these data, the authors found
that police stops and search decisions suffer from persistent racial bias.3
In this problem, you will work with data from the Stanford Open Policing Project and conduct
an exploratory analysis based on approaches used by the study team. The dataset stops.Rdata
contains standardized data on police stops in Philadelphia, Pennsylvania between 2013 and 2017.
Each case represents a single police stop.
The variables are defined as follows:
– date: date of the stop, in YYYY-MM-DD format
– year: year of the stop
– time: 24-hour time for the stop, in HH:MM format
– location: freeform text of the location, e.g. street number and street name
– lat: latitude of the stop
– lng: longitude of the stop
– district: police district
– service_area: police service area
– subject_age: age of the stopped subject
– subject_race: race of the stopped subject, recorded as either white, black, hispanic,
asian/pacific islander, or othe
unknown
– subject_sex: the recorded sex of the stopped subject
– type: type of stop, either vehicular or pedestrian
– a
est_made: recorded as TRUE if an a
est was made, and FALSE if otherwise
– outcome: strictest police action taken, either a
est, citation, warning, summons
– contraband_found: recorded as TRUE if contraband was found from a search, and FALSE if
otherwise
– frisk_performed: recorded as TRUE if a frisk was performed, and FALSE if otherwise
– search_conducted: recorded as TRUE if a search was conducted, and FALSE if otherwise
– search_person: recorded as TRUE if search of a person has occu
ed, and FALSE if otherwise
– search_vehicle: recorded as TRUE if search of a vehicle has occu
ed, and FALSE if otherwise
Use these data to answer the following questions.
a) Take an initial look at the stops dataset.
i. How many police stops are represented in the data?
ii. What date range does the data cover?
iii. Of the police stops recorded, what proportion of stops occu
ed in 2017?
) Describe the distribution of age of stopped subjects, referencing numerical and graphical
summaries as needed.
3E. Pierson, C. Simoiu, J. Overgoor, S. Co
ett-Davies, D. Jenson, A. Shoemaker, V. Ramachandran, P. Barghouty,
C. Phillips, R. Shroff, and S. Goel. A large-scale analysis of racial disparities in police stops across the United States.
Nature Human Behaviour, Vol. 4, 2020.
3
c) To na
ow the scope of the analysis, we will focus on vehicular police stops that occu
ed in
2017. Subset the data appropriately and name the subset stops.subset.
i. Using numerical and graphical summaries, describe the distribution of race of stopped
subjects, among vehicular stops in 2017. Does any race appear to be ove
epresented?
ii. In a few sentences,
iefly explain why it would be helpful to account for racial demo-
graphics in Philadelphia when interpreting the values in part i.
iii. The dataset population_2017.Rdata contains information about racial demographics in
Philadelphia for 2017. Use this information to compute the “stop rate” for each group,
where stop rate is defined as number of police stops per member of the population. Fo
example, if 10 police stops occur in which the stopped subject is Asian, and there are
100 Asian members of the population, the stop rate for Asians is 10/100 = 0.10.
Report the stop rate for each race group.
iv. Based on the calculations in part iii., relative to white drivers, how much more often are
lack drivers stopped by the police? Relative to white drivers, how much more often
are Hispanic drivers stopped by the police?
d) After a driver is stopped, officers may ca
y out a search of the driver or vehicle if they
suspect more serious criminal activity. One strategy for understanding whether data suggest
iased decision-making is the outcome test, which is based on assessing the proportion of
searches that successfully identify contraband. If searches of minorities are successful less
often than searches of whites, this suggests that officers are searching minorities on the basis
of less evidence.
i. Calculate hit races by race in Philadelphia in 2017 for vehicular stops, where hit rate
is defined as the proportion of searches in which contraband was found. Describe you
findings.
It may be the case that the bar for stopping people is lower in certain police districts, and
that minorities are more likely to live in neighborhoods in those districts. The dataframe
hit_rates.Rdata contains the hit rate for whites, black, and Hispanics in each police district
in Philadelphia (for vehicular stops and searches in XXXXXXXXXXInformation about each district is
contained in two rows: one row contains the hit rates of black drivers and one row contains
the hit rates of Hispanic drivers.
ii. Create a plot that summarizes the relationship between the hit rates of black drivers
and the hit rates of white drivers for police districts in Philadelphia.
iii. Add a y = x line to the plot from part i. Describe what a point on the y = x line would
epresent in context of the data.
iv. With reference to the y = x line, describe what you see in the plotted data. Are the
esults suggestive of bias against black drivers? Explain your answer.
4
Problem 5.
Vitamin D is essential for growth and bone health in children. It can be either obtained from
dietary sources or produced by the body upon exposure of skin to ultraviolet waves (typically via
sun exposure). Natural food sources rich in Vitamin D are scarce. Even in many low latitude
countries where sunshine is plentiful, Vitamin D deficiency is a public health concern.
A study was conducted to evaluate Vitamin D status among schoolchildren in Thailand. The
study drew data from a randomized trial conducted in rural subdistricts of a specific su
egion of
the country that assessed the efficacy of a seasoning powder fortified with iron, zinc, iodine, and
Vitamin A for reducing anemia.
Exposure to sunlight allows the body to produce serum 25(OH)D, which is a marker of Vitamin
D status. Serum 25(OH)D is then converted into a biologically active form, serum 1,25(OH)2D.
Data on both serum levels were used to determine the prevalence of Vitamin D deficiency in the
subpopulation under study. Vitamin D deficiency is defined as having a serum 25(OH)D level
elow 50 nmol/L.
The file vitamin_d
Answered Same Day Jun 26, 2021

Solution

Bezawada Arun answered on Jun 29 2021
146 Votes
---
title: "Problem Set 1"
author: "Ioannis Lamprou"
date: "26 June 2020 - 08:20"
output:
pdf_document:
fig_height: 3.5
fig_width: 5
word_document: default
geometry: margin=1in
fontsize: 11pt
---
## Problem 1.
a)
First of all, this is sampling bias. The reason is that the first eight architects whose last names are higher in the alphabet order have a higher possibility to be selected than those who their last names are lower in the alphabet order.
In an unbiased sample, each person in the population has equal chances to be sampled (selected).
)
First of all, 92% of women that responded to the survey that their health is excellent, very good, or good does not mean that 92% of women that read this particular magazine gave these ratings. Also, this does not even mean that 92% of the magazine readers gave these ratings as well.
So, in my personal opinion, the sample that was announced is not randomly selected between the females that read the magazine. For instance, there is a possibility that the magazine selected the women that are in excellent health who are more likely to respond to these types of surveys.
In my opinion, the magazine was inaccurately reported to its readers about this subject.
c)
In my opinion, there is no possibility for each person to seek medical attention from a clinic. For instance, individuals that need general health services are more likely to seek medical help from clinics than people with mental illness. So, even in the case that clinics are accessible, in this sample, there is probably a selection bias.
## Problem 2.
a)
Dear (Editors name),
The newspaper headline "Thinking Positively Improves Cardiovascular Health" is misleading as the studies do not indicate that thinking positively would improve the cardiovascular health of people.
The studies indicated that optimistic individuals have a 35% lower likelihood of getting a heart attack. They have not shown in the results or even revealed that it would enhance your cardiovascular health.
)
A plausible study design that has the potential to demonstrate the effect of thinking positively on cardiovascular health is the following:
i. The study must be in two parts: a systematic review (latest researches) and meta-analysis.
ii. The study must contain a large portion of the population of different ages, countries, climates for a long period of time.
iii. The study must take into consideration other factors that affect cardiovascular health.
c)
I will explain to him that researches does not show that thinking positively would improve the cardiovascular health of people. However, there is a strong co
elation between these two but there are other factors like age, physical activity that need to be considered in the studies.
## Problem 3.
a)
The first graph on the left (1980 graph) exhibits a unimodal distribution with little right skewing for 20 years. On the other hand, the second graph (2016 graph) exhibits a bimodal distribution with two peaks, one peak in almost 20 years and the second one in 29 years approximately.
)
i.

Comparing these two graphs, we can identify that women have more opportunities to find jobs or access pursuing higher education, for instance, than in the past. So, the result is that women may delay having children until they reach their dreams (join the workforce, pursue an education) compared to the past. As we can see, this is the main difference between these two graphs that is resulting in a shift during this time period.

ii.
A possible reason behind the shape of the age distribution in 2016 is it is a bimodal shape; it can be due to geography factors. For example, there are places where it is common for women to have children in their 20s while there are other places that women can delay having children until their late 20s.
\newpage
## Problem 4.
a)
i.
```{r, warning = FALSE, message = FALSE}
install.packages("knitr")
install.packages("tinytex")
li
ary(knitr)
li
ary(tinytex)
tinytex::install_tinytex()
install.packages("markdown")
install.packages("rmarkdown")
d= read.table(file=".txt",header=TRUE,sep="")
ender("pset01summer2020i-frknptbu_updated.rmd","pdf_document")
sprintf(gettext(fmt, domain = domain), ...)
gettextf("Package %s version %s cannot be unloaded:\n %s", sQuote(package), oldversion, paste0(P, conditionMessage(e), "\n"))
stop(gettextf("Package %s version %s cannot be unloaded:\n %s", sQuote(package), oldversion, paste0(P, conditionMessage(e), "\n")), domain = NA)
# load the data
load("datasets/stops.Rdata")
# number of...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here