Problem Set 1
Problem Set 1
Due June 29, 2020 at 11:59 pm
Problem set policies. Please provide concise, clear answers for each question. Note that only writing the result of
a calculation (e.g., "SD = 3.3") without explanation is not sufficient. For problems involving R, be sure to include
the code in your solution.
Please submit your problem set via Canvas as a PDF, along with the R Markdown source file.
We encourage you to discuss problems with other students (and, of course, with the teaching team), but you must
write your final answer in your own words. Solutions prepared "in committee" are not acceptable. If you do
collaborate with classmates on a problem, please list your collaborators on your solution.
For each of the following scenarios, discuss (in at most five sentences) the main issue(s) with
espect to sampling or reporting bias.
a) A particular city has 14 architects who own their own firm. To select a survey sample, each
architect was contacted via telephone by order of appearance in the telephone directory,
then the first 8 that agreed to be interviewed formed the sample.
) The September 1992 issue of Prevention magazine included a women’s health survey; ap-
proximately 16,500 women responded to the survey. The May 1993 issue reported on the
survey results, claiming that “92% of our readers rated their health as excellent, very good,
c) Many scholars and policymakers are interested in estimating the prevalence of mental ill-
ness among the homeless population. In one study, the authors sampled homeless persons
who received medical attention from a clinic that was part of the Health Care for the Home-
less project, resulting in an estimated prevalence of 33%.1 The authors maintain that se-
lection bias is not a serious problem because the clinics are easily accessible to homeless
A recently published analysis examined 10 studies that measured optimism and pessimism by
asking participants about their level of agreement with statements like “In uncertain times, I
usually expect the best,” or “I rarely expect good things to happen to me”. Optimistic people
tend to expect that they will encounter favorable outcomes, whereas less optimistic people tend
to expect that they will encounter unfavorable outcomes.2
These studies also measured other variables on participants, including factors related to heart
disease. The analysis found that compared with pessimists, people with the most optimistic out-
look had a 35% lower risk for cardiovascular events (e.g., heart attacks). The studies, on average,
1This project is a federally funded program that
ings general health and mental health services to homeless people.
2Alan Rozanski, MD, et al. Association of optimism with cardiovascular events and all-cause mortality. JAMA
Network Open 2019; 2(9):e1912200.
observed people over a 14-year period and compared the rate of cardiovascular events between
those classified as optimists versus pessimists.
a) A popular newspaper reports on the analysis with the headline “Thinking Positively Im-
proves Cardiovascular Health”. Write a short response to the editor explaining clearly why
the headline is potentially misleading. Be sure to use language accessible to a general audi-
ence without a statistics background. Limit your answer to at most five sentences.
) Briefly describe a plausible study design that has the potential to demonstrate the effect of
thinking positively on cardiovascular health.
c) Suppose someone who is very optimistic reads about the analysis and concludes that the
findings suggest he has a 35% lower risk for cardiovascular events than his friend who is
extremely pessimistic. Explain why this is not necessarily the case.
The following graphs are based on data from the National Center for Health Certificates.
a) Describe what you see in the two graphs, with particular focus on the differences between
the two distributions.
) Economists are interested in the possible causes driving the shape of the age distribution in
i. Discuss a possible reason behind the discrepancy between the 1980 distribution and
the 2016 distribution; i.e., what is a potential factor driving the difference in the distri-
ii. Discuss a possible reason behind the shape of the age distribution in 2016.
The Stanford Open Policing Project is a team of researchers and journalists at Stanford Univeristy
working to collect and standardize data on vehicle and pedestrian stops from law enforcement
departments across the country, with the goal of investigating and improving interactions between
the police and the public. In a recently published analysis based on these data, the authors found
that police stops and search decisions suffer from persistent racial bias.3
In this problem, you will work with data from the Stanford Open Policing Project and conduct
an exploratory analysis based on approaches used by the study team. The dataset stops.Rdata
contains standardized data on police stops in Philadelphia, Pennsylvania between 2013 and 2017.
Each case represents a single police stop.
The variables are defined as follows:
– date: date of the stop, in YYYY-MM-DD format
– year: year of the stop
– time: 24-hour time for the stop, in HH:MM format
– location: freeform text of the location, e.g. street number and street name
– lat: latitude of the stop
– lng: longitude of the stop
– district: police district
– service_area: police service area
– subject_age: age of the stopped subject
– subject_race: race of the stopped subject, recorded as either white, black, hispanic,
asian/pacific islander, or othe
– subject_sex: the recorded sex of the stopped subject
– type: type of stop, either vehicular or pedestrian
est_made: recorded as TRUE if an a
est was made, and FALSE if otherwise
– outcome: strictest police action taken, either a
est, citation, warning, summons
– contraband_found: recorded as TRUE if contraband was found from a search, and FALSE if
– frisk_performed: recorded as TRUE if a frisk was performed, and FALSE if otherwise
– search_conducted: recorded as TRUE if a search was conducted, and FALSE if otherwise
– search_person: recorded as TRUE if search of a person has occu
ed, and FALSE if otherwise
– search_vehicle: recorded as TRUE if search of a vehicle has occu
ed, and FALSE if otherwise
Use these data to answer the following questions.
a) Take an initial look at the stops dataset.
i. How many police stops are represented in the data?
ii. What date range does the data cover?
iii. Of the police stops recorded, what proportion of stops occu
ed in 2017?
) Describe the distribution of age of stopped subjects, referencing numerical and graphical
summaries as needed.
3E. Pierson, C. Simoiu, J. Overgoor, S. Co
ett-Davies, D. Jenson, A. Shoemaker, V. Ramachandran, P. Barghouty,
C. Phillips, R. Shroff, and S. Goel. A large-scale analysis of racial disparities in police stops across the United States.
Nature Human Behaviour, Vol. 4, 2020.
c) To na
ow the scope of the analysis, we will focus on vehicular police stops that occu
2017. Subset the data appropriately and name the subset stops.subset.
i. Using numerical and graphical summaries, describe the distribution of race of stopped
subjects, among vehicular stops in 2017. Does any race appear to be ove
ii. In a few sentences,
iefly explain why it would be helpful to account for racial demo-
graphics in Philadelphia when interpreting the values in part i.
iii. The dataset population_2017.Rdata contains information about racial demographics in
Philadelphia for 2017. Use this information to compute the “stop rate” for each group,
where stop rate is defined as number of police stops per member of the population. Fo
example, if 10 police stops occur in which the stopped subject is Asian, and there are
100 Asian members of the population, the stop rate for Asians is 10/100 = 0.10.
Report the stop rate for each race group.
iv. Based on the calculations in part iii., relative to white drivers, how much more often are
lack drivers stopped by the police? Relative to white drivers, how much more often
are Hispanic drivers stopped by the police?
d) After a driver is stopped, officers may ca
y out a search of the driver or vehicle if they
suspect more serious criminal activity. One strategy for understanding whether data suggest
iased decision-making is the outcome test, which is based on assessing the proportion of
searches that successfully identify contraband. If searches of minorities are successful less
often than searches of whites, this suggests that officers are searching minorities on the basis
of less evidence.
i. Calculate hit races by race in Philadelphia in 2017 for vehicular stops, where hit rate
is defined as the proportion of searches in which contraband was found. Describe you
It may be the case that the bar for stopping people is lower in certain police districts, and
that minorities are more likely to live in neighborhoods in those districts. The dataframe
hit_rates.Rdata contains the hit rate for whites, black, and Hispanics in each police district
in Philadelphia (for vehicular stops and searches in XXXXXXXXXXInformation about each district is
contained in two rows: one row contains the hit rates of black drivers and one row contains
the hit rates of Hispanic drivers.
ii. Create a plot that summarizes the relationship between the hit rates of black drivers
and the hit rates of white drivers for police districts in Philadelphia.
iii. Add a y = x line to the plot from part i. Describe what a point on the y = x line would
epresent in context of the data.
iv. With reference to the y = x line, describe what you see in the plotted data. Are the
esults suggestive of bias against black drivers? Explain your answer.
Vitamin D is essential for growth and bone health in children. It can be either obtained from
dietary sources or produced by the body upon exposure of skin to ultraviolet waves (typically via
sun exposure). Natural food sources rich in Vitamin D are scarce. Even in many low latitude
countries where sunshine is plentiful, Vitamin D deficiency is a public health concern.
A study was conducted to evaluate Vitamin D status among schoolchildren in Thailand. The
study drew data from a randomized trial conducted in rural subdistricts of a specific su
the country that assessed the efficacy of a seasoning powder fortified with iron, zinc, iodine, and
Vitamin A for reducing anemia.
Exposure to sunlight allows the body to produce serum 25(OH)D, which is a marker of Vitamin
D status. Serum 25(OH)D is then converted into a biologically active form, serum 1,25(OH)2D.
Data on both serum levels were used to determine the prevalence of Vitamin D deficiency in the
subpopulation under study. Vitamin D deficiency is defined as having a serum 25(OH)D level
elow 50 nmol/L.
The file vitamin_d