Motivation
The first step in the statistical process involves asking a research question. Some questions can be answered with a statistical method from Math 361, others require a more advanced statistical technique. Many questions are not suitable for statistical analysis at all, being better suited to philosophy, mathematics, direct experimentation, etc.
Instructions
Can a method from our “methods” table be used to answer the research questions? Think about the number and type of variable that would need to be collected to answer the question. Then see if these match a column of the “methods” table. If yes, decide whether a statistic or graph would suffice or if you should use a confidence interval or test of significance. If you choose a test, write the null and alternative hypotheses for the question. Your answers for each of the three questions are worth 8 points, for a total of 24 points.
Relevant Class Material
"methods" table, variables, observational units, types of variable
SECTION 1 (24 points)
Fill in the table. Some cells may be left empty depending on your answer to the first question.
Question
Possible answers
Do a higher proportion of people live in rural areas now than before Covid?
What proportion of people in Klamath Falls are mo
idly obese?
On average, how much do students spend on textbooks a term?
Can a method from this class be used? If yes, answer the following questions. If not,
iefly explain how the question should be answered without statistics.
Yes or no
What variable(s) need to be collected to answer this question?
What is the type of each variable?
Numerical or binary categorical
What observational unit(s) could the variable(s) be collected on?
Assuming a sample is used, state the population
Would a test or a confidence interval be more appropriate for this question?
Confidence Interval or Test?
State the name of the most appropriate approximate method from the "methods" table
t or z?
One sample or two sample?
Paired or independent (if two sample)?
State the name of the type of graph you recommended for the variable(s)
Boxplot(s),
arplot, histogram(s), scatterplot, or stacked barplot?
SECTION 2:
The second step in the statistical process is to make a plan for collecting data, analyzing data and making a conclusion from the data. Careful planning can reduce biases in estimation due to data collection or study design.
In this section you will create a plan to answer the research question, "Does serving in student government during high school lead to higher earnings at age 30 for people in the US?"
There are 11 questions here, each worth 2 points with the exception of question 3.
What is the population in the research question?
What variable(s) must be collected to answer the research question?
(4 points) Frame the research question as null and alternative hypotheses with an appropriate parameter. Write both hypotheses using appropriate symbols. (In OneNote, you can "insert" a "symbol" to obtain π or µ)
Briefly explain why it is not feasible to collect data via a simple random sample in order to answer the research question.
Is it feasible to perform a randomized controlled experiment (RCT) to answer this research question? Briefly explain your reasoning.
Which of these three possible explanations is most directly addressed by the computation of a p-value for the null and alternative hypotheses of question 3?
Choose 1: causal effect chance confounding variable
Identify a possible confounding variable in this study and
iefly explain how it relates to both obtaining a student government participation and earnings at age 30.
What is a Type I e
or in the context of this research question?
Once the data is collected on your response and treatment variables, you will be ready to do the data analysis.
Sketch or insert a table of how you plan to summarize the dataset you collect (means or medians, standard deviation or MAD…). Include pretend numbers and be sure to label the columns and rows.
Once the data is collected on your response and treatment variables, you will be ready to do the data analysis.
Which inferential method do you plan to use?
The last step in the plan is to decide how you form a conclusion based on the data analysis. The p-value from a test will tell you the probability of seeing a difference as extreme as the difference in your dataset assuming the null hypothesis is true. Choose one option below and
iefly outline your conclusions under the following scenarios:
I choose option ___
Option 1: Ignore the potential for confounding bias and choose between "by chance" and "causal effect" via a p-value as following:
If the p-value is less than ______, I will conclude that _____________
If the p-value is above _________, I will conclude that ____________
Option 2: Decide the potential for confounding bias is so extreme that it is not worthwhile to do a test using only the response and treatment variables. In this case,
iefly explain how the method of subclassification could be used to adjust this plan to account for the confounding variable you are wo
ied about:
SECTION 3:
The "parks.csv" dataset in Canvas dataset contains information on all recorded visits to National Parks. This (and much more) is available for public download here: STATS - National Reports (nps.gov). The parks.csv dataset has the annual visits in 2018 and 2019 by type of visit. The types are
Recreational visits (RV)
Non-recreational visits (NRV)
Concessioner Lodging (CL)
Concessioner Camping (CC)
Tent overnights (TO)
RV overnights (RVO)
Backcountry overnights (BO)
Non-recreational overnights (N)
Misc. overnights (MO)
EXCEL FILE ATTACHED FOR THIS SECTION
Here's a blank code notebook if you're using R:
https:
colab.research.google.com/drive/1cFMAOpl1c3HfbhvIDCBxFwitIh495zgt?usp=sharing
You will likely find the R code helpful
Use the dataset "parks.csv" to answer the following questions.
(2 points) How many parks are included in the dataset?
2. (4 points) In 2019, what was largest number of Backcountry overnight visits to a single park? Which park was it?
3. (6 points) Create a subset of the dataset for parks with a non-zero number of Backcountry overnights in 2019. Briefly describe the distribution of Backcountry overnights in 2019 by computing a measure of spread, a measure of center and the number of parks with non-zero Backcountry Overnight visits.
Number:
Center:
Spread:
4. (6 points) Using the full dataset, compute the difference between the number of Backcountry overnights in 2018 and 2019 for each park. Create an appropriate graph of this variable and write a sentence or two describing what you learned about it's distribution. Include a screenshot of your graph and comment on center, shape, spread and any outliers, as appropriate.
5. (6 points) Review the list of available park names, e.g. by opening "parks.csv" in Excel. Choose a park and a variable you're interested in and compare the variable's value for your park with the distribution of that variable for all parks. For example, compute the median and MAD for the variable for all parks and say how your park compares to these values. Write a sentence describing what you learned.
Include a printout of your R code or Jamovi screenshot or Excel spreadsheet below: