---
title: "Assignment 4"
author: "Your name and ID here"
date: "Fall 2022"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# packages
library(AER)
library(tidyverse)
library(wooldridge)
library(kableExtra)
library(modelsummary)
```
## Instructions
Please read each question and answer where appropriate. The assignment is graded on a scale from 1 to 5. I grade effort as well as content. That means to obtain a 5, every question must be attempted, and I am a kind grader if the effort was high, but the result was not quite right.
After you answer the questions, `knit` the document to HTML and submit on eclass I will **only grade** HTML. If you submit the `rmd` file instead, you will receive a zero. You have been warned, so there will be no exceptions.
Groups of up to four are allowed, but every student must submit their own assignment.
**If an interpretation of output is asked for, but only output or code is given, the question will get zero points**
# Question 1: polynomails and interactions
This question uses the Wooldridge data set `wage1`. I have loaded it below to a data frame called `wage1`. The purpose of this question is to get used to interpreting coefficients with different functional form assumptions.
```{r}
wage1 <- wooldridge::wage1="" %="">%
filter(complete.cases(.))
```
Consider the generic regression:
$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \beta_3 x_2 + \beta_4 x_3 + \beta_5 (x_2\cdot x_3) + u
$$
Here, the variable $x_1$ is entered into the regression twice -- once as a 'main' effect and once as a squared term. Consider the partial effect of increasing $x_1$ by 1 unit.
$$
\frac{\partial y}{\partial x_1} = \beta_1 + \beta_2 x_1
$$
This says that the marginal impact on $y$ from a one unit increase in $x_1$ is not a constant, but a function. The impact depends on the value of $x_1$. For example, suppose that $\beta_1$ is positive and $\beta_2$ is negative. This means that the relationship between $y$ and $x_1$ exhibits decreasing marginal returns. That is, when $x_1$ increases from 0 to 1, the $\frac{\partial y}{\partial x_1}$ is higher than going from 2 to 3 and so on. If we graph this, it would look like an inverted 'U'. For example, it might look like:
```{r}
fig.data <- data.frame(x="1:10)" %="">%
mutate(y = XXXXXXXXXX * x - 1.5*x^2)
ggplot(fig.data, aes(y = y, x = x)) +
geom_line() +
labs(title = "Typical diminishing marginal returns profile")
```
We can find the the point at which `x_1` turns from positive to negative (the inflection point) but setting $\frac{\partial y}{\partial x_1} = 0$ and solving for $x_1$. This yields
$$
x^*_1 = \left|\frac{\beta_1}{2\cdot \beta_2}\right|.
$$
The variables $x_2$ and $x_3$ also appear twice; each as an main effect and then as an interaction. Consider the partial effect of $x_2$ on $y$
$$
\frac{\partial y}{\partial x_2} = \beta_3 + \beta_5 x_2
$$
Again, this says that the impact of $x_2$ on $y$ is not a constant. It allows the impact to depend on the value of $x_3$. The treatment of $x_3$ is symmetric. We most often use these types of interactions when one term is a dummy variable. Suppose $x_3$ only takes on two values, 1 and 0. Then
$$
\frac{\partial y}{\partial x_2} = \beta_3 + \beta_5 \text{ when } x_3 = 1 \text{ and } \frac{\partial y}{\partial x_2} = \beta_3 \text{ when } x_3 = 0
$$
Since dummy variables denote groups (ie, 2 groups), this allows each group to have its own intercept ($\beta_4$) and slope. Graphically, it looks like:
```{r}
ggplot(wage1 %>% filter(educ>5), aes(y = lwage, x = educ, color = factor(west))) +
geom_smooth(method = 'lm', se = F)
```
Where each line is a regression of log wages on education, with an interaction for living in the west. The return education for workers in the western United States in this data is higher than the return for those in the rest of the country.
In `R`, we can create variables "on the fly" to use in regressions. We use this mostly to create interaction terms and low order polynomial terms. Consider the following code that would estimate the following equation
$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \beta_3 x_2 + \beta_4 x_3 + \beta_5 (x_2\cdot x_3) + u
$$
In the code below, each regression is exactly the same, just different ways of expressing it:
```{r, eval = F}
mod <- lm(y="" ~="" x_1="" +="" i(x_1^2)="" +="" x_2="" *="" x_3,="" data="">
mod <- lm(y="" ~="" poly(x1,2,="" raw="T)" +="" x_2="" *="" x_3,="" data="">
mod <- lm(y="" ~="" x_1="" +="" i(x_1^2)="" +="" x_2="" +="" x_3="" +="" x_2:x_3,="" data="">
mod <- lm(y="" ~="" x_1="" +="" i(x_1^2)="" +="" x_2="" +="" x_3="" +="" i(x_2*x_3),="" data="">
```
The term `I()` is an "insulator function". It tells `R` to evaluate the expression inside first, then run the regression. The notation for `x_2*x_3` says to include main effects for each variable, plus and interaction. The notation `x_2:x_3` just includes an interaction. Finally, `poly()` constructs low order polynomials. The `raw=T` option is important.
```{r}
wage1 <- wooldridge::wage1="" %="">%
filter(complete.cases(.))
# fit models
models <->
lm(lwage ~ educ + exper + I(exper^2) + nonwhite + female , data = wage1),
lm(lwage ~ educ*female + exper + I(exper^2) + nonwhite , data = wage1)
)
# table
modelsummary(models,
fmt = 5,
statistics_override = sandwich,
stars = T,
gof_omit = "[^R2|Adj. R2]") %>%
kable_classic_2()
```
1. In the first column, interpret the return to experience (`exper`). After how many years of experience does the relationship turn negative?
> Answer here
2. In column two, what is the return to education for men and women. Are the returns to education significantly different for men and women?
> Answer here
# Question 2: Teaching evaluations
Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. The article titled, “Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity” (Hamermesh and Parker, 2005) found that instructors who are viewed to be better looking receive higher instructional ratings.
> Daniel S. Hamermesh, Amy Parker, Beauty in the classroom: instructors pulchritude and putative pedagogical productivity, Economics of Education Review, Volume 24, Issue 4, August 2005, Pages XXXXXXXXXX, ISSN XXXXXXXXXX, XXXXXXXXXX/j.econedurev XXXXXXXXXX.
[Paper link - not required to read](http://www.sciencedirect.com/science/article/pii/S XXXXXXXXXX)
```{r}
data("TeachingRatings") # load ratings data
df <- teachingratings="" #="" re-name="" as="" df="" for="">
```
1. The data set `df` constructed in the above code chunk contains different types of variables. Use the command `str()` or `glimpse()` on the data frame `df` to answer below:
(a) What type of variable is `credits`? What fraction of the data are single credit courses?
(b) What type of variable is `allstudents`? What is the largest class in the data set?
(c) Construct a variable called `frac` that is the proportion of students in the class that filled out the evaluation. What is the average participation rate?
> Answer here
2. You can see the variable definitions by typing "?TeachingRatings" in the console. Suppose we are interested in estimating a causal effect of `beauty` on `eval`. That is,
$$
eval_i = \beta_0 + \beta_1 beauty_i + \eta_i
$$
Using the strategy discussed in class and in Chapter 7.6, construct a regression table evaluating the causal effect of beauty on teaching evaluations. Your regression table should consider several specifications, starting with the bivariate regression above and then adding more controls, possibly in groups. For each specification, state why you think its important to include for the controls you add. Your answer should relate to the CIA assumption. Interpret your results, do you think that beauty has a causal impact on evaluations. If yes, defend your answer. If not, state why not.
> Answer here
```{r}
# regression table here
```
3. Run a regression of `eval` on `beauty`, `gender`, `minority`, `credits`, `division`, `tenure`, `native`. Consider my data: I am male, non-minority, native English speaker, teaching multiple credit courses in an upper-division and I have tenure. While I don't have a `beauty` rating, according to [RateMyProfessor.com](https://www.ratemyprofessors.com/ShowRatings.jsp?tid=2033571), I have an evaluation of 2.3. Use your regression and my information to infer my what my `beauty` rating would be if I were in this data set.
> Answer here
```{r}
# Regression here
```
4. In your regression you ran in part (3), the coefficient on gender shows that women have on average, after controlling for other characteristics, lower evaluations than men. This has lead to additional research on the topic -- evaluations are important for promotion and tenure decisions. Add an interaction term between `beauty` and `gender`. Interpret your results: is the marginal impact of beauty the same for men and women? Are good-looking men treated differently from good-looking women by students in terms of their evaluations? Can we reject that the return to beauty for women, in terms of evaluations, is zero?
> Answer here
```{r}
# Regression here
```
5. Using the same controls in part (3), test that the return to beauty depends on the level of beauty. What do you find?
> Answer here
```{r}
# Regression here
```
6. Using your regression in part (5), allow the beauty profile to depend on gender. Can you reject that men and women have the same beauty profile? Using the `margins` command to estimate the effect of moving from the 25th percentile to the 75th percentile for men and women. What do you find?
> Answer here
```{r}
# Regression here
```
# Question 3: Birth weight
Smoking during pregnancy has been shown to have significant adverse health effects for new born babies. Smoking is thought to be a preventable cause of low birth weight of infants who in turn, need more resources at delivery and are more likely to have related health problems in infancy and beyond. Despite these concerns, many women still smoke during pregnancy. In this section, we analyze the relationship between birth weight and smoking behavior, with the emphasis on identifying a _causal_ impact of smoking on the birth weight of newborns.
The relationship we examine is:
$$
\log(\texttt{birth weight})_i = \beta_0 + \beta_1 \texttt{smoking}_i + \eta_i
$$
where $\texttt{smoking}_i$ will be measured by average cigarettes per day. The term $\eta_i$ captures all of the other things that determine birth weight aside from smoking.
### Baseline analysis.
Investigate the birth weight-smoking relationship and present your results in a table format. Your investigation should be structured around the discussion of section 7.6. For control variables, choose the ones you see fit and explain why you choose them. Your explanation should be centered on our class discussion of the conditional independence assumption. You can see the help file for the data set by typing ?bwght in the console. Remember, good controls are related to the treatment or target variable of interest and not affected by the treatment itself.
```{r}
# loading birth weight data from the package wooldridge
bw <->
```
### Robustness of your results
Investigate any potentially non-linearity of your results. First, test whether the relationship between smoking and birth weight is linear by including a polynomial in cigarettes per day. Second, examine whether the impact of smoking is the same for girls and boys. Your results should be presented in a table format and structured along the lines of the discussion in Chapter 8.4 of the text. The average number of cigarettes smoked per day for smokers is about 14. Using the `margins` command, estimate the impact of reducing this by half and compare this effect to quitting all together.
```{r}
# Regressions here
```
### Assessing your results
Estimating the causal relationship between birth weight and smoking is made difficult by the fact that smoking might be correlated with other behaviors that are harmful to newborn outcomes. In other words, there are **threats to internal validity**. There are various types of threats to internal validity. For each one I list below, explain how this might affect the interpretation of your results above and whether or not you can address the concern:
1. Omitted variables bias. Explain how this would affect the interpretation of the estimated coefficient on smoking. Given an example of potential omitted variable you would control for if it were available in the data.
2. Model misspecification. For example, the relationship between birth weight and smoking is not linear. Should we be worried about this in this particular case?
3. Measurement error or errors-in-variables: If mothers in the survey did not accurately report their smoking behavior, how would this affect the interpretation of your results? Should we be worried about this in this particular case?
4. Simultaneous causality.
### External validity.
Your analysis above provides some evidence on the relationship between birth weight and smoking behavior, but no one study is perfect. In this case, there are two looming concerns: (1) omitted variables bias and (2) whether the results are generalizable. In the first case, there are usually additional things we'd like to control for but can't because of data limitations. In the second case, we worry that our results might depend on a particular sample. For these reasons, it is a good idea to examine other possible data sets. Below, I load a 2nd data set on birth weight outcomes. This data set contains many of the same variables as the first data set and a few other variables.
```{r}
# 2nd data set
bw2 <->
```
Below, re-estimate your main specification from above on this data _as best as you can_ -- it might not have exactly the same variables, but it should be fairly close. Use ?bwght2 to see a variables list.
Do you results generalize? The average number of cigarettes smoked per day for smokers is about 13 in this data. Using the `margins` command, estimate the impact of reducing this by half and compare this effect to quitting all together. Are these results similar to the previous data set?
### Additional controls
The 2nd data set contains two types of additional controls that might be helpful to alleviate omitted variables concerns. First, this data set contains a variable called `drink` that measures "drinks per week". To the extent that smoking is related to other types of behavior that is harmful to newborns, this can be a useful additional control. Second, the variable `npvis` documents the "number of prenatal visits to a doctor. This variable might capture at-risk births (lots of prenatal visits might indicate something was wrong) or how attentive the mother was in terms of seeking health advice. Estimate a specifications that control for `drinks` and `npvis` as well as their squares. Does this affect your main conclusions? Discuss.