Final/finalproblemset-gexy0jl5.html
Final Problem Set: Due Tuesday, May 10 at 5:00 pm EDT¶
Important: You need basque.RData to complete this set. All are located on eLC.
Answer the questions below. This notebook will be your workspace so add cells as you please. When you are finished, you will submit two objects to eLC:
**Written pdf Report.** This should be short. You should include:
The research question.
The figure from your Exploratory Data Analysis along with discussion. Be sure to note the difference in per capita GDP pre and post 1975.
Difference-in-differences result. Be sure to include the figure comparing the Basque country and the control region. Discuss how the control group satisfies our criteria. Be sure to discuss identification (see "interpretation" of DiD below) and results relative to the research question. Ideally, you should work your estimation result into the figure so the reader can quickly observe the "answer" to the research question.
Synthetic Control result. Be sure to include the figure comparing the Basque country and the synthetic control region. Be sure to discuss identification (see "interpretation" of DiD below) and results relative to the research question. Ideally, you should work your estimation result into the figure so the reader can quickly observe the "answer" to the research question.
Your analysis should be professional: i.e., well-written, clear, and concise. Figures should be incorporated in your analysis. For example, it would be useful to set the title of each figure in this notebook (eg, "Figure 1: Per Capita GDP in the Basque Region") so in the final report you can reference the figure in the appopriate commentary. Save your report as a pdf ("File/Save As Adobe PDF") with the naming convention 'FinalReport[insert last name]'. For example, ''Final_Report_Thurk.pdf'.
**Jupyter Notebook.** Print the notebook as a pdf. [To print: From the file menu, choose 'print preview'. A new tab will open with the notebook presented as html. Print as a pdf.] Save your pdf notebook with the naming conention 'FinalWorkbook[insert last name]'. For example, 'Final_Workbook_Thurk.pdf'. Think of the notebook as your opportunity to show your work.
Grading: The problem set is worth 90 points and partial credit is indicated for each exercise. I will grade your report (1) and use your notebook (2) to assign partial credit in the event there are e
ors.
You may use your notes, books, and the internet.
Do not consult with other people. This work should be entirely your own.
Exercise 1: Exploratory Data Analysis [30 points]¶
We're going to estimate the economic impact of te
orism in the Basque region -- an autonomous community in Spain. To do this, we'll use information from 17 other Spanish regions where our underlying assumption will be that te
orism affected economic activity in the Basque region but not elsewhere whereas other economic shocks are aggregate and affect all of Spain.
Here are the details:
Coverage from 1955–1997 for 18 Spanish regions. One of the data "regions" is all of Spain which we won't use.
The treatment region is “Basque Country (Pais Vasco)”.
The "treatment" year is 1975 since there were several bombings around that year.
We will measure "economic impact" via GDP per capita (in thousands).
Background: Euskadi Ta Askatasuna (ETA), was an armed leftist Basque nationalist and separatist organization in the Basque Country (in northern Spain and southwestern France). The group was founded in 1959 and later evolved from a group promoting traditional Basque culture to a paramilitary group engaged in a violent campaign of bombing, assassinations and kidnappings in the Southern Basque Country and throughout Spanish te
itory. Its goal was gaining independence for the Basque Country. ETA was the main group within the Basque National Liberation Movement and was the most important Basque participant in the Basque conflict. While its te
orist activities spanned several decades, the death of Spanish dictator Francisco Franco in 1975 led to a substantial increase in bombings. (Wikipedia)
Part A: Load Data¶
Thus far we've learned how to load data from csv, excel, and stata. Another popular programming language (and to some degree a rival of python) is the open-source language R. The data we want to access is from the following academic pape
Abadie, A. and Gardeazabal, J. (2003) "Economic Costs of Conflict: A Case Study of the Basque Country." American Economic Review 93 (1) 113--132.
where the data was saved in the R programming language. We'll do this by leveraging python's own open-source nature and load a user-created package called pyreadr which we install as usual in the terminal:
pip install pyread
We then use pyreadr to load basque.RData access the data as follows:
esult = pyreadr.read_r('basque.RData')
data = result['basque'] # extract the pandas dataframe
In [1]:
import pyreadr, pandas as pd
pd.set_option('display.max_columns', 20)
esult = pyreadr.read_r('\\basque-pehqaa4x.RData')
esult = [pd.DataFrame(i) for i in result.values()][0]
esult
Out[1]:
regionno regionname year gdpcap sec.agriculture sec.energy sec.industry sec.construction sec.services.venta sec.services.nonventa school.illit school.prim school.med school.high school.post.high popdens invest
rownames
1 1.0 Spain (Espana) 1955.0 2.354542 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1.0 Spain (Espana) 1956.0 2.480149 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1.0 Spain (Espana) 1957.0 2.603613 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1.0 Spain (Espana) 1958.0 2.637104 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 1.0 Spain (Espana) 1959.0 2.669880 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
770 18.0 Rioja (La) 1993.0 9.132391 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 16.765787
771 18.0 Rioja (La) 1994.0 9.498000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 16.469452
772 18.0 Rioja (La) 1995.0 9.752213 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 20.275650
773 18.0 Rioja (La) 1996.0 10.056413 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
774 18.0 Rioja (La) 1997.0 10.476292 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
774 rows × 17 columns
Part B: Drop Spain¶
Drop the region "Spain (Espana)".
In [2]:
esult = result[result['regionname'] != 'Spain (Espana)'] # Drop Spain
Part C: Plot GDP Per Capita Over Time¶
Before proceeding into the econometrics, it's useful to graph the data variation we're most interested in. We're interested in evaluating the effect of te
ororism on per capita GDP where we're using the bombings in 1975 as a "natural experiment." Plot per capita (ie, gdpcap) in the Basque region across time. Put a vertical dashed line at 1975 and indicate Pre/Post periods.
In [3]:
import matplotlib.pyplot as plt
time = result['year']
position = result['gdpcap']
plt.plot(time, position)
plt.xlabel('year')
plt.ylabel('real per-capita GDP (1986 USD, thousand)')
Out[3]:
Text(0, 0.5, 'real per-capita GDP (1986 USD, thousand)')
In [ ]:
We observe that per capita GDP decreases after the 1975 increase in ETA te
orist bombings and then increases. How much of that decrease is due to te
orism and how much is due to general economic uncertainty from Franco's death?
Exercise 2: Difference-in-Differences [30 points]¶
Part A: Comparing Before and After¶
We'll begin by comparing the per capit GDP before and after the 1975 treament year. Solve for average per capita GDP in the Basque region before (and including) 1975 to after 1975 (i.e., two numbers).
In [126]:
year = compare['year']
compare['year'] = compare['year'].astype(int)
ipython-input-126-262a52e2efa1>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https:
pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
compare['year'] = compare['year'].astype(int)
In [125]:
compare.plot(x="year", y="regionname", kind="scatter")
Out[125]:
AxesSubplot:xlabel='year', ylabel='regionname'
In [124]:
compare[compare['regionname'] == 'Basque Country (Pais Vasco)']
Out[124]:
regionname year gdpcap
rownames
689 Basque Country (Pais Vasco) 1955.0 3.853185
690 Basque Country (Pais Vasco) 1956.0 3.945658
691 Basque Country (Pais Vasco) 1957.0 4.033562
692 Basque Country (Pais Vasco) 1958.0 4.023422
693 Basque Country (Pais Vasco) 1959.0 4.013782
694 Basque Country (Pais Vasco) 1960.0 4.285918
695 Basque Country (Pais Vasco) 1961.0 4.574336
696 Basque Country (Pais Vasco) 1962.0 4.898957
697 Basque Country (Pais Vasco) 1963.0 5.197015
698 Basque Country (Pais Vasco) 1964.0 5.338903
699 Basque Country (Pais Vasco) 1965.0 5.465153
700 Basque Country (Pais Vasco) 1966.0 5.545916
701 Basque Country (Pais Vasco) 1967.0 5.614896
702 Basque Country (Pais Vasco) 1968.0 5.852185
703 Basque Country (Pais Vasco) 1969.0 6.081405
704 Basque Country (Pais Vasco) 1970.0 6.170094
705 Basque Country (Pais Vasco) 1971.0 6.283633
706 Basque Country (Pais Vasco) 1972.0 6.555555
707 Basque Country (Pais Vasco) 1973.0 6.810769
708 Basque Country (Pais Vasco) 1974.0 7.105184
709 Basque Country (Pais Vasco) 1975.0 7.377892
710 Basque Country (Pais Vasco) 1976.0 7.232934
711 Basque Country (Pais Vasco) 1977.0 7.089831
712 Basque Country (Pais Vasco) 1978.0 6.786704
713 Basque Country (Pais Vasco) 1979.0 6.639817
714 Basque Country (Pais Vasco) 1980.0 6.562839
715 Basque Country (Pais Vasco) 1981.0 6.500785
716 Basque Country (Pais Vasco) 1982.0 6.545059
717 Basque Country (Pais Vasco) 1983.0 6.595330
718 Basque Country (Pais Vasco) 1984.0 6.761497
719 Basque Country (Pais Vasco) 1985.0 6.937161
720 Basque Country (Pais Vasco) 1986.0 7.332191
721 Basque Country (Pais Vasco) 1987.0 7.742788
722 Basque Country (Pais Vasco) 1988.0 8.120537
723 Basque Country (Pais Vasco) 1989.0 8.509711
724 Basque Country (Pais Vasco) 1990.0 8.776778
725 Basque Country (Pais Vasco) 1991.0 9.025279
726 Basque Country (Pais Vasco) 1992.0 8.873893
727 Basque Country (Pais Vasco) 1993.0 8.718224
728 Basque Country (Pais Vasco) 1994.0 9.018138
729 Basque Country (Pais Vasco) 1995.0 9.440874
730 Basque Country (Pais Vasco) 1996.0 9.686518
731 Basque Country (Pais Vasco) 1997.0 10.170666
In [129]:
import matplotlib.pyplot as plt
plt.plot(compare["regionname"], compare["gdpcap"])
plt.show()
In [147]:
compare.groupby("year").apply(lambda s: pd.Series({"gdpSum": s["gdpcap"].sum()})).plot()
Out[147]:
AxesSubplot:xlabel='year'
In [122]:
compare = result[['regionname','year','gdpcap']]
compare_below_1975 = compare[compare['year'] <= 1975]
compare_above_1975 = compare[compare['year'] >= 1975]
compare
Out[122]:
regionname year gdpcap
rownames
44 Andalucia 1955.0 1.688732
45 Andalucia 1956.0 1.758498
46 Andalucia 1957.0 1.827621
47 Andalucia 1958.0 1.852756
48 Andalucia 1959.0 1.878035
... ... ... ...
770 Rioja (La) 1993.0 9.132391
771 Rioja (La) 1994.0 9.498000
772 Rioja (La) 1995.0 9.752213
773 Rioja (La) 1996.0 10.056413
774 Rioja (La) 1997.0 10.476292
731 rows × 3 columns
In [127]:
compare_groupby = compare.groupby("year")["gdpcap"].sum().sort_values()
Part B: First-Difference Regression¶
We can evaluate the effect of the ETA is by constructing a "first difference" equation by having gdpcap as the dependent variable and adding a post indicator as an independent variable; i.e.,
Per Capita GDPt=β0+β1Pt+ϵtPer Capita GDPt=β0+β1Pt+ϵtwhere PtPt is equal to one for all years after 1975 and zero otherwise. Run the above regression using only information for the Basque region.
In [134]:
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
n = np.size(x)
m_x = np.mean(x)
m_y = np.mean(y)
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
y_pred = b[0] + b[1]*x
plt.plot(x, y_pred, color = "g")
plt.xlabel('x')
plt.ylabel('y')
plt.show()
def main():
x = np.a
ay(compare['gdpcap'])
y = np.a
ay(compare['year'])
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()
Estimated coefficients:
_0 = 1952.1932542750592
_1 = 4.411319205122376
What effect did the ETA bombings have on Basque per capita GDP? Compare your results from (A) and (B).
In [ ]:
Part C: Difference-in-Differences (DiD)¶
We'll establish causality by comparing the trend of a "control" group to the Basque region. This is like conducting a an experiment where you give some patients a vaccine, drug, etc. while you give other patients a placebo. In that experimental setting, we design the experiment such that the control and treatment groups are identical but for the treament applied. We then observe whether or not outcomes-of-interest are significantly different between the two groups.
The same rationale holds here but of course we don't have an experimental setting so we choose a control group after the fact under the assumption this group provides an adequate proxy for the trend that would have been observed in the treatment group in the absence of treatment. if true, the difference in change of slope would be the actual treatment effect. Hence, the "difference in differences" terminology. Note that our "identification strategy" requires the treatment and control groups follow the same pre-period (i.e., pre-treatment) trend which is something we can easilly show.
In fact, we'll choose our control group using the following criteria:
High co
elation between our control region's per capita GDP and the per capita GDP in the Basque country before 1975.
The control region is not located close to the Basque region. Implicitly, this criterion is based on our assumption that ETA bombings did not affect the control region's per capita GDP.
Solve for the co
elation in per capita gdp between the diffe
ent regions and the Basque country using only the pre-1975 period. Your output should be a table of co
elations between per capita GDP of non-Basque regions (e.g., "Asturias") and the Basque region. Choose a control region which satisfies the above two criteria.
In [140]:
compare.groupby('year')[['regionname','gdpcap']].co
()
Out[140]:
gdpcap
year
1955 gdpcap 1.0
1956 gdpcap 1.0
1957 gdpcap 1.0
1958 gdpcap 1.0
1959 gdpcap 1.0
1960 gdpcap 1.0
1961 gdpcap 1.0
1962 gdpcap 1.0
1963 gdpcap 1.0
1964 gdpcap 1.0
1965 gdpcap 1.0
1966 gdpcap 1.0
1967 gdpcap 1.0
1968 gdpcap 1.0
1969 gdpcap 1.0
1970 gdpcap 1.0
1971 gdpcap 1.0
1972 gdpcap 1.0
1973 gdpcap 1.0
1974 gdpcap 1.0
1975 gdpcap 1.0
1976 gdpcap 1.0
1977 gdpcap 1.0
1978 gdpcap 1.0
1979 gdpcap 1.0
1980 gdpcap 1.0
1981 gdpcap 1.0
1982 gdpcap 1.0
1983 gdpcap 1.0
1984 gdpcap 1.0
1985 gdpcap 1.0
1986 gdpcap 1.0
1987 gdpcap 1.0
1988 gdpcap 1.0
1989 gdpcap 1.0
1990 gdpcap 1.0
1991 gdpcap 1.0
1992 gdpcap 1.0
1993 gdpcap 1.0
1994 gdpcap 1.0
1995 gdpcap 1.0
1996 gdpcap 1.0
1997 gdpcap 1.0
Plot per capita GDP for the Basque country and your choice of control region (i.e., the region which best satisfies the above criteria). Put a vertical dashed line at 1975 and indicate Pre/Post periods.
In [142]:
def histogram_intersection(a, b):
v = np.minimum(a, b).sum().round(decimals=1)
return v
compare.co
(method=histogram_intersection)
Out[142]:
year gdpcap
year 1.0 3945.0
gdpcap 3945.0 1.0
In [152]:
import numpy as np
compare.groupby("year").apply(lambda s: pd.Series({"gdpSum": s["gdpcap"].sum()})).plot()
Out[152]:
AxesSubplot:xlabel='year'
Run the following difference-in-differences regression:
Per Capita GDPit=β0+β1Pt+β2Ti+β3Pt×Ti+ϵtPer Capita GDPit=β0+β1Pt+β2Ti+β3Pt×Ti+ϵtwhere PtPt is the "period" indicator equal to one for all years after 1975 and zero otherwise; TiTi is the "treatment" group indicator equal to one when the region is the Basque region; and ii is the region. Note the regression only uses observations from the treatment (Basque region) and control group.
In [164]:
esult
Out[164]:
regionno regionname year gdpcap sec.agriculture sec.energy sec.industry sec.construction sec.services.venta sec.services.nonventa school.illit school.prim school.med school.high school.post.high popdens invest
rownames
44 2.0 Andalucia 1955.0 1.688732 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
45 2.0 Andalucia 1956.0 1.758498 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
46 2.0 Andalucia 1957.0 1.827621 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
47 2.0 Andalucia 1958.0 1.852756 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
48 2.0 Andalucia 1959.0 1.878035 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
770 18.0 Rioja (La) 1993.0 9.132391 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 16.765787
771 18.0 Rioja (La) 1994.0 9.498000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 16.469452
772 18.0 Rioja (La) 1995.0 9.752213 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 20.275650
773 18.0 Rioja (La) 1996.0 10.056413 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
774 18.0 Rioja (La) 1997.0 10.476292 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
731 rows × 17 columns
In [188]:
pd.set_option('display.max_columns', 30)
df = result[result['regionname'] != 'Spain (Espana)']
pivot = df.pivot_table(values='gdpcap', index='regionname', columns=['year'])
dfProp99 = pd.DataFrame(pivot.to_records())
allColumns = dfProp99.columns.values
dfProp99
Out[188]:
regionname 1955.0 1956.0 1957.0 1958.0 1959.0 1960.0 1961.0 1962.0 1963.0 1964.0 1965.0 1966.0 1967.0 1968.0 ... 1983.0 1984.0 1985.0 1986.0 1987.0 1988.0 1989.0 1990.0 1991.0 1992.0 1993.0 1994.0 1995.0 1996.0 1997.0
0 Andalucia 1.688732 1.758498 1.827621 1.852756 1.878035 2.010140 2.129177 2.280348 2.431020 2.508855 2.584690 2.694444 2.802342 2.987361 ... 4.291631 4.358683 4.426593 4.663239 4.900671 5.159597 5.417738 5.585261 5.749214 5.641245 5.534918 5.638817 5.720723 5.995930 6.300986
1 Aragon 2.288775 2.445159 2.603399 2.639032 2.677092 2.881462 3.099543 3.359183 3.614182 3.680091 3.745287 3.883319 4.016138 4.243645 ... 6.260854 6.372894 6.495501 6.926521 7.358612 7.802770 8.242645 8.458298 8.668238 8.466866 8.256927 8.573979 8.846758 9.096687 9.518709
2 Baleares (Islas) 3.143959 3.347758 3.549629 3.642673 3.734862 4.058841 4.360254 4.646173 4.911525 5.050700 5.184662 5.466795 5.737646 6.161454 ... 8.925307 9.275921 9.652242 10.257783 10.823336 11.120395 11.408169 11.512425 11.679520 11.319623 10.969723 11.419594 11.773779 11.926592 12.350043
3 Basque Country (Pais Vasco) 3.853185 3.945658 4.033562 4.023422 4.013782 4.285918 4.574336 4.898957 5.197015 5.338903 5.465153 5.545916 5.614896 5.852185 ... 6.595330 6.761497 6.937161 7.332191 7.742788 8.120537 8.509711 8.776778 9.025279 8.873893 8.718224 9.018138 9.440874 9.686518 10.170666
4 Canarias 1.914382 2.071837 2.226078 2.220866 2.213439 2.357684 2.445730 2.648243 2.844759 2.951157 3.054199 3.231791 3.403385 3.660312 ... 5.719866 5.801342 5.885604 6.256784 6.612682 6.977007 7.337903 7.345044 7.347187 7.220080 7.092188 7.410740 7.616395 7.817052 8.060554
5 Canta
ia 2.559412 2.693873 2.820337 2.879035 2.943730 3.137032 3.327621 3.555341 3.771423 3.839403 3.906098 4.032133 4.155955 4.375893 ... 5.941660 6.028849 6.138318 6.420451 6.713225 7.023422 7.333619 7.450729 7.596401 7.462154 7.327906 7.550700 7.777064 7.907741 8.226935
6 Castilla Y Leon 1.729149 1.838332 1.947658 1.971365 1.995144 2.138817 2.239503 2.454227 2.672237 2.777778 2.882176 2.988075 3.094544 3.302271 ... 4.970722 5.113468 5.261997 5.647315 6.044630 6.354399 6.674022 6.870323 7.063196 7.045487 7.027635 7.074264 7.282919 7.611397 7.888460
7 Castilla-La Mancha 1.327764 1.415096 1.503570 1.531420 1.559340 1.667524 1.752428 1.920451 2.091902 2.182591 2.274707 2.378392 2.482362 2.709083 ... 4.424664 4.550057 4.677664 4.980648 5.295559 5.677878 6.065339 6.279420 6.474507 6.330691 6.188589 6.230934 6.328763 6.614396 6.865396
8 Cataluna 3.546630 3.690446 3.826835 3.875678 3.921737 4.241788 4.575335 4.838046 5.081334 5.158098 5.223651 5.332477 5.429449 5.674379 ... 7.397886 7.484290 7.569980 8.077692 8.583976 9.057412 9.525850 9.785062 10.050700 9.837903 9.625107 10.006427 10.339903 10.576264 11.045416
9 Comunidad Valenciana 2.575978 2.738503 2.899886 2.963510 3.026207 3.219294 3.362468 3.569980 3.765210 3.823693 3.874179 3.978149 4.073408 4.279777 ... 6.139817 6.236861 6.336118 6.739360 7.144387 7.560697 7.969152 8.138389 8.306198 8.080548 7.857041 8.068409 8.289061 8.429734 8.725364
10 Extremadura 1.243430 1.332548 1.422451 1.440231 1.458083 1.535847 1.596258 1.705584 1.817695 1.882819 1.948872 2.032633 2.117609 2.245501 ... 3.648957 3.737146 3.828478 4.161739 4.498572 4.769494 5.051414 5.234076 5.398315 5.365467 5.332619 5.439874 5.501357 5.905813 6.224579
11 Galicia 1.634676 1.725578 1.816481 1.840903 1.865396 1.983290 2.005784 2.185661 2.366395 2.458797 2.549700 2.669666 2.787846 2.978363 ... 4.806341 4.899314 4.997786 5.277921 5.565767 5.909526 6.254499 6.453228 6.641603 6.544487 6.447515 6.556484 6.688660 6.862468 7.138532
12 Madrid (Comunidad De) 4.594473 4.786632 4.963439 4.906170 4.846401 5.161097 5.632605 5.840831 6.024493 6.099329 6.152028 6.110469 6.057341 6.253142 ... 7.558555 7.699228 7.839189 8.347615 8.849615 9.254499 9.657955 9.806484 9.963582 9.840046 9.718652 9.882177 10.098543 10.322765 10.732648
13 Murcia (Region de) 1.679520 1.764282 1.850328 1.887389 1.924093 2.118609 2.305484 2.521422 2.739074 2.851257 2.965938 3.099186 3.227292 3.461154 ... 4.906884 5.031991 5.154599 5.471508 5.788560 6.124893 6.450443 6.578192 6.710368 6.662882 6.616324 6.784847 6.885818 7.045416 7.295058
14 Nava
a (Comunidad Foral De) 2.555127 2.698158 2.839831 2.881891 2.930877 3.163525 3.335904 3.623393 3.894816 3.985147 4.072979 4.210011 4.352399 4.556984 ... 6.544702 6.797201 7.047772 7.449300 7.879178 8.349758 8.803913 9.197372 9.591545 9.345187 9.117395 9.365895 9.758640 10.060697 10.522708
15 Principado De Asturias 2.502928 2.615538 2.725793 2.751857 2.777421 2.967295 3.143887 3.373536 3.597258 3.672594 3.743359 3.909383 4.073122 4.308626 ... 5.769066 5.887318 6.011283 6.234790 6.465296 6.688160 6.913525 6.983148 7.040631 6.922808 6.798343 6.954156 7.116467 7.217224 7.475721
16 Rioja (La) 2.390460 2.535204 2.680020 2.726435 2.772851 2.969866 3.153171 3.404384 3.669238 3.803985 3.921808 4.032705 4.160311 4.373036 ... 6.502142 6.626893 6.775564 7.165096 7.580691 8.002713 8.453299 8.858897 9.229506 9.180948 9.132391 9.498000 9.752213 10.056413 10.476292
17 rows × 44 columns
In [ ]:
In [179]:
import numpy as np
states = list(np.unique(dfProp99['regionname']))
years = np.delete(allColumns, [0])
caStateKey = 'Basque Country (Pais Vasco)'
states.remove(caStateKey)
otherStates = states
yearStart = 1955
yearTrainEnd = 1975
yearTestEnd = 2097
p = 1.0
In [ ]:
trainingYears = []
for i in range(yearStart, yearTrainEnd, 1):
trainingYears.append(str(i))
testYears = []
for i in range(yearTrainEnd, yearTestEnd, 1):
testYears.append(str(i))
trainDataMasterDict = {}
trainDataDict = {}
testDataDict = {}
for key in otherStates:
series = dfProp99.loc[dfProp99['regionname'] == key]
trainDataMasterDict.update({key: pd.Series[trainingYears].values[0]})
(trainData, pObservation) = tsUtils.randomlyHideValues(copy.deepcopy(trainDataMasterDict[key]), p)
trainDataDict.update({key: trainData})
testDataDict.update({key: series[testYears].values[0]})
series = dfProp99[dfProp99['regionname'] == caStateKey]
trainDataMasterDict.update({caStateKey: series[trainingYears].values[0]})
trainDataDict.update({caStateKey: series[trainingYears].values[0]})
testDataDict.update({caStateKey: series[testYears].values[0]})
trainMasterDF = pd.DataFrame(data=trainDataMasterDict)
trainDF = pd.DataFrame(data=trainDataDict)
testDF = pd.DataFrame(data=testDataDict)
In [ ]:
trainDF.head()
Discuss the results. The interpretation of the coefficients is:
β0β0: Average per capita gdp (y) for the control group during the sample.
β1β1: Average change in per capita gdp (y) from the first to the second time period that is common to both groups
β2β2: Average difference in per capita gdp (y) between the two groups that is common in both time periods
β3β3: Average differential change in per capita gdp (y) from the first to the second time period of the treatment group relative to the control group
Our interest is β3β3. For you visual learners, the following image will give you some intuition as to how to interpret the results (and why this is called difference-in-differences):
Exercise 3: Synthetic Control [30 points]¶
The "synthetic control" (aka synthetic difference-in-differneces) method allows for estimation in settings where a single unit (a state, country, firm, etc.) is exposed to an event or intervention but it's not obvious who or what the control group should be. Specifically, synthetic control provides a data-driven procedure to construct a control group using a convex combination of comparison units. The idea is that this generated (ie, synthetic) control group approximates the characteristics of the unit of interest prior to treatment. The thought (hope) is that a combination of comparison units provides a better comparison for the unit exposed to the intervention than any single comparison unit.
In our example above, we identfied a control variable by generating a metric to assess pre-period trends. In many (most?) cases, we will have too much data to identify a control group by "eyeballing" the data and we'll likely not be able to satisfy the parallel trend pre-period assumption. What if instead we could create a control group via a convex combination of potential candidates? That's the idea behind "synthetic controls" -- we create a control group.
Synthetic control is a technique which is very similar to DiD in estimating the true impact of a treatment. Both the methods use the help of control groups to construct a counter-factual of the treated group giving us an idea of what the trend is if the treatment had not happened. The counter-factual GDP of the treated group would be predicted by the GDP of the control groups and also other possible covariates in the control group.
Just as the DiD approach used the control to construct a counter-factual of the treated group giving us an idea of what the trend is if the treatment had not happened, "synthetic control" predicts the counter-factual by assigning weights to the regressors in the control groups identify individual regressors and their influence in prediction. Ultimately, the true causal impact is the difference in GDP between actual GDP and the counter-factual GDP if the treatment had not happened which is the same idea as DiD.
As always, let's look at the raw data before proceeding.
Part A: Visualization¶
Create a figure of per capita GDP over time with different lines for each region. Put a dashed vertical line at 1975 and indicate pre and post periods.
In [ ]:
(U, s, Vh) = np.linalg.svd((trainDF) - np.mean(trainDF))
s2 = np.power(s, 2)
spectrum = np.cumsum(s2)/np.sum(s2)
plt.plot(spectrum)
plt.grid()
plt.title("Per gdp")
plt.figure()
plt.plot(s2)
plt.grid()
plt.xlabel("Ordered Singular Values")
plt.ylabel("Energy")
plt.title("Singular Value Spectrum")
Idea: What we'll be doing is picking the convex combination of regions to best match the evolution of per capita GDP prior to the ETA bombings (i.e., before 1975). We'll then hold these weights fixed and generate per capita GDP for the synthetic post-treatment control group. That's it. Identifying the causal effect then follows our DiD regression where we use the synthetic control rather than a specific control. A nice feature of this approach is that we can be hands-off in selection of our control group. Plus, construction of our synthetical control is not based on the treatment period so there's minimal risk of inadvertantly baking-in our results.
Part B: Construct a Synthetic Control Group¶
Use Lasso to regress per capita GDP of the Basque country (i.e., this is the y) on per capita GDP of the other regions for the years prior to (and including 1975). Be sure to use cross-validation (leave-one-out) to choose the best penalty parameter. Since there are 17 other regions, there are potentially 17 variables and one constant in the regression. You can use the pivot pandas method to reshape the main data frame. I say "potentially"
c you should exclude regions which violate criterion two.
In [ ]:
singvals = 4
scModel = RobustSyntheticControl(caStateKey, singvals, len(trainDF), probObservation=1.0, modelType='svd', svdMethod='numpy', otherSeriesKeysA
ay=otherStates)
scModel.fit(trainDF)
denoisedDF = rscModel.model.denoisedDF()
Use the model to generate (predict) a control group for all the years.
In [15]:
predictions = []
predictions = np.dot(testDF[otherStates], rscModel.model.weights)
actual = dfProp99.loc[dfProp99['LocationDesc'] == caStateKey]
actual = actual.drop('LocationDesc', axis=1)
actual = actual.iloc[0]
model_fit = np.dot(trainDF[otherStates][:], rscModel.model.weights)
Graph the per capita GDP series for the Basque country and your synthetic control group. Put a dashed vertical line at 1975 and indicate pre and post periods.
In [ ]:
fig, ax = plt.subplots(1,1)
tick_spacing = 5
label_markings = np.insert(years[::tick_spacing], 0, 'dummy')
ax.set_xticks(np.arange(len(label_markings)))
ax.set_xticklabels(label_markings, rotation=45)
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.plot(years, actual ,label='actual')
plt.xlabel('Year')
plt.ylabel('Per capita cigarette consumption')
plt.plot(trainingYears, model_fit, label='fitted model')
plt.plot(testYears, predictions, label='counterfactual')
plt.title(caStateKey+', Singular Values used: '+str(singvals))
xposition = pd.to_datetime(yearTrainEnd, e
ors='coerce')
plt.axvline(x=str(yearTrainEnd), color='k', linestyle='--', linewidth=4)
plt.grid()
plt.legend()
Part C: Synthetic Difference-in-Differences¶
Run the following synthetic difference-in-differences regression:
Per Capita GDPit=β0+β1Pt+β2Ti+β3Pt×Ti+ϵtPer Capita GDPit=β0+β1Pt+β2Ti+β3Pt×Ti+ϵtwhere PtPt is the "period" indicator equal to one for all years after 1975 and zero otherwise; TiTi is the "treatment" group indicator equal to one when the region is the Basque region; and ii is the region. Note the regression only uses observations from the treatment (Basque region) and synthetic control group.
In [ ]:
Discuss the results. What effect did the ETA bombings have on per capita GDP in the Basque region?
In [ ]:
Final/finalproblemset-gexy0jl5.ipyn
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Final Problem Set: Due Tuesday, May 10 at 5:00 pm EDT\n",
"\n",
"**Important:** You need `basque.RData` to complete this set. All are located on eLC.\n",
"\n",
"Answer the questions below. This notebook will be your workspace so add cells as you please. When you are finished, you will submit two objects to eLC:\n",
"\n",
"1.
**Written pdf Report.**
font> This should be short. You should include:\n",
" * **The research question.**\n",
" \n",
" * **The figure from your Exploratory Data Analysis along with discussion.** Be sure to note the difference in per capita GDP pre and post 1975.\n",
" \n",
" * **Difference-in-differences result.** Be sure to include the figure comparing the Basque country and the control region. Discuss how the control group satisfies our criteria. Be sure to discuss identification (see \"interpretation\" of DiD below) and results relative to the research question. Ideally, you should work your estimation result into the figure so the reader can quickly observe the \"answer\" to the research question. \n",
" \n",
" * **Synthetic Control result.** Be sure to include the figure comparing the Basque country and the synthetic control region. Be sure to discuss identification (see \"interpretation\" of DiD below) and results relative to the research question. Ideally, you should work your estimation result into the figure so the reader can quickly observe the \"answer\" to the research question.\n",
"\n",
" Your analysis should be professional: i.e., well-written, clear, and concise. Figures should be incorporated in your analysis. For example, it would be useful to set the title of each figure in this notebook (eg, \"Figure 1: Per Capita GDP in the Basque Region\") so in the final report you can reference the figure in the appopriate commentary. Save your report as a pdf (\"File/Save As Adobe PDF\") with the naming convention **'Final_Report_[insert last name]'**. For example, ''Final_Report_Thurk.pdf'. \n",
"\n",
"\n",
"2. **Jupyter Notebook.**
font> Print the notebook as a pdf. [To print: From the file menu, choose 'print preview'. A new tab will open with the notebook presented as html. Print as a pdf.] Save your pdf notebook with the naming conention **'Final_Workbook_[insert last name]'**. For example, 'Final_Workbook_Thurk.pdf'. Think of the notebook as your opportunity to show your work.\n",
"\n",
"**Grading:** The problem set is worth **90 points** and partial credit is indicated for each exercise. I will grade your report (1) and use your notebook (2) to assign partial credit in the event there are e
ors.\n",
"\n",
"\n",
"* You may use your notes, books, and the internet.\n",
"* Do not consult with other people. This work should be entirely your own. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 1:
font> Exploratory Data Analysis [30 points]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to estimate the economic impact of te
orism in the Basque region -- an autonomous community in Spain. To do this, we'll use information from 17 other Spanish regions where our underlying assumption will be that te
orism affected economic activity in the Basque region but not elsewhere whereas other economic shocks are aggregate and affect all of Spain.\n",
"\n",
"Here are the details:\n",
"\n",
"* Coverage from 1955–1997 for 18 Spanish regions. One of the data \"regions\" is all of Spain which we won't use.\n",
"* The treatment region is “Basque Country (Pais Vasco)”.\n",
"* The \"treatment\" year is 1975 since there were several bombings around that year.\n",
"* We will measure \"economic impact\" via GDP per capita (in thousands).\n",
"\n",
"__Background:__ Euskadi Ta Askatasuna (ETA), was an armed leftist Basque nationalist and separatist organization in the Basque Country (in northern Spain and southwestern France). The group was founded in 1959 and later evolved from a group promoting traditional Basque culture to a paramilitary group engaged in a violent campaign of bombing, assassinations and kidnappings in the Southern Basque Country and throughout Spanish te
itory. Its goal was gaining independence for the Basque Country. ETA was the main group within the Basque National Liberation Movement and was the most important Basque participant in the Basque conflict. While its te
orist activities spanned several decades, the death of Spanish dictator Francisco Franco in 1975 led to a substantial increase in bombings. (Wikipedia)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part A: Load Data\n",
"\n",
"Thus far we've learned how to load data from csv, excel, and stata. Another popular programming language (and to some degree a rival of python) is the open-source language R. The data we want to access is from the following academic paper \n",
"\n",
"> Abadie, A. and Gardeazabal, J. (2003) \"Economic Costs of Conflict: A Case Study of the Basque Country.\" American Economic Review 93 (1) 113--132.\n",
"\n",
"where the data was saved in the R programming language. We'll do this by leveraging python's own open-source nature and load a user-created package called `pyreadr` which we install as usual in the terminal:\n",
"\n",
"`pip install pyreadr`\n",
"\n",
"We then use `pyreadr` to load `basque.RData` access the data as follows:\n",
"\n",
"```python\n",
"result = pyreadr.read_r('basque.RData')\n",
"data = result['basque'] # extract the pandas dataframe\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"