DEPARTMENT OF ECONOMICS
ECON 4041H – RESEARCH METHODOLOGY
Winter 2023, Pete
orough
Assignment #1
Due date: January 31, 2023
Instructions: You must provide your own unique solution. You may work with others, but each
of you is responsible for submitting your own problem set solution. Question values
are listed for each question. Submit solution through SafeAssign. Ideally you will
submit your RMarkdown file, preferably in pdf format. Blackboard won’t accept
html files, so if submitting an html file, first zip it and submit the zipped version.
But if you don’t like using RMarkdown, you may submit two files: your command
file and a wordprocessor file containing results, comments and answers to questions,
as well as graphs. Please bind all output together in one document file rather than
submitting separate files for each question, or for each graph. Your command file
will be a separate file.
For questions 1–5 use the labour force survey file lfs7797.rds. For question 6, use the 2016 Census
PUMF cen16.rds.
1. Some basic data descriptions of datafile lfs7797.rds [15 marks]
a. number of observations in the dataframe
. number of observations for variable cowmain–class of worke
c. number of missing observations for variable cowmain
d. mean wage (hrlyearn) of workers of variable cowmain category:
i. “Public employee”
ii. “Private employee”
e. mean wage (hrlyearn) of workers of variable union category:
i. “Union + agreement”
ii. “Agreement,no union”
2. Distribution of hrlyearn (wage rate), and uhrsmain (usual weekly hours) [15 marks]
a. summary statistics: find mean, median, maximum, minimum, standard deviation of
wage rate and weekly hours
. plot the densities of
i. wage rate
ii. log of wage rate
iii. usual weekly hours
iv. log of usual weekly hours
3. Generate some 2x2 tables of several variables [15 marks]
ECON 4041H - Assignment 1
a. first recode the variables for educational attainment: ed76to89 and educ90, the first is
for years prior to 1990, and the second is 1990 on. Recode to create one variable fo
oth years and call it educ
i. ed76to89
• “0 to 8 years” and “9-10 yrs schooling”: code as “less than high school”
• “11-13 years schooling” and “Some post secondary”: “high school”
• “Post secondary certificate of diploma”: “college” (note: keep spelling e
or)
• “University degree”: “university”
ii. educ90
• “0 to 8 years” and “Some secondary”: “less than high school”
• “Grade 11 to 13,grad” and “Some post secondary”: “high school”
• “College diploma”: “college”
• “Bachelors degree” or “Graduate degree”: “university”
. now calculate the following conditional means
i. mean hourly earnings by sex
ii. mean hourly earnings by educational attainment
iii. mean weekly hours by sex
iv. mean weekly hours by educational attainment
4. Composition of labour force by year: 1977 and 1997 [15 marks]
a. by sex (sex)
. by educational attainment (use variable created in previous question)
c. by age (use variable age_12)
Use the variable lfsstat (labour force status) to subset the labour force. Remember from
macro that the labour force is composed of those employed plus those unemployed.
5. Test the central limit theorem, as we did in our demo example. You will draw repeated sam-
ples of two variables hrlyearn–wages, and uhrsmain-usual weekly hours worked, saving the
mean value of each sample. Then compare the means, standard deviations and distribution
of the three samples to the “population” statistics.
Note, the data are in a dataframe, so you must either extract each variable as a vector, o
make sure you set your command for a dataframe. In order to replicate results, you will
need to set a seed value. The seed value determines a starting point for the random numbe
generator. To set your seed value, take your sid, drop the leading 0, then take the sum of the
next three digits and the last three. For example, if my sid is XXXXXXXXXX, I would calculate my
seed value as XXXXXXXXXX = 579. Then draw the random sample following the example in the
Sampling Distribution exercise. [20 marks]
a. Draw a sample of 1,000 observations of wages (hrlyearn). Save the mean value. Repeat
this for 2,000 repetitions. This yields 2,000 sample means. Then repeat for 5,000
observations, and again for 10,000 observations. This will give you three sets of 2,000
means. Report the mean, standard deviation, and graph the kernel density for each of
these three sets.
. What do you see as you increase the sample size? Compare your results—mean, stan-
dard deviation, density plot—with those of the aggregate sample.
2
ECON 4041H - Assignment 1
c. Repeat parts a. and b. above, but use the weekly hours variable (uhrsmain).
6. Use the Census 2016 PUMF (cen16.rds) to test whether the relationship between age (facto
variable agegrp) and employment income (variable empin) is linear. Restrict your analysis to
those in the age range from 20 to 84 years old. The variable agegrp for this range consists of
5-year age groups. Generate a numeric version of this variable and use the numeric variable
ather than the factor variable where appropriate. [20 marks]
a. generate a scatter plot with employment income on the y-axis and (the numeric version
of) age on the x-axis. Use a subset of the census file including only 50,000 observations.
The generated plot will otherwise take up a lot of space in your output file.
. generate a loess plot of employment income as a function of (the numeric version of)
age. Use a subset of the census file including only 50,000 observations. This command
is otherwise very slow. In specifying the loess plot command, make sure to include the
option “se = FALSE”, otherwise the estimation is very slow, even on the subset.
c. Run a regression of employment income on the numeric version of age. Report the
esults and interpret. What do they mean?
d. Run a regression of employment income on original factor variable version of age.
i. Report the results and interpret. What do they mean? Do they tell you anything
about whether the relationship is linear?
ii. Using the output from the regression above, test the significance of power terms of
the age variable using the contrast() command.
iii. Generate a plot of the predicted values of employment income for each level of the
age factor variable. Interpret.
3