Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Economics 104: Project 1 Fall 2022, UCLA Due Date: Oct 12, 2022 by 11:59 PM (PST) For this project, you will work any dataset you like, however, it must contain at least 5 different predictors...

1 answer below »
Economics 104: Project 1
Fall 2022, UCLA
Due Date: Oct 12, 2022 by 11:59 PM (PST)
For this project, you will work any dataset you like, however, it must contain at least 5 different
predictors and one response variable which you will aim to predict. Your task will be to find a
easonable model by following the 11 steps outlined below.
As an illustration of a good dataset (you cannot use this dataset), the file diamonds.csv contains the
prices and other attributes of almost 54,000 diamonds. The data description and file can be accessed
directly from kaggle and the goal is to predict diamond prices . There are many datasets that are
publicly available in kaggle but you can also get data from AER, FRED, BLS, and so on.
1. Provide a descriptive analysis of your variables. This should include histograms and fitted
distributions, co
elation plot, boxplots, scatterplots, and statistical summaries (e.g., the
five-number summary). All figures must include comments.
2. Estimate a multiple linear regression model that includes all the main effects only (i.e., no
interactions nor higher order terms). We will use this model as a baseline. Comment on
the statistical and economic significance of your estimates. Also, make sure to provide an
interpretation of your estimates.
3. Identify if there are any outliers, high leverage, and or influential observations worth removing.
If so, remove them but justify your reason for doing so and re-estimate your model.
4. Use Mallows Cp for identifying which terms you will keep in the model (based on part 3 )
and also use the Boruta algorithm for variable selection. Based on the two results, determine
which subset of predictors you will keep.
5. Test for multicollinearity using VIF on the model from (4) . Based on the test, remove any
appropriate variables, and estimate a new regression model based on these findings.
6. For your model in part (5) plot the respective residuals vs. ŷ and comment on your results.
7. For your model in part (5) perform a RESET test and comment on your results.
8. For your model in part (5) test for heteroskedasticity and comment on your results. If you
identify heteroskedasticy, make sure to account for it before moving on to (9).
9. Estimate a model based on all your findings that also includes interaction terms (if appro-
priate) and if needed, any higher power terms. Comment on the performance of this model
compared to your other models. Make sure to use AIC and BIC for model comparison.
10. Evaluate your model performance (from 9) using cross-validation, and also by dividing you
data into the traditional 2/3 training and 1/3 testing samples, to evaluate your out-of-sample
performance. Comment on your results.
11. Provide a short (1 paragraph) summary of your overall conclusions/findings.
https:
www.kaggle.com/shivam2503/diamonds
https:
www.rdocumentation.org/packages/AER/versions/1.2-10
https:
www.rdocumentation.org/packages/AER/versions/1.2-10
Answered 1 days After Oct 10, 2022

Solution

Mohd answered on Oct 12 2022
54 Votes
-
-
-
2022-10-12
1. Provide a descriptive analysis of your variables. This should include histograms and fitted distributions, co
elation plot, boxplots, scatterplots, and statistical summaries (e.g., the five-number summary). All figures must include comments.
li
ary(readr)
exams <- read_csv("exams.csv")
## Rows: 1000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): gender, race/ethnicity, parent_education_level, lunch, test_prep_co...
## dbl (1): math
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(exams)
First look of data
# descriptive measures
skimr::skim(exams)
Data summary
    Name
    exams
    Number of rows
    1000
    Number of columns
    6
    _______________________
    
    Column type frequency:
    
    characte
    5
    numeric
    1
    ________________________
    
    Group variables
    None
Variable type: characte
    skim_variable
    n_missing
    complete_rate
    min
    max
    empty
    n_unique
    whitespace
    gende
    0
    1
    4
    6
    0
    2
    0
    race/ethnicity
    0
    1
    7
    7
    0
    5
    0
    parent_education_level
    0
    1
    11
    18
    0
    6
    0
    lunch
    0
    1
    8
    12
    0
    2
    0
    test_prep_course
    0
    1
    4
    9
    0
    2
    0
Variable type: numeric
    skim_variable
    n_missing
    complete_rate
    mean
    sd
    p0
    p25
    p50
    p75
    p100
    hist
    math
    0
    1
    66.09
    15.16
    0
    57
    66
    77
    100
    ▁▁▅▇▃
#histogram of math score
hist(exams$math, main="Histogram of math score")
#Boxplot of math score
oxplot(exams$math, main="Boxplot of Math score")
# Removing Outliers
li
ary(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# using 1.5*IQR where IQR = Q3-Q1
exams<-exams%>%
filter(math>=30)
#After removing outliers
oxplot(exams$math)
hist(exams$math)
Categorical variables distribution
li
ary(ggplot2)
# gender distribution
ggplot(data=exams, aes(x=gender)) +
geom_bar() +
labs (title = "Gender Distribution", x = "Gender", y = "Total Count")+ theme_classic()
#Race/ ethnicity distribution
ggplot(data=exams, aes(x=exams$`race/ethnicity`)) +
geom_bar() +
labs (title = "race/ethnicity Distribution", x = "race/ethnicity", y = "Total Count")+ theme_classic()
#parent Education level distribution
ggplot(data=exams, aes(x=parent_education_level)) +
geom_bar() +
labs (title = "parent_education_level Distribution", x = "parent_education_level", y = "Total Count")+ theme_classic()
#Lunch distribution
ggplot(data=exams, aes(x=lunch)) +
geom_bar() +
labs (title = "lunch Distribution", x = "lunch", y = "Total Count")+ theme_classic()
#test preparation course
ggplot(data=exams, aes(x=test_prep_course)) +
geom_bar() +
labs (title = "test_prep_course Distribution", x = "test_prep_course", y = "Total Count")+ theme_classic()
1. Estimate a multiple linear regression model that includes all the main effects only (i.e., no interactions nor higher order terms). We will use this model as a baseline. Comment on the statistical and economic significance of your estimates. Also, make sure to provide an interpretation of your estimates.
aseline_mod<-lm(math~.,data=exams)
stargazer::stargazer(baseline_mod,type = "text")
##
## ===================================================================
## Dependent variable:
## ---------------------------
## math
## -------------------------------------------------------------------
## gendermale 4.322***
## (0.808)
##
## `race/ethnicity`group B 2.685
## (1.639)
##
## `race/ethnicity`group C 2.746*
## (1.532)
##
## `race/ethnicity`group D 5.413***
## (1.563)
##
## `race/ethnicity`group E 10.033***
## (1.729)
##
## parent_education_levelbachelor's degree 2.074
## ...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here