Washington, D.C. is the capital of the United States. Washington's population is approaching 700,000
people and has been growing since 2000 following a half-century of population decline. The city is highly
segregated and features a high cost of living. In 2017, the average price of a single-family home in the
district was $649,000. The dataset (DC_Property_Train.CSV) provides insight on the housing stock of the
district.
Explanations for Columns
ID: House ID
BATHRM: Number of Full Bathrooms
HF_BATHRM: Number of Half Bathrooms (no bathtub or shower)
HEAT: Heating
AC: Cooling
NUM_UNITS: Number of Units
ROOMS: Number of Rooms
BEDRM: Number of Bedrooms
AYB: The earliest time the main portion of the building was built
YR_RMDL: Year structure was remodeled
EYB: The year an improvement was built more recent than actual year built
STORIES: Number of stories in primary dwelling
PRICE: Price of most recent sale
GBA: Gross building area in square feet
BLDG_NUM: Building Number on Property
STYLE: Style
STRUCT: Structure
LANDAREA: Land area of property in square feet
ASSESSMENT_NBHD: Neighborhood ID
In this problem, you are required to finish the following tasks. All necessary steps need to be clearly
documented in your report. You use the data set DC_Property_Train.CSV for questions 1 to 6.
1. Plot a histogram for EYB. Describe the plotted pattern. (4 marks)
2. Plot a histogram for PRICE. Describe the plotted pattern and analyze the potential reasons for the
high-priced properties. (4 marks)
3. Summarize the average PRICE for each ASSESSMENT_NBHD. Sort the processed data and make a
ar plot of average prices for the top 10 neighborhoods. (6 marks)
4. Plot boxplots of PRICE by ASSESSMENT_NBHD for the top 10 neighborhoods. Explain the pros of using
oxplots instead of average prices. (6 marks)
5. Plot boxplots PRICE by the categories of STRUCT using the facet approach. Compare these boxplots
and summarize your findings. (6 marks)
6. Visualize the relationship between PRICE and GBA. Identify outliers based on the visualization and list
their IDs. (6 marks)
7. Create a regression model for predicting PRICE through selected variables (you decide which ones to
use) from the data set DC_Property_Train.CSV. You may exclude the identified outliers from the
previous steps. Quantitatively evaluate the model performance using R2 and MSE. Fill the
PREDICTED_PRICE column of the data set DC_Property_Test.CSV using the predicted values from you
model. (8 marks)