Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

11/10/2019 1/2 Dataset: City The original data from Chicago’s data portal contains detailed information for each crime and call to 311. We have split the city up into regions using a simple grid and...

1 answer below »
11/10/2019
1/2
Dataset: City
The original data from Chicago’s data portal contains detailed information for each crime and call to 311.
We have split the city up into regions using a simple grid and have aggregated this data by region.
Each city data file contains data for different types of complaints (that is, calls to 311) and the total amount
of crimes on a per-region basis. The first row in the file contains column labels, for example, GRAFFITI o
POT_HOLES. Subsequent rows contain data for different regions of the city. A column contains data for a
given variable across all the rows. For example, the column with index 1 (the second column) contains the
number of calls about pot holes for each region. In addition to information about specific types of com-
plaints, the file also has one column that contains the total number of crimes in each region.
File paths:
data/city
Parameters:
{"name": "City",
"predictor_vars": [0, 1, 2, 3, 4, 5, 6],
"dependent_var": 7,
"training_fraction": 0.55,
"seed": 22992}
City Task 1a:
CRIME_TOTALS ~ XXXXXXXXXX678349 * GRAFFITI
R2: XXXXXXXXXX
CRIME_TOTALS ~ XXXXXXXXXX375417 * POT_HOLES
R2: XXXXXXXXXX
CRIME_TOTALS ~ XXXXXXXXXX711958 * RODENTS
R2: XXXXXXXXXX
CRIME_TOTALS ~ XXXXXXXXXX892669 * GARBAGE
R2: XXXXXXXXXX
CRIME_TOTALS ~ XXXXXXXXXX447459 * STREET_LIGHTS
R2: XXXXXXXXXX
CRIME_TOTALS ~ XXXXXXXXXX324616 * TREE_DEBRIS
R2: XXXXXXXXXX
CRIME_TOTALS ~ XXXXXXXXXX338500 * ABANDONED_BUILDINGS
R2: XXXXXXXXXX
City Task 1b:
CRIME_TOTALS ~ XXXXXXXXXX347343 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX * RO-
DENTS XXXXXXXXXX * GARBAGE XXXXXXXXXX * STREET_LIGHTS XXXXXXXXXX * TREE_DEBRIS +
XXXXXXXXXX * ABANDONED_BUILDINGS
R2: XXXXXXXXXX
City Task 2:
CRIME_TOTALS ~ XXXXXXXXXX300180 * POT_HOLES XXXXXXXXXX * ABANDONED_BUILDINGS
R2: XXXXXXXXXX
City Task 3:
11/10/2019
2/2
CRIME_TOTALS ~ XXXXXXXXXX338500 * ABANDONED_BUILDINGS
R2: XXXXXXXXXX
CRIME_TOTALS ~ XXXXXXXXXX300180 * POT_HOLES XXXXXXXXXX * ABANDONED_BUILDINGS
R2: XXXXXXXXXX
CRIME_TOTALS ~ XXXXXXXXXX213704 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX * ABAN-
DONED_BUILDINGS
R2: XXXXXXXXXX
CRIME_TOTALS ~ XXXXXXXXXX386986 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX *
TREE_DEBRIS XXXXXXXXXX * ABANDONED_BUILDINGS
R2: XXXXXXXXXX
CRIME_TOTALS ~ XXXXXXXXXX337971 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX *
GARBAGE XXXXXXXXXX * TREE_DEBRIS XXXXXXXXXX * ABANDONED_BUILDINGS
R2: XXXXXXXXXX
CRIME_TOTALS ~ XXXXXXXXXX348926 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX *
GARBAGE XXXXXXXXXX * STREET_LIGHTS XXXXXXXXXX * TREE_DEBRIS XXXXXXXXXX * ABAN-
DONED_BUILDINGS
R2: XXXXXXXXXX
CRIME_TOTALS ~ XXXXXXXXXX347343 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX * RO-
DENTS XXXXXXXXXX * GARBAGE XXXXXXXXXX * STREET_LIGHTS XXXXXXXXXX * TREE_DEBRIS +
XXXXXXXXXX * ABANDONED_BUILDINGS
R2: XXXXXXXXXX
City Task 4:
CRIME_TOTALS ~ XXXXXXXXXX348926 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX *
GARBAGE XXXXXXXXXX * STREET_LIGHTS XXXXXXXXXX * TREE_DEBRIS XXXXXXXXXX * ABAN-
DONED_BUILDINGS
R2: XXXXXXXXXX
Adjusted R2: XXXXXXXXXX
City Task 5:
CRIME_TOTALS ~ XXXXXXXXXX348926 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX *
GARBAGE XXXXXXXXXX * STREET_LIGHTS XXXXXXXXXX * TREE_DEBRIS XXXXXXXXXX * ABAN-
DONED_BUILDINGS
Training R2: XXXXXXXXXX
Testing R2: XXXXXXXXXX

11/10/2019
1/15
Linear Regression
In this assignment, you will fit linear regression models and implement a few simple variable selection al-
gorithms. The assignment will give you experience with NumPy and more practice with using classes and
functions to support code reuse.
You must work alone on this assignment.
Introduction
At the heart of the assignment is a table, where each column is a variable and each row is a sample unit. As
an example, in a health study, each sample unit might be a person, with variables like height, weight, sex,
etc. In your analysis, you will build models that, with varying levels of accuracy, can predict the value of
one of the variables as a function of the others.
Predictions are only possible if variables are related somehow. As an example, look at this plot of recorded
crimes against logged complaint calls about ga
age to 311.
Each point describes a sample unit, which in this example represents a geographical region of Chicago.
Each region is associated with variables, such as the number of crimes or complaint calls during a fixed
time frame. Given this plot, if you got the question of how many crimes you think were recorded for a re-
gion that had 150 complaint calls about ga
age, you would follow the general trend and probably say
something like 3000 recorded crimes. To formalize this prediction, we need a model for the data that re-
lates a dependent variable (e.g., crimes) to a set of predictor variables (e.g., complaint calls). Our model
will assume a linear dependence.
To make this precise, we will use the following notation:
the total number of sample units.
the total number of predictor variables. In the example above, .
the sample unit that we are cu
ently considering (an integer from to ).
an observation of a predictor variable for sample unit , e.g., the number of complaint calls about
ga
age.
an observation of the dependent variable for sample unit , e.g., the total number of crimes.
N
K
K = 1
n
0 N − 1
xnk
k n
yn
n
Due: Wednesday, Nov 13 at 9pm
11/10/2019
2/15
(1)
(2)
(3)
(4)
our prediction for the dependent variable for sample unit , based on our observation of the predicto
variables. This value co
esponds to a point on the red line.
The residual or observed e
or, that is, the difference between the actual observed value of the depen-
dent variable, and our prediction for it. Ideally, our predictions would match the observations, so that
would always be zero. In practice, there will be some discrepancy, for two reasons. For one, when
we make predictions on new data, we will not have access to the observations of the dependent vari-
able. But also, our model will assume a linear dependence between the predictor variables and the de-
pendent variable, while in reality the relationship will not be quite linear. So, even when we do have
direct access to the observations of the dependent variable, we will not have equal to zero.
Our prediction for the dependent variable will be given by a linear equation:
where the coefficients are real numbers. We would like to select values for these coefficients
that result in small residuals .
We can rewrite this equation more concisely using vector notation. We define:
a column vector of the regression coefficients, where is the intercept and (for ) is the
coefficient associated with the predictor. This vector describes the red line in the figure above. Note
that a positive value of a coefficient suggests a positive co
elation with the dependent variable. The
same is true for a negative value and a negative co
elation.
a column vector representation of all the predictors for a given sample unit. Note that a 1 has been
prepended to the vector. This will allow us to rewrite equation (1) in vector notation without having to
treat separately from the other coefficients .
We can then rewrite equation (1) as:
This equation can be written for all sample units at the same time using matrix notation. We define:
a column vector of observations of the dependent variable.
a column vector of predictions for the dependent variable.
a column vector of the residuals (observed e
ors).
an matrix where each row is one sample unit. The first column of this matrix is all ones.
We can then write equations (1) and (2) for all sample units at once as
And, we can express the residuals as
Matrix multiplication:
ŷ n
n
= −εn yn ŷ n
εn
εn
= + + ⋯ + ,ŷ n β0 β1xn1 βK xnK
, , … ,β0 β1 βK
εn
β =β ( )β0 β1 β2 ⋯ βK T
β0 βk 1 ≤ k ≤ K
kth
=xn ( )1 xn1 xn2 ⋯ xnK T
β0 βk
= β.ŷ n x
T
n β
y = ( )y0 y1 ⋯ yN−1 T
=y ̂  ( )ŷ 0 ŷ 1 ⋯ ŷ N−1
T
ε =ε ( )ε0 ε1 ⋯ εN−1 T
X
N × (K + 1)
= Xβ,y ̂  β
ε = y − .ε y ̂ 
11/10/2019
3/15
Equations (2) and (3) above involve matrix multiplication. If you are unfamiliar with matrix mul-
tiplication, you will still be able to do this assignment. Just keep in mind that to make the calcula-
tions less messy, the matrix contains not just the observations of the predictor variables, but
also an initial column of all ones. The data we provide does not yet have this column of ones, so
you will need to prepend it.
Model fitting
There are many possible candidates for some that fit the data better than others. Finding the best value
of is refe
ed to as fitting the model. For our purposes, the “best” value of is the one that minimizes the
esiduals in the least-squared sense. That is, we want the value for such that the predicted val-
ues are as close to the observed values as possible (in a statistically-motivated way using maxi-
mum likelihood). We will provide a function that computes this value of ; see “The linear_regression
function” below.
Getting started
We have seeded your repository with a directory for this assignment. To pick it up, change to you
capp30121-aut-19-username directory (where the string username should be replaced with your user-
name) and then run the command: git pull upstream master. You should also run git pull to make
sure your local copy of your repository is in sync with the server.
The pa5 directory contains the following files:
egression.py: Python file where you will write your code.
util.py: Python file with several helper functions, some of which you will need to use in
your code.
output.py: This file is described in detail in the “Testing your code” section below.
test_regression.py: Python file with the automated tests for this assignment.
The pa5 directory also contains a data directory which, in turn, contains two sub-directories: city and
stock.
Data
In this assignment you will write code that can be used
Answered Same Day Nov 11, 2021

Solution

Abr Writing answered on Nov 13 2021
145 Votes
'''
Linear regression
YOUR NAME HERE
Main file for linear regression and model selection.
'''
import numpy as np
from sklearn.model_selection import train_test_split
import util
class DataSet(object):
'''
Class for representing a data set.
'''
def __init__(self, dir_path):
'''
Constructo
Inputs:
dir_path: (string) path to the directory that contains the
file
'''
# REPLACE pass WITH YOUR CODE
params = util.load_json_file(dir_path, "parameters.json")
data = util.load_numpy_a
ay(dir_path, "data.csv")
self.name = params['name']
self.dependent_var = params['dependent_var']
self.pred_vars = params['predictor_vars']
self.seed = params['seed']
self.split_fraction = params['training_fraction']
self.col_names = data[0]
self.data = data[1]
class Model(object):
'''
Class for representing a model.
'''
def __init__(self, dataset, pred_vars):
'''
Construct a data structure to hold the model.
Inputs:
dataset: an dataset instance
pred_vars: a list of the indices for the columns (of the
original data a
ay) used in the model.
'''
# REPLACE pass WITH YOUR CODE
self.col_names = dataset.col_names
self.dep_var = dataset.dependent_va
self.train, self.test = train_test_split(dataset.data,
test_size=None, train_size=dataset.split_fraction, random_state=dataset.seed)
self.pred_vars = pred_vars
self.pred_obs = self.train[:, self.pred_vars]
self.dependent_obs = self.train[:, self.dep_var]
self.beta = util.linear_regression(self.pred_obs,
self.dependent_obs)
x = self.rsquared()
self.R2 = x
self.adj_R2 = None
def __repr__(self):
'''
Format model as a string.
'''
# Replace this return statement with one that returns a more
# helpful string representation
# return "!!! You haven't implemented the Model __repr__ method yet !!!"
n = "{} ~ {}".format(self.col_names[self.dep_var], self.beta[0])
if type(self.pred_vars) == list:
...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here