11/10/20191/2Dataset: CityThe original data from Chicago’s data portal contains detailed information...

Question

11/10/20191/2Dataset: CityThe original data from Chicago’s data portal contains detailed information for each crime and call to 311.We have split the city up into regions using a simple grid and have aggregated this data by region.Each city data file contains data for different types of complaints (that is, calls to 311) and the total amountof crimes on a per-region basis. The first row in the file contains column labels, for example, GRAFFITI oPOT_HOLES. Subsequent rows contain data for different regions of the city. A column contains data for agiven variable across all the rows. For example, the column with index 1 (the second column) contains thenumber of calls about pot holes for each region. In addition to information about specific types of com-plaints, the file also has one column that contains the total number of crimes in each region.File paths:data/cityParameters:{"name": "City",  "predictor_vars": [0, 1, 2, 3, 4, 5, 6],  "dependent_var": 7,  "training_fraction": 0.55,  "seed": 22992}City Task 1a:CRIME_TOTALS ~ XXXXXXXXXX678349 * GRAFFITI R2: XXXXXXXXXX CRIME_TOTALS ~ XXXXXXXXXX375417 * POT_HOLES R2: XXXXXXXXXX CRIME_TOTALS ~ XXXXXXXXXX711958 * RODENTS R2: XXXXXXXXXX CRIME_TOTALS ~ XXXXXXXXXX892669 * GARBAGE R2: XXXXXXXXXX CRIME_TOTALS ~ XXXXXXXXXX447459 * STREET_LIGHTS R2: XXXXXXXXXX CRIME_TOTALS ~ XXXXXXXXXX324616 * TREE_DEBRIS R2: XXXXXXXXXX CRIME_TOTALS ~ XXXXXXXXXX338500 * ABANDONED_BUILDINGS R2: XXXXXXXXXX City Task 1b:CRIME_TOTALS ~ XXXXXXXXXX347343 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX * RO-DENTS XXXXXXXXXX * GARBAGE XXXXXXXXXX * STREET_LIGHTS XXXXXXXXXX * TREE_DEBRIS + XXXXXXXXXX * ABANDONED_BUILDINGS R2: XXXXXXXXXX City Task 2:CRIME_TOTALS ~ XXXXXXXXXX300180 * POT_HOLES XXXXXXXXXX * ABANDONED_BUILDINGS R2: XXXXXXXXXX City Task 3:11/10/20192/2CRIME_TOTALS ~ XXXXXXXXXX338500 * ABANDONED_BUILDINGS R2: XXXXXXXXXX CRIME_TOTALS ~ XXXXXXXXXX300180 * POT_HOLES XXXXXXXXXX * ABANDONED_BUILDINGS R2: XXXXXXXXXX CRIME_TOTALS ~ XXXXXXXXXX213704 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX * ABAN-DONED_BUILDINGS R2: XXXXXXXXXX CRIME_TOTALS ~ XXXXXXXXXX386986 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX *TREE_DEBRIS XXXXXXXXXX * ABANDONED_BUILDINGS R2: XXXXXXXXXX CRIME_TOTALS ~ XXXXXXXXXX337971 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX *GARBAGE XXXXXXXXXX * TREE_DEBRIS XXXXXXXXXX * ABANDONED_BUILDINGS R2: XXXXXXXXXX CRIME_TOTALS ~ XXXXXXXXXX348926 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX *GARBAGE XXXXXXXXXX * STREET_LIGHTS XXXXXXXXXX * TREE_DEBRIS XXXXXXXXXX * ABAN-DONED_BUILDINGS R2: XXXXXXXXXX CRIME_TOTALS ~ XXXXXXXXXX347343 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX * RO-DENTS XXXXXXXXXX * GARBAGE XXXXXXXXXX * STREET_LIGHTS XXXXXXXXXX * TREE_DEBRIS + XXXXXXXXXX * ABANDONED_BUILDINGS R2: XXXXXXXXXX City Task 4:CRIME_TOTALS ~ XXXXXXXXXX348926 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX *GARBAGE XXXXXXXXXX * STREET_LIGHTS XXXXXXXXXX * TREE_DEBRIS XXXXXXXXXX * ABAN-DONED_BUILDINGS R2: XXXXXXXXXX Adjusted R2: XXXXXXXXXX City Task 5:CRIME_TOTALS ~ XXXXXXXXXX348926 * GRAFFITI XXXXXXXXXX * POT_HOLES XXXXXXXXXX *GARBAGE XXXXXXXXXX * STREET_LIGHTS XXXXXXXXXX * TREE_DEBRIS XXXXXXXXXX * ABAN-DONED_BUILDINGS Training R2: XXXXXXXXXX Testing R2: XXXXXXXXXX  11/10/20191/15Linear RegressionIn this assignment, you will fit linear regression models and implement a few simple variable selection al-gorithms. The assignment will give you experience with NumPy and more practice with using classes andfunctions to support code reuse.You must work alone on this assignment.IntroductionAt the heart of the assignment is a table, where each column is a variable and each row is a sample unit. Asan example, in a health study, each sample unit might be a person, with variables like height, weight, sex,etc. In your analysis, you will build models that, with varying levels of accuracy, can predict the value ofone of the variables as a function of the others.Predictions are only possible if variables are related somehow. As an example, look at this plot of recordedcrimes against logged complaint calls about gaage to 311.Each point describes a sample unit, which in this example represents a geographical region of Chicago.Each region is associated with variables, such as the number of crimes or complaint calls during a fixedtime frame. Given this plot, if you got the question of how many crimes you think were recorded for a re-gion that had 150 complaint calls about gaage, you would follow the general trend and probably saysomething like 3000 recorded crimes. To formalize this prediction, we need a model for the data that re-lates a dependent variable (e.g., crimes) to a set of predictor variables (e.g., complaint calls). Our modelwill assume a linear dependence.To make this precise, we will use the following notation:the total number of sample units.the total number of predictor variables. In the example above, .the sample unit that we are cuently considering (an integer from  to ).an observation of a predictor variable  for sample unit , e.g., the number of complaint calls aboutgaage.an observation of the dependent variable for sample unit , e.g., the total number of crimes.NKK = 1n0 N − 1xnkk nynnDue: Wednesday, Nov 13 at 9pm11/10/20192/15(1)(2)(3)(4)our prediction for the dependent variable for sample unit , based on our observation of the predictovariables. This value coesponds to a point on the red line.The residual or observed eor, that is, the difference between the actual observed value of the depen-dent variable, and our prediction for it. Ideally, our predictions would match the observations, so that  would always be zero. In practice, there will be some discrepancy, for two reasons. For one, whenwe make predictions on new data, we will not have access to the observations of the dependent vari-able. But also, our model will assume a linear dependence between the predictor variables and the de-pendent variable, while in reality the relationship will not be quite linear. So, even when we do havedirect access to the observations of the dependent variable, we will not have  equal to zero.Our prediction for the dependent variable will be given by a linear equation:where the coefficients  are real numbers. We would like to select values for these coefficientsthat result in small residuals .We can rewrite this equation more concisely using vector notation. We define:a column vector of the regression coefficients, where  is the intercept and  (for ) is thecoefficient associated with the  predictor. This vector describes the red line in the figure above. Notethat a positive value of a coefficient suggests a positive coelation with the dependent variable. Thesame is true for a negative value and a negative coelation.a column vector representation of all the predictors for a given sample unit. Note that a 1 has beenprepended to the vector. This will allow us to rewrite equation (1) in vector notation without having totreat  separately from the other coefficients .We can then rewrite equation (1) as:This equation can be written for all sample units at the same time using matrix notation. We define:a column vector of observations of the dependent variable.a column vector of predictions for the dependent variable.a column vector of the residuals (observed eors).an  matrix where each row is one sample unit. The first column of this matrix is all ones.We can then write equations (1) and (2) for all sample units at once asAnd, we can express the residuals asMatrix multiplication:ŷ nn= −εn yn ŷ nεnεn= + + ⋯ + ,ŷ n β0 β1xn1 βK xnK, , … ,β0 β1 βKεnβ =β ( )β0 β1 β2 ⋯ βK Tβ0 βk 1 ≤ k ≤ Kkth=xn ( )1 xn1 xn2 ⋯ xnK Tβ0 βk= β.ŷ n xTn βy = ( )y0 y1 ⋯ yN−1 T=y ̂  ( )ŷ 0 ŷ 1 ⋯ ŷ N−1Tε =ε ( )ε0 ε1 ⋯ εN−1 TXN × (K + 1)= Xβ,y ̂  βε = y − .ε y ̂ 11/10/20193/15Equations (2) and (3) above involve matrix multiplication. If you are unfamiliar with matrix mul-tiplication, you will still be able to do this assignment. Just keep in mind that to make the calcula-tions less messy, the matrix  contains not just the observations of the predictor variables, butalso an initial column of all ones. The data we provide does not yet have this column of ones, soyou will need to prepend it.Model fittingThere are many possible candidates for  some that fit the data better than others. Finding the best valueof  is refeed to as fitting the model. For our purposes, the “best” value of  is the one that minimizes theesiduals  in the least-squared sense. That is, we want the value for  such that the predicted val-ues  are as close to the observed values  as possible (in a statistically-motivated way using maxi-mum likelihood). We will provide a function that computes this value of ; see “The linear_regressionfunction” below.Getting startedWe have seeded your repository with a directory for this assignment. To pick it up, change to youcapp30121-aut-19-username directory (where the string username should be replaced with your user-name) and then run the command: git pull upstream master. You should also run git pull to makesure your local copy of your repository is in sync with the server.The pa5 directory contains the following files:egression.py: Python file where you will write your code.util.py: Python file with several helper functions, some of which you will need to use inyour code.output.py: This file is described in detail in the “Testing your code” section below.test_regression.py: Python file with the automated tests for this assignment.The pa5 directory also contains a data directory which, in turn, contains two sub-directories: city andstock.DataIn this assignment you will write code that can be used

Abr Writing · Accepted Answer

'''
Linear regression
YOUR NAME HERE
Main file for linear regression and model selection.
'''
import numpy as np
from sklearn.model_selection import train_test_split
import util
class DataSet(object):
    '''
    Class for representing a data set.
    '''
    def __init__(self, dir_path):
        '''
        Constructor
        Inputs:
            dir_path: (string) path to the directory that contains the
              file
        '''
        # REPLACE pass WITH YOUR CODE
        params = util.load_json_file(dir_path, "parameters.json")
        data = util.load_numpy_array(dir_path, "data.csv")
        self.name = params['name']
        self.dependent_var = params['dependent_var']
        self.pred_vars = params['predictor_vars']
        self.seed = params['seed']
        self.split_fraction = params['training_fraction']
        self.col_names = data[0]
        self.data = data[1]
class Model(object):
    '''
    Class for representing a model.
    '''
    def __init__(self, dataset, pred_vars):
        '''
        Construct a data structure to hold the model.
        Inputs:
            dataset: an dataset instance
            pred_vars: a list of the indices for the columns (of the
              original data array) used in the model.
        '''
        # REPLACE pass WITH YOUR CODE
        self.col_names = dataset.col_names
        self.dep_var = dataset.dependent_var
        self.train, self.test = train_test_split(dataset.data,
            test_size=None, train_size=dataset.split_fraction, random_state=dataset.seed)
        self.pred_vars = pred_vars
        self.pred_obs = self.train[:, self.pred_vars]
        self.dependent_obs = self.train[:, self.dep_var]
        self.beta = util.linear_regression(self.pred_obs,
            self.dependent_obs)
        x = self.rsquared()
        self.R2 = x
        self.adj_R2 = None
    def __repr__(self):
        '''
        Format model as a string.
        '''
        # Replace this return statement with one that returns a more
        # helpful string representation
        # return "!!! You haven't implemented the Model __repr__ method yet !!!"
        n = "{} ~ {}".format(self.col_names[self.dep_var], self.beta[0])
        if type(self.pred_vars) == list:

11/10/2019 1/2 Dataset: City The original data from Chicago’s data portal contains detailed information for each crime and call to 311. We have split the city up into regions using a simple grid and...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment