Data Preprocessing (30 points)
Preprocess the data you are given to your liking. This may include dropping some columns you wont use,
addressing noisy or missing data etc.
Use Pandas as a dataframe abstraction for this task. You can learn about Pandas here:
[J »
Ed Pandas Tutorial 2021
[EEE
ARN
NDAS IN
ONE VIDEO
B 4 ULTIMATE
Watch on EB YouTube
Ultimately this task has to result to a dataframe that you will use for the training and testing the classifier.
Logistic Regression (40 points)
Write from scratch the logistic regression solution to the prediction problem that can work with
Stochastic Gradient Descent (SGD). You can only use pandas and numpy to do so.
Show clearly all equations of the gradient and include comments in either markdown or Python (inline to
code) explaining every stage of processing. Also, highlight any enhancements you may have done to
improve performance such as regularization.
Performance Results (20 points)
Plot the precision vs recall curve of your classifier. Clearly explain the tradeoff between the two quantities
and the shape of the curve.
Chapter 16. Logistic Regression
A lot of people say there’s a fine line between genius and insanity. I don’t think there’s a
fine line, I actually think there’s a yawning gulf.
Bill Bailey
In Chapter 1, we
iefly looked at the problem of trying to predict which DataScienceste
users paid for premium accounts. Here we’ll revisit that problem.
The Problem
We have an anonymized data set of about 200 users, containing each user’s salary, he
years of experience as a data scientist, and whether she paid for a premium account
(Figure 16-1). As is usual with categorical variables, we represent the dependent variable
as either 0 (no premium account) or 1 (premium account).
As usual, our data is in a matrix where each row is a list [experience, salary,
paid_account]. Let’s turn it into the format we need:
x = [[1] + row[:2] for row in data] # each element is [1, experience, salary]
y = [row[2] for row in data] # each element is paid_account
An obvious first attempt is to use linear regression and find the best model:
Figure 16-1. Paid and unpaid users
And certainly, there’s nothing preventing us from modeling the problem this way. The
esults are shown in Figure 16-2:
escaled_x = rescale(x)
eta = estimate_beta(rescaled_x, y) # [0.26, 0.43, -0.43]
predictions = [predict(x_i, beta) for x_i in rescaled_x]
plt.scatter(predictions, y)
plt.xlabel("predicted")
plt.ylabel("actual")
plt.show()
Figure 16-2. Using linear regression to predict premium accounts
But this approach leads to a couple of immediate problems:
We’d like for our predicted outputs to be 0 or 1, to indicate class membership. It’s fine
if they’re between 0 and 1, since we can interpret these as probabilities — an output of
0.25 could mean 25% chance of being a paid member. But the outputs of the linea
model can be huge positive numbers or even negative numbers, which it’s not clea
how to interpret. Indeed, here a lot of our predictions were negative.
The linear regression model assumed that the e
ors were unco
elated with the
columns of x. But here, the regression coefficent for experience is 0.43, indicating that
more experience leads to a greater likelihood of a premium account. This means that
our model outputs very large values for people with lots of experience. But we know
that the actual values must be at most 1, which means that necessarily very large
outputs (and therefore very large values of experience) co
espond to very large
negative values of the e
or term. Because this is the case, our estimate of beta is
iased.
What we’d like instead is for large positive values of dot(x_i, beta) to co
espond to
probabilities close to 1, and for large negative values to co
espond to probabilities close
to 0. We can accomplish this by applying another function to the result.
The Logistic Function
In the case of logistic regression, we use the logistic function, pictured in Figure 16-3:
def logistic(x):
return 1.0 / (1 + math.exp(-x))
Figure 16-3. The logistic function
As its input gets large and positive, it gets closer and closer to 1. As its input gets large
and negative, it gets closer and closer to 0. Additionally, it has the convenient property
that its derivative is given by:
def logistic_prime(x):
return logistic(x) * (1 - logistic(x))
which we’ll make use of in a bit. We’ll use this to fit a model:
where f is the logistic function.
Recall that for linear regression we fit the model by minimizing the sum of squared e
ors,
which ended up choosing the that maximized the likelihood of the data.
Here the two aren’t equivalent, so we’ll use gradient descent to maximize the likelihood
directly. This means we need to calculate the likelihood function and its gradient.
Given some , our model says that each should equal 1 with probability and 0
with probability .
In particular, the pdf for can be written as:
since if is 0, this equals:
and if is 1, it equals:
It turns out that it’s actually simpler to maximize the log likelihood:
Because log is strictly increasing function, any beta that maximizes the log likelihood also
maximizes the likelihood, and vice versa.
def logistic_log_likelihood_i(x_i, y_i, beta):
if y_i == 1:
return math.log(logistic(dot(x_i, beta)))
else:
return math.log(1 - logistic(dot(x_i, beta)))
If we assume different data points are independent from one another, the overall likelihood
is just the product of the individual likelihoods. Which means the overall log likelihood is
the sum of the individual log likelihoods:
def logistic_log_likelihood(x, y, beta):
return sum(logistic_log_likelihood_i(x_i, y_i, beta)
for x_i, y_i in zip(x, y))
A little bit of calculus gives us the gradient:
def logistic_log_partial_ij(x_i, y_i, beta, j):
"""here i is the index of the data point,
j the index of the derivative"""
return (y_i - logistic(dot(x_i, beta))) * x_i[j]
def logistic_log_gradient_i(x_i, y_i, beta):
"""the gradient of the log likelihood
co
esponding to the ith data point"""
return [logistic_log_partial_ij(x_i, y_i, beta, j)
for j, _ in enumerate(beta)]
def logistic_log_gradient(x, y, beta):
return reduce(vector_add,
[logistic_log_gradient_i(x_i, y_i, beta)
for x_i, y_i in zip(x,y)])
at which point we have all the pieces we need.
Applying the Model
We’ll want to split our data into a training set and a test set:
andom.seed(0)
x_train, x_test, y_train, y_test = train_test_split(rescaled_x, y, 0.33)
# want to maximize log likelihood on the training data
fn = partial(logistic_log_likelihood, x_train, y_train)
gradient_fn = partial(logistic_log_gradient, x_train, y_train)
# pick a random starting point
eta_0 = [random.random() for _ in range(3)]
# and maximize using gradient descent
eta_hat = maximize_batch(fn, gradient_fn, beta_0)
Alternatively, you could use stochastic gradient descent:
eta_hat = maximize_stochastic(logistic_log_likelihood_i,
logistic_log_gradient_i,
x_train, y_train, beta_0)
Either way we find approximately:
eta_hat = [-1.90, 4.05, -3.87]
These are coefficients for the rescaled data, but we can transform them back to the
original data as well:
eta_hat_unscaled = [7.61, 1.42, XXXXXXXXXX]
Unfortunately, these are not as easy to interpret as linear regression coefficients. All else
eing equal, an extra year of experience adds 1.42 to the input of logistic. All else being
equal, an extra $10,000 of salary subtracts 2.49 from the input of logistic.
The impact on the output, however, depends on the other inputs as well. If dot(beta,
x_i) is already large (co
esponding to a probability close to 1), increasing it even by a lot
cannot affect the probability very much. If it’s close to 0, increasing it just a little might
increase the probability quite a bit.
What we can say is that — all else being equal — people with more experience are more
likely to pay for accounts. And that — all else being equal — people with higher salaries
are less likely to pay for accounts. (This was also somewhat apparent when we plotted the
data.)
Goodness of Fit
We haven’t yet used the test data that we held out. Let’s see what happens if we predict
paid account whenever the probability exceeds 0.5:
true_positives = false_positives = true_negatives = false_negatives = 0
for x_i, y_i in zip(x_test, y_test):
predict = logistic(dot(beta_hat, x_i))
if y_i == 1 and predict >= 0.5: # TP: paid and we predict paid
true_positives += 1
elif y_i == 1: # FN: paid and we predict unpaid
false_negatives += 1
elif predict >= 0.5: # FP: unpaid and we predict paid
false_positives += 1
else: # TN: unpaid and we predict unpaid
true_negatives += 1
precision = true_positives / (true_positives + false_positives)
ecall = true_positives / (true_positives + false_negatives)
This gives a precision of 93% (“when we predict paid account we’re right 93% of the
time”) and a recall of 82% (“when a user has a paid account we predict paid account 82%
of the time”), both of which are pretty respectable numbers.
We can also plot the predictions versus the actuals (Figure 16-4), which also shows that
the model performs well:
predictions = [logistic(dot(beta_hat, x_i)) for x_i in x_test]
plt.scatter(predictions, y_test)
plt.xlabel("predicted probability")
plt.ylabel("actual outcome")
plt.title("Logistic Regression Predicted vs. Actual")
plt.show()
Figure 16-4. Logistic regression predicted versus actual
Support Vector Machines
The set of points where dot(beta_hat, x_i) equals 0 is the boundary between ou
classes. We can plot this to see exactly what our model is doing (Figure 16-5).
This boundary is a hyperplane that splits the parameter space into two half-spaces
co
esponding to predict paid and predict unpaid. We found it as a side-effect of finding
the most likely logistic model.
An alternative approach to classification is to just look for the hyperplane that “best”
separates the classes in the training data. This is the idea behind the support vecto
machine, which finds the hyperplane that maximizes the distance to the nearest point in
each class (Figure 16-6).
Figure 16-5. Paid and unpaid users with decision boundary
Finding such a hyperplane is an optimization problem that involves techniques that are too
advanced for us. A different problem is that a separating hyperplane might not exist at all.
In our “who pays?” data set there simply is no line that perfectly separates the paid users
from the unpaid users.
We can (sometimes) get around this by transforming the data into a higher-dimensional
space. For example, consider the simple one-dimensional data set shown in Figure 16-7.
Figure 16-6. A separating hyperplane
It’s clear that there’s no hyperplane that separates the positive examples from the negative
ones. However, look at what happens when we map this data set to two dimensions by
sending the point x to (x, x**2). Suddenly it’s possible to find a hyperplane that splits the
data (Figure 16-8).
This is usually called the kernel trick because rather than actually mapping the points into
the higher-dimensional space (which could be expensive if there are a lot of points and the
mapping is complicated), we can use a “kernel” function to compute dot products in the
higher-dimensional space and use those to find a hyperplane.
Figure 16-7. A nonseparable one-dimensional data set
It’s hard (and probably not a good idea) to use support vector machines without relying on
specialized optimization software written by people with the appropriate expertise, so
we’ll have to leave our treatment here.
Figure 16-8. Data set becomes separable in higher dimensions
For Further Investigation
scikit-learn has modules for both Logistic Regression and Support Vector Machines.
libsvm is the support vector machine implementation that scikit-learn is using behind
the scenes. Its website has a variety of useful documentation about support vecto
machines.