 Programming codex

# Automatically Learning From Data – Logistic Regression With L2 Regularization in Python Logistic Regression

Logistic regression is used for binary classification problems — where you have some examples that are “on” and other examples that are “off.”  You get as input a training set; which has some examples of each class along with a label saying whether each example is “on” or “off”.  The goal is to learn a model from the training data so that you can predict the label of new examples that you haven’t seen before and don’t know the label of.

For one example, suppose that you have data describing a bunch of buildings and earthquakes (E.g., year the building was constructed, type of material used, strength of earthquake,etc), and you know whether each building collapsed (“on”) or not (“off”) in each past earthquake.  Using this data, you’d like to make predictions about whether a given building is going to collapse in a hypothetical future earthquake.

One of the first models that would be worth trying is logistic regression.

Coding it up

I wasn’t working on this exact problem, but I was working on something close.  Being one to practice what I preach, I started looking for a dead simple Python logistic regression class.  The only requirement is that I wanted it to support L2 regularization (more on this later).  I’m also sharing this code with a bunch of other people on many platforms, so I wanted as few dependencies on external libraries as possible.

I couldn’t find exactly what I wanted, so I decided to take a stroll down memory lane and implement it myself.  I’ve written it in C++ and Matlab before but never in Python.

I won’t do the derivation, but there are plenty of good explanations out there to follow if you’re not afraid of a little calculus.  Just do a little Googling for “logistic regression derivation.”  The big idea is to write down the probability of the data given some setting of internal parameters, then to take the derivative, which will tell you how to change the internal parameters to make the data more likely.  Got it?  Good.

For those of you out there that know logistic regression inside and out, take a look at how short the train() method is.  I really like how easy it is to do in Python.

Regularization

I caught a little indirect flak during March madness season for talking about how I regularized the latent vectors in my matrix-factorization model of team offensive and defensive strengths when predicting outcomes in NCAA basketball.  Apparently people thought I was talking nonsense — crazy, right?

But seriously, guys — regularization is a good idea.

Let me drive home the point.  Take a look at the results of running the code (linked at the bottom).

Take a look at the top row.

On the left side, you have the training set.  There are 25 examples laid out along the x axis, and the y axis tells you if the example is “on” (1) or “off” (0).  For each of these examples, there’s a vector describing its attributes that I’m not showing.  After training the model, I ask the model to ignore the known training set labels and to estimate the probability that each label is “on” based only on the examples’s description vectors and what the model has learned (hopefully things like stronger earthquakes and older buildings increase the likelihood of collapse).  The probabilities are shown by the red X’s.  In the top left, the red X’s are right on top of the blue dots, so it is very sure about the labels of the examples, and it’s always correct.

Now on the right side, we have some new examples that the model hasn’t seen before.  This is called the test set.  This is essentially the same as the left side, but the model knows nothing about the test set class labels (yellow dots).  What you see is that it still does a decent job of predicting the labels, but there are some troubling cases where it is very confident and very wrong.  This is known as overfitting.

This is where regularization comes in.  As you go down the rows, there is stronger L2 regularization — or equivalently, pressure on the internal parameters to be zero.  This has the effect of reducing the model’s certainty.  Just because it can perfectly reconstruct the training set doesn’t mean that it has everything figured out.  You can imagine that if you were relying on this model to make important decisions, it would be desirable to have at least a bit of regularization in there.

And here’s the code.  It looks long, but most of it is to generate the data and plot the results.  The bulk of the work is done in the train() method, which is only three (dense) lines.  It requires numpy, scipy, and pylab.

* For full disclosure, I should admit that I generated my random data in a way such that it is susceptible to overfitting, possibly making logistic-regression-without-regularization look worse than it is.

The Python Code

from scipy.optimize.optimize import fmin_cg, fmin_bfgs, fmin

import numpy as np

def sigmoid(x):

return 1.0 / (1.0 + np.exp(-x))

class SyntheticClassifierData():

def __init__(self, N, d):

“”” Create N instances of d dimensional input vectors and a 1D

class label (-1 or 1). “””

means = .05 * np.random.randn(2, d)

self.X_train = np.zeros((N, d))

self.Y_train = np.zeros(N)

for i in range(N):

if np.random.random() > .5:

y = 1

else:

y = 0

self.X_train[i, :] = np.random.random(d) + means[y, :]

self.Y_train[i] = 2.0 * y – 1

self.X_test = np.zeros((N, d))

self.Y_test = np.zeros(N)

for i in range(N):

if np.random.randn() > .5:

y = 1

else:

y = 0

self.X_test[i, :] = np.random.random(d) + means[y, :]

self.Y_test[i] = 2.0 * y – 1

class LogisticRegression():

“”” A simple logistic regression model with L2 regularization (zero-mean

Gaussian priors on parameters). “””

def __init__(self, x_train=None, y_train=None, x_test=None, y_test=None,

alpha=.1, synthetic=False):

# Set L2 regularization strength

self.alpha = alpha

# Set the data.

self.set_data(x_train, y_train, x_test, y_test)

# Initialize parameters to zero, for lack of a better choice.

self.betas = np.zeros(self.x_train.shape)

def negative_lik(self, betas):

return -1 * self.lik(betas)

def lik(self, betas):

“”” Likelihood of the data under the current settings of parameters. “””

# Data likelihood

l = 0

for i in range(self.n):

l += log(sigmoid(self.y_train[i] *

np.dot(betas, self.x_train[i,:])))

# Prior likelihood

for k in range(1, self.x_train.shape):

l -= (self.alpha / 2.0) * self.betas[k]**2

return l

def train(self):

“”” Define the gradient and hand it off to a scipy gradient-based

optimizer. “””

# Define the derivative of the likelihood with respect to beta_k.

# Need to multiply by -1 because we will be minimizing.

dB_k = lambda B, k : np.sum([-self.alpha * B[k] +

self.y_train[i] * self.x_train[i, k] *

sigmoid(-self.y_train[i] *

np.dot(B, self.x_train[i,:]))

for i in range(self.n)]) * -1

# The full gradient is just an array of componentwise derivatives

dB = lambda B : np.array([dB_k(B, k)

for k in range(self.x_train.shape)])

# Optimize

self.betas = fmin_bfgs(self.negative_lik, self.betas, fprime=dB)

def set_data(self, x_train, y_train, x_test, y_test):

“”” Take data that’s already been generated. “””

self.x_train = x_train

self.y_train = y_train

self.x_test = x_test

self.y_test = y_test

self.n = y_train.shape

def training_reconstruction(self):

p_y1 = np.zeros(self.n)

for i in range(self.n):

p_y1[i] = sigmoid(np.dot(self.betas, self.x_train[i,:]))

return p_y1

def test_predictions(self):

p_y1 = np.zeros(self.n)

for i in range(self.n):

p_y1[i] = sigmoid(np.dot(self.betas, self.x_test[i,:]))

return p_y1

def plot_training_reconstruction(self):

plot(np.arange(self.n), .5 + .5 * self.y_train, ‘bo’)

plot(np.arange(self.n), self.training_reconstruction(), ‘rx’)

ylim([-.1, 1.1])

def plot_test_predictions(self):

plot(np.arange(self.n), .5 + .5 * self.y_test, ‘yo’)

plot(np.arange(self.n), self.test_predictions(), ‘rx’)

ylim([-.1, 1.1])

if __name__ == “__main__”:

from pylab import *

# Create 20 dimensional data set with 25 points — this will be

# susceptible to overfitting.

data = SyntheticClassifierData(25, 20)

# Run for a variety of regularization strengths

alphas = [0, .001, .01, .1]

for j, a in enumerate(alphas):

# Create a new learner, but use the same data for each run

lr = LogisticRegression(x_train=data.X_train, y_train=data.Y_train,

x_test=data.X_test, y_test=data.Y_test,

alpha=a)

print “Initial likelihood:”

print lr.lik(lr.betas)

# Train the model

lr.train()

# Display execution info

print “Final betas:”

print lr.betas

print “Final lik:”

print lr.lik(lr.betas)

# Plot the results

subplot(len(alphas), 2, 2*j + 1)

lr.plot_training_reconstruction()

ylabel(“Alpha=%s” % a)

if j == 0:

title(“Training set reconstructions”)

subplot(len(alphas), 2, 2*j + 2)

lr.plot_test_predictions()

if j == 0:

title(“Test set predictions”)

show()