Wednesday, May 8, 2013

Redux of Coursera ml-003: Machine Learning, first half

Since my academic focus for the past ten years seems to be somewhat of a sinking ship recently, I've decided it's time for a change of career. One of my favorite parts of working in physics has been using computers to find insight into tough problems, which I can fortunately keep doing regardless of the existence of a big expensive experiment. According to at least one McKinsey report and a recent Science article, the field of data science is rapidly growing and demand for data scientists far exceeds supply over the next decade, suggesting that it's probably a good time to get into the field and take a swing at it. In addition to beefing up my pythonjava, and teaching a brief computer science course this Summer for the MIT MEET program, I'm taking a Coursera course on Machine Learning by Stanford's Andrew Ng before starting the Insight data science fellowship this fall, all of which I'm hoping will be good prep for working in data science.

The machine learning course seems like it will be pretty awesome, but in the beginning a lot of the content has been a pretty plain review from a Physics/Engineering-Science PhD background. Specifically, a lot of the content was covered in a graduate level geophysical inverse theory course I took (taught by Robert Parker at Scripps Institution of Oceanography), and in another graduate level applied mathematics course (taught by Stefan Llewelyn Smith at UCSD Mechanical and Aerospace Engineering). Both of these courses were much more mathematically vigorous, ambiguous of machine learning, and less focused on the programming/algorithms side of the content, but this course does a much better job of delivering practical examples of how to apply the techniques to data science type problems.

To keep track of my notes during the course, I'm putting together a more brief redux tutorial version for anyone else interested in skimming through the content. I'm breaking the sections down by week in the course, and the headings generally go by lecture. Also, I tend to italicize key words and concepts where I remember to.

Introduction

In a sentence, 'machine learning is the field of study that gives computers the ability to learn without being explicitly programmed'.

Supervised vs. Unsupervised learning

In supervised learning, an algorithm is given a training set that has been pre-classified to learn how to classify data

  • using a set of e-mails classified as spam/not spam, classify future emails as spam/not spam
  • using profiles of customers given a catalog who then either purchased merchandise or not, determine from a customer profile if they will buy merchandise after receiving a catalog (classification)
  • given credit scores of mortgage holders who defaulted or not, determine from a future buyer's credit score if they will default or not (classification)
  • given a set of house prices, square footage, and number of bedrooms, etc. for an area, estimate how much a house of given parameters will cost (regression)
In unsupervised learning, an algorithm is given an unclassified data set to group into clusters
  • given a set of airline passenger itineraries, determine if there are groupings which could be exploited to optimize terminal operations (classification?)
  • given a set of student essays, determine if any of them are similar (classification)
  • given a set of news articles from various sources, group them into similar topics/headlines (classification)
  • given a set of stage III pharmaceutical data, determine if there are different categories of patient responses (classification)
As far as I can tell, supervised learning may be either a classification or regression problem, but unsupervised learning is only ever a classification problem.

Linear regression with one variable

We used example data of house areas (in square feet) and prices in Portland, OR with m=47 training examples, and applied a linear regression to model the data. The model we fit to the data was represented as a function: $h_\theta(x^{(i)}) = \theta_0 + \theta_1 x$, and the process of linear regression is to find a set of $\theta_i$ which minimizes our desired cost function for the training set. The data set and optimum fit are shown below.


In this case, we used the Euclidean norm (i.e. the squared error function) for the cost function:
$J(\theta) = \frac{1}{2m}\sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2$. The cost function is plotted below as a function of $\theta_0$ and $\theta_1$, with a $\color{red} \times$ marking the optimum solution.



The gradient descent algorithm is introduced as one technique for finding solutions $\theta$ which minimize the cost function by repeating the assignment $\theta_j := \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}$ until the cost function converges. For the Euclidean norm cost function, it can be shown that the gradient evaluates to $\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$, so that the algorithm reads as:

repeat until convergence {
$\theta_j := \theta_j - \frac{\alpha}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$
}

The importance of simultaneous update of the multiple $\theta_j$ values is impressed, so that the gradients are appropriately evaluated.

The learning rate $\alpha$ is briefly discussed as a control on the step size used by the algorithm, with small values yielding slow solutions (0.01 below), larger values yielding more rapid solutions (0.1 and 1.0 below), and too large of values potentially diverging (10 below). The potential for gradient descent finding only a local minimum of the cost function, based on where $\theta$ is initialized was also raised.
'Batch' gradient descent was introduced as an algorithm which uses all the training examples, as opposed to some subset of them.

Linear algebra review

There was a brief review of linear algebra topics including matrix multiplication properties, inverse and transpose of matrices, etc. Parameters and variables are referred to in vector notation generally from here forward, e.g.: $\theta = [\theta_0, \theta_1, \theta_2, \theta_3]$, $x = [x_0, x_1, x_2, x_3]$.

Linear regression with multiple variables/features

Having introduced gradient descent and one variable regression, these topics were extended to cover multi-variable regression. The regression model is still assumed to be linear, so that $h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + ... = \theta^Tx$.

The exercises about multi-variable linear regression were to find a model that would fit housing price data as a function of number of bedrooms and house area. We had a discussion in the forums about interpreting the solution $\theta$. $\theta_0$ represents the value of a 0br 0sq-ft house, essentially the typical empty lot: a few hundred thousand dollars. $\theta_1$ represents the additional value per bedroom at fixed house area, minus a few hundred dollars per room! $\theta_2$ represents the additional value per square foot for a fixed number of bedrooms, plus a few thousand dollars per square foot.

Non-linear models

Non-linear models were also introduced, such as the square or square root of data values, by adding additional 'features' to the data which are set as the square or square root of the original values e.g.:
$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2$ where $x_2=x_1^2$ or $x_2=\sqrt{x_1}$.

Feature normalization

The concept of feature normalization was introduced as a way to increase the speed and stability of gradient descent by avoiding zig-zagging cost-function trajectories in the convergence path: $ x := \frac{x-\mu}{\sigma}$, where $\mu$ is the mean of x, and $\sigma$ is the standard deviation.

Normal equations

The normal equation: $\theta = (X^TX)^{-1}X^Ty$ was introduced as an algebraic way to find a direct solution instead of an iterative technique like gradient descent. This technique is great for when we don't want to use iteration or we don't want to choose $\alpha$, but is very slow $O(n^3)$, so for large datasets iterative techniques are still recommended. 

Since use of the normal equation requires that the matrix $X^TX$ be invertible, the case of non-invertibile $X^TX$ was also discussed. This can happen when we are trying to solve a model with a larger number of features $n$ than data-points $m$ (i.e. $n>m$). In this case, the matrix $X^TX$ is said to be singular or degenerate. This can be solved either by deleting some features (i.e. redundant features such as size in feet-squared and size in meters-squared), or by regularization which will be discussed more later.

Octave

Octave is essentially a GNU version of Matlab, and was reviewed since it is the recommended programming language to use in this course. It contains most of the features as Matlab, and scripts or functions written for Matlab will generally run in Octave as well. The exercises can generally also be done in python, fortran, java, C/C++, or other languages by importing appropriate libraries and modules, so in this redux we will not review Octave.

Logistic Regression

When you have a data set that you need to classify into a set of groups, you use logistic regression. Instead of a linear unbounded model, the model hypothesis representation generally resembles something like a sigmoid function $g(z) = \frac{1}{1+e^{-z}}$ which is bounded between 0 and 1, so that $h_\theta(x) = g(\theta^Tx)$.
The line drawn between two classes in a data set is referred to as a decision boundary, can be linear or non-linear, and is a property of the hypothesis, not of the training set. The admit/no admit decision boundary found for a learning set of student grades on two exams is shown below for a first and second order model.




We used the logistic regression cost function: $Cost(h_\theta(x)) = -log(h_\theta(x))$ if y=1, $Cost(h_\theta(x)) =-log(h_\theta(1-x))$ if y=0, which ensures that the minimization is 'convex' (i.e. has one local minimum) as opposed to 'non-convex' (i.e. multiple local minima, such as with the Euclidean norm for some non-linear models). This cost function is also guaranteed to be >=0 for all cases.
This function can be derived from the principle of maximum likelihood estimation.

The cost function can be simplified to: $\begin{align*}J(\theta) &= \frac{1}{m}\sum\limits_{i=0}^m Cost( h_\theta(x^{(i)},y^{(i)}) ) \\ &= -\frac{1}{m}\left[\sum\limits_{i=0}^m y^{(i)} log h_\theta(x^{(i)}) + (1-y^{(i)}) log (1-h_\theta(x^{(i)}) )\right] \end{align*}$
It can be shown that the gradient of this cost function has the same form as the gradient of the Euclidean norm used in the linear case: $\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$.
Derive both of these.

Advanced optimization

Algorithms other than gradient descent were introduced: (briefly review each of these)
  • conjugate gradient - similar to gradient descent except the gradient directions are modified slightly so that the algorithm is guaranteed to converge within n iterations
  • BFGS (Broyden–Fletcher–Goldfarb–Shanno method) - a method for solving unconstrained nonlinear problems
  • L-BFGS - a limited memory version of BFGS, optimum for large data sets
Benefits of these all are: no need to pick $\alpha$, can be faster than gradient descent. Disadvantages are increased complexity compared to gradient descent, but on the bright side there are pre-built implementations we can use instead of writing our own. The Octave/Matlab function 'fminunc' has all of these algorithms implemented, and even tries to automatically select the appropriate method. Similar packages exist for other languages.

All of the classification examples in the class so far have been binary classification (i.e. 0 and 1, or true and false, etc.). The concept of multi-class classification was introduced here, with the possibility of multiple classifications (i.e. 0, 1, 2, 3, etc.). The problem of classifying each of several classes using existing developed techniques was approached via one-vs-all classification, in which each class is sequentially considered on its own against all the other data.

Regularization

Need to understand the concept of 'performance on the data set' and 'training accuracy'.

Overfitting occurs when we have too many features $m$ for the number of data $n$ (i.e. $m>n$), and the resulting model has a high-variance.
Underfitting is also a problem which occurs when we do not include enough features to describe data, and results in high-bias.

There are at least two options available to address overfitting:
  1. reduce the number of features, either manually or using a model selection algorithm (discussed later)
  2. use regularization to limit the values of $\theta_j$
In regularization, we re-define the cost function so that not only the difference between the model and data is minimized, but so are the magnitude of the values $\theta$:
$J(\theta) = \frac{1}{2m} \sum\limits_{i=1}^m ( h_\theta(x)-y )^2 + \lambda \sum\limits_{j=1}^n \theta_j^2$
Note that the sums are from 1 to m as opposed to 0 to m, which excludes the offset $\theta_0$ from the minimization, so that we don't penalize $\theta_0$.
For this new cost function, it can be shown that the gradient evaluates to $\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j$
In order to implement gradient descent, we run the same algorithm, which now reads as:

repeat until convergence {
$\theta_0 := \theta_0 \frac{\alpha}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}$
$\theta_j := \theta_j(1-\alpha\frac{\lambda}{m}) - \frac{\alpha}{m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$
}

Normal equation

The normal equation for regularization is similar to that for normal gradient descent, and can be found by solving the equation $\frac{\partial J(\theta)}{\partial \theta_j} = 0$. It can be shown that the solution has the form: 

$\theta = (X^TX + \lambda \left[ \begin{array}{ccccc} 0 &&&&\\ &1&&& \\ &&1&& \\ &&&\ddots& \\ &&&&1 \end{array} \right] )^{-1}X^Ty$

We return to the potential for non-invertibility of the matrix $X^TX + \lambda \left[ \begin{array}{ccccc} 0 &&&&\\ &1&&& \\ &&1&& \\ &&&\ddots& \\ &&&&1 \end{array} \right]$. It can be shown that as long as $\lambda>0$, the matrix has guaranteed invertibility, so regularization solves this problem.

Regularized logistic regression

To implement regularization of logistic regression, all we need to do is add the regularization term to the cost function: $J(\theta) = -\frac{1}{m}\left[\sum\limits_{i=0}^m y^{(i)} log h_\theta(x^{(i)}) + (1-y^{(i)}) log (1-h_\theta(x^{(i)}) )\right] + \frac{\lambda}{2m}\sum\limits_{j=1}^n \theta_j^2$. This will have the effect of smoothing the decision boundary according to the regularization parameter $\lambda$.

The decision boundary from a sixth order logistic model for two microchip tests is shown below for regularization parameters $\lambda=0.001$ and $\lambda=1.0$, as calculated for the course exercises.
In a later exercise set, we applied one-vs-all logistic regression to a typical computer vision exercise: learning to classify hand-written numbers from a training set of 5000 cases. The logistic regression model correctly characterized 94% of these cases after learning. A sample of the data set and of the incorrectly classified numbers are shown below. Notice that even when the algorithm fails, a human eye can often see how it was misled (i.e. a 3 that looks fairly close to a 7, etc.).

Neural networks

Neural networks are useful when we need to use a model with many parameters. As an example with the housing data we've used, we could consider the size of the house, number of bedrooms, number of floors, number of bathrooms, number of car space in the garage, total lot size, age, and maybe around 100 parameters we might want to consider. Using a second order linear regression model, the number of features we need to fit goes like $\mathcal{O}(n^2)$, more concretely $n^2/2 = 5000$, and using a third order model the number of features goes like $\mathcal{O}(n^3)$, etc. In computer vision problems, even a small 100x100 pixel image has 10,000 parameters (pixels), so a second order model would require 50 million  features and we can see that this approach rapidly loses traction.

Neural networks originated as algorithms trying to mimic the brain, and were widely used in the 80s and 90s, then diminished popularity in the 90s. They've enjoyed a recent resurgence with faster and bigger computers as a state of the art technique for solving machine learning problems. 

Representation



The inputs to each layer $a^{(l)}$ are referred to as activation values, and the input and hidden layers have a bias unit added which is always $a^{(l)}_0=1$.
label the inputs x (i.e. actions) at various layers, the theta parameters, etc. in neural network lingo, and discuss how they all work
bias parameters

In the exercises, we used a neural network technique to solve the logistic regression problem of classifying hand-written numbers to digits. Even though it used only 25 parameters, it classified the numbers with only half the error of the linear logistic regression with 400 parameters used above.

Cost function

In a neural network, we have a new but similar cost function to our prior examples:
$J(\Theta) = -\frac{1}{m} \left[ \sum\limits_{i=1}^m \sum\limits_{k=1}^K y_k^{(i)}log(h_\Theta(x^{(i)})_k) + (1-y_k^{(i)}) log(1-h_\Theta(x^{(i)})_k) \right] + \frac{\lambda}{2m} \sum\limits_{l=1}^{L-1} \sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} (\Theta_{ij}^{(l)})^2$

Contrary to the cases for linear and logistic regression, the cost function for a neural network is not guaranteed to be convex, so it may converge to a local minimum instead of a global minimum.

Gradient computation: Back-propagation algorithm

We need to calculate the error of each node $j$ in each layer $L$: $\delta_j^{(l)}$. For the last layer, the error is calculated as: $\delta_j^{(end)} = a_j^{(end)}-y_j$. For the prior layers, the error is calculated as: $\delta_j^{(l)} = (\Theta^{(l)})^T \delta^{(l+1)} .* g'(z^{(l)})$, where $g'(z^{(l)}) = a^{(l)} .* (1-a^{(l)})$. The first element of each $\delta^{(l<L)}$ must be removed so that the bias units are not counted in the error.

The algorithm is then essentially the following for each i from an initial guess $\Delta_{ij}^{(l)}$:
  • forward propagate $a$: set $a^{(1)} = x$, then compute $a^{(l)}$ for $l=2,3,...L$ (this is done to calculate the cost function)
  • back propagate $\delta$: compute $\delta^{(L)} = a^{(L)} - y$, then compute $\delta^{(L-1)}, \delta^{(L-2)}, ... \delta^{(2)}$
  • compute $\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l+1)}$
From this, the gradients are then calculated as: 
$\frac{\partial J(\Theta)}{ \partial\Theta_{ij}^{(l)} } = \frac{1}{m} \Delta_{ij}^{(l)} + \frac{\lambda}{m}\Theta_{ij}^{(l)}$,
where the regularization term is omitted for $j=0$ as usual.

Gradient checking

Since calculating the gradients with back-propagation has so many steps, difficult to detect coding errors are likely to occur, so the following gradient checking process is recommended when using back propagation. The algorithm calculates the gradient using one technique, but the user can also calculate the discrete gradient using: $\frac{ \partial J(\theta) }{\partial \theta} = \frac{ J(\theta + \epsilon) - J(\theta - \epsilon) }{2\epsilon}$. While the two calculations of the gradient will not generally be exactly the same, they should have similar values. If the two calculations do not have similar values, there is likely an error which should be tracked down.

IMPORTANT: Remember to turn off the gradient checking portions of the routine after verifying correct operation, since they are computationally expensive compared to the algorithm's calculation. Leaving them turned on will cause your code to run slowly.

Random initialization: symmetry breaking

When initializing a neural network to a set of zeros, it tends to perform poorly if initialized with zeros as we have done in the prior examples. To improve performance, the matrices $\Theta_{ij}^{(l)}$ should be initialized to some random small numbers. A good size for these numbers can be estimated by $\epsilon = \frac{\sqrt{6}}{\sqrt{L_{in}+L_{out}}}$, where $L_{in}$ and $L_{out}$ are the number of input and output nodes.

Picking a network architecture

If there was any intuition explained about what's going on with the layers here, I missed it. I'm thinking of this problem in the context of our gradient descent examples. In gradient descent, the algorithm will only find as many outputs as we request from it, and only exactly the outputs that we request. In a neural network, the algorithm chooses the best combination of hidden units to return into the output units.

Number of input units: dimension of features $x^{(i)}$
Number of output units: number of classifications
Number of hidden layers: Reasonable default of 1 layer
Number of hidden units: The more the merrier (but this is computationally expensive)? Use same number in each layer. But why even have multiple layers?

Training a neural network

After setting up the back-propagation algorithm, it's important to make sure that the neural network doesn't over fit the problem. Given enough iterations and a low enough regularization parameter, the network can correctly fit all of the given examples, but it might have trouble with new cases. If this happens, we can adjust $\lambda$ to increase the regularization somewhat.

Overall impression

The first half of this course started out a bit slow for me, probably since I had already seen much of the early material. After linear regression it picked up speed and interesting new content, especially through the neural networking section. I'm definitely digging it so far!

No comments: