Analysis of Supervised Learning Algorithms

15 Jan 2016

Introduction

Imagine that you are using Decision Trees to fit some data. The hypothesis set here is the set of all decision trees, that is, when you find your best hypothesis it is going to be some decision tree. Here the decision tree is also called the model or representation. A machine learning algorithm, for example ID3, is an algorithm that is used to learn this model. We will first discuss bias and variance as the properties of the model. We will then understand how we can measure performance of these algorithms and accuracy of the model predicted.

Bias and Variance

If you run the ID3 algorithm several times you will find a new decision tree everytime. This error introduced in the prediction because of this behavior is called variance. If you run ID3 algorithm several times you will find a new decision tree everytime. Let us say you could average out the trees and find a new tree that is representative of these trees. It is still going to be slightly different than the best decision tree. This error is called bias. Further there are two types of biases: preference bias are the set of hypothesis that the algorithm prefers to predict and restriction bias are the set of hypothesis that the algorithm is restricted to predict.

Causes of Bias/Variance

The variance is high when these decision trees are more dissimilar to each other. This is caused when there is too much noise in the data. On the other hand a model has high bias when you might have sampled the data from a particular population. Since our prediction algorithm is finding a hypothesis set which is a decision tree, we are biasing our algorithm to only find decision trees.

A popular technique to reduce error due to variance is Bagging and Resampling. In bagging (Bootstrap Aggregating), we recreate our training set by random selection with replacement. Each training set is then used to predict a model. Then we can average the model or use a validation technique to pick the best model.

Measuring Performance

Like any algorithm the performance of a machine learning algorithm can be measured using asymptotic analysis - time and space complexity. For example, most of the supervised learning algorithms like Decision Trees and Neural Networks learn a model. The space required for these algorithms is equivalent to the space required for these models. However in instance based learning methods like KNN, the space required is equivalent to the space required to store the entire sample space.

Measuring Accuracy

To measure accuracy of a hypothesis we need to compare it to the true hypothesis. However we never know the true hypothesis, so we use other techniques to measure model prediction error. For this article, the model prediction error will be dependent on our data: we are trying to find a hypothesis that best fits the data. For a regression problem we use least square error, for a classification problem we use Type I/II errors.

We can find the best hypothesis using model complexity experiments and learning curves.

Model Complexity

Model Complexity curves have an error metric on its Y axis and the complexity on the X axis. The graph has atleast two curves: one for the training set and one for the testing set. To partially remove variance due to noise, cross validation is used instead of a testing set.

The purpose of this graph is to find the best complexity our model should have. Here, the term "best" is loosely used. In Machine Learning best means it can model future unseen data. If it is not a good forecaster it is not the best.

With this graph you can determine the point at which overfitting is the least. This is the point where the training and validation curves diverge. If there are multiple points, we choose the one where the complexity is the least because of Ockham's Razor.

You can change the model complexity of Supervised Learning models in many ways. The complexity of a model comes from its preference and restriction bias. Every model has a preference bias and when you change complexity of your model, you are basically changing the preference bias. For the best results you should change complexity until you touch the hypothesis space that is the entire restriction bias.

Decision Trees - Changing height of trees, different pruning algorithms
Neural Networks - Changing the number of hidden layers/nodes, changing momentum and optimization algorithm
k-NN - Changing the value of k, choosing different subset of the feature space
SVM - Changing the kernel
Boosting - Changing the number of base learners, changing the base learner

So, that's model complexity.

Learning Curves

Learning curves, again, have an error metric on its Y axis and the size of the dataset on its X axis. Like the Model Complexity curves the graph has atleast two curves: one for the training set and the other for the testing or cross-validation set.

Assume that you run a Model Complexity experiment on a dataset for a given model. We find a point where the training and cross-validation curves diverge. Lets say that the gap between training and cross-validation curve at the point of overfitting is relatively large. It can be due to two reasons: our model is not expressive enough or we don't have enough data. To help matters here we use the learning curve.

In a learning curve experiment, we fix the model complexity and vary the size of the training set used to train the data. If our data does not have noise and the model can perfectly fit the data, the training and cross validation curves will meet when the training size on the X axis is all the data. If they don't meet, than it means we don't have enough data or our model needs to be more complex. We change the complexity of our model and rerun the experiment while noting down the gap between the two curves.

In most cases, if the complexity required for the best fit in a model complexity graph and a learning curve graph should be the same. If they are not the same, we need to gather more data.