Suppose you created a model and predicted or forecasted a horizontal straight line, while your observations clearly have a seasonal pattern. The observations don’t show a straight line at all. Then, most likely you’re dealing with underfitting. The opposite of underfitting, when you created a model that more or less copies the training data, is called overfitting.
Let me explain all this by starting off with a section about performance metrics of a model. This is critical since you’re able to quantify the perfomance of a model based on these metrics. It’s like having arguments for the occurence of overfitting and underfitting. Interesting? Yes, indeed!
Now, after we found out that what’s happening, it’s also important to understand it. Because if you don’t understand, it’s less likely that you will solve in a decent way. And if you do, it might be pure luck, which can turn out to be a little embarrassing, especially when you have to explain your project to a client. So we’ll end with an explanation of under- and overfitting, and what we can do to stop them.
In this section we deal with two key questions, applied to regression:
- How to measure model performance?
- How can we know whether a model will perform well in real life?
First of all, how to measure model performance?
In regression, the most commonly used performance metrics are the mean-absolute-error (MAE), the mean-squared-error (MSE) and the coefficient of determination (also known as R2 score). The higher the error metrics, the worse the model is. The higher the R2 score, the better your model is.
Secondly, the model in real-life, how do you know you created a good model?
Well, typically you split your dataset into a train and test set. The training set is used to train the model, the test set to evaluate and validate the model. Often 70% of the data is used for training and 30% for testing. However, deviations from these numbers are possible depending on each specific problem.
The performance measures we talked about earlier, are most interesting when calculated during the test phase of your project. During training, a model is created with both the observations and features as input. During testing however, the model only gets the features and has to predict the future observations based on the values of the past and current (not future!) features. Obviously, that’s most interesting because then you can evaluate how your model performs without knowing what the future observations will be.
Overfitting and Underfitting
Key question of this section:
- What are overfitting and underfitting?
- Can performance metrics tell us when we are over- or underfitting?
- How to improve model performance?
Let’s start off with these questions applied to overfitting.
An example of overfitting
Overfitting means that the model performance on the training set is very good, almost perfect, but the model performance on the test set is much worse. The model just copies the training data, but cannot handle new inputs. Quite often, during testing, your predictions/forecasts just converge to the mean. Not exactly what you were looking for!
How does this happen? Typically, overfitting happens when the model is too complex. Either there are too many parameters, in which case the model is too flexible, or the model parameters are not constrained, which means that they can have extreme values. One must always remember that model complexity has to be balanced by a sufficient amount of data. So the more complex your model, the more data you will need.
So how to solve all this?
- Reduce your model complexity, which often means reducing the number of parameters.
- Regularization, which basically comes down to constraints on the parameters.
- If possible: Use more training data.
Now, let’s talk about underfitting.
Underfitting is not exactly the opposite of overfitting, because here the performance on both the training set and test set is horrible.
This can happen if either the model is too simple, or x does not explain y. The latter can have different causes: noise, variables that have an influence but were not observed, …
We can solve underfitting (if the model is too simple) by making the model more complex, for instance by using a model with polynomial variables, but remember that you’ll have to support this with extra data. However, if you just didn’t measure the variable that has an influence on the variable of interest, then increasing the model will not do, nor will adding more data.