Analysing overfitting on individual model features
On 10 September 2024 - tagged machine-learning
Overfitting is one of the biggest adversaries of quantitative researchers. We have already discussed it in two articles and covered a few ways to detect and avoid overfitting. Today, we will zoom in into our models and find out which features are useful and which are causing the biggest harm.
We would like to know which features are not consistently useful both on in-sample and out-of-sample data, therefore contributing to overfitting. Those features will have bad correlation between their SHAP values and the target variable (or label).
Here is an example of such an analysis. We have 27 features and a gradient boosting regression model implemented using lightgbm library. I have anonymized the feature names, but normally the names would be shown on the plot. The following plot shows the partial correlation between SHAP values and the target variable. The features, that are high on both x and y axis are the most useful for the model, eg. features 0,5 or even 6,1,18. Features to the left are not very useful and we could simply drop them, which we could also learn from simpler metrics like feature importance. Finally, the features to the bottom right were useful on in-sample data but not on out-of-sample data, which is a sign of overfitting (2,10,9,...).
What to do with such features? We could simply drop them. However, it is often possible to save the feature by perhaps normalizing it, binning it into just a few different values (max_bin_by_feature setting in lightgbm), or rethinking the feature meaning and computation. We could also engineer and add new features similar to the best-performing ones in the top right.
Note that this feature selection method is based on out-of-sample results and is therefore improving your out-of-sample performance. This can lead to 'manually overfitting' your model as discussed in the overfitting intro post.
Have your own tricks how to avoid overfitting or debug your models? Comment under this tweet!
Previous articles in overfitting series #
New posts are announced on twitter @jan_skoda or @crypto_lake_com, so follow us!