Label Engineering vs. Feature Engineering

It’s conventional wisdom that feature engineering is an important part of model development. A less appreciated point is that label engineering can be just as important, and in many settings, much more can be done to improve the quality of your labels than the features.

When the labels for your model are not actually ground truth but a noisy proxy for them, you have noisy labels. This is an extremely common scenario when working with human-generated labels. In general, noisy labels are not a problem for your model as long as you have many of them and they are unbiased measurements of your ground truth.

In practice, noisy labels are rarely unbiased and training solely on the raw labels will cause your model to learn to predict outputs that are systematically different from the ground truth. In this case, counterintuitively, having more training labels can quickly become useless and can even become harmful for predicting the ground truth target.

Label engineering is the process of adjusting raw labels to make them more useful for predicting the ground truth target. There are now dozens of research papers that have proposed solutions to this problem but I will describe one of these approaches that is very useful in practice.

One method for dealing with noisy labels is to use a label correction model. This is an upstream model to your primary task that adjusts your labels to make them more accurate, or at least less biased. The cleaned labels are then passed to the primary model in the same way the raw labels were. This follows the teacher-student model paradigm, where a teacher model with access to more information or a richer model architecture produces predictions for a student model to learn from.

The label correction model approach requires the collection of a small ground truth set to learn the function that maps noisy labels to clean ones. Collection of this set can be expensive, but in most settings there is an existing process for collecting ground truth labels for the model eval set and this process can be used to collect a label correction model train set.

Leave a Comment