Although feature engineering might remove or at least minimize the impact of noise on the accuracy of your models, often is difficult than that. As I mentioned before, a complete and exhaustive data exploration is a must. Actually, in the worst case scenario, less data but better would yield in more robust models than those trained with bigger but nosy datasets. However, the perfect storm is when in addition the dataset is nosy and imbalanced.
Obviously, What we have to do first is dealing with the noise. We can have noise at feature level or in the labeling. For instance, training a classification model with examples with noise features could artificially create clusters of a particular class in areas of other class, alternating the decision boundaries and led to an erroneous model. The same could happen with datasets with contradictory labels or misclassification. In the other hand, the noise can come from the measurement tools, or sensors, corrupting the features. How noise effects to the models and how to deal with it is an important topic, many papers and books have been written about it (Class Noise vs. Attribute Noise: A Quantitative Study, Mining with noise knowledge: Error-aware data mining, etc..) In addition, run simulations assuming different scenarios is helpful to decide what can perform best. What I found useful for me is the use of filters to reduce noise data on my training set.
Once we have reduced the amount of noise data, we can tackle the imbalance problem. One common challenge when we work on a classification problem is the lack of examples of one class, and many times the most interesting class. However, an imbalance dataset could not be super problematic if have a very heterogeneous set of training points, and the decisions boundaries are clear… and unicorns and rainbows … however, real-world data have a bias on the distribution or the training set is especially poor for items close to one or more decision boundaries, etc… and that could lead to a model with a poor performance.Again, this is an old friend, and a lot of literature has been written about it. In summary, there are 3 approaches;
- Balance the training set. You can either oversample the minority class, undersample the majority class or synthesize new elements of the minority class.
- Tune or modify the algorithm. adding weights or adjust the decision threshold using soft clustering
- Rethink the problem. Try to get more data, switch to an anomaly detection
I particularly synthesized using Synthetic Minority Oversampling TEchnique, (SMOTE), in particular, SMOTE-IPF. The method is base on generating new points using a sort of interpolation. The algorithm picks randomly an element of the minority class close to the decision boundary, and find its nearest neighbors. The group of data is used to create a new element of the class by bootstrapping ( other approaches can be applied thought)
There are many different approaches, and finding a perfect solution is difficult. What can work for a determinate dataset, may be a terrible solution for other. As always, it is important to have a good and deep knowledge of the training set and the nature of the potential noise.