A bad scenario; a Noisy and imbalanced dataset​

Although feature engineering might remove or at least minimize the impact of noise on the accuracy of your models, often is difficult than that. As I mentioned before, a complete and exhaustive data exploration is a must. Actually, in the worst case scenario, less data but better would yield in more robust models than those trained with bigger but nosy datasets. However, the perfect storm is when in addition the dataset is nosy and imbalanced.

Obviously, What we have to do first is dealing with the noise. We can have noise at feature level or in the labeling.  For instance, training a classification model with examples with noise features could artificially create clusters of a particular class in areas of other class, alternating the decision boundaries and led to an erroneous model. The same could happen with datasets with contradictory labels or misclassification.  In the other hand, the noise can come from the measurement tools, or sensors, corrupting the features.    How noise effects to the models and how to deal with it is an important topic, many papers and books have been written about it (Class Noise vs. Attribute Noise: A Quantitative StudyMining with noise knowledge: Error-aware data mining, etc..)   In addition, run simulations assuming different scenarios is helpful to decide what can perform best. What I found useful for me is the use of filters to reduce noise data on my training set.

Once we have reduced the amount of noise data, we can tackle the imbalance problem. One common challenge when we work on a classification problem is the lack of examples of one class, and many times the most interesting class. However, an imbalance dataset could not be super problematic if have a very heterogeneous set of training points, and the decisions boundaries are clear… and unicorns and rainbows …  however, real-world data have a bias on the distribution or the training set is especially poor for items close to one or more decision boundaries, etc… and that could lead to a model with a poor performance.Again, this is an old friend, and a lot of literature has been written about it. In summary, there are 3 approaches;

  • Balance the training set. You can either oversample the minority class, undersample the majority class or synthesize new elements of the minority class.
  • Tune or modify the algorithm. adding weights or adjust the decision threshold using soft clustering
  • Rethink the problem. Try to get more data, switch to an anomaly detection

I particularly synthesized using Synthetic Minority Oversampling TEchnique, (SMOTE), in particular, SMOTE-IPF.  The method is base on generating new points using a sort of interpolation.  The algorithm picks randomly an element of the minority class close to the decision boundary, and find its nearest neighbors. The group of data is  used to create a new element of the class by bootstrapping ( other approaches can be applied thought)

Screen Shot 2017-12-07 at 15.07.15

There are many different approaches, and finding a perfect solution is difficult.  What can work for a determinate dataset, may be a terrible solution for other. As always,  it is important to have a good and deep knowledge of the training set and the nature of the potential noise.



Simple guide for building a machine for Deep Learning or GPU accelerated calculations.

We have a small cluster in the lab, around 20 nodes running under a queue system. This is ideal system for analysis and small jobs, meanwhile big jobs, the final step of our projects, run in one of the massive clusters of Compute Canada. Recently, I extended our lab cluster with a few “GPU” machines. This extension covers the gap between prototyping and production of all those jobs accelerated by GPU. However, GPUs are a very expensive hardware, so we want to invest our money wisely to maximize our dollar performance ratio. I feel that provide just a table with the configurations I chose do not have a lot of value because technology has a very fast cycle and any list of hardware is going to be obsolete in a few months, in both value and performance. Consequently, I decide to share what I learned building these computers. At the end TL;DR and 2 configurations to use as a example.

If you are planning to focus in GPU accelerated calculations and your budged is tight, do not invest in a high-end CPU.  Obviously is nice to have a powerful CPU in order to speed other day a day process, but in terms of GPU accelerated calculations, specially training a Convoluted neural network, is much important the muscle of your GPU rather the number of cores of your CPU. However, if you plan to build multiple GPUs machine, get a CPU with at least 1 physical core per GPU.

GPU wise, if the money is, again, an important factor to make the final decision, and you plan to do mostly Deep learning then try to get the model with the maximum amount of memory that you can afford. GPU memory will be a limitation factor in terms of size of network and training batches. In terms of performance, multiple GPU are more efficient  to run different parameters or different projects simultaneously rather than parallelize the same job using MPI. However, multiple GPU may be necessary if your network and mini-batches are too big for the memory of a single GPU, then you will need to split the job between two or more.

Currently, Nvidia is the standard the facto; they were one of the first to get into the scientific market, and CUDA – Nvidia programming libraries  – have more momentum than openCL, the open source alternative. Although RADEON family are a good alternative in terms of performance, they support is , at this moment, very poor and I can not recommend to buy a AMD GPU. Let’s see what will happen in the next years, where other player, like intel , may present alternatives to NVIDIA.

Personally, I prefer to have as much memory I can afford, but if the budget is a problem, try to have at least the same amount as the GPU. Meanwhile the speed of the RAM is not important in terms of accelerated GPU calculations an important bottleneck is the communications of the CPU-RAM and the GPU-RAM, consequently get a motherboard and components (e.g. GPU) supports the latest PCI express version.

A bonus, but not critical, is a SSD disk. This can increase for instance the reading of the mini batches during the training, however, if your budget is tight, you can get a regular disk.

Finally, check the power consumption of your system, GPU are hungry beast in terms of electricity, especially high-end models, a titan Xp can consume almost 250W at full load. The rule of thumb is to get a PSU capable to provide a 20% more than the theoretical usage of your system at full load.

Final tips. PCPartPicker website is quite use full to build computer by parts and keep track in compatibilities and power requirement.


  • Get the GPU or GPUs with the highest amount of memory
  • Do not invest in CPU or super-fast memory.
  • Get one CPU-core per GPU
  • Get a modern motherboard; fast PCI bus.
  • Try to get a SDD, nut there is life without it.
  • 16Gb is enough, but if you can squeeze the budge and afford a little more, do not think it twice, you will not regret it.
  • Keep and eye on the power requirements of the system and get a good PSU.


Best budget machine, late-2017

  • 1060 6Gb. best value, enough memory to participate in almost all Kaggle challenges)
  • AMD Ryzon CPU series have the best value; however, they are not well supported by kernel version earlier than 4.10. A few tips about it, here
  • Any gamers mother board that supports PCIe 3.0, e.g. MSI MORTAR
  • 16 Gb DDR4 2100M
  • 500Gb SSD
  • PSU 450W

GPU cluster Node, late-2017

  • 3x Nvidia Titan Xp. Multiple GPU to train bigger networks, or run different model/parameters
  • Intel Core i7-7700K 4.2Ghz Quad-core
  • Asus STRIX Z270-E GAMING ATX LGA1151. Motherboard support up to 4 GPUS
  • 64Gb DDR4 2100
  • 1Tb SSD
  • PSU 1200W

Feature Engineering

Lately, I had to work with a very limited dataset, both in size and number of features. And to add some extra challenge, some labels were likely to be wrong. So, in order to do my best in order to build a classification model, I had to do a serious effort, particularly in the dataset, because no matter how good is your model if the data sucks.  In the upside, I learned a few new tricks, so I decided to write a quick post summarizing a bit the different strategy we have to do feature engineering. Basically, this means  transform and clean the current features, combine them to create new ones, or even revisit the dataset and try to recycle information discarded.

Variable Transformation

This referes to apply  logarithm, square  or transform a conituos varaible in categories ( a.k.a. binning ). In my project, what it worked best was the  log transformation, improving a lot the performance of the model over a subset of data with very few points to train. This improve is mainly because the distribution of data in this particular subset was highly skewed. Square, in the other hand , can be applied to transform negative values to positive values, and the cube root to do the opposite.  Binning could be a good decision in very particular occasion, but I guess depends a lot of the understanding of the data.

Variable Creation

Next, we can create new features based on the existing features with the aim to unlock a laten information. For example, decompose features in smaller elements. The classic example is  to split  the full date in days, month year, week, weekend, holiday, etc… Also, convert the categorical labels to variables using one-hot encoding (a.k.a creating dummy variables).  Or the completely opposite, aggregate different features in one more useful, in my case I used four measurements  at different conditions and I decided to fitting a linear equation to observed data, linear regions, and use the slope of the result as a feature.

Features Housekeeping

Many times, less is more, so in my case I focused in eliminate outliers. As I mentioned at the beginning, I could be sure that my entire training set was accurately labeled , Consequently, using clustering, I decided to eliminate those data points far for the centroid of their class.  Also is recommendable to keep track of the wight and nature of your features. Something as simple how plot against the label and measure if there is a correlation , or the weight of the feature in the model are good habits.  For instance, you are woking in model to predict the performance of students of your city and the most important feature is the wether over other features as the hours of study, then something smells funny. In addition, a noisy feature can damage the accuracy of your model, specially if you are using SVM.

Revisit your dataset for discarded data

Often information in the dataset is discarted to be consideredd not relevant, however, with some transformation can be informative. A good example is the names of the passengers on the famous titanic disaster dataset. The first instinc is to remove  it, but the name can contain important information, nobiliary or medical doctor titles may reflect the changes of survival of the passenger on the sinking.  In the other hand, data with missing values is removed from the dataset. However, in my case, the data was so precious that I decided to use it . One approach to fill the gaps in your data is to simple write the average value o the group, also, more sophisticated, it is to cluster the dataset using the other variables, and fill the gap using the average of the group.



Is Skynet coming?

I cannot recommend strongly enough this video of Minute Physics about Artificial Intelligence (AI), in less than 5 minutes covers the most important concern about the AI and what research are doing to overcome them. The final conclusion is not new; Technology is not scary or dangerous, it is how some humans use it the problem. About the same topic, if you have more curiosity and time, I also recommend a series of post wrote by Tim Urban some time ago. Waitbywhy AI


The Demiurge

We live surrounded by smart machines, indeed there are several algorithms to mimic a mind and provide a machine of intelligence. These magic software are applied in many fields, from finance to medicine. In fact, those algorithms are not new,  the foundation of several of them  were establish in the early 60. However, they are far from being a thinking mind and in fact each one has its own weaknesses that limit them to achieve the complete mimic of a human mind. So, the magic formula has not been written yet. Pedro Domingos talk about all this in his book “The master algorithm“. I strongly recomend the following video were the author makes a brilliant summary about the state-of-the-art in the field and makes an introduction to the master algorithm – the code capable of conceive a real mind. There is no need of any special  knowledge to follow the talk and might motivate you to read the book. 

Spurious Correlations

Do a chart with two variables to measure the degree of correlation between them, It is probably the most used statistical tools. And like any other analysis, We must be very careful drawing conclusions because it does not always reflect reality. Two variables can be strongly correlated in many ways. For example, the number of libraries in a city is strongly related with the absolute number of crimes. However, that does not mean that libraries encourage crime. This example is very clear, and it sure is easy to find more examples of this kind, but I would like to mention the spurious correlations. These correlations ocurre when two variables with no logical connection have a strong correlation coefficient. For instance, the annual number of PhD in computer science in the United States and annual revenue of Americans arcades. …… In Tyler Vigen’s website you can find more of those correlations and if you are really  interest, you can buy his book on the subject, Spurious Correlations.