A bad scenario; a Noisy and imbalanced dataset​

Although feature engineering might remove or at least minimize the impact of noise on the accuracy of your models, often is difficult than that. As I mentioned before, a complete and exhaustive data exploration is a must. Actually, in the worst case scenario, less data but better would yield in more robust models than those trained with bigger but nosy datasets. However, the perfect storm is when in addition the dataset is nosy and imbalanced.

Obviously, What we have to do first is dealing with the noise. We can have noise at feature level or in the labeling.  For instance, training a classification model with examples with noise features could artificially create clusters of a particular class in areas of other class, alternating the decision boundaries and led to an erroneous model. The same could happen with datasets with contradictory labels or misclassification.  In the other hand, the noise can come from the measurement tools, or sensors, corrupting the features.    How noise effects to the models and how to deal with it is an important topic, many papers and books have been written about it (Class Noise vs. Attribute Noise: A Quantitative StudyMining with noise knowledge: Error-aware data mining, etc..)   In addition, run simulations assuming different scenarios is helpful to decide what can perform best. What I found useful for me is the use of filters to reduce noise data on my training set.

Once we have reduced the amount of noise data, we can tackle the imbalance problem. One common challenge when we work on a classification problem is the lack of examples of one class, and many times the most interesting class. However, an imbalance dataset could not be super problematic if have a very heterogeneous set of training points, and the decisions boundaries are clear… and unicorns and rainbows …  however, real-world data have a bias on the distribution or the training set is especially poor for items close to one or more decision boundaries, etc… and that could lead to a model with a poor performance.Again, this is an old friend, and a lot of literature has been written about it. In summary, there are 3 approaches;

  • Balance the training set. You can either oversample the minority class, undersample the majority class or synthesize new elements of the minority class.
  • Tune or modify the algorithm. adding weights or adjust the decision threshold using soft clustering
  • Rethink the problem. Try to get more data, switch to an anomaly detection

I particularly synthesized using Synthetic Minority Oversampling TEchnique, (SMOTE), in particular, SMOTE-IPF.  The method is base on generating new points using a sort of interpolation.  The algorithm picks randomly an element of the minority class close to the decision boundary, and find its nearest neighbors. The group of data is  used to create a new element of the class by bootstrapping ( other approaches can be applied thought)

Screen Shot 2017-12-07 at 15.07.15

There are many different approaches, and finding a perfect solution is difficult.  What can work for a determinate dataset, may be a terrible solution for other. As always,  it is important to have a good and deep knowledge of the training set and the nature of the potential noise.



Simple guide for building a machine for Deep Learning or GPU accelerated calculations.

We have a small cluster in the lab, around 20 nodes running under a queue system. This is ideal system for analysis and small jobs, meanwhile big jobs, the final step of our projects, run in one of the massive clusters of Compute Canada. Recently, I extended our lab cluster with a few “GPU” machines. This extension covers the gap between prototyping and production of all those jobs accelerated by GPU. However, GPUs are a very expensive hardware, so we want to invest our money wisely to maximize our dollar performance ratio. I feel that provide just a table with the configurations I chose do not have a lot of value because technology has a very fast cycle and any list of hardware is going to be obsolete in a few months, in both value and performance. Consequently, I decide to share what I learned building these computers. At the end TL;DR and 2 configurations to use as a example.

If you are planning to focus in GPU accelerated calculations and your budged is tight, do not invest in a high-end CPU.  Obviously is nice to have a powerful CPU in order to speed other day a day process, but in terms of GPU accelerated calculations, specially training a Convoluted neural network, is much important the muscle of your GPU rather the number of cores of your CPU. However, if you plan to build multiple GPUs machine, get a CPU with at least 1 physical core per GPU.

GPU wise, if the money is, again, an important factor to make the final decision, and you plan to do mostly Deep learning then try to get the model with the maximum amount of memory that you can afford. GPU memory will be a limitation factor in terms of size of network and training batches. In terms of performance, multiple GPU are more efficient  to run different parameters or different projects simultaneously rather than parallelize the same job using MPI. However, multiple GPU may be necessary if your network and mini-batches are too big for the memory of a single GPU, then you will need to split the job between two or more.

Currently, Nvidia is the standard the facto; they were one of the first to get into the scientific market, and CUDA – Nvidia programming libraries  – have more momentum than openCL, the open source alternative. Although RADEON family are a good alternative in terms of performance, they support is , at this moment, very poor and I can not recommend to buy a AMD GPU. Let’s see what will happen in the next years, where other player, like intel , may present alternatives to NVIDIA.

Personally, I prefer to have as much memory I can afford, but if the budget is a problem, try to have at least the same amount as the GPU. Meanwhile the speed of the RAM is not important in terms of accelerated GPU calculations an important bottleneck is the communications of the CPU-RAM and the GPU-RAM, consequently get a motherboard and components (e.g. GPU) supports the latest PCI express version.

A bonus, but not critical, is a SSD disk. This can increase for instance the reading of the mini batches during the training, however, if your budget is tight, you can get a regular disk.

Finally, check the power consumption of your system, GPU are hungry beast in terms of electricity, especially high-end models, a titan Xp can consume almost 250W at full load. The rule of thumb is to get a PSU capable to provide a 20% more than the theoretical usage of your system at full load.

Final tips. PCPartPicker website is quite use full to build computer by parts and keep track in compatibilities and power requirement.


  • Get the GPU or GPUs with the highest amount of memory
  • Do not invest in CPU or super-fast memory.
  • Get one CPU-core per GPU
  • Get a modern motherboard; fast PCI bus.
  • Try to get a SDD, nut there is life without it.
  • 16Gb is enough, but if you can squeeze the budge and afford a little more, do not think it twice, you will not regret it.
  • Keep and eye on the power requirements of the system and get a good PSU.


Best budget machine, late-2017

  • 1060 6Gb. best value, enough memory to participate in almost all Kaggle challenges)
  • AMD Ryzon CPU series have the best value; however, they are not well supported by kernel version earlier than 4.10. A few tips about it, here
  • Any gamers mother board that supports PCIe 3.0, e.g. MSI MORTAR
  • 16 Gb DDR4 2100M
  • 500Gb SSD
  • PSU 450W

GPU cluster Node, late-2017

  • 3x Nvidia Titan Xp. Multiple GPU to train bigger networks, or run different model/parameters
  • Intel Core i7-7700K 4.2Ghz Quad-core
  • Asus STRIX Z270-E GAMING ATX LGA1151. Motherboard support up to 4 GPUS
  • 64Gb DDR4 2100
  • 1Tb SSD
  • PSU 1200W

Feature Engineering

Lately, I had to work with a very limited dataset, both in size and number of features. And to add some extra challenge, some labels were likely to be wrong. So, in order to do my best in order to build a classification model, I had to do a serious effort, particularly in the dataset, because no matter how good is your model if the data sucks.  In the upside, I learned a few new tricks, so I decided to write a quick post summarizing a bit the different strategy we have to do feature engineering. Basically, this means  transform and clean the current features, combine them to create new ones, or even revisit the dataset and try to recycle information discarded.

Variable Transformation

This referes to apply  logarithm, square  or transform a conituos varaible in categories ( a.k.a. binning ). In my project, what it worked best was the  log transformation, improving a lot the performance of the model over a subset of data with very few points to train. This improve is mainly because the distribution of data in this particular subset was highly skewed. Square, in the other hand , can be applied to transform negative values to positive values, and the cube root to do the opposite.  Binning could be a good decision in very particular occasion, but I guess depends a lot of the understanding of the data.

Variable Creation

Next, we can create new features based on the existing features with the aim to unlock a laten information. For example, decompose features in smaller elements. The classic example is  to split  the full date in days, month year, week, weekend, holiday, etc… Also, convert the categorical labels to variables using one-hot encoding (a.k.a creating dummy variables).  Or the completely opposite, aggregate different features in one more useful, in my case I used four measurements  at different conditions and I decided to fitting a linear equation to observed data, linear regions, and use the slope of the result as a feature.

Features Housekeeping

Many times, less is more, so in my case I focused in eliminate outliers. As I mentioned at the beginning, I could be sure that my entire training set was accurately labeled , Consequently, using clustering, I decided to eliminate those data points far for the centroid of their class.  Also is recommendable to keep track of the wight and nature of your features. Something as simple how plot against the label and measure if there is a correlation , or the weight of the feature in the model are good habits.  For instance, you are woking in model to predict the performance of students of your city and the most important feature is the wether over other features as the hours of study, then something smells funny. In addition, a noisy feature can damage the accuracy of your model, specially if you are using SVM.

Revisit your dataset for discarded data

Often information in the dataset is discarted to be consideredd not relevant, however, with some transformation can be informative. A good example is the names of the passengers on the famous titanic disaster dataset. The first instinc is to remove  it, but the name can contain important information, nobiliary or medical doctor titles may reflect the changes of survival of the passenger on the sinking.  In the other hand, data with missing values is removed from the dataset. However, in my case, the data was so precious that I decided to use it . One approach to fill the gaps in your data is to simple write the average value o the group, also, more sophisticated, it is to cluster the dataset using the other variables, and fill the gap using the average of the group.



Can a computer have imagination?

I need to get familiar with the Generative adversarial networks (GAN) for a crazy idea I had at work, so, in order to get a first sense of how they work and how to implement them, I decided to do a small toy project. But first, what I am talking about?

GAN is a variant of neural networks for unsupervised learning. This method introduced by Ian Goodfellow is based in to fuse a generative model with a discriminative model.  The former tries to generate artificial elements with the ultimate goal of fool a discriminator, this competition during the train process allow to the generator to figure out the best way to generate artificial items, like photo realistic images, from a random seed of numbers.

To accomplish this in an efficient way both networks are trained simultaneously. The discriminative evaluate a set of real and generated images, labelling them as fake or real. The goal of the discriminator, or its loss function is to minimizes the probability to label a fake image as real. Next, the generator is train using as feedback the discriminator, training to maximize the probability to fool the discriminator. Technically, each neural network tries to minimize they loss adjusting their weight by back-propagating the gradients. The trick here is that inside the loss function of each model, there is the opposite model embedded.

Commonly the discriminator is a convoluted neural network, encoding an image into a small vector, also referred as latent vector. In the other side, the generator is a deconvolution neural network, were a random vector, acts as a latent set of variables and is sequentially up sampled to generate a full image. However, other types of nets have been applied as recurrent neural networks.

I chose the face generation example because there are multiple datasets ready to use on internet, that mean, clean and normalized data. This, purity in the training dataset, although is important in general for any machine learning model, in GANs is especially critical, these little nets are well known hard to train.

I download the Labeled Faces in the Wild dataset (LFW). Although is not a super big dataset, all the faces on LFW are already aligned and is low redundant, consequently, that should reduce the training effort and the complexity of our network. To reduce even more the model, I decided to transform all the images to a grayscale.

Also, to quickly develop the whole thing I decided to use Keras, and starting building a small network, because I want something easy to handle and debug.  Below you can see the evolution of the generator trhough one thousend epochs of training.

On 1025 epochs our model reached its best performance. A GAN should be trained until it reaches an equilibrium, in this case when no matter what, the generator is not available to reduce its loss. Or alternately the best scenario, when the discriminator is almost randomly classifying the generated images because is totally incapable to discriminate between generated and real.

However, sometimes one of the models figure out how to outperform the other during the training and on that point the feedback stop and the whole thing just will not work. Actually, this can happen often, and more infuriating, it can be just one feature responsible of this misbehavior.

Another challenge is finding the right mini-batch size, learning rates, network architecture and training procedure. I experimented many weight explosions during my first attempts. Fortunately, I finally got something that works, or at least move in the right direction. As you can see in the video, the first images generated are basically noise, but at each epoch the generative network is getting better and better.  The discriminator “learned” how to convert this random noise in faces, or transform it in spots with hair, eyes and mouths shapes. In some point, the model plattered and seems like is not available generate anything better. Adding more epochs do not help, and finally broke it.

What did I learn?

Probably to improve the performance of this model I should increase the complexity of the network to capture more features. Also, I should add some regularization elements to avoid the explosion in my gradients, this also would allow me to increase the number of epochs and finally, a bigger dataset , for instance like celeba , would help to get better results.

Although the architecture is key to train efficiently and capture features, this method are highly flexible. Without any modification I trained the same model using this time MNIST dataset, and  I got nice handwrite digits.   However, I believe that a different architecture would reduce the number of epochs need it to get similar  or better results.


In the other side, meanwhile Keras is amazing to easy deploy a NN model with very little knowledge, is also at the same time a terrible teacher. There are too many thinks happening under the hood, so probably for learning proposes I would try to focus in pure Tensorflow, an as soon I got more experience, move back to Keras. You can find my code in my github.

Finally, do not try this if you do not have a GPU, the difference is huge.

PS. About the title, obviously there is a long way from this to the human imagination., but you have to admit, this is pretty cool!

Is Skynet coming?

I cannot recommend strongly enough this video of Minute Physics about Artificial Intelligence (AI), in less than 5 minutes covers the most important concern about the AI and what research are doing to overcome them. The final conclusion is not new; Technology is not scary or dangerous, it is how some humans use it the problem. About the same topic, if you have more curiosity and time, I also recommend a series of post wrote by Tim Urban some time ago. Waitbywhy AI


Cooking classes for computers (I)

A year ago, this dataset of receipts was post on Kaggle . The challenge was to generate a model to predict the cuisine type for a receipt. The dataset, was provide by yummly. Before to see how well different approach perform to predict the cuisine, I had the idea to apply distributed word vectors word2vec to this dataset.

Word2vec, in a very high level, is an algorithm capable to learn the relationship between words using the context (neighbouring words) they appear in sentences. After training the model, it produces a vector for each word with encodes characteristics of it. Using these vectors, we can cluster the words in or library, or even do operations. The classic example of the latter is; “king – man + woman = queen.”

Word2vec uses recurrent neural networks to learn, then usually works better with huge datasets (billions of words), but we will see how it performs with the cooking dataset, where each receipt will be a sentence. One of the best features of this algorithm published by Google is the speed. Other recurrent neural networks had been proposed, however they were insanely CPU time consuming. If you want more detailed information about this, I strongly suggest you to read about here, here and here. Also, if you want to check in detail what I have done, please visit the notebook of this post in my Github.

First of all, although this dataset is probably quite clean, do a general exploration of our data is really good habit, no matter what kind of model you will apply to them. Plot a few frequencies and means can provide a value information of all sorts about potential problems, bias, typos, etc.

The training dataset, contains almost 40K receipts, of 20 different kinds of cuisine, and around 6K ingredients. Let see how many receipts are by cuisine type. As we can see in the first plot, Italian and Mexican receipts represent more than a third of the entire dataset. So, it is probable that this will affect how our vectors form. It is good to keep this on mind for this or any other further model we apply to this dataset. So far, we will ignore this bias, next check if there is a bias on the size of the receipts.


Well, on terms of size, all the receipts appear to be more similar. Then, let’s focus on the ingredients. As I mentioned above, I guess this dataset is really clean, or at least more than a real-world dataset and I do not expect any pre-processing. Also, full disclaimer, I did not check any of the models submitted to kaggel, and my intention is build the word2vec with as a proof of concept.


Finally, we will take a look on the ingredients format, and frequency. If we take a look to  the percentiles, we would see how  half of the ingredients only appear 4 times in the entire data set, and a 1% of them more than thousand. Make sense, some ingredients as salt, or water are common in many recipes meanwhile some can be very specific of a cuisine type, something important to keep in mind for futher analysis. However,  a few ingredients have been count has unique, a in reality just variants of other. Then, again better to make a few question to the dataset. Which ones are the top 10 ingredients?

[(u'salt', 18049),
 (u'olive oil', 7972),
 (u'onions', 7972),
 (u'water', 7457),
 (u'garlic', 7380),
 (u'sugar', 6434),
 (u'garlic cloves', 6237),
 (u'butter', 4848),
 (u'ground black pepper', 4785),
 (u'all-purpose flour', 4632),
 (u'pepper', 4438),
 (u'vegetable oil', 4385),
 (u'eggs', 3388),
 (u'soy sauce', 3296),
 (u'kosher salt', 3113)]

A few ingredients make a lot of sense to be highly frequent, but the present of olive oil among these omnipresent ingredients make me think that is an artefact of the bias of the dataset to the Italian cooking. On the other hand,   Which ones are the less frequent ingredinets?

[(u'whole wheat seasoned breadcrumbs', 1),
 (u'Foster Farms boneless skinless chicken breasts', 1),
 (u'Doritos Tortilla Chips', 1),
 (u'smoked turkey drumstick', 1),
 (u'Wholesome Sweeteners Organic Sugar', 1),
 (u'stem ginger', 1),
 (u'farfalline', 1),
 (u'lipton green tea bag', 1),
 (u'plain soy yogurt', 1),
 (u'meat-filled tortellini', 1),
 (u'cold-smoked salmon', 1),
 (u'ranch-style seasoning', 1),
 (u'lotus leaves', 1),
 (u'white quinoa', 1),
 (u'high gluten bread flour', 1),
 (u'blueberry pie filling', 1),
 (u'Pillsbury Thin Pizza Crust', 1),
 (u'Greek black olives', 1),


 (u'Amarena cherries', 1),
 (u'black radish', 1),
 (u'candied jalapeno', 1),
 (u'arame', 1),
 (u'chioggia', 1),
 (u'low sodium canned chicken broth', 1),
 (u'cinnamon ice cream', 1)]

Indeed, There are some very specific ingredients among those ones, However, a deeply search on the dataset start to bring to light a few of the variant writning.

[(u'garlic', 7380),
 (u'garlic cloves', 6237),
 (u'garlic powder', 1442),
 (u'garlic paste', 282),
 (u'garlic salt', 240),
 (u'garlic chili sauce', 130),
 (u'garlic chives', 25),
 (u'garlic puree', 16),
 (u'garlic bulb', 14),
 (u'garlic sauce', 11),
 (u'garlic oil', 9),
 (u'garlic pepper seasoning', 8),
 (u'garlic herb feta', 4),
 (u'garlic shoots', 4),
 (u'garlic and herb seasoning', 3)]

Also notice that in the dataset the same ingredient can present in different formats, garlic, and garlic cloves.  Well, here we need to decide if we will considerer those like different ingredients, actually probably this can change in a significate way how the model work, but so far  let’s train the neural network with almost raw version of the dataset and see what it comes through.

Once the model is train we can evaluate the model by ask him a few questions, for example what is similar to feta cheese.

[(u'kalamata', 0.9521325826644897),
 (u'pitted kalamata olives', 0.9163238406181335),
 (u'fresh oregano', 0.9144715666770935),
 (u'roasted red peppers', 0.8977206945419312),
 (u'grape tomatoes', 0.8959800004959106),
 (u'olives', 0.895972728729248),
 (u'pita bread rounds', 0.8829742670059204),
 (u'plum tomatoes', 0.8803691267967224),
 (u'goat cheese', 0.8792314529418945),
 (u'yellow tomato', 0.8785962462425232)]

Looks like all the ingredients belong to Greek cuisine, even are food that you expect to find with feta cheese. So, although the dataset is small for this algorithm, the model have been available to capture some relationships. Next, we can try to make operation with the words, as I mentioned above or even a sudo logic operation.


PASTA + MEAT – TOMATO SAUCE = White mushrooms



BACON is to CACAO as CHICKEN is to seed

BACON is to BROCCOLI  as CHICKEN is to chili flakes

  • How similar are these ingredients?




I setup the model to  build a vector of 300 dimensions for each word, so plot that to be visualized in a scatter plot is inviable. However, there are technics to reduce high-dimensional dataset to two or thre dimensions, as Principal Componend analysis (PCA), or more suitable for this example  t-Distributed Stochastic Neighbor Embedding.  t-SNE  is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot.

I colored each ingredient by popularity in a cuisine type, just for educational purpuses that should be fine. Actually, it work better than I expecrted,  there are some interesting clusters in the plot. For instance, Japanese and Chinese ingredients appear together, as also does Spanish, Italian and Greek ingredients. In addition, Indian cuisine ingredients are clustered in a very clear and isolated group. An looks like cuisines with strong influence of different cusines, as Mexican, brazillian Southern US, trend to be in the central area.


There is a lot of room to improve this implementatio, but for Educational purpuses I think this quick and straighforward test is more than enought. Since word2vec is an unsupersed method, we can use the test recipts as well to improve the model, but still this dataset is not big enought to get really impresive results. As a rule of thumb, neural networks use to work better with  big training dataset and also, bigger networks (more layers).  In the next pòst lets try to predic de cuisina name using the ingredients list.


leftover paella photo -> http://www.freefoodphotos.com/imagelibrary/bread/slides/left_over_paella.html

Study of titles in Pubmed, I

After a quick a look to the journals in Pubmed with the higher frequency of publication, I wanted to try something more complex with this dataset. Specifically, I have compiled and group by year all the titles for the papers from 1946 to 2014. Then, I calculated the cosine similarity of all against all years. This methodology is  common to measure the similarity between texts, and in a very high level detects changes in the frequency of  words. Below,  you can see a heat map plot, a representation of the resulting matrix of measure for distances for each against all. However, as you shall see, this analysis do not show any shocking pattern,  only a expected chronological clustering.  

sim_matrix The early years are very different from the rest, this is not due to the seniority of the papers, but at the presence of papers in multiple languages (for instance at the beginning of the century the lingua franca of science was the german instead of the english), also until 1966 not all items are included , only a selection that could bias a bit the titles. In the heatmap appear a few clusters ; for example one that would cover from the 50s to the early  70s , another group comprising from mid 70s to the mid 80s and finally a third group from there until now a days. To make a objective clustering I used the simple algorithm the nearest neighbour, as you can see below in the dendrogram, there are no surprises, but at least now we have defined groups of more objectively.


In the dendrogram , besides the patterns observed in the heatmap plot , we can cluster the years in a second level of detail. However,  in general, and as I expected , the groups respondend the chronological order of publication. This is mainly due to the presence of new terms in the recent years, for instance  “mRNA -Seq ” , ” RFLP ” or “HIV ” , plus  the abandonment and alteration of others , such as  electronarcosis , or RSFSR. On the other hand, if we compare the words with most frequently between clusters, with a few  exception ,in general we find that there is a strong overlap between groups. This overlap corresponds to common descriptive terms in scientific language in particular biomedical . As for treatment , study , patient, case, effect , therapy, illness. Interestingly, from the 90s on this list burst terms gene and protein.

Screen Shot 2016-04-07 at 12.46.56

Finally, we can establish a correlation between the frequency of a disease and its interest. Then,  according to WHO, the diseases with the highest rates of mortality are strokes, heart disease (heart attacks and hypertension) and infectious diseases. However if we look at the relative frequency of the word “cancer” regarding the sum of terms related  with heart disease (i.e. Hypertension, ischemia, heart, cardiovascular, etc.), infectious diseases (tuberculosis, malaria, HIV, etc.), diabetes and asthma. As you can show in the plot below,  there is a clear prominence of the cancer. Before the 80s, the term was important but after there is a growing  trend. During the 80s some important discoveries were reported, some thanks to the large injection of funding during the previous decade. In addition, a significative increase in the rate of the lung cancer (more details) trigger a t social awareness of the disease and  motivated an increase of interest and funding. However, It has been estimated a total of ~8 million of dead by cancer (all kinds) in the world, meanwhile cardiovascular diseases killed ~17  million. 


Diabetes shows an opposite situation; it displays worrying facts,  such as a growing index of mortality and incidence rate, but its frequency in Pubmed has never changed. Fortunately, in recent years we have witnessed an increase in public awareness about this disease, and  there is slight increase in the trend over the last decade. 

Interestingly, in the former plot we can observe an absolute protagonists of the infectious diseases before the 60s, but they lost the leading position  due  the success of antibiotics and vaccines, reducing the severity and incidence of infection or communicable diseases.