A year ago, this dataset of receipts was post on Kaggle . The challenge was to generate a model to predict the cuisine type for a receipt. The dataset, was provide by yummly. Before to see how well different approach perform to predict the cuisine, I had the idea to apply distributed word vectors word2vec to this dataset.
Word2vec, in a very high level, is an algorithm capable to learn the relationship between words using the context (neighbouring words) they appear in sentences. After training the model, it produces a vector for each word with encodes characteristics of it. Using these vectors, we can cluster the words in or library, or even do operations. The classic example of the latter is; “king – man + woman = queen.”
Word2vec uses recurrent neural networks to learn, then usually works better with huge datasets (billions of words), but we will see how it performs with the cooking dataset, where each receipt will be a sentence. One of the best features of this algorithm published by Google is the speed. Other recurrent neural networks had been proposed, however they were insanely CPU time consuming. If you want more detailed information about this, I strongly suggest you to read about here, here and here. Also, if you want to check in detail what I have done, please visit the notebook of this post in my Github.
First of all, although this dataset is probably quite clean, do a general exploration of our data is really good habit, no matter what kind of model you will apply to them. Plot a few frequencies and means can provide a value information of all sorts about potential problems, bias, typos, etc.
The training dataset, contains almost 40K receipts, of 20 different kinds of cuisine, and around 6K ingredients. Let see how many receipts are by cuisine type. As we can see in the first plot, Italian and Mexican receipts represent more than a third of the entire dataset. So, it is probable that this will affect how our vectors form. It is good to keep this on mind for this or any other further model we apply to this dataset. So far, we will ignore this bias, next check if there is a bias on the size of the receipts.
Well, on terms of size, all the receipts appear to be more similar. Then, let’s focus on the ingredients. As I mentioned above, I guess this dataset is really clean, or at least more than a real-world dataset and I do not expect any pre-processing. Also, full disclaimer, I did not check any of the models submitted to kaggel, and my intention is build the word2vec with as a proof of concept.
Finally, we will take a look on the ingredients format, and frequency. If we take a look to the percentiles, we would see how half of the ingredients only appear 4 times in the entire data set, and a 1% of them more than thousand. Make sense, some ingredients as salt, or water are common in many recipes meanwhile some can be very specific of a cuisine type, something important to keep in mind for futher analysis. However, a few ingredients have been count has unique, a in reality just variants of other. Then, again better to make a few question to the dataset. Which ones are the top 10 ingredients?
(u'olive oil', 7972),
(u'garlic cloves', 6237),
(u'ground black pepper', 4785),
(u'all-purpose flour', 4632),
(u'vegetable oil', 4385),
(u'soy sauce', 3296),
(u'kosher salt', 3113)]
A few ingredients make a lot of sense to be highly frequent, but the present of olive oil among these omnipresent ingredients make me think that is an artefact of the bias of the dataset to the Italian cooking. On the other hand, Which ones are the less frequent ingredinets?
[(u'whole wheat seasoned breadcrumbs', 1),
(u'Foster Farms boneless skinless chicken breasts', 1),
(u'Doritos Tortilla Chips', 1),
(u'smoked turkey drumstick', 1),
(u'Wholesome Sweeteners Organic Sugar', 1),
(u'stem ginger', 1),
(u'lipton green tea bag', 1),
(u'plain soy yogurt', 1),
(u'meat-filled tortellini', 1),
(u'cold-smoked salmon', 1),
(u'ranch-style seasoning', 1),
(u'lotus leaves', 1),
(u'white quinoa', 1),
(u'high gluten bread flour', 1),
(u'blueberry pie filling', 1),
(u'Pillsbury Thin Pizza Crust', 1),
(u'Greek black olives', 1),
(u'Amarena cherries', 1),
(u'black radish', 1),
(u'candied jalapeno', 1),
(u'low sodium canned chicken broth', 1),
(u'cinnamon ice cream', 1)]
Indeed, There are some very specific ingredients among those ones, However, a deeply search on the dataset start to bring to light a few of the variant writning.
(u'garlic cloves', 6237),
(u'garlic powder', 1442),
(u'garlic paste', 282),
(u'garlic salt', 240),
(u'garlic chili sauce', 130),
(u'garlic chives', 25),
(u'garlic puree', 16),
(u'garlic bulb', 14),
(u'garlic sauce', 11),
(u'garlic oil', 9),
(u'garlic pepper seasoning', 8),
(u'garlic herb feta', 4),
(u'garlic shoots', 4),
(u'garlic and herb seasoning', 3)]
Also notice that in the dataset the same ingredient can present in different formats, garlic, and garlic cloves. Well, here we need to decide if we will considerer those like different ingredients, actually probably this can change in a significate way how the model work, but so far let’s train the neural network with almost raw version of the dataset and see what it comes through.
Once the model is train we can evaluate the model by ask him a few questions, for example what is similar to feta cheese.
(u'pitted kalamata olives', 0.9163238406181335),
(u'fresh oregano', 0.9144715666770935),
(u'roasted red peppers', 0.8977206945419312),
(u'grape tomatoes', 0.8959800004959106),
(u'pita bread rounds', 0.8829742670059204),
(u'plum tomatoes', 0.8803691267967224),
(u'goat cheese', 0.8792314529418945),
(u'yellow tomato', 0.8785962462425232)]
Looks like all the ingredients belong to Greek cuisine, even are food that you expect to find with feta cheese. So, although the dataset is small for this algorithm, the model have been available to capture some relationships. Next, we can try to make operation with the words, as I mentioned above or even a sudo logic operation.
PASTA + MEAT – TOMATO SAUCE = White mushrooms
CHILI + MEAT – TOMATO SAUCE = Peanut
BACON is to CACAO as CHICKEN is to seed
BACON is to BROCCOLI as CHICKEN is to chili flakes
- How similar are these ingredients?
BROCCOLI and BACON = 0.33
BROCCOLI and CARROTS = 0.67
BROCCOLI and MUSHROOMS = 0.81
I setup the model to build a vector of 300 dimensions for each word, so plot that to be visualized in a scatter plot is inviable. However, there are technics to reduce high-dimensional dataset to two or thre dimensions, as Principal Componend analysis (PCA), or more suitable for this example t-Distributed Stochastic Neighbor Embedding. t-SNE is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot.
I colored each ingredient by popularity in a cuisine type, just for educational purpuses that should be fine. Actually, it work better than I expecrted, there are some interesting clusters in the plot. For instance, Japanese and Chinese ingredients appear together, as also does Spanish, Italian and Greek ingredients. In addition, Indian cuisine ingredients are clustered in a very clear and isolated group. An looks like cuisines with strong influence of different cusines, as Mexican, brazillian Southern US, trend to be in the central area.
There is a lot of room to improve this implementatio, but for Educational purpuses I think this quick and straighforward test is more than enought. Since word2vec is an unsupersed method, we can use the test recipts as well to improve the model, but still this dataset is not big enought to get really impresive results. As a rule of thumb, neural networks use to work better with big training dataset and also, bigger networks (more layers). In the next pòst lets try to predic de cuisina name using the ingredients list.
leftover paella photo -> http://www.freefoodphotos.com/imagelibrary/bread/slides/left_over_paella.html