Cooking classes for computers (I)

A year ago, this dataset of receipts was post on Kaggle . The challenge was to generate a model to predict the cuisine type for a receipt. The dataset, was provide by yummly. Before to see how well different approach perform to predict the cuisine, I had the idea to apply distributed word vectors word2vec to this dataset.

Word2vec, in a very high level, is an algorithm capable to learn the relationship between words using the context (neighbouring words) they appear in sentences. After training the model, it produces a vector for each word with encodes characteristics of it. Using these vectors, we can cluster the words in or library, or even do operations. The classic example of the latter is; “king – man + woman = queen.”

Word2vec uses recurrent neural networks to learn, then usually works better with huge datasets (billions of words), but we will see how it performs with the cooking dataset, where each receipt will be a sentence. One of the best features of this algorithm published by Google is the speed. Other recurrent neural networks had been proposed, however they were insanely CPU time consuming. If you want more detailed information about this, I strongly suggest you to read about here, here and here. Also, if you want to check in detail what I have done, please visit the notebook of this post in my Github.

First of all, although this dataset is probably quite clean, do a general exploration of our data is really good habit, no matter what kind of model you will apply to them. Plot a few frequencies and means can provide a value information of all sorts about potential problems, bias, typos, etc.

The training dataset, contains almost 40K receipts, of 20 different kinds of cuisine, and around 6K ingredients. Let see how many receipts are by cuisine type. As we can see in the first plot, Italian and Mexican receipts represent more than a third of the entire dataset. So, it is probable that this will affect how our vectors form. It is good to keep this on mind for this or any other further model we apply to this dataset. So far, we will ignore this bias, next check if there is a bias on the size of the receipts.

Unknown-5

Well, on terms of size, all the receipts appear to be more similar. Then, let’s focus on the ingredients. As I mentioned above, I guess this dataset is really clean, or at least more than a real-world dataset and I do not expect any pre-processing. Also, full disclaimer, I did not check any of the models submitted to kaggel, and my intention is build the word2vec with as a proof of concept.

Unknown-4

Finally, we will take a look on the ingredients format, and frequency. If we take a look to  the percentiles, we would see how  half of the ingredients only appear 4 times in the entire data set, and a 1% of them more than thousand. Make sense, some ingredients as salt, or water are common in many recipes meanwhile some can be very specific of a cuisine type, something important to keep in mind for futher analysis. However,  a few ingredients have been count has unique, a in reality just variants of other. Then, again better to make a few question to the dataset. Which ones are the top 10 ingredients?

[(u'salt', 18049),
 (u'olive oil', 7972),
 (u'onions', 7972),
 (u'water', 7457),
 (u'garlic', 7380),
 (u'sugar', 6434),
 (u'garlic cloves', 6237),
 (u'butter', 4848),
 (u'ground black pepper', 4785),
 (u'all-purpose flour', 4632),
 (u'pepper', 4438),
 (u'vegetable oil', 4385),
 (u'eggs', 3388),
 (u'soy sauce', 3296),
 (u'kosher salt', 3113)]

A few ingredients make a lot of sense to be highly frequent, but the present of olive oil among these omnipresent ingredients make me think that is an artefact of the bias of the dataset to the Italian cooking. On the other hand,   Which ones are the less frequent ingredinets?

[(u'whole wheat seasoned breadcrumbs', 1),
 (u'Foster Farms boneless skinless chicken breasts', 1),
 (u'Doritos Tortilla Chips', 1),
 (u'smoked turkey drumstick', 1),
 (u'Wholesome Sweeteners Organic Sugar', 1),
 (u'stem ginger', 1),
 (u'farfalline', 1),
 (u'lipton green tea bag', 1),
 (u'plain soy yogurt', 1),
 (u'meat-filled tortellini', 1),
 (u'cold-smoked salmon', 1),
 (u'ranch-style seasoning', 1),
 (u'lotus leaves', 1),
 (u'white quinoa', 1),
 (u'high gluten bread flour', 1),
 (u'blueberry pie filling', 1),
 (u'Pillsbury Thin Pizza Crust', 1),
 (u'Greek black olives', 1),

....

 (u'Amarena cherries', 1),
 (u'black radish', 1),
 (u'candied jalapeno', 1),
 (u'arame', 1),
 (u'chioggia', 1),
 (u'low sodium canned chicken broth', 1),
 (u'cinnamon ice cream', 1)]

Indeed, There are some very specific ingredients among those ones, However, a deeply search on the dataset start to bring to light a few of the variant writning.

[(u'garlic', 7380),
 (u'garlic cloves', 6237),
 (u'garlic powder', 1442),
 (u'garlic paste', 282),
 (u'garlic salt', 240),
 (u'garlic chili sauce', 130),
 (u'garlic chives', 25),
 (u'garlic puree', 16),
 (u'garlic bulb', 14),
 (u'garlic sauce', 11),
 (u'garlic oil', 9),
 (u'garlic pepper seasoning', 8),
 (u'garlic herb feta', 4),
 (u'garlic shoots', 4),
 (u'garlic and herb seasoning', 3)]

Also notice that in the dataset the same ingredient can present in different formats, garlic, and garlic cloves.  Well, here we need to decide if we will considerer those like different ingredients, actually probably this can change in a significate way how the model work, but so far  let’s train the neural network with almost raw version of the dataset and see what it comes through.

Once the model is train we can evaluate the model by ask him a few questions, for example what is similar to feta cheese.

[(u'kalamata', 0.9521325826644897),
 (u'pitted kalamata olives', 0.9163238406181335),
 (u'fresh oregano', 0.9144715666770935),
 (u'roasted red peppers', 0.8977206945419312),
 (u'grape tomatoes', 0.8959800004959106),
 (u'olives', 0.895972728729248),
 (u'pita bread rounds', 0.8829742670059204),
 (u'plum tomatoes', 0.8803691267967224),
 (u'goat cheese', 0.8792314529418945),
 (u'yellow tomato', 0.8785962462425232)]

Looks like all the ingredients belong to Greek cuisine, even are food that you expect to find with feta cheese. So, although the dataset is small for this algorithm, the model have been available to capture some relationships. Next, we can try to make operation with the words, as I mentioned above or even a sudo logic operation.

  • OPERATIONS

PASTA + MEAT – TOMATO SAUCE = White mushrooms

CHILI + MEAT – TOMATO SAUCE  = Peanut

  • LOGICS

BACON is to CACAO as CHICKEN is to seed

BACON is to BROCCOLI  as CHICKEN is to chili flakes

  • How similar are these ingredients?

BROCCOLI and BACON = 0.33

BROCCOLI and CARROTS = 0.67

BROCCOLI and MUSHROOMS = 0.81

I setup the model to  build a vector of 300 dimensions for each word, so plot that to be visualized in a scatter plot is inviable. However, there are technics to reduce high-dimensional dataset to two or thre dimensions, as Principal Componend analysis (PCA), or more suitable for this example  t-Distributed Stochastic Neighbor Embedding.  t-SNE  is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot.

I colored each ingredient by popularity in a cuisine type, just for educational purpuses that should be fine. Actually, it work better than I expecrted,  there are some interesting clusters in the plot. For instance, Japanese and Chinese ingredients appear together, as also does Spanish, Italian and Greek ingredients. In addition, Indian cuisine ingredients are clustered in a very clear and isolated group. An looks like cuisines with strong influence of different cusines, as Mexican, brazillian Southern US, trend to be in the central area.

Unknown

There is a lot of room to improve this implementatio, but for Educational purpuses I think this quick and straighforward test is more than enought. Since word2vec is an unsupersed method, we can use the test recipts as well to improve the model, but still this dataset is not big enought to get really impresive results. As a rule of thumb, neural networks use to work better with  big training dataset and also, bigger networks (more layers).  In the next pòst lets try to predic de cuisina name using the ingredients list.

——————————————————————-

leftover paella photo -> http://www.freefoodphotos.com/imagelibrary/bread/slides/left_over_paella.html

Study of titles in Pubmed, I

After a quick a look to the journals in Pubmed with the higher frequency of publication, I wanted to try something more complex with this dataset. Specifically, I have compiled and group by year all the titles for the papers from 1946 to 2014. Then, I calculated the cosine similarity of all against all years. This methodology is  common to measure the similarity between texts, and in a very high level detects changes in the frequency of  words. Below,  you can see a heat map plot, a representation of the resulting matrix of measure for distances for each against all. However, as you shall see, this analysis do not show any shocking pattern,  only a expected chronological clustering.  

sim_matrix The early years are very different from the rest, this is not due to the seniority of the papers, but at the presence of papers in multiple languages (for instance at the beginning of the century the lingua franca of science was the german instead of the english), also until 1966 not all items are included , only a selection that could bias a bit the titles. In the heatmap appear a few clusters ; for example one that would cover from the 50s to the early  70s , another group comprising from mid 70s to the mid 80s and finally a third group from there until now a days. To make a objective clustering I used the simple algorithm the nearest neighbour, as you can see below in the dendrogram, there are no surprises, but at least now we have defined groups of more objectively.

Slide1

In the dendrogram , besides the patterns observed in the heatmap plot , we can cluster the years in a second level of detail. However,  in general, and as I expected , the groups respondend the chronological order of publication. This is mainly due to the presence of new terms in the recent years, for instance  “mRNA -Seq ” , ” RFLP ” or “HIV ” , plus  the abandonment and alteration of others , such as  electronarcosis , or RSFSR. On the other hand, if we compare the words with most frequently between clusters, with a few  exception ,in general we find that there is a strong overlap between groups. This overlap corresponds to common descriptive terms in scientific language in particular biomedical . As for treatment , study , patient, case, effect , therapy, illness. Interestingly, from the 90s on this list burst terms gene and protein.

Screen Shot 2016-04-07 at 12.46.56

Finally, we can establish a correlation between the frequency of a disease and its interest. Then,  according to WHO, the diseases with the highest rates of mortality are strokes, heart disease (heart attacks and hypertension) and infectious diseases. However if we look at the relative frequency of the word “cancer” regarding the sum of terms related  with heart disease (i.e. Hypertension, ischemia, heart, cardiovascular, etc.), infectious diseases (tuberculosis, malaria, HIV, etc.), diabetes and asthma. As you can show in the plot below,  there is a clear prominence of the cancer. Before the 80s, the term was important but after there is a growing  trend. During the 80s some important discoveries were reported, some thanks to the large injection of funding during the previous decade. In addition, a significative increase in the rate of the lung cancer (more details) trigger a t social awareness of the disease and  motivated an increase of interest and funding. However, It has been estimated a total of ~8 million of dead by cancer (all kinds) in the world, meanwhile cardiovascular diseases killed ~17  million. 

Unknown

Diabetes shows an opposite situation; it displays worrying facts,  such as a growing index of mortality and incidence rate, but its frequency in Pubmed has never changed. Fortunately, in recent years we have witnessed an increase in public awareness about this disease, and  there is slight increase in the trend over the last decade. 

Interestingly, in the former plot we can observe an absolute protagonists of the infectious diseases before the 60s, but they lost the leading position  due  the success of antibiotics and vaccines, reducing the severity and incidence of infection or communicable diseases.

The Demiurge

We live surrounded by smart machines, indeed there are several algorithms to mimic a mind and provide a machine of intelligence. These magic software are applied in many fields, from finance to medicine. In fact, those algorithms are not new,  the foundation of several of them  were establish in the early 60. However, they are far from being a thinking mind and in fact each one has its own weaknesses that limit them to achieve the complete mimic of a human mind. So, the magic formula has not been written yet. Pedro Domingos talk about all this in his book “The master algorithm“. I strongly recomend the following video were the author makes a brilliant summary about the state-of-the-art in the field and makes an introduction to the master algorithm – the code capable of conceive a real mind. There is no need of any special  knowledge to follow the talk and might motivate you to read the book. 

Parking in Toronto, II

A part of when are issued the parking tickets, another very important aspect  is the where. On the Dataset provided by the city of Toronto, there are approximately 400000 addresses. This would mean about 25 tickets for location over the past 4 years. However, only 100,000 accumulates 98% of the total offences, actually 7,000 represents around the 60% of the total! Even more, the top ten locations accumulate nearly of the 2% of all tickets of the city. In the future I will talk about this kind of distributions where a few elements have the main weight of the population, a very common kind of distribution. Below,  you can see the  7,000 most ticket locations of Toronto during the last 4 years.

Screen Shot 2016-02-21 at 23.45.24On over 6,000 of those locations, the average is a ticket or less per day. And in general the downtown shows a mayor density of hot locations, but surprisingly the 10 most ticketed address are not focus in the city center, and in fact are quite scattered around the map.
top_detalle
I checked the most ticket addresses and with a couple of exception most of them correspond to hospitals, university campuses and shop centres or malls. In the first position, very prominent, there is the Sunnybrook Hospital. This hospital, one of the largest in Canada, is very busy during the day. I can imagine that people generally underestimate the time it will spend in the hospital and perhaps how severe is the control of parking. However, the hospital has its own parking control staff. And they take their work very seriously, considering that the average is about 25 tickets a day. And like the Sunnybrook Hospital, the University have their own staff to control the parking, so there is no offence without punishment in the campus.

 Ranking Address Total Tickets
(2011-2014)
Ratio (%) Description
1 2075 BAYVIEW AVE 37399 0.3512 Sunnybrook hospital
2 20 EDWARD ST 25711 0.2414 World’s Biggest Book Store
3 1750 FINCH AVE E 19036 0.1787 Seneca College
4 JAMES ST 13108 0.1231 Eaton Center
5 941 PROGRESS AVE 12816 0.1203 Centennial College
6 700 LAWRENCE AV W 11497 0.1079 Lawrence Square Shopping Centre
7 1265 MILITARY TR 11129 0.1045 University of Toronto Scarborough Campus
8 225 KING ST W 10926 0.1026 Ticket King
9 60 BLOOR ST W 10300 0.0967 GAP
10 3401 DUFFERIN ST 9814 0.0921 Yorkdale Shopping Centre

Among all the top positions, one caught my attention , 20 Edward Street. The number of tickets on this location is not a joke, however, currently in this address there is only a condo under construction, that’s it. At the beginning I thought  that might be the source of all those tickets were the development, but a quick search of the address on internet finally I give me a better answered. In that location, until 2014 (the last year of the dataset), in 20 Edward Street were the World’s Biggest Book Store. Apparently it was quite popular, and judging by the type of fines accumulated – most were for being illegally parked (~% 50) followed by nonpayment parking (~ 20%), I can imagine the behaviour of the  drivers. They prefere to risk to get ticket instead to search for a spot to parking, or why pay, if only is going to be a minute.

Lately, and drive my small analysis over the parking dataset of Toronto, I have been paying more attention to the parking behaviour. Thus, I got the feeling that just a few people pay at the parking meter. This impression is corroborated by the dataset since more than 30% of all fines are city park without paying.

 


			

Stop the presses!

In science, especially in biomedicine, the relevance of a scientist is measured with his record of publications. Moreover, not only the number is important but also in which journals they have been published. Actually, there is a ranking of how important is a journal base on the number of year citations of the journal, the impact factor.  Also, there are journals with open access  and others in which you have to pay a subscription or buy the papers. That is funny when you considerer that often the author have to pay to publish as well… But before we get into that, a scientist usually follow with regularity only a few number of journals – often the most important in its field – However, those scientists  will use with frequency a search engine to find, in a comprehensive way,  what it has been done about a topic, or  methodology, no matter the journal. The classic search engine in biomedicine is Pubmed. This is a free search engine  maintained by the national library of the United States and is accessible since 1996. Pubmed has indexed  papers  since 1966 and prior to that date there are only a selection of the most relevant papers. But not all the journals are in Pubmed, only those journals that accomplish Pubmed’s  scientific standards are indexed. Today  in PubMed, despite these limitations, there are indexed nearly 25 million papers, and during the last years, Pubmed growth in about 1 million of new papers per year. 

years

This growth, like many other phenomenas , can be explained  by a combination of different reasons factors:  i) Money:  Although,  right now I think it is decreasing, investment in science has been increasing during the last decades, as well the number of scientist working. In fact, I would guess a high correlation between the number of papers and the number of scientist. In other hand, there are two subtle differences in growth, one after the 70, (The total war against cancer started in the early 70) and a second one after the 2000, Human genome era. ii) Motivation: This is more complex but can be summarized as , Publish or perish . Most biomedical research is maintained at public expense , therefore, Those dollars should be given to the best projects and leading people to develop these promising projects . As I mentioned earlier in this post, the literary production of a scientist is the most important scale to do so . Not only that, a good publication record in your early stages as a scientist determine where you can go to work in your next stage, and better center , usually comes with more money, thus better  chances to publish more relevant and quantitate of papers. Papers you will need to apply for more money …. Do you see the feedback?

Journal Total papers  (mid 2014) First paper indexed Date
The Journal of biological chemistry 167622 1946
Science (New York, N.Y.) 164612 1946
Lancet 128337 1946
Proceedings of the National Academy of Sciences of the United States of America 117711 1946
Nature 101682 1946
British medical journal 97098 1946
Biochimica et biophysica acta 93736 1948
PloS one 85750 2007
Biochemical and biophysical research communications 76442 1960
The New England journal of medicine 70721 1946

 

Well, let’s  rank Pubmed’s journals by the total number of publications. In the top 10 we find what we expected, very old and prestigious journals with thousands of papers. In fact, the first paper published by “Lancet”, “The New England Journal of Medicine”, or “Science”  were published in 1823, 1812 and 1880 respectively (very previous to their first indexed papers). However, among all these journals, one call our attention, “PLoS One“. This journal founded in 2006 has been published more papers than a journal founded almost 2 centuries ago. Well, it is clear that the rate of publication 80 years ago is not the same of now a days,  but how do you publish in 6 years such amount of papers?  What is special about “PLoS One”?

Firstly, it’s completely open, your articles are accessible to everyone without paying any subscription and completely online. This is a smart move, because that papers can be cited by more people. However, the main reason of this high ratio of publication is its criterion for acceptance or rejection of papers. Unlike the most tradicional journals where a research has to proof a certain novelty, impact and scientific rigour, Plos ONE, instead, only verifies whether experiments and data analysis were conducted rigorously. To provide a frame of referee:  PLOS ONE have a ratio of accepted papers of around 70%, while “Nature” only 8% are accepted. There is a big controversy over this model. I personally think, that it should exist a space to publish works that might be less relevant. For instance, work already scooped or less sexy than the author thought before the experiments. Although, When I see papers like “Fellatio by fruit bats prolongs copulation time” I feel piss. Not only,  those research as been funded by public money, also  the journal get pay for publish a often dispensable research.Then, Plos ONE is taking advantage of the necessity of publish? Is this journal a new business model more than an academic model?  In addition, I afraid, this model can be exploited to artificially increase the number of papers of some authors, sending just cancelled projects, or one week projects with no aim at all. Just long hanger fruits that will increase is publish record to increase the chances of get a new grant. Recently, the impact factor of Plos ONE is decrassing, thus looks like this can be a general believe. However,  there is no dubt, this model is a very profitable model and other journals are planning to start apply  the same approach.

Spurious Correlations

Do a chart with two variables to measure the degree of correlation between them, It is probably the most used statistical tools. And like any other analysis, We must be very careful drawing conclusions because it does not always reflect reality. Two variables can be strongly correlated in many ways. For example, the number of libraries in a city is strongly related with the absolute number of crimes. However, that does not mean that libraries encourage crime. This example is very clear, and it sure is easy to find more examples of this kind, but I would like to mention the spurious correlations. These correlations ocurre when two variables with no logical connection have a strong correlation coefficient. For instance, the annual number of PhD in computer science in the United States and annual revenue of Americans arcades. …… In Tyler Vigen’s website you can find more of those correlations and if you are really  interest, you can buy his book on the subject, Spurious Correlations.

chart.png

Parking Tickets in Toronto, I

Toronto is the largest city in Canada and the fifth in North America. According to a census carried out during 2011, Toronto has 1.3 cars per household. Meanwhile  this rate is  around 0.6 and 1.1 for New York and Chicago, respectively. Cities with equivalent characteristics but the fact that Toronto is the most car-oriented, won’t surprised  anyone that has ever been in Toronto and it has used its public transport.

IMG_1002

So, according to this census, in Toronto there are approximately 1.1 million cars and in average, in year 2.8 million parking fines are administered in the city Toronto. In other words,  nearly 3 tickets per year for parking after hours or in places prohibited by vehicle. A fairly high average, but if you have a car, or you know someone who has car in Toronto. That average will not surprise either. In way, the system is broke, for instance recently there was a general amnesty, cancelling nearly one million  violations. This mainly  because the amount of claims was impossible to handle. Something that also says a lot .

Unknown

Fortunately, the city hall  of Toronto provide the raw data of all of those parking violations. A dataset I plan to analysis in order to learn more about Toronto. First, I have plotted the number of total tickets per day since 2011. As you can see above, it stands out  how spiky is the plot. Furthermore, these spikes are almost all of them, falls. If we sort the data by number of fines, we can appreciate how the minimums correspond with holidays, especially, Christmas, New Year and Thanksgiving Day.

date_of_infraction      N
2013-12-25             235
2011-12-25             322
2014-12-25             425
2014-01-01             536
2013-01-01             577
2013-12-26             587
2012-12-25             673
2013-12-23             943
2012-01-01            1059
2013-12-24            1126
2013-12-22            1225
2014-12-26            1307
2012-12-26            1488
2011-01-01            1527
2013-02-08            1695
2013-10-14            1966
2011-12-26            1996
2014-10-13            2028
2012-10-08            2076
2011-10-10            2172

Actually, in the top 20 there is only one day that breaks this rule , 02.08.2013. That day, it felt a major snowstorm in Toronto. Also, we can notice how, the smaller minimums , are repetitive. Those minims corresponds to weekends. Due to parking prohibitions are generally not influenced by holidays, this behaivoir can be explained by a combination of two factors: i) Many people does not move or leave the city during the weekend and holidays, ii) There are fewer parking agents regulating parking. I will think about it, because I would like to figure out which one of these factors have a higher weight.

In other hand, in summer many people move by bike , so it caught my attention , how small is the difference between summer and winter. Something you can appreciate in As  in the plot of 2010, the year were this difference is more pronounced.

Unknown_2011.png

Let’s see, what ele we can learn from the Parking Tickets of Toronto. This dataset is provided by the City of Toronto itself, under license  Open Government.