Study of titles in Pubmed, I

After a quick a look to the journals in Pubmed with the higher frequency of publication, I wanted to try something more complex with this dataset. Specifically, I have compiled and group by year all the titles for the papers from 1946 to 2014. Then, I calculated the cosine similarity of all against all years. This methodology is  common to measure the similarity between texts, and in a very high level detects changes in the frequency of  words. Below,  you can see a heat map plot, a representation of the resulting matrix of measure for distances for each against all. However, as you shall see, this analysis do not show any shocking pattern,  only a expected chronological clustering.  

sim_matrix The early years are very different from the rest, this is not due to the seniority of the papers, but at the presence of papers in multiple languages (for instance at the beginning of the century the lingua franca of science was the german instead of the english), also until 1966 not all items are included , only a selection that could bias a bit the titles. In the heatmap appear a few clusters ; for example one that would cover from the 50s to the early  70s , another group comprising from mid 70s to the mid 80s and finally a third group from there until now a days. To make a objective clustering I used the simple algorithm the nearest neighbour, as you can see below in the dendrogram, there are no surprises, but at least now we have defined groups of more objectively.


In the dendrogram , besides the patterns observed in the heatmap plot , we can cluster the years in a second level of detail. However,  in general, and as I expected , the groups respondend the chronological order of publication. This is mainly due to the presence of new terms in the recent years, for instance  “mRNA -Seq ” , ” RFLP ” or “HIV ” , plus  the abandonment and alteration of others , such as  electronarcosis , or RSFSR. On the other hand, if we compare the words with most frequently between clusters, with a few  exception ,in general we find that there is a strong overlap between groups. This overlap corresponds to common descriptive terms in scientific language in particular biomedical . As for treatment , study , patient, case, effect , therapy, illness. Interestingly, from the 90s on this list burst terms gene and protein.

Screen Shot 2016-04-07 at 12.46.56

Finally, we can establish a correlation between the frequency of a disease and its interest. Then,  according to WHO, the diseases with the highest rates of mortality are strokes, heart disease (heart attacks and hypertension) and infectious diseases. However if we look at the relative frequency of the word “cancer” regarding the sum of terms related  with heart disease (i.e. Hypertension, ischemia, heart, cardiovascular, etc.), infectious diseases (tuberculosis, malaria, HIV, etc.), diabetes and asthma. As you can show in the plot below,  there is a clear prominence of the cancer. Before the 80s, the term was important but after there is a growing  trend. During the 80s some important discoveries were reported, some thanks to the large injection of funding during the previous decade. In addition, a significative increase in the rate of the lung cancer (more details) trigger a t social awareness of the disease and  motivated an increase of interest and funding. However, It has been estimated a total of ~8 million of dead by cancer (all kinds) in the world, meanwhile cardiovascular diseases killed ~17  million. 


Diabetes shows an opposite situation; it displays worrying facts,  such as a growing index of mortality and incidence rate, but its frequency in Pubmed has never changed. Fortunately, in recent years we have witnessed an increase in public awareness about this disease, and  there is slight increase in the trend over the last decade. 

Interestingly, in the former plot we can observe an absolute protagonists of the infectious diseases before the 60s, but they lost the leading position  due  the success of antibiotics and vaccines, reducing the severity and incidence of infection or communicable diseases.