20180729_153306_HDR TF-IDF approach by the Genie team
This is the slide from the Genie team that illustrates the TF-IDF approach. No, it does not make it clear to me what TF-IDF is, or how they applied it, or what does it mean to get "class" as an output from a medical text. A class of what? Of genes? Of their mutations? Googling could not tell me that, because "class" is such a generic word, even in the context of genetics. But this slide contains an example of text that they analyzed. The names of genes and their mutations are in red.
"Genie" was the name of the project presented at the Women In Tech Machine Learning product hackathon that applied natural language processing to genetics.
I tried to understand what the Genie team did. Since I know nothing about genetics, it turned into an exercise of "how much can I understand from these slides by googling genetics for an hour?". Understanding slides without a narrative is so much more difficult than understanding the speaker's explanation. Eventually I concluded that the goal of the Genie project was to identify genes and their mutations or variants from the text. (Wikipedia tells me that a gene variant may or may not be a mutation; apparently it's a more general term.)
So for example, the strings in red in this picture would be the names of genes and their mutations identified by their machine learning application. So their application, if you feed it a text like the one above, would be able to say: here's what genes and mutations this text is talking about.
The Genie team also tried other approaches. In the "Doc2Vec and Neural Networks" method, they converted the text to a 1-dimensional vector using Doc2Vec, then trained a neural network in Keras to find correlations between vectors that correspond to the same gene / variation combination.
Goal: live gene and variation key words to identify the gene from the text
They mentioned how varying some parameters in their model lead to better or worse results. I know way too little about machine learning to understand what those parameters represent or how they lead to better or worse results. It's just that the word "overfitting" was involved. They also found that this approach had a downside: the output vector has to be the same dimension as the input labeled set, so we cannot account for new genes.
They also tried a Name-Entity Recognizer, or NER approach. This attracted my attention when I realized that I use something like that in my personal project, SkillClusters. In the Genie team NER approach, they used a -- to quote their slide -- "custom entity dictionary, which is basically a substring match". I do the same with SkillClusters, where every term is a key in a dictionary. And the downside they observed -- "can only find what you list (so we cannot account for new genes), and it is not context-sensitive", sounds familiar to me too. In SkillClusters I too can only extract from software development job ads the skill keywords that are already in my dictionary. I keep expanding it all the time as I see new-to-me skills in software development job ads, but there are always more.
And that's why, despite me knowing nothing about genetics, this project appealed to me more than others at the hackathon. I saw my own struggles in it.