20180729_153259 Genie team demo
"Genie" was the name of the project presented at the Women In Tech Machine Learning product hackathon that applied natural language processing to genetics.
I tried to understand what the Genie team did. Since I know nothing about genetics, it turned into an exercise of "how much can I understand from these slides by googling genetics for an hour?". Understanding slides without a narrative is so much more difficult than understanding the speaker's explanation. Eventually I concluded that the goal of the Genie project was to identify genes and their mutations or variants from the text. (Wikipedia tells me that a gene variant may or may not be a mutation; apparently it's a more general term.)
The Genie team attempted it in three ways: TF-IDF, Doc2Vec and NER (named entity recognition). I found their TF-IDF slide completely inscrutable, because it contained an excerpt from a medical text that mentioned many genes and their mutations, and these words:
Input: Medical literature (document)
I don't understand what is the "class" they produced as output. As one might imagine, with a term so generic, Googling didn't help.
In the "Doc2Vec and Neural Networks" method, they converted the text to a 1-dimensional vector using Doc2Vec, then trained a neural network in Keras to find correlations between vectors that correspond to the same gene / variation combination.
Goal: live gene and variation key words to identify the gene from the text
They mentioned how varying some parameters in their model lead to better or worse results. I know way too little about machine learning to understand what those parameters represent or how they lead to better or worse results. It's just that the word "overfitting" was involved. They also found that this approach had a downside: the output vector has to be the same dimension as the input labeled set, so we cannot account for new genes.
They also tried a Name-Entity Recognizer, or NER approach. This attracted my attention when I realized that I use something like that in my personal project, SkillClusters. In the Genie team NER approach, they used a -- to quote their slide -- "custom entity dictionary, which is basically a substring match". I do the same with SkillClusters, where every term is a key in a dictionary. And the downside they observed -- "can only find what you list (so we cannot account for new genes), and it is not context-sensitive", sounds familiar to me too. In SkillClusters I too can only extract from software development job ads the skill keywords that are already in my dictionary. I keep expanding it all the time as I see new-to-me skills in software development job ads, but there are always more.
And that's why, despite me knowing nothing about genetics, this project appealed to me more than others at the hackathon. I saw my own struggles in it.