20180729_153510_HDR NER approach by the Genie team
This is the slide from the Genie team that illustrates the NER -- Named-Entity Recognizer -- approach.
Here is what they did:
-- Tokenize the text using SpaCy, with custom Gene and Variant entities that are defined using the labeled training data. Use the majority token of the Gene and Variant entity for the text.
-- Goal: Use gene and variation key words to identify the gene from the text
-- Findings: use a custom entity dictionary, which is basically a substring match
-- Downside: can only find what you list (so we cannot account for new genes) and it is not context-sensitive.
While I'm not sure if I understood correctly what is Named-Entity Recognizer, but this approach attracted my attention when I realized that I use something like that in my personal project, SkillClusters. In the Genie team NER approach, they used a -- to quote their slide -- "custom entity dictionary, which is basically a substring match". I do the same with SkillClusters, where every term is a key in a dictionary. And the downside they observed -- "can only find what you list (so we cannot account for new genes), and it is not context-sensitive", sounds familiar to me too. In SkillClusters I too can only extract from software development job ads the skill keywords that are already in my dictionary. I keep expanding it all the time as I see new-to-me skills in software development job ads, but there are always more.
"Genie" was the name of the project presented at the Women In Tech Machine Learning product hackathon that applied natural language processing to genetics.
I'm still a little uncertain if I understood exactly what the Genie team did. Since I know nothing about genetics, it turned into an exercise of "how much can I understand from these slides by googling genetics for an hour?". Understanding slides without a narrative is so much more difficult than understanding the speaker's explanation. Eventually I concluded that the goal of the Genie project was to identify genes and their mutations or variants from the text. (Wikipedia tells me that a gene variant may or may not be a mutation; apparently it's a more general term.)