Curiosity’s COVID-19 Research Tool
Inspired by the CORD-19 challenge, we built an open tool to help answer medical questions about COVID-19. It offers search, similarity, and grouping of the research papers on various topics.
What’s it all about?
In response to the rapidly evolving COVID-19 pandemic, the White House and a coalition of leading research groups published the CORD-19 dataset of more than 45,000 medical papers. They challenged the data science community to develop tools to help medical professionals answer several high-priority research questions.
Inspired by this call for action, Curiosity built a CORD-19 research tool to help find papers related to the research questions and group them together. It aims to support the researchers and data scientists in solving this pressing problem.
The tool is not an entry for the Kaggle challenge because it is hosted externally and because it depends on our own tech stack, which has only been partially open-sourced.
How can this help researchers?
The tool is based on our realisation that — much as we wish that we were closer — natural language processing (NLP) today is still far from understanding and synthesising complex technical texts on its own. Therefore, there is more merit to aiding medical professionals in answering the research questions — where the NLP systems should act on their behalf and help them find information in the mass of papers, instead of replacing the humans in the process.
The tool centres around four pieces of functionality: Search, Similarity, Topic Mining, and the Knowledge Graph, each of which we’ll briefly describe below.
The first thing we noticed other data-scientists struggle on the Kaggle was just trying to explore and understand the data. We tend to take search for granted nowadays, but getting search on your own data is still hard work: This was our initial motivation to build the website, to help data-scientists and other researchers to browse, find relevant information and filter their results. We also put together custom views so they can then browse the linked data — eg. all the papers by a particular author, co-authors, institutions, and other relationships, based on the links in the knowledge graph.
The search is augmented by NLP and the knowledge graph, and also expands users’ searches using synonyms, and the definitions for abbreviations and acronyms, both of which were learned automatically from this specific dataset using unsupervised models. To see or add synonyms and abbreviations you can right-click on the search term. Note that abbreviations will only be expanded if they are typed in upper case on the search box.
When a user has found an interesting paper, the tool helps them find similar results under the “similar” tab. Without going into too much technical detail, “similarity” is based on concepts encountered in the papers, which are linked together in the knowledge graph (in this case we use special graph embedding models).
Topic Mining (beta)
Topic mining helps users group together papers related to a topic, with a combination of several tools:
Searching for keywords and manually adding papers to the topic
Viewing/adding similar documents to the ones already in your topic
Defining rules based on search or similarity to automatically add to the topic or a candidate area
Training classification models to predict which other papers belong to the topic
To demonstrate how topics work , we’ve set up a couple of sets of topics where you can poke around — they’re under the “Explore Topics (beta)” on the home page. We put together some topics based on the tasks that are listed as part of the Kaggle Challenge.
Feel free to play around, add and review papers to existing topics, and define your own topics or new analysis. Please note that the topic mining interfaces are in beta, and there might be some bugs waiting for you there 🐞.
Underlying the tool’s functionality is a knowledge graph that links papers and other interesting entities, which may come from the papers’ metadata or may be captured from the abstract / full text.
Entities currently include:
Authors (from the CORD-19 dataset)
Affiliations (mostly universities, from the CORD-19 dataset)
Abbreviations (unsupervised model trained on this corpus)
Concepts (unsupervised n-gram model trained on this corpus)
Diseases (model trained from external dataset)
Genes (model trained from external dataset, WIP)
Topics (user-defined topics from the topic mining interface)
The data in the knowledge graph can be exported in various formats for further analysis. If you are interested, we’re happy to provide access to this feature — just drop us an email.
Technology: How it works
The tool is built on Curiosity’s Mosaik knowledge engine, and uses four key components.
Data Ingestion and Extraction
Connector code ingests the original dataset into the knowledge graph. In most applications we also use extractors to extract text from files (e.g. pdf, doc, etc.) but they were not necessary here because we have access to the textual data.
A suite of natural language processing models including tasks like tokenisation, POS, NER, entity linking, and embedding models. NLP models are combined in pipelines for each document type (here only the papers). Curiosity’s Catalyst library (that includes all of our NLP models) is available as open-source software on GitHub.
Curiosity systems use a custom knowledge graph technology. Besides natively supporting full-text search, it has been extended to support search by similarity, filtering by relationships, inferencing and more. The graph is tightly integrated with the NLP models to allow fast model training and enable similarity queries using the embeddings vector data using the HNSW algorithm.
UIs are built with Curiosity’s open source Tesserae component library, that is itself built on top of the Bridge.NET compiler. Besides the open source components, we provide simple-to-use interfaces for integrating search, graph visualisation and topic mining as part of the tech stack. The UI is tightly integrated with the back-end and can exchange feedback from users for model retraining (active machine learning).
Empowering Research Projects
In some projects we implement all of the major components above ourselves. Sometimes, however, people prefer to tackle more of the work on their own, for example writing the data ingestion or preprocessing themselves. While this is currently done using C#, a Python Connector library is under active development.
Some of our partners take it even further and develop custom UIs themselves. To illustrate how much code is(n’t) required to populate and customise this CORD-19 project, we’ve published the entire source-code used to build the website on GitHub. If you would like to try locally, you can get a free developer license and access to our docker repository — just get in touch via email or Twitter!
Over the next few weeks we plan to refine the tool with nested topics, updates for topic mining, and additional datasets (e.g. genes and citation networks). Hopefully, as more data gets added and as the data models get extended and refined, more and more useful insights may reveal themselves — so keep an eye out for changes!
Curiosity is an artificial intelligence (AI) startup based in Munich, Germany. Our software helps customers easily build custom, enterprise-grade software for their unstructured text data. Curiosity was founded in 2018 by Rafael Oliveira and Leon Zucchini. It was recognised as one of the most innovative AI companies in Germany in 2019 and 2020. Notable customers include Airbus, Siemens, and Telefonica.