Robert Kirkpatrick at InSTEDD pointed me to a very interesting public health project out of Japan called BioCaster, an ontology-based text mining system that uses linguistic signals on the Web for the early detection and tracking of infectious disease out-breaks.
“The system continuously analyzes documents reported from over 1,700 RSS feeds, classifies them for topical relevance and plots them onto a Google map using geocoded information. The background knowledge for bridging the gap between layman’s terms and formal coding systems is contained in the freely available BioCaster ontology which includes information in eight languages focused on the epidemiological role of pathogens as well as geographical locations with their latitudes/longitudes. The system consists of four main stages: topic classification, named entity recognition (NER), disease/location detection and event recognition. Higher order event analysis is used to detect more precisely specified warning signals that can then be notified to registered users via email alerts. Evaluation of the system for topic recognition and entity identification is conducted on a gold standard corpus of annotated news articles.”
BioCaster has specific advantages over related initiatives like GPHIN, MedISys, Argus, ProMedMail, EpiSpider and HealthMap. I’ve blogged about these initiatives here and here but BioCaster combines the following functionalities within a single system
- Text mining techniques such as entity recognition which aim to generalize to previously unseen terms and expressions;
- Text-level recognition of severity indicators such as international travel or the contamination of blood products;
- Ontology-based inferencing to fill in the gaps, e.g. between a mentioned pathogen and the unmentioned disease that caused it or between symptoms and diseases;
- Direct knowledge of term equivalence within and across languages.
The system has been operational since 2006 and offers “an intuitive mapping interface [see above] for the general reader as well as an openly available ontology for community re-use. Future work will focus on extending coverage to new languages and public health threats. A paper on BioCaster is available here.
I’m very interested in this system and would really like to apply the methodology to early detection and tracking of conflict rurmors. See this post for more on early warning and natural language parsing.