Speaker: Prof. Mihai Surdeanu, Department of Computer Science (courtesy appointments in Linguistics and Cognitive Science), University of Arizona
Presentation Title: Using Machine Reading to Aid Cancer Understanding and Treatment
Abstract: PubMed, a repository and search engine for biomedical literature, now indexes more than 1 million articles each year. At the same time, a typical large-scale patient profiling effort produces petabytes of data -- and is expected to reach exabytes within the near future. Combining these large profiling data sets with the mechanistic biological information covered by the literature is an exciting opportunity that can yield causal, predictive understanding of cellular processes. Such understanding can unlock important downstream applications in medicine and biology. Unfortunately, most of the mechanistic knowledge in the literature is not in a computable form and remains mostly hidden.
In the first part of the talk I will describe a natural language processing (NLP) approach that captures a system-scale, mechanistic understanding of cellular processes through automated, large-scale reading of scientific literature. At the core of this approach are compact semantic grammars that capture mentions of biological entities (e.g., genes, proteins, protein families, simple chemicals), events that operate over these biochemical entities (e.g., biochemical reactions), and nested events that operate over other events (e.g., catalyses). This grammar-based approach is a departure from recent trends in NLP such as deep learning, but I will argue that this is a better direction for cross-disciplinary projects such as this. Grammar-based approaches are modular (i.e., errors can be attributed to a specific rule) and are easier to understand by non-NLP users. This means that biologists can actively participate in the debugging and maintenance of the overall system. Additionally, the proposed approach captures other complex language phenomena such as hedging and coreference resolution. I will highlight how these phenomena are different in biomedical texts versus open-domain language.
I will show that the proposed approach performs machine reading at accuracy comparable with human domain experts, but at much higher throughput, and, more importantly, that this automatically-derived knowledge substantially improves the inference capacity of existing biological data analysis algorithms. Using this knowledge we were able to identify a large number of previously unidentified, but highly statistically significant mutually exclusively altered signaling modules in several cancers, which led to novel biological hypotheses within the corresponding cancer context.
Bio: Mihai Surdeanu is an Associate Professor in the Computer Science department at University of Arizona. Dr. Surdeanu earned a PhD degree in Computer Science from Southern Methodist University, Dallas, TX, in 2001. He has 15+ years of experience in building systems driven by natural language processing (NLP) and machine learning. His experience spans both academia (Stanford University, University of Arizona) and industry (Yahoo! Research and two NLP-centric startups). During his career he published more than 80 peer-reviewed articles, including two articles that were among the top three most cited articles at two different NLP conferences. He was leader or member of teams that ranked in the top three at seven highly competitive international evaluations of end-user NLP systems such as question answering and information extraction. His work was funded by several government organizations (DARPA, NIH), as well as private foundations (the Allen Institute for Artificial Intelligence, the Bill & Melinda Gates Foundation). Dr. Surdeanu's current work focuses on using machine reading to extract structure from free text, and using this structure to construct causal models that can be used to understand, explain, and predict hypotheses for precision medicine.