Session 11: Methods Towards A Distant Supervision Paradigm for Clinical Information Extraction: Creating Large Training Datasets for Machine Learning

Presentation Type

Event

Abstract

Background In the era of big data, a large number of clinical narratives exist in electronic health records. Automatic extraction of key variables from clinical narratives has facilitated many aspects of healthcare and biomedical research. Conventional approaches are based on rule-based natural language processing (NLP) techniques that rely on expert knowledge and exhaustive human efforts of designing rules. Recently machine learning has seen a big performance gain compared to conventional NLP approaches. Despite the impressive improvements achieved by machine learning models, large manual labeled training data are the crucial building blocks of conventional machine learning methods and key enablers of recent deep learning methods. However, large training data are not always readily available and usually expensive to obtain from human annotators. This problem becomes more significant for use cases in clinical domain due to the Health Insurance Portability and Accountability Act (HIPAA) where methods, such as crowdsourcing, are not applicable, and requirements of annotators being medical experts. Method In this paper, we propose a distant supervision paradigm for clinical information extraction. In this paradigm, rule-based NLP algorithms are used to generate large training data with labels automatically. Machine learning models are subsequently trained on these distant labels with word embedding features. Results We study the effectiveness of the proposed framework on two clinical information extraction tasks i2b2 smoking status extraction shared task and a fracture extraction task at our institution. We tested three prevalent machine learning models, namely, Convolutional Neural Networks, Support Vector Machine, and Random Forrest. Conclusion The experimental results show that the proposed distant supervision paradigm is effective for the machine learning models to learn rules towards gold standard from distant labels. Moreover, the machine learning models trained on the distant labels generated by a rule-based NLP algorithm could perform better than the NLP algorithm given sufficient data. Additionally, we showed that CNN was more sensitive to the data size than the conventional machine learning models and that all the tested machine learning methods were viable options for the distant supervision paradigm.

Start Date

2-12-2018 3:30 PM

End Date

2-12-2018 5:00 PM

This document is currently not available here.

Share

COinS
 
Feb 12th, 3:30 PM Feb 12th, 5:00 PM

Session 11: Methods Towards A Distant Supervision Paradigm for Clinical Information Extraction: Creating Large Training Datasets for Machine Learning

University Student Union: Dakota Room 250 A/C

Background In the era of big data, a large number of clinical narratives exist in electronic health records. Automatic extraction of key variables from clinical narratives has facilitated many aspects of healthcare and biomedical research. Conventional approaches are based on rule-based natural language processing (NLP) techniques that rely on expert knowledge and exhaustive human efforts of designing rules. Recently machine learning has seen a big performance gain compared to conventional NLP approaches. Despite the impressive improvements achieved by machine learning models, large manual labeled training data are the crucial building blocks of conventional machine learning methods and key enablers of recent deep learning methods. However, large training data are not always readily available and usually expensive to obtain from human annotators. This problem becomes more significant for use cases in clinical domain due to the Health Insurance Portability and Accountability Act (HIPAA) where methods, such as crowdsourcing, are not applicable, and requirements of annotators being medical experts. Method In this paper, we propose a distant supervision paradigm for clinical information extraction. In this paradigm, rule-based NLP algorithms are used to generate large training data with labels automatically. Machine learning models are subsequently trained on these distant labels with word embedding features. Results We study the effectiveness of the proposed framework on two clinical information extraction tasks i2b2 smoking status extraction shared task and a fracture extraction task at our institution. We tested three prevalent machine learning models, namely, Convolutional Neural Networks, Support Vector Machine, and Random Forrest. Conclusion The experimental results show that the proposed distant supervision paradigm is effective for the machine learning models to learn rules towards gold standard from distant labels. Moreover, the machine learning models trained on the distant labels generated by a rule-based NLP algorithm could perform better than the NLP algorithm given sufficient data. Additionally, we showed that CNN was more sensitive to the data size than the conventional machine learning models and that all the tested machine learning methods were viable options for the distant supervision paradigm.