Leveraging Large Language Models for Extracting Protein-Protein Interactions from Biomedical Corpora
Presentation Type
Poster
Student
Yes
Abstract
The extraction of protein-protein interactions (PPIs) is pivotal for our understanding in areas of genetic mechanisms, disease pathogenesis, and drug development. However, with the rapid growth of biomedical literature, automated and precise PPI extraction is becoming essential for efficient scientific discovery. This study focuses on leveraging large language models, specifically generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT) for the extraction of PPIs. We have evaluated the capability of GPT and BERT models for PPI identification using three manually curated gold-standard corpora: Learning Language in Logic (LLL), Human Protein Reference Database (HPRD50), and Interaction Extraction Performance Assessment (IEPA). Notably, BioBERT emerged as a leader, recording the highest recall at 91.95% and an F1-score of 86.84% on LLL dataset. Interestingly, despite not being specifically trained for biomedical texts, GPT-4 achieved commendable performance with the highest precision of 88.37%, a closely comparable F1-score of 86.49% on the same dataset. For the HPRD50 and IEPA datasets, BERT-based models continued to outperform in terms of overall effectiveness. Nonetheless, GPT-4 maintained close proximity in performance, demonstrating its potential capabilities in accurately detecting PPIs from text data. This study suggests prospects for future investigations into the fine-tuning of GPT-4 for specialized tasks in the biomedical domain.
Keywords – PPI, Large Language Model (LLM), GPT, BERT.
Start Date
2-6-2024 1:00 PM
End Date
2-6-2024 2:00 PM
Leveraging Large Language Models for Extracting Protein-Protein Interactions from Biomedical Corpora
Volstorff A
The extraction of protein-protein interactions (PPIs) is pivotal for our understanding in areas of genetic mechanisms, disease pathogenesis, and drug development. However, with the rapid growth of biomedical literature, automated and precise PPI extraction is becoming essential for efficient scientific discovery. This study focuses on leveraging large language models, specifically generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT) for the extraction of PPIs. We have evaluated the capability of GPT and BERT models for PPI identification using three manually curated gold-standard corpora: Learning Language in Logic (LLL), Human Protein Reference Database (HPRD50), and Interaction Extraction Performance Assessment (IEPA). Notably, BioBERT emerged as a leader, recording the highest recall at 91.95% and an F1-score of 86.84% on LLL dataset. Interestingly, despite not being specifically trained for biomedical texts, GPT-4 achieved commendable performance with the highest precision of 88.37%, a closely comparable F1-score of 86.49% on the same dataset. For the HPRD50 and IEPA datasets, BERT-based models continued to outperform in terms of overall effectiveness. Nonetheless, GPT-4 maintained close proximity in performance, demonstrating its potential capabilities in accurately detecting PPIs from text data. This study suggests prospects for future investigations into the fine-tuning of GPT-4 for specialized tasks in the biomedical domain.
Keywords – PPI, Large Language Model (LLM), GPT, BERT.