Leveraging Large Language Models for Extracting Protein-Protein Interactions from Biomedical Corpora

Presentation Type

Poster

Student

Yes

Abstract

The extraction of protein-protein interactions (PPIs) is pivotal for our understanding in areas of genetic mechanisms, disease pathogenesis, and drug development. However, with the rapid growth of biomedical literature, automated and precise PPI extraction is becoming essential for efficient scientific discovery. This study focuses on leveraging large language models, specifically generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT) for the extraction of PPIs. We have evaluated the capability of GPT and BERT models for PPI identification using three manually curated gold-standard corpora: Learning Language in Logic (LLL), Human Protein Reference Database (HPRD50), and Interaction Extraction Performance Assessment (IEPA). Notably, BioBERT emerged as a leader, recording the highest recall at 91.95% and an F1-score of 86.84% on LLL dataset. Interestingly, despite not being specifically trained for biomedical texts, GPT-4 achieved commendable performance with the highest precision of 88.37%, a closely comparable F1-score of 86.49% on the same dataset. For the HPRD50 and IEPA datasets, BERT-based models continued to outperform in terms of overall effectiveness. Nonetheless, GPT-4 maintained close proximity in performance, demonstrating its potential capabilities in accurately detecting PPIs from text data. This study suggests prospects for future investigations into the fine-tuning of GPT-4 for specialized tasks in the biomedical domain.

Keywords – PPI, Large Language Model (LLM), GPT, BERT.

Start Date

2-6-2024 1:00 PM

End Date

2-6-2024 2:00 PM

This document is currently not available here.

Share

COinS
 
Feb 6th, 1:00 PM Feb 6th, 2:00 PM

Leveraging Large Language Models for Extracting Protein-Protein Interactions from Biomedical Corpora

Volstorff A

The extraction of protein-protein interactions (PPIs) is pivotal for our understanding in areas of genetic mechanisms, disease pathogenesis, and drug development. However, with the rapid growth of biomedical literature, automated and precise PPI extraction is becoming essential for efficient scientific discovery. This study focuses on leveraging large language models, specifically generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT) for the extraction of PPIs. We have evaluated the capability of GPT and BERT models for PPI identification using three manually curated gold-standard corpora: Learning Language in Logic (LLL), Human Protein Reference Database (HPRD50), and Interaction Extraction Performance Assessment (IEPA). Notably, BioBERT emerged as a leader, recording the highest recall at 91.95% and an F1-score of 86.84% on LLL dataset. Interestingly, despite not being specifically trained for biomedical texts, GPT-4 achieved commendable performance with the highest precision of 88.37%, a closely comparable F1-score of 86.49% on the same dataset. For the HPRD50 and IEPA datasets, BERT-based models continued to outperform in terms of overall effectiveness. Nonetheless, GPT-4 maintained close proximity in performance, demonstrating its potential capabilities in accurately detecting PPIs from text data. This study suggests prospects for future investigations into the fine-tuning of GPT-4 for specialized tasks in the biomedical domain.

Keywords – PPI, Large Language Model (LLM), GPT, BERT.