PV24 Speakers

Subject to change.

 

 

image

Ghulam Rasool, PhD

Assistant Member, Moffitt Cancer Center


Ghulam Rasool is an assistant member of the Department of Machine Learning with a secondary clinical appointment in the Department of Neuro-Oncology at the H. Lee Moffitt Cancer Center & Research Institute, Tampa, FL. His research focuses on building trustworthy multimodal ML/AI models for cancer diagnosis and risk assessment. His research efforts are funded by the National Science Foundation (NSF) and NIH. He received the 2023 Junior Researcher Award in the Quantitative Sciences at Moffitt.

 

 

SESSIONS

Extraction of Discrete Information from Pathology Reports Using Local and Private LLMs
   Mon, Nov 4
   04:20PM - 04:40PM ET

Background: Surgical pathology reports provide detailed descriptions of tumor samples and are the primary communication tool between pathologists and other clinical specialists involved in a patient's treatment journey. These reports may include critical information such as cancer site, laterality, tumor stage and grade, histology, behavior, and disease codes. Extracting this information as discrete variables has numerous downstream applications, including the maintenance of cancer registries. Cancer registries are databases that capture essential information about cancer patients, and pathology reports are a vital source for creating and maintaining accurate tumor records. Currently, the extraction of relevant information from pathology reports for cancer registries is performed manually, with human experts reviewing the documents and populating the records. Current Solution and their Limitation: Various natural language processing (NLP) methods have been proposed for extracting information from pathology reports. However, these methods often fail to achieve the desired accuracy due to the complex nature of pathology reports, including cancer typing, sub-typing, and specialized medical terminology. Additionally, pathology reports may be stored in formats such as PDF or RTF, necessitating pre-processing steps like optical character recognition (OCR), which introduces additional artifacts and noise into the data. Recently, techniques based on large language models (LLMs) have been proposed. However, using LLMs presents several challenges, including (1) privacy: most LLMs are accessed via APIs, requiring the transmission of user data over the internet to the LLM server for processing; (2) cost: LLMs are billed per token (approximately three-quarters of a word in English), and the cost of processing pathology reports from both historical data and current clinical records can accumulate to substantial amounts; (3) computational requirements: running LLMs on-premise necessitates significant investment in computational infrastructure.Methods: We addressed these challenges by running compressed and quantized LLMs locally within Moffitt's firewall to process over 7,000 pathology reports from the TCGA project. Data: The PDF pathology reports from twelve solid cancers (bladder, brain, cervix, colorectal, head and neck, kidney, lung, liver, ovarian, pancreas, prostate, and uterus) were downloaded and stored locally. We developed a software pipeline to load the PDF files, perform OCR, and then use an LLM to extract six discrete variables, along with an explanation for each output selection. The variables included cancer site, laterality, stage, grade, histology, and behavior. Our prompt strategy involved two calls to the LLMs using the LangChain library: (1) we prompted the LLM to extract the six variables and provide an explanation for each extraction, and (2) we used Pydantic to force the LLM to output the variables in a JSON dictionary format using the results from the first call as the input for the LLM in the second call. The pipeline's output was stored in JSON and CSV formats. We experimented with different LLMs, including Mistral, Llama-2, Llama-3, and Mixtal, and found that the Mixtral 8x7b model (quantized at Q4_0 with 46.7B parameters) provided the best balance between processing time and accuracy. Our experiments were conducted on a desktop computer with an NVIDIA RTX A4500 (30GB VRAM) GPU and a data center compute node with an NVIDIA A30 (24GB VRAM) GPU. A pathology expert on our team analyzed the extracted variables. Given the large number of reports ( 6,944 in total across all cancers), the experts randomly selected between 10 to 30 reports from each cancer and manually verified the correctness of the LLM-extracted variables. Reports where OCR failed to produce correct text were excluded from the experiment. Any variables not present in the original report or not applicable (as determined by the subject matter expert) were excluded from the analysis.Preliminary Results: The LLM was able to extract all six variables with an average accuracy of 99.2% across all variables and cancer sites in 145 reports. Specifically, the extraction accuracy for each variable was as follows: cancer site 100%, laterality 99.3%, stage 99.3%, grade 98.6%, histological entity 100%, and behavior 97.9%. When analyzing accuracy by cancer type, we observed the following results: bladder 100% (n=22), brain 83.3% (n=12), cervix 90.9% (n=11), colorectal 92.8% (n=14), head and neck 95% (n=18), kidney 100% (N=9), lung 100% (n=22), liver 100% (n=11), prostate 100% (n=15). The PDF files, source code, LLM prompt, and configurations have been publicly shared via GitHub.

 

Learning Objectives

  1. Explain the current challenges and limitations in extracting discrete information from pathology reports for the cancer registry.      
  2. Describe the innovative approach using compressed, quantized, private, and local LLMs for information extraction.         
  3. Present the preliminary results and performance evaluation of the developed system.
Chat bot