Improved statistical benchmarking of digital pathology models predicting cell and tissue types using nested pairwise frames evaluation


Background: As digital pathology expands in use in both research and clinical settings, the ability to accurately evaluate AI model performance relative to pathologists is critical. However, several challenges limit the utility of standard computer-vision model evaluation techniques in pathology specimens. Specifically, while AI models can easily perform exhaustive detection or segmentation on whole slide images (WSIs), obtaining manual annotations for every pixel in a WSI is prohibited by cost, time, and scale. Inter-pathologist variation in substance identification in WSI limits the ability to define “ground truth,” rendering the relationship between model and pathologist predictions difficult to understand. Furthermore, pathologist-consensus-based model evaluation methods define a single ground truth prediction by combining annotations, eliminating data in regions where consensus cannot be reached. To overcome these challenges, we developed a nested pairwise patch-based analysis method for consensus-free evaluation of tissue and cell model predictions against pathologist annotations.

Methods: Small patches (“frames”) were sampled from hematoxylin and eosin-stained melanoma WSI tissue as follows: 315 frames were sampled across 83 WSIs for tissue model evaluation, and 200 frames were sampled across 72 WSIs for cell model evaluation. Each cell or tissue frame was exhaustively annotated by four pathologists. To assess inter-pathologist agreement, we performed a series of pairwise analyses, where for each pair of pathologists, one “reference pathologist” was considered the “ground truth,” and the other “evaluation pathologist” was compared against the reference. We repeated this process such that all pathologists were assessed as both reference and evaluation pathologists. Subsequently, evaluation pathologist annotations were replaced by AI model predictions for the same set of frames. To evaluate agreement between an evaluation pathologist (or model) and a reference pathologist, we computed pairwise recall, precision, and F1 metrics for each substance. The metrics were averaged first across reference pathologists for each evaluation pathologist or model and again across evaluation pathologists, weighted by the number of frames annotated, to produce a single set of model and pathologist agreement metrics. Bootstrapping was used to provide confidence intervals for mean metric values by resampling frames.

Results: We applied our method to evaluate the performance of a melanoma model that segments tissue regions (“tissue model”) and detects and classifies individual cells (“cell model”) in the tumor microenvironment. For the tissue model, average differences in performance between model and pathologists tended to be small, with the exception of cancer stroma and necrosis, where model recall was significantly better than pathologists. For the cell model, average differences in performance between model and pathologists also tended to be small, with the exception of plasma cells, where precision was significantly worse. Quantitative nested pairwise evaluation results were consistent with qualitative pathologist evaluation.

Conclusions: Nested pairwise evaluation provides a benchmarking framework where digital pathology models are treated equivalently to pathologists. Our method performs relative benchmarking of a model against pathologists and enables rigorous non-inferiority and superiority hypothesis testing. These results suggest that nested pairwise evaluation may provide utility in quantitatively guiding the development and improvement of digital pathology models to be deployed in clinical specimens.



  1. Recognize the challenges of evaluating AI models for digital pathology.
  2. Describe the nested pairwise frames evaluation approach.
  3. Interpret nested pairwise frames evaluation results.


Presented by:


Ylaine Gerardin, PhD

Principal Biomedical Data Scientist



Ylaine Gerardin, PhD, Principal Biomedical Data Scientist at PathAI, leads efforts to apply and interpret AI histopathology models across a range of indications including oncology and chronic liver disease. She has worked on defining novel model evaluation strategies and leveraging human-interpretable features as biomarkers in clinical research. Prior to PathAI, she worked as a discovery and translational computational biologist at a clinical-stage startup. Ylaine received her PhD in Systems Biology from Harvard in 2016 and dual BS degrees in Electrical Engineering and Biology from MIT.