PV25 Schedule of Events
Introduction: Recently, vision transformers (ViTs) were shown to outperform convolutional neural networks when pre-trained on sufficient amounts of data. Compared to convolutional neural networks (CNNs), vision transformers have a weaker inductive bias, allowing for more flexible feature detection. Vision Transformers are a promising alternative due to their reported accuracy on large-scale datasets and their advancements in self-supervised learning and multimodal training, where the flexibility of the Transformer architecture is beneficial.Methods: Our algorithm was written in Python using Torch, Torchvision, and Sklearn.metrics libraries. Parallel processing was performed using an NVIDIA GPU with Compute Unified Device Architecture (CUDA). We obtained WSI of H&E-stained slides from 20 cases: 10 cases of anaplastic large cell lymphoma (ALCL) and 10 of classical Hodgkin lymphoma (cHL), which were scanned using the Philips SG60 scanner at 20x magnification. Each WSI extracted 60 image patches of 100 × 100 pixels for feature extraction, yielding 1,200 image patches. Of these, 1080 were used for training, and 120 were used for testing. The cases were divided into two cohorts, with 10 cases for each diagnostic category. For the ViT model, the training set included 540 image patches from ALCL cases and 540 from cHL cases. The test set consisted of 60 image patches from ALCL cases and 60 from cHL cases. Our study presents the first direct comparison of predictive performance between a CNN and a Vision Transformer model, utilizing the same dataset that encompasses ALCL and cHL cases.Results: ViT model achieved a diagnostic accuracy of 100% and an F1 score of 1.0 on the independent test set, matching the performance of our previously developed CNN model on the same dataset. The confusion matrix demonstrated perfect classification, with zero false positives or negatives for ALCL and cHL classes. The F1 score remained at 1.0 across all 200 training epochs, indicating consistent, high-quality performance without degradation or overfitting, supporting the model's interpretability and robustness.Conclusion: This study presents the first direct, head-to-head comparison between a Vision Transformer and a convolutional neural network model on the same digital pathology dataset of ALCL and cHL. Our findings demonstrate that the ViT model can achieve diagnostic performance equivalent to that of a CNN, with both models achieving 100% accuracy and perfect F1 scores in distinguishing between these two lymphoma subtypes.
Learning Objectives: