by Simon Thomas, PhD Student, University of Queensland - Institute of Molecular Bioscience
The burgeoning development of machine learning systems for digital pathology places us in an exciting time. One often reads that the complexities of anatomical pathology are now, or are soon to be unraveled by the latest machine learning technologies. Such incredible claims are bolstered by the experience of seeing a system classify histology images (or better, training one’s own). It truly is remarkable that this is even possible. Yet, as this becomes a more common experience for the pathology community, it is likely that our current expectations and ambitions will be tempered by the constraints of reality.
I remember being awe-struck at how realistic computer graphics were in the late 80’s and early 90’s. Flight simulators were the closest thing to actual flying! Caught up in how impressive the achievement was for its time, I was too uncritical to realisehow far away close actually was. A modern example is the controversial phenomenon of Deepfakes - the ability to generate fake image/video/audio content. The technology has recently progressed[1] to being near-indistinguishable from the real thing. But as before, how near depends on the degree of criticism we apply. Once the awe subsides, we notice abounding artefacts and discontinuities. In the midst of impressive technological advancements we have a proclivity to leap across chasms as if they were not there.
Similarly, we have begun to hear phrases such as “clinical-grade” and “pathologist-level” in the digital and computational pathology literature[2-5]. Impressively, the technology has been scaled to enormous datasets across many cancer types, with performance that ostensibly rivels pathologists. However, reality begs a question of great significance; does the machine learning system really understand the problem? By just contemplating this question it is clear that such systems are far from “clinical-grade” and “pathologist-level” in a very real sense, just as 90s computer graphics and Deepfakes are far from what they purport to be.
The most common approaches of binary classification and segmentation fall far short of the real-world problem. Pathologists only appear to classify tissue as cancer and non-cancer. In reality they contextualise the tissue, combining macro and microscopic features, all of which depend on the site of the specimen, the type of biopsy, grossing and tissue embedding steps and quality of sectioning and staining. Further, they can deconstruct an image into the respective tissue types and subtypes e.g. muscle (cardiac, skeletal, smooth), independent of their relevance to a particular disease state. All this information is then integrated across multiple slides to produce a single diagnosis (or request deeper levels) and recommend a clinical intervention. Needless to say, their knowledge spans multiple cancer types as well as myriad other diseases. This is the clinical-grade and pathologist-level problem. Let us not confuse this for something much less. Indeed, let us not sell pathologists short.
Grounding ourselves in the reality of what pathologists do enables us to meaningfully evaluate progress. It also forces us to consider more deeply the role of artificial intelligent systems in current and future practice. In the short-term, are we simply looking for tools which can improve workflow efficiency and diagnostic accuracy? Can current machine learning realistically contribute to this end, or is it yet only of academic appeal? If there is clinical utility, do available machine learning systems provide meaningful value given that the technology is improving so rapidly? Is the regulatory framework capable and flexible enough to translate this and future technology into practice? How can we better evaluate such systems to discern their knowledge and limitations? Is the technology really there? What do we actually require of these systems and what will it take to get there? The digital pathology community needs to continue asking and refining such questions as the technology improves, and be sure to evaluate them critically in the context of the real-world problem.
My own area of research is in interpretability techniques for machine learning systems. It is common to refer to machine learning systems as “black-boxes” because we don’t know how a decision has been made. We infer that a system understands the problem by way of its performance on unseen test data, referred to as its generalisability. If a system can correctly and reliably classify cancer images it has never seen before, we reasonably conjecture that it has learned the underlying pattern. But should we have reservations about using it in practice? For extra assurance, perhaps we may ask, what part of the image is informing the classification? We can overlay a probability heat map, revealing that the likely-cancerous regions do indeed correspond to our own identifications of cancer. It may appear that the system does indeed understand the problem. So then, are they really a black box?
Recently, experts[6] have come to criticise the use of the term black-box, indicating that the human brain is equally a black-box and point out that we have made progress in understanding it. However, for me the label black-box denotes something much deeper. Brains can produce explanations of their output (albeit fallible) by making explicit their internal “model of the world”. Conversely, by virtue of their implementation, machine learning systems have a limited ability to explain themselves, leaving much of their knowledge implicit e.g. hidden layers in deep learning systems. Such a system requires us to make inferences about its model of the world through simple cancer or non-cancer classifications. In this sense, the black-box label is the extent to which we don’t know how wrong the system really is about how the world really is. That is, they have a limited ability to explain themselves because they lack rich outputs in the way a brain does not.
In pursuit of richer outputs, we can augment a machine learning systems using post-hoc analysis techniques. The most insightful are those involving visualisation[7], which reveal these networks to have a practical understanding of the world, but also one that is extraordinarily naive. Also, it has been discovered that systems are vulnerable to adversarial examples[8], where imperceptible changes to the input can lead to drastically different outputs. Fortunately, efforts have been directed towards being robust against such examples, consequently improving[9] the learned model. It is clear that such robust systems are beginning to represent, and hence, see, the world as we do.
Following from this we can ask, what model of the world should a computational pathology system have? What human-meaningful relationships exist for us that the model should capture? Thinking back to the real-problem described earlier, what model of the world would a pathologist-level system actually need? Approaching the problem from this perspective is qualitatively different to the traditional paradigm of machine learning. We should learn models of the world directly, rather than classifiers that incidentally have them. This would allow for robust and naturally interpretable systems because they represent the world the same way a pathologist does. The hope is that we can then utilise the pattern recognition skills of machine learning systems, to make explicit the subtleties of difficult cases, and further our understanding of disease. It may be so that once pointed out to us, the pattern is obvious. In this way, machine learning can contribute both clinically and scientifically to pathology.
Holding pathologists in high regard secures an even more exciting future for machine learning systems. The world is explicable, and the fact that machine learning systems can make sense of data that we apparently cannot only further emphasises that. My objection is to the building of systems that function as oracles; accumulating authoritative dogmas upon which we turn to, depend, and ultimately fail to criticise. Instead, we should build interpretable systems which can improve their knowledge as well as ours.
References:
[1] Shamook. (2019, December 19). Deep Fake Comparison - Deeper metrics of Christmas by Jim Meskimen. Retrieved January 14, 2020, from https://www.youtube.com/watch?v=78L6I6vsfrU.
[2] Hekler, A., Utikal, J. S., Enk, A. H., Berking, C., Klode, J., Schadendorf, D., ... & von Kalle, C. (2019). Pathologist-level classification of histopathological melanoma images with deep neural networks. European Journal of Cancer, 115, 79-83.
[3] Campanella, G., Hanna, M. G., Geneslaw, L., Miraflor, A., Silva, V. W. K., Busam, K. J., ... & Fuchs, T. J. (2019). Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine, 25(8), 1301-1309.
[4] Ström, P., Kartasalo, K., Olsson, H., Solorzano, L., Delahunt, B., Berney, D. M., ... & Iczkowski, K. A. (2019). Pathologist-Level Grading of Prostate Biopsies with Artificial Intelligence. arXiv preprint arXiv:1907.01368.
[5] Wei, J. W., Tafe, L. J., Linnik, Y. A., Vaickus, L. J., Tomita, N., & Hassanpour, S. (2019). Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks. Scientific reports, 9(1), 3358.
[6] Johnson, K. (2020, January 2). Top minds in machine learning predict where AI is going in 2020. Retrieved January 14, 2020, from https://venturebeat.com/2020/01/02/top-minds-in-machine-learning-predict-where-ai-is-going-in-2020/.
[7] Olah, C., Mordvintsev, A., & Schubert, L. (2017). Feature visualization. Distill, 2(11), e7.
[8] Goodfellow, I., Papernot, N., Huang, S., Duan, R., Abbeel, P., & Clark, J. (2019, March 7). Attacking Machine Learning with Adversarial Examples. Retrieved January 14, 2020, from https://openai.com/blog/adversarial-example-research/.
[9] Engstrom, L., Ilyas, A., Mądry, A., Santurkar, S., Tran, B., & Tsipras, D. (2019, July 3). Robustness beyond Security: Representation Learning. Retrieved January 14, 2020, from http://gradientscience.org/robust_reps/.
Disclaimer: In seeking to foster discourse on a wide array of ideas, the Digital Pathology Association believes that it is important to share a range of prominent industry viewpoints. This article does not necessarily express the viewpoints of the DPA, however we view this as a valuable point with which to facilitate discussion.
2 comment(s) on "Pathologist Versus Artificial Pathologist: What Do We Really Want (Need) From Machine Learning"
02/12/2020 at 07:00 PM
Shirley Huang says:
I enjoyed your post. This question of interpretability is very interesting. Machine learning may pick up info that the pathologist did not see, for example stromal features reported to assist in classification of epithelial breast lesions, and this is not unreasonable since ML is different from human learning. If as I hope the pathologist is to benefit from this learning, some way to interpret or visualize the ML is needed. It is great to hear that interpretability is your focus.I don’t think pathologists should embrace ML blindly, but on a more practical note, how should pathologists evaluate algorithms? Algorithms that work on the traintest data may not perform as well in the real world. The described vulnerability to adversarial alteration raises concerns about the reliability. What standard information should be provided about data sets and are there ways to look for bias or other issues. What should the pathologist look at when choosing one prostate cancer algorithm over another, for example. I hope to see more questions and answers in these areas.
Thanks for the intriguing post
02/18/2020 at 07:00 PM
Simon Thomas says:
In reply to Shirley Huang.Hi Shirley, thanks for commenting and your contribution. The questions you have raised are interesting and deserving of a blog post themselves. Although machines learn differently, it is unlikely that if a biologically meaningful pattern exists it would be one that we ourselves could not understand. This is most obvious from the vast majority of work showing cancerous tissue contributing to cancer classification. The difficulty is that machine learning systems often leverage spurious correlations to improve performance on average, and this is something we have previously had no direct control over. A good example is the heat maps for misclassified BCC in reference 3 above. This is an excellent precedent for the transparency of such algorithms.
How AI could be used in practice is wide ranging, and so pathologists are best placed to evaluate the suitability for a given task. This will be made easier as we develop systems with richer outputs, providing more avenues for criticism. I personally don’t find arguments for high performance on large unseen test datasets as compelling as others do. But for an appropriate use case, the technology is arguably already there.
Please log in to your DPA profile to submit comments