Mark D. Zarella, PhD, and David S. McClintock, MD, Mayo Clinic
Among the many costs of digital pathology, data storage is often one of the first brought up. It’s true, whole-slide images are large, sometimes exceeding 1 GB each, and high-throughput slide scanning has enabled departments to digitize thousands of slides per day. However, a recurring theme in digital pathology is that every use case is different and therefore a user’s needs and strategy will also be different.
Image retention is optional
The first thing to note is that you usually don’t have to store your images. Unlike the glass slide, there is no specific requirement for storing whole-slide images, and the practical argument for long-term storing of images may in fact be quite niche. Academic medical centers have often led the way in large-scale digital pathology deployments, and have published extensively on them, but they also usually have other considerations (such as research and education) that may warrant a strict long-term data retention strategy. Other more targeted use cases for digital pathology may not benefit from retaining all images (or digitizing them in the first place). Some centers may instead wish to digitize on demand and retain only those images that were interesting or diagnostically important. Others may choose to retain all images over some time period where immediate access is most useful and discard images after they’ve reached a certain age. Pathology departments can benefit from weighing the costs of storage against the costs of retrieval of old material. For example, some departments may need access to prior slides so infrequently that it makes very little sense to retain one million slides from three years ago so that they can rapidly retrieve only a handful of images that they may need to access.
Not all whole-slide images are created equal
There are a number of factors that contribute to a whole-slide image’s file size, which usually include the amount of tissue on the slide, the image compression strategy, the scanning strategy, and other more esoteric factors like the number of channels (for multichannel images) or z-stack planes. Departments should factor data storage into their digitization strategy at the earliest stage, which can actually have an impact on even their whole-slide scanner selection. For example, a scanner that can discard white space by only capturing and encoding tissue can potentially store the same slide at half the size of scanners that require rectangular bounding boxes. Likewise, a scanner that utilizes more sophisticated compression algorithms may be able to encode the same slide without visually discernible loss at half the size or less. If storage costs are a major impediment for digital adoption in a department, these factors should be more strongly considered prior to initial purchase.
Storage levels can be strategic
There are a multitude of storage options, including on-prem storage residing in servers or network attached devices, cloud storage, archival storage (that may not be immediately accessible), and other methods that can be dynamically allocated according to simple rulesets. The storage strategy should be developed at the earliest stages of planning and should reflect intended use cases, in addition to considering institutional data policy, backup, performance, and maintenance. Some may find that they prefer a strategy where: 1) all data are immediately accessible at low latency no matter how old, 2) hot spares can be deployed with very little impact on the end user in case of drive failure, 3) drive failure likelihood can be reduced by frequent drive replacement, and 4) with the aid of lots of redundancy, disaster recovery is fast and seamless. These features, however, come at a price, and a more cost-conscious institution may be willing to accept some tradeoffs. For example, they may elect that older data are moved to lower tier storage with the expectation that it will be infrequently accessed, which may take longer to access and incur additional costs only when accessing it (but may still represent an overall cost savings). Perhaps their disaster recovery plan may be slower or more manual, potentially even leading to some downtime. Or perhaps they have institutional resources that can help manage and maintain on-prem storage with little direct cost to the department, which in some cases may help reduce overall costs. Although these considerations have the potential for huge cost savings, some institutions have a rigid data policy that may constrain the choices one has for storage options. Finally, it is important to note that modern storage architectures have the ability to be designed to ensure that performance is maintained as capacity is increased (i.e., “scale-out” as opposed to just “scale-up”). Overall, proper planning will ensure that whichever solution you choose stays maintains the proper performance for your use cases as your storage needs grow.
A sample storage strategy for digital pathology - on a budget
To provide a sample storage deployment is foolhardy because it is so easy to poke holes in any strategy, as there is certainly not a one-size-fits-all method to storage. But we’ll do it anyway, and will then discuss some of its pros and cons.
Imagine an on-prem deployment where a large physical server is purchased where the dominant cost is the hard drives that populate it. Perhaps they are configured for medium redundancy such that the server can withstand the loss of approximately 10% of its drives at any one time without loss of data. The server is directly connected to the local network, providing high speed access to the data from other machines in the network, including the slide scanners and a locally-hosted image management system. Off-site backup is achieved by uploading images (at the time of acquisition) to a cloud service that stores them in low-cost archival format, similar in concept to “glacier” storage. Images older than three years are purged from on-prem storage but are retained in the cloud archive. Roughly 20% of the drives in this server are randomly replaced each year and were not sourced from the same batch (or even same manufacturer) in the first place.
The cost of this example scenario is relatively modest and the performance is quite high. However, there are several potential challenges:
- Disaster recovery from off-site backup in case of complete failure would be slow and expensive.
- Image management or AI that may reside in the cloud but needs access to these images stored on-prem could potentially suffer from performance issues.
- Images older than three years would not be immediately accessible, would not be backed up in a secondary location, and would also be costly to access; under normal circumstances this may be acceptable, but more problematic if a large retrospective research study was embarked upon, for example.
- The strategy may not be consistent with institutional IT policy for data storage.
- The on-prem strategy may require more resources and local expertise for maintenance.
Navigating these challenges may be easy for some deployments and impossible for others. In general, we recommend anyone interested in storage solutions to not only do some online research, but also talk to your institution’s storage team and/or storage providers. Storage providers today have a variety of different programs for procuring storage and its related infrastructure, including options for treating storage as an operational expense as opposed to a capital purchase. Depending on your organization, having a flexible storage solution, especially one not dependent on capital budget cycles, might allow for projects to start sooner.
Conclusion
Storage should not automatically be considered a major burden for digital pathology deployment. All options are potentially on the table – from no long-term data retention at all, to sophisticated retention designed to minimize the risk of data loss, downtime, and user impact, and everything in between. Circumstance and institutional policy will ultimately dictate many facets of the storage strategy, but in cases where there is more flexibility, there are a number of options that can help achieve the right balance.
Disclaimer: In seeking to foster discourse on a wide array of ideas, the Digital Pathology Association believes that it is important to share a range of prominent industry viewpoints. This article does not necessarily express the viewpoints of the DPA, however we view this as a valuable point with which to facilitate discussion.
1 comment(s) on "Don't be afraid of storage costs"
01/23/2023 at 02:15 AM
Roxana Barajas Caldera says:
HELLO THANK YOU FOR SHARING YOUR EXPERIENCE.I AM BEGINNING IN THIS AND I WOULD LIKE TO KNOW WHAT DATABASE YOU USE MOST FREQUENTLY. THANK YOU.
Please log in to your DPA profile to submit comments