A Detailed Analysis of Using Hammerspace Tier 0 for Checkpointing

In high-performance computing (HPC) and AI applications, checkpointing is the process of periodically saving the state of an application to persistent storage, allowing recovery from failures without restarting computations from the beginning.

However, checkpointing can introduce significant overhead, particularly when writing large amounts of data over the network to shared storage, and introduces these challenges.

Challenges Associated with Checkpointing to External Networked Storage

GPU Idle Time: During checkpointing, GPUs often remain idle until all data is written to shared storage, leading to inefficient utilization of expensive compute resources.
Shared Storage Bottlenecks: Simultaneous writes from multiple nodes to a shared storage system can overwhelm the network and storage bandwidth, increasing checkpointing time.
Risk of Data Loss: Relying solely on local storage without redundancy can risk data loss if a node fails before the checkpoint is safely stored elsewhere.

This whitepaper defines Tier 0, explains how to use it in a checkpointing workflow, and quantifies the benefits of using Hammerspace Tier 0 storage for checkpointing.

Want to learn more?

Submit the form below to Access the Resource