Karlsruhe 2024 – scientific programme
Parts | Days | Selection | Search | Updates | Downloads | Help
T: Fachverband Teilchenphysik
T 68: Data, AI, Computing 6 (ML utilities)
T 68.2: Talk
Wednesday, March 6, 2024, 16:15–16:30, Geb. 30.34: LTI
Checkpointing of long running machine learning trainings on GPUs — •Jonas Eppelt, Matthias Schnepf, Giacomo De Pietro, and Günter Quast — Institute of Experimental Particle Physics (ETP), Karlsruhe Institute of Technology (KIT)
The rise of Machine Learning (ML) applications in High Energy Physics (HEP) analysis and reconstruction pushes for the use of GPUs in such workflows. The training of neural networks can have long runtimes, making them more susceptible to runtime constraints and failures. Checkpoints contain information about the current state of the training and therefore allow continuing the training from this state to another time and another place. This will provide resistance to failures and will allow for long runtimes while abiding to time constraints. Additionally, checkpointing will enable efforts in sustainable computing. For example, trainings can be run at times when renewable energies are available and haltet during times with limited energy supply. This talk presents a Python interface that bundles common tools from the HEP community to provide storing, transferring and restoring checkpoints for ML training.
Keywords: Calorimeter; Machine Learning; Graph Neural Networks; Clustering; Reconstruction