Bereiche | Tage | Auswahl | Suche | Aktualisierungen | Downloads | Hilfe
T: Fachverband Teilchenphysik
T 44: Data, AI, Computing 4 (workflow)
T 44.8: Vortrag
Dienstag, 5. März 2024, 17:45–18:00, Geb. 30.34: LTI
Search for Hidden Job Failure Risk Factors in ATLAS Job Meta Data — Arnulf Quadt1, Sebastian Wozniewski2, and •Kia-Jüng Yang3 — 1II. Physikalisches Institut, Georg-August-Universität Göttingen — 2II. Physikalisches Institut, Georg-August-Universität Göttingen — 3II. Physikalisches Institut, Georg-August-Universität Göttingen
The ATLAS Detector records over 10,000 TB of data per year and it increases even further with the upcoming upgrades. The Worldwide LHC Computing Grid (WLCG) provides a distributed computing infrastructure to store and process these data. It is crucial, that the WLCG is also reliable, meaning that the failure rate of submitted jobs by the users is low. While some job failures can be clearly traced back to known temporary issues, others seem to happen more randomly due to various more hidden reasons. An investigation of the job failure rates depending on the job attributes may reveal correlations, which might allow for mitigating actions in order to further reduce the number of job failures. This task is supported by the training of a neural network, which helps to identify and investigate correlations in the multi-dimensional space of job attributes.
Keywords: Grid Computing; Machine Learning; ATLAS; Computing Jobs; Data Analysis