Darmstadt 2008 – scientific programme
Parts | Days | Selection | Search | Downloads | Help
HK: Fachverband Physik der Hadronen und Kerne
HK 19: Instrumentation und Anwendungen II
HK 19.7: Talk
Tuesday, March 11, 2008, 10:00–10:15, 2D
The SysMES Framework: System Management for networked Embedded Systems and Clusters — •Camilo Lara for the ALICE-HLT collaboration — Kirchhoff Institute for Physics, Heidelberg, Germany
The ALICE heavy-ion particle physics experiment is currently being built at CERN near Geneva. It will use a PC cluster of 900 dual-processor machines for the last stages of the data readout process and a network of 400 microcomputers for the configuration and control of the cluster nodes. One of the most important objectives to be achieved in such experiments is to guarantee the utilized devices are running correctly during the experiment life-time. A second aspect is the extremely high availability and reliability requirements of the applications being run, the so called High Level Trigger (HLT). The SysMES Framework is a scalable, decentralized, fault tolerant, dynamic, rule based tool set for the monitoring of networks of target systems and applications. The management algorithms consist of the following steps: system and application monitoring, recognition of undesirable states, event (message) generation, local event handling on the target, event forwarding to the management framework, event handling on the management side, rule checking and automatic reaction. This framework will be used in order to recognize undesirable states on the analysis chain such as process termination or cluster node overload and to react automatically starting a HLT reconfiguration.