Köln 2004 – scientific programme
Parts | Days | Selection | Search | Downloads | Help
HK: Physik der Hadronen und Kerne
HK 18: Instrumentation and Applications II
HK 18.6: Talk
Tuesday, March 9, 2004, 16:45–17:00, C
A Fault Tolerant Data Flow Framework for Clusters — •Timm M. Steinbeck — Kirchhoff Institute of Physics, Ruprecht-Karls-University Heidelberg, Im Neuenheimer Feld 227, D-69120 Heidelberg, http://www.ti.uni-hd.de/HLT/
The ALICE experiment’s High Level Trigger (HLT) has to reduce the data rate of up to 25 GB/s to at most 1.25 GB/s before permanent storage. To cope with these rates a PC cluster system of several 100 nodes connected by a fast network is being designed. For the system’s software an efficient, flexible, and fault tolerant data transport software framework is being developed. It consists of components, connected via a common interface, allowing to construct different configurations that are even changeable at runtime. To ensure a fault-tolerant operation, the framework includes fail-over mechanisms to replace whole nodes as well as to restart and reconnect components during runtime of the system. The last functionality utilizes the runtime reconnection feature of the component interface. To connect components on different cluster nodes a communication class library is used to abstract from the network used to retain flexibility in the hardware choice. It contains two working prototype versions for TCP and the SCI SAN. Extensions can be added to this library without modifications to other parts of the framework. Performance tests show very promising results, indicating that ALICE’s requirements concerning the data transport can be fulfilled. In a test with simulated proton-proton data for a part of the TPC an event rate of more than 430 Hz was achieved with full tracking being performed.