Berlin 2008 – scientific programme

Parts | Days | Selection | Search | Downloads | Help

BP: Fachverband Biologische Physik

BP 25: Protein Structure and Folding

BP 25.9: Talk

Thursday, February 28, 2008, 16:30–16:45, PC 203

Accurate sequence alignment statistics for different protein models — •Stefan Wolfsheimer¹, Inke Herms², Sven Rahmann³, and Alexander K Hartmann¹ — ¹Institut für Physik, Universität Oldenburg, Germany — ²AG Genominformatik/COMET, Technische Fakultät,Universität Bielefeld, Germany — ³Fachbereich Informatik, TU Dortmund, Germany

Searching for homologous sequences or identifying proteins are well studied fields in bioinformatics. For these purposes a large sequence database is searched with a query by sequence alignment algorithms. The Smith-Waterman algorithm is a famous representative of those. A meaningful interpretation of the score is given by a p-value, which states the probability of the score within a selected null model.

Exact results are only known for gapless alignment of infinitely long uncorrelated protein models, where the amino acids are independent and identically distributed (i.i.d.). For this case a Gumbel distribution is expected. It turned out that real proteins do not fulfill these restrictions: first they are finite and secondly the i.i.d. assumption might not be the best description. Therefore we study more complex systems which incorporate information from secondary structure annotation to obtain a more plausible null model.

By generalized ensemble Monte Carlo simulations we obtain the score distributions down to very small probabilities (p ∼ 10⁻¹⁰⁰ ). We find strong deviations from the expected form in the rare-event tail. Our results indicate that p-values are overestimated in the high scoring regime, when assuming a Gumbel extrapolation.