Aligning distantly related protein sequences is a long-standing problem in bioinformatics, and a key for successful protein structure prediction. Its importance is increasing recently in the context of structural genomics projects because more and more experimentally solved structures are available as templates for protein structure modeling. Toward this end, recent structure prediction methods employ profile-profile alignments, and various ways of aligning two profiles have been developed. More fundamentally, a better amino acid similarity matrix can improve a profile itself; thereby resulting in more accurate profile-profile alignments. Here we have developed novel amino acid similarity matrices from knowledge-based amino acid contact potentials. Contact potentials are used because the contact propensity to the other amino acids would be one of the most conserved features of each position of a protein structure. The derived amino acid similarity matrices are tested on benchmark alignments at three different levels, namely, the family, the superfamily, and the fold level. Compared to BLOSUM45 and the other existing matrices, the contact potential-based matrices perform comparably in the family level alignments, but clearly outperform in the fold level alignments. The contact potential-based matrices perform even better when suboptimal alignments are considered. Comparing the matrices themselves with each other revealed that the contact potential-based matrices are very different from BLOSUM45 and the other matrices, indicating that they are located in a different basin in the amino acid similarity matrix space.
Cite this work
Researchers should cite this work as follows:
- Tan, Y., Huang, H., Kihara, D. (2013). Statistical Potential based Amino Acid Similarity Matrices for Aligning Distantly Related Protein Sequences. Purdue University Research Repository. doi:10.4231/D3DR2P83H
Lindahl and Elofsson's Dataset: Website - [http://www.sbc.su.se/~arne/ http://www.sbc.su.se/~arne/], sequence dataset in FASTA format (http://dragon.bio.purdue.edu/aamatrices/fasta_format.htm) - seq.tar.gz, sequence dataset in PDB format (http://www.rcsb.org/pdb/) - [http://dragon.bio.purdue.edu/aamatrices/pdb.tar.gz http://dragon.bio.purdue.edu/aamatrices/pdb.tar.gz] : Matrices: Matrix built from structural superposition data for identifying potential remote homologues (Blake-Cohen, 2001) - BLAJ010101.txt, BLOSUM45 substitution matrix (Henikoff-Henikoff, 1992) - BLOSUM45.txt, structure-based amino acid scoring table (Johnson-Overington, 1993) - JOHM930101.txt, conformational similarity weight matrix (Kolaskar-Kulkarni-Kale, 1992) - KOLA920101.txt, context-dependent optimal substitution matrices for all residues (Koshi-Goldstein, 1995) - KOSJ950115.txt, base-substitution-protein-stability matrix (Miyazawa-Jernigan, 1993) - MIYS930101.txt, STR matrix from structure-based alignments (Overington et al., 1992) - OVEJ920101.txt, structure derived matrix (SDM) for alignment of distantly related sequences (Prlic et al., 2000) - PRLA000101.txt, homologous structure dereived matrix (HSDM) for alignment of distantly related sequences (Prlic et al., 2000) - PRLA000102.txt, cross-correlation coefficients of preference factors main chain (Qu et al., 1993) - QU_C930101.txt, cross-correlation coefficients of preference factors side chain (Qu et al., 1993) - QU_C930102.txt, STROMA score matrix for the alignment of known distant homologs (Qian-Goldstein, 2002) - QUIB020101.txt, additional matrices files - CCPC.txt, CCPG.txt, CCPQ.txt, CCREFPG.txt, CC6PC.txt, CC6PG.txt, CC6PQ.txt, 10% CCPC + 90% KOLA - CC10.txt, 20% CCPC + 80% KOLA - CC20.txt, 30% CCPC + 70% KOLA - CC30.txt, 40% CCPC + 60% KOLA - CC40.txt, 50% CCPC + 50% KOLA - CC50.txt, 60% CCPC + 40% KOLA - CC60.txt, 70% CCPC + 30% KOLA - CC70.txt, 80% CCPC + 20% KOLA - CC80.txt, 90% CCPC + 10% KOLA - CC90.txt, downhill simplex optimization of matrix CCPC using family level data - Fam01.txt, downhill simplex optimization of matrix CCPC using super family level data - SFam01.txt, downhill simplex optimization of matrix CCPC using fold level data - Fold01.txt, matrix containing random values between -5 and 15 - RANDMATRIX.txt : Golden Alignments: Family level alignments computed using CE (http://cl.sdsc.edu/ce.html) - ce.fam.tar.gz, super family level alignments computed using CE (http://cl.sdsc.edu/ce.html) - ce.sfam.tar.gz, fold level alignments computed using CE (http://cl.sdsc.edu/ce.html) - ce.fold.tar.gz : Benchmarking Datasets: http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/, [http://www.drive5.com/muscle/prefab.htm http://www.drive5.com/muscle/prefab.htm], [http://www-cryst.bioc.cam.ac.uk/~homstrad/ http://www-cryst.bioc.cam.ac.uk/~homstrad/]