Computational methods for de novo identification of gene regulation elements, such as transcription factor binding sites, have proved to be useful for deciphering genetic regulatory networks. However, despite the availability of a large number of algorithms, their strengths and weaknesses are not sufficiently understood. Here, we designed a comprehensive set of performance measures and benchmarked five modern sequence-based motif discovery algorithms using large datasets generated from Escherichia coli !RegulonDB. Factors that affect the prediction accuracy, scalability and reliability are characterized. It is revealed that the nucleotide and the binding site level accuracy are very low, while the motif level accuracy is relatively high, which indicates that the algorithms can usually capture at least one correct motif in an input sequence. To exploit diverse predictions from multiple runs of one or more algorithms, a consensus ensemble algorithm has been developed, which achieved 6-45% improvement over the base algorithms by increasing both the sensitivity and specificity. Our study illustrates limitations and potentials of existing sequence-based motif discovery algorithms. Taking advantage of the revealed potentials, several promising directions for further improvements are discussed. Since the sequence-based algorithms are the baseline of most of the modern motif discovery algorithms, this paper suggests substantial improvements would be possible for them.
Cite this work
Researchers should cite this work as follows:
- Hu, J., Li, B., Kihara, D. (2013). Limitations and Potentials of Current Motif Discovery Algorithms. Purdue University Research Repository. doi:10.4231/D33F4KN2K
E. coli genome data sets: The regulonDB.txt file was obtained from http://www.cifn.unam.mx/Computational_Genomics/regulondb/. : The ecoli.genes file is the gene information of E. coli. : The ecoli.genome file is the complete E. coli genome sequence. : The ecoli.motifs.zip file contains separate files for each motif group compiled from !RegulonDB uncompress with unzip under linux). : ECRDB70 data sets: The ECRDB70.txt file contains 70 motif groups screened out of !RegulonDB. Some of the records will be skipped when generating input sequence data sets. : The ECRDB70.list file is a list of motif groups in ECRDB with their motif widths and other information. : The ECRDB70.stat file contains some statistics of the ECRDB70 motifs. : Input sequence data sets with different margins generated from ECRDB70 - Refer to the paper for the procedures to generate the following input sequence data sets from ECRDB70: The file intergenic.zip contains input sequences extracted from intergenic regions in which the motifs in ECRDB70 are located. : The margin20.zip file contains training sequences with margin size of 20 on both sides of motifs. : The margin50.zip contains training sequences with margin size of 50 on both sides of motifs. : The margin100.zip file contains training sequences with margin size of 100 on both sides of motifs. : The margin200.zip file contains training sequences with margin size of 200 on both sides of motifs. : The margin300.zip file contains training sequences with margin size of 300 on both sides of motifs. : The margin400.zip file contains training sequences with margin size of 400 on both sides of motifs. : The margin500.zip file contains training sequences with margin size of 500 on both sides of motifs. : The margin800.zip file contains training sequences with margin size of 800 on both sides of motifs. : The ECRDB61B.tar.gz file contains training sequences with margin size 20,50,100,200,300,400,500,800 on both sides of motifs(8 data sets). NOTE: Redundant input sequences were removed and motif groups which have just one input sequence after this processing were removed too.So there are just 61 motif groups left in each dataset. : The resampling.zip file contains sequence files of motif groups with at least 40 sequences, used for benchmarking how the number of sequences affects prediction performance. : Background sequences. Two types of background models are generated based on: 1. The whole E.coli genome sequence: ecoli.genome. 2. All the sequence segments located in the intergenic regions of E.coli genomes: ecoli.intergenic.fasta. This file is generated based on the ecoli.genome and the gene information in ecoli.genes files. It includes intergenic segments from both strands of the E. coli genome. : Parameter settings for benchmark experiments and the minimal-parameter-tuning guideline: According to our minimal-parameter-tuning guideline, we list all the major running parameters of the five motif discovery programs used in our experiments including !AlignACE, !BioProspector, MDScan, MEME, and !MotifSampler. Most of the parameters are unset or use the default settings. Check the parameters with these files parametersetting.pdf.