Localization of potential regulatory signals in the human genome

Listed in Datasets

By Minou Bina1, Phillip J. Wyss1

Purdue University

Discovering sequences with potential regulatory characteristics

Version 1.0 - published on 13 Aug 2014 doi:10.4231/R7V40S43 - cite this Archived on 25 Oct 2016

Licensed under CC0 1.0 Universal


Supplementary materials for the publication entitled: "Discovering sequences with potential regulatory characteristics." Bina M, Wyss P, Lazarus SA, Shah SR, Ren W, Szpankowski W, Crawford GE, Park SP, Song XC. Access dataset in genome browser. In the human genome, the expression of protein-coding genes is in part regulated by specific DNA sequence elements localized upstream of transcription start sites (TSSs), in regions known as proximal and basal promoters. These elements are relatively short and usually correspond to transcription factor binding sites (TFBSs). Numerous studies have worked on determining the genomic DNA sequences that control the expression of human genes. However, it has been difficult to pinpoint the protein binding sites along the human chromosomes via various experimental strategies including chromatin immunoprecipitation assays (!ChIPS). To develop a complementary approach, we reasoned that a priori one could expect that promoter regions would provide a rich source of data for designing tools to predict the position of regulatory signals de novo, based on DNA sequence alone. Towards this goal we designed a scheme to create density plots of sequences that appear frequently upstream of TSSs. The scheme involved the following steps. First, we created a database (RF_06_data) to collect all possible 9-mers (between positions -500 to -1) with respect to TSSs of protein-coding genes (the data included nearly 16,000 genes). To reduce sequence-redundancy, we selected the promoter of one mRNA isoform per gene. Since the human genome may contain multiple copies of a given gene, we selected one promoter to represent redundant genes. Second, we applied statistical criteria to rank the collected 9-mers with respect to their relative abundance in total human genomic DNA. Third, we wrote programs to create density plots to localize the occurrences of statistically significant 9-mers along the human chromosomes.  We selected a 30-base-window and applied filtering criteria to scan each chromosome. The program examined all possible 9-mers in the 30-base-window to identify their statistically assigned ranks. The program used the ranks to compute a weighted sum. As the window moved along a chromosome (one base at a time), the weighted sums produced intensity values at each nucleotide position. Finally, a program was developed to create a file to display the density plots on a track on the genome browser at UCSC. This publication offers a link for viewing the density plots directly on the genome browser (hg19) and provides the corresponding file for download. The file is in bigWig format. To obtain a copy of the file, click on download (displayed at the top right of this page). To use the downloaded file to create a custom track on the human genome browser (hg19), include the instructions below: db=hg19 track type=bigWig name="Reg. Signal" description="Purdue University Regulatory Signal Prediction" visibility=full autoScale=off viewLimits=0.0:12.0 color=0,0,200 Alternatively, click on Regulatory Signals to view the density plots directly on a track on the genome browser. Examples of applications include using the genome browser to examine the DNA sequences upstream of your favorite gene to determine whether they include 9-mers that also occur in promoter of other genes. The results may localize regions that you could evaluate in DNA binding assays, in functional assays, or both. Note that the computational strategy assigns a low weight to 9-mers that occur frequently in the human genome.  Such 9-mers often correspond to AT-rich sequences.

Cite this work

Researchers should cite this work as follows:



If you use the data and the plots in your research, please cite the primary publication: Minou Bina, Phillip Wyss, Sheryl A. Lazarus, Syed R. Shah, Wenhui Ren, Wojciech Szpankowski, Gregory E. Crawford, Sang P. Park, Xiaohui C. Song (2009), "Discovering sequences with potential regulatory characteristics," Genomics, 93, 4: pg. 314-322, April. (DOI: 10.1016/j.ygeno.2008.11.008).

The Purdue University Research Repository (PURR) is a university core research facility provided by the Purdue University Libraries, the Office of the Executive Vice President for Research and Partnerships, and Information Technology at Purdue (ITaP).