↓ Skip to main content

PLOS

The Limits of De Novo DNA Motif Discovery

Overview of attention for article published in PLOS ONE, November 2012
Altmetric Badge

Mentioned by

twitter
2 X users
facebook
1 Facebook page

Citations

dimensions_citation
24 Dimensions

Readers on

mendeley
107 Mendeley
citeulike
4 CiteULike
Title
The Limits of De Novo DNA Motif Discovery
Published in
PLOS ONE, November 2012
DOI 10.1371/journal.pone.0047836
Pubmed ID
Authors

David Simcha, Nathan D. Price, Donald Geman

Abstract

A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify "motifs" that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery-searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA "background" sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are "too null," resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where "ground truth" is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced "over-fitting" in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/.

X Demographics

X Demographics

The data shown below were collected from the profiles of 2 X users who shared this research output. Click here to find out more about how the information was compiled.
Mendeley readers

Mendeley readers

The data shown below were compiled from readership statistics for 107 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country Count As %
United Kingdom 2 2%
Finland 1 <1%
Brazil 1 <1%
Spain 1 <1%
United States 1 <1%
Unknown 101 94%

Demographic breakdown

Readers by professional status Count As %
Student > Ph. D. Student 30 28%
Researcher 23 21%
Student > Master 16 15%
Student > Bachelor 8 7%
Professor > Associate Professor 6 6%
Other 15 14%
Unknown 9 8%
Readers by discipline Count As %
Agricultural and Biological Sciences 49 46%
Biochemistry, Genetics and Molecular Biology 26 24%
Computer Science 9 8%
Neuroscience 3 3%
Engineering 3 3%
Other 6 6%
Unknown 11 10%