×

PICS: Probabilistic Inference for ChIP-seq. (English) Zbl 1216.62184

Summary: ChIP-seq combines chromatin immunoprecipitation with massively parallel short-read sequencing. While it can profile genome-wide in vivo transcription factor-DNA association with higher sensitivity, specificity, and spatial resolution than ChIP-chip, it poses new challenges for statistical analysis that derive from the complexity of the biological systems characterized and from variability and biases in its sequence data. We propose a method called PICS (Probabilistic Inference for ChIP-seq) for identifying regions bound by transcription factors from aligned reads. PICS identifies binding event locations by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical \(t\)-mixture model. It uses precalculated, whole-genome read mappability profiles and a truncated \(t\)-distribution to adjust binding event models for reads that are missing due to local genome repetitiveness. It estimates uncertainties in model parameters that can be used to define confidence regions on binding event locations and to filter estimates. Finally, PICS calculates a per-event enrichment score relative to a control sample, and can use a control sample to estimate a false discovery rate. Using published GABP and FOXA1 data from human cell lines, we show that PICS’ predicted binding sites were more consistent with computationally predicted binding motifs than the alternative methods MACS, QuEST, CisGenome, and USeq. We then use a simulation study to confirm that PICS compares favorably to these methods and is robust to model misspecification.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62F15 Bayesian inference
92C40 Biochemistry, molecular biology
65C60 Computational problems in statistics (MSC2010)

Software:

PICS; Bioconductor; R
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Baudry, Combining mixture components for clustering, Journal of Computational and Graphical Statistics (2010) · doi:10.1198/jcgs.2010.08111
[2] Buck, ChIP-chip: Considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments, Genomics 83 pp 349– (2004) · doi:10.1016/j.ygeno.2003.11.004
[3] Cicatiello, Estrogens and progesterone promote persistent CCND1 gene activation during G1 by inducing transcriptional derepression via c-Jun/c-Fos/estrogen receptor (progesterone receptor) complex assembly to a distal regulatory element and recruitment of cyclin D1 to its own gene promoter, Molecular and Cellular Biology 24 pp 7260– (2004) · doi:10.1128/MCB.24.16.7260-7274.2004
[4] Dempster, Maximum likelihood from incomplete data via the EM algorithm (with discussion), Journal of the Royal Statistical Society, Series B 39 pp 1– (1977) · Zbl 0364.62022
[5] D’haeseleer, What are DNA sequence motifs, Nature Biotechnology 24 pp 423– (2006) · doi:10.1038/nbt0406-423
[6] Eeckhoute, A cell-type-specific transcriptional network required for estrogen regulation of cyclin D1 and cell cycle progression in breast cancer, Genes and Development 20 pp 2513– (2006) · doi:10.1101/gad.1446006
[7] Fejes, FindPeaks 3.1: A java application for identifying areas of enrichment from massively parallel short-read sequencing technology, Bioinformatics 24 pp 1729– (2008) · Zbl 05511637 · doi:10.1093/bioinformatics/btn305
[8] Fraley, How many clusters? Which clustering method? Answers via model-based cluster analysis, Computer Journal 41 pp 578– (1998) · Zbl 0920.68038 · doi:10.1093/comjnl/41.8.578
[9] Fraley, Bayesian regularization for Normal mixture estimation and model-based clustering, Journal of Classification 24 pp 155– (2007) · Zbl 1159.62302 · doi:10.1007/s00357-007-0004-5
[10] Gentleman, Bioconductor: Open software development for computational biology and bioinformatics, Genome Biology 5 pp R80.1– (2004) · doi:10.1186/gb-2004-5-10-r80
[11] Gottardo, A flexible and powerful Bayesian hierarchical model for ChIP-chip experiments, Biometrics 64 pp 468– (2008) · Zbl 1137.62394 · doi:10.1111/j.1541-0420.2007.00899.x
[12] Holt, The new paradigm of flow cell sequencing, Genome Research 18 pp 839– (2008) · doi:10.1101/gr.073262.107
[13] Ihaka, R: A language for data analysis and graphics, Journal of Computational and Graphical Statistics 5 pp 299– (1996) · doi:10.2307/1390807
[14] Ji, An integrated software system for analyzing ChIP-chip and ChIP-seq data, Nature Biotechnology 26 pp 1293– (2008) · doi:10.1038/nbt.1505
[15] Johnson, Systematic evaluation of variability in ChIP-chip experiments using predefined DNA targets, Genome Research 18 pp 393– (2008) · doi:10.1101/gr.7080508
[16] Johnson, Model-based analysis of tiling-arrays for ChIP-chip, Proceedings of the National Academy of Sciences of the United States of America 103 pp 12457– (2006) · doi:10.1073/pnas.0601180103
[17] Kharchenko, Design and analysis of ChIP-seq experiments for DNA-binding proteins, Nature Biotechnology 26 pp 1351– (2008) · doi:10.1038/nbt.1508
[18] Kuhn, The UCSC Genome browser database: Update 2009, Nucleic Acids Research 37 pp D755– (2009) · Zbl 05746622 · doi:10.1093/nar/gkn875
[19] Lange, Robust statistical modeling using the t distribution, Journal of the American Statistical Association 84 pp 881– (1989) · doi:10.2307/2290063
[20] Li, Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinformatics 25 pp 1754– (2009) · Zbl 05744088 · doi:10.1093/bioinformatics/btp324
[21] Li, GADEM: A genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery, Journal of Computational Biology 16 pp 317– (2009) · doi:10.1089/cmb.2008.16TT
[22] Lo, Automated gating of flow cytometry data via robust model-based clustering, Cytometry A 73A pp 321– (2008) · doi:10.1002/cyto.a.20531
[23] Lupien, FoxA1 translates epigenetic signatures into enhancer-driven lineage-specific transcription, Cell 132 pp 958– (2008) · doi:10.1016/j.cell.2008.01.018
[24] Mahony, DNA familial binding profiles made easy: Comparison of various motif alignment and clustering strategies, PLoS Computational Biology 3 (2007) · doi:10.1371/journal.pcbi.0030061
[25] McLachlan, Fitting mixture models to grouped and truncated data via the em algorithm, Biometrics 44 pp 571– (1998) · Zbl 0707.62214 · doi:10.2307/2531869
[26] McLachlan, The EM Algorithm and Extensions (1997) · Zbl 0882.62012
[27] Milde-Langosch, The Fos family of transcription factors and their role in tumourigenesis, European Journal of Cancer 41 pp 2449– (2008) · doi:10.1016/j.ejca.2005.08.008
[28] Nix, Empirical methods for controlling false positives and estimating confidence in ChIP-seq peaks, BMC Bioinformatics 9 pp 1– (2008) · doi:10.1186/1471-2105-9-523
[29] Park, ChIP-seq: Advantages and challenges of a maturing technology, Nature Reviews Genetics 10 pp 669– (2009) · doi:10.1038/nrg2641
[30] Peel, Robust mixture modelling using the t distribution, Statistics and Computing 10 pp 339– (2000) · doi:10.1023/A:1008981510081
[31] Robertson, Genome-wide relationship between histone H3 lysine 4 mono- and tri-methylation and transcription factor binding, Genome Research 18 pp 1906– (2008) · doi:10.1101/gr.078519.108
[32] Roeder, Practical Bayesian density estimation using mixtures of normals, Journal of the American Statistical Association 92 pp 894– (1997) · Zbl 0889.62021 · doi:10.2307/2965553
[33] Rozowsky, PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls, Nature Biotechnology 27 pp 66– (2009) · doi:10.1038/nbt.1518
[34] Schwarz, Estimating the dimension of a model, Annals of Statistics 6 pp 461– (1978) · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[35] Valouev, Genome-wide analysis of transcription factor binding sites based on ChIP-seq data, Nature Methods 5 pp 829– (2008) · doi:10.1038/nmeth.1246
[36] Zhang, Model-based Analysis of ChIP-seq (MACS), Genome Biology 9 pp R137.17– (2008) · doi:10.1186/gb-2008-9-9-r137
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.