Research & Scholarship
Current Research and Scholarly Interests
Current interest centers on the application of statistics to problems arsing from biology. We are particularly interested in questions concerning gene regulation and signal transduction.
- A Course in Bayesian Statistics
STATS 270, STATS 370 (Spr)
- Literature of Statistics
STATS 319 (Aut)
Independent Studies (11)
- Directed Reading in Health Research and Policy
HRP 299 (Aut, Win, Spr, Sum)
- Graduate Research
HRP 399 (Aut, Win, Spr, Sum)
- Independent Study
STATS 199 (Aut, Win, Spr, Sum)
- Independent Study
STATS 299 (Aut, Win, Spr, Sum)
- Industrial Research for Statisticians
STATS 298 (Aut, Win, Spr, Sum)
- Industrial Research for Statisticians
STATS 398 (Aut, Win, Spr, Sum)
- Master's Research
CME 291 (Aut, Win, Spr, Sum)
- Medical Scholars Research
HRP 370 (Aut, Win, Spr, Sum)
- Ph.D. Research
CME 400 (Aut, Win, Spr, Sum)
STATS 399 (Aut, Win, Spr, Sum)
- Undergraduate Research
HRP 199 (Aut, Win, Spr, Sum)
- Directed Reading in Health Research and Policy
- Prior Year Courses
Graduate and Fellowship Programs
Biology (School of Humanities and Sciences) (Phd Program)
Early role for IL-6 signalling during generation of induced pluripotent stem cells revealed by heterokaryon RNA-Seq.
Nature cell biology
2013; 15 (10): 1244-1252
Molecular insights into somatic cell reprogramming to induced pluripotent stem cells (iPS) would aid regenerative medicine, but are difficult to elucidate in iPS because of their heterogeneity, as relatively few cells undergo reprogramming (0.1-1%; refs , ). To identify early acting regulators, we capitalized on non-dividing heterokaryons (mouse embryonic stem cells fused to human fibroblasts), in which reprogramming towards pluripotency is efficient and rapid, enabling the identification of transient regulators required at the onset. We used bi-species transcriptome-wide RNA-seq to quantify transcriptional changes in the human somatic nucleus during reprogramming towards pluripotency in heterokaryons. During heterokaryon reprogramming, the cytokine interleukin 6 (IL6), which is not detectable at significant levels in embryonic stem cells, was induced 50-fold. A 4-day culture with IL6 at the onset of iPS reprogramming replaced stably transduced oncogenic c-Myc such that transduction of only Oct4, Klf4 and Sox2 was required. IL6 also activated another Jak/Stat target, the serine/threonine kinase gene Pim1, which accounted for the IL6-mediated twofold increase in iPS frequency. In contrast, LIF, another induced GP130 ligand, failed to increase iPS frequency or activate c-Myc or Pim1, thereby revealing a differential role for the two Jak/Stat inducers in iPS generation. These findings demonstrate the power of heterokaryon bi-species global RNA-seq to identify early acting regulators of reprogramming, for example, extrinsic replacements for stably transduced transcription factors such as the potent oncogene c-Myc.
View details for DOI 10.1038/ncb2835
View details for PubMedID 23995732
Personalized prediction of first-cycle in vitro fertilization success
FERTILITY AND STERILITY
2013; 99 (7): 1905-1911
To test whether the probability of having a live birth (LB) with the first IVF cycle (C1) can be predicted and personalized for patients in diverse environments.Retrospective validation of multicenter prediction model.Three university-affiliated outpatient IVF clinics located in different countries.Using primary models aggregated from >13,000 C1s, we applied the boosted tree method to train a preIVF-diversity model (PreIVF-D) with 1,061 C1s from 2008 to 2009, and validated predicted LB probabilities with an independent dataset comprising 1,058 C1s from 2008 to 2009.None.Predictive power, reclassification, receiver operator characteristic analysis, calibration, dynamic range.Overall, with PreIVF-D, 86% of cases had significantly different LB probabilities compared with age control, and more than one-half had higher LB probabilities. Specifically, 42% of patients could have been identified by PreIVF-D to have a personalized predicted success rate >45%, whereas an age-control model could not differentiate them from others. Furthermore, PreIVF-D showed improved predictive power, with 36% improved log-likelihood (or 9.0-fold by log-scale; >1,000-fold linear scale), and prediction errors for subgroups ranged from 0.9% to 3.7%.Validated prediction of personalized LB probabilities from diverse multiple sources identify excellent prognoses in more than one-half of patients.
View details for DOI 10.1016/j.fertnstert.2013.02.016
View details for Web of Science ID 000320505900028
View details for PubMedID 23522806
- Detecting DNA Modifications from SMRT Sequencing Data by Modeling Sequence Context Dependence of Polymerase Kinetic PLOS COMPUTATIONAL BIOLOGY 2013; 9 (3)
RNA sequencing reveals a diverse and dynamic repertoire of the Xenopus tropicalis transcriptome over development
2013; 23 (1): 201-216
The Xenopus embryo has provided key insights into fate specification, the cell cycle, and other fundamental developmental and cellular processes, yet a comprehensive understanding of its transcriptome is lacking. Here, we used paired end RNA sequencing (RNA-seq) to explore the transcriptome of Xenopus tropicalis in 23 distinct developmental stages. We determined expression levels of all genes annotated in RefSeq and Ensembl and showed for the first time on a genome-wide scale that, despite a general state of transcriptional silence in the earliest stages of development, approximately 150 genes are transcribed prior to the midblastula transition. In addition, our splicing analysis uncovered more than 10,000 novel splice junctions at each stage and revealed that many known genes have additional unannotated isoforms. Furthermore, we used Cufflinks to reconstruct transcripts from our RNA-seq data and found that ∼13.5% of the final contigs are derived from novel transcribed regions, both within introns and in intergenic regions. We then developed a filtering pipeline to separate protein-coding transcripts from noncoding RNAs and identified a confident set of 6686 noncoding transcripts in 3859 genomic loci. Since the current reference genome, XenTro3, consists of hundreds of scaffolds instead of full chromosomes, we also performed de novo reconstruction of the transcriptome using Trinity and uncovered hundreds of transcripts that are missing from the genome. Collectively, our data will not only aid in completing the assembly of the Xenopus tropicalis genome but will also serve as a valuable resource for gene discovery and for unraveling the fundamental mechanisms of vertebrate embryogenesis.
View details for DOI 10.1101/gr.141424.112
View details for Web of Science ID 000312963400019
View details for PubMedID 22960373
Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases
2013; 23 (1): 129-141
Current generation DNA sequencing instruments are moving closer to seamlessly sequencing genomes of entire populations as a routine part of scientific investigation. However, while significant inroads have been made identifying small nucleotide variation and structural variations in DNA that impact phenotypes of interest, progress has not been as dramatic regarding epigenetic changes and base-level damage to DNA, largely due to technological limitations in assaying all known and unknown types of modifications at genome scale. Recently, single-molecule real time (SMRT) sequencing has been reported to identify kinetic variation (KV) events that have been demonstrated to reflect epigenetic changes of every known type, providing a path forward for detecting base modifications as a routine part of sequencing. However, to date no statistical framework has been proposed to enhance the power to detect these events while also controlling for false-positive events. By modeling enzyme kinetics in the neighborhood of an arbitrary location in a genomic region of interest as a conditional random field, we provide a statistical framework for incorporating kinetic information at a test position of interest as well as at neighboring sites that help enhance the power to detect KV events. The performance of this and related models is explored, with the best-performing model applied to plasmid DNA isolated from Escherichia coli and mitochondrial DNA isolated from human brain tissue. We highlight widespread kinetic variation events, some of which strongly associate with known modification events, while others represent putative chemically modified sites of unknown types.
View details for DOI 10.1101/gr.136739.111
View details for Web of Science ID 000312963400012
View details for PubMedID 23093720
An Oct4-Sall4-Nanog network controls developmental progression in the pre-implantation mouse embryo
MOLECULAR SYSTEMS BIOLOGY
Landmark events occur in a coordinated manner during pre-implantation development of the mammalian embryo, yet the regulatory network that orchestrates these events remains largely unknown. Here, we present the first systematic investigation of the network in pre-implantation mouse embryos using morpholino-mediated gene knockdowns of key embryonic stem cell (ESC) factors followed by detailed transcriptome analysis of pooled embryos, single embryos, and individual blastomeres. We delineated the regulons of Oct4, Sall4, and Nanog and identified a set of metabolism- and transport-related genes that were controlled by these transcription factors in embryos but not in ESCs. Strikingly, the knockdown embryos arrested at a range of developmental stages. We provided evidence that the DNA methyltransferase Dnmt3b has a role in determining the extent to which a knockdown embryo can develop. We further showed that the feed-forward loop comprising Dnmt3b, the pluripotency factors, and the miR-290-295 cluster exemplifies a network motif that buffers embryos against gene expression noise. Our findings indicate that Oct4, Sall4, and Nanog form a robust and integrated network to govern mammalian pre-implantation development.
View details for DOI 10.1038/msb.2012.65
View details for Web of Science ID 000314415800002
View details for PubMedID 23295861
Neural-specific Sox2 input and differential Gli-binding affinity provide context and positional information in Shh-directed neural patterning
GENES & DEVELOPMENT
2012; 26 (24): 2802-2816
In the vertebrate neural tube, regional Sonic hedgehog (Shh) signaling invokes a time- and concentration-dependent induction of six different cell populations mediated through Gli transcriptional regulators. Elsewhere in the embryo, Shh/Gli responses invoke different tissue-appropriate regulatory programs. A genome-scale analysis of DNA binding by Gli1 and Sox2, a pan-neural determinant, identified a set of shared regulatory regions associated with key factors central to cell fate determination and neural tube patterning. Functional analysis in transgenic mice validates core enhancers for each of these factors and demonstrates the dual requirement for Gli1 and Sox2 inputs for neural enhancer activity. Furthermore, through an unbiased determination of Gli-binding site preferences and analysis of binding site variants in the developing mammalian CNS, we demonstrate that differential Gli-binding affinity underlies threshold-level activator responses to Shh input. In summary, our results highlight Sox2 input as a context-specific determinant of the neural-specific Shh response and differential Gli-binding site affinity as an important cis-regulatory property critical for interpreting Shh morphogen action in the mammalian neural tube.
View details for DOI 10.1101/gad.207142.112
View details for Web of Science ID 000312775700012
View details for PubMedID 23249739
Activation of Innate Immunity Is Required for Efficient Nuclear Reprogramming
2012; 151 (3): 547-558
Retroviral overexpression of reprogramming factors (Oct4, Sox2, Klf4, c-Myc) generates induced pluripotent stem cells (iPSCs). However, the integration of foreign DNA could induce genomic dysregulation. Cell-permeant proteins (CPPs) could overcome this limitation. To date, this approach has proved exceedingly inefficient. We discovered a striking difference in the pattern of gene expression induced by viral versus CPP-based delivery of the reprogramming factors, suggesting that a signaling pathway required for efficient nuclear reprogramming was activated by the retroviral, but not CPP approach. In gain- and loss-of-function studies, we find that the toll-like receptor 3 (TLR3) pathway enables efficient induction of pluripotency by viral or mmRNA approaches. Stimulation of TLR3 causes rapid and global changes in the expression of epigenetic modifiers to enhance chromatin remodeling and nuclear reprogramming. Activation of inflammatory pathways are required for efficient nuclear reprogramming in the induction of pluripotency.
View details for DOI 10.1016/j.cell.2012.09.034
View details for Web of Science ID 000310529300012
View details for PubMedID 23101625
Improving PacBio Long Read Accuracy by Short Read Alignment
2012; 7 (10)
The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.
View details for DOI 10.1371/journal.pone.0046679
View details for Web of Science ID 000309580800039
View details for PubMedID 23056399
Fast and accurate read alignment for resequencing
2012; 28 (18): 2366-2373
Next-generation sequence analysis has become an important task both in laboratory and clinical settings. A key stage in the majority sequence analysis workflows, such as resequencing, is the alignment of genomic reads to a reference genome. The accurate alignment of reads with large indels is a computationally challenging task for researchers.We introduce SeqAlto as a new algorithm for read alignment. For reads longer than or equal to 100 bp, SeqAlto is up to 10 × faster than existing algorithms, while retaining high accuracy and the ability to align reads with large (up to 50 bp) indels. This improvement in efficiency is particularly important in the analysis of future sequencing data where the number of reads approaches many billions. Furthermore, SeqAlto uses less than 8 GB of memory to align against the human genome. SeqAlto is benchmarked against several existing tools with both real and simulated data.Linux and Mac OS X binaries free for academic use are available at http://firstname.lastname@example.org.
View details for DOI 10.1093/bioinformatics/bts450
View details for Web of Science ID 000308532300059
View details for PubMedID 22811546
Six2 and Wnt Regulate Self-Renewal and Commitment of Nephron Progenitors through Shared Gene Regulatory Networks
2012; 23 (3): 637-651
A balance between Six2-dependent self-renewal and canonical Wnt signaling-directed commitment regulates mammalian nephrogenesis. Intersectional studies using chromatin immunoprecipitation and transcriptional profiling identified direct target genes shared by each pathway within nephron progenitors. Wnt4 and Fgf8 are essential for progenitor commitment; cis-regulatory modules flanking each gene are cobound by Six2 and β-catenin and are dependent on conserved Lef/Tcf binding sites for activity. In vitro and in vivo analyses suggest that Six2 and Lef/Tcf factors form a regulatory complex that promotes progenitor maintenance while entry of β-catenin into this complex promotes nephrogenesis. Alternative transcriptional responses associated with Six2 and β-catenin cobinding events occur through non-Lef/Tcf DNA binding mechanisms, highlighting the regulatory complexity downstream of Wnt signaling in the developing mammalian kidney.
View details for DOI 10.1016/j.devcel.2012.07.008
View details for Web of Science ID 000308776400019
View details for PubMedID 22902740
Predicting personalized multiple birth risks after in vitro fertilization-double embryo transfer
FERTILITY AND STERILITY
2012; 98 (1)
To report and evaluate the performance and utility of an approach to predicting IVF-double embryo transfer (DET) multiple birth risks that is evidence-based, clinic-specific, and considers each patient's clinical profile.Retrospective prediction modeling.An outpatient university-affiliated IVF clinic.We used boosted tree methods to analyze 2,413 independent IVF-DET treatment cycles that resulted in live births. The IVF cycles were retrieved from a database that comprised more than 33,000 IVF cycles.None.The performance of this prediction model, MBP-BIVF, was validated by an independent data set, to evaluate predictive power, discrimination, dynamic range, and reclassification.Multiple birth probabilities ranging from 11.8% to 54.8% were predicted by the model and were significantly different from control predictions in more than half of the patients. The prediction model showed an improvement of 146% in predictive power and 16.0% in discrimination over control. The population standard error was 1.8%.We showed that IVF patients have inherently different risks of multiple birth, even when DET is specified, and this risk can be predicted before ET. The use of clinic-specific prediction models provides an evidence-based and personalized method to counsel patients.
View details for DOI 10.1016/j.fertnstert.2012.04.011
View details for Web of Science ID 000305950200020
View details for PubMedID 22673597
A Sparse Transmission Disequilibrium Test for Haplotypes Based on Bradley-Terry Graphs
2012; 73 (1): 52-61
Linkage and association analysis based on haplotype transmission disequilibrium can be more informative than single marker analysis. Several works have been proposed in recent years to extend the transmission disequilibrium test (TDT) to haplotypes. Among them, a powerful approach called the evolutionary tree TDT (ET-TDT) incorporates information about the evolutionary relationship among haplotypes using the cladogram of the locus.In this work we extend this approach by taking into consideration the sparsity of causal mutations in the evolutionary history. We first introduce the notion of a Bradley-Terry (BT) graph representation of a haplotype locus. The most important property of the BT graph is that sparsity of the edge set of the graph corresponds to small number of causal mutations in the evolution of the haplotypes. We then propose a method to test the null hypothesis of no linkage and association against sparse alternatives under which a small number of edges on the BT graph have non-nil effects.We compare the performance of our approach to that of the ET-TDT through a power study, and show that incorporating sparsity of causal mutations can significantly improve the power of a haplotype-based TDT.
View details for DOI 10.1159/000335937
View details for Web of Science ID 000302111100008
View details for PubMedID 22398955
- Coupling Optional Polya Trees and the Two Sample Problem JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2011; 106 (496): 1553-1565
A BOOTSTRAP-BASED NON-PARAMETRIC ANOVA METHOD WITH APPLICATIONS TO FACTORIAL MICROARRAY DATA
2011; 21 (2): 495-514
View details for Web of Science ID 000290459900002
A New FACS Approach Isolates hESC Derived Endoderm Using Transcription Factors
2011; 6 (3)
We show that high quality microarray gene expression profiles can be obtained following FACS sorting of cells using combinations of transcription factors. We use this transcription factor FACS (tfFACS) methodology to perform a genomic analysis of hESC-derived endodermal lineages marked by combinations of SOX17, GATA4, and CXCR4, and find that triple positive cells have a much stronger definitive endoderm signature than other combinations of these markers. Additionally, SOX17(+) GATA4(+) cells can be obtained at a much earlier stage of differentiation, prior to expression of CXCR4(+) cells, providing an important new tool to isolate this earlier definitive endoderm subtype. Overall, tfFACS represents an advancement in FACS technology which broadly crosses multiple disciplines, most notably in regenerative medicine to redefine cellular populations.
View details for DOI 10.1371/journal.pone.0017536
View details for Web of Science ID 000288170900026
View details for PubMedID 21408072
Human transcriptome array for high-throughput clinical studies
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2011; 108 (9): 3707-3712
A 6.9 million-feature oligonucleotide array of the human transcriptome [Glue Grant human transcriptome (GG-H array)] has been developed for high-throughput and cost-effective analyses in clinical studies. This array allows comprehensive examination of gene expression and genome-wide identification of alternative splicing as well as detection of coding SNPs and noncoding transcripts. The performance of the array was examined and compared with mRNA sequencing (RNA-Seq) results over multiple independent replicates of liver and muscle samples. Compared with RNA-Seq of 46 million uniquely mappable reads per replicate, the GG-H array is highly reproducible in estimating gene and exon abundance. Although both platforms detect similar expression changes at the gene level, the GG-H array is more sensitive at the exon level. Deeper sequencing is required to adequately cover low-abundance transcripts. The array has been implemented in a multicenter clinical program and has generated high-quality, reproducible data. Considering the clinical trial requirements of cost, sample availability, and throughput, the GG-H array has a wide range of applications. An emerging approach for large-scale clinical genomic studies is to first use RNA-Seq to the sufficient depth for the discovery of transcriptome elements relevant to the disease process followed by high-throughput and reliable screening of these elements on thousands of patient samples using custom-designed arrays.
View details for DOI 10.1073/pnas.1019753108
View details for Web of Science ID 000287844400051
View details for PubMedID 21317363
- Statistical Modeling of RNA-Seq Data STATISTICAL SCIENCE 2011; 26 (1): 62-83
Completely phased genome sequencing through chromosome sorting
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2011; 108 (1): 12-17
The two haploid genome sequences that a person inherits from the two parents represent the most fundamentally useful type of genetic information for the study of heritable diseases and the development of personalized medicine. Because of the difficulty in obtaining long-range phase information, current sequencing methods are unable to provide this information. Here, we introduce and show feasibility of a scalable approach capable of generating genomic sequences completely phased across the entire chromosome.
View details for DOI 10.1073/pnas.1016725108
View details for Web of Science ID 000285915000007
View details for PubMedID 21169219
THE ANALYSIS OF CHIP-SEQ DATA
METHODS IN ENZYMOLOGY, VOL 497: SYNTHETIC BIOLOGY, METHODS FOR PART/DEVICE CHARACTERIZATION AND CHASSIS ENGINEERING, PT A
2011; 497: 51-73
Chromatin immunoprecipitation coupled with ultra-high-throug put parallel DNA sequencing (ChIP-seq) is an effective technology for the investigation of genome-wide protein-DNA interactions. Examples of applications include the studies of RNA polymerases transcription, transcriptional regulation, and histone modifications. The technology provides accurate and high-resolution mapping of the protein-DNA binding loci that are important in the understanding of many processes in development and diseases. Since the introduction of ChIP-seq experiments in 2007, many statistical and computational methods have been developed to support the analysis of the massive datasets from these experiments. However, because of the complex, multistaged analysis workflow, it is still difficult for an experimental investigator to conduct the analysis of his or her own ChIP-seq data. In this chapter, we review the basic design of ChIP-seq experiments and provide an in-depth tutorial on how to prepare, to preprocess, and to analyze ChIP-seq datasets. The tutorial is based on a revised version of our software package CisGenome, which was designed to encompass most standard tasks in ChIP-seq data analysis. Relevant statistical and computational issues will be highlighted, discussed, and illustrated by means of real data examples.
View details for DOI 10.1016/B978-0-12-385075-1.00003-2
View details for Web of Science ID 000291321200003
View details for PubMedID 21601082
Integration of Brassinosteroid Signal Transduction with the Transcription Network for Plant Growth Regulation in Arabidopsis
2010; 19 (5): 765-777
Brassinosteroids (BRs) regulate a wide range of developmental and physiological processes in plants through a receptor-kinase signaling pathway that controls the BZR transcription factors. Here, we use transcript profiling and chromatin-immunoprecipitation microarray (ChIP-chip) experiments to identify 953 BR-regulated BZR1 target (BRBT) genes. Functional studies of selected BRBTs further demonstrate roles in BR promotion of cell elongation. The BRBT genes reveal numerous molecular links between the BR-signaling pathway and downstream components involved in developmental and physiological processes. Furthermore, the results reveal extensive crosstalk between BR and other hormonal and light-signaling pathways at multiple levels. For example, BZR1 not only controls the expression of many signaling components of other hormonal and light pathways but also coregulates common target genes with light-signaling transcription factors. Our results provide a genomic map of steroid hormone actions in plants that reveals a regulatory network that integrates hormonal and light-signaling pathways for plant growth regulation.
View details for DOI 10.1016/j.devcel.2010.10.010
View details for Web of Science ID 000284516300016
View details for PubMedID 21074725
- From EM to Data Augmentation: The Emergence of MCMC Bayesian Computation in the 1980s STATISTICAL SCIENCE 2010; 25 (4): 506-516
Deep phenotyping to predict live birth outcomes in in vitro fertilization
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2010; 107 (31): 13570-13575
Nearly 75% of in vitro fertilization (IVF) treatments do not result in live births and patients are largely guided by a generalized age-based prognostic stratification. We sought to provide personalized and validated prognosis by using available clinical and embryo data from prior, failed treatments to predict live birth probabilities in the subsequent treatment. We generated a boosted tree model, IVFBT, by training it with IVF outcomes data from 1,676 first cycles (C1s) from 2003-2006, followed by external validation with 634 cycles from 2007-2008, respectively. We tested whether this model could predict the probability of having a live birth in the subsequent treatment (C2). By using nondeterministic methods to identify prognostic factors and their relative nonredundant contribution, we generated a prediction model, IVF(BT), that was superior to the age-based control by providing over 1,000-fold improvement to fit new data (p<0.05), and increased discrimination by receiver-operative characteristic analysis (area-under-the-curve, 0.80 vs. 0.68 for C1, 0.68 vs. 0.58 for C2). IVFBT provided predictions that were more accurate for approximately 83% of C1 and approximately 60% of C2 cycles that were out of the range predicted by age. Over half of those patients were reclassified to have higher live birth probabilities. We showed that data from a prior cycle could be used effectively to provide personalized and validated live birth probabilities in a subsequent cycle. Our approach may be replicated and further validated in other IVF clinics.
View details for DOI 10.1073/pnas.1002296107
View details for Web of Science ID 000280605900006
View details for PubMedID 20643955
Detection of splice junctions from paired-end RNA-seq data by SpliceMap
NUCLEIC ACIDS RESEARCH
2010; 38 (14): 4570-4578
Alternative splicing is a prevalent post-transcriptional process, which is not only important to normal cellular function but is also involved in human diseases. The newly developed second generation sequencing technique provides high-throughput data (RNA-seq data) to study alternative splicing events in different types of cells. Here, we present a computational method, SpliceMap, to detect splice junctions from RNA-seq data. This method does not depend on any existing annotation of gene structures and is capable of finding novel splice junctions with high sensitivity and specificity. It can handle long reads (50-100 nt) and can exploit paired-read information to improve mapping accuracy. Several parameters are included in the output to indicate the reliability of the predicted junction and help filter out false predictions. We applied SpliceMap to analyze 23 million paired 50-nt reads from human brain tissue. The results show at this depth of sequencing, RNA-seq can support reliable detection of splice junctions except for those that are present at very low level. Compared to current methods, SpliceMap can achieve 12% higher sensitivity without sacrificing specificity.
View details for DOI 10.1093/nar/gkq211
View details for Web of Science ID 000280922400010
View details for PubMedID 20371516
CisGenome Browser: a flexible tool for genomic data visualization
2010; 26 (14): 1781-1782
We present an open source, platform independent tool, called CisGenome Browser, which can work together with any other data analysis program to serve as a flexible component for genomic data visualization. It can also work by itself as a standalone genome browser. By working as a light-weight web server, CisGenome Browser is a convenient tool for data sharing between labs. It has features that are specifically designed for ultra high-throughput sequencing data visualization.http://biogibbs.stanford.edu/ approximately jiangh/browser/
View details for DOI 10.1093/bioinformatics/btq286
View details for Web of Science ID 000279474400017
View details for PubMedID 20513664
An "Almost Exhaustive" Search-Based Sequential Permutation Method for Detecting Epistasis in Disease Association Studies
2010; 34 (5): 434-443
Due to the complex nature of common diseases, their etiology is likely to involve "uncommon but strong" (UBS) interactive effects--i.e. allelic combinations that are each present in only a small fraction of the patients but associated with high disease risk. However, the identification of such effects using standard methods for testing association can be difficult. In this work, we introduce a method for testing interactions that is particularly powerful in detecting UBS effects. The method consists of two modules--one is a pattern counting algorithm designed for efficiently evaluating the risk significance of each marker combination, and the other is a sequential permutation scheme for multiple testing correction. We demonstrate the work of our method using a candidate gene data set for cardiovascular and coronary diseases with an injected UBS three-locus interaction. In addition, we investigate the power and false rejection properties of our method using data sets simulated from a joint dominance three-locus model that gives rise to UBS interactive effects. The results show that our method can be much more powerful than standard approaches such as trend test and multifactor dimensionality reduction for detecting UBS interactions.
View details for DOI 10.1002/gepi.20496
View details for Web of Science ID 000280349600007
View details for PubMedID 20583286
Analysis of factorial time-course microarrays with application to a clinical study of burn injury.
Proceedings of the National Academy of Sciences of the United States of America
2010; 107 (22): 9923-9928
Time-course microarray experiments are capable of capturing dynamic gene expression profiles. It is important to study how these dynamic profiles depend on the multiple factors that characterize the experimental condition under which the time course is observed. Analytic methods are needed to simultaneously handle the time course and factorial structure in the data. We developed a method to evaluate factor effects by pooling information across the time course while accounting for multiple testing and nonnormality of the microarray data. The method effectively extracts gene-specific response features and models their dependency on the experimental factors. Both longitudinal and cross-sectional time-course data can be handled by our approach. The method was used to analyze the impact of age on the temporal gene response to burn injury in a large-scale clinical study. Our analysis reveals that 21% of the genes responsive to burn are age-specific, among which expressions of mitochondria and immunoglobulin genes are differentially perturbed in pediatric and adult patients by burn injury. These new findings in the body's response to burn injury between children and adults support further investigations of therapeutic options targeting specific age groups. The methodology proposed here has been implemented in R package "TANOVA" and submitted to the Comprehensive R Archive Network at http://www.r-project.org/. It is also available for download at http://gluegrant1.stanford.edu/TANOVA/.
View details for DOI 10.1073/pnas.1002757107
View details for PubMedID 20479259
- OPTIONAL POLYA TREE AND BAYESIAN INFERENCE ANNALS OF STATISTICS 2010; 38 (3): 1433-1459
Hedgehog pathway-regulated gene networks in cerebellum development and tumorigenesis
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2010; 107 (21): 9736-9741
Many genes initially identified for their roles in cell fate determination or signaling during development can have a significant impact on tumorigenesis. In the developing cerebellum, Sonic hedgehog (Shh) stimulates the proliferation of granule neuron precursor cells (GNPs) by activating the Gli transcription factors. Inappropriate activation of Shh target genes results in unrestrained cell division and eventually medulloblastoma, the most common pediatric brain malignancy. We find dramatic differences in the gene networks that are directly driven by the Gli1 transcription factor in GNPs and medulloblastoma. Gli1 binding location analysis revealed hundreds of genomic loci bound by Gli1 in normal and cancer cells. Only one third of the genes bound by Gli1 in GNPs were also bound in tumor cells. Correlation with gene expression levels indicated that 116 genes were preferentially transcribed in tumors, whereas 132 genes were target genes in both GNPs and medulloblastoma. Quantitative PCR and in situ hybridization for some putative target genes support their direct regulation by Gli. The results indicate that transformation of normal GNPs into deadly tumor cells is accompanied by a distinct set of Gli-regulated genes and may provide candidates for targeted therapies.
View details for DOI 10.1073/pnas.1004602107
View details for Web of Science ID 000278054700048
View details for PubMedID 20460306
Modeling Co-Expression across Species for Complex Traits: Insights to the Difference of Human and Mouse Embryonic Stem Cells
PLOS COMPUTATIONAL BIOLOGY
2010; 6 (3)
Complex interactions between genes or proteins contribute substantially to phenotypic evolution. We present a probabilistic model and a maximum likelihood approach for cross-species clustering analysis and for identification of conserved as well as species-specific co-expression modules. This model enables a "soft" cross-species clustering (SCSC) approach by encouraging but not enforcing orthologous genes to be grouped into the same cluster. SCSC is therefore robust to obscure orthologous relationships and can reflect different functional roles of orthologous genes in different species. We generated a time-course gene expression dataset for differentiating mouse embryonic stem (ES) cells, and compiled a dataset of published gene expression data on differentiating human ES cells. Applying SCSC to analyze these datasets, we identified conserved and species-specific gene regulatory modules. Together with protein-DNA binding data, an SCSC cluster specifically induced in murine ES cells indicated that the KLF2/4/5 transcription factors, although critical to maintaining the pluripotent phenotype in mouse ES cells, were decoupled from the OCT4/SOX2/NANOG regulatory module in human ES cells. Two of the target genes of murine KLF2/4/5, LIN28 and NODAL, were rewired to be targets of OCT4/SOX2/NANOG in human ES cells. Moreover, by mapping SCSC clusters onto KEGG signaling pathways, we identified the signal transduction components that were induced in pluripotent ES cells in either a conserved or a species-specific manner. These results suggest that the pluripotent cell identity can be established and maintained through more than one gene regulatory network.
View details for DOI 10.1371/journal.pcbi.1000707
View details for Web of Science ID 000278125200015
View details for PubMedID 20300647
Modeling non-uniformity in short-read rates in RNA-Seq data
2010; 11 (5)
After mapping, RNA-Seq data can be summarized by a sequence of read counts commonly modeled as Poisson variables with constant rates along each transcript, which actually fit data poorly. We suggest using variable rates for different positions, and propose two models to predict these rates based on local sequences. These models explain more than 50% of the variations and can lead to improved estimates of gene and isoform expressions for both Illumina and Applied Biosystems data.
View details for DOI 10.1186/gb-2010-11-5-r50
View details for Web of Science ID 000279631000015
View details for PubMedID 20459815
ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2009; 106 (51): 21521-21526
Next-generation sequencing has greatly increased the scope and the resolution of transcriptional regulation study. RNA sequencing (RNA-Seq) and ChIP-Seq experiments are now generating comprehensive data on transcript abundance and on regulator-DNA interactions. We propose an approach for an integrated analysis of these data based on feature extraction of ChIP-Seq signals, principal component analysis, and regression-based component selection. Compared with traditional methods, our approach not only offers higher power in predicting gene expression from ChIP-Seq data but also provides a way to capture cooperation among regulators. In mouse embryonic stem cells (ESCs), we find that a remarkably high proportion of variation in gene expression (65%) can be explained by the binding signals of 12 transcription factors (TFs). Two groups of TFs are identified. Whereas the first group (E2f1, Myc, Mycn, and Zfx) act as activators in general, the second group (Oct4, Nanog, Sox2, Smad1, Stat3, Tcfcp2l1, and Esrrb) may serve as either activator or repressor depending on the target. The two groups of TFs cooperate tightly to activate genes that are differentially up-regulated in ESCs. In the absence of binding by the first group, the binding of the second group is associated with genes that are repressed in ESCs and derepressed upon early differentiation.
View details for DOI 10.1073/pnas.0904863106
View details for Web of Science ID 000272994200013
View details for PubMedID 19995984
Identifiability of isoform deconvolution from junction arrays and RNA-Seq
2009; 25 (23): 3056-3059
Splice junction microarrays and RNA-seq are two popular ways of quantifying splice variants within a cell. Unfortunately, isoform expressions cannot always be determined from the expressions of individual exons and splice junctions. While this issue has been noted before, the extent of the problem on various platforms has not yet been explored, nor have potential remedies been presented.We propose criteria that will guarantee identifiability of an isoform deconvolution model on exon and splice junction arrays and in RNA-Seq. We show that up to 97% of 2256 alternatively spliced human genes selected from the RefSeq database lead to identifiable gene models in RNA-seq, with similar results in mouse. However, in the Human Exon array only 26% of these genes lead to identifiable models, and even in the most comprehensive splice junction array only 69% lead to identifiable models.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btp544
View details for Web of Science ID 000272080800002
View details for PubMedID 19762346
Dissecting Early Differentially Expressed Genes in a Mixture of Differentiating Embryonic Stem Cells
PLOS COMPUTATIONAL BIOLOGY
2009; 5 (12)
The differentiation of embryonic stem cells is initiated by a gradual loss of pluripotency-associated transcripts and induction of differentiation genes. Accordingly, the detection of differentially expressed genes at the early stages of differentiation could assist the identification of the causal genes that either promote or inhibit differentiation. The previous methods of identifying differentially expressed genes by comparing different cell types would inevitably include a large portion of genes that respond to, rather than regulate, the differentiation process. We demonstrate through the use of biological replicates and a novel statistical approach that the gene expression data obtained without prior separation of cell types are informative for detecting differentially expressed genes at the early stages of differentiation. Applying the proposed method to analyze the differentiation of murine embryonic stem cells, we identified and then experimentally verified Smarcad1 as a novel regulator of pluripotency and self-renewal. We formalized this statistical approach as a statistical test that is generally applicable to analyze other differentiation processes.
View details for DOI 10.1371/journal.pcbi.1000607
View details for Web of Science ID 000274229000025
View details for PubMedID 20019792
FoxOs Cooperatively Regulate Diverse Pathways Governing Neural Stem Cell Homeostasis
CELL STEM CELL
2009; 5 (5): 540-553
The PI3K-AKT-FoxO pathway is integral to lifespan regulation in lower organisms and essential for the stability of long-lived cells in mammals. Here, we report the impact of combined FoxO1, 3, and 4 deficiencies on mammalian brain physiology with a particular emphasis on the study of the neural stem/progenitor cell (NSC) pool. We show that the FoxO family plays a prominent role in NSC proliferation and renewal. FoxO-deficient mice show initial increased brain size and proliferation of neural progenitor cells during early postnatal life, followed by precocious significant decline in the NSC pool and accompanying neurogenesis in adult brains. Mechanistically, integrated transcriptomic, promoter, and functional analyses of FoxO-deficient NSC cultures identified direct gene targets with known links to the regulation of human brain size and the control of cellular proliferation, differentiation, and oxidative defense. Thus, the FoxO family coordinately regulates diverse genes and pathways to govern key aspects of NSC homeostasis in the mammalian brain.
View details for DOI 10.1016/j.stem.2009.09.013
View details for Web of Science ID 000272019500015
View details for PubMedID 19896444
Energy landscape of a spin-glass model: Exploration and characterization
PHYSICAL REVIEW E
2009; 79 (5)
The disconnectivity graph (DG) is widely used to represent energy landscapes. Although powerful numerical methods have been developed to construct DGs for continuous potential-energy surfaces, they have difficulties in applications to discrete Hamiltonians as the case of spin-glass models. When the configuration space is large, brute force enumeration of all configurations to build a DG is not practical. We propose an alternative approach to construct DGs based on recursive partition of Monte Carlo samples from microcanonical ensembles. To characterize energy landscapes, we define the local density of states (LDOS) on a DG, with which one can compute many thermodynamic properties over local energy basins for any temperature. Estimation of LDOS is developed with DG construction. We further propose the concepts of tree entropy and local escape probability, both of which are functions of local density of states, to capture the symmetry and the roughness of a Boltzmann distribution, respectively. Our approach is applied to a study of the Sherrington-Kirkpatrick spin-glass model with N varying between 20 and 100 spins. We observe that the energy landscape is extremely asymmetric and there exists a sharp increase in local escape probability preceding the transition from spin glass to paramagnetic phase.
View details for DOI 10.1103/PhysRevE.79.051117
View details for Web of Science ID 000266500700031
View details for PubMedID 19518426
Modeling the spatio-temporal network that drives patterning in the vertebrate central nervous system
BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS
2009; 1789 (4): 299-305
In this review, we discuss the gene regulatory network underlying the patterning of the ventral neural tube during vertebrate embryogenesis. The neural tube is partitioned into domains of distinct cell fates by inductive signals along both anterior-posterior and dorsal-ventral axes. A defining feature of the dorsal-ventral patterning is the graded distribution of Sonic hedgehog (Shh), which acts as a morphogen to specify several classes of ventral neurons in a concentration-dependent fashion. These inductive signals translate into patterned expressions of transcription factors that define different neural progenitor subtypes. Progenitor boundaries are sharpened by repressive interactions between these transcription factors. The progenitor-expressed transcription factors induce another set of transcription factors that are thought to contribute to neural identities in post-mitotic neural precursors. Thus, the gene regulatory network of the ventral neural tube patterning is characterized by hierarchical expression [inductive signal-->progenitor specifying factors (mitotic)--> precursor specifying factors (post mitotic)--> differentiated neural markers] and cross-repression between progenitor-expressed regulatory factors. Although a number of transcriptional regulators have been identified at each hierarchical level, their precise regulatory relationships are not clear. Here we discuss approaches aimed at clarifying and extending our understanding of the formation and propagation of this network.
View details for DOI 10.1016/j.bbagrm.2009.01.002
View details for Web of Science ID 000265729800008
View details for PubMedID 19445894
Cross-hybridization modeling on Affymetrix exon arrays
2008; 24 (24): 2887-2893
Microarray designs have become increasingly probe-rich, enabling targeting of specific features, such as individual exons or single nucleotide polymorphisms. These arrays have the potential to achieve quantitative high-throughput estimates of transcript abundances, but currently these estimates are affected by biases due to cross-hybridization, in which probes hybridize to off-target transcripts.To study cross-hybridization, we map Affymetrix exon array probes to a set of annotated mRNA transcripts, allowing a small number of mismatches or insertion/deletions between the two sequences. Based on a systematic study of the degree to which probes with a given match type to a transcript are affected by cross-hybridization, we developed a strategy to correct for cross-hybridization biases of gene-level expression estimates. Comparison with Solexa ultra high-throughput sequencing data demonstrates that correction for cross-hybridization leads to a significant improvement of gene expression estimates.We provide mappings between human and mouse exon array probes and off-target transcripts and provide software extending the GeneBASE program for generating gene-level expression estimates including the cross-hybridization correction http://biogibbs.stanford.edu/~kkapur/GeneBase/.
View details for DOI 10.1093/bioinformatics/btn571
View details for Web of Science ID 000261456700012
View details for PubMedID 18984598
An integrated software system for analyzing ChIP-chip and ChIP-seq data
2008; 26 (11): 1293-1300
We present CisGenome, a software system for analyzing genome-wide chromatin immunoprecipitation (ChIP) data. CisGenome is designed to meet all basic needs of ChIP data analyses, including visualization, data normalization, peak detection, false discovery rate computation, gene-peak association, and sequence and motif analysis. In addition to implementing previously published ChIP-microarray (ChIP-chip) analysis methods, the software contains statistical methods designed specifically for ChlP sequencing (ChIP-seq) data obtained by coupling ChIP with massively parallel sequencing. The modular design of CisGenome enables it to support interactive analyses through a graphic user interface as well as customized batch-mode computation for advanced data mining. A built-in browser allows visualization of array images, signals, gene structure, conservation, and DNA sequence and motif information. We demonstrate the use of these tools by a comparative analysis of ChIP-chip and ChIP-seq data for the transcription factor NRSF/REST, a study of ChIP-seq analysis with or without a negative control sample, and an analysis of a new motif in Nanog- and Sox2-binding regions.
View details for DOI 10.1038/nbt.1505
View details for Web of Science ID 000260832200025
View details for PubMedID 18978777
SeqMap: mapping massive amount of oligonucleotides to the genome
2008; 24 (20): 2395-2396
SeqMap is a tool for mapping large amount of short sequences to the genome. It is designed for finding all the places in a reference genome where each sequence may come from. This task is essential to the analysis of data from ultra high-throughput sequencing machines. With a carefully designed index-filtering algorithm and an efficient implementation, SeqMap can map tens of millions of short sequences to a genome of several billions of nucleotides. Multiple substitutions and insertions/deletions of the nucleotide bases in the sequences can be tolerated and therefore detected. SeqMap supports FASTA input format and various output formats, and provides command line options for tuning almost every aspect of the mapping process. A typical mapping can be done in a few hours on a desktop PC. Parallel use of SeqMap on a cluster is also very straightforward.
View details for DOI 10.1093/bioinformatics/btn429
View details for Web of Science ID 000259973500020
View details for PubMedID 18697769
A genome-scale analysis of the cis-regulatory circuitry underlying sonic hedgehog-mediated patterning of the mammalian limb
GENES & DEVELOPMENT
2008; 22 (19): 2651-2663
Sonic hedgehog (Shh) signals via Gli transcription factors to direct digit number and identity in the vertebrate limb. We characterized the Gli-dependent cis-regulatory network through a combination of whole-genome chromatin immunoprecipitation (ChIP)-on-chip and transcriptional profiling of the developing mouse limb. These analyses identified approximately 5000 high-quality Gli3-binding sites, including all known Gli-dependent enhancers. Discrete binding regions exhibit a higher-order clustering, highlighting the complexity of cis-regulatory interactions. Further, Gli3 binds inertly to previously identified neural-specific Gli enhancers, demonstrating the accessibility of their cis-regulatory elements. Intersection of DNA binding data with gene expression profiles predicted 205 putative limb target genes. A subset of putative cis-regulatory regions were analyzed in transgenic embryos, establishing Blimp1 as a direct Gli target and identifying Gli activator signaling in a direct, long-range regulation of the BMP antagonist Gremlin. In contrast, a long-range silencer cassette downstream from Hand2 likely mediates Gli3 repression in the anterior limb. These studies provide the first comprehensive characterization of the transcriptional output of a Shh-patterning process in the mammalian embryo and a framework for elaborating regulatory networks in the developing limb.
View details for DOI 10.1101/gad.1693008
View details for Web of Science ID 000259700900010
View details for PubMedID 18832070
Isolation and transcriptional profiling of purified hepatic cells derived from human embryonic stem cells
2008; 26 (8): 2032-2041
The differentiation of human embryonic stem cells (hESCs) into functional hepatocytes provides a powerful in vitro model system for studying the molecular mechanisms governing liver development. Furthermore, a well-characterized renewable supply of hepatocytes differentiated from hESCs could be used for in vitro assays of drug metabolism and toxicology, screening of potential antiviral agents, and cell-based therapies to treat liver disease. In this study, we describe a protocol for the differentiation of hESCs toward hepatic cells with complex cellular morphologies. Putative hepatic cells were identified and isolated using a lentiviral vector, containing the alpha-fetoprotein promoter driving enhanced green fluorescent protein expression (AFP:eGFP). Whole-genome transcriptional profiling was performed on triplicate samples of AFP:eGFP+ and AFP:eGFP- cell populations using the recently released Affymetrix Exon Array ST 1.0 (Santa Clara, CA, http://www.affymetrix.com). Statistical analysis of the transcriptional profiles demonstrated that the AFP:eGFP+ population is highly enriched for genes characteristic of hepatic cells. These data provide a unique insight into the complex process of hepatocyte differentiation, point to signaling pathways that may be manipulated to more efficiently direct the differentiation of hESCs toward mature hepatocytes, and identify molecular markers that may be used for further dissection of hepatic cell differentiation from hESCs. Disclosure of potential conflicts of interest is found at the end of this article.
View details for DOI 10.1634/stemcells.2007-0964
View details for Web of Science ID 000258297500011
View details for PubMedID 18535157
Defining Human Embryo Phenotypes by Cohort-Specific Prognostic Factors
2008; 3 (7)
Hundreds of thousands of human embryos are cultured yearly at in vitro fertilization (IVF) centers worldwide, yet the vast majority fail to develop in culture or following transfer to the uterus. However, human embryo phenotypes have not been formally defined, and current criteria for embryo transfer largely focus on characteristics of individual embryos. We hypothesized that embryo cohort-specific variables describing sibling embryos as a group may predict developmental competence as measured by IVF cycle outcomes and serve to define human embryo phenotypes.We retrieved data for all 1117 IVF cycles performed in 2005 at Stanford University Medical Center, and further analyzed clinical data from the 665 fresh IVF, non-donor cycles and their associated 4144 embryos. Thirty variables representing patient characteristics, clinical diagnoses, treatment protocol, and embryo parameters were analyzed in an unbiased manner by regression tree models, based on dichotomous pregnancy outcomes defined by positive serum beta-human chorionic gonadotropin (beta-hCG). IVF cycle outcomes were most accurately predicted at approximately 70% by four non-redundant, embryo cohort-specific variables that, remarkably, were more informative than any measures of individual, transferred embryos: Total number of embryos, number of 8-cell embryos, rate (percentage) of cleavage arrest in the cohort and day 3 follicle stimulating hormone (FSH) level. While three of these variables captured the effects of other significant variables, only the rate of cleavage arrest was independent of any known variables.Our findings support defining human embryo phenotypes by non-redundant, prognostic variables that are specific to sibling embryos in a cohort.
View details for DOI 10.1371/journal.pone.0002562
View details for Web of Science ID 000263288200029
View details for PubMedID 18596962
- Learning causal Bayesian network structures from experimental data JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2008; 103 (482): 778-789
Exon arrays provide accurate assessments of gene expression
2007; 8 (5)
We have developed a strategy for estimating gene expression on Affymetrix Exon arrays. The method includes a probe-specific background correction and a probe selection strategy in which a subset of probes with highly correlated intensities across multiple samples are chosen to summarize gene expression. Our results demonstrate that the proposed background model offers improvements over the default Affymetrix background correction and that Exon arrays may provide more accurate measurements of gene expression than traditional 3' arrays.
View details for DOI 10.1186/gb-2007-8-5-r82
View details for Web of Science ID 000246983100029
View details for PubMedID 17504534
Probe Selection and Expression Index Computation of Affymetrix Exon Arrays
2006; 1 (1)
There is great current interest in developing microarray platforms for measuring mRNA abundance at both gene level and exon level. The Affymetrix Exon Array is a new high-density gene expression microarray platform, with over six million probes targeting all annotated and predicted exons in a genome. An important question for the analysis of exon array data is how to compute overall gene expression indexes. Because of the complexity of the design of exon array probes, this problem is different in nature from summarizing gene-level expression from traditional 3' expression arrays.In this manuscript, we use exon array data from 11 human tissues to study methods for computing gene-level expression. We showed that for most genes there is a subset of exon array probes having highly correlated intensities across multiple samples. We suggest that these probes could be used as reliable indicators of overall gene expression levels. We developed a probe selection algorithm to select such a subset of highly correlated probes for each gene, and computed gene expression indexes using the selected probes.Our results demonstrate that probe selection improves gene expression estimates from exon arrays. The selected probes can be used in future analyses of other exon array datasets to compute gene expression indexes.
View details for DOI 10.1371/journal.pone.0000088
View details for Web of Science ID 000207443600087
View details for PubMedID 17183719
Reliable prediction of transcription factor binding sites by phylogenetic verification
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2005; 102 (47): 16945-16950
We present a statistical methodology that largely improves the accuracy in computational predictions of transcription factor (TF) binding sites in eukaryote genomes. This method models the cross-species conservation of binding sites without relying on accurate sequence alignment. It can be coupled with any motif-finding algorithm that searches for overrepresented sequence motifs in individual species and can increase the accuracy of the coupled motif-finding algorithm. Because this method is capable of accurately detecting TF binding sites, it also enhances our ability to predict the cis-regulatory modules. We applied this method on the published chromatin immunoprecipitation (ChIP)-chip data in Saccharomyces cerevisiae and found that its sensitivity and specificity are 9% and 14% higher than those of two recent methods. We also recovered almost all of the previously verified TF binding sites and made predictions on the cis-regulatory elements that govern the tight regulation of ribosomal protein genes in 13 eukaryote species (2 plants, 4 yeasts, 2 worms, 2 insects, and 3 mammals). These results give insights to the transcriptional regulation in eukaryotic organisms.
View details for DOI 10.1073/pnas.0504201102
View details for Web of Science ID 000233463200009
View details for PubMedID 16286651
De novo discovery of a tissue-specific gene regulatory module in a chordate
2005; 15 (10): 1315-1324
We engage the experimental and computational challenges of de novo regulatory module discovery in a complex and largely unstudied metazoan genome. Our analysis is based on the comprehensive characterization of regulatory elements of 20 muscle genes in the chordate, Ciona savignyi. Three independent types of data we generate contribute to the characterization of a muscle-specific regulatory module: (1) Positive elements (PEs), short sequences sufficient for strong muscle expression that are identified in a high-resolution in vivo analysis; (2) CisModules (CMs), candidate regulatory modules defined by clusters of overrepresented motifs predicted de novo; and (3) Conserved elements (CEs), short noncoding sequences of strong conservation between C. savignyi and C. intestinalis. We estimate the accuracy of the computational predictions by an analysis of the intersection of these data. As final biological validation of the discovered muscle regulatory module, we implement a novel algorithm to search the genome for instances of the module and identify seven novel enhancers.
View details for DOI 10.1101/gr.4062605
View details for Web of Science ID 000232436800001
View details for PubMedID 16169925
Sampling motifs on phylogenetic trees
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2005; 102 (27): 9481-9486
We present a method to find motifs by simultaneously using the overrepresentation property and the evolutionary conservation property of motifs. This method is applicable to divergent species where alignment is unreliable, which overcomes a major limitation of the current methods. The method has been applied to search regulatory motifs in four yeast species based on ChIP-chip data in Saccharomyces cerevisiae and obtained 20% higher accuracy than the best current methods. We also discovered cis-regulatory elements that govern the tight regulation of ribosomal protein genes in two distantly related insects by using this method. These results demonstrate that our method will be useful for the extraction of regulatory signals in multiple genomes.
View details for DOI 10.1073/pnas.0501620102
View details for Web of Science ID 000230406000010
View details for PubMedID 15983378
GeneNotes - A novel information management software for biologists
Collecting and managing information is a challenging task in a genome-wide profiling research project. Most databases and online computational tools require a direct human involvement. Information and computational results are presented in various multimedia formats (e.g., text, image, PDF, word files, etc.), many of which cannot be automatically processed by computers in biologically meaningful ways. In addition, the quality of computational results is far from perfect and requires nontrivial manual examination. The timely selection, integration and interpretation of heterogeneous biological information still heavily rely on the sensibility of biologists. Biologists often feel overwhelmed by the huge amount of and the great diversity of distributed heterogeneous biological information.We developed an information management application called GeneNotes. GeneNotes is the first application that allows users to collect and manage multimedia biological information about genes/ESTs. GeneNotes provides an integrated environment for users to surf the Internet, collect notes for genes/ESTs, and retrieve notes. GeneNotes is supported by a server that integrates gene annotations from many major databases (e.g., HGNC, MGI, etc.). GeneNotes uses the integrated gene annotations to (a) identify genes given various types of gene IDs (e.g., RefSeq ID, GenBank ID, etc.), and (b) provide quick views of genes. GeneNotes is free for academic usage. The program and the tutorials are available at: http://bayes.fas.harvard.edu/genenotes/.GeneNotes provides a novel human-computer interface to assist researchers to collect and manage biological information. It also provides a platform for studying how users behave when they manipulate biological information. The results of such study can lead to innovation of more intelligent human-computer interfaces that greatly shorten the cycle of biology research.
View details for DOI 10.1186/1471-2105-6-20
View details for Web of Science ID 000227451700001
View details for PubMedID 15686593
GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space.
2004; 3 (4): 261-264
The analysis of complex patterns of gene regulation is central to understanding the biology of cells, tissues and organisms. Patterns of gene regulation pertaining to specific biological processes can be revealed by a variety of experimental strategies, particularly microarrays and other highly parallel methods, which generate large datasets linking many genes. Although methods for detecting gene expression have improved substantially in recent years, understanding the physiological implications of complex patterns in gene expression data is a major challenge. This article presents GoSurfer, an easy-to-use graphical exploration tool with built-in statistical features that allow a rapid assessment of the biological functions represented in large gene sets. GoSurfer takes one or two list(s) of gene identifiers (Affymetrix probe set ID) as input and retrieves all the Gene Ontology (GO) terms associated with the input genes. GoSurfer visualises these GO terms in a hierarchical tree format. With GoSurfer, users can perform statistical tests to search for the GO terms that are enriched in the annotations of the input genes. These GO terms can be highlighted on the GO tree. Users can manipulate the GO tree in various ways and interactively query the genes associated with any GO term. The user-generated graphics can be saved as graphics files, and all the GO information related to the input genes can be exported as text files.GoSurfer is a Windows-based program freely available for noncommercial use and can be downloaded at http://www.gosurfer.org. Datasets used to construct the trees shown in the figures in this article are available at http://www.gosurfer.org/download/GoSurfer.zip.
View details for PubMedID 15702958