Skip to main content

Fall 2019 Seminar Series

Double Hierarchical Generalized Linear Models for RNAseq Data: DHGLMseq

Wednesday, October 2, 2019

Time: 11:00 a.m.

Speaker: Dongseok Choi - Professor of Biostatistics at Oregon Health & Science University

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: RNAseq has become the standard technology in gene expression studies in the past few years.

It is considered superior to microarrays that used to be the choice of technology in the 2000s. Since RNAseq data are typically summarized as counts per gene for downstream statistical analyses, there have been active developments of statistical models based on negative binomial regression models (NB). To overcome the shortfalls of current NB-based models, we extended the double hierarchical generalized linear models to high dimensional counting data such as RNAseq data and developed an R package for model fitting (DHGLMseq). In addition, we extended Lee and Bjønstad’s false discovery rate (FDR) control for linear mixed models to the high dimensional DHGMLs. In this presentation, we will review a brief history of advancement of statistical methods for RNAseq data and compare their power and false discovery rates by simulations.


Communication-Efficient Accurate Statistical Estimation

Wednesday, October 16, 2019

SPECIAL TIME: 3:00 p.m.

Speaker: Jianqing Fan - Professor of Finance, Professor of Statistics, and Professor of Operations Research and Financial Engineering, Princeton University

Place: Annenberg Hall G15, 2120 Campus Drive

Abstract: When the data are stored in a distributed manner, direct application of traditional statistical inference procedures is often prohibitive due to communication cost and privacy concerns. This paper develops and investigates two Communication-Efficient Accurate Statistical Estimators (CEASE), implemented through iterative algorithms for distributed optimization. In each iteration, node machines carry out computation in parallel and communicates with the central processor, which then broadcasts aggregated gradient vector to node machines for new updates. The algorithms adapt to the similarity among loss functions on node machines, and converge rapidly when each node machine has large enough sample size. Moreover, they do not require good initialization and enjoy linear converge guarantees under general conditions. The contraction rate of optimization errors is derived explicitly, with dependence on the local sample size unveiled. In addition, the improved statistical accuracy per iteration is derived.  By regarding the proposed method as a multi-step statistical estimator, we show that statistical efficiency can be achieved in finite steps in typical statistical applications.  In addition, we give the conditions under which one-step CEASE estimator is statistically efficient.  Extensive numerical experiments on both synthetic and real data validate the theoretical results and demonstrate the superior performance of our algorithms.

(Joint work with  Yongyi Guo and Kaizheng Wang)


A Parallel-Oriented Method to Understand Gene Transcriptional Network Architectures 

Wednesday, October 30, 2019

Time: 11:00 a.m.

Speaker: Dabao Zhang - Associate Professor of Statistics, Purdue University

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: Genetical genomics couples genomic and transcriptomic techniques, and holds great promises in translational and clinical research by revealing gene interaction architectures and understanding functional properties of genomic control program in biological systems. However, its utility has been compromised by lack of effective methodologies to integratively analyze a variety of omics data. We propose a two-stage penalized least squares (2SPLS) method to build a large system of structural equations which models the entire gene regulatory network in an organism. This method is computationally fast as it parallelly fits linear models at each stage, with one for each gene. The whole system can be constructed with bounded errors via consistent estimation of a set of conditional expectations at the first stage, and consistent selection of regulatory effects at the second stage. We also demonstrate its effectiveness via simulation studies and real data analysis.


Nonparametric Interaction Selection

Wednesday, November 13, 2019

Time: 11:00 a.m.

Speaker: Yichao Wu – Professor, Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: Interaction selection has gained much interest recently. Yet most of the existing interaction methods are parametric. In this talk, I will present a new method to perform nonparametric interaction selection. Our method is based on the estimation framework of coupling backfitting algorithm with local constant smoothing for the additive interaction model. In this framework, it is observed that an interaction term is unimportant if it favors an infinity smoothing bandwidth. Based on this observation, we propose to solve an optimization problem to estimate which interaction terms favor an infinity smoothing bandwidth, thus achieving nonparametric interaction selection. We will provide both numerical evidence and theoretical justification for the proposed nonparametric interaction selection method. 


A Bayesian Hidden Markov Model for Detecting Differentially Methylated Regions

Wednesday, November 20, 2019

Time: 11:00 a.m.

Speaker: Tieming Ji - Assistant Professor, University of Missouri

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: Alterations in DNA methylation have been linked to the development and progression of many diseases. The bisulfite sequencing technique presents methylation profiles at base resolution. Count data on methylated and unmethylated reads provide information on the methylation level at each CpG site. As more bisulfite sequencing data become available, these data are increasingly needed to infer methylation aberrations in diseases. Automated and powerful algorithms also need to be developed to accurately identify differentially methylated regions between treatment groups. This study adopts a Bayesian approach using the hidden Markov model to account for inherent dependence in read count data. Given the expense of sequencing experiments, few replicates are available for each treatment group. A Bayesian approach that borrows information across an entire chromosome improves the reliability of statistical inferences. The proposed hidden Markov model considers location dependence among genomic loci by incorporating correlation structures as a function of genomic distance. An iterative algorithm based on expectation-maximization is designed for parameter estimation. Methylation states are inferred by identifying the optimal sequence of latent states from observations. Real datasets and simulation studies that mimic the real datasets are used to illustrate the reliability and success of the proposed method.

Back to top