2012: Department of Statistics and Data Science

2012

Fall 2012

Wednesday Nov. 14^th, 11am

Speaker: Yuan Ji, Director of Cancer Informatics, Center for Clinical and Research Informatics, NorthShore University HealthSystem

Title: Bayesian Models for Next Generation Sequencing Data on Epigenetics

Abstract: In this talk, I will describe how Bayesian models are successfully applied to the field of epigenetics, which is concerned about regulatory mechanism of gene expression. Epigenetics, one of the most heavily researched and challenging field in biology, increasingly draws attention from statisticians due to breakthroughs in bioengineer and biotechnology that allow large-scale and high-throughput experiments to be routinely conducted with affordable cost. A central topic of epigenetics is to understand the chromatin state -- modifications to histones and other proteins that package the DNA. A complex mechanism called "histone code" is believed to dictate the dynamics of DNA expression. As a step towards deciphering the histone code, we develop Bayesian models based on genome-wide mapping of histone modifications. Such models are only initial attempts to decipher the complex histone code but highlight the need of Bayesian inference in the research of gene regulations, receiving relatively small amount of attention from statisticians. I will summarize our recent work and results using a comprehensive ChIP-Seq data set.

Wednesday Nov. 7^th, 11am
Speaker: Hongyuan Cao, Assistant Professor of Biostatistics, University of Chicago

Title: Analysis of sparse asynchronous longitudinal data

Abstract: We consider estimation of regression models for sparse asynchronous longitudinal observations, where time-dependent response and covariates are observed intermittently within subjects. Unlike with synchronous data, where response and covariates are observed at the same time point, with asynchronous data, the observation times are mismatched. Simple kernel weighted estimating equations are proposed for generalized linear models with either time-invariant or time-dependent coefficients. The time-dependent covariates are assumed to be smooth in time but sparsely observed, while the time-varying response may be continuous, categorical, or count data. For models with either time-invariant or time-dependent coefficients, the estimators are consistent and asymptotically normal. However, they converge at rates which are slower than the rates which may be achieved with synchronous longitudinal data with response and covariates measured at the same time points. Simulation studies evidence that the methods perform well with realistic sample sizes and may be superior to methods for synchronous data based on an ad hoc last value carry forward approach. The practical utility of the methods is illustrated on data from an HIV study.

Wednesday October 17, 11am
Place: Classroom, Department of Statistics, 2006 Sheridan Road
Speaker: Yuguo Chen, Associate Professor of Statistics, UIUC

Title: Sampling for Conditional Inference on Network Data

Abstract: Random graphs with given vertex degrees have been widely used as a model for many real-world complex networks. We describe a sequential sampling method for sampling networks with a given degree sequence. These samples can be used to approximate closely the null distributions of a number of test statistics involved in such networks, and provide an accurate estimate of the total number of networks with given vertex degrees. We apply our method to a range of examples to demonstrate its efficiency in real problems.

Wednesday October 10, 11am
Speaker: Michael Zhu, Associate Professor of Statistics, Purdue University (http://www.stat.purdue.edu/~yuzhu/ )

Title: Statistical Model-Based Methods for Transcript Expression Level Quantification and Their Comparison

Abstract: Further advancement and application of RNA-Seq technology call for the development of effective normalization methods for RNA-Seq data. In this talk, we propose to use finite Poisson mixture models to characterize the generating mechanism of RNA-Seq read counts and develop a procedure called MP-Seq to quantify transcript expression level.

Furthermore, we propose to use a system of measurement error models based on qRT-PCR, Microarray and RNA-Seq gene expression data to compare and validate RNA-Seq normalization methods. As an application, we apply the system to show that MP-Seq outperforms other existing quantification methods in the literature.

Winter 2012

Wednesday March 7, 11am

Speaker: Bruce Lindsay, Department of Statistics, Pennsylvania State University

Wednesday February 15, 11am

Speaker: Zhengyuan Zhu, Department of Statistics, Iowa State University

Title: Spatial Sampling Design and Wireless Sensor Networks

Abstract: Spatial sampling design problems have been studied by statisticians for many different application areas such as agriculture, soil science, ecology, and environmental science. Though many of the methodologies in spatial sampling design can be used to help design the sampling plan of wireless sensor networks (WSN), WSN has some characteristics such as the energy and communication constraints which are not present in a traditional sampling network and poses new challenges to statisticians. In this talk we will give an overview on spatial sampling design and discuss its relationship to the sampling design for WSN. An example of maximum-information predictive designs for model-based geostatistics and some preliminary results on the optimal sampling design of a WSN for parameter estimation under energy and communication constraints will be presented.

Wednesday February 8, 11am

Speaker: Wenxuan Zhong, Department of Statistics, University of Illinois, Urbana-Champaign

Title: Variable selection using dimension reduction model

Abstract: In this talk, a stepwise procedure will be discussed for variable selection under the sufficient dimension reduction framework, in which the response variable is influenced by a subset of predictors through an unknown function of a few linear combinations of them. Unlike linear stepwise regression, our proposed method does not impose a special form of relationship (such as linear) between the response variable and the predictor variables. Our method selects variables that attain the maximum correlation between the transformed response and the linear combination of the variables. Various asymptotic properties of the COP procedure are established, and in particular, its variable selection performance under diverging number of predictors and sample size has been investigated. The empirical performance of the COP procedure will be demonstrated in functional genomic analysis.

Wednesday January 11, 11am

Speaker: Yinxiao Huang, Department of Statistics, University of Illinois, Urbana-Champaign

Title: Nonparametric Online Inference for Time Series

Abstract: Online learning is concerned with the task of making real-time updates as new observations become available. This is especially relevant in a time series context including speech recognition and image processing, where large data comes in a sequential manner. However classical nonparametric methods do not accommodate real-time update.

In the literature, online learning with kernels has been extensively studied in the fields of both statistics and computer science. To my knowledge, most of them are considered under either i.i.d. conditions which is not realistic for time series or mixing conditions which are hard to verify. In this talk, we consider online kernel estimation for time series data. The asymptotic behavior of our online kernel estimators, both for density and regression function, is explored for a general class of stationary time series under the dependence framework developed by Wu (2005). We establish the asymptotic normality, almost sure convergence for the online kernel estimators and in particular, a law of iterated logarithm (LIL) for the online kernel density estimator, while one generally does not have such a sharp convergence rate for traditional estimators. Our approach can be extended further to nonstationary processes.