2008: Department of Statistics and Data Science

2008

Fall 2008

Friday, October 3 at 1 pm

Speaker: Professor Terry Speed, Statistics, UC Berkeley and Bioinformatics, WEHI

Title: Some statistical issues arising with next-generation DNA sequencing data

Abstract: Next generation sequencing machines are now producing tens of millions of short sequencing reads. These need to be mapped back to a reference genome, if there is one, and then further processed in a way which varies with the task, For mRNA-seq, these need to be assigned to genes, exons, or other transcriptional units, and counted. For ChIP-seq, we need to find putative binding sites. What sort of statistical issues arise, and how should we proceed with the analyses. Some initial ideas will be presented.

Thursday, October 23 at 4 pm

Speaker: Professor Ingram Olkin, Department of Statistics, Stanford University

Title: Life Distributions in Survival Analysis and Reliability: Structure of Semiparametric Families

Abstract: Semiparametric families are families that have both a real parameter and a parameter that is itself a distribution. A number of semiparametric parametric families suitable for lifetime data in survival or reliability are introduced: scale, power, frailty (proportional hazards), age, moment, and others. Interesting results on stochastic orderings are obtained for these families. The coincidence of two families provides a characterization of the underlying distribution. Some of the characterization results provide a rationale for the use of certain families. In this talk we provide an overview of these semiparametric families, and present several characterizations.

This work is joint with Albert W. Marshall.

Spring 2008

Wednesday, April 9 at 12 pm

Speaker: Professor Heping Zhang, Biostatistics, Director ofCollaborative Center for Statistics in Science, Yale University.

Title: Joint Modeling of Time Series Measures and Recurrent Events and Analysis of the Effects of Air Quality on Respiratory Symptoms

Abstract: Exposure to ambient pollutants at concentrations above defined standards is a risk factor for respiratory symptoms, especially in sensitive children. Many studies have been undertaken to monitor air quality and to assess its association with respiratory symptoms. We propose a joint mixed effects regression model of time series measures and recurrent events to analyze the air quality and respiratory symptom data from the Yale Mothers and Infants Health Study.

Three mothers' symptoms (runny nose, cough, and sore throat) and three infants' symptoms (runny nose, cough, and general sickness) were investigated. To alleviate the computational complexity, a two-stage maximum likelihood based estimation procedure is introduced to estimate the parameters, and simulation studies are conducted to assess the validity of this estimation procedure.

Our analysis reveals differences in the etiology of respiratory symptoms between mothers and infants. Most notably, coarse particles of mass between 2.5 and 10 microns in diameter increased the risks of mothers' runny nose and cough symptoms, but had no significant impact on any of the three infants' symptoms. The sulfate level was negatively associated with the risk of infants' runny nose and cough symptoms, but had no significant effects on any of the three mothers' symptoms. High level of humidity is negatively associated with the mothers' cough incidence, but had no significant association on any of the three infants' symptoms. Such differences reveal not only the sensitivity of the mothers and infants to the air quality, but also call for further understanding of the differences. It is possible that actions taken to overcome humidity by mothers may inadvertently affect the infants.

This is a joint work with Yuanqing Ye, Peter Diggle, and Jian Shi.

Wednesday, April 23 at 12 pm

Speaker: Professor Dan Nordman, Department of Statistics, Iowa State University

Title: Tapered empirical likelihood for time series data

Abstract: This talk aims to motivate and describe a formulation of empirical likelihood for time series inference based on tapered data blocks. Data blocks are a device for capturing the time dependence and the proposed method involves tapering these blocks in a special way. The resulting empirical likelihood has chi squared limits for nonparametrically calibrating confidence intervals for time series parameters, such as means and correlations. Tapering is shown to improve the chi-squared approximation and enhance the coverage accuracy of intervals compared to untapered empirical likelihood versions. Simulation evidence is provided and block choices are considered as well.

Wednesday, May 7 at 12 pm

Speaker: Professor Ginger Davis, Department of Systems and Information Engineering, University of Virginia

Title: Hierarchical Bayesian Markov Switching Models with Application to Predicting Spawning Success of Shovelnose Sturgeon

Abstract: The timing of spawning in fish is tightly linked to environmental factors however these factors are not very well understood for many species. Specifically, little information is available to guide recruitment efforts for endangered species such as the sturgeon. Therefore, we propose a Bayesian hierarchical model for predicting spawning success of the shovelnose sturgeon which uses both biological and behavioral (longitudinal) data. In particular, we use data produced from a tracking study conducted in the Lower Missouri River. The data produced from this study consist of biological variables associated with readiness to spawn along with longitudinal behavioral data collected using telemetry and data storage device sensors. These high frequency data are complex both biologically and in the underlying behavioral process. To accommodate such complexity, the model we developed uses an eigenvalue predictor, derived from the transition probability matrix of a two-state Markov switching model with GARCH dynamics, as a generated regressor in a hierarchical linear regression model. Finally, in order to minimize the computational burden associated with estimation of this model, a parallel computing approach is proposed.

Wednesday, May 14 at 12 pm

Speaker: Professor Xiaofeng Shao, Department of Statistics, University of Illinois at Urbana-Champaign

Title: Portmanteau tests in time series

Abstract: This talk consists of two parts. In the first part, we will talk about testing for white noise and its applications to goodness-of-fit of long memory time series models. The limitation of the current asymptotic theory for portmanteau tests will be pointed out and new theoretical results will be discussed. In the second part, we will introduce generalized portmanteau type test statistics in the frequency domain to test independence between two stationary time series. Unlike the existing tests, each time series is allowed to possess short memory, long memory or anti-persistence. Under the null hypothesis of independence, the asymptotic null distributions of the proposed statistics are standard normal. The results from a simulation study will also be presented.

Winter 2008

Wednesday, February 13 at 12 pm

Speaker: Lu Tian, Assistant Professor, Department of Preventive Medicine, Northwestern University

Title: Lasso Regularization for the Accelerated Failure Time Model

Abstract: It is challenging to develop a stable regression model for predicting failure time outcomes when the dimension of the covariates is big relative to the sample size. Further complication arises due to the fact that failure time responses are often not completely observed because of right censoring. In this paper, we proposed to couple the LASSO type regularization methods with the Gehan's rank based estimator in the setting of accelerated failure time model to construct a stable and parsimonious prediction model. Unlike the inverse probability weighting approach, the proposed estimators are valid under the general noninformative censoring assumption. We also propose an efficient numerical algorithm for obtaining the entire regularization path to facilitate the adaptive selection of the tuning parameter. We illustrate the proposed methods with an application to predict the survival time of breast cancer patients based on a set of clinical prognostic factors and collected gene signatures and evaluate their finite sample performance through a simulation study.

Wednesday, February 27 at 12 pm

Speaker: Peter McCullagh, John D. MacArthur Distinguished Service Professor, Department of Statistics, University of Chicago

Title: Sampling bias and logistic models

Abstract: In a regression model, the joint distribution for each finite sample of units is determined by a function px(y) depending only on the list of covariate values x = (x(u1), . . . , x(un)) on the sampled units. No random sampling of units is involved. In biological work, random sampling is frequently unavoidable, in which case the joint distribution p(y, x) depends on the sampling scheme. Regression models can be used for the study of dependence provided that the conditional distribution p(y | x) for random samples agrees with px(y) as determined by the regression model for a fixed sample having a non-random configuration x. This paper develops a model that avoids the concept of a fixed population of units, thereby forcing the sampling plan to be incorporated into the sampling distribution. For a quota sample having a predetermined covariate configuration x, the sampling distribution agrees with the standard logistic regression model with correlated components. For most natural sampling plans such as sequential or simple random sampling, the conditional distribution p(y | x) is not the same as the regression distribution unless px(y) has independent components. In this sense, most natural sampling schemes involving binary random-effects models are biased. The implications of this formulation for subject-specific and population-averaged procedures are explored.

Wednesday, March 5 at 12 pm

Speaker: Sandy L. Zabell, Professor, Department of Statistics and Department of Mathematics, Northwestern University

Title: On Student’s 1908 paper “The probable error of a mean”

Abstract: This month marks the one-hundredth anniversary of the appearance of William Sealey Gosset’s celebrated paper “The probable error of a mean”. Gosset’s elegant contributionrepresented the first in a series of exact, “small-sample” results that were developed by Gosset, Fisher, and others to form a central component of the modern theory of statistical inference. This talk celebrates the centenary of Gosset’s paper by discussing both its background and impact on modern statistical theory and practice.

Wednesday, March 12 at 12 pm

Speaker: Rong Chen, Professor, Department of Statistics, Rutgers University

Title: Constrained Sequential Monte Carlo (CSMC)

Abstract: The sequential Monte Calo (SMC) methodologies have been shown to have great promises in solving very high dimensional and complex problems often encountered in applications such as communication, bioinformatics and financial data analysis. The key to a successful SMC implementation is efficiency, not only in terms of statistical inference accuracy, but also on the computational complexity. Efficiency is directly related to the design of the key components of SMC, including the intermediate distributions, the trial 'growth' distribution, and the resampling method. Many problems in application share a common feature - the target distribution is highly constrained. That is, the target distribution is a truncated distribution on an ill-shaped subspace of a high dimensional space. The constraints, without careful treatments, are a main source of obstacles in successful implementations of SMC. In this talk, we develop a set of algorithms categorized as Constrained Sequential Monte Carlo (CSMC) for solving such problems, including strategies in designing the intermediate distributions, the trial distributions, the resampling steps and Markov moves with CSMC.