Skip to main content

Student Projects

Jake Schauer

My name is Jake Schauer, and I'm a 5th year grad student in the stats department. I work on a few projects that may be of interest to undergrads. 


Replication is fundamental to scientific progress; philosophically, science has been defined in terms of how well we can verify findings, and some fields even contend that a finding is not really a finding until it has been replicated. However, a recent crisis in science, has seen empirical research call into question the replicability of findings of various fields (medicine, psychology, economics, etc.).

On its face, replication seems like a simple problem: just repeat an experiment and check that you get the same results. But it is not clear what we mean when we say "the same results." Recent analyses of replication studies suggest that it is not even clear what we mean by "results." Amid this ambiguity, analyses in high profile papers have used different conflicting definitions of replication, often in the same analysis. Moreover, since the definition of replication is not a settled matter, it seems unlikely that empirical evaluations of replication have been designed to ensure a sufficiently sensitive analysis.

This project provides a framework for defining and assessing replication based on meta-analysis. This brings clarity to subjective notions like "What are a study's results?" or "What do we mean by 'the same results?'," and provides analysis methods with known properties. Given these analyses, we can determine the types of program designs that might support unambiguous conclusions. The methodology developed in this program suggests that replication is far from simple, and that ideas about it that we may take for granted are less useful than we might think.



In various fields, data collected by researchers is subject to privacy restrictions. Ensuring the privacy of individuals' data is an important part of the research process. However, the tradeoff is that such protection can hinder the reproducibility of research findings, or prevent research from even occurring. This is particularly true in education, where laws protecting student privacy have made it difficult to obtain data from state data systems.

This project seeks to empirically test various data privacy methods for open source data. In education, masking student data involves either suppressing certain records or stochastically perturbing data before releasing it. However, the exact nature of suppression or perturbation can affect the veracity of conclusions based on the released data. If many records are suppressed, then the resulting open source data will likely preserve more privacy, but have limited utility to researchers; likewise if greater stochastic noise is added to data, it can better protect privacy, but will swamp any analysis run on the released dataset. The results of this project demonstrate the feasibility and properties of masking procedures in education. Using multiple imputation methods to mask data, we show that it is possible to preserve key relationships in student data while limiting the risk of disclosure. Moreover, we highlight the limitations of various micro-suppression strategies in generating useful open source datasets.


Abby Smith and Katie Fitzgerald:

Katie and I have been collaborating on a side project with an anthropology professor (Dr. Sera Young) to help analyze some of her study data. We were connected to her out of IPR (Institute for Policy Research), which hosts many faculty members/graduate student affiliates. Here's an example of statistical collaboration in the academic setting :-) 

Sera's focus is primarily on food and water insecurity as determinants of child/maternal health. She also has done a lot of work on pica: the craving and consumption of non-food items such as earth, charcoal, and ice. Sera designed one of the first longitudinal studies focused on studying pica behaviors among pregnant women in Kenya (pre/post pregnancy months). The study captured many sociodemographic factors (such as tribe, household size, asset index) and also measured biological indicators such as hemoglobin and gastrointestinal problems, over 9 different time points pre and post pregnancy. Much of the existing literature has focused on hemoglobin as an important biological determinant of pica, but few have ever quantitatively examined women longitudinally. Note that there are 2 main types of pica:  amylophagy and geophagy (eating earth/clay). Our analysis ended up focusing on building a model with geophagy and hemoglobin as the responses.

We fit a generalized multilevel model for a binary response (engaging in pica). Multilevel models are statistical models of parameters that vary at more than one level. The classic example is in an education context  (model of student test scores that contains measures for individual students as well as measures for classrooms within which the students are grouped). We chose this model structure to distinguish the variation both within each woman who participated (her behavior over the course of her 9 visits) and between each woman. We ask: do the time-variant aspects of the study (longitudinal) actually matter for a woman, or is it simply differences between each woman's average hemoglobin that matter more? 

To quote: "If a basic Ordinary Least Squares regression is used on longitudinal data, it ignores the nesting of time points within people; in addition to concerns regarding bias and artificially small standard errors, the coefficients in this type of analysis are estimating an often uninterpretable combination of the within and between person trends. If the two trends are different, as they often are, this can lead to misleading results that give no real indication of the trends the researcher may be interested in."

A huge challenge to this project was the amount of missing data-- both missingness from attrition in the study (35% of women dropped out before visit 9) and symstematic missingness from the way the study was designed. For example, hemoglobin was measured at time points 1,3,5 and 6, while diarrhea/nausea were measured at time points 1-9. Complete case analysis, as the name suggests, drops all cases that have missing values and performs the analysis only on the complete cases. Although convenient, estimates based on this method can be severely biased if MCAR (Missing Completely at Random assumption) does not hold. Any models that include hemo and diar for example, two variables we believe to be important predictors of geophagy, would automatically drop out all data for 5 of the 9 visits plus any other cases that were not complete for the remaining 4 visits. It has been shown that bias from complete case analysis increases with the proportion of the missing data (Little and Rubin, 2002). Because this data set has high rates of missingness, we look for an alternative. 

We used multiple imputation to generate 10 complete datasets, and then pool (average) model coefficients and standard errors (using Rubins Rules) across imputations. To quote the attached pdf: "creates 2 or more complete (imputed) data sets based on an imputation model determined by the analyst. Often 5 imputed data sets is the default, but more may be preferred if the proportion of missingness is high. Analyses are run on each of the complete data sets, and results are pooled to compute final point estimates and corresponding standard errors by a method now known as “Rubin’s rules” (Rubin, 1987). Many missing data methods produce artificially small standard errors because they do not account for the additional uncertainty caused by missingness. Multiple imputation protects against this problem; the standard errors computed via Rubin’s rules account for both the conventional sampling variance (within-imputation variance) and the variance across imputations (between-imputation variance) (van Buuren, 17). The between-imputation variance serves as a quantification of the uncertainty due to missingness. Final pooled results from multiple imputation". 

We utilized both R packages (mice and hmi [hierchachal multiple imputation]), and HLM software to perform multiple imputation and run models. The collaborative paper is currently in development and hopefully will be published in the American Journal of Clinical Nutrition.

Back to top