Colloquia

When: Thursday, September 4, 2025 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Anru Zhang, Department of Biostatistics & Bioinformatics and Department of Computer Science, Duke University

Abstract: The increasing availability of electronic health records (EHRs) and other biomedical data calls for methodologies that can generate high-quality synthetic data while preserving privacy, correcting bias, and addressing complex data structures. In this talk, I will present a series of recent advances in generative modeling for synthetic health data. First, using denoising diffusion probabilistic models, we develop a framework for generating realistic, privacy-preserving EHR time series that achieve superior fidelity and lower privacy risk than existing methods. Second, to address irregularly observed functional data, we introduce Smooth Flow Matching (SFM), a semiparametric copula flow framework capable of generating smooth, infinite-dimensional trajectories under irregular sampling and non-Gaussian structures. Finally, we propose a bias-corrected data synthesis strategy for imbalanced learning, which mitigates distortions introduced by synthetic samples and enhances predictive performance in rare-event classification. Collectively, these methods provide a principled foundation for generative modeling of synthetic health data, enabling privacy-preserving bias-reduced analysis and broader utilization of sensitive biomedical datasets.

When: Thursday, September 11, 2025 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Cong Ma, Department of Statistics, University of Chicago

Abstract: Integrative data analysis often requires separating shared from individual variations across multiple datasets, typically using the Joint and Individual Variation Explained (JIVE) model. Despite its popularity, theoretical insights into JIVE methods remain limited, particularly in the context of multiple matrices and varying degrees of subspace misalignment. In this talk, I will present new theoretical results on the Angle-based JIVE (AJIVE) method—a two-stage spectral algorithm. Specifically, we establish that AJIVE achieves decreasing estimation error with an increasing number of matrices in high signal-to-noise ratio (SNR) regimes. In contrast, AJIVE faces inherent limitations in low-SNR conditions, where estimation error remains persistently high. Complementary minimax lower bounds confirm AJIVE’s optimal performance at high SNR, while analysis of an oracle estimator highlights fundamental limitations of spectral methods at low SNR.

When: Thursday, September 18, 2025 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Christopher Wikle, Department of Statistics, University of Missouri

Abstract: The world is full of extreme events. For example, a central question in public health planning might be to assess the likelihood of extreme exposures (meteorological conditions, air pollution, social stress, etc.). Such extreme events typically occur in spatial and/or temporal clusters. Yet, the principal methodologies that statisticians deal with spatially dependent processes (Gaussian processes and Markov random fields) are not suitable for complex tail dependence structures. This is particularly true of simulation model emulation. More flexible spatial extremes models exhibit appealing extremal dependence properties but are often exceedingly prohibitive to fit and simulate from in high dimensions. Here I present recent work where we develop a new spatial extremes model that has flexible and non-stationary dependence properties, and we integrate it in the encoding-decoding structure of a variational autoencoder (XVAE), whose parameters are estimated via variational Bayes combined with deep learning. The XVAE can be used to analyze high-dimensional data or as a spatio-temporal emulator that characterizes the distribution of potential mechanistic model output states and produces outputs that have the same statistical properties as the inputs, especially in the tail. Through extensive simulation studies, we show that our XVAE is substantially more time-efficient than traditional Bayesian inference while also outperforming many spatial extremes models with a stationary dependence structure. We demonstrate our method applied to a high-resolution satellite-derived dataset of sea surface temperature in the Red Sea and to a high-resolution simulation model of a turbulent plume, such as one would find in a wildfire. We note, however, that these methods can be applied to any data set or simulation model that exhibits extremes.

When: Thursday, September 25, 2025 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Seungchul Baek, Department of Mathematics and Statistics, University of Maryland, Baltimore County

Abstract: I introduce two projects related to high-dimensional classification. The first project focuses on developing a classifier using random partitioning. Specifically, we split the original high-dimensional data ($p>n$) into multiple low-dimensional subsets, making sure the number of selected covariates is less than the sample size. Using these partitioned datasets, we apply linear discriminant analysis (LDA) to each subset and propose a method to aggregate the results. We provide theoretical justification for our approach by comparing its misclassification rates to those of LDA in high dimensions. The second project concerns variable selection in high-dimensional classification. By utilizing the recently proposed mirror statistic, we first identify significant variables and then develop a new classifier based on a modified version of the $\epsilon$-greedy algorithm.

When: Tuesday, October 14, 2025 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Philip Ernst, Department of Mathematics, Imperial College London

Abstract: In 1926, G. Udny Yule considered the following problem: given two i.i.d. random walks independent from each other, what is the distribution of their empirical correlation coefficient? Yule empirically observed the distribution of this statistic to be heavily dispersed and frequently large in absolute value, leading him to call it “nonsense correlation.” This unexpected finding led to his formulation of two concrete questions, each of which would remain open for more than ninety years: (i) Find (analytically) the variance of the empirical correlation coefficient and (ii): Find (analytically) the higher order moments and the density of the empirical correlation coefficient. Ernst, Shepp, and Wyner (Annals of Statistics, 2017) considered the empirical correlation coefficient of two independent Wiener processes, the limit to which the empirical correlation for two independent random walks converges weakly. Using tools from integral equation theory, we closed question (i) by explicitly calculating the second moment of the empirical correlation coefficient to be .240522. This talk begins where Ernst et al. (2017) leaves off. I shall explain how we finally succeeded in closing question (ii) by explicitly calculating all moments of the empirical correlation coefficient (up to order 16). This leads, for the first time, to an approximation to the density of Yule's nonsense correlation. I shall then proceed to explain how we were able to explicitly compute higher moments of the empirical correlation coefficient when the two independent Wiener processes are replaced by two correlated Wiener processes, two independent Ornstein-Uhlenbeck processes, and two independent Brownian bridges. I will conclude by stating a Central Limit Theorem for the case of two independent Ornstein-Uhlenbeck processes. This result shows that Yule's “nonsense correlation” is indeed not “nonsense” for stochastic processes which admit stationary distributions. This work is joint with L.C.G. Rogers (Cambridge) and Quan Zhou (Texas A&M) and recently appeared in Bernoulli in February 2025. We shall conclude with a discussion of some concrete applications of our work to the study of weather and climate extremes. The latter is part of our ongoing collaboration with the U.S. Office of Naval Research (2018-present).

When: Thursday, October 16, 2025 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Jason Klusowski, Department of Operations Research and Financial Engineering, Princeton University

Abstract: Statisticians often work in settings with limited labeled data and abundant unlabeled data. During training, they may even have access to extra side information (some labeled, some not) that won’t be available once the model is deployed. When can this side information actually improve performance? I’ll present a simple framework where a rich-view model that sees the extra features generates pseudo-labels on the large unlabeled data, and a deployment model that only sees the standard features is trained on both real and pseudo-labels. The two are trained iteratively: each deployment model update calibrates the next round of pseudo-labels, and those refined pseudo-labels in turn guide the deployment model. Our theory shows that side information helps precisely when the rich-view and deployment models make different kinds of errors. We formalize this with a decorrelation score that quantifies how independent those errors are; the more independent, the greater the performance gains.

When: Thursday, October 30, 2025 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Tingting Zhang, Department of Statistics, University of Pittsburgh

Abstract: The human brain is a high-dimensional directed network system of brain regions involving directed connectivity. Seizures are a directed network phenomenon, as abnormal neuronal activities start from a seizure onset zone (SOZ) and propagate to otherwise healthy regions. To localize the SOZ of an epileptic patient, clinicians use intracranial EEG (iEEG) to record the patient’s brain activity in many small regions. iEEG data are high-dimensional multivariate time series. To model the underlying directed brain network, we build a state-space multivariate autoregression (SSMAR) model for iEEG data. To produce scientifically meaningful network results, we incorporate prior knowledge that brain networks tend to exhibit modular organization. Specifically, we assign a stochastic-blockmodel-motivated prior to the SSMAR parameters, which encourages modularity in the estimated networks. We develop a Bayesian framework to estimate the SSMAR model, infer directed connections, and identify network modules. The method is robust to violations of model assumptions and outperforms existing network approaches. When applied to iEEG data from an epileptic patient, the model reveals patterns of seizure initiation and propagation and uncovers a distinct connectivity profile of the SOZ. We also extend this Bayesian approach to fMRI data, identifying functionally specialized modules and directed interactions between them.

When: Thursday, November 6, 2025 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Nathaniel Josephs, Department of Statistics, North Carolina State University

Abstract: The graph-matching problem is a classic task that involves finding the correspondence between the vertices of two graphs. A new class of nonparametric priors is introduced for permutations by borrowing ideas from the extensive literature on partition structures. This enables a Bayesian approach to graph matching that combines the position-aware Chinese restaurant process with a correlated stochastic block model likelihood. A node-wise blocked Gibbs sampler is proposed for posterior inference, as well as an efficient posterior summary technique that leverages variation-of information (VI) summaries for partitions.

When: Thursday, November 13, 2025 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Yichao Wu, Department of Mathematics, Statistics, and Computer Science, University of Illinois Chicago

Abstract: The first part of the talk will focus on the general partially linear model without any structure assumption on the nonparametric component. For such a model with both linear and nonlinear predictors being multivariate, we propose a new variable selection method. Our new method is a unified approach in the sense that it can select both linear and nonlinear predictors simultaneously by solving a single optimization problem. We prove that the proposed method achieves consistency. The second part of the talk will be based on an ongoing research project. In this project, we are extending the above variable selection method to partially global Fréchet regression (Tucker and Wu, 2025 Statistica Sinica).

When: Thursday, November 20, 2025 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Gemma Moran, Department of Statistics, Rutgers University

Abstract: High-dimensional data often exhibit variation that can be captured by lower dimensional factors. For high-dimensional data from multiple studies or environments, one goal is to understand which underlying factors are common to all studies, and which factors are study or environment-specific. As a particular example, we consider platelet gene expression data from patients in different disease groups. In this data, factors correspond to clusters of genes which are co-expressed; we may expect some clusters (or biological pathways) to be active for all diseases, while some clusters are only active for a specific disease. To learn these factors, we consider a nonlinear multi-study factor model, which allows for both shared and specific factors. To fit this model, we propose a multi-study sparse variational autoencoder. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. In the genomics example, this means each gene is active in only a few biological processes. Further, the model implicitly induces a penalty on the number of latent factors, which helps separate the shared factors from the group-specific factors. We prove that the shared factors are identified, and demonstrate our method recovers meaningful factors in the platelet gene expression data.

When: Tuesday, January 20, 2026 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: TBD

Abstract: TBD

When: Friday, January 23, 2026 — 2:00 p.m. to 3:00 p.m.
Where: LeConte 224

Speaker: TBD

Abstract: TBD

When: Tuesday, January 27, 2026 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: TBD

Abstract: TBD

When: Thursday, January 29, 2026 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: TBD

Abstract: TBD

When: Tuesday, February 3, 2026 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: TBD

Abstract: TBD

When: Tuesday, February 10, 2026 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: TBD

Abstract: TBD

When: Thursday, February 19, 2026 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Jay Bartroff, Department of Statistics and Data Sciences, University of Texas at Austin

Abstract: A novel method for fixed-width confidence intervals—called the Push Algorithm—for the binomial success probability appeared in Asparaouhov's PhD thesis, and cited an unknown manuscript by Lorden. In this talk I'll discuss the little-known method, and our extension of it to any bounded parameter in a monotone likelihood ratio family. The method produces the shortest possible fixed-width confidence interval for a given confidence level, and if the Push interval does not exist for a given width and level then no such interval exists. We demonstrate it on the binomial, hypergeometric, and normal distributions with our available R package, where it outperforms the standard intervals, including the venerable z-interval in the normal case. This is joint work with undergraduate student Asmit Chakraborty.

When: Thursday, February 26, 2026 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Ciprian Crainiceanu, Department of Biostatistics, Johns Hopkins University

Abstract: Wearable devices, such as accelerometers and heart monitors, are used in health research because they provide objective, continuous, unbiased, and detailed information about human activity either in the laboratory or the free-living environment. In this talk I will explore the different resolutions of the data, ways to summarize it, and inferential methods for exploring the associations with health outcomes. We will illustrate these methods using large, publicly available datasets, including the NHANES and UK Biobank. We will also show that objectively measured physical activity is the strongest predictor of mortality and cardiovascular mortality and the strongest modifiable risk factor of Multiple Sclerosis, Parkinson's Disease, and Alzheimer's Disease.

When: Thursday, March 5, 2026 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Minjeong Jeon, Department of Education, University of California, Los Angeles

Abstract: In this talk, I will introduce a novel approach for longitudinal assessments that involve item responses from individuals at two or more time points. A key limitation of existing longitudinal models is their inability to capture item-by-person interactions that can change over time. To address this, I propose an interaction map approach that can capture and visualize time-varying person-by-item interactions, offering valuable insights into individuals’ progress over time. Furthermore, I will present a more structured version of the interaction map approach, which focuses on tracking individuals’ progress toward a measurement target directly within the map. Real-world examples will be shared to illustrate the practical applications of the proposed methodologies.

When: Friday, March 27, 2026 — 3:00 p.m. to 4:00 p.m.
Where: Bates West Social Room

Speaker: Dr. Bin Yu, Department of Statistics, University of California, Berkeley

Note: Palmetto Lecture I, Palmetto Symposium

Abstract: Data science underpins modern AI and many advances in healthcare, yet human judgment permeates every stage of the data science life cycle. These judgment calls introduce hidden uncertainties that go well beyond sampling variability and drive many of the risks associated with AI. We introduce veridical data science, grounded in three fundamental principles—Predictability, Computability, and Stability (PCS)—to make such uncertainties explicit and assessable and to aggregate reality-checked algorithms for better results. The PCS framework unifies and extends best practices in statistics and machine learning and is illustrated through healthcare applications, including identifying genetic drivers of heart disease, reducing cost of prostate cancer detection, improving uncertainty quantification beyond standard conformal prediction, and proposing Green Shielding, a new user-centric framework for safeguarding users of AI.

When: Saturday, March 28, 2026 — 11:00 a.m. to 12:00 p.m.
Where: Science and Technology Building, Room 351

Speaker: Dr. Bin Yu, Department of Statistics, University of California, Berkeley

Note: Palmetto Lecture II, Department 40th Anniversary

Abstract: The rapid advancement of AI relies heavily on the foundation of data science, yet its education significantly lags its demand in practice. The upcoming book Veridical Data Science: The Practice of Responsible Data Analysis and Decision Making (Yu and Barter, MIT Press, 2024; free online at www.vdsbook.com) tackles this gap by promoting Predictability, Computability, and Stability (PCS) as core principles for trustworthy data insights. It thoroughly integrates these principles into the Data Science Life Cycle (DSLC), from problem formulation to data cleansing and the communication of results, fostering a new standard for responsible data analysis. This talk explores the book's motivations, comparing its approach with traditional ones. Using examples from chapters on data cleansing and clustering analysis, I will demonstrate PCS's practical applications and describe four types of homework assignments—True/False, conceptual, mathematical, and coding—to solidify learners' grasp. Time permitting, I will discuss a prostate cancer research case study, illustrating PCS's effectiveness in real-world data analysis.

When: Thursday, April 2, 2026 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Daniel Bolt, Department of Educational Psychology, University of Wisconsin–Madison

Abstract: Both research and practice involving standardized tests increasingly relies, in both direct and indirect ways, on latent variable models. However, the models often prove too flexible, with various forms of systematic model misspecification easily being absorbed into, and hence distorting, the latent metrics despite good statistical fit. In this talk, I consider several educational measurement applications that have relied on the traditional models but return questionable results likely due to such misspecification. Several anticipated sources of metric distortion are identified and explored through sensitivity analyses. A common theme is the relevance and need for practical measurement models that can accommodate asymmetry in the measurement link function.

When: Thursday, April 9, 2026 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Xinyi Li, School of Mathematical and Statistical Sciences, Clemson University

Abstract: Modern studies increasingly pair clinical features with high-dimensional imaging, where each scan can be viewed as a function living in a Hilbert space. This talk introduces a unified approach that incorporate imaging data as
interpretable features via functional principal component analysis (FPCA). First, we discuss a framework for linear
regression with Hilbert-space-valued covariates that provides asymptotic normal inference and bootstrap
uncertainty quantification, explicitly accounting for the fact that FPCA bases are estimated from data. Second, we use the proposed multi-dimensional FPCA features from imaging to estimate individualized treatment regime under standard
causal assumptions, enabling treatment decisions informed by patient-specific imaging patterns along with risk factors.
The proposed methods are applied to Alzheimer's Disease Neuroimaging Initiative (ADNI) data, where PET scans and
genetic and demographic covariates are used to model cognitive outcomes and guide personalized treatment strategies.

When: Thursday, April 16, 2026 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr.Sebastian Kurtek, Department of Statistics, the Ohio State University

Abstract: Intra-tumor heterogeneity driving disease progression is characterized by distinct growth and spatial proliferation patterns of cells and their nuclei within tumor and non-tumor tissues. A widely accepted hypothesis is that these spatial patterns are correlated with morphology of the cells and their nuclei. Nevertheless, tools to quantify the correlation, with uncertainty, are scarce, and the state-of-the-art is based on low-dimensional numerical summaries of the shapes that are inadequate to fully encode shape information. To this end, we propose a marked point process framework to assess spatial correlation among shapes of planar closed curves, which represent cell or nuclei outlines. With shapes of curves as marks, the framework is based on a mark-weighted K function, a second-order spatial statistic that accounts for the marks' variation by using test functions that capture only the shapes of cells and their nuclei. We then develop local and global hypothesis tests for spatial dependence between the marks using the K function. The framework is brought to bear on the cell nuclei extracted from histopathology images of breast cancer, where we uncover distinct correlation patterns that are consistent with clinical expectations. This is joint work with Ye Jin Choi (former Ph.D. student in Statistics at The Ohio State University), Simeng Zhu (James Cancer Center, The Ohio State University) and Karthik Bharath (School of Mathematical Sciences, University of Nottingham).

When: Thursday, April 23, 2026 — 2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr.Raymond Wong, Department of Statistics, Texas A&M University

Abstract: In reinforcement learning, distributional off-policy evaluation (OPE) aims to estimate the return distribution of a target policy using offline data collected under a potentially different behavior policy. In this talk, I will focus on an approach called fitted distributional evaluation (FDE), which extends the widely used fitted Q-evaluation -- developed for expectation-based reinforcement learning -- to the distributional OPE setting. Although a few related methods exist, there is currently no unified framework for designing FDE algorithms. To address this, I will present a set of guiding principles for constructing theoretically sound FDE methods. Building on these principles, we can develop several new FDE algorithms with convergence guarantees. Moreover, this framework provides a theoretical foundation for existing methods, even in complex, non-tabular settings.

Department of Statistics

2025 – 2026 Department of Statistics Colloquium Speaker

Challenge the conventional. Create the exceptional. No Limits.