UConn Stat Anniversary : Poster Abstracts

Poster Abstracts

Presenters, please keep your poster size 24 inches tall by 36 inches wide or smaller.

1. Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data

Presenter: Jianmin Chen, University of Connecticut

Statistical learning with a large number of rare binary features is commonly encountered in analyzing electronic health records (EHR) data, especially in the modeling of disease onset with prior medical diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is notoriously challenging as conventional methods may suffer from a lack of power in testing and inconsistency in model fitting while machine learning methods may suffer from the inability of producing interpretable results or clinically meaningful risk factors. To improve EHR-based modeling and utilize the natural hierarchical structure of disease classification, we propose a tree-guided feature selection and logic aggregation approach for large-scale regression with rare binary features, in which dimension reduction is achieved through not only a sparsity pursuit but also an aggregation promoter with the logic operator of “or”. We convert the combinatorial problem into a convex linearly- constrained regularized estimation, which enables scalable computation with theoretical guarantees. In a suicide risk study with EHR data, our approach is able to select and aggregate prior mental health diagnoses as guided by the diagnosis hierarchy of the International Classification of Diseases. By balancing the rarity and specificity of the EHR diagnosis records, our strategy improves both prediction and model interpretation. We identify important higher-level categories and subcategories of mental health conditions and simultaneously determine the level of specificity needed for each of them in predicting suicide risk.

2. Local Bayesian Modeling Approach for Simultaneous Estimation and Feature Extraction with Application to Sparse, High Dimensional Spatio-Temporal Data

Presenter: Garrett Frady, University of Connecticut

As a result of the vast advancements in technology, we frequently come across data in high dimensions. We propose to extend the utility of the Gaussian and Diffused-gamma (GD) prior for feature extraction when dealing with sparse, high dimensional spatio-temporal data. To bypass the computational complexity, we build local binary classification models of subject-level responses at each time point using logistic regression and incorporate the temporal structure through our subject-level prediction process. The effectiveness of our method will be demonstrated through a simulation study. We will also conduct a case study with multi-subject electroencephalography (EEG) data to identify active regions of the brain and predict the risk of early-onset alcoholism. One goal of EEG analysis is to extract information from the brain in a spatiotemporal pattern and analyze the functional connectivity between different areas of the brain as a response to a certain stimulus. Selecting active regions of the brain can be viewed as a feature extraction process.

3. A Novel Approch for Data Fusion

Presenter: Lucas Godoy, University of Connecticut

The problem of making inference about a variable observed at different scales is termed data fusion. Most of the methods used to deal with this problem involve assuming that areal data are aggregations of an underlying random field. Despite providing interpretable spatial parameters and predictions, these models depend on a discrete approximation of the study region. These approximations are hard to be applied even to moderately sized datasets due to its computation cost and may often lead to biased results. We propose using the Hausdorff-Gaussian process (HGP) to analyze data fusion problems. The method deals with the different scales seamlessly by construction while accounting for the geometries, shapes, and sizes of the sample units. Air pollution data are available in the California state both as direct measurements at measuring sites (point-referenced) and as satellite estimates at 10 by 10 km. We reproduced the analysis from Moraga et al. 2017 using both our proposed methodology and their model. Our model has provided narrower credible intervals for the model parameters and fitted better the data according to the WAIC.

4. Distributed Statistical Learning on the Tweedie Distribution

Presenter: Zijian Huang, University of Connecticut

The normal, gamma and inverse Gaussian distributions, the Poisson distribution, and the class of compound Poisson–gamma distributions which have positive mass at zero, but are otherwise continuous are all members of the Tweedie family. In the insurance industry, due to the large proportion of the zero claims, Tweedie's compound Poisson model is more suitable for those highly right-skewed distribution of data. However, as data centers' data volumes increase quickly, it is no longer practical to store such large amounts of data in a single data center. Thus, recent years have seen a surge in interest in distributed statistical learning. Firstly, we propose a one-step approach of the Tweedie's compound Poisson model to improve from a simple-averaging distributed estimator. Due to the single round of communication, the communication cost is relatively low. Moreover, for the random initialization of the coefficients, we offer a multiple-step method of the Tweedie's compound Poisson model to achieve decent estimation. It will iteratively update the coefficients by communicating the information between each machine.

5. Consumer Credit Risk Convergence: The Case for Performance-Based Interest Rate Reductions in Consumer Automobile Loans

Presenter: Jackson Lautier, University of Connecticut

We use survival analysis techniques and consumer automobile loan data spanning a wide range of credit risk profiles from securitization pools to estimate that the conditional monthly probability of default converges for borrowers in disparate credit risk bands after 15-51 months. We call this phenomenon consumer credit risk convergence. Using these probabilistic estimates, we used a risk-based pricing framework and actuarial techniques to derive the market-implied conditional expected rate of return over the lifespan of these loans by credit risk band. We estimate that deep subprime and subprime consumers eventually overpay by annual percentage rates of 285-1,391 basis points.

6. Disagreements in the Number of Clusters Between Training and Testing Data.

Presenter: Jung Wun Lee, University of Connecticut

This paper introduces a disagreement problem that may be an issue when classification is implemented on estimated classes of the population. A disagreement problem denotes the case in which a sample fails to cover a specific class that a testing observation belongs. Or in other words, the training and testing data do not agree on the number of classes in the population. These disagreement problems may occur due to various reasons such as sampling errors, selection bias, or emerging classes of the population. Once the disagreement problem occurs, a testing observation will be misclassified, because a classification rule based on the sample cannot capture a class not observed in the training data (sample). To overcome such issues, we suggest a two-stage classification method that can ameliorate a disagreement problem in classification. Our proposed method tests whether a testing observation is sampled from the observed or possibly unobserved class, then classifies it based on the test result. We suggest a test for identification of the disagreement problem and demonstrate the performance of the two-stage classification via numerical studies.

7. Lasso Regularization for Censored Regression and High Dimensional Predictors

Presenter: Dashun Liu, University of Connecticut

we propose an heuristic EM-like algorithm to handle censored regression models with a Lasso regularization to accommodate high dimensional predictors under normal assumption. We specify how this technique can be easily implemented using available R packages.

8. Pursuing Sources of Heterogeneity in Microbiome Community Structure via a Regularized Dirichlet-Multinomial Mixture Model

Presenter: Zhongmao Liu, University of Connecticut

In microbiome studies, it is often of great interest to identify natural clusters or partitions in the samples and characterize the unique compositional profile for each microbial community. Different metacommunities can differ in taxonomic, functional, ecological and medical properties, thus can help with the evaluation of human health and diseases. In all of the applicable methods of microbial clustering analysis, the Dirichlet multinomial mixture (DMM) model is among the most popular ones. In this research we propose a novel sparse DMM method with group L1 penalty which can simultaneously detect sample clusters and identify sources of heterogeneity, i.e., the critical taxa which differentiate metacommunities. A comprehensive simulation study shows that the sparse DMM method achieves good feature selection performance in identifying heterogeneous taxa. An application to the upper-airway microbiota and asthma study on children shows that different initial nasal microbiome groups defined by the sparse DMM method relate to different risk of loss of asthma control in the future.

9. Peer of International Trade Between European Countries.

Presenter: Brisilda Ndreka, University of Connecticut

Peer effect, sometimes also referred to as contagion effect, are actions or characteristics of a reference group impacting an individual’s behavior or outcomes, which is very important in social network analysis. In this work, we ”borrow” the contagious effect concept and propose a novel approach to analyze the influence of international trade in the economy of a country. Under a longitudinal data analysis we are considering 36 European countries in the study as nodes of a graph, the edges are defined between the countries that share bilateral trade in agriculture and manufacture. Considering that peer influence effects are often confounded with latent homophily caused by an unobserved similar characteristics and this that may lead to bias results, first we estimate the latent variable using the latent-space adjusted approach (Xu, 2018). Then, we use the findings to act as a proxy variable of unobserved factors in the dynamic linear model by Friedkin and Johnson, 1990, concluding that peer influence exists in international trade networks.

10. Outage Prediction Models Generalize Well to New Domains

Presenter: Aaron Spaulding, Graduate Student at the UConn Department of Civil and Environmental Engineering

State-of-the-art outage prediction models can predict weather impacts on the electric grid, saving lives and improving utility response; however, most current operational models are built by hand and custom-tuned for each study region. This process requires a consistent dataset of outages and weather which may be very time-demanding or impossible to curate for utilities with limited datasets and computational resources. Outage prediction models tuned for transmission lines generalize well to the distribution system despite new territories, sparse data, and limited failure examples.

11. A New Formulation of Minimum Risk Fixed-Width Confidence Interval (MRFWCI) for a Normal Mean

Presenter: Swathi Venkatesan, University of Connectictut

The fixed-width confidence interval (FWCI) estimation problems for a normal mean when the variance is unknown have moved along under a zero-one loss without taking into account sampling cost. While, minimum risk point estimation (MRPE) problems have grown largely under squared error loss (SEL) plus sampling cost. Here, a new formulation combining both MRPE and FWCI methodologies is introduced with desired asymptotic second-order characteristics under a unified structure to develop a minimum risk fixed-width confidence interval (MRFWCI) strategy.

12. Unweighted Estimation Based on Optimal Sample under Measurement Constraints

Presenter: Jing Wang, University of Connecticut

To tackle massive data, subsampling is a practical approach to sift more informative data points. However, when responses are expensive to measure, developing efficient subsampling schemes is challenging, and the optimal sampling approach under measurement constraints was developed to meet this challenge. This method uses the inverses of optimal sampling probabilities to reweight the objective function, which assigns smaller weights on more important data points. Thus the estimation efficiency of the resulting estimator can be improved. In this paper, we propose an unweighted estimating procedure based on optimal subsamples to obtain a more efficient estimator. We obtain the unconditional asymptotic distribution of the estimator via martingale techniques without conditioning on the pilot estimate, which has been less investigated in existing subsampling literature. Both asymptotic results and numerical results show that the unweighted estimator is more efficient in parameter estimation.

13. Recurrent Events Modeling Based on a Reflected Brownian Motion with Application to Hypoglycemia

Presenter: Yingfa Xie, University of Connecticut

Patients with diabetes need to closely monitor their blood sugar levels. Hypoglycemic events are easily observed due to obvious symptoms, but hyperglycemic events are not. We propose to model observed hypoglycemic event as a lower-boundary crossing event for a reflected Brownian motion with an upper reflection line. The boundary is set by clinical standard. Covariates are incorporated into the volatility and upper reflection barrier of the Brownian motion. To further capture the heterogeneity among patients and the dependence within each patient, a frailty is introduced to the log scale of the volatility and the upper reflection line, respectively. Inferences are facilitated by a Bayesian framework using Markov chain Monte Carlo. Two model comparison criteria, the Deviance Information Criterion and the Logarithm of the Pseudo-Marginal Likelihood, are used for model selection. The methodology is validated in a simulation study. In application to a dataset of hypoglycemic events from diabetic patients, the model provides adequate fit and can be used to generate data that are similar to the observed data.

14. Generalized Estimating Equations for Normal Clustered Data and the 'Sandwich' Variance Formula

Presenter: Zhenyu Xu, University of Connecticut

Generalized estimating equations (GEE) are of great interest not only for their adaptability and interpretability in the analysis of clustered data but also for their ability to model the association structure. One version of GEE proposed recently used three estimating equations for mean, variance, and correlation respectively. (Luo and Pan, 2022) However, simulations show that their second equation for variance can be inappropriate, and their ‘sandwich’ variance estimator can be deficient, especially for data with non-constant variance functions. By referring to Yan and Fine (2004), this paper proposes a similar model to Luo and Pan (2022), but with scale parameters instead of variance parameters and a more reliable ‘sandwich’ estimator. The model is also an extension of the R package "geepack" with the flexibility to apply different working covariance matrices for the variance and correlation structures. The proposed model is more accurate and efficient for data with the heterogeneity of variance.

15. Quantifying Correspondence Uncertainty Using Sinkhorn Algorithm

Presenter: Shiwen Yang, Boston University

The problem of estimating an unobserved correspondence is crucial for many tasks involving data fusion and entity resolution. Substantial work has been done to characterize when such a correspondence can be either exactly or approximately recovered in numerous settings. However, there has been much less work on quantifying uncertainty about an estimated correspondence. One approach is to compute the posterior probability of a correspondence based on prior probabilities for the data and a uniform prior on the set of permutations. A natural summary statistic for this posterior distribution is the matrix of marginal posterior probabilities that one observation matches to another. The difficulty of calculating these marginals comes from a summation over a subset of the space of permutations. As an approximation method, we consider applying the Sinkhorn algorithm, an algorithm that scales a square matrix to a doubly stochastic matrix, to a matrix of pairwise exponentiated negative squared distances. We investigate how this approximation relates to the true posterior marginals both in simulation and in preliminary theory.

16. Optimal Subsampling for Parametric Accelerated Failure Time Models with Massive Survival Data

Presenter: Zehan Yang, University of Connecticut

With increasing availability of massive survival data, researchers need valid statistical inferences for survival modeling whose computation is not limited by computer memories. Existing works focus on relative risk models using the online updating and divide-and-conquer strategies. The subsampling strategy has not been available due to challenges in developing the asymptotic properties of the estimator under semiparametric models with censored data. This article tackles optimal subsampling algorithms to fast approximate the maximum likelihood estimator for parametric accelerate failure time models with massive survival data. We derive the asymptotic distributions of the subsampling estimator and the optimal sampling probabilities that minimize the asymptotic mean squared error of the estimator. A feasible two-step algorithm is proposed where the optimal sampling probabilities in the second step are estimated based on a pilot sample in the first step. The asymptotic properties of the two-step estimator are established. The performance of the estimator is validated in a simulation study. A real data analysis illustrates the usefulness of the methods.

17. Generating Directed Networks with Predetermined Assortativity Measures

Presenter: Yelie Yuan, University of Connecticut

Assortativity coefficients are important metrics to analyze both directed and undirected networks. In general, it is not guaranteed that the fitted model will always agree with the assortativity coefficients in the given network, and the structure of directed networks is more complicated than the undirected ones. Therefore, we provide a remedy by proposing a degree-preserving rewiring algorithm, called DiDPR, for generating directed networks with given directed assortativity coefficients. We construct the joint edge distribution of the target network by accounting for the four directed assortativity coefficients simultaneously, provided that they are attainable, and obtain the desired network by solving a convex optimization problem. Our algorithm also helps check the attainability of the given assortativity coefficients. We assess the performance of the proposed algorithm by simulation studies with focus on two different network models, namely Erdos--Renyi and preferential attachment random networks. We then apply the algorithm to a Facebook wall post network as a real data example. The codes for implementing our algorithm are publicly available in R package wdnet.

18. Unified Approach for Analysis of Two-Stage Seamless Adaptive Design with Different Endpoints

Presenter: Baoshan Zhang, Department of Biostatistics and Bioinformatics, School of Medicine, Duke University

In clinical trials, a two-stage seamless adaptive trial design that combines two separate studies into a single study is commonly considered (Chow and Chang, 2006). Chow and Lin (2015) classified clinical trials into four different types depending upon whether their study objectives and study endpoints are the same at different stages. Lu et al. (2009, 2014) studied the case where the study objectives are the same but study endpoints are different at different stages. On the other hand, Chow and Lin (2015) and Lu et al. (2012) considered the cases where the data types of study endpoints in different stages are both binary responses, continuous responses, or time-to-event responses, respectively. In practice, however, study endpoints at different stages may have different data types. The focus of this research is to study the case where the study endpoints at different stages are of different data types. In addition, we also provide the sample size adaptation method in the interim analysis based on the observed information in stage 1. A unified approach for analysis is developed for the analysis of the combined data collected from the two stages with different data types in different stages for testing equality, superiority, non-inferiority, and equivalence between the two treatments. In addition, power calculations for sample size requirements including sample size allocation at different stages are derived under each hypothesis testing. A clinical trial concerning non-alcoholic steatohepatitis (NASH) utilizing a two-stage seamless adaptive trial design is discussed to illustrate the use of the proposed unified approach for data analysis.

Keywords: Seamless adaptive design; Sample size adaptation; Weibull distribution; Time-to-event data; NASH