Statistical Inferences for Two-Component Semiparametric Location-Scale Mixture Models

Date
2024-11-25
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Mixture models serve as a powerful statistical tool, particularly in capturing heterogeneous populations by representing them as a mixture of several distributions. These models are particularly useful in various fields, including genomics, economics, and social sciences, where data often arises from a combination of distinct subpopulations. Among the various mixture models, the two-component location-scale mixture model is of special interest due to its simplicity and flexibility in modeling diverse data structures. Traditional methods of parameter estimation in mixture models, such as Maximum Likelihood Estimation (MLE), are widely used due to their desirable asymptotic properties such as asymptotic efficiency. However, MLE can be highly sensitive to model misspecification and the presence of outliers, often leading to biased or inefficient estimates. Recognizing these limitations, this thesis explores the use of Minimum Hellinger Distance Estimation (MHDE), a robust alternative estimation which offers a balance between efficiency and robustness, meaning that while MHDE may not be as efficient as MLE in perfectly specified models (i.e., when the model exactly fits the data), it remains sufficiently efficient while being far more robust to data that does not perfectly align with the model assumptions. The choice of MHDE is motivated by its robustness in the face of outliers and its ability to provide more reliable estimates when the underlying distribution deviates from the assumed model. Focusing on semiparametric mixtures introduces additional flexibility by allowing for an unspecified distribution, which enables the model to capture complex data structures without imposing strict parametric assumptions. In these semiparametric models, the emphasis is placed on estimating the unknown parameter vector, while the form of the mixing distribution remains unspecified. This approach strikes a balance between parametric precision and nonparametric flexibility, making it particularly useful in situations where the true distribution is unknown or deviates from common parametric forms. This thesis mainly focuses on three primary objectives, which include minimum Hellinger distance estimation for both parametric and semiparametric location-scale mixture models, and estimation for location-scale mixture when data are right-censored. The first objective focuses on the estimation of the unknown parameters using minimum Hellinger distance, with a particular emphasis on the parametric location-scale mixtures. In this thesis, the Parametric Hellinger Distance Estimation (MHDEP) method is explored in depth. The Hellinger distance is defined in terms of the Hellinger integral, which was introduced by Ernst Hellinger in 1909. Chapter 2 delves into the methodology, theoretical asymptotic normality properties, simulation studies and real data analysis of MHDEP. This approach is chosen for its advanced robustness and competitive efficiency to be compared with the classical parametric likelihood-based estimations, especially in scenarios where traditional estimation methods may perform bad due to model misspecification or data irregularities. The second objective focuses on Semiparametric Hellinger Distance Estimation (SEMIMHDE) for mixture models with unknown component distributions. While the full derivation of identifiability conditions is ongoing, identifiability remains crucial for ensuring reliable parameter estimation. It ensures that the parameters of interest, such as mixing proportions, location, and scale parameters can be uniquely determined from the data. We first constructed the SEMIMHDE by deriving a custom base function for each component in the semiparametric mixture model. Subsequently, we adapted Hellinger Minimization for Mixtures (HMIX) algorithm which is originally designed for parametric mixtures using MHDE, to accommodate the semiparametric setting. This modified HMIX algorithm allows the estimation of unknown component distributions. To assess the performance of SEMIMHDE, we conducted a series of simulation studies, evaluating its efficiency, robustness, and sensitivity to model misspecification compared to other parametric estimation methods. Finally, we applied SEMIMHDE to the Old Faithful Geyser dataset, demonstrating its practical applicability and illustrating how it can handle real-world data. The third objective aims to refine and advance the current methodologies, specifically Kaplan-Meier-weighted MHDEP and SEMIMHDE for right-censored mixture is constructed, to provide a robust and comprehensive toolkit to analyze finite location-scale mixture models when data are right-censored. Right-censoring is a common issue in many practical applications, such as survival analysis, where the complete data for some observations is not available. Chapter 4 focuses on applying MHDEP and SEMIMHDE to various censoring rate scenarios from low to high rate to evaluate their finite sample performance. Simulation results show that both methods maintain good finite-sample performance even at high levels of censoring, demonstrating their robustness and reliability under different degrees of right-censoring. This study is anticipated to provide more reliable and versatile solutions for complex statistical modeling challenges, broadening the applicability of these methods to a wider range of practical problems. Additionally, both estimation methods were applied to a real right-censored dataset to assess their performance in scenarios where complete observations are unavailable. In summary, this thesis makes significant contributions to the field of both parametric and semiparametric mixture models by addressing fundamental issues of efficiency, robustness and model misspecification using Minimum Hellinger distance estimations. The proposed MHDEP and SEMIMHDE not only advance the theoretical properties of these models but also provide practical tools for more accurate and reliable statistical analysis in various applied settings.
Description
Keywords
Mixture Model
Citation
Zhang, N. (2024). Statistical inferences for two-component semiparametric location-scale mixture models (Doctoral thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca.