Some investigations in discriminant analysis with mixed variables

  1. Mahat, Nor Idayu
unter der Leitung von:
  1. Adolfo Hernández Estrada Doktorvater

Universität der Verteidigung: University of Exeter

Jahr der Verteidigung: 2006

Art: Dissertation

Zusammenfassung

The location model is a potential basis for discriminating between groups of objects with mixed types of variables. The model specifies a parametric form for the conditional distribution of the Continuous variables given each pattern of values of the categorical variables, thus leading to a theoretical discriminant function between the groups. To conduct a practical discriminant analysis, the objects must first be sorted into the cells of a multinomial table generated from the categorical values, and the model parameters must then be estimated from the data. However, in many practical situations some of the cells are empty, which prevents simple implementation of Maximum likelihood estimation and restricts the feasibility of linear model estimators to cases with relatively few categorical variables. This deficiency was overcome by non-parametric smoothing estimation proposed by Asparoukhov and Krzanowski (2000). Its usual implementation uses exponential and piece-wise smoothing functions for the continuous variables, and adaptive weighted nearest neighbour for the categorical variables. Despite increasing the range of applicability, the Smoothing parameters that are chosen by maximising the leave-one-out pseudo-likelihood depend on distributional assumptions, while, the smoothing method for the Categorical variables produces erratic values if the number of variables is large. This thesis rectifies these shortcomings, and extends location model methodology to situations where there are large numbers of mixed categorical and continuous variables. Chapter 2 uses the simplest form of the exponential smoothing function for the continuous variables and describes how the smoothing parameters can instead be chosen by minimising either the leave-one-out error rate or the leave-one-out Brier score, neither of which make distributional assumptions. Alternative smoothing methods, namely a kernel and a weighted form of the maximum likelihood, are also investigated for the categorical variables. Numerical evidence in Chapter 3 shows that there is little to choose among the strategies for estimating smoothing parameters and among the Smoothing methods for the categorical variables. However, some of the proposed smoothing methods are more feasible when the number of parameters to be estimated is reduced. Chapter 4 reviews previous work on problems of high dimensional feature variables, and focuses on selecting variables on the basis of the distance between groups. In particular, the Kullback-Leibler divergence is considered for the location model, but existing theory based on maximum likelihood estimators is not applicable for general cases. Chapter 5 therefore describes the implementation of this distance for smoothed estimators, and investigates its asymptotic distribution. The estimated distance and its asymptotic distribution provide a stopping rule in a sequence of searching processes, either by forward, backward or stepwise selections, following the test for no additional information. Simulation results in Chapter 6 exhibit the feasibility of the proposed variable selection strategies for large numbers of variables, but limitations in several circumstances are identified. Applications to real data sets in Chapter 7 show how the proposed methods are competitive with, and sometimes better than other existing classification methods. Possible future work is outlined in Chapter 8.