Column subset selection in practice: efficient heuristics and regularization

Ordozgoiti Rubio, Bruno

Column subset selection in practiceefficient heuristics and regularization

Ordozgoiti Rubio, Bruno

Dirigida por:

Alberto Mozo Velasco Director/a
Jesús García López de Lacalle Codirector/a

Universidad de defensa: Universidad Politécnica de Madrid

Fecha de defensa: 22 de noviembre de 2018

Tribunal:

Ernestina Menasalvas Presidente/a
L. M. Pozo Coronado Secretario
Daniel Gómez González Vocal
Rafael Caballero Roldán Vocal
Roberto González Sánchez Vocal

Tipo: Tesis

Teseo: 571860 DIALNET Archivo Digital UPM editor

Resumen

Today, data are available at an unprecedented scale. An overwhelming quantity of Internet-connected devices generate a constant trickle of pieces of information all over the world, much of which are processed in real time or stored for later use. Making sense of these enormous data sets is often an challenging endeavour. Their size demands the use of massive computational resources, which motivates the design of efficient algorithms. Additionally, these data usually contain measurements of a large number of variables, which poses a wide variety of problems. To address the latter, a family of techniques commonly referred to as dimensionality reduction is studied. In this thesis we address the problem of feature selection, a subset of dimensionality reduction methods that preserve the semantic meaning of the original data variables. To do so, we analyze a problem formulation known as column subset selection. A significant advantage of column subset selection is that the models it produces are simple and in some cases easy to interpret. In an age where notable advances in applied computer science are met with growing concerns about ethics and transparency, model simplicity can become a key requirement in many scenarios. The column subset selection problem has received significant attention in the computer science literature over the last few years, mainly from a theoretical perspective. Here we analyze the problem from a more practical standpoint. Our contributions can be summarized as follows. First, we propose the use of a local search heuristic. We show empirically that it outperforms existing algorithms and derive elementary approximation guarantees. Furthermore, we take advantage of the nature of the problem formulation to derive an efficient implementation suitable for practical use. Second, we introduce regularized formulations of the problem. We derive a greedy algorithm for these new objectives and demonstrate empirically that it produces improved subsets with respect to multiple criteria.