Métodos de validación de identificaciones a gran escala de proteínas y desarrollo e implementación de estándares en Proteómica

  1. Martínez de Bartolomé Izquierdo, Salvador
Dirigida por:
  1. Juan Pablo Albar Director/a
  2. Jesús María Vázquez Cobos Director/a

Universidad de defensa: Universidad Autónoma de Madrid

Fecha de defensa: 12 de octubre de 2012

Tribunal:
  1. José María Carazo García Presidente/a
  2. Esteban Montejo de Garcini Guedas Secretario/a
  3. Juan Antonio Vizcaíno González Vocal
  4. Concepcion Gil Garcia Vocal
  5. Joaquín Abián Vocal

Tipo: Tesis

Resumen

High throughput identification of peptides in databases from tandem mass spectrometry data is a key technique in modern proteomics. Common approaches to interpret large scale peptide identification results are based on the statistical analysis of average score distributions, which are constructed from the set of best scores produced by large collections of MS/MS spectra by using searching engines such as SEQUEST. Other approaches calculate individual peptide identification probabilities on the basis of theoretical models or from single-spectrum score distributions constructed by the set of scores produced by each MS/MS spectrum. In this work, we study the mathematical properties of average SEQUEST score distributions by introducing the concept of spectrum quality and expressing these average distributions as compositions of single-spectrum distributions. Our analysis leads to a novel indicator, the probability ratio, a non-parametric and robust indicator that makes spectra classification according to parameters such as charge state unnecessary and allows a peptide identification performance, on the basis of false discovery rates, that is better than that obtained by other empirical statistical approaches. We also developed another method based on the construction of single-spectrum SEQUEST score distributions. These results make the robustness, conceptual simplicity, and ease of automation of the probability ratio algorithm a very attractive alternative to determine peptide identification confidences and error rates in high throughput experiments. On the other hand, recent developments of HUPO-PSI (Proteomics Standards Initiative) standard data formats and MIAPE guidelines (Minimum Information About a Proteomics Experiment) are certainly contributing to proteomics data-sharing within the scientific community. In addition, specialized journals have emphasized the use of these standards and guidelines to facilitate the evaluation and publication of new articles. However, there is an evident lack of bioinformatics tools specifically designed to manage these standards containing the required information and its connectivity with the proteomics pipeline. In this work we describe the development of a set tools based on PSI standards and MIAPE guidelines, such as semantic and MIAPE validators of proteomics standard data files, a proteomics experiment repository based on MIAPE guidelines, a Java library for the management and extraction of MIAPE information from standard data files and a tool for a complete proteomics data analysis workflow allowing the aggregation, filtering and inspection of large amount of data, as well as its dissemination by preparing a complete ProteomeXchange submission. Additionally, here we also present the contribution for the definition of the MIAPE guidelines for quantitative Proteomics experiments, receptly accepted as a new global standard for the Proteomics community. Palabras clave : proteómica, bioinformática, analisis de datos, directrices MIAPE, espectrometría de masas, repositorio de datos, desarrollo de herramientas, HUPO-PSI, modelo estadístico Key words: proteomics, bioinformatics, data analysis, MIAPE guidelines, mass spectrometry, data repository, tool develpment, HUPO-PSI, statistical model