Introduction

Proteogenomics

Proteogenomics has emerged as a new research field, which is the integrative analysis of proteomic and genomic data. If you want to know more about proteogenomics, please read the following review papers about proteogenomics:
1. Ruggles, Kelly V., et al. "Methods, tools and current perspectives in proteogenomics." Molecular & Cellular Proteomics 16.6 (2017): 959-981.
2. Nesvizhskii, Alexey I. "Proteogenomics: concepts, applications and computational strategies." Nature methods 11.11 (2014): 1114-1125.
3. Alfaro, Javier A., et al. "Onco-proteogenomics: cancer proteomics joins forces with genomics." Nature methods 11.11 (2014): 1107-1113.

PepQuery

PepQuery is a peptide-centric search engine for novel peptide identification and validation. It's different from the spectrum-centric search engines, such as Mascot, MaxQuant and MS-GF+.

Installation

PepQuery is available as a standalone application as well as a web-application. In order to run the standalone version of PepQuery, you must have Java 1.8 installed. To check your java version please open your terminal application and run the following command:

java -version

Please go to the Download page to download the standalone version of PepQuery. It's written by Java and is platform independent. This package is a zip file. After you download it, please unzip it and you will find a jar file in the package folder.

If you want to take a Variant Call Format (VCF), Browser Extensible Data (BED) or Gene Transfer Format (GTF) file as input, you will need to install R and the R package PGA. You need to prepare annotation data according to the user's manual of PGA before you start to run PepQuery with taking VCF, BED or GTF file as input. PepQuery uses PGA to translate the events in the VCF, BED or GTF file to protein sequences. If you want to know more about PGA, please read the paper of PGA (doi: 10.1186/s12859-016-1133-3). If you only want to take a peptide, protein or DNA sequence as input, you don't need to install R and PGA.

Web application

You can use PepQuery through the web server of PepQuery. Using the web interface, you don't need to prepare the MS/MS data and protein reference database. Of course, you also don't need to install the PepQuery software in your computer. All you need to provide is the target peptide sequence which you want to identify.

Input data

Currently, the web interface accepts taking peptide, protein and DNA sequences as input. For each search, only one sequence is supported. If you have multiple sequences, pelase do multiple searches.

Parameters

There are only a few parameters you need to set. Below is a screenshot of the web inferface parameters:

PepQuery

Please find the details about the parameters below:

  • MS/MS dataset: select one MS/MS dataset which you want to search. Currently, there are five large scale cancer public datasets which are avialable in the web application of PepQuery. Please see below for details.
  • Target event: select the sequence class you provide. For the web interface, peptide, protein and DNA sequences are supported.
  • Input sequence: a peptide/protein/DNA sequnce depends on your selection for target event. For one search, only one sequence is accepted.
  • Reference database: a protein reference database;
  • Scoring algorithm: peptide spectrum scoring algorithm: Hyperscore and MVH are available;
  • Unrestricted modification filtering: Whether or not to do the unrestricted modification search filtering.

For each dataset in this web application, we have selected a set of optimized MS/MS searching parameters so that users don't need to set these parameters by theirself. In the result page, you can find the detailed MS/MS searching parameters for the selected MS/MS dataset.

Datasets

We have prepared several large scale MS/MS datasets from CPTAC and other resource in the web interface. For each dataset, please find the details below:

Result page

Through the result web page of a search, you can find the detailed identification result about your input sequence. The result page is divided four panels:

Identification overview: the first panel contains the identification parameters for the selected dataset:
PepQuery
Identification result: this panel contains the detailed identification result for input sequence. If a row is green, it indicates this identification is confident (pvalue≤0.01). You can click a row then you can find the spectrum annotation figure in the "Spectrum annotation" panel. This figure can help you further evaluate the quality of the peptide spectrum matching. If you want ot download the identification table, you can click the "Download" button in the bottom of this panel. Using the "Search" function, you can quickly filter the rows you want. PepQuery
Spectrum annotation: this panel displays the spectrum annotation figure of the identification you selected in the Identification result panel. The matched ions are displayed with different colors. The figure can be zoomed in and zoomed out. You can download the figure and the MS/MS data refered to this identification through the functions in this panel. PepQuery
Sample information: you can find the sample information for the selected row in the "Identification result" panel. PepQuery

Stand alone version

Input data

As for the standalone version of PepQuery, usually you need to provide MS/MS data in MGF format, a reference protein database in FASTA format and a peptide, protein or DNA sequence. However, if you want to take a VCF, BED or GTF as input, you will need to provide a folder which includes annotation data. This annotation data will be used to translate the input events into protein sequences.

Parameters

Please find the command line parameters of PepQuery below:

  • -pep: a peptide sequence which you want to search;
  • -db: a reference protein database, such as the protein sequence database from RefSeq or Ensembl;
  • -ms: MS/MS data in MGF format. If you have MS/MS data with raw, mzML or mzXML format, you can easily use some tools such as ProteoWizard to convert these data to MGF files. This parameter accepts a single MGF file or a folder contains multiple MGF files.
  • -fixMod: fixed modification. The format is like : 1,2,3. Different modification is represented by different number. You can use the following command line to get all the available modifications:
    java -jar pepquery.jar -printPTM
  • -varMod: variable modification. The format is the same with -fixMod;
  • -maxVar: the maximum number of allowed variable modifications for a peptide;
  • -printPTM: print all the available modifications in PepQuery;
  • -um: set this parameter to perform unrestricted modification filtering;
  • -o: output folder;
  • -prefix: the prefix of output files;
  • -tol: the error window on experimental peptide mass values. This parameter is usually set according to the mass spectrometer which was used to generate the MS/MS data;
  • -tolu: the unit of -tol, ppm or Da;
  • -itol: Error window for MS/MS fragment ion mass values in Da unit;
  • -e: enzyme used for protein digestion. Default is trypsin;
  • -c: the number of allowed missed cleavage sites. Default is 2;
  • -cpu: the number of CPUs used for analysis;
  • -fragmentMethod: peptide fragmentation method, 1=CID/HCD (default), 2=ETD;
  • -m: scoring algorithm for peptide spectrum matching, 1=Hyperscore (default), 2=MVH;
  • -minCharge: the minimum charge to consider if the charge state is not available, default is 2;
  • -maxCharge: the maximum charge to consider if the charge state is not available, default is 3;
  • -minScore: minimum score to consider for peptide searching;
  • -n: the number of random peptides to be generated for pvalue calculation, default is 1000;
  • -t: if the target input sequence is not peptide, then use this parameter to specify the type of input event. 1=protein,2=DNA,3=VCF,4=BED,5=GTF;
  • -i: if the target input sequence is not peptide, then use this parameter to specify the input event. The accepted input is a protein sequence, DNA sequence, VCF, BED or GTF file;
  • -f: the frame to translate DNA sequence to protein. The right format is like this: "1,2,3,4,5,6","1,2,3","1". "0" means to keep the longest frame. In default, for each frame only the longest protein is used;
  • -anno: annotation files folder for VCF/BED/GTF. Please follow the instruction of PGA to prepare the annotation data;
  • -h: print all the command line options.

Output data

PepQuery outputs results as several tab-delimited text files.

psm_rank.txt: this is the mainly result file of PepQuery. This file includes the detailed identification results for the input target peptide.

Column Description
peptide A target peptide sequence.
modification Modification information of the target peptide. For example, "Deamidation of N@6[0.9840]". "-" means there is no modification.
n The number of candidate spectra for the target peptide.
spectrum_title Sepctrum title of the matched spectrum.
charge The charge state of the precursor ion.
exp_mass The experimental mass of the matched spectrum.
ppm Mass error in ppm unit.
pep_mass The theoretical mass of the peptide.
mz The mass-over-charge value of the precursor ion.
score The score of the identification, where larger is better.
n_db The number of better matched peptides from the reference protein database to the matched spectrum.
total_db The total number of matched peptides from the reference protein database without regarding the score.
n_random The number of better matched random peptides to the matched spectrum.
total_random The total number of random peptides.
pvalue The pvalue of the identification, where smaller is better.
rank The rank of the identification.
n_ptm The number of better matched modification peptides when performing the unrestricted modification searching.

psm_rank.mgf: this is an MGF file which contains all the MS/MS spectra in psm_rank.txt file. The users can extract the MS/MS spectra from this file if they want to do some downstream analysis.

ptm_detail.txt: this file contains the identifications of better matched modification peptides from the unrestricted modification searching. It's possible to be empty when there is no better match. Except containing all the columns in psm_rank.txt, it also contains the following columns:

Column Description
ptm_spectrum_title Sepctrum title of the matched spectrum.
ptm_peptide Modification peptide sequence.
ptm_charge The charge state of the precursor ion.
ptm_exp_mass The experimental mass of the matched spectrum.
ptm_pep_mass The theoretical mass of the peptide.
ptm_modification Modification information of the target peptide. For example, "Carbamidomethylation of C@2[57.0215];Methyl of C@2[14.0157]".
ptm_score The score of the identification, where larger is better.

ptm.txt: this file contains all the identifications from the unrestricted modification searching. All the modification peptides in ptm_detail.txt which are from the unrestricted modification searching are also included in this file.

Column Description
spectrum_title Sepctrum title of the matched spectrum.
peptide Modification peptide sequence.
charge The charge state of the precursor ion.
exp_mass The experimental mass of the matched spectrum.
pep_mass The theoretical mass of the peptide.
modification Modification information of the target peptide. For example, "Carbamidomethylation of C@2[57.0215];Methyl of C@2[14.0157]".
score The score of the identification, where larger is better.

Result visualization

The result files of PepQuery can be imported to PSMviewer which is developed by our lab to visualize. Currently, the two files psm_rank.txt and ptm.txt can be imported into PSMviewer. Through PSMviewer, you can check the spectrum annotation figure for each identification one by one and you can also export the annotation figure in different figure format. This function can help you further evaluate the quality of the identification manually.

PepQuery

Example dataset

You can go to the Download page to download some example datasets.