Format of the tab output format

This format is the main goal of the whole program. It is designed to be the pre-step of a bulk data uploader of a database.

Note that the output contains the default value instead of empty lines or missing lines in the original input files.

Structure

The file is line oriented.

Empty lines may be in the file and should be ignored.

Every other line has a character describing its purpose in column 1. A number may follow immediately. The true content follows after a blank.

Roughly, a line has this format:


  cnum data...         
where c is a purpose code, num is the optional number and data is the content of the line.

Characters in lower case indicate a description of a line. The corresponding upper case character contains data without a description. E.g. the fictional lines

  m "ID";"MESSAGE"
  M7 42;"This example is rather stupid."
show that a line pair exists for the code M. The first like with the lower case M acts as a pattern how to parse the next line. M7 indicates that this is the 7th item of M. The counting itself is independent of data specific id numbers as shown in this case. 42 is the value of the ID entry, "This example is rather stupid." is the value of the MESSAGE entry.

Some contraints of the structure:

See
Code H, The Header line for some other important informations.

Code H, The Header line

This line is given once directly after the pattern lines.

Fields of the header line
namecontent
PROGRAM Name of the producer. This is "MRES2X" in all cases.
CODEPAGE The used codepage while processing the input files. This value may not be of any interest, because the input data was generated in another codepage.
PRG_VERSION This value is a string and represents the version of this program in our source repository. Even small changes should change tis number.
DATA_VERSION This value is a number with two fractional digits. The fractional part is increased if new fields have been added to the various content lines. The integer part is increased on structural changes, e.g. new line codes or removal of fields or lines.
MAX_FILES This value contains the number of input files that are about to be processed. The final number of processed files may not be the number of input files mentioned here, because mres2x does a single step processing and continues its work even when errorneous input files have been detected. The buggy file names are included in this number.

Code T, The Termination line

This line is given once as the last line in the output file. A severe error can be considered if this line doesn't appear. A roleback is suggested for the last data set.

Fields of the termination line
namecontent
RETURNCODE Code that will be returned to the caller of the program. This is either 0 or 1 currently. 0 indicates full success, 1 indicates at least one error.
LISTED In opposite to Code H, field MAX_FILES, this entry contains the number of actually listed input files in the output file. Note that even errorneous data sets are counted if they are shown at least partially.

Code I, An Input file description line

This line introduces the processing of a new input file. All following lines up to EOF (in case of en error) or a Code O-line are related to this individual file.

Fields of the input file description line
namecontent
TYPE Processed data type. "MIS" indicates an MS/MS run. "PMF" indicates a peptide mass fingerprint. Other elements are not planned to be supported.
AVERAGE This is the kind of processing that has been made. 0 indicates a monoisotopic computation, 1 a computation with average mass values.
See parameters:MASS
CLEAVAGE This is the chemical digest used in this experiment. A typical value is "Trypsin".
See parameters:CLE
DB1 This is the description of the database used. There is no list or nomenclatura to use, so expect differences where no differences are and vice versa. This name is hopefully comparable with other experiment's DB1 field. An example is "yeastsgd".
See parameters:DB
DB2 This is a more exactly description of the database used. There is no list or nomenclatura to use, so expect differences where no differences are and vice versa. An example is "yeast_all_sgd.fasta 3018992". The used filename by Mascot is listed, followed by the number of residues. The intention for this field is to create a possibility to distinguish between several experiments with the same database but with a different dataset, which may result in incomparable values.
See header:release and header:residues
FILENAME This field contains the filename of the experiment. The path components are stripped off as well as the suffix (as far as it is well known). This is not the input file name. It is the name of the file that has been listed in the input file.
See parameters:FILE
PROGRAM This field contains the program name that produced the input file. A typical value is "MASCOT 2.0.04". The text "MASCOT" is constant, the number part is the version string passed in the input file.
See header:version
ICAT This field contains either 0 or 1. ICAT has been enabled if 1 is given. ICAT is a dangerous field. It changes the results extremely but is hard to detect if activated incidentally.
See parameters:ICAT
INSTRUMENT This field contains a string describing the used instrument. The big differences between instruments become manifest in the following parameter SEARCHES. Note that "Default" is a common value and is it wrong in most cases where you don't use your microwave oven as the instrument.
See parameters:INSTRUMENT
SEARCHES This field contains a list of numbers. Each number selects a different ion series. The overall selection is done by chosing the instrument, which is translated to this ion series list in the file fragmentation_rules. Currently used rules (at RVZ):
  1. singly charged
  2. doubly charged if CHARGE >= 2
    (not internal or immonium)
  3. doubly charged if CHARGE >= 3
    (not internal or immonium)
  4. immonium
  5. a series
  6. a - NH3 if a significant and fragment includes RKNQ
  7. a - H2O if a significant and fragment includes STED
  8. b series
  9. b - NH3 if b significant and fragment includes RKNQ
  10. b - H2O if b significant and fragment includes STED
  11. c series
  12. x series
  13. y series
  14. y - NH3 if y significant and fragment includes RKNQ
  15. y - H2O if y significant and fragment includes STED
  16. z series
  17. internal yb < 700 Da
  18. internal ya < 700 Da
  19. y or y++ must be significant
  20. y or y++ must be highest scoring series
  21. z+1 series
  22. d and d' series
  23. v series
  24. w and w' series
See parameters:RULES
FRAGMENT_TOL This field contains the fragment mass tolerance value. This is the radius of the window around the measured points that must be hit to let a fragment fulfill its "hit" criteria.
See parameters:ITOL
FRAGMENT_TOLU This field contains the fragment mass tolerance unit. This is either "Da" or "mmu".
See parameters:ITOLU
PEPTIDE_TOL This field contains the peptide mass tolerance value. This is the radius of the window around the computed peptide masses that must be hit by the precursor mass to let a peptide fulfill its "hit" criteria.
This value has an active influence on the intensity threshold(s), because the count of matching theoretical peptides in the window defines the threshold.
See parameters:TOL
PEPTIDE_TOLU This field contains the peptide mass tolerance unit. This is either "Da", "mmu", "%" or "ppm".
See parameters:TOLU
VARIABLE_MODS This field contains a comma-separated list of modifications. Each modification has this form:
      special=diff=description
or
      special=diff[neutral]=description
special is a special character selected by mres2x to be appended to the modificated amino acid character later described. The special character is choosen from the this list "@~#§!^°:;`'/={}[]()/" from left to right.
diff is the mass difference between the used value and the standard value (u - s). Note that Mascot uses the last amino acid in the mod_file to compute the value. Many things may go wrong if more than one mass difference has been applied to the various residues of one modification.
[neutral] is given only if a neutral loss exists. neutral is a signed value describing the gain to the modification mass. E.g. @=79.978699[-97.995200]=Phospho (T) shows a modification gain of roughly 80 Da, but in case of a neutral loss you will have more or less 80-98 Da, which is an overall loss of 18 Da.
description is a freely choosen text by the modifier of mod_file hopefully describing the modification enough.
An empty string is possible for this variable.
See masses:deltai and masses:NeutralLossi
FIXED_MODS This field contains a comma-separated list of modifications. Each modification has this form:
      AA=diff
AA is a one of the characters used for amino acids, one of the atoms Hydrogen, Carbon, Nitrogen, Oxygen, the electron mass electron or one of the two terminus placeholders C_term or N_term.
diff is the mass difference between the used value and the standard value (u - s). Default values are the weight of the molecules H and OH for the N terminus and the C terminus.
An empty string is possible for this variable.
See the section masses
PFA This field contains either a whole number >= 0 which is the partials factor.
This is the maximum number of missed cleavages Mascot will compute with.
The default value is 0 despite the documentation.
See parameters:PFA
USER This field contains the user name associated with the experiment. Note that mres2x has the opportunity to overwrite this field.
An empty string is possible for this variable.
See parameters:USER and the flags -u and -U
TIMESTAMP This field contains the unix time stamp of the run of the analyzer program, which is Mascot. Unix time stamps are seconds since January, 1st 1970.
See header:date
IDENTITY_THRES This field contains the identity threshold shown by Mascot. It is computed as follows for those who always want to know how Earth spins.

Be m the average value of all qmatchi in the summary block.

This value has to be divided by 20*p, but p is usually the famous p value of 0.05. Keep this in mind for the following computation:

IDENTITY_THRESHOLD = 10 * log10(m)

This value is shown in Mascot result presentations.

See summary:qmatchi
QUERIES This field contains the number of queries (series of measurement) contained in the input file.
See header:queries
COMMENT This field contains the comment associated with the experiment. This is the content of Mascot's TITLE entry. If this field isn't set or bound to the empty string, the COM field is used.
An empty string is possible for this variable.
See parameters:TITLE and parameters:COM
CHARGE This field contains the content of the charge search field of Mascot.
This field is not the charge Mascot actually uses. In fact, Mascot ignores this field if the experiment provides a value. See here for used values during evaluation.
An empty string is possible for this variable.
See parameters:CHARGE
SEG This field contains the content of the protein mass search field of Mascot.
This field changes all possible results significantly. Every non-empty value should be treated as a sign that this computation has been done for experimental reason. Never ever use results of this input file in a comparison of/groups with other results.
An empty string is possible and expected for this variable.
See parameters:SEG

Code O, An input file ending line

This line is given once for each occurence of the Code I input file description line. A severe error can be considered if this line doesn't appear after a Code I line or before the second occurrence of that line. A roleback is suggested for the last data set.

The number directly following the O will match the number following the I in the corresponding input file description line.

Fields of the input file ending line
namecontent
SUCCESS Code that indicates either success by a value of 1 or a failure in case of a value of 0. In the later case it is advisable to consider a roleback.
Note that one failure in a containing query results in a failure of the input file. Nothing is said about other query results. They may be usable.

Code B, The Beginning of a new query processing

This line introduces the beginning of a new query processing. At least one query usually is part of an input file. All following lines up to EOF (in case of en error) or a Code E-line are related to this individual file.

A query is characterised by a list of ions representing a peaklist with some additional informations. Most of these informations are extracted by programs out of the raw data file of the mass spectrometer.

Fields of the beginning of a new query processing
namecontent
QUERY This is the number of the query (1-based) in the current input file. The number doesn't need to be consecutive.
This number can be used for direct references into the source file. The numbering is identical.
CHARGE This is the charge of the precursor found in the current query.
See summary:qexpi's second value
There exists a relation between CHARGE, MASS and PRECURSOR, see here.
MASS This is the uncharged mass of the precursor molecule.
See summary:qmassi
There exists a relation between CHARGE, MASS and PRECURSOR, see here.
PRECURSOR This is the value of the famous value of m/z of the charged precursor ion.
See summary:qexpi's first value
There exists a relation between CHARGE, MASS and PRECURSOR, with H being the mass of a Hydrogen (either monoisotopic or average depending on AVERAGE!) it is:

MASS = PRECURSOR * CHARGE - H * CHARGE

MATCH This field contains the number of matching peptides at different sites of different proteins with their mass matching the range spanned by the PEPTIDE_TOL around the MASS value.
See summary:qmatchi
IDENTITY_THRES This field contains the identity threshold. It is computed as follows from MATCH known as the MOWSE score threshold (MOWSE = More Of Weird Statistical Errors).

IDENTITY_THRES = 10 * log10(MATCH)

Note that this value isn't shown by Mascot usually. Mascot uses the overall value for the complete file explained here.

See summary:qmatchi
HOMOLOGY_THRES This field contains the homology threshold computed by Mascot. The homology theshold is shown by Mascot in its overviews as threshold of significant homology with p < 0.05 if this value is less than IDENTITY_THRES.
The author suggests max(IDENTITY_THRES, HOMOLOGY_THRES) currently as a good threshold of convincing results.
See summary:qplugholei
TITLE This field contains a string describing the title of the peak serie.
An empty string is possible for this variable.
See queryi:title
PEAKLIST This field contains the list of peaks measured by the instrument. Each peak is a couple of value and intensity (in this order) delimited by a colon. The peaks itself are delimited by commas.
See queryi:Ions1

Code E, The Ending of a query processing

This line is given once for each occurence of the Code B beginning of a new query processing. A severe error can be considered if this line doesn't appear after a Code B line or before the second occurrence of that line. A roleback is strongly suggested for the last data set.

The number directly following the E will match the number following the B of the beginning of the new query processing.

Fields of the ending of a query processing
namecontent
SUCCESS Code that indicates either success by a value of 1 or a failure in case of a value of 0. In the later case it is strongly suggested to do a data roleback.

Code P, A Protein data line

This line shows data of the summary section relating a distinct query.

Some lines in the summary section may be invalidated, which is normal, because the summary section contains protein choices of Mascot for the "best hit". This doesn't contain all different peak lists if more than one peak list is given at all. Thus, the HITNUMBER may have non-consecutive numbers if more than one query is used in an input file.

Fields of a protein data line
namecontent
PROTEIN This is the name of the protein Mascot assigned to a specific hit. The kind and specification of the name is database depending.
A string containing a comma is possible for this variable.
The PROTEIN field with the QUERY field should be unique in one input file.
See summary:hi's first element
HITNUMBER This is the number under which the PROTEIN is positioned in the hit list. The smaller the number, the better the hit of the protein.
The HITNUMBER field with the QUERY field are unique in one input file.
See the i in summary:hi
TOTAL_SCORE This is the total score of the proteine. It is the result of a complexe formula known by Matrix Science. In general, is is the sum of each individual peptide in the input file that matches this protein. Even low scored peptides contribute their score to the sum, maybe partially.

One of the things not mentioned very well in the documentation is the fact, that even different peptides generated by one peak list will add their amount of score to the total score.
This is the reason why even with only one peak list in the input file the protein hit list and the peptide hit list differ.

See summary:hi's second element
TOTAL_MASS This is the computed mass of the protein.
See summary:hi's forth element
MISSED_CLEAVAGE This is the number of missed cleavages detected by Mascot for the PEPTIDE.
See summary:hi_qj's first element
QUERY This is the j in summary:hi_qj and is equal to the QUERY of a Code B line.
PEPTIDE This field contains the modified peptide sequence. Every ambiguous amino acid code (B, X, Z) has been replaced by a valid amino acid code. Every variable modification is annotated by a modification code. It isn't impossible that even the termini are modificated. Exactly in this case the modifications of the termini is delimited by a period from the peptide's sequence.
An example is "@.HMIIM~KKM" which has two modifications, one at the N-terminus, one other at the M in the middle.
See summary:hi_qj's seventh element
PEPTIDE_MASS This is the computed mass of the peptide without charge.
See summary:hi_qj's second element
PEPTIDE_START This is the position of the peptide in the protein (1-based).
See summary:hi_qj's forth element
PEPTIDE_SCORE This is the score of the PEPTIDE Mascot has computed.
The value is more or less useful depending on the thresholds.
See summary:hi_qj's tenth element
OCCURANCES This is the number of occurances of the PEPTIDE's mass in the pool of the masses of each possible peptide in the protein. The information may be useful for PMF searches.
See summary:hi_qj's eleventh element
MATCHING_FRAGMENTS This is number of matching ions.
We still need to know which ions are counted both as "found" and which ion series are possible.
See summary:hi_qj's sixth element
MATCHING_PEAKS This is number of matching peaks in the list of peaks for this peptide.
See summary:hi_qj's eighth element
SERIES_FOUND This is a list of ion series found in the peak list matching the theoretical spektrum of the peptide.

This string should have 17 characters (which is known to be different in some Mascot versions) being either 0 (not found), 1 (more than a random peak), 2 (scored peak).

Elements of the SERIES_FOUND string
positionserie
1 a
2 reserved, should be zero
3 a++
4 b
5 reserved, should be zero
6 b++
7 y
8 reserved, should be zero
9 y++
10 c
11 c++
12 x
13 x++
14 z
15 z++
16 z+H
17 z++H++

See summary:hi_qj's twelveth element

SERIES_FOUND_STR This is a list of ion series found in the peak list matching the theoretical spektrum of the peptide in a user readable form.

This value is the representation of SERIES_FOUND. Only known series are displayed with at least more than random matches. Unscored values are displayed in parentheses, scored values are displayed directly. The entries are comma-separated.
Example: SERIES_FOUND="00010020000000000" leads to SERIES_FOUND_STR="(b),y"
See summary:hi_qj's twelveth element

Code F, A peptide data line

This line shows data of the peptides section relating a distinct query. AG Sickmann of RVZ uses this data preferable.

The HITNUMBER field with the PROTEIN_NUMBER field and the QUERY field are unique in one input file.
Fields of a peptide data line
namecontent
PROTEIN This is the name of the protein Mascot assigned to a specific hit. The kind and specification of the name is database depending.
A string containing a comma is possible for this variable.
See peptides:qi_pj's twelfth element
PROTEIN_NUMBER This is the running number of the various proteins in the list of matching protein list for a particular PEPTIDE.
The HITNUMBER field with the PROTEIN_NUMBER field and the QUERY field are unique in one input file.
See peptides:qi_pj's twelfth element
HITNUMBER This is the number under which the PEPTIDE is positioned in the hit list. The smaller the number, the better the hit of the peptide for one particular query.
The HITNUMBER field with the PROTEIN_NUMBER field and the QUERY field are unique in one input file.
See the j in peptides:qi_pj
TOTAL_MASS This is the computed mass of the protein.
This field may not be set due to Mascot#s format. The value is 0.0 in this case.
The value is extracted out of the summary section or the proteins section.
MISSED_CLEAVAGE This is the number of missed cleavages detected by Mascot for the PEPTIDE.
See peptides:qi_pj's first element
QUERY This is the i in peptides:qi_pj and is equal to the QUERY of a Code B line.
PEPTIDE This field contains the modified peptide sequence. Every ambiguous amino acid code (B, X, Z) has been replaced by a valid amino acid code. Every variable modification is annotated by a modification code. It isn't impossible that even the termini are modificated. Exactly in this case the modifications of the termini is delimited by a period from the peptide's sequence.
An example is "@.HMIIM~KKM" which has two modifications, one at the N-terminus, one other at the M in the middle.
See peptides:qi_pj's fifth element
PEPTIDE_MASS This is the computed mass of the peptide without charge.
See summary:hi_qj's second element
PEPTIDE_START This is the position of the peptide in the protein (1-based).
See peptides:qi_pj's twelfth element
PEPTIDE_SCORE This is the score of the PEPTIDE Mascot has computed.
The value is more or less useful depending on the thresholds.
See peptides:qi_pj's eighth element
OCCURANCES This is the number of occurances of the PEPTIDE's mass in the pool of the masses of each possible peptide in the protein. The information may be useful for PMF searches.
See peptides:qi_pj's twelfth element
MATCHING_FRAGMENTS This is number of matching ions.
We still need to know which ions are counted both as "found" and which ion series are possible.
See peptides:qi_pj's forth element
MATCHING_PEAKS This is number of matching peaks in the list of peaks for this peptide.
See peptides:qi_pj's sixth element
SERIES_FOUND This is a list of ion series found in the peak list matching the theoretical spektrum of the peptide.

This string should have 17 characters (which is known to be different in some Mascot versions) being either 0 (not found), 1 (more than a random peak), 2 (scored peak).

Elements of the SERIES_FOUND string
positionserie
1 a
2 reserved, should be zero
3 a++
4 b
5 reserved, should be zero
6 b++
7 y
8 reserved, should be zero
9 y++
10 c
11 c++
12 x
13 x++
14 z
15 z++
16 z+H
17 z++H++

See peptides:qi_pj's nineth element

SERIES_FOUND_STR This is a list of ion series found in the peak list matching the theoretical spektrum of the peptide in a user readable form.

This value is the representation of SERIES_FOUND. Only known series are displayed with at least more than random matches. Unscored values are displayed in parentheses, scored values are displayed directly. The entries are comma-separated.
Example: SERIES_FOUND="00010020000000000" leads to SERIES_FOUND_STR="(b),y"
See peptides:qi_pj's nineth element