📄 emowse.txt
字号:
emowse Function Protein identification by mass spectrometryDescription Peptide mass information can provide a 'fingerprint' signature sufficiently discriminating to allow for the unique and rapid identification of unknown sample proteins, independent of other analytical methods such as protein sequence analysis. Practical experience has shown that sample proteins can be uniquely identified using as few as 3-4 experimentally determined peptide masses when screened against a fragment database derived from over 50,000 proteins. Given a one-per-line file of molecular weights cut by enzymes/reagents, emowse will search a protein database for matches with the mass spectrometry data. One of eight cutting enzymes/reagents can be specified and an optional whole sequence molecular weight. Determination of molecular weight has always been an important aspect of the characterization of biological molecules. Protein molecular weight data, historically obtained by SDS gel electrophoresis or gel permeation chromatography, has been used establish purity, detect post-translational modification (such as phosphorylation or glycosylation) and aid identification. Until just over a decade ago, mass spectrometric techniques were typically limited to relatively small biomolecules, as proteins and nucleic acids were too large and fragile to withstand the harsh physical processes required to induce ionization. This began to change with the development of 'soft' ionization methods such as fast atom bombardment (FAB)[1], electrospray ionisation (ESI) [2,3] and matrix-assisted laser desorption ionisation (MALDI)[4], which can effect the efficient transition of large macromolecules from solution or solid crystalline state into intact, naked molecular ions in the gas phase. As an added bonus to the protein chemist, sample handling requirements are minimal and the amounts required for MS analysis are in the same range, or less, than existing analytical methods. As well as providing accurate mass information for intact proteins, such techniques have been routinely used to produce accurate peptide molecular weight 'fingerprint' maps following digestion of known proteins with specific proteases. Such maps have been used to confirm protein sequences (allowing the detection of errors of translation, mutation or insertion), characterise post-translational modifications or processing events and assign disulphide bonds [5,6]. Less well appreciated, however, is the extent to which such peptide mass information can provide a 'fingerprint' signature sufficiently discriminating to allow for the unique and rapid identification of unknown sample proteins, independent of other analytical methods such as protein sequence analysis. Practical experience has shown that sample proteins can be uniquely identified using as few as 3- 4 experimentally determined peptide masses when screened against a fragment database derived from over 50,000 proteins. Experimental errors of a few Daltons are tolerated by the scoring algorithms, permitting the use of inexpensive time-of-flight mass spectrometers. As with other types of physical data, such as amino acid composition or linear sequence, peptide masses can clearly provide a set of determinants sufficiently unique to identify or match unknown sample proteins. Peptide mass fingerprints can prove as discriminating as linear peptide sequence, but can be obtained in a fraction of the time using less material. In many cases, this allows for a rapid identification of a sample protein before committing to protein sequence analysis. Fragment masses also provide structural information, at the protein level, fully complementary to large-scale DNA sequencing or mapping projects [7,8,9]. For each entry in the specified set of sequences to search, emowse derives both whole sequence molecular weight and calculated peptide molecular weights for complete digests using the range of cleavage reagents and rules detailed in Table 1. Cleavage is disallowed if the target residue is followed by proline (except for CNBr or Asp N). Glu C (S. aureus V8 protease) cleavages are also inhibited if the adjacent residue is glutamic acid. Peptide mass calculations are based entirely on the linear sequence and use the average isotopic masses of amide-bonded amino acid residues (IUPAC 1987 relative atomic masses). To allow for N-terminal hydrogen and C-terminal hydroxyl the final calculated molecular weight of a peptide of N residues is given by the equation: N __ \ / Residue mass + 18.0153 -- n=1 Molecular weights are rounded to the nearest integer value before being used. Cysteine residues are calculated as the free thiol, anticipating that samples are reduced prior to mass analysis. CNBr fragments are calculated as the homoserine lactone form. Information relating to post- translational modification (phosphorylation, glycosylation etc.) is not incorporated into calculation of peptide masses. Table 1: Cleavage reagents modelled by emowse.Reagent no. Reagent Cleavage rule 1 Trypsin C-term to K/R 2 Lys-C C-term to K 3 Arg-C C-term to R 4 Asp-N N-term to D 5 V8-bicarb C-term to E 6 V8-phosph C-term to E/D 7 Chymotrypsin C-term to F/W/Y/L/M 8 CNBr C-term to M Current versions of emowse also incorporate calculated peptide Mw's resulting from incomplete or partial cleavages. At present, this is achieved by computing all nearest-neighbour pairs for each enzyme or reagent detailed in table 1. Tolerance The supplied number specifies the error allowed for mass accuracy of experimental mass determination. If no figure is specified, a default tolerance of 2 Daltons will be assumed. If you wish to specify a different tolerance then follow the qualifier '-tolerance' with the required number of Daltons. eg: '-tolerance 1'. In this case, supplied peptide masses will be matched to +/- 1 Daltons. Values of 2-4 are suggested for data obtained by laser- desorption TOF instruments. Accuracies of +/- 2 Daltons or better are generally only possible using an appropriate internal standard (e.g. oxidised insulin B chain) with TOF instruments. For electrospray or FAB data, a value of 1 can be selected in most cases. If you have real confidence in mass determination, specify '0' (zero) to limit matches to the nearest integer value (effectively +/- 0.5 Daltons). Discrimination is significantly improved by the selection of a small error tolerance. Whole sequence molecular weight This option allows you to give the molwt of the whole protein (if known). This allows you to limit the search to proteins of this molwt plus/minus a 'limit' (see below). If unspecified, a whole protein molwt of 0 is assumed which emowse interprets as "search the whole database". This will include all proteins up to the maximum size of just under 700,000 Daltons. You can specify any molwt in Daltons with this command e.g. '-weight 90000'. Allowed whole sequence weight variability This option is used in conjunction with the '-weight' option and is meaningless without it. It specifies a percentage. Only proteins of the given Sequence molecular weight +/- this percentage will be searched. If a Sequence molecular weight is specified but '-pcrange' is unspecified then '-pcrange ' will default to 25%. To specify a percentage of 30% use: '-pcrange 30'. In this case, a molecular weight of 90,000 Daltons was specified and the selection of 30 for the filter restricts the search to those proteins with masses from 63,000 to 117,000 Daltons. A value of 25 is suggested for initial searches, which can be progressively widened for subsequent search attempts if no matches are found. Discrimination is best when the filter percentage is narrow, but some Mw estimates (particularly from SDS gels) should be given considerable allowance for error. Partials factor This specifies the weighting given to partially-cleaved peptide fragments, with a range from 0.1 to 1.0. If not specified, the default value is 0.4. The factor effectively down-weights the score awarded to a partial fragment by the specified amount. For example, a '-partials' of 0.25 will reduce the score of partial fragments to 25% (one quarter) of the score of a complete ('perfect') peptide cleavage fragment of equal mass. Computing all possible nearest-neighbour partial fragments adds significantly to the number of peptides entered in the database (by a factor of two). The major effect of this is to increase the background score by increasing the number of random Mw matches, which can significantly reduce discrimination. The use of a low '-partials' factor (eg 0.1 - 0.3) is a useful way of limiting this effect - partial peptide matches will add a little to the cumulative frequency score, but without compromising discrimination. More experienced users can utilise the '-partials' factor to optimize searches where the peptide Mw data contain a significant proportion of partial cleavage fragments (eg > 30%). In such cases, setting the '-partials' factor within the range 0.4 - 0.6 can help to improve discrimination. Conversely, if the digestion is perfect, with no partial fragments present, the lowest '-partials' factor of 0.1 will give maximum discrimination. Program requirements The emowse search program accepts a single text file containing a list of experimentally-determined masses, generally selected from the range 700-4,000 Daltons to reduce the influence of partial cleavage products. The program outputs a ranked hit list comprising the top 30 scores, with information including the protein entry name, text identifiers, final accumulated scores, matching peptide sequences and hit versus miss tallies. User-selectable search parameters include an error tolerance (default +/- 2 Daltons), selection of the enzyme or reagent used and an intact protein Mw (optional, if known). For each peptide Mw entry in the data file, emowse matches individual fragment molecular weights (FMWs) with database entry molecular weights (DBMWs). A 'hit' is scored when the following criterion is met: DBMW-tolerance-1 < FMW < DBMW+tolerance+1 If an intact protein Mw is specified (SMW) then the program prompts for a molecular weight filter percentage (MWFP). emowse then restricts the search to those entries which match the following criteria: R = SMW x MWFP / 100 0 < SMW-R < emowse entry Mol.wt. < SMW+R Default search parameters are a tolerance of +/- 2 Daltons, intact Mw specified and the MWFP set to 25. emowse Scoring scheme The final scoring scheme is based on the frequency of a fragment molecular weight being found in a protein of a given range of molecular weight. OWL database sequence entries were initially grouped into 10 kDalton intact molecular weight intervals. For each 10 kDalton protein interval, peptide fragment molecular weights were assigned to cells of 100 Dalton intervals. The cells therefore contained the number of times a particular fragment molecular weight occurred in a protein of any given size. This operation was performed for each enzyme. Cell frequency values were calculated by dividing each cell value by the total number of peptides in each 10 kD protein interval. Cell frequency values for each 10 kDalton interval were then normalised to the largest cell value (Fmax), with all the cell values recalculated as: Cell value = Old value / Fmax to yield floating point numbers between 0 and 1. These distribution frequency values, calculated for each cleavage reagent, were then built into the emowse search program. For every database entry scanned, all matching fragments contribute to the final score. In the current implementation, non-matching fragments are ignored (neutral). For each matching peptide Mw a score is assigned by looking up the appropriate normalised distribution frequency value. In the case of multiple 'hits' in any one target protein (i.e. more than one matching peptide Mw), the distribution frequency scores are multiplied. The final product score is inverted and then normalised to an 'average'
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -