CN113571129B - Complex cross-linked peptide identification method based on mass spectrum - Google Patents

Complex cross-linked peptide identification method based on mass spectrum Download PDF

Info

Publication number
CN113571129B
CN113571129B CN202111117873.7A CN202111117873A CN113571129B CN 113571129 B CN113571129 B CN 113571129B CN 202111117873 A CN202111117873 A CN 202111117873A CN 113571129 B CN113571129 B CN 113571129B
Authority
CN
China
Prior art keywords
peptide
cross
fragment
linked
mass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111117873.7A
Other languages
Chinese (zh)
Other versions
CN113571129A (en
Inventor
张永谦
韦秋实
邓玉林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202111117873.7A priority Critical patent/CN113571129B/en
Publication of CN113571129A publication Critical patent/CN113571129A/en
Application granted granted Critical
Publication of CN113571129B publication Critical patent/CN113571129B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6803General methods of protein analysis not limited to specific proteins or families of proteins
    • G01N33/6848Methods of protein analysis involving mass spectrometry
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Urology & Nephrology (AREA)
  • Hematology (AREA)
  • Biomedical Technology (AREA)
  • Immunology (AREA)
  • Analytical Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • Medicinal Chemistry (AREA)
  • Food Science & Technology (AREA)
  • Microbiology (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Cell Biology (AREA)
  • Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Biochemistry (AREA)

Abstract

The invention relates to a mass spectrum-based complex cross-linked peptide identification method, and belongs to the technical field of biological analysis. The method of the invention firstly carries out theoretical enzyme digestion on the protein in a protein library and establishes a peptide index; then, assuming the crosslinking sites on each peptide fragment and establishing an ion index; and then performing ion complementary hypothesis on the spectrogram, performing peptide spectrum matching by using an ion index, and finally performing rough scoring, fine scoring and re-scoring to obtain a final result. The method can directly realize peptide spectrum matching between the spectrogram and all the cross-linked peptide segments meeting the conditions, and the correct result peptide segments of the spectrogram are basically distributed at the position of the top rank, so that the correct result can be recalled only by performing fine scoring once in a small retrieval space.

Description

Complex cross-linked peptide identification method based on mass spectrum
Technical Field
The invention relates to a mass spectrum-based complex cross-linked peptide identification method, and belongs to the technical field of biological analysis.
Background
Cross-linked proteomics aims to elucidate structural features of intracellular protein networks, understand life mechanisms at a molecular level, and finally apply to drug design and the like. The cross-linking mass spectrometry technology has the characteristics of high throughput, high sensitivity and the like, and can realize large-scale acquisition of protein cross-linking in complex samples. Wherein, the identification of the cross-linked peptide segment is the key for obtaining the cross-linked information of the protein. The cross-linked peptide fragment includes three basic forms: mono-link, loop-link and cross-link, also known as simple cross-linked forms. However, in the actual process, due to the influence of the uneven distribution of the restriction enzyme cutting sites and the crosslinking sites, the obtained crosslinked peptide segment may have a plurality of crosslinks, which is called a complex crosslinked form peptide segment and is a combination of the three simple crosslinked peptide segments. They also play an important role in understanding protein structure and function. At present, many methods are available for identifying the peptide fragment in a simple cross-linked form in mass spectrum data, but few researches are available for identifying the peptide fragment in a complex cross-linked form. Because a plurality of cross-links exist in the complicated cross-linked peptide segment, the fragmentation rule and fragment ions of the complicated cross-linked peptide segment are not completely the same as those of the simple cross-linked peptide segment, so that the identification method of the simple cross-linked peptide segment cannot be used for the identification of the complicated cross-linked peptide segment. Due to the introduction of a plurality of cross-links, even if peptide bonds among cross-linking sites in the peptide fragments are broken, the obtained fragment ions have no difference in mass, and the fragment ions do not help in the peptide spectrum matching process. These fragment ions, if normally generated and used for peptide profile matching, will lead to erroneous identification results.
Disclosure of Invention
In view of the above, the present invention provides a method for identifying a cross-linked peptide fragment based on mass spectrometry, which can identify a complex cross-linked peptide fragment with high accuracy.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a method for mass spectrometry-based cross-linked peptide fragment identification, the method comprising the steps of:
(1) performing theoretical enzyme digestion on all protein sequences in a protein library, and storing detailed information of the obtained peptide fragments according to the mass size; establishing a corresponding relation between the quality of the peptide fragment and the source peptide fragment, and constructing a peptide fragment index;
(2) traversing all the peptide fragments, assuming all the crosslinking sites existing on each peptide fragment, calculating the mass of the uncharged fragment ions under each crosslinking site, establishing the corresponding relation between the mass of the fragment ions and the source peptide fragment, and constructing an ion index;
(3) fragment ions of each peak in a mass spectrum of the peptide fragment to be identified are assumed, and are respectively assumed to be non-cross-linked fragment ions and cross-linked fragment ions; when the fragment ions are non-cross-linked, calculating to obtain the mass of the fragment ions without charges; when the fragment ions are cross-linked fragment ions, calculating to obtain the mass of the fragment ions which are complementary with the cross-linked fragment ions and are not charged;
(4) respectively storing the non-cross-linked fragment ions and the cross-linked fragment ions obtained from one spectrogram into a non-cross-linked spectrogram and a cross-linked spectrogram, traversing the fragment ions in the non-cross-linked spectrogram and the cross-linked spectrogram, and searching a source peptide segment corresponding to the mass of the fragment ions obtained by calculation in an ion index to obtain a matched peptide segment;
(5) scoring the correlation degree of the matched peptide fragments, respectively calculating scores in a non-cross-linked spectrogram and a cross-linked spectrogram, and summing the scores to serve as the scores of the matched peptide fragments; according to the relation that the total mass of the cross-linked peptide fragments is the sum of the masses of two complementary peptide fragments (assuming that one peptide fragment in the cross-linked peptide fragments is an alpha peptide fragment, and the other peptide fragment is a beta peptide fragment which is complementary with the alpha peptide fragment, the total mass of the cross-linked peptide fragments = the mass of the alpha peptide fragment + the mass of the beta peptide fragment), respectively taking the peptide fragments with higher fractions as candidate results of the two complementary peptide fragments, thereby obtaining a candidate cross-linked peptide fragment combination result in a spectrogram;
(6) repeating the steps (4) to (5) to obtain candidate cross-linked peptide fragment combination results in all spectrograms, merging all the results, and establishing indexes from the peptide fragments to the spectrograms;
(7) reading the detailed information of the peptide fragments in the step (1) and calculating the matching score of fragment ions of the cross-linked peptide fragment combination and the spectrogram to obtain all candidate cross-linked peptide fragment combination scores of each spectrogram; taking the combination of the cross-linked peptide segments with the highest fraction in each spectrogram as the result of the spectrogram; and sequencing the results of each spectrogram from high to low, and performing fine scoring and re-scoring to obtain the final result of the peptide fragment to be identified.
Further, in the step (1), the detailed process of the peptide fragment comprises the sequence of the peptide fragment, the cross-linking site and the source protein.
Further, in the step (1), when constructing the peptide fragment index: the peptide fragment sequence is numbered, a quality array is constructed in an external memory, and the array comprises the peptide fragment quality range and the number of the source peptide fragment.
Further, in the step (2), when constructing the ion index: and constructing a mass array in the memory, wherein the array comprises fragment ion mass and the number of the source peptide segment.
Further, the step (1) and the step (2) are numbered by using 64-bit binary codes.
Further, in the step (2), when the mass of the uncharged fragment ions is calculated: and generating corresponding non-cross-linked fragment ions b and y according to cross-linking sites on the peptide fragments, calculating the mass of the non-charged fragment ions b and y, and defining the mass of the non-charged fragment ions b as the sum of corresponding amino acid residues and the mass of the non-charged fragment ions y as the sum of one water molecule and corresponding amino acid residues.
Further, in the step (3), when the mass of the fragment ions without charges is calculated: assuming the charge number of the fragment ions, and then calculating according to the mass of the fragment ions without charges = peak mass-to-charge ratio charge number-proton mass-to-charge number; when calculating the mass of the uncharged fragment ions complementary to the cross-linked fragment ions: calculations were performed as mass of uncharged fragment ions complementary to the cross-linked fragment ions = uncharged parent ion mass- (peak mass to charge ratio charge number-proton mass to charge number).
Further, in the step (1), the mass range of the peptide segment is the result of rounding the actual mass of the peptide segment after amplifying the actual mass by one hundred times; in the step (2), the mass of the fragment ions is the mass of the fragment ions which is amplified by one hundred times after two decimal points are reserved for the actual mass of the fragment ions; and (4) amplifying the quality result obtained by calculation in the step (3) by one hundred times and then rounding.
Further, in the step (5), according to the relation that the total mass of the cross-linked peptide fragments is the sum of the masses of two complementary peptide fragments, firstly, the matched peptide fragment with a higher fraction is taken as a candidate result of one peptide fragment in the cross-linked peptide fragments of the spectrogram; and then calculating the mass of the other complementary peptide fragment, and taking the peptide fragment with higher fraction as a candidate result of the other peptide fragment in the cross-linked peptide fragments of the spectrogram according to the ranking result of the scores.
Further, when fine scoring and re-scoring are performed in the step (7): calculating the FDR of the PSM level; carrying out SVM training according to the obtained FDR to obtain a model; and re-scoring the result of each spectrogram by using the model, sorting the result from high to low according to the score, and calculating the FDR of the PSM level to obtain the final result.
Advantageous effects
When the method of the invention constructs the ion index, the combination of the cross-linking sites in each peptide segment in the peptide segment index is fully considered, and a plurality of peptide segments correspond to the same sequence peptide segment but different combinations of the cross-linking sites. For complex cross-linked peptides, fragment ions between cross-linking sites are absent, and are excluded in the construction of an ion index to ensure correct identification of complex cross-linked forms of the peptide. The method can fully utilize all fragment ions of the candidate peptide fragment for matching in the process of using ion indexes to match peptide spectra by performing ion complementary hypothesis on spectrograms. The method can directly realize peptide spectrum matching between the spectrogram and all the cross-linked peptide segments meeting the conditions, the retrieval space is only the complexity of O (n), the correct result peptide segments of the spectrogram are basically distributed at the position with the top rank, and the correct result can be recalled only by performing fine scoring once again in a very small retrieval space.
The method uses a coding ion index technology, when constructing the ion index of the candidate peptide segments in the database, the candidate peptide segments are uniquely identified through the codes, and the codes can reflect the quality relation of the peptide segments at the same time, so that high-score candidate cross-linked peptide segment pairs meeting the combination condition limitation can be quickly recalled in the peptide spectrum matching result of O (n) space, and excessive memory overhead is not needed.
The method can realize the identification of the peptide segment in a complex cross-linking form and has higher accuracy. The method is used for searching on a simulated spectrogram data set containing the peptide segment in the complex cross-linking form, and the accuracy rate is over 95 percent.
The method uses a standard peptide fragment database to search on a spectrogram data set acquired after standard peptide fragment crosslinking, identifies that the accuracy of the simple crosslinking form peptide fragment reaches 99%, adds an E.coli library as an interference library to search, the accuracy reaches 98%, adds a Human library as interference to search, and the accuracy reaches 91%. Although the identification of the peptide fragment in a simple cross-linked form is at the same level as that of the existing identification method, the invention provides a technical scheme different from the conception of the prior art.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 shows the coding scheme used in the method of the present invention.
FIG. 3 is a process of searching by using ion indexing in the method of the present invention.
FIG. 4 is a diagram showing the storage of candidate cross-linked peptide fragments in combination according to the method of the present invention.
FIG. 5 shows the results of the identification of mass spectral data containing D-dimer in example 1.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
As shown in fig. 1, a method for identifying a cross-linked peptide fragment based on mass spectrometry comprises the following steps:
(1) performing theoretical enzyme digestion on all protein sequences in a protein library, and storing detailed information of the obtained peptide fragments according to the mass size; establishing a corresponding relation between the quality of the peptide fragment and the source peptide fragment, and constructing a peptide fragment index;
specifically, the detailed description of the peptide fragment includes the sequence of the peptide fragment, the crosslinking site and the source protein.
Specifically, when constructing the peptide fragment index: and constructing a quality array, wherein the row and column information of one position in the array respectively represents the quality and the source peptide segment number, and the quality is the integer result obtained after amplifying the normal quality by one hundred times. Each peptide fragment uses 64-bit binary codes to store the specific position of the peptide fragment in an external memory, the peptide fragments are numbered according to the sequence of mass sizes, and the numbering corresponds to the binary codes one by one. As shown in fig. 2.
(2) Traversing all the peptide fragments, assuming all the crosslinking sites existing on each peptide fragment, calculating the mass of the uncharged fragment ions under each crosslinking site, establishing the corresponding relation between the mass of the fragment ions and the source peptide fragment, and constructing an ion index;
specifically, when calculating the mass of the uncharged fragment ions: and generating corresponding non-cross-linked fragment ions b and y according to cross-linking sites on the peptide fragments, calculating the mass of the non-charged fragment ions b and y, and defining the mass of the non-charged fragment ions b as the sum of corresponding amino acid residues and the mass of the non-charged fragment ions y as the sum of one water molecule and corresponding amino acid residues.
Specifically, when the ion index is constructed: and (3) reserving two digits after the decimal point for the calculated fragment ion mass, amplifying by 100 times, putting each fragment ion into a newly constructed mass array according to the fragment ion mass and the peptide segment number of the source, and storing the peptide segment array number in the array position.
(3) Fragment ions of each peak in a mass spectrum of the peptide fragment to be identified are assumed, and are respectively assumed to be non-cross-linked fragment ions and cross-linked fragment ions; when the fragment ions are non-cross-linked, calculating to obtain the mass of the fragment ions without charges; when the fragment ions are cross-linked fragment ions, calculating to obtain the mass of the fragment ions which are complementary with the cross-linked fragment ions and are not charged;
specifically, when calculating the mass of fragment ions without charge: assuming the charge number of the fragment ions, and then calculating according to the mass of the fragment ions without charges = peak mass-to-charge ratio charge number-proton mass-to-charge number; when calculating the mass of the uncharged fragment ions complementary to the cross-linked fragment ions: calculations were performed as mass of uncharged fragment ions complementary to the cross-linked fragment ions = uncharged parent ion mass- (peak mass to charge ratio charge number-proton mass to charge number).
(4) Respectively storing the non-cross-linked fragment ions and the cross-linked fragment ions obtained from one spectrogram into a non-cross-linked spectrogram and a cross-linked spectrogram, traversing the fragment ions in the non-cross-linked spectrogram and the cross-linked spectrogram, and searching a source peptide segment corresponding to the mass of the fragment ions obtained by calculation in an ion index to obtain a matched peptide segment; specifically, fragment ions in the two spectrograms are traversed respectively, the mass of each fragment ion can find a corresponding peptide segment in the ion index, and the number of the fragment ions which can be matched in the spectrogram of each peptide segment is counted. In the counting process, considering the ion deviation, peptide fragments within a range of ± 2 of the fragment ion of the spectrum need to be counted, for example, if there is a fragment ion at 100000 of the spectrum, 1 is added to the peptide fragments at 99998, 99999, 100000, 100001, 100002 of step (5).
(5) And (4) rough scoring: scoring the correlation degree of the matched peptide fragments, respectively calculating scores in a non-cross-linked spectrogram and a cross-linked spectrogram, and summing the scores to serve as the scores of the matched peptide fragments; according to the relation that the total mass of the cross-linked peptide fragments is the sum of the masses of two complementary peptide fragments (assuming that one peptide fragment in the cross-linked peptide fragments is an alpha peptide fragment, and the other peptide fragment is a beta peptide fragment which is complementary with the alpha peptide fragment, the total mass of the cross-linked peptide fragments = the mass of the alpha peptide fragment + the mass of the beta peptide fragment), respectively taking the peptide fragments with higher fractions as candidate results of the two complementary peptide fragments, thereby obtaining a candidate cross-linked peptide fragment combination result in a spectrogram;
specifically, according to the relation that the total mass of the cross-linked peptide fragments is the sum of the masses of two complementary peptide fragments, firstly, the matched peptide fragment with a higher fraction is taken as a candidate result of one peptide fragment in the cross-linked peptide fragments of a spectrogram; and then calculating the mass of the other complementary peptide fragment, and taking the peptide fragment with higher fraction as a candidate result of the other peptide fragment in the cross-linked peptide fragments of the spectrogram according to the ranking result of the scores.
Specifically, the correlation scoring may be performed using BM25, as shown in fig. 3.
In step (5), the top 15 results can be taken as candidates for the alpha peptide fragment, and the mass of the beta peptide fragment complementary to these 15 results can be calculated. Then, the scores are sorted, and the candidate structure of the beta peptide fragment of the top 10 in the scores can be directly obtained. Thereby determining the candidate cross-linked peptide fragment combination result of the next spectrogram. In storing the results, only 64 binary codes of these peptides are stored. As shown in fig. 4.
(6) Repeating the steps (4) to (5) to obtain candidate cross-linked peptide fragment combination results in all spectrograms, merging all the results, and establishing indexes from the peptide fragments to the spectrograms; that is, under the condition that one peptide fragment can correspond to the candidate results of a plurality of spectrograms, a list is used for recording the result distribution, the list correspondingly represents the related peptide fragments according to the number sequence in the step (1), and the position in the list stores the spectrogram corresponding to the peptide fragment.
(7) Reading the detailed information of the peptide fragments in the step (1) and calculating the matching score of fragment ions of the cross-linked peptide fragment combination and the spectrogram to obtain all candidate cross-linked peptide fragment combination scores of each spectrogram; taking the combination of the cross-linked peptide segments with the highest fraction in each spectrogram as the result of the spectrogram; and sequencing the results of each spectrogram from high to low, and performing fine scoring and re-scoring to obtain the final result of the peptide fragment to be identified.
Specifically, when fine scoring and heavy scoring are performed: calculating the FDR of the PSM level according to the positive (Target) and negative (Decoy) distributions reported in the sorted results; taking a Target-Target result with FDR <5% as a positive example, taking all Target-Decoy, Decoy-Target and Decoy-Decoy results as negative examples, and carrying out SVM training to obtain a model; and re-scoring the results of each spectrogram by using the model, sorting the results from high to low according to the obtained scores, and calculating the final result of the FDR report of the PSM level.
Because the identification of the peptide fragment is realized by matching the fragment ions corresponding to the peptide fragment sequence with the spectral peaks in the peptide fragment mass spectrogram, the sequence in the protein database needs to be theoretically enzyme-cut first, and the obtained peptide fragment sequence is used for peptide spectrum matching. Taking the commonly used specific trypsin, for example, is used for theoretical cleavage, which cleaves the peptide bond at the C-terminus of lysine (K) and arginine (R), which are called the cleavage sites of trypsin. For example, a protein has the sequence of MKKTTMKIIPFNRLTIGEGQQHHLGGAKQAGDV, the theoretical cleavage occurs at the right end of the letters K and R, and under normal conditions, the sequence is theoretically cleaved to obtain six peptide sequences of MK, K, TTMK, IIPFNR, LTIGEGQQHHLGGAK and QAGDV. In practice, there may be some cases of cleavage omission, and in the case of the number of cleavage omission of 1, five peptide fragments of MKK, KTTMK, TTMKIIPFNR, IIPFNRLTIGEGQQHHLGGAK, LTIGQHHLGGAKQAGDV are obtained in addition to the above peptide fragments after theoretical cleavage of the protein sequence. Thus, the mass of each peptide fragment and the theoretical fragment ion mass can be calculated. In addition, the fact that modifications are added to the peptide, i.e., the corresponding amino acid letters are added or subtracted by a certain amount, also results in an increase in the number of peptide sequences actually used for matching. In addition to this, the sequence capable of forming the peptide fragment requires the presence of specific amino acids, i.e., crosslinking sites, such as isopeptide bond crosslinks occurring on lysine (K) and glutamine (Q), and amino acids that normally undergo crosslinking will not be able to serve as cleavage sites, so that only LTIGEGQQHHLGGAK, QAGDV, MKK, KTTMK, TTMKIIPFNR, IIPFNRLTIGEGQQHHLGGAK and LTIGEGQQHHLGGAKQAGDV of the above-mentioned peptide fragments are capable of forming isopeptide bond crosslinks.
In the study of protein cross-linking, the cross-linked peptide fragments consisting of two peptide fragments are more concerned, and thus the combination of peptide fragments is involved. In order to facilitate the rapid implementation of the combination, the peptide fragments need to be sorted according to the mass size and then stored. In practical biological experiments, peptide fragment sequences generated after theoretical cleavage of a retrieved protein database are very large, and sequences of different proteins are identical after theoretical cleavage, so that redundancy removal operation needs to be performed on the identical sequences. Processing would require a significant amount of space if the sequences were all read into memory. Therefore, the invention firstly divides the peptide segment quality range into a plurality of small blocks, establishes a peptide segment file with a corresponding quality range in an external memory, calculates the quality of the peptide segment obtained by theoretical enzyme digestion, and outputs the peptide segment to the peptide segment file in the external memory for storage. And after all candidate peptide fragments are distributed, sequencing the peptide fragment sequences in each external storage file according to the quality, and realizing the treatment of sequencing all the peptide fragment sequences according to the quality. Because the organized peptide segment sequence is assigned and unique in the external memory, the peptide segment can be coded according to the position information, a number is used for uniquely representing a peptide segment for processing in the internal memory, and the occupied internal memory space is far smaller than that of a mode of representing by using a character string.
And generating theoretical fragment ions for all peptide fragment sequences capable of forming cross-linked peptide fragments to establish ion indexes for performing rapid peptide spectrum matching. Although the cross-linked peptide is composed of two peptides, from the perspective of one peptide, it can be understood that a single peptide is modified with a huge mass, and the difference of cross-linking sites can cause the difference of fragment ions corresponding to the peptide, such as peptide LTIGEGQQHHLGGAKQAGDV and LTIGEGQQHHLGGAKQAGDV, and underlines indicate cross-linking sites, although the two sequences are the same, their corresponding fragment ions are not identical, mainly the mass of fragment ions between the two cross-linking sites. In addition, if the mass of fragment ions of a peptide fragment in a complex cross-linked form is completely consistent between two cross-linking sites, no help is brought to peptide spectrum matching, the fragment ions need to avoid error generation, in addition, the mass of the other peptide fragment needs to be known for the fragment ions crossing the cross-linking sites, but the mass of the other peptide fragment is limited by the mass of parent ions of the spectrogram, cannot be known in advance and cannot be generated, and the ion index value generates the fragment ions which do not cross the cross-linking.
The ion index is a dictionary data structure established by taking mass as key and peptide fragment as value. The peptide fragments are assigned to the ion index according to their fragment ion masses, for example, if a peptide fragment having fragment ion masses of 50 and 100Da, respectively, is added to values corresponding to the ion indexes key of 50 and 100, and the peptide fragment is represented by the previously assigned code.
Since only non-cross-linked fragment ions are generated in the ion index, cross-linked fragment ions cannot be used in peptide spectrum matching. However, the cross-linked fragment ion peaks exist in the spectrogram, so that the cross-linked fragment ion peaks can be matched, the spectrum peaks in the spectrogram can be converted by utilizing the corresponding relation between the cross-linked fragment ions and the non-cross-linked fragment ions, and the cross-linked fragment ion peaks can be converted into the non-cross-linked fragment ion peaks, so that the cross-linked fragment ion information can be used in the ion index matching process, and the aim of accurate identification can be achieved.
After the peptide spectrum matching is carried out by using the ion index, a scoring list which is actually single peptide is obtained, sorting can be carried out according to the score, for example, a Top-5 result is taken as one of the cross-linked peptide fragments, and then the mass of the other peptide fragment is converted according to the mass limit of the parent ion of the spectrogram. For peptides within this mass range, for example, Top-5 results can be retrieved from the previous single peptide scoring list as candidate peptides. Thus, there will be 5 x 5, i.e. 25 candidate cross-linked peptide fragments for subsequent fine scoring treatment. In this step, it is necessary to consider whether the peptide fragments can be combined to form a cross-linked peptide fragment, for example, isopeptide bond must be a combination of a cross-linked peptide fragment with one end connected to K and the other end connected to Q, for example, peptide fragments MKK and KTMK can not be used as candidate cross-linked peptide fragments, and the same applies to the complex cross-linked peptide fragments.
Example 1
Taking an example of using a protein database to identify the acquired peptide fragment spectrogram of the D-dimer in the complex crosslinking form, when the method provided by the invention is used for testing mass spectrum data containing the D-dimer, the spectrogram of the peptide fragment with the complex crosslinking form in the D-dimer can be accurately reported, as shown in FIG. 5, LTIGEGQQHHLGGAKQAGDV-LTIGEGQQHHLGGAKQAGDV, the underline is a crosslinking site, and the isopeptide bond is Q and K crosslinking.
In summary, the invention includes but is not limited to the above embodiments, and any equivalent replacement or local modification made under the spirit and principle of the invention should be considered as being within the protection scope of the invention.

Claims (9)

1. A method for identifying a cross-linked peptide segment based on mass spectrum is characterized by comprising the following steps: the method comprises the following steps:
(1) performing theoretical enzyme digestion on all protein sequences in a protein library, and storing detailed information of the obtained peptide fragments according to the mass size; establishing a corresponding relation between the quality of the peptide fragment and the source peptide fragment, and constructing a peptide fragment index;
(2) traversing all the peptide fragments, assuming all the crosslinking sites existing on each peptide fragment, calculating the mass of the uncharged fragment ions under each crosslinking site, establishing the corresponding relation between the mass of the fragment ions and the source peptide fragment, and constructing an ion index;
(3) fragment ions of each peak in a mass spectrum of the peptide fragment to be identified are assumed, and are respectively assumed to be non-cross-linked fragment ions and cross-linked fragment ions; when the fragment ions are non-cross-linked, calculating to obtain the mass of the fragment ions without charges; when the fragment ions are cross-linked fragment ions, calculating to obtain the mass of the fragment ions which are complementary with the cross-linked fragment ions and are not charged;
(4) respectively storing the non-cross-linked fragment ions and the cross-linked fragment ions obtained from one spectrogram into a non-cross-linked spectrogram and a cross-linked spectrogram, traversing the fragment ions in the non-cross-linked spectrogram and the cross-linked spectrogram, and searching a source peptide segment corresponding to the mass of the fragment ions obtained by calculation in an ion index to obtain a matched peptide segment;
(5) scoring the correlation degree of the matched peptide fragments, respectively calculating scores in a non-cross-linked spectrogram and a cross-linked spectrogram, and summing the scores to serve as the scores of the matched peptide fragments; according to the relation that the total mass of the cross-linked peptide fragments is the sum of the masses of the two complementary peptide fragments, respectively taking the peptide fragments with higher fractions as candidate results of the two complementary peptide fragments, thereby obtaining a candidate cross-linked peptide fragment combination result in a spectrogram;
(6) repeating the steps (4) to (5) to obtain candidate cross-linked peptide fragment combination results in all spectrograms, merging all the results, and establishing indexes from the peptide fragments to the spectrograms;
(7) reading the detailed information of the peptide fragments in the step (1) and calculating the matching score of fragment ions of the cross-linked peptide fragment combination and the spectrogram to obtain all candidate cross-linked peptide fragment combination scores of each spectrogram; taking the combination of the cross-linked peptide segments with the highest fraction in each spectrogram as the result of the spectrogram; sequencing the results of each spectrogram from high to low, and performing fine scoring and re-scoring to obtain the final result of the peptide fragment to be identified;
in the step (5), according to the relation that the total mass of the cross-linked peptide fragments is the sum of the masses of two complementary peptide fragments, firstly, taking the matching peptide fragment with the top fraction of 15 as a candidate result of one peptide fragment in the cross-linked peptide fragments of the spectrogram; and then calculating the mass of the other complementary peptide fragment, and taking the peptide fragment with the top fraction of 15 as a candidate result of the other peptide fragment in the cross-linked peptide fragments of the spectrogram according to the ranking result of the scores.
2. The method for mass spectrometry-based identification of cross-linked peptide fragments of claim 1, wherein: in the step (1), the detailed information of the peptide fragment comprises the sequence of the peptide fragment, a crosslinking site and a source protein.
3. The method for mass spectrometry-based identification of cross-linked peptide fragments of claim 1, wherein: in the step (1), when constructing the peptide fragment index: the peptide fragment sequence is numbered, a quality array is constructed in an external memory, and the array comprises the peptide fragment quality range and the number of the source peptide fragment.
4. The method for mass spectrometry-based identification of cross-linked peptide fragments of claim 1, wherein: in the step (2), when the ion index is constructed: and constructing a mass array in the memory, wherein the array comprises fragment ion mass and the number of the source peptide segment.
5. The method for mass spectrometry-based cross-linked peptide fragment identification according to claim 3 or 4, wherein: numbering is done using 64-bit binary coding.
6. The method for mass spectrometry-based identification of cross-linked peptide fragments of claim 1, wherein: in the step (2), when the mass of the uncharged fragment ions under each crosslinking site is calculated: and generating corresponding non-cross-linked fragment ions b and y according to cross-linking sites on the peptide fragments, calculating the mass of the non-charged fragment ions b and y, and defining the mass of the non-charged fragment ions b as the sum of corresponding amino acid residues and the mass of the non-charged fragment ions y as the sum of one water molecule and corresponding amino acid residues.
7. The method for mass spectrometry-based identification of cross-linked peptide fragments of claim 1, wherein: in the step (3), when the mass of the fragment ions without charges is calculated: assuming the charge number of fragment ions, and calculating according to the mass m1= A × B-C × B of the fragment ions without charges; when calculating the mass of the uncharged fragment ions complementary to the cross-linked fragment ions: calculations were performed according to the mass m2= D-m1 of the uncharged fragment ions complementary to the cross-linked fragment ions; wherein A represents the peak mass-to-charge ratio, B represents the charge number, and C represents the proton mass; d represents the uncharged parent ion mass.
8. The method for mass spectrometry-based cross-linked peptide fragment identification according to claim 3, wherein: in the step (1), the mass range of the peptide segment is the result of rounding the actual mass of the peptide segment after amplifying the actual mass by one hundred times; in the step (2), the mass of the fragment ions is the mass of the fragment ions which is amplified by one hundred times after two decimal points are reserved for the actual mass of the fragment ions; and (4) amplifying the quality result obtained by calculation in the step (3) by one hundred times and then rounding.
9. The method for mass spectrometry-based identification of cross-linked peptide fragments of claim 1, wherein: when fine scoring and heavy scoring are carried out in the step (7): calculating the FDR of the PSM level; carrying out SVM training according to the obtained FDR to obtain a model; and re-scoring the result of each spectrogram by using the model, sorting the result from high to low according to the score, and calculating the FDR of the PSM level to obtain the final result.
CN202111117873.7A 2021-09-24 2021-09-24 Complex cross-linked peptide identification method based on mass spectrum Active CN113571129B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111117873.7A CN113571129B (en) 2021-09-24 2021-09-24 Complex cross-linked peptide identification method based on mass spectrum

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111117873.7A CN113571129B (en) 2021-09-24 2021-09-24 Complex cross-linked peptide identification method based on mass spectrum

Publications (2)

Publication Number Publication Date
CN113571129A CN113571129A (en) 2021-10-29
CN113571129B true CN113571129B (en) 2022-02-11

Family

ID=78174198

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111117873.7A Active CN113571129B (en) 2021-09-24 2021-09-24 Complex cross-linked peptide identification method based on mass spectrum

Country Status (1)

Country Link
CN (1) CN113571129B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105823883A (en) * 2015-11-19 2016-08-03 云南民族大学 Tandem mass spectrometry identification method for protein based on Poisson distribution model
CN106033501A (en) * 2015-03-16 2016-10-19 中国科学院计算技术研究所 Crosslinking dipeptide rapid identification method
CN106529204A (en) * 2016-10-18 2017-03-22 中国科学院计算技术研究所 Semi-supervised learning-based multi-cross-linked-mass-spectrum sorting method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102495127B (en) * 2011-11-11 2013-09-04 暨南大学 Protein secondary mass spectrometric identification method based on probability statistic model
CN106018535B (en) * 2016-05-11 2018-11-09 中国科学院计算技术研究所 A kind of method and system of intact glycopeptide identification
GB201621927D0 (en) * 2016-12-22 2017-02-08 Micromass Ltd Mass spectrometric analysis of lipids

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106033501A (en) * 2015-03-16 2016-10-19 中国科学院计算技术研究所 Crosslinking dipeptide rapid identification method
CN105823883A (en) * 2015-11-19 2016-08-03 云南民族大学 Tandem mass spectrometry identification method for protein based on Poisson distribution model
CN106529204A (en) * 2016-10-18 2017-03-22 中国科学院计算技术研究所 Semi-supervised learning-based multi-cross-linked-mass-spectrum sorting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Identification of cross-linked peptides from complex samples;Bing Yang 等;《Nature Methods》;20120930;第9卷(第9期);第904-909页 *

Also Published As

Publication number Publication date
CN113571129A (en) 2021-10-29

Similar Documents

Publication Publication Date Title
US6963807B2 (en) Automated identification of peptides
CN103245714B (en) Protein secondary mass spectrum identification method of marker loci based on candidate peptide fragment discrimination
US8108153B2 (en) Method, apparatus, and program product for creating an index into a database of complex molecules
US7429727B2 (en) Method, apparatus, and program product for quickly selecting complex molecules from a data base of molecules
KR101313087B1 (en) Method and Apparatus for rearrangement of sequence in Next Generation Sequencing
WO2011000991A1 (en) Method for identifying peptides and proteins according to mass spectrometry data
CN104034792A (en) Secondary protein mass spectrum identification method based on mass-to-charge ratio error recognition capability
CN101714187B (en) Index acceleration method and corresponding system in scale protein identification
WO2002021139A2 (en) Automated identification of peptides
CN113571129B (en) Complex cross-linked peptide identification method based on mass spectrum
CN105823883A (en) Tandem mass spectrometry identification method for protein based on Poisson distribution model
US8712695B2 (en) Method, system, and computer program product for scoring theoretical peptides
CN107729719B (en) De novo sequencing method
CN1769891A (en) Method for identifying peptide by using tandem mass spectrometry data
CN115862742A (en) Bidirectional peptide fragment sequencing method based on self-attention mechanism and application
CN115662521A (en) Sequence real-time comparison method based on pan-genome
WO2004083233A2 (en) Peptide identification
Bocker et al. Combinatorial approaches for mass spectra recalibration
US20020152033A1 (en) Method for evaluating the quality of database search results by means of expectation value
CN116486907B (en) Protein sequence tag sequencing method based on A star algorithm
CN111524549B (en) Integral protein identification method based on ion index
CN113449533B (en) Bar code sequence-based read length comparison method and device
Tschager Algorithms for Peptide Identification via Tandem Mass Spectrometry
SE517259C2 (en) Molecular identification system
US7603240B2 (en) Peptide identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant