CN106770605B - De novo sequencing method and device - Google Patents

De novo sequencing method and device Download PDF

Info

Publication number
CN106770605B
CN106770605B CN201611019740.5A CN201611019740A CN106770605B CN 106770605 B CN106770605 B CN 106770605B CN 201611019740 A CN201611019740 A CN 201611019740A CN 106770605 B CN106770605 B CN 106770605B
Authority
CN
China
Prior art keywords
path
node
quality
peptide fragment
modification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611019740.5A
Other languages
Chinese (zh)
Other versions
CN106770605A (en
Inventor
杨皓
迟浩
周文婧
何昆
曾文锋
刘超
孙瑞祥
贺思敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201611019740.5A priority Critical patent/CN106770605B/en
Publication of CN106770605A publication Critical patent/CN106770605A/en
Application granted granted Critical
Publication of CN106770605B publication Critical patent/CN106770605B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N27/00Investigating or analysing materials by the use of electric, electrochemical, or magnetic means
    • G01N27/62Investigating or analysing materials by the use of electric, electrochemical, or magnetic means by investigating the ionisation of gases, e.g. aerosols; by investigating electric discharges, e.g. emission of cathode

Landscapes

  • Chemical & Material Sciences (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Electrochemistry (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The present invention provides de novo sequencing methods, it includes converting mass spectrum connection figure for spectrogram to be resolved, count the score of each paths in the mass spectrum connection figure, the high preceding several common paths of extraction path score and modification path are as candidate peptide fragment, wherein, the common path is only by the path of common side group, and the modification path is the path being made of common side and decorating edges and wherein only includes a decorating edges;And peptide spectrum matching marking is carried out for each candidate peptide fragment, take the highest candidate peptide fragment of peptide spectrum matching marking as the corresponding peptide fragment of the spectrogram.This method can support the thousands of kinds of discoveries surprisingly modified, and the speed that will not be identified peptide fragment has larger impact.Furthermore it is also possible to which more fine granularity area split-phase improves the accuracy rate of peptide fragment identification like peptide section sequence.

Description

De novo sequencing method and device
Technical field
The invention belongs to the identifications of bioinformatics more particularly to de novo sequencing peptide fragment.
Background technique
Proteomics is rapidly progressed in nearest more than ten years, and researcher is used to analyze biology using mass-spectrometric technique When sample, peptide fragment and protein identification method have become very the key link.Currently, the peptide based on tandem mass spectrum data Section identification method is mainly divided to two classes: database search method and de novo sequencing method.Database search method is directed to each spectrogram All candidate peptide fragments are searched from database carries out matching marking.But if correct peptide fragment is not in the database, such as new species Or species are not sequenced for genome, then generally use de novo sequencing method.De novo sequencing method independent of database, directly from Peptide section sequence is obtained in spectrogram.
But database search method is compared, de novo sequencing method candidate peptide segment number is big, and identification speed is slow.In general, from Candidate peptide segment number 15 magnitudes bigger than the candidate peptide segment number of database search method of head sequencing approach.And with matter Spectrometer precision is higher and higher, the key factor of mutation and unexpected modification as non-analysis spectrogram.About have thousands of kinds at present Unexpected modification.If de novo sequencing algorithm will support this thousands of kinds of discoveries surprisingly modified, candidate peptide to be dealt with Segment number can at least be further added by two orders of magnitude on the basis of existing, this can undoubtedly greatly reduce peptide fragment identification speed again.And The very close even element composition complete one of sequence similarity and after considering thousands of kinds of unexpected modifications, between candidate peptide fragment It causes, is difficult to distinguish similar peptide section sequence.
Summary of the invention
Therefore, it is an object of the invention to overcome the defect of the above-mentioned prior art, a kind of new de novo sequencing method is provided.
The purpose of the present invention is what is be achieved through the following technical solutions:
On the one hand, the present invention provides a kind of de novo sequencing methods, comprising:
Mass spectrum connection figure is converted by spectrogram to be resolved, wherein every spectral peak is converted into the mass spectrum in the spectrogram The node of connection figure, if of poor quality between node is for amino acid masses or common modification two-by-two in the mass spectrum connection figure Quality is then connected with common side between the two nodes, and the marking on the common side is based on the corresponding spectral peak of the big node of quality Intensity determines, if of poor quality for unexpected modification quality between node two-by-two, is connected with modification between the two nodes The marking on side, the decorating edges is determined based on the intensity of the corresponding spectral peak of the big node of quality;
Count the score of each paths in the mass spectrum connection figure, the high preceding several common paths of extraction path score and Path is modified as candidate peptide fragment, wherein the common path is only by the path of common side group, and the modification path is by general Lead to the path on side and decorating edges composition and wherein only includes a decorating edges;
Peptide spectrum matching is carried out for each candidate peptide fragment to give a mark, and is taken described in the highest candidate peptide fragment conduct of peptide spectrum matching marking The corresponding peptide fragment of spectrogram.
Another aspect, the present invention provides a kind of de novo sequencing methods, comprising:
Mass spectrum connection figure is converted by spectrogram to be resolved, wherein every spectral peak is converted into the mass spectrum in the spectrogram The node of connection figure, if of poor quality between node is for amino acid masses or common modification two-by-two in the mass spectrum connection figure Quality is then connected with common side between the two nodes, and the marking on the common side is based on the corresponding spectral peak of the big node of quality Intensity determines, if of poor quality for unexpected modification quality between node two-by-two, is connected with modification between the two nodes The marking on side, the decorating edges is determined based on the intensity of the corresponding spectral peak of the big node of quality;
Count the score of each paths in the mass spectrum connection figure, the high preceding several common paths of extraction path score and Path is modified as candidate peptide fragment, and records the path ranking of each candidate peptide fragment, wherein the common path is only by common side The path of group, the modification path are the path being made of common side and decorating edges and wherein only include a decorating edges;
Peptide spectrum matching marking is carried out for each candidate peptide fragment;
The modification abundance of the peptide spectrum matching marking of each candidate peptide fragment, path ranking and candidate's peptide fragment is mentioned as feature Trained collating sort device obtains and divides described in highest candidate peptide fragment conduct to give a mark to candidate's peptide fragment in advance for supply The corresponding peptide fragment of spectrogram.
In the above-mentioned methods, the marking on common side can be the strong of the big corresponding spectral peak of node of quality between two nodes Degree takes the logarithm using natural number the bottom of as, and the marking of decorating edges can be the big corresponding spectral peak of node of quality between two nodes Intensity takes the logarithm using natural number the bottom of as.
In the above-mentioned methods, the marking of the decorating edges between two nodes can be the big corresponding spectral peak of node of quality Intensity is multiplied by the corresponding abundance surprisingly modified of the decorating edges, and wherein the abundance surprisingly modified is two of decorating edges connection Quality is corresponding surprisingly modifies the probability or frequency being likely to occur for the unexpected modification differed between node.
In the above-mentioned methods, the corresponding abundance surprisingly modified of decorating edges can be equal to two sections connected in the decorating edges The number that the unexpected modification quality differed between point occurs between all nodes of the mass spectrum connection figure is divided by the mass spectrum The sum of decorating edges in connection figure.
In the above-mentioned methods, wherein the abundance surprisingly modified can be it is preset.
In the above-mentioned methods, it may also include through following mode and set every kind of abundance surprisingly modified:
Multiple existing spectrograms, the statistics number that wherein every kind of unexpected modification occurs are extracted at random;
Every kind is surprisingly modified the number occurred in the multiple spectrogram and all unexpected modifications go out in the spectrogram The abundance that the ratio of occurrence number summation is surprisingly modified as this kind.
In the above-mentioned methods, two nodes that every spectral peak in the spectrogram can be converted in mass spectrum connection figure, wherein One node corresponds to b ion, another node corresponds to y ion.
In the above-mentioned methods, for every spectral peak, the node quality of corresponding b ion is that the spectral peak quality subtracts 1, corresponding y The node quality of ion is that the quality of parent ion in the spectrogram subtracts the spectral peak quality and 1 moisture protonatomic mass.
In the above-mentioned methods, abundance is modified if candidate's peptide fragment comes from common path for each candidate peptide fragment It is 1;If candidate's peptide fragment, from modification path, modification abundance is the corresponding unexpected modification of decorating edges in the modification path Abundance.
In the above-mentioned methods, the collating sort device can be using its spectrogram for corresponding to peptide fragment and modification known to one group as sample This collection is trained characterized by the peptide of the peptide fragment from each sample extraction spectrum matching marking, path ranking and modification abundance.
On the other hand, the present invention provides a kind of de novo sequencing devices, comprising:
Conversion unit, for converting mass spectrum connection figure for spectrogram to be resolved, wherein every spectral peak quilt in the spectrogram It is converted into the node of the mass spectrum connection figure, it is in the mass spectrum connection figure if of poor quality for amino acid between node two-by-two Quality commonly modifies quality, then is connected with common side between the two nodes, the marking on the common side section big based on quality The intensity of corresponding spectral peak is put to determine, if of poor quality for unexpected modification quality, the two nodes between node two-by-two Between be connected with decorating edges, the marking of the decorating edges is determined based on the intensity of the corresponding spectral peak of the big node of quality;
Path extraction unit, for counting the score of each paths in the mass spectrum connection figure, extraction path score is high Preceding several common paths and modification path are as candidate peptide fragment, wherein the common path be only by the path of common side group, The modification path is the path being made of common side and decorating edges and wherein only includes a decorating edges;
Matching marking unit takes peptide spectrum matching marking highest for carrying out peptide spectrum matching marking for each candidate peptide fragment Candidate peptide fragment as the corresponding peptide fragment of the spectrogram.
Compared with the prior art, the advantages of the present invention are as follows:
It can support the thousands of kinds of discoveries surprisingly modified, and the speed that will not be identified peptide fragment has larger impact.In addition, Can also more fine granularity area split-phase like peptide section sequence, improve the accuracy rate of peptide fragment identification.
Detailed description of the invention
Embodiments of the present invention is further illustrated referring to the drawings, in which:
Fig. 1 is the flow diagram according to the de novo sequencing method of one embodiment of the invention;
Fig. 2 is the flow diagram according to the de novo sequencing method of another embodiment of the present invention;
Fig. 3 a), 3b) and 3c) for according to the example spectrogram of one embodiment of the invention and the schematic diagram of mass spectrum connection figure;
Fig. 4 is the path score result schematic diagram according to the candidate peptide fragment of one embodiment of the invention;
Fig. 5 is situation schematic diagram of being given a mark according to the candidate peptide fragment of one embodiment of the invention.
Specific embodiment
In order to make the purpose of the present invention, technical solution and advantage are more clearly understood, and are passed through below in conjunction with attached drawing specific real Applying example, the present invention is described in more detail.It should be appreciated that described herein, specific examples are only used to explain the present invention, and It is not used in the restriction present invention.
Fig. 1 gives the process signal of de novo sequencing method according to an embodiment of the invention.This method mainly includes Mass spectrum connection figure (step 101) is converted by spectrogram;The score for counting each paths of the mass spectrum connection figure, is arranged according to path score Several common paths and modification path are as candidate peptide fragment (step 102) before sequence takes;And peptide spectrum is carried out based on candidate peptide fragment Matching marking (step 103).Preferably, before executing above-mentioned steps, the spectrogram received can be pre-processed, to the greatest extent may be used It can ground removal noise and impurity from spectrogram.For example, deconvoluting to spectrogram, removing parent ion peak in spectrogram and parent ion loses Water, mistake ammonia peak etc..
More specifically, in one embodiment, the pretreatment for spectrogram includes: firstly, of poor quality according to spectral peak two-by-two It determines isotopic peak cluster, is if spectral peak is of poor quality two-by-two in some spectral peak set(note: Da be for measure atom or The unit of molecular mass, alternatively referred to as Dalton or dalton, n=1,2 ..., c, c are the parent ion quantity of electric charge in spectrogram), then will These spectral peak set are considered isotopic peak cluster, and the spectral peak charge in isotopic peak cluster is consistent.Then, according to isotopic peak cluster In peak two-by-two it is of poor quality, judge the charge of isotopic peak cluster, for example, if it is of poor quality aboutThen in isotopic peak cluster All spectral peaks be+n charge.Then according to the quantity of electric charge in isotopic peak cluster, monoisotopic peak (is referred in isotopic peak cluster That the smallest root spectral peak of quality) it is converted into single charge mass, the formula of conversion is the quality of original spectral peak multiplied by the quantity of electric charge, is removed Other spectral peaks in addition to monoisotopic peak in isotopic peak cluster.Be typically based on the available spectrogram parent ion quality of spectrogram, The information such as all quality, mass-to-charge ratio and intensity of spectral peak in charge, retention time and the spectrogram.
Referring now to figure 1, in step 101, the node in mass spectrum connection figure is converted by every spectral peak in spectrogram, by these Node, which interconnects, constitutes mass spectrum connection figure.It is wherein if of poor quality amino acid masses or commonly to repair between two spectral peaks Adorn quality, then in mass spectrum connection figure, connect this corresponding node of two spectral peaks while to be common, the marking on the common side is The intensity of that big root spectral peak of quality.In field of bioinformatics, every kind of amino acid, common modification and unexpected modification have spy Fixed quality, all qualified names and its quality information may refer to Unimod (network address:http://www.unimod.org/), it is inner Face provides the accurate mass of each amino acid and modification.If of poor quality surprisingly to modify quality between two spectral peaks, In mass spectrum connection figure, the side for connecting this corresponding node of two spectral peaks is decorating edges.In one embodiment, which beats Divide to be the intensity of that big root spectral peak of quality.In yet another embodiment, what the marking of the decorating edges can be big for quality The intensity of that root spectral peak is multiplied by the corresponding abundance surprisingly modified of the decorating edges.The corresponding abundance surprisingly modified of some decorating edges It is general to refer to that the corresponding unexpected modification of unexpected modification quality differed between two nodes of decorating edges connection is likely to occur Rate or frequency.The abundance surprisingly modified, which can be, to be set in advance, and is also possible to for corresponding spectrogram through real-time statistics dependency number According to and dynamic is set.In one example, the corresponding abundance surprisingly modified of decorating edges is equal to two connected in the decorating edges The number that the unexpected modification quality differed between a node occurs between all nodes of mass spectrum connection figure connects divided by the mass spectrum The sum of decorating edges in map interlinking.In other embodiments, some surprisingly modify the corresponding abundance surprisingly modified of quality can also be with It presets in the following manner.For example, using database searching software or other de novo sequencing softwares to current spectrogram It is identified, identifies number that certain is surprisingly modified divided by all numbers surprisingly modified are identified, surprisingly modified as this Abundance.In another example the spectrogram of 10%-20% is randomly selected in the spectrogram in existing known peptide fragment and surprisingly modified, by institute Certain of statistics surprisingly modifies the number occurred in these spectrograms and all unexpected modification frequency of occurrence in these spectrograms Ratio, as the abundance surprisingly modified.Here, for the marking of decorating edges, by the way that high abundance can be increased multiplied by Abundances The discrimination that it is unexpected that modification surprisingly modify with low abundance.
In yet another embodiment, included the following steps: based on spectrogram building mass spectrum connection figure
1) every spectral peak is converted into corresponding node.Every spectral peak can usually be converted to two nodes, one of them The corresponding ionic type of node is b ion, and the corresponding ionic type of another node is y ion.Assuming that spectrogram parent ion quality For M, the quality of certain root spectral peak is m, then the quality of the node for the correspondence b ion that this root peak is converted is m-1;Corresponding y ion The quality of node is M-m-H2O (wherein H2O indicates the quality of hydrone, about 18Da).The marking of each node can be set to pair The intensity of spectral peak is answered, or may be set to be the intensity of corresponding spectral peak and take logarithm using natural number the bottom of as.
2) calculate the of poor quality of node two-by-two, if between two nodes it is of poor quality for the quality of some amino acid or certain A quality commonly modified, then connect the two nodes while to be common, the marking on common side is beating for the big node of quality Point;If the quality of poor quality surprisingly modified for some between two nodes, the side for connecting the two nodes is decorating edges, is repaired The marking of edging can be the marking of the big node of quality.Preferably, the big node of quality is set by the marking of decorating edges Marking is multiplied by the abundance surprisingly modified.
3) increase beginning and end, quality is respectively 0 and M-H2O-1, marking are 0.
It continues to refer to figure 1, in step 102, counts the score of each paths in constructed mass spectrum connection figure.In the implementation In example, the path in mass spectrum connection figure is divided into common path and modification path.Wherein common path refers to only by common side The path of composition, wherein not including any decorating edges.Modification path refers to the path being made of common side and decorating edges, but its In have and an only decorating edges.In one example, the score of each paths, example can be calculated by dynamic programming method Following Dynamic Programming formula such as can be used:
Wherein di(v) and di' (v) is respectively the i-th common path marking corresponding with modification path for reaching node v.WithThe j-th strip common path marking corresponding with modification path for respectively reaching node u, reaches node u's The marking of j-th strip common path.U ∈ InvE1 (v) indicates all common predecessor nodes of v, i.e., any node u and v in the set Between side (u, v) be common side, u ∈ InvE2 (v) indicates all modification predecessor nodes of v, i.e. any node u in the set Side (u, v) between v is decorating edges, w(u,v)Indicate the marking of side (u, v).
In a preferred embodiment, the marking situation of each paths can be calculated by following mode: according to node Quality sequence for each node, calculates from starting point to the common path of every of the node from the off and modifies path Marking, marking sorts from large to small by path, retains preceding k common path and preceding k item from starting point to the node and modifies path. For example, most starting first zequin, the smallest node of the quality in addition to starting point is then calculated, guarantees that the node sequence calculated is pressed in this way According to the sequence of topological sorting.For each node v, if the side (u, v) between the node and its some predecessor node u is general Logical side, then the preceding k common path of predecessor node u and preceding k item modification path are added side (u, v) respectively, the new road of composition Diameter set modifies path respectively as the preceding k common path and preceding k item for reaching node v through predecessor node u.Certainly due to v's Common predecessor node is generally higher than one, so the path of all common predecessor nodes is taken into account, then beats by path Divide and sort from large to small, k common path and preceding k item modify path before retaining.If side (u, v) be decorating edges, will before The preceding k common path for driving node u adds side (u, v), and the new route set of composition reaches node v's as through predecessor node u Preceding k item modifies path.Similarly, since the common predecessor node of v is generally higher than one, by the modification road of all common predecessor nodes The common path of diameter and all modification predecessor nodes is taken into account, and marking sorts from large to small by path, k item modification before retaining Path.The preceding k common path of the terminal finally obtained and preceding k item modification path are candidate peptide fragment.Here, k is natural number, It can be arranged according to actual needs, such as the value of k can be taken between [100,200], such as optional k is oneself of 150 or so So number.
After obtaining candidate peptide fragment, in step 103, peptide spectrum matching marking is carried out to all candidate peptide fragments and is sorted. It is primarily based on each candidate peptide fragment and generates corresponding theoretical spectra.Such as consider b, y ion of 1 valence, divalent, neutral loss ion, Inner ion and imines ion calculate the quality of these ions, theoretical spectra corresponding position generate spectral peak, spectral peak it is strong Angle value can be random, because composing the strength information not considered in theoretical spectra generated when matching marking in subsequent peptide.Then, The theoretical spectra of each candidate peptide fragment is matched with the original spectrogram received in step 101 to beat candidate's peptide fragment Point.For example, for every spectral peak I in the theoretical spectra of candidate peptide fragment, it is assumed that its quality is m, and quality is in original spectrogram Search whether that there are spectral peaks in [m- Δ m, m+ Δ m] (wherein, Δ m indicates quality error) range, if it is present by this model Marking of the intensity at the top in enclosing as spectral peak I, finally, beating all spectral peaks in the theoretical spectra of candidate's peptide fragment Divide summation, the corresponding peptide spectrum matching marking of candidate's peptide fragment can be obtained.In this way, it is highest usually to can choose peptide spectrum matching marking Candidate peptide fragment is as the final peptide fragment for spectrogram identification.
Fig. 2 gives the process signal of the de novo sequencing method of another embodiment according to the present invention.This method includes will Spectrogram is converted into mass spectrum connection figure (step 201, identical as above-mentioned steps 101);Count each paths of the mass spectrum connection figure Score, several common paths and modification path as candidate peptide fragment and record each candidate peptide fragment before taking according to path score sequence Path ranking (step 202, other than record path ranking, for details, reference can be made to above-mentioned steps 102);And based on candidate peptide fragment Carry out peptide spectrum matching marking (step 203, for details, reference can be made to above-mentioned steps 103).Each candidate is being obtained through the above steps After the peptide spectrum matching marking and the corresponding path ranking of each candidate peptide fragment of peptide fragment, this method further includes to obtained each peptide fragment More fine-grained marking (step 204) is carried out, more accurately to identify and identify peptide section sequence and unexpected modification.
Below with reference to Fig. 3-5 pairs Fig. 2 shows the process of method introduced in more detail.In step 201, spectrogram is turned Turn to mass spectrum connection figure.Fig. 3 gives an exemplary spectrogram and corresponding mass spectrum connection figure.Wherein Fig. 3 a) upper right corner gives Known peptide fragment ANHVR (wherein amino acid H is indicated with overstriking and italic, indicates known amino acid H something unexpected happened to modify, this A example is Methyl modification), it is tested here with the spectrogram of known peptide fragment, primarily to the side of verifying the present embodiment The validity of method.Fig. 3 a) it is that the corresponding spectrogram of the peptide fragment only shows 8 spectral peaks and part of nodes for ease of description, Wherein b2, b3, b4 indicate respectively spectral peak it is corresponding be the 2nd, the 3rd, the 4th b ion (facilitate the mesh of description for simplifying , y ion node is not shown), remaining spectral peak represents certain noises.Every spectral peak is converted into corresponding section first in step 201 Point, in addition beginning and end.Such as Fig. 3 b) and 3c) show part of nodes (to simplify the explanation, only for every peak in Fig. 3 a) Show a node), including node 0 (starting point) is to node 9 (terminal) totally 10 nodes, as described above, according to every The quality of spectral peak determines the quality of its corresponding node.The marking of each node is that the intensity of corresponding spectral peak takes using natural number the bottom of as Logarithm.Then calculate the of poor quality of node two-by-two, if it is of poor quality for the quality of some amino acid or some common modify Quality, then connect the two nodes while to be common, if the quality of poor quality surprisingly modified for some, connect this two The side of a node is decorating edges.As shown in Fig. 3 b), the quality of the orresponding amino acid G of poor quality between node 0 and node 2, therefore Common side is can connect between node 0 and 2.Similarly, between node 2 and node 3 it is of poor quality it is corresponding may be amino acid Q Or therefore the quality of AG or GA can connect common side between node 2 and node 3.It is of poor quality between node 3 and node 5 Corresponding may be that Methyl modification has occurred on amino acid H, be denoted as Met [H], therefore can connect between node 3 and node 5 There are decorating edges, other and so on, wherein dotted line indicates that decorating edges, solid line indicate common side.Then, common side is given respectively and is repaired Edging marking.The marking on common side is the marking of the big node of quality;The marking of decorating edges is that the marking of the big node of quality multiplies With the accordingly corresponding abundance of unexpected modification quality.Constructed mass spectrum connection figure such as Fig. 3 c) shown in, wherein such as node 3 and section The marking 26 of decorating edges is that the marking of the big node 3 of quality is obtained multiplied by the abundance of unexpected modification Methyl between point 5.
Then, in step 202, using the constructed mass spectrum of the dynamic programming method statistics introduced above in association with step 102 From the scoring event of all common paths of starting point to the end and modification path in connection figure.Its result is as shown in figure 4, common road Corresponding diameter is candidate common peptide fragment, and it is candidate modification peptide fragment that it is corresponding, which to modify path,.From high to low with path score Sort each candidate peptide fragment.In one example, common path and path candidate can be ranked up together, acquirement point is high The corresponding peptide fragment of preceding N paths as candidate peptide fragment and records the path ranking of candidate's peptide fragment.In another example, may be used Common path and path candidate to sort respectively, the corresponding peptide fragment of k common path and preceding k item modification path is made before taking respectively For candidate peptide fragment, and record the path ranking of each candidate peptide fragment.Path score is higher, then path rank order is more forward. K as mentioned above can preferably take the value 150 or so, and correspondingly, N can take 300 additional values.
In step 203, for each candidate peptide fragment generative theory spectrogram, then by the theoretical spectra and Fig. 3 a) shown in Original spectrogram is matched, so that the peptide spectrum matching for obtaining candidate's peptide fragment is given a mark, (for details, reference can be made to above to Jie of step 103 It continues).
In step 204, for each of obtaining candidate peptide fragment in step 202, by its corresponding peptide spectrum matching marking, path Ranking and modification abundance are supplied to trained classifier and give a mark to it, so that the corresponding score value of candidate's peptide fragment is obtained, Select the highest candidate peptide fragment of score value as the peptide fragment fixed for current spectrum illustrated handbook.For each candidate peptide fragment, if the candidate Peptide fragment comes from common path, then it is 1 that it, which modifies abundance,;If candidate's peptide fragment is from modification path, modification abundance is should Modify the corresponding abundance surprisingly modified of decorating edges in path.Calculation method about the corresponding abundance surprisingly modified of decorating edges Hereinbefore there is introduction.
Wherein classifier is trained sequence point characterized by the peptide of peptide fragment spectrum matching marking, path ranking, modification abundance Class device.The collating sort device can be trained in advance.Such as it is constructed by following method first for training sequence point The sample set of class device: for known to one group its correspond to peptide fragment and the surprisingly spectrogram modified, using existing de novo sequencing method or Database search method or above-mentioned step identify each spectrogram, if it is the identification of some spectrogram peptide section sequence with It is consistent that the spectrogram corresponds to peptide fragment, then is labeled as positive sample, is otherwise labeled as negative sample.Then, for each sample extraction feature: Peptide spectrum matching marking, path ranking and modification abundance.Wherein, peptide spectrum matching marking and path ranking can pass through the application intermediary The above-mentioned steps to continue obtain, and every kind of accident has modified the corresponding calculation method for modifying abundance hereinbefore in each sample There is introduction.For example, multiple existing spectrograms are extracted at random, the statistics number that wherein every kind of unexpected modification occurs;By every kind of accident The ratio for modifying the number occurred in the multiple spectrogram and all unexpected modification frequency of occurrence summations in the spectrogram is made The abundance surprisingly modified for this kind.In another example for each sample peptide fragment, it can be by various modifications contained in the peptide fragment in the peptide The number that occurs is corresponding with the ratio conduct of total degrees that all modifications occur in section modifies corresponding modification abundance, and usually 0 ~1 floating number.Sample peptide fragment without modification, extracted modification abundance feature are 1.Matching is composed to the peptide of candidate peptide fragment Higher result of giving a mark is more credible;It is bigger to modify this feature of abundance, it is as a result more credible;Path ranking is smaller (to be equivalent to path to obtain Divide higher), it is as a result more credible.
Then, based on obtained sample set, using the method training collating sort device of machine learning, which is used In the corresponding marking of each peptide fragment of calculating.In one example, for example, by using RankBoost method, sorter model isWherein n is characterized number (being here 3), siFor ith feature value, fiFor ith feature letter Number, is monotonically increasing function, design parameter is obtained by training process.The classifier can be very good to identify positive negative sample Between difference condition, correct and error result is effectively distinguished according to this difference.In other embodiments, it also can be used SVM Rank, RankNet, FRank etc. are trained as sorter model.
Fig. 2 is returned to, in step 204, by the corresponding peptide spectrum matching marking of each candidate's peptide fragment, path ranking and modification abundance Trained classifier is supplied to give a mark to it.Fig. 5 shows the marking knot final for candidate's peptide fragment shown in Fig. 3 c) Fruit.From fig. 5, it can be seen that the corresponding peptide fragment of top score 58.1 is ANHVR, it is seen then that the peptide fragment identified through the above steps is just Fig. 3 a) shown in the corresponding peptide fragment of spectrogram.
Although the present invention has been described by means of preferred embodiments, the present invention is not limited to described here Embodiment, without departing from the present invention further include made various changes and variation.

Claims (12)

1. a kind of de novo sequencing method, comprising:
Mass spectrum connection figure is converted by spectrogram to be resolved, wherein every spectral peak is converted into the mass spectrum connection in the spectrogram The node of figure, if of poor quality between node modifies for amino acid masses or commonly matter two-by-two in the mass spectrum connection figure Amount, then be connected with common side between the two nodes, the marking on the common side is based on the strong of the corresponding spectral peak of the big node of quality If degree, of poor quality for unexpected modification quality between node two-by-two, is connected with decorating edges between the two nodes to determine, The marking of the decorating edges is determined based on the intensity of the corresponding spectral peak of the big node of quality;
Count the score of each paths in the mass spectrum connection figure, the high preceding several common paths of extraction path score and modification Path is as candidate peptide fragment, wherein the common path is only by the path of common side group, and the modification path is by common side With decorating edges composition path and wherein only include a decorating edges;
Peptide spectrum matching marking is carried out for each candidate peptide fragment, takes the highest candidate peptide fragment of peptide spectrum matching marking as the spectrogram Corresponding peptide fragment.
2. a kind of de novo sequencing method, comprising:
Mass spectrum connection figure is converted by spectrogram to be resolved, wherein every spectral peak is converted into the mass spectrum connection in the spectrogram The node of figure, if of poor quality between node modifies for amino acid masses or commonly matter two-by-two in the mass spectrum connection figure Amount, then be connected with common side between the two nodes, the marking on the common side is based on the strong of the corresponding spectral peak of the big node of quality If degree, of poor quality for unexpected modification quality between node two-by-two, is connected with decorating edges between the two nodes to determine, The marking of the decorating edges is determined based on the intensity of the corresponding spectral peak of the big node of quality;
Count the score of each paths in the mass spectrum connection figure, the high preceding several common paths of extraction path score and modification Path records the path ranking of each candidate peptide fragment as candidate peptide fragment, wherein the common path is only by common side group Path, the modification path are the path being made of common side and decorating edges and wherein only include a decorating edges;
Peptide spectrum matching marking is carried out for each candidate peptide fragment;
The modification abundance of the peptide spectrum matching marking of each candidate peptide fragment, path ranking and candidate's peptide fragment is supplied to as feature Preparatory trained collating sort device obtains to give a mark to candidate's peptide fragment and divides highest candidate peptide fragment as the spectrogram Corresponding peptide fragment.
3. method according to claim 1 or 2, the marking on common side is the big node pair of quality between two of them node The intensity for the spectral peak answered takes the logarithm using natural number the bottom of as, and the marking of decorating edges is that the big node of quality is corresponding between two nodes The intensity of spectral peak take the logarithm using natural number the bottom of as.
4. method according to claim 1 or 2, the marking of the decorating edges between two of them node is the big node of quality The intensity of corresponding spectral peak is multiplied by the corresponding abundance surprisingly modified of the decorating edges, and wherein the abundance surprisingly modified is the modification Quality is corresponding surprisingly modifies the probability or frequency being likely to occur for the unexpected modification differed between two nodes of side connection.
5. according to the method described in claim 4, wherein the corresponding abundance surprisingly modified of decorating edges is equal in decorating edges company The number that the unexpected modification quality differed between two nodes connect occurs between all nodes of the mass spectrum connection figure removes With the sum of decorating edges in the mass spectrum connection figure.
6. according to the method described in claim 4, the abundance wherein surprisingly modified is preset.
7. according to the method described in claim 6, further including setting certain abundance surprisingly modified by following mode:
Extract multiple existing spectrograms at random, statistics wherein certain surprisingly modify the number of appearance;
This kind is surprisingly modified to the number occurred in the multiple spectrogram and all unexpected modifications go out occurrence in the spectrogram The abundance that the ratio of number summation is surprisingly modified as this kind.
8. method according to claim 1 or 2, wherein every spectral peak in the spectrogram is converted in mass spectrum connection figure Two nodes, one of node correspond to b ion, another node corresponds to y ion.
9. according to the method described in claim 8, wherein for every spectral peak, the node quality of corresponding b ion is the spectral peak matter Amount subtracts 1, and the node quality of corresponding y ion is that the quality of parent ion in the spectrogram subtracts the spectral peak quality and 1 hydrone Quality.
10. according to the method described in claim 4, wherein for each candidate peptide fragment, if candidate's peptide fragment comes from common road Diameter, then it is 1 that it, which modifies abundance,;If candidate's peptide fragment is to modify in the modification path from modification path, modification abundance The corresponding abundance surprisingly modified in side.
11. according to the method described in claim 10, wherein the collating sort device is with its correspondence peptide fragment known to one group and to repair The spectrogram of decorations is sample set, characterized by the peptide of the peptide fragment from each sample extraction spectrum matching marking, path ranking and modification abundance Come what is trained.
12. a kind of de novo sequencing device, comprising:
Conversion unit, for converting mass spectrum connection figure for spectrogram to be resolved, wherein every spectral peak is converted in the spectrogram It is in the mass spectrum connection figure if of poor quality for amino acid masses between node two-by-two for the node of the mass spectrum connection figure Or commonly modify quality, then it is connected with common side between the two nodes, the marking on the common side node pair big based on quality The intensity for the spectral peak answered determines, if of poor quality for unexpected modification quality between node two-by-two, between the two nodes Decorating edges are connected with, the marking of the decorating edges is determined based on the intensity of the corresponding spectral peak of the big node of quality;
Path extraction unit, for counting the score of each paths in the mass spectrum connection figure, if high preceding of extraction path score Dry common path and modification path are as candidate peptide fragment, wherein the common path be only by the path of common side group, it is described Modification path is the path being made of common side and decorating edges and wherein only includes a decorating edges;
Matching marking unit, for taking peptide spectrum to match highest time of giving a mark for each candidate peptide fragment progress peptide spectrum matching marking Select peptide fragment as the corresponding peptide fragment of the spectrogram.
CN201611019740.5A 2016-11-14 2016-11-14 De novo sequencing method and device Active CN106770605B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611019740.5A CN106770605B (en) 2016-11-14 2016-11-14 De novo sequencing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611019740.5A CN106770605B (en) 2016-11-14 2016-11-14 De novo sequencing method and device

Publications (2)

Publication Number Publication Date
CN106770605A CN106770605A (en) 2017-05-31
CN106770605B true CN106770605B (en) 2019-03-26

Family

ID=58968971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611019740.5A Active CN106770605B (en) 2016-11-14 2016-11-14 De novo sequencing method and device

Country Status (1)

Country Link
CN (1) CN106770605B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622184B (en) * 2017-09-29 2020-01-21 中国科学院计算技术研究所 Evaluation method for amino acid reliability and modification site positioning
CN116825198B (en) * 2023-07-14 2024-05-10 湖南工商大学 Peptide sequence tag identification method based on graph annotation mechanism

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1551984A (en) * 2000-07-25 2004-12-01 Methods and kits for sequencing polypeptides
CN1749269A (en) * 2004-07-16 2006-03-22 安捷伦科技有限公司 Serial derivatization of peptides for de novo sequencing using tandem mass spectrometry
JP2015230262A (en) * 2014-06-05 2015-12-21 株式会社島津製作所 Mass analysis data analysis method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120232805A1 (en) * 2011-02-14 2012-09-13 Board Of Regents, The University Of Texas System Computerized Amino Acid Composition Enumeration

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1551984A (en) * 2000-07-25 2004-12-01 Methods and kits for sequencing polypeptides
CN1749269A (en) * 2004-07-16 2006-03-22 安捷伦科技有限公司 Serial derivatization of peptides for de novo sequencing using tandem mass spectrometry
JP2015230262A (en) * 2014-06-05 2015-12-21 株式会社島津製作所 Mass analysis data analysis method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
pNova+:De Novo Peptide Sequencing Using Complementary HCD and ETD Tandem Mass Spectra;Hao Chi et al.;《Journal of proteome research》;20131231;第12卷;第615-625页 *
pNovo:De novo Peptide Sequencing and Identification Using HCD Spectra;Hao Chi et al.;《Journal of proteome》;20101231;第9卷;第2713-2724页 *
pNovo:基于串联质谱的从头测序方法的研究;杨皓等;《中国化学会第二届全国质谱分析学术报告会会议摘要集》;20151016;第359页 *
基于概率模型的基因组从头测序算法研究;韩东涛;《中国优秀硕士学位论文全文数据库 基础科学辑》;20140515(第5期);全文 *

Also Published As

Publication number Publication date
CN106770605A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN102495127B (en) Protein secondary mass spectrometric identification method based on probability statistic model
Lam et al. Development and validation of a spectral library searching method for peptide identification from MS/MS
JP4549314B2 (en) Method, apparatus and program product for classifying ionized molecular fragments
Gentzel et al. Preprocessing of tandem mass spectrometric data to support automatic protein identification
Flikka et al. Improving the reliability and throughput of mass spectrometry‐based proteomics by spectrum quality filtering
CN103810200B (en) The database search method of opened protein matter qualification and system thereof
CN106950315B (en) The method of chemical component in sample is quickly characterized based on UPLC-QTOF
CN104182658B (en) Tandem mass spectrogram identification method
CN101115991B (en) Mass analysis data analysis device and mass analysis data analysis method
US20100288917A1 (en) System and method for analyzing contents of sample based on quality of mass spectra
CN101918826B (en) Mass spectrometry system
CN101871945A (en) Spectrum library generating method and spectrogram identifying method of tandem mass spectrometry
CN104076115A (en) Protein second-level mass spectrum identification method based on peak intensity recognition capability
CN106770605B (en) De novo sequencing method and device
CN104820011B (en) A kind of method of protein post-translational modification positioning
JP4821400B2 (en) Structural analysis system
US7230235B2 (en) Automatic detection of quality spectra
CN106033501B (en) A kind of crosslinking dipeptides rapid identification method
Zou et al. Charge state determination of peptide tandem mass spectra using support vector machine (SVM)
CN101055558A (en) Mass spectrum effective peak selection method based on data isotope mode
US9702882B2 (en) Method and system for analyzing mass spectrometry data
CN107563148A (en) A kind of overall protein identification method and system based on ion index
WO2006106724A1 (en) Method of protein analysis, apparatus and program
JP2004526958A (en) Mass protein matching method
CN114639445B (en) Polypeptide histology identification method based on Bayesian evaluation and sequence search library

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant