CN115862742A

CN115862742A - Bidirectional peptide fragment sequencing method based on self-attention mechanism and application

Info

Publication number: CN115862742A
Application number: CN202211615090.6A
Authority: CN
Inventors: 栾钟治; 郭天南; 吴思宇; 王群莹; 何林璇
Original assignee: Beihang University; Westlake University
Current assignee: Beihang University; Westlake University
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-03-28

Abstract

The invention provides a bidirectional peptide fragment sequencing method based on a self-attention mechanism and application thereof, wherein the method comprises the following steps: acquiring mass spectrum original data, processing the mass spectrum original data to obtain a peptide fragment characteristic and a secondary fragment ion spectrogram related to the peptide fragment characteristic: inputting the peptide fragment characteristics and the secondary fragment ion spectrogram into a bidirectional independent sequencing model in a bidirectional peptide fragment sequencing model to output bidirectional independent prediction candidate sequences, inputting the peptide fragment characteristics and the secondary fragment ion spectrogram into the bidirectional interactive sequencing model in the bidirectional peptide fragment sequencing model to output bidirectional interactive prediction candidate sequences, and taking a union set of the bidirectional independent prediction candidate sequences and the bidirectional interactive prediction candidate sequences as a final candidate sequence; and inputting the final candidate sequence into the bidirectional independent sequencing model again for scoring, and selecting the peptide segment sequence with the highest score as a prediction result, so that the effect of deducing the peptide segment sequence from the accurate quality of a secondary spectrogram and a primary spectrogram generated by an experiment can be realized without departing from a database, and the discovery of a new peptide segment protein is facilitated.

Description

Bidirectional peptide fragment sequencing method based on self-attention mechanism and application

Technical Field

The application relates to the field of protein sequencing, in particular to a bidirectional peptide sequencing method based on an attention-free mechanism and application.

Background

Proteomics is an emerging discipline that studies protein expression and its activity pattern at the organ, tissue, cellular and subcellular levels. Since the development of human proteome plans, proteomics technology based on mass spectrometry technology is rapidly developed and gradually applied to the research of life science, and the rules of life movement are revealed by taking protein sequences as bases and combining with various quantitative technologies.

In proteomics research procedures, protein identification technology is the most critical part. And the protein sequence can be better deduced and the expression and interaction relationship of the protein can be better explained based on a reliable peptide fragment sequencing result. The mass spectrum data is analyzed and reduced into a peptide fragment sequence, which is called peptide fragment sequencing. The traditional peptide fragment sequencing method based on mass spectrum data mainly comprises the following steps: a sequence database search-based identification method and a spectrogram database search-based identification method. The sequence database searching and identifying method theoretically cuts protein sequences in a known sequence database into peptide fragments, and forms candidate peptide fragments within a mass error range according to the mass of parent ions measured in mass spectrometer data. And performing theoretical fragmentation on the candidate peptide fragments to form a theoretical spectrogram, matching and scoring the experimental spectrogram to be identified and the theoretical spectrogram, and forming an identification result of the peptide fragments according to a scoring result. The spectrogram database searching and identifying method is similar to the sequence database searching and identifying method, and compares an experimental spectrogram to be identified with a reference spectrogram, but the difference is that the reference spectrogram in the spectrogram database searching and identifying strategy is derived from an actual spectrogram determined in an experiment.

The above peptide fragment sequencing method has the advantages of being easy to implement, but has many defects depending on the existing database:

(1) The discoverability was poor: only peptides or proteins present in the reference database can be identified, and new peptides or proteins cannot be found.

(2) Poor universality: analysis of protein samples from non-model species lacking reliable protein reference sequences is not applicable and the accuracy of identification is limited by the quality and completeness of the reference library.

(3) Low reference library utilization: the search space of the reference database is continuously enlarged, a large part of spectrogram is not used in the search identification based on the database, and the utilization rate is low.

Therefore, the method for sequencing peptide fragments independent of sequence databases and spectrogram libraries is of great importance, and is especially important for finding new peptide fragments and new proteins which do not exist in the libraries.

In addition, the existing peptide fragment sequencing from head has less research and lower accuracy, and the phenomenon of prediction imbalance exists: the accuracy of the first few amino acids of the forward prediction is higher than the accuracy of the last few amino acids. When backward prediction is carried out, the accuracy of prediction of a plurality of amino acids predicted first is higher than that of prediction of a plurality of amino acids predicted last. Meanwhile, when the peptide fragment sequence is predicted, due to the quality constraint of the peptide fragment sequence, all predictions after the current position can be influenced by the current amino acid prediction error, and the problem of accumulated deviation exists. Furthermore, the peptide sequencing task requires evaluation of the match of the predicted peptide and the standard answer. The existing evaluation mode of the de novo sequencing algorithm only calculates the number of completely matched peptide fragments and lacks more detailed and comprehensive evaluation indexes. Therefore, it is necessary to design a de novo sequencing algorithm that can solve the problems of output imbalance, bias accumulation and the like in a data independent acquisition mode and evaluate the de novo sequencing algorithm by using more comprehensive and detailed indexes.

Disclosure of Invention

The embodiment of the application provides a bidirectional peptide fragment sequencing method based on a self-attention mechanism and application, and the effect of deducing a peptide fragment sequence directly from the accurate quality of a secondary spectrogram and a primary spectrogram generated in an experiment without a database is realized by utilizing a bidirectional prediction and self-attention mechanism module optimization model, so that the discovery of new peptide fragment protein is facilitated.

In a first aspect, the embodiments of the present application provide a bidirectional peptide sequencing method based on an attention-driven mechanism, including the following steps:

feature extraction and pretreatment: acquiring mass spectrum original data, and processing the mass spectrum original data to obtain a peptide fragment characteristic and a secondary fragment ion spectrogram related to the peptide fragment characteristic;

bidirectional prediction: inputting the peptide fragment characteristics and the secondary fragment ion spectrogram into a bidirectional independent sequencing model in a bidirectional peptide fragment sequencing model to output bidirectional independent prediction candidate sequences, inputting the peptide fragment characteristics and the secondary fragment ion spectrogram into the bidirectional interactive sequencing model in the bidirectional peptide fragment sequencing model to output bidirectional interactive prediction candidate sequences, and taking a union set of the bidirectional independent prediction candidate sequences and the bidirectional interactive prediction candidate sequences as a final candidate sequence;

and (3) reordering: and inputting the final candidate sequence into the bidirectional independent sequencing model again for scoring, and selecting the peptide segment sequence with the highest score as a prediction result.

In a second aspect, the present embodiments provide a bidirectional peptide fragment testing apparatus based on an attention-deficit mechanism, including the following:

a feature extraction and preprocessing unit: the system comprises a mass spectrum acquisition unit, a data processing unit and a data processing unit, wherein the mass spectrum acquisition unit is used for acquiring mass spectrum original data and processing the mass spectrum original data to obtain a peptide fragment characteristic and a secondary fragment ion spectrogram related to the peptide fragment characteristic;

the bidirectional prediction unit is used for inputting the peptide fragment characteristics and the secondary fragment ion spectrogram into a bidirectional independent sequencing model in a bidirectional peptide fragment sequencing model to output bidirectional independent prediction candidate sequences, inputting the peptide fragment characteristics and the secondary fragment ion spectrogram into a bidirectional interactive sequencing model in the bidirectional peptide fragment sequencing model to output bidirectional interactive prediction candidate sequences, and taking a union set of the bidirectional independent prediction candidate sequences and the bidirectional interactive prediction candidate sequences as final candidate sequences;

and the reordering unit is used for inputting the final candidate sequence into the bidirectional independent sequencing model again for scoring, and selecting the peptide segment sequence with the highest score as a prediction result.

In a third aspect, embodiments of the present application provide an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program to perform the method for bidirectional peptide fragment sequencing based on the self-attention mechanism.

In a fourth aspect, embodiments of the present application provide a readable storage medium having stored therein a computer program comprising program code for controlling a process to perform a process comprising a self-attention mechanism based bidirectional peptide fragment sequencing method according to any one of the above.

The main contributions and innovation points of the invention are as follows:

the embodiment of the application is different from the prior art sequence database search identification method and spectrogram database search identification method, the bidirectional peptide fragment test is realized by using the self-attention mechanism module and the bidirectional peptide fragment sequencing model optimized by bidirectional prediction, the inference of a peptide fragment sequence can be directly carried out from a secondary spectrogram generated by an experiment and the accurate mass of primary parent ions without depending on a database, and the discovery of a new peptide fragment protein is facilitated. In addition, a self-attention mechanism module is introduced in the scheme, so that the internal rule of a peptide fragment sequence, the mode of fragment ions of the tandem mass spectrum and other important characteristics can be better learned, and a new solution is brought to the analysis and the inference of a secondary spectrogram. In addition, the scheme designs a structure for bidirectional prediction and re-scoring according to the model, and is used for solving the problems of unbalanced output and accumulated deviation in peptide fragment sequence prediction.

In addition, the scheme designs the peptide fragment de novo sequencing evaluation index and more comprehensively considers the conditions of secondary spectrogram fragment ion deletion and peptide fragment sequence amino acid dislocation aiming at the problems that the existing de novo sequencing algorithm evaluation mode only calculates the number of completely matched peptide fragments and lacks more detailed and comprehensive evaluation indexes.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a logic flow diagram of a method for bidirectional peptide fragment sequencing based on the autofocusing mechanism in accordance with an embodiment of the present application;

FIG. 2 is a schematic representation of a bidirectional independent sequencing model according to one aspect of the present application;

FIG. 3 is a schematic of a spectrogram-peptide stretch attention mechanism;

FIG. 4 is a schematic diagram of bi-directional independent prediction;

FIG. 5 is a schematic illustration of a two-way interaction test;

FIG. 6 is a schematic diagram of the computation of a bi-directional synchronous self-attentive mechanism module;

FIG. 7 is a schematic diagram of a bi-directional interactive predictive beam search;

FIG. 8 is a block diagram of a bidirectional peptide fragment testing device based on an attention-driven mechanism according to an embodiment of the present application;

fig. 9 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

Example one

The scheme provides a bidirectional peptide fragment sequencing method based on a self-attention mechanism, the method can directly deduce the amino acid sequence of a peptide fragment from a secondary spectrogram generated by an experiment and the accurate mass of primary parent ions without depending on an existing protein sequence reference database and a spectrogram library, and is used for finding new peptide fragments and new proteins with unknown sequences.

Specifically, the bidirectional peptide sequencing method based on the attention-free mechanism provided by the scheme comprises the following steps:

and (3) reordering: and inputting the final candidate sequence into the bidirectional independent sequencing model again for scoring, and selecting the peptide fragment sequence with the highest score as a prediction result.

In the step of 'feature extraction and pretreatment', the scheme extracts the peptide fragment features of the original mass spectrum data and a secondary fragment ion spectrogram related to the peptide fragment features, wherein the peptide fragment features refer to the trace of a single peptide fragment, and the secondary fragment ion spectrogram is extracted from the original mass spectrum data and is related to the peptide fragment features.

In the mass spectrometry experiment, as the SILAC stable isotope labeling technology is adopted, stable isotope labeling causes the mass difference between the masses of the same peptide fragments, and as the charge quantity of the same peptide fragments is equal, when the peptide fragments are drawn on a heat map, precursor ions of the same peptide fragments generate traces which are parallel and are equally spaced in the mass-to-charge ratio dimension, namely, the charge quantity of the same peptide fragments is equal, for example, the masses of the peptide fragments isotopically labeled by carbon-12 (12 c) and carbon-13 (13 c) are different, and if the charge quantity z =2 of the peptide fragments, the distance between the trace lines of the peptide fragments in the mass-to-charge ratio dimension is 0.5m/z, so the scheme takes the trace of a single peptide fragment as the characteristic of the peptide fragment. Correspondingly, traces which are parallel on the primary mass spectrum and have equal mass-to-charge ratio one-dimensional intervals are obtained as the peptide segment characteristics.

However, the peptide fragment characteristics are extracted from the primary mass spectrum data, while the secondary mass spectrum data mainly plays a role in the qualitative analysis of the peptide fragment protein, so the scheme also needs to acquire a secondary fragment ion spectrum related to the peptide fragment characteristics. Because the original mass spectrum data acquired by different mass spectrometers have different formats, for example, the off-line data generated by a Thermo-type mass spectrometer is usually in a raw format, the original mass spectrum data needs to be converted into a uniform spectrogram format, a CWT method is adopted, and 1-2 MS-level selection is performed to extract a secondary fragment ion spectrogram in the original mass spectrum data and associate the secondary fragment ion spectrogram with the peptide fragment characteristics.

In an embodiment of the present disclosure, a uniform spectrogram format is selected as a more intuitive MGF file, and at this time, the original data (. Raw) file is first converted into a spectrogram format (. MGF) file through mscovert software. The specific method for obtaining the secondary fragment ion spectrogram related to the peptide fragment characteristics is as follows: reading original mass spectrum data in a unified spectrogram format, establishing a dictionary sequence for storing keywords of 'Scans' and 'Pepmass', giving a peptide segment characteristic and a retention time range, traversing the dictionary sequence, screening out fragment ions of which the 'Rtins' field is between the peptide segment characteristic RTBegin and the RTEnd and the 'Pepmass' field of the secondary fragment ions must be capable of covering the mass-to-charge ratio of the peptide segment characteristic, and obtaining the secondary mass spectrum of the fragment ions as a secondary fragment ion mass spectrum.

According to the scheme, a peptide fragment characteristic and a secondary fragment ion spectrogram related to the peptide fragment characteristic are used as input characteristics in a bidirectional peptide fragment sequencing model, the bidirectional peptide fragment sequencing model comprises a bidirectional independent sequencing model and a bidirectional interactive sequencing model which are parallel, bidirectional independent prediction is carried out on the input characteristics in the bidirectional independent sequencing model to obtain bidirectional independent prediction candidate sequences, bidirectional interactive test is carried out on the input characteristics in the bidirectional interactive sequencing model to obtain bidirectional interactive prediction candidate sequences, and a union of the bidirectional interactive prediction candidate sequences and the bidirectional independent prediction candidate sequences is taken as a final candidate sequence.

Specifically, a bidirectional independent sequencing model is introduced, as shown in fig. 2, the bidirectional independent sequencing model comprises an encoder, an attention decoder and an intensity decoder, and the secondary fragment ion spectrogram is input into the encoder to be encoded to obtain the global feature of the secondary spectrogram; the predicted forward and backward peptide fragment sequences are respectively and independently input into a position coding module in the attention decoder to obtain sequence position codes, the peptide fragment sequences input into the attention decoder output peptide fragment vector characteristics from an attention mechanism module, the peptide fragment vector characteristics and the secondary spectrogram global characteristics are jointly input into a spectrogram-sequence attention mechanism to obtain associated characteristics, candidate intensity characteristics are output after the corresponding forward and backward secondary fragment ion spectrograms are input into the intensity decoder, the candidate intensity characteristics and the associated characteristics are spliced to obtain splicing characteristics, and the splicing characteristics are mapped into bidirectional independent prediction candidate sequences.

The bidirectional independent sequencing model adopts a coder decoder framework, a coder is internally provided with a global CNN network, a secondary fragment ion spectrogram is used as the input characteristic of the coder, and the global CNN network of the coder codes the global representation of the secondary fragment ion spectrogram to obtain the global characteristic of the secondary spectrogram. More specifically, the global CNN network includes a CNN layer and a fully connected layer, and the global representation of the secondary spectrogram is subjected to a maximum pooling layer, convolution operations of two layers of three-dimensional convolution kernels, and a maximum pooling layer in the CNN layer and then input to the fully connected layer to obtain a secondary spectrogram global feature with a specific vector size.

In an example of the present solution, the global representation of the secondary spectrogram is a vector of (5, 150000) size, where 5 is a retention time dimension and 150000 is a mass-to-charge ratio dimension, and then the global representation of the secondary spectrogram outputs (16, 256) vector-sized secondary spectrogram global features after passing through the encoder.

The attention decoder comprises a position coding module, a peptide segment sequence self-attention mechanism module and a spectrogram-sequence attention mechanism which are sequentially connected. The peptide segment sequence is firstly input into the position coding module to carry out coding of sequence position to obtain sequence position coding, wherein the position coding module comprises an embedded layer and a position coding layer, and the peptide segment sequence is input into the position coding module to output the sequence position coding.

The peptide fragment sequence self-attention mechanism module in the attention decoder is used for carrying out relationship mining on the peptide fragment sequences. The peptide segment sequence self-attention mechanism module adopts a scaling dot product attention mechanism which maps the query and a group of key value pairs to output. Specifically, since the peptide fragment sequencing task is similar to the text generation task in natural language processing, only the amino acid at the current position can be predicted from the already predicted amino acid. Wherein Q _j Is the j-th bitPut Query vector, K _≤j Is the Key vector, V, before the j position _≤j Is the Value vector before the j position, so when the calculation of the attention mechanism is performed, Q _j Only with K _≤j Calculating an attention weight matrix, and comparing the obtained attention weight matrix with V _≤j Multiplying to obtain a vector representation of the amino acid at position j, as calculated in equation (1), wherein d _k Is the dimensionality of the Key vector, and prevents the obtained score from being too large:

namely, after the attention weight matrix is calculated by the query vector of the sequence position code of the current position and the key vector of the previous position, the current position is multiplied by the value vector of the previous position to obtain the vector representation of the current position, and the vector representation of each position is traversed to obtain the vector characteristics of the peptide fragment.

The spectrogram-peptide segment attention mechanism in the attention decoder is used for calculating the correlation characteristics of the secondary spectrogram global characteristics and the peptide segment vector characteristics. Specifically, the secondary spectrogram global features include all peptide fragment information, and the spectrogram-peptide fragment attention mechanism pays attention to different regions of the secondary spectrogram by means of the secondary spectrogram global features, and gives different weights to the different regions of the secondary spectrogram.

Specifically, as shown in fig. 3, the query vector in the spectrogram-peptide fragment attention machine system is a peptide fragment vector feature, the key value vector and the value vector are a secondary spectrogram global feature, and the calculation formula is shown in the following formula (2):

wherein Q _j Is the Query vector of the j-th position, K _enc And V _enc Are all global features of the second-order spectrogram, wherein d _k The dimension of the Key vector prevents the obtained score from being too large.

Illustratively, assuming that when predicting amino acid G, the Query vector is the output of the predicted amino acid sequence ELSGSSPVLE via the attention module, both Key and Value are global features of the secondary spectrum.

The intensity decoder comprises a local CNN, and the local CNN models the association relationship between the possible intensity value characteristics of each step of prediction theory and the secondary fragment ion spectrogram. Specifically, the predicted secondary spectrogram of the partial sequence and a candidate intensity value formed by candidate ions are input into a local CNN in an intensity decoder for convolution extraction, and the local CNN is convolved by 3 three-dimensional convolution kernels and then subjected to dimension mapping after passing through a maximum pooling layer and a full connection layer to obtain candidate intensity characteristics. Specifically, the theoretical possible candidate intensity values corresponding to the peptide fragment features are expressed as vectors with the size of (26, 8,5, 10), wherein 26 is 23 candidate amino acid masses and 3 sequence identifiers (PAD: vacancy identifier; BOS: start identifier; EOS: end identifier), 8 is 8 theoretically producible fragment ion species, 5 is a retention time dimension, 10 is an error tolerance window, and the candidate intensity features with the dimension size of (1, 512) are obtained in the scheme.

It is worth mentioning that in the conventional peptide sequencing method, the decoding stage of peptide sequencing is usually unidirectional decoding, i.e. from left to right (forward) or from right to left (backward), and finally selecting the peptide with the highest score from the forward predicted peptide and the backward predicted peptide as the final output. The unidirectional decoding is consistent with the logic of generation, i.e. like writing, the amino acid at the t-th position is predicted to use the information of the amino acid already generated at the t-1 position from front to back or from back to front in sequence. However, such decoding methods have a problem of output imbalance and cannot fully utilize information of future output ends of backward decoding, which means that the accuracy of a few amino acids predicted by a model is high, and the prediction accuracy becomes low as the sequence grows. Meanwhile, when the peptide fragment sequence is predicted, due to the mass constraint of the peptide fragment sequence, all predictions after the current position can be influenced by the current amino acid prediction error, and the phenomenon of accumulated deviation exists. In this respect, the scheme adopts the two-way independent prediction and two-way interactive prediction to relieve the problems of deviation accumulation and output imbalance

As shown in the fourth figure, the bidirectional independent sequencing model adopts a forward and backward independent prediction mode, traverses all intermediate positions as junction points, and forms a candidate bidirectional independent prediction candidate sequence, and the bidirectional independent prediction candidate sequence needs to ensure that the sum of the mass of amino acids is equal to the mass of precursor ions. The algorithm of bidirectional independent prediction is equivalent to adding a candidate before the peptide segment with the highest score is finally selected, and the confidence of the candidate in the first amino acid and the confidence of the last amino acid are both higher, so that the probability of success of prediction of the peptide segment sequence is also increased.

In the scheme, a bidirectional independent prediction candidate sequence obtained by forward prediction of a forward peptide fragment sequence is forward _peptide ＝{a ₁ ，a ₂ ，a ₃ ，...，a _m Forward prediction is carried out on the backward peptide fragment sequence to obtain a bidirectional independent prediction candidate sequence which is backward predicted _peptide ＝{b ₁ ，b ₂ ，b ₃ ，...，b _n And f, the summarized bidirectional independent prediction candidate sequence obtained at this time is shown as formula (3):

Bi_indepednet _candidate ＝{a ₁ ，a ₂ ，...，a _i ，b _j ，…，b ₂ ，b ₁ }i，j∈(1，max(m，n)) (3)。

however, the formed bi-directional independent prediction candidate set cannot be arbitrarily performed, and a certain constraint needs to be satisfied. When the decoding is carried out in one step in the peptide fragment sequencing, the sum of the amino acid mass obtained by each step of decoding is ensured to be equal to the corresponding peptide fragment sequence mass in the peptide fragment characteristics. Therefore, when forming a candidate set, it is necessary to ensure that the candidate sequence is within the error range of the peptide fragment sequence quality, as shown in formula (4):

wherein

For the k-th candidate sequence predicted bidirectionally and independently, mass (predictor) is the mass of the peptide sequence in the peptide segment characteristic, and delta represents the absolute value of the change of the mass.

Then, introducing a bidirectional interactive sequencing model, wherein the bidirectional interactive sequencing model comprises an encoder, an attention decoder and an intensity decoder, and the secondary fragment ion spectrogram is input into the encoder to be encoded to obtain the global characteristics of the secondary spectrogram; the predicted forward and backward peptide fragment sequences are synchronously input into a position coding module in the attention decoder to obtain sequence position codes, the sequence position codes are input into a bidirectional synchronous self-attention mechanism module in the attention decoder to output peptide fragment vector characteristics, the peptide fragment vector characteristics and the secondary spectrogram global characteristics are jointly input into a spectrogram-sequence attention mechanism to obtain associated characteristics, candidate intensity characteristics are output after the corresponding forward and backward secondary fragment ion spectrograms are input into the intensity decoder, the candidate intensity characteristics and the associated characteristics are spliced to obtain splicing characteristics, and the splicing characteristics are mapped into bidirectional interactive prediction candidate sequences.

The two-way cross-sequencing model differs from the two-way independent sequencing model in that both forward and backward predictions infer the amino acid information to be generated from the amino acids that have been predicted (historical information). And bidirectional interactive prediction considers not only historical information but also future information, so that forward and backward prediction sequences can be better converged at a certain position in the middle, and candidate sequences are dynamically expanded. In the bidirectional independent prediction, some combined candidate peptide fragments are discarded because the quality requirements of the peptide fragment sequence cannot be met, and the bidirectional interactive prediction is equivalent to backward guiding significance in peptide fragment sequence prediction, so that more candidate peptide fragments meeting the quality requirements can be generated.

According to the scheme, a peptide segment self-attention mechanism module is modified into a bidirectional synchronous self-attention mechanism module on the basis of a bidirectional independent sequencing model, and a decoder for simultaneously interactively decoding forward and backward predictions is adopted in the bidirectional interactive sequencing model. Specifically, the operation of the bidirectional synchronous self-attention mechanism module is not changed, but the Query vector Query, the Key vector Key, and the Value vector Value are represented by a bidirectional matrix, at this time, the calculation of the bidirectional synchronous attention mechanism does not include the calculation of the Query vector Query, the Key vector Key, and the Value vector Value in the same direction, and also includes the calculation of the Query vector Query, the Key vector Key in the opposite direction, and the Value vector Value, and finally the hidden layer vectors obtained by the calculation in the same direction and the backward direction are subjected to linear change and then spliced to obtain the peptide segment vector characteristics, and the peptide segment vector characteristics are used as the input of the next layer of the decoder.

Specifically, as shown in fig. 6, the bidirectional synchronous self-attention mechanism module inputs Query vector Query, key vector Key, and Value vector Value represented by a bidirectional matrix into the bidirectional synchronous self-attention mechanism module, the Query vector Query, the equidirectional Key vector Key, and the Value vector Value are calculated to obtain homodromous calculated hidden layer vectors, the Query vector Query, the equidirectional Key vector Key, and the Value vector Value are calculated to obtain backward hidden layer vectors, and the hidden layer vectors of the two are spliced to obtain the peptide segment vector characteristics.

Specific calculation formulas are shown in the following formulas (5) to (8):

/>

wherein

Is frontTo the Query vector at the time of prediction, <' >>

And &>

Value vector and Key vector of forward prediction, respectively. />

Is a Query vector at the time of backward prediction>

And &>

Value vector and Key vector of backward prediction, respectively. />

Is a hidden layer vector output in forward prediction->

Is the hidden layer vector output in backward prediction. />

Hidden vector for forward prediction using only historical prediction information, based on the predicted vector>

Hidden vector for forward prediction, based on future information of only backward prediction>

Hidden vector for backward prediction, based on historical prediction information only>

For obtaining concealment by using only future information of backward prediction in backward predictionA layer vector.

Formulas of the hidden layer vector at the time of forward prediction and the hidden layer vector at the time of backward prediction are shown as formulas (9) to (11):

ReLU(x)＝max(0，x) (11)；

wherein the total hidden vector output in forward prediction

Is/>

And/or>

Combination 1->

After an activation function and->

And (4) linear combination. The total hidden-layer vector output in the backward prediction->

Is->

And/or>

In combination of (4), is selected>

After an activation function and->

And (4) linear combination.

In addition, as shown in fig. 7, since dynamic beam search is involved in model prediction, the calculation of the bidirectional synchronous self-attention mechanism module needs to be dynamically adjusted. In addition, the original unidirectional prediction is that the size of the beam search is k, and the size of the bidirectional interactive prediction beam search is k because the forward prediction and the backward prediction need to be decoded simultaneously

When the wave beam searching is not finished, the problem of the corresponding relation when the predicted candidate sequence and the predicted candidate sequence calculate the bidirectional synchronous self-attention is also involved, in the invention, the strict corresponding relation of the ranking of the fraction values is adopted, namely, the predicted candidate sequence and the unpredicted candidate sequence in the forward prediction and the backward prediction are sequentially arranged, and the arranged peptide segment sequences are arranged from high to low according to the fraction, and are in one-to-one correspondence.

In the "reordering" step, the one with the highest score needs to be selected from the final candidate sequence as the prediction result. Specifically, the scheme inputs the final candidate sequence to the bidirectional sequence again

Scoring in an independent sequencing model is shown in the following formula (12):

wherein

In order to be able to forward-predict the score,

for the backward prediction score, the peptide sequence with the highest score in the sum of the two is used as the final prediction result.

When calculating the forward prediction score and the backward prediction score, the prediction score of each position of the peptide fragment sequence is obtained by adding, and the prediction score of each position of the peptide fragment sequence is obtained by logsoftmax probability of the prediction result of the position, and the calculation formulas are shown as (13) and (14):

wherein x _i Is composed of

The amino acid at the i-th position of the peptide fragment sequence.

Example two

Of course, the bidirectional peptide fragment sequencing model mentioned in the scheme is obtained by training a training sample of the labeled peptide fragment sequence and spectrogram. Correspondingly, the scheme provides a construction method of a bidirectional peptide fragment sequencing model, which comprises the following steps:

obtaining mass spectrum data of spectrogram marked with peptide fragment sequence and secondary fragment ions as training sample

Inputting mass spectrum data into a bidirectional peptide fragment sequencing model for training, wherein the bidirectional peptide fragment sequencing model comprises a parallel bidirectional independent sequencing model and a bidirectional interactive sequencing model,

the bidirectional independent sequencing model comprises an encoder, an attention decoder and an intensity decoder, and the secondary fragment ion spectrogram is input into the encoder to be encoded to obtain the global features of the secondary spectrogram; the predicted forward and backward peptide fragment sequences are independently input into a position coding module in the attention decoder to obtain sequence position codes, the peptide fragment sequences input into the attention decoder output peptide fragment vector characteristics from an attention mechanism module, the peptide fragment vector characteristics and the secondary spectrogram global characteristics are jointly input into a spectrogram-sequence attention mechanism to obtain associated characteristics, candidate intensity characteristics are output after the corresponding forward and backward secondary fragment ion spectrograms are input into the intensity decoder, the candidate intensity characteristics and the associated characteristics are spliced to obtain splicing characteristics, and the splicing characteristics are mapped into bidirectional independent prediction candidate sequences;

the bidirectional interactive sequencing model comprises an encoder, an attention decoder and an intensity decoder, and the secondary fragment ion spectrogram is input into the encoder to be encoded to obtain secondary spectrogram global features; the predicted forward and backward peptide fragment sequences are synchronously input into a position coding module in the attention decoder to obtain sequence position codes, the sequence position codes are input into a bidirectional synchronous self-attention mechanism module in the attention decoder to output peptide fragment vector characteristics, the peptide fragment vector characteristics and the secondary spectrogram global characteristics are jointly input into a spectrogram-sequence attention mechanism to obtain associated characteristics, the secondary fragment ion spectrograms corresponding to the forward direction and the backward direction are input into the intensity decoder to output candidate intensity characteristics, the candidate intensity characteristics and the associated characteristics are spliced to obtain splicing characteristics, and the splicing characteristics are mapped into bidirectional interactive prediction candidate sequences;

taking the union of the bidirectional independent prediction candidate sequence and the bidirectional interactive prediction candidate sequence as a final candidate sequence; and inputting the final candidate sequence into the bidirectional independent sequencing model again for scoring, and selecting the peptide segment sequence with the highest score as the peptide segment sequence.

The structure and content of the bidirectional peptide fragment sequencing model are the same as those of the first embodiment, and the description is not repeated.

EXAMPLE III

The scheme not only designs a set of bidirectional peptide fragment sequencing method based on a self-attention mechanism, but also provides a corresponding peptide fragment de novo sequencing evaluation index, and compared with the existing evaluation index, the scheme more comprehensively considers the conditions of fragment ion deletion and peptide fragment sequence amino acid dislocation of a secondary fragment ion spectrogram.

In other words, the conventional evaluation index calculates the ratio of the amino acid prediction accuracy in the model total prediction result and the ratio of the peptide fragment sequence prediction accuracy in the model total prediction result. The following two problems result: (1) There is no reasonable measure of the misalignment between the predicted and standard sequences. (2) Due to the partial deletion of fragment ions in the secondary spectrum, the predicted sequence of partial amino acids is reversed, so that the situation of the amino acid sequence reversal cannot be reasonably evaluated.

Therefore, the scheme newly provides two evaluation indexes to evaluate the quality of the model: the method comprises a position BLEU index and an alignment score index, wherein the position BLEU index is added with distance weight by using the idea of a common index BLEU in neural machine translation, and the alignment score index is obtained by introducing a double-sequence alignment algorithm to calculate the similarity between a predicted sequence and a target sequence.

Regarding the location BLEU indicator: the index is correspondingly improved on the idea of BLEU-1gram, a distance weight is added for matching of each position to solve the problem that the length of a prediction sequence is different from that of a reference sequence, and specific calculation formulas are shown as formulas (15) to (18):

dist_forward(aaid)＝|trg index(aaid)-pre index(aaid)|，index＝(0，1，2...n)(17)

dist_backward(aaid)＝|trg index(aaid)-pre index(aaid)|，index＝(n，n-1，...0)(18)

wherein Position-BLEU is a Position BLEU indicator,

(ii) the amino acid at position j of the predicted sequence of item i;

for counting from front to back->

The absolute value of the difference between the position in the target sequence and the position in the predicted sequence is recorded as the forward distance of the amino acid match; />

For calculation from back to front

The absolute value of the difference between the position in the target sequence and the position in the predicted sequence is recorded as the backward distance of the amino acid match, trg index (aaid) represents the position of the amino acid in the target sequence, and pre index (aaid) represents the position of the amino acid in the predicted target.

The reason for calculating the sum of the forward distance and the backward distance of the amino acid matching in the scheme is to prevent the calculated score value from being smaller due to the dislocation phenomenon caused by the fact that the lengths of the predicted sequence and the target sequence are different. The minimum value is taken after calculating the sum of the forward and backward distances to consider only

Distance to the nearest identical amino acid in the reference sequence.

Regarding the alignment score index: the index uses the thought of a Smith-Waterman algorithm for double-sequence local comparison for reference, and a dynamic programming mode is adopted to search a local similarity region. The basic idea of the algorithm is as follows: calculating the similarity score of the two sequences by using an iterative method, storing the similarity score in a score matrix, and then backtracking and finding the optimal aligned sequence by using a dynamic programming method according to the score matrix, wherein a specific calculation formula is shown as the following formula (19):

in the above formula, s (x) _i ，y _j ) BLOSUM62 is typically used as a substitution matrix for amino acids, which is labeled with the frequency of co-occurrence of amino acids, and d is a gap penalty for sequence alignment. Since the candidate amino acids are deduced according to the mass of the amino acids in the sequencing step, the present invention generates an amino acid mass substitution matrix (mass _ matrix) to substitute s (x) in the formula (19) when aligning _i ，y _j ) Wherein F (i, j) represents the alignment score of the target sequence and the predicted sequence.

Specifically, the amino acid masses are arranged from small to large, the difference between the maximum mass and the minimum mass is equally divided into ten parts, and the amino acid mass interval corresponding to the score is obtained as shown in formula (20):

per mass＝(max(mass_AAid)-min(mass_AAid))/10 (20)；

wherein mss AAid represents the mass of the amino acid.

When predicting the i position amino acid pre of the sequence _i With the j-th amino acid trg of the target sequence _j When the comparison quality of (2) is equal, the position score is 10 full points, but when the quality is not equal, the comparison score is negative, the specific numerical value is determined by the quality difference, and the matching score is calculated as shown in formula (21):

the greater the difference in quality, the greater the value of the penalty. The gap penalty is set to-1. The amino acid quality substitution matrix is a lower triangular matrix, the element values of the diagonal lines are all 10, and the rest positions in the lower triangular matrix are negative scores.

Example four

Based on the same concept, referring to fig. 3, the present application also proposes a bidirectional peptide fragment testing device based on the self-attention mechanism, which comprises the following components:

a feature extraction and preprocessing unit: the system is used for acquiring mass spectrum original data and processing the mass spectrum original data to obtain a peptide fragment characteristic and a secondary fragment ion spectrogram related to the peptide fragment characteristic;

The technical contents of the third embodiment that are the same as those of the first embodiment will not be described repeatedly.

EXAMPLE five

The present embodiment further provides an electronic device, referring to fig. 4, comprising a memory 404 and a processor 402, wherein the memory 404 stores a computer program, and the processor 402 is configured to execute the computer program to perform the steps of any of the above embodiments of the bidirectional peptide sequencing method based on the self-attention mechanism.

Specifically, the processor 402 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of the embodiments of the present application.

Memory 404 may include, among other things, mass storage 404 for data or instructions. By way of example, and not limitation, memory 404 may include a hard disk drive (hard disk drive, HDD for short), a floppy disk drive, a solid state drive (SSD for short), flash memory, an optical disk, a magneto-optical disk, tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Memory 404 may include removable or non-removable (or fixed) media, where appropriate. The memory 404 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 404 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 404 includes Read-only memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or FLASH memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a static random-access memory (SRAM) or a dynamic random-access memory (DRAM), where the DRAM may be a fast page mode dynamic random-access memory 404 (FPMDRAM), an extended data output dynamic random-access memory (EDODRAM), a synchronous dynamic random-access memory (SDRAM), or the like.

Memory 404 may be used to store or cache various data files needed for processing and/or communication purposes, as well as possibly computer program instructions executed by processor 402.

Processor 402 reads and executes computer program instructions stored in memory 404 to implement any of the above-described embodiments of the bidirectional peptide sequencing method.

Optionally, the electronic apparatus may further include a transmission device 406 and an input/output device 408, where the transmission device 406 is connected to the processor 402, and the input/output device 408 is connected to the processor 402.

The transmitting device 406 may be used to receive or transmit data via a network. Specific examples of the network described above may include wired or wireless networks provided by communication providers of the electronic devices. In one example, the transmission device includes a network adapter (NIC) that can be connected to other network devices through a base station to communicate with the internet. In one example, the transmitting device 406 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The input and output devices 408 are used to input or output information. In this embodiment, the input information may be mass spectrum data, etc., and the output information may be a peptide fragment sequence, etc.

Optionally, in this embodiment, the processor 402 may be configured to execute the following steps by a computer program:

acquiring mass spectrum original data, and processing the mass spectrum original data to obtain a peptide fragment characteristic and a secondary fragment ion spectrogram related to the peptide fragment characteristic;

inputting the peptide fragment characteristics and the secondary fragment ion spectrogram into a bidirectional independent sequencing model in a bidirectional peptide fragment sequencing model to output bidirectional independent prediction candidate sequences, inputting the peptide fragment characteristics and the secondary fragment ion spectrogram into the bidirectional interactive sequencing model in the bidirectional peptide fragment sequencing model to output bidirectional interactive prediction candidate sequences, and taking a union set of the bidirectional independent prediction candidate sequences and the bidirectional interactive prediction candidate sequences as a final candidate sequence;

and inputting the final candidate sequence into the bidirectional independent sequencing model again for scoring, and selecting the peptide fragment sequence with the highest score as a prediction result.

It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.

In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products) including software routines, applets and/or macros can be stored in any device-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may comprise one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. Further in this regard it should be noted that any block of the logic flow as in the figures may represent a program step, or an interconnected logic circuit, block and function, or a combination of a program step and a logic circuit, block and function. The software may be stored on physical media such as memory chips or memory blocks implemented within the processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVDs and data variants thereof, CDs. The physical medium is a non-transitory medium.

It should be understood by those skilled in the art that various technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present description should be considered as being described in the present specification.

The above examples are merely illustrative of several embodiments of the present application, and the description is more specific and detailed, but not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A bidirectional peptide fragment sequencing method based on a self-attention mechanism is characterized by comprising the following steps: feature extraction and pretreatment: acquiring mass spectrum original data, and processing the mass spectrum original data to obtain a peptide fragment characteristic and a secondary fragment ion spectrogram related to the peptide fragment characteristic;

and (3) reordering: and inputting the final candidate sequence into the bidirectional peptide fragment sequencing model again for scoring, and selecting the peptide fragment sequence with the highest score as a prediction result.

2. The self-attention mechanism-based bidirectional peptide fragment sequencing method of claim 1, wherein the bidirectional independent sequencing model comprises an encoder, an attention decoder and an intensity decoder, and the secondary fragment ion spectrogram is input into the encoder to be encoded to obtain secondary spectrogram global features; the predicted forward and backward peptide fragment sequences are independently input into a position coding module in the attention decoder to obtain sequence position codes, the peptide fragment sequences input into the attention decoder output peptide fragment vector characteristics from an attention mechanism module, the peptide fragment vector characteristics and the secondary spectrogram global characteristics are jointly input into a spectrogram-sequence attention mechanism to obtain associated characteristics, candidate intensity characteristics are output after the corresponding forward and backward secondary fragment ion spectrograms are input into the intensity decoder, the candidate intensity characteristics and the associated characteristics are spliced to obtain splicing characteristics, and the splicing characteristics are mapped into bidirectional independent prediction candidate sequences.

3. The bidirectional peptide fragment sequencing method based on self-attention mechanism of claim 2, wherein the peptide fragment sequence self-attention mechanism module employs a scaled point-product attention mechanism that maps a query and a set of key-value pairs to an output.

4. The bidirectional peptide fragment sequencing method based on self-attention mechanism as claimed in claim 2, wherein the query vector in the spectrogram-peptide fragment attention mechanism is a peptide fragment vector feature, and the key value vector and the value vector are secondary spectrogram global features.

5. The self-attention mechanism-based bidirectional peptide fragment sequencing method of claim 1, wherein the predicted forward peptide fragment sequence is forward predicted in a bidirectional independent sequencing model to obtain bidirectional independent prediction candidate sequences, the backward peptide fragment sequence is backward predicted in the bidirectional independent sequencing model to obtain bidirectional independent prediction candidate sequences, the bidirectional independent prediction candidate sequences are obtained by summarizing, and the sum of the amino acid masses of the bidirectional independent prediction candidate sequences is ensured to be equal to the mass of the precursor ions.

6. The self-attention mechanism-based bidirectional peptide fragment sequencing method of claim 1, wherein the bidirectional interactive sequencing model comprises an encoder, an attention decoder and an intensity decoder, and the secondary fragment ion spectrogram is input into the encoder to be encoded to obtain a secondary spectrogram global feature; the predicted forward and backward peptide fragment sequences are synchronously input into a position coding module in the attention decoder to obtain sequence position codes, the sequence position codes are input into a bidirectional synchronous self-attention mechanism module in the attention decoder to output peptide fragment vector characteristics, the peptide fragment vector characteristics and the secondary spectrogram global characteristics are jointly input into a spectrogram-sequence attention mechanism to obtain associated characteristics, the secondary fragment ion spectrograms corresponding to the forward direction and the backward direction are input into the intensity decoder to output candidate intensity characteristics, the candidate intensity characteristics and the associated characteristics are spliced to obtain splicing characteristics, and the splicing characteristics are mapped into bidirectional interactive prediction candidate sequences.

7. The bidirectional peptide fragment sequencing method based on the attention-deficit mechanism is characterized in that a query vector, a key vector and a value vector are represented by a bidirectional matrix, the query vector, the key vector and the value vector represented by the bidirectional matrix are input into the bidirectional synchronous attention-deficit mechanism module, hidden layer vectors in the same direction are obtained by computing the query vector, the key vector and the value vector in the same direction, hidden layer vectors in the backward direction are obtained by computing the query vector, the key vector and the value vector in the same direction, and the peptide fragment vector characteristics are obtained by splicing the two hidden layer vectors.

8. The self-attention mechanism-based bidirectional peptide fragment sequencing method of claim 1, wherein the calculation of the bidirectional synchronous self-attention mechanism module is dynamically adjusted based on dynamic beam search, and when the beam search is not finished, the predicted and non-predicted in the forward prediction and the backward prediction are sequentially arranged, and the arranged peptide fragment sequences are arranged from high to low according to the fraction, and are in one-to-one correspondence.

9. The bidirectional peptide fragment sequencing method based on the self-attention mechanism as claimed in claim 1, wherein in the reordering step, the peptide fragment sequence with the highest sum of the forward prediction score and the backward prediction score of the forward prediction is taken as the prediction result, and when the forward prediction score and the backward prediction score are calculated, the prediction score of each position of the peptide fragment sequence is obtained by adding the prediction scores of each position of the peptide fragment sequence, wherein the prediction score of each position of the peptide fragment sequence is obtained by logsoftmax probability of the prediction result of the position.

10. The bidirectional peptide fragment sequencing method based on the self-attention mechanism is characterized in that the position BLEU index and the alignment score index are used for evaluating the quality of a predicted result.

11. A method for constructing a bidirectional peptide fragment sequencing model is characterized by comprising the following steps:

acquiring mass spectrum data of a spectrogram marked with a peptide fragment sequence and secondary fragment ions as a training sample, inputting the mass spectrum data into a bidirectional peptide fragment sequencing model for training, wherein the bidirectional peptide fragment sequencing model comprises a parallel bidirectional independent sequencing model and a bidirectional interactive sequencing model,

the bidirectional independent sequencing model comprises an encoder, an attention decoder and an intensity decoder, and the secondary fragment ion spectrogram is input into the encoder to be encoded to obtain secondary spectrogram global features; the predicted forward and backward peptide fragment sequences are respectively and independently input into a position coding module in the attention decoder to obtain sequence position codes, the peptide fragment sequences input into the attention decoder are output with peptide fragment vector characteristics from an attention mechanism module, the peptide fragment vector characteristics and the secondary spectrogram global characteristics are jointly input into a spectrogram-sequence attention mechanism to obtain associated characteristics, candidate intensity characteristics are output after the corresponding forward and backward secondary fragment ion spectrograms are input into the intensity decoder, the candidate intensity characteristics and the associated characteristics are spliced to obtain splicing characteristics, and the splicing characteristics are mapped into bidirectional independent prediction candidate sequences;

and taking the union of the bidirectional independent prediction candidate sequence and the bidirectional interactive prediction candidate sequence as a final candidate sequence, inputting the final candidate sequence into the bidirectional independent sequencing model again for scoring, and selecting the peptide segment sequence with the highest score as the peptide segment sequence.

12. A bidirectional peptide fragment sequencing model, which is constructed by the construction method of the bidirectional peptide fragment sequencing model of claim 11.

13. A bidirectional peptide fragment testing device based on a self-attention mechanism is characterized by comprising the following components:

14. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the method for bidirectional peptide sequencing based on attention paid system of any one of claims 1 to 10 or the method for constructing the bidirectional peptide sequencing model of claim 11.

15. A readable storage medium having stored therein a computer program comprising program code for controlling a process to execute a process, the process comprising the method of self-attention based bidirectional peptide sequencing of any one of claims 1 to 10 or the method of constructing the bidirectional peptide sequencing model of claim 11.