CN115458049B

CN115458049B - Method and device for predicting universal anti-citrullinated polypeptide antibody epitope based on bidirectional circulating neural network

Info

Publication number: CN115458049B
Application number: CN202210752384.7A
Authority: CN
Inventors: 赵毅; 朱晨曦; 戴伦治; 陈桃; 孙蕊; 周珍; 许佳艺; 张梦玥
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2023-07-25
Anticipated expiration: 2042-06-29
Also published as: CN115458049A; WO2024001226A1

Abstract

The invention discloses a universal anti-citrullinated polypeptide antibody (ACPA) epitope prediction method and device based on a bidirectional circulating neural network, wherein the prediction method comprises the following steps: the peptide fragment sequence to be detected is converted into a coded data set through physical and chemical property coding of natural amino acid and citrulline, and then feature extraction and result prediction are carried out on the coded data set through a bidirectional cyclic neural network prediction model containing a self-attention mechanism. The method has high prediction efficiency and high prediction accuracy, and can be further applied to antigen screening research of Rheumatoid Arthritis (RA) patients, thereby improving the diagnosis rate of the RA patients and the prevention and treatment capability of the RA.

Description

Method and device for predicting universal anti-citrullinated polypeptide antibody epitope based on bidirectional circulating neural network

Technical Field

The invention belongs to the technical field of an antigenic polypeptide prediction method based on artificial intelligence, and particularly relates to the technical field of a universal citrullinated polypeptide antibody (ACPA) epitope prediction method.

Background

Rheumatoid arthritis (rheumatoid arthritis, RA) is an autoimmune disease, and a variety of autoantibodies are often present in patients, some of which such as anti-Cyclic Citrulline (CCP) antibodies, have become common in the prior art for early diagnosis of RA. However, in practice, it has been found that CCP antibodies are only 50-75% sensitive to diagnosis of RA, and in some special cases may be detected in other diseases such as chronic lung disease. In contrast, anti-citrullinated polypeptide antibodies (anti-citrullinated peptide antibody, ACPA) have higher diagnostic specificity, typically over 90%.

Citrullination is a protein post-translational modification process in which peptide-based arginine deiminase (peptidyl arginine deiminase, PAD) catalyzes the conversion of the ketimine group of arginine to a ketone group in the presence of calcium ions, causing it to lose electropositivity, forming a neutral citrulline residue. The modification process changes the molecular weight, charge and three-dimensional structure of the protein, and imparts antigenicity to the protein, enabling it to be rendered self-reactive with CD4 ⁺ The T cells recognize the polypeptide and further produce cytokines, and induce the B cells to differentiate and mature and secrete the anti-citrullinated polypeptide antibody ACPA. During RA pathogenesis, ACPA eventually promotes bone destruction by stimulating the production of pro-inflammatory cytokines, inducing osteoclastogenesis and mediating osteoblast apoptosis, which are key roles in RA immune aberrant activation with citrullinated polypeptide antibodies that induce ACPA production. Therefore, by accurately judging or predicting the ACPA production, it is expected to realize more accurate RA diagnosis and prediction.

Under the combined action of genes and environmental risk factors, the in vivo citrullination level of RA patients is obviously improved relative to that of healthy people, and a large amount of citrullinated polypeptides up-regulated relative to that of healthy people can be detected in body fluid, wherein the citrullinated polypeptides are truly antigenic, namely, the parts which can induce the production of ACPA and are bound by the ACPA are only very few, and the special polypeptides are also called ACPA epitopes. According to the different types of ACPA, most of ACPA epitopes are pro-consensus (ACPA) epitopes, play a main role in early stages of disease development and systemic immune reaction, and the other small parts are specific (private) ACPA epitopes, and play a role only in specific tissues, so that screening of pro-consensus ACPA epitopes has important significance for accurately judging or predicting ACPA production conditions, further researching RA early pathogenesis or carrying out disease prediction on the ACPA epitopes, and a computer prediction method for the ACPA epitopes is still lacking in the prior art.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a method for predicting universal ACPA epitopes through a bidirectional circulating neural network and a central attention mechanism, which has accurate and efficient prediction and can be further applied to research on screening of RA patient antigens.

The technical scheme of the invention is as follows:

a method for pervasive ACPA epitope prediction based on a bi-directional recurrent neural network, comprising:

s0, respectively carrying out data coding on a plurality of physicochemical properties of 20 natural amino acids and citrulline, and carrying out normalization processing on the obtained coded data according to physicochemical property classification to obtain normalized coded data;

s1, converting a peptide sequence to be detected into a normalized coding data set according to the corresponding relation between the contained amino acid and the coded data after normalization processing;

s2, performing dimension reduction processing on the normalized coded data set to obtain a characteristic data set;

s3, inputting the characteristic data set into a prediction model after training is completed, and obtaining a prediction result;

wherein the predictive model is constructed based on a bi-directional recurrent neural network containing a self-attention mechanism.

In the technical scheme, the physical and chemical characteristics of the citrulline and other 20 natural amino acids are normalized and coded together, so that the correlation of interaction between the citrulline and adjacent amino acids on two sides on final polypeptide antigenicity can be reflected better; and considering that the interaction between antigen and antibody is based on the generation of intermolecular interaction forces such as hydrogen bond and Van der Waals force which are connected with each other, the two intermolecular interaction forces are greatly influenced by the physicochemical properties of each amino acid such as the charge, mass, hydrogen bond supply amount, etc. of the amino acid itself, so that the normalized encoding of all amino acid properties constituting the polypeptide can reflect the characteristics of the polypeptide more realistically.

According to some preferred embodiments of the invention, the physicochemical properties include: molar volume of amino acids, isoelectric point (pI), molecular Weight (Molecular Weight), n-octanol-water partition coefficient, number of hydrogen bond donors (Hydrogen Bond Donor Count), number of hydrogen bond acceptors (Hydrogen Bond Acceptor Count), number of rotatable bonds (Rotatable Bond Count), topological polar surface area Topological Polar Surface Area), number of Heavy atoms (heavies Atom Count), and structural Complexity of amino acids (complety) calculated from Bertz/henrickson/Ihlenfeldt formula.

The inventors have unexpectedly found that coding according to the physicochemical properties of the above 10 amino acids can result in more pronounced predictive accuracy of generic ACPA.

According to the preferred embodiment, not only the data conversion of the peptide fragment can be realized, but also the obtained coded data set can contain characteristic information which can sufficiently improve the prediction accuracy. The selected physicochemical properties can shorten the distance measurement between similar amino acids, and the distance measurement between different amino acids is increased, so that a more accurate prediction result is obtained.

According to some preferred embodiments of the invention, the normalization treatments are all normalized to between 0 and 1.

According to some preferred embodiments of the invention, the training of the predictive model comprises:

s31, acquiring citrullinated polypeptide segments positive in synovial fluid reaction in plasma of a plurality of RA patients as candidate polypeptide segments;

s32, extracting characteristic peptide fragments of each candidate polypeptide fragment;

s33, taking all the obtained characteristic peptide segments as a positive training set, taking a plurality of other citrullinated polypeptide segments obtained after the duplicate removal treatment of the positive training set as a negative training set, and forming a training set for training a prediction model by the positive training set and the negative training set;

wherein the characteristic peptide segment comprises: citrulline, and about 4 amino acids centered around citrulline, are peptide fragments of 9 amino acids in length.

Wherein the candidate polypeptide fragments and the additional citrullinated polypeptide fragments are obtainable from an existing database, such as the Immune Epitope Database (IEDB). The synovial fluid positive reaction refers to the situation that the OD450 value is significantly increased and exceeds the positive threshold value in enzyme-linked immunosorbent assay (ELISA) relative to the normal level.

According to some preferred embodiments of the invention, the training further comprises:

s34, training the prediction model through a ten-fold cross validation method based on the obtained training set.

According to some preferred embodiments of the invention, the predictive model is further provided with a masking layer between the self-attention mechanism and the output of the bi-directional recurrent neural network.

According to some preferred embodiments of the present invention, the mask processing layer uses a mask matrix with fixed weights, and the weight distribution is as follows: the middle weight is highest, and the surrounding weights are gradually decreased.

According to some preferred embodiments of the invention, the mask matrix is a 1×9 matrix.

According to the preferred embodiments of the mask treatment layers, in the prediction process, after mask treatment, the invention can further enhance the attention of citrulline and common amino acid which is closer to the citrulline, strengthen the position importance of the citrulline in peptide segments and enable a model to obtain better specific prediction capability.

According to some preferred embodiments of the invention, the prediction model uses a sliding window prediction method to obtain the prediction confidence.

According to the above prediction method, an apparatus for performing universal ACPA epitope prediction may be further obtained, which contains a storage medium storing a program and/or a model for implementing the above prediction method.

According to the invention, the chemical mechanism of antigen-antibody interaction and the structural characteristics of the ACPA-citrullinated antigen complex analyzed by X-rays can be used as modeling parameters, the polypeptide characteristics are reflected by encoding the physicochemical properties of 21 amino acids, the interaction among 9 amino acids is reflected by a bidirectional circulating neural network and a self-attention mechanism, the specificity of citrulline in antigenicity scoring is emphasized, and a high-accuracy ACPA epitope prediction model is obtained.

The invention can greatly reduce the cost in the research of screening ACPA epitopes, and can directly pick out the sites with antigenicity in the protein of the research object for subsequent verification, which is far superior to the mode of screening by synthesizing citrullinated polypeptides in a large amount and carrying out combination experiments, thereby saving the cost of manpower and material resources and improving the screening accuracy.

Drawings

FIG. 1 is a flow chart of prediction in an embodiment of the present invention.

FIG. 2 is a block diagram of a Bi-GRU in accordance with an embodiment of the present invention.

FIG. 3 is a diagram of the self-attention mechanism in an embodiment of the present invention.

Fig. 4 is a graph showing the change of accuracy of training sets and verification sets in training according to embodiment 1 of the present invention.

Fig. 5 is a graph showing the change of the loss function of training set and verification set in the training according to the embodiment 1 of the present invention.

FIG. 6 is a graph showing the prediction accuracy of the universal ACPA epitope (A) and the non-ACPA epitope (B) in example 1 of the present invention.

Detailed Description

Specific embodiments of the present invention are shown in more detail below with reference to the accompanying drawings. It should be understood that the technical solutions of the present invention should not be limited by the specific embodiments or examples set forth, but are merely for the purpose of facilitating a clearer and accurate understanding of the present invention by those skilled in the art. All other embodiments, which can be made by a person skilled in the art without any inventive effort, are intended to be within the scope of the present invention, based on the embodiments of the present invention.

Referring to fig. 1, according to the technical scheme of the present invention, some specific embodiments of a generic ACPA epitope prediction method based on a bidirectional recurrent neural network include the following procedures:

s0, respectively carrying out data coding on a plurality of physical and chemical properties of citrulline and 20 natural amino acids, and carrying out normalization processing on the obtained coded data according to physical and chemical property classification to obtain normalized coded data;

s1, converting a peptide sequence to be detected into a normalized coding data set according to the corresponding relation between the amino acid contained in the peptide sequence and the normalized coding data;

and S3, inputting the characteristic data set into a prediction model after training is completed, and obtaining a prediction result.

Further, in some preferred embodiments, the inventors have unexpectedly found that more significant ACPA prediction accuracy can be obtained from the encoding according to the physicochemical properties of the following 10 amino acids: molar volume, isoelectric point (pI), molecular Weight (Molecular Weight), n-octanol-water partition coefficient (XLogP 3), number of hydrogen bond donors (Hydrogen Bond Donor Count), number of hydrogen bond acceptors (Hydrogen Bond Acceptor Count), number of rotatable bonds (Rotatable Bond Count), topological polar surface area (Topological Polar Surface Area), number of Heavy atoms (heavyatom Count), and amino acid structural Complexity (complety) calculated from Bertz/Hendrickson/Ihlenfeldt formula.

The coding process can realize the data conversion of peptide segments, and can lead the obtained coded data set to contain characteristic information which can fully improve the prediction accuracy, and the selected physical and chemical properties and the coded data thereof can shorten the distance measurement between amino acids with similar physical and chemical properties and lengthen the distance measurement from the measurement.

Since each physicochemical property is not in one order of magnitude in value, separate normalization treatments are required, preferably, normalization to between 0 and 1 is performed.

Further, in some specific embodiments, the dimension reduction processing is performed through an embedded (ebedding) layer, which can map a high-dimension amino acid coding sequence obtained by directly coding physicochemical properties to a low-dimension space, thereby facilitating differentiation of features of each sequence and processing of a subsequent deep neural network.

Further, in some preferred embodiments, the training of the predictive model includes:

s31, acquiring citrullinated polypeptide segments positive in synovial fluid reaction in plasma of a plurality of RA patients as candidate polypeptide segments through an Immune Epitope Database (IEDB); wherein synovial fluid positive reaction refers to the situation that the OD450 value is significantly increased and exceeds a positive threshold in an enzyme-linked immunosorbent assay (ELISA) relative to a healthy person;

according to the crystal structure of the ACPA-citrulline polypeptide complex of X-ray scanning, the interaction between antigen and antibody is limited to 8-9 amino acid length, therefore, preferably, the characteristic peptide segment is a peptide segment with the length of 9 amino acids comprising citrulline and about 4 amino acids taking citrulline as the center;

s33, taking all the obtained characteristic peptide fragments as a positive training set, taking a plurality of peptide fragments obtained after de-duplication treatment with the positive training set in a background citrullinated polypeptide library provided according to literature data as a negative training set, and forming a training set for training a prediction model by the positive training set and the negative training set. Further, in some preferred embodiments, the training further comprises:

The ten-fold cross validation method is to divide the data of the training set into 10 parts at random each time, 9 parts are taken as actual training sets, the rest 1 part is taken as a validation set, training and validation are carried out for multiple times, the performance obtained by model training is different due to different training and validation set selection, and the accuracy of the model obtained under each training validation set is averaged and recorded. Through the ten-time cross-validation method, the design parameters of the prediction model can be cross-validated, and the optimal number of circulating neural network layers and the optimal number of embedded dimensions are obtained.

Further, in some preferred embodiments, the predictive model uses an NLP model of a bi-directional recurrent neural network (BiGRU) containing self-attention mechanisms (self-attention).

In the model, a bidirectional cyclic neural network is used, and compared with a unidirectional neural network structure, the state is always output from front to back, so that the situation of the output of the correlation cannot be obtained when a sequence (such as a peptide segment consisting of amino acids) with the correlation before and after is processed, the cyclic neural network can enable the output at the current moment to be connected with the state at the previous moment and the state at the next moment, and the extraction of deep features of the sequence is facilitated.

Meanwhile, for peptide segment feature extraction, the technical scheme of the invention not only hopes to pay attention to unidirectional sequence information, but also hopes to extract bidirectional information, so that BiGRU is used, and the output of the BiGRU is determined by the states of two GRUs in unidirectional and opposite directions.

In some embodiments, referring to fig. 2, the biglu network used comprises the following structure:

the input layer, the forward hidden layer that is connected with each passageway of input layer, the backward hidden layer that links to each other with the forward hidden layer, the output layer that links to each other with the backward hidden layer, wherein input layer and backward hidden layer, output layer and forward hidden layer, and the forward hidden layer between different passageways, the backward hidden layer between different passageways link to each other respectively.

In model training, the weight of the connection from the input layer to the forward hidden layer is set as W ₁ The weight of the self connection of the forward hidden layer is W ₂ The weight from the input layer to the backward hidden layer is W ₃ The weight of the self connection of the backward hidden layer is W ₅ The weight from the backward hidden layer to the output layer is W ₆ . The method comprises the steps of firstly, forward processing a sequence, obtaining output characteristics of the forward sequence at a forward hidden layer, then, reversely processing the sequence, obtaining the output characteristics at a backward hidden layer, and mutually independent forward and backward computing processes. And finally, splicing the forward and backward outputs to obtain an output result, wherein the result contains sequence information of forward and backward two-way moments. The weight matrix adopts an orthogonal initialization mode to avoid gradient disappearance and gradient explosion.

Further, the inventors have unexpectedly found that in polypeptide epitopes, citrulline is a key factor in determining antigenicity, and that weighting its corresponding features gives better prediction results, thus introducing a central attention improvement based on self-attention mechanisms in the model. The self-attention mechanism allows each input in the model to interact with each other and outputs an aggregate and attention score of these interactions, finding the input in which more attention should be paid.

In some embodiments, referring to FIG. 3, the present invention uses the following self-attention mechanism:

firstly, carrying out similarity calculation on query (query) q and each push key (key) k in a dot product form to obtain weight; these weights are then normalized using a softmax function; thereafter, the weights and the corresponding key values (value) v are weighted and summed to obtain the final attention (attention) value.

In some preferred embodiments, the present invention sets the key value v equal to the push key k. In the two-part output of the BiGRU model, the output characteristic of the cyclic network is set to k or v, and the time sequence information representing the context information is set to q, so that the self-attention layer is input, and effective sequence weight is adaptively generated.

In some preferred embodiments, the invention adds a 1×9 mask matrix with fixed weight based on self-attention mechanism, and the weight distribution is high in the middle and gradually decreases on two sides, so that the invention can correspond to 4 common amino acids on two sides with citrulline at the center in the characteristic peptide segment, and the citrulline at the center and the common amino acid closer to the citrulline can get more attention.

Multiplying the mask matrix by the weight obtained based on self-attention, and finally normalizing by softmax to obtain the final weight. By this embodiment, not only can the model pay more attention to important fragments, but also the importance of the position of citrulline in a characteristic peptide segment consisting of 9 amino acids can be emphasized in training, allowing the model to learn better specificity.

In some preferred embodiments, the prediction model adopts a sliding window prediction method, the window length is set to be consistent with the modeling sequence length (9 amino acids), when the input sequence is greater than or equal to the window length, sliding scanning is carried out on the input sequence by using the sliding window with fixed length to form a plurality of subsequences, and the maximum prediction confidence of all the subsequences is integrated to obtain the final prediction confidence; when the input sequence is smaller than the window length, the part of the insufficient position is subjected to zero padding coding and does not participate in the forward direction of the network.

The following is a further illustration of the invention in conjunction with examples.

Example 1

Citrulline and a plurality of physicochemical properties of 20 natural amino acids are encoded according to the encoding scheme of table 1, wherein N represents an amino acid name, MV represents a molar volume, pI represents an isoelectric point, MW represents a molecular weight, X represents a N-octanol-water partition coefficient given by XLogP3 software, HBD represents a number of hydrogen bond donors, HBA represents a number of hydrogen bond acceptors, RB represents a number of rotatable bonds, TPSA represents a topological polar surface area, HA represents a number of heavy atoms, and C represents structural complexity of the amino acid:

TABLE 1 amino acid encoding data

In practice, a total of 10898 data were obtained for the positive training set, and 95916 data were obtained for the negative training set.

In the process of carrying out iterative training on the prediction model through a ten-fold cross validation method, the accuracy of the training set and the validation set and the change curve of the BCE loss function are respectively shown in figures 4 and 5, the ACC of the finally obtained training set is 98.9%, and the ACC on the validation set is 94.7%.

The citrullinated polypeptides of the two generic ACPA epitopes and non-ACPA epitopes reported in documents Steen, j et al, recognition of Amino Acid Motifs, rather Than Specific Proteins, by Human Plasma Cell-Derived Monoclonal Antibodies to Posttranslationally Modified Proteins in Rheumatoid architis rheomol, 2019.71 (2): p.196-209 were predicted with a recall of 89.9% and 88%, respectively, using models as shown in fig. 6 (a), (B). It can be seen that the prediction method of the invention has higher accuracy.

It should be understood that the scope of the present invention is not limited to the above embodiments. All technical schemes belonging to the concept of the invention belong to the protection scope of the invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. The universal ACPA epitope prediction method based on the bidirectional circulating neural network is characterized by comprising the following steps of:

wherein the predictive model is constructed based on a bi-directional recurrent neural network containing a self-attention mechanism;

the training of the prediction model comprises the following steps:

2. The prediction method according to claim 1, wherein in S0, the physicochemical property includes: molar volume of amino acids, isoelectric point, molecular weight, n-octanol-water partition coefficient, number of hydrogen bond donors, number of hydrogen bond acceptors, number of rotatable bonds, topological polar surface area, number of heavy atoms, and structural complexity of amino acids calculated by Bertz/Hendrickson/Ihlenfeldt formula.

3. The prediction method according to claim 1, wherein the training further comprises:

4. A prediction method according to claim 3, wherein the prediction model is further provided with a masking layer between the self-attention mechanism and the output of the bi-directional recurrent neural network.

5. The prediction method according to claim 4, wherein the mask processing layer uses a mask matrix with fixed weights, and the weight distribution is as follows: the middle weight is highest, and the surrounding weights are gradually decreased.

6. The prediction method according to claim 5, wherein the mask matrix is a 1 x 9 matrix.

7. The prediction method according to claim 1, wherein the prediction model obtains the prediction confidence by a sliding window prediction method.

8. An apparatus for performing universal ACPA epitope prediction, comprising a storage medium storing a program and/or model for implementing the prediction method according to any one of claims 1 to 7.