CN115458049B - Method and device for predicting universal anti-citrullinated polypeptide antibody epitope based on bidirectional circulating neural network - Google Patents

Method and device for predicting universal anti-citrullinated polypeptide antibody epitope based on bidirectional circulating neural network Download PDF

Info

Publication number
CN115458049B
CN115458049B CN202210752384.7A CN202210752384A CN115458049B CN 115458049 B CN115458049 B CN 115458049B CN 202210752384 A CN202210752384 A CN 202210752384A CN 115458049 B CN115458049 B CN 115458049B
Authority
CN
China
Prior art keywords
prediction
training
neural network
prediction method
amino acids
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210752384.7A
Other languages
Chinese (zh)
Other versions
CN115458049A (en
Inventor
赵毅
朱晨曦
戴伦治
陈桃
孙蕊
周珍
许佳艺
张梦玥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202210752384.7A priority Critical patent/CN115458049B/en
Publication of CN115458049A publication Critical patent/CN115458049A/en
Priority to PCT/CN2023/077121 priority patent/WO2024001226A1/en
Application granted granted Critical
Publication of CN115458049B publication Critical patent/CN115458049B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Epidemiology (AREA)
  • Chemical & Material Sciences (AREA)
  • Public Health (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Peptides Or Proteins (AREA)
  • Medicines Containing Antibodies Or Antigens For Use As Internal Diagnostic Agents (AREA)

Abstract

The invention discloses a universal anti-citrullinated polypeptide antibody (ACPA) epitope prediction method and device based on a bidirectional circulating neural network, wherein the prediction method comprises the following steps: the peptide fragment sequence to be detected is converted into a coded data set through physical and chemical property coding of natural amino acid and citrulline, and then feature extraction and result prediction are carried out on the coded data set through a bidirectional cyclic neural network prediction model containing a self-attention mechanism. The method has high prediction efficiency and high prediction accuracy, and can be further applied to antigen screening research of Rheumatoid Arthritis (RA) patients, thereby improving the diagnosis rate of the RA patients and the prevention and treatment capability of the RA.

Description

Method and device for predicting universal anti-citrullinated polypeptide antibody epitope based on bidirectional circulating neural network
Technical Field
The invention belongs to the technical field of an antigenic polypeptide prediction method based on artificial intelligence, and particularly relates to the technical field of a universal citrullinated polypeptide antibody (ACPA) epitope prediction method.
Background
Rheumatoid arthritis (rheumatoid arthritis, RA) is an autoimmune disease, and a variety of autoantibodies are often present in patients, some of which such as anti-Cyclic Citrulline (CCP) antibodies, have become common in the prior art for early diagnosis of RA. However, in practice, it has been found that CCP antibodies are only 50-75% sensitive to diagnosis of RA, and in some special cases may be detected in other diseases such as chronic lung disease. In contrast, anti-citrullinated polypeptide antibodies (anti-citrullinated peptide antibody, ACPA) have higher diagnostic specificity, typically over 90%.
Citrullination is a protein post-translational modification process in which peptide-based arginine deiminase (peptidyl arginine deiminase, PAD) catalyzes the conversion of the ketimine group of arginine to a ketone group in the presence of calcium ions, causing it to lose electropositivity, forming a neutral citrulline residue. The modification process changes the molecular weight, charge and three-dimensional structure of the protein, and imparts antigenicity to the protein, enabling it to be rendered self-reactive with CD4 + The T cells recognize the polypeptide and further produce cytokines, and induce the B cells to differentiate and mature and secrete the anti-citrullinated polypeptide antibody ACPA. During RA pathogenesis, ACPA eventually promotes bone destruction by stimulating the production of pro-inflammatory cytokines, inducing osteoclastogenesis and mediating osteoblast apoptosis, which are key roles in RA immune aberrant activation with citrullinated polypeptide antibodies that induce ACPA production. Therefore, by accurately judging or predicting the ACPA production, it is expected to realize more accurate RA diagnosis and prediction.
Under the combined action of genes and environmental risk factors, the in vivo citrullination level of RA patients is obviously improved relative to that of healthy people, and a large amount of citrullinated polypeptides up-regulated relative to that of healthy people can be detected in body fluid, wherein the citrullinated polypeptides are truly antigenic, namely, the parts which can induce the production of ACPA and are bound by the ACPA are only very few, and the special polypeptides are also called ACPA epitopes. According to the different types of ACPA, most of ACPA epitopes are pro-consensus (ACPA) epitopes, play a main role in early stages of disease development and systemic immune reaction, and the other small parts are specific (private) ACPA epitopes, and play a role only in specific tissues, so that screening of pro-consensus ACPA epitopes has important significance for accurately judging or predicting ACPA production conditions, further researching RA early pathogenesis or carrying out disease prediction on the ACPA epitopes, and a computer prediction method for the ACPA epitopes is still lacking in the prior art.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a method for predicting universal ACPA epitopes through a bidirectional circulating neural network and a central attention mechanism, which has accurate and efficient prediction and can be further applied to research on screening of RA patient antigens.
The technical scheme of the invention is as follows:
a method for pervasive ACPA epitope prediction based on a bi-directional recurrent neural network, comprising:
s0, respectively carrying out data coding on a plurality of physicochemical properties of 20 natural amino acids and citrulline, and carrying out normalization processing on the obtained coded data according to physicochemical property classification to obtain normalized coded data;
s1, converting a peptide sequence to be detected into a normalized coding data set according to the corresponding relation between the contained amino acid and the coded data after normalization processing;
s2, performing dimension reduction processing on the normalized coded data set to obtain a characteristic data set;
s3, inputting the characteristic data set into a prediction model after training is completed, and obtaining a prediction result;
wherein the predictive model is constructed based on a bi-directional recurrent neural network containing a self-attention mechanism.
In the technical scheme, the physical and chemical characteristics of the citrulline and other 20 natural amino acids are normalized and coded together, so that the correlation of interaction between the citrulline and adjacent amino acids on two sides on final polypeptide antigenicity can be reflected better; and considering that the interaction between antigen and antibody is based on the generation of intermolecular interaction forces such as hydrogen bond and Van der Waals force which are connected with each other, the two intermolecular interaction forces are greatly influenced by the physicochemical properties of each amino acid such as the charge, mass, hydrogen bond supply amount, etc. of the amino acid itself, so that the normalized encoding of all amino acid properties constituting the polypeptide can reflect the characteristics of the polypeptide more realistically.
According to some preferred embodiments of the invention, the physicochemical properties include: molar volume of amino acids, isoelectric point (pI), molecular Weight (Molecular Weight), n-octanol-water partition coefficient, number of hydrogen bond donors (Hydrogen Bond Donor Count), number of hydrogen bond acceptors (Hydrogen Bond Acceptor Count), number of rotatable bonds (Rotatable Bond Count), topological polar surface area Topological Polar Surface Area), number of Heavy atoms (heavies Atom Count), and structural Complexity of amino acids (complety) calculated from Bertz/henrickson/Ihlenfeldt formula.
The inventors have unexpectedly found that coding according to the physicochemical properties of the above 10 amino acids can result in more pronounced predictive accuracy of generic ACPA.
According to the preferred embodiment, not only the data conversion of the peptide fragment can be realized, but also the obtained coded data set can contain characteristic information which can sufficiently improve the prediction accuracy. The selected physicochemical properties can shorten the distance measurement between similar amino acids, and the distance measurement between different amino acids is increased, so that a more accurate prediction result is obtained.
According to some preferred embodiments of the invention, the normalization treatments are all normalized to between 0 and 1.
According to some preferred embodiments of the invention, the training of the predictive model comprises:
s31, acquiring citrullinated polypeptide segments positive in synovial fluid reaction in plasma of a plurality of RA patients as candidate polypeptide segments;
s32, extracting characteristic peptide fragments of each candidate polypeptide fragment;
s33, taking all the obtained characteristic peptide segments as a positive training set, taking a plurality of other citrullinated polypeptide segments obtained after the duplicate removal treatment of the positive training set as a negative training set, and forming a training set for training a prediction model by the positive training set and the negative training set;
wherein the characteristic peptide segment comprises: citrulline, and about 4 amino acids centered around citrulline, are peptide fragments of 9 amino acids in length.
Wherein the candidate polypeptide fragments and the additional citrullinated polypeptide fragments are obtainable from an existing database, such as the Immune Epitope Database (IEDB). The synovial fluid positive reaction refers to the situation that the OD450 value is significantly increased and exceeds the positive threshold value in enzyme-linked immunosorbent assay (ELISA) relative to the normal level.
According to some preferred embodiments of the invention, the training further comprises:
s34, training the prediction model through a ten-fold cross validation method based on the obtained training set.
According to some preferred embodiments of the invention, the predictive model is further provided with a masking layer between the self-attention mechanism and the output of the bi-directional recurrent neural network.
According to some preferred embodiments of the present invention, the mask processing layer uses a mask matrix with fixed weights, and the weight distribution is as follows: the middle weight is highest, and the surrounding weights are gradually decreased.
According to some preferred embodiments of the invention, the mask matrix is a 1×9 matrix.
According to the preferred embodiments of the mask treatment layers, in the prediction process, after mask treatment, the invention can further enhance the attention of citrulline and common amino acid which is closer to the citrulline, strengthen the position importance of the citrulline in peptide segments and enable a model to obtain better specific prediction capability.
According to some preferred embodiments of the invention, the prediction model uses a sliding window prediction method to obtain the prediction confidence.
According to the above prediction method, an apparatus for performing universal ACPA epitope prediction may be further obtained, which contains a storage medium storing a program and/or a model for implementing the above prediction method.
According to the invention, the chemical mechanism of antigen-antibody interaction and the structural characteristics of the ACPA-citrullinated antigen complex analyzed by X-rays can be used as modeling parameters, the polypeptide characteristics are reflected by encoding the physicochemical properties of 21 amino acids, the interaction among 9 amino acids is reflected by a bidirectional circulating neural network and a self-attention mechanism, the specificity of citrulline in antigenicity scoring is emphasized, and a high-accuracy ACPA epitope prediction model is obtained.
The invention can greatly reduce the cost in the research of screening ACPA epitopes, and can directly pick out the sites with antigenicity in the protein of the research object for subsequent verification, which is far superior to the mode of screening by synthesizing citrullinated polypeptides in a large amount and carrying out combination experiments, thereby saving the cost of manpower and material resources and improving the screening accuracy.
Drawings
FIG. 1 is a flow chart of prediction in an embodiment of the present invention.
FIG. 2 is a block diagram of a Bi-GRU in accordance with an embodiment of the present invention.
FIG. 3 is a diagram of the self-attention mechanism in an embodiment of the present invention.
Fig. 4 is a graph showing the change of accuracy of training sets and verification sets in training according to embodiment 1 of the present invention.
Fig. 5 is a graph showing the change of the loss function of training set and verification set in the training according to the embodiment 1 of the present invention.
FIG. 6 is a graph showing the prediction accuracy of the universal ACPA epitope (A) and the non-ACPA epitope (B) in example 1 of the present invention.
Detailed Description
Specific embodiments of the present invention are shown in more detail below with reference to the accompanying drawings. It should be understood that the technical solutions of the present invention should not be limited by the specific embodiments or examples set forth, but are merely for the purpose of facilitating a clearer and accurate understanding of the present invention by those skilled in the art. All other embodiments, which can be made by a person skilled in the art without any inventive effort, are intended to be within the scope of the present invention, based on the embodiments of the present invention.
Referring to fig. 1, according to the technical scheme of the present invention, some specific embodiments of a generic ACPA epitope prediction method based on a bidirectional recurrent neural network include the following procedures:
s0, respectively carrying out data coding on a plurality of physical and chemical properties of citrulline and 20 natural amino acids, and carrying out normalization processing on the obtained coded data according to physical and chemical property classification to obtain normalized coded data;
s1, converting a peptide sequence to be detected into a normalized coding data set according to the corresponding relation between the amino acid contained in the peptide sequence and the normalized coding data;
s2, performing dimension reduction processing on the normalized coded data set to obtain a characteristic data set;
and S3, inputting the characteristic data set into a prediction model after training is completed, and obtaining a prediction result.
Further, in some preferred embodiments, the inventors have unexpectedly found that more significant ACPA prediction accuracy can be obtained from the encoding according to the physicochemical properties of the following 10 amino acids: molar volume, isoelectric point (pI), molecular Weight (Molecular Weight), n-octanol-water partition coefficient (XLogP 3), number of hydrogen bond donors (Hydrogen Bond Donor Count), number of hydrogen bond acceptors (Hydrogen Bond Acceptor Count), number of rotatable bonds (Rotatable Bond Count), topological polar surface area (Topological Polar Surface Area), number of Heavy atoms (heavyatom Count), and amino acid structural Complexity (complety) calculated from Bertz/Hendrickson/Ihlenfeldt formula.
The coding process can realize the data conversion of peptide segments, and can lead the obtained coded data set to contain characteristic information which can fully improve the prediction accuracy, and the selected physical and chemical properties and the coded data thereof can shorten the distance measurement between amino acids with similar physical and chemical properties and lengthen the distance measurement from the measurement.
Since each physicochemical property is not in one order of magnitude in value, separate normalization treatments are required, preferably, normalization to between 0 and 1 is performed.
Further, in some specific embodiments, the dimension reduction processing is performed through an embedded (ebedding) layer, which can map a high-dimension amino acid coding sequence obtained by directly coding physicochemical properties to a low-dimension space, thereby facilitating differentiation of features of each sequence and processing of a subsequent deep neural network.
Further, in some preferred embodiments, the training of the predictive model includes:
s31, acquiring citrullinated polypeptide segments positive in synovial fluid reaction in plasma of a plurality of RA patients as candidate polypeptide segments through an Immune Epitope Database (IEDB); wherein synovial fluid positive reaction refers to the situation that the OD450 value is significantly increased and exceeds a positive threshold in an enzyme-linked immunosorbent assay (ELISA) relative to a healthy person;
s32, extracting characteristic peptide fragments of each candidate polypeptide fragment;
according to the crystal structure of the ACPA-citrulline polypeptide complex of X-ray scanning, the interaction between antigen and antibody is limited to 8-9 amino acid length, therefore, preferably, the characteristic peptide segment is a peptide segment with the length of 9 amino acids comprising citrulline and about 4 amino acids taking citrulline as the center;
s33, taking all the obtained characteristic peptide fragments as a positive training set, taking a plurality of peptide fragments obtained after de-duplication treatment with the positive training set in a background citrullinated polypeptide library provided according to literature data as a negative training set, and forming a training set for training a prediction model by the positive training set and the negative training set. Further, in some preferred embodiments, the training further comprises:
s34, training the prediction model through a ten-fold cross validation method based on the obtained training set.
The ten-fold cross validation method is to divide the data of the training set into 10 parts at random each time, 9 parts are taken as actual training sets, the rest 1 part is taken as a validation set, training and validation are carried out for multiple times, the performance obtained by model training is different due to different training and validation set selection, and the accuracy of the model obtained under each training validation set is averaged and recorded. Through the ten-time cross-validation method, the design parameters of the prediction model can be cross-validated, and the optimal number of circulating neural network layers and the optimal number of embedded dimensions are obtained.
Further, in some preferred embodiments, the predictive model uses an NLP model of a bi-directional recurrent neural network (BiGRU) containing self-attention mechanisms (self-attention).
In the model, a bidirectional cyclic neural network is used, and compared with a unidirectional neural network structure, the state is always output from front to back, so that the situation of the output of the correlation cannot be obtained when a sequence (such as a peptide segment consisting of amino acids) with the correlation before and after is processed, the cyclic neural network can enable the output at the current moment to be connected with the state at the previous moment and the state at the next moment, and the extraction of deep features of the sequence is facilitated.
Meanwhile, for peptide segment feature extraction, the technical scheme of the invention not only hopes to pay attention to unidirectional sequence information, but also hopes to extract bidirectional information, so that BiGRU is used, and the output of the BiGRU is determined by the states of two GRUs in unidirectional and opposite directions.
In some embodiments, referring to fig. 2, the biglu network used comprises the following structure:
the input layer, the forward hidden layer that is connected with each passageway of input layer, the backward hidden layer that links to each other with the forward hidden layer, the output layer that links to each other with the backward hidden layer, wherein input layer and backward hidden layer, output layer and forward hidden layer, and the forward hidden layer between different passageways, the backward hidden layer between different passageways link to each other respectively.
In model training, the weight of the connection from the input layer to the forward hidden layer is set as W 1 The weight of the self connection of the forward hidden layer is W 2 The weight from the input layer to the backward hidden layer is W 3 The weight of the self connection of the backward hidden layer is W 5 The weight from the backward hidden layer to the output layer is W 6 . The method comprises the steps of firstly, forward processing a sequence, obtaining output characteristics of the forward sequence at a forward hidden layer, then, reversely processing the sequence, obtaining the output characteristics at a backward hidden layer, and mutually independent forward and backward computing processes. And finally, splicing the forward and backward outputs to obtain an output result, wherein the result contains sequence information of forward and backward two-way moments. The weight matrix adopts an orthogonal initialization mode to avoid gradient disappearance and gradient explosion.
Further, the inventors have unexpectedly found that in polypeptide epitopes, citrulline is a key factor in determining antigenicity, and that weighting its corresponding features gives better prediction results, thus introducing a central attention improvement based on self-attention mechanisms in the model. The self-attention mechanism allows each input in the model to interact with each other and outputs an aggregate and attention score of these interactions, finding the input in which more attention should be paid.
In some embodiments, referring to FIG. 3, the present invention uses the following self-attention mechanism:
firstly, carrying out similarity calculation on query (query) q and each push key (key) k in a dot product form to obtain weight; these weights are then normalized using a softmax function; thereafter, the weights and the corresponding key values (value) v are weighted and summed to obtain the final attention (attention) value.
In some preferred embodiments, the present invention sets the key value v equal to the push key k. In the two-part output of the BiGRU model, the output characteristic of the cyclic network is set to k or v, and the time sequence information representing the context information is set to q, so that the self-attention layer is input, and effective sequence weight is adaptively generated.
In some preferred embodiments, the invention adds a 1×9 mask matrix with fixed weight based on self-attention mechanism, and the weight distribution is high in the middle and gradually decreases on two sides, so that the invention can correspond to 4 common amino acids on two sides with citrulline at the center in the characteristic peptide segment, and the citrulline at the center and the common amino acid closer to the citrulline can get more attention.
Multiplying the mask matrix by the weight obtained based on self-attention, and finally normalizing by softmax to obtain the final weight. By this embodiment, not only can the model pay more attention to important fragments, but also the importance of the position of citrulline in a characteristic peptide segment consisting of 9 amino acids can be emphasized in training, allowing the model to learn better specificity.
In some preferred embodiments, the prediction model adopts a sliding window prediction method, the window length is set to be consistent with the modeling sequence length (9 amino acids), when the input sequence is greater than or equal to the window length, sliding scanning is carried out on the input sequence by using the sliding window with fixed length to form a plurality of subsequences, and the maximum prediction confidence of all the subsequences is integrated to obtain the final prediction confidence; when the input sequence is smaller than the window length, the part of the insufficient position is subjected to zero padding coding and does not participate in the forward direction of the network.
The following is a further illustration of the invention in conjunction with examples.
Example 1
Citrulline and a plurality of physicochemical properties of 20 natural amino acids are encoded according to the encoding scheme of table 1, wherein N represents an amino acid name, MV represents a molar volume, pI represents an isoelectric point, MW represents a molecular weight, X represents a N-octanol-water partition coefficient given by XLogP3 software, HBD represents a number of hydrogen bond donors, HBA represents a number of hydrogen bond acceptors, RB represents a number of rotatable bonds, TPSA represents a topological polar surface area, HA represents a number of heavy atoms, and C represents structural complexity of the amino acid:
TABLE 1 amino acid encoding data
In practice, a total of 10898 data were obtained for the positive training set, and 95916 data were obtained for the negative training set.
In the process of carrying out iterative training on the prediction model through a ten-fold cross validation method, the accuracy of the training set and the validation set and the change curve of the BCE loss function are respectively shown in figures 4 and 5, the ACC of the finally obtained training set is 98.9%, and the ACC on the validation set is 94.7%.
The citrullinated polypeptides of the two generic ACPA epitopes and non-ACPA epitopes reported in documents Steen, j et al, recognition of Amino Acid Motifs, rather Than Specific Proteins, by Human Plasma Cell-Derived Monoclonal Antibodies to Posttranslationally Modified Proteins in Rheumatoid architis rheomol, 2019.71 (2): p.196-209 were predicted with a recall of 89.9% and 88%, respectively, using models as shown in fig. 6 (a), (B). It can be seen that the prediction method of the invention has higher accuracy.
It should be understood that the scope of the present invention is not limited to the above embodiments. All technical schemes belonging to the concept of the invention belong to the protection scope of the invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (8)

1. The universal ACPA epitope prediction method based on the bidirectional circulating neural network is characterized by comprising the following steps of:
s0, respectively carrying out data coding on a plurality of physicochemical properties of 20 natural amino acids and citrulline, and carrying out normalization processing on the obtained coded data according to physicochemical property classification to obtain normalized coded data;
s1, converting a peptide sequence to be detected into a normalized coding data set according to the corresponding relation between the contained amino acid and the coded data after normalization processing;
s2, performing dimension reduction processing on the normalized coded data set to obtain a characteristic data set;
s3, inputting the characteristic data set into a prediction model after training is completed, and obtaining a prediction result;
wherein the predictive model is constructed based on a bi-directional recurrent neural network containing a self-attention mechanism;
the training of the prediction model comprises the following steps:
s31, acquiring citrullinated polypeptide segments positive in synovial fluid reaction in plasma of a plurality of RA patients as candidate polypeptide segments;
s32, extracting characteristic peptide fragments of each candidate polypeptide fragment;
s33, taking all the obtained characteristic peptide segments as a positive training set, taking a plurality of other citrullinated polypeptide segments obtained after the duplicate removal treatment of the positive training set as a negative training set, and forming a training set for training a prediction model by the positive training set and the negative training set;
wherein the characteristic peptide segment comprises: citrulline, and about 4 amino acids centered around citrulline, are peptide fragments of 9 amino acids in length.
2. The prediction method according to claim 1, wherein in S0, the physicochemical property includes: molar volume of amino acids, isoelectric point, molecular weight, n-octanol-water partition coefficient, number of hydrogen bond donors, number of hydrogen bond acceptors, number of rotatable bonds, topological polar surface area, number of heavy atoms, and structural complexity of amino acids calculated by Bertz/Hendrickson/Ihlenfeldt formula.
3. The prediction method according to claim 1, wherein the training further comprises:
s34, training the prediction model through a ten-fold cross validation method based on the obtained training set.
4. A prediction method according to claim 3, wherein the prediction model is further provided with a masking layer between the self-attention mechanism and the output of the bi-directional recurrent neural network.
5. The prediction method according to claim 4, wherein the mask processing layer uses a mask matrix with fixed weights, and the weight distribution is as follows: the middle weight is highest, and the surrounding weights are gradually decreased.
6. The prediction method according to claim 5, wherein the mask matrix is a 1 x 9 matrix.
7. The prediction method according to claim 1, wherein the prediction model obtains the prediction confidence by a sliding window prediction method.
8. An apparatus for performing universal ACPA epitope prediction, comprising a storage medium storing a program and/or model for implementing the prediction method according to any one of claims 1 to 7.
CN202210752384.7A 2022-06-29 2022-06-29 Method and device for predicting universal anti-citrullinated polypeptide antibody epitope based on bidirectional circulating neural network Active CN115458049B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210752384.7A CN115458049B (en) 2022-06-29 2022-06-29 Method and device for predicting universal anti-citrullinated polypeptide antibody epitope based on bidirectional circulating neural network
PCT/CN2023/077121 WO2024001226A1 (en) 2022-06-29 2023-02-20 Bidirectional recurrent neural network-based promiscuous acpa epitope prediction method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210752384.7A CN115458049B (en) 2022-06-29 2022-06-29 Method and device for predicting universal anti-citrullinated polypeptide antibody epitope based on bidirectional circulating neural network

Publications (2)

Publication Number Publication Date
CN115458049A CN115458049A (en) 2022-12-09
CN115458049B true CN115458049B (en) 2023-07-25

Family

ID=84297587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210752384.7A Active CN115458049B (en) 2022-06-29 2022-06-29 Method and device for predicting universal anti-citrullinated polypeptide antibody epitope based on bidirectional circulating neural network

Country Status (2)

Country Link
CN (1) CN115458049B (en)
WO (1) WO2024001226A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115458049B (en) * 2022-06-29 2023-07-25 四川大学 Method and device for predicting universal anti-citrullinated polypeptide antibody epitope based on bidirectional circulating neural network

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112071361A (en) * 2020-04-11 2020-12-11 信华生物药业(广州)有限公司 Polypeptide TCR immunogenicity prediction method based on Bi-LSTM and Self-anchoring

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017040832A1 (en) * 2015-09-01 2017-03-09 The Administrators Of The Tulane Educational Fund A method for cd4+ t-cell epitope prediction using antigen structure
CN106950365B (en) * 2017-02-15 2018-11-27 中国医学科学院北京协和医院 A kind of RA diagnosis marker of ACPA feminine gender and its application
US11557375B2 (en) * 2018-08-20 2023-01-17 Nantomics, Llc Methods and systems for improved major histocompatibility complex (MHC)-peptide binding prediction of neoepitopes using a recurrent neural network encoder and attention weighting
CN111868073B (en) * 2019-05-31 2021-04-30 广州市雷德生物科技有限公司 Specific polypeptide related to rheumatoid arthritis and application thereof
CN110991625B (en) * 2020-03-02 2020-06-16 南京邮电大学 Surface anomaly remote sensing monitoring method and device based on recurrent neural network
CN112699960B (en) * 2021-01-11 2023-06-09 华侨大学 Semi-supervised classification method, equipment and storage medium based on deep learning
CN114280304A (en) * 2021-04-30 2022-04-05 北京纳么美科技有限公司 Poly antigen and application thereof in detection of autoimmune disease
CN114386528B (en) * 2022-01-18 2024-05-14 平安科技(深圳)有限公司 Model training method and device, computer equipment and storage medium
CN114496064A (en) * 2022-01-18 2022-05-13 武汉大学 CCS prediction model construction method, device, equipment and readable storage medium
CN115458049B (en) * 2022-06-29 2023-07-25 四川大学 Method and device for predicting universal anti-citrullinated polypeptide antibody epitope based on bidirectional circulating neural network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112071361A (en) * 2020-04-11 2020-12-11 信华生物药业(广州)有限公司 Polypeptide TCR immunogenicity prediction method based on Bi-LSTM and Self-anchoring

Also Published As

Publication number Publication date
CN115458049A (en) 2022-12-09
WO2024001226A1 (en) 2024-01-04

Similar Documents

Publication Publication Date Title
CN109671469B (en) Method for predicting binding relationship and binding affinity between polypeptide and HLA type I molecule based on circulating neural network
US8694265B2 (en) Bioinformatic approach to disease diagnosis
AU2022221568A1 (en) GAN-CNN for MHC peptide binding prediction
CN115458049B (en) Method and device for predicting universal anti-citrullinated polypeptide antibody epitope based on bidirectional circulating neural network
Lo et al. Prediction of conformational epitopes with the use of a knowledge-based energy function and geometrically related neighboring residue characteristics
WO2010017559A1 (en) Methods and systems for predicting proteins that can be secreted into bodily fluids
CN112912960A (en) Methods and systems for improving Major Histocompatibility Complex (MHC) -peptide binding prediction for neoepitopes using a recurrent neural network encoder and attention weighting
CN108009405A (en) A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter
Golugula et al. Evaluating feature selection strategies for high dimensional, small sample size datasets
Berghofer et al. Gauge symmetries, symmetry breaking, and gauge-invariant approaches
He et al. A multimodal deep architecture for large-scale protein ubiquitylation site prediction
US20240120022A1 (en) Predicting protein amino acid sequences using generative models conditioned on protein structure embeddings
CN114242159B (en) Method for constructing antigen peptide presentation prediction model, and antigen peptide prediction method and device
CN115148277A (en) Affinity prediction method, device, equipment and storage medium
Huang et al. Using random forest to classify linear B-cell epitopes based on amino acid properties and molecular features
CN117497058A (en) Antibody antigen neutralization prediction method and device based on graphic neural network
CN117037897A (en) Peptide and MHC class I protein affinity prediction method based on protein domain feature embedding
CN103778182B (en) A kind of Fast Graphics similarity method of discrimination
CN116343922A (en) Method for predicting polypeptide based on machine learning
Sun et al. B-cell epitope prediction method based on deep ensemble architecture and sequences
CN101609486B (en) Identification method of superclass of G-protein-coupled receptors and Web service system thereof
Arango-Argoty et al. Feature extraction by statistical contact potentials and wavelet transform for predicting subcellular localizations in gram negative bacterial proteins
Xu et al. DTL-NeddSite: A Deep-Transfer Learning Architecture for Prediction of Lysine Neddylation Sites
CN116959577A (en) Composite structure prediction method, device, computer equipment and storage medium
Chen et al. Predicting contact map using radial basis function neural network with conformational energy function

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant