WO2021106706A1 - Dispositif de recherche de séquence d'acides aminés, vaccin, procédé de recherche de séquence d'acides aminés et programme de recherche de séquence d'acides aminés - Google Patents

Dispositif de recherche de séquence d'acides aminés, vaccin, procédé de recherche de séquence d'acides aminés et programme de recherche de séquence d'acides aminés Download PDF

Info

Publication number
WO2021106706A1
WO2021106706A1 PCT/JP2020/042958 JP2020042958W WO2021106706A1 WO 2021106706 A1 WO2021106706 A1 WO 2021106706A1 JP 2020042958 W JP2020042958 W JP 2020042958W WO 2021106706 A1 WO2021106706 A1 WO 2021106706A1
Authority
WO
WIPO (PCT)
Prior art keywords
amino acid
acid sequence
sequence data
data
binding
Prior art date
Application number
PCT/JP2020/042958
Other languages
English (en)
Japanese (ja)
Inventor
春佳 藤田
九月 貞光
坂口 誠
昭子 天満
啓徳 中神
森下 竜一
Original Assignee
フューチャー株式会社
株式会社ファンペップ
国立大学法人大阪大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by フューチャー株式会社, 株式会社ファンペップ, 国立大学法人大阪大学 filed Critical フューチャー株式会社
Priority to JP2021561342A priority Critical patent/JPWO2021106706A1/ja
Publication of WO2021106706A1 publication Critical patent/WO2021106706A1/fr

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K14/00Peptides having more than 20 amino acids; Gastrins; Somatostatins; Melanotropins; Derivatives thereof
    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K2/00Peptides of undefined number of amino acids; Derivatives thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • the present invention relates to an amino acid sequence search device, a vaccine, an amino acid sequence search method, and an amino acid sequence search program.
  • FIG. 37 is a diagram schematically showing the interaction between peptides, antibodies, and epitopes.
  • B cells When an antigen that causes a disease invades the body, B cells are activated, and B cells produce and release antibodies.
  • the antibody released from the B cell recognizes and binds to an epitope (a partial sequence of the amino acid sequence constituting the antigen) in the antigen that reacts with itself. The binding of the antibody to the epitope reduces the function of the antigen.
  • a peptide (amino acid short chain) corresponding to the epitope is administered as a vaccine from outside the body.
  • a peptide binds to a B cell, the B cell more actively produces an antibody suitable for the epitope in the antigen.
  • the type of antibody is optimized, which leads to an increase in the number of antibodies, so that the function of the antigen is effectively reduced.
  • Non-Patent Document 1 discloses a technique for searching for an epitope corresponding to a peptide from within an antigen by using an RNN (Recurrent Neural Network) which is a kind of DNN (Deep Neural Network). Specifically, Non-Patent Document 1 uses RNN to calculate the probability that a partial sequence of amino acid sequences constituting an antigen becomes a peptide.
  • RNN Recurrent Neural Network
  • DNN Deep Neural Network
  • Kipf 1 outside, “Semi-Supervised Classification with Graph Convolutional Networks”, Max Welling, February 22, 2017, arXiv: 1609.02907, [online], [Search July 9, 2019], Internet ⁇ URL : Https://arxiv.org/abs/1609.02907 > K. Xu, 3 outsiders, “How Powerful are Graph Neural Networks?”, February 22, 2019, arXiv: 1810.0826, [online], [Search July 9, 2019], Internet ⁇ URL: https: / /arxiv.org/abs/1810.00826>
  • Non-Patent Document 1 only utilizes basic DNN for deep learning, and focuses only on the partial sequence corresponding to the amino acid sequence length of the peptide in the entire sequence of amino acid sequences constituting the antigen. , There is a problem that the search accuracy of the epitope is low.
  • the present invention has been made in view of the above circumstances, and an object of the present invention is to improve the accuracy of searching for a target amino acid sequence.
  • the amino acid sequence search device of one aspect of the present invention stores a trained model for predicting amino acid sequence binding in which the sequence data structure of amino acid sequence data is learned for a plurality of amino acid sequence data using a deep learning model with a attention mechanism.
  • the input unit for inputting the first amino acid sequence data and the second amino acid sequence data, and the learned model read from the storage unit the first amino acid sequence data is the second. It is characterized by including a search unit for outputting binding prediction information regarding whether or not to bind as a part of the amino acid sequence data of the above.
  • the deep learning model is used to learn the degree of binding between the sequence data structure of the amino acid sequence data and the plurality of amino acid sequence data, and the learned result is stored as the learned model. It is characterized by further including a learning unit to be stored in the unit.
  • the trained model processes the first amino acid sequence data and the second amino acid sequence data with the two deep learning models, respectively, and combines the processing results. Based on this, the combination prediction information is output.
  • the trained model processes the first amino acid sequence data and the second amino acid sequence data with one deep learning model, and predicts the binding based on the result of the processing. It is characterized by outputting information.
  • the deep learning model is one of a statistical learning model with an attention mechanism that handles sequence data, a learning model with a self-attention mechanism, and a learning model with a source-target attention mechanism. It is a feature.
  • the trained model uses the added first amino acid sequence data after adding the amino acid sequence data at any position before, after, or before or after the first amino acid sequence data. It is characterized in that the combination prediction information is output based on the result of the processing.
  • the trained model is the result of processing using the physicochemical feature amount or biochemical feature amount of the amino acid contained in the first amino acid sequence data or the second amino acid sequence data. Based on the above, the combination prediction information is output.
  • the amino acid sequence search device is characterized in that a machine learning model is used instead of the deep learning model with the attention mechanism.
  • the amino acid sequence search apparatus of one aspect of the present invention uses a deep learning model that handles graphs to learn the three-dimensional structure of the amino acids contained in the amino acid sequence data for a plurality of amino acid sequence data, and obtains a trained model for predicting amino acid sequence binding.
  • a deep learning model that handles graphs to learn the three-dimensional structure of the amino acids contained in the amino acid sequence data for a plurality of amino acid sequence data, and obtains a trained model for predicting amino acid sequence binding.
  • the storage unit to be stored the input unit for inputting the first amino acid sequence data and the second amino acid sequence data, and the learned model read from the storage unit, the first amino acid sequence data is described.
  • the trained model includes the first amino acid sequence data and the second amino acid sequence data, and includes a search unit that outputs binding prediction information regarding whether or not to bind as a part of the second amino acid sequence data. Is processed by one of the deep learning models, and the combined prediction information is output based on the combined result of the processed results.
  • the amino acid sequence search apparatus of one aspect of the present invention uses a deep learning model that handles graphs to learn the three-dimensional structure of the amino acids contained in the amino acid sequence data for a plurality of amino acid sequence data, and obtains a trained model for predicting amino acid sequence binding.
  • the deep learning model is used to learn the three-dimensional structure of the amino acid contained in the amino acid sequence data and the degree of binding between the plurality of amino acid sequence data, and the learned result is the learned model. It is characterized in that a learning unit for storing in the storage unit is further provided.
  • the amino acid sequence search device is characterized in that a machine learning model is used instead of the deep learning model that handles the graph.
  • the vaccine of one aspect of the present invention uses a trained model for predicting amino acid sequence binding in which the sequence data structure of amino acid sequence data is trained for a plurality of amino acid sequence data using a deep learning model with an attention mechanism. It is characterized by containing the amino acid of the first amino acid sequence data determined to bind as a part of the amino acid sequence data of.
  • the vaccine of one aspect of the present invention uses a trained model for predicting amino acid sequence binding in which the three-dimensional structure of amino acids contained in amino acid sequence data is learned for a plurality of amino acid sequence data using a deep learning model that handles graphs.
  • the vaccine of one aspect of the present invention uses a trained model for predicting amino acid sequence binding in which the three-dimensional structure of amino acids contained in amino acid sequence data is learned for a plurality of amino acid sequence data using a deep learning model that handles graphs. It is characterized by containing the amino acids of the amino acid sequence data determined to bind to the predetermined amino acid sequence data.
  • the amino acid sequence search method of one aspect of the present invention is an amino acid sequence search method performed by an amino acid sequence search device, wherein the amino acid sequence search device has a plurality of sequence data structures of amino acid sequence data using a deep learning model with a caution mechanism.
  • the step of outputting binding prediction information regarding whether or not the first amino acid sequence data binds as a part of the second amino acid sequence data is performed. ..
  • the amino acid sequence search method is characterized in that a machine learning model is used instead of the deep learning model with the attention mechanism.
  • the amino acid sequence search method of one aspect of the present invention is an amino acid sequence search method performed by an amino acid sequence search device, wherein the amino acid sequence search device uses a deep learning model that handles graphs to form a three-dimensional structure of amino acids contained in amino acid sequence data.
  • a step of outputting binding prediction information regarding whether or not the first amino acid sequence data binds as a part of the second amino acid sequence data is performed, and the training is performed.
  • the first amino acid sequence data and the second amino acid sequence data are processed by one deep learning model, and the binding prediction information is output based on the result of combining the processed results. It is characterized by.
  • the amino acid sequence search method of one aspect of the present invention is an amino acid sequence search method performed by an amino acid sequence search device, wherein the amino acid sequence search device uses a deep learning model that handles graphs to form a three-dimensional structure of amino acids contained in amino acid sequence data.
  • the step of storing the trained model for predicting the amino acid sequence binding learned for the plurality of amino acid sequence data in the storage unit, the step of inputting the predetermined amino acid sequence data, and the trained model read from the storage unit are used.
  • the step of outputting the amino acid sequence data to be bound to the predetermined amino acid sequence data is performed.
  • the amino acid sequence search method is characterized in that a machine learning model is used instead of the deep learning model that handles the graph.
  • the amino acid sequence search program according to one aspect of the present invention is characterized in that a computer functions as the amino acid sequence search device.
  • the accuracy of searching for a target amino acid sequence can be improved.
  • step S105 It is a figure which shows the detailed operation flow of the step S105 which concerns on 3rd Example. It is a figure which shows the functional block composition of the information processing apparatus which concerns on 2nd Embodiment. It is a figure which shows typically the function at the time of learning of an information processing apparatus. It is a figure which shows typically the function at the time of search of an information processing apparatus. It is a figure which shows the operation flow at the time of learning of the information processing apparatus which concerns on 2nd Embodiment. It is a figure which shows the detailed operation flow of step S505. It is a figure which shows the operation flow at the time of the search of the information processing apparatus which concerns on 2nd Embodiment.
  • Vaccines peptides
  • B cells antibodies
  • antigens epitope
  • epitope are in an interacting relationship as described above, and they are all proteins, and a plurality of amino acids constituting the protein are arranged in series and three-dimensionally. It consists of amino acid sequence data.
  • the present invention discloses a technique (amino acid sequence search algorithm) for searching a target amino acid sequence from one or two amino acid sequences.
  • the first amino acid sequence and the second amino acid sequence are input, and the first amino acid sequence is the second using the learned model described later.
  • Binding prediction information for example, degree of binding, binding probability (%), label indicating binding or non-binding, etc. regarding whether or not to bind as part of the amino acid sequence of is calculated and output.
  • a DNN with Attention deep learning model with an attention mechanism
  • a trained model for predicting amino acid sequence binding is used in which the sequence data structure of the amino acid sequence is learned for a plurality of amino acid sequences using a DNN with Attention.
  • Attention is a mechanism that determines which part of a DNN to pay attention to, such as focusing on a specific word among multiple layers (for example, hidden layers) that make up the DNN.
  • the part to pay attention to can be set as appropriate.
  • the degree of attention is a weight.
  • Attention whether or not two amino acid sequences are bound is defined as Attention, and the importance of the binding is weighted. Attention and weight are often used interchangeably.
  • a deep learning model included in the DNN with Attention
  • a deep learning model other than these may be used.
  • the deep learning model such as DNN but also various machine learning models (for example, lightGBM, SVM, etc.) can be used instead of the deep learning model. ..
  • an amino acid sequence of arbitrary length is added before and after the amino acid sequence, and the physicochemical and biochemical features of the amino acid are further used. That is, in the second embodiment, instead of focusing on "only the partial sequence corresponding to the peptide in the entire sequence of the amino acid sequences constituting the antigen" as in Non-Patent Document 1, before and after the partial sequence (before and after the peptide). ) Is used with an arbitrary length amino acid sequence (Window). Specifically, in the second embodiment, the result of processing using the added first amino acid sequence in which a predetermined length amino acid sequence is added at any position before, after, or before or after the first amino acid sequence. Based on, the above-mentioned combination prediction information is output.
  • the amino acid sequence added before, after, or before and after the amino acid sequence is further used.
  • the accuracy of searching can be further improved.
  • the physicochemical feature amount and the biochemical feature amount of the amino acid are further used, the accuracy of searching for the target amino acid sequence can be further improved.
  • an amino acid sequence having a three-dimensional graph structure is used, in which each amino acid constituting the amino acid sequence is a node and the relationship between the amino acids is an edge.
  • the three-dimensional structure of the amino acid sequence is modeled using GNN (Graph Neural Network) (deep learning model that handles graphs). More specifically, in the third embodiment, the three-dimensional structure (for example, the position of an amino acid, etc.) of an amino acid contained in the amino acid sequence data is learned for a plurality of amino acid sequences using GNN.
  • GNN Graph Neural Network
  • the three-dimensional structure of an amino acid contained in the amino acid sequence data is learned for a plurality of amino acid sequences using GNN.
  • GCN Graph Convolutional Network
  • the learned model for predicting the amino acid sequence binding in which the three-dimensional structure of the amino acid is learned using GNN is used, the accuracy of searching for the target amino acid sequence can be further improved.
  • various machine learning models instead of the deep learning model such as GNN.
  • the present invention is applicable not only to amino acids such as vaccines (peptides), B cells (antibodies), and antigens (epitope), but also to other types of amino acids.
  • vaccines peptides
  • B cells antibodies
  • antigens epitopope
  • FIG. 1 shows each combination of the amino acid sequence to be bound and the technical features of the present invention used in each embodiment.
  • a peptide and an antibody are used as the amino acids to be bound.
  • the antibody is referred to as a "protein to be bound”.
  • the amino acid sequence of the protein to be bound to the peptide is used, and the amino acids are used using the DNN with Attention targeting those two amino acid sequences (2 molecules in total). Generate a trained model for sequence binding prediction. Then, in the first embodiment, the binding probability between the peptide and the binding target protein is calculated using the trained model. The Attention referred to here is whether or not two amino acid sequences are bound. The importance of the connection is weighted. Further, in the first embodiment, the physicochemical and biochemical features of the peptide and the protein to be bound are further used.
  • each of the peptide and the protein to be bound is subjected to learning processing with two LSTMs (or BERTs) with Attention (two deep learning models), and then the two processing results are obtained. Join.
  • the relationship between the binding of both the peptide and the protein to be bound is learned by BERT (one deep learning model).
  • the binding is weighted by a Transformer (one deep learning model) which is a Source-Target Attention using both the peptide and the protein to be bound.
  • FIG. 2 is a diagram showing a functional block configuration of the information processing device 1 according to the first embodiment (first embodiment).
  • the information processing device 1 is a device that generates a trained model, and is an amino acid sequence search device that searches for a target amino acid sequence (calculates a binding probability) using the generated trained model.
  • the information processing apparatus 1 according to the first embodiment includes a first input unit 11, a first calculation unit 12, a second input unit 13, a second calculation unit 14, and a learning unit. It includes 15, a search unit 16, a storage unit 17, and a display unit 18.
  • 3 and 4 are diagrams schematically showing the functions of the information processing device 1 during learning and searching, respectively.
  • the first input unit (input unit) 11 has a function of inputting data on the amino acid sequence of the peptide (first amino acid sequence data).
  • the first input unit 11 inputs an amino acid sequence (binding amino acid sequence) with a correct answer label to a peptide to be bound to a predetermined binding target protein, and inputs an amino acid sequence (binding amino acid sequence) to a peptide that does not bind. Enter the amino acid sequence (unbound amino acid sequence) labeled as incorrect.
  • the first input unit 11 inputs a plurality of amino acid sequences of such a combination as learning data.
  • the first input unit 11 inputs an amino acid sequence of a peptide having no label (correct label, incorrect label) as prediction target data.
  • the first calculation unit 12 has a function of calculating the physicochemical / biochemical feature amount of the peptide input by the first input unit 11. For example, the first calculation unit 12 calculates the physicochemical features of the peptide.
  • the second input unit (input unit) 13 has a function of inputting data (second amino acid sequence data) of the amino acid sequence (whole or part) of the protein to be bound.
  • data second amino acid sequence data
  • the second input unit 13 inputs the amino acid sequence of the predetermined binding target protein corresponding to the amino acid sequence of the peptide labeled with the correct answer and the amino acid sequence of the peptide labeled with the incorrect answer. To do.
  • the second input unit 13 inputs a plurality of amino acid sequences of such a binding target protein as learning data.
  • the second input unit 13 inputs the amino acid sequence of the binding target protein as the prediction target data.
  • the second calculation unit 14 has a function of calculating the physicochemical / biochemical feature amount of the binding target protein input by the second input unit 13. For example, the second calculation unit 14 calculates the biochemical feature amount of the protein to be bound.
  • the learning unit 15 includes an amino acid sequence (s) of each peptide labeled with a correct answer and an incorrect answer label, a physicochemical / biochemical feature amount of each peptide, and a predetermined binding target corresponding to each peptide.
  • the amino acid sequence (plurality) of the protein and the physicochemical / biochemical feature amount of the predetermined binding target protein are input to a learning model having LSTM or BERT with Attention, and the degree of binding and the relationship between the amino acid sequences are input.
  • a trained model for predicting amino acid sequence binding that calculates the binding degree (binding probability) of the two amino acid sequences to be bound is generated, and the data of the generated trained model is stored in the storage unit 17. It has a function to memorize.
  • the learning unit 15 converts the amino acid sequence of the peptide into a one-hot-vector value, inputs the converted one-hot-vector value to the LSTM or BERT with Attention, and executes the learning process. , It has a function of concatenating the physicochemical and biochemical features of the peptide with respect to the result of the learning process.
  • the learning unit 15 converts the amino acid sequence of the protein to be bound into a one-hot-vector value, inputs the converted one-hot-vector value to the LSTM or BERT with Attention, and executes the learning process. It has a function of linking the physicochemical and biochemical features of the binding target protein to the result of the learning process.
  • the learning unit 15 fully binds the amino acid sequence of the linked peptide and the amino acid sequence of the linked protein to be bound, and uses one fully bound data as an activation function such as a non-linear function. It has a function to convert to a probability value using.
  • the learning unit 15 By executing the above function, the learning unit 15 generates a trained model for amino acid sequence binding prediction in which the sequence data structure of the amino acid sequence is learned for a plurality of amino acid sequences using LSTM or BERT with Attention.
  • the search unit (output unit) 16 contains the amino acid sequence of the peptide, which is the prediction target data, the physicochemical / biochemical feature amount of the peptide, the amino acid sequence of the binding target protein, which is the prediction target data, and the binding target.
  • the physicochemical and biochemical feature amounts of the protein are input to the trained model for predicting amino acid sequence binding, and the degree of binding (binding probability) at which the peptide binds as a part of the binding target protein is calculated. It has a function to output.
  • the storage unit 17 has a function of readablely storing the data of the learning model learned using LSTM or BERT with Attention and the data of the learned model for predicting the amino acid sequence binding learned using the learning model. Be prepared.
  • the display unit 18 has a function of displaying various data and the like required for searching for a target amino acid sequence on a display. For example, the display unit 18 displays an input field for inputting the amino acid sequence of the peptide or the amino acid sequence of the protein to be bound, and displays the binding probability calculated using the trained model.
  • FIG. 5 is a diagram showing an operation flow at the time of learning of the information processing apparatus 1 according to the first embodiment.
  • Step S101 First, the first input unit 11 inputs the data of the amino acid sequence of the peptide with the correct answer label to the predetermined binding target protein, and also inputs the data of the amino acid sequence of the peptide with the incorrect answer label. ..
  • the first input unit 11 inputs a plurality of amino acid sequences of such a combination as learning data.
  • Step S102 the first calculation unit 12 calculates the physicochemical features of the plurality of peptides input in step S101 for each peptide.
  • the first calculation unit 12 calculates using an existing tool such as IEDB-API.
  • IEDB-API an existing tool
  • the amino acid sequence of the peptide for example, "LCEGAVLPRSAKELR”
  • LCEGAVLPRSAKELR the amino acid sequence of the peptide
  • Step S103 Subsequently, the second input unit 13 inputs the amino acid sequence of the predetermined binding target protein corresponding to the peptide amino acid sequence of the combination input in step S101.
  • the second input unit 13 inputs a plurality of amino acid sequences of such a binding target protein as learning data.
  • Step S104 the second calculation unit 14 calculates the biochemical features of the plurality of binding target proteins input in step S103 for each binding target protein.
  • the second calculation unit 14 calculates using a Biopython library or the like.
  • scores related to the isoelectric point, the ratio of amino acids having an aromatic ring, hydrophobicity, stability, surface reachability, etc. can be calculated from the binding target protein as illustrated in FIG.
  • the Biopython library is disclosed at "https://biopython.org/".
  • Step S105 the learning unit 15 includes the amino acid sequences (plurality) of each peptide having the correct answer label and the incorrect answer, which are input or calculated in steps S101 to S104, the physicochemical feature amount of each peptide, and the said.
  • the amino acid sequence (plurality) of a predetermined binding target protein corresponding to each peptide and the biochemical feature amount of the predetermined binding target protein are input to a learning model having two LSTMs (or BERTs) with Attention, respectively. And learn.
  • the learning unit 15 optimizes the weight of each layer in the learning model based on the amino acid sequence of the peptide labeled with the correct answer.
  • the weight optimization method using the correct label can be carried out by a known technique.
  • Step S106 Finally, the learning unit 15 stores the learning model generated in step S105 in the storage unit 17 as a learned model for amino acid sequence binding prediction.
  • steps S103 and 104 may be executed before steps S101 and S102, or may be executed at the same time as steps S101 and S102.
  • FIG. 8 is a diagram showing a detailed operation flow of step S105 shown in FIG.
  • FIG. 9 is a diagram schematically showing an example of a learning model having an LSTM with Attention.
  • the number of dimensions (number) of each layer is an example.
  • the learning model is composed of, for example, an input layer, an intermediate layer, a connecting layer, a fully connected layer, and an output layer.
  • the learning unit 15 outputs a learned model (a model in which the weight value of each layer is optimized) by inputting learning data with a correct answer label into the learning model and learning.
  • Step S105a the learning unit 15 converts the amino acid sequence of the peptide into a one-hot-vector value in the input layer.
  • the number of dimensions of one-hot-vector is the total number of amino acid types (20 types).
  • the method of converting one-hot-vector values can be carried out by a known technique.
  • Step S105b the learning unit 15 inputs the one-hot-vector value of the peptide converted in step S105a into the intermediate layer composed of the LSTM layer and the Attention layer, and executes the learning process.
  • the LSTM layer is a layer that models the sequence of amino acids. Each LSTM connected in a row inputs the one-hot-vector value of each amino acid such as "L", "C", "E”, etc., and also inputs the output of the immediately preceding LSTM for learning. Memorize the sequence of amino acids.
  • the Attention layer is a layer that imparts a weight indicating the importance of binding between the peptide and the protein to be bound. It is possible to give importance to each amino acid.
  • the optimization of the entire network adjusts the weight of each layer of the network so as to minimize the error between the predicted probability calculated from the network and the correct label.
  • the learning process using the LSTM model can be executed by a known technique. Further, the technique related to Attention is disclosed in Non-Patent Document 2 as described above.
  • Step S105c the learning unit 15 connects the physicochemical features of the peptide calculated in step S102 to the result of the learning process performed in step S105b in the connecting layer.
  • the learning unit 15 concatenates the features related to "Chou-Fasman, Emini, Kolaskar-Tongaonkar, Parker" (four dimensions) with respect to the number of dimensions of the final layer of the connecting layer.
  • Step S105d Subsequently, the learning unit 15 converts the amino acid sequence of the protein to be bound into a one-hot-vector value in the input layer in the same manner as in step S105a.
  • Step S105e Further, in the same manner as in step S105b, the learning unit 15 inputs the one-hot-vector value of the binding target protein converted in step S105d into the intermediate layer composed of the LSTM layer and the Attention layer to execute the learning process. ..
  • Step S105f Further, in the same manner as in step S105c, the learning unit 15 links the physicochemical feature amount of the protein to be bound calculated in step S104 to the learning processing result performed in step S105e in the connecting layer.
  • the learning unit 15 connects the feature quantities related to "isoelectric point, ratio of amino acids having an aromatic ring, hydrophobicity, stability" (four dimensions) to the number of dimensions of the front layer.
  • Step S105g After that, the learning unit 15 models the whole by fully binding the amino acid sequence of the ligated peptide in step S105c and the amino acid sequence of the ligated protein to be bound in step S105f in the fully bound layer. Get a one-dimensional output.
  • Step S105h the learning unit 15 uses an activation function such as Sigmoid or a non-linear function to convert the fully-coupled one-dimensional fully-coupled data in step S105g into a probability value.
  • an activation function such as Sigmoid or a non-linear function to convert the fully-coupled one-dimensional fully-coupled data in step S105g into a probability value.
  • the learning unit 15 By executing the above processing, the learning unit 15 generates a trained model for amino acid sequence binding prediction in which the sequence data structure of the amino acid sequence is learned for a plurality of amino acid sequences using LSTM or BERT with Attention.
  • steps S105d to 105f may be executed before steps S105a to S105c, or may be executed at the same time as steps S105a to S105c. Further, since the BERT includes the Attention layer inside, the same processing can be executed for the learning model having the BERT.
  • FIG. 10 is a diagram showing an operation flow at the time of searching for the information processing device 1 according to the first embodiment.
  • Step S201 First, the first input unit 11 inputs the amino acid sequence of the peptide that is not labeled (correct label, incorrect label), which is the prediction target data.
  • Step S202 Next, the first calculation unit 12 calculates the physicochemical features of the peptide input in step S201.
  • Step S203 Subsequently, the second input unit 13 inputs the amino acid sequence of the binding target protein, which is the prediction target data.
  • Step S204 Next, the second calculation unit 14 calculates the biochemical feature amount of the binding target protein input in step S203.
  • the search unit 16 includes the amino acid sequence of the peptide, the physicochemical feature amount of the peptide, the amino acid sequence of the binding target protein, and the raw protein of the binding target protein, which are input or calculated in steps S101 to S104.
  • the chemical feature amount is input to the trained model for predicting amino acid sequence binding, and the binding probability at which the peptide binds as a part of the binding target protein is calculated and output.
  • the search unit 16 may output a label indicating that the combination is performed when the combination probability is equal to or greater than the threshold value, or may output a label indicating that the combination is not performed when the combination probability is less than the threshold value. ..
  • the information processing device 1 according to the second embodiment has the same configuration as that of the first embodiment shown in FIG. 11 and 12 are diagrams schematically showing the functions of the information processing device 1 during learning and searching, respectively.
  • the learning unit 15 converts the amino acid sequence of the peptide into the value of one-hot-vector and the amino acid sequence of the protein to be bound to the value of one-hot-vector, and those two ones.
  • the value of -hot-vector is input to BERT to execute the learning process, and the physicochemical / biochemical features of the peptide and the physicochemical / biochemistry of the protein to be bound are obtained based on the result of the learning process. It has a function to connect target feature quantities.
  • FIG. 13 is a diagram showing a detailed operation flow in the second embodiment of step S105 shown in FIG.
  • Step S301 First, the learning unit 15 converts the amino acid sequence of the peptide into a one-hot-vector value in the input layer.
  • Step S302 Next, the learning unit 15 converts the amino acid sequence of the protein to be bound into a one-hot-vector value in the input layer.
  • Step S303 Next, the learning unit 15 sets the one-hot-vector value of the peptide converted in step S301 and the one-hot-vector value of the binding target protein converted in step S302 as an intermediate layer composed of a BERT layer. Enter in to execute the learning process.
  • the learning process using the BERT model can be executed by a known technique.
  • Step S304 Next, in the connecting layer, the learning unit 15 links the physicochemical features of the peptide and the physicochemical features of the protein to be bound to the result of the learning process performed in step S303.
  • Step S305 After that, the learning unit 15 fully combines all the data connected in step S304 in the fully connected layer to model the whole and obtain a one-dimensional output.
  • Step S306 the learning unit 15 uses an activation function such as Sigmoid or a non-linear function to convert the fully-coupled one-dimensional fully-coupled data in step S305 into a probability value.
  • an activation function such as Sigmoid or a non-linear function to convert the fully-coupled one-dimensional fully-coupled data in step S305 into a probability value.
  • the learning unit 15 By executing the above processing, the learning unit 15 generates a trained model for amino acid sequence binding prediction in which the sequence data structure of the amino acid sequence is learned for a plurality of amino acid sequences using BERT.
  • step S302 may be executed before step S301 or may be executed at the same time as step S301.
  • the search unit 16 calculates and outputs the binding probability that the peptide binds as a part of the binding target protein using the trained model for predicting the amino acid sequence binding generated at the time of learning.
  • the binding is weighted by Transformer (one deep learning model) which is a Source-Target Attention using both the peptide and the protein to be bound.
  • the information processing device 1 according to the third embodiment has the same configuration as that of the first embodiment shown in FIG. 14 and 15 are diagrams schematically showing the functions of the information processing apparatus 1 during learning and searching, respectively.
  • the learning unit 15 has a function of converting the amino acid sequence of the peptide into the value of one-hot-vector and the amino acid sequence of the protein to be bound to the value of one-hot-vector. Further, the learning unit 15 inputs the one-hot-vector value of the peptide into the Transformer encoder to execute the learning process, and inputs the output from the encoder and the one-hot-vector value of the binding target protein to the Transformer. It has a function to input to the decoder and execute the learning process. Further, the learning unit 15 has a function of linking the physicochemical / biochemical feature amount of the peptide and the physicochemical / biochemical feature amount of the binding target protein to the output from the decoder.
  • FIG. 16 is a diagram showing a detailed operation flow in the second embodiment of step S105 shown in FIG.
  • Step S401 First, the learning unit 15 converts the amino acid sequence of the peptide into a one-hot-vector value in the input layer.
  • Step S402 Next, the learning unit 15 converts the amino acid sequence of the protein to be bound into a one-hot-vector value in the input layer.
  • Step S403 the learning unit 15 inputs the one-hot-vector value of the peptide converted in step S401 into the encoder of the Transformer layer, which is the intermediate layer, and executes the learning process.
  • the learning process using the Transformer (encoder and decoder described later) model can be executed by a known technique.
  • Step S404 Next, the learning unit 15 inputs the learning result of the encoder executed in step S403 and the one-hot-vector value of the binding target protein converted in step S402 into the decoder of the Transformer layer for learning processing. To execute.
  • Step S405 Next, in the connecting layer, the learning unit 15 links the physicochemical features of the peptide and the physicochemical features of the protein to be bound to the result of the learning process performed in step S404.
  • Step S406 After that, the learning unit 15 fully combines all the data connected in step S405 in the fully connected layer to model the whole and obtain a one-dimensional output.
  • Step S407 the learning unit 15 uses an activation function such as Sigmoid or a non-linear function to convert the one-dimensional fully-coupled data fully-coupled in step S406 into a probability value.
  • an activation function such as Sigmoid or a non-linear function to convert the one-dimensional fully-coupled data fully-coupled in step S406 into a probability value.
  • the learning unit 15 By executing the above processing, the learning unit 15 generates a trained model for amino acid sequence binding prediction in which the sequence data structure of the amino acid sequence is learned for a plurality of amino acid sequences using Transformer which is Source-Target Attention.
  • step S402 may be executed before step S401 or at the same time as step S401.
  • the search unit 16 calculates and outputs the binding probability that the peptide binds as a part of the binding target protein using the trained model for predicting the amino acid sequence binding generated at the time of learning.
  • a trained model for predicting amino acid sequence binding is prepared using a DNN with Attention for the amino acid sequence of a peptide and the amino acid sequence of a protein to be bound to the peptide. Since the generated model is used to calculate and output the binding probability that the peptide binds as a part of the binding target protein, the accuracy of searching for the target amino acid sequence can be improved.
  • the accuracy of searching for the target amino acid sequence can be further improved.
  • an amino acid sequence (Window) of an arbitrary length added at the position before and after the amino acid sequence is used.
  • the physicochemical and biochemical features of the amino acids that make up the entire antigen (antigen protein) are used.
  • a trained model is generated by DNN (LSTM with Self-Attention) with Attention (Self-Attention) targeting one amino acid sequence (one molecule in total).
  • the probability of the antigenicity (property of inducing an antibody) of the peptide in the antigen protein is calculated using the trained model.
  • Attention is a weight indicating the importance of antigenicity of each amino acid in the antigen protein.
  • FIG. 17 is a diagram showing a functional block configuration of the information processing device 1 according to the second embodiment.
  • the information processing device 1 according to the present embodiment includes a third input unit 19, a third calculation unit 20, a learning unit 15, a search unit 16, a storage unit 17, and a display unit 18.
  • 18 and 19 are diagrams schematically showing the functions of the information processing apparatus 1 during learning and searching, respectively.
  • the differences from the first embodiment will be mainly described.
  • the third input unit 19 has a function of inputting data of the amino acid sequence of a peptide which is a part of the antigen and the amino acid sequence (Window) of an arbitrary length added before and after the peptide.
  • the third input unit 19 inputs an amino acid sequence with a window with a correct label attached to a peptide that binds to a predetermined antigen protein, and an incorrect label is attached to a peptide that does not bind. Enter the attached amino acid sequence with Window.
  • the first input unit 11 inputs a plurality of amino acid sequences with windows in such a combination as learning data.
  • the first input unit 11 inputs a windowed amino acid sequence of a peptide having no label (correct label, incorrect label) as prediction target data.
  • the third input unit 19 has a function of inputting data of the amino acid sequence (whole) of the antigen protein.
  • the third input unit 19 is an amino acid of a predetermined antigen protein corresponding to the windowed amino acid sequence of the peptide labeled with the correct answer and the amino acid sequence with the window of the peptide labeled with the incorrect answer. Enter the array.
  • the third input unit 19 inputs a plurality of amino acid sequences of such antigen proteins as learning data.
  • the third input unit 19 inputs the amino acid sequence of the antigen protein as prediction target data.
  • the third calculation unit 20 has a function of calculating the physicochemical / biochemical feature amount of the antigen protein (amino acid constituting the entire antigen) input by the third input unit 19. For example, the third calculation unit 20 calculates the biochemical feature amount of the antigen protein.
  • the learning unit 15 includes a windowed amino acid sequence (s) of each peptide labeled with a correct answer and an incorrect answer label, a physicochemical / biochemical feature amount of each peptide, and a predetermined predetermined amount corresponding to each peptide.
  • a windowed amino acid sequence (s) of each peptide labeled with a correct answer and an incorrect answer label a physicochemical / biochemical feature amount of each peptide
  • a predetermined predetermined amount corresponding to each peptide By inputting the physicochemical and biochemical features of the antigen protein into a learning model with Attention LSTM and learning, a trained model for predicting amino acid sequence binding is generated, and the generated trained model is stored. It has a function of storing in the unit 17.
  • the learning unit 15 has a function of converting the amino acid sequence of the peptide into a one-hot-vector value, inputting the converted one-hot-vector value into the LSTM with Attention, and executing the learning process. Be prepared. Further, the learning unit 15 connects the physicochemical / biochemical feature amount of the peptide and the physicochemical / biochemical feature amount of the predetermined binding target protein corresponding to the peptide to the result of the learning process. It has a function to do. Further, the learning unit 15 has a function of fully combining all the connected data and converting one fully connected data into a probability value by using an activation function such as a non-linear function.
  • the search unit 16 includes a windowed amino acid sequence of the peptide, which is the prediction target data, a physicochemical / biochemical feature amount of the peptide, and a physicochemical / biochemical feature amount of the antigen protein, which is the prediction target data. Is input to the trained model for predicting amino acid sequence binding, and has a function of calculating and outputting a probability value at which the peptide binds as a part of the antigen protein.
  • FIG. 20 is a diagram showing an operation flow at the time of learning of the information processing apparatus 1 according to the second embodiment.
  • Step S501 First, the third input unit 19 inputs the data of the windowed amino acid sequence of the peptide with the correct answer label to the predetermined antigen protein, and the data of the windowed amino acid sequence of the peptide with the incorrect answer label. Enter.
  • the third input unit 19 inputs a plurality of amino acid sequences with windows in such a combination as learning data.
  • Step S502 Next, the first calculation unit 12 calculates the physicochemical features of the plurality of peptides input in step S501 for each peptide.
  • Step S503 Subsequently, the second input unit 13 inputs the amino acid sequence of a predetermined antigen protein corresponding to the peptide amino acid sequence of the combination input in step S501.
  • the second input unit 13 inputs a plurality of amino acid sequences of such antigen proteins as learning data.
  • Step S504 Next, the third calculation unit 20 calculates the biochemical feature amounts of the plurality of antigen proteins input in step S503 for each antigen protein.
  • Step S505 After that, the learning unit 15 receives the amino acid sequences (plural) of each peptide having a correct answer label and an incorrect answer, which were input or calculated in steps S501, S502, and S504, and the physicochemical feature amount of each peptide. And the biochemical feature amount of the predetermined antigen protein are input to a learning model having an LSTM with Self-Attention for learning. At this time, the learning unit 15 optimizes the weight of each layer in the learning model based on the windowed amino acid sequence of the peptide labeled with the correct answer.
  • the weight optimization method using the correct label can be carried out by a known technique.
  • Step S506 Finally, the learning unit 15 stores the learning model generated in step S505 in the storage unit 17 as a learned model for amino acid sequence binding prediction.
  • steps S503 and 504 may be executed before steps S501 and S502, or may be executed at the same time as steps S501 and S502.
  • step S505 will be described as a specific example.
  • FIG. 21 is a diagram showing a detailed operation flow of step S505 shown in FIG.
  • Step S505a First, the learning unit 15 converts the amino acid sequence with a window of the peptide into a one-hot-vector value in the input layer.
  • Step S505b Next, the learning unit 15 inputs the one-hot-vector value of the peptide converted in step S505a into the intermediate layer composed of the LSTM layer including Self-Attention, and executes the learning process.
  • the LSTM layer weights each amino acid constituting the amino acid sequence to indicate the importance of antigenicity by self-attention.
  • Step S505c Next, in the connecting layer, the learning unit 15 sets the physicochemical features of the peptide calculated in step S502 and the antigen calculated in step S504 with respect to the result of the learning process performed in step S505b. Concatenate with the biochemical features of the protein.
  • Step S505d After that, the learning unit 15 fully combines all the data connected in step S505c in the fully connected layer to model the whole and obtain a one-dimensional output.
  • Step S505e the learning unit 15 uses an activation function such as Sigmoid or a non-linear function to convert the fully-coupled one-dimensional fully-coupled data in step S105d into a probability value.
  • an activation function such as Sigmoid or a non-linear function to convert the fully-coupled one-dimensional fully-coupled data in step S105d into a probability value.
  • the learning unit 15 By executing the above process, the learning unit 15 generates a trained model for amino acid sequence binding prediction trained using the LSTM with Self-Attention.
  • FIG. 22 is a diagram showing an operation flow at the time of searching for the information processing device 1 according to the second embodiment.
  • Step S601 First, the first input unit 11 inputs a windowed amino acid sequence of a peptide that is not labeled (correct label, incorrect label), which is prediction target data.
  • Step S602 Next, the first calculation unit 12 calculates the physicochemical features of the peptide input in step S601.
  • Step S603 Subsequently, the second input unit 13 inputs the amino acid sequence of the binding target protein, which is the prediction target data.
  • Step S604 Next, the third calculation unit 20 calculates the biochemical feature amount of the antigen protein input in step S603.
  • Step S605 the search unit 16 includes the amino acid sequence with a window of the peptide, the physicochemical feature amount of the peptide, and the biochemical feature amount of the protein to be bound, which are input or calculated in steps S601, S602, and S604. Is input to the trained model for predicting amino acid sequence binding, and the probability value at which the peptide binds as a part of the binding target protein is calculated.
  • the accuracy of searching for the target amino acid sequence can be further improved.
  • the amino acid sequence is converted into one-hot-vector and input is described as an example, but for example, the amino acid sequence is converted into embedding-vector and input. It may be.
  • word2vec which is often used in natural language processing, may be used. That is, any method can be considered as long as the amino acid sequence can be vectorized.
  • an antibody and an antigen are used in the first embodiment, and a part of the antigen and the whole antigen are used in the second embodiment.
  • the antibody is referred to as an "antibody protein” and the antigen is referred to as an "antigen protein”.
  • the three-dimensional structure (amino acid position, etc.) of the amino acid sequence constituting the antibody protein or antigen protein is modeled by GNN (GCN, GIN, etc.). Further, in the third embodiment, the biochemical features of the antibody protein and the antigen protein and the type of amino acid are further used.
  • the three-dimensional structures of the antigen protein and the antibody protein are modeled by GCN, and the amino acid sequence (epitope) in the antigen protein is predicted.
  • the amino acid sequence (epitope) that binds to the antibody protein is predicted.
  • FIG. 23 is a diagram showing a functional block configuration of the information processing device 1 according to the third embodiment (first embodiment).
  • the information processing device 1 according to the first embodiment includes a fourth input unit 21, a fourth calculation unit 22, a fifth input unit 23, a fifth calculation unit 24, a learning unit 15, a search unit 16, and the like.
  • a storage unit 17 and a display unit 18 are provided.
  • 24 and 25 are diagrams schematically showing the functions of the information processing device 1 during learning and searching, respectively.
  • the differences between the first embodiment and the second embodiment will be mainly described.
  • the fourth input unit 21 has a function of inputting peptide amino acid data (first amino acid sequence data) in the three-dimensional structure of the antigen protein.
  • the fourth input unit 21 inputs peptide amino acid data with a correct label attached to a peptide that binds to a predetermined antibody protein, and an incorrect label is attached to a peptide that does not bind. Enter the peptide amino acid data obtained.
  • the fourth input unit 21 inputs a plurality of peptide amino acid data of such a combination as learning data.
  • the fourth input unit 21 inputs peptide amino acid data without a label (correct answer label, incorrect answer label) as prediction target data.
  • the fourth calculation unit 22 has a function of calculating the biochemical feature amount of the peptide amino acid of the antigen protein input by the fourth input unit 21. For example, the fourth calculation unit 22 calculates the type of peptide amino acid and the biochemical feature amount.
  • the fifth input unit 23 has a function of inputting amino acid sequence data (second amino acid sequence data) in the three-dimensional structure of the antibody protein.
  • the fifth input unit 23 inputs the amino acid sequences in the three-dimensional structure of the predetermined antibody protein corresponding to the peptide amino acid labeled with the correct answer and the peptide amino acid with the incorrect answer label.
  • the fifth input unit 23 inputs a plurality of such amino acid sequences as learning data.
  • the second input unit 13 inputs the amino acid sequence in the three-dimensional structure of the antibody protein as the prediction target data.
  • the fifth calculation unit 24 has a function of calculating the biochemical feature amount of the amino acid of the antibody protein input by the fifth input unit 23. For example, the fifth calculation unit 24 calculates the type of amino acid and the biochemical feature amount.
  • the learning unit 15 provides the peptide amino acid data (plurality) in the three-dimensional structure of the antigen protein with the correct answer label and the incorrect answer label, the biochemical characteristic amount of the peptide amino acid of the antigen protein, and the like, and each peptide amino acid.
  • the amino acid sequence (plurality) in the three-dimensional structure of the corresponding predetermined antibody protein and the biochemical feature amount of the amino acid of the predetermined antibody protein are input to a learning model having GCN, and the degree of binding between amino acids and the relationship are By learning the degree, a trained model for predicting amino acid sequence binding is generated, and the generated trained model is stored in the storage unit 17.
  • the learning unit 15 has a function of generating an adjacency matrix showing the adjacency relationship between amino acids based on the relationship between amino acids (distance, etc.) from the peptide amino acid data in the three-dimensional structure of the antigen protein. Further, the learning unit 15 has a function of associating each amino acid of the peptide amino acid with a biochemical feature amount of the peptide amino acid and the like. Further, the learning unit 15 has a function of embedding (embedding) an adjacency matrix of peptide amino acids, a biochemical feature amount of peptide amino acids, and the like in a graph model showing a three-dimensional structure of an antigen protein.
  • the learning unit 15 has a function of generating an adjacency matrix showing the adjacency relationship between amino acids based on the relationship (distance, etc.) between amino acids from the amino acid sequence in the three-dimensional structure of the antibody protein.
  • the learning unit 15 has a function of associating each amino acid in the amino acid sequence with a biochemical feature amount of the amino acid.
  • the learning unit 15 has a function of embedding an adjacency matrix of amino acids, a biochemical feature amount of an amino acid, and the like in a graph model showing a three-dimensional structure of an antibody protein.
  • the learning unit 15 has a function of inputting vectors of two embedded graph models (two molecules) into a learning model having a GCN (one deep learning model) for learning, and generating a learned model by the learning. To be equipped.
  • the learning unit 15 By executing the above function, the learning unit 15 generates a learned model for amino acid sequence binding prediction in which the three-dimensional structure of the amino acid contained in the amino acid sequence data is learned for a plurality of amino acid sequence data using GCN.
  • the search unit 16 includes peptide amino acids in the three-dimensional structure of the antigen protein, which is the prediction target data, biochemical feature amounts of the peptide amino acids of the antigen protein, and amino acid sequences in the three-dimensional structure of the antibody protein, which is the prediction target data.
  • the biochemical feature amount of the antibody protein and the like are input to the trained model for predicting amino acid sequence binding, and the binding probability that the peptide amino acid of the antigen protein binds as a part of the amino acid of the antibody protein is calculated. It has a function to output.
  • FIG. 26 is a diagram showing an operation flow at the time of learning of the information processing apparatus 1 according to the first embodiment.
  • Step S701 First, the fourth input unit 21 inputs the peptide amino acid data in the three-dimensional structure of the antigen protein with the correct answer label to the predetermined antibody protein, and the peptide in the three-dimensional structure of the antigen protein with the incorrect answer label. Enter the amino acid data.
  • the fourth input unit 21 inputs a plurality of peptide amino acid data of such a combination as learning data.
  • Step S702 the fourth calculation unit 22 calculates the types and biochemical features of the plurality of peptide amino acids input in step S701 for each peptide. For example, the fourth calculation unit 22 calculates a value corresponding to the type of peptide amino acid by converting the peptide amino acid into a categorical variable. In addition, the fourth calculation unit 22 calculates the biochemical features of the peptide amino acid using an existing tool such as DAS (Diffusion Accessibility Server). By inputting the three-dimensional structure data of the entire amino acid (Protein Dara Bank data, https://www.rcsb.org/) into DAS, the score for each atom is calculated. It is aggregated for each type of amino acid and used as a biochemical feature. DAS is disclosed in "http://services.mbi.ucla.edu/DiffAcc/".
  • Step S703 Subsequently, the fifth input unit 23 inputs the amino acid sequence in the three-dimensional structure of the predetermined antibody protein corresponding to the peptide amino acid data of the combination input in step S701.
  • the fifth input unit 23 inputs a plurality of amino acid sequences in the three-dimensional structure of such an antibody protein as learning data.
  • Step S704 Next, the fifth calculation unit 24 calculates the type and biochemical feature amount of each amino acid of the plurality of amino acid sequences input in step S703 for each antibody protein.
  • Step S705 After that, the learning unit 15 receives each peptide amino acid data (plural) in the three-dimensional structure of the antigen protein with the correct answer label and the incorrect answer, which was input or calculated in steps S701 to S704, and the peptide amino acid of the antigen protein.
  • the learning unit 15 optimizes the weight of each layer in the learning model based on the peptide amino acid data with the correct answer label.
  • the weight optimization method using the correct label can be carried out by a known technique.
  • Step S706 Finally, the learning unit 15 stores the learning model generated in step S705 in the storage unit 17 as a learned model for amino acid sequence binding prediction.
  • steps S703 and 704 may be executed before steps S701 and S702, or may be executed at the same time as steps S701 and S702.
  • step S705 will be described as a specific example.
  • FIG. 28 is a diagram showing a detailed operation flow of step S705 shown in FIG. 26.
  • Step S705a First, the learning unit 15 generates an adjacency matrix showing the adjacency relationship between amino acids based on the relationship between amino acids (distance, etc.) from the peptide amino acid data in the three-dimensional structure of the antigen protein input in step S701. For example, the learning unit 15 generates an adjacency matrix shown in FIG. 29 (b) from a graph model showing the three-dimensional structure of the antigen protein shown in FIG. 29 (a) based on the presence or absence of an adjacency relationship between amino acids.
  • Step S705b the learning unit 15 associates each amino acid of the peptide amino acid of the antigen protein with the type of peptide amino acid calculated in step S702 and the biochemical feature amount.
  • the learning unit 15 associates the types of peptide amino acids illustrated in FIG. 29 (c) and biochemical features.
  • Step S705c the learning unit 15 sets the adjacency matrix of the peptide amino acids generated in step S705a, the types of peptide amino acids associated in step S705b, and the biochemical features with respect to the graph model showing the three-dimensional structure of the antigen protein.
  • Embedding can be executed using, for example, node2vec, poincare embedding, or the like.
  • the adjacency matrix of FIG. 29 (b) and the node feature amount of FIG. 29 (c) are associated with each amino acid constituting the graph model of the three-dimensional structure of the antigen protein shown in FIG. 29 (a).
  • Step S705d Subsequently, the learning unit 15 generates an adjacency matrix showing the adjacency relationship between amino acids based on the relationship between amino acids (distance, etc.) from the amino acid sequence in the three-dimensional structure of the antibody protein input in step S703.
  • Step S705e Next, the learning unit 15 associates each amino acid in the amino acid sequence of the antibody protein with the amino acid type and biochemical feature amount calculated in step S704.
  • Step S705f Next, the learning unit 15 sets the adjacency matrix of the amino acids generated in step S705d, the types of amino acids associated in step S705e, and the biochemical features with respect to the graph model showing the three-dimensional structure of the amino acids of the antibody protein. Embed.
  • Step S705g After that, the learning unit 15 inputs and learns the vectors of two graph models (two molecules), that is, the graph model in step S705c and the graph model in step S705f, into a learning model having a GCN, and learns the amino acid sequence by the learning. Generate a trained model for join prediction.
  • steps S705d to 705f may be executed before steps S705a to S705c, or may be executed at the same time as steps S705a to S705c.
  • FIG. 30 is a diagram showing an operation flow at the time of searching for the information processing device 1 according to the first embodiment.
  • Step S801 First, the fourth input unit 21 inputs peptide amino acid data in the three-dimensional structure of the antigen protein that is not labeled (correct label, incorrect label), which is the prediction target data.
  • Step S802 Next, the fourth calculation unit 22 calculates the type of peptide amino acid of the antigen protein input in step S801 and the physiochemical feature amount.
  • Step S803 Subsequently, the fifth input unit 23 inputs the amino acid sequence in the three-dimensional structure of the antibody protein, which is the prediction target data.
  • Step S804 Next, the fifth calculation unit 24 calculates the type of amino acid and the biochemical feature amount of the antibody protein input in step S803.
  • the search unit 16 includes the peptide amino acid data in the three-dimensional structure of the antigen protein, the peptide amino acid type, the biochemical feature amount, and the antibody protein of the antigen protein, which were input or calculated in steps S801 to S804.
  • Each vector of the amino acid sequence in the three-dimensional structure of the above, the type of each amino acid of the antibody protein, and the biochemical feature amount is calculated, and each of the calculated vectors is input to the trained model for predicting the amino acid sequence binding.
  • the binding probability (whether or not the peptide is an epitope) of the peptide amino acid binding to the antibody protein is calculated and output.
  • FIG. 31 is a diagram showing a functional block configuration of the information processing device 1 according to the second embodiment.
  • the information processing device 1 according to the second embodiment includes a fourth input unit 21, a fourth calculation unit 22, a learning unit 15, a search unit 16, a storage unit 17, and a display unit 18.
  • 32 and 33 are diagrams schematically showing the functions of the information processing device 1 during learning and searching, respectively.
  • the differences between the first embodiment and the second embodiment will be mainly described.
  • the fourth input unit 21 has a function of inputting each peptide amino acid data (predetermined amino acid sequence data) in the three-dimensional structure of the antigen protein (some antigens) with the correct answer label and the incorrect answer label.
  • the fourth calculation unit 22 has a function of calculating the physicochemical and biochemical features of each amino acid constituting the three-dimensional structure of the target antigen protein as a whole (whole antigen).
  • the learning unit 15 describes each peptide amino acid data (plurality) in the three-dimensional structure of the antigen protein (some antigens) labeled with the correct answer and the incorrect answer label, and the peptide amino acids of the whole (whole antigen) in the antigen protein.
  • a trained model for predicting amino acid sequence binding is generated, and the generated trained model is generated. Is provided in the storage unit 17.
  • the learning unit 15 By executing the above function, the learning unit 15 generates a learned model for amino acid sequence binding prediction in which the three-dimensional structure of the amino acid contained in the amino acid sequence data is learned for a plurality of amino acid sequence data using GCN.
  • the search unit 16 combines the peptide amino acids in the three-dimensional structure of the antigen protein (some antigens), which is the prediction target data, with the biochemical feature amounts of the peptide amino acids of the antigen protein (whole antigen), and the like.
  • the amino acid of the antigen protein (whole antigen; long chain of the amino acid of the antigen) that binds to the peptide amino acid of the relevant antigen protein (some antigens; short chain of the amino acid of the antigen) is calculated. It has a function to output.
  • FIG. 34 is a diagram showing an operation flow at the time of learning of the information processing apparatus 1 according to the second embodiment.
  • Step S901 First, the fourth input unit 21 inputs the peptide amino acid data in the three-dimensional structure of the antigen protein with the correct answer label to the predetermined antigen protein, and the peptide in the three-dimensional structure of the antigen protein with the incorrect answer label. Enter the amino acid data.
  • the fourth input unit 21 inputs a plurality of peptide amino acid data of such a combination as learning data.
  • Step S902 the fourth calculation unit 22 calculates the type of peptide amino acid and the biochemical feature amount for each peptide using the entire amino acid sequence in the plurality of peptide amino acid data input in step S901.
  • Step S903 After that, the learning unit 15 includes each peptide amino acid data (plural) in the three-dimensional structure of the antigen protein with the correct answer label and the incorrect answer, which was input or calculated in step S901 and step S902, and the entire peptide in the antigen protein. Amino acid types and biochemical features are input to a learning model with GCN for learning. At this time, the learning unit 15 optimizes the weight of each layer in the learning model based on the peptide amino acid data with the correct answer label. The weight optimization method using the correct label can be carried out by a known technique.
  • Step S904 Finally, the learning unit 15 stores the learning model generated in step S903 in the storage unit 17 as a learned model for amino acid sequence binding prediction.
  • step S903 will be described as a specific example.
  • FIG. 35 is a diagram showing a detailed operation flow of step S903 shown in FIG. 34.
  • Step S903a First, the learning unit 15 uses an adjacency matrix showing the adjacency relationship between amino acids based on the relationship between amino acids (distance, etc.) from the peptide amino acid data in the three-dimensional structure of the antigen protein input in step S901 (see FIG. 29 (b)). To generate.
  • Step S903b Next, the learning unit 15 associates each amino acid of the peptide amino acid of the antigen protein with the type of peptide amino acid calculated in step S902 and the biochemical feature amount (see FIG. 29 (c)).
  • Step S903c the learning unit 15 sets the adjacency matrix of the peptide amino acids produced in step S903a and the types of amino acids associated in step S903b with respect to the graph model showing the three-dimensional structure of the antigen protein (see FIG. 29 (a)). Embed biochemical features.
  • Step S903d After that, the learning unit 15 inputs the vector of one graph model (one molecule) in step S903c into the learning model having GCN and learns it, and generates a learned model for amino acid sequence binding prediction by the learning.
  • FIG. 36 is a diagram showing an operation flow at the time of searching for the information processing device 1 according to the second embodiment.
  • Step S1001 First, the fourth input unit 21 inputs peptide amino acid data in the three-dimensional structure of the antigen protein that is not labeled (correct label, incorrect label), which is the prediction target data.
  • Step S1002 the fourth calculation unit 22 calculates the type of peptide amino acid and the amount of physiochemical features of the entire amino acid sequence using the entire amino acid sequence in the peptide amino acid data input in step S1001.
  • Step S1003 Finally, the search unit 16 describes the peptide amino acid sequence data in the three-dimensional structure of the antigen protein, the amino acid type of the entire antigen protein, and the biochemical feature amount, which were input or calculated in steps S1001 and S1002.
  • a vector is calculated, and each of the calculated vectors is input to a trained model for predicting amino acid sequence binding, and an antigen protein that binds to the peptide amino acid is calculated and output.
  • the accuracy of searching for the target amino acid sequence can be further improved.
  • the techniques for all learning models described in each embodiment are applicable to each other in each embodiment.
  • the information processing device 1 described in each embodiment can be realized by a computer provided with a CPU, a memory, a hard disk, and the like. It is also possible to create an information processing program for operating a computer as the information processing device 1 and a storage medium for the information processing program. Furthermore, it is also possible to prepare or produce a vaccine (pharmaceutical product) containing the amino acids of the amino acid sequence data searched in each embodiment.
  • a trained model for predicting amino acid sequence binding is used in which the sequence data structure of the amino acid sequence data is trained for a plurality of amino acid sequence data using a deep learning model with an attention mechanism.
  • a vaccine containing the amino acid of the first amino acid sequence data determined to bind as part of the second amino acid sequence data may be produced. For example, using a trained model for predicting amino acid sequence binding in which the sequence data structure of an amino acid sequence is trained for a plurality of amino acid sequences using a DNN with Attention, a peptide that binds as a part of an antigen is determined and bound. Produce a vaccine containing the determined peptide.
  • a plurality of amino acids have a three-dimensional structure of amino acids contained in the amino acid sequence data using a deep learning model that handles graphs.
  • a vaccine containing the amino acids of the first amino acid sequence data determined to bind as part of the second amino acid sequence data may be produced. ..
  • a plurality of amino acids have a three-dimensional structure of amino acids contained in the amino acid sequence data using a deep learning model that handles graphs.
  • a vaccine containing the amino acids of the amino acid sequence data determined to bind to the predetermined amino acid sequence data may be produced.
  • the present invention is not limited to the above embodiments. The present invention can be modified in a number of ways within the scope of the gist of the present invention.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Medicinal Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biochemistry (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Gastroenterology & Hepatology (AREA)
  • Public Health (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention améliore la précision de recherche d'une séquence d'acides aminés d'intérêt. Un dispositif de traitement d'informations (dispositif de recherche de séquence d'acides aminés) 1 comprend : une unité de stockage 17 qui stocke un modèle entraîné pour une prédiction de liaison de séquence d'acides aminés qui a été entraîné avec la structure de données de série de données de séquence d'acides aminés pour une pluralité d'éléments de données de séquence d'acides aminés en utilisant un modèle d'apprentissage profond avec un mécanisme d'attention ; une première unité d'entrée 11 et une seconde unité d'entrée 13 qui entrent des premières données de séquence d'acides aminés et des secondes données de séquence d'acides aminés, respectivement ; et une unité de recherche 16 qui utilise le modèle entraîné lu depuis l'unité de stockage 17 pour délivrer des informations de prédiction de liaison indiquant si oui ou non les premières données de séquence d'acides aminés se lient en tant que partie des secondes données de séquence d'acides aminés.
PCT/JP2020/042958 2019-11-28 2020-11-18 Dispositif de recherche de séquence d'acides aminés, vaccin, procédé de recherche de séquence d'acides aminés et programme de recherche de séquence d'acides aminés WO2021106706A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2021561342A JPWO2021106706A1 (fr) 2019-11-28 2020-11-18

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2019215453 2019-11-28
JP2019-215453 2019-11-28

Publications (1)

Publication Number Publication Date
WO2021106706A1 true WO2021106706A1 (fr) 2021-06-03

Family

ID=76129394

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/042958 WO2021106706A1 (fr) 2019-11-28 2020-11-18 Dispositif de recherche de séquence d'acides aminés, vaccin, procédé de recherche de séquence d'acides aminés et programme de recherche de séquence d'acides aminés

Country Status (2)

Country Link
JP (1) JPWO2021106706A1 (fr)
WO (1) WO2021106706A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838523A (zh) * 2021-09-17 2021-12-24 深圳太力生物技术有限责任公司 一种抗体蛋白cdr区域氨基酸序列预测方法及系统
EP4220644A1 (fr) * 2022-01-26 2023-08-02 Fujitsu Limited Programme de calcul de quantité de caractéristiques, procédé de calcul de quantité de caractéristiques et dispositif de calcul de quantité de caractéristiques
WO2023230077A1 (fr) * 2022-05-23 2023-11-30 Palepu Kalyan Apprentissage contrastif pour conception de dégradeur à base de peptides et ses utilisations

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190303535A1 (en) * 2018-04-03 2019-10-03 International Business Machines Corporation Interpretable bio-medical link prediction using deep neural representation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190303535A1 (en) * 2018-04-03 2019-10-03 International Business Machines Corporation Interpretable bio-medical link prediction using deep neural representation

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
HARINDER SINGH, ANSARI HIFZUR RAHMAN, RAGHAVA GAJENDRA P. S.: "Improved Method for Linear B-Cell Epitope Prediction Using Antigen’s Primary Sequence", PLOS ONE, vol. 8, no. 5, 1 January 2013 (2013-01-01), pages 1 - 8, XP055242468, DOI: 10.1371/journal.pone.0062216 *
JESPERSEN MARTIN CLOSTER, PETERS BJOERN, NIELSEN MORTEN, MARCATILI PAOLO: "BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes", NUCLEIC ACIDS RESEARCH, OXFORD UNIVERSITY PRESS, GB, vol. 45, no. W1, 3 July 2017 (2017-07-03), GB, pages W24 - W29, XP055830648, ISSN: 0305-1048, DOI: 10.1093/nar/gkx346 *
JIN JING, LIU ZHONGHAO, NASIRI ALIREZA, CUI YUXIN, LOUIS STEPHEN, ZHANG ANSI, ZHAO YONG, HU JIANJUN: "Attention mechanism-based deep learning pan-specific model for interpretable MHC-I peptide binding prediction", BIORXIV, 7 November 2019 (2019-11-07), XP055830649, Retrieved from the Internet <URL:https://www.biorxiv.org/content/10.1101/830737v2.full.pdf> [retrieved on 20210806], DOI: 10.1101/830737 *
LIU ZHONGHAO, JIN JING, CUI YUXIN, XIONG ZHENG, NASIRI ALIREZA, ZHAO YONG, HU JIANJUN: "DeepSeqPanII: an interpretable recurrent neural network model with attention mechanism for peptide-HLA class II binding prediction", IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, IEEE SERVICE CENTER, NEW YORK, NY., US, vol. 14, 1 January 2021 (2021-01-01), US, pages 1 - 1, XP055830651, ISSN: 1545-5963, DOI: 10.1109/TCBB.2021.3074927 *
TSUBAKI MASASHI, TOMII KENTARO, SESE JUN: "Compound–protein interaction prediction with end-to-end learning of neural networks for graphs and sequences", BIOINFORMATICS, OXFORD UNIVERSITY PRESS , SURREY, GB, vol. 35, no. 2, 15 January 2019 (2019-01-15), GB, pages 309 - 318, XP055826403, ISSN: 1367-4803, DOI: 10.1093/bioinformatics/bty535 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113838523A (zh) * 2021-09-17 2021-12-24 深圳太力生物技术有限责任公司 一种抗体蛋白cdr区域氨基酸序列预测方法及系统
EP4220644A1 (fr) * 2022-01-26 2023-08-02 Fujitsu Limited Programme de calcul de quantité de caractéristiques, procédé de calcul de quantité de caractéristiques et dispositif de calcul de quantité de caractéristiques
WO2023230077A1 (fr) * 2022-05-23 2023-11-30 Palepu Kalyan Apprentissage contrastif pour conception de dégradeur à base de peptides et ses utilisations

Also Published As

Publication number Publication date
JPWO2021106706A1 (fr) 2021-06-03

Similar Documents

Publication Publication Date Title
WO2021106706A1 (fr) Dispositif de recherche de séquence d&#39;acides aminés, vaccin, procédé de recherche de séquence d&#39;acides aminés et programme de recherche de séquence d&#39;acides aminés
Levy et al. Diverse demonstrations improve in-context compositional generalization
CN110852116B (zh) 非自回归神经机器翻译方法、装置、计算机设备和介质
US20190065677A1 (en) Machine learning based antibody design
BR112020022270A2 (pt) sistemas e métodos para unificar modelos estatísticos para diferentes modalidades de dados
JP2020523699A (ja) 関心点コピーの生成
Csordás et al. The neural data router: Adaptive control flow in transformers improves systematic generalization
WO2020093071A1 (fr) Co-évolution multiobjectif d&#39;architectures de réseau neuronal profond
CN114503203A (zh) 使用自注意力神经网络的由氨基酸序列的蛋白质结构预测
CN116049459A (zh) 跨模态互检索的方法、装置、服务器及存储介质
Klein et al. Timewarp: Transferable acceleration of molecular dynamics by learning time-coarsened dynamics
CN111488460B (zh) 数据处理方法、装置和计算机可读存储介质
Verma et al. AbODE: Ab initio antibody design using conjoined ODEs
Tamposis et al. Extending hidden Markov models to allow conditioning on previous observations
Zhang et al. Predicting binding affinities of emerging variants of SARS-CoV-2 using spike protein sequencing data: observations, caveats and recommendations
US20240079098A1 (en) Device for predicting drug-target interaction by using self-attention-based deep neural network model, and method therefor
Torge et al. Diffhopp: A graph diffusion model for novel drug design via scaffold hopping
Yang et al. GCNfold: A novel lightweight model with valid extractors for RNA secondary structure prediction
CN117441209A (zh) 内坐标中用于分子构象空间建模的对抗框架
CN116964678A (zh) 使用以蛋白质结构嵌入为条件的生成模型预测蛋白质氨基酸序列
EP4182928A1 (fr) Procédé, système et produit de programme d&#39;ordinateur permettant de déterminer des probabilités de présentation de néoantigènes
Gao et al. Pre-training with a rational approach for antibody
Spinks et al. Generating text from images in a smooth representation space
JP2013105377A (ja) 学習装置、学習方法、プログラム
Fang Applications of deep neural networks to protein structure prediction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20892456

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021561342

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20892456

Country of ref document: EP

Kind code of ref document: A1