CN110428870B - Method for predicting antibody heavy chain and light chain pairing probability and application thereof - Google Patents

Method for predicting antibody heavy chain and light chain pairing probability and application thereof Download PDF

Info

Publication number
CN110428870B
CN110428870B CN201910730394.9A CN201910730394A CN110428870B CN 110428870 B CN110428870 B CN 110428870B CN 201910730394 A CN201910730394 A CN 201910730394A CN 110428870 B CN110428870 B CN 110428870B
Authority
CN
China
Prior art keywords
antibody
amino acid
light chain
heavy chain
acid sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910730394.9A
Other languages
Chinese (zh)
Other versions
CN110428870A (en
Inventor
吴婷婷
侯强波
蔡晓辉
杨平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Synbio Technologies
Original Assignee
Synbio Technologies
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Synbio Technologies filed Critical Synbio Technologies
Priority to CN201910730394.9A priority Critical patent/CN110428870B/en
Publication of CN110428870A publication Critical patent/CN110428870A/en
Application granted granted Critical
Publication of CN110428870B publication Critical patent/CN110428870B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Peptides Or Proteins (AREA)

Abstract

The invention relates to a method for predicting the pairing probability of an antibody heavy chain and a light chain, which predicts the pairing probability of the antibody heavy chain and the light chain based on a convolutional neural network, and particularly, after converting the amino acid sequence characteristics of the heavy chain and the light chain of an antibody sample with known pairing information into digital signals, inputting the digital signals into the convolutional neural network for training to obtain final model parameters, and predicting the pairing probability of the heavy chain and the light chain of an antibody to be predicted by using the model parameters. The method can obtain the pairing information probability of the antibody through machine learning, is simple and convenient to operate, rapid, efficient, low in cost, high in repeatability and high in accuracy, the accuracy can reach 67.4%, the heavy chain and the light chain of the antibody with a large number can be paired, and the method has important significance for clinical research, antibody discovery, antibody library and other researches.

Description

Method for predicting antibody heavy chain and light chain pairing probability and application thereof
Technical Field
The invention belongs to the technical field of biology, and particularly relates to a method for predicting the pairing probability of heavy chains and light chains of an antibody and application thereof.
Background
The antibody is an immunoglobulin capable of being specifically combined with antigen, and consists of two identical heavy chains (H chains) and two identical light chains (L chains), wherein the heavy chains are connected with the heavy chains through disulfide bonds, and the heavy chains and the light chains are connected with each other through disulfide bonds to form a light-heavy chain paired symmetrical molecule. Wherein the heavy chain is divided into a variable region (V region), a constant region (C region), a transmembrane region and a cytoplasmic region; the light chain has only V and C regions. In the variable region of an antibody, each of the heavy chain variable region (VH) and the light chain variable region (VL) is composed of three complementarity determining regions (CDR regions), i.e., CDR1, CDR2, and CDR3 regions. The amino acid/gene composition and arrangement sequence of the CDR regions exhibit a high degree of diversity, up to 10 within the same body 9 -10 12 Constitute a vast B cell antigen receptor (BCR) pool. In addition, the antibody heavy chain variable region is encoded by the V, D, J gene cluster; the antibody light chain variable region is encoded by the V, J gene cluster. The heavy and light chains are the products of transcription from two separate mrnas that, together, assemble into a full-length immunoglobulin molecule in the B cell endoplasmic reticulum. Therefore, the study of pairing of natural antibodies (VH-VL pairing) is of great importance for correct folding of antibodies, stability of antibodies, expression of antibodies, binding of antibody antigens, and the like.
The Framework Regions (FR) of antibodies are highly conserved, with diversity in the alteration of CDR regions, with the CDR3 region of the heavy chain being most susceptible to mutation. The diversity of antibodies upon stimulation by antigens comes primarily from two sources. First, in the process of B cell receptor encoding in germline genes, it undergoes random rearrangements to adapt to the antigenic structure; second, when the pathogen initiates an immune response, the antibody V region undergoes proliferation, death and mutation, in the acute phase of the immune responsePeriod, mutation rate is as high as 1/10 3 bp。
In humans and mice, heavy and light chain pairing is crucial for the folding, stability and antigen binding of natural antibodies. Furthermore, information about the heavy/light chain dimer is necessary to mimic the exact three-dimensional conformation of the antibody variable region and antigen binding domain, which is necessary for reasonable antibody engineering. Thus, the natural pairing of VH/VL antibodies provides important information for our understanding of antibody biology and design.
At present, an immunohistochemical library sequencing technology, a monoclonal antibody sequencing technology and an immunoglobulin single cell sequencing technology are important experimental research means for researching heavy chains and light chains of antibodies. Usually, the immunohistochemical library sequencing technology can only obtain the sequences of the heavy chain and the light chain, and the relative abundance, and can not obtain the pairing information of the heavy chain and the light chain. Compared with sequencing of the amino acid sequence of the monoclonal antibody, the monoclonal antibody gene sequencing based on the PCR amplification technology can provide more accurate and reliable results, and can more accurately distinguish the amino acids such as leucine, isoleucine and the like which are difficult to distinguish in mass spectrometric identification, but the sequencing efficiency of the method is low and the single cell flux is low (the method is low)<200-500 cells) and requires a complicated experimental procedure and consumes a lot of time and materials, and only a relatively small number of VH-VL pairs, such as only 10 pairs, can be realized through many experimental studies 4 -10 5 Far from millions or even larger data sets (e.g., 10) 9- 10 12 ) The research requirements of (2) limit the research of clinical research, antibody discovery, antibody libraries and the like. Immunoglobulin single cell sequencing technology can produce high throughput paired VH/VL, but this method is expensive to experiment, complex to operate, and has low reproducibility.
Therefore, it is necessary to develop a simple, fast, efficient, low-cost, and high-accuracy method for matching the heavy chain and the light chain of an antibody, which is of great significance for clinical research, antibody discovery, antibody library, and the like.
Disclosure of Invention
In view of the deficiencies of the prior art, the present invention aims to provide a method for predicting the pairing probability of heavy chains and light chains of antibodies and application thereof.
In order to achieve the purpose, the invention adopts the following technical scheme:
in one aspect, the invention provides a method for predicting the pairing probability of an antibody heavy chain and a light chain, wherein the method predicts the pairing probability of the antibody heavy chain and the light chain based on a convolutional neural network.
Preferably, the method comprises the steps of: and (3) converting the amino acid sequence characteristics of the heavy chain and the light chain of the antibody sample with known pairing information into digital signals, inputting the digital signals into a convolutional neural network for training to obtain final model parameters, and predicting the pairing probability of the heavy chain and the light chain of the antibody to be predicted by using the model parameters.
The method can obtain the pairing information probability of the antibody through machine learning, is simple and convenient to operate, rapid, efficient, low in cost, high in repeatability and high in accuracy, can realize the pairing of heavy chains and light chains of a large number of antibodies, and has important significance for clinical research, antibody discovery, antibody libraries and other researches.
Preferably, the method for obtaining the final model parameters comprises the following steps:
(1) Acquiring the pairing information of the heavy chain amino acid sequence and the light chain amino acid sequence of the antibody sample as a basic sample;
(2) Dividing a basic sample into a first part and a second part, extracting the amino acid sequence characteristics, the site specificity scoring matrix characteristics, the length characteristics of a CDR3 region of an antibody, the secondary structure characteristics of amino acids and the solution accessibility characteristics of the amino acids of the basic sample of the first part, and summarizing the information of the characteristics to obtain a data set containing the information of the antibody sample as a training set;
(3) Extracting the amino acid sequence characteristics, site-specific scoring matrix characteristics, antibody CDR3 region length characteristics, amino acid secondary structure characteristics and amino acid solution accessibility characteristics of the second part of basic samples, and summarizing the characteristics to obtain a data set containing antibody sample information as a test set;
(4) Acquiring a positive data set and a negative data set;
(5) Inputting the training set, the test set, the positive data set and the negative data set into a convolutional neural network for convolutional neural network training;
(6) And outputting pairing information and optimizing an objective function of convolutional neural network training to obtain the weight and bias of the full convolutional neural network and obtain final model parameters.
In the present invention, the basic sample is obtained by: acquiring the whole gene sequence information of the unpaired heavy chain and light chain variable regions, acquiring the antibody sample gene pairing information, and processing the information to obtain the heavy chain and light chain amino acid sequence pairing information of the antibody sample.
The whole gene sequence information data of the unpaired heavy chain and light chain variable regions are downloaded from an NCBI Short Read Archive (SRA) database, and the SRA number of the sample is as follows: the SRP047462 is used for obtaining VH: VL _ analysis information of Donor 1-3, wherein each Donor has 2 samples, and the information of 6 samples is obtained by downloading respectively and is generally called as 'data 1'.
The antibody sample gene pairing information data are downloaded from a nature webpage and comprise antibody light chain variable region V, J germline gene information and CDR3 region sequence information, and antibody heavy chain variable region V, D, J germline gene information and CDR3 region sequence information, wherein the information completes light chain and heavy chain pairing, but the information does not contain the whole gene sequences of the antibody heavy chain variable region and the light chain variable region, and 6 data files are respectively downloaded: supplementariy Data Set 1-6, collectively referred to as "Data 2".
The "data 1" obtained above is the result of the antibody sequence sequenced by the immunohistochemical library, and the "data 2" obtained is the information of the germline gene information of V, J, D and the gene sequence information of the CDR3 region, so that the information of "data 1" and "data 2" needs to be preprocessed, and after pairing, the pairing information of the heavy chain variable region of each antibody in "data 1" and the light chain with the corresponding number, that is, the sample information is completed. The specific operation mode is as follows:
(1) And (3) sample information extraction: germline gene information and CDR3 region sequence information of the antibody sample in data 1 were extracted by Igblast software. The heavy chain sequence is extracted by Igblast to obtain V, J, D germ line gene information and CDR3 region gene sequence, and the light chain sequence is extracted by Igblast to obtain V, J germ line gene information and CDR3 region gene sequence.
(2) Heavy chain and light chain sequence matching: finding the corresponding antibody sequence number in the data 1 according to the sequence information of the germline gene and the CDR3 in the data 2; and converting the DNA sequences of the paired heavy chain and light chain in the data 1 into amino acid sequences to obtain the pairing information of the amino acid sequences of the heavy chain and the light chain of the antibody sample, namely the sample information.
Preferably, the amino acid sequence features are extracted in a manner of:
an amino acid is defined as a binary number of twenty digits, and an amino acid sequence is defined as a matrix of [ x × 20], where x represents the number of amino acids in the antibody heavy or light chain amino acid sequence.
Here, one kind of amino acid is represented by a binary number of twenty bits, and then 20 kinds of amino acids correspond to binary numbers of twenty bits different from each other, for example, the correspondence relationship shown in the following table may be used:
amino acids Abbreviations Binary number of twenty bits
Alanine A 00000000000000000001
Arginine R 00000000000000000010
Asparagine N 00000000000000000100
Aspartic acid D 00000000000000001000
Cysteine C 00000000000000010000
Glutamine Q 00000000000000100000
Glutamic acid E 00000000000001000000
Histidine (His) H 00000000000010000000
Isoleucine I 00000000000100000000
Leucine L 00000000001000000000
Lysine K 00000000010000000000
Methionine M 00000000100000000000
Phenylalanine F 00000001000000000000
Proline P 00000010000000000000
Serine S 00000100000000000000
Threonine T 00001000000000000000
Tryptophan W 00010000000000000000
Tyrosine Y 00100000000000000000
Valine V 01000000000000000000
Glycine G 10000000000000000000
It should be noted that only one correspondence is listed in the above table, and other correspondences between 20 amino acids and the twenty-digit binary number are also within the scope of the present invention.
In the invention, the extraction mode of the characteristics of the site-specific scoring matrix (PSSM matrix) is as follows:
the site-specific scoring matrix is defined as [ Lx 20] as shown in the following formula]Wherein L represents the length of the antibody heavy or light chain amino acid sequence, i.e. the number of amino acids; calculating a site-specific scoring matrix for the base sample using the PSI-BLAST tool; then, the values obtained by the rows and the columns of the PSSM matrix are utilized to carry out sigmoid function
Figure BDA0002160343230000061
Transforming to 0-1 interval;
Figure BDA0002160343230000062
wherein E is i-j Represents the log of the probability that the i-th amino acid of the amino acid sequence was mutated in evolution to amino acid j, j =1-20 being 20 natural amino acids arranged alphabetically, respectively.
The PSSM matrix is a [ L × 20] matrix, where L represents the length of the antibody heavy or light chain amino acid sequence, i.e. the number of amino acids, and each amino acid residue has 20 changes, corresponding to 20 amino acids, so that 20 messages in each row of the PSSM matrix represent a measure of the frequency with which the residue at this position of the amino acid sequence is mutated to the corresponding residue. The PSSM matrix gives the conservation of amino acid at each position, and the conservation is expressed by the log value of the occurrence frequency of certain amino acid at the current position, and the PSSM matrix can be used for extracting the evolution information of the antibody sequence.
The extraction method of the characteristics of the site-specific scoring matrix (PSSM matrix) specifically comprises the following steps:
(1) Germline gene information of heavy chains and light chains of the antibody samples is obtained from the IMGT database by a PSI-BLAST tool, an E value (expectation value) is set to be 0.001, and 1 sequence set with homology to the query sequence, namely the PSSM matrix of the target antibody is obtained by iterative searching for 3 times. The resulting sequences are then subjected to multiple sequence alignments, including the query sequence. Thus, the evolution of a residue in the query sequence is known.
(2) By sigmoid function
Figure BDA0002160343230000063
The PSSM values are transformed to the 0-1 interval.
Preferably, the length characteristics of the CDR3 region of the antibody are extracted in the following manner: the length of the antibody CDR3 region was characterized by defining the number of amino acids in the antibody CDR3 region as a different seven-digit binary number.
Here, the seven-digit binary number represents a length of the CDR3 region, i.e., the number of amino acids, and different length of the CDR3 region correspond to seven-digit binary numbers different from each other, for example, the correspondence relationship shown in the following table can be used:
length of CDR3 region Binary number
1 0000000
2 0000001
3 0000010
4 0000011
5 0000100
6 0000101
7 0000110
8 0000111
9 0001000
10 0001001
11 0001010
12 0001011
……
128 1111111
It should be noted that, only one correspondence relationship is listed in the above table, and other correspondence relationships between the length characteristic of the CDR3 region and the seven-bit binary number are also within the scope of the present invention.
Preferably, the secondary structural features of the amino acids are extracted in a manner that: the secondary structure of each amino acid in the amino acid sequence was predicted using the SCRATCH tool, and three secondary structures were defined as three different three-digit binary numbers, thereby extracting the amino acid secondary structure characteristics.
The secondary structure formed by the arrangement of adjacent residues of the antibody sequence affects the interaction between residue pairs and thus the tertiary structure of the antibody. The SCRATCH tool is used here to predict the secondary structure of antibody sequences (including three types of structures, helix, chain, and loop), and the secondary structure at each position is represented by a three-bit binary number, which may be, for example, the following table:
Figure BDA0002160343230000071
Figure BDA0002160343230000081
it should be noted that only one correspondence is listed in the above table, and other correspondences of the secondary structural features of the amino acids with three-digit binary numbers are also within the scope of the present invention.
In the present invention, the solution accessibility feature of the amino acid is extracted in the following manner:
solution accessibility characteristics of amino acids are extracted using the SCRATCH tool to predict solution accessibility for each amino acid in an amino acid sequence and define two states as two different binary digits.
Because the folding of the antibody is carried out in the cell, in the process of folding the antibody in a cellular water environment, hydrophobic residues in the antibody gradually move to the center of an antibody conformation space, hydrophilic residues form the surface of protein, and the approximate position of amino acid residues of an antibody sequence in the antibody conformation space is judged by extracting relative accessibility characteristics of a solution. Here, the SCRATCH tool is used to predict the solution accessibility of the amino acid sites of the antibody, and the solution accessibility at each position is represented by a two-bit binary number, which represents the burying and exposing of the sites, respectively, and can be, for example, the correspondence shown in the following table:
accessibility of solutions Binary number
Buried in 01
Exposing 10
It should be noted that only one correspondence is listed in the above table, and other correspondences of the solution accessibility characteristics of the amino acids to the dibits are also within the scope of the invention.
Preferably, the method for summarizing the information of the features comprises the following steps:
summarizing the extracted amino acid sequence characteristics, site-specific scoring matrix characteristics, antibody CDR3 region length characteristics, amino acid secondary structure characteristics and amino acid solution accessibility characteristics, defining the summarized information as a fifty-two-digit binary number, and further defining an amino acid sequence as [ y × 52] matrix data, wherein y represents the number of amino acids in the antibody heavy chain or light chain amino acid sequence.
In the matrix data of [ y × 52] above, each amino acid contains the following characteristic information:
(1) PSSM matrix characterization: a 20-dimensional vector;
(2) Amino acid sequence characteristics: a 20-dimensional vector;
(3) Length characteristics of the CDR regions: a 7-dimensional vector;
(4) Secondary structure characteristics: a 3-dimensional vector;
(5) Solution accessibility characteristics: a 2-dimensional vector.
Finally, all the antibody sample heavy chain and light chain sequences in the basic sample are subjected to characteristic extraction in the steps to obtain a data set containing the antibody sample heavy chain and light chain information. For example: an antibody light chain sequence comprises 105 amino acids and has the sequence:
QPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLLIYDVTNRPSGVSNRFSGSKSGNTASLTISGLQADDEADYYCSSHTRSGTVVFGGGTKLTVL, which goes through the above-mentioned series of feature extraction steps to finally obtain a matrix of [105 × 52 ]. For the sake of brevity, the correspondence between the first 12 amino acids and the vectors in the matrix is taken as follows:
Figure BDA0002160343230000091
Figure BDA0002160343230000101
in the present invention, the method for acquiring the positive data set and the negative data set comprises: selecting antibody sequences with known pairings as a positive data set; and (4) according to the size of the Read count, arranging the Read count from large to small, and selecting the heavy chain with the count ranking 20% first, the light chain with the count ranking 20% later, the light chain with the count ranking 20% first and the heavy chain with the count ranking 20% later as a negative data set.
The invention uses the computer framework implementation of open source deep learning of keras of python language to carry out convolutional neural network training by taking a personal computer or a high-performance computer as hardware, in the convolutional neural network, the neurons of the convolutional layer are only connected with partial neuron nodes of the previous layer, namely, the connections among the neurons are not fully connected, and the weight w of the connection among some neurons in the same layer and the bias b of the neurons are shared (namely, the same), thereby greatly reducing the quantity of parameters needing training.
Preferably, the structure of the convolutional neural network during convolutional neural network training includes: the device comprises an input layer, a first coiling layer, a first excitation layer, a first pooling layer, a first full-connection layer, a second coiling layer, a second excitation layer, a second pooling layer, a second full-connection layer, a third coiling layer, a third excitation layer, a third pooling layer, a third full-connection layer, a fourth coiling layer, a fourth excitation layer, a fourth pooling layer, a fourth full-connection layer and an output layer.
A first winding layer: and 48 convolution kernels of size 3 × 3.
First excitation layer activation function: relu.
A first pooling layer: 2 × 2 cores.
The first fully connected layer dropout ratio is equal to 0.35.
A second convolution layer: 48 convolution kernels are 3 × 3.
Second excitation layer activation function: relu.
A second pooling layer: 2 × 2 cores.
The second fully connected layer dropout ratio is equal to 0.35.
A third convolutional layer: 96 convolution kernels are 3 × 3.
Third excitation layer activation function: relu.
A third pooling layer: 2 × 2 cores.
The third fully connected layer dropout ratio is equal to 0.35.
A fourth convolution layer: 96 convolution kernels are 3 × 3.
Fourth excitation layer activation function: relu.
A fourth pooling layer: 2 × 2 cores.
The fourth fully connected layer dropout ratio is equal to 0.35.
And connecting the fourth pooling layer with the output layer.
The input layer is to be used for input of data, i.e. for input of [ lx52 ] matrix elements; the convolution layer uses convolution kernel to carry out feature extraction and feature mapping; the excitation layer is used for increasing nonlinear mapping; the pooling layer is used for down-sampling; the full-connection layer is refitted at the tail part of the convolutional neural network, so that the loss of characteristic information is reduced; and the output layer is used for outputting results, carrying out convolutional neural network training on the training set by using the test set, and obtaining a trained convolutional neural network.
Preferably, the parameters for performing the convolutional neural network training are set as follows:
NB_EPOCH=20
BATCH_SIZE=100
VERBOSE=1
NB_CLASSES=2
OPTIMIZER=SGD
VALIDATION_SPLIT=0.2。
in another aspect, the present invention provides an application of the method for predicting the pairing probability of the heavy chain and the light chain of the antibody, wherein the application method comprises the following steps:
and taking the positive and negative data sets subjected to antibody heavy chain and light chain pairing as a test set, converting the amino acid sequence into 0 and 1 which can be identified by a computer according to the characteristics, inputting the 0 and 1 into a neural network, obtaining a weight value and other parameters through a convolutional neural network, and taking the heavy chain and the light chain of the antibody of unknown pairing information as input to obtain the probability of success or failure of pairing of the heavy chain and the light chain.
Compared with the prior art, the invention has the following beneficial effects:
the method for predicting the pairing probability of the heavy chain and the light chain of the antibody can obtain the pairing information probability of the antibody through machine learning, is simple and convenient to operate, quick, efficient, low in cost, high in repeatability and high in accuracy rate, the accuracy rate can reach 67.4%, and the method has important significance for clinical research, antibody discovery, antibody library and other researches.
Detailed Description
To further illustrate the technical means and effects of the present invention, the following further describes the technical solution of the present invention with reference to the preferred embodiments of the present invention, but the present invention is not limited to the scope of the embodiments.
Example 1
In this embodiment, the convolutional neural network training is performed on the antibody sample to obtain the final model parameters, and the specific operation method is as follows:
(1) Acquiring the whole gene sequence information of the unpaired heavy chain and light chain variable regions, acquiring the gene pairing information of an antibody sample, and processing the information to obtain the amino acid sequence pairing information of the heavy chain and the light chain of the antibody sample as a basic sample;
(2) Dividing the basic sample into a first part and a second part, and extracting the amino acid sequence characteristics of the basic sample of the first part: an amino acid is defined as a binary number of twenty digits, and 20 amino acids correspond to the binary numbers of twenty digits as follows:
amino acids Abbreviations Binary number of twenty bits
Alanine A 00000000000000000001
Arginine R 00000000000000000010
Asparagine N 00000000000000000100
Aspartic acid D 00000000000000001000
Cysteine C 00000000000000010000
Glutamine Q 00000000000000100000
Glutamic acid E 00000000000001000000
Histidine H 00000000000010000000
Isoleucine I 00000000000100000000
Leucine L 00000000001000000000
Lysine K 00000000010000000000
Methionine M 00000000100000000000
Phenylalanine F 00000001000000000000
Proline P 00000010000000000000
Serine S 00000100000000000000
Threonine T 00001000000000000000
Tryptophan W 00010000000000000000
Tyrosine Y 00100000000000000000
Valine V 01000000000000000000
Glycine G 10000000000000000000
An amino acid sequence is further defined as a matrix of [ x × 20], where x represents the number of amino acids in the amino acid sequence of the heavy or light chain of the antibody.
(3) Extracting the site-specific scoring matrix characteristics of the first part of basic samples: the site-specific scoring matrix is defined as [ Lx 20] as shown in the following formula]Wherein L represents the length of the antibody heavy or light chain amino acid sequenceDegree; calculating a site-specific scoring matrix for the base sample using the PSI-BLAST tool; then using sigmoid function
Figure BDA0002160343230000131
Transforming the value of the site-specific scoring matrix to a range of 0-1;
Figure BDA0002160343230000141
wherein E is i-j Represents the probability log value of the i-th amino acid of the amino acid sequence mutated to the amino acid j in evolution, j =1-20 being 20 natural amino acids arranged alphabetically, respectively.
(4) Extracting the length characteristics of the CDR3 region of the antibody of the first part of the basic sample: the number of amino acids in the CDR3 region of an antibody is defined as a seven-digit binary number, which corresponds to the following seven-digit binary numbers:
length of CDR3 region Binary number
1 0000000
2 0000001
3 0000010
4 0000011
5 0000100
6 0000101
7 0000110
8 0000111
9 0001000
10 0001001
11 0001010
12 0001011
……
128 1111111
(5) Extracting the amino acid secondary structure characteristics of the first part of basic samples: the SCRATCH tool was used to predict the secondary structure of each amino acid in an amino acid sequence and define the three secondary structures as three different three-digit binary numbers as follows:
structure of the product Binary number
Spiral (H) 001
Chain (E) 010
Ring (C) 100
(6) Extracting solution accessibility characteristics of amino acids of a first portion of the base sample: the solution accessibility of each amino acid in the amino acid sequence was predicted using the SCRATCH tool and two states were defined as two different binary digits as follows:
accessibility of solutions Binary number
Buried in 01
Exposing 10
(7) Summarizing the extracted amino acid sequence characteristics, site-specific scoring matrix characteristics, antibody CDR3 region length characteristics, amino acid secondary structure characteristics and amino acid solution accessibility characteristics, defining summarized information as a fifty-two-digit binary number, further defining an amino acid sequence as [ y x 52] matrix data, wherein y represents the number of amino acids in the antibody heavy chain or light chain amino acid sequence, and finally obtaining a data set containing antibody sample information as a training set;
(8) Extracting the amino acid sequence characteristics, site-specific scoring matrix characteristics, antibody CDR3 region length characteristics, amino acid secondary structure characteristics and amino acid solution accessibility characteristics of a second part of basic samples according to the methods in the steps (2) to (7), and summarizing the information of the characteristics to obtain a data set containing antibody sample information as a test set;
(9) Acquiring a positive data set and a negative data set: selecting antibody sequences with known pairings as a positive data set; and (4) according to the size of the Read count, arranging the Read count from large to small, and selecting the heavy chain with the count ranking 20% first, the light chain with the count ranking 20% later, the light chain with the count ranking 20% first and the heavy chain with the count ranking 20% later as a negative data set.
(10) Inputting a training set, a test set, a positive data set and a negative data set into a convolutional neural network for convolutional neural network training, wherein the convolutional neural network training device structurally comprises: the device comprises an input layer, a first coiling layer, a first excitation layer, a first pooling layer, a first full-connection layer, a second coiling layer, a second excitation layer, a second pooling layer, a second full-connection layer, a third coiling layer, a third excitation layer, a third pooling layer, a third full-connection layer, a fourth coiling layer, a fourth excitation layer, a fourth pooling layer, a fourth full-connection layer and an output layer.
A first winding layer: and 48 convolution kernels of size 3 × 3.
First excitation layer activation function: relu.
A first pooling layer: 2 × 2 cores.
The first fully connected layer dropout ratio is equal to 0.35.
A second convolution layer: 48 convolution kernels are 3 × 3.
Second excitation layer activation function: relu.
A second pooling layer: 2 × 2 cores.
The second fully connected layer dropout ratio is equal to 0.35.
A third convolutional layer: 96 convolution kernels are 3 × 3.
Third excitation layer activation function: relu.
A third pooling layer: 2 × 2 cores.
The third fully connected layer dropout ratio is equal to 0.35.
A fourth convolution layer: 96 convolution kernels are 3 × 3.
Fourth stimulation layer activation function: relu.
A fourth pooling layer: 2 × 2 cores.
The fourth fully connected layer dropout ratio is equal to 0.35.
And connecting the fourth pooling layer with the output layer.
The parameter setting conditions are as follows:
NB_EPOCH=20
BATCH_SIZE=100
VERBOSE=1
NB_CLASSES=2
OPTIMIZER=SGD
VALIDATION_SPLIT=0.2。
(11) And outputting pairing information and optimizing an objective function of convolutional neural network training to obtain the weight and bias of the full convolutional neural network and obtain final model parameters.
Example 2
In this embodiment, the method according to the present invention is used to predict the pairing probability of the following query sequences, and the specific operation method is as follows:
wherein the heavy chain is:
QVHLQESGPELVRPGASVKISCKTSGYVFSSSWMNWVKQRPGQGLKWIGRIYPGNGNTNYNEKFKGKATLTADKSSNTAYMQLSSLTSVDSAVYFCATSSAYWGQGTLLTVSAAKTTPPSVYPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVPSSPRPSETVTCNVAHPASSTKVDKKIVPR;
the light chain is:
DIQMTQTTSSLSASLGDRVTFSCSASQDISNYLNWYQQKPDGTIKLLIYYTSSLRSGVPSRFSGSGSGTDYSLTINNLEPEDIATYFCQQYSRLPFTFGSGTKLEIKRADAAPTVSIFPPSSEQLTSGGASVVCFLNNFYPKDINVKWKIDGSERQNGVLNSWTDQDSKDSTYSMSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNEC;
the resulting pairing probability was 83.1%.
The applicant states that the present invention is illustrated by the above examples to describe a method for predicting the pairing probability of heavy chains and light chains of an antibody and the application thereof, but the present invention is not limited to the above examples, i.e. it does not mean that the present invention must rely on the above examples to be implemented. It should be understood by those skilled in the art that any modification of the present invention, equivalent substitutions of the raw materials of the product of the present invention, addition of auxiliary components, selection of specific modes, etc., are within the scope and disclosure of the present invention.
The preferred embodiments of the present invention have been described in detail, however, the present invention is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present invention within the technical idea of the present invention, and these simple modifications are within the protective scope of the present invention.
It should be noted that the various technical features described in the above embodiments can be combined in any suitable manner without contradiction, and the invention is not described in any way for the possible combinations in order to avoid unnecessary repetition.

Claims (11)

1. A method for predicting the probability of pairing a heavy chain and a light chain of an antibody, the method comprising the steps of: converting the amino acid sequence characteristics of the heavy chain and the light chain of the antibody sample with known pairing information into digital signals, inputting the digital signals into a convolutional neural network for training to obtain final model parameters, and predicting the pairing probability of the heavy chain and the light chain of the antibody to be predicted by using the model parameters;
the method for obtaining the final model parameters specifically comprises the following steps:
(1) Acquiring the pairing information of the heavy chain amino acid sequence and the light chain amino acid sequence of the antibody sample as a basic sample;
(2) Dividing a basic sample into a first part and a second part, extracting the amino acid sequence characteristics, the site specificity scoring matrix characteristics, the length characteristics of a CDR3 region of an antibody, the secondary structure characteristics of amino acids and the solution accessibility characteristics of the amino acids of the basic sample of the first part, and summarizing the information of the characteristics to obtain a data set containing the information of the antibody sample as a training set;
(3) Extracting the amino acid sequence characteristics, site specificity scoring matrix characteristics, antibody CDR3 region length characteristics, amino acid secondary structure characteristics and amino acid solution accessibility characteristics of a second part of basic samples, and summarizing the characteristics to obtain a data set containing antibody sample information as a test set;
(4) Acquiring a positive data set and a negative data set;
(5) Inputting the training set into a convolutional neural network for convolutional neural network training;
(6) And outputting pairing information and optimizing a target function of convolutional neural network training to obtain the weight and bias of the full convolutional neural network and obtain final model parameters.
2. The method of claim 1, wherein the base sample is obtained by: acquiring the whole gene sequence information of the unpaired heavy chain and light chain variable regions, acquiring the antibody sample gene pairing information, and processing the information to obtain the heavy chain and light chain amino acid sequence pairing information of the antibody sample.
3. The method of predicting the probability of pairing a heavy chain and a light chain of an antibody of claim 1, wherein the amino acid sequence features are extracted by:
an amino acid is defined as a binary number of twenty digits, and an amino acid sequence is defined as a matrix of [ x × 20], where x represents the number of amino acids in the antibody heavy or light chain amino acid sequence.
4. The method of predicting antibody heavy chain and light chain pairing probability of claim 1, wherein the site-specific scoring matrix features are extracted by:
the site-specific scoring matrix is defined as [ Lx 20] as shown in the following formula]Wherein L represents the length of an antibody heavy or light chain amino acid sequence; calculating a site-specific scoring matrix for the base sample using the PSI-BLAST tool; then, the values obtained by the rows and the columns of the PSSM matrix are utilized to carry out sigmoid function
Figure FDA0003996618750000021
Transforming to 0-1 interval;
Figure FDA0003996618750000022
wherein E is i-j Represents the log probability value of the mutation of the ith amino acid of the amino acid sequence to the amino acid j in evolution, wherein j =1-20 is 20 natural amino acids arranged by letters respectively.
5. The method of claim 1, wherein the length of the CDR3 region of the antibody is extracted by: the length of the antibody CDR3 region was characterized by defining the number of amino acids in the antibody CDR3 region as a different seven-digit binary number.
6. The method of predicting the probability of pairing a heavy chain and a light chain of an antibody of claim 1, wherein the secondary structural features of the amino acids are extracted by: the secondary structure of each amino acid in the amino acid sequence was predicted using the SCRATCH tool, and three secondary structures were defined as three different three-digit binary numbers, thereby extracting the amino acid secondary structure characteristics.
7. The method of predicting antibody heavy and light chain pairing probability of claim 1 wherein the solution accessibility features of the amino acids are extracted by:
solution accessibility characteristics of amino acids are extracted using the SCRATCH tool to predict solution accessibility for each amino acid in an amino acid sequence and define two states as two different binary digits.
8. The method of claim 1, wherein the method for predicting the probability of pairing heavy chains and light chains of the antibody summarizes the information on the characteristics by:
summarizing the extracted amino acid sequence characteristics, site-specific scoring matrix characteristics, antibody CDR3 region length characteristics, amino acid secondary structure characteristics and amino acid solution accessibility characteristics, defining the summarized information as a fifty-two-digit binary number, and further defining an amino acid sequence as [ y × 52] matrix data, wherein y represents the number of amino acids in the antibody heavy chain or light chain amino acid sequence.
9. The method of predicting antibody heavy chain and light chain pairing probability of claim 1, wherein the positive dataset and the negative dataset are obtained by: selecting antibody sequences with known pairings as a positive data set; and (4) according to the size of the Read count, arranging the Read count from large to small, and selecting the heavy chain with the count ranking 20% first, the light chain with the count ranking 20% later, the light chain with the count ranking 20% first and the heavy chain with the count ranking 20% later as a negative data set.
10. The method for predicting antibody heavy chain and light chain pairing probability of claim 1, wherein the structure of the convolutional neural network in performing convolutional neural network training comprises: the device comprises an input layer, a first coiling layer, a first excitation layer, a first pooling layer, a first full-connection layer, a second coiling layer, a second excitation layer, a second pooling layer, a second full-connection layer, a third coiling layer, a third excitation layer, a third pooling layer, a third full-connection layer, a fourth coiling layer, a fourth excitation layer, a fourth pooling layer, a fourth full-connection layer and an output layer.
11. The method of claim 1, wherein the parameters for performing the convolutional neural network training are set as follows:
NB_EPOCH=20
BATCH_SIZE=100
VERBOSE=1
NB_CLASSES=2OPTIMIZER=SGDVALIDATION_SPLIT=0.2。
CN201910730394.9A 2019-08-08 2019-08-08 Method for predicting antibody heavy chain and light chain pairing probability and application thereof Active CN110428870B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910730394.9A CN110428870B (en) 2019-08-08 2019-08-08 Method for predicting antibody heavy chain and light chain pairing probability and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910730394.9A CN110428870B (en) 2019-08-08 2019-08-08 Method for predicting antibody heavy chain and light chain pairing probability and application thereof

Publications (2)

Publication Number Publication Date
CN110428870A CN110428870A (en) 2019-11-08
CN110428870B true CN110428870B (en) 2023-03-21

Family

ID=68413287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910730394.9A Active CN110428870B (en) 2019-08-08 2019-08-08 Method for predicting antibody heavy chain and light chain pairing probability and application thereof

Country Status (1)

Country Link
CN (1) CN110428870B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118398082A (en) * 2024-04-25 2024-07-26 北京博奥森生物技术有限公司 Novel method for screening rabbit monoclonal antibody

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103003697A (en) * 2010-05-17 2013-03-27 得克萨斯系统大学董事会 Rapid isolation of monoclonal antibodies from animals
CN105026430A (en) * 2012-11-28 2015-11-04 酵活有限公司 Engineered immunoglobulin heavy chain-light chain pairs and uses thereof
CN106047857A (en) * 2016-06-01 2016-10-26 苏州金唯智生物科技有限公司 Method for mining antibody with specific function
CN107435065A (en) * 2016-05-10 2017-12-05 江苏荃信生物医药有限公司 The method for identifying primate antibody
AR107083A1 (en) * 2015-12-18 2018-03-21 Novartis Ag ANTIBODIES DIRECTED TO CD32B AND METHODS OF USE OF THE SAME
AU2017236431A1 (en) * 2016-03-24 2018-09-27 Bayer Pharma Aktiengesellschaft Prodrugs of cytotoxic active agents having enzymatically cleavable groups
CN108595913A (en) * 2018-05-11 2018-09-28 武汉理工大学 Differentiate the supervised learning method of mRNA and lncRNA
CN109411011A (en) * 2018-11-06 2019-03-01 苏州泓迅生物科技股份有限公司 A kind of design method and its application of primer sets
CN109906232A (en) * 2016-09-23 2019-06-18 埃尔斯塔治疗公司 Multi-specificity antibody molecule comprising lambda light chain and κ light chain

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2783214A2 (en) * 2011-11-23 2014-10-01 The Board of Regents of The University of Texas System Proteomic identification of antibodies

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103003697A (en) * 2010-05-17 2013-03-27 得克萨斯系统大学董事会 Rapid isolation of monoclonal antibodies from animals
CN105026430A (en) * 2012-11-28 2015-11-04 酵活有限公司 Engineered immunoglobulin heavy chain-light chain pairs and uses thereof
AR107083A1 (en) * 2015-12-18 2018-03-21 Novartis Ag ANTIBODIES DIRECTED TO CD32B AND METHODS OF USE OF THE SAME
AU2017236431A1 (en) * 2016-03-24 2018-09-27 Bayer Pharma Aktiengesellschaft Prodrugs of cytotoxic active agents having enzymatically cleavable groups
CN107435065A (en) * 2016-05-10 2017-12-05 江苏荃信生物医药有限公司 The method for identifying primate antibody
CN106047857A (en) * 2016-06-01 2016-10-26 苏州金唯智生物科技有限公司 Method for mining antibody with specific function
CN109906232A (en) * 2016-09-23 2019-06-18 埃尔斯塔治疗公司 Multi-specificity antibody molecule comprising lambda light chain and κ light chain
CN108595913A (en) * 2018-05-11 2018-09-28 武汉理工大学 Differentiate the supervised learning method of mRNA and lncRNA
CN109411011A (en) * 2018-11-06 2019-03-01 苏州泓迅生物科技股份有限公司 A kind of design method and its application of primer sets

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
A (H3N2)亚型流感病毒血凝素基因特性及蛋白结构分析,以南宁市2009~2012 年为研究案例;秦剑秋;《基因组学与应用生物学》;20151231;第34卷(第9期);第1833-1841页 *
iMotor-CNN: Identifying molecular functions of cytoskeleton motor proteins using 2D convolutional neural network via Chou"s 5-step rule;Nguyen Quoc Khanh Le等;《Analytical Biochemistry》;20190328;第575卷;第17-26页 *
In-depth determination and analysis of the human paired heavy- and light-chain antibody repertoire;Brandon J DeKosky等;《nature medicine》;20150131;第21卷(第1期);第86-93页 *
Novel CH1:CL interfaces that enhance correct light chain pairing in heterodimeric bispecific antibodies;Maximilian Bönisch等;《Protein Engineering》;20170831;第30卷(第9期);第685-696页 *
Phenotypic, transcriptomic, and genomic features of clonal plasma cells in light-chain amyloidosis;Bruno Paiva等;《BLOOD》;20160912;第127卷(第24期);第3035-3039页 *
家蚕后部丝腺差异蛋白组学及microRNA表达谱研究;李季生;《中国博士学位论文全文数据库 基础科学辑》;20130815(第(2013)08期);A006-269 *
肝癌细胞免疫猴源Fab噬菌体抗体库的构建及筛选;戴欣;《中国优秀硕士学位论文全文数据库 医药卫生科技辑》;20060915(第(2006)09期);E072-230 *
重组炭疽疫苗免疫人体产生抗体基因的生物信息学分析;刘渝娇等;《生物技术通讯》;20180330;第29卷(第02期);第174-182页 *

Also Published As

Publication number Publication date
CN110428870A (en) 2019-11-08

Similar Documents

Publication Publication Date Title
Wanamaker et al. CrY2H-seq: a massively multiplexed assay for deep-coverage interactome mapping
Schritt et al. Repertoire Builder: high-throughput structural modeling of B and T cell receptors
Prihoda et al. BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning
Mason et al. Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning
Shuai et al. Generative language modeling for antibody design
JP2022548841A (en) Methods and apparatus using machine learning for evolutionary data-driven design of proteins and other sequence-defined biomolecules
Mason et al. Deep learning enables therapeutic antibody optimization in mammalian cells by deciphering high-dimensional protein sequence space
CN110689920A (en) Protein-ligand binding site prediction algorithm based on deep learning
Hood A personal journey of discovery: developing technology and changing biology
Shuai et al. IgLM: Infilling language modeling for antibody sequence design
CN106021990B (en) A method of biological gene is subjected to classification and Urine scent with specific character
CN110428870B (en) Method for predicting antibody heavy chain and light chain pairing probability and application thereof
US20240257902A1 (en) Antigen prediction method and apparatus, device, and storage medium
CA3092098A1 (en) Determining protein structure and properties based on sequence
Porebski et al. Rapid discovery of high-affinity antibodies via massively parallel sequencing, ribosome display and affinity screening
Mhanna et al. Adaptive immune receptor repertoire analysis
Cha et al. The antibody repertoire of colorectal cancer
WO2023086999A1 (en) Systems and methods for evaluating immunological peptide sequences
EP4396820A1 (en) Training a neural network to predict multi-chain protein structures
CN115620801A (en) Prediction device and method for protein binding pocket
Le et al. Prediction of protein-protein interactions through deep learning based on sequence feature extraction and interaction network
Prabakaran et al. Animal immunization merges with innovative technologies: a new paradigm shift in antibody discovery
Zeng et al. Improving human essential protein prediction using only protein sequences via ensemble learning
US20210363528A1 (en) Biologics engineering via aptamomimetic discovery
KR20230018358A (en) Conformal Inference for Optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant