CN110428870A - A kind of method and its application of prediction heavy chain of antibody light chain pairing probability - Google Patents

A kind of method and its application of prediction heavy chain of antibody light chain pairing probability Download PDF

Info

Publication number
CN110428870A
CN110428870A CN201910730394.9A CN201910730394A CN110428870A CN 110428870 A CN110428870 A CN 110428870A CN 201910730394 A CN201910730394 A CN 201910730394A CN 110428870 A CN110428870 A CN 110428870A
Authority
CN
China
Prior art keywords
amino acid
antibody
heavy chain
light chain
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910730394.9A
Other languages
Chinese (zh)
Other versions
CN110428870B (en
Inventor
吴婷婷
侯强波
蔡晓辉
杨平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Wang Xun Biological Polytron Technologies Inc
Original Assignee
Suzhou Wang Xun Biological Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Wang Xun Biological Polytron Technologies Inc filed Critical Suzhou Wang Xun Biological Polytron Technologies Inc
Priority to CN201910730394.9A priority Critical patent/CN110428870B/en
Publication of CN110428870A publication Critical patent/CN110428870A/en
Application granted granted Critical
Publication of CN110428870B publication Critical patent/CN110428870B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Abstract

The present invention relates to a kind of methods of prediction heavy chain of antibody light chain pairing probability, the method is based on convolutional neural networks and predicts heavy chain of antibody light chain pairing probability, specifically, by the amino acid sequence Feature Conversion of the antibody sample heavy chain of known unpaired message, light chain for after digital signal, it is trained in input convolutional neural networks, final model parameter is obtained, the model parameter is recycled to predict the pairing probability of heavy chain of antibody light chain to be predicted.This method can obtain the unpaired message probability of antibody by machine learning, it is easy to operate, quick, efficient, cost is relatively low, repeatability is high, accuracy rate is high, accuracy rate is up to 67.4%, and the more heavy chain of antibody of number, light chain pairing may be implemented, have great importance to researchs such as clinical research, antibody discovery, antibody libraries.

Description

A kind of method and its application of prediction heavy chain of antibody light chain pairing probability
Technical field
The invention belongs to field of biotechnology, and in particular to it is a kind of prediction heavy chain of antibody light chain pairing probability method and its Using.
Background technique
Antibody be it is a kind of can with the immunoglobulin in conjunction with antigentic specificity, by two identical heavy chains (H chain), two phases Same light chain (L chain) composition, is connected with disulfide bond between heavy chain and heavy chain, between heavy chain and light chain, forms a light and weight chain and match Pair symmetrical molecule.Wherein, heavy chain is divided into variable region (area V), constant region (area C), transmembrane region and cytoplasmic region;Light chain then only has V Area and the area C.In the variable region of antibody, each heavy chain variable region (VH), light chain variable region (VL) are respectively by three complementary determining regions (CDR region) composition, the i.e. area CDR1, CDR2, CDR3.The amino acid of CDR region/gene composition and the presentation height multiplicity that puts in order Property, it is same in vivo, this species diversity is up to 109-1012, constitute the library the huge B cell antigen receptor of capacity (BCR).Separately Outside, antibody heavy chain variable region is encoded by V, D, J gene cluster;Antibody's light chain variable region is encoded by V, J gene cluster.Heavy chain and light chain are By the product of two individual mRNA transcriptions, they are assembled into the full-length immunoglobulin molecule in B cell endoplasmic reticulum together.Cause This, pairing (VH-VL pairing) research of natural antibody, to the correct folding of antibody, the stability of antibody, antibody expression, antibody Combination of antigen etc. has vital meaning.
The framework region (FR) of antibody be it is highly conserved, the transformation of CDR region has diversity, wherein the area CDR3 of heavy chain Most easily mutate.Under the stimulation of antigen, the diversity of antibody is mainly from two sources.Firstly, B-cell receptor is in germline During encoding in gene, undergo random rearrangement to adapt to antigenic structure;Secondly, when pathogen starts and is immunoreacted, The area antibody V is proliferated, and dead and mutation, in the acute stage of immune response, mutation rate is up to 1/103bp。
In people and mouse, heavy chain and light chain match the folding for natural antibody, and stability and antigen binding are to pass Important.In addition, about heavy chain/light chain dimer information for the accurate three-dimensional of analog antibody variable region and antigen binding domain Conformation be it is required, this is required for reasonable antibody engineering.Therefore, naturally pairing VH/VL antibody is us to antibody The understanding of biology and design provides important information.
Currently, immune group library sequencing technologies, monoclonal antibody sequencing technologies, the unicellular sequencing technologies of immunoglobulin are to grind Study carefully the important experimental study means of heavy chain of antibody, light chain.Usual immune group library sequencing technologies can only obtain the sequence of heavy chain and light chain Column and relative abundance can not obtain the unpaired message of the two.The monoclonal antibody gene of based on PCR amplification technique is sequenced, phase Compared with the sequencing to monoclonal antibody amino acid sequence, monoclonal antibody gene sequencing can provide more acurrate more reliable knot Fruit, and the more indistinguishable amino acid in Mass Spectrometric Identification such as leucine/isoleucine can be more precisely distinguished, but should It the sequencing low efficiency of method, single cell flux low (< 200-500 cell) and needs by cumbersome experimental procedure, and expends A large amount of time and materials, are studied through many experiments, can only be realized the fewer VH-VL pairing of number, such as can only be realized 104- 105Pairing, millions of even more large data sets (such as 10 are much not achieved9-1012) Research Requirements, limit clinical research, The research of antibody discovery, antibody library etc..The unicellular sequencing technologies of immunoglobulin can produce high-throughput pairing VH/VL, still This method experiment is expensive, complicated for operation, and repeatability is low.
Therefore, a kind of side of easy, quick, efficient, inexpensive, high-accuracy heavy chain of antibody light chain pairing probability is developed Method is very important, and has great importance to researchs such as clinical research, antibody discovery, antibody libraries.
Summary of the invention
In view of the deficiencies of the prior art, the purpose of the present invention is to provide a kind of prediction heavy chain of antibody light chain pairing probability Method and its application.
In order to achieve that object of the invention, the invention adopts the following technical scheme:
On the one hand, the present invention provides a kind of method of prediction heavy chain of antibody light chain pairing probability, and the method is based on convolution Neural network predicts the pairing probability of heavy chain of antibody light chain.
Preferably, described method includes following steps: by the antibody sample heavy chain of known unpaired message, the amino acid of light chain It after sequence signature is converted to digital signal, inputs in convolutional neural networks and is trained, obtain final model parameter, recycle The model parameter predicts the pairing probability of heavy chain of antibody light chain to be predicted.
The method can obtain the unpaired message probability of antibody by machine learning, it is easy to operate, quick, efficient, at This lower, repeated high, accuracy rate height, may be implemented the more heavy chain of antibody of number, light chain pairing, to clinical research, resists The researchs such as body discovery, antibody library have great importance.
Preferably, the method for obtaining final mask parameter includes the following steps:
(1) heavy chain, the light-chain amino acid sequence unpaired message for obtaining antibody sample, as basic sample;
(2) basic sample is divided into first part and second part, extracts the amino acid sequence of first part basis sample Feature, locus specificity scoring matrix feature, antibody CDR3 section length feature, amino acid second structure characteristic and amino acid it is molten Features described above progress information is summarized, the data set comprising antibody sample information is obtained, as training by liquid accessibility feature Collection;
(3) amino acid sequence feature, the locus specificity scoring matrix feature, antibody of second part basis sample are extracted The solution accessibility feature of CDR3 section length feature, amino acid second structure characteristic and amino acid, carries out information for features described above Summarize, the data set comprising antibody sample information is obtained, as test set;
(4) positive data collection and negative data set are obtained;
(5) training set, test set, positive data collection and negative data set are inputted into convolutional neural networks, carries out convolution mind Through network training;
(6) it exports unpaired message and optimizes the objective function of convolutional neural networks training, obtain full convolutional neural networks Weight and biasing obtain final model parameter.
In the present invention, the acquisition modes of the basic sample are as follows: first obtain the full base of unpaired heavy chain, light chain variable region Because of sequence information, then obtain antibody sample gene unpaired message, above- mentioned information handled, obtain antibody sample heavy chain, Light-chain amino acid sequence unpaired message.
The unpaired heavy chain, light chain variable region complete genome sequence information data be from NCBI Short Read Archive (SRA) database is downloaded, and sample SRA are as follows: SRP047462 takes the VH:VL_ of wherein Donor 1-3 Analysis information, wherein have 2 samples in each Donor, downloading obtains the information of 6 samples respectively, is referred to as " data 1”。
The antibody sample gene unpaired message data are to obtain from nature page download, including antibody light chain is variable Area V, J germ line genes information and CDR3 region sequence information, antibody heavy chain variable region V, D, J germ line genes information and CDR3 region sequence Information, and light and heavy chain pairing is completed in above- mentioned information, but information does not include antibody heavy chain variable region, light chain variable region herein Complete genome sequence, download 6 data files respectively: Supplementary Data Set 1-6 is referred to as " data 2 ".
" data 1 " obtained above be immune group library sequencing antibody sequence as a result, " data 2 " obtain be comprising V, J, The germ line genes information of D and the gene sequence information in the area CDR3, therefore, it is necessary to the information pre-processing to " data 1 ", " data 2 ", After matching, the unpaired message of the light chain of the heavy chain variable region and reference numeral of every antibody in " data 1 " is completed, i.e. sample is believed Breath.Its concrete operations mode are as follows:
(1) sample information is extracted: by Igblast software, extract in " data 1 " the germ line genes information of antibody sample and CDR3 region sequence information.Sequence of heavy chain extracts to obtain V, J, D germ line genes information and the area CDR3 gene order through Igblast, gently Chain-ordering obtains V, J germ line genes information and the area CDR3 gene order through Igblast.
(2) " data heavy chain, sequence of light chain matching: are found according to the sequence information of germ line genes and CDR3 in " data 2 " Corresponding antibody sequence number in 1 ";The DNA sequence dna of the heavy chain matched in " data 1 ", light chain is changed into amino acid sequence, is obtained Heavy chain, light-chain amino acid sequence unpaired message, i.e. sample information to antibody sample.
Preferably, the extracting mode of the amino acid sequence feature are as follows:
A kind of amino acid is defined as a kind of 20 bits, and then a kind of amino acid sequence is defined as [an x × 20] matrix data, wherein x indicates the number of amino acid in heavy chain of antibody or light-chain amino acid sequence.
A kind of amino acid is indicated with 20 bits herein, then 20 kinds of amino acid respectively correspond a kind of difference each other 20 bits, such as can be corresponding relationship as shown in the table:
Amino acid Abbreviation 20 bits
Alanine A 00000000000000000001
Arginine R 00000000000000000010
Asparagine N 00000000000000000100
Aspartic acid D 00000000000000001000
Cysteine C 00000000000000010000
Glutamine Q 00000000000000100000
Glutamic acid E 00000000000001000000
Histidine H 00000000000010000000
Isoleucine I 00000000000100000000
Leucine L 00000000001000000000
Lysine K 00000000010000000000
Methionine M 00000000100000000000
Phenylalanine F 00000001000000000000
Proline P 00000010000000000000
Serine S 00000100000000000000
Threonine T 00001000000000000000
Tryptophan W 00010000000000000000
Tyrosine Y 00100000000000000000
Valine V 01000000000000000000
Glycine G 10000000000000000000
It should be noted that only list a kind of corresponding relationship in upper table, 20 kinds of amino acid and 20 bits Other corresponding relationships are also in the scope of the present invention.
In the present invention, the extracting mode of locus specificity scoring matrix (PSSM matrix) feature are as follows:
Locus specificity scoring matrix is defined as to the matrix of [L × 20] that one is shown below, wherein L indicates antibody Length, that is, amino acid number of heavy chain or light-chain amino acid sequence;The site of basic sample is calculated using PSI-BLAST tool Specific scoring matrix;Then by the obtained value of the ranks of PSSM matrix, sigmoid function is utilizedBecome Change to the section 0-1;
Wherein, Ei-jIndicate that i-th of amino acid of amino acid sequence sports the log of the probability of amino acid j in evolution Value, j=1-20 is respectively 20 kinds of alphabetical natural amino acids.
The PSSM matrix is the matrix of one [L × 20], and wherein L indicates the length of heavy chain of antibody or light-chain amino acid sequence Degree is the number of amino acid, each amino acid residue can have 20 kinds of variations, corresponds to 20 kinds of amino acid, so PSSM matrix 20 information of every a line indicate that this position residue mutations of amino acid sequence are the frequency measures of corresponding residue.PSSM matrix is given The conservation of amino acids at each position is gone out, has been indicated, can be used with the log value of certain amino acid frequency of occurrences of current location In the evolution information for extracting antibody sequence.
The extracting mode of above-mentioned locus specificity scoring matrix (PSSM matrix) feature specifically:
(1) the germ line genes letter of the heavy chain of antibody sample, light chain is obtained from IMGT database by PSI-BLAST tool Breath, E value (desired value) are set as 0.001, obtain 1 sequence for having homology with search sequence by iterative search 3 times search Column set, i.e. the PSSM matrix of target antibody.It then include that search sequence carries out Multiple sequence alignments obtained sequence.In this way The situation of change of some residue during evolution in search sequence known to just.
(2) pass through sigmoid functionThe value of PSSM is transformed into the section 0-1.
Preferably, the extracting mode of the antibody CDR3 section length feature are as follows: determine the amino acid number in the area antibody CDR3 Justice is seven different bits, thus extracts antibody CDR3 section length feature.
A kind of CDR3 section length situation i.e. amino acid number situation is indicated with seven bits herein, then different CDR3 section length situation respectively corresponds a kind of seven different each other bits, such as can be corresponding pass as shown in the table System:
The length in the area CDR3 Binary number
1 0000000
2 0000001
3 0000010
4 0000011
5 0000100
6 0000101
7 0000110
8 0000111
9 0001000
10 0001001
11 0001010
12 0001011
……
128 1111111
It should be noted that a kind of corresponding relationship is only listed in upper table, CDR3 section length feature and seven bits Other corresponding relationships also in the scope of the present invention.
Preferably, the extracting mode of the amino acid second structure characteristic are as follows: use SCRATCH tool predicted amino acid sequence The secondary structure of each amino acid in column, and three kinds of secondary structures are defined as three kinds of different triad numbers, thus mention Take amino acid second structure characteristic.
Because the secondary structure that the neighbouring residue alignments of antibody sequence are formed influences whether the interaction between residue pair, into And influence antibody tertiary structure.The secondary structure for being used herein as SCRATCH tool prediction antibody sequence (includes spiral, chain, ring altogether Three class formations), the secondary structure of each position is by one three binary number representations, such as can be pair as shown in the table It should be related to:
It should be noted that only list a kind of corresponding relationship in upper table, amino acid second structure characteristic and three two into Other corresponding relationships of number processed are also in the scope of the present invention.
In the present invention, the extracting mode of the solution accessibility feature of the amino acid are as follows:
Using the solution accessibility of each amino acid in SCRATCH tool predicted amino acid sequence, and two states are determined Justice is two different dibits, thus extracts the solution accessibility feature of amino acid.
Because the folding of antibody is carried out in cell, during antibody folds in cell water environment, in antibody Hydrophobic residue will be gradually mobile to the center of antibody conformational space, hydrophilic residue can then constitute the surface of protein, by right Solution judges Position Approximate of the antibody sequence amino acid residue in antibody conformational space with respect to accessibility feature extraction. It is used herein as the solution accessibility in SCRATCH tool prediction antibody amino acid site, the solution accessibility of each position is by one Two binary number representations respectively indicate burying, exposing for site, such as can be corresponding relationship as shown in the table:
Solution accessibility Binary number
It buries 01
Exposure 10
It should be noted that only list a kind of corresponding relationship in upper table, the solution accessibility feature of amino acid with two Other corresponding relationships of binary number are also in the scope of the present invention.
Preferably, the method that features described above progress information is summarized are as follows:
By the amino acid sequence feature extracted, locus specificity scoring matrix feature, antibody CDR3 section length feature, ammonia The solution accessibility feature of base acid second structure characteristic and amino acid is summarized, and summary information is defined as one 52 Binary number, and then a kind of amino acid sequence is defined as to the matrix data of one [y × 52], wherein y indicate heavy chain of antibody or The number of amino acid in light-chain amino acid sequence.
In the matrix data of above-mentioned [y × 52], each amino acid includes following characteristic information:
(1) PSSM matrix character: 20 dimensional vectors;
(2) amino acid sequence feature: 20 dimensional vectors;
(3) CDR region length characteristic: 7 dimensional vectors;
(4) second structure characteristic: 3 dimensional vectors;
(5) solution accessibility feature: 2 dimensional vectors.
Finally, all antibody sample heavy chains, feature extraction of the sequence of light chain through above-mentioned steps in basic sample, obtain one Data set comprising antibody sample heavy chain, light chain information.Such as: certain antibody light chain sequences includes 105 amino acid, sequence are as follows:
QPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLLIYDVTNRPSGVSNRFSGSKSGNTA SLTISGLQADDEADYYCSSHTRSGTVVFGGGTKLTVL can finally be obtained by above-mentioned series of features extraction step The matrix of one [105 × 52].Consider for length, the corresponding relationship of the vector in preceding 12 amino acid and matrix is taken to lift herein Under such as:
In the present invention, the method for obtaining positive data collection and negative data set are as follows: select the known antibody sequence matched Column are used as positive data collection;It according to the size of Read count number, is arranged according to sequence from big to small, selects count Number rankings before 20% heavy chain and ranking after 20% light chain and count number ranking before 20% light chain and ranking after 20% weight Chain, as negative data set.
The present invention is implemented using the computer frames of the deep learning of the open source of the keras of python language, in terms of individual Calculation machine or high-performance computer carry out convolutional neural networks as hardware and train, in convolutional neural networks, the mind of convolutional layer Only be connected with the partial nerve member node of preceding layer through member, i.e., its interneuronal connection be it is non-connect entirely, and same layer In the weight w of connection between the certain neurons and biasing b of neuron be it is shared (i.e. identical), reduce in large quantities in this way Need the quantity of training parameter.
Preferably, the structure for carrying out convolutional neural networks when convolutional neural networks training includes: input layer, the first volume Lamination, the first excitation layer, the first pond layer, the first full articulamentum, the second convolutional layer, the second excitation layer, the second pond layer, second Full articulamentum, third convolutional layer, third excitation layer, third pond layer, the full articulamentum of third, Volume Four lamination, the 4th excitation layer, 4th pond layer, the 4th full articulamentum and output layer.
First convolutional layer: the convolution kernel of 3 × 3 sizes 48.
First excitation layer activation primitive: Relu.
First pond layer: 2 × 2 core.
First full articulamentum dropout rate is equal to 0.35.
Second convolutional layer: 3 × 3 convolution kernels 48.
Second excitation layer activation primitive: Relu.
Second pond layer: 2 × 2 core.
Second full articulamentum dropout rate is equal to 0.35.
Third convolutional layer: 3 × 3 convolution kernels 96.
Third excitation layer activation primitive: Relu.
Third pond layer: 2 × 2 core.
The full articulamentum dropout rate of third is equal to 0.35.
Volume Four lamination: 3 × 3 convolution kernels 96.
4th excitation layer activation primitive: Relu.
4th pond layer: 2 × 2 core.
4th full articulamentum dropout rate is equal to 0.35.
4th pond layer is connected into output layer.
The input layer will be used for the input of data, that is, be used for the input of the matrix element of [L × 52];The convolutional layer is Feature extraction and Feature Mapping are carried out using convolution kernel;The excitation layer is for increasing Nonlinear Mapping;The pond layer is For carrying out down-sampling;The full articulamentum is fitted again in the tail portion of convolutional neural networks, and characteristic information is reduced Loss;The output layer is for exporting as a result, convolutional neural networks training is carried out to training set with test set, after being trained Convolutional neural networks.
Preferably, parameter setting is as follows when progress convolutional neural networks training:
NB_EPOCH=20
BATCH_SIZE=100
VERBOSE=1
NB_CLASSES=2
OPTIMIZER=SGD
VALIDATION_SPLIT=0.2.
On the other hand, a kind of method that the present invention provides prediction heavy chain of antibody light chain pairing probability as described above is being predicted Heavy chain of antibody light chain matches the application in probability, the method for the application are as follows:
Using by the pairing of heavy chain of antibody light chain the positive and negative data set as test set, amino acid sequence according to It after upper feature switchs to computer can identify 0 and 1, is input in neural network, by convolutional neural networks, obtains weighted value And whether successfully general other parameters obtain the two pairing using the heavy chain of the antibody of unknown unpaired message and light chain as input Rate.
Compared with prior art, the invention has the following beneficial effects:
The method of prediction heavy chain of antibody light chain pairing probability of the present invention can obtain matching for antibody by machine learning It is easy to operate, quick, efficient, cost is relatively low, repeatability is high, accuracy rate is high to informational probability, accuracy rate up to 67.4%, Have great importance to researchs such as clinical research, antibody discovery, antibody libraries.
Specific embodiment
Further to illustrate technological means and its effect adopted by the present invention, below in conjunction with preferred implementation of the invention Example to further illustrate the technical scheme of the present invention, but the present invention is not limited in scope of embodiments.
Embodiment 1
The present embodiment carries out convolutional neural networks training to antibody sample, obtains final model parameter, concrete operations side Method is as follows:
(1) the complete genome sequence information of unpaired heavy chain, light chain variable region is first obtained, then obtains the pairing of antibody sample gene Information handles above- mentioned information, heavy chain, the light-chain amino acid sequence unpaired message of antibody sample is obtained, as basic sample This;
(2) basic sample is divided into first part and second part, extracts the amino acid sequence of first part basis sample Feature: being defined as a kind of 20 bits for a kind of amino acid, 20 kinds of amino acid respectively correspond following 20 two into Number processed:
Amino acid Abbreviation 20 bits
Alanine A 00000000000000000001
Arginine R 00000000000000000010
Asparagine N 00000000000000000100
Aspartic acid D 00000000000000001000
Cysteine C 00000000000000010000
Glutamine Q 00000000000000100000
Glutamic acid E 00000000000001000000
Histidine H 00000000000010000000
Isoleucine I 00000000000100000000
Leucine L 00000000001000000000
Lysine K 00000000010000000000
Methionine M 00000000100000000000
Phenylalanine F 00000001000000000000
Proline P 00000010000000000000
Serine S 00000100000000000000
Threonine T 00001000000000000000
Tryptophan W 00010000000000000000
Tyrosine Y 00100000000000000000
Valine V 01000000000000000000
Glycine G 10000000000000000000
And then a kind of amino acid sequence is defined as to the matrix data of one [x × 20], wherein x indicates heavy chain of antibody or light The number of amino acid in chain amino acid sequence.
(3) the locus specificity scoring matrix feature of first part basis sample is extracted: by locus specificity scoring matrix It is defined as the matrix of [L × 20] that one is shown below, wherein L indicates the length of heavy chain of antibody or light-chain amino acid sequence; The locus specificity scoring matrix of basic sample is calculated using PSI-BLAST tool;Then sigmoid function is utilizedThe value of locus specificity scoring matrix is transformed into the section 0-1;
Wherein, Ei-jIndicate that i-th of amino acid of amino acid sequence sports the probability log value of amino acid j, j in evolution =1-20 is respectively 20 kinds of alphabetical natural amino acids.
(4) the antibody CDR3 section length feature of first part basis sample is extracted: by a kind of amino acid in the area antibody CDR3 Number is defined as a kind of seven bits, respectively corresponds seven following bits:
The length in the area CDR3 Binary number
1 0000000
2 0000001
3 0000010
4 0000011
5 0000100
6 0000101
7 0000110
8 0000111
9 0001000
10 0001001
11 0001010
12 0001011
……
128 1111111
(5) it extracts the amino acid second structure characteristic of first part basis sample: using SCRATCH tool predicted amino acid The secondary structure of each amino acid in sequence, and three kinds of secondary structures are defined as three kinds of different triad numbers, specifically It is as follows:
Structure Binary number
Spiral (H) 001
Chain (E) 010
Ring (C) 100
(6) it extracts the solution accessibility feature of the amino acid of first part basis sample: predicting ammonia using SCRATCH tool The solution accessibility of each amino acid in base acid sequence, and two states are defined as two different dibits, have Body is as follows:
Solution accessibility Binary number
It buries 01
Exposure 10
(7) the amino acid sequence feature extracted, locus specificity scoring matrix feature, antibody CDR3 section length is special The solution accessibility feature of sign, amino acid second structure characteristic and amino acid is summarized, and summary information is defined as one five Ten dibits, and then a kind of amino acid sequence is defined as to the matrix data of one [y × 52], wherein y indicates antibody The number of amino acid in heavy chain or light-chain amino acid sequence finally obtains the data set comprising antibody sample information, as Training set;
(8) amino acid sequence feature, the site spy of second part basis sample are extracted according to the method in step (2)-(7) Anisotropic scoring matrix feature, antibody CDR3 section length feature, the solution accessibility of amino acid second structure characteristic and amino acid are special Features described above progress information is summarized, the data set comprising antibody sample information is obtained, as test set by sign;
(9) positive data collection and negative data set are obtained: selecting the known antibody sequence matched as positive data collection;Root According to the size of Read count number, arranged according to sequence from big to small, select count number ranking before 20% heavy chain and After ranking before 20% light chain and count number ranking after 20% light chain and ranking 20% heavy chain, as negative data set.
(10) training set, test set, positive data collection and negative data set are inputted into convolutional neural networks, carries out convolution mind Through network training, structure includes: input layer, the first convolutional layer, the first excitation layer, the first pond layer, the first full articulamentum, Two convolutional layers, the second excitation layer, the second pond layer, the second full articulamentum, third convolutional layer, third excitation layer, third pond layer, The full articulamentum of third, Volume Four lamination, the 4th excitation layer, the 4th pond layer, the 4th full articulamentum and output layer.
First convolutional layer: the convolution kernel of 3 × 3 sizes 48.
First excitation layer activation primitive: Relu.
First pond layer: 2 × 2 core.
First full articulamentum dropout rate is equal to 0.35.
Second convolutional layer: 3 × 3 convolution kernels 48.
Second excitation layer activation primitive: Relu.
Second pond layer: 2 × 2 core.
Second full articulamentum dropout rate is equal to 0.35.
Third convolutional layer: 3 × 3 convolution kernels 96.
Third excitation layer activation primitive: Relu.
Third pond layer: 2 × 2 core.
The full articulamentum dropout rate of third is equal to 0.35.
Volume Four lamination: 3 × 3 convolution kernels 96.
4th excitation layer activation primitive: Relu.
4th pond layer: 2 × 2 core.
4th full articulamentum dropout rate is equal to 0.35.
4th pond layer is connected into output layer.
Its parameter setting situation are as follows:
NB_EPOCH=20
BATCH_SIZE=100
VERBOSE=1
NB_CLASSES=2
OPTIMIZER=SGD
VALIDATION_SPLIT=0.2.
(11) it exports unpaired message and optimizes the objective function of convolutional neural networks training, obtain full convolutional neural networks Weight and biasing obtain final model parameter.
Embodiment 2
The present embodiment carries out the prediction of pairing probability using method according to the present invention to following search sequence, specific to grasp It is as follows to make method:
Wherein heavy chain are as follows:
QVHLQESGPELVRPGASVKISCKTSGYVFSSSWMNWVKQRPGQGLKWIGRIYPGNGNTNYNEKFKGKA TLTADKSSNTAYMQLSSLTSVDSAVYFCATSSAYWGQGTLLTVSAAKTTPPSVYPLAPGSAAQTNSMVTLGCLVKG YFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVPSSPRPSETVTCNVAHPASSTKVDKKIVPR;
Light chain are as follows:
DIQMTQTTSSLSASLGDRVTFSCSASQDISNYLNWYQQKPDGTIKLLIYYTSSLRSGVPSRFSGSGSG TDYSLTINNLEPEDIATYFCQQYSRLPFTFGSGTKLEIKRADAAPTVSIFPPSSEQLTSGGASVVCFLNNFYPKDI NVKWKIDGSERQNGVLNSWTDQDSKDSTYSMSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNEC;
Obtained pairing probability is 83.1%.
The Applicant declares that the present invention is explained by the above embodiments, a kind of prediction heavy chain of antibody light chain of the invention is matched The method and its application of probability, but the present invention is not limited to the above embodiments, that is, it is above-mentioned not mean that the present invention must rely on Embodiment could be implemented.It should be clear to those skilled in the art, any improvement in the present invention, to product of the present invention The equivalence replacement of each raw material and addition, the selection of concrete mode of auxiliary element etc., all fall within protection scope of the present invention and public affairs Within the scope of opening.
The preferred embodiment of the present invention has been described above in detail, still, during present invention is not limited to the embodiments described above Detail within the scope of the technical concept of the present invention can be with various simple variants of the technical solution of the present invention are made, this A little simple variants all belong to the scope of protection of the present invention.
It is further to note that specific technical features described in the above specific embodiments, in not lance In the case where shield, can be combined in any appropriate way, in order to avoid unnecessary repetition, the present invention to it is various can No further explanation will be given for the combination of energy.

Claims (10)

1. a kind of method of prediction heavy chain of antibody light chain pairing probability, which is characterized in that the method is based on convolutional neural networks Heavy chain of antibody light chain pairing probability is predicted.
2. the method for prediction heavy chain of antibody light chain pairing probability as described in claim 1, which is characterized in that the method includes Following steps: by the amino acid sequence Feature Conversion of the antibody sample heavy chain of known unpaired message, light chain be digital signal after, it is defeated Enter in convolutional neural networks and be trained, obtain final model parameter, recycles the model parameter to antibody weight to be predicted The pairing probability of chain light chain is predicted.
3. the method for prediction heavy chain of antibody light chain pairing probability as claimed in claim 1 or 2, which is characterized in that the acquisition The method of final mask parameter specifically comprises the following steps:
(1) heavy chain, the light-chain amino acid sequence unpaired message for obtaining antibody sample, as basic sample;
(2) basic sample is divided into first part and second part, extract first part basis sample amino acid sequence feature, Locus specificity scoring matrix feature, antibody CDR3 section length feature, amino acid second structure characteristic and amino acid solution can And property feature, features described above progress information is summarized, the data set comprising antibody sample information is obtained, as training set;
(3) amino acid sequence feature, the locus specificity scoring matrix feature, the area antibody CDR3 of second part basis sample are extracted Features described above progress information is summarized, is obtained by the solution accessibility feature of length characteristic, amino acid second structure characteristic and amino acid It include the data set of antibody sample information to one, as test set;
(4) positive data collection and negative data set are obtained;
(5) training set is inputted into convolutional neural networks, carries out convolutional neural networks training;
(6) it exports unpaired message and optimizes the objective function of convolutional neural networks training, obtain the weight of full convolutional neural networks And biasing, obtain final model parameter.
4. the method for prediction heavy chain of antibody light chain pairing probability as claimed in claim 3, which is characterized in that the basis sample Acquisition modes are as follows: first obtain the complete genome sequence information of unpaired heavy chain, light chain variable region, then obtain antibody sample gene and match To information, above- mentioned information are handled, obtain heavy chain, the light-chain amino acid sequence unpaired message of antibody sample;
Preferably, the extracting mode of the amino acid sequence feature are as follows:
A kind of amino acid is defined as a kind of 20 bits, so by a kind of amino acid sequence be defined as one [x × 20] matrix data, wherein x indicates the number of amino acid in heavy chain of antibody or light-chain amino acid sequence.
5. the method for prediction heavy chain of antibody light chain pairing probability as claimed in claim 3, which is characterized in that the site-specific The extracting mode of property scoring matrix feature are as follows:
Locus specificity scoring matrix is defined as to the matrix of [L × 20] that one is shown below, wherein L indicates heavy chain of antibody Or the length of light-chain amino acid sequence;The locus specificity scoring matrix of basic sample is calculated using PSI-BLAST tool;Then By the obtained value of the ranks of PSSM matrix, sigmoid function is utilizedTransform to the section 0-1;
Wherein, Ei-jIndicate that i-th of amino acid of amino acid sequence sports the probability log value of amino acid j, j=1- in evolution 20 be respectively 20 kinds of alphabetical natural amino acids;
Preferably, the extracting mode of the antibody CDR3 section length feature are as follows: be defined as the amino acid number in the area antibody CDR3 Thus seven different bits extract antibody CDR3 section length feature;
Preferably, the extracting mode of the amino acid second structure characteristic are as follows: using in SCRATCH tool predicted amino acid sequence The secondary structure of each amino acid, and three kinds of secondary structures are defined as three kinds of different triad numbers, thus extract ammonia Base acid second structure characteristic.
6. the method for prediction heavy chain of antibody light chain pairing probability as claimed in claim 3, which is characterized in that the amino acid The extracting mode of solution accessibility feature are as follows:
Using the solution accessibility of each amino acid in SCRATCH tool predicted amino acid sequence, and two states are defined as Thus two different dibits extract the solution accessibility feature of amino acid;
Preferably, the method that features described above progress information is summarized are as follows:
By the amino acid sequence feature extracted, locus specificity scoring matrix feature, antibody CDR3 section length feature, amino acid The solution accessibility feature of second structure characteristic and amino acid is summarized, by summary information be defined as one 52 two into Number processed, and then a kind of amino acid sequence is defined as to the matrix data of one [y × 52], wherein y indicates heavy chain of antibody or light chain The number of amino acid in amino acid sequence.
7. the method for prediction heavy chain of antibody light chain pairing probability as claimed in claim 3, which is characterized in that the acquisition is positive The method of data set and negative data set are as follows: select the known antibody sequence matched as positive data collection;According to Read The size of count number is arranged according to sequence from big to small, select count number ranking before 20% heavy chain and ranking after Before 20% light chain and count number ranking after 20% light chain and ranking 20% heavy chain, as negative data set.
8. the method for prediction heavy chain of antibody light chain pairing probability as claimed in claim 3, which is characterized in that the carry out convolution When neural metwork training the structure of convolutional neural networks include: input layer, the first convolutional layer, the first excitation layer, the first pond layer, First full articulamentum, the second convolutional layer, the second excitation layer, the second pond layer, the second full articulamentum, third convolutional layer, third swash Encourage layer, third pond layer, the full articulamentum of third, Volume Four lamination, the 4th excitation layer, the 4th pond layer, the 4th full articulamentum and Output layer.
9. the method for prediction heavy chain of antibody light chain pairing probability as claimed in claim 3, which is characterized in that the carry out convolution Parameter setting is as follows when neural metwork training:
NB_EPOCH=20
BATCH_SIZE=100
VERBOSE=1
NB_CLASSES=2
OPTIMIZER=SGD
VALIDATION_SPLIT=0.2.
10. the method for prediction heavy chain of antibody light chain pairing probability as claimed in claims 1-9 is matched in prediction heavy chain of antibody light chain To the application in probability.
CN201910730394.9A 2019-08-08 2019-08-08 Method for predicting antibody heavy chain and light chain pairing probability and application thereof Active CN110428870B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910730394.9A CN110428870B (en) 2019-08-08 2019-08-08 Method for predicting antibody heavy chain and light chain pairing probability and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910730394.9A CN110428870B (en) 2019-08-08 2019-08-08 Method for predicting antibody heavy chain and light chain pairing probability and application thereof

Publications (2)

Publication Number Publication Date
CN110428870A true CN110428870A (en) 2019-11-08
CN110428870B CN110428870B (en) 2023-03-21

Family

ID=68413287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910730394.9A Active CN110428870B (en) 2019-08-08 2019-08-08 Method for predicting antibody heavy chain and light chain pairing probability and application thereof

Country Status (1)

Country Link
CN (1) CN110428870B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103003697A (en) * 2010-05-17 2013-03-27 得克萨斯系统大学董事会 Rapid isolation of monoclonal antibodies from animals
US20130178370A1 (en) * 2011-11-23 2013-07-11 The Board Of Regents Of The University Of Texas System Proteomic identification of antibodies
CN105026430A (en) * 2012-11-28 2015-11-04 酵活有限公司 Engineered immunoglobulin heavy chain-light chain pairs and uses thereof
CN106047857A (en) * 2016-06-01 2016-10-26 苏州金唯智生物科技有限公司 Method for mining antibody with specific function
CN107435065A (en) * 2016-05-10 2017-12-05 江苏荃信生物医药有限公司 The method for identifying primate antibody
AR107083A1 (en) * 2015-12-18 2018-03-21 Novartis Ag ANTIBODIES DIRECTED TO CD32B AND METHODS OF USE OF THE SAME
AU2017236431A1 (en) * 2016-03-24 2018-09-27 Bayer Pharma Aktiengesellschaft Prodrugs of cytotoxic active agents having enzymatically cleavable groups
CN108595913A (en) * 2018-05-11 2018-09-28 武汉理工大学 Differentiate the supervised learning method of mRNA and lncRNA
CN109411011A (en) * 2018-11-06 2019-03-01 苏州泓迅生物科技股份有限公司 A kind of design method and its application of primer sets
CN109906232A (en) * 2016-09-23 2019-06-18 埃尔斯塔治疗公司 Multi-specificity antibody molecule comprising lambda light chain and κ light chain

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103003697A (en) * 2010-05-17 2013-03-27 得克萨斯系统大学董事会 Rapid isolation of monoclonal antibodies from animals
US20130178370A1 (en) * 2011-11-23 2013-07-11 The Board Of Regents Of The University Of Texas System Proteomic identification of antibodies
CN105026430A (en) * 2012-11-28 2015-11-04 酵活有限公司 Engineered immunoglobulin heavy chain-light chain pairs and uses thereof
AR107083A1 (en) * 2015-12-18 2018-03-21 Novartis Ag ANTIBODIES DIRECTED TO CD32B AND METHODS OF USE OF THE SAME
AU2017236431A1 (en) * 2016-03-24 2018-09-27 Bayer Pharma Aktiengesellschaft Prodrugs of cytotoxic active agents having enzymatically cleavable groups
CN107435065A (en) * 2016-05-10 2017-12-05 江苏荃信生物医药有限公司 The method for identifying primate antibody
CN106047857A (en) * 2016-06-01 2016-10-26 苏州金唯智生物科技有限公司 Method for mining antibody with specific function
CN109906232A (en) * 2016-09-23 2019-06-18 埃尔斯塔治疗公司 Multi-specificity antibody molecule comprising lambda light chain and κ light chain
CN108595913A (en) * 2018-05-11 2018-09-28 武汉理工大学 Differentiate the supervised learning method of mRNA and lncRNA
CN109411011A (en) * 2018-11-06 2019-03-01 苏州泓迅生物科技股份有限公司 A kind of design method and its application of primer sets

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
BRANDON J DEKOSKY等: "In-depth determination and analysis of the human paired heavy- and light-chain antibody repertoire", 《NATURE MEDICINE》 *
BRUNO PAIVA等: "Phenotypic, transcriptomic, and genomic features of clonal plasma cells in light-chain amyloidosis", 《BLOOD》 *
MAXIMILIAN BÖNISCH等: "Novel CH1:CL interfaces that enhance correct light chain pairing in heterodimeric bispecific antibodies", 《PROTEIN ENGINEERING》 *
NGUYEN QUOC KHANH LE等: "iMotor-CNN: Identifying molecular functions of cytoskeleton motor proteins using 2D convolutional neural network via Chou"s 5-step rule", 《ANALYTICAL BIOCHEMISTRY》 *
刘渝娇等: "重组炭疽疫苗免疫人体产生抗体基因的生物信息学分析", 《生物技术通讯》 *
戴欣: "肝癌细胞免疫猴源Fab噬菌体抗体库的构建及筛选", 《中国优秀硕士学位论文全文数据库 医药卫生科技辑》 *
李季生: "家蚕后部丝腺差异蛋白组学及microRNA表达谱研究", 《中国博士学位论文全文数据库 基础科学辑》 *
秦剑秋: "A (H3N2)亚型流感病毒血凝素基因特性及蛋白结构分析,以南宁市2009~2012 年为研究案例", 《基因组学与应用生物学》 *

Also Published As

Publication number Publication date
CN110428870B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
Mason et al. Deep learning enables therapeutic antibody optimization in mammalian cells by deciphering high-dimensional protein sequence space
JP2019535057A (en) Protein binding site prediction method, apparatus, facility, and storage medium
CN106021990B (en) A method of biological gene is subjected to classification and Urine scent with specific character
US20110118130A1 (en) Compositions and methods for defining cells
Hanning et al. Deep mutational scanning for therapeutic antibody engineering
CN114026645A (en) Identification of convergent antibody specific sequence patterns
CN107610784A (en) A kind of method of predictive microbiology and disease relationship
CN106096327B (en) Gene character recognition methods based on Torch supervised deep learnings
CN109147866A (en) Residue prediction technique is bound based on sampling and the protein-DNA of integrated study
Pertseva et al. Applications of machine and deep learning in adaptive immunity
WO2021188838A9 (en) Single-cell combinatorial indexed cytometry sequencing
Onimaru et al. Developmental hourglass and heterochronic shifts in fin and limb development
Wong et al. The New Answer to Drug Discovery: Quantum Machine Learning in Preclinical Drug Development
CN110428870A (en) A kind of method and its application of prediction heavy chain of antibody light chain pairing probability
Ratul et al. PS8-Net: a deep convolutional neural network to predict the eight-state protein secondary structure
CN106446601B (en) A kind of method of extensive mark lncRNA function
CN106709273B (en) The matched rapid detection method of microalgae protein characteristic sequence label and system
WO2023086999A1 (en) Systems and methods for evaluating immunological peptide sequences
Bhat et al. De novo generation and prioritization of target-binding peptide motifs from sequence alone
Thrift et al. Graph-pMHC: graph neural network approach to mhc class II peptide presentation and antibody immunogenicity
Desai et al. Isolating anti-amyloid antibodies from yeast-displayed libraries
CN107644678A (en) A kind of method that algorithm predictive microbiology and disease relationship are inferred based on network
Prabakaran et al. Animal immunization merges with innovative technologies: a new paradigm shift in antibody discovery
US20220290262A1 (en) Antibody-dna conjugates and hpv detection and treatment
CN114596915A (en) Method for correcting and standardizing TCR beta high-throughput sequencing data based on template sequence and reference cells

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant