CN110428870B

CN110428870B - Method for predicting antibody heavy chain and light chain pairing probability and application thereof

Info

Publication number: CN110428870B
Application number: CN201910730394.9A
Authority: CN
Inventors: 吴婷婷; 侯强波; 蔡晓辉; 杨平
Original assignee: Synbio Technologies
Current assignee: Synbio Technologies
Priority date: 2019-08-08
Filing date: 2019-08-08
Publication date: 2023-03-21
Anticipated expiration: 2039-08-08
Also published as: CN110428870A

Abstract

The invention relates to a method for predicting the pairing probability of an antibody heavy chain and a light chain, which predicts the pairing probability of the antibody heavy chain and the light chain based on a convolutional neural network, and particularly, after converting the amino acid sequence characteristics of the heavy chain and the light chain of an antibody sample with known pairing information into digital signals, inputting the digital signals into the convolutional neural network for training to obtain final model parameters, and predicting the pairing probability of the heavy chain and the light chain of an antibody to be predicted by using the model parameters. The method can obtain the pairing information probability of the antibody through machine learning, is simple and convenient to operate, rapid, efficient, low in cost, high in repeatability and high in accuracy, the accuracy can reach 67.4%, the heavy chain and the light chain of the antibody with a large number can be paired, and the method has important significance for clinical research, antibody discovery, antibody library and other researches.

Description

Method for predicting antibody heavy chain and light chain pairing probability and application thereof

Technical Field

The invention belongs to the technical field of biology, and particularly relates to a method for predicting the pairing probability of heavy chains and light chains of an antibody and application thereof.

Background

The antibody is an immunoglobulin capable of being specifically combined with antigen, and consists of two identical heavy chains (H chains) and two identical light chains (L chains), wherein the heavy chains are connected with the heavy chains through disulfide bonds, and the heavy chains and the light chains are connected with each other through disulfide bonds to form a light-heavy chain paired symmetrical molecule. Wherein the heavy chain is divided into a variable region (V region), a constant region (C region), a transmembrane region and a cytoplasmic region; the light chain has only V and C regions. In the variable region of an antibody, each of the heavy chain variable region (VH) and the light chain variable region (VL) is composed of three complementarity determining regions (CDR regions), i.e., CDR1, CDR2, and CDR3 regions. The amino acid/gene composition and arrangement sequence of the CDR regions exhibit a high degree of diversity, up to 10 within the same body ⁹ -10 ¹² Constitute a vast B cell antigen receptor (BCR) pool. In addition, the antibody heavy chain variable region is encoded by the V, D, J gene cluster; the antibody light chain variable region is encoded by the V, J gene cluster. The heavy and light chains are the products of transcription from two separate mrnas that, together, assemble into a full-length immunoglobulin molecule in the B cell endoplasmic reticulum. Therefore, the study of pairing of natural antibodies (VH-VL pairing) is of great importance for correct folding of antibodies, stability of antibodies, expression of antibodies, binding of antibody antigens, and the like.

The Framework Regions (FR) of antibodies are highly conserved, with diversity in the alteration of CDR regions, with the CDR3 region of the heavy chain being most susceptible to mutation. The diversity of antibodies upon stimulation by antigens comes primarily from two sources. First, in the process of B cell receptor encoding in germline genes, it undergoes random rearrangements to adapt to the antigenic structure; second, when the pathogen initiates an immune response, the antibody V region undergoes proliferation, death and mutation, in the acute phase of the immune responsePeriod, mutation rate is as high as 1/10 ³ bp。

In humans and mice, heavy and light chain pairing is crucial for the folding, stability and antigen binding of natural antibodies. Furthermore, information about the heavy/light chain dimer is necessary to mimic the exact three-dimensional conformation of the antibody variable region and antigen binding domain, which is necessary for reasonable antibody engineering. Thus, the natural pairing of VH/VL antibodies provides important information for our understanding of antibody biology and design.

At present, an immunohistochemical library sequencing technology, a monoclonal antibody sequencing technology and an immunoglobulin single cell sequencing technology are important experimental research means for researching heavy chains and light chains of antibodies. Usually, the immunohistochemical library sequencing technology can only obtain the sequences of the heavy chain and the light chain, and the relative abundance, and can not obtain the pairing information of the heavy chain and the light chain. Compared with sequencing of the amino acid sequence of the monoclonal antibody, the monoclonal antibody gene sequencing based on the PCR amplification technology can provide more accurate and reliable results, and can more accurately distinguish the amino acids such as leucine, isoleucine and the like which are difficult to distinguish in mass spectrometric identification, but the sequencing efficiency of the method is low and the single cell flux is low (the method is low)<200-500 cells) and requires a complicated experimental procedure and consumes a lot of time and materials, and only a relatively small number of VH-VL pairs, such as only 10 pairs, can be realized through many experimental studies ⁴ -10 ⁵ Far from millions or even larger data sets (e.g., 10) ^9- 10 ¹² ) The research requirements of (2) limit the research of clinical research, antibody discovery, antibody libraries and the like. Immunoglobulin single cell sequencing technology can produce high throughput paired VH/VL, but this method is expensive to experiment, complex to operate, and has low reproducibility.

Therefore, it is necessary to develop a simple, fast, efficient, low-cost, and high-accuracy method for matching the heavy chain and the light chain of an antibody, which is of great significance for clinical research, antibody discovery, antibody library, and the like.

Disclosure of Invention

In view of the deficiencies of the prior art, the present invention aims to provide a method for predicting the pairing probability of heavy chains and light chains of antibodies and application thereof.

In order to achieve the purpose, the invention adopts the following technical scheme:

in one aspect, the invention provides a method for predicting the pairing probability of an antibody heavy chain and a light chain, wherein the method predicts the pairing probability of the antibody heavy chain and the light chain based on a convolutional neural network.

Preferably, the method comprises the steps of: and (3) converting the amino acid sequence characteristics of the heavy chain and the light chain of the antibody sample with known pairing information into digital signals, inputting the digital signals into a convolutional neural network for training to obtain final model parameters, and predicting the pairing probability of the heavy chain and the light chain of the antibody to be predicted by using the model parameters.

The method can obtain the pairing information probability of the antibody through machine learning, is simple and convenient to operate, rapid, efficient, low in cost, high in repeatability and high in accuracy, can realize the pairing of heavy chains and light chains of a large number of antibodies, and has important significance for clinical research, antibody discovery, antibody libraries and other researches.

Preferably, the method for obtaining the final model parameters comprises the following steps:

(1) Acquiring the pairing information of the heavy chain amino acid sequence and the light chain amino acid sequence of the antibody sample as a basic sample;

(2) Dividing a basic sample into a first part and a second part, extracting the amino acid sequence characteristics, the site specificity scoring matrix characteristics, the length characteristics of a CDR3 region of an antibody, the secondary structure characteristics of amino acids and the solution accessibility characteristics of the amino acids of the basic sample of the first part, and summarizing the information of the characteristics to obtain a data set containing the information of the antibody sample as a training set;

(3) Extracting the amino acid sequence characteristics, site-specific scoring matrix characteristics, antibody CDR3 region length characteristics, amino acid secondary structure characteristics and amino acid solution accessibility characteristics of the second part of basic samples, and summarizing the characteristics to obtain a data set containing antibody sample information as a test set;

(4) Acquiring a positive data set and a negative data set;

(5) Inputting the training set, the test set, the positive data set and the negative data set into a convolutional neural network for convolutional neural network training;

(6) And outputting pairing information and optimizing an objective function of convolutional neural network training to obtain the weight and bias of the full convolutional neural network and obtain final model parameters.

In the present invention, the basic sample is obtained by: acquiring the whole gene sequence information of the unpaired heavy chain and light chain variable regions, acquiring the antibody sample gene pairing information, and processing the information to obtain the heavy chain and light chain amino acid sequence pairing information of the antibody sample.

The whole gene sequence information data of the unpaired heavy chain and light chain variable regions are downloaded from an NCBI Short Read Archive (SRA) database, and the SRA number of the sample is as follows: the SRP047462 is used for obtaining VH: VL _ analysis information of Donor 1-3, wherein each Donor has 2 samples, and the information of 6 samples is obtained by downloading respectively and is generally called as 'data 1'.

The antibody sample gene pairing information data are downloaded from a nature webpage and comprise antibody light chain variable region V, J germline gene information and CDR3 region sequence information, and antibody heavy chain variable region V, D, J germline gene information and CDR3 region sequence information, wherein the information completes light chain and heavy chain pairing, but the information does not contain the whole gene sequences of the antibody heavy chain variable region and the light chain variable region, and 6 data files are respectively downloaded: supplementariy Data Set 1-6, collectively referred to as "Data 2".

The "data 1" obtained above is the result of the antibody sequence sequenced by the immunohistochemical library, and the "data 2" obtained is the information of the germline gene information of V, J, D and the gene sequence information of the CDR3 region, so that the information of "data 1" and "data 2" needs to be preprocessed, and after pairing, the pairing information of the heavy chain variable region of each antibody in "data 1" and the light chain with the corresponding number, that is, the sample information is completed. The specific operation mode is as follows:

(1) And (3) sample information extraction: germline gene information and CDR3 region sequence information of the antibody sample in data 1 were extracted by Igblast software. The heavy chain sequence is extracted by Igblast to obtain V, J, D germ line gene information and CDR3 region gene sequence, and the light chain sequence is extracted by Igblast to obtain V, J germ line gene information and CDR3 region gene sequence.

(2) Heavy chain and light chain sequence matching: finding the corresponding antibody sequence number in the data 1 according to the sequence information of the germline gene and the CDR3 in the data 2; and converting the DNA sequences of the paired heavy chain and light chain in the data 1 into amino acid sequences to obtain the pairing information of the amino acid sequences of the heavy chain and the light chain of the antibody sample, namely the sample information.

Preferably, the amino acid sequence features are extracted in a manner of:

an amino acid is defined as a binary number of twenty digits, and an amino acid sequence is defined as a matrix of [ x × 20], where x represents the number of amino acids in the antibody heavy or light chain amino acid sequence.

Here, one kind of amino acid is represented by a binary number of twenty bits, and then 20 kinds of amino acids correspond to binary numbers of twenty bits different from each other, for example, the correspondence relationship shown in the following table may be used:

amino acids	Abbreviations	Binary number of twenty bits
			Alanine	A	00000000000000000001
Arginine	R	00000000000000000010
			Asparagine	N	00000000000000000100
Aspartic acid	D	00000000000000001000
			Cysteine	C	00000000000000010000
Glutamine	Q	00000000000000100000
			Glutamic acid	E	00000000000001000000
Histidine (His)	H	00000000000010000000
			Isoleucine	I	00000000000100000000
Leucine	L	00000000001000000000
			Lysine	K	00000000010000000000
Methionine	M	00000000100000000000
			Phenylalanine	F	00000001000000000000
Proline	P	00000010000000000000
			Serine	S	00000100000000000000
Threonine	T	00001000000000000000
			Tryptophan	W	00010000000000000000
Tyrosine	Y	00100000000000000000
			Valine	V	01000000000000000000
Glycine	G	10000000000000000000

It should be noted that only one correspondence is listed in the above table, and other correspondences between 20 amino acids and the twenty-digit binary number are also within the scope of the present invention.

In the invention, the extraction mode of the characteristics of the site-specific scoring matrix (PSSM matrix) is as follows:

the site-specific scoring matrix is defined as [ Lx 20] as shown in the following formula]Wherein L represents the length of the antibody heavy or light chain amino acid sequence, i.e. the number of amino acids; calculating a site-specific scoring matrix for the base sample using the PSI-BLAST tool; then, the values obtained by the rows and the columns of the PSSM matrix are utilized to carry out sigmoid function

Transforming to 0-1 interval;

wherein E is _i-j Represents the log of the probability that the i-th amino acid of the amino acid sequence was mutated in evolution to amino acid j, j =1-20 being 20 natural amino acids arranged alphabetically, respectively.

The PSSM matrix is a [ L × 20] matrix, where L represents the length of the antibody heavy or light chain amino acid sequence, i.e. the number of amino acids, and each amino acid residue has 20 changes, corresponding to 20 amino acids, so that 20 messages in each row of the PSSM matrix represent a measure of the frequency with which the residue at this position of the amino acid sequence is mutated to the corresponding residue. The PSSM matrix gives the conservation of amino acid at each position, and the conservation is expressed by the log value of the occurrence frequency of certain amino acid at the current position, and the PSSM matrix can be used for extracting the evolution information of the antibody sequence.

The extraction method of the characteristics of the site-specific scoring matrix (PSSM matrix) specifically comprises the following steps:

(1) Germline gene information of heavy chains and light chains of the antibody samples is obtained from the IMGT database by a PSI-BLAST tool, an E value (expectation value) is set to be 0.001, and 1 sequence set with homology to the query sequence, namely the PSSM matrix of the target antibody is obtained by iterative searching for 3 times. The resulting sequences are then subjected to multiple sequence alignments, including the query sequence. Thus, the evolution of a residue in the query sequence is known.

(2) By sigmoid function

The PSSM values are transformed to the 0-1 interval.

Preferably, the length characteristics of the CDR3 region of the antibody are extracted in the following manner: the length of the antibody CDR3 region was characterized by defining the number of amino acids in the antibody CDR3 region as a different seven-digit binary number.

Here, the seven-digit binary number represents a length of the CDR3 region, i.e., the number of amino acids, and different length of the CDR3 region correspond to seven-digit binary numbers different from each other, for example, the correspondence relationship shown in the following table can be used:

length of CDR3 region	Binary number
		1	0000000
2	0000001
		3	0000010
4	0000011
		5	0000100
6	0000101
		7	0000110
8	0000111
		9	0001000
10	0001001
		11	0001010
12	0001011
		……
128	1111111

It should be noted that, only one correspondence relationship is listed in the above table, and other correspondence relationships between the length characteristic of the CDR3 region and the seven-bit binary number are also within the scope of the present invention.

Preferably, the secondary structural features of the amino acids are extracted in a manner that: the secondary structure of each amino acid in the amino acid sequence was predicted using the SCRATCH tool, and three secondary structures were defined as three different three-digit binary numbers, thereby extracting the amino acid secondary structure characteristics.

The secondary structure formed by the arrangement of adjacent residues of the antibody sequence affects the interaction between residue pairs and thus the tertiary structure of the antibody. The SCRATCH tool is used here to predict the secondary structure of antibody sequences (including three types of structures, helix, chain, and loop), and the secondary structure at each position is represented by a three-bit binary number, which may be, for example, the following table:

it should be noted that only one correspondence is listed in the above table, and other correspondences of the secondary structural features of the amino acids with three-digit binary numbers are also within the scope of the present invention.

In the present invention, the solution accessibility feature of the amino acid is extracted in the following manner:

solution accessibility characteristics of amino acids are extracted using the SCRATCH tool to predict solution accessibility for each amino acid in an amino acid sequence and define two states as two different binary digits.

Because the folding of the antibody is carried out in the cell, in the process of folding the antibody in a cellular water environment, hydrophobic residues in the antibody gradually move to the center of an antibody conformation space, hydrophilic residues form the surface of protein, and the approximate position of amino acid residues of an antibody sequence in the antibody conformation space is judged by extracting relative accessibility characteristics of a solution. Here, the SCRATCH tool is used to predict the solution accessibility of the amino acid sites of the antibody, and the solution accessibility at each position is represented by a two-bit binary number, which represents the burying and exposing of the sites, respectively, and can be, for example, the correspondence shown in the following table:

accessibility of solutions	Binary number
		Buried in	01
Exposing	10

It should be noted that only one correspondence is listed in the above table, and other correspondences of the solution accessibility characteristics of the amino acids to the dibits are also within the scope of the invention.

Preferably, the method for summarizing the information of the features comprises the following steps:

summarizing the extracted amino acid sequence characteristics, site-specific scoring matrix characteristics, antibody CDR3 region length characteristics, amino acid secondary structure characteristics and amino acid solution accessibility characteristics, defining the summarized information as a fifty-two-digit binary number, and further defining an amino acid sequence as [ y × 52] matrix data, wherein y represents the number of amino acids in the antibody heavy chain or light chain amino acid sequence.

In the matrix data of [ y × 52] above, each amino acid contains the following characteristic information:

(1) PSSM matrix characterization: a 20-dimensional vector;

(2) Amino acid sequence characteristics: a 20-dimensional vector;

(3) Length characteristics of the CDR regions: a 7-dimensional vector;

(4) Secondary structure characteristics: a 3-dimensional vector;

(5) Solution accessibility characteristics: a 2-dimensional vector.

Finally, all the antibody sample heavy chain and light chain sequences in the basic sample are subjected to characteristic extraction in the steps to obtain a data set containing the antibody sample heavy chain and light chain information. For example: an antibody light chain sequence comprises 105 amino acids and has the sequence:

QPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLLIYDVTNRPSGVSNRFSGSKSGNTASLTISGLQADDEADYYCSSHTRSGTVVFGGGTKLTVL, which goes through the above-mentioned series of feature extraction steps to finally obtain a matrix of [105 × 52 ]. For the sake of brevity, the correspondence between the first 12 amino acids and the vectors in the matrix is taken as follows:

in the present invention, the method for acquiring the positive data set and the negative data set comprises: selecting antibody sequences with known pairings as a positive data set; and (4) according to the size of the Read count, arranging the Read count from large to small, and selecting the heavy chain with the count ranking 20% first, the light chain with the count ranking 20% later, the light chain with the count ranking 20% first and the heavy chain with the count ranking 20% later as a negative data set.

The invention uses the computer framework implementation of open source deep learning of keras of python language to carry out convolutional neural network training by taking a personal computer or a high-performance computer as hardware, in the convolutional neural network, the neurons of the convolutional layer are only connected with partial neuron nodes of the previous layer, namely, the connections among the neurons are not fully connected, and the weight w of the connection among some neurons in the same layer and the bias b of the neurons are shared (namely, the same), thereby greatly reducing the quantity of parameters needing training.

Preferably, the structure of the convolutional neural network during convolutional neural network training includes: the device comprises an input layer, a first coiling layer, a first excitation layer, a first pooling layer, a first full-connection layer, a second coiling layer, a second excitation layer, a second pooling layer, a second full-connection layer, a third coiling layer, a third excitation layer, a third pooling layer, a third full-connection layer, a fourth coiling layer, a fourth excitation layer, a fourth pooling layer, a fourth full-connection layer and an output layer.

A first winding layer: and 48 convolution kernels of size 3 × 3.

First excitation layer activation function: relu.

A first pooling layer: 2 × 2 cores.

The first fully connected layer dropout ratio is equal to 0.35.

A second convolution layer: 48 convolution kernels are 3 × 3.

Second excitation layer activation function: relu.

A second pooling layer: 2 × 2 cores.

The second fully connected layer dropout ratio is equal to 0.35.

A third convolutional layer: 96 convolution kernels are 3 × 3.

Third excitation layer activation function: relu.

A third pooling layer: 2 × 2 cores.

The third fully connected layer dropout ratio is equal to 0.35.

A fourth convolution layer: 96 convolution kernels are 3 × 3.

Fourth excitation layer activation function: relu.

A fourth pooling layer: 2 × 2 cores.

The fourth fully connected layer dropout ratio is equal to 0.35.

And connecting the fourth pooling layer with the output layer.

The input layer is to be used for input of data, i.e. for input of [ lx52 ] matrix elements; the convolution layer uses convolution kernel to carry out feature extraction and feature mapping; the excitation layer is used for increasing nonlinear mapping; the pooling layer is used for down-sampling; the full-connection layer is refitted at the tail part of the convolutional neural network, so that the loss of characteristic information is reduced; and the output layer is used for outputting results, carrying out convolutional neural network training on the training set by using the test set, and obtaining a trained convolutional neural network.

Preferably, the parameters for performing the convolutional neural network training are set as follows:

NB_EPOCH＝20

BATCH_SIZE＝100

VERBOSE＝1

NB_CLASSES＝2

OPTIMIZER＝SGD

VALIDATION_SPLIT＝0.2。

in another aspect, the present invention provides an application of the method for predicting the pairing probability of the heavy chain and the light chain of the antibody, wherein the application method comprises the following steps:

and taking the positive and negative data sets subjected to antibody heavy chain and light chain pairing as a test set, converting the amino acid sequence into 0 and 1 which can be identified by a computer according to the characteristics, inputting the 0 and 1 into a neural network, obtaining a weight value and other parameters through a convolutional neural network, and taking the heavy chain and the light chain of the antibody of unknown pairing information as input to obtain the probability of success or failure of pairing of the heavy chain and the light chain.

Compared with the prior art, the invention has the following beneficial effects:

the method for predicting the pairing probability of the heavy chain and the light chain of the antibody can obtain the pairing information probability of the antibody through machine learning, is simple and convenient to operate, quick, efficient, low in cost, high in repeatability and high in accuracy rate, the accuracy rate can reach 67.4%, and the method has important significance for clinical research, antibody discovery, antibody library and other researches.

Detailed Description

To further illustrate the technical means and effects of the present invention, the following further describes the technical solution of the present invention with reference to the preferred embodiments of the present invention, but the present invention is not limited to the scope of the embodiments.

Example 1

In this embodiment, the convolutional neural network training is performed on the antibody sample to obtain the final model parameters, and the specific operation method is as follows:

(1) Acquiring the whole gene sequence information of the unpaired heavy chain and light chain variable regions, acquiring the gene pairing information of an antibody sample, and processing the information to obtain the amino acid sequence pairing information of the heavy chain and the light chain of the antibody sample as a basic sample;

(2) Dividing the basic sample into a first part and a second part, and extracting the amino acid sequence characteristics of the basic sample of the first part: an amino acid is defined as a binary number of twenty digits, and 20 amino acids correspond to the binary numbers of twenty digits as follows:

amino acids	Abbreviations	Binary number of twenty bits
			Alanine	A	00000000000000000001
Arginine	R	00000000000000000010
			Asparagine	N	00000000000000000100
Aspartic acid	D	00000000000000001000
			Cysteine	C	00000000000000010000
Glutamine	Q	00000000000000100000
			Glutamic acid	E	00000000000001000000
Histidine	H	00000000000010000000
			Isoleucine	I	00000000000100000000
Leucine	L	00000000001000000000
			Lysine	K	00000000010000000000
Methionine	M	00000000100000000000
			Phenylalanine	F	00000001000000000000
Proline	P	00000010000000000000
			Serine	S	00000100000000000000
Threonine	T	00001000000000000000
			Tryptophan	W	00010000000000000000
Tyrosine	Y	00100000000000000000
			Valine	V	01000000000000000000
Glycine	G	10000000000000000000

An amino acid sequence is further defined as a matrix of [ x × 20], where x represents the number of amino acids in the amino acid sequence of the heavy or light chain of the antibody.

(3) Extracting the site-specific scoring matrix characteristics of the first part of basic samples: the site-specific scoring matrix is defined as [ Lx 20] as shown in the following formula]Wherein L represents the length of the antibody heavy or light chain amino acid sequenceDegree; calculating a site-specific scoring matrix for the base sample using the PSI-BLAST tool; then using sigmoid function

Transforming the value of the site-specific scoring matrix to a range of 0-1;

wherein E is _i-j Represents the probability log value of the i-th amino acid of the amino acid sequence mutated to the amino acid j in evolution, j =1-20 being 20 natural amino acids arranged alphabetically, respectively.

(4) Extracting the length characteristics of the CDR3 region of the antibody of the first part of the basic sample: the number of amino acids in the CDR3 region of an antibody is defined as a seven-digit binary number, which corresponds to the following seven-digit binary numbers:

(5) Extracting the amino acid secondary structure characteristics of the first part of basic samples: the SCRATCH tool was used to predict the secondary structure of each amino acid in an amino acid sequence and define the three secondary structures as three different three-digit binary numbers as follows:

structure of the product	Binary number
		Spiral (H)	001
Chain (E)	010
		Ring (C)	100

(6) Extracting solution accessibility characteristics of amino acids of a first portion of the base sample: the solution accessibility of each amino acid in the amino acid sequence was predicted using the SCRATCH tool and two states were defined as two different binary digits as follows:

accessibility of solutions	Binary number
		Buried in	01
Exposing	10

(7) Summarizing the extracted amino acid sequence characteristics, site-specific scoring matrix characteristics, antibody CDR3 region length characteristics, amino acid secondary structure characteristics and amino acid solution accessibility characteristics, defining summarized information as a fifty-two-digit binary number, further defining an amino acid sequence as [ y x 52] matrix data, wherein y represents the number of amino acids in the antibody heavy chain or light chain amino acid sequence, and finally obtaining a data set containing antibody sample information as a training set;

(8) Extracting the amino acid sequence characteristics, site-specific scoring matrix characteristics, antibody CDR3 region length characteristics, amino acid secondary structure characteristics and amino acid solution accessibility characteristics of a second part of basic samples according to the methods in the steps (2) to (7), and summarizing the information of the characteristics to obtain a data set containing antibody sample information as a test set;

(9) Acquiring a positive data set and a negative data set: selecting antibody sequences with known pairings as a positive data set; and (4) according to the size of the Read count, arranging the Read count from large to small, and selecting the heavy chain with the count ranking 20% first, the light chain with the count ranking 20% later, the light chain with the count ranking 20% first and the heavy chain with the count ranking 20% later as a negative data set.

(10) Inputting a training set, a test set, a positive data set and a negative data set into a convolutional neural network for convolutional neural network training, wherein the convolutional neural network training device structurally comprises: the device comprises an input layer, a first coiling layer, a first excitation layer, a first pooling layer, a first full-connection layer, a second coiling layer, a second excitation layer, a second pooling layer, a second full-connection layer, a third coiling layer, a third excitation layer, a third pooling layer, a third full-connection layer, a fourth coiling layer, a fourth excitation layer, a fourth pooling layer, a fourth full-connection layer and an output layer.

A first winding layer: and 48 convolution kernels of size 3 × 3.

First excitation layer activation function: relu.

A first pooling layer: 2 × 2 cores.

The first fully connected layer dropout ratio is equal to 0.35.

A second convolution layer: 48 convolution kernels are 3 × 3.

Second excitation layer activation function: relu.

A second pooling layer: 2 × 2 cores.

The second fully connected layer dropout ratio is equal to 0.35.

A third convolutional layer: 96 convolution kernels are 3 × 3.

Third excitation layer activation function: relu.

A third pooling layer: 2 × 2 cores.

The third fully connected layer dropout ratio is equal to 0.35.

A fourth convolution layer: 96 convolution kernels are 3 × 3.

Fourth stimulation layer activation function: relu.

A fourth pooling layer: 2 × 2 cores.

The fourth fully connected layer dropout ratio is equal to 0.35.

And connecting the fourth pooling layer with the output layer.

The parameter setting conditions are as follows:

NB_EPOCH＝20

BATCH_SIZE＝100

VERBOSE＝1

NB_CLASSES＝2

OPTIMIZER＝SGD

VALIDATION_SPLIT＝0.2。

(11) And outputting pairing information and optimizing an objective function of convolutional neural network training to obtain the weight and bias of the full convolutional neural network and obtain final model parameters.

Example 2

In this embodiment, the method according to the present invention is used to predict the pairing probability of the following query sequences, and the specific operation method is as follows:

wherein the heavy chain is:

QVHLQESGPELVRPGASVKISCKTSGYVFSSSWMNWVKQRPGQGLKWIGRIYPGNGNTNYNEKFKGKATLTADKSSNTAYMQLSSLTSVDSAVYFCATSSAYWGQGTLLTVSAAKTTPPSVYPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVPSSPRPSETVTCNVAHPASSTKVDKKIVPR；

the light chain is:

DIQMTQTTSSLSASLGDRVTFSCSASQDISNYLNWYQQKPDGTIKLLIYYTSSLRSGVPSRFSGSGSGTDYSLTINNLEPEDIATYFCQQYSRLPFTFGSGTKLEIKRADAAPTVSIFPPSSEQLTSGGASVVCFLNNFYPKDINVKWKIDGSERQNGVLNSWTDQDSKDSTYSMSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNEC；

the resulting pairing probability was 83.1%.

The applicant states that the present invention is illustrated by the above examples to describe a method for predicting the pairing probability of heavy chains and light chains of an antibody and the application thereof, but the present invention is not limited to the above examples, i.e. it does not mean that the present invention must rely on the above examples to be implemented. It should be understood by those skilled in the art that any modification of the present invention, equivalent substitutions of the raw materials of the product of the present invention, addition of auxiliary components, selection of specific modes, etc., are within the scope and disclosure of the present invention.

The preferred embodiments of the present invention have been described in detail, however, the present invention is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present invention within the technical idea of the present invention, and these simple modifications are within the protective scope of the present invention.

It should be noted that the various technical features described in the above embodiments can be combined in any suitable manner without contradiction, and the invention is not described in any way for the possible combinations in order to avoid unnecessary repetition.

Claims

1. A method for predicting the probability of pairing a heavy chain and a light chain of an antibody, the method comprising the steps of: converting the amino acid sequence characteristics of the heavy chain and the light chain of the antibody sample with known pairing information into digital signals, inputting the digital signals into a convolutional neural network for training to obtain final model parameters, and predicting the pairing probability of the heavy chain and the light chain of the antibody to be predicted by using the model parameters;

the method for obtaining the final model parameters specifically comprises the following steps:

(3) Extracting the amino acid sequence characteristics, site specificity scoring matrix characteristics, antibody CDR3 region length characteristics, amino acid secondary structure characteristics and amino acid solution accessibility characteristics of a second part of basic samples, and summarizing the characteristics to obtain a data set containing antibody sample information as a test set;

(4) Acquiring a positive data set and a negative data set;

(5) Inputting the training set into a convolutional neural network for convolutional neural network training;

(6) And outputting pairing information and optimizing a target function of convolutional neural network training to obtain the weight and bias of the full convolutional neural network and obtain final model parameters.

2. The method of claim 1, wherein the base sample is obtained by: acquiring the whole gene sequence information of the unpaired heavy chain and light chain variable regions, acquiring the antibody sample gene pairing information, and processing the information to obtain the heavy chain and light chain amino acid sequence pairing information of the antibody sample.

3. The method of predicting the probability of pairing a heavy chain and a light chain of an antibody of claim 1, wherein the amino acid sequence features are extracted by:

4. The method of predicting antibody heavy chain and light chain pairing probability of claim 1, wherein the site-specific scoring matrix features are extracted by:

the site-specific scoring matrix is defined as [ Lx 20] as shown in the following formula]Wherein L represents the length of an antibody heavy or light chain amino acid sequence; calculating a site-specific scoring matrix for the base sample using the PSI-BLAST tool; then, the values obtained by the rows and the columns of the PSSM matrix are utilized to carry out sigmoid function

Transforming to 0-1 interval;

wherein E is _i-j Represents the log probability value of the mutation of the ith amino acid of the amino acid sequence to the amino acid j in evolution, wherein j =1-20 is 20 natural amino acids arranged by letters respectively.

5. The method of claim 1, wherein the length of the CDR3 region of the antibody is extracted by: the length of the antibody CDR3 region was characterized by defining the number of amino acids in the antibody CDR3 region as a different seven-digit binary number.

6. The method of predicting the probability of pairing a heavy chain and a light chain of an antibody of claim 1, wherein the secondary structural features of the amino acids are extracted by: the secondary structure of each amino acid in the amino acid sequence was predicted using the SCRATCH tool, and three secondary structures were defined as three different three-digit binary numbers, thereby extracting the amino acid secondary structure characteristics.

7. The method of predicting antibody heavy and light chain pairing probability of claim 1 wherein the solution accessibility features of the amino acids are extracted by:

8. The method of claim 1, wherein the method for predicting the probability of pairing heavy chains and light chains of the antibody summarizes the information on the characteristics by:

9. The method of predicting antibody heavy chain and light chain pairing probability of claim 1, wherein the positive dataset and the negative dataset are obtained by: selecting antibody sequences with known pairings as a positive data set; and (4) according to the size of the Read count, arranging the Read count from large to small, and selecting the heavy chain with the count ranking 20% first, the light chain with the count ranking 20% later, the light chain with the count ranking 20% first and the heavy chain with the count ranking 20% later as a negative data set.

10. The method for predicting antibody heavy chain and light chain pairing probability of claim 1, wherein the structure of the convolutional neural network in performing convolutional neural network training comprises: the device comprises an input layer, a first coiling layer, a first excitation layer, a first pooling layer, a first full-connection layer, a second coiling layer, a second excitation layer, a second pooling layer, a second full-connection layer, a third coiling layer, a third excitation layer, a third pooling layer, a third full-connection layer, a fourth coiling layer, a fourth excitation layer, a fourth pooling layer, a fourth full-connection layer and an output layer.

11. The method of claim 1, wherein the parameters for performing the convolutional neural network training are set as follows:

NB_EPOCH＝20

BATCH_SIZE＝100

VERBOSE＝1

NB_CLASSES＝2OPTIMIZER＝SGDVALIDATION_SPLIT＝0.2。