CN115458048A - Antibody humanization method based on sequence encoding and decoding - Google Patents
Antibody humanization method based on sequence encoding and decoding Download PDFInfo
- Publication number
- CN115458048A CN115458048A CN202211128757.XA CN202211128757A CN115458048A CN 115458048 A CN115458048 A CN 115458048A CN 202211128757 A CN202211128757 A CN 202211128757A CN 115458048 A CN115458048 A CN 115458048A
- Authority
- CN
- China
- Prior art keywords
- gene
- training
- antigen molecule
- antibody
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 447
- 239000000427 antigen Substances 0.000 claims abstract description 312
- 102000036639 antigens Human genes 0.000 claims abstract description 312
- 108091007433 antigens Proteins 0.000 claims abstract description 312
- 241000283973 Oryctolagus cuniculus Species 0.000 claims abstract description 225
- 239000011159 matrix material Substances 0.000 claims abstract description 134
- 238000012546 transfer Methods 0.000 claims abstract description 35
- 239000013598 vector Substances 0.000 claims description 323
- 238000012549 training Methods 0.000 claims description 262
- 238000000605 extraction Methods 0.000 claims description 78
- 238000012986 modification Methods 0.000 claims description 46
- 230000004048 modification Effects 0.000 claims description 46
- 230000029087 digestion Effects 0.000 claims description 39
- 230000005764 inhibitory process Effects 0.000 claims description 33
- 230000009466 transformation Effects 0.000 claims description 23
- 230000014509 gene expression Effects 0.000 claims description 20
- 238000004364 calculation method Methods 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 11
- 230000000116 mitigating effect Effects 0.000 claims description 2
- 241000976806 Genea <ascomycete fungus> Species 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 abstract description 7
- 238000011160 research Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 74
- 238000010586 diagram Methods 0.000 description 12
- 230000027455 binding Effects 0.000 description 8
- 238000002965 ELISA Methods 0.000 description 6
- 241000699666 Mus <mouse, genus> Species 0.000 description 6
- 230000002068 genetic effect Effects 0.000 description 6
- 125000003275 alpha amino acid group Chemical group 0.000 description 5
- 210000004027 cell Anatomy 0.000 description 5
- 238000013461 design Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 3
- 230000000890 antigenic effect Effects 0.000 description 3
- 102000004169 proteins and genes Human genes 0.000 description 3
- 230000001629 suppression Effects 0.000 description 3
- 108020004414 DNA Proteins 0.000 description 2
- 241001529936 Murinae Species 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 150000001413 amino acids Chemical class 0.000 description 2
- 239000011248 coating agent Substances 0.000 description 2
- 238000000576 coating method Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 239000013078 crystal Substances 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 210000004602 germ cell Anatomy 0.000 description 2
- 210000000987 immune system Anatomy 0.000 description 2
- 230000005847 immunogenicity Effects 0.000 description 2
- 230000001965 increasing effect Effects 0.000 description 2
- 238000003367 kinetic assay Methods 0.000 description 2
- 239000013641 positive control Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000001742 protein purification Methods 0.000 description 2
- 239000006228 supernatant Substances 0.000 description 2
- 238000002198 surface plasmon resonance spectroscopy Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241000699670 Mus sp. Species 0.000 description 1
- 108020004511 Recombinant DNA Proteins 0.000 description 1
- 238000012300 Sequence Analysis Methods 0.000 description 1
- 238000002441 X-ray diffraction Methods 0.000 description 1
- 230000002362 anti-crystal effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000002425 crystallisation Methods 0.000 description 1
- 230000008025 crystallization Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000012239 gene modification Methods 0.000 description 1
- 238000011905 homologation Methods 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000002250 progressing effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000002424 x-ray crystallography Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Data Mining & Analysis (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Peptides Or Proteins (AREA)
Abstract
The application relates to the field of biological research, and particularly discloses an antibody humanization method based on sequence coding and decoding, which takes a gene sequence as a text sequence by adopting an artificial intelligence model based on natural semantic understanding, and respectively represents the antigen molecule gene sequence in a human body and the characteristic distribution information of the gene sequence of the modified rabbit source antibody by fusing the global implicit characteristic of the gene sequence and the multi-scale neighborhood correlation characteristic under different gene spans. And evaluating the homology of the modified rabbit-derived antibody and the antigen molecules in the human body by utilizing the transfer matrix of the gene characteristics of the modified rabbit-derived antibody relative to the gene characteristics of the antigen molecules, and further checking the homology of the gene sequence of the modified rabbit-derived antibody and the gene sequence of the antibody molecules in the human body.
Description
Technical Field
The present application relates to the field of biological research, and more particularly, to a method for humanizing an antibody based on a sequence encoding/decoding scheme.
Background
Antibody humanization is an important component of experimental research in the production of recombinant antibodies (monoclonal antibodies). Antibody humanization is a process of progressing from a rabbit-derived antibody to a human-derived antibody. Most clinically used monoclonal antibodies are mouse-derived monoclonal antibodies, and due to species specificity of human and mice, the use of mouse-derived antibodies is limited and generates anti-drug antibodies.
The mouse antibody or rabbit antibody as foreign protein can enter human body, which can make human immune system generate response, and generate specific antibody using mouse antibody as antigen, namely human anti-mouse antibody (HAMA), usually the heterologous protein can be cleared quickly in human body, and half-life period is very short. Similarly, rabbit-derived antibodies also have this drawback and need to be designed for humanization to reduce immunogenicity. Because of various limitations in clinical applications of rabbit-derived antibodies, people use recombinant DNA technology to humanize rabbit-derived antibodies.
The traditional humanization of a mouse-derived or rabbit-derived antibody is to ensure that the antibody has extremely similar profile with antibody molecules in a human body through gene modification, thereby evading the recognition of a human immune system and avoiding inducing HAMA reaction. Humanization of antibodies should be performed following two basic principles, namely, maintaining or increasing the affinity and specificity of the antibody, and substantially reducing or substantially eliminating the immunogenicity of the antibody.
In the existing technical scheme, the traditional humanization method of a rabbit-derived antibody is similar to that of a mouse source, a homology modeling method is adopted to mutate a rabbit-derived amino acid sequence of a framework region into a human source, and finally, the affinity of the antibody is determined by an ELISA (enzyme-linked immuno sorbent assay) or SPR (surface plasmon resonance) method and the like, and a humanized version is selected. The rabbit antibody and the human antibody have low homology, and the structural reliability of homologous modeling is low, so that the rabbit anti-affinity of the mutated humanized version is generally reduced.
Therefore, an optimized antibody humanization scheme is desired to achieve a higher degree of homologation.
Disclosure of Invention
The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides an antibody humanization method based on sequence coding and decoding, which treats a gene sequence as a text sequence by adopting an artificial intelligence model based on natural semantic understanding, and respectively represents the antigen molecule gene sequence in a human body and the characteristic distribution information of the gene sequence of the rabbit source antibody after modification by fusing the global implicit characteristic of the gene sequence and the multi-scale neighborhood correlation characteristics under different gene spans. And evaluating the homology of the modified rabbit-derived antibody and the antigen molecules in the human body by using a transfer matrix of the gene characteristics of the modified rabbit-derived antibody relative to the gene characteristics of the antigen molecules, and further checking the homology of the gene sequence of the modified rabbit-derived antibody and the gene sequence of the antibody molecules in the human body.
According to one aspect of the present application, there is provided a method for humanizing an antibody based on a sequence coding, comprising:
a training phase comprising:
acquiring training data, wherein the training data comprise a gene sequence of a training antigen molecule, a gene sequence of a training modified rabbit-derived antibody and a true value of the homology between the gene sequence of the training antigen molecule and the gene sequence of the training modified rabbit-derived antibody;
respectively enabling the gene sequence of the training antigen molecule and the gene sequence of the rabbit source antibody after training modification to pass through the context encoder based on the converter and the multi-scale neighborhood feature extraction module to obtain a training antigen molecule gene feature vector and a rabbit source antibody gene feature vector after training modification;
calculating a transfer matrix of the training and transformed rabbit source antibody gene characteristic vector relative to the training antigen molecule gene characteristic vector as a training classification characteristic matrix;
passing the training classification feature matrix through the classifier to obtain a classification loss function value;
calculating a classification mode digestion inhibition loss function value of the rabbit-derived antibody gene feature vector after the training transformation and the training antigen molecule feature vector, wherein the classification mode digestion inhibition loss function value is related to the square of a two-norm of a difference feature vector between the rabbit-derived antibody gene feature vector after the training transformation and the training antigen molecule feature vector; and
training the converter-based context encoder, the multi-scale neighborhood feature extraction module, and the classifier with a weighted sum of the classification loss function values and the classification mode digestion mitigation loss function values as loss function values; and
an inference phase comprising:
obtaining a gene sequence of an antigen molecule in a human body;
obtaining a plurality of gene expression characteristic vectors by passing the gene sequence of the antigen molecule through a trained context encoder based on a converter, and cascading the gene expression characteristic vectors to obtain an antigen molecule global gene characteristic vector;
the gene sequence of the antigen molecule passes through a trained multi-scale neighborhood feature extraction module to obtain an antigen molecule multi-neighborhood scale feature vector;
cascading the antigen molecule global gene feature vector and the antigen molecule multi-neighborhood scale feature vector to obtain an antigen molecule gene feature vector;
obtaining a gene sequence of the modified rabbit source antibody;
processing the gene sequence of the modified rabbit-derived antibody through the context encoder based on the converter and the multi-scale neighborhood feature extraction module to obtain a modified rabbit-derived antibody gene feature vector;
calculating a transfer matrix of the modified rabbit source antibody gene characteristic vector relative to the antigen molecule gene characteristic vector as a classification characteristic matrix; and
and obtaining a class probability value by the trained classifier of the classification characteristic matrix, wherein the class probability value represents the degree of homology of the rabbit-derived antibody and antigen molecules in the human body after modification.
In the antibody humanization method based on sequence coding, the passing the training classification feature matrix through the classifier to obtain a classification loss function value includes: processing the training classification feature matrix using the classifier with the following formula to generate a training classification result, wherein the formula is:
softmax{(M c ,B c ) L Project (F), where Project (F) represents the projection of the training classification feature matrix as a vector, M c Weight matrix being a fully connected layer, B c A bias matrix representing a fully connected layer; and calculating the cross entropy value between the training classification result and the true value of the homology between the gene sequence of the training antigen molecule in the training data and the gene sequence of the rabbit-derived antibody after training transformation as the classification loss function value.
In the antibody humanization method based on sequence encoding and decoding, the calculating a function value of the rabbit-derived antibody gene feature vector after the training transformation and the classification mode digestion inhibition loss of the training antigen molecule gene feature vector includes: calculating the classification mode digestion inhibition loss function value of the rabbit-derived antibody gene characteristic vector after the training transformation and the training antigen molecule gene characteristic vector according to the following formula;
wherein the formula is:
wherein V1 and V2 Respectively representing the training modified rabbit source antibody gene characteristic vector and the training antigen molecule gene characteristic vector, and M 1 and M2 Respectively representing the weight matrixes of the classifier for the training modified rabbit-derived antibody gene feature vector and the training antigen molecule gene feature vector,represents the square of the two-norm of the vector, | · | F An F norm representing a matrix, exp (·) represents a matrix and an exponential operation of a vector, the exponential operation of the matrix representing a calculation of a natural exponent function value raised to an eigenvalue of each position in the matrix, the exponential operation of the vector representing a calculation of a natural exponent function value raised to an eigenvalue of each position in the vector.
In the antibody humanization method based on sequence encoding and decoding, the obtaining the multi-scale neighborhood feature vector of the antigen molecule by the trained multi-scale neighborhood feature extraction module of the gene sequence of the antigen molecule includes: inputting the gene sequence of the antigen molecule into a first convolution layer of the multi-scale neighborhood characteristic extraction module to obtain a first neighborhood scale antigen molecule characteristic vector, wherein the first convolution layer has a first one-dimensional convolution kernel with a first length; inputting the gene sequence of the antigen molecule into a second convolution layer of the multi-scale neighborhood characteristic extraction module to obtain a second neighborhood scale antigen molecule characteristic vector, wherein the second convolution layer has a second one-dimensional convolution kernel with a second length, and the first length is different from the second length; and cascading the first neighborhood scale antigen molecule feature vector and the second neighborhood scale antigen molecule feature vector to obtain the antigen molecule multi-neighborhood scale feature vector.
In the antibody humanization method based on sequence coding and decoding, the inputting the gene sequence of the antigen molecule into the first convolution layer of the multi-scale neighborhood feature extraction module to obtain a first neighborhood scale antigen molecule feature vector includes: performing one-dimensional convolution coding on the gene sequence of the antibody molecule by using a first convolution layer of the multi-scale neighborhood characteristic extraction module according to the following formula to obtain a first neighborhood scale antigen molecule characteristic vector;
wherein the formula is:
wherein a is the width of the first convolution kernel in the X direction, F (a) is a parameter vector of the first convolution kernel, G (X-a) is a local vector matrix operated with a convolution kernel function, w is the size of the first convolution kernel, and X represents the gene sequence of the antigen molecule;
in the above antibody humanization method based on sequence coding and decoding, the inputting the gene sequence of the antigen molecule into the second convolution layer of the multi-scale neighborhood feature extraction module to obtain a second neighborhood scale antigen molecule feature vector includes: performing one-dimensional convolution coding on the gene sequence of the antibody molecule by using a second convolution layer of the multi-scale neighborhood characteristic extraction module according to the following formula to obtain a second neighborhood scale antigen molecule characteristic vector;
wherein the formula is:
wherein b is the width of the second convolution kernel in the X direction, F (b) is a second convolution kernel parameter vector, G (X-b) is a local vector matrix operated with the convolution kernel function, m is the size of the second convolution kernel, and X represents the gene sequence of the antigen molecule.
In the antibody humanization method based on sequence encoding and decoding, the calculating a transfer matrix of the gene feature vector of the modified rabbit-derived antibody relative to the gene feature vector of the antigen molecule as a classification feature matrix includes: calculating a transfer matrix of the gene characteristic vector of the rabbit-derived antibody relative to the gene characteristic vector of the antigen molecule as the classification characteristic matrix according to the following formula;
wherein the formula is:
wherein V1 Representing the gene feature vector of the rabbit source antibody after modification,V 2 representing the antigenic molecule gene feature vector, M representing the classification feature matrix,representing a matrix multiplication.
According to another aspect of the present application, there is provided an antibody humanization system based on sequence coding, comprising:
a training module comprising:
the training data acquisition unit is used for acquiring training data, wherein the training data comprises a gene sequence of a training antigen molecule, a gene sequence of a training modified rabbit-derived antibody and a true value of the homology between the gene sequence of the training antigen molecule and the gene sequence of the training modified rabbit-derived antibody;
the feature vector extraction unit is used for enabling the gene sequence of the training antigen molecule and the gene sequence of the rabbit-derived antibody after training and modification to pass through the context encoder based on the converter and the multi-scale neighborhood feature extraction module respectively so as to obtain a training antigen molecule gene feature vector and a rabbit-derived antibody gene feature vector after training and modification;
the training classification characteristic matrix generating unit is used for calculating a transfer matrix of the rabbit-derived antibody gene characteristic vector relative to the training antigen molecule characteristic vector after training transformation as a training classification characteristic matrix;
the classification loss function value calculation unit is used for enabling the training classification characteristic matrix to pass through the classifier so as to obtain a classification loss function value;
a classification mode digestion inhibition loss function value calculation unit, configured to calculate a classification mode digestion inhibition loss function value of the rabbit-derived antibody gene feature vector after the training transformation and the training antigen molecule gene feature vector, where the classification mode digestion inhibition loss function value is related to a square of a two-norm of a difference feature vector between the rabbit-derived antibody gene feature vector after the training transformation and the training antigen molecule gene feature vector; and
a training unit to train the converter-based context encoder, the multi-scale neighborhood feature extraction module, and the classifier with a weighted sum of the classification loss function values and the classification mode digestion suppression loss function values as loss function values; and
an inference module comprising:
a physiological information acquisition unit for acquiring a gene sequence of an antigen molecule in a human body;
the antigen molecule global gene feature vector generating unit is used for enabling the gene sequence of the antigen molecule to pass through a trained context encoder based on a converter to obtain a plurality of gene expression feature vectors, and cascading the gene expression feature vectors to obtain the antigen molecule global gene feature vector;
the multi-scale neighborhood characteristic extraction unit is used for enabling the gene sequence of the antigen molecule to pass through a trained multi-scale neighborhood characteristic extraction module so as to obtain an antigen molecule multi-neighborhood scale characteristic vector;
the cascade unit is used for cascading the antigen molecule global gene feature vector and the antigen molecule multi-neighborhood scale feature vector to obtain an antigen molecule gene feature vector;
the gene sequence acquisition unit of the rabbit-derived antibody is used for acquiring the gene sequence of the modified rabbit-derived antibody;
the modified rabbit-derived antibody gene feature extraction unit is used for processing the gene sequence of the modified rabbit-derived antibody through the context encoder based on the converter and the multi-scale neighborhood feature extraction module to obtain a modified rabbit-derived antibody gene feature vector;
a classification characteristic matrix generating unit for calculating a transfer matrix of the modified rabbit source antibody gene characteristic vector relative to the antigen molecule gene characteristic vector as a classification characteristic matrix; and
and the class probability value generating unit is used for obtaining a class probability value by the trained classifier of the classification characteristic matrix, wherein the class probability value represents the homology of the rabbit-derived antibody and antigen molecules in the human body after modification.
According to still another aspect of the present application, there is provided an electronic apparatus including: a processor; and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform a method of humanizing an antibody based on a sequence codec as described above.
According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform a method of humanizing an antibody based on a sequence codec as described above.
Compared with the prior art, the antibody humanization method based on sequence coding and decoding provided by the application considers the gene sequence as a text sequence by adopting an artificial intelligence model based on natural semantic understanding, and respectively represents the antigen molecule gene sequence in a human body and the characteristic distribution information of the gene sequence of the modified rabbit-derived antibody by fusing the global implicit characteristic of the gene sequence and the multi-scale neighborhood associated characteristics under different gene spans. And evaluating the homology of the modified rabbit-derived antibody and the antigen molecules in the human body by using a transfer matrix of the gene characteristics of the modified rabbit-derived antibody relative to the gene characteristics of the antigen molecules, and further checking the homology of the gene sequence of the modified rabbit-derived antibody and the gene sequence of the antibody molecules in the human body.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.
Fig. 1 illustrates a schematic diagram of a heavy chain of a rabbit antibody according to embodiments of the present application.
Fig. 2 illustrates another schematic diagram of a heavy chain of a rabbit antibody according to embodiments of the present application.
FIG. 3 illustrates a schematic diagram of a structural model of homology modeling building according to an embodiment of the application.
FIG. 4 illustrates a schematic diagram of a structural model built by co-evolutionary modeling according to an embodiment of the application.
Fig. 5 illustrates a flow diagram of a training phase in a method of sequence codec based antibody humanization according to an embodiment of the present application.
FIG. 6 illustrates a flow chart of the inference stage in a sequence codec based antibody humanization method according to an embodiment of the present application.
Fig. 7 illustrates an architecture diagram of a training phase in a sequence codec based antibody humanization method according to an embodiment of the present application.
Fig. 8 illustrates an architectural diagram of an inference stage in a sequence codec based antibody humanization method according to an embodiment of the present application.
Fig. 9 illustrates a flowchart of an antigen molecule multi-scale neighborhood feature extraction process in an antibody humanization method based on sequence coding according to an embodiment of the present application.
FIG. 10 illustrates a block diagram of an antibody humanization system based on sequence coding according to an embodiment of the present application.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.
Overview of scenes
Accordingly, considering the affinity loss or degradation that is easily caused by the conventional method, in one example, using the FORTEBIO instrument and related software, the kinetic experimental mode was selected. Through analysis, the affinity of the humanized individual antibodies designed by the traditional method is reduced.
Therefore, in the technical scheme of the application, the non-homologous modeling method can maintain the affinity unchanged or even higher. The method comprises the following specific steps:
step 1: the method comprises the steps of constructing a rabbit anti-structural model by adopting a non-homologous modeling method, constructing a structural model by adopting a coevolution method, constructing the structural model by using alphafold II, performing sequence de-coding, and finally designing an amino acid sequence to obtain the humanized antibody with consistent affinity.
Step 2: selecting a progressive method for humanization, selecting a humanized antibody with 80-90% of homology as a structural model, and then selecting a fully humanized antibody with 90-99% of homology as the structural model; a higher degree of humanisation is achieved.
And step 3: the affinity of the humanized antibody is verified by design.
Based on this, it is considered that although a humanized antibody with 80-90% homology is selected as a structural model first, and then a fully human antibody with 90-99% homology is selected as a structural model to achieve higher degree of humanization, it is still necessary to examine the homology of the gene sequence of the rabbit antibody after modification and the gene sequence of the antibody molecule in the human body. Therefore, in the technical scheme of the application, it is expected that a gene sequence is regarded as a text sequence by adopting an artificial intelligence model based on natural semantic understanding, and the characteristic distribution information of the antigen molecule gene sequence in the human body and the gene sequence of the modified rabbit-derived antibody is respectively represented by fusing the global implicit characteristic of the gene sequence and the multi-scale neighborhood correlation characteristics under different gene spans. And evaluating the homology of the modified rabbit-derived antibody and the antigen molecules in the human body by utilizing the transfer matrix of the gene characteristics of the modified rabbit-derived antibody relative to the gene characteristics of the antigen molecules, and further checking the homology of the gene sequence of the modified rabbit-derived antibody and the gene sequence of the antibody molecules in the human body.
Specifically, in the technical scheme of the application, firstly, a gene sequence of an antigen molecule in a human body is obtained. Then, considering that each gene in the gene sequence of the antigen molecule in the human body has the semantic feature information of the context, the context encoder based on the converter is used for processing the gene sequence of the antigen molecule to extract the essential features of the gene of the antigen molecule based on the global high-dimensional semantic features so as to be more suitable for characterizing the antigen molecule in the human body. Then, the multiple gene expression characteristic vectors are cascaded to integrate global implicit characteristic information of the genes of the antigen molecules in each human body, so that the antigen molecule global gene characteristic vector is obtained.
In particular, in the technical scheme of the application, it is considered that since the gene is composed of a plurality of bases, the bases are sites, and there are ATCG four bases in the DNA. Therefore, the gene sequence of the human antigen molecule has an ATCG base sequence consisting of a plurality of ATCG bases. Therefore, in the technical solution of the present application, before the gene sequence is encoded by the context encoder, the gene sequence of the antigen molecule in the human body is first subjected to unique heat encoding to be converted into an input vector.
It should be understood that, since different implicit features exist in each gene segment under different gene segment spans in the gene sequence of the antigen molecule in the human body, the multi-scale neighborhood features can extract the associated features under different gene segment spans. Therefore, in the technical scheme of the application, a multi-scale neighborhood feature extraction module is further used for encoding the gene sequence of the antibody molecule to extract multi-scale neighborhood associated features of the gene sequence of the antibody molecule in the human body under different gene segment spans, so that an antigen molecule multi-neighborhood scale feature vector is obtained.
And then cascading the antigen molecule global gene feature vector and the antigen molecule multi-neighborhood scale feature vector to perform feature fusion to obtain the antigen molecule gene feature vector.
Furthermore, in order to accurately evaluate and judge the homology between the modified rabbit-derived antibody and the antigen molecules in the human body, the gene sequence of the modified rabbit-derived antibody needs to be obtained. Similarly, the gene sequence of the modified rabbit-derived antibody is processed through the context encoder based on the converter and the multi-scale neighborhood feature extraction module, so that a modified rabbit-derived antibody gene feature vector with global multi-scale neighborhood correlation features under different gene fragment spans is obtained.
Then, because the genetic characteristics of the modified rabbit-derived antibody and the genetic characteristics of the antigen molecules in the human body have different characteristic scales in a high-dimensional characteristic space and the humanized antibody needs higher affinity, in order to accurately judge the homology of the modified rabbit-derived antibody and the antigen molecules in the human body, a transfer matrix of the modified rabbit-derived antibody genetic characteristic vector relative to the antigen molecule genetic characteristic vector is further calculated to classify, and then the homology of the modified rabbit-derived antibody and the antigen molecules in the human body is evaluated to obtain higher humanization.
In particular, in the technical solution of the present application, since the classification feature matrix is a transfer matrix of the modified rabbit-derived antibody gene feature vector relative to the antigen molecule gene feature vector, during a training process of the classifier, when a gradient back propagation passes through feature extraction models of the modified rabbit-derived antibody gene feature vector and the antigen molecule gene feature vector, respectively, that is, the converter-based context encoder and the multi-scale neighborhood feature extraction module are added, a digestion inhibition loss function of a classification pattern is introduced, which may cause digestion of feature patterns expressed by the modified rabbit-derived antibody gene feature vector and the antigen molecule gene feature vector due to abnormal gradient branches:
V 1 and V2 Respectively are the characteristic vector of the gene of the rabbit derived antibody after modification and the characteristic vector of the gene of the antigen molecule, and M 1 and M2 Classifier pair V respectively 1 and V2 The weight matrix of (a) is determined,representing the square of the two-norm of the vector.
Here, by introducing the classification pattern digestion inhibition loss function, the pseudo difference of the classifier weight can be pushed to the real feature distribution difference of the modified rabbit-derived antibody gene feature vector and the antigen molecule gene feature vector, so that the directional derivative is enabled to be regularized near a gradient branch point when the gradient is reversely propagated, that is, the gradient is subjected to over-weighting between the modified rabbit-derived antibody gene feature vector and the feature extraction pattern of the antigen molecule gene feature vector, so that the classification pattern digestion of the features is inhibited, and the classification accuracy is improved. Therefore, the homology of the rabbit-derived antibody after modification and the antigen molecule in the human body can be accurately evaluated and judged, and the homology of the gene sequence of the rabbit-derived antibody after modification and the gene sequence of the antibody molecule in the human body can be accurately checked.
Based on this, the present application provides a method for humanizing an antibody based on a sequence coding, comprising: a training phase comprising: acquiring training data, wherein the training data comprises a gene sequence of a training antigen molecule, a gene sequence of a training modified rabbit-derived antibody and a true value of the homology between the gene sequence of the training antigen molecule and the gene sequence of the training modified rabbit-derived antibody; respectively enabling the gene sequence of the training antigen molecule and the gene sequence of the rabbit-derived antibody after training and modification to pass through the context encoder based on the converter and the multi-scale neighborhood feature extraction module to obtain a training antigen molecule gene feature vector and a rabbit-derived antibody gene feature vector after training and modification; calculating a transfer matrix of the training and transformed rabbit source antibody gene characteristic vector relative to the training antigen molecule gene characteristic vector as a training classification characteristic matrix; passing the training classification feature matrix through the classifier to obtain a classification loss function value; calculating a classification mode digestion inhibition loss function value of the rabbit-derived antibody gene feature vector after the training transformation and the training antigen molecule feature vector, wherein the classification mode digestion inhibition loss function value is related to the square of a two-norm of a difference feature vector between the rabbit-derived antibody gene feature vector after the training transformation and the training antigen molecule feature vector; and training the converter-based context encoder, the multi-scale neighborhood feature extraction module, and the classifier with a weighted sum of the classification loss function values and the classification mode digestion inhibition loss function values as loss function values; further comprising: an inference phase comprising: obtaining a gene sequence of an antigen molecule in a human body; obtaining a plurality of gene expression characteristic vectors by passing the gene sequence of the antigen molecule through a trained context encoder based on a converter, and cascading the gene expression characteristic vectors to obtain an antigen molecule global gene characteristic vector; the gene sequence of the antigen molecule passes through a trained multi-scale neighborhood feature extraction module to obtain an antigen molecule multi-neighborhood scale feature vector; cascading the antigen molecule global gene feature vector and the antigen molecule multi-neighborhood scale feature vector to obtain an antigen molecule gene feature vector; obtaining a gene sequence of the modified rabbit source antibody; processing the gene sequence of the modified rabbit-derived antibody through the context encoder based on the converter and the multi-scale neighborhood feature extraction module to obtain a modified rabbit-derived antibody gene feature vector; calculating a transfer matrix of the modified rabbit source antibody gene characteristic vector relative to the antigen molecule gene characteristic vector as a classification characteristic matrix; and obtaining a class probability value by the trained classifier of the classification characteristic matrix, wherein the class probability value represents the homology of the rabbit-derived antibody and antigen molecules in the human body after modification.
Having described the basic principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.
Exemplary methods of humanization
The first embodiment is as follows: rabbit antibody sequence analysis
The rabbit antibody sequences were selected as follows:
heavy chain > VH
QSVKESEGGLFKPTDTLTLTCTVSGFSLSSYAISWVRQAPGNGLEWIGIINSYGSTYYASWAKSRSTITRNTNENTVTLKMTSLTAADTATYFCARGYAGSSGGYIWGPGTLVTVSS
Light chain > VL
AAVLTQTPSPVSAAVGGTVTIKCQSSQSVYNNNLLSWYQQKPGQPPKLLIYDASNLPSGVPDRFSGSGSGTQFTLTISGVQCDDAATYYCLGGYYGSDAGGNTFGGGTEVVVK
The rabbit antibody heavy chain sequence is compared with the human germline antibody sequence, and the comparison shows that the degree of homology of the rabbit anti-sequence and the human antibody sequence is lower than 70%. The heavy chains such as VK/EG/F/T, etc. are all sites to be humanized as shown in FIG. 1.
The rabbit light chain sequence and the human germline antibody sequence are compared, and the comparison shows that the degree of homology of the rabbit anti-sequence and the human antibody sequence is lower than 70%. The heavy chains such as AVL/PV/A/GT, etc. are all the sites to be humanized as shown in FIG. 2.
Example two: homologous modeling building rabbit anti-structural model
Homologous modeling: selecting 5-10 optimal structural solutions by adopting a swiss-model homologous modeling method, modeling a Loop region by using the homologous modeling method generally, and building a CDR3 structural model by using a de novo modeling method if the CDR amino acid sequence comparison result shows that the content is lower than 50 percent. The PDB BLAST was used to retrieve the closest 10 antibody crystal structure models (structure resolution higher than 2.5 a) of the sequences, compared to the automated modeling model, and the optimal structure model was selected. And (3) modeling the constructed structural model by homologous modeling, as shown in FIG. 3.
Example three:
co-evolutionary modeling: selecting 2 optimal structural solutions by adopting an alpha fold II coevolution modeling method, selecting a humanized antibody with 80-90% of homology as a structural model, and then selecting a fully humanized antibody with 90-99% of homology as the structural model. And (4) carrying out coevolution modeling on the constructed structural model, as shown in FIG. 4.
Example four:
traditional humanization design schemes: and (4) mutating the original murine sequence into the human sequence by database comparison. The original sequence of the rabbit-derived antibody was designed as a plurality of humanized amino acids (huVH 1, huVH2, huVH3, huVL1, huVL2, huVL 3), and the designed sequences were combined into a humanized antibody, and the antibody was expressed in an Expi 293 mammalian expression system.
The results of sequencing the above humanized amino acid sequence are shown in the following table.
Purifying the humanized antibody: a series of humanized antibodies expressed in Expi 293 cells were collected from cell supernatants and purified according to standard protocol for protein purification. The experimental results are characterized, and the purity of the purified humanized antibody is over 90 percent.
Example five:
the activity of the humanized antibody is detected by homologous modeling design:
the activity of binding of the humanized antibody to the antigen was detected by ELISA:
coating a plate to be detected with 0.5 mu g/mL of antigen by using a target antigen through an ELISA method, setting the concentration gradient of the humanized antibody sample purified in the embodiment 4 to be 0.00004-1.2 mu g/mL, and measuring the binding activity OD value to detect the strength of the binding affinity of the humanized antibody and the target antigen, wherein the test result is shown in the figure; as can be seen from the graph, the OD value rises obviously along with the rise of the concentration of the sample, clear upper and lower platforms are generated very quickly, the window is large, the antigen binding affinity of the humanized antibody is weaker than that of the female parent, the humanized degree is high, and the affinity is reduced obviously;
the results of the calculated EC50 calculation are given in the following table:
Sample ID | EC50(μg/mL) |
mVH+mVL | 0.02054 |
H1L1 | 1.4298 |
H2L2 | 3.6758 |
H3L3 | / |
positive control | 0.01683 |
Example six:
affinity kinetic assay
Opening a FORTEBIO instrument and related software, and selecting a Kinetics experiment mode. Through analysis, the affinity of the humanized individual antibodies designed by the traditional method is reduced.
Name of sample | KD(M) | ka(1/Ms) | kd(1/s) | R2 | Rmax(nm) | Ratio:WT/Variant |
Female parent | 1.12E-09 | 1.46E+05 | 1.64E-04 | 0.997 | 0.476 | 1.62 |
H1L1 | 2.20E-07 | 1.73E+04 | 3.81E-03 | 0.997 | 0.464 | 1.67 |
H2L2 | 3.13E-08 | 1.78E+05 | 5.57E-03 | 0.996 | 0.489 | 1.69 |
H3L3 | 4.36E-07 | 1.96E+05 | 8.55E-03 | 0.996 | 0.435 | 1.80 |
Example seven:
improved humanization design protocol: and (4) mutating the original murine sequence into the human sequence by database comparison. The original sequence of the rabbit-derived antibody was designed as a plurality of humanized amino acids (huVH 4, huVH5, huVH6, huVL4, huVL5, huVL 6), and the designed sequences were combined into a humanized antibody, and the antibody was expressed in an Expi 293 mammalian expression system. The results of the above humanized amino acid sequences are shown in the following table.
Purifying the humanized antibody: a series of humanized antibodies expressed in Expi 293 cells were collected from the cell supernatant and purified according to standard protocol for protein purification. The experimental results are characterized, and the purity of the purified humanized antibody is over 90 percent.
Example seven:
the technical method comprises the following steps of modeling and designing the activity detection of the humanized antibody:
the activity of binding of the humanized antibody to the antigen was detected by ELISA:
coating a plate to be detected with 0.5 mu g/mL of antigen by using a target antigen through an ELISA method, setting the concentration gradient of the humanized antibody sample purified in the embodiment 7 to be 0.00004-1.2 mu g/mL, and measuring the OD (OD) value of the binding activity to detect the strength of the binding affinity of the humanized antibody and the target antigen, wherein the test result is shown in the figure; as can be seen from the graph, the OD value is obviously increased along with the increase of the concentration of the sample, clear upper and lower platforms are generated very quickly, the window is large, the antigen binding affinity of the humanized antibody is equivalent to that of the parent, the humanization degree is high, the affinity is basically consistent, and the humanization is successful;
the results of the calculated EC50 calculation are given in the following table:
Sample ID | EC50(μg/mL) |
mVH+mVL | 0.01649 |
H4L4 | 0.01156 |
H5L5 | 0.02173 |
H6L6 | 0.01124 |
positive control | 0.01019 |
Example eight:
affinity kinetic assay
Opening a FORTEBIO instrument and related software, and selecting a Kinetics experiment mode. Through analysis, the affinity of the humanized individual antibodies designed by the traditional method is consistent with that of the female parent, and the humanized design is successful.
Example nine: resolution of rabbit anti-crystal structure
Respectively synthesizing rabbit anti-light heavy chain sequences, expressing and purifying in 293F cells, or 20mg high-purity protein, screening crystals by using a crystallization robot, collecting data by X-ray crystallography and diffraction, and analyzing the structure by methods such as molecular replacement and the like. By analyzing and comparing structures predicted by different methods using pymol software, the RMSD value of the structure obtained by the traditional homologous modeling method was 1.42 angstroms, while the RMSD value of the structure obtained using the method of the present technology was 0.41 angstroms; the antibody structure predicted by the technical method is more accurate, and the humanized antibody has higher synthesis power and activity and is superior to the traditional method.
In particular, in the technical scheme of the present application, it is considered that although a humanized antibody with 80-90% homology is selected as a structural model first, and then a fully human antibody with 100% homology is selected as a structural model to achieve higher degree of humanization, the homology between the gene sequence of the modified rabbit antibody and the gene sequence of the antibody molecule in the human body still needs to be checked. Therefore, in the technical scheme of the application, it is expected that a gene sequence is regarded as a text sequence by adopting an artificial intelligence model based on natural semantic understanding, and the characteristic distribution information of the antigen molecule gene sequence in the human body and the gene sequence of the modified rabbit-derived antibody is respectively represented by fusing the global implicit characteristic of the gene sequence and the multi-scale neighborhood correlation characteristics under different gene spans. And evaluating the homology of the modified rabbit-derived antibody and the antigen molecules in the human body by utilizing the transfer matrix of the gene characteristics of the modified rabbit-derived antibody relative to the gene characteristics of the antigen molecules, and further checking the homology of the gene sequence of the modified rabbit-derived antibody and the gene sequence of the antibody molecules in the human body.
Exemplary homology check method
Fig. 5 illustrates a flow diagram of a training phase in a method of sequence codec based antibody humanization according to an embodiment of the present application. As shown in fig. 5, the method for humanizing an antibody based on a sequence coding according to an embodiment of the present application includes: a training phase comprising: s110, obtaining training data, wherein the training data comprise a gene sequence of a training antigen molecule, a gene sequence of a training modified rabbit-derived antibody and a true value of the homology between the gene sequence of the training antigen molecule and the gene sequence of the training modified rabbit-derived antibody; s120, the gene sequence of the training antigen molecule and the gene sequence of the rabbit-derived antibody after training and modification are respectively passed through the context encoder based on the converter and the multi-scale neighborhood feature extraction module to obtain a training antigen molecule gene feature vector and a rabbit-derived antibody gene feature vector after training and modification; s130, calculating a transfer matrix of the gene feature vector of the rabbit source antibody after training modification relative to the gene feature vector of the training antigen molecule as a training classification feature matrix; s140, enabling the training classification characteristic matrix to pass through the classifier to obtain a classification loss function value; s150, calculating a classification mode digestion inhibition loss function value of the rabbit-derived antibody gene feature vector after the training transformation and the training antigen molecule feature vector, wherein the classification mode digestion inhibition loss function value is related to the square of a two-norm of a difference feature vector between the rabbit-derived antibody gene feature vector after the training transformation and the training antigen molecule feature vector; and S160, training the context encoder based on converter, the multi-scale neighborhood feature extraction module and the classifier by taking the weighted sum of the classification loss function value and the classification mode digestion inhibition loss function value as a loss function value.
FIG. 6 illustrates a flow chart of the inference stage in a sequence codec based antibody humanization method according to an embodiment of the present application. As shown in fig. 6, the method for humanizing an antibody based on a sequence codec according to an embodiment of the present application further includes an inference stage including the steps of: s210, obtaining a gene sequence of an antigen molecule in a human body; s220, passing the gene sequence of the antigen molecule through a trained context encoder based on a converter to obtain a plurality of gene expression characteristic vectors, and cascading the gene expression characteristic vectors to obtain an antigen molecule global gene characteristic vector; s230, passing the gene sequence of the antigen molecule through a trained multi-scale neighborhood feature extraction module to obtain an antigen molecule multi-neighborhood scale feature vector; s240, cascading the antigen molecule global gene feature vector and the antigen molecule multi-neighborhood scale feature vector to obtain an antigen molecule gene feature vector; s250, obtaining a gene sequence of the modified rabbit antibody; s260, processing the gene sequence of the modified rabbit-derived antibody through the context encoder based on the converter and the multi-scale neighborhood feature extraction module to obtain a modified rabbit-derived antibody gene feature vector; s270, calculating a transfer matrix of the gene characteristic vector of the modified rabbit source antibody relative to the antigen molecule gene characteristic vector as a classification characteristic matrix; and S280, obtaining a class probability value by the trained classifier of the classification characteristic matrix, wherein the class probability value represents the homology of the rabbit source antibody after modification and antigen molecules in a human body.
Fig. 7 illustrates an architecture diagram of a training phase in an antibody humanization method based on sequence coding according to an embodiment of the present application. As shown in fig. 7, in the training phase, in the network structure, first, training data is obtained, where the training data includes a gene sequence of a training antigen molecule, a gene sequence of a training modified rabbit-derived antibody, and a true value of a degree of homology between the gene sequence of the training antigen molecule and the gene sequence of the training modified rabbit-derived antibody; then, the obtained gene sequence of the training antigen molecule and the gene sequence of the rabbit source antibody after training and modification are respectively passed through the context encoder based on the converter and the multi-scale neighborhood feature extraction module to obtain a training antigen molecule gene feature vector and a rabbit source antibody gene feature vector after training and modification; then calculating a transfer matrix of the rabbit-derived antibody gene characteristic vector after training modification relative to the training antigen molecule gene characteristic vector as a training classification characteristic matrix; passing the obtained classification characteristic matrix through the classifier to obtain a classification loss function value; secondly, calculating a classification mode digestion inhibition loss function value of the rabbit-derived antibody gene characteristic vector after the training transformation and the training antigen molecule gene characteristic vector, wherein the classification mode digestion inhibition loss function value is related to the square of a two-norm of a difference characteristic vector between the rabbit-derived antibody gene characteristic vector after the training transformation and the training antigen molecule gene characteristic vector; further, the converter-based context encoder, the multi-scale neighborhood feature extraction module, and the classifier are trained with a weighted sum of the classification loss function values and the classification mode digestion suppression loss function values as loss function values.
More specifically, in the training phase, in step S110, training data is obtained, where the training data includes a gene sequence of a training antigen molecule, a gene sequence of a training modified rabbit-derived antibody, and a true value of a degree of homology between the gene sequence of the training antigen molecule and the gene sequence of the training modified rabbit-derived antibody. Considering that although a humanized antibody with 80-90% homology is selected as a structural model first and then a fully human antibody with 100% homology is selected as a structural model to achieve higher degree of humanization, it is still necessary to examine the homology between the gene sequence of the rabbit antibody after modification and the gene sequence of the antibody molecule in the human body. Therefore, in the technical scheme of the application, the gene sequence of the training antigen molecule, the gene sequence of the rabbit-derived antibody after training and modification, and the true value of the homology between the gene sequence of the training antigen molecule and the gene sequence of the rabbit-derived antibody after training and modification can be obtained through a gene sequence analyzer.
More specifically, in the training phase, in step S120, the gene sequence of the training antigen molecule and the gene sequence of the training modified rabbit-derived antibody are respectively passed through the context encoder based on the converter and the multi-scale neighborhood feature extraction module to obtain a training antigen molecule gene feature vector and a training modified rabbit-derived antibody gene feature vector. Considering that each gene in the gene sequence of the antigen molecule in the human body has semantic feature information of context, the gene sequence of the training antigen molecule is processed using a context encoder based on a converter to obtain a training antigen molecule gene feature vector, and particularly, in the technical scheme of the present application, considering that since the gene is composed of many bases, the bases are sites, and there are four bases of ATCG in DNA. Therefore, the gene sequence of the human antigen molecule has an ATCG base sequence consisting of a plurality of ATCG bases. Therefore, in the technical solution of the present application, before the context encoder is used to encode the gene sequence, the gene sequence of the antigen molecule in the human body is subjected to unique hot encoding to be converted into an input vector; further, the input vector passes through the multi-scale neighborhood feature extraction module to obtain a training antigen molecule gene feature vector; further, in order to accurately evaluate and judge the homology between the modified rabbit-derived antibody and the antigen molecules in the human body, a gene sequence of the training modified rabbit-derived antibody needs to be obtained, and more specifically, the gene sequence of the training modified rabbit-derived antibody is processed based on a context encoder of a converter and the multi-scale neighborhood feature extraction module, so as to obtain the gene feature vector of the training modified rabbit-derived antibody with global multi-scale neighborhood correlation features under different gene segment spans.
More specifically, in the training phase, in step S130, a transfer matrix of the rabbit-derived antibody gene feature vector after the training transformation relative to the training antigen molecule gene feature vector is calculated as a training classification feature matrix. Because the genetic characteristics of the rabbit-derived antibody after training modification and the genetic characteristics of the training antigen molecules in the human body have different characteristic scales in a high-dimensional characteristic space, and the humanized antibody needs higher affinity, in order to accurately judge the homology of the rabbit-derived antibody after training modification and the training antigen molecules in the human body, a transfer matrix of the gene characteristic vector of the rabbit-derived antibody after training modification relative to the gene characteristic vector of the training antigen molecules is further calculated for classification, and then the homology of the rabbit-derived antibody after training modification and the training antigen molecules in the human body is evaluated, so that higher-degree humanization is obtained. In a specific example of the application, a transfer matrix of the rabbit-derived antibody gene feature vector after training modification relative to the training antigen molecule gene feature vector is calculated as the training classification feature matrix according to the following formula;
wherein the formula is:
wherein V1 Expressing the characteristic vector of the rabbit source antibody gene after the training transformation, V 2 Representing the training antigen molecule gene feature vector, M representing the classification feature matrix,representing a matrix multiplication.
More specifically, in the training phase, in step S140, the training classification feature matrix is passed through the classifier to obtain a classification loss function value. In a specific example of the present application, the passing the training classification feature matrix through the classifier to obtain a classification loss function value includes: processing the training classification feature matrix using the classifier with a formula to generate a training classification result, wherein the formula is:
softmax{(M c ,B c ) Project (F), where Project (F) represents projecting the training classification feature matrix as a vector, M c Weight matrix being a fully connected layer, B c A bias matrix representing a fully connected layer; and calculating the cross entropy value between the training classification result and the true value of the homology between the gene sequence of the training antigen molecule in the training data and the gene sequence of the rabbit-derived antibody after training transformation as the classification loss function value.
More specifically, in the training phase, in step S150, a class pattern digestion inhibition loss function value of the training modified rabbit-derived antibody gene feature vector and the training antigen molecule gene feature vector is calculated, wherein the class pattern digestion inhibition loss function value is related to the square of the two-norm of the differential feature vector between the training modified rabbit-derived antibody gene feature vector and the training antigen molecule gene feature vector. Particularly, in the technical solution of the present application, since the classification feature matrix is a transfer matrix of the modified rabbit-derived antibody gene feature vector relative to the antigen molecule gene feature vector, in a training process of the classifier, when gradient back propagation respectively passes through feature extraction models of the modified rabbit-derived antibody gene feature vector and the antigen molecule gene feature vector, that is, the context encoder based on the converter adds the multi-scale neighborhood feature extraction module, digestion of feature patterns expressed by the modified rabbit-derived antibody gene feature vector and the antigen molecule gene feature vector may be caused due to abnormal gradient branches, and thus, a classification pattern digestion inhibition loss function is introduced. In a specific example of the present application, the calculating a function value of the digest inhibition loss in the classification mode of the feature vector of the rabbit-derived antibody gene after the training modification and the feature vector of the training antigen molecule gene includes: calculating the classification mode digestion inhibition loss function values of the rabbit source antibody gene characteristic vector after the training transformation and the training antigen molecule gene characteristic vector according to the following formula;
wherein the formula is:
wherein V1 and V2 Respectively representing the training modified rabbit source antibody gene characteristic vector and the training antigen molecule gene characteristic vector, and M 1 and M2 Respectively representing the weight matrixes of the classifier for the training modified rabbit-derived antibody gene feature vector and the training antigen molecule gene feature vector,represents the square of the two-norm of the vector, | - | F An F norm representing a matrix, exp (-) representing a matrix and an exponential operation of a vector representing a calculation of a natural exponent function value raised to the power of the eigenvalue of each position in the matrix, the exponential operation of the vector representing a calculation of a natural exponent function value raised to the power of the eigenvalue of each position in the vector. Here, by introducing the classification pattern resolution inhibition loss function, the classifier can be usedThe pseudo difference of the weight is pushed to the real feature distribution difference of the modified rabbit-derived antibody gene feature vector and the antigen molecule gene feature vector, so that the directional derivative is enabled to be regularized near a gradient branch point when the gradient is reversely propagated, namely, the gradient is subjected to weighting between the modified rabbit-derived antibody gene feature vector and the feature extraction mode of the antigen molecule gene feature vector, the feature classification mode is eliminated, and the classification accuracy is improved. Therefore, the homology of the rabbit source antibody after modification and the antigen molecule in the human body can be accurately evaluated and judged so as to obtain higher humanization, and further, the affinity can be maintained unchanged or even higher.
More specifically, in the training phase, the converter-based context encoder, the multi-scale neighborhood feature extraction module, and the classifier are trained in step S160 with a weighted sum of the classification loss function values and the classification mode-solving rejection loss function values as the loss function values. Namely, the weighted sum of the classification loss function value and the classification mode digestion inhibition loss function value updates the parameters of the context encoder, the parameters of the multi-scale neighborhood feature extraction module and the parameters of the classifier.
After training is completed, the inference phase is entered. Namely, the context encoder based on the converter, the multi-scale neighborhood feature extraction module and the classifier which are trained in the training stage can be obtained according to the method, and then the context encoder based on the converter and the multi-scale neighborhood feature extraction module which are trained in the training stage are used in actual inference to obtain a more accurate classification result of the homology of the rabbit-derived antibody and the antigen molecules in the human body after modification.
Fig. 8 illustrates an architectural diagram of an inference stage in a sequence codec based antibody humanization method according to an embodiment of the present application. As shown in fig. 8, in the inference stage, in the network structure, first, the gene sequence of the antigen molecule in the human body and the gene sequence of the rabbit-derived antibody after modification are obtained; then, the obtained gene sequence of the antigen molecule passes through a trained context encoder based on a converter to obtain a plurality of gene expression characteristic vectors, and the gene expression characteristic vectors are cascaded to obtain an antigen molecule global gene characteristic vector; meanwhile, the gene sequence of the modified rabbit-derived antibody is processed through the context encoder based on the converter and the multi-scale neighborhood feature extraction module to obtain a modified rabbit-derived antibody gene feature vector; secondly, the gene sequence of the antigen molecule passes through a trained multi-scale neighborhood feature extraction module to obtain an antigen molecule multi-neighborhood scale feature vector; cascading the antigen molecule global gene feature vector and the antigen molecule multi-neighborhood scale feature vector to obtain an antigen molecule gene feature vector; then calculating a transfer matrix of the modified rabbit source antibody gene characteristic vector relative to the antigen molecule gene characteristic vector as a classification characteristic matrix; and then, obtaining a class probability value by the trained classifier of the classification characteristic matrix, wherein the class probability value represents the homology of the rabbit-derived antibody and the antigen molecules in the human body after modification.
More specifically, in the inference phase, in steps S210 and S220, the gene sequence of the antigen molecule in the human body is acquired; and (3) passing the gene sequence of the antigen molecule through a trained context encoder based on a converter to obtain a plurality of gene expression characteristic vectors, and cascading the gene expression characteristic vectors to obtain an antigen molecule global gene characteristic vector. Considering that each gene in the gene sequence of the antigen molecule in the human body has semantic feature information of context, the context encoder based on the converter is used for processing the gene sequence of the antigen molecule to extract essential features of the gene sequence of the antigen molecule based on global high-dimensional semantic features so as to be more suitable for characterizing the antigen molecule in the human body. Then, the multiple gene expression characteristic vectors are cascaded to integrate the global implicit characteristic information of the genes of the antigen molecules in each human body, so that the global gene characteristic vector of the antigen molecules is obtained.
More specifically, in the inference phase, in step S230 and step S240, the gene sequence of the antigen molecule is passed through a trained multi-scale neighborhood feature extraction module to obtain a multi-scale neighborhood feature vector of the antigen molecule; and cascading the antigen molecule global gene feature vector and the antigen molecule multi-neighborhood scale feature vector to obtain the antigen molecule gene feature vector. It should be understood that, since different implicit features exist in each gene segment under different gene segment spans in the gene sequence of the antigen molecule in the human body, the multi-scale neighborhood features can extract the associated features under different gene segment spans. Therefore, in the technical scheme of the application, a multi-scale neighborhood feature extraction module is further used for encoding the gene sequence of the antibody molecule to extract multi-scale neighborhood associated features of the gene sequence of the antibody molecule in the human body under different gene segment spans, so that an antigen molecule multi-neighborhood scale feature vector is obtained. And then cascading the antigen molecule global gene feature vector and the antigen molecule multi-neighborhood scale feature vector to perform feature fusion to obtain the antigen molecule gene feature vector.
Fig. 6 illustrates a flowchart of an extraction process of an antigen molecule multi-scale neighborhood feature in an antibody humanization method based on sequence coding according to an embodiment of the present application. As shown in fig. 9, in the process of extracting the multi-scale neighborhood features of the antigen molecules, the method includes: s231, inputting the gene sequence of the antigen molecule into a first convolution layer of the multi-scale neighborhood characteristic extraction module to obtain a first neighborhood scale antigen molecule characteristic vector, wherein the first convolution layer has a first one-dimensional convolution kernel with a first length; s232, inputting the gene sequence of the antigen molecule into a second convolution layer of the multi-scale neighborhood characteristic extraction module to obtain a second neighborhood scale antigen molecule characteristic vector, wherein the second convolution layer has a second one-dimensional convolution kernel with a second length, and the first length is different from the second length; and S233, cascading the first neighborhood scale antigen molecule feature vector and the second neighborhood scale antigen molecule feature vector to obtain the multi-scale neighborhood antigen molecule feature vector. More specifically, the gene sequence of the antibody molecule is subjected to one-dimensional convolution coding by using a first convolution layer of the multi-scale neighborhood characteristic extraction module according to the following formula so as to obtain a first neighborhood scale antigen molecule characteristic vector;
wherein the formula is:
wherein a is the width of the first convolution kernel in the X direction, F (a) is a parameter vector of the first convolution kernel, G (X-a) is a local vector matrix operated with a convolution kernel function, w is the size of the first convolution kernel, and X represents the gene sequence of the antigen molecule; further using a second convolution layer of the multi-scale neighborhood characteristic extraction module to perform one-dimensional convolution coding on the gene sequence of the antibody molecule by using the following formula so as to obtain a second neighborhood scale antigen molecule characteristic vector;
wherein the formula is:
wherein b is the width of the second convolution kernel in the X direction, F (b) is a second convolution kernel parameter vector, G (X-b) is a local vector matrix operated with the convolution kernel function, m is the size of the second convolution kernel, and X represents the gene sequence of the antigen molecule.
More specifically, in the inference phase, in step S250, the gene sequence of the engineered rabbit-derived antibody is obtained. It should be understood that, in order to accurately evaluate and judge the homology between the rabbit-derived antibody and the antigen molecules in the human body, the gene sequence of the rabbit-derived antibody needs to be obtained. In the technical scheme of the application, the gene sequence of the modified rabbit source antibody can be obtained through a gene sequence analyzer.
More specifically, in the inference phase, in step S260, the gene sequence of the modified rabbit-derived antibody is processed by the context encoder based on converter and the multi-scale neighborhood feature extraction module to obtain a modified rabbit-derived antibody gene feature vector. Similarly, the gene sequence of the modified rabbit-derived antibody is processed through the context encoder based on the converter and the multi-scale neighborhood feature extraction module, so that a modified rabbit-derived antibody gene feature vector with global multi-scale neighborhood correlation features under different gene segment spans is obtained.
More specifically, in the inference phase, in steps S270 and S280, a transfer matrix of the modified rabbit-derived antibody gene feature vector relative to the antigen molecule gene feature vector is calculated as a classification feature matrix; and obtaining a class probability value by the trained classifier of the classification characteristic matrix, wherein the class probability value represents the homology of the rabbit source antibody after modification and antigen molecules in a human body. In a specific example of the present application, said calculating a transfer matrix of said modified rabbit-derived antibody gene feature vector relative to said antigen molecule gene feature vector as a classification feature matrix includes: calculating a transfer matrix of the gene characteristic vector of the rabbit-derived antibody relative to the gene characteristic vector of the antigen molecule as the classification characteristic matrix according to the following formula;
wherein the formula is:
wherein V1 Expressing the modified rabbit antibody gene feature vector, V 2 Representing the antigenic molecule gene feature vector, M representing the classification feature matrix,representing a matrix multiplication.
In summary, the antibody humanization method based on sequence coding and decoding is elucidated, and by adopting an artificial intelligence model based on natural semantic understanding, a gene sequence is regarded as a text sequence, and the feature distribution information of an antigen molecule gene sequence in a human body and a gene sequence of a modified rabbit-derived antibody is respectively represented by fusing a global implicit feature of the gene sequence and multi-scale neighborhood correlation features under different gene spans. And evaluating the homology of the modified rabbit-derived antibody and the antigen molecules in the human body by utilizing the transfer matrix of the gene characteristics of the modified rabbit-derived antibody relative to the gene characteristics of the antigen molecules, and further checking the homology of the gene sequence of the modified rabbit-derived antibody and the gene sequence of the antibody molecules in the human body.
Exemplary System
FIG. 10 illustrates a block diagram of an antibody humanization system based on sequence coding, according to an embodiment of the present application. As shown in fig. 10, the antibody humanization system 500 based on sequence coding according to an embodiment of the present application includes: a training module 510 and an inference module 520.
As shown in fig. 10, the training module 510 includes: a training data obtaining unit 511, configured to obtain training data, where the training data includes a gene sequence of a training antigen molecule, a gene sequence of a training modified rabbit-derived antibody, and a true value of a degree of homology between the gene sequence of the training antigen molecule and the gene sequence of the training modified rabbit-derived antibody; a feature vector extraction unit 512, configured to pass the gene sequence of the training antigen molecule and the gene sequence of the training modified rabbit-derived antibody through the converter-based context encoder and the multi-scale neighborhood feature extraction module, respectively, to obtain a training antigen molecule gene feature vector and a training modified rabbit-derived antibody gene feature vector; a training classification feature matrix generating unit 513, configured to calculate a transfer matrix of the rabbit-derived antibody gene feature vector after the training modification relative to the training antigen molecule gene feature vector as a training classification feature matrix;
a classification loss function value calculating unit 514, configured to pass the training classification feature matrix through the classifier to obtain a classification loss function value; a classification mode digestion inhibition loss function value calculation unit 515, configured to calculate a classification mode digestion inhibition loss function value of the rabbit-derived antibody gene feature vector after the training modification and the training antigen molecule gene feature vector, where the classification mode digestion inhibition loss function value is related to a square of a two-norm of a difference feature vector between the rabbit-derived antibody gene feature vector after the training modification and the training antigen molecule gene feature vector; and a training unit 516 for training the converter-based context encoder, the multi-scale neighborhood feature extraction module, and the classifier with a weighted sum of the classification loss function values and the classification mode digestion suppression loss function values as loss function values.
As shown in fig. 7, the inference module 520 includes: a physiological information acquisition unit 521 for acquiring a gene sequence of an antigen molecule in a human body; an antigen molecule global gene feature vector generating unit 522, configured to pass a trained converter-based context encoder through a gene sequence of the antigen molecule to obtain a plurality of gene expression feature vectors, and cascade the plurality of gene expression feature vectors to obtain an antigen molecule global gene feature vector; a multi-scale neighborhood feature extraction unit 523, configured to pass the gene sequence of the antigen molecule through a trained multi-scale neighborhood feature extraction module to obtain a multi-scale neighborhood feature vector of the antigen molecule; a cascading unit 524, configured to cascade the antigen molecule global gene feature vector and the antigen molecule multi-neighborhood scale feature vector to obtain an antigen molecule gene feature vector; a gene sequence acquisition unit 525 of the rabbit-derived antibody, which is used for acquiring the gene sequence of the modified rabbit-derived antibody; a modified rabbit-derived antibody gene feature extraction unit 526, configured to process the gene sequence of the modified rabbit-derived antibody through the converter-based context encoder and the multi-scale neighborhood feature extraction module to obtain a modified rabbit-derived antibody gene feature vector; a classification feature matrix generating unit 527, configured to calculate a transfer matrix of the modified rabbit-derived antibody gene feature vector relative to the antigen molecule gene feature vector as a classification feature matrix; and a class probability value generating unit 528, configured to obtain a class probability value by using the trained classifier of the classification feature matrix, where the class probability value represents a degree of homology between the rabbit-derived antibody and an antigen molecule in a human body after modification.
In one example, in the antibody humanization system 500 based on the sequence codec described above, the classification loss function value calculating unit 514 includes: processing the training classification feature matrix using the classifier with the following formula to generate a training classification result, wherein the formula is:
softmax{(M c ,B c ) Project (F), where Project (F) represents projecting the training classification feature matrix as a vector, M c Weight matrix being a fully connected layer, B c A bias matrix representing a fully connected layer; and calculating the cross entropy value between the training classification result and the true value of the homology between the gene sequence of the training antigen molecule in the training data and the gene sequence of the rabbit-derived antibody after training transformation as the classification loss function value.
In one example, in the antibody humanization system 500 based on the sequence codec described above, the classification pattern digestion inhibition loss function value calculation unit 515 includes: calculating the classification mode digestion inhibition loss function values of the rabbit source antibody gene characteristic vector after the training transformation and the training antigen molecule gene characteristic vector according to the following formula;
wherein the formula is:
wherein V1 and V2 Respectively representing the training modified rabbit source antibody gene characteristic vector and the training antigen molecule gene characteristic vector, and M 1 and M2 Respectively representing the weight matrixes of the classifier for the training modified rabbit-derived antibody gene feature vector and the training antigen molecule gene feature vector,represents the square of the two-norm of the vector, | - | F An F norm representing a matrix, exp (·) represents a matrix and an exponential operation of a vector, the exponential operation of the matrix representing a calculation of a natural exponent function value raised to an eigenvalue of each position in the matrix, the exponential operation of the vector representing a calculation of a natural exponent function value raised to an eigenvalue of each position in the vector.
In one example, in the antibody humanization system 500 based on sequence coding, the multi-scale neighborhood feature extraction unit 523 is further configured to: inputting the gene sequence of the antigen molecule into a first convolution layer of the multi-scale neighborhood characteristic extraction module to obtain a first neighborhood scale antigen molecule characteristic vector, wherein the first convolution layer has a first one-dimensional convolution kernel with a first length; inputting the gene sequence of the antigen molecule into a second convolution layer of the multi-scale neighborhood characteristic extraction module to obtain a second neighborhood scale antigen molecule characteristic vector, wherein the second convolution layer has a second one-dimensional convolution kernel with a second length, and the first length is different from the second length; and cascading the first neighborhood scale antigen molecule feature vector and the second neighborhood scale antigen molecule feature vector to obtain the multi-scale neighborhood antigen molecule feature vector.
In one example, in the antibody humanization system 500 based on the sequence codec, the classification feature matrix generating unit 527 includes: calculating a transfer matrix of the modified rabbit source antibody gene characteristic vector relative to the antigen molecule gene characteristic vector by using the following formula as the classification characteristic matrix;
wherein the formula is:
wherein V1 Expressing the characteristic vector of the rabbit derived antibody gene after modification, V 2 Representing the antigenic molecule gene feature vector, M representing the classification feature matrix,representing a matrix multiplication.
In summary, an antibody humanization system based on sequence coding and decoding is elucidated, and a gene sequence is regarded as a text sequence by adopting an artificial intelligence model based on natural semantic understanding, and a global implicit feature of the gene sequence and a multi-scale neighborhood correlation feature under different gene spans are fused to respectively represent a gene sequence of an antigen molecule in a human body and feature distribution information of the gene sequence of the modified rabbit-derived antibody. And evaluating the homology of the modified rabbit-derived antibody and the antigen molecules in the human body by utilizing the transfer matrix of the gene characteristics of the modified rabbit-derived antibody relative to the gene characteristics of the antigen molecules, and further checking the homology of the gene sequence of the modified rabbit-derived antibody and the gene sequence of the antibody molecules in the human body.
Claims (8)
1. A method for humanizing an antibody based on a sequence encoding/decoding, comprising:
obtaining a gene sequence of an antigen molecule in a human body;
obtaining a plurality of gene expression characteristic vectors by passing the gene sequence of the antigen molecule through a trained context encoder based on a converter, and cascading the gene expression characteristic vectors to obtain an antigen molecule global gene characteristic vector;
the gene sequence of the antigen molecule passes through a trained multi-scale neighborhood feature extraction module to obtain an antigen molecule multi-neighborhood scale feature vector;
cascading the antigen molecule global gene feature vector and the antigen molecule multi-neighborhood scale feature vector to obtain an antigen molecule gene feature vector;
obtaining a gene sequence of the modified rabbit source antibody;
processing the gene sequence of the modified rabbit-derived antibody through the context encoder based on the converter and the multi-scale neighborhood feature extraction module to obtain a modified rabbit-derived antibody gene feature vector;
calculating a transfer matrix of the modified rabbit source antibody gene characteristic vector relative to the antigen molecule gene characteristic vector as a classification characteristic matrix; and
and obtaining a class probability value by the trained classifier of the classification characteristic matrix, wherein the class probability value represents the degree of homology of the rabbit-derived antibody and antigen molecules in the human body after modification.
2. The antibody humanization method based on sequence coding and decoding according to claim 1, wherein the passing the gene sequence of the antigen molecule through a trained multi-scale neighborhood feature extraction module to obtain an antigen molecule multi-neighborhood feature vector comprises:
inputting the gene sequence of the antigen molecule into a first convolution layer of the multi-scale neighborhood characteristic extraction module to obtain a first neighborhood scale antigen molecule characteristic vector, wherein the first convolution layer has a first one-dimensional convolution kernel with a first length;
inputting the gene sequence of the antigen molecule into a second convolution layer of the multi-scale neighborhood characteristic extraction module to obtain a second neighborhood scale antigen molecule characteristic vector, wherein the second convolution layer has a second one-dimensional convolution kernel with a second length, and the first length is different from the second length; and
cascading the first neighborhood scale antigen molecule feature vector and the second neighborhood scale antigen molecule feature vector to obtain the multi-scale neighborhood antigen molecule feature vector.
3. The method of claim 2, wherein inputting the gene sequence of the antigen molecule into the first convolution layer of the multi-scale neighborhood feature extraction module to obtain a first neighborhood scale antigen molecule feature vector comprises:
performing one-dimensional convolution coding on the gene sequence of the antibody molecule by using the first convolution layer of the multi-scale neighborhood characteristic extraction module according to the following formula to obtain a first neighborhood scale antigen molecule characteristic vector;
wherein the formula is:
wherein a is the width of the first convolution kernel in the X direction, F (a) is a parameter vector of the first convolution kernel, G (X-a) is a local vector matrix operated with a convolution kernel function, w is the size of the first convolution kernel, and X represents the gene sequence of the antigen molecule.
4. The method of claim 3, wherein inputting the gene sequence of the antigen molecule into a second convolution layer of the multi-scale neighborhood feature extraction module to obtain a second neighborhood scale antigen molecule feature vector comprises:
performing one-dimensional convolution coding on the gene sequence of the antibody molecule by using a second convolution layer of the multi-scale neighborhood characteristic extraction module according to the following formula to obtain a second neighborhood scale antigen molecule characteristic vector;
wherein the formula is:
wherein b is the width of the second convolution kernel in the X direction, F (b) is a second convolution kernel parameter vector, G (X-b) is a local vector matrix operated with the convolution kernel function, m is the size of the second convolution kernel, and X represents the gene sequence of the antigen molecule.
5. The method of claim 4, wherein the step of calculating a transfer matrix of the modified rabbit-derived antibody gene feature vector relative to the antigen molecule gene feature vector as a classification feature matrix comprises:
calculating a transfer matrix of the modified rabbit source antibody gene characteristic vector relative to the antigen molecule gene characteristic vector by using the following formula as the classification characteristic matrix;
wherein the formula is:
6. The sequence codec-based antibody humanization method according to claim 1, further comprising training the converter-based context encoder, the multi-scale neighborhood feature extraction module, and the classifier;
the training the converter-based context encoder, the multi-scale neighborhood feature extraction module, and the classifier includes:
acquiring training data, wherein the training data comprise a gene sequence of a training antigen molecule, a gene sequence of a training modified rabbit-derived antibody and a true value of the homology between the gene sequence of the training antigen molecule and the gene sequence of the training modified rabbit-derived antibody;
respectively enabling the gene sequence of the training antigen molecule and the gene sequence of the rabbit source antibody after training modification to pass through the context encoder based on the converter and the multi-scale neighborhood feature extraction module to obtain a training antigen molecule gene feature vector and a rabbit source antibody gene feature vector after training modification;
calculating a transfer matrix of the rabbit-derived antibody gene characteristic vector after training modification relative to the training antigen molecule gene characteristic vector as a training classification characteristic matrix;
passing the training classification feature matrix through the classifier to obtain a classification loss function value;
calculating a classification mode digestion inhibition loss function value of the rabbit-derived antibody gene feature vector after the training transformation and the training antigen molecule feature vector, wherein the classification mode digestion inhibition loss function value is related to the square of a two-norm of a difference feature vector between the rabbit-derived antibody gene feature vector after the training transformation and the training antigen molecule feature vector; and
training the converter-based context encoder, the multi-scale neighborhood feature extraction module, and the classifier with a weighted sum of the classification loss function values and the classification mode digestion mitigation loss function values as loss function values.
7. The method of claim 6, wherein the calculating the values of the class-pattern digestion inhibition loss function of the training adapted rabbit-derived antibody gene feature vector and the training antigen molecule gene feature vector comprises:
calculating the classification mode digestion inhibition loss function values of the rabbit source antibody gene characteristic vector after the training transformation and the training antigen molecule gene characteristic vector according to the following formula;
wherein the formula is:
wherein V1 and V2 Respectively representing the training modified rabbit source antibody gene characteristic vector and the training antigen molecule gene characteristic vector, and M 1 and M2 Respectively representing the classifier for the training modified rabbit-derived antibody gene feature vector and the training antigen molecule geneA weight matrix of the feature vector is calculated,represents the square of the two-norm of the vector, | - | F An F norm representing a matrix, exp (·) represents a matrix and an exponential operation of a vector, the exponential operation of the matrix representing a calculation of a natural exponent function value raised to an eigenvalue of each position in the matrix, the exponential operation of the vector representing a calculation of a natural exponent function value raised to an eigenvalue of each position in the vector.
8. The method of claim 7, wherein passing the training classification feature matrix through the classifier to obtain a classification loss function value comprises:
processing the training classification feature matrix using the classifier with a formula to generate a training classification result, wherein the formula is:
softmax{(M c ,B c ) L Project (F), where Project (F) represents the projection of the training classification feature matrix as a vector, M c Weight matrix being a fully connected layer, B c A bias matrix representing a fully connected layer; and
and calculating a cross entropy value between the training classification result and a true value of the homology between the gene sequence of the training antigen molecule in the training data and the gene sequence of the rabbit source antibody after training modification to serve as the classification loss function value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211128757.XA CN115458048B (en) | 2022-09-16 | 2022-09-16 | Antibody humanization method based on sequence coding and decoding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211128757.XA CN115458048B (en) | 2022-09-16 | 2022-09-16 | Antibody humanization method based on sequence coding and decoding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115458048A true CN115458048A (en) | 2022-12-09 |
CN115458048B CN115458048B (en) | 2023-05-26 |
Family
ID=84304404
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211128757.XA Active CN115458048B (en) | 2022-09-16 | 2022-09-16 | Antibody humanization method based on sequence coding and decoding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115458048B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005057486A2 (en) * | 2003-12-08 | 2005-06-23 | Xencor, Inc. | Protein engineering with analogous contact environments |
US20120141486A1 (en) * | 2010-12-06 | 2012-06-07 | Dainippon Sumitomo Pharma Co., Ltd. | Human monoclonal antibody |
CN103145834A (en) * | 2013-01-17 | 2013-06-12 | 广州泰诺迪生物科技有限公司 | Antibody humanization transformation method |
US20190065677A1 (en) * | 2017-01-13 | 2019-02-28 | Massachusetts Institute Of Technology | Machine learning based antibody design |
US20200087395A1 (en) * | 2018-09-14 | 2020-03-19 | Eli Lilly And Company | Cd200r agonist antibodies and uses thereof |
US20200342955A1 (en) * | 2017-10-27 | 2020-10-29 | Apostle, Inc. | Predicting cancer-related pathogenic impact of somatic mutations using deep learning-based methods |
CN114664376A (en) * | 2022-03-31 | 2022-06-24 | 重庆邮电大学 | miRNA-mRNA target prediction method based on sequence statistical characterization learning |
-
2022
- 2022-09-16 CN CN202211128757.XA patent/CN115458048B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2005057486A2 (en) * | 2003-12-08 | 2005-06-23 | Xencor, Inc. | Protein engineering with analogous contact environments |
US20120141486A1 (en) * | 2010-12-06 | 2012-06-07 | Dainippon Sumitomo Pharma Co., Ltd. | Human monoclonal antibody |
CN103145834A (en) * | 2013-01-17 | 2013-06-12 | 广州泰诺迪生物科技有限公司 | Antibody humanization transformation method |
US20190065677A1 (en) * | 2017-01-13 | 2019-02-28 | Massachusetts Institute Of Technology | Machine learning based antibody design |
US20200342955A1 (en) * | 2017-10-27 | 2020-10-29 | Apostle, Inc. | Predicting cancer-related pathogenic impact of somatic mutations using deep learning-based methods |
US20200087395A1 (en) * | 2018-09-14 | 2020-03-19 | Eli Lilly And Company | Cd200r agonist antibodies and uses thereof |
CN114664376A (en) * | 2022-03-31 | 2022-06-24 | 重庆邮电大学 | miRNA-mRNA target prediction method based on sequence statistical characterization learning |
Non-Patent Citations (3)
Title |
---|
YAGHOUB SAFDARI等: "Antibody humanization methods–a review and update" * |
YI-FAN ZHANG等: "Humanization of rabbit monoclonal antibodies via grafting combined Kabat/IMGT/Paratome complementarity-determining regions: Rationale and examples" * |
马威: "基于机器学习的虚拟筛选模型构建和PAR4蛋白的同源模建及结构验证" * |
Also Published As
Publication number | Publication date |
---|---|
CN115458048B (en) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Prihoda et al. | BioPhi: A platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning | |
CN110970099A (en) | Medicine molecule generation method based on regularization variational automatic encoder | |
Bachas et al. | Antibody optimization enabled by artificial intelligence predictions of binding affinity and naturalness | |
CN113838523A (en) | Antibody protein CDR region amino acid sequence prediction method and system | |
WO2023208204A1 (en) | Attention mechanism-based antibody non-sequential prediction method and apparatus | |
CN114585918A (en) | Mesoscale engineered peptides and methods of selection | |
CN114008713A (en) | Information processing system, information processing method, program, and method for producing antigen-binding molecule or protein | |
Huang et al. | A review of protein inter-residue distance prediction | |
CN115458048A (en) | Antibody humanization method based on sequence encoding and decoding | |
JP6484612B2 (en) | Obtaining improved therapeutic ligands | |
CN112365919A (en) | Antibody calculation optimization method based on genetic algorithm | |
Fei et al. | LTPConstraint: a transfer learning based end-to-end method for RNA secondary structure prediction | |
CN116189776A (en) | Antibody structure generation method based on deep learning | |
JP2022538378A (en) | Computer-implemented method for optimizing physical/chemical properties of biological sequences | |
Castro et al. | Guided generative protein design using regularized transformers | |
CN116312752A (en) | Rigid body protein butt joint method based on isomorphism map neural network | |
CN114360636A (en) | Antibody sequence structure collaborative design method based on flow model | |
KR20230121880A (en) | Prediction of complete protein expression from masked protein expression | |
Zou et al. | Antibody Humanization via Protein Language Model and Neighbor Retrieval | |
WO2023170844A1 (en) | Method for producing library by machine learning | |
Honda et al. | Cross attentive antibody-antigen interaction prediction with multi-task learning | |
Minot | Data efficient machine learning-guided protein engineering | |
Li et al. | Machine Learning Optimization of Candidate Antibodies Yields Highly Diverse Sub-nanomolar Affinity Antibody Libraries | |
WO2024051806A1 (en) | Method for designing humanized antibody sequence | |
WO2024122449A1 (en) | Antibody design method through machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |