CN115116543A

CN115116543A - Antigen-antibody binding site determination method, device, equipment and storage medium

Info

Publication number: CN115116543A
Application number: CN202210407121.2A
Authority: CN
Inventors: 黄宁桥; 蒋彪彬; 吴家祥; 于洋; 刘伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-09-27

Abstract

Embodiments of the present disclosure provide a method, apparatus, device and computer-readable storage medium for antigen-antibody binding site determination. The method provided by the embodiment of the disclosure determines the structural similarity and the sequence similarity of the antibody to be predicted and other known antibodies based on the key region of the antibody by combining the sequence information and the structural information of the antibody, and predicts the binding site of the known antibody and the antigen most similar to the antibody to be predicted as the binding site of the antibody to be predicted and the antigen, thereby realizing the accurate prediction of the binding site of the antibody. The method disclosed by the embodiment of the disclosure can be used for assisting the antibody drug aiming at a specific binding site, and quickly and accurately predicting the change of the binding site caused by antigen mutation, so that the research and development of a vaccine or an antibody drug are accelerated.

Description

Antigen-antibody binding site determination method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence and biology, and more particularly, to a method, an apparatus, a device, and a storage medium for determining an antigen-antibody binding site.

Background

Antibody binding site prediction is an important unsolved problem in immunology and is a prerequisite for artificial intelligence to assist vaccine and synthetic antibody design. Currently, the most accurate method for determining binding sites is to observe residues that are spatially close to the structure of the three-dimensional binding region, which can be obtained by experimental techniques such as X-ray crystallography. However, these experimental methods are time consuming and expensive, and it is desirable to develop computational methods that can overcome these problems to help develop treatments more quickly.

The prior art has demonstrated that sequence-based clustering can identify antibodies with similar binding sites, for example by "cluster cloning" of sequences by way of lineage clustering. In addition, rational mining of structural information of antibodies also contributes to efficient classification of binding sites of antibodies to antigens. However, for antibodies belonging to different lineages and having the same binding sites, or for antibodies with large sequence length differences, the technical scheme of performing lineage clustering by using antibody sequences is not ideal, and in an antibody database, only a very small number of antibodies have obtained their true structures by an experimental analysis method, which results in that most cases cannot directly use reliable structural information, and can only be obtained by predicting the structure, while the existing method based on antibody structural information cannot effectively classify the antibody sequences without templates.

Therefore, there is a need for an efficient and accurate method for antigen-antibody binding site prediction, such that antigen-antibody binding site determination can be achieved for template-free antibodies with varying lengths between antibody sequences.

Disclosure of Invention

In order to solve the above problems, the present disclosure achieves accurate prediction of an antibody binding site by jointly determining a known antibody most similar to an antibody to be predicted with respect to structural similarity and sequence similarity of the antibody to be predicted and other known antibodies, in combination with sequence information and structural information of the antibody.

Embodiments of the present disclosure provide a method, apparatus, device and computer-readable storage medium for antigen-antibody binding site determination.

Embodiments of the present disclosure provide a method for determining an antigen-antibody binding site, comprising: obtaining an antibody to be predicted and a plurality of known antibodies corresponding to the same antigen as the antibody to be predicted; determining the heavy chain sequence similarity and the light chain sequence similarity of each of the plurality of known antibodies and the antibody to be predicted respectively based on each of the plurality of known antibodies and at least a part of the respective heavy chain and at least a part of the respective light chain of the antibody to be predicted; determining a structural similarity of each of the plurality of known antibodies to the antibody to be predicted based on the plurality of known antibodies and the heavy chain of the antibody to be predicted; and determining one known antibody which is most similar to the antibody to be predicted in the plurality of known antibodies based on the structural similarity, the heavy chain sequence similarity and the light chain sequence similarity of each of the plurality of known antibodies and the antibody to be predicted, and taking the binding site of the known antibody and the antigen as the binding site of the antibody to be predicted and the antigen.

Embodiments of the present disclosure provide an antigen-antibody binding site determination device, including: an antibody obtaining module configured to obtain an antibody to be predicted and a plurality of known antibodies corresponding to the same antigen as the antibody to be predicted; a sequence alignment module configured to determine a heavy chain sequence similarity and a light chain sequence similarity of each of the plurality of known antibodies to the antibody to be predicted based on each of the plurality of known antibodies and at least a portion of the respective heavy chain and at least a portion of the respective light chain of the antibody to be predicted, respectively; a structural alignment module configured to determine a structural similarity of each of the plurality of known antibodies to the antibody to be predicted based on the plurality of known antibodies and the heavy chains of the antibody to be predicted; and a site determination module configured to determine one of the plurality of known antibodies that is most similar to the antibody to be predicted based on the structural similarity, the heavy chain sequence similarity and the light chain sequence similarity of each of the plurality of known antibodies to the antibody to be predicted, and to use the binding site of the known antibody to the antigen as the binding site of the antibody to be predicted to the antigen.

Embodiments of the present disclosure provide an antigen-antibody binding site determining apparatus including: one or more processors; and one or more memories, wherein the one or more memories have stored therein a computer-executable program that, when executed by the processor, performs the antigen antibody binding site determination method as described above.

Embodiments of the present disclosure provide a computer-readable storage medium having stored thereon computer-executable instructions for implementing the antigen-antibody binding site determination method as described above when executed by a processor.

Embodiments of the present disclosure provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the antigen-antibody binding site determination method according to an embodiment of the present disclosure.

Compared with the existing technology of judging the binding site by clustering the antibody sequence by family or mining the antibody structure information, the method provided by the embodiment of the disclosure can realize accurate prediction of the antigen-antibody binding site without being influenced by the difference of sequence length, adopts a uniform modeling mode for an antibody structure model, and adopts a de novo folding mode for all regions difficult to predict, thereby avoiding the problem of template-free antibody.

The method provided by the embodiment of the disclosure determines the structural similarity and the sequence similarity of the antibody to be predicted and other known antibodies based on the key region of the antibody by combining the sequence information and the structural information of the antibody, and predicts the binding site of the known antibody and the antigen most similar to the antibody to be predicted as the binding site of the antibody to be predicted and the antigen, thereby realizing the accurate prediction of the binding site of the antibody. The method disclosed by the embodiment of the disclosure can assist the antibody drug aiming at a specific binding site, and can quickly and accurately predict the change of the binding site caused by antigen mutation, so that the research and development of a vaccine or an antibody drug are accelerated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only exemplary embodiments of the disclosure, and that other drawings may be derived from those drawings by a person of ordinary skill in the art without inventive effort.

Fig. 1 is a schematic diagram illustrating determining structural similarity between antibody sequences according to an embodiment of the present disclosure;

fig. 2 is a flow diagram illustrating an antigen-antibody binding site determination method according to an embodiment of the present disclosure;

fig. 3 is a schematic flow diagram illustrating the determination of sequence similarity between a known antibody and an antibody to be predicted according to an embodiment of the present disclosure;

fig. 4A is a schematic diagram showing partial numbering results of IMGT numbering of multiple antibody sequences according to an embodiment of the disclosure;

fig. 4B is a schematic diagram illustrating extraction of CDR region sequences from an antibody sequence based on IMGT numbering according to an embodiment of the disclosure;

fig. 4C is a schematic diagram illustrating a multiple sequence alignment for CDRH3 region sequences, according to an embodiment of the present disclosure;

fig. 5 is a flow diagram illustrating determining structural similarity of a known antibody and an antibody to be predicted according to an embodiment of the present disclosure;

fig. 6 is a schematic flow diagram illustrating determining structural similarity of a known antibody and an antibody to be predicted according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram illustrating an antigen-antibody binding site determining apparatus according to an embodiment of the present disclosure;

FIG. 8 shows a schematic diagram of an antigen-antibody binding site determining apparatus according to an embodiment of the present disclosure;

FIG. 9 shows a schematic diagram of an architecture of an exemplary computing device, according to an embodiment of the present disclosure; and

FIG. 10 shows a schematic diagram of a storage medium according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

In the present specification and the drawings, steps and elements having substantially the same or similar characteristics are denoted by the same or similar reference numerals, and repeated description of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance or order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

For the purpose of describing the present disclosure, concepts related to the present disclosure are introduced below.

The antigen-antibody binding site determination method of the present disclosure may be Artificial Intelligence (AI) -based. Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. For example, with respect to the artificial intelligence-based antigen-antibody binding site determination method, it is possible to determine the binding site of an antibody to be predicted to an antigen by determining a known antibody that is most similar to the structure and sequence of the antibody to be predicted in a manner similar to how similar a human recognizes the similarity between the structure and sequence of the antibody to be predicted and the structure and sequence of a known antibody by the naked eye. Artificial intelligence enables the antigen-antibody binding site determination method disclosed by the present disclosure to have a function of quickly and accurately predicting changes in binding sites caused by antigen mutations by studying the design principles and implementation methods of various intelligent machines.

The antigen-antibody binding site determination methods of the present disclosure may be based on antigen-antibody interaction studies. Among these, the surface of the antigenic molecular structure presents unique groups called antigenic determinants or epitopes capable of specifically binding and recognizing antibody molecules and cellular receptors. An epitope is a structure inherent to a protein molecule and is an inherent functional characteristic exhibited when it is bound to a reactive substance. The epitope is the basis of the antigenicity of virus molecules, the determination of the sequence and conformation information of the epitope is the basis for disclosing the interaction mechanism of antigen and antibody, and the epitope has important guiding significance for the design of bioactive drugs and vaccines. Antibody molecules may be composed of light and heavy chains each comprising a variable region that determines the specificity of binding to an antigen and a constant region that is associated with the immune effect of the antibody. complementary-Determining regions (CDRs) in the variable Region are potential regions for binding of antibodies to antigens, have strong flexible structures, are key structural regions Determining the diversity and interaction specificity of antibody recognition antigens, and B cells producing different antibodies can change the amino acids of the regions through VDJ gene recombination and somatic hypermutation, thereby enhancing the binding capacity to antigens.

Alternatively, the antigen-antibody binding site determination methods of the present disclosure may be based on antibody sequence numbering methods. Standardized antibody numbering methods can be used to precisely define complementarity determining regions and residues of light and heavy chains that affect the binding affinity and specificity of antibody-antigen interactions. Antibody numbering may include a variety of common numbering schemes, such as the Kabat numbering scheme, the Chothia numbering scheme, and the international ImMunoGeneTics information system (IMGT) numbering scheme employed in the present disclosure. IMGT is the main reference numbering scheme for immunogenetics and immunoinformatics, numbering the amino acid units of antibodies, ensuring that the amino acid types at highly conserved positions of antibodies are fixed after numbering, and facilitating alignment of the numbered antibody sequences. Of course, the present disclosure performs antibody sequence numbering using the IMGT numbering scheme described above as an example only and not a limitation, and thus, other numbering schemes that achieve similar effects are equally applicable to the antigen-antibody binding site determination methods of the present disclosure.

Alternatively, the antigen-antibody binding site determination methods of the present disclosure may be based on Multiple Sequence Alignment (MSA). Multiple sequence alignment multiple (3 or more) protein molecules with phylogenetic relationship of amino acid sequences or nucleic acid sequences are aligned, as much as possible of the same base or amino acid residues are arranged in the same column, so that the aligned base or amino acid residues are evolutionarily homologous. Multiple sequence alignments are primarily aimed at finding similar sequences, which often originate from a common ancestral sequence, which are likely to have similar spatial structure and biological function, and thus for a protein of known sequence but unknown structure and function, the structure and function of the protein of unknown structure and function can be presumed if the structure and function of some proteins similar to its sequence are known. Therefore, in the antigen-antibody binding site determination method of the present disclosure, a known antibody similar to the sequence of an antibody to be predicted can be searched by using a multiple sequence alignment method, thereby being used for deducing the unknown structure of the antibody to be predicted.

In summary, the embodiments of the present disclosure provide solutions relating to artificial intelligence, antibody numbering, multiple sequence alignment, and the like, and will be further described with reference to the accompanying drawings.

Proteins play an indispensable role in the body, and are not only the material basis of cells and tissues, but also the components involved in various functional reactions. The recognition process of proteins and ligand molecules, including the recognition of antigens and antibodies, enzymes and substrates, hormones and receptors, is involved in almost all important life activities such as immune reaction, biochemical reaction, signal transduction, etc. in the body. The analysis of protein-ligand interactions is the basis for understanding the molecular mechanisms and regulatory processes of various biological functions in life activities. The research on the recognition mechanism of the antigen antibody is helpful for molecular design and mechanism explanation of vaccines, and guiding the research on therapeutic antibodies with mature affinity, and has very important significance in the aspects of disease prevention and clinical treatment and diagnosis application.

As a key step in the elucidation of antibody function and the discovery of the potential of antibodies as therapeutic agents, there have been several studies on the recognition of binding sites of antibodies to antigens. The prior art has demonstrated that sequence-based clustering can identify antibodies with similar binding sites. The current methods commonly used include "cluster cloning" of sequences, a way of pedigree clustering. This method can be accomplished in a number of ways, such as by mapping the V genes on the heavy and light chains of the antibody to the V or J genes of the nearest neighbor immunoglobulin, followed by finding antibodies of the same lineage by comparing the sequence similarity (e.g., including sequence length) of CDRH3 and CDRL3, the third of the three CDR regions of the heavy (H) and light (L) chains, respectively, of the antibody. In addition, attempts have been made in other ways to ignore the J-gene segment information or to consider only the heavy chain partial sequence. Recent studies have also shown that rational mining of structural information of antibodies also contributes to efficient classification of binding sites of antibodies to antigens. The method firstly carries out homologous modeling on the Fv complete sequence of the antibody, and only a structural model that the variable region can find the template and does not need to be folded from the beginning is reserved in the scheme in order to ensure the quality of the whole modeling. Structural clustering was performed separately after the modeling was completed based on the length of the CDR regions of the antibody (i.e., the three CDRH regions of the heavy chain (H) of the antibody-CDRH 1, CDRH2, and CDRH3, and the three CDRL regions of the light chain (L) of the antibody-CDRL 1, CDRL2, and CDRL3, for a total of 6 CDR regions). Fig. 1 is a schematic diagram illustrating determining structural similarity between antibody sequences according to an embodiment of the present disclosure. Each time the structural cluster similarity is obtained, after using the spatial alignment algorithm, as shown in fig. 1, the root mean square error (RMSD) of the distance between the corresponding α carbon atoms of the two sequences is calculated as a basis for judging the structural similarity of the two sequences. In the lower right part of fig. 1, a number of exemplary sequences are given for the third CDR regions of highest variability in the CDR regions (CDRH3 and CDRL3), respectively, where these amino acid sequences are classified by horizontal lines, e.g., for the CDRH3 region, the 11 sequences shown are classified into four classes based on the amino acid types therein, where the amino acid sequences in each class can be considered likely to have similar spatial structure and biological function.

However, for the above mentioned technical solution of clustering by antibody sequences, in antibody databases, such as the new crown antibody database (CoV-AbDab), there are often some antibodies that apparently belong to different lineages but the binding sites are in the same region. Due to the presence of such antibody data, the accuracy of the binding site prediction will be directly affected. Furthermore, since this method is generally applicable only to antibodies with dense lineage clusters and significant associations, it is not ideal for antibodies with large sequence length differences. The existing technology for judging the binding site by mining the structural information of the antibody also has great defects. Firstly, in an antibody database, only a very small number of antibodies have obtained their true structures by an experimental analysis method, which results in that most cases cannot directly use reliable antibody structure information, and can only obtain the information by a structure prediction method, and in order to reduce errors of modeling accuracy, template-free antibody sequences are filtered, which results in that the template-free antibody sequences cannot be effectively classified. Furthermore, in the structural clustering section, since structural spatial alignment requires equal lengths of antibody sequences, the number of antibodies that can be used for similarity comparison is also limited.

The present disclosure provides an antigen-antibody binding site determination method which jointly determines a known antibody most similar to an antibody to be predicted with respect to the structural similarity and sequence similarity of the antibody to be predicted and other known antibodies by combining the sequence information and structural information of the antibodies, thereby achieving accurate prediction of the antibody binding site.

The method provided by the embodiment of the disclosure determines the structural similarity and the sequence similarity of the antibody to be predicted and other known antibodies based on the key region of the antibody by combining the sequence information and the structural information of the antibody, and predicts the binding site of the known antibody and the antigen most similar to the antibody to be predicted as the binding site of the antibody to be predicted and the antigen, thereby realizing the accurate prediction of the binding site of the antibody. The method disclosed by the embodiment of the disclosure can be used for assisting the antibody drug aiming at a specific binding site, and quickly and accurately predicting the change of the binding site caused by antigen mutation, so that the research and development of a vaccine or an antibody drug are accelerated.

Fig. 2 is a flow diagram illustrating an antigen-antibody binding site determination method 200 according to an embodiment of the present disclosure.

In step 201, an antibody to be predicted and a plurality of known antibodies corresponding to the same antigen as the antibody to be predicted may be obtained.

As described above, the obtained plurality of known antibodies and the above-mentioned antibody to be predicted act on the same antigen via the corresponding binding sites, and the sequences and structures of the known antibodies and the binding sites with the antigen can be predetermined, so that the binding sites of the antibody to be predicted and the antigen can be determined based on the similarity between the antibody to be predicted and the known antibodies. Furthermore, for the prediction of binding sites of different antibodies on the same antigen, the two partial sequences of the light and heavy chains of the antibody can be first resolved.

According to an embodiment of the present disclosure, each of the plurality of known antibodies and at least a portion of the respective heavy chain of the antibody to be predicted may include three complementarity determining regions in each of the plurality of known antibodies and the respective heavy chain of the antibody to be predicted, respectively. Also, according to an embodiment of the present disclosure, each of the plurality of known antibodies and at least a portion of the respective light chain of the antibody to be predicted may include three complementarity determining regions in each of the plurality of known antibodies and the respective light chain of the antibody to be predicted, respectively.

Alternatively, taking as an example one heavy chain and one light chain of an antibody, the variable regions on the heavy and light chains may each have three complementarity determining regions, the three CDR regions (i.e., CDRH regions) for the heavy chain being CDRH1, CDRH2 and CDRH3, respectively, and the three CDR regions (i.e., CDRL regions) for the light chain being CDRL1, CDRL2 and CDRL3, respectively. Of these, the first two complementarity determining regions CDR1 and CDR2 (i.e., CDRH1 and CDRH2, or CDRL1 and CDRL2) of the three complementarity determining regions have lower variability than the third complementarity determining region CDR3(CDRH3 or CDRL3), that is, the region of CDRH3 on the heavy chain and the region of CDRL3 on the light chain have the highest variability in the respective peptide chains for the antibody sequences. Therefore, considering that the CDR regions are potential regions for binding of an antibody and the amino acid types of similar antibody sequences are approximately the same in highly conserved regions, the key regions (e.g., the CDR1, CDR2 and CDR3 regions, or only the CDR3 region) in the antibody sequences can be intercepted for information alignment in the subsequent similarity determination, without the need to align the whole antibody sequences, so as to avoid the problem that the length difference between the antibody sequences results in large difference in alignment results.

In step 202, the similarity of the heavy chain sequence and the similarity of the light chain sequence of each of the plurality of known antibodies to the antibody to be predicted can be determined based on each of the plurality of known antibodies and at least a portion of the respective heavy chain and at least a portion of the respective light chain of the antibody to be predicted.

Alternatively, the heavy chain sequence similarity and the light chain sequence similarity between the antibody to be predicted and the known antibody can be determined for the heavy chain and the light chain in the antibody, respectively, wherein the determination of the heavy chain sequence similarity and the light chain sequence similarity can be similar due to the similar region structures of the heavy chain and the light chain.

Fig. 3 is a schematic flow diagram illustrating determining sequence similarity between a known antibody and an antibody to be predicted according to an embodiment of the present disclosure. As shown in fig. 3, the schematic flow of determining sequence similarity between antibodies in fig. 3 may be performed separately for antibody heavy chains and antibody light chains, respectively.

According to an embodiment of the present disclosure, determining the sequence similarity of each of the plurality of known antibodies to the heavy chain of the antibody to be predicted based on each of the plurality of known antibodies and at least a portion of the respective heavy chain of the antibody to be predicted may comprise performing a sequence similarity alignment of each of the plurality of known antibodies and the respective heavy chain of the antibody to be predicted. Alternatively, the sequence similarity alignment may be represented as part of the boxes of the schematic flow chart shown in fig. 3.

According to embodiments of the disclosure, the sequence similarity alignment may comprise: for each known antibody of the plurality of known antibodies, extracting three complementarity determining regions of the respective heavy chain from the respective heavy chains of the known antibody and the antibody to be predicted, respectively; and performing multiple sequence alignment on the three complementarity determining regions of the heavy chain of the known antibody and the three complementarity determining regions of the heavy chain of the antibody to be predicted so as to determine the similarity of the sequences of the heavy chains of the known antibody and the antibody to be predicted.

Alternatively, determination of sequence similarity may be achieved by truncating critical regions in the sequence as described above, for example, for the heavy chain portion, the sequence similarity between the predicted antibody and the known antibody may be determined by extracting the three CDRH region sequences of the heavy chain for multiple sequence alignment with the CDRH region sequences of the heavy chains of other antibodies.

According to an embodiment of the present disclosure, extracting three complementarity determining regions of respective heavy chains from the respective heavy chains of the known antibody and the antibody to be predicted, respectively, may include: antibody sequence numbering is performed on the respective heavy chains of the known antibody and the antibody to be predicted, and antibody sequences within specific numbering ranges are extracted from the respective antibody sequence-numbered heavy chains of the known antibody and the antibody to be predicted, wherein the specific numbering ranges correspond to the three complementarity determining regions.

Alternatively, the number of each position in the antibody sequence may be determined by first numbering the antibody sequence, and the number range of the above-mentioned key region (CDR region) is fixed during the numbering, so that the amino acid sequence of the key region can be cut out from the antibody sequence by the number corresponding to the fixed number range.

As an example, in embodiments of the present disclosure, key regions are truncated from antibody sequences using the IMGT numbering method, which counts residues consecutively from 1 to 128 based on germline V sequence (germ-line V) alignment.

Fig. 4A is a schematic diagram illustrating partial numbering results of IMGT numbering of multiple antibody sequences according to an embodiment of the disclosure. Fig. 4B is a schematic diagram illustrating extraction of CDR region sequences from antibody sequences based on IMGT numbering according to an embodiment of the disclosure.

In fig. 4A are shown part of the exemplary region sequences of the four antibody sequences SH1, SH2, SH3 and SH4, respectively related to CDRH1 (fig. 4A (a)), CDRH2 (fig. 4A (b)) and CDRH3 (fig. 4A (c)), wherein the light grey regions in fig. 4A correspond to conserved regions in the antibody sequences and the dark grey regions correspond to CDR regions in the antibody sequences, e.g. the dark grey regions in fig. 4A (a) correspond to the respective CDRH1 regions of the four antibody sequences, the dark grey regions in fig. 4A (b) correspond to the CDRH2 regions, and the dark grey regions in fig. 4A (c) correspond to the CDRH3 regions. As shown in figure 4A, the amino acid types changed less in the conserved regions of the antibody sequence and more in the CDR regions, while CDRH1 and CDRH2 contained fewer amino acids and fewer amino acid changes than the most variable CDRH 3.

As shown in fig. 4A, the amino acids of the CDR regions of the antibody sequence are numbered within a fixed numbering range by the IMGT numbering method, for example, the numbering range is [27, 38] for the CDR1 region, [56, 65] for the CDR2 region, and [105, 117] for the CDR3 region, so that subsequent sequence extractions for the key region can be extracted directly from the numbered sequences quickly and conveniently. For example, in the case where the sequence of the CDR3 region is to be extracted, the amino acid sequence lying within the numbering range [105, 117] can be directly truncated from the numbered antibody sequence.

Alternatively, for an antibody sequence that comprises a greater number of amino acids in a CDR region than the number of numbering in the numbering range to which the CDR region corresponds, a corresponding number of additional numbering can be inserted in the numbering range to which the CDR region corresponds, such that the number of numbering in the numbering range is equal to the number of amino acids comprised in the CDR region. For example, as shown in fig. 4a (C), for the antibody sequences SH1 and SH2, since they include more amino acids in the CDRH3 region than the number of numbers in the numbering range corresponding to the CDR region (e.g., 13 in fig. 4a (C)), a corresponding number of additional numbers (e.g., 111A, 111B, 111C, 112D, 112C, 112B and 112A) are inserted at the middle portion (e.g., positions 111 and 112) of the CDRH3 region, respectively, so that the amino acids within the CDRH3 region are all numbered within the numbering range corresponding to the region.

Alternatively, for an antibody sequence that comprises a number of amino acids in a CDR region that is less than the number of numbers in the numbering range to which the CDR region corresponds, a corresponding number of spaces can be inserted into the numbering range to which the CDR region corresponds and a number can be occupied for each space such that the number of numbers in the numbering range is equal to the number of amino acids in the CDR region. As described above, for the same purpose, in fig. 4A, for an antibody sequence that contains fewer amino acids in its CDR region than the number of numbers in the numbering range to which the CDR region corresponds, a corresponding number of spaces may be inserted in the middle portion of the numbering range and occupy one number for each space to better match the available structural data.

As described above, the numbering of each position in the antibody sequence can be determined by the antibody sequence numbering, and the antibody sequence numbered by the antibody sequence can be positionally aligned by inserting additional numbers or spaces as described above, i.e., the length of the antibody sequence (including spaces) within the range of the numbering corresponding to the CDR region is the same. Thus, antibody sequences of unequal length can be converted to antibody sequences of equal length, and since the numbering range of the key regions (CDR regions) is fixed, the antibody sequences of the key regions can be truncated from the antibody sequences by the numbers corresponding to the fixed numbering range.

As shown in fig. 4B, the alignment sequences numbered by IMGT were analyzed for the four antibody sequences SH1, SH2, SH3 and SH4 of fig. 4A of different lengths, whereby the antibody sequence portions in the three CDR regions corresponding to these antibody sequences were extracted. Among the extracted antibody sequence portions, the antibody sequence portions corresponding to the same CDR region have the same sequence length (including blank spaces). Thus, based on the antibody sequences of the extracted key regions, multiple sequence alignment operations can be performed to determine sequence similarity between antibody sequences.

Taking the CDRH3 region sequence as an example, fig. 4C is a schematic diagram illustrating multiple sequence alignments for CDRH3 region sequences according to embodiments of the present disclosure.

Alternatively, portions of the antibody sequence extracted against the CDRH3 region in fig. 4B may be aligned by multiple sequences such that there are as many amino acids in the same (or similar) columns as possible in those portions, that is, such that positions of the same (or similar) residues are in the same column. As shown in fig. 4C, the black portion corresponds to a column containing no space element, i.e., amino acid elements are present in the column for all antibody sequences participating in alignment, which is a result of multi-sequence alignment based on shifting the spaces in the antibody sequences so that there is the highest similarity between the antibody sequences subjected to multi-sequence alignment.

As described above, for the heavy chain of an antibody sequence, the heavy chain sequence similarity between a plurality of known antibodies and the antibody to be predicted can be obtained by performing multiple sequence alignment of the antibody sequence portions of the three CDR regions extracted therefrom. For example, the heavy chain sequence similarity can be determined by splicing three CDR regions of an antibody sequence obtained by multiple sequence alignment, and by using a specific calculation method (e.g., BLOSUM62 protein sequence alignment scoring matrix, etc.).

According to an embodiment of the present disclosure, determining the sequence similarity of each of the plurality of known antibodies to the light chain of the antibody to be predicted based on each of the plurality of known antibodies and at least a portion of the respective light chain of the antibody to be predicted comprises performing the sequence similarity alignment on each of the plurality of known antibodies and the respective light chain of the antibody to be predicted.

For the light chain of the antibody sequence, the light chain sequence similarity of the antibody sequence can also be determined by determining the heavy chain sequence similarity with respect to the heavy chain of the antibody sequence, and for example, the sequence similarity alignment operation described with reference to fig. 3 and fig. 4A-4C can be applied to the light chain sequence of the antibody sequence to obtain the light chain sequence similarity between a plurality of known antibodies and the antibody to be predicted, which is not repeated herein.

Based on the above description, by performing the operations as shown in fig. 3 on the respective heavy chains and light chains of the plurality of known antibodies and the antibody to be predicted, respectively, the heavy chain sequence similarity and the light chain sequence similarity between the plurality of known antibodies and the antibody to be predicted can be determined.

Next, in step 203, a structural similarity of each of the plurality of known antibodies to the antibody to be predicted may be determined based on the plurality of known antibodies and the heavy chains of the antibody to be predicted.

As described above, structural information of antibodies also contributes to effective classification of binding sites of antibodies to antigens, and therefore, it is also possible to find a known antibody most similar to an antibody to be predicted based on structural similarities between a plurality of known antibodies and the antibody to be predicted.

Fig. 5 is a flow diagram illustrating determining structural similarity of a known antibody and an antibody to be predicted according to an embodiment of the present disclosure. Fig. 6 is a schematic flow diagram illustrating determining structural similarity of a known antibody and an antibody to be predicted according to an embodiment of the present disclosure.

According to the embodiment of the present disclosure, step 203 may include steps 2031-2034 as shown in fig. 5.

In embodiments of the present disclosure, since antibodies are composed of a heavy chain and a light chain, and the heavy chain has a greater number of amino acid residues than the light chain, and thus has a more complex protein structure, in determining structural similarity between antibodies, only structural similarity determinations for the heavy chain portion in the antibody may be considered, as shown in fig. 6.

In step 2031, at least one estimated structural model of the antibody to be predicted may be determined based on the heavy chain of the antibody to be predicted.

Alternatively, only the heavy chain partial sequence of the antibody to be predicted may be structurally modeled to obtain at least one estimated structural model. Wherein each of the obtained estimated structural models can employ a unified protein structural modeling approach, and a de novo folding approach can be employed for key regions (e.g., CDR regions) therein.

Alternatively, in the structural modeling of the heavy chain partial sequence of the antibody to be predicted, an existing protein structure prediction tool may be selectively employed to predict the three-dimensional spatial structure of the protein from the amino acid sequence of the protein using a "de novo folding" protein structure prediction method. For example, AI tool "tFold" developed by Tencent may be utilized to generate at least one estimated structural model of an antibody to be predicted based on a heavy chain sequence of the antibody to be predicted, wherein co-evolution information in multiple sets of multi-sequence associations may be first mined using a multi-data-source fusion technique, then prediction accuracy of some important two-dimensional structural information of proteins (e.g., residue-to-distance and orientation matrices) may be improved using a deep cross-attention residual network (DCARN), and finally structural information in a 3D model generated by Free-Modeling (FM) and Template-based Modeling (TBM) may be effectively fused using a Template-assisted Free Modeling (TBFM) method, thereby greatly improving accuracy of final three-dimensional Modeling.

Of course, besides the prediction of the three-dimensional spatial structure of the heavy chain sequence from the heavy chain sequence of the antibody to be predicted by the existing protein structure prediction tool such as tFold, other protein structure prediction methods that can achieve the same effect can be applied to the antigen-antibody binding site prediction method of the present disclosure, and the present disclosure is described only by the tFold tool as an example and not as a limitation.

In step 2032, an estimated structure model may be selected from the at least one estimated structure model as the structure model of the antibody to be predicted based on at least a portion of each of the at least one estimated structure model.

After generating at least one estimated structural model of the antibody to be predicted as described above, these estimated structural models may be subjected to quality evaluation to select an estimated structural model having the best quality therefrom as the structural model of the antibody to be predicted.

According to an embodiment of the present disclosure, step 2032 may comprise: for each of the at least one estimated structure model, cleaving a partial structure corresponding to the third CDRs from the estimated structure model individually, and performing a protein structure quality assessment on the partial structure in the estimated structure model; and selecting the estimated structure model with the highest protein structure quality from the at least one estimated structure model as the structure model of the antibody to be predicted.

Alternatively, protein structure quality assessment may be performed for the structural parts of the generated estimated structure model corresponding to the key regions. For example, in the embodiments of the present disclosure, the protein structure quality evaluation may be performed only for the partial structure corresponding to the CDRH3 region in the generated estimated structure model, considering that the conserved regions of the above-mentioned multiple known antibodies and the antibody to be predicted may generally have similar amino acid sequences or even protein structures, while in the non-conserved regions, the CDRH3 region has the highest variability and the other CDR regions have lower variability, and thus, the quality of the entire antibody structure may be estimated based on the quality of the partial structure corresponding to the CDRH3 region with the highest variability (hereinafter referred to as CDRH3 heavy chain predicted structure).

Alternatively, the characteristics of single model evaluation and consensus evaluation methods can be fused, that is, information of a single predicted structure and the relationship characteristics between the predicted structure and different conformation predictions of other same sequences are collected at the same time, and then the predicted two-dimensional structure information of the protein is utilized to cooperate with the searched template information and the conformation of the predicted structure of the CDRH3 heavy chain to be evaluated to generate a graph with amino acid residues as vertexes and residue-residue distance relationship as sides. Finally, the features of the single model evaluation method and consensus evaluation method can be integrated into the graph to predict the accuracy of the CDRH3 heavy chain predicted structure using a messaging network. Thus, the accuracy of each of the at least one estimated structure model can be determined by the above method, and then the most accurate estimated structure model is selected from these estimated structure models as the structure model of the antibody to be predicted.

Also, in addition to the above-described protein structure quality assessment method, other protein structure quality assessment methods that can achieve the same effect can be equally applied to the antigen-antibody binding site prediction method of the present disclosure, which is described only by way of example and not limitation.

After the structural model of the antibody to be predicted is determined in step 2032, the similarity between the structural model of the antibody to be predicted and the structural model of the known antibody may be analyzed, and for the similarity analysis of the structural model, the structural model is first subjected to structural alignment in a three-dimensional space, and the structural alignment requires that the sequence alignment of the antibody sequence portion corresponding to the structural model is first performed.

Thus, in step 2033, a multiple sequence alignment of at least a portion of the heavy chain of the antibody to be predicted with at least a portion of the heavy chain of each of the plurality of known antibodies may be performed.

According to an embodiment of the present disclosure, the at least a portion of the heavy chain of the antibody to be predicted comprises a third one of the three complementarity determining regions in the heavy chain of the antibody to be predicted, the at least a portion of each of the plurality of known antibodies comprises a third one of the three complementarity determining regions in the heavy chain of each of the plurality of known antibodies, and the at least a portion of each of the at least one estimation structure model corresponds to the third complementarity determining region.

Alternatively, at least a portion of the heavy chain of the antibody to be predicted and at least a portion of the heavy chain of the known antibody described above may correspond to the CDRH3 regions thereof, respectively, i.e. the CDRH3 region sequence of the antibody to be predicted is subjected to a multiple sequence alignment with the CDRH3 region sequence of the known antibody. Wherein the CDRH3 region sequences of the antibody to be predicted or of a known antibody may likewise be extracted by IMGT numbering as described above with reference to figure 3.

In step 2034, a structural alignment may be performed on the structural model of the antibody to be predicted and the structural model of each of the plurality of known antibodies based on the result of the multiple sequence alignment, and the structural similarity of each of the plurality of known antibodies and the antibody to be predicted is determined.

Through multiple sequence alignment, the sequence elements (including blank spaces) of the CDRH3 region sequence of the antibody to be predicted and the CDRH3 region sequence of the known antibody can be aligned one by one and have the same sequence length. Thus, structural models can be structurally aligned based on multiple sequence aligned antibody sequences.

According to an embodiment of the present disclosure, the performing the structural alignment on the structural model of the antibody to be predicted and the structural model of each of the plurality of known antibodies in step 2034 based on the result of the multiple sequence alignment may include: and performing structural alignment on a part corresponding to the third complementarity determining region in the structural model of the antibody to be predicted and a part corresponding to the third complementarity determining region in the structural model of each of the plurality of known antibodies based on the result of the multiple sequence alignment.

Alternatively, since the sequence elements (including white spaces) of the CDRH3 region sequence of the antibody to be predicted and the CDRH3 region sequence of the known antibody may be aligned one-to-one, structural alignments may be made on the basis of each alignment element pair, wherein an alignment element pair may include a white space-amino acid residue pair, a white space-white space pair, and an amino acid residue-amino acid residue pair.

Alternatively, since the absolute positions of the respective structures of the antibody to be predicted and the known antibody may be different in space, for example, the structure of the antibody to be predicted may be at the origin, while the structure of the known antibody may be far from the origin, it is necessary to make the structures of the two antibodies coincide as much as possible before calculating the structural similarity of the antibody to be predicted and the known antibody in order to perform structural alignment. For example, the structure of one of the antibody to be predicted and the known antibody may be fixed and the structure of the other may be subjected to an arbitrary rotation and/or translation operation, so that the sum of the distances of the alignment element pairs between the antibodies for which the structural similarity is to be determined is minimized by searching for the degrees of freedom of rotation and translation.

According to an embodiment of the present disclosure, determining the structural similarity of each of the plurality of known antibodies to the antibody to be predicted may include: for each known antibody of the plurality of known antibodies, determining the structural similarity of the structural model of the known antibody and the structural model of the antibody to be predicted according to the distance between the structural model of the antibody to be predicted subjected to structural alignment and the carbon atom at the alignment position in the structural model of the known antibody.

As described above, the structural alignment may be based on the distance in space of each alignment element pair, where the distance is the distance in space of the carbon atoms of the two elements of the alignment element pair.

Alternatively, the structural similarity of the structural model of the known antibody to the structural model of the antibody to be predicted may be determined based on the respective distances in space of the CDRH3 region sequence of the antibody to be predicted and all alignment element pairs of the CDRH3 region sequence of each known antibody (e.g., RMSD as a function of the distances in space of all alignment element pairs). Wherein, for an alignment element pair comprising at least one space element, the distance in space thereof may not be considered.

In step 204, a known antibody most similar to the antibody to be predicted in the plurality of known antibodies may be determined based on the structural similarity, the heavy chain sequence similarity and the light chain sequence similarity of each of the plurality of known antibodies and the antibody to be predicted, and the binding site of the known antibody and the antigen may be used as the binding site of the antibody to be predicted and the antigen.

In the embodiment of the present disclosure, the structural similarity, the heavy chain sequence similarity and the light chain sequence similarity between the antibody to be predicted and the known antibody can be simultaneously considered to jointly determine the known antibody most similar to the antibody to be predicted through the structural similarity and the sequence similarity between the antibody to be predicted and the known antibody, so as to realize accurate prediction of the antigen-antibody binding site.

According to an embodiment of the present disclosure, determining, in step 204, one known antibody of the plurality of known antibodies that is most similar to the antibody to be predicted based on the structural similarity, the heavy chain sequence similarity, and the light chain sequence similarity of each of the plurality of known antibodies to the antibody to be predicted may include: for each of the plurality of known antibodies, determining an antibody similarity of the known antibody to the antibody to be predicted based on a weighted sum of the structural similarity, the heavy chain sequence similarity, and the light chain sequence similarity of the known antibody to the antibody to be predicted; and selecting a known antibody with the highest antibody similarity with the antibody to be predicted from the plurality of known antibodies as the known antibody with the most similarity with the antibody to be predicted.

Alternatively, determining the antibody similarity of the known antibody and the antibody to be predicted based on the weighted sum of the structural similarity, the heavy chain sequence similarity and the light chain sequence similarity of the known antibody and the antibody to be predicted may comprise determining the magnitude of the influence (i.e., the weight) of the structural similarity, the heavy chain sequence similarity and the light chain sequence similarity on the similarity between the antibodies.

Alternatively, the determination of the above-mentioned weights may be obtained by data training. For example, a training data set may be determined based on antibodies of known structure in the antibody structure database (SAbDab), such as antibodies of known structure corresponding to the same antigen, and clustered according to their binding sites on the antigen (with the binding sites being the labels for the antibodies). Therefore, using the above algorithm flow, the structural similarity, the heavy chain sequence similarity, and the light chain sequence similarity can be trained as features, and the respective weights of these features are searched by maximizing the accuracy of classification of all antibody data, thereby determining the linearly weighted weights.

As described above, the antigen-antibody binding site determining method of the present disclosure uses a mode in which sequences and structural information are fused, obtains key regions of antibodies by using antibody sequence numbers, and for problems existing in sequence similarity, intercepts key regions at mapping positions of multiple sequence alignments using IMGT numbers as reference alignment information to determine sequence similarity between an antibody to be predicted and a known antibody based on the reference alignment information, thereby avoiding large difference in sequence alignment results due to different lengths between antibody sequences; aiming at the problems of structural similarity, the structural similarity of the antibody and other known antibodies is determined by modeling and screening key regions of the antibody and performing multi-sequence comparison and structural comparison with the key regions of other known antibodies, a unified modeling mode is adopted in a structural modeling part, template information is only used as an optional characteristic item, and all key regions are folded from the beginning, so that the problems of different lengths among antibody sequences and no template antibody are solved. In order to ensure the quality of antibody modeling, the CDRH3 region with the largest change in antibody structure is separately cut out to be subjected to quality evaluation by using a protein structure model evaluation technology, and the best structure model is preferably used as a modeling result.

Fig. 7 is a schematic diagram illustrating an antigen-antibody binding site determining apparatus 700 according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the antigen-antibody binding site determining apparatus 700 may include an antibody obtaining module 701, a sequence alignment module 702, a structure alignment module 703 and a site determining module 704.

The antibody obtaining module 701 may be configured to obtain an antibody to be predicted, and a plurality of known antibodies corresponding to the same antigen as the antibody to be predicted.

Since the obtained plurality of known antibodies and the above-mentioned antibody to be predicted act on the same antigen via the corresponding binding sites, the sequences and structures of the known antibodies and the binding sites with the antigen can be predetermined, and therefore, the binding sites of the antibody to be predicted and the antigen can be determined based on the similarity of the antibody to be predicted and the known antibodies. Alternatively, antibody acquisition module 701 may perform the operations described above with reference to step 201.

Alternatively, each of the plurality of known antibodies and at least a portion of the respective heavy chain (or light chain) of the antibody to be predicted may comprise three complementarity determining regions in each of the plurality of known antibodies and the respective heavy chain (or light chain) of the antibody to be predicted, respectively. Taking a heavy chain and a light chain of an antibody as an example, the variable regions on the heavy and light chains may each have three complementarity determining regions, three CDR regions (i.e., CDRH regions) for the heavy chain being CDRH1, CDRH2, and CDRH3, respectively, and three CDR regions (i.e., CDRL regions) for the light chain being CDRL1, CDRL2, and CDRL3, respectively.

The sequence alignment module 702 can be configured to determine a heavy chain sequence similarity and a light chain sequence similarity of each of the plurality of known antibodies to the antibody to be predicted based on each of the plurality of known antibodies and at least a portion of the respective heavy chain and at least a portion of the respective light chain of the antibody to be predicted, respectively. Alternatively, sequence alignment module 702 can perform the operations described above with reference to step 202.

For example, the heavy chain sequence similarity and the light chain sequence similarity between the antibody to be predicted and a known antibody can be determined for the heavy chain and the light chain in the antibody, respectively, wherein the determination of the heavy chain sequence similarity and the light chain sequence similarity can be similar due to the fact that the heavy chain and the light chain have similar region structures. For example, for the heavy and light chains of an antibody sequence, the heavy chain and light chain sequence similarity of the antibody sequence can be determined by sequence similarity alignment operations as described with reference to fig. 3 and fig. 4A-4C, respectively.

Alternatively, determination of sequence similarity may be achieved by truncating critical regions in the sequence, for example, for the heavy chain portion, the sequence similarity between the predicted antibody and the known antibody may be determined by extracting the three CDRH region sequences of the heavy chain for multiple sequence alignment with the CDRH region sequences of the heavy chains of other antibodies. For example, the number of each position in the antibody sequence can be determined by first numbering the antibody sequence (e.g., IMGT numbering), and the number range of the above-mentioned key regions (CDR regions) is fixed during the numbering, so that the amino acid sequence of the key regions can be truncated from the antibody sequence by the number corresponding to the fixed number range.

Alternatively, for the extracted antibody sequence portions, heavy chain (or light chain) sequence similarity between a plurality of known antibodies and the antibody to be predicted can be obtained by multiple sequence alignment such that there are as many amino acids in the same (or similar) columns in these antibody sequence portions as possible, that is, such that the positions of the same (or similar) residues are in the same column.

As described above, structural information of an antibody also contributes to effective classification of the binding site of the antibody to an antigen, and therefore, a known antibody most similar to an antibody to be predicted can also be found based on structural similarities between a plurality of known antibodies and the antibody to be predicted. Thus, the structure alignment module 703 may be configured to determine the structural similarity of each of the plurality of known antibodies to the antibody to be predicted based on the heavy chains of the plurality of known antibodies and the antibody to be predicted. Optionally, structure alignment module 703 may perform the operations described above with reference to step 203.

For example, only the heavy chain partial sequence of the antibody to be predicted may be structurally modeled to obtain at least one estimated structural model. Each of the obtained estimated structural models may employ a unified protein structural modeling approach, and a de novo folding approach may be employed for critical regions (e.g., CDR regions) therein.

Alternatively, after at least one estimated structure model of the antibody to be predicted is generated as described above, these estimated structure models may be subjected to quality evaluation to select an estimated structure model having the best quality therefrom as the structure model of the antibody to be predicted. For example, protein structure quality assessment may be performed for the structural portions of the generated estimated structure model that correspond to the critical regions. After determining the structural model of the antibody to be predicted, the similarity between the structural model of the antibody to be predicted and the structural model of the known antibody may be analyzed, and the similarity analysis of the structural model requires that the structural models thereof are first subjected to structural alignment in a three-dimensional space, and the structural alignment requires that the sequence alignment of the antibody sequence portions corresponding to the structural models is first performed, as described above with reference to steps 2033 and 2034.

In embodiments of the present disclosure, the structural similarity, the heavy chain sequence similarity, and the light chain sequence similarity between the antibody to be predicted and the known antibody may be considered simultaneously to determine the known antibody most similar to the antibody to be predicted jointly by the structural similarity and the sequence similarity of the antibody to be predicted and the known antibody. The site determination module 704 may be configured to determine one of the plurality of known antibodies that is most similar to the antibody to be predicted based on the structural similarity, the heavy chain sequence similarity, and the light chain sequence similarity of each of the plurality of known antibodies to the antibody to be predicted, and to use the binding site of the known antibody to the antigen as the binding site of the antibody to be predicted to the antigen. Alternatively, location determining module 704 may perform the operations described above with reference to step 204.

For example, prior to the weighted sum of the structural similarity, the heavy chain sequence similarity, and the light chain sequence similarity of the known antibody and the antibody to be predicted, the magnitude of the influence (i.e., the weight) of each of the structural similarity, the heavy chain sequence similarity, and the light chain sequence similarity on the similarity between antibodies may be determined, which may be obtained by, for example, data training.

Therefore, by determining one known antibody which is most similar to the antibody to be predicted in a plurality of known antibodies, the binding site of the known antibody and the antigen can be used as the binding site of the antibody to be predicted, so that the accurate prediction of the binding site of the antigen and the antibody can be realized.

According to still another aspect of the present disclosure, there is also provided an antigen-antibody binding site determining apparatus. Fig. 8 shows a schematic diagram of an antigen-antibody binding site determining apparatus 2000 according to an embodiment of the present disclosure.

As shown in fig. 8, the antigen-antibody binding site determining apparatus 2000 may include one or more processors 2010, and one or more memories 2020. Wherein the memory 2020 has stored therein computer readable code that, when executed by the one or more processors 2010, may perform the antigen antibody binding site determination method as described above.

The processor in the embodiments of the present disclosure may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which may be of the X86 or ARM architecture.

In general, the various example embodiments of this disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of embodiments of the disclosure have been illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

For example, a method or apparatus in accordance with embodiments of the present disclosure may also be implemented by way of the architecture of computing device 3000 shown in fig. 9. As shown in fig. 9, computing device 3000 may include a bus 3010, one or more CPUs 3020, a Read Only Memory (ROM)3030, a Random Access Memory (RAM)3040, a communication port 3050 to connect to a network, input/output components 3060, a hard disk 3070, and the like. A storage device in the computing device 3000, such as the ROM 3030 or the hard disk 3070, may store various data or files used in the processing and/or communication of the antigen-antibody binding site determination method provided by the present disclosure, as well as program instructions executed by the CPU. Computing device 3000 can also include user interface 3080. Of course, the architecture shown in FIG. 8 is merely exemplary, and one or more components of the computing device shown in FIG. 9 may be omitted as needed in implementing different devices.

According to yet another aspect of the present disclosure, there is also provided a computer-readable storage medium. Fig. 10 shows a schematic diagram 4000 of a storage medium according to the present disclosure.

As shown in fig. 10, the computer storage media 4020 has stored thereon computer readable instructions 4010. The computer readable instructions 4010, when executed by a processor, can perform the antigen antibody binding site determination methods according to embodiments of the present disclosure described with reference to the above figures. The computer readable storage medium in embodiments of the present disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DRRAM). It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory. It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the antigen-antibody binding site determination method according to an embodiment of the present disclosure.

It should be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The exemplary embodiments of the present disclosure, which are described in detail above, are merely illustrative, and not restrictive. It will be appreciated by those skilled in the art that various modifications and combinations of these embodiments or features thereof may be made without departing from the principles and spirit of the disclosure, and that such modifications are intended to be within the scope of the disclosure.

Claims

1. A method of determining an antigen-antibody binding site comprising:

obtaining an antibody to be predicted and a plurality of known antibodies corresponding to the same antigen as the antibody to be predicted;

determining the heavy chain sequence similarity and the light chain sequence similarity of each of the plurality of known antibodies and the antibody to be predicted respectively based on each of the plurality of known antibodies and at least a portion of the respective heavy chain and at least a portion of the respective light chain of the antibody to be predicted;

determining a structural similarity of each of the plurality of known antibodies to the antibody to be predicted based on the plurality of known antibodies and the heavy chain of the antibody to be predicted; and

and determining one known antibody which is most similar to the antibody to be predicted in the plurality of known antibodies based on the structural similarity, the heavy chain sequence similarity and the light chain sequence similarity of each of the plurality of known antibodies and the antibody to be predicted, and taking the binding site of the known antibody and the antigen as the binding site of the antibody to be predicted and the antigen.

2. The method of claim 1, wherein each of the plurality of known antibodies and at least a portion of the respective heavy chain of the antibody to be predicted comprise three complementarity determining regions in each of the plurality of known antibodies and the respective heavy chain of the antibody to be predicted, respectively;

each of the plurality of known antibodies and at least a portion of the respective light chain of the antibody to be predicted comprise three complementarity determining regions in each of the plurality of known antibodies and the respective light chain of the antibody to be predicted, respectively;

wherein determining the sequence similarity of each of the plurality of known antibodies to the heavy chain of the antibody to be predicted based on each of the plurality of known antibodies and at least a portion of the respective heavy chain of the antibody to be predicted comprises performing a sequence similarity alignment of each of the plurality of known antibodies to the respective heavy chain of the antibody to be predicted, the sequence similarity alignment comprising:

for each known antibody of the plurality of known antibodies, extracting three complementarity determining regions of the respective heavy chain from the respective heavy chains of the known antibody and the antibody to be predicted, respectively; and

performing multiple sequence alignment on the three complementarity determining regions of the heavy chain of the known antibody and the three complementarity determining regions of the heavy chain of the antibody to be predicted so as to determine the similarity of the sequences of the heavy chains of the known antibody and the antibody to be predicted;

wherein determining the sequence similarity of each of the plurality of known antibodies to the light chain of the antibody to be predicted based on each of the plurality of known antibodies and at least a portion of the respective light chain of the antibody to be predicted comprises performing the sequence similarity alignment on each of the plurality of known antibodies and the respective light chain of the antibody to be predicted.

3. The method of claim 2, wherein extracting three complementarity determining regions of respective heavy chains from the respective heavy chains of the known antibody and the antibody to be predicted comprises:

antibody sequence numbering is performed on the respective heavy chains of the known antibody and the antibody to be predicted, and antibody sequences within specific numbering ranges are extracted from the respective antibody sequence-numbered heavy chains of the known antibody and the antibody to be predicted, wherein the specific numbering ranges correspond to the three complementarity determining regions.

4. The method of claim 2, wherein determining the structural similarity of each of the plurality of known antibodies to the antibody to be predicted based on the plurality of known antibodies and the heavy chain of the antibody to be predicted comprises:

determining at least one estimated structural model of the antibody to be predicted based on the heavy chain of the antibody to be predicted;

selecting one estimated structure model from the at least one estimated structure model as a structure model of the antibody to be predicted based on at least a portion of each of the at least one estimated structure model;

performing a multiple sequence alignment of at least a portion of the heavy chain of the antibody to be predicted with at least a portion of the heavy chain of each of the plurality of known antibodies; and

and performing structural alignment on the structural model of the antibody to be predicted and the structural model of each of the plurality of known antibodies based on the result of the multi-sequence alignment, and determining the structural similarity of each of the plurality of known antibodies and the antibody to be predicted.

5. The method of claim 4, wherein the at least a portion of the heavy chain of the antibody to be predicted comprises a third of the three complementarity determining regions in the heavy chain of the antibody to be predicted, the at least a portion of each of the plurality of known antibodies comprises a third of the three complementarity determining regions in the heavy chain of each of the plurality of known antibodies, and the at least a portion of each of the at least one estimation structure model corresponds to the third complementarity determining region;

wherein structurally aligning the structural model of the antibody to be predicted and the structural model of each of the plurality of known antibodies based on the results of the multiple sequence alignments comprises:

and performing structural alignment on a part corresponding to the third complementarity determining region in the structural model of the antibody to be predicted and a part corresponding to the third complementarity determining region in the structural model of each of the plurality of known antibodies based on the result of the multiple sequence alignment.

6. The method of claim 5, wherein determining the structural similarity of each of the plurality of known antibodies to the antibody to be predicted comprises:

for each known antibody of the plurality of known antibodies, determining the structural similarity of the structural model of the known antibody and the structural model of the antibody to be predicted according to the distance between the structural model of the antibody to be predicted subjected to structural alignment and the carbon atom at the alignment position in the structural model of the known antibody.

7. The method of claim 5, wherein selecting one estimated structure model from the at least one estimated structure model as the structure model of the antibody to be predicted based on at least a portion of each of the at least one estimated structure model comprises:

for each of the at least one estimated structure model, cleaving a partial structure corresponding to the third CDRs from the estimated structure model individually, and performing a protein structure quality assessment on the partial structure in the estimated structure model; and

selecting the estimated structure model having the highest estimated protein structure mass from the at least one estimated structure model as the structure model of the antibody to be predicted.

8. The method of claim 1, wherein determining the one of the plurality of known antibodies that is most similar to the antibody to be predicted based on the structural similarity, the heavy chain sequence similarity, and the light chain sequence similarity of each of the plurality of known antibodies to the antibody to be predicted comprises:

for each of the plurality of known antibodies, determining an antibody similarity of the known antibody to the antibody to be predicted based on a weighted sum of the structural similarity, the heavy chain sequence similarity, and the light chain sequence similarity of the known antibody to the antibody to be predicted; and

selecting a known antibody having the highest similarity to the antibody to be predicted from the plurality of known antibodies as the known antibody most similar to the antibody to be predicted.

9. An antigen-antibody binding site determining device comprising:

an antibody obtaining module configured to obtain an antibody to be predicted and a plurality of known antibodies corresponding to the same antigen as the antibody to be predicted;

a sequence alignment module configured to determine a heavy chain sequence similarity and a light chain sequence similarity of each of the plurality of known antibodies to the antibody to be predicted based on each of the plurality of known antibodies and at least a portion of the respective heavy chain and at least a portion of the respective light chain of the antibody to be predicted, respectively;

a structural alignment module configured to determine a structural similarity of each of the plurality of known antibodies to the antibody to be predicted based on the plurality of known antibodies and the heavy chains of the antibody to be predicted; and

a site determination module configured to determine one of the plurality of known antibodies that is most similar to the antibody to be predicted based on the structural similarity, the heavy chain sequence similarity and the light chain sequence similarity of each of the plurality of known antibodies to the antibody to be predicted, and to use the binding site of the known antibody to the antigen as the binding site of the antibody to be predicted to the antigen.

10. The device of claim 9, wherein each of the plurality of known antibodies and at least a portion of the respective heavy chain of the antibody to be predicted comprise three complementarity determining regions in each of the plurality of known antibodies and the respective heavy chain of the antibody to be predicted, respectively;

11. The apparatus of claim 10, wherein determining the structural similarity of each of the plurality of known antibodies to the antibody to be predicted based on the plurality of known antibodies and the heavy chain of the antibody to be predicted comprises:

12. The apparatus of claim 11, wherein determining the one of the plurality of known antibodies that is most similar to the antibody to be predicted based on the structural similarity, the heavy chain sequence similarity, and the light chain sequence similarity of each of the plurality of known antibodies to the antibody to be predicted comprises:

13. An antigen-antibody binding site determining apparatus comprising:

one or more processors; and

one or more memories having stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 1-8.

14. A computer program product stored on a computer readable storage medium and comprising computer instructions which, when executed by a processor, cause a computer device to perform the method of any one of claims 1-8.

15. A computer-readable storage medium having stored thereon computer-executable instructions for implementing the method of any one of claims 1-8 when executed by a processor.