WO2024025963A1 - Systèmes et procédés de détermination d'interactions protéiques - Google Patents

Systèmes et procédés de détermination d'interactions protéiques Download PDF

Info

Publication number
WO2024025963A1
WO2024025963A1 PCT/US2023/028727 US2023028727W WO2024025963A1 WO 2024025963 A1 WO2024025963 A1 WO 2024025963A1 US 2023028727 W US2023028727 W US 2023028727W WO 2024025963 A1 WO2024025963 A1 WO 2024025963A1
Authority
WO
WIPO (PCT)
Prior art keywords
protein
proteins
binding
ace2
antibody
Prior art date
Application number
PCT/US2023/028727
Other languages
English (en)
Inventor
Johnson Yiu-Nam Lau
Manson FOK
Original Assignee
Lau Johnson Yiu Nam
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lau Johnson Yiu Nam filed Critical Lau Johnson Yiu Nam
Publication of WO2024025963A1 publication Critical patent/WO2024025963A1/fr

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B30/00Methods of screening libraries
    • C40B30/04Methods of screening libraries by measuring the ability to specifically bind a target molecule, e.g. antibody-antigen binding, receptor-ligand binding
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61PSPECIFIC THERAPEUTIC ACTIVITY OF CHEMICAL COMPOUNDS OR MEDICINAL PREPARATIONS
    • A61P31/00Antiinfectives, i.e. antibiotics, antiseptics, chemotherapeutics
    • A61P31/12Antivirals
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/569Immunoassay; Biospecific binding assay; Materials therefor for microorganisms, e.g. protozoa, bacteria, viruses
    • G01N33/56983Viruses
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/40ICT specially adapted for the handling or processing of patient-related medical or healthcare data for data related to laboratory analysis, e.g. patient specimen analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • G16H20/10ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to drugs or medications, e.g. for ensuring correct administration to patients
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2333/00Assays involving biological materials from specific organisms or of a specific nature
    • G01N2333/005Assays involving biological materials from specific organisms or of a specific nature from viruses
    • G01N2333/08RNA viruses
    • G01N2333/165Coronaviridae, e.g. avian infectious bronchitis virus
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2500/00Screening for compounds of potential therapeutic value
    • G01N2500/04Screening involving studying the effect of compounds C directly on molecule A (e.g. C are potential ligands for a receptor A, or potential substrates for an enzyme A)

Definitions

  • the functional S-protein is a trimer, with each protomer consisting of two domains, S1 and S2.
  • S1 contains the receptor binding domain (RBD), which directly interacts with ACE2.
  • RBD receptor binding domain
  • Conformational changes in S-protein from closed (or “down”) to open (or “up”) are required for its interaction with ACE2.
  • Viral mutational fitness is an essential element of viral adaptation. It is one of the key drivers of the various waves of COVID-19 infections in the current pandemic. Indeed, these waves of infections correspond to the evolution and emergence of new SARS-CoV-2 variants of concern (VOCs). For instance, the D614G and N439K non-synonymous mutations in the S- protein drove the initial wave of infections. Additional non-synonymous mutations accumulated in the Alpha and Delta variants drove the second and third waves of infections, respectively.
  • the Omicron variant with its sub-lineages contains a hypermutated S-protein consisting of previously observed as well as novel mutations.
  • Many non- synonymous mutations are located within the RBD.
  • Some substitutions for example Q493R, G496S, and Q498R also form novel hydrogen bonds with ACE2. These new substitutions may enhance or reduce binding affinity towards ACE2, and as a result, it is important to be able to predict the impact of these substitutions, and of other as yet unknown substitutions, on ACE2 affinity, immune escape, and its general mutational fitness.
  • the inventive subject matter provides apparatus, systems and methods for predicting effects of mutations in the amino acid sequence of a member of a ligand-receptor and/or antigen- antibody pair, thereby permitting generation of mutated proteins with improved binding affinities for therapeutic and/or diagnostic use. Such methods can also be used to predict the ability pathogen strains to escape conventional therapy (e.g., through reduced interaction with a therapeutic ligand or receptor analog), and to identify pathogen strains with increased infectivity (e.g., in having increased binding affinity for a host receptor).
  • One embodiment of the inventive concept is a method of modulating(e.g., increasing or decreasing relative to values for naturally occurring or known proteins) interaction between a binding protein and a ligand protein, by providing a heterogeneous database comprising a plurality of datasets related to interactions between a first set of proteins and a second set of proteins, wherein the plurality of datasets comprises experimental data from a plurality of experimental techniques.
  • a structure dataset is prepared utilizing the heterogeneous database, where the structure dataset includes a plurality of graphical unified protein structure models.
  • Each graphical unified protein structure model incorporates both sequence and experimental data of members of a protein binding pair, and each graphical unified protein structure model includes a representation (e.g., a graphical representation) of binding strength between members of the protein binding pair.
  • An artificial intelligence (AI) system is trained using the structure dataset to generate a trained AI that includes a protein interaction algorithm derived from correlation between sequence and binding strength elements of the plurality of graphical unified structure models.
  • the trained AI is provided with a primary library comprising a plurality of initial candidate binding proteins and generates, using the protein interaction algorithm, a secondary library that includes a plurality of secondary candidate binding proteins, where the secondary binding proteins are selected by the protein interaction algorithm as including a modulated (e.g., increased or decreased, depending upon the desired end result) interaction with the ligand protein.
  • a modulated e.g., increased or decreased, depending upon the desired end result
  • At least a portion of the set of tertiary candidate binding protein can be synthesized for use in, for example, an in vitro biomedical assay, an in vivo biomedical assay, or as a therapeutic protein formulation.
  • the plurality of experimental techniques can include collecting experimental data from an enzyme linked immunosorbent assay, surface plasmon resonance, fluorescence spectroscopy, flow cytometry, or a neutralization assay.
  • Suitable binding proteins include, but are not limited to, an antibody, a fragment of an antibody, a single-chain antibody, and a fragment of a single chain antibody. Such an antibody can be a therapeutic antibody or a result of immunization.
  • the ligand can be an immune checkpoint protein, a tumor marker, or a component of a pathogen.
  • the pathogen can be a virus, such as a coronavirus and the ligand can be a spike protein of the coronavirus.
  • compositions that includes a mutated ligand generated by the method described above.
  • Embodiments of the inventive concept include a composition for use in treating a viral infection, and that includes a mutated ligand generated as described above, which can act to compete for a host receptor of the virus causing the viral infection.
  • the viral infection can be a coronavirus infection
  • the mutated ligand shows increased affinity for ACE2 relative to a wild type SARS coronavirus spike protein or relative to SARS spike proteins of a plurality of SARS coronavirus variants.
  • Embodiments of the inventive concept include a method of identifying a mutated pathogen with increased infectivity or escape from immunotherapy relative to that of a wild type or known pathogen, by providing a heterogeneous database comprising data related to interactions between a first set of proteins and a second set of proteins, where the heterogeneous database comprises experimental data from a plurality of experimental techniques.
  • a structure database is generated utilizing the heterogeneous database, wherein the structure dataset includes a plurality of graphical unified protein structure models, wherein each graphical unified protein structure model incorporates both sequence and experimental data of members of a protein binding pair, and wherein each graphical unified protein structure model comprises a representation of binding strength between members of the protein binding pair.
  • the artificial intelligence (AI) system is trained using the structure database to generate a trained AI that includes a protein interaction algorithm derived from correlation between sequence and binding strength elements of the plurality of graphical unified structure models.
  • the trained AI is provided with a primary library comprising sequence information for an immunotherapy protein directed to the pathogen or a host receptor for the pathogen and a plurality of initial candidate pathogen ligand proteins originated from mutant pathogens, where the immunotherapy protein or the host receptor interacts with a ligand protein of wild type pathogen.
  • the protein interaction algorithm is used to generate a secondary library that includes a plurality of secondary pathogen ligand proteins, where the secondary pathogen ligand proteins are selected by the protein interaction algorithm as having a reduced interaction with the immunotherapy protein or an increased interaction with the host receptor.
  • These secondary pathogen ligand proteins are screened for reduced interaction with the immunotherapy protein or increased interaction with the host receptor using an in vivo or in vitro assay.
  • Pathogen that includes a secondary pathogen ligand protein with reduced interaction with the immunotherapy protein relative to the wild type pathogen is reported to a practitioner as likely to escape treatment with the immunotherapy protein.
  • Pathogen that includes a secondary pathogen ligand protein with increased interaction with the host receptor relative to the wild type pathogen is reported to a practitioner as likely to highly infective.
  • the experimental techniques can include collecting experimental data from enzyme linked immunosorbent assay, surface plasmon resonance, fluorescence spectroscopy, flow cytometry, and/or a neutralization assay.
  • the immunotherapy protein can be selected from the group consisting of an antibody, a fragment of an antibody, a single- chain antibody, and a fragment of a single-chain antibody.
  • the mutated pathogen can be a virus, such as a coronavirus.
  • the plurality of initial candidate pathogen proteins can include a coronavirus spike protein.
  • Embodiments of the inventive concept include methods of improving accuracy of prediction of binding affinities between a first protein and a second protein, by providing a heterogeneous database comprising data related to interactions between a first set of proteins and a second set of proteins, where the heterogeneous database comprises experimental data from a plurality of experimental techniques.
  • a structure database is prepared utilizing the heterogeneous database, where the structure dataset comprises a plurality of graphical unified protein structure models, wherein each graphical unified protein structure model incorporates both sequence and experimental data of members of a protein binding pair, and wherein each graphical unified protein structure model comprises a representation of binding strength between members of the protein binding pair.
  • An artificial intelligence (AI) system is trained using the structure database to generate a trained AI comprising a protein interaction algorithm derived from correlation between sequence and binding strength elements of the plurality of graphical unified structure models.
  • the trained AI is provided with a primary library comprising sequence information for the first protein and the second protein and generating, using the protein interaction algorithm. A predicted binding affinity between the first protein and the second protein selected from the primary library by a user is then reported.
  • the experimental techniques can include collecting experimental data from enzyme linked immunosorbent assay, surface plasmon resonance, fluorescence spectroscopy, flow cytometry, and/or a neutralization assay.
  • the first protein is an antibody and the second protein is a ligand.
  • the first protein is a coronavirus spike protein and the second protein is ACE2.
  • Embodiments of the inventive concept include a method of generating a high affinity antibody directed to an antigen, by providing a heterogeneous database including data related to interactions between the antigen and a set of initial candidate antibody proteins that can form a protein binding pair, wherein the heterogeneous database comprises experimental data from a plurality of experimental techniques.
  • a structure database is prepared using the heterogeneous database, where the structure dataset includes a plurality of graphical unified protein structure models, where each graphical unified protein structure model incorporates both sequence and experimental data of members of the protein binding pair, and where each graphical unified protein structure model includes a representation of binding strength between members of the protein binding pair.
  • An artificial intelligence (AI) system is trained using the structure database to generate a trained AI that includes a protein interaction algorithm derived from correlation between sequence and binding strength elements of the plurality of graphical unified structure models.
  • the trained AI is provided with a primary library that includes sequence information for the antigen and a plurality of initial candidate antibody proteins originated from an initial antibody directed to the antigen.
  • the protein interaction algorithm generates a secondary library that includes a plurality of secondary antibody proteins, wherein the secondary antibody proteins are selected by the protein interaction algorithm as comprising an increased interaction with the antigen relative to the initial antibody.
  • the secondary antibody proteins are screened for increased interaction with the antigen relative to the initial candidate antibody using an in vivo or in vitro assay to identify a plurality of tertiary antibody proteins having increased affinity for the antigen relative to the initial antibody.
  • One or more of such screened secondary antibody proteins can be synthesized to generate an antibody or antibodies having improved affinity for the antigen from among the plurality of tertiary antibody proteins.
  • the high affinity antibody can be a divalent antibody, a fragment of a divalent antibody, a single-chain antibody, or a fragment of a single-chain antibody.
  • the antigen can be derived from a pathogen, an immunotherapy target, or a cancer marker.
  • Embodiments of the inventive concept include a system for deriving protein binding characteristics, which includes: (a) a database module, including heterogeneous biologic data, wherein heterogeneous biologic data comprises protein sequence data and biologic data originating from a plurality of experimental techniques or forms of expression for the biologic data; (b) a protein representation module, in which heterogeneous data from the database module is used to construct a plurality of graphical hierarchal protein structures for proteins represented in the database module; and (c) an AI module, including encoded instructions to utilize the plurality of graphical hierarchal protein structures as a training set to derive a protein interaction algorithm and to apply the protein interaction algorithm to evaluate or estimate binding characteristics of wild type and/or mutated proteins provided to the AI module.
  • the plurality of experimental techniques can include enzyme linked immunosorbent assay, surface plasmon resonance, fluorescence spectroscopy, flow cytometry, and/or a neutralization assay.
  • the database module can include binding energy estimates derived from experimental data, and/or can include biological data directly related to a protein or proteins being characterized. Alternatively, the database module can not include biological data directly related to a protein or proteins being characterized.
  • the system can include an effector, such as a liquid handling device, a handler for a disposable component, and an incubator.
  • a controller which can include the AI module that is communicatively coupled to the effector.
  • the system includes a sensor, such as a colorimeters, a spectrophotometer, a fluorometer, a luminometer, and/or an imaging system.
  • the sensor can be communicatively coupled to the AI module.
  • FIG.2A shows a typical regression correlation between calculated and experimental values of changes in binding affinity for all mutations in SKEMPI V2.0.
  • FIG.2B shows an exemplary study of the type shown in FIG.2A, where the analysis is stratified into protein complexes containing single amino acid substitution in the S-protein.
  • FIG.2C shows an exemplary study of the type shown in FIG.2A, where the analysis is stratified into protein complexes containing multiple amino acid substitutions in the S-protein.
  • FIG.2D shows typical mean absolute errors (MAEs) of the AI predictions from a study as shown in FIG.2A with one or more amino acid substitutions.
  • MAEs mean absolute errors
  • FIG.3A shows typical regression performance of affinity change prediction between ACE2 and RBD variants measured by ⁇ KD,app..
  • FIG.3B shows predicted performance on RBD mutation effects on SARS-CoV-2 variant Alpha (N501Y).
  • FIG.3C shows predicted performance on RBD mutation effects on SARS-CoV-2 variant Beta (K417N + E484K + N501Y).
  • FIG.3D shows predicted performance on RBD mutation effects on SARS-CoV-2 variant Delta (L452R+T478K).
  • FIG.3E shows predicted performance on RBD mutation effects on SARS-CoV-2 variant and Eta (E484K).
  • FIG.4A shows typical regression performance of affinity change prediction between RBD and ACE variants measured by log2 enrichment ratio.
  • FIG.4B shows a typical correlation between predicted and experimental RBD-ACE2 affinities for single point mutations.
  • FIG.4C shows a typical correlation between predicted and experimental RBD-ACE2 affinities for multi-point mutations.
  • FIG.4D shows a distribution map of log2 enrichment scores of newly designed ACE2 variants.
  • FIG.4E shows typical results of an experimental test of ACE2 variants binding to RBD using ELISA analysis.
  • FIG.4F shows proposed interactions between wild-type and genetically modified ACE2 and SARS-CoV-2- RBD, where genetically modified ACE has a N330Y mutation.
  • FIG.4G shows proposed interactions between wild-type and genetically modified ACE2 and SARS-CoV-2- RBD, where genetically modified ACE has a Q42L mutation.
  • FIG.4H shows a heatmap of S-protein–ACE2 binding affinities across species.
  • FIG.4I shows a regression analysis of predicted versus experimental affinity change between S-proteins of sarbecoviruses and human ACE2 orthologues.
  • FIG.4J shows a heatmap of predicted affinity values for S-protein–ACE2 binding between SARS-CoV-2 variants and ACE2 proteins from 24 animal species.
  • FIG.5A shows regression performance of RBD-antibody affinity (escape score) prediction for the effects of different RBD mutations on RBD-Ab binding.
  • FIG.5B shows a typical heatmap of an experimental escape score matrix upon mutations of RBD to different antibodies.
  • FIG.5C shows a typical heatmap of a predicted escape score matrix upon mutations of RBD to different antibodies.
  • FIG.5D shows stratified analysis of regression performance on Class 1 neutralization antibodies.
  • FIG.5E shows stratified analysis of regression performance on Class 2 neutralization antibodies.
  • FIG.5F shows stratified analysis of regression performance on Class 31 neutralization antibodies.
  • FIG.5G shows stratified analysis of regression performance on Class 4 neutralization antibodies.
  • FIG.5H shows average escape scores of each site on the RBD.
  • FIG.5I shows an escape score matrix of S protein variants to different antibodies calculated by UniBindTM.
  • FIG.5J shows a typical receiver operating characteristic (ROC) plot of predicted escape scores of different S protein variants.
  • FIG.5K shows predicted escape scores of antibodies as individual VOC boxplots.
  • FIG.6A shows predicted ACE2 binding affinity and antibody escape versus dates of variant emergence in the course of the COVID19 pandemic.
  • FIG.6B shows a typical correlation between AI-generated measurements of S-protein trimer-ACE2 affinity with experimental results.
  • FIG.6C shows a typical correlation between AI-generated measurement on RBD-ACE2 affinity with experimental results.
  • FIG.6D shows a heat map illustrating the effect of mutations in the RBD segment on ACE2 binding affinity changes.
  • FIG.6E shows a heat map showing the effect of mutations in the RBD segment on antibody escape scores.
  • FIG.6F shows characteristics of a viral lineage evolutionary path.
  • FIG.6G shows characteristics of AI-predicted new variant’s evolution based on five Omicron lineage.
  • FIG.6H shows distribution maps of ACE2 binding affinity of variants from subsampled GISAID data.
  • FIG.6I shows distribution maps of AI-predicted ACE2 binding affinity values of potential variants.
  • FIG.6J shows a typical correlation analysis between reported fitness and affinity-based evolutionary score (evo-score).
  • FIG.6K shows characteristics of single mutations’ effects on ACE2 binding affinity, antibody escape, and evo-score.
  • FIG.6L shows effects of essential mutations on evo-scores of five recent Omicron lineages.
  • FIG. 7 schematically depicts an exemplary architectural details of a geometry and energy attention (GEA) module.
  • GSA geometry and energy attention
  • the inventive subject matter provides apparatus, systems and methods in which an AI- based UniBindTM framework is provided that includes at least three major components: protein representation as a graph at the residue- and atom- levels, BindFormer blocks with geometry and energy attention, and multi-task learning for heterogeneous biological data integration. Trained on more than curated 70,000 protein structure-to-function data, UniBindTM accurately predicted binding affinities of SARS-CoV-2 spike protein mutants to human ACE2 receptor, or to neutralizing monoclonal antibodies. Systematic tests on major benchmark datasets and experimental validation demonstrated that UniBindTM is accurate, robust, and scalable.
  • the numbers expressing quantities of ingredients, properties such as concentration, reaction conditions, and so forth, used to describe and claim certain embodiments of the invention are to be understood as being modified in some instances by the term “about.” Accordingly, in some embodiments, the numerical parameters set forth in the written description and attached claims are approximations that can vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameters should be construed in light of the number of reported significant digits and by applying ordinary rounding techniques.
  • Each group member can be referred to and claimed individually or in any combination with other members of the group or other elements found herein.
  • One or more members of a group can be included in, or deleted from, a group for reasons of convenience and/or patentability. When any such inclusion or deletion occurs, the specification is herein deemed to contain the group as modified thus fulfilling the written description of all Markush groups used in the appended claims. [0070] It should be noted that any language directed to a computer should be read to include any suitable combination of computing devices, including servers, interfaces, systems, databases, agents, peers, engines, controllers, or other types of computing devices operating individually or collectively.
  • the computing devices comprise a processor configured to execute software instructions stored on a tangible, non-transitory computer readable storage medium (e.g., hard drive, solid state drive, RAM, flash, ROM, etc.).
  • the software instructions preferably configure the computing device to provide the roles, responsibilities, or other functionality as discussed below with respect to the disclosed apparatus.
  • the various servers, systems, databases, or interfaces exchange data using standardized protocols or algorithms, possibly based on HTTP, HTTPS, AES, public-private key exchanges, web service APIs, known financial transaction protocols, or other electronic information exchanging methods.
  • Data exchanges preferably are conducted over a packet- switched network, the Internet, LAN, WAN, VPN, or other type of packet switched network.
  • inventive subject matter provides many advantageous technical effects including provision of rapid and accurate prediction of the effects of specified changes in protein amino acid sequence on interaction of the protein with elements of its environment. This can improve effectiveness of therapeutic proteins and reduce the time required for their development, as well as improving the accuracy of prediction of likely pathogen variants.
  • inventive subject matter is considered to include all possible combinations of the disclosed elements. Thus, if one embodiment comprises elements A, B, and C, and a second embodiment comprises elements B and D, then the inventive subject matter is also considered to include other remaining combinations of A, B, C, or D, even if not explicitly disclosed.
  • Coupled to is intended to include both direct coupling (in which two elements that are coupled to each other contact each other) and indirect coupling (in which at least one additional element is located between the two elements). Therefore, the terms “coupled to” and “coupled with” are used synonymously.
  • Rapid progress on high throughput approaches in experimental biology has generated an unprecedented amount of sequence data with corresponding binding affinity information.
  • There are different experimental methods for measuring binding affinities including surface plasmon resonance, fluorescence spectroscopy, flow cytometry, and neutralization assay.
  • the SKEMPI V2.0 database contains the affinity changes arising from amino acid substitutions of over 7,000 structurally-solved protein complexes.
  • ‘deep mutational scanning’ can be used to generate large-scale mutational data for any protein. These data can then be organized into a sequence-function map to reveal the functional consequences of all possible single mutations. These data, together with other large proteomic and biochemical databases such as SKEMPI V2.0, provides an opportunity to model changes in SARS-CoV-2 mutations and evaluate their functional impact.
  • protein-protein interaction data were generated at an unprecedented scale, these data show enormous heterogeneity in terms of the measurement method (e.g., binding energies vs log2 enrichment ratios vs EC50s), as well as the experimental conditions. Therefore, integration of these heterogeneous datasets remains a challenge for any AI system.
  • UniBindTM is a generalizable modular framework which includes a hierarchical representation of protein structure as a graph at both atom- and amino acid residue-level and a dual-path neural network named BindFormer, with novel attention mechanisms of geometry and energy attention (GEA).
  • UniBindTM with multi-task learning and model ensemble to make full use of the unprecedented large-scale experimental data (for example, deep mutational scanning and high throughput genomic sequencing). Trained using over 70,000 mutations, which is 10-fold more data than previously available, UniBindTM predicted the binding affinity between SARS-CoV-2 S-protein mutants and human ACE2 receptor, or with neutralizing monoclonal antibodies. Systematic tests and validations on major benchmark datasets for affinity prediction demonstrated that UniBindTM is accurate, robust, and scalable. [0077] The GISAID initiative has generated a sharing platform for providing over 10 million SARS-CoV-2 genetic sequences.
  • UniBindTM is a powerful tool for prospective biology analysis: high-throughput deep mutational scanning in silico, lineage analysis (including for the new lineage trends), and affinity-based viral evolution. Furthermore, Inventors have applied UniBindTM to address infectivity and immune escapes of different variants of SARS-CoV-2, by predicting properties of S-protein in different variants and their sub-lineages, including its binding affinity to ACE2 and escape from antibody/vaccine therapies.
  • PADBs contain information on affinity changes of the RBD-ACE2 binding upon mutations of the S protein or ACE2, as well as changes in RBD-antibody binding affinities upon RBD mutations (Table 1). Different evaluation metrics were employed as a label for multi-task learning. More specifically, the PADB-SA dataset which comprised 45,363 affinity data of SARS-CoV-2 S-protein variants binding to human ACE2 with the determination of apparent dissociation constants (KD,app) using deep mutational scanning approaches (Starr et al., 2022a; Starr et al., 2020).
  • PADB-AS dataset included 2,230 affinity data of human ACE2 variants binding to wild type SARS-CoV-2 S-protein, with measurements of log2 enrichment ratios or fluorescence intensity change ( ⁇ MFU ⁇ 1000) (Chan et al., 2020).
  • PADB-SAb comprised 16,971 affinity data of different antibodies binding to SARS-CoV-2 S-protein variants, with escape score, IC50, and fold change of IC50 (Cao et al., 2022; Liu et al., 2022; Wang et al., 2021).
  • Overview of the AI model [0079] Inventors combined and integrated multi-source and heterogeneous biological data with UniBindTM for protein-protein interaction prediction tasks.
  • UniBindTM includes three major components: protein representation as graph data structure, BindFormer blocks with geometry and energy attention (GEA), and multi-task learning for heterogeneous biological data integration.
  • GAA geometry and energy attention
  • FIG.1A the datasets described above together with corresponding protein structural and amino acid substitution information were expressed as graph data structure and fed into UniBind.
  • UniBindTM can then systematically identify and quantify ACE2 or S- protein variants with altered affinity and antibody binding affinity/escape.
  • Inventors further conducted prospective studies including AI-based lineage analysis, AI-based deep mutational scanning, fitness landscape depiction, and model-guided evolution (FIG .1A).
  • FIG .1A provides an exemplary workflow of the UniBindTM framework.
  • a heterogeneous affinity dataset was constructed using SKEMPI V2.0 and multiple sets of binding affinity data which were collected and curated from different experimental methods. Then UniBindTM was trained using heterogeneous multi-task learning methods for multiple regression predictions including affinity, log2 enrichment ratio, escape score, etc. Based on previous predictions on variants, multiple prospective analysis was performed on many aspects like lineage analysis, AI-based deep mutational scanning, and model-guided evolutions, as well as verified with experiments. For protein structure representation, Inventors developed a hierarchical representation of protein at an atom level and amino acid residue level as a graph, to better capture the protein-protein interaction functional characteristics, which served as the input for the BindFormer module (FIG. 1B).
  • FIG.1B schematically depicts an exemplary architecture of a deep neural network utilized in systems and methods of the inventive concept.
  • the residue-level and atom-level features of protein were extracted and aggregated using a unified protein graph representation; for the BindFormer module, Inventors applied a dual-path graph neural network based on GEA (geometry and energy attention) mechanisms; a final multi-task learning method was used to integrate heterogeneous biology experiments and measurements for robust and scalable affinity tasks.
  • GEA geometry and energy attention
  • BindFormer module it is a dual-path network with novel attention mechanisms of GEA to enable the messages passing in our network.
  • GEA is a geometric invariant multi-head attention layer by aggregating geometric and energy terms.
  • Inventors applied a multitask learning and model ensemble to increase the tolerance of variations in biological experimental measurements (such as affinity change, apparent affinity change, escape score, and log2 enrichment ratio) simultaneously.
  • biological experimental measurements such as affinity change, apparent affinity change, escape score, and log2 enrichment ratio
  • the UniBindTM framework is accurate, robust, and scalable when applied to heterogeneous multi- source datasets.
  • systems and methods of the inventive concept can be considered as a series of interconnected functional modules, through which information related to protein structure and interactions flows.
  • One module can be a database module, which includes heterogeneous biologic data (i.e., biologic data originating from more than one source, more than one testing modality, and/or more than one form of expression for the biologic data).
  • a database module can include protein sequence information and binding energy estimates derived from experimental results as provided by literature source as well as a range experimental results such as dose/response or EC50 curves for protein binding assays, surface plasmon resonance measurements, results of immunofluorescent microscopy studies, results from flow cytometry studies, etc.).
  • Such a database module might or might not include biological data directly related to the protein or proteins being studies.
  • Another module can be a protein representation module, in which heterogeneous data from the database module is used to construct hierarchal protein structures for proteins in the heterogeneous database.
  • unified protein structures which can be envisaged as a graph
  • Another module can be an AI module, which utilizes information from such unified protein structure models as a training set to derive a protein interaction algorithm.
  • This AI module can also be used to apply the protein interaction algorithm so derived to evaluate or estimate binding characteristics of wild type and/or mutated proteins provided to the AI module. Such binding characteristics can be reported to provide a library of one or more mutant proteins desired increased or decreased binding relative to a second protein.
  • Such a library of candidate mutant proteins can be further screened (e.g., using in vitro and/or in vivo methods) to determine functionality.
  • AI prediction of affinities of protein-protein interactions [0080] Inventors validated UniBindTM for estimating the impact of single as well as multiple amino acid substitution on protein-protein interactions using the SKEMPI v2.0 dataset ( Figures 2a to 2d). Changes in dissociation constant ( ⁇ Kd, kcal mol-1) were employed to measure the effects of mutations on binding affinity. The AI-predicted versus experimentally measured affinity changes following single or multiple amino acid substitutions were plotted. Inventors applied 10-fold cross-validation to calculate the Pearson’s correlation coefficient (PCC) between experimental and calculated ⁇ Kd.
  • PCC Pearson’s correlation coefficient
  • FIG.2A shows a typical regression correlation between calculated and experimental values of changes in binding affinity for all mutations in SKEMPI V2.0.
  • Inventors stratified the analysis into protein complexes containing single amino acid substitution in the S- protein (FIG.2B) or those with multiple amino acid substitutions (FIG.2C), and accurate results were generated with PCC of 0.78 and 0.91, respectively.
  • Inventors also evaluated the mean absolute error (MAE) of the AI predictions with one or more amino acid substitutions ( Figure 2d). Inventors found a MAE of no more than 1.5 kcal mol-1 in all amino acid substitution groups.
  • MAE mean absolute error
  • FIG.3A shows typical regression performance of affinity change prediction between ACE2 and RBD variants measured by ⁇ K D,app .
  • FIGs.3A to 3E show predicted performance on RBD mutation effects on different SARS-CoV-2 variants including wild type, Beta (N501Y), Eta (K417N+E484K+N501Y), Delta (E484K).
  • UniBindTM demonstrated a reliable performance in predicting ⁇ KD, app with a PCC of 0.90.
  • Inventors stratified the binding affinity prediction into several important SARS-CoV-2 VOC subgroups, including the Alpha (N501Y; FIG.3B), Beta (K417N + E484K + N501Y; FIG.3C), Delta (L452R+T478K, FIG.3D) and Eta (E484K; FIG.3E), using wild-type SARS-CoV-2 as the baseline.
  • FIG.4A shows typical regression performance of affinity change prediction between RBD and ACE variants measured by log2 enrichment ratio.
  • FIGs.4B and 4C show typical results of an experimental test of ACE2 variants with single point mutations (FIG.4B) and mutli-point mutations (FIG.4C) binding to RBD using ELISA analysis.
  • PCC Pearson’s correlation coefficient
  • SCC Spearman’s correlation coefficient.
  • ACE2 For the effects on ACE2 binding of single-point mutations to S-protein, the predicted log2 enrichment ratio of protein complexes versus actual experimental measurements had a PCC of 0.73.
  • methods of the inventive concept can be used to generate ACE2-derived proteins and peptides that are ACE2 analogs having an increased affinity for coronavirus spike proteins relative to native ACE2. Such methods can, for example, be used to identify such ACE2 analogs that have high affinity binding to spike proteins of two or more strains of pathogenic or potentially pathogenic coronavirus strains.
  • Such ACE2 analogs or mixtures of such ACE2 analogs can be used as ACE2 receptor traps that are effective against a wide range of coronavirus strains.
  • FIG. 4C shows a typical correlation between predicted and experimental RBD-ACE2 affinities, measured by log2 enrichment score and ⁇ MFU ⁇ 1000, respectively.
  • Experimental data were collected from a literature source, in which RBD-ACE2 affinity was tested using Flow Cytometry.
  • Soluble ACE2 can act as a decoy to neutralize SARS-CoV-2 infection. Accordingly, UniBindTM can be used to design high affinity ACE2 receptor decoy molecules as a general strategy to target all current and future variants. Based on predictions, UniBindTM identified 111 single amino acid substitutions on ACE2 (SEQ ID NO.1) which could potentially increase the binding affinity towards S-protein. Inventors further applied an in silico evolution method to generate 13,913 ACE2 variants containing between 1 and 4 amino acid changes that lead to increased affinity.
  • FIG.4D shows a distribution map of log2 enrichment scores of newly designed ACE2 variants.
  • the orange line shows the reference ACE2 variant (sACE2.v2.4) collected from literature, and the green lines show the ACE2 variants selected for experimental validation.
  • human ACE2 ACE2-WT, SEQ ID NO.1
  • ACE2 of other species, such as Mus musculus (SEQ ID NO. 2) or Rattus norvegicus (SEQ ID NO.3) can be similarly modified. Mutations to human ACE2 (SEQ ID NO.1) utilized in these studies are summarized below in Table 2.
  • ACE2 variants Five ACE2 variants were selected for experimental validation (ACE2-1, ACE2 -2, ACE2-7, ACE2-8 and ACE2-9) and compared these to sACE2.v2.4, known to have the highest affinity for RBD binding. ELISA experiments showed that the EC50s of these five variants (ranging from 0.54 ⁇ g/ml to 1.36 ⁇ g/ml) were lower than both wild-type ACE2 (5.2 ⁇ g/ml) and sACE2.v2.4 (1.83 ⁇ g/ml), indicating high binding affinity as predicted by UniBindTM (as shown in FIG.4E). The results highlight the potential application of UniBindTM in therapeutic protein engineering.
  • FIG.4F shows interactions between wild-type and genetically modified ACE2 and SARS-CoV-2- RBD, where genetically modified ACE has a N330Y mutation and interaction with P499 of SARS-CoV-2-RBD.
  • FIG. 4G shows interactions between wild-type and genetically modified ACE2 and SARS-CoV-2- RBD, where genetically modified ACE has a Q42L mutation and interaction with Y449 of SARS-CoV-2-RBD.
  • Q42 is situated in a highly negatively charged area which may prevent interaction with the RBD35,36. Therefore, the Q42L mutation may increase the hydrophobic area of ACE2 and improve binding to Q498 and Y449 on the RBD. Furthermore, the N330Y substitution may provide additional van der Waals contacts and H-bonds with the RBD. [0091] Inventors have found that methods of the inventive concept provide greatly improved accuracy in predicting effects of single and point mutations on the strength of inter-protein interactions relative to conventional methods. Table 43 shows typical results of a comparison of methods of the inventive concept and various conventional methods on prediction performance in the SKEMPI 2.0 set with mutation-level validation. Pearson correlations between AI model- predicted ⁇ G data and reported experimental ⁇ G data.
  • S1131 a subset of 1,131 non- redundant interface single-point mutations.
  • S4169 a subset of 4,169 single-point mutations compiled from the SKEMPI 2.0 dataset.
  • S8338 a subset of 4,169 single-point mutations and all the corresponding reverse mutations.
  • M1707 a subset of 1,707 non-redundant interface multi- point mutations. Table 4 While this exemplary data is generated using the SKEMPI 2.0 data set, Inventors believe that similar improvements are realized on application of methods of the inventive concept to other protein-related data sets.
  • FIG.5A shows regression performance of RBD-antibody affinity (escape score) prediction for the effects of different RBD mutations on RBD-Ab binding.
  • FIG.5B shows a heatmap of an experimental escape score matrix upon mutations of RBD to different antibodies. Brightness represents the escape score. A brighter dot indicates that the mutation on site position of x-axis is more likely to lead to higher immune escape for antibody of y-axis.
  • MAE mean absolute error
  • R 2 coefficient of determination
  • PCC Pearson’s correlation coefficient.
  • FIG.5C shows a heatmap of a predicted escape score matrix upon mutations of RBD to different antibodies. Brightness represents the escape score.
  • a brighter dot indicates that the mutation on site position of x-axis is more likely to lead to higher immune escape for antibody of y-axis.
  • MAE mean absolute error
  • R 2 coefficient of determination
  • PCC Pearson’s correlation coefficient.
  • the predicted escape score showed a good correlation with neutralization experiment data in the dataset of PBAD-SAb (FIG.5D, FIG.5E, FIG.5F, FIG.5G), with all PCCs above 0.8 in four classes of antibodies, validating its utility in predicting antibody escape.
  • Antibody escape scores range from 0 to 1, with increasing scores denoting higher levels of escape. The average escape scores of each site on the RBD are shown in FIG.5H.
  • FIGs.5D to 5G show stratified analysis of regression performance on 4 classes of neutralization antibodies.
  • FIG.5H shows the average mutational effects on escape score at each RBD site. The blue line represents experimental data; the orange line represents predicted results; shadows of each color indicate the standard error of each line.
  • FIG.5I shows an escape score matrix of S protein variants to different antibodies calculated by UniBindTM.
  • the x-axis is antibodies, and the y-axis is common single point mutations and variants of concern.
  • the results are consistent with the current consensus that Omicron and its derivative variants display the strongest immune escape ability.
  • Inventors set several thresholds for neutralization data according to the original literature, and divided the predicted data into two groups: Escape or Non-Escape.
  • the receiver operating characteristic (ROC) plot shows that our predicted escape score could accurately identify the ability of different variants to escape neutralization by different antibodies, with an area under the curve (AUC) of 0.944 (FIG.5J).
  • FIG.5K shows predicted escape scores of antibodies as individual VOC boxplots. For each analysis, antibodies were separated into two groups that can be escaped or not escaped by SARS-CoV-2 variants according to relevant literature.
  • the Center line indicates median; box limits indicate upper and lower quartiles; whiskers indicate 1.5x interquartile range; points indicate outliers; P values less than 0.05, 0.01, 0.001, 0.0001 are summarized with one to four asterisks, respectively.
  • AI longitudinal prediction on viral evolution and antibody escape [0097] Inventors have found that UniBindTM can accurately predict the S-protein ACE2 binding affinity and potential for antibody escape. Based on these findings Inventors believe that UniBindTM can be used to perform a prospective analysis on potential SARS-COV-2 variants, including AI-based lineage analysis, AI-based deep mutational scanning, and model-guided evolution.
  • FIG.6A shows predicted ACE2 binding affinity and antibody escape versus dates of variant emergence in the course of the COVID19 pandemic.
  • Circles represent reported SARS-CoV-2 variants from GISAID data; circles annotated with common VOC names and their PANGO lineage represent variants examined in a number of experimental ways.
  • Co-dimension time from January 2020 to September 2022 (x-axis), antibody escape scores (y-axis), ACE2 affinity (color and circle size, more red and larger circle means increased affinity).
  • AI prediction showed an overall trend for the newer variants to have a higher antibody escape score.
  • the Omicron sub-lineages BA.1 and BA.2 which emerged late in 2021 showed a reduction in ACE2 affinity but enhanced antibody escape.
  • the Omicron sub-lineages 22A (BA.4), 22B (BA.5), and 22C (BA.2.12.1) were predicted to show enhanced S-protein ACE2 binding affinity as well as an increase in antibody escape.
  • UniBindTM’s predicted S- protein ACE2 affinities on all VOCs are consistent with that reported in the literature, with a PCC score of 0.74 in the RBD-ACE2 affinity prediction (FIG.6B), and a PCC score of 0.89 in the S-protein trimer-ACE2 affinity prediction (FIG.6C).
  • FIG.6B shows a correlation between AI-generated measurements of S-protein trimer-ACE2 affinity with experimental results.
  • FIG. 6C shows a correlation between AI-generated measurements of RBD-ACE2 affinity with experimental results.
  • FIG. 6D shows a heat map illustrating the effect of mutations in the RBD segment on ACE2 binding affinity changes, red color means increased affinity and blue color means decreased affinity.
  • FIG.6E shows a heat map showing the effect of mutations in the RBD segment on antibody escape scores, blue means decreased antibody binding affinity.
  • This approach is highly consistent with previous studies using an experimental deep mutational scanning method for measurement of mutations in RBD which affects ACE2 affinity ( Figure 6d).
  • UniBindTM can simultaneously predict affinity changes on multiple mutations such as that from all 16 VOCs, which addresses the problem of a heterogeneous batch effect in experimental biology.
  • escape scores were calculated by surveying 80 neutralizing antibodies and averaging the escape score of AI-based antibodies to generate a sequence-to-escape heatmap ( Figure 6e), which can better reflect variant's overall immune escape ability.
  • the Omicron BA.2.12.1 variant displays a near one log-unit improvement in ACE2 affinity, without large changes in antibody escape.
  • UniBindTM predicted a stronger overall evo-score for variants evolving in the direction of a higher antibody escape ability.
  • Inventors applied the evo-score to predict the evolution of the Omicron variants (FIG. 6G).
  • FIG.6G shows characteristics of AI-predicted new variant’s evolution based on five Omicron lineage.
  • Blue dots represent new variants, green dots represent original Omicron lineage, and orange dots represent variants of interests that with top-five highest evo-score for each Omicron lineage.
  • UniBindTM can predict the evolution of the Omicron variants to even higher evo-scores driven by several key non-synonymous mutations, particularly A475E which occurs most frequently. The main determinant of higher evo-scores in these predicted variants is enhanced antibody escape, with their S-protein ACE2 affinity values remaining around the same.
  • UniBindTM also predicted that there is a possibility for future variants evolving to high S-protein ACE2 affinity values, underscoring a risk for potentially more virulent strains (FIGs. 6H and 6I).
  • FIG.6H shows distribution maps of ACE2 binding affinity of variants from subsampled GISAID data.
  • FIG.6I shows distribution maps of AI-predicted ACE2 binding affinity values of potential variants.
  • FIG.6J shows correlation analysis between reported fitness and affinity-based evolutionary score (evo-score).
  • FIG.6K shows characteristics of single mutations’ effects on ACE2 binding affinity, antibody escape, and evo-score. Dashed lines represent contour lines of evo-score; blue dots represent single mutations; orange dots show several well-known mutations which could significantly improve virus evolution.
  • FIG.6L shows effects of essential mutations on evo-scores of five recent Omicron lineages. Circle size and color represent appearance frequency in top score variants that were derived from a relevant original lineage.
  • Inventors have combined heterogeneous sets of protein-protein binding affinity data with AI-based protein sequence-to-function modeling to systematically identify and determine various affinity related tasks.
  • protein representation needs to take into consideration the entirety of the binding interface and the residues which form chemical bonds with each other in the setting of protein-protein interactions.
  • prior learning approaches have limitations due to poor scalability to large datasets, and predictions limited to single mutation variants.
  • UniBindTM is more in-tuned for protein-protein interaction prediction, as both structural changes and energetic effects are crucial for protein ⁇ protein binding affinity prediction. Furthermore, UniBindTM integrates several heterogeneous sources of datasets and performs multi-task learning and model assembling to predict various task-specific affinity changes (for example S-protein ACE2 interaction and antibody escape scores). Inventors have validated UniBindTM on major publicly available datasets for affinity prediction, and this has demonstrated that UniBindTM is accurate, robust, and scalable. Another advantage of UniBindTM prediction on the S-protein ACE2 affinities is that it is based on using a full-length S-protein which is very desirable and feasible by AI, yet not feasible or impractical by biological experiment designs.
  • the first interaction provides a measure of infectivity and the second for the immune escape potential. These can be assessed using in vitro approaches in the laboratory, but this is potentially hazardous, time-consuming, costly, and error-prone (e.g., cross contamination, human error, etc.). Such prior art approaches also do not provide for the evaluation of the large numbers of variants which are currently being generated, and can only increasingly lag real-world needs as new variants emerge.
  • UniBindTM deep mutational scanning Inventors have developed an affinity-based evolutionary score (evo-score) system to take into consideration of S-protein ACE2 binding affinity and S- protein antibody binding affinity.
  • UniBindTM predicted that the S-protein of the BA.4 and BA.5 variants have an increased antibody escape score, but its affinity towards ACE2 remains similar to the BA.2 variant. This means that the infectivity and severity of the BA.4 and BA.5 variants is expected to be comparably low, like the BA.2 variant. Inventors believe that the therapeutic efficacy of current neutralizing antibodies will be further compromised against the BA.4 and BA.5 variants. Importantly, UniBindTM predicted that additional mutations in the Omicron BA.4 background will result in variants with reduced S-protein affinity towards ACE2 ( Figure 6g and Figure 6h).
  • Systems and methods of the inventive concept can include codes and datasets available through governmental Infectious Diseases Control Units as well as the entire scientific and medical communities in order to take full advantage of these resources.
  • a decrease in the S-protein ACE2 binding affinity may be paralleled/ compensated by an increase in the viral replication efficiency, e.g. more efficient viral polymerase or other viral replication related proteins mutants, more efficient packaging mutants, or an improved host-viral interaction, all of which may facilitate viral survival or fitness. It is known that for virus to evolve, the various part of the genome will evolve in such a way that will cluster into the same genotype or subtypes.
  • UniBindTM can incorporate and utilize data representing other parts of the viral genome (e.g., those related to reproductive efficiency) and utilize such data within functional models as described above for use in evaluating current and projected mutations for virulence, escape from immune protection, etc..
  • AI-based methods for predicting protein-protein interactions have a variety of practical applications.
  • such an AI-based approach can be used in methods to predict the effects of specific mutations in a protein’s amino acid sequence on interactions with one or more binding ligands (such as the same or a different protein, a nucleic acid, a carbohydrate polymer, etc.). If such a binding ligand is involved in a disease process, accurate prediction of mutations that provide for enhanced binding (e.g., providing a lower binding constant) relative to an initial therapeutic protein selected for optimization can be used to generate a library of one or more mutated proteins with improved binding characteristics. Such mutated proteins can then be utilized in screening studies to identify those with binding characteristics that can provide an improved therapeutic protein.
  • binding ligands such as the same or a different protein, a nucleic acid, a carbohydrate polymer, etc.
  • Such screening studies can be performed in vitro (e.g., microplate or microbead-based binding studies using labeled proteins) and/or in vivo (e.g., using animal models of disease).
  • a method using an AI-based approach as described above can begin with a therapeutic antibody selected to bind to an immune checkpoint protein (such as PD-1 or PD-L1) to generate a library of one or more mutated antibodies with increased binding affinity for the immune checkpoint protein.
  • Elements of such a library can then be screened for increased binding, for example using an ELISA directed to the immune checkpoint protein.
  • such elements of such a library can be screened using an animal model for a PD-1 bearing cancer.
  • the target ligand can be associated with an infectious disease.
  • the binding ligand can be a component of the pathogen, such as a surface protein or glycoprotein.
  • Such a binding ligand can be directly involved in the disease process (e.g., a viral protein utilized in host cell recognition and entry) or can simply be characteristic of the pathogen.
  • a method using an AI-based approach as described above can begin with a therapeutic antibody selected to bind to a target ligand of the pathogen to generate a library of one or more mutated antibodies with increased binding affinity for the target ligand. Elements of such a library can then be screened for increased binding, for example using an ELISA directed to the target ligand. Alternatively, or in addition, such elements of such a library can be screened using an animal model for the infectious disease. Screening methods utilizing cells grown in culture and/or artificial organ systems can also be used.
  • data from such screening experiments can be provided to the AI-based method as an experimental database, which can in turn be used to refine results from the AI-based method, [00111]
  • methods using an AI-based approach as described above can be used to identify strains or mutations of a pathogen with reduced affinity for binding interactions with a therapeutic protein. For example, mutated proteins encoded by emergent or potential pathogenic virus strains can be scored for binding to therapeutic proteins that interact with the corresponding wild-type protein. Strains expressing reduced binding to the therapeutic protein can be aggregated in a library of pathogen strains that may escape treatment with the therapeutic protein. Elements of such a library can then be screened for reduced binding, for example using an ELISA directed to the therapeutic protein.
  • such elements of such a library can be screened using an animal model for the infectious disease. Screening methods utilizing cells grown in culture and/or artificial organ systems can also be used.
  • an AI-based approach as described above can be used to develop a library of mutated therapeutic proteins with enhanced interaction with elements of the library of mutated pathogen proteins, permitting identification of potential therapies as the mutated pathogen becomes prevalent.
  • data from such screening experiments can be provided to the AI-based method as an experimental database, which can in turn be used to refine results from the AI-based method.
  • [00112] Wildlife species are a known reservoirs for coronaviruses.
  • FIG.4H shows a heatmap of S- protein–ACE2 binding affinities across species.
  • the left panel of FIG.4H shows AI-predicted values generated by a method of the inventive concept.
  • the right panel shows corresponding experimental data.
  • Sarbecoviruses are colored by clade.
  • FIG.4I shows a typical regression analysis of predicted versus experimental affinity change between S-proteins of sarbecoviruses and ACE2 orthologues of humans.
  • FIG.4J shows a heatmap of predicted affinity values for S- protein–ACE2 binding between SARS-CoV-2 variants and ACE2 proteins from 24 animal species.
  • Tiles with labels represent the affinities between related ACE2 orthologues and SARSCoV-2 spike variants. Circles indicate that the variants reported could bind to relevant ACE2 orthologues; dots indicate that the variants reported could not bind to relevant ACE2 orthologues.
  • methods as described above can be performed using an automated or partially automated system.
  • Such a system can include a computer encoding elements of the AI and that is in communication with suitable databases, as can include effectors and sensors suitable for performing physical screening studies. Suitable effectors include liquid handling devices, handlers for disposable components (e.g., test plates), and incubators.
  • Suitable sensors include colorimeters, spectrophotometers, fluorometers, luminometers, imaging systems, etc., Such systems can include a controller for directing effector functions. Such a controller can include encoded instructions for the performance of screening assays of candidate proteins identified by the AI-based methods.
  • data generated by sensor systems can be provided as an experimental database that is in communication with the computer encoding elements of the AI.
  • Recombinant plasmids were extracted using an endotoxin removal plasmid extraction kit (TIANGENTM).50 ml HEK 293F cells were transfected with 25 ng recombinant plasmids using FectoPRO(ployPlus)TM transfection reagent to express target proteins. The culture medium was collected after 5 days incubation. Recombinant ACE2 proteins were extracted using protein A dextran and purified using SDS-PAGE electrophoresis. The obtained recombinant ACE2 proteins were in the natively dimeric form.
  • ELISA EC50s of ACE2 mutants binding to RBD were measured by indirect ELISA as previous described 3.
  • Wells of a 96-well plate were coated with 200 ng recombinant RBD protein (ABLINK Biotech) at 4°C overnight. After removing the supernatant, the wells were blocked using 1% BSA at room temperature for 2 hours and then washed using wash buffer PT (Abcam) for 3 times.10 mg/ml ACE2, ACE2-1, ACE2-2, ACE2-7, ACE2-8, ACE2-9 and sACE2.
  • v2.4 recombinant proteins were diluted at a ratio of 1:3 into 7 concentrations and added to blocked plate with 100 ⁇ l per well.
  • SKEMPI V2.0 is a manually curated database which includes affinity changes upon mutations for structurally-solved protein-protein interactions, which currently contains 7,085 mutations in total. There are many kinetic or thermodynamic parameters that were reported by SKEMPI V2.0 database; here Inventors used dissociation constants (Kd) to represent affinity. The structures of mutant and wild-type protein complexes were downloaded from SKEMPI V2.0 website (https://life.bsc.es/pid/skempi2). [00116] The effect of SARS-CoV-2 RBD mutations on RBD-ACE2 binding affinity was collected from literature sources, which were measured by apparent dissociation constants (KD,app) using deep mutational scanning approach.
  • the effect of ACE2 mutations on RBD- ACE2 binding affinity was collected from literature sources and estimated by log2 enrichment ratio which was calculated by comparing transcript frequencies between enriched cell populations and na ⁇ ve plasmid library. Flow cytometry analysis data was utilized for validation.
  • the effects of RBD mutations on RBD-antibody binding affinity were collected from literature sources and estimated by escape score which were calculated by comparing barcode frequencies of variants between immune escape cell populations and reference populations followed by normalization within each antibody. Neutralization assay data from recent literature was included as validation.
  • the structure of wild-type RBD-ACE2 complex was obtained from Protein Data Bank (PDB) with accession number 7df4.
  • the structures of Spike-antibody complexes were collected and curated from Protein Data Bank (PDB).
  • Inventors developed an affinity-based prospective analysis module for comprehensive analysis including lineage analysis, AI-generated deep mutational scan, fitness landscape depiction, and variant evolution.
  • Protein representation as a graph [00118] Given an input of the wild type structure and its corresponding mutational structure, Inventors represented it as an attributed graph encoded with sequence, geometry, and energy information at both residue-level and atom-level.
  • h is the embedding vector based on the amino acid or atom types, residue sequence indices, chain ids, and mutant types.
  • ⁇ and ⁇ are translation and rotation vectors calculated by the coordinates of three specific atoms using Gram–Schmidt process, where N ⁇ C ⁇ ⁇ C are applied for a residue and ⁇ A ⁇ ⁇ C are applied for an atom A ⁇ in a residue.
  • systems and methods of the inventive concept utilize edge features to capture the energy of a biomolecule conformation.
  • Z r,ij is an energy term between residue i and residue j, and ergy terms between atom A ⁇ and A.
  • BindFormer with geometry and energy attention [00119] Based on the unified protein representation, Inventors developed BindFormer, which is a dual-path neural network to predict changes in protein ⁇ protein binding affinity upon a mutation to extract and combine residue- and atom-level information around mutant sites. As the geometric and energy features are two key determinants for protein-protein interactions, Inventors implemented geometry and energy attention (GEA) to incorporate the messages passing in the network.
  • GAA geometry and energy attention
  • BindFormer block Given the input of residue feature h r and atom feature h a , Inventors derived transformed features he process of the dual-path with GEA layers, the i-th residue feature h r,i is first transformed into atom level and combined with feature h a, ⁇ of the atom ⁇ ⁇ A i with a multilayer perceptron (MLP), Then the atomic GEA layer was to aggregate atom level messages from neighbor residues, here x a,e and x a,g are energy and geometry terms at atom level.
  • MLP multilayer perceptron
  • Multi-task learning for heterogeneous biological data integration [00121] Inventors developed a framework with an affinity consistent constraint loss, which bridges the gap by modeling affinity across experiments explicitly and trains the model with joint datasets.
  • FIG. 7 schematically depicts an exemplary architectural details of a geometry and energy attention (GEA) module. Arrows show the information flow.
  • GAA geometry and energy attention
  • L length of the amino acid sequence
  • L N No. of nearest neighbor residues in the graph
  • N h No. of heads in the multi-head attention.
  • GEA is a geometric invariant multi-head attention layer aggregating sequence features from neighbor nodes weighted by pairwise geometric and energy terms.
  • Model training and ensemble [00123] The training was performed for 200 epochs using the Adam optimizer with a learning rate of 10 -3 and a weight decay of 10 -6 . Mutation and wild type inversion were applied to each complex pair during training as data augmentation in order to enable an improved and generalized network learning. The models were implemented using PyTorch. Inventors performed the 10-fold cross-validation by leaving one-fold of mutations out as a test set and using the rest of the mutations to train and tune the model, repeating this process for each fold.
  • Inventors applied a model ensemble. The reported predictions were obtained by aggregating the outputs of 10-fold cross-validation.
  • AI affinity-based prospective analysis [00124] Inventors conducted prospective analysis on SARS-COV-2 variants, including AI- based lineage analysis, AI-based deep mutational scanning, fitness landscape depiction, and model-guided evolution. [00125] Based on the AI’s ability to assess affinity properties of variants, Inventors characterized SARS-COV-2 variants by the changes of binding affinity of S-ACE2 and antibody escape score. Inventors used 918 reported SARS-COV-2 variants from GISAID data, from the wild type to the latest Omicron variants.
  • Deep mutational scanning (DMS) approach is a high throughput method that makes use of next-generation sequencing technology to measure the properties of more than 10 ⁇ 5 variants of a protein in a single experiment. But the cost for wet-lab experiments will increase dramatically when the amount of desired variants and properties increases. Furthermore, Inventors conducted AI-based deep mutational scanning by predicting the affinity changes of ACE2 binding and averaged antibody escape scores of all single point mutations of spike in SARS-COV-2.
  • Inventors constructed an evolutionary score (evo-score) using two main determinant factors of ACE-S affinity and immunity escape score, and further depicted the landscape to characterize the fitness of SARS-COV-2 variants. Specifically, Inventors adopted SVM with RBF kernel (Radial basis function) method to fit the fitness of each variant and visualize the topography of the fitness landscape to demonstrate the mutation effects. Variants belonging to the same variant of interest (VOI) or clade were highlight, and clustering in the fitness landscape were grouped together using 2D Gaussian kernel density estimation.
  • VOI variant of interest
  • clade were highlight, and clustering in the fitness landscape were grouped together using 2D Gaussian kernel density estimation.
  • Systems and methods of the inventive concept can use a hill-climbing algorithm to search variants with a set number of mutations from wild-type ACE2 or SARS-COV-2 that maximized the minimum predicted functional score from an ensemble of 10-fold models , where, x is the input with wild type structure and mutant structure, S L is mutant space with mutant edit distance not large than L, M is a set of trained models, and f m (x) is a model’s predicted score for the input x. This evolution objective ensures that all models predict that the sequence will have a high functional score.
  • Systems and methods of the inventive concept can initialize a hill-climbing run with selected variants of interest and potential single-point mutations based on AI-based DMS.
  • the terms “comprises” and “comprising” should be interpreted as referring to elements, components, or steps in a non-exclusive manner, indicating that the referenced elements, components, or steps may be present, or utilized, or combined with other elements, components, or steps that are not expressly referenced.
  • the specification claims refers to at least one of something selected from the group consisting of A, B, C .... and N, the text should be interpreted as requiring only one element from the group, not A plus N, or B plus N, etc.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Medicinal Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Public Health (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Virology (AREA)
  • Biophysics (AREA)
  • Organic Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • Epidemiology (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Urology & Nephrology (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • General Chemical & Material Sciences (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Hematology (AREA)
  • Primary Health Care (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Communicable Diseases (AREA)
  • Veterinary Medicine (AREA)
  • Analytical Chemistry (AREA)
  • Microbiology (AREA)
  • Animal Behavior & Ethology (AREA)
  • General Physics & Mathematics (AREA)
  • Cell Biology (AREA)
  • Tropical Medicine & Parasitology (AREA)

Abstract

L'invention concerne des systèmes, des procédés et des compositions qui incorporent, utilisent ou sont générés par une structure UniBind™ basée sur l'lA. La structure UniBind™ basée sur l'lA comprend trois composants majeurs : une représentation de protéine sous la forme d'un graphique aux niveaux du résidu et de l'atome, des blocs BindFormer™ avec une géométrie et une attention énergétique, et un apprentissage multitâche pour une intégration de données biologiques hétérogènes. Entraînée sur plus de 70 000 données organisées de structure à fonction de protéine, UniBind™ a prédit avec précision des affinités de liaison des mutants de protéine de spicule du SARS-CoV-2 au récepteur ACE2 humain, ou à la neutralisation d'anticorps monoclonaux. Des tests systématiques sur des ensembles de données de référence majeurs et une validation expérimentale montrent que la structure UniBind™ est précise, robuste et évolutive.
PCT/US2023/028727 2022-07-26 2023-07-26 Systèmes et procédés de détermination d'interactions protéiques WO2024025963A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263392420P 2022-07-26 2022-07-26
US63/392,420 2022-07-26

Publications (1)

Publication Number Publication Date
WO2024025963A1 true WO2024025963A1 (fr) 2024-02-01

Family

ID=89707157

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/028727 WO2024025963A1 (fr) 2022-07-26 2023-07-26 Systèmes et procédés de détermination d'interactions protéiques

Country Status (1)

Country Link
WO (1) WO2024025963A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111210871A (zh) * 2020-01-09 2020-05-29 青岛科技大学 基于深度森林的蛋白质-蛋白质相互作用预测方法
US20210371841A1 (en) * 2020-02-26 2021-12-02 Northwestern University Soluble ace2 variants and uses therefor

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111210871A (zh) * 2020-01-09 2020-05-29 青岛科技大学 基于深度森林的蛋白质-蛋白质相互作用预测方法
US20210371841A1 (en) * 2020-02-26 2021-12-02 Northwestern University Soluble ace2 variants and uses therefor

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BELL ERIC W., SCHWARTZ JACOB H., FREDDOLINO PETER L., ZHANG YANG: "PEPPI: Whole-proteome Protein-protein Interaction Prediction through Structure and Sequence Similarity, Functional Association, and Machine Learning", JOURNAL OF MOLECULAR BIOLOGY, ACADEMIC PRESS, UNITED KINGDOM, vol. 434, no. 11, 1 June 2022 (2022-06-01), United Kingdom , pages 167530, XP093134364, ISSN: 0022-2836, DOI: 10.1016/j.jmb.2022.167530 *
TRAGNI VINCENZO, PREZIUSI FRANCESCA, LAERA LUNA, ONOFRIO ANGELO, MERCURIO IVAN, TODISCO SIMONA, VOLPICELLA MARIATERESA, DE GRASSI : "Modeling SARS-CoV-2 spike/ACE2 protein–protein interactions for predicting the binding affinity of new spike variants for ACE2, and novel ACE2 structurally related human protein targets, for COVID-19 handling in the 3PM context", THE EPMA JOURNAL, SPRINGER, NL, vol. 13, no. 1, 1 March 2022 (2022-03-01), NL , pages 149 - 175, XP093134368, ISSN: 1878-5077, DOI: 10.1007/s13167-021-00267-w *
YAZDANI-JAHROMI MEHDI, YOUSEFI NILOOFAR, TAYEBI AIDA, GARIBAY OZLEM OZMEN, SEAL SUDIPTA, KOLANTHAI ELAYARAJA, NEAL CRAIG J.: "Interpretable and Generalizable Attention-Based Model for Predicting Drug-Target Interaction Using 3D Structure of Protein Binding Sites: SARS-CoV-2 Case Study and in-Lab Validation", BIORXIV, 18 February 2022 (2022-02-18), pages 1 - 11, XP093134375, DOI: 10.1101/2021.12.07.471693 *

Similar Documents

Publication Publication Date Title
Abramson et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3
JP7459159B2 (ja) Mhcペプチド結合予測のためのgan-cnn
Gainza et al. De novo design of protein interactions with learned surface fingerprints
Tomar et al. Immunoinformatics: a brief review
US8744982B2 (en) Gene-specific prediction
Gao et al. Pan-peptide meta learning for T-cell receptor–antigen binding recognition
Wang et al. Determinants of antigenicity and specificity in immune response for protein sequences
Sette et al. A roadmap for the immunomics of category A–C pathogens
Wu et al. TCR-BERT: learning the grammar of T-cell receptors for flexible antigen-binding analyses
US20230022022A1 (en) Systems and methods for artificial intelligence-guided biomolecule design and assessment
Beguir et al. Early computational detection of potential high-risk SARS-CoV-2 variants
WO2021119256A1 (fr) Prédiction améliorée de structure de protéine à l'aide d'une découverte d'homologue de protéine et de distogrammes contraints
Wang et al. Deep-learning-enabled protein–protein interaction analysis for prediction of SARS-CoV-2 infectivity and variant evolution
US11749377B2 (en) Method and electronic system for predicting at least one fitness value of a protein, related computer program product
Rubinstein et al. Functional classification of immune regulatory proteins
Yin et al. IAV-CNN: a 2D convolutional neural network model to predict antigenic variants of influenza A virus
US11545236B2 (en) Methods and systems for predicting membrane protein expression based on sequence-level information
Maheshwari et al. Across-proteome modeling of dimer structures for the bottom-up assembly of protein-protein interaction networks
Chakraborti et al. ‘All That Glitters Is Not Gold’: High-Resolution Crystal Structures of Ligand-Protein Complexes Need Not Always Represent Confident Binding Poses
Rozano et al. Ab initio modelling of the structure of ToxA-like and MAX fungal effector proteins
WO2024025963A1 (fr) Systèmes et procédés de détermination d'interactions protéiques
McBride et al. Slowest-first protein translation scheme: Structural asymmetry and co-translational folding
US20030032066A1 (en) Protein-protein interaction map inference using interacting domain profile pairs
Ingolfsson et al. Protein domain prediction
Cohen et al. Multi-state modeling of antibody-antigen complexes with SAXS profiles and deep-learning models

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23847313

Country of ref document: EP

Kind code of ref document: A1