WO2024040031A1 - Computational-based methods for improving protein purification - Google Patents
Computational-based methods for improving protein purification Download PDFInfo
- Publication number
- WO2024040031A1 WO2024040031A1 PCT/US2023/072176 US2023072176W WO2024040031A1 WO 2024040031 A1 WO2024040031 A1 WO 2024040031A1 US 2023072176 W US2023072176 W US 2023072176W WO 2024040031 A1 WO2024040031 A1 WO 2024040031A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- proteins
- feature vectors
- molecular
- machine learning
- parameters
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0985—Hyperparameter optimisation; Meta-learning; Learning-to-learn
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
Definitions
- This application relates generally to protein purification, and, more particularly, to computational -based methods for improving protein purification.
- Cell cultures utilizing engineered mammalian or bacterial cell lines can be used to produce a target protein of interest by, for example, insertion of a recombinant plasmid containing the gene for the target protein.
- the cell lines themselves are living organisms, the cell lines produce other proteins than the target protein and may require a complex growth medium including, for example, various sugars, amino acids, and growth factors. It is often desired, if not required, to obtain a high-purity composition of the target protein, especially when the target protein is going to be used as a therapeutic active agent, such as when the target protein is a therapeutic antibody.
- the produced target protein needs to be purified from these other components in the cell culture, which may involve a complex sequence of processes each involving many variables, such as chromatography stationary phases, mobile phases, salt concentrations, pHs, and other operating conditions, such as temperature.
- a sequence of protein purification processes can include: (a) obtaining a cell culture sample containing the target protein; (b) one or more capture steps, such as an affinity capture step using, for example, protein A; (c) one or more conditioning steps; (d) one or more depth filtration steps; (e) one or more ion exchange chromatography steps, such as cation exchange or anion exchange chromatography, or a mixed mode thereof optionally including with hydrophobic interaction chromatography; (f) one or more hydrophobic interaction chromatography steps, or a mixed mode thereof; (g) a virus filtration step; and (h) one or more ultra-filtration steps.
- capture steps such as an affinity capture step using, for example, protein A
- conditioning steps such as an affinity capture step using, for example, protein A
- depth filtration steps such as an affinity capture step using, for example, protein A
- ion exchange chromatography steps such as cation exchange or anion exchange chromatography, or a mixed mode thereof optionally including with hydrophobic interaction chromat
- purification techniques include many variables critical to efficiently producing a high-purity composition of the target protein - in addition to considerations regarding the target protein itself, one must consider, for example, the chromatography stationary phase, the mobile phases, salt concentrations, pHs, and other operating conditions, such as temperature.
- Embodiments of the present disclosure are directed toward one or more computing devices, methods, and non-transitory computer-readable media that may utilize a machine learning model iteratively trained to generate a prediction of a molecular binding property of one or more target proteins as part of a streamlined process of protein purification for identifying target proteins (e.g., antibodies) and accelerating the selection process of therapeutic antibody candidates or other immunotherapy candidates by way of early identification of the most promising therapeutic antibody candidates.
- target proteins e.g., antibodies
- the machine learning model comprises an ensemble machine learning model comprising a plurality of models.
- the machine learning model (e g., “boosting” ensemblelearning model) may be utilized to generate a prediction of a molecular binding property (e.g., a prediction of a percent protein bound at one or more specific pH values and specific salt concentrations and/or specific salt species and chromatographic resin) of one or more proteins by utilizing optimized hyper-parameters (e.g., general parameters, booster parameters, learning-task parameters) and learnable parameters (e.g., regression model weights, decision variables) learned during the training of the machine learning model and a selected -best matrix of feature vectors of a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins of interest.
- optimized hyper-parameters e.g., general parameters, booster parameters, learning-task parameters
- learnable parameters e.g., regression model weights, decision variables
- the machine learning model may utilize the optimized hyper-parameters (e.g., general parameters, booster parameters, learning-task parameters) and learnable parameters (e g., regression model weights, decision variables) learned during training to predict a percent protein bound (e.g., a percentage of a set of proteins predicted to bind to a ligand within a solution for a given pH value and salt concentration) for one or more target proteins based only on, as input, the selected 4-best matrix of feature vectors of a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins of interest and one or more sets of pH values and salt concentrations associated with the binding properties of the one or more proteins of interest.
- the optimized hyper-parameters e.g., general parameters, booster parameters, learning-task parameters
- learnable parameters e.g., regression model weights, decision variables
- the molecular binding property and elution property of the one or more proteins of interest may be determined without considerable upstream experimentation. That is, desirable proteins of the one or more proteins of interest may be identified and distinguished from undesirable proteins of the one or more proteins of interest in-silico, and those desirable proteins identified in-silico may be further utilized to facilitate and accelerate the downstream development and manufacturing of one or more therapeutic mAbs, bsAbs, tsAbs, or other similar immunotherapies that may be utilized to treat various patient diseases (e.g., by reducing upstream experimental duration and experimentation inefficiency and providing in-silico feedback on which candidate proteins may be difficult to purify, and, by extension, ultimately difficult to manufacture).
- desirable proteins of the one or more proteins of interest may be identified and distinguished from undesirable proteins of the one or more proteins of interest in-silico, and those desirable proteins identified in-silico may be further utilized to facilitate and accelerate the downstream development and manufacturing of one or more therapeutic mAbs, bsAbs, tsAbs, or other similar immunotherapies that may
- the hyper-parameters e.g., general parameters, booster parameters, learning-task parameters
- learnable parameters e.g., regression model weights, decision variables
- the iterations may include 1) reducing a molecular descriptor matrix representing the set of amino acid sequences by clustering similar feature vectors of the molecular descriptor matrix based on a distance metric.
- the distance metric may be calculated based on a Pearson’s correlation, mutual information, or maximum information coefficient (MIC), or other distance metrics.
- the iterations may next include determining the £-best most-predictive feature vectors of the reduced molecular descriptor matrix based on a Ar-bcst process and a maximum information coefficient (MIC) for determining a correlation between the feature vectors of the reduced molecular descriptor matrix and an experimentally-determined percent protein bound and/or first principal component (PC) value for one or more specific pH values and salt concentrations.
- the iterations may next include calculating an //-number of cross- validation losses based on the £-best most-predictive feature vectors and the experimentally- determined percent protein bound and/or the first PC value.
- the iterations may include updating the hyper-parameters (e.g., general parameters, booster parameters, learning-task parameters) based on the /-number of cross-validation losses.
- reducing the molecular descriptor matrix which may include a large set of amino acid sequence-based descriptors, by way of the foregoing feature dimensionality reduction and feature selection techniques may ensure that the regression model successfully converges to an accurately trained regression model as opposed to suffering overfitting due to superfluous or noisy descriptors.
- a distance correlation, mutual information, or other similar nonlinear correlation metric, or a linear correlation metric e.g., Pearson’s correlation
- one or more computing devices, methods, and non- transitory computer-readable media may access a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins.
- the molecular descriptor matrix may be generated by a first machine learning model (e.g., a matrix generation machine learning model) distinct from a machine learning model (e.g., an ensemblelearning model).
- the first machine learning model was trained to generate the molecular descriptor matrix based on the set of amino acid sequences.
- the first machine learning model may include a neural network trained to generate the Mx N descriptor matrix representing the set of amino acid sequences, in which N includes a number of the set of amino acid sequences and M includes a number of nodes in an output layer of the neural network.
- the one or more computing devices may then refine a set of hyper-parameters associated with a machine learning model trained to generate a prediction of a molecular binding property of the one or more proteins.
- the machine learning model may include one or more of a gradient boosting model, an adaptive boosting (AdaBoost) model, an extreme gradient boosting (XGBoost) model, a light gradient boosted machine (LightGBM) model, or a categorical boosting (CatBoost) model.
- the prediction of the molecular binding property of the one or more proteins may be generated by a computational model-based column process.
- the computational model-based chromatography process may include one or more of a computational model-based affinity chromatography process, an ion exchange chromatography (IEC) process, a hydrophobic interaction chromatography (HIC) process, or a mixed-mode chromatography (MMC) process.
- chromatography techniques involve a stationary phase and a mobile phase.
- the stationary phase may include moieties designed to interact with a target protein (such as in a bind and elute mode style of chromatography) or to not interact with the target protein (such as in a flow through style of chromatography).
- the mobile phase(s) used in a chromatography technique may have many variables, including a concentration of one or more salts, pH, and solvent gradients.
- chromatography techniques can be performed in various conditions, such as at elevated temperatures.
- the computational model-based chromatography process include an affinity chromatography process.
- the affinity chromatography process may include an affinity ligand, such as according to any of a protein A chromatography, a protein G chromatography, a protein A/G chromatography, a protein L chromatography, and a kappa chromatography.
- the affinity chromatography process may include an elution mobile phase, such as a mobile phase having a set pH.
- the computational model-based chromatography process may include an ion exchange chromatography process.
- Ion exchange chromatography allow for separation based on electrostatic interactions (anion and cation) between a ligand of the ion exchange stationary phase and a component of a sample, for example, a target or non-target protein.
- the ion exchange chromatography process a cation exchange (CEX) stationary phase.
- CEX cation exchange
- the ion exchange chromatography may include a strong CEX stationary phase.
- the ion exchange chromatography may include a weak CEX stationary phase.
- the ion exchange chromatography resin may be functionalized with ligands containing anionic functional group(s) such as a carboxyl group or a sulfonate group.
- the ion exchange chromatography stationary phase may include an anion exchange (AEX) stationary phase.
- the ion exchange chromatography may include a strong AEX stationary phase.
- the ion exchange chromatography may include a weak AEX stationary phase.
- the ion exchange chromatography resin may be functionalized with ligands containing cationic functional group(s) such as a quaternary amine.
- the ion exchange chromatography may include a multimodal ion exchange (MMIEX) stationary phase.
- MMIEX chromatography stationary phases may include both cation exchange and anion exchange components and/or features.
- the MMIEX stationary phase may include a multimodal anion/ cation exchange (MM-AEX/ CEX) stationary phase.
- the ion exchange chromatography may include a ceramic hydroxyapatite chromatography stationary phase.
- the ion exchange chromatography stationary phase may be selected from the group consisting of: sulphopropyl (SP) Sepharose® Fast Flow (SPSFF), quartenary ammonium (Q) Sepharose® Fast Flow (QSFF), SP Sepharose® XL (SPXL), StreamlineTM SPXL, ABxTM (MM-AEX/ CEX medium), PorosTM XS, PorosTM 50HS, diethylaminoethyl (DEAE), dimethylaminoethyl (DMAE), trimethylaminoethyl (TMAE), quaternary aminoethyl (QAE), mercaptoethylpyridine (MEP)- HypercelTM, HiPrepTM Q XL, Q Sepharose® XL, and HiPrepTM SP XL.
- SP sulphopropyl
- SPSFF
- the ion exchange chromatography process may include an elution step mobile phase including increased salt concentrations, such as increased relative to binding or washing mobile phases.
- the computational model-based chromatography process may include a mixed mode chromatography process.
- Mixed mode chromatography processes may include stationary phases that combine charge-based (i.e., ion exchange chromatography features) and hydrophobic-based elements.
- the mixed mode chromatography process may include a bind and elute mode of operation.
- the mixed mode chromatography process may include a flow-through mode of operation.
- the mixed mode chromatography process may include a stationary phase selected from the group consisting of Capto MMC and Capto Adhere.
- the computational model-based chromatography process may include a hydrophobic interaction chromatography (HIC) process.
- Hydrophobic interaction chromatography processes may include hydrophobic stationary phases.
- the mixed mode chromatography process may include a bind and elute mode of operation.
- the hydrophobic interaction chromatography process may include a flow-through mode of operation.
- the hydrophobic interaction chromatography process may include a stationary phase including a substrate, such as an inert matrix, for example, a cross-linked agarose, sepharose, or resin matrix.
- at least a portion of the substrate of a hydrophobic interaction chromatography stationary phase may include a surface modification including the hydrophobic ligand.
- the hydrophobic interaction chromatography ligand is a ligand including between about 1 and 18 carbons.
- the hydrophobic interaction chromatography ligand may include 1 or more carbons, such as any of 2 or more carbons, 3 or more carbons, 4 or more carbons, 5 or more carbons, 6 or more carbons, 7 or more carbons, 8 or more carbons, 9 or more carbons, 10 or more carbons, 11 or more carbons, 12 or more carbons, 13 or more carbons, 14 or more carbons, 15 or more carbons, 16 or more carbons, 17 or more carbons, or 18 or more carbons.
- the hydrophobic interaction chromatography ligand may include any of 1 carbon, 2 carbons, 3 carbons, 4 carbons, 5 carbons, 6 carbons, 7 carbons, 8 carbons, 9 carbons, 10 carbons, 11 carbons, 12 carbons, 13 carbons, 14 carbons, 15 carbons, 16 carbons, 17 carbons, or 18 carbons.
- the hydrophobic ligand is selected from the group consisting of an ether group, a methyl group, an ethyl group, a propyl group, an isopropyl group, a butyl group, a t-butyl group, a hexyl group, an octyl group, a phenyl group, and a polypropylene glycol group.
- the HIC medium is a hydrophobic charge induction chromatography medium.
- the hydrophobic interaction chromatography process may include a mobile phase including a high salt condition.
- a high salt condition may be used to reduce the solvation of the target thereby exposing hydrophobic regions which can then interact with the hydrophobic interaction chromatography stationary phase.
- the hydrophobic interaction chromatography process may include a mobile phase including a low salt condition, for example, with no salt or no added salt.
- the hydrophobic interaction chromatography stationary phase is selected from the group consisting of Bakerbond WP Hl-PropylTM, Phenyl Sepharose® Fast Flow (Phenyl-SFF), Phenyl Sepharose® Fast Flow Hi-sub (Phenyl-SFF HS), Toyopearl® Hexyl-650, PorosTM Benzyl Ultra, and Sartobind® phenyl
- the Toyopearl® Hexyl-650 is Toyopearl® Hexyl-650M.
- the Toyopearl® Hexyl-650 is Toyopearl® Hexyl-650C.
- the Toyopearl® Hexyl-650 is Toyopearl® Hexyl-650S.
- the prediction of the molecular binding property of the one or more proteins may include an identification of a target protein of the one or more proteins.
- the prediction of the molecular binding property of the one or more proteins may use quantitative structure property relationship (QSPR) or a quantitative structure activity relationship (QSAR) modeling of the one or more proteins.
- the prediction of the molecular binding property of the one or more proteins may include a prediction of a molecular binding property for each amino acid sequence of the set of amino acid sequences corresponding to the one or more proteins.
- the prediction of the molecular binding property for each amino acid sequence may include a computational model-based isolation of desirable amino acid molecules from undesirable amino acid molecules.
- the machine learning model (e.g., an ensemblelearning model) may be further trained to generate a prediction of a molecular elution property of the one or more proteins. In another embodiment, the machine learning model may be further trained to generate a prediction of a flow-through property of the one or more proteins.
- the one or more computing devices may refine the set of hyper-parameters iteratively by executing a process until a desired precision is reached. For example, in certain embodiments, the one or more computing devices may execute the process by first reducing the molecular descriptor matrix by selecting one representative feature vector for each of a plurality of feature vector clusters. In one embodiment, each of the feature vector clusters includes similar feature vectors. For example, in some embodiments, reducing the molecular descriptor matrix may include performing clustering using a correlation distance metric, for example, calculated based on a Pearson’s correlation of feature vectors of the molecular descriptor matrix, to generate the plurality of feature vector clusters.
- a correlation distance metric for example, calculated based on a Pearson’s correlation of feature vectors of the molecular descriptor matrix
- the clustering of the sets of descriptors may be based on the correlation distance between the descriptors, which may be calculated from the Pearson’s correlation (e g., 1 - abs(Pearson’s Correlation)).
- the selected one representative feature vector for each of the plurality of feature vector clusters may include a centroid feature vector for each of the plurality of feature vector clusters utilized to represent two or more of the similar feature vectors.
- the one or more computing devices may execute the process by then determining one or more most-predictive feature vectors of the selected representative feature vectors for each of the plurality of feature vector clusters based on a correlation between the selected representative feature vectors and predetermined batch binding data associated with the one or more proteins. For example, in some embodiments, determining the one or more representative feature vectors of the selected representative feature vectors for each of the plurality of feature vector clusters may include selecting a £-best matrix of feature vectors of the selected representative feature vectors for each of the plurality of feature vector clusters. In one embodiment, the Z -best matrix of feature vectors of the selected representative feature vectors is determined based on a predetermined /-best process.
- the correlation between the selected representative feature vectors and the predetermined batch binding data is determined based on a Pearson’s correlation, mutual information, maximal information coefficient (MIC), or other metric, between the selected representative feature vectors and the predetermined batch binding data.
- a distance correlation, mutual information, or other similar nonlinear correlation metric and/or linear correlation metrics may be utilized.
- the one or more computing devices may execute the process by then calculating one or more cross-validation losses based at least in part on the one or more most-predictive feature vectors and the predetermined batch binding data.
- calculating the one or more cross-validation losses further may include evaluating a cross-validation loss function based on the one or more most- predictive feature vectors, the predetermined batch binding data, the set of hyper-parameters, and a set of learnable parameters associated with the machine learning model, and further minimizing the cross-validation loss function by varying the set of learnable parameters while the one or more most-predictive feature vectors, the predetermined batch binding data, and the set of hyper-parameters remain constant.
- minimizing the cross-validation loss function may include optimizing the set of hyper-parameters.
- the set of hyper-parameters may include one or more of a set of general parameters, a set of booster parameters, or a set of learning-task parameters.
- minimizing the cross- validation loss function may further include minimizing a loss between a prediction of a percent protein bound for the one or more proteins and an experimentally-determined percent protein bound for the one or more proteins.
- the predetermined batch binding data may include an experimentally-determined percent protein bound for one or more pH values and salt concentrations associated with the molecular binding property of the one or more proteins.
- the set of learnable parameters may include one or more weights or decision variables determined by the machine learning model based at least in part on the one or more most-predictive feature vectors and the predetermined batch binding data.
- calculating the one or more cross-validation losses may include calculating an n number of cross-validation losses, in which n includes an integer from ⁇ -n. In some embodiments, calculating the one or more cross-validation losses may include determining an n number of individual train-test splits based on the one or more most- predictive feature vectors and the predetermined batch binding data, in which n includes an integer from 1-n. In some embodiments, calculating the one or more cross-validation losses may include calculating an n number of cross-validation losses and generating the prediction of the molecular binding property of the one or more proteins based on an averaging of the n number of cross-validation losses.
- the one or more computing devices may execute the process by then updating the set of hyper-parameters based on the one or more cross-validation losses.
- the updated set of hyper-parameters may include one or more of an updated set of general parameters, an updated set of booster parameters, or an updated set of learning-task parameters.
- the one or more computing devices may output, by the machine learning model, the prediction of the molecular binding property of the one or more proteins based at least in part on the updated set of hyper-parameters.
- the one or more computing devices may further access a second molecular descriptor matrix representing a second set of amino acid sequences corresponding to one or more second proteins, reduce the second molecular descriptor matrix by selecting one representative feature vector for each of a second plurality of feature vector clusters of the second molecular descriptor matrix, determine one or more second most-predictive feature vectors of the selected representative feature vectors for each feature vector cluster based on a second correlation between the selected representative feature vectors and predetermined batch binding data associated with the one or more second proteins, inputting the one or more second most- predictive feature vectors into the machine learning model trained to generate a prediction of a molecular binding property of the one or more second proteins, and outputting, by the machine learning model, the prediction of the molecular binding property of the one or more second proteins based at least in part on the updated set of hyper-parameters. For example, the prediction of the molecular binding property of the one or more second
- the one or more computing devices may further optimize the machine learning model based on a Bayesian model-optimization process. In some embodiments, the one or more computing devices may then utilize Group -Fold cross- validation to train and evaluate the optimized machine learning model based on the one or more most-predictive feature vectors, the predetermined batch binding data, the set of hyperparameters, and the set of learnable parameters. In some embodiments, the Group K-Fold cross validation may be stratified in order to ensure that the cross-validation training and evaluation splits include a diverse range of regression target values. In some embodiments, the stratification might be accomplished using labels generated by binning the regression target values into a number of quantiles.
- one or more computing devices, methods, and non-transitory computer-readable media may access a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins; and obtain, by a machine learning model, a prediction of a molecular binding property of the one or more proteins based at least in part on the molecular descriptor matrix, wherein the machine learning model is trained by: accessing a training molecular descriptor matrix representing a training set of amino acid sequences corresponding to one or more empirically-evaluated proteins; and iteratively executing a process to refine a set of hyper-parameters associated with the machine learning model until a desired precision is reached, the process comprising: reducing the training molecular descriptor matrix by selecting one representative feature vector for each of a plurality of feature vector clusters, wherein each feature vector cluster includes similar feature vectors; determining one or more most-predictive feature vectors of the selected representative feature vectors for each feature vector cluster based on a
- FIG. 1 illustrates a diagram illustrating an experimental example for performing one or more protein purification processes as compared to a computational model-based example for performing one or more protein purification processes, in accordance with various embodiments.
- FIG. 2 illustrates a high-level workflow diagram for performing feature generation, feature dimensionality reduction, regression model optimization, and model output-based feature selection, in accordance with various embodiments.
- FIG. 3A illustrates a workflow diagram for optimizing hyper-parameters and learnable parameters of a machine learning model for performing one or more computational model-based protein purification processes, in accordance with various embodiments.
- FIG. 3B illustrates workflow diagram for optimizing the machine learning model for performing one or more computational model-based protein purification processes, in accordance with various embodiments.
- FIG. 4 illustrates a flow diagram of a method for generating a prediction of a molecular binding property of one or more target proteins as part of a streamlined process of protein purification for identifying target proteins, in accordance with various embodiments.
- FIG. 5 illustrates an example computing system, in accordance with various embodiments.
- FIG. 6 illustrates a diagram of an example artificial intelligence (Al) architecture included as part of the example computing system of FIG. 5, in accordance with various embodiments
- FIG. 7 illustrates another high-level workflow diagram for performing feature generation, feature dimensionality reduction, regression model optimization, and model output-based feature selection, in accordance with various embodiments.
- FIG. 8 illustrates another workflow diagram for optimizing hyper-parameters and learnable parameters of a machine learning model for performing one or more computational model-based protein purification processes, in accordance with various embodiments.
- FIG. 9 illustrates a process for training a machine learning model to predict a molecular binding property, in accordance with various embodiments.
- FIGS. 10A-10D illustrate example plots illustrating how a principal component analysis can be used to predict a molecular binding property, in accordance with various embodiments.
- FIGS. 11A-11F illustrate example heat maps illustrating a relationship between experimental conditions and experimental K p values, and experimental conditions and modeled K p values, respectively, in accordance with various embodiments.
- FIG. 12 illustrates a flow diagram of a method for generating a prediction of a molecular binding property of one or more target proteins as part of another streamlined process of protein purification for identifying target proteins, in accordance with various embodiments.
- Embodiments of the present embodiments are directed toward one or more computing devices, methods, and non-transitory computer-readable media that may utilize a machine learning model iteratively trained to generate a prediction of a molecular binding property of one or more target proteins as part of a streamlined process of protein purification for identifying target proteins (e.g., antibodies) and accelerating the selection process of therapeutic antibody candidates or other immunotherapy candidates by way of early identification of the most promising therapeutic antibody candidates.
- target proteins e.g., antibodies
- This streamlined process of identifying target proteins (e.g., antibodies) in-silico may facilitate and accelerate the downstream development and manufacturing of one or more therapeutic monoclonal antibodies (mAbs), bispecific antibodies (bsAbs), trispecific antibodies (tsAbs), or other similar immunotherapies that may be utilized to treat various diseases.
- mAbs monoclonal antibodies
- bsAbs bispecific antibodies
- tsAbs trispecific antibodies
- the machine learning model (e.g., ensemble-learning model or a “boosting” ensemble-learning model) may be utilized to generate a prediction of a molecular binding property (e.g., a prediction of a percent protein bound at one or more specific pH values and specific salt concentrations and/or specific salt species and chromatographic resin) of one or more proteins by utilizing optimized hyper-parameters (e g., general parameters, booster parameters, learning-task parameters) and learnable parameters (e g., regression model weights, decision variables) learned during the training of the machine learning model and a selected Zr-best matrix of feature vectors of a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins of interest.
- optimized hyper-parameters e.g., general parameters, booster parameters, learning-task parameters
- learnable parameters e., regression model weights, decision variables
- the machine learning model may utilize the optimized hyper-parameters (e.g., general parameters, booster parameters, learning-task parameters) and learnable parameters (e.g., regression model weights, decision variables) learned during training to predict (i) a percent protein bound (e g., a percentage of a set of proteins predicted to bind to a ligand within a solution) for a given pH value and salt concentration or a plurality of different combinations of pH values and salt concentrations, (ii) predict a percent protein bound (e.g., a percentage of a set of proteins predicted to bind to a ligand within a solution) for a set of pH values and salt concentrations, and/or (iii) predict a principal component (PC) representing a set of pH values and salt concentrations, for one or more target proteins based only on, as input, the selected &-best matrix of feature vectors of a molecular descriptor matrix representing a set of amino
- PC principal component
- desirable proteins of the one or more proteins of interest may be identified and distinguished from undesirable proteins of the one or more proteins of interest in-silico, and those desirable proteins identified in-silico may be further utilized to expedite and facilitate the downstream development of one or more therapeutic mAbs, bsAbs, tsAbs, or other similar immunotherapies that may be utilized to treat various diseases (e.g., by reducing upstream experimental duration and experimentation inefficiency and providing in-silico feedback on which candidate proteins may be difficult to purify, and, by extension, ultimately difficult to manufacture).
- the hyper-parameters e.g., general parameters, booster parameters, learning-task parameters
- learnable parameters e.g., regression model weights, decision variables
- reducing the molecular descriptor matrix which may include a large set of amino acid sequence-based descriptors, by way of the foregoing feature dimensionality reduction and feature selection techniques may ensure that the regression model successfully converges to an accurately trained regression model as opposed to suffering overfitting due to superfluous or noisy descriptors.
- a distance correlation, mutual information, or other similar nonlinear correlation metric and/or a linear correlation metric e.g., Pearson’s correlation, f-statistic based metrics
- polypeptide and “protein” may interchangeably refer to a polymer of amino acid residues, and are not limited to a minimum length.
- such polymers of amino acid residues may contain natural or non-natural amino acid residues, and include, but are not limited to, peptides, oligopeptides, dimers, trimers, and multimers of amino acid residues. Both full-length proteins and fragments thereof are encompassed by the definition, for example.
- the terms “polypeptide” and “protein” may also include post- translational modifications of the polypeptide, for example, glycosylation, sialylation, acetylation, phosphorylation, and the like.
- FIG. 1 illustrates a diagram 100 illustrating an experimental example 102 for performing one or more protein purification processes as compared to a computational modelbased example 104 for performing one or more protein purification processes, in accordance with the disclosed embodiments.
- the experimental duration for the experimental example 102 for performing one or more protein purification processes may span a number of weeks.
- the execution time for the computational modelbased example 104 for performing one or more protein purification processes may be only minutes.
- the experimental example 102 for performing one or more protein purification processes may include receiving amino acid sequences at block 106, selecting plasmids at block 108, engineering proteins by way of cell lines and cell cultures at blocks 110 and 112, respectively, performing one or more chromatography processes (e.g., an affinity chromatography process, ion exchange chromatography (IEX) process, a hydrophobic interaction chromatography (HIC) process, or a mixed-mode chromatography (MMC) process) at block 114, and performing a high throughput screening (HTS) and computing a partition coefficient K p ) to quantify protein binding at block 116, all as part of a cumbersome and timeconsuming protein purification process.
- IEX ion exchange chromatography
- HIC hydrophobic interaction chromatography
- MMC mixed-mode chromatography
- a molecular assessment of one or more target proteins may be then performed at block 118.
- the computational model -based example 104 for performing one or more protein purification processes may include accessing amino acid sequences corresponding to one or more proteins of interest at block 106, generating a molecular descriptor matrix based on the amino acid sequences and reducing the molecular descriptor matrix at block 120, and utilizing a machine learning model (e.g., an ensemble-learning model) to generate a prediction of a molecular binding property of one or more target proteins at block 122, as part of an optimized and streamlined process of protein purification for identifying target proteins (e.g., antibodies) and accelerating the selection process of therapeutic antibody candidates or other immunotherapy candidates, in accordance with the presently disclosed embodiments.
- a machine learning model e.g., an ensemble-learning model
- the machine learning model may utilize optimized hyper-parameters (e.g., general parameters, booster parameters, learning-task parameters) and learnable parameters (e.g., regression model weights, decision variables) learned during training to predict a percent protein bound (e g., a percentage of a set of proteins predicted to bind to a ligand within a solution for a given pH value and salt concentration) for one or more target proteins based only on, as input, a selected 4-best matrix of feature vectors of the molecular descriptor matrix generated at block 120 and one or more sets of pH values and salt concentrations associated with the binding properties of the one or more proteins of interest.
- optimized hyper-parameters e.g., general parameters, booster parameters, learning-task parameters
- learnable parameters e.g., regression model weights, decision variables
- the molecular assessment of the one or more target proteins may be then performed at block 118 without considerable upstream experimentation (e.g., as compared to the experimental example 102 for performing one or more protein purification processes). That is, desirable proteins of the one or more proteins of interest may be identified and distinguished from undesirable proteins of the one or more proteins of interest in-silico, and those desirable proteins identified in-silico may be further utilized to expedite and facilitate the downstream development of one or more therapeutic mAbs, bsAbs, tsAbs, or other similar immunotherapies that may be utilized to treat various diseases (e.g., by reducing upstream experimental duration and experimentation inefficiency and providing in- silico feedback on which candidate proteins may be difficult to purify, and, by extension, ultimately difficult to manufacture).
- the machine learning model may be configured to obtain a prediction of a molecular binding property of the one or more proteins. From the molecular binding property, desirable proteins may be identified.
- FIG. 2 illustrates a high-level workflow diagram 200 for performing feature generation 202, feature dimensionality reduction 204, model-output based feature selection 206, and regression model optimization 208, in accordance with the disclosed embodiments.
- the high-level examples for performing feature generation 202, feature dimensionality reduction 204, model-output based feature selection 206, and regression model optimization 208 may be discussed in greater detail below with respect to FIGS. 3A and 3B, and may be performed by a machine learning (e g., a matrix generation machine learning model) in conjunction with another machine learning model (e.g., an ensemble-learning model) in accordance with the presently-disclosed embodiments.
- a machine learning e g., a matrix generation machine learning model
- another machine learning model e.g., an ensemble-learning model
- feature generation 202 may be performed by a machine learning model 301
- feature dimensionality reduction 204 may be performed by a feature dimensionality reduction model 307 A, 307B of a machine learning models 302A, 302B
- model-output-based feature selection 206 may be performed by a feature selection model 309A, 309B of the machine learning models 302A, 302B
- regression model optimization 208 may be performed by a regression model 311A, 31 IB of the machine learning models 302A, 302B.
- performing feature generation 202 may include generating, for example, 1024 molecular descriptors (e g., amino acid sequence-based descriptors).
- performing feature dimensionality reduction 204 may include, for example, clustering and reducing the 1024 molecular descriptors (e.g., amino acid sequence-based descriptors) to remove redundant features or other features determined to be exceedingly similar.
- performing model- output-based feature selection 206 may include generating a r-best feature matrix to reduce the molecular descriptors to only the Z-best most-predictive features of those molecular descriptors.
- the number of molecular descriptors may be 1024 based on the particular model used to generate the descriptors. As another example, the number of molecular descriptors may be greater or smaller, for instance, 2048 descriptors, 320 descriptors, etc.
- performing regression model optimization 208 may include, for example, optimizing hyper-parameters and learnable parameters associated with the regression model 311 A, 311B of the machine learning models 302 A, 302B.
- the feature dimensionality reduction 204 and modeloutput-based feature selection 206 may, in some embodiments, be provided to filter the large set of amino acid sequence-based descriptors that may be generated as part of the feature generation 202. In this way, reducing the large set of amino acid sequence-based descriptors by way of feature dimensionality reduction 204 and model -output-based feature selection 206 may ensure that the regression model successfully converges to an accurately trained regression model as opposed to suffering overfitting due to superfluous or noisy descriptors.
- FIG. 3A illustrates a detailed workflow diagram 300A for optimizing hyperparameters and learnable parameters of a machine learning model 302A (e.g., an ensemblelearning model) and utilizing the machine learning model 302A to generate a prediction of a molecular binding property of one or more target proteins as part of a streamlined process of protein purification for identifying target proteins (e.g., antibodies) and accelerating the selection process of therapeutic antibody or other immunotherapy candidates by way of early identification of the most promising therapeutic antibody candidates, in accordance with the disclosed embodiments.
- target proteins e.g., antibodies
- the workflow diagram 300A may be performed in conjunction by a machine learning model 301 (e g., a matrix generation machine learning model) and a machine learning model 302A (e.g., as illustrated by the dashed line) executed utilizing one or more processing devices (e g., computing device(s) 500 and artificial intelligence architecture 600 to be discussed below with respect to FIGS.
- a machine learning model 301 e g., a matrix generation machine learning model
- a machine learning model 302A e.g., as illustrated by the dashed line
- processing devices e g., computing device(s) 500 and artificial intelligence architecture 600 to be discussed below with respect to FIGS.
- a general purpose processor e.g., a general purpose processor, a graphic processing unit (GPU), an application-specific integrated circuit (ASIC), a system-on- chip (SoC), a microcontroller, a field-programmable gate array (FPGA), a central processing unit (CPU), an application processor (AP), a visual processing unit (VPU), a neural processing unit (NPU), a neural decision processor (NDP), a deep learning processor (DLP), a tensor processing unit (TPU), a neuromorphic processing unit (NPU), or any other processing device(s) that may be suitable for processing genomics data, proteomics data, metabolomics data, metagenomics data, transcriptomics data, or other omics data and making one or more decisions based thereon), software (e.g., instructions running/executing on one or more processors), firmware (e g., microcode), or some combination thereof.
- software e.g., instructions running/executing on one or more processors
- firmware e.g., microcode
- the machine learning model 302A may include, for example, any number of individual machine learning models or other predictive models (e g., a feature dimensionality reduction model 307A, a feature selection model 309A, and a regression model 311 A) that may be trained and executed in conjunction (e.g., trained and/or executed serially, in parallel, or end-to-end) to perform one or more predictions in sequence, such that the output of one or more initial models in the pipeline serves as the input to one or more succeeding models in the ensemble until a final overall prediction is outputted (e g., “boosting”).
- a feature dimensionality reduction model 307A e.g., a feature selection model 309A, and a regression model 311 A
- boosting e.g., “boosting”.
- the machine learning model 302A may include a gradient boosting model, an adaptive boosting (AdaBoost) model, an extreme gradient boosting (XGBoost) model, a light gradient boosted machine (LightGBM) model, or a categorical boosting (CatBoost) model.
- AdaBoost adaptive boosting
- XGBoost extreme gradient boosting
- XGBM light gradient boosted machine
- CatBoost categorical boosting
- the machine learning model 301 may perform one or more feature generation and data importing tasks 303, while the machine learning model 302A may include a feature dimensionality reduction model 307 A, a feature selection model 309A, and a regression model 311 A.
- One or more hyper-parameter optimization tasks 314 may further be performed to refine a set of hyper-parameters associated with the machine learning model 302A
- the workflow diagram 300A may begin at functional block 304 with the machine learning model 301 importing amino acid sequences for a set of one or more P proteins.
- the machine learning model 301 may include one or more pre-trained artificial neural networks (ANNs), convolutional neural networks (CNNs), or other neural networks that may be suitable for generating a large set of amino acid sequence-based descriptors in, for example, a supervised, weakly-supervised, semi -supervised, or unsupervised manner.
- ANNs artificial neural networks
- CNNs convolutional neural networks
- the amino acid sequencebased descriptors may be utilized (e g., as opposed to structure-based descriptors), as the amino acid sequence-based descriptors may be more effective for training the machine learning model 302A to generate predictions of the molecular binding property of one or more target proteins (e.g., as compared to utilizing structure-based descriptors).
- the feature dimensionality reduction model 307 A, 307B and the feature selection model 309A, 309B may, in some embodiments, be provided to filter the large set of amino acid sequence-based descriptors that may be outputted by the machine learning model 301.
- reducing the large set of amino acid sequence-based descriptors by way of the feature dimensionality reduction model 307A, 307B and the feature selection model 309A, 309B may ensure that the regression model 311A, 31 IB successfully converges to an accurately trained regression model as opposed to suffering overfitting due to superfluous or noisy descriptors.
- predetermined batch binding data for the set of one or more P proteins may also be imported for use by the machine learning model 302A.
- the predetermined batch binding data may include an experimentally- determined percent protein bound for one or more specific pH values and salt concentrations (e.g., a sodium-chloride (NaCl) concentration, a phosphate (PO -) concentration) and/or salt species (e.g., sodium acetate (CH3COONa) species, a sodium phosphate (Na3PO4) species) and chromatographic resin.
- the workflow diagram 300A may then continue at functional block 306 with the machine learning model 301 generating a molecular descriptor matrix of size M- by-JV.
- the workflow diagram 300A may then continue at functional block 308 with generating a weighted average of the descriptors (AV) in the molecular descriptor matrix across all amino acids (N). For example, in certain embodiments, a weighted average of the descriptors (AV) in the molecular descriptor matrix across all amino acids (N) may be calculated, resulting in a descriptor vector of size AV-by-1 for each protein of the set of one or more P proteins. For example, in some embodiments, the machine learning model 301 may generate one or more AV-by-1 vectors of descriptors for each protein of the set of one or more P proteins. In certain embodiments, the workflow diagram 300A may then continue at functional block 310 with representing descriptor vectors for all proteins (P) as a protein descriptor matrix of size M-by-P.
- functional block 312 of the workflow diagram 300A may illustrate an iteration of the machine learning model 302A having already been trained, in which a set of hyper-parameters (e.g., general parameters, booster parameters, learning-task parameters) and a set of learnable parameters (e.g., regression model weights, decision variables) were identified during the training of the machine learning model 302A.
- hyper-parameters e.g., general parameters, booster parameters, learning-task parameters
- learnable parameters e.g., regression model weights, decision variables
- a baseline set of hyper-parameters may be selected and then updated iteratively so as to minimize the average score of the 10-cycle regression-based model of the machine learning model 302A.
- the machine learning model 302A may be iteratively trained until a desired precision is reached, refining a set of hyper-parameters by updating the selected hyper-parameters (e.g., general parameters, booster parameters, learning-task parameters) with each successive iteration.
- the selected hyper-parameters may be updated based on one or more cross-validation losses.
- the desired precision is reached when a given set of hyper-parameters selected minimizes (e.g., reaches lowest possible value or error on a scale of 0.0 to 1.0) the one or more cross-validation losses.
- minimizing the one or more cross-validation losses may include minimizing a loss between a predicted percent protein bound and an experimentally-determined percent protein bound.
- the desired precision of the machine learning model 302A is reached when a given set of hyper-parameters selected minimizes the loss between the predicted percent protein bound and the experimentally-determined percent protein bound.
- the hyper-parameters may be optimized by evaluating a cross-validation loss function based on the &-best feature vectors most-predictive of the predetermined batch binding data, the predetermined batch binding data (e g., experimentally-determined percent protein bound for one or more specific pH values and salt concentrations and/or salt species and chromatographic resin), the baseline set of hyper-parameters (e g., general parameters, booster parameters, learning-task parameters), and a set of learnable parameters (e g., regression model weights, decision variables) associated with, and determined by, the machine learning model 302A.
- the machine learning model 302A may then minimize the cross- validation loss function by varying the set of learnable parameters while the A-best most- predictive feature vectors, the predetermined batch binding data, and the set of hyperparameters remain constant.
- the machine learning model 302A may include a feature dimensionality reduction model 307A, a feature selection model 309A, and a regression model 311A.
- a feature dimensionality reduction task may reduce the molecular descriptor matrix by selecting one representative feature vector for each of a plurality of feature vector clusters.
- the workflow diagram 300A may continue at functional block 322 with the machine learning model 302A evaluating a similarity of different descriptors by comparing the set of M feature vectors of size 1-by-P.
- the similarity of different descriptors may be evaluated by comparing the set of M feature vectors of size 1-by-P.
- the workflow diagram 300A may then continue at functional block 324 with the machine learning model 302A calculating a correlation between the feature vectors (size 1-by-P).
- the machine learning model 302A may calculate a correlation distance metric, which may, for example, be calculated using a Pearson’s correlation, between each of the feature vectors (size 1-by-P).
- clustering of the descriptors may be based on the correlation distance between the descriptors calculated from the Pearson’s correlation (e.g., 1 - abs(Pearson’s correlation)).
- the workflow diagram 300A may then continue at functional block 326 with the machine learning model 302A clustering feature vectors in order to group together redundant features that capture similar information. For example, in certain embodiments, utilizing an agglomerative-clustering process and the calculated distance correlation metric, which may be calculated based on the Pearson’s correlation, the machine learning model 302A may cluster feature vectors in order to group together any and all redundant features that include similar information (similar feature vectors). In certain embodiments, the workflow diagram 300A may then continue at functional block 328 with the machine learning model 302A determining a centroid of each cluster as representative of the cluster, which is valuable for feature selection.
- the selection of the centroid of each cluster can enable a set of orthogonal features to be selected, which can reduce multicollinearity.
- the workflow diagram 300A may then continue at functional block 330 with the machine learning model 302A iteratively evaluating the number of clusters (C) to determine how many result in optimal performance of the machine learning model 302 A.
- the machine learning model 302A may also include the feature selection model 309A, which may determine one or more most-predictive feature vectors of the selected representative feature vectors for each feature vector cluster.
- the workflow diagram 300A may continue at functional block 332 with the machine learning model 302A, starting with the reduced descriptor matrix (size C-by-P), calculating a correlation between the feature vectors (1-by-P) in the reduced descriptor matrix (C-by-P) and the predetermined batch binding data at functional block 334.
- the machine learning model 302A may calculate the correlation between the selected representative feature vectors (1-by-P) in the reduced descriptor matrix (C-by-P) and the predetermined batch binding data (associated with the one or more proteins) in order to rank which features and/or descriptors capture information that is suitable for predicting the outputs.
- a nonlinear correlation metric e g., maximal information coefficient (MIC), distance correlation, mutual information, or other similar nonlinear correlation metric
- a linear correlation metric e.g., a Pearson’s correlation
- the workflow diagram 300A may then continue at functional block 336 with the machine learning model 302A determining the top A feature vectors (1-by-F) that are most predictive of the predetermined batch binding data to generate the A best features matrix (X-by-F). For example, in certain embodiments, utilizing a r-best process, the machine learning model 302A may select the top K feature vectors (1-by-F) that are most predictive of the predetermined batch binding data (e.g., as scored by the MIC, distance correlation, mutual information, or other similar nonlinear correlation metric) to generate a -best features matrix (X-by-F).
- the machine learning model 302A may select the top K feature vectors (1-by-F) that are most predictive of the predetermined batch binding data (e.g., as scored by the MIC, distance correlation, mutual information, or other similar nonlinear correlation metric) to generate a -best features matrix (X-by-F).
- the A-besl features matrix may maintain the top K feature vectors (1-by-F), where K is an integer value indicating a number of the feature vectors that are maintained.
- the k- best features matrix may maintain the top K feature vectors, where T is a percentage value indicating a percentage of the feature vectors that are maintained.
- the workflow diagram 300A may then continue at functional block 338 with the machine learning model 302 A iteratively evaluating the K feature vectors to determine how many result in optimal performance of the machine learning model 302A.
- the machine learning model 302A may also include the regression model 311.
- the workflow diagram 300A may continue at functional block 340 with the machine learning model 302A, starting with the baseline hyper-parameters selected and updated as part of the hyper-parameter optimization tasks 314, the machine learning model 302A may then perform cross-validation utilizing n unique train-test splits (e g., Group -Fold cross-validation, stratified X-Fold cross-validation).
- the cross-validation may include calculating one or more cross-validation losses based at least in part on the one or more most-predictive feature vectors and the predetermined batch binding data.
- the machine learning model 302A may perform cross- validation utilizing 10 unique train-test splits of the Xbest features matrix and the predetermined batch binding data (e.g., training data set).
- the machine learning model 302A may perform cross-validation utilizing 2 or more, 5 or more, 10 or more, or other quantities of, unique train-test splits of the £-best features matrix and the predetermined batch binding data (e.g., training data set) in order to, for example, reduce a possibility of overfitting or miscalculating the accuracy of the machine learning model 302A due to the traintest split.
- the machine learning model 302A may perform cross- validation utilizing any n integer number of unique train-test splits, so long as the integer number n is less than or equal to a number of data points corresponding, for example, to the training dataset.
- the workflow diagram 300A may then continue at functional block 342 with the machine learning model 302A adjusting the weight given to the data of the predetermined batch binding data (e.g., percent protein bounds at various pH values and salt concentrations and/or salt species and chromatographic resin) to the weight data in the transition region with greater importance.
- the machine learning model 302A may adjust the weight given to each point in the predetermined batch binding data to weight data in the transition region (e g., partially bound proteins) with more importance than those that are fully-bound proteins or fully-unbound.
- the workflow diagram 344 with the machine learning model 302A predicting a percent protein bound for the set of proteins P and optimizing the machine learning model 302A by minimizing a loss between the predicted percent protein bound and an experimentally-determined percent protein bound.
- the workflow diagram 300A may then continue at functional block 346 with the machine learning model 302 repeating model optimization n times with unique train-test splits and reporting the average score.
- the regression tasks of the machine learning model 302A may include receiving the predetermined batch binding data and the £-best features matrix and predicting (at functional block 346) a percent protein bound for the set of proteins P based on the predetermined batch binding data and the /c-best features matrix.
- the machine learning model 302A may be then optimized by minimizing (at functional block 346) a loss (e.g., sum of squared error (SSE)) between the predicted percent protein bound and the experimentally- determined percent protein bound for one or more specific pH values and salt concentrations (e.g., a sodium-chloride (NaCl) concentration, a phosphate (PO -) concentration) and/or salt species (e.g., a sodium acetate (CH3COONa) species, a sodium phosphate (Na3PO4) species) and chromatographic resin.
- a loss e.g., sum of squared error (SSE)
- SSE sum of squared error
- pH value and salt concentration and/or salt species and chromatographic resin may be associated with the molecular binding property of the one or more proteins.
- a machine learning model 302A may be iteratively trained to generate a prediction of a molecular binding property of one or more target proteins as part of a streamlined process of protein purification for identifying target proteins (e.g., antibodies) and accelerating the selection process of therapeutic antibody candidates or other immunotherapy candidates by way of early identification of the most promising therapeutic antibody candidates.
- the streamlined process of identifying target proteins (e.g., antibodies) in-silico may facilitate and accelerate the downstream development and manufacturing of one or more therapeutic mAbs, bsAbs, tsAbs, 2+1 Abs, or other similar immunotherapies that may be utilized to treat various diseases.
- the machine learning model 302A (e g., “boosting” machine learning model) may be utilized to generate a prediction of a molecular binding property (e.g., a prediction of a percent protein bound at one or more specific pH values and specific salt concentrations and/or specific salt species and chromatographic resin) of one or more proteins by utilizing optimized hyper-parameters (e.g., general parameters, booster parameters, learning-task parameters) and learnable parameters (e.g., regression model weights, decision variables) learned during the training of the machine learning model 302A and a selected A-best matrix of feature vectors of a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins of interest.
- optimized hyper-parameters e.g., general parameters, booster parameters, learning-task parameters
- learnable parameters e.g., regression model weights, decision variables
- the machine learning model 302A may utilize the optimized hyper-parameters (e.g., general parameters, booster parameters, learning-task parameters) and learnable parameters (e.g., regression model weights, decision variables) learned during training to predict a percent protein bound (e.g., a percentage of a set of proteins predicted to bind to a ligand within a solution for a given pH value and salt concentration) and/or a first principal component (PCI) of the Log( ),) values (logit transform of percent bound) for one or more target proteins based only on, as input, the selected A-best matrix of feature vectors of a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins of interest and one or more sets of pH values and salt concentrations and/or salt species and chromatographic resin associated with the binding properties of the one or more proteins of interest.
- a percent protein bound e.g., a percentage of a set of proteins predicted to bind to a ligand within a solution for
- a first principal component (PCI) of the Log/ ),) values may be predicted from data across the design space (some set of datapoint covering a range of pH/salt concentrations) for a given resin.
- the molecular binding property and elution property of the one or more proteins of interest may be determined without considerable upstream experimentation. That is, desirable proteins of the one or more proteins of interest may be identified and distinguished from undesirable proteins of the one or more proteins of interest in-silico, and those desirable proteins identified in-silico may be further utilized to expedite and facilitate the downstream development of one or more therapeutic mAbs, bsAbs, tsAbs, or other similar immunotherapies that may be utilized to treat various diseases (e.g., by reducing upstream experimental duration and experimentation inefficiency and providing in- silico feedback on which candidate proteins may be difficult to purify, and, by extension, ultimately difficult to manufacture).
- desirable proteins of the one or more proteins of interest may be identified and distinguished from undesirable proteins of the one or more proteins of interest in-silico, and those desirable proteins identified in-silico may be further utilized to expedite and facilitate the downstream development of one or more therapeutic mAbs, bsAbs, tsAbs, or other similar immunotherapies that may
- the machine learning model may be configured to obtain a prediction of a molecular binding property of the one or more proteins. From the molecular binding property, desirable proteins may be identified. While the present embodiments are discussed herein primarily with respect to the machine learning model 302A generating a prediction of a molecular binding property of one or more target proteins, it should be appreciated that the machine learning model 302A as trained may also generate a prediction of an elution property of the one or more proteins or generate a prediction of a flow-through property of the one or more proteins, in accordance with the presently disclosed embodiments.
- FIG. 3B illustrates a detailed workflow diagram 300B for optimizing the machine learning model 302A as discussed above with respect to FIG. 3A and utilizing the optimized machine learning model 302B to generate a prediction of a molecular binding property of one or more target proteins as part of a streamlined process of protein purification for identifying target proteins (e.g., antibodies) and accelerating the selection process of therapeutic antibody candidates or other immunotherapy candidates by way of early identification of the most promising therapeutic antibody candidates, in accordance with the disclosed embodiments.
- the workflow diagram 300B may represent an improvement over the workflow diagram 300A as discussed above with respect to FIG. 3 A.
- the workflow diagram 300B may include performing one or more Bayesian optimization processes (e.g., sequential model-based optimization (SMBO), expected improvement (El)) to iteratively optimize and evaluate the machine learning model 302B by, for example, selectively determining which of the functional blocks of feature dimensionality reduction model 307B, feature selection model 309B, and regression model 31 IB to execute, as well as the order in which the determined functional blocks of feature dimensionality reduction model 307B, feature selection model 309B, and regression model 31 IB are to be executed.
- SMBO sequential model-based optimization
- El expected improvement
- the workflow diagram 300B may be performed utilizing one or more processing devices (e.g., computing device(s) 500 and artificial intelligence architecture 600 to be discussed below with respect to FIGS. 5 and 6) that may include hardware (e.g., a general purpose processor, a graphic processing unit (GPU), an application-specific integrated circuit (ASIC), a system-on-chip (SoC), a microcontroller, a field-programmable gate array (FPGA), a central processing unit (CPU), an application processor (AP), a visual processing unit (VPU), a neural processing unit (NPU), a neural decision processor (NDP), a deep learning processor (DLP), a tensor processing unit (TPU), a neuromorphic processing unit (NPU), or any other processing device(s) that may be suitable for processing genomics data, proteomics data, metabolomics data, metagenomics data, transcriptomics data, or other omics data and making one or more decisions based thereon), software (e.g., instructions running/executing on one or more processor
- the workflow diagram 300B may begin at functional block 348 with importing amino acid sequences for a set of one or more P proteins.
- one or more partition coefficient (K p ) screens of experimental amino acid sequences for a set of one or more P proteins and/or molecular amino acid sequences for a set of one or more P proteins may be imported.
- the workflow diagram 300B may then continue at functional block 350 with formatting the amino acid sequences for the set of one or more P proteins and generating a molecular descriptor matrix of size Af-by-A.
- the workflow diagram 300B may also include generating a weighted average of the descriptors (AT) in the molecular descriptor matrix across all amino acids (A).
- a weighted average of the descriptors (M) in the molecular descriptor matrix across all amino acids (A) may be calculated, resulting in a descriptor vector of size AT-by-1 for each protein of the set of one or more P proteins.
- the machine learning model 301 (as described above with respect to FIG. 3A) may generate one or more AT-by-1 vectors of descriptors for each protein of the set of one or more P proteins.
- the workflow diagram 300B may then continue at functional block 352 with preprocessing the descriptor vector by removing amino acid sequence data with precipitation at high salt concentrations and weighting experimental data to prioritize the binding transition region (e.g., -2 ⁇ Log[X' z ,] ⁇ +2, or -0.5 ⁇ Log[ ' z ,] ⁇ +2).
- the workflow diagram 300B may be provided for optimizing the machine learning model 302A as discussed above with respect to FIG.
- the optimized machine learning model 302B may be utilized to generate a prediction of a molecular binding property of one or more target proteins in accordance with the presently- disclosed embodiments
- the workflow diagram 300B may continue at functional block 354 selectively determining which of the functional blocks of feature dimensionality reduction model 307B, feature selection model 309B, and regression model 31 IB to execute, as well as the order in which the functional blocks of feature dimensionality reduction model 307B, feature selection model 309B, and regression model 31 IB are to be executed.
- the workflow diagram 300B at functional block 354 may perform one or more Bayesian optimization processes (e g., sequential model-based optimization (SMBO), expected improvement (El)) to optimize and evaluate the machine learning model 302B.
- the Bayesian optimization processes e.g., SMBO, El
- the Bayesian optimization processes may include, for example, one or more probabilitybased objective functions that may be constructed and utilized to select the most predictive or the most promising of the functional blocks of feature dimensionality reduction model 307B, feature selection model 309B, and regression model 31 IB to execute and/or the order in which to execute these functional blocks.
- the workflow diagram 300B at functional block 354 may further proceed in estimating the accuracy of the machine learning model 302B utilizing, for example, nested cross-validation with Group / ⁇ -Fold cross-validation.
- the workflow diagram 300B may optimize the machine learning model 302B to more efficiently (e.g., decreasing the execution time of the machine learning model 302B and database capacity suitable for storing the machine learning model 302B) generate a prediction of a molecular binding property of one or more target proteins as compared to, for example, the machine learning model 302A as discussed above with respect to FIG. 3A.
- the workflow diagram 300B may then continue at functional block 356 with training and evaluating the optimized machine learning model 302B.
- the optimized machine learning model 302B (e g., as optimized at functional block 354) may be trained and evaluated based on the descriptor vector representing the amino acid sequences for the set of one or more proteins (e.g., as computed at functional block 352) and the functional blocks of feature dimensionality reduction model 307B, feature selection model 309B, and regression model 31 IB selected for execution.
- the workflow diagram 300B at functional block 356 may further include applying the optimized set of hyper-parameters (e g., general parameters, booster parameters, learning-task parameters) and optimized set of learnable parameters (e g., regression model weights, decision variables) (e.g., as iteratively optimized and discussed above with respect to the workflow diagram 300A of FIG. 3A) to the optimized machine learning model 302B and utilizing the optimized machine learning model 302B to generate a prediction of a molecular binding property of one or more target proteins in accordance with the presently-disclosed embodiments.
- the optimized set of hyper-parameters e g., general parameters, booster parameters, learning-task parameters
- optimized set of learnable parameters e.g., regression model weights, decision variables
- the workflow diagram 300B may then conclude at functional block 358 with storing the optimized machine learning model 302B, the optimized set of hyper-parameters (e.g., general parameters, booster parameters, learning-task parameters), and the optimized set of learnable parameters (e.g., regression model weights, decision variables) to be utilized for subsequent predictions of the molecular binding property of one or more target proteins.
- the optimized set of hyper-parameters e.g., general parameters, booster parameters, learning-task parameters
- learnable parameters e.g., regression model weights, decision variables
- the feature dimensionality reduction model 307B of the machine learning model 302B may receive or import a molecular descriptor matrix and scale and normalize one or more sets of the descriptors of the descriptor matrix.
- the molecular descriptor matrix may represent a set of amino acid sequences corresponding to a set of P proteins.
- the feature dimensionality reduction model 307B may then perform a clustering of the one or more sets of descriptors by determining a correlation distance between descriptors (e.g., 1 - abs(Pearson’s correlation)), and then only the descriptors closest to the centroid may be stored. For example, in some embodiments, utilizing the calculated correlation distance metric, which may be calculated based on the Pearson’s correlation, the feature dimensionality reduction model 307B may cluster feature vectors in order to group together any and all redundant features that include similar information (similar feature vectors) and determine a centroid of each cluster as representative of the cluster. In certain embodiments, the feature dimensionality reduction model 307B may then optimize the number of descriptors selected.
- a correlation distance between descriptors e.g., 1 - abs(Pearson’s correlation
- the feature selection model 309B may then calculate a nonlinear correlation between descriptors and output a percent protein bound. In one or more other embodiments, the feature selection model 309B may calculate a nonlinear correlation between descriptors and output a percent protein bound utilizing distance correlation, mutual information, or other similar nonlinear correlation metric.
- the feature selection model 309B may determine the &-best most-predictive feature vectors of the reduced molecular descriptor matrix based on a -best process and the MIC for determining a correlation between the feature vectors of the reduced molecular descriptor matrix and an experimentally- determined percent protein bound for one or more specific pH values and salt concentrations and/or salt species and chromatographic resin.
- a distance correlation, mutual information, or other similar nonlinear correlation metric may be utilized.
- the feature selection model 309B may then select the highly correlated descriptors and optimize the selected descriptors.
- the feature selection model 309B may then select a set of descriptors based on impact to the overall performance (e.g., processing speed, storage capacity) of the machine learning model 302B. For example, in some embodiments, the feature selection model 309B may iteratively evaluating the K descriptors to determine how many result in optimal performance of the machine learning model 302B. In some embodiments, the feature selection model 309B may perform the selection of the set of descriptors based on impact to the overall performance entirely selectively.
- the overall performance e.g., processing speed, storage capacity
- the feature selection model 309B may perform, for example, one or more Boruta feature selection algorithms, one or more SHapley Additive exPlanations (SHAP) feature selection algorithms, or other similar recursive feature elimination algorithm to select the K descriptors and to optimize the percentage of the number of selected K descriptors.
- the regression model 31 IB of the machine learning model 302B may then receive as inputs a pH value, a salt concentration, and the descriptors sequence-based descriptors, and may then output a prediction of a percent protein bound for the set of proteins P and optimizing the machine learning model 302B by minimizing a loss between the predicted percent protein bound and an experimentally-determined percent protein bound.
- a machine learning model 302B iteratively trained to generate a prediction of a molecular binding property of one or more target proteins as part of a streamlined process of protein purification for identifying target proteins and accelerating the selection process of therapeutic antibody candidates or other immunotherapy candidates by way of early identification of the most promising therapeutic antibody candidates.
- the streamlined process of identifying target proteins (e g , antibodies) in-silico may facilitate and accelerate the downstream development and manufacturing of one or more therapeutic mAbs, bsAbs, tsAbs, or other similar immunotherapies that may be utilized to treat various patient diseases.
- the machine learning model 302B may be utilized to generate a prediction of a molecular binding property (e.g., a prediction of a percent protein bound at one or more specific pH values and specific salt concentrations and/or specific salt species and chromatographic resin) of one or more proteins by utilizing optimized hyper-parameters (e.g., general parameters, booster parameters, learning-task parameters) and learnable parameters (e.g., regression model weights, decision variables) learned during the training of the machine learning model 302B and a selected &-best matrix of feature vectors of a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins of interest.
- optimized hyper-parameters e.g., general parameters, booster parameters, learning-task parameters
- learnable parameters e.g., regression model weights, decision variables
- the machine learning model 302B may utilize the optimized hyper-parameters (e.g., general parameters, booster parameters, learning-task parameters) and learnable parameters (e g., regression model weights, decision variables) learned during training to predict a percent protein bound (e.g., a percentage of a set of proteins predicted to bind to a ligand within a solution for a given pH value and salt concentration) for one or more target proteins based only on, as input, the selected /'-best matrix of feature vectors of a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins of interest and one or more sets of pH values and salt concentrations and/or salt species and chromatographic resin associated with the binding properties of the one or more proteins of interest.
- the optimized hyper-parameters e.g., general parameters, booster parameters, learning-task parameters
- learnable parameters e.g., regression model weights, decision variables
- the ensemble-learning 302B may be further optimized utilizing one or more Bayesian optimization processes to more efficiently to generate the prediction of the molecular binding property (e.g., a prediction of a percent protein bound at one or more specific pH values and specific salt concentrations and/or specific salt species and chromatographic resin).
- the prediction of the molecular binding property e.g., a prediction of a percent protein bound at one or more specific pH values and specific salt concentrations and/or specific salt species and chromatographic resin.
- the molecular binding property and elution property of the one or more proteins of interest may be determined without considerable upstream experimentation That is, desirable proteins of the one or more proteins of interest may be identified and distinguished from undesirable proteins of the one or more proteins of interest in-silico, and those desirable proteins identified in-silico may be further utilized to expedite and facilitate the downstream development of one or more therapeutic mAbs, bsAbs, tsAbs, or other similar immunotherapies that may be utilized to treat various diseases (e.g., by reducing upstream experimental duration and experimentation inefficiency and providing in-silico feedback on which candidate proteins may be difficult to purify, and, by extension, ultimately difficult to manufacture).
- desirable proteins of the one or more proteins of interest may be identified and distinguished from undesirable proteins of the one or more proteins of interest in-silico, and those desirable proteins identified in-silico may be further utilized to expedite and facilitate the downstream development of one or more therapeutic mAbs, bsAbs, tsAbs, or other similar immunotherapies that may be utilized
- the machine learning model may be configured to obtain a prediction of a molecular binding property of the one or more proteins. From the molecular binding property, desirable proteins may be identified. While the present embodiments are discussed herein primarily with respect to the machine learning model 302B generating a prediction of a molecular binding property of one or more target proteins, it should be appreciated that the machine learning model 302B as trained may also generate a prediction of an elution property of the one or more proteins or generate a prediction of a flow-through property of the one or more proteins, in accordance with the presently disclosed embodiments.
- FIG. 4 illustrates a flow diagram of a method 400 for generating a prediction of a molecular binding property of one or more target proteins as part of a streamlined process of protein purification for identifying target proteins (e.g., antibodies) and accelerating the selection process of therapeutic antibody candidates or other immunotherapy candidates by way of early identification of the most promising therapeutic antibody candidates, in accordance with the disclosed embodiments.
- the method 400 may be performed utilizing one or more processing devices (e.g., computing device(s) and artificial intelligence architecture to be discussed below with respect to FIGS.
- a general purpose processor e.g., a general purpose processor, a graphic processing unit (GPU), an application-specific integrated circuit (ASIC), a system-on-chip (SoC), a microcontroller, a field-programmable gate array (FPGA), a central processing unit (CPU), an application processor (AP), a visual processing unit (VPU), a neural processing unit (NPU), a neural decision processor (NDP), a deep learning processor (DLP), a tensor processing unit (TPU), a neuromorphic processing unit (NPU), or any other processing device(s) that may be suitable for processing genomics data, proteomics data, metabolomics data, metagenomics data, transcriptomics data, or other omics data and making one or more processors), firmware (e.g., microcode), or some combination thereof.
- firmware e.g., microcode
- the method 400 may begin at block 402 with one or more processing devices accessing a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins.
- the method 400 may then continue at block 404 with one or more processing devices refining a set of hyper-parameters associated with a machine learning model trained to generate a prediction of a molecular binding property of the one or more proteins.
- the method 400 may then proceed with an iterative sub -process of optimizing the set of hyper-parameters by iteratively executing the sub-process (e g., illustrated by the dashed lines around a portion of the method 400 of FIG. 4) until a desired precision is reached for the machine learning model.
- the method 400 may continue at block 406 with one or more processing devices reducing the molecular descriptor matrix by selecting one representative feature vector for each of a plurality of feature vector clusters, wherein each of the feature vector clusters includes similar feature vectors.
- the method 400 may then continue at block 408 with one or more processing devices determining one or more most-predictive feature vectors of the selected representative feature vectors for each of the plurality of feature vector clusters based on a correlation between the selected representative feature vectors and predetermined batch binding data associated with the one or more proteins.
- the method 400 may then continue at block 410 with one or more processing devices calculating one or more cross-validation losses based at least in part on the one or more most-predictive feature vectors and the predetermined batch binding data.
- the method 400 may then conclude at block 412 with one or more processing devices updating the set of hyper-parameters based on the one or more cross-validation losses.
- FIG. 5 illustrates an example of one or more computing device(s) 500 that may be utilized to generate a prediction of a molecular binding property of one or more target proteins as part of a streamlined process of protein purification for identifying target proteins (e g., antibodies) and accelerating the selection process of therapeutic antibody candidates or other immunotherapy candidates by way of early identification of promising therapeutic antibody candidates, in accordance with the disclosed embodiments.
- the one or more computing device(s) 500 may perform one or more steps of one or more methods described or illustrated herein.
- the one or more computing device(s) 500 provide functionality described or illustrated herein.
- software running on the one or more computing device(s) 500 performs one or more steps of one or more methods described or illustrated herein, or provides functionality described or illustrated herein. Certain embodiments include one or more portions of the one or more computing device(s) 500.
- This disclosure contemplates any suitable number of computing systems 500.
- This disclosure contemplates one or more computing device(s) 500 taking any suitable physical form.
- one or more computing device(s) 500 may be an embedded computer system, a system-on-chip (SOC), a single-board computer system (SBC) (e.g., a computer-on-module (COM) or system-on-module (SOM)), a desktop computer system, a laptop or notebook computer system, an interactive kiosk, a mainframe, a mesh of computer systems, a mobile telephone, a personal digital assistant (PDA), a server, a tablet computer system, an augmented/virtual reality device, or a combination of two or more of these.
- the one or more computing device(s) 500 may be unitary or distributed; span multiple locations; span multiple machines; span multiple data centers; or reside in a cloud, which may include one or more cloud components in one or more networks.
- the one or more computing device(s) 500 may perform without substantial spatial or temporal limitation one or more steps of one or more methods described or illustrated herein.
- the one or more computing device(s) 500 may perform, in real-time or in batch mode, one or more steps of one or more methods described or illustrated herein.
- the one or more computing device(s) 500 may perform, at different times or at different locations, one or more steps of one or more methods described or illustrated herein, where appropriate.
- the one or more computing device(s) 500 includes a processor 502, memory 504, database 506, an input/output (I/O) interface 508, a communication interface 510, and a bus 512.
- processor 502 includes hardware for executing instructions, such as those making up a computer program.
- processor 502 may retrieve (or fetch) the instructions from an internal register, an internal cache, memory 504, or database 506; decode and execute them; and then write one or more results to an internal register, an internal cache, memory 504, or database 506.
- processor 502 may include one or more internal caches for data, instructions, or addresses. This disclosure contemplates processor 502 including any suitable number of any suitable internal caches, where appropriate.
- processor 502 may include one or more instruction caches, one or more data caches, and one or more translation lookaside buffers (TLBs). Instructions in the instruction caches may be copies of instructions in memory 504 or database 506, and the instruction caches may speed up retrieval of those instructions by processor 502.
- TLBs translation lookaside buffers
- Data in the data caches may be copies of data in memory 504 or database 506 for instructions executing at processor 502 to operate on; the results of previous instructions executed at processor 502 for access by subsequent instructions executing at processor 502 or for writing to memory 504 or database 506; or other suitable data.
- the data caches may speed up read or write operations by processor 502.
- the TLBs may speed up virtual-address translation for processor 502.
- processor 502 may include one or more internal registers for data, instructions, or addresses. This disclosure contemplates processor 502 including any suitable number of any suitable internal registers, where appropriate.
- processor 502 may include one or more arithmetic logic units (ALUs); be a multicore processor; or include one or more processors 502.
- memory 504 includes main memory for storing instructions for processor 502 to execute or data for processor 502 to operate on.
- the one or more computing device(s) 500 may load instructions from database 506 or another source (such as, for example, another one or more computing device(s) 500) to memory 504.
- Processor 502 may then load the instructions from memory 504 to an internal register or internal cache.
- processor 502 may retrieve the instructions from the internal register or internal cache and decode them.
- processor 502 may write one or more results (which may be intermediate or final results) to the internal register or internal cache. Processor 502 may then write one or more of those results to memory 504.
- processor 502 executes only instructions in one or more internal registers, internal caches, or memory 504 (as opposed to database 506 or elsewhere) and operates only on data in one or more internal registers, internal caches, or memory 504 (as opposed to database 506 or elsewhere).
- One or more memory buses (which may each include an address bus and a data bus) may couple processor 502 to memory 504.
- Bus 512 may include one or more memory buses, as described below.
- one or more memory management units reside between processor 502 and memory 504 and facilitate accesses to memory 504 requested by processor 502.
- memory 504 includes random access memory (RAM). This RAM may be volatile memory, where appropriate.
- this RAM may be dynamic RAM (DRAM) or static RAM (SRAM). Moreover, where appropriate, this RAM may be single-ported or multi -ported RAM. This disclosure contemplates any suitable RAM.
- Memory 504 may include one or more memory devices 504, where appropriate. Although this disclosure describes and illustrates particular memory, this disclosure contemplates any suitable memory.
- database 506 includes mass storage for data or instructions.
- database 506 may include a hard disk drive (HDD), a floppy disk drive, flash memory, an optical disc, a magneto-optical disc, magnetic tape, or a Universal Serial Bus (USB) drive, or a combination of two or more of these.
- Database 506 may include removable or non-removable (or fixed) media, where appropriate.
- Database 506 may be internal or external to the one or more computing device(s) 500, where appropriate.
- database 506 is non-volatile, solid-state memory.
- database 506 includes read-only memory (ROM).
- this ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), electrically alterable ROM (EAROM), flash memory, or a combination of two or more of these.
- This disclosure contemplates mass database 506 taking any suitable physical form.
- Database 506 may include one or more storage control units facilitating communication between processor 502 and database 506, where appropriate. Where appropriate, database 506 may include one or more databases 506. Although this disclosure describes and illustrates particular storage, this disclosure contemplates any suitable storage.
- VO interface 508 includes hardware, software, or both, providing one or more interfaces for communication between the one or more computing device(s) 500 and one or more VO devices.
- the one or more computing device(s) 500 may include one or more of these VO devices, where appropriate.
- One or more of these VO devices may enable communication between a person and the one or more computing device(s) 500.
- an VO device may include a keyboard, keypad, microphone, monitor, mouse, printer, scanner, speaker, still camera, stylus, tablet, touch screen, trackball, video camera, another suitable I/O device, or a combination of two or more of these.
- An I/O device may include one or more sensors.
- I/O interface 508 may include one or more device or software drivers enabling processor 502 to drive one or more of these I/O devices.
- I/O interface 508 may include one or more I/O interfaces 508, where appropriate.
- communication interface 510 includes hardware, software, or both providing one or more interfaces for communication (such as, for example, packetbased communication) between the one or more computing device(s) 500 and one or more other computing device(s) 500 or one or more networks.
- communication interface 510 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI network.
- NIC network interface controller
- WNIC wireless NIC
- WI-FI network wireless network
- the one or more computing device(s) 500 may communicate with an ad hoc network, a personal area network (PAN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), one or more portions of the Internet, or a combination of two or more of these.
- PAN personal area network
- LAN local area network
- WAN wide area network
- MAN metropolitan area network
- One or more portions of one or more of these networks may be wired or wireless.
- the one or more computing device(s) 500 may communicate with a wireless PAN (WPAN) (such as, for example, a BLUETOOTH WPAN), a WI-FI network, a WLMAX network, a cellular telephone network (such as, for example, a Global System for Mobile Communications (GSM) network), other suitable wireless network, or a combination of two or more of these.
- WPAN wireless PAN
- the one or more computing device(s) 500 may include any suitable communication interface 510 for any of these networks, where appropriate.
- Communication interface 510 may include one or more communication interfaces 510, where appropriate.
- bus 512 includes hardware, software, or both coupling components of the one or more computing device(s) 500 to each other.
- bus 512 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a front-side bus (FSB), a FIYPERTRANSPORT (HT) interconnect, an Industry Standard Architecture (ISA) bus, an INFINIBAND interconnect, a low-pin-count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCIe) bus, a serial advanced technology attachment (SATA) bus, a Video Electronics Standards Association local (VLB) bus, another suitable bus, or a combination of two or more of these.
- Bus 512 may include one or more buses 512, where appropriate.
- a computer-readable non-transitory storage medium or media may include one or more semiconductor-based or other integrated circuits (ICs) (such, as for example, field- programmable gate arrays (FPGAs) or application-specific ICs (ASICs)), hard disk drives (HDDs), hybrid hard drives (HHDs), optical discs, optical disc drives (ODDs), magneto-optical discs, magneto-optical drives, floppy diskettes, floppy disk drives (FDDs), magnetic tapes, solid-state drives (SSDs), RAM-drives, SECURE DIGITAL cards or drives, any other suitable computer-readable non-transitory storage media, or any suitable combination of two or more of these, where appropriate.
- ICs semiconductor-based or other integrated circuits
- HDDs hard disk drives
- HHDs hybrid hard drives
- ODDs optical disc drives
- magneto-optical discs magneto-optical drives
- FDDs floppy diskettes
- FDDs floppy disk drives
- FIG. 6 illustrates a diagram 600 of an example artificial intelligence (Al) architecture 602 (which may be included as part of the one or more computing device(s) 500 as discussed above with respect to FIG. 5) that may be utilized to generate a prediction of a molecular binding property of one or more target proteins as part of a streamlined process of protein purification for identifying target proteins (e.g., antibodies) and accelerating the selection process of therapeutic antibody candidates or other immunotherapy candidates by way of early identification of the most promising therapeutic antibody candidates, in accordance with the disclosed embodiments.
- Al artificial intelligence
- the Al architecture 602 may be implemented utilizing, for example, one or more processing devices that may include hardware (e.g., a general purpose processor, a graphic processing unit (GPU), an applicationspecific integrated circuit (ASIC), a system-on-chip (SoC), a microcontroller, a field- programmable gate array (FPGA), a central processing unit (CPU), an application processor (AP), a visual processing unit (VPU), a neural processing unit (NPU), a neural decision processor (NDP), a deep learning processor (DLP), a tensor processing unit (TPU), a neuromorphic processing unit (NPU), and/or other processing device(s) that may be suitable for processing various molecular data and making one or more decisions based thereon), software (e g., instructions running/executing on one or more processing devices), firmware (e.g., microcode), or some combination thereof.
- hardware e.g., a general purpose processor, a graphic processing unit (GPU), an applicationspecific integrated circuit (ASIC),
- the Al architecture 602 may include machine learning (ML) algorithms and functions 604, natural language processing (NLP) algorithms and functions 606, expert systems 608, computer-based vision algorithms and functions 610, speech recognition algorithms and functions 612, planning algorithms and functions 614, and robotics algorithms and functions 616.
- the ML algorithms and functions 604 may include any statistics-based algorithms that may be suitable for finding patterns across large amounts of data (e g., “Big Data” such as genomics data, proteomics data, metabolomics data, metagenomics data, transcriptomics data, or other omics data).
- the ML algorithms and functions 604 may include deep learning algorithms 618, supervised learning algorithms 620, and unsupervised learning algorithms 622.
- the deep learning algorithms 618 may include any artificial neural networks (ANNs) that may be utilized to learn deep levels of representations and abstractions from large amounts of data.
- the deep learning algorithms 618 may include ANNs, such as a perceptron, a multilayer perceptron (MLP), an autoencoder (AE), a convolution neural network (CNN), a recurrent neural network (RNN), long short term memory (LSTM), a grated recurrent unit (GRU), a restricted Boltzmann Machine (RBM), a deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), a generative adversarial network (GAN), and deep Q-networks, a neural autoregressive distribution estimation (NADE), an adversarial network (AN), attentional models (AM), a spiking neural network (SNN), deep reinforcement learning, and so forth.
- ANNs such as a perceptron, a multilayer perceptron (MLP), an autoencoder (AE
- the supervised learning algorithms 620 may include any algorithms that may be utilized to apply, for example, what has been learned in the past to new data using labeled examples for predicting future events. For example, starting from the analysis of a known training data set, the supervised learning algorithms 620 may produce an inferred function to make predictions about the output values. The supervised learning algorithms 500 may also compare its output with the correct and intended output and find errors in order to modify the supervised learning algorithms 620 accordingly.
- the unsupervised learning algorithms 622 may include any algorithms that may applied, for example, when the data used to train the unsupervised learning algorithms 622 are neither classified nor labeled. For example, the unsupervised learning algorithms 622 may study and analyze how systems may infer a function to describe a hidden structure from unlabeled data.
- the NLP algorithms and functions 606 may include any algorithms or functions that may be suitable for automatically manipulating natural language, such as speech and/or text.
- the NLP algorithms and functions 606 may include content extraction algorithms or functions 624, classification algorithms or functions 626, machine translation algorithms or functions 628, question answering (QA) algorithms or functions 630, and text generation algorithms or functions 632.
- the content extraction algorithms or functions 624 may include a means for extracting text or images from electronic documents (e.g., webpages, text editor documents, and so forth) to be utilized, for example, in other applications.
- the classification algorithms or functions 626 may include any algorithms that may utilize a supervised learning model (e.g., logistic regression, naive Bayes, stochastic gradient descent (SGD), k-nearest neighbors, decision trees, random forests, support vector machine (SVM), and so forth) to learn from the data input to the supervised learning model and to make new observations or classifications based thereon.
- the machine translation algorithms or functions 628 may include any algorithms or functions that may be suitable for automatically converting source text in one language, for example, into text in another language.
- the QA algorithms or functions 630 may include any algorithms or functions that may be suitable for automatically answering questions posed by humans in, for example, a natural language, such as that performed by voice-controlled personal assistant devices.
- the text generation algorithms or functions 632 may include any algorithms or functions that may be suitable for automatically generating natural language texts.
- the expert systems 608 may include any algorithms or functions that may be suitable for simulating the judgment and behavior of a human or an organization that has expert knowledge and experience in a particular field (e.g., stock trading, medicine, sports statistics, and so forth).
- the computer-based vision algorithms and functions 610 may include any algorithms or functions that may be suitable for automatically extracting information from images (e.g., photo images, video images).
- the computer-based vision algorithms and functions 610 may include image recognition algorithms 634 and machine vision algorithms 636.
- the image recognition algorithms 634 may include any algorithms that may be suitable for automatically identifying and/or classifying objects, places, people, and so forth that may be included in, for example, one or more image frames or other displayed data.
- the machine vision algorithms 636 may include any algorithms that may be suitable for allowing computers to “see”, or, for example, to rely on image sensors cameras with specialized optics to acquire images for processing, analyzing, and/or measuring various data characteristics for decision making purposes.
- the speech recognition algorithms and functions 612 may include any algorithms or functions that may be suitable for recognizing and translating spoken language into text, such as through automatic speech recognition (ASR), computer speech recognition, speech-to-text (STT) 638, or text-to-speech (TTS) 640 in order for the computing to communicate via speech with one or more users, for example.
- the planning algorithms and functions 614 may include any algorithms or functions that may be suitable for generating a sequence of actions, in which each action may include its own set of preconditions to be satisfied before performing the action. Examples of Al planning may include classical planning, reduction to other problems, temporal planning, probabilistic planning, preference-based planning, conditional planning, and so forth.
- the robotics algorithms and functions 616 may include any algorithms, functions, or systems that may enable one or more devices to replicate human behavior through, for example, motions, gestures, performance tasks, decision-making, emotions, and so forth.
- Described herein include processes associated with predicting a molecular binding property of one or more proteins, as described above. This may include importing amino acid sequences of proteins and generating a molecular descriptor matrix based on the amino acid sequences. Protein molecules are formed of amino acid sequences. An amino acid sequence may be represented by a string of characters (e g., a string of letters). In one or more examples, the amino acid sequences may be input to a machine learning model (e.g., a neural network) to generate the molecular descriptor matrix. In one or more examples, the machine learning model may be pre-trained using amino acid sequences. For example, the machine learning model may comprise a protein language model. In another example, the machine learning model may be pre-trained in an unsupervised manner. In some embodiments, the machine learning model may be configured to generate structure-based descriptors representing the sequences used to generate a protein structure.
- a machine learning model e.g., a neural network
- the molecular feature matrix that is generated may be used to predict a molecular binding property of the corresponding protein.
- the molecular descriptor matrix may be a multi-dimensional matrix (i.e., a tensor) comprised of a plurality of feature vectors representing the descriptors for each amino acid in the sequence of each protein.
- the dimensions of the multi-dimensional molecular descriptor matrix e g., a descriptor tensor
- the multi-dimensional molecular descriptors matrix may (with peramino acid feature vectors for each molecule) be reduced to a 2-dimensional molecular feature matrix (with molecular feature vectors for each molecule) by averaging the feature vectors across all amino acids in each molecule.
- a feature dimensionality reduction technique used to reduce the number of feature vectors of the molecular descriptor matrix may include in particular by removing redundant feature vectors subsequent to the averaging. For instance, because some feature vectors (and/or the features included therein) may be highly correlated, a single representative feature vector may be identified to represent the collection of highly-correlated feature vectors.
- a clustering technique e.g., a hierarchical/agglomerative clustering technique
- identify feature vectors that are similar e.g., whose corresponding embeddings are less than a threshold distance away from one another in an embedding space.
- one or more representative feature vector may be selected from each cluster of similar feature vectors as being “representative” of that cluster.
- the representative feature vectors may be input to a machine learning model to obtain the prediction of the molecular binding property of the proteins.
- These proteins may be proteins of interest for potential drug discovery assays.
- the machine learning model may be trained to receive, as input, one or more representative feature vector describing one or more proteins and output the prediction of the molecular binding property of the proteins based on the representative feature vectors.
- the machine learning model may be trained by aligning the molecular descriptors (from a training molecular descriptor matrix generated by machine learning model 301 of FIG. 3A based on the imported amino acid sequences of one or more empirically-evaluated proteins) and predetermined batch binding data associated with the empirically-evaluated proteins. After being aligned, a supervised regression may be performed to train the machine learning model .
- the regressor used may comprise a bagged decision tree, a bagged linear model, a non-bagged linear model, a random forest, a linear forest, or another type of regressor, or combination thereof.
- part of the training step comprises optimizing a set of hyperparameters of the machine learning model.
- the hyper-parameters may include regularization parameters, a number of estimators, a maximum tree depth, and the like.
- the pipelines e.g., feature-dimensionality reduction model 307A, feature selection model 309A, 309B, and regression model 311A, 31 IB
- the feature-dimensionality reduction model may be configured to use correlation clustering, recursive feature elimination, and/or other techniques to reduce a number of feature vectors of the molecular descriptor matrix.
- the training step may also include a cross-validation step where an optimized set of learnable parameters of the machine learning model are identified and then a cross-validation test is performed iteratively until the optimized set of learnable parameters are determined.
- the machine learning model included a decision tree structure (e.g., a random forest)
- the number of learnable parameters may include the number of trees and/or a depth of the trees.
- the optimized set of learnable parameters are selected such that they optimize the performance of the machine learning model.
- the machine learning model may be trained to generate predictions of molecular binding properties of new amino acid sequences that are not part of the training sets.
- one or more additional steps may be performed to predict a molecular binding property of one or more protein molecules based on amino acid sequences.
- One of the goals of the disclosed techniques comprises predicting a property of a molecule to- be-assessed. In particular, how well a protein molecule binds to a resin provides valuable clinical information and/or valuable manufacturing process developability information that can be used in the development of new therapeutics.
- the foregoing describes an additional/altemative set of steps to the aforementioned steps that can be performed to predict the molecular binding properties based on amino acid sequences.
- testing binding properties of molecules is a complex and time-consuming process.
- the experimental duration of experimental example 102 of FIG. 1 for performing one or more protein purification processes may span a number of weeks.
- the execution time for the computational model-based example 104 e.g., the machine learning models described herein
- experimental example 102 describes a non-ideal process to test every potential molecule.
- the machine learning models described herein can reduce the amount of time expended on testing by increasing the number of molecules that can be screened in a given amount of time, or that be screened by a given researcher is a goal of model.
- molecular descriptor matrices can be generated using various existing protein language models (e.g., molecular descriptors 120 of FIG 1).
- existing techniques can be harnessed to generate the machine learning models’ inputs, thereby reducing the amount of additional data that needs to be collected and reducing the amount of additional model training needed.
- the machine learning models described herein can be trained using less data while maintaining or increasing the models’ accuracy.
- the molecular descriptor matrix can be reduced to determining (and use as input to the machine learning models) the most-predictive feature vectors.
- This descriptor reduction process can further optimize the training processes for the machine learning models. For example, each training molecular descriptor matrix may be reduced by determining the most-predictive feature vectors, and the model may be trained based on the most-predictive feature vectors.
- FIG. 7 illustrates another high-level workflow diagram 700 for performing feature generation 202, feature dimensionality reduction 204, feature filtering 206, recursive modelbased feature elimination 207, and regression model optimization 208, in accordance with various embodiments.
- the descriptions of feature generation 202, feature dimensionality reduction 204, feature filtering 206, and regression model optimization 208 may apply equally here.
- diagram 700 may further include recursive model-based feature elimination 207.
- Recursive-model based feature elimination 207 may include an additional model for further reducing the number of features in the feature set.
- recursive model-based feature elimination 207 may assist in preventing or reducing the likelihood of overfitting.
- recursive model-based feature elimination 207 may implement a machine learning model 820 of FIG. 8.
- FIG. 8 include similar components as that of FIG. 3A, and similar labels are used to refer to those components.
- workflow 800 may include model 301 and machine learning model 820.
- Machine learning model 820 may include feature dimensionality reduction model 307A, feature filtering model 309A, recursive feature elimination model 801, and regression model 311 A.
- Workflow 800 may follow a similar path as that of workflow 300 A, with the exception that the most-predictive feature vectors may include those that have been reduced via recursive feature elimination model 801.
- determining the one or more most-predictive feature vectors may further comprise implementing recursive feature elimination model 801 to further reduce the number of feature vectors.
- some embodiments include a number of feature vectors included in the further reduced number of feature vectors being equal to or less than the number of training items.
- recursive feature elimination model 801 may be configured to fit a model to the representative feature vectors.
- the model may be a regression model, for example.
- a feature importance score may be calculated based on the fit model.
- the feature importance score may indicate an importance of each representative feature vector.
- one or more feature vectors of the representative feature vectors may be removed based on the feature importance score of each of the representative feature vectors to obtain a subset of representative feature vectors. For example, a least-important feature or feature vector may be removed from the representative feature vectors.
- the most-predictive feature vectors may comprise one or more feature vectors from the subset of representative feature vectors.
- recursive feature elimination model 801 may iteratively perform blocks 802-806 until a number of feature vectors included in the subset satisfies a feature quantity criterion.
- the feature quantity criterion being satisfied comprises the number of feature vectors included in the subset of representative feature vectors being less than or equal to a threshold number of feature vectors.
- the threshold number of feature vectors may include a same or similar number of features from the training data used to train machine learning model 820.
- the number of feature vectors included in the subset of representative feature vectors may include one of the set of hyper-parameters.
- the number of feature vector clusters included in the plurality of feature vector clusters comprises one of the set of hyper-parameters.
- FIG. 9 illustrates a process for training a machine learning model to predict a molecular binding property, in accordance with various embodiments.
- process 900 of FIG. 9, as described herein may organize the data used to train a regression model (e.g., at step 930) in a different manner.
- the data used to train the machine learning model(s) include a predefined quantity of experimental conditions.
- the experimental conditions may specify a molecular binding property of a protein for a given set of experimental conditions.
- the data may comprise a measured molecular binding level of a protein at a first salt concentration and a first pH level, a measured molecular binding level of the protein at a second salt concentration and the first pH level, a measured molecular binding level of the protein at the first salt concentration and a second pH level, and the like.
- the predefined quantity of experimental conditions for the predetermined batch binding may include 12 or more experimental conditions (e.g., 4 salt concentrations, 3 pH levels), 24 or more experimental conditions (e g., 6 salt concentrations, 4 pH levels), and the like.
- the trained machine learning model as described above, may use experimental conditions (e.g., pH levels and salt concentrations) as inputs in addition to the molecular descriptor matrix to predict a molecular binding property of the one or more proteins.
- the experimental conditions may not need to be input to the machine learning model and instead a predicted molecular binding property may be determined for a continuum of experimental conditions. To do so, however, the training data and training process may be adjusted, as illustrated in FIG. 9.
- FIG. 9 illustrates a workflow diagram of a process 900 for optimizing hyper-parameters and learnable parameters of a machine learning model for performing one or more computational model-based protein purification processes, in accordance with various embodiments.
- Process 900 differs from that described above with respect to FIGS. 3A-4 in that a transformed representation of a molecular binding property of the training empirically- evaluated proteins may be used to train the machine learning model.
- the trained machine learning model may output a value corresponding to the transformed representation of the molecular binding property which in turn can be used to predict all binding conditions for all experimental conditions for a given protein molecule.
- the amount of training data needed to train the machine learning model may be reduced from N empirically-derived binding measures for N different experimental conditions (e g., salt concentration levels and pH levels) to a single transformed binding measure that can be used to resolve the N empirically-derived binding measures.
- N empirically-derived binding measures for N different experimental conditions e g., salt concentration levels and pH levels
- sequence data 902 corresponding to one or more amino acid sequences of proteins P may be provided to a matrix generation machine learning (ML) model 904.
- machine learning model 904 may be the same or similar to machine learning model 301 of FIG. 3A, and the previous description may apply.
- matrix generation ML model 904 may be trained to generate a molecular descriptor matrix 906 from sequence data 902 representing the amino acid sequences of the P proteins.
- Matrix generation ML model 904 may comprise a neural network, which may generate features X structed as molecular descriptor matrix 906.
- Molecular descriptor matrix 906 may be the same or similar to the molecular descriptor matrix generated at functional block 306 of FIG. 3 A.
- molecular descriptor matrix 906 may include 100 or more features, 500 or more features, 1,000 or more features, 2,000 or more features, 10,000 or more features, or other amounts of features. The features of molecular descriptor matrix 906 may then be analyzed to determine which (if any) correlate with a molecular binding property of the corresponding protein molecule.
- Molecular descriptor matrix 906 may have dimensions of a number of molecules A/by a number of descriptors (e.g., features) N.
- the amino acid sequence can be represented using a string of characters (e.g., the alphabet) that form the proteins being tested.
- sequences 902 may also be analyzed experimentally.
- the experiments may produce empirically-derived protein binding data 912.
- Empirically-derived protein binding data 912 may comprise molecular binding property values for a set of experimental conditions 914.
- empirically-derived protein binding data 912 may indicate that for a given sequence (e.g., Sequence A) and a first experimental condition (e.g., a first salt concentration level and a first pH level), the molecular binding property is Yl.
- empirically-derived protein binding data 912 may indicate that for the sequence (e.g., Sequence A) and a second experimental condition (e.g., a second salt concentration level and the first pH level), the molecular binding property is Y2.
- empirically-derived protein binding data 912 may indicate that for the sequence (e.g., Sequence A) and a third experimental condition (e.g., the first salt concentration level and a second pH level), the molecular binding property is Y3.
- predetermined batch binding data may be formulated as with the molecules as rows and experimental conditions 914 as columns.
- Process 900 may be configured to train a machine learning model (e.g., machine learning model 820) to predict a molecular binding property of a protein for a set of experimental conditions.
- a machine learning model e.g., machine learning model 820
- Testing binding properties of molecules is a complex and time-consuming process (e.g., takes 2-6 weeks to grow molecule, purify, and then test, so it can take several weeks to fully evaluate each molecule). It is not ideal to test every potential molecule. Therefore, increasing the number of molecules that can be screened in a given amount of time, or that be screened by a given researcher is a goal of the model. Another goal is increasing the number of molecules that can be screened without incurring the timeline delays or additional experimental burden.
- Process 900 may be trained using a small number of training examples (e.g., few molecules) and a large number of descriptors (e.g., 100 or more features, 500 or more features, 1,000 or more features, 2,000 or more features, 10,000 or more features, etc ). Process 900 may sort the descriptors in a systematic way to train machine learning model to predict molecular binding property 910. Additionally, process 900 may leverage the descriptors which have a relationship to one or more physical attributes of the protein. Machine learning pipeline 908 may thereby be configured to find the descriptors (e.g., features) that best predict the molecular binding property of a protein based on the molecular descriptor matrix. The ML model may then try and determine which descriptors are the most predictive.
- descriptors e.g., 100 or more features, 500 or more features, 1,000 or more features, 2,000 or more features, 10,000 or more features, etc .
- Process 900 may sort the descriptors in a systematic way to train machine
- predetermined batch binding data 912 comprises empirically-measured binding properties of each analyzed protein for the set of experimental conditions.
- process 900 may include a set of performing, for example using computing system 500 of FIG. 5, a linearizing transformation 916 to the empirically- measured binding properties.
- the empirically-measured binding properties may comprise percent-bound measures (e.g., a protein is Y% bound to a resin).
- Process 900 may transform the percent-bound empirically-measured binding properties stored in predetermined batch binding data 912 into a linearized or pseudo-linear representation of the that empirically- measured binding property. For example, a logit transformation operation may be performed.
- the logit transformation includes calculating the log of the ratio of the bound/not-bound protein concentrations.
- the bounds transform from 0.0-1.0 (i.e., 0% bound to 100% bound) to negative infinity to positive infinity (in log( ),) space).
- linear models such as PCA models, which converge better, can be used.
- process 900 may include applying one or more dimensionality reduction technique (e.g., a principal component analysis (PCA) 918) to the linear representations of the empirically-measured binding properties of each analyzed protein.
- PCA 918 may be configured to derive a first, second, and the like, principal component (PC) of the linearizing transformation (e.g., logit transform) of their empirically-measured binding properties.
- the performed PCA 918 may represent the linear representations of the empirically-measured protein binding properties to a more succinct representation.
- the number of experimental conditions C defines a number of data points in predetermined batch binding data 912.
- PCA 918 may reduce the number of data points from C to less than or equal to C.
- PCA 918 may be configured to output transformed representations 920 representing the transformed versions of the empirically-measured molecular binding property.
- the number of molecules that are tested may be 1 or more, 5 or more, 10 or more, 20 or more, 50 or more, or other values.
- the PCA model can decompose the data (e g., predetermined batch binding data 912) into a set of lower-dimensionality vectors. For example, for 24 experimental conditions (e g., 24 experimental data points), the PCA model can identify the first eigenvector of the data, which may capture a plurality of the variance of the data set. Thus, PCA enables a lower dimensional projection to be used to describe the behavior of the binding data. In one or more examples, if an average binding efficiency of a molecule is to be predicted, the PCA provides a more representative and valuable result than any of the experimental conditions individually. Additionally, PCA’s ability to succinctly (in a low-dimensional representation) summarize trends in noisy multidimensional data can be useful to scientists. Persons of ordinary skill in the art will recognize that any number of principal components can be identified by PCA 918 including, but not limited to, a first principal component and/or a second principal component.
- predicted molecular binding property 910 may be compared to transformed representations 920 of the empirically-measured molecular binding property.
- a cross-validation loss may be calculated to determine how well machine learning model 908 predicted the empirically-measured molecular binding property of a given protein.
- the prediction indicates how well machine learning pipeline 908 predicts a transformed representation of the empirically-measured molecular binding property.
- a cross-validation loss may be computed. As described previously, one or more examples may use a £-fold cross-validation technique. Additionally, or alternatively, at 930, a stratified Mold cross-validation may be computed.
- the stratified k- fold cross-validation comprises taking the molecules of the training set and ranking them into bins based on their molecular binding property.
- the bins may comprise a first bin corresponding to weakly-binding proteins, a second bin corresponding to moderately-binding proteins, a third bin corresponding to tightly-binding proteins, and the like.
- FIGS. 10A-10D illustrate example plots illustrating how a principal component analysis can be used to predict a molecular binding property, in accordance with various embodiments
- FIG. 10A illustrates a plot 1000 of a principal component analysis result of a set of molecules.
- the X-axis corresponds to a 1 st principal component value of each molecule of the set and the Y-axis corresponds to a 2 nd principal component of each molecule.
- the red oval and the green oval represent a first and second standard deviation from a centroid of the cluster of data points.
- the molecules e.g., data points
- the molecules are fairly well-distributed about the x-axis.
- FIG 10B illustrates a plot 1020 of isotherm curves for a given molecule for various values of a first principal component, in accordance with various embodiments.
- the x-axis represents a salt concentration level used during a corresponding experiment to determine a protein binding property of a molecule and the y-axis represents a protein binding level.
- Isotherm curves 1022-1030 correspond to different principal component (PC) values.
- PC principal component
- Isotherm curves 1022-1030 of plot 1020 may be computed using a fixed pH level. As seen from plot 1020, as the value of the first PC increases (e.g., -6 in curve 1022) to very large (e.g., +6 in curve 1030), the binding behavior changes. In the example of plot 1020, the percent bound is approximately 100% for low salt concentrations and approximately 0% for high salt concentration values.
- one or more protein purification steps may be performed to filter out molecules that are not a protein of interest.
- the protein purification step include causing or otherwise facilitating the protein of interest to bind to a resin (e.g., a chromatography column). Ideally, the resin will bind all of the proteins of interest.
- a wash may be applied to deposit the proteins of interest into a solution.
- the wash may include salt at a particular salt concentration level (and/or pH level).
- the salt concentration level may influence whether the protein un-binds from the resin. For example, at lower salt concentration levels, a protein may remain bound to a resin, whereas higher salt concentration levels may cause the protein to detach from the resin.
- assays or other studies may be performed to the solution/protein
- 10A-10B describes the transition of the protein from a bound to unbound state (e g., as seen by isotherm curves 822-830) using a single value (e.g., the principal component) instead of the set of experimental conditions (e.g., 24 salt/pH combinations).
- a single value e.g., the principal component
- the first principal component as illustrated in plot 1020 of FIG. 10B, can visually describe the average binding, as a percent bound.
- isotherm curve 1022 for a first principal component of -6, the protein may be tightly bound to the resin. Isotherm curve 1022 may be flagged as problematic because, regardless of the salt concentration level, for the particular pH level and first principal component value, the protein under analysis is unlikely to unbind from the resin.
- isotherm curves 1042-1050 illustrate how the binding percentage varies as the salt concentration level of the wash changes for different values of the first principal component.
- the percent bound of the protein does not change much as the salt concentration level is varied. Isotherm curve 1042 may then also be flagged as problematic because the protein is bound to the resin and cannot be removed.
- isotherm curve 1050 may have a substantially static percent bound regardless of salt concentration level. However, differing from isotherm curve 1042, the protein in this example may not be able to bind to the resin. Isotherm curve 1050 may therefore also be flagged as problematic because no purification can be performed, as all of the protein washes away. Isotherm curves 1044-1048 represent a more desirable state, where the percent bound transitions from bound to unbound as the salt concentration level is varied
- predicting the first principal component can enable the percent bound to be determined for an infinite amount of salt concentrations (and/or pHs).
- a percent bound prediction for all experimental conditions e.g., points along an isotherm curve
- use of PCA to predict a first principal component vastly simplifies the process of predicting a molecular binding property of a protein without sacrificing accuracy.
- the PCA may output more than the first principal component.
- the second principal component may also be determined and may be used to guide decision making steps.
- plot 1060 depicts isotherm curves 1062-1070 of a second principal component for a protein. Isotherm curves 1062-1070 illustrate how the percent bound of the protein changes as the salt concentration level is varied for a set of second principal component values.
- the first principal component can shift where the transition is from bound to unbound.
- isotherm curve 1062 may include a 2 nd PC value of -6, which as illustrated is very steep as compared with isotherm curve 1070, having a 2 nd PC value of +2, is less steep (and does not reach a percent bound of ⁇ 0%).
- other principal components may be used to.
- isotherm curve 1066 may represent an “ideal” curve.
- the first principal component may be set at 0 while the second principal component is varied.
- machine learning pipeline 908 may be trained to output the first principal component, the second principal component, other principal components, or combinations thereof.
- Machine learning pipeline 908 may output the principal components together or serially.
- process 900 can reduce a number of data points needed to train machine learning model.
- the number of principal components may be limited by the number of data points of the empirically -measured proteins.
- the number of principal components may be less than or equal to the number of experimental conditions.
- process 900 of FIG. 9 may reduce that number to 1 data point.
- FIGS. 11A-11F illustrate example heat maps 1100-1150 illustrating a relationship between experimental conditions and experimental K p values, and experimental conditions and modeled K p values, respectively, in accordance with various embodiments.
- Heat maps 1100- 1150 include a color gradient representing how tightly bound a protein is (in units of percent bound).
- the x-axis of maps 1100-1150 describes a salt concentration level and the y-axis represents a pH level.
- the portions of heat maps 1100-1150 that are “red” represent a higher log( 'p) value (e g., molecular binding property) and the “green” represents a lower log(A p ) value.
- Heat map 1100-1150 may be generated based on the one or more empirically-evaluated proteins.
- FIGS. 11A-11B may depict heat maps 1100-1110 depicting an experimental K p screen and a model predicted K p screen for an ion exchange resin.
- FIGS. 11C- 11D may depict heat maps 1120-1130 depicting an experimental K p screen and a model predicted K p screen for a hydrophobic resin.
- FIGS. 1 IE-1 IF may depict heat maps 1140-1150 depicting an experimental K p screen and a model predicted K p screen for a mixed mode resin.
- the protein of interest may be bound until the salt concentration level used reaches approximately 250 mM in the experimental data.
- FIG. 12 illustrates a flow diagram of a method 1200 for generating a prediction of a molecular binding property of one or more target proteins as part of another streamlined process of protein purification for identifying target proteins, in accordance with various embodiments.
- Method 1200 may accelerate the selection process of therapeutic antibody candidates or other immunotherapy candidates by way of early identification of the most promising therapeutic antibody candidates, in accordance with the disclosed embodiments.
- Method 1200 may be performed utilizing one or more processing devices (e.g., computing device(s) and artificial intelligence architecture to be discussed below with respect to FIGS.
- a general purpose processor e.g., a general purpose processor, a graphic processing unit (GPU), an application-specific integrated circuit (ASIC), a system-on-chip (SoC), a microcontroller, a field-programmable gate array (FPGA), a central processing unit (CPU), an application processor (AP), a visual processing unit (VPU), a neural processing unit (NPU), a neural decision processor (NDP), a deep learning processor (DLP), a tensor processing unit (TPU), a neuromorphic processing unit (NPU), or any other processing device(s) that may be suitable for processing genomics data, proteomics data, metabolomics data, metagenomics data, transcriptomics data, or other omics data and making one or more processors), firmware (e.g., microcode), or some combination thereof.
- firmware e.g., microcode
- method 1200 may begin at block 1210.
- Block 1210 may form part of the steps performed to train machine learning pipeline 908.
- a training molecular descriptor matrix representing a training set of amino acid sequences corresponding to one or more empirically-evaluated proteins may be accessed.
- the training molecular matrix may be generated for proteins that have been evaluated experimentally under one or more experimental conditions (e.g., salt concentration levels, pH levels, etc.).
- an iterative process may be executed to refine a set of hyper-parameters associated with the ensemble-learning model until a desired precision is reached. For example, the process may repeat until machine learning pipeline 908 predicts molecular binding properties with a threshold level of accuracy.
- Block 1220 may include a steps that are performed during each iteration of block 1220.
- the training molecular descriptor matrix may be reduced by selecting one representative feature vector for each of a plurality of feature vector clusters.
- Each feature vector cluster may comprise similar feature vectors. For example, two feature vectors having a distance less than a threshold distance (e.g., in an embedding space) may be classified as being “similar.”
- the selected representative feature vector may represent all the feature vectors included within a given feature vector cluster.
- one or more most-predictive feature vectors of the selected representative feature vectors for each feature vector cluster may be determined based on a correlation between the selected representative feature vectors and predetermined batch binding data associated with the empirically-evaluated proteins.
- the most-predictive feature vectors may be determined based on a principal component analysis identifying a first principal component.
- Step 1226 one or more cross-validation losses may be calculated based at least in part on the most-predictive feature vectors and the predetermined batch binding data.
- the set of hyper-parameters of machine learning pipeline 908 may be updated based on the cross- validation losses.
- the set of hyper-parameters may be updated based on the one or more cross-validation losses.
- Blocks 1210-1220 may comprise a “training” portion.
- the result of blocks 1210-1220 may include the trained machine learning model (e.g., machine learning model 908), which can be used during inferencing.
- a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins may be accessed.
- a prediction of a molecular binding property of the one or more proteins may be obtained by a trained ML model based at least in part on the molecular descriptor matrix.
- the proteins may be proteins of interest.
- a machine learning model e.g., a protein language model implemented using a neural network
- the molecular descriptor matrix may comprise a plurality of descriptors (e.g., features). The descriptors may be structured as feature vectors.
- machine learning pipeline 908 may be trained to analyze the molecular descriptor matrix and perform a dimensionality reduction.
- the dimensionality reduction may reduce the molecular descriptor matrix by selecting a representative feature vector.
- the selected representative feature vector may be selected from a cluster of similar feature vectors of the molecular descriptor matrix.
- each cluster may have a representative feature vector.
- the most-predictive feature vectors of the representative feature vectors may be determined.
- the most-predictive feature vectors may then be used to generate a predicted molecular binding property.
- the predicted molecular binding property may represent a first principal component.
- references in the appended claims to an apparatus or system or a component of an apparatus or system being adapted to, arranged to, capable of, configured to, enabled to, operable to, or operative to perform a particular function encompasses that apparatus, system, component, whether or not it or that particular function is activated, turned on, or unlocked, as long as that apparatus, system, or component is so adapted, arranged, capable, configured, enabled, operable, or operative. Additionally, although this disclosure describes or illustrates certain embodiments as providing particular advantages, certain embodiments may provide none, some, or all of these advantages.
- Embodiments disclosed herein may include:
- a method for predicting a molecular binding property of one or more proteins comprising, by one or more computing devices: accessing a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins; and refining a set of hyper-parameters associated with a machine learning model trained to generate a prediction of a molecular binding property of the one or more proteins, wherein refining the set of hyper-parameters comprises iteratively executing a process until a desired precision is reached, the process comprising: reducing the molecular descriptor matrix by selecting one representative feature vector for each of a plurality of feature vector clusters, wherein each feature vector cluster includes similar feature vectors; determining one or more most-predictive feature vectors of the selected representative feature vectors for each feature vector cluster based on a correlation between the selected representative feature vectors and predetermined batch binding data associated with the one or more proteins; calculating one or more cross- validation losses based at least in part on the one or more most-predictive feature vectors
- calculating the one or more cross-validation losses further comprises: evaluating a cross-validation loss function based on the one or more most-predictive feature vectors, the predetermined batch binding data, the set of hyperparameters, and a set of learnable parameters associated with the machine learning model; and minimizing the cross-validation loss function by varying the set of learnable parameters while the one or more most-predictive feature vectors, the predetermined batch binding data, and the set of hyper-parameters remain constant.
- minimizing the cross-validation loss function comprises optimizing the set of hyper-parameters, and wherein the set of hyper-parameters comprises one or more of a set of general parameters, a set of booster parameters, or a set of learning-task parameters.
- minimizing the cross-validation loss function comprises minimizing a loss between a prediction of a percent protein bound for the one or more proteins and an experimentally-determined percent protein bound for the one or more proteins.
- the predetermined batch binding data comprises an experimentally-determined percent protein bound for one or more pH values and salt concentrations associated with the molecular binding property of the one or more proteins.
- the set of learnable parameters comprises one or more weights or decision variables determined by the machine learning model based at least in part on the one or more most-predictive feature vectors and the predetermined batch binding data.
- the updated set of hyperparameters comprises one or more of an updated set of general parameters, an updated set of booster parameters, or an updated set of learning-task parameters.
- calculating the one or more cross- validation losses comprises calculating an n number of cross-validation losses, and wherein n comprises an integer from ⁇ -n.
- calculating the one or more cross-validation losses comprises determining an n number of individual train-test splits based on the one or more most-predictive feature vectors and the predetermined batch binding data, and wherein n comprises an integer from -n.
- calculating the one or more cross-validation losses comprises calculating an n number of cross-validation losses, the method further comprising: generating the prediction of the molecular binding property of the one or more proteins based on an averaging of the n number of cross-validation losses.
- the first machine learning model comprises a neural network trained to generate an M xN descriptor matrix representing the set of amino acid sequences, and wherein?/ comprises a number of the set of amino acid sequences and M comprises a number of nodes in an output layer of the neural network.
- the machine learning model comprises one or more of a gradient boosting model, an adaptive boosting (AdaBoost) model, an extreme gradient boosting (XGBoost) model, a light gradient boosted machine (LightGBM) model, or a categorical boosting (CatBoost) model.
- AdaBoost adaptive boosting
- XGBoost extreme gradient boosting
- CatBoost categorical boosting
- the machine learning model is further trained to generate a prediction of a molecular elution property of the one or more proteins.
- reducing the molecular descriptor matrix comprises performing a Pearson’s correlation of feature vectors of the molecular descriptor matrix to generate the plurality of feature vector clusters.
- determining the one or more representative feature vectors of the selected representative feature vectors for each of the plurality of feature vector clusters comprises selecting a A best matrix of feature vectors of the selected representative feature vectors for each of the plurality of feature vector clusters.
- the computational model-based chromatography process comprises one or more of a computational model-based affinity chromatography process, ion exchange chromatography (IEX) process, a hydrophobic interaction chromatography (HIC) process, or a mixed-mode chromatography (MMC) process.
- IEX ion exchange chromatography
- HIC hydrophobic interaction chromatography
- MMC mixed-mode chromatography
- a method for predicting a molecular binding property of one or more proteins comprising, by one or more computing devices: accessing a molecular descriptor matrix representing a set of amino acid sequences corresponding to one or more proteins; and obtaining, by a machine learning model, a prediction of a molecular binding property of the one or more proteins based at least in part on the molecular descriptor matrix, wherein the machine learning model is trained by: accessing a training molecular descriptor matrix representing a training set of amino acid sequences corresponding to one or more empirically- evaluated proteins; and iteratively executing a process to refine a set of hyper-parameters associated with the machine learning model until a desired precision is reached, the process comprising: reducing the training molecular descriptor matrix by selecting one representative feature vector for each of a plurality of feature vector clusters, wherein each feature vector cluster includes similar feature vectors; determining one or more most-predictive feature vectors of the selected representative feature vectors for each feature vector cluster based
- obtaining the prediction comprises: reducing the molecular descriptor matrix by selecting one representative feature vector for each of a plurality of feature vector clusters of the molecular descriptor matrix; determining one or more most-predictive feature vectors of the selected representative feature vectors for each feature vector cluster based on a correlation between the selected representative feature vectors and predetermined batch binding data associated with the one or more proteins; inputting the one or more most-predictive feature vectors into the machine learning model to obtain the prediction of the molecular binding property of the one or more proteins.
- calculating the one or more cross-validation losses further comprises: evaluating a cross-validation loss function based on the one or more most-predictive feature vectors, the predetermined batch binding data, the set of hyper-parameters, and a set of learnable parameters associated with the machine learning model; and minimizing the cross-validation loss function by varying the set of learnable parameters while the one or more most-predictive feature vectors, the predetermined batch binding data, and the set of hyper-parameters remain constant.
- minimizing the cross-validation loss function comprises optimizing the set of hyper-parameters, and wherein the set of hyper-parameters comprises one or more of a set of general parameters, a set of booster parameters, or a set of learning-task parameters.
- minimizing the cross-validation loss function comprises minimizing a loss between a prediction of a percent protein bound for the one or more proteins and an experimentally-determined percent protein bound for the one or more proteins.
- the predetermined batch binding data comprises an experimentally-determined percent protein bound for one or more pH values and salt concentrations associated with the molecular binding property of the one or more proteins.
- the set of learnable parameters comprises one or more weights or decision variables determined by the machine learning model based at least in part on the one or more most-predictive feature vectors and the predetermined batch binding data.
- the method further comprises: accessing a second molecular descriptor matrix representing a second set of amino acid sequences corresponding to one or more second proteins; and obtaining, by the machine learning model, a second prediction of a molecular binding property of the one or more second proteins based at least in part on the second molecular descriptor matrix.
- the machine learning model is trained to: reduce the second molecular descriptor matrix by selecting one representative feature vector for each of a second plurality of feature vector clusters of the second molecular descriptor matrix; determine one or more second most-predictive feature vectors of the selected representative feature vectors for each feature vector cluster based on a second correlation between the selected representative feature vectors and predetermined batch binding data associated with the one or more second proteins; inputting the one or more second most- predictive feature vectors into the machine learning model trained to generate the second prediction.
- calculating the one or more cross-validation losses comprises calculating an n number of cross-validation losses, and wherein n comprises an integer from -n.
- calculating the one or more cross-validation losses comprises determining an n number of individual train-test splits based on the one or more most-predictive feature vectors and the predetermined batch binding data, and wherein n comprises an integer from -n.
- calculating the one or more cross-validation losses comprises calculating an n number of cross-validation losses, the method further comprising: generating the prediction of the molecular binding property of the one or more proteins based on an averaging of the n number of cross-validation losses.
- the first machine learning model comprises a neural network trained to generate an M x N descriptor matrix representing the set of amino acid sequences.
- N comprises a number of the set of amino acid sequences and AT comprises a number of nodes in an output layer of the neural network.
- the machine learning model comprises one or more of a gradient boosting model, an adaptive boosting (AdaBoost) model, an extreme gradient boosting (XGBoost) model, a light gradient boosted machine (LightGBM) model, or a categorical boosting (CatBoost) model.
- AdaBoost adaptive boosting
- XGBoost extreme gradient boosting
- XGBM light gradient boosted machine
- CatBoost categorical boosting
- reducing the molecular descriptor matrix comprises clustering the similar feature vectors into the plurality of feature vector clusters based on a correlation distance.
- the selected one representative feature vector for each of the plurality of feature vector clusters comprises a centroid feature vector for each of the plurality of feature vector clusters utilized to represent two or more of the similar feature vectors.
- determining the one or more predictive feature vectors of the selected representative feature vectors for each of the plurality of feature vector clusters comprises selecting a k -best matrix of feature vectors of the selected representative feature vectors for each of the plurality of feature vector clusters.
- the computational model-based chromatography process comprises one or more of a computational model-based affinity chromatography process, ion exchange chromatography (IEX) process, a hydrophobic interaction chromatography (HIC) process, or a mixed-mode chromatography (MMC) process.
- IEX ion exchange chromatography
- HIC hydrophobic interaction chromatography
- MMC mixed-mode chromatography
- the predetermined batch binding data associated with the one or more empirically-evaluated proteins comprises, for each of the one or more empirically-evaluated proteins, an experimentally-determined binding value measured for each of a set of experimental conditions.
- the correlation between the selected representative feature vectors and the predetermined batch binding data comprises: for each of the one or more empirically-evaluated proteins and for each of the set of experimental conditions: generating a linear representation of the experimentally-determined binding value of the empirically-evaluated protein based on a logit transformation applied to the experimentally-determined binding value of the empirically-evaluated protein; and performing a principal component analysis (PC A) to the linear representations of the experimentally- determined binding values of the one or more empirically-evaluated proteins to obtain at least a first principal component.
- PC A principal component analysis
- any one of embodiments 32-79 further comprising: generating, based on the prediction, a set of functions representing a behavior of the one or more proteins for a set of experimental conditions; and selecting at least one of the one or more proteins for one or more drug discovery assays based on the behavior of the one or more proteins for the set of experimental conditions.
- the correlation between the selected representative feature vectors and the predetermined batch binding data associated with the one or more empirically-evaluated proteins comprises: a correlation between the representative feature vectors and a principal component calculated based on the predetermined batch binding data.
- determining the one or more most-predictive feature vectors further comprises: (i) fitting a model to the representative feature vectors; (ii) calculating, based on the model, a feature importance score for each of the representative feature vectors; and (iii) removing one or more feature vectors of the representative feature vectors based on the feature importance score of each of the representative feature vectors to obtain a subset of representative feature vectors, wherein the one or more most-predictive feature vectors comprise one or more feature vectors from the subset of representative feature vectors.
- the threshold number of feature vectors comprises a same or similar number of features from the training data used to train the machine learning model.
- a system including one or more computing devices, the system further comprising: one or more non-transitory computer-readable storage media including instructions; and one or more processors coupled to the one or more storage media, the one or more processors configured to execute the instructions to effectuate the method of any one of embodiments 1- 89.
- a non-transitory computer-readable medium comprising instructions that, when executed by one or more processors of one or more computing devices, cause the one or more processors to effectuate operations comprising the method of any one of embodiments 1-89.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Chemical & Material Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Medicinal Chemistry (AREA)
- Epidemiology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Databases & Information Systems (AREA)
- Public Health (AREA)
- Pharmacology & Pharmacy (AREA)
- Bioethics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Computational Linguistics (AREA)
- Physiology (AREA)
- Peptides Or Proteins (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202380059325.4A CN119698660A (en) | 2022-08-15 | 2023-08-14 | Computational-based approaches for improving protein purification |
KR1020257005193A KR20250053066A (en) | 2022-08-15 | 2023-08-14 | Computational Methods for Improving Protein Purification |
EP23768085.5A EP4573552A1 (en) | 2022-08-15 | 2023-08-14 | Computational-based methods for improving protein purification |
US19/053,054 US20250191676A1 (en) | 2022-08-15 | 2025-02-13 | Computational-based methods for improving protein purification |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202263398168P | 2022-08-15 | 2022-08-15 | |
US63/398,168 | 2022-08-15 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US19/053,054 Continuation US20250191676A1 (en) | 2022-08-15 | 2025-02-13 | Computational-based methods for improving protein purification |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024040031A1 true WO2024040031A1 (en) | 2024-02-22 |
Family
ID=87974547
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2023/072176 WO2024040031A1 (en) | 2022-08-15 | 2023-08-14 | Computational-based methods for improving protein purification |
Country Status (5)
Country | Link |
---|---|
US (1) | US20250191676A1 (en) |
EP (1) | EP4573552A1 (en) |
KR (1) | KR20250053066A (en) |
CN (1) | CN119698660A (en) |
WO (1) | WO2024040031A1 (en) |
-
2023
- 2023-08-14 CN CN202380059325.4A patent/CN119698660A/en active Pending
- 2023-08-14 KR KR1020257005193A patent/KR20250053066A/en active Pending
- 2023-08-14 EP EP23768085.5A patent/EP4573552A1/en active Pending
- 2023-08-14 WO PCT/US2023/072176 patent/WO2024040031A1/en active Application Filing
-
2025
- 2025-02-13 US US19/053,054 patent/US20250191676A1/en active Pending
Non-Patent Citations (3)
Title |
---|
LIN HUNG-YI LINHY@NUTC EDU TW ET AL: "Assessing Information Quality and Distinguishing Feature Subsets for Molecular Classification", PROCEEDINGS OF THE 2020 10TH INTERNATIONAL CONFERENCE ON BIOSCIENCE, BIOCHEMISTRY AND BIOINFORMATICS, ACMPUB27, NEW YORK, NY, USA, 19 January 2020 (2020-01-19), pages 96 - 100, XP058459989, ISBN: 978-1-4503-7676-1, DOI: 10.1145/3386052.3386061 * |
QINGYUAN FENG ET AL: "PADME: A Deep Learning-based Framework for Drug-Target Interaction Prediction", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 25 July 2018 (2018-07-25), XP081119001 * |
XU YUTING ET AL: "Deep Dive into Machine Learning Models for Protein Engineering", JOURNAL OF CHEMICAL INFORMATION AND MODELING, vol. 60, no. 6, 22 June 2020 (2020-06-22), US, pages 2773 - 2790, XP055908760, ISSN: 1549-9596, Retrieved from the Internet <URL:https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.0c00073> DOI: 10.1021/acs.jcim.0c00073 * |
Also Published As
Publication number | Publication date |
---|---|
CN119698660A (en) | 2025-03-25 |
US20250191676A1 (en) | 2025-06-12 |
KR20250053066A (en) | 2025-04-21 |
EP4573552A1 (en) | 2025-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wardah et al. | Protein secondary structure prediction using neural networks and deep learning: A review | |
Guo et al. | Diffusion models in bioinformatics and computational biology | |
Martorell-Marugán et al. | Deep learning in omics data analysis and precision medicine | |
Li et al. | Applications of deep learning in understanding gene regulation | |
CN112585685A (en) | Machine learning to determine protein structure | |
EP3776564A2 (en) | Molecular design using reinforcement learning | |
WO2019186195A2 (en) | Shortlist selection model for active learning | |
US20210027864A1 (en) | Active learning model validation | |
Arowolo et al. | A survey of dimension reduction and classification methods for RNA-Seq data on malaria vector | |
Wang et al. | AUC-maximized deep convolutional neural fields for protein sequence labeling | |
JP2025504942A (en) | Image-based variant pathogenicity determination | |
Mansoor et al. | Gene Ontology GAN (GOGAN): a novel architecture for protein function prediction | |
Hattori et al. | A deep bidirectional long short-term memory approach applied to the protein secondary structure prediction problem | |
US20250191676A1 (en) | Computational-based methods for improving protein purification | |
Shaver et al. | Deep learning in therapeutic antibody development | |
Thareja et al. | Intelligence model on sequence-based prediction of PPI using AISSO deep concept with hyperparameter tuning process | |
Bongirwar et al. | An improved multi-scale convolutional neural network with gated recurrent neural network model for protein secondary structure prediction | |
Alzubaidi et al. | Deep mining from omics data | |
Salem et al. | Wrapper-based modified binary particle swarm optimization for dimensionality reduction in big gene expression data analytics | |
Yildiz et al. | Automated defect identification in coherent diffraction imaging with smart continual learning | |
Thareja et al. | Applications of deep learning models in bioinformatics | |
Prathibhavani et al. | A novel ensemble classifier for protein secondary structure prediction | |
CN116913393B (en) | Protein evolution method and device based on reinforcement learning | |
US20240355411A1 (en) | Decoding surface fingerprints for protein-ligand interactions | |
Bhutto et al. | Exploring deep-learning applications in drug discovery and design |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23768085 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2025507723 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023768085 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2023768085 Country of ref document: EP Effective date: 20250317 |
|
WWP | Wipo information: published in national office |
Ref document number: 1020257005193 Country of ref document: KR |
|
WWP | Wipo information: published in national office |
Ref document number: 2023768085 Country of ref document: EP |