WO2022206320A1 - 预测模型训练、数据预测方法、装置和存储介质 - Google Patents

预测模型训练、数据预测方法、装置和存储介质 Download PDF

Info

Publication number
WO2022206320A1
WO2022206320A1 PCT/CN2022/079885 CN2022079885W WO2022206320A1 WO 2022206320 A1 WO2022206320 A1 WO 2022206320A1 CN 2022079885 W CN2022079885 W CN 2022079885W WO 2022206320 A1 WO2022206320 A1 WO 2022206320A1
Authority
WO
WIPO (PCT)
Prior art keywords
training sample
training
information
energy feature
current
Prior art date
Application number
PCT/CN2022/079885
Other languages
English (en)
French (fr)
Inventor
杨子翊
叶兆丰
廖奔犇
张胜誉
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP22778504.5A priority Critical patent/EP4318478A1/en
Priority to JP2023534153A priority patent/JP2023552416A/ja
Publication of WO2022206320A1 publication Critical patent/WO2022206320A1/zh
Priority to US18/075,643 priority patent/US20230097667A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • the present application relates to the field of computer technology, and in particular, to a prediction model training, data prediction method, apparatus, computer equipment and storage medium.
  • the use of machine learning algorithms to predict the affinity between compounds and target proteins has emerged.
  • the model established by the machine learning algorithm is used to predict the affinity change between the target protein and the compound after mutation, and then determine whether the target protein is resistant to the compound, so as to provide a reference for doctors to use drugs.
  • the current prediction models established by machine learning algorithms have problems of low accuracy and poor model generalization ability.
  • a predictive model training method comprising:
  • the training sample set includes each training sample, the training sample weight corresponding to each training sample, and the target energy feature corresponding to each training sample.
  • the training sample includes wild-type protein information, mutant protein information and compound information, and target energy features Based on the wild-type energy feature and mutant energy feature, the wild-type energy feature is obtained by combining energy feature extraction based on wild-type protein information and compound information, and the mutant energy feature is based on mutant protein information and compound information. Combining energy feature extraction owned;
  • the target prediction model is used to predict the input The protein information of , and the interaction state information corresponding to the input compound information.
  • a prediction model training device includes:
  • the sample acquisition module is used to obtain a training sample set.
  • the training sample set includes each training sample, the training sample weight corresponding to each training sample, and the target energy feature corresponding to each training sample.
  • the training sample includes wild-type protein information, mutant protein information and Compound information, target energy features are obtained based on wild-type energy features and mutant energy features, wild-type energy features are obtained by combining energy features based on wild-type protein information and compound information, and mutant energy features are based on mutant protein information and compound information.
  • the information is obtained by combining the energy feature extraction;
  • the sample determination module is used to determine the current training sample from the training sample set based on the weight of the training sample
  • the training module is used to input the current target energy feature corresponding to the current training sample into the pre-training prediction model for basic training, and when the basic training is completed, the basic prediction model is obtained;
  • the iterative module is used to update the training sample weight corresponding to each training sample based on the basic prediction model, and return to the step of determining the current training sample from the training sample set based on the training sample weight.
  • the target prediction model and target prediction are obtained.
  • the model is used to predict the interaction state information corresponding to the input protein information and the input compound information.
  • a computer device includes a memory and a processor, the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
  • the training sample set includes each training sample, the training sample weight corresponding to each training sample, and the target energy feature corresponding to each training sample.
  • the training sample includes wild-type protein information, mutant protein information and compound information, and target energy features Based on the wild-type energy feature and mutant energy feature, the wild-type energy feature is obtained by combining energy feature extraction based on wild-type protein information and compound information, and the mutant energy feature is based on mutant protein information and compound information. Combining energy feature extraction owned;
  • the target prediction model is used to predict the input The protein information of , and the interaction state information corresponding to the input compound information.
  • a computer-readable storage medium having computer-readable instructions stored thereon, the computer-readable instructions realizing the following steps when executed by a processor:
  • the training sample set includes each training sample, the training sample weight corresponding to each training sample, and the target energy feature corresponding to each training sample.
  • the training sample includes wild-type protein information, mutant protein information and compound information, and target energy features Based on the wild-type energy feature and mutant energy feature, the wild-type energy feature is obtained by combining energy feature extraction based on wild-type protein information and compound information, and the mutant energy feature is based on mutant protein information and compound information. Combining energy feature extraction owned;
  • the target prediction model is used to predict the input The protein information of , and the interaction state information corresponding to the input compound information.
  • the above prediction model training method, device, computer equipment and storage medium are obtained by acquiring a training sample set, the training sample set includes each training sample, the training sample weight corresponding to each training sample, and the target energy feature corresponding to each training sample, and the training sample includes wild Type protein information, mutant protein information and compound information, determine the current training sample from the training sample set based on the weight of the training sample; input the current target energy feature corresponding to the current training sample into the pre-training prediction model for basic training, when the basic training is completed , obtain the basic prediction model; update the training sample weight corresponding to each training sample based on the basic prediction model, and return to the step of determining the current training sample from the training sample set based on the training sample weight, until the model training is completed, the target prediction model is obtained,
  • the target prediction model is used to predict the interaction state information corresponding to the input protein information and the input compound information.
  • the quality of the training samples can be guaranteed, and then using the current training samples to train the prediction model, so that the target prediction model obtained by training It can improve the accuracy and generalization of prediction.
  • a data prediction method comprising:
  • the data to be predicted includes the information of the wild-type protein to be predicted, the information of the mutant protein to be predicted and the information of the compound to be predicted;
  • the combined energy feature extraction is performed based on the information of the wild-type protein to be predicted and the information of the compound to be predicted to obtain the wild-type energy feature to be predicted, and the combined energy feature extraction is performed based on the information of the mutant protein to be predicted and the information of the compound to be predicted to obtain the mutant energy to be predicted. feature;
  • the target prediction model obtains the training sample set and determines the current training sample from the training sample set based on the weight of the training sample; the current training sample corresponding to the current training sample is determined.
  • the target energy feature is input into the pre-training prediction model for basic training.
  • the basic prediction model is obtained; the training sample weight corresponding to each training sample is updated based on the basic prediction model, and the weight of the training sample is returned based on the training sample.
  • the steps of the current training sample are executed until the model training is completed.
  • a data prediction device includes:
  • the data acquisition module is used to acquire the data to be predicted, and the data to be predicted includes the information of the wild-type protein to be predicted, the information of the mutant protein to be predicted and the information of the compound to be predicted;
  • the feature extraction module is used to extract the binding energy feature based on the information of the wild-type protein to be predicted and the information of the compound to be predicted, to obtain the wild-type energy feature to be predicted, and extract the binding energy feature based on the information of the mutant protein to be predicted and the information of the compound to be predicted, Obtain the energy characteristic of the mutant to be predicted;
  • a target feature determination module configured to determine the target energy feature to be predicted based on the to-be-predicted wild-type energy feature and the to-be-predicted mutant energy feature;
  • the prediction module is used to input the energy characteristics of the target to be predicted into the target prediction model for prediction, and obtain the interaction state information.
  • the target prediction model obtains the training sample set and determines the current training sample from the training sample set based on the weight of the training sample;
  • the current target energy characteristics corresponding to the training samples are input into the pre-training prediction model for basic training.
  • the basic prediction model is obtained; the training sample weights corresponding to each training sample are updated based on the basic prediction model, and the weights based on the training samples are returned.
  • the step of determining the current training sample from the training sample set is performed until the model training is completed.
  • a computer device includes a memory and a processor, the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
  • the data to be predicted includes the information of the wild-type protein to be predicted, the information of the mutant protein to be predicted and the information of the compound to be predicted;
  • the combined energy feature extraction is performed based on the information of the wild-type protein to be predicted and the information of the compound to be predicted to obtain the wild-type energy feature to be predicted, and the combined energy feature extraction is performed based on the information of the mutant protein to be predicted and the information of the compound to be predicted to obtain the mutant energy to be predicted. feature;
  • the target prediction model obtains the training sample set and determines the current training sample from the training sample set based on the weight of the training sample; the current training sample corresponding to the current training sample is determined.
  • the target energy feature is input into the pre-training prediction model for basic training.
  • the basic prediction model is obtained; the training sample weight corresponding to each training sample is updated based on the basic prediction model, and the weight of the training sample is returned based on the training sample.
  • the steps of the current training sample are executed until the model training is completed.
  • a computer-readable storage medium having computer-readable instructions stored thereon, the computer-readable instructions realizing the following steps when executed by a processor:
  • the data to be predicted includes the information of the wild-type protein to be predicted, the information of the mutant protein to be predicted and the information of the compound to be predicted;
  • the combined energy feature extraction is performed based on the information of the wild-type protein to be predicted and the information of the compound to be predicted to obtain the wild-type energy feature to be predicted, and the combined energy feature extraction is performed based on the information of the mutant protein to be predicted and the information of the compound to be predicted to obtain the mutant energy to be predicted. feature;
  • the target prediction model obtains the training sample set and determines the current training sample from the training sample set based on the weight of the training sample; the current training sample corresponding to the current training sample is determined.
  • the target energy feature is input into the pre-training prediction model for basic training.
  • the basic prediction model is obtained; the training sample weight corresponding to each training sample is updated based on the basic prediction model, and the weight of the training sample is returned based on the training sample.
  • the steps of the current training sample are executed until the model training is completed.
  • the above data prediction method, device, computer equipment and storage medium obtain the data to be predicted, then determine the energy characteristics of the target to be predicted, input the energy characteristics of the target to be predicted into the target prediction model for prediction, and obtain the interaction state information.
  • the model is to obtain the training sample set and determine the current training sample from the training sample set based on the weight of the training sample; input the current target energy feature corresponding to the current training sample into the pre-training prediction model for basic training, and when the basic training is completed, get the basic Prediction model; update the training sample weight corresponding to each training sample based on the basic prediction model, and return to the step of determining the current training sample from the training sample set based on the training sample weight, until the model training is completed, that is, predict through the target prediction model
  • the interaction state information is obtained, because the target prediction model obtained by training can improve the accuracy of prediction, and thus the obtained interaction state information improves the accuracy.
  • FIG. 1 is an application environment diagram of a predictive model training method in one embodiment
  • FIG. 2 is a schematic flowchart of a predictive model training method in one embodiment
  • FIG. 3 is a schematic flowchart of pre-training an initial prediction model in one embodiment
  • FIG. 5 is a schematic flowchart of obtaining target energy characteristics in one embodiment
  • Fig. 6 is the schematic flow chart of obtaining wild-type energy characteristic in one embodiment
  • FIG. 8 is a schematic flowchart of obtaining a target basic prediction model in one embodiment
  • FIG. 9 is a schematic flowchart of obtaining a basic prediction model in one embodiment
  • 10 is a schematic flowchart of obtaining updated sample weights in one embodiment
  • FIG. 11 is a schematic flowchart of a data prediction method in one embodiment
  • FIG. 12 is a schematic flowchart of an application scenario of a data predictor in a specific embodiment
  • FIG. 13 is a schematic flowchart of a predictive model training method in a specific embodiment
  • FIG. 14 is a schematic flowchart of a predictive model training method in a specific embodiment
  • Fig. 15 is a schematic diagram of comparative test results in a specific embodiment
  • FIG. 16 is a schematic diagram of the accuracy rate and recall rate curve indicators in the specific embodiment of FIG. 15;
  • 17 is a structural block diagram of a prediction model training apparatus in one embodiment
  • FIG. 18 is a structural block diagram of a data prediction apparatus in one embodiment
  • Figure 19 is an internal structure diagram of a computer device in one embodiment
  • FIG. 20 is an internal structure diagram of a computer device in another embodiment.
  • the prediction model training method provided in this application can be applied to the application environment shown in FIG. 1 .
  • the terminal 102 communicates with the server 104 through the network.
  • the server 104 receives the model training instruction sent by the terminal 102, and the server 104 obtains a training sample set from the database 106 according to the model training instruction.
  • the training sample set includes each training sample, the training sample weight corresponding to each training sample, and the target corresponding to each training sample
  • Training samples include wild-type protein information, mutant protein information and compound information, target energy features are obtained based on wild-type energy features and mutant energy features, and wild-type energy features are based on wild-type protein information and compound information.
  • the mutant energy feature is obtained by combining energy feature extraction based on mutant protein information and compound information; the server 104 determines the current training sample from the training sample set based on the weight of the training sample; the server 104 converts the current target corresponding to the current training sample
  • the energy feature is input into the pre-training prediction model for basic training, and when the basic training is completed, the basic prediction model is obtained; the server 104 updates the training sample weight corresponding to each training sample based on the basic prediction model, and returns the training sample weight based on the training sample weight from the training sample set.
  • the step of determining the current training sample is performed until the model training is completed, and the target prediction model is obtained, and the target prediction model is used to predict the interaction state information corresponding to the input protein information and the input compound information.
  • the terminal 102 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server 104 can be implemented by an independent server or a server cluster composed of multiple servers.
  • a prediction model training method is provided, and the method is applied to the server in FIG. 1 as an example for description. It can be understood that the method can also be applied to a terminal, It can also be applied to a system including a terminal and a server, and is realized through interaction between the terminal and the server. In this embodiment, the following steps are included:
  • Step 202 obtaining a training sample set
  • the training sample set includes each training sample, the training sample weight corresponding to each training sample, and the target energy feature corresponding to each training sample
  • the training sample includes wild-type protein information, mutant protein information and compound information
  • the target energy feature is obtained based on the wild-type energy feature and the mutant energy feature.
  • the wild-type energy feature is obtained by combining the energy feature extraction based on the wild-type protein information and compound information, and the mutant energy feature is based on the combination of the mutant protein information and compound information. energy feature extraction.
  • protein refers to a targeting protein, such as a protein kinase.
  • a compound refers to a drug that is capable of interacting with a target protein. such as tyrosine kinase inhibitors.
  • Protein information is used to characterize the specific information of the target protein, which can include protein structure, protein physicochemical properties, etc. Wild-type protein information refers to individuals obtained from nature, that is, the information of non-artificially mutagenized proteins, mutants Protein information refers to mutated protein information, for example, the drug structure may be mutated.
  • the compound information refers to the information of the compound that can interact with the protein, which can include the structure of the compound, the physical and chemical properties of the compound, and so on.
  • the training sample weight refers to the weight corresponding to the training sample, which is used to characterize the quality of the corresponding training sample. High-quality training samples can improve the training quality when training the machine learning model.
  • Binding energy characteristics refer to the characteristics of the interaction between proteins and compounds, which are used to characterize the interaction energy information between the target protein and the compound molecules, which can include structural characteristics, physical and chemical properties characteristics, and energy characteristics, etc.
  • the binding energy characteristics are Features obtained after feature selection.
  • the wild-type energy signature refers to the binding energy signature extracted when the wild-type protein interacts with the compound.
  • the mutant energy feature refers to the binding energy feature extracted when the mutant protein interacts with the compound.
  • the target energy signature is used to characterize the difference between the mutant energy signature and the wild-type energy signature.
  • the server can obtain the training sample set directly from the database.
  • the training sample set includes each training sample, the training sample weight corresponding to each training sample, and the target energy feature corresponding to each training sample.
  • the training sample includes wild-type protein information, Mutant protein information and compound information, target energy features are obtained based on wild-type energy features and mutant energy features, wild-type energy features are obtained based on wild-type protein information and compound information by combining energy features, and mutant energy features are based on mutation It is obtained by combining energy feature extraction with protein information and compound information.
  • the server may also collect each training sample from the Internet, then extract the target energy feature corresponding to each training sample and initialize the training sample weight corresponding to each training sample.
  • the server can also obtain the training sample set from a third-party server that provides data services, for example, can obtain the training sample set from a third-party cloud server.
  • the server may obtain protein information, mutant protein information, and compound information, perform combined energy feature extraction based on wild-type protein information and compound information to obtain wild-type energy features, and combine based on mutant protein information and compound information
  • the energy feature is extracted to obtain the mutant energy feature, and the difference between the wild-type energy feature and the mutant energy feature is calculated to obtain the target energy feature.
  • the corresponding training sample weights are initialized, for example, random initialization, zero initialization, Gaussian distribution initialization, and so on.
  • Step 204 Determine the current training sample from the training sample set based on the weight of the training sample.
  • the current training sample refers to the training sample used in the current training.
  • the server selects the training samples from the training sample set according to the training sample weights corresponding to each training sample to obtain the current training sample. For example, a training sample whose weight of the training sample is greater than a preset weight threshold may be used as the current training sample, and the preset weight threshold is a preset weight threshold.
  • the training sample weights may be set to 0 and 1, that is, the training sample weights corresponding to each training sample are initialized to 0 or 1. When the training sample weight is 1, the corresponding training sample is the current training sample.
  • the server may select multiple training samples from the training sample set according to the weight of the training samples to obtain a current training sample set, where the current training sample set includes multiple training samples. Use the current training sample set to train the base prediction model.
  • Step 206 Input the current target energy feature corresponding to the current training sample into the pre-training prediction model for basic training, and obtain the basic prediction model when the basic training is completed.
  • the current target energy feature refers to the target energy feature corresponding to the current training sample.
  • a pre-trained prediction model refers to a pre-trained prediction model, which is built using the random forest algorithm, and can be used to predict the affinity changes between proteins and compounds before and after mutation.
  • the basic prediction model is obtained by training the corresponding current training samples while keeping the weights of the training samples unchanged.
  • the server can input the current target energy feature corresponding to the pre-training sample into the pre-training prediction model for prediction, obtain the prediction result, calculate the loss according to the prediction result, reversely update the pre-training prediction model according to the loss, and return to the pre-training prediction model.
  • the step of inputting the current target energy feature corresponding to the sample into the pre-training prediction model for prediction is performed iteratively until the basic training completion condition is reached, and the prediction model that has reached the basic training completion condition is used as the basic prediction model.
  • the basic training completion conditions refer to the conditions for obtaining the basic prediction model, including the training reaching a preset upper limit of the number of iterations or the loss reaching a preset threshold, or the parameters of the model no longer changing, etc.
  • Step 208 determine whether the model training is completed, when the model training is completed, perform step 208a, when the model training is not completed, perform step 208b, and return to step 204 for execution.
  • step 208a a target prediction model is obtained, and the target prediction model is used to predict the interaction state information corresponding to the input protein information and the input compound information.
  • Step 208b update the training sample weight corresponding to each training sample based on the basic prediction model, and return to the step of determining the current training sample from the training sample set based on the training sample weight.
  • the completion of model training refers to the conditions for obtaining the target prediction model
  • the target prediction model refers to the model finally trained to predict the interaction state information corresponding to the input protein information and the input compound information.
  • the interaction state information is used to characterize the change in the binding free energy between the protein and the compound before and after mutation.
  • the binding free energy refers to the interaction that exists between the ligand and the receptor.
  • the server when the server obtains the basic prediction model, it further determines whether the model training is completed, and the model training completion condition may include that the number of iterations reaches a preset upper limit of the number of iterations for model training.
  • the model training completion condition When the model training completion condition is not reached, keep the parameters of the basic prediction model unchanged, and then use the basic prediction model to update the training sample weights corresponding to each training sample, and input the target energy characteristics corresponding to each training sample into the basic prediction model. , obtain the loss corresponding to each training sample, and update the training sample weight corresponding to each training sample according to the loss corresponding to each training sample.
  • the target prediction model is used to predict the interaction state information corresponding to the input protein information and the input compound information.
  • a training sample set is obtained, the training sample set includes each training sample, the training sample weight corresponding to each training sample, and the target energy feature corresponding to each training sample, and the training sample includes wild-type protein information and mutant protein information. and compound information, determine the current training sample from the training sample set based on the weight of the training sample; input the current target energy feature corresponding to the current training sample into the pre-training prediction model for basic training, and obtain the basic prediction model when the basic training is completed; The basic prediction model updates the training sample weight corresponding to each training sample, and returns to the step of determining the current training sample from the training sample set based on the training sample weight. Until the model training is completed, the target prediction model is obtained. The target prediction model is used to predict the input data.
  • the interaction state information corresponding to the protein information and the input compound information is, by continuously updating the training sample weights in the iterative process, and using the training sample weights to determine the current training samples from the training sample set, the quality of the training samples can be guaranteed, and then using the current training samples to train the prediction model, so that the target prediction model obtained by training It can improve the accuracy and generalization of prediction.
  • the method before step 202, that is, before acquiring the training sample set, the method further includes:
  • Step 302 Obtain each training sample, where the training sample includes wild-type protein information, mutant protein information and compound information.
  • Step 304 based on the wild-type protein information and the compound information, perform combined initial energy feature extraction to obtain the wild-type initial energy feature.
  • the combination of initial energy features refers to the extracted and unscreened features, which may include non-physical model features, physics-based and empirical potential energy features, and so on.
  • non-physical model features include crystal protein-compound structural features, physicochemical properties of ligands and residues, energy features calculated based on experience or descriptor scoring functions, and so on.
  • Physical and empirical potential energy-based features refer to energy features computed by modeling programs based on mixed physical and empirical potential energy.
  • the wild-type initial energy feature refers to the binding initial energy feature extracted when the wild-type protein information and compound information interact.
  • the server may obtain each training sample from the database, and each training sample may be a sample used in pre-training.
  • the respective training samples may be the same as or different from the training samples in the training sample set.
  • the server can also collect each training sample from the Internet, and the server can also obtain each training sample from a server that provides data services. Wild-type protein information, mutant protein information, and compound information are included in each training sample.
  • the server performs feature extraction on each training sample, that is, using the wild-type protein information and compound information to combine the initial energy feature extraction to obtain the wild-type initial energy feature corresponding to each training sample.
  • Step 306 based on the mutant protein information and compound information, perform combined initial energy feature extraction to obtain mutant initial energy features, and determine target initial energy features corresponding to each training sample based on the wild-type initial energy features and mutant initial energy features.
  • the initial energy characteristic of mutant type refers to the initial energy characteristic of binding extracted when the mutant protein information and compound information interact, and the initial energy characteristic of target is used to characterize the difference between the initial energy characteristic of wild type and the initial energy characteristic of mutant type.
  • the server extracts the initial energy feature of the mutant protein information and the compound information in combination to obtain the initial energy feature of the mutant type, and calculates the difference based on the initial energy feature of the wild type and the initial energy feature of the mutant type, and uses the difference as the obtained Target initial energy characteristics.
  • the difference between structural features can be calculated and used as the target structural feature.
  • Differences between physicochemical properties can also be calculated and used as target structural features.
  • Step 308 input the target initial energy feature corresponding to each training sample into the initial prediction model for prediction, and obtain initial interaction state information corresponding to each training sample.
  • the initial prediction model is established by using the random forest algorithm.
  • the initial prediction model refers to a prediction model in which model parameters are initialized, and the model parameter initialization may be initialized at any time, or may be zero-initialized, and so on.
  • the initial prediction model is built using the random forest algorithm, which refers to a classifier that uses multiple trees to train and predict samples.
  • the ExtraTree (extreme random tree) algorithm can be used to establish an initial prediction model.
  • the initial interaction state information refers to the interaction state information predicted by using the initial prediction model.
  • the server uses the random forest algorithm to establish an initial prediction model for initializing model parameters in advance, and then inputs the target initial energy characteristics corresponding to each training sample into the initial prediction model for prediction, and obtains the initial interaction state corresponding to each output training sample. information.
  • Step 310 Perform loss calculation based on the initial interaction state information corresponding to each training sample and the interaction state label corresponding to each training sample, and obtain initial loss information corresponding to each training sample.
  • the interaction state label refers to the real interaction state information
  • each training sample has a corresponding interaction state label.
  • the initial loss information is used to characterize the error between the initial interaction state information and the interaction state labels.
  • the server uses a preset loss function to calculate the loss between the initial interaction state information corresponding to each training sample and the interaction state label, and obtains the initial loss information corresponding to each training sample.
  • the loss function may be a mean square error loss function, a mean absolute value error loss function, and the like.
  • Step 312 update the initial prediction model based on the initial loss information, and return to the step of inputting the target energy characteristics corresponding to each training sample into the initial prediction model for prediction, until the pre-training is completed, the pre-training prediction model and the target initial energy are obtained.
  • the feature importance corresponding to the feature.
  • the completion of pre-training refers to the conditions for obtaining the pre-training prediction model, which means that the number of pre-training reaches the preset number of iterations, or the loss of pre-training reaches the preset threshold, or the parameters of the pre-training prediction model no longer change.
  • the feature importance is used to characterize the importance of the initial energy feature of the target. The higher the feature importance, the more important the corresponding feature is, and the more it contributes to the model training.
  • the server uses the initial loss information to calculate the gradient, and then uses the gradient to update the initial prediction model in reverse, obtains the updated prediction model, and judges whether the pre-training is completed.
  • the updated prediction model is used as the initial prediction. model, and return to the step of inputting the target energy features corresponding to each training sample into the initial prediction model for prediction. Iteratively executes until the pre-training is completed, using the updated prediction model obtained from the last iteration as the pre-training prediction model, and Since the pre-training prediction model is established using the random forest algorithm, when the pre-training prediction model is trained, the feature importance corresponding to the target initial energy feature can be directly obtained. Each feature in the target initial energy feature has a corresponding feature importance.
  • Step 316 Determine the training sample weight corresponding to each training sample based on the loss information corresponding to each training sample when the pre-training is completed, and select the target energy feature from the target initial energy features based on the feature importance.
  • the server can use the loss information corresponding to each training sample when the pre-training is completed to determine the training sample weight corresponding to each training sample. For example, the loss information corresponding to each training sample can be compared with the weight loss threshold, and when the loss information is greater than the weight Loss threshold, the corresponding training samples are good quality samples, and the weight of the corresponding training samples can be set to 1. When the loss information is not greater than the weight loss threshold, the corresponding training sample is a sample of poor quality, and the weight of the corresponding training sample can be set to 0.
  • the feature selection is performed from the target initial energy features by feature importance to obtain the target energy feature, which is the feature to be extracted when the pre-training prediction model is further trained.
  • the pre-training model is obtained by pre-training each training sample, and then the training sample weight corresponding to each training sample is determined based on the loss information corresponding to each training sample when the pre-training is completed, and the target initial energy is determined based on the feature importance.
  • Feature selection is performed in the features to obtain the target energy feature, so that the training efficiency can be improved during further training, and the training accuracy can be ensured.
  • step 308 the target initial energy feature corresponding to each training sample is input into the initial prediction model for prediction, and the initial interaction state information corresponding to each training sample is obtained.
  • the initial prediction model is Built using the random forest algorithm, including:
  • Step 402 input the target initial energy feature corresponding to each training sample into the initial prediction model
  • Step 404 the initial prediction model takes the target initial energy feature corresponding to each training sample as the current set to be divided, and calculates the initial feature importance corresponding to the target initial energy feature, and determines the initial division feature from the target initial energy feature based on the initial feature importance. , based on the initial division feature, divide the target initial energy feature corresponding to each training sample to obtain each division result, the division result includes the target initial energy feature corresponding to each division sample, and use each division result as the current set to be divided, and return the calculation The steps of the initial feature importance corresponding to the target initial energy feature are iterated until the division is completed, and the initial interaction state information corresponding to each training sample is obtained.
  • the initial feature importance refers to the feature importance corresponding to the initial energy feature of the target
  • the initial division feature refers to the feature for dividing the decision tree.
  • the division result refers to what is obtained after dividing the target initial energy feature
  • the divided sample refers to the training sample corresponding to the target initial energy feature in the division result.
  • the server inputs the target initial energy feature corresponding to each training sample into the initial prediction model, and the initial prediction model scores the input feature to obtain the initial feature importance corresponding to the target initial energy feature.
  • information gain, information gain rate, Gini coefficient, mean square error, etc. can be used to calculate the initial feature importance.
  • the initial division feature is determined from the target initial energy feature based on the initial feature importance, and the target initial energy feature corresponding to each training sample is divided based on the initial division feature.
  • the target initial energy feature of the initial division feature is used as another part, and a division result is obtained.
  • the division result includes the target initial energy feature corresponding to each divided sample, and each division result is used as the current set to be divided, and returns the calculated target initial energy feature corresponding to the
  • the steps of initial feature importance are iterated until the division is completed, and the initial interaction state information corresponding to each training sample is obtained.
  • the division is completed means that each tree node cannot be divided, that is, the leaf node corresponds to only the unique target initial energy feature.
  • the initial interaction state information refers to the interaction state information predicted by the initial prediction model.
  • the initial prediction model calculates the initial feature importance corresponding to the target initial energy feature, based on the initial feature importance from the target initial energy feature Determine the initial division feature in , and divide the target initial energy feature corresponding to each training sample based on the initial division feature to obtain each division result.
  • the division result includes the target initial energy feature corresponding to each division sample, and each division result is used as the current to be divided Set, and return to the step of calculating the initial feature importance corresponding to the initial energy feature of the target iteratively, until the division is completed, the initial interaction state information corresponding to each training sample is obtained, which improves the accuracy of obtaining the initial interaction state information.
  • step 202 namely acquiring a training sample set
  • the training sample set includes training sample weights corresponding to each training sample, including the steps:
  • a confidence level corresponding to each training sample is obtained, and a training sample weight corresponding to each training sample is determined based on the confidence level.
  • the confidence is used to represent the quality of the corresponding training samples.
  • the server may also acquire the confidence level corresponding to each training sample at the same time. Then, the confidence level can be directly used as the training sample weight corresponding to each training sample. Wherein, the confidence may be set manually, or may be obtained by pre-assessing the confidence of each training sample. In one embodiment, the confidence corresponding to each training sample can also be compared with a preset confidence threshold, and when the confidence threshold is exceeded, the weight of the corresponding training sample is set to 1, and the training sample is the current training sample . When the confidence threshold is not exceeded, set the corresponding training sample weight to 0.
  • the training sample weight corresponding to each training sample is determined according to the confidence degree, so as to improve the efficiency of obtaining the training sample weight.
  • a training sample set is obtained, and the training sample set includes target energy features corresponding to each training sample, including:
  • Step 502 extracting the combined energy feature based on the wild-type protein information and the compound information to obtain the wild-type energy feature.
  • Step 504 extracting the combined energy feature based on the mutant protein information and the compound information to obtain the mutant energy feature.
  • the wild-type energy characteristics include but are not limited to wild-type protein characteristics, compound characteristics, and energy characteristics when wild-type protein information and compound information interact.
  • Wild-type protein features are used to characterize features corresponding to wild-type protein information, including but not limited to wild-type protein structural features and wild-type protein physicochemical properties.
  • Compound features include, but are not limited to, compound structural features, and compound physicochemical properties.
  • Mutant energy features include, but are not limited to, mutant protein features, compound features, and energy features when mutant protein information and compound information interact.
  • Mutant protein features are used to characterize features corresponding to mutant protein information, including but not limited to mutant protein structural features and mutant protein physicochemical properties.
  • the server uses the wild-type protein information and compound information to perform feature extraction, extracts the wild-type protein features and compound features, and extracts the energy features of the interaction between the wild-type protein and the compound. and the energy signature as a wild-type energy signature.
  • the server uses the mutant protein information to perform feature extraction to obtain mutant protein features, and then extracts the energy features when the mutant protein and the compound interact, and uses the extracted mutant protein features, compound features, and energy features as mutant energy features. .
  • Step 506 Calculate the difference between the wild-type energy feature and the mutant energy feature to obtain the target energy feature.
  • the server calculates the difference between the wild-type energy feature and the mutant energy feature, for example, calculating the difference between the wild-type protein feature and the mutant protein feature, calculating the energy feature and mutation when the wild-type protein interacts with the compound
  • the difference between the energy signatures of the interaction between the type protein and the compound is used to obtain the target energy signature.
  • the feature difference between the wild-type energy feature and the mutant energy feature can be calculated to obtain the target energy feature.
  • the target energy feature is obtained, which can improve the accuracy of obtaining the target energy feature.
  • the wild-type energy signature includes a first wild-type energy signature and a second wild-type energy signature
  • step 502 based on the wild-type protein information and compound information, perform binding energy feature extraction to obtain wild-type energy features, including:
  • Step 602 based on wild-type protein information and compound information, use a non-physical type scoring function to extract binding energy features to obtain a first wild-type energy feature.
  • the non-physical scoring function refers to a scoring function based on experience or descriptors.
  • the scoring function will be based on some a priori assumptions or by fitting experimental data to obtain energy features that do not have obvious interpretable features. physical meaning.
  • the first wild-type energy feature refers to the first part of the energy feature obtained by extraction.
  • the server can use the pre-set non-physical scoring function to extract the binding energy feature, calculate the wild-type protein information and compound information through the non-physical scoring function, obtain the calculation result, and use the calculation result as the first wild type energy characteristics.
  • energy features can be extracted using a scoring function (a function used to evaluate the plausibility of theoretically derived receptor-ligand binding modes).
  • Step 602 based on the wild-type protein information and the compound information, use the physical type function to perform binding energy feature extraction to obtain a second wild-type energy feature.
  • the physical function refers to the energy function based on mixed physical and empirical potential energy, which has a clear physical meaning.
  • the energy function family is composed of force field functions based on experimental data fitting, quantitative calculation functions based on first principles, based on The solvent model of the continuum, etc.
  • the server uses the pre-set physical type function to extract the binding energy feature of the wild-type protein information and the compound information to obtain the second wild-type energy feature.
  • energy features can be calculated using the energy function in the mixed physical and empirical potential-based modeling program Rosetta (a polymer modeling software library with Monte Carlo simulated annealing at the core of the algorithm).
  • Step 602 based on the fusion of the first wild-type energy feature and the second wild-type energy feature, to obtain the wild-type energy feature.
  • the server calculates the feature difference between the first wild-type energy feature and the second wild-type energy feature to obtain the wild-type energy feature.
  • the first wild-type energy feature and the second wild-type energy feature are extracted, and the first wild-type energy feature and the second wild-type energy feature are fused to obtain the wild-type energy feature.
  • the energy feature and the second wild-type energy feature can better characterize the interaction energy information between the wild-type target protein and the compound molecule, so that the obtained wild-type energy feature is more accurate.
  • the mutant energy signature includes a first mutant energy signature and a second mutant energy signature
  • step 504 the combined energy feature extraction is performed based on the mutant protein information and the compound information to obtain mutant energy features, including:
  • Step 702 based on the mutant protein information and the compound information, use a non-physical function to perform binding energy feature extraction to obtain a first mutant energy feature.
  • Step 704 based on the mutant protein information and the compound information, use the physical function to perform binding energy feature extraction to obtain a second mutant energy feature.
  • the server uses a preset non-physical function to perform combined energy feature extraction on the mutant protein information and compound information to obtain the first mutant energy feature, and then uses the preset physical function to extract the mutant protein information and compound information.
  • the compound information is combined with energy feature extraction to obtain the second mutant energy feature.
  • Step 706 fuse based on the first mutant energy feature and the second mutant energy feature to obtain a mutant energy feature.
  • the server calculates the feature difference between the first mutant energy feature and the second mutant energy feature to obtain the mutant energy feature.
  • the first mutant energy feature and the second mutant energy feature are extracted, and the first mutant energy feature and the second mutant energy feature are fused to obtain the mutant energy feature.
  • the energy feature and the second mutant energy feature can better characterize the interaction energy information between the mutant target protein and the compound molecule, so that the obtained mutant energy feature is more accurate.
  • step 204 determining the current training sample from the training sample set based on the training sample weight, includes:
  • Step 802 Obtain protein family information, and divide the training sample set based on the protein family information to obtain each training sample group.
  • proteins with similar amino acid sequences and very similar structures and functions in vivo constitute a "protein family", and members of the same protein family are called “homologous proteins”.
  • the protein family information refers to the information of the protein family, and the training sample group is obtained by dividing the training samples corresponding to the same protein family together.
  • the server may directly obtain the protein family information from the database, the protein family information may also be obtained from the Internet, or may be obtained from a third-party server that provides data services. In one embodiment, the server may also divide protein families with similar structures or sequences of protein information in the training samples into the same training sample group, and obtain each training sample group
  • Step 804 Select current training samples from each training sample group based on the weights of the training samples to obtain a current training sample set.
  • the server selects the current training sample from each training sample group by using the training sample weight, that is, selects the current training sample in turn according to the training sample weight in the training sample group, and selects from each training sample group to obtain the current training sample set .
  • Step 206 input the current target energy feature corresponding to the current training sample into the pre-training prediction model for basic training, and when the basic training is completed, obtain the basic prediction model, including:
  • Step 806 Input the current target energy feature corresponding to each current training sample in the current training sample set into the pre-training prediction model for basic training, and obtain the target basic prediction model when the basic training is completed.
  • the server inputs the current target energy features corresponding to each current training sample in the current training sample set into the pre-training prediction model for basic training, and when the basic training is completed, the target basic prediction model is obtained.
  • each training sample group is obtained by dividing the training sample set according to the protein family information. Then, based on the weight of the training samples, the current training samples are selected from each training sample group to obtain the current training sample set, so as to use the current training sample set to perform basic training on the pre-trained prediction model to obtain the basic prediction model. That is, by selecting the current training samples from each training sample group, the selected training samples are distributed throughout the space rather than concentrated in a local area, so as to ensure that the global information contained in the training samples can be learned when the model is trained. In this way, the comprehensiveness of the knowledge learned by the model during the training process is ensured, the convergence speed during the model training process is further improved, and the generalization ability of the model obtained by training is improved.
  • the basic form of the pre-trained prediction model is shown in the following formula (1).
  • n the total number of training samples
  • X the training sample set
  • X (x 1 , . . . , x n ) ⁇ R n*m
  • R represents the real number set
  • m the number of energy features
  • xi the ith training sample
  • yi the interaction state label corresponding to the ith training sample
  • g the pre-trained prediction model
  • w the model parameters
  • L the loss function
  • v the training sample weight.
  • v (v (1) ,...,v (b) ), b represents the number of training sample groups, that is, the training sample set is divided into b groups: x (1) ,...,x (b) , in, represents the training samples of the jth training sample group, n j represents the number of training samples in the jth training sample group, and Indicates the training sample weight corresponding to the first training sample in the jth training sample group.
  • v i represents the weight of the ith training sample.
  • represents the parameter of the difficulty of training samples, which means that the training samples are selected in sequence from the samples that are easy to select (high confidence) to the samples that are difficult to select (low confidence).
  • represents the parameter of sample diversity.
  • samples are selected from multiple training sample groups.
  • 1 means L 1 norm
  • 2.1 means L 2.1 norm. in, b represents the number of training sample groups, and j represents the training sample weight of the jth training sample group. That is, the negative L 1 norm tends to select the samples with high confidence, that is, the samples with less error in the results during training.
  • the negative L 2.1 norm is beneficial to select training samples from multiple training sample groups, and to embed diversity information into the prediction model.
  • the current training samples are selected from each training sample group based on the training sample weights to obtain the current training sample set, including:
  • Obtain the current learning parameters determine the number of selected samples and the distribution of samples based on the current learning parameters, select the current training samples from each training sample group according to the weight of the training samples based on the number of selected samples and the distribution of samples, and obtain the target current training sample set.
  • the current learning parameter refers to the learning parameter used in the current training, and the current learning parameter is used to control the selection of the current training sample.
  • the number of selected samples refers to the current number of training samples to be selected.
  • the sample distribution refers to the distribution of the selected current training samples in each training sample group.
  • the target current training sample set refers to a set of current training samples selected by using the current learning parameters.
  • the server obtains the current training sample parameter, and the initial value of the current training sample parameter may be preset.
  • the server uses the current learning parameters to calculate the number and distribution of samples to be selected during training. Then, based on the number of selected samples and the distribution of samples, the current training samples are selected from each training sample group according to the weight of the training samples, and the target current training sample set is obtained.
  • the selected training samples can be more accurate, thereby further making the prediction model obtained by training more accurate, and improving The generalization ability of the predictive model.
  • step 206 is to input the current target energy feature corresponding to the current training sample into the pre-training prediction model for basic training, and when the basic training is completed, the basic prediction model is obtained, including:
  • Step 902 Input the current target energy feature corresponding to the current training sample into the pre-training prediction model for prediction, and obtain current interaction state information.
  • the current interaction state information is used to characterize the change of the protein-compound interaction before and after the mutation in the predicted current training sample.
  • the server directly uses the current target energy feature corresponding to the current training sample as the input of the pre-training prediction model, and the pre-training prediction model predicts according to the input to the current target energy feature, and outputs the prediction result, that is, the current interaction state information.
  • Step 904 Calculate the error between the current interaction state information and the interaction state label corresponding to the current training sample to obtain current loss information.
  • the current loss information refers to the error between the prediction result corresponding to the current training sample and the actual result.
  • the server acquires the interaction state label corresponding to the current training sample, and the interaction state label may be preset.
  • the interaction state label can be an experimentally measured change in the interaction of the protein with the compound before and after the mutation. Then the server uses a preset loss function to calculate the error between the current interaction state information and the interaction state label corresponding to the current training sample, and obtains the current loss information.
  • Step 906 update the pre-training prediction model based on the current loss information, and return to input the current target energy feature corresponding to the current training sample into the pre-training prediction model for prediction, and perform the steps of obtaining the current interaction state information until the basic training is completed. When the conditions are met, the basic prediction model is obtained.
  • the server uses the current loss information to reversely update the parameters in the pre-training prediction model through the gradient descent algorithm, and returns to input the current target energy feature corresponding to the current training sample into the pre-training prediction model for prediction, and obtains the current interaction
  • the steps of the state information are executed iteratively until the preset number of basic training iterations is reached or the model parameters no longer change, and the pre-training prediction model of the last iteration is used as the basic prediction model.
  • the optimization function corresponding to the pre-trained prediction model is shown in the following formula (2), and the optimization function is a regression optimization function.
  • vi indicates that the weight of the training sample is selected as the training sample whose weight exceeds the weight threshold for training. For example, when the weight of the training sample only includes 0 and 1, the training sample with the weight of the training sample as 1 may be selected for training.
  • the weight of the training sample is kept unchanged, and then the current training sample is selected to train the pre-trained prediction model to obtain the basic prediction model, thereby making the trained basic prediction model more accurate.
  • step 208b updating the training sample weights corresponding to each training sample based on the basic prediction model, including:
  • Step 1002 Input the target energy feature corresponding to each training sample into the basic prediction model to obtain basic interaction state information corresponding to each training sample.
  • each training sample refers to each training sample in the training sample set.
  • the basic interaction state information refers to the interaction state information corresponding to each training sample predicted by the basic prediction model.
  • the interaction state information may be the relative difference between the binding free energy of the wild-type protein and compound and the binding free energy of the mutant protein and compound.
  • the server obtains the basic prediction model through training, the parameters in the basic prediction model are kept unchanged, and the training sample weight corresponding to each training sample in the training sample set is updated. That is, the server inputs the target energy feature corresponding to each training sample into the basic prediction model, and obtains basic interaction state information corresponding to each output training sample.
  • Step 1004 Calculate the error between the basic interaction state information corresponding to each training sample and the interaction state label corresponding to each training sample to obtain basic loss information.
  • the basic loss information refers to the error between the predicted results of the basic prediction model and the actual results.
  • the server uses a preset loss function to calculate the error of each training sample, that is, calculates the error between the basic interaction state information and the interaction state label, and obtains the basic loss information corresponding to each training sample.
  • Step 1006 Update the weights of the training samples based on the basic loss information to obtain the updated sample weights corresponding to each training sample.
  • the server uses the basic loss information corresponding to each training sample to update the weight of each training sample, and the server may directly use the basic loss information corresponding to each training sample as the updated sample weight corresponding to each training sample.
  • step 1006 that is, updating the training sample weights based on the basic loss information to obtain the updated sample weights corresponding to each training sample, including the steps:
  • the update threshold refers to the threshold for updating the weights of training samples.
  • the server obtains the current learning parameters, and uses the current learning parameters to determine the update threshold. Compare the update threshold with the basic loss information corresponding to each training sample. When the basic loss information exceeds the update threshold, it means that the prediction error corresponding to the training sample is relatively large. At this time, the weight of the corresponding training sample is updated to the first training sample. Weights. When the basic loss information does not exceed the update threshold, the error is small, and at this time, the corresponding training sample weight is updated to the second training sample weight. Then, when the current training sample is selected, the training sample corresponding to the weight of the second training sample is selected as the current training sample.
  • the current learning parameters include diversity learning parameters and difficulty learning parameters; and calculating an update threshold based on the current learning parameters includes the steps of:
  • each training sample group determines the current training sample group from each training sample group, and calculate the sample rank corresponding to the current training sample group.
  • the weighted value is calculated based on the sample rank, and the weighted value is used to weight the diversity learning parameters to obtain the target weighted value. Calculate the sum of the target weight and the difficulty learning parameter to get the update threshold.
  • the difficulty learning parameter refers to a learning parameter that measures the ease, and the difficulty learning parameter is used to determine the confidence level of the training samples selected during training.
  • the diversity learning parameter is a learning parameter that measures diversity. The diversity learning parameter is used to determine the distribution of the training samples selected during training in the training sample group.
  • the sample rank refers to the rank of the training samples in the pre-training sample group, and the rank of a vector group is the number of vectors contained in its largest irrelevant group.
  • the current training sample group refers to the training sample group for which the weight of the training sample needs to be updated currently.
  • the server obtains each training sample group, determines the current training sample group from each training sample group, and calculates the sample rank corresponding to the current training sample group.
  • the weighted value is calculated based on the sample rank, and the weighted value is used to weight the diversity learning parameters to obtain the target weighted value. Calculate the sum of the target weighted value and the difficulty learning parameter to obtain the update threshold corresponding to the current training sample group.
  • the training samples in each training sample group may be sorted in ascending order according to the basic loss information. Each sorted training sample group is obtained, a current training sample group is determined from the sorted training sample group, and an update threshold corresponding to the current training sample group is calculated.
  • the following formula (3) can be used to update the training sample weight corresponding to the training sample.
  • a represents the rank in the jth training sample group.
  • represents the predicted interaction state information corresponding to the ith training sample of the jth training sample group Represents the true interaction state label corresponding to the ith training sample of the jth training sample group.
  • the method further includes the following steps:
  • the server may preset the update conditions of the current learning parameters, for example, preset the increment of the current learning parameters after each weight update. Then, the current learning parameter is updated according to the preset increment to obtain the updated learning parameter, and the updated learning parameter is used as the current learning parameter.
  • the server may also obtain the preset number of samples to be increased, update the current learning parameter by the preset number of samples to be increased, obtain the updated learning parameter, and use the updated learning parameter as the current learning parameter. learning parameters. And when the number of samples increases, when the loss information obtained by training increases from small to large, the training is completed, and the prediction model trained when the number of samples is not increased is used as the final target prediction model.
  • a data prediction method is provided, and the method is applied to the server in FIG. 1 as an example for description. It can be understood that the method can also be applied to a terminal, and It can be applied to a system including a terminal and a server, and is realized through interaction between the terminal and the server. In this embodiment, the following steps are included:
  • Step 1102 Acquire data to be predicted, and the data to be predicted includes information of wild-type protein to be predicted, information of mutant protein to be predicted, and information of compound to be predicted.
  • the wild-type protein information to be predicted refers to the wild-type protein information for which interaction state information needs to be predicted.
  • the mutant protein information to be predicted refers to the mutant protein information that needs to predict the interaction state information.
  • the compound information to be predicted refers to the compound information that needs to predict the interaction state information.
  • the server may collect the data to be predicted from the Internet, and may also obtain the data to be predicted from the terminal.
  • the server can also directly obtain the data to be predicted from the database.
  • the server may also obtain the data to be predicted sent by the third-party server.
  • the third-party server may be a server that provides business services.
  • the data to be predicted includes the information of the wild-type protein to be predicted, the information of the mutant protein to be predicted, and the information of the compound to be predicted.
  • the server can obtain the information of the mutant protein to be predicted and the information of the compound to be predicted from the terminal, and then obtain the information of the wild type protein to be predicted corresponding to the information of the mutant protein to be predicted from the database, so as to obtain the information of the wild type protein to be predicted corresponding to the information of the mutant protein to be predicted. forecast data.
  • Step 1104 extracting the binding energy feature based on the information of the wild-type protein to be predicted and the information of the compound to be predicted, to obtain the wild-type energy feature to be predicted, and extracting the binding energy feature based on the information of the mutant protein to be predicted and the information of the compound to be predicted, to obtain the feature to be predicted. Mutant energy signature.
  • the energy feature of the wild-type to be predicted refers to the energy feature when the extracted information of the wild-type protein to be predicted and the information of the compound to be predicted interact.
  • the energy feature of the mutant to be predicted refers to the energy feature obtained when the information of the mutant protein to be predicted and the information of the compound to be predicted interact.
  • the server performs combined energy feature extraction based on the information of the wild-type protein to be predicted and the information of the compound to be predicted, and obtains the energy feature of the wild-type to be predicted.
  • the physicochemical properties are extracted according to the physicochemical properties in the information of the wild-type protein to be predicted and the physicochemical properties in the information of the compounds to be predicted.
  • Physical and chemical properties are indicators to measure the properties of chemical substances, including physical properties and chemical properties, physical properties include melting and boiling point, state at room temperature, color, chemical properties include pH and so on.
  • the energy features of the interaction between the information of the wild-type protein to be predicted and the information of the compounds to be predicted are calculated by the scoring function, and the energy features of the predicted wild-type energy are obtained by using the energy function based on the mixed physical and empirical potential energy. Then, the combined energy feature extraction is performed based on the information of the mutant protein to be predicted and the information of the compound to be predicted to obtain the energy feature of the mutant to be predicted.
  • the protein structure in the information of the mutant protein to be predicted and the compound structure in the information of the compound to be predicted can be extracted to extract structural features, and then extract physicochemical properties according to the physicochemical properties in the information of the mutant protein to be predicted and the physicochemical properties in the information of the compounds to be predicted, and use the scoring function to extract the energy features and use the energy function based on physical and empirical potential energy. energy features, so as to obtain the mutant energy features to be predicted.
  • Step 1106 Determine the target energy characteristic to be predicted based on the to-be-predicted wild-type energy characteristic and the to-be-predicted mutant energy characteristic.
  • the server calculates the difference between each feature value in the wild-type energy feature to be predicted and the feature value corresponding to the mutant energy feature to be predicted, and obtains the target energy feature to be predicted.
  • Step 1108 Input the energy feature of the target to be predicted into the target prediction model for prediction, and obtain the interaction state information.
  • the target prediction model obtains the training sample set, and determines the current training sample from the training sample set based on the weight of the training sample; The corresponding current target energy feature is input into the pre-training prediction model for basic training.
  • the basic prediction model is obtained; the training sample weight corresponding to each training sample is updated based on the basic prediction model, and the weight of the training sample is returned based on the training sample weight.
  • the steps of determining the current training sample in the sample set are executed until the model training is completed.
  • the target prediction model may be a model obtained by training in any embodiment of the above-mentioned prediction model training method. That is, the target prediction model can be obtained by obtaining a training sample set, and determining the current training sample from the training sample set based on the weight of the training sample; inputting the current target energy feature corresponding to the current training sample into the pre-training prediction model for basic training, when the basic training is completed. , obtain the basic prediction model; update the training sample weight corresponding to each training sample based on the basic prediction model, and return to the step of determining the current training sample from the training sample set based on the training sample weight, until the model training is completed.
  • the server inputs the energy feature of the target to be predicted into the target prediction model for prediction, and obtains the output interaction state information.
  • the interaction state information refers to the relative difference between the binding free energies of the to-be-predicted mutant protein and the to-be-predicted wild-type protein, respectively, and the to-be-predicted compound. Then, the relative difference of binding free energy is compared with the drug resistance threshold. When the relative difference of binding free energy exceeds the drug resistance threshold, it indicates that the mutant protein to be predicted has developed drug resistance and cannot be used continuously. When the relative difference of binding free energy does not exceed the drug resistance threshold, it means that the mutant protein to be predicted has not developed drug resistance and can still be used normally.
  • the above data prediction method, device, computer equipment and storage medium obtain the data to be predicted, then determine the energy characteristics of the target to be predicted, input the energy characteristics of the target to be predicted into the target prediction model for prediction, and obtain the interaction state information.
  • the model is to obtain the training sample set and determine the current training sample from the training sample set based on the weight of the training sample; input the current target energy feature corresponding to the current training sample into the pre-training prediction model for basic training, and when the basic training is completed, get the basic Prediction model; update the training sample weight corresponding to each training sample based on the basic prediction model, and return to the step of determining the current training sample from the training sample set based on the training sample weight, until the model training is completed, that is, predict through the target prediction model
  • the interaction state information is obtained, because the target prediction model obtained by training can improve the accuracy of prediction, and thus the obtained interaction state information improves the accuracy.
  • the present application also provides an application scenario where the above-mentioned data prediction method is applied.
  • Figure 12 it is a schematic flowchart of the application scenario of the data prediction method.
  • the server obtains the data to be predicted sent by the terminal, and the data to be predicted includes: Two different types of target protein information, including wild-type protein information and mutant protein information, and compound information. Then use wild-type protein information and mutant protein information, as well as compound information to extract features that predict protein mutated affinity with reference value, including features from non-physical models and features based on physical and empirical potential energies.
  • Non-physical models such as crystal protein-ligand structures, physicochemical properties of ligands and residues, and some energy features calculated based on empirical or descriptor scoring functions, etc.
  • Energy signatures calculated by Rosetta, a hybrid physical and empirical potential energy modeling program.
  • feature selection that is, select the corresponding feature from the extracted features through the target energy feature obtained through feature selection during training, select the target energy feature to be predicted, and input the target energy feature to be predicted into the target prediction model.
  • a prediction is made to obtain the difference in the predicted binding free energy.
  • the difference of the binding free energy is compared with the drug resistance threshold. When the difference of the binding free energy exceeds the drug resistance threshold, it is indicated that the protein mutation is a protein mutation that can cause drug resistance. When the difference in binding free energy does not exceed the drug resistance threshold, it indicates that the protein mutation is a protein mutation that does not cause drug resistance.
  • the prediction result is sent to the terminal for display,
  • a training method of a prediction model is provided, which specifically includes the following steps:
  • Step 1302 Obtain a training sample set, the training sample set includes each training sample, the training sample weight corresponding to each training sample, and the target energy feature corresponding to each training sample, and the training sample includes wild-type protein information, mutant protein information and compound information,
  • the target energy feature is obtained based on the wild-type energy feature and the mutant energy feature.
  • the wild-type energy feature is obtained by combining the energy feature extraction based on the wild-type protein information and compound information, and the mutant energy feature is based on the combination of the mutant protein information and compound information. energy feature extraction.
  • Step 1304 Obtain protein family information, divide the training sample set based on the protein family information, obtain each training sample group, obtain current learning parameters, and determine the number of selected samples and sample distribution based on the current learning parameters. Based on the number of samples and the distribution of samples, the current training samples are selected from each training sample group according to the weight of the training samples, and the target current training sample set is obtained.
  • Step 1306 Input the target energy feature corresponding to each training sample in the target current training sample set into the basic prediction model, obtain the basic interaction state information corresponding to each training sample, and calculate the basic interaction state information corresponding to each training sample and each training sample. The error between the interaction state labels corresponding to the samples is used to obtain the basic loss information.
  • Step 1302 Calculate the sample ranks corresponding to each training sample group. Calculate the weighted value based on the sample rank, use the weighted value to weight the diversity learning parameters, obtain the target weighted value, calculate the sum of the target weighted value and the difficulty learning parameter, and obtain the update threshold of each training sample group.
  • Step 1308 Compare the update threshold with the basic loss information corresponding to the training samples in each training sample group, obtain the comparison result corresponding to the training sample, and determine the updated sample weight corresponding to the training sample in each training sample group according to the comparison result corresponding to the training sample .
  • Step 1310 Update the current learning parameters according to the preset increment, obtain the updated learning parameters, use the updated learning parameters as the current learning parameters, and return to the steps of determining the number of selected samples and sample distribution based on the current learning parameters, and execute until the model training is completed. , the target prediction model is obtained.
  • the present application further provides an application scenario, where the above-mentioned prediction model training method is applied to the application scenario. specifically:
  • the input data and training sample group information are obtained.
  • the input data includes each training sample and the corresponding training sample weight is 0 or 1.
  • the training sample group information indicates the training sample group to which the training samples in the input data belong.
  • the model parameters and learning parameters of the prediction model are initialized.
  • the weight of the training sample corresponding to the fixed training sample remains unchanged, and the parameters of the training model are selected according to the initialized learning parameters.
  • the basic prediction model is obtained.
  • the parameters of the basic prediction model are fixed, and the sample weights are updated, that is, formula (3) is used to update the training sample weights corresponding to each training sample to obtain the updated sample weights.
  • the target prediction model obtained by training is compared and tested.
  • the drug resistance standard dataset Platinum Platinum is a database that extensively collects drug resistance information and was developed to study and understand the impact of missense mutations on ligand-proteome interactions
  • TKIs were used to Perform training and testing, in which the target prediction model is obtained by training with the dataset Platinum, and then tested with the dataset TKI.
  • RDKit is an open source toolkit for cheminformatics, based on 2D and 3D molecular manipulation of compounds, using machine learning methods for compound descriptor generation, fingerprint generation, compound structure similarity calculation, 2D and 3D molecular display etc.
  • Biopython Biopython provides an online resource library for developers who use and study bioinformatics
  • FoldX calculated protein binding free energy
  • PLIP is an analysis tool for non-covalent interactions of protein ligands
  • AutoDock an open source molecular simulation software, mainly used to perform ligand-protein molecular docking
  • other non-physical modeling tools to generate features corresponding to predicted protein affinity changes after mutation.
  • energy signatures were calculated using Rosetta, a modeling program based on a mixture of physical and empirical potential energies. Then perform feature selection to obtain the final selected features.
  • Table 1 is the final selected feature number table.
  • Table 1 Feature Number Table
  • Figure 15 shows the scatter plot of the experimentally measured and predicted ⁇ G values.
  • ⁇ G refers to The relative difference between the binding free energies of the ligand and the receptor, that is, the difference between the corresponding binding free energies when the protein before and after mutation binds to the respective compound.
  • the first row in Figure 15 is a schematic diagram of the results of using only non-physical model features to predict the relative difference in binding free energy
  • the second row in Figure 15 is using non-physical model features and physical and empirical potential energy features together to predict binding Schematic diagram of the results for the relative difference in free energy.
  • the first column is a scatter plot of the relative difference in binding free energies obtained by testing using prior art 1 .
  • the second column is a scatter plot of the relative difference in binding free energies obtained by testing using prior art 2.
  • the third column is a scatter plot of the relative difference in binding free energy obtained by testing using the technical solution of the present application.
  • RMSE root mean square error
  • Pearson Pearson
  • AUPRC area under the curve decreases
  • Table 2 the mean, minimum and maximum values of RMSE, Pearson and AUPRC indicators are calculated respectively, and the obtained results are shown in Table 2 below.
  • the average value of the RMSE (smaller is better) index in the present application in all features is 0.73, the minimum value is 0.72, and the maximum value is 0.74. Now, for other obvious averaging errors, it is small.
  • the Pearson (bigger is better) metric in this application is also clearly due to other prior art.
  • the AUPRC index in this application is also better than other prior art. Therefore, compared with the prior art, the prediction accuracy in the present application is significantly improved. Further, as shown in FIG. 16 , it is a schematic diagram of the AUPRC index in the comparison test results.
  • the first circle from left to right in each curve indicates the precision and recall corresponding to the predicted drug resistance results when the corresponding drug resistance results are obtained for the test samples when ⁇ G>1.36kcal/mol is used as the threshold Rate.
  • the second circle from left to right in each curve represents the precision and recall corresponding to the predicted drug resistance results when the top 15% ⁇ G test samples are used as the divided drug resistance results. It can be clearly seen from this that the technical solution of the present application can obviously improve the performance of classifying whether there is drug resistance.
  • steps in the flowcharts of FIGS. 2-14 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIGS. 2-14 may include multiple steps or multiple stages. These steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution of these steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the steps or phases within the other steps.
  • a predictive model training apparatus 1700 is provided.
  • the apparatus may use software modules or hardware modules, or a combination of the two to become a part of computer equipment.
  • the apparatus specifically includes: a sample Acquisition module 1702, sample determination module 1704, training module 1706, and iteration module 1708, wherein:
  • the sample acquisition module 1702 is used to acquire a training sample set, the training sample set includes each training sample, the training sample weight corresponding to each training sample, and the target energy feature corresponding to each training sample, and the training sample includes wild-type protein information and mutant protein information and compound information, the target energy feature is obtained based on the wild-type energy feature and the mutant energy feature, the wild-type energy feature is obtained by combining the energy feature extraction based on the wild-type protein information and the compound information, and the mutant energy feature is based on the mutant protein information and Compound information is obtained by combining energy feature extraction;
  • a sample determination module 1704 configured to determine the current training sample from the training sample set based on the training sample weight
  • the training module 1706 is used to input the current target energy feature corresponding to the current training sample into the pre-training prediction model for basic training, and when the basic training is completed, obtain the basic prediction model;
  • the iteration module 1708 is used to update the training sample weight corresponding to each training sample based on the basic prediction model, and return to the step of determining the current training sample from the training sample set based on the training sample weight.
  • the prediction model is used to predict the interaction state information corresponding to the input protein information and the input compound information.
  • the predictive model training apparatus 1700 further includes:
  • the pre-training module is used to obtain each training sample, and the training samples include wild-type protein information, mutant protein information and compound information; based on the wild-type protein information and compound information, the initial energy feature extraction is combined to obtain the wild-type initial energy feature; The mutant protein information and compound information are combined with the initial energy feature extraction to obtain the mutant initial energy feature, and the target initial energy feature corresponding to each training sample is determined based on the wild-type initial energy feature and the mutant initial energy feature; each training sample corresponds to The initial energy characteristics of the target are input into the initial prediction model for prediction, and the initial interaction state information corresponding to each training sample is obtained.
  • the initial prediction model is established using the random forest algorithm; based on the initial interaction state information corresponding to each training sample and each Perform loss calculation on the interaction state labels corresponding to the training samples to obtain the initial loss information corresponding to each training sample; update the initial prediction model based on the initial loss information, and return to input the target energy features corresponding to each training sample into the initial prediction model for prediction
  • the steps are performed until the pre-training is completed, and the feature importance corresponding to the pre-training prediction model and the target initial energy feature is obtained; based on the loss information corresponding to each training sample when the pre-training is completed, the training sample weight corresponding to each training sample is determined, and based on Feature importance selects target energy features from target initial energy features.
  • the pre-training module is further configured to input the target initial energy feature corresponding to each training sample into the initial prediction model; the initial prediction model takes the target initial energy feature corresponding to each training sample as the current set to be divided, and calculates The initial feature importance corresponding to the target initial energy feature, the initial division feature is determined from the target initial energy feature based on the initial feature importance, the target initial energy feature corresponding to each training sample is divided based on the initial division feature, and each division result is obtained.
  • the result includes the target initial energy feature corresponding to each divided sample, takes each divided result as the current set to be divided, and returns to the step of calculating the initial feature importance corresponding to the target initial energy feature. the initial interaction state information.
  • the sample obtaining module 1702 is further configured to obtain a confidence level corresponding to each training sample, and determine a training sample weight corresponding to each training sample based on the confidence level.
  • the sample acquisition module 1702 is further configured to extract binding energy features based on wild-type protein information and compound information to obtain wild-type energy features; perform binding energy feature extraction based on mutant protein information and compound information to obtain mutant Energy signature; calculate the difference between the wild-type energy signature and the mutant energy signature to obtain the target energy signature.
  • the wild-type energy signature includes a first wild-type energy signature and a second wild-type energy signature; the sample acquisition module 1702 is further configured to perform binding energy signatures based on wild-type protein information and compound information using a non-physical-type scoring function Extraction to obtain the first wild-type energy feature; based on the wild-type protein information and compound information, use the physical type function to perform binding energy feature extraction to obtain the second wild-type energy feature; Based on the first wild-type energy feature and the second wild-type energy feature Fusion was performed to obtain the wild-type energy signature.
  • the mutant energy features include a first mutant energy feature and a second mutant energy feature; the sample acquisition module 1702 is further configured to perform binding energy feature extraction using a non-physical function based on mutant protein information and compound information , obtain the first mutant energy feature; use the physical function to extract the combined energy feature based on the mutant protein information and compound information to obtain the second mutant energy feature; based on the first mutant energy feature and the second mutant energy feature. Fusion to obtain mutant energy features.
  • the sample determination module 1704 is further configured to obtain protein family information, divide the training sample set based on the protein family information, and obtain each training sample group; select the current training sample from each training sample group based on the training sample weight, Get the current training sample set.
  • the training module 1706 is further configured to input the current target energy feature corresponding to each current training sample in the current training sample set into the pre-training prediction model for basic training, and obtain the target basic prediction model when the basic training is completed.
  • the sample determination module 1704 is further configured to obtain the current learning parameters, and determine the number of selected samples and the distribution of samples based on the current learning parameters; based on the number of selected samples and the distribution of samples, select the current training samples from each training sample group according to the weight of the training samples sample to get the target current training sample set.
  • the training module 1706 is further configured to input the current target energy feature corresponding to the current training sample into the pre-training prediction model for prediction to obtain current interaction state information; calculate the current interaction state information corresponding to the current training sample The error between the interaction state labels of , obtain the current loss information; update the pre-training prediction model based on the current loss information, and return to input the current target energy feature corresponding to the current training sample into the pre-training prediction model for prediction, and get the current interaction
  • the steps of applying the state information are executed until the basic training completion condition is reached, and the basic prediction model is obtained.
  • the iterative module 1708 is further configured to input the target energy feature corresponding to each training sample into the basic prediction model to obtain basic interaction state information corresponding to each training sample; calculate the basic interaction state corresponding to each training sample The error between the information and the interaction state label corresponding to each training sample is obtained to obtain the basic loss information; the weight of the training sample is updated based on the basic loss information to obtain the updated sample weight corresponding to each training sample.
  • the iteration module 1708 is further configured to obtain the current learning parameter, and calculate the update threshold based on the current learning parameter; compare the update threshold with the basic loss information corresponding to each training sample to obtain the comparison result corresponding to each training sample; The comparison result corresponding to each training sample determines the updated sample weight corresponding to each training sample.
  • the current learning parameters include a diversity learning parameter and a difficulty learning parameter; the iteration module 1708 is further configured to acquire each training sample group, determine the current training sample group from each training sample group, and calculate the current training sample The sample rank corresponding to the group; the weighted value is calculated based on the sample rank, and the weighted value is used to weight the diversity learning parameters to obtain the target weighted value; the sum of the target weighted value and the difficulty learning parameter is calculated to obtain the update threshold.
  • the iteration module 1708 obtains the current learning parameter, updates the current learning parameter according to the preset increment, obtains the updated learning parameter, and uses the updated learning parameter as the current learning parameter.
  • a data prediction apparatus 1800 is provided.
  • the apparatus can adopt software modules or hardware modules, or a combination of the two to become a part of computer equipment.
  • the apparatus specifically includes: data acquisition module 1802, feature extraction module 1804, target feature determination module 1806 and prediction module 1808, wherein:
  • the feature extraction module 1804 is used to extract the combined energy feature based on the information of the wild-type protein to be predicted and the information of the compound to be predicted, obtain the wild-type energy feature to be predicted, and extract the combined energy feature based on the information of the mutant protein to be predicted and the information of the compound to be predicted , obtain the mutant energy characteristics to be predicted;
  • a target feature determination module 1806, configured to determine the target energy feature to be predicted based on the to-be-predicted wild-type energy feature and the to-be-predicted mutant energy feature;
  • the prediction module 1808 is used to input the energy feature of the target to be predicted into the target prediction model for prediction, and obtain the interaction state information.
  • the target prediction model obtains the training sample set and determines the current training sample from the training sample set based on the weight of the training sample;
  • the current target energy feature corresponding to the current training sample is input into the pre-training prediction model for basic training.
  • the basic prediction model is obtained; the training sample weight corresponding to each training sample is updated based on the basic prediction model, and the training sample based on the training sample is returned.
  • the weights are obtained from the training sample set to determine the current training sample step execution until the model training is complete.
  • Each module in the above data prediction apparatus can be implemented in whole or in part by software, hardware and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 19 .
  • the computer device includes a processor, memory, and a network interface connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer device is used to store training sample data and data to be predicted.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer readable instructions when executed by the processor, implement a predictive model training method or a data prediction method.
  • a computer device is provided, and the computer device may be a terminal, and its internal structure diagram may be as shown in FIG. 20 .
  • the computer equipment includes a processor, memory, a communication interface, a display screen, and an input device connected by a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the non-volatile storage medium stores an operating system and computer readable instructions.
  • the internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the communication interface of the computer device is used for wired or wireless communication with an external terminal, and the wireless communication can be realized by WIFI, operator network, NFC (Near Field Communication) or other technologies.
  • the computer readable instructions when executed by the processor, implement a predictive model training method and a data prediction method.
  • the display screen of the computer equipment may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball or a touchpad set on the shell of the computer equipment , or an external keyboard, trackpad, or mouse.
  • FIG. 19 and FIG. 20 are only block diagrams of partial structures related to the solution of the present application, and do not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • a computer device may include more or fewer components than those shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device including a memory and a processor, where computer-readable instructions are stored in the memory, and when the processor executes the computer-readable instructions, the steps in the foregoing method embodiments are implemented.
  • a computer-readable storage medium which stores computer-readable instructions, and when the computer-readable instructions are executed by a processor, implements the steps in the foregoing method embodiments.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps in the foregoing method embodiments.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, or optical memory, and the like.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • the RAM may be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computational Linguistics (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

本申请涉及一种预测模型训练方法、装置、计算机设备和存储介质。该方法包括:获取训练样本集,训练样本集包括各个训练样本、各个训练样本对应的训练样本权重和各个训练样本对应的目标能量特征;基于训练样本权重从训练样本集中确定当前训练样本;将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于基础预测模型更新各个训练样本对应的训练样本权重并迭代执行,直到模型训练完成时,得到目标预测模型,目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息。采用本方法能够提高训练得到的目标预测模型的预测准确性。

Description

预测模型训练、数据预测方法、装置和存储介质
本申请要求于2021年04月01日提交中国专利局,申请号为2021103559296,申请名称为“预测模型训练、数据预测方法、装置和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种预测模型训练、数据预测方法、装置、计算机设备和存储介质。
背景技术
随着人工智能技术的发展,出现了使用机器学习算法来预测化合物与靶向蛋白质之间的亲和力。目前,通过使用机器学习算法建立的模型来预测靶向蛋白质发生突变后与化合物之间的亲和力变化,进而确定靶向蛋白质对化合物是否产生耐药性,从而为医生用药提供参考。然而,目前通过机器学习算法建立的预测模型存在准确率低,模型泛化能力差的问题。
发明内容
基于此,有必要针对上述技术问题,提供一种能够提高预测模型训练准确性,进而提高预测准确性的预测模型训练、数据预测方法、装置、计算机设备和存储介质。
一种预测模型训练方法,所述方法包括:
获取训练样本集,训练样本集包括各个训练样本、各个训练样本对应的训练样本权重和各个训练样本对应的目标能量特征,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,目标能量特征基于野生型能量特征和突变型能量特征得到,野生型能量特征是基于野生型蛋白质信息和化合物信息进行结合能量特征提取得到,突变型能量特征是基于突变型蛋白质信息和化合物信息进行结合能量特征提取得到的;
基于训练样本权重从训练样本集中确定当前训练样本;
将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;
基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时,得到目标预测模型,目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息。
一种预测模型训练装置,所述装置包括:
样本获取模块,用于获取训练样本集,训练样本集包括各个训练样本、各个训练样本对应的训练样本权重和各个训练样本对应的目标能量特征,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,目标能量特征基于野生型能量特征和突变型能量特征得到,野生型能量特征是基于野生型蛋白质信息和化合物信息进行结合能量特征提取得到,突变型能量特征是基于突变型蛋白质信息和化合物信息进行结合能量特征提取得到的;
样本确定模块,用于基于训练样本权重从训练样本集中确定当前训练样本;
训练模块,用于将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;
迭代模块,用于基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时,得到目标预测模型,目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用 状态信息。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现以下步骤:
获取训练样本集,训练样本集包括各个训练样本、各个训练样本对应的训练样本权重和各个训练样本对应的目标能量特征,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,目标能量特征基于野生型能量特征和突变型能量特征得到,野生型能量特征是基于野生型蛋白质信息和化合物信息进行结合能量特征提取得到,突变型能量特征是基于突变型蛋白质信息和化合物信息进行结合能量特征提取得到的;
基于训练样本权重从训练样本集中确定当前训练样本;
将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;
基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时,得到目标预测模型,目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息。
一种计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现以下步骤:
获取训练样本集,训练样本集包括各个训练样本、各个训练样本对应的训练样本权重和各个训练样本对应的目标能量特征,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,目标能量特征基于野生型能量特征和突变型能量特征得到,野生型能量特征是基于野生型蛋白质信息和化合物信息进行结合能量特征提取得到,突变型能量特征是基于突变型蛋白质信息和化合物信息进行结合能量特征提取得到的;
基于训练样本权重从训练样本集中确定当前训练样本;
将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;
基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时,得到目标预测模型,目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息。
上述预测模型训练方法、装置、计算机设备和存储介质,通过获取训练样本集,训练样本集包括各个训练样本、各个训练样本对应的训练样本权重和各个训练样本对应的目标能量特征,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,基于训练样本权重从训练样本集中确定当前训练样本;将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时,得到目标预测模型,目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息。即通过在迭代过程中不断更新训练样本权重,并且使用训练样本权重从训练样本集中确定当前训练样本,能够保证训练样本的质量,然后使用当前训练样本训练预测模型,从而使训练得到的目标预测模型能够提高预测的准确性和泛化性。
一种数据预测方法,所述方法包括:
获取待预测数据,待预测数据包括待预测野生型蛋白质信息、待预测突变型蛋白质信息 和待预测化合物信息;
基于待预测野生型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测野生型能量特征,基于待预测突变型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测突变型能量特征;
基于待预测野生型能量特征和待预测突变型能量特征确定待预测目标能量特征;
将待预测目标能量特征输入目标预测模型中进行预测,得到相互作用状态信息,目标预测模型是通过获取训练样本集,基于训练样本权重从训练样本集中确定当前训练样本;将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时得到的。
一种数据预测装置,所述装置包括:
数据获取模块,用于获取待预测数据,待预测数据包括待预测野生型蛋白质信息、待预测突变型蛋白质信息和待预测化合物信息;
特征提取模块,用于基于待预测野生型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测野生型能量特征,基于待预测突变型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测突变型能量特征;
目标特征确定模块,用于基于待预测野生型能量特征和待预测突变型能量特征确定待预测目标能量特征;
预测模块,用于将待预测目标能量特征输入目标预测模型中进行预测,得到相互作用状态信息,目标预测模型是通过获取训练样本集,基于训练样本权重从训练样本集中确定当前训练样本;将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时得到的。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现以下步骤:
获取待预测数据,待预测数据包括待预测野生型蛋白质信息、待预测突变型蛋白质信息和待预测化合物信息;
基于待预测野生型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测野生型能量特征,基于待预测突变型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测突变型能量特征;
基于待预测野生型能量特征和待预测突变型能量特征确定待预测目标能量特征;
将待预测目标能量特征输入目标预测模型中进行预测,得到相互作用状态信息,目标预测模型是通过获取训练样本集,基于训练样本权重从训练样本集中确定当前训练样本;将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时得到的。
一种计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器 执行时实现以下步骤:
获取待预测数据,待预测数据包括待预测野生型蛋白质信息、待预测突变型蛋白质信息和待预测化合物信息;
基于待预测野生型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测野生型能量特征,基于待预测突变型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测突变型能量特征;
基于待预测野生型能量特征和待预测突变型能量特征确定待预测目标能量特征;
将待预测目标能量特征输入目标预测模型中进行预测,得到相互作用状态信息,目标预测模型是通过获取训练样本集,基于训练样本权重从训练样本集中确定当前训练样本;将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时得到的。
上述数据预测方法、装置、计算机设备和存储介质,通过获取待预测数据,然后确定待预测目标能量特征,将待预测目标能量特征输入目标预测模型中进行预测,得到相互作用状态信息,由于目标预测模型是通过获取训练样本集,基于训练样本权重从训练样本集中确定当前训练样本;将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时得到的,即通过目标预测模型来预测得到相互作用状态信息,由于训练得到的目标预测模型能够提高预测的准确性,进而使得到的相互作用状态信息提高了准确性。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中预测模型训练方法的应用环境图;
图2为一个实施例中预测模型训练方法的流程示意图;
图3为一个实施例中预训练初始预测模型的流程示意图;
图4为一个实施例中得到初始相互作用状态信息的流程示意图;
图5为一个实施例中得到目标能量特征的流程示意图;
图6为一个实施例中得野生型能量特征的流程示意图;
图7为一个实施例中得到突变型能量特征的流程示意图;
图8为一个实施例中得到目标基础预测模型的流程示意图;
图9为一个实施例中得到基础预测模型的流程示意图;
图10为一个实施例中得到更新样本权重的流程示意图;
图11为一个实施例中数据预测方法的流程示意图;
图12为一个具体实施例中数据预测方应用场景的流程示意图;
图13为一个具体实施例中预测模型训练方法的流程示意图;
图14为一个具体实施例中预测模型训练方法的流程示意图;
图15为一个具体实施例中对比测试结果的示意图;
图16为图15具体实施例中准确率和召回率曲线指标的示意图;
图17为一个实施例中预测模型训练装置的结构框图;
图18为一个实施例中数据预测装置的结构框图;
图19为一个实施例中计算机设备的内部结构图;
图20为另一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的预测模型训练方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104进行通信。服务器104接收到终端102发送的模型训练指令,服务器104根据模型训练指令从数据库106中获取训练样本集,训练样本集包括各个训练样本、各个训练样本对应的训练样本权重和各个训练样本对应的目标能量特征,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,目标能量特征基于野生型能量特征和突变型能量特征得到,野生型能量特征是基于野生型蛋白质信息和化合物信息进行结合能量特征提取得到,突变型能量特征是基于突变型蛋白质信息和化合物信息进行结合能量特征提取得到的;服务器104基于训练样本权重从训练样本集中确定当前训练样本;服务器104将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;服务器104基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时,得到目标预测模型,目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息。其中,终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一个实施例中,如图2所示,提供了一种预测模型训练方法,以该方法应用于图1中的服务器为例进行说明,可以理解的是,该方法也可以应用在终端中,还可以应用于包括终端和服务器的系统,并通过终端和服务器的交互实现,在本实施例中,包括以下步骤:
步骤202,获取训练样本集,训练样本集包括各个训练样本、各个训练样本对应的训练样本权重和各个训练样本对应的目标能量特征,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,目标能量特征基于野生型能量特征和突变型能量特征得到,野生型能量特征是基于野生型蛋白质信息和化合物信息进行结合能量特征提取得到,突变型能量特征是基于突变型蛋白质信息和化合物信息进行结合能量特征提取得到的。
其中,蛋白质是指靶向蛋白质,比如,蛋白激酶。化合物是指与靶向蛋白质能够相互作用的药物。比如酪氨酸激酶抑制剂。蛋白质信息用于表征靶向蛋白质具体的信息,可以包括蛋白质结构,蛋白质理化性质等等,野生型蛋白质信息是指从大自然中获得的个体,也就是非人工诱变的蛋白质的信息,突变型蛋白质信息是指发生了突变的蛋白质信息,比如,可以是药物结构发生了突变。化合物信息是指与蛋白质能够相互作用的化合物的信息,可以包括化合物的结构,化合物的理化性质等等。训练样本权重是指训练样本对应的权重,用于表征 对应训练样本的质量,高质量的训练样本可以在训练机器学习模型时提升训练的质量。结合能量特征是指蛋白质和化合物相互作用时的特征,用于表征靶点蛋白质与化合物分子之间的相互作用能量信息,可以包括结构特征,理化性质特征以及能量特征等等,该结合能量特征是通过特征选择后得到的特征。野生型能量特征是指野生型蛋白质与化合物相互作用时提取得到的结合能量特征。突变型能量特征是指突变型蛋白质和化合物相互作用时提取得到的结合能量特征。目标能量特征用于表征突变型能量特征与野生型能量特征之间的差异。
具体地,服务器可以直接从数据库中获取到获取训练样本集,训练样本集包括各个训练样本、各个训练样本对应的训练样本权重和各个训练样本对应的目标能量特征,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,目标能量特征基于野生型能量特征和突变型能量特征得到,野生型能量特征是基于野生型蛋白质信息和化合物信息进行结合能量特征提取得到,突变型能量特征是基于突变型蛋白质信息和化合物信息进行结合能量特征提取得到的。服务器还可以从互联网采集到各个训练样本,然后提取各个训练样本对应的目标能量特征并初始化各个训练样本对应的训练样本权重。服务器也可以从提供数据服务的第三方服务器中获取到训练样本集,比如可以从第三方云服务器中获取到训练样本集。
在一个实施例中,服务器可以获取到蛋白质信息、突变型蛋白质信息和化合物信息,基于野生型蛋白质信息和化合物信息进行结合能量特征提取得到野生型能量特征,基于突变型蛋白质信息和化合物信息进行结合能量特征提取得到突变型能量特征,计算野生型能量特征和突变型能量特征之间的差异,得到目标能量特征。同时,初始化对应的训练样本权重,比如,可以是随机初始化、为零初始化、高斯分布初始化等等。
步骤204,基于训练样本权重从训练样本集中确定当前训练样本。
其中,当前训练样本是指当前训练时使用的训练样本。
具体地,服务器根据各个训练样本对应的训练样本权重从训练样本集中进行训练样本的选取,得到当前训练样本。比如,可以将训练样本权重大于预设权重阈值的训练样本作为当前训练样本,预设权重阈值是预先设置好的权重阈值。在一个具体的实施例中,可以将训练样本权重可以设置为0和1,即将各个训练样本对应的训练样本权重初始化为0或者1。当训练样本权重为1时,对应的训练样本为当前训练样本。在一个实施例中,服务器可以根据训练样本权重从训练样本集选取多个训练样本,得到当前训练样本集,该当前训练样本集中包括有多个训练样本。使用当前训练样本集进行基础预测模型的训练。
步骤206,将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型。
其中,当前目标能量特征是指当前训练样本对应的目标能量特征。预训练预测模型是指预先经过初步训练的预测模型,该预测模型是使用随机森林算法建立的,该预测模型可以用于预测突变前后蛋白质和化合物之间的亲和力变化。基础预测模型是保持训练样本权重不变的情况下使用对应的当前训练样本进行训练得到。
具体地,服务器可以将前训练样本对应的当前目标能量特征输入到预训练预测模型中进行预测,得到预测结果,根据该预测结果计算损失,根据损失反向更新预训练预测模型并返回将前训练样本对应的当前目标能量特征输入到预训练预测模型中进行预测的步骤迭代执行,直到当达到基础训练完成条件时,将达到基础训练完成条件的预测模型作为基础预测模型。其中,基础训练完成条件是指得到基础预测模型的条件,包括训练达到预先设置好的迭代次数上限或者损失达到预先设置好的阈值,或者模型的参数不再发生变化等等。
步骤208,判断模型是否训练完成,当模型训练完成时,执行步骤208a,当模型训练未完成时,执行步骤208b,并返回步骤204执行。
步骤208a,得到目标预测模型,目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息。
步骤208b,基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行。
其中,模型训练完成是指得到目标预测模型的条件,目标预测模型是指最终训练得到的用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息的模型。相互作用状态信息用于表征蛋白质突变前后与化合物之间的结合自由能的变化。结合自由能是指存在于配体与受体之间的相互作用。
具体地,服务器当得到基础预测模型时,进一步判断是否达到模型训练完成,该模型训练完成条件可以包括迭代次数达到预先设置好的模型训练迭代次数上限。当未达到模型训练完成条件,此时保持基础预测模型的参数不变,然后使用基础预测模型更新各个训练样本对应的训练样本权重,可以将各个训练样本对应的目标能量特征输入到基础预测模型中,得到各个训练样本对应的损失,根据各个训练样本对应的损失来更新各个训练样本对应的训练样本权重。当训练样本权重更新后,返回基于训练样本权重从训练样本集中确定当前训练样本的步骤继续迭代执行,直到达到模型训练完成条件时,将达到模型训练完成条件时的基础预测模型作为目标预测模型,该目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息。
上述预测模型训练方法,通过获取训练样本集,训练样本集包括各个训练样本、各个训练样本对应的训练样本权重和各个训练样本对应的目标能量特征,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,基于训练样本权重从训练样本集中确定当前训练样本;将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时,得到目标预测模型,目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息。即通过在迭代过程中不断更新训练样本权重,并且使用训练样本权重从训练样本集中确定当前训练样本,能够保证训练样本的质量,然后使用当前训练样本训练预测模型,从而使训练得到的目标预测模型能够提高预测的准确性和泛化性。
在一个实施例中,如图3所示,在步骤202之前,即在获取训练样本集之前,还包括:
步骤302,获取各个训练样本,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息。
步骤304,基于野生型蛋白质信息和化合物信息进行结合初始能量特征提取,得到野生型初始能量特征。
其中,结合初始能量特征是指提取得到的未经筛选的特征,可以包括非物理模型特征、基于物理和经验势能特征等等。其中,非物理模型特征包括晶体蛋白-化合物结构特征、配体和残基的理化性质特征、基于经验或描述符打分函数计算得到的能量特征等等。基于物理和经验势能特征是指基于混合的物理和经验势能的建模程序计算得到的能量特征。野生型初始能量特征是指对野生型蛋白质信息和化合物信息相互作用时提取的结合初始能量特征。
具体地,服务器可以从数据库中获取到各个训练样本,该各个训练样本可以是预训练时 使用的样本。该各个训练样本可以和训练样本集中的训练样本可以相同,也可以不同。服务器也可以从互联网采集到各个训练样本,服务器还可以从提供数据服务的服务器中获取到各个训练样本。每个训练样本中都包括野生型蛋白质信息、突变型蛋白质信息和化合物信息。此时,服务器对每个训练样本都进行特征提取,即使用野生型蛋白质信息和化合物信息进行结合初始能量特征提取,得到每个训练样本对应的野生型初始能量特征。
步骤306,基于突变型蛋白质信息和化合物信息进行结合初始能量特征提取,得到突变型初始能量特征,并基于野生型初始能量特征和突变型初始能量特征确定各个训练样本对应的目标初始能量特征。
其中,突变型初始能量特征是指对突变型蛋白质信息和化合物信息相互作用时提取的结合初始能量特征,目标初始能量特征用于表征野生型初始能量特征和突变型初始能量特征之间的差异。
具体地,服务器对突变型蛋白质信息和化合物信息进行结合初始能量特征提取,得到突变型初始能量特征,并计算基于野生型初始能量特征和突变型初始能量特征之间的差异,将该差异作为得到目标初始能量特征。比如,可以计算结构特征之间的差异,将该差异作为目标结构特征。也可以计算理化性质之间的差异,将理化性质之间的差异作为目标结构特征。
步骤308,将各个训练样本对应的目标初始能量特征输入到初始预测模型中进行预测,得到各个训练样本对应的初始相互作用状态信息,初始预测模型是使用随机森林算法建立的。
其中,初始预测模型是指模型参数初始化的预测模型,该模型参数初始化可以是随时初始化,也可以是为零初始化等等。初始预测模型是使用随机森林算法建立的,随机森林指的是利用多棵树对样本进行训练并预测的一种分类器。比如可以使用ExtraTree(极端随机树)算法来建立初始预测模型。初始相互作用状态信息是指使用初始预测模型进行预测得到的相互作用状态信息。
具体地,服务器预先使用随机森林算法建立模型参数初始化的初始预测模型,然后将各个训练样本对应的目标初始能量特征输入到初始预测模型中进行预测,得到输出的各个训练样本对应的初始相互作用状态信息。
步骤310,基于各个训练样本对应的初始相互作用状态信息和各个训练样本对应的相互作用状态标签进行损失计算,得到各个训练样本对应的初始损失信息。
其中,相互作用状态标签是指真实的相互作用状态信息,每个训练样本都有对应的相互作用状态标签。初始损失信息用于表征初始相互作用状态信息与相互作用状态标签之间的误差。
具体地,服务器使用预先设置好的损失函数计算每个训练样本对应的初始相互作用状态信息与相互作用状态标签之间的损失,得到各个训练样本对应的初始损失信息。其中,损失函数可以是均方误差损失函数,平均绝对值误差损失函数等等。
步骤312,基于初始损失信息更新初始预测模型,并返回将各个训练样本对应的目标能量特征输入到初始预测模型中进行预测的步骤执行,直到预训练完成时,得到预训练预测模型和目标初始能量特征对应的特征重要性。
其中,预训练完成是指得到预训练预测模型的条件,是指预训练次数达到预先设置好的迭代次数,或者预训练的损失达到预先设置好的阈值或者预训练的预测模型参数不再发生变化。特征重要性用于表征目标初始能量特征的重要程度,特征重要性越高其对应的特征就越重要,对模型训练时的贡献就越多。
具体地,服务器使用初始损失信息计算梯度,然后使用梯度反向更新初始预测模型,得到更新后的预测模型,判断预训练是否完成,当预训练未完成时,将更新后的预测模型作为初始预测模型,并返回将各个训练样本对应的目标能量特征输入到初始预测模型中进行预测的步骤迭代执行,直到预训练完成时,将最后一次迭代得到的更新后的预测模型作为预训练预测模型,并由于预训练预测模型是使用随机森林算法建立的,当训练完成预训练预测模型时,可以直接得到目标初始能量特征对应的特征重要性。目标初始能量特征中的每个特征都有对应的特征重要性。
步骤316,基于预训练完成时各个训练样本对应的损失信息确定各个训练样本对应的训练样本权重,并基于特征重要性从目标初始能量特征中选取目标能量特征。
具体地,服务器可以使用预训练完成时各个训练样本对应的损失信息确定各个训练样本对应的训练样本权重,比如,可以将各个训练样本对应的损失信息与权重损失阈值进行比较,当损失信息大于权重损失阈值,对应的训练样本就为质量好的样本,可以设置对应的训练样本权重为1。当损失信息未大于权重损失阈值,对应的训练样本就为质量差的样本,可以设置对应的训练样本权重为0。通过特征重要性从目标初始能量特征中进行特征选择,得到目标能量特征,目标能量特征即是在预训练预测模型进一步训练时要提取得到的特征。
在上述实施例中,通过使用各个训练样本预先训练得到预训练模型,然后基于预训练完成时各个训练样本对应的损失信息确定各个训练样本对应的训练样本权重,并且基于特征重要性从目标初始能量特征中进行特征选择,得到目标能量特征,从而能够在进一步训练时提高训练效率,并且保证训练的准确性。
在一个实施例中,如图4所示,步骤308,将各个训练样本对应的目标初始能量特征输入到初始预测模型中进行预测,得到各个训练样本对应的初始相互作用状态信息,初始预测模型是使用随机森林算法建立的,包括:
步骤402,将各个训练样本对应的目标初始能量特征输入到初始预测模型中;
步骤404,初始预测模型将各个训练样本对应的目标初始能量特征作为当前待划分集,并计算目标初始能量特征对应的初始特征重要性,基于初始特征重要性从目标初始能量特征中确定初始划分特征,基于初始划分特征将各个训练样本对应的目标初始能量特征进行划分,得到各个划分结果,划分结果中包括各个划分样本对应的目标初始能量特征,将各个划分结果作为当前待划分集,并返回计算目标初始能量特征对应的初始特征重要性的步骤迭代,直到划分完成时,得到各个训练样本对应的初始相互作用状态信息。
其中,初始特征重要性是指目标初始能量特征对应的特征重要性,初始划分特征是指进行决策树划分的特征。划分结果是指对目标初始能量特征进行划分后的得到的,划分样本是指划分结果中的目标初始能量特征对应的训练样本。
具体地,服务器将各个训练样本对应的目标初始能量特征输入到初始预测模型中,初始预测模型对输入特征进行评分,得到目标初始能量特征对应的初始特征重要性。其中,可以使用信息增益、信息增益率、基尼系数、均方差等计算初始特征重要性。基于初始特征重要性从目标初始能量特征中确定初始划分特征,基于初始划分特征将各个训练样本对应的目标初始能量特征进行划分,即将超过该初始划分特征的目标初始能量特征作为一部分,将未超过该初始划分特征的目标初始能量特征作为另一部分,得到划分结果,划分结果中包括各个划分样本对应的目标初始能量特征,将各个划分结果作为当前待划分集,并返回计算目标初始能量特征对应的初始特征重要性的步骤迭代,直到划分完成时,得到各个训练样本对应的 初始相互作用状态信息,其中,划分完成是指每个树节点都无法进行划分,即叶子节点对应只有唯一的目标初始能量特征。初始相互作用状态信息是指初始预测模型预测得到的相互作用状态信息。
在上述实施例中,通过将各个训练样本对应的目标初始能量特征输入到初始预测模型中,初始预测模型通过计算目标初始能量特征对应的初始特征重要性,基于初始特征重要性从目标初始能量特征中确定初始划分特征,基于初始划分特征将各个训练样本对应的目标初始能量特征进行划分,得到各个划分结果,划分结果中包括各个划分样本对应的目标初始能量特征,将各个划分结果作为当前待划分集,并返回计算目标初始能量特征对应的初始特征重要性的步骤迭代,直到划分完成时,得到各个训练样本对应的初始相互作用状态信息,提高了得到初始相互作用状态信息的准确性。
在一个实施例中,步骤202,即获取训练样本集,训练样本集包括各个训练样本对应的训练样本权重,包括步骤:
获取各个训练样本对应的置信度,基于置信度确定各个训练样本对应的训练样本权重。
其中,置信度用于表征对应训练样本质量好坏的程度。置信度越高说明训练样本对应的质量就越高,使用置信度高的训练样本训练得到的模型性能越好。
具体地,服务器在获取各个训练样本时,也可以同时获取到各个训练样本对应的置信度。然后可以直接将置信度之间作为各个训练样本对应的训练样本权重。其中,该置信度可以是人为设置的,也可以是预先对各个训练样本进行置信度评估得到的。在一个实施例中,也可以将各个训练样本对应的置信度与预先设置好的置信度阈值进行比较,当超过置信度阈值时,设置对应的训练样本权重为1,该训练样本为当前训练样本。当未超过置信度阈值时,设置对应的训练样本权重为0。
在上述实施例中,通过获取到置信度,根据置信度确定各个训练样本对应的训练样本权重,提高得到训练样本权重的效率。
在一个实施例中,如图5所示,步骤202,获取训练样本集,训练样本集包括各个训练样本对应的目标能量特征,包括:
步骤502,基于野生型蛋白质信息和化合物信息进行结合能量特征提取,得到野生型能量特征。
步骤504,基于突变型蛋白质信息和化合物信息进行结合能量特征提取,得到突变型能量特征。
其中,野生型能量特征包括但不限于野生型蛋白质特征,化合物特征以及野生型蛋白质信息和化合物信息相互作用时的能量特征。野生型蛋白质特征用于表征野生型蛋白质信息对应的特征,包括但不限于野生型蛋白质结构特征、野生型蛋白质理化性质特征。化合物特征包括但不限于化合物结构特征,化合物理化性质特征。突变型能量特征包括但不限于突变型蛋白质特征,化合物特征以及突变型蛋白质信息和化合物信息相互作用时的能量特征。突变型蛋白质特征用于表征突变型蛋白质信息对应的特征,包括但不限于突变型蛋白质结构特征、突变型蛋白质理化性质特征。
具体地,服务器使用野生型蛋白质信息和化合物信息进行特征提取,提取到野生型蛋白质特征和化合物特征,同时对野生型蛋白质和化合物相互作用时的能量特征进行提取,将野生型蛋白质特征、化合物特征以及能量特征作为野生型能量特征。服务器使用突变型蛋白质信息进行特征提取,得到突变型蛋白质特征,然后对突变型蛋白质和化合物相互作用时的能 量特征进行提取,将提取得到的突变型蛋白质特征化合物特征以及能量特征作为突变型能量特征。
步骤506,计算野生型能量特征和突变型能量特征之间的差异,得到目标能量特征。
具体地,服务器计算野生型能量特征和突变型能量特征之间的差异,比如,计算野生型蛋白质特征与突变型蛋白质特征之间的差异,计算野生型蛋白质和化合物相互作用时的能量特征与突变型蛋白质和化合物相互作用时的能量特征之间的差异,得到目标能量特征。在一个具体的实施例中,可以计算野生型能量特征和突变型能量特征的特征差值,得到目标能量特征。
在上述实施例中,通过提取到野生型能量特征和突变型能量特征,然后计算野生型能量特征和突变型能量特征之间的差异,得到目标能量特征,能够提高得到目标能量特征的准确性。
在一个实施例中,野生型能量特征包括第一野生型能量特征和第二野生型能量特征;
如图6所示,步骤502,基于野生型蛋白质信息和化合物信息进行结合能量特征提取,得到野生型能量特征,包括:
步骤602,基于野生型蛋白质信息和化合物信息使用非物理型打分函数进行结合能量特征提取,得到第一野生型能量特征。
其中,非物理型打分函数是指基于经验或描述符打分函数,该打分函数会基于一些先验假设或对实验数据进行拟合,从而得到能量特征,该得到的能量特征不具有明显可解释的物理意义。第一野生型能量特征是指提取得到的第一部分能量特征。
具体地,服务器可以使用预先设置好的非物理型打分函数进行结合能量特征提取,将野生型蛋白质信息和化合物信息通过非物理型打分函数进行计算,得到计算结果,将计算结果作为第一野生型能量特征。其中,可以使用打分函数(用于评价理论获得的受体–配体结合模式合理性的函数)提取能量特征。
步骤602,基于野生型蛋白质信息和化合物信息使用物理型函数进行结合能量特征提取,得到第二野生型能量特征。
其中,物理型函数是指基于混合的物理和经验势能的能量函数,是有明确物理意义的,能量函数家族由基于实验数据拟合的力场函数,基于第一性原理的量化计算函数,基于连续介质的溶剂模型等组成。
具体地,服务器使用预先设置好的物理型函数对野生型蛋白质信息和化合物信进行结合能量特征提取,得到第二野生型能量特征。例如,可以使用基于混合的物理和经验势能的建模程序Rosetta(基于蒙特卡罗模拟退火为算法核心的高分子建模软件库)中的能量函数计算能量特征。
步骤602,基于第一野生型能量特征和第二野生型能量特征进行融合,得到野生型能量特征。
具体地,服务器计算第一野生型能量特征和第二野生型能量特征之间的特征差值,得到野生型能量特征。
在上述实施例中,通过提取第一野生型能量特征和第二野生型能量特征,基于第一野生型能量特征和第二野生型能量特征进行融合,得到野生型能量特征,由于第一野生型能量特征和第二野生型能量特征能够更好地表征野生型靶点蛋白质与化合物分子之间的相互作用能量信息,从而使得到的野生型能量特征更加的准确。
在一个实施例中,突变型能量特征包括第一突变型能量特征和第二突变型能量特征;
如图7所述,步骤504,基于突变型蛋白质信息和化合物信息进行结合能量特征提取,得到突变型能量特征,包括:
步骤702,基于突变型蛋白质信息和化合物信息使用非物理型函数进行结合能量特征提取,得到第一突变型能量特征。
步骤704,基于突变型蛋白质信息和化合物信息使用物理型函数进行结合能量特征提取,得到第二突变型能量特征。
具体地,服务器使用预先设置好的非物理型函数对突变型蛋白质信息和化合物信息进行结合能量特征提取,得到第一突变型能量特征,然后使用预先设置好的物理型函数对突变型蛋白质信息和化合物信息进行结合能量特征提取,得到第二突变型能量特征。
步骤706,基于第一突变型能量特征和第二突变型能量特征进行融合,得到突变型能量特征。
具体地,服务器计算第一突变型能量特征和第二突变型能量特征之间的特征差值,得到突变型能量特征。
在上述实施例中,通过提取第一突变型能量特征和第二突变型能量特征,基于第一突变型能量特征和第二突变型能量特征进行融合,得到突变型能量特征,由于第一突变型能量特征和第二突变型能量特征能够更好地表征突变型靶点蛋白质与化合物分子之间的相互作用能量信息,从而使得到的突变型能量特征更加的准确。
在一个实施例中,如图8所示,步骤204,基于训练样本权重从训练样本集中确定当前训练样本,包括:
步骤802,获取蛋白质家族信息,基于蛋白质家族信息将训练样本集进行划分,得到各个训练样本组。
其中,体内氨基酸序列相似并且结构与功能十分相近的蛋白质构成“蛋白质家族”(protein family),同一蛋白质家族的成员称为“同源蛋白质”。蛋白质家族信息是指蛋白质家族的信息,训练样本组是将同一蛋白质家族对应的各个训练样本划分到一起得到的。
具体地,服务器可以直接从数据库中获取到蛋白质家族信息,该蛋白质家族信息也可以是从互联网中获取到的,也可以是从提供数据服务的第三方服务器中获取到的。在一个实施例中,服务器也可以将训练样本中蛋白质信息的结构或者序列相似的蛋白质家族划分为同一个训练样本组,得到各个训练样本组
步骤804,基于训练样本权重从各个训练样本组中选取当前训练样本,得到当前训练样本集。
具体地,服务器使用训练样本权重从各个训练样本组中选取当前训练样本,即按照训练样本组中训练样本权重依次选取当前训练样本,并且从每个训练样本组都进行选取,得到当前训练样本集。
步骤206,将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型,包括:
步骤806,将当前训练样本集中各个当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到目标基础预测模型。
具体地,服务器将当前训练样本集中各个当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到目标基础预测模型。
在上述实施例中,通过将训练样本集按照蛋白质家族信息进行划分,得到各个训练样本组。然后基于训练样本权重从各个训练样本组中选取当前训练样本,得到当前训练样本集,从而使用当前训练样本集对预训练预测模型中进行基础训练,得到基础预测模型。即通过从各个训练样本组中选取当前训练样本,从而使选取的训练样本分布于空间各处而非集中在一个局部区域,从而保证在训练模型时,能够学习到训练样本中蕴含的全局信息,从而保证模型在训练过程中学习知识的全面性,进一步提高模型训练过程中的收敛速度,并提升训练得到模型的泛化能力。
在一个具体的实施例中,预训练预测模型的基本形式如下公式(1)所示。
Figure PCTCN2022079885-appb-000001
其中,n表示训练样本的总数,X表示训练样本集,X=(x 1,...,x n)∈R n*m,R表示实数集,m表示能量特征的数目。x i表示第i个训练样本,y i表示第i个训练样本对应的相互作用状态标签。g表示预训练预测模型,w表示模型参数,L表示损失函数,v表示训练样本权重。v=(v (1),...,v (b)),b表示训练样本组的组数,即将训练样本集划分为b组:x (1),...,x (b),其中,
Figure PCTCN2022079885-appb-000002
表示第j个训练样本组的训练样本,
Figure PCTCN2022079885-appb-000003
n j表示第j个训练样本组中训练样本数量,且
Figure PCTCN2022079885-appb-000004
表示第j个训练样本组中第1个训练样本对应的训练样本权重。v i表示第i个训练样本权重。λ表示训练样本难易度的参数,即表示训练样本在选取时是从容易选取(置信度高)的样本到难选取(置信度低)的样本依次进行选取。γ表示样本多样性的参数。即表示从多个训练样本组中选取样本。|| || 1表示L 1范数,|| || 2.1表示L 2.1范数。其中,
Figure PCTCN2022079885-appb-000005
b表示训练样本组的组数,j表示第j个训练样本组的训练样本权重。即负L 1范数倾向于选取置信度高的样本,即训练时结果误差较小的样本。负L 2.1范数有利于在多个训练样本组中选取训练样本,将多样性信息嵌入预测模型中。
在一个实施例中,基于训练样本权重从各个训练样本组中选取当前训练样本,得到当前训练样本集,包括:
获取当前学习参数,基于当前学习参数确定选取样本数和样本分布,基于选取样本数和样本分布按照训练样本权重从各个训练样本组中选取当前训练样本,得到目标当前训练样本 集。
其中,当前学习参数是指当前训练时使用的学习参数,该当前学习参数用于控制当前训练样本的选取。选取样本数是指当前要选取的训练样本数量。样本分布是指选取的当前训练样本在各个训练样本组中的分布。目标当前训练样本集是指使用当前学习参数选取得到的当前训练样本的集合。
具体地,服务器获取到当前训练样本参数,该当前训练样本参数的初始值可以是预先设置好的。服务器使用当前学习参数来计算当前在训练时所要选取的样本数和样本分布。然后基于选取样本数和样本分布按照训练样本权重从各个训练样本组中选取当前训练样本,得到目标当前训练样本集。
在上述实施例中,通过使用当前学习参数来进一步控制训练样本的选取,从而得到目标当前训练样本集,能够使选取的训练样本更加准确,从而进一步使训练得到的预测模型更加的准确,并且提高预测模型的泛化能力。
在一个实施例中,如图9所示,步骤206,即将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型,包括:
步骤902,将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行预测,得到当前相互作用状态信息。
其中,当前相互作用状态信息用于表征预测得到的当前训练样本中突变前后的蛋白质与化合物相互作用的变化。
具体地,服务器直接将当前训练样本对应的当前目标能量特征作为预训练预测模型的输入,预训练预测模型根据输入到当前目标能量特征进行预测,并输出预测结果,即当前相互作用状态信息。
步骤904,计算当前相互作用状态信息与当前训练样本对应的相互作用状态标签之间的误差,得到当前损失信息。
其中,当前损失信息是指当前训练样本对应的预测结果和真实结果之间的误差。
具体地,服务器获取到当前训练样本对应的相互作用状态标签,该相互作用状态标签可以是预先设置好的。相互作用状态标签可以是通过实验测得的突变前后蛋白质与化合物相互作用的变化。然后服务器使用预设损失函数计算当前相互作用状态信息与当前训练样本对应的相互作用状态标签之间的误差,得到当前损失信息。
步骤906,基于当前损失信息更新预训练预测模型,并返回将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行预测,得到当前相互作用状态信息的步骤执行,直到达到基础训练完成条件时,得到基础预测模型。
具体地,服务器使用当前损失信息通过梯度下降算法来反向更新预训练预测模型中的参数,并返回将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行预测,得到当前相互作用状态信息的步骤迭代执行,直到达到预先设置好的基础训练迭代次数或者模型参数不再发生变化时,将最后一次迭代的预训练预测模作为基础预测模型。
在一个具体的实施例中,预训练预测模型对应的优化函数如下公式(2)所示,该优化函数是一个回归优化函数。
Figure PCTCN2022079885-appb-000006
其中,v i表示选取训练样本权重为超过权重阈值的训练样本进行训练。比如,当训练样 本权重仅包括0和1时,可以选取训练样本权重为1的训练样本进行训练。
在上述实施例中,通过保存训练样本权重不变,然后选取当前训练样本对预训练预测模型进行训练,得到基础预测模型,从而使训练的基础预测模型更加的准确。
在一个实施例中,如图10所示,步骤208b,基于基础预测模型更新各个训练样本对应的训练样本权重,包括:
步骤1002,将各个训练样本对应的目标能量特征输入到基础预测模型中,得到各个训练样本对应的基础相互作用状态信息。
其中,各个训练样本是指训练样本集中的每个训练样本。基础相互作用状态信息是指基础预测模型预测得到的每个训练样本对应的相互作用状态信息。该相互作用状态信息可以是野生型蛋白质和化合物的结合自由能与突变型蛋白质和化合物的结合自由能之间的相对差值。
具体的,服务器训练得到基础预测模型时,保持基础预测模型中的参数不变,更新训练样本集中每个训练样本对应的训练样本权重。即服务器将各个训练样本对应的目标能量特征输入到基础预测模型中,得到输出的各个训练样本对应的基础相互作用状态信息。
步骤1004,计算各个训练样本对应的基础相互作用状态信息与各个训练样本对应的相互作用状态标签之间的误差,得到基础损失信息。
其中,基础损失信息是指基础预测模型预测结果和真实结果之间的误差。
具体地,服务器使用预设损失函数来计算每个训练样本的误差,即计算基础相互作用状态信息与相互作用状态标签之间的误差,得到每个训练样本对应的基础损失信息。
步骤1006,基于基础损失信息对训练样本权重进行更新,得到各个训练样本对应的更新样本权重。
具体,服务器使用每个训练样本对应的基础损失信息对每个训练样本权重进行更新,服务器可以直接将每个训练样本对应的基础损失信息作为每个训练样本对应的更新样本权重。
在一个实施例中,步骤1006,即基于基础损失信息对训练样本权重进行更新,得到各个训练样本对应的更新样本权重,包括步骤:
获取当前学习参数,基于当前学习参数计算更新阈值;将更新阈值与各个训练样本对应的基础损失信息进行比较,得到各个训练样本对应的比较结果;根据各个训练样本对应的比较结果确定各个训练样本对应的更新样本权重。
其中,更新阈值是指更新训练样本权重的阈值。
具体地,服务器获取到当前学习参数,使用当前学习参数确定更新阈值。将更新阈值与各个训练样本对应的基础损失信息进行比较,当基础损失信息超过更新阈值时,说明该训练样本对应的预测误差较大,此时,将对应的训练样本权重更新为第一训练样本权重。当基础损失信息未超过更新阈值时,说明误差较小,此时,将对应的训练样本权重更新为第二训练样本权重。然后,在选取当前训练样本时,选取第二训练样本权重对应的训练样本为当前训练样本。
在一个实施例中,当前学习参数包括多样性学习参数和难易度学习参数;基于当前学习参数计算更新阈值,包括步骤:
获取各个训练样本组,从各个训练样本组中确定当前训练样本组,并计算当前训练样本组对应的样本秩。基于样本秩计算加权值,使用加权值对多样性学习参数进行加权,得到目标加权值。计算目标加权值与难易度学习参数的和,得到更新阈值。
其中,难易度学习参数是指衡量容易度的学习参数,难易度学习参数用于确定训练时选取的训练样本的置信程度。多样性学习参数是衡量多样性的学习参数。多样性学习参数用于确定训练时选取得到的训练样本在训练样本组中的分布。样本秩是指前训练样本组中训练样本的秩,一个向量组的秩是其最大无关组所含的向量个数。当前训练样本组是指当前需要更新训练样本权重的训练样本组。
具体地,服务器获取各个训练样本组,从各个训练样本组中确定当前训练样本组,并计算当前训练样本组对应的样本秩。基于样本秩计算加权值,使用加权值对多样性学习参数进行加权,得到目标加权值。计算目标加权值与难易度学习参数的和,得到当前训练样本组对应的更新阈值。在一个具体的实施例中,可以按照基础损失信息对各个训练样本组中的训练样本按照升序排序。得到各个排序后的训练样本组,对排序后的训练样本组中确定当前训练样本组,并计算得到当前训练样本组对应的更新阈值。
在一个具体的实施例中,可以使用如下所示的公式(3)来更新训练样本对应的训练样本权重。
Figure PCTCN2022079885-appb-000007
其中,a表示第j个训练样本组中的秩。
Figure PCTCN2022079885-appb-000008
表示第j个训练样本组第i个训练样本对应的预测出的相互作用状态信息,
Figure PCTCN2022079885-appb-000009
表示第j个训练样本组第i个训练样本对应的真实的相互作用状态标签。
Figure PCTCN2022079885-appb-000010
表示计算得到的更新阈值。当第j个训练样本组第i个训练样本对应的误差小于更新阈值时,将对应的训练样本权重更新为1,当第j个训练样本组第i个训练样本对应的误差大于等于更新阈值时,将对应的训练样本权重更新为0。
在上述实施例中,通过不断的更新样本权重,重新选取当前训练样本进行训练,能够使得在训练过程中使用误差较大的训练样本进行训练,从而避免误差较大的训练样本对训练过程中的负面影响,进而提高训练得到的目标预测模型的准确性。
在一个实施例中,在基于基础预测模型更新各个训练样本对应的训练样本权重之后,还包括步骤:
获取当前学习参数,按照预设增加量对当前学习参数进行更新,得到更新学习参数,将更新学习参数作为当前学习参数。
具体地,服务器可以预先设置当前学习参数的更新条件,比如,预先设置好当前学习参数在每次权重更新后的增加量。然后按照预设增加量对当前学习参数进行更新,得到更新学习参数,将更新学习参数作为当前学习参数。在一个实施例中,服务器也可以获取到预先设置好的要增加的样本个数,通过预先设置好的要增加的样本个数来更新当前学习参数,得到 更新学习参数,将更新学习参数作为当前学习参数。并且在当样本个数增加后,训练得到的损失信息从小变大时,训练完成,并将未增加样本个数时训练得到的预测模型作为最终得到的目标预测模型。
在一个实施例中,如图11所示,提供了一种数据预测方法,以该方法应用于图1中的服务器为例进行说明,可以理解的是,该方法也可以应用在终端中,还可以应用于包括终端和服务器的系统,并通过终端和服务器的交互实现,在本实施例中,包括以下步骤:
步骤1102,获取待预测数据,待预测数据包括待预测野生型蛋白质信息、待预测突变型蛋白质信息和待预测化合物信息。
其中,待预测野生型蛋白质信息是指需要预测相互作用状态信息的野生型蛋白质信息。待预测突变型蛋白质信息是指需要预测相互作用状态信息的突变型蛋白质信息。待预测化合物信息是指需要预测相互作用状态信息的化合物信息。
具体地,服务器可以从互联网采集到待预测数据,也可以从终端中获取到待预测数据。服务器还可以直接从数据库中获取到待预测数据。在一个实施例中,服务器还可以获取到第三方服务器发送的待预测数据。第三方服务器可以是提供业务服务的服务器。待预测数据包括待预测野生型蛋白质信息、待预测突变型蛋白质信息和待预测化合物信息。在一个实施例中,服务器可以从终端中获取到待预测突变型蛋白质信息和待预测化合物信息,然后可以从数据库中获取到待预测突变型蛋白质信息对应的待预测野生型蛋白质信息,从而得到待预测数据。
步骤1104,基于待预测野生型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测野生型能量特征,基于待预测突变型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测突变型能量特征。
其中,待预测野生型能量特征是指提取得到的待预测野生型蛋白质信息和待预测化合物信息相互作用时的能量特征。待预测突变型能量特征是指提取得到的待预测突变型蛋白质信息和待预测化合物信息相互作用时的能量特征。
具体地,服务器基于待预测野生型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测野生型能量特征,比如,可以根据待预测野生型蛋白质信息中的蛋白质结构和待预测化合物信息中的化合物结构来提取结构特征,然后根据待预测野生型蛋白质信息中的理化性质和待预测化合物信息中的理化性质来提取理化性质特征。理化性质是衡量化学物质特性的指标,包括物理性质和化学性质,物理性质包括熔沸点,常温下的状态,颜色,化学性质包括酸碱度等等。同时使用打分函数计算待预测野生型蛋白质信息和待预测化合物信息相互作用的能量特征以及使用基于混合的物理和经验势能的能量函数计算得到能量特征,从而得到了待预测野生型能量特征。然后基于待预测突变型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测突变型能量特征,比如,可以根据待预测突变型蛋白质信息中的蛋白质结构和待预测化合物信息中的化合物结构来提取结构特征,然后根据待预测突变型蛋白质信息中的理化性质和待预测化合物信息中的理化性质来提取理化性质特征,同时使用打分函数提取能量特征并使用基于物理和经验势能的能量函数提取能量特征,从而得到待预测突变型能量特征。
步骤1106,基于待预测野生型能量特征和待预测突变型能量特征确定待预测目标能量特征。
具体地,服务器计算待预测野生型能量特征中每个特征值与待预测突变型能量特征对应 的特征值之间的差异,得到待预测目标能量特征。
步骤1108,将待预测目标能量特征输入目标预测模型中进行预测,得到相互作用状态信息,目标预测模型是通过获取训练样本集,基于训练样本权重从训练样本集中确定当前训练样本;将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时得到的。
其中,目标预测模型可以是上述预测模型训练方法中任意一实施例中训练得到的模型。即目标预测模型可以是通过获取训练样本集,基于训练样本权重从训练样本集中确定当前训练样本;将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时得到的。
具体地,服务器将待预测目标能量特征输入目标预测模型中进行预测,得到输出的相互作用状态信息。在一个具体的实施例中,该相互作用状态信息是指待预测突变型蛋白质和待预测野生型蛋白质分别与待预测化合物的结合自由能的相对差值。然后将结合自由能的相对差值与耐药性阈值进行比较,当结合自由能的相对差值超过耐药性阈值,说明待预测突变型蛋白质已产生了耐药性,无法继续使用。当结合自由能的相对差值未超过耐药性阈值,说明待预测突变型蛋白质未产生耐药性,仍然能够正常使用。
上述数据预测方法、装置、计算机设备和存储介质,通过获取待预测数据,然后确定待预测目标能量特征,将待预测目标能量特征输入目标预测模型中进行预测,得到相互作用状态信息,由于目标预测模型是通过获取训练样本集,基于训练样本权重从训练样本集中确定当前训练样本;将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时得到的,即通过目标预测模型来预测得到相互作用状态信息,由于训练得到的目标预测模型能够提高预测的准确性,进而使得到的相互作用状态信息提高了准确性。
本申请还提供一种应用场景,该应用场景应用上述的数据预测方法。如图12所示,为数据预测方法应用场景的流程示意图,具体来说:在预测靶向蛋白质突变引起耐药性的应用场景中,服务器获取到终端发送的待预测数据,该待预测数据包括两种不同类型的靶点蛋白质信息,包括野生型蛋白质信息和突变型蛋白质信息,以及化合物信息。然后使用野生型蛋白质信息和突变型蛋白质信息,以及化合物信息提取预测蛋白质突变后的亲和力具有参考价值的特征,包括非物理模型的特征和基于物理和经验势能的特征。非物理模型的特征如晶体蛋白-配体结构,配体和残基的理化性质,以及一些基于经验或描述符打分函数计算得到的能量特征等等,然后基于物理和经验势能的特征是使用基于混合的物理和经验势能的建模程序Rosetta计算得到的能量特征。然后进行特征选择,即通过在训练时的经过特征选择得到的目标能量特征从提取得到的特征中选取对应的特征,选取得到待预测目标能量特征,将待预测目标能量特征输入到目标预测模型中进行预测,得到预测出的结合自由能的差值。将该结合自由能的差值与耐药性阈值进行比较,当结合自由能的差值超过耐药性阈值时,说明该蛋白质突变是会引起耐药性的蛋白质突变。当结合自由能的差值未超过耐药性阈值时,说明该 蛋白质突变是并不会引起耐药性的蛋白质突变。此时将预测结果发送到终端进行显示、
在一个具体地实施例中,如图13所示,提供一种预测模型的训练方法,具体包括以下步骤:
步骤1302,获取训练样本集,训练样本集包括各个训练样本、各个训练样本对应的训练样本权重和各个训练样本对应的目标能量特征,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,目标能量特征基于野生型能量特征和突变型能量特征得到,野生型能量特征是基于野生型蛋白质信息和化合物信息进行结合能量特征提取得到,突变型能量特征是基于突变型蛋白质信息和化合物信息进行结合能量特征提取得到的。
步骤1304,获取蛋白质家族信息,基于蛋白质家族信息将训练样本集进行划分,得到各个训练样本组,获取当前学习参数,基于当前学习参数确定选取样本数和样本分布。基选取于样本数和样本分布按照训练样本权重从各个训练样本组中选取当前训练样本,得到目标当前训练样本集。
步骤1306,将目标当前训练样本集中各个训练样本对应的目标能量特征输入到基础预测模型中,得到各个训练样本对应的基础相互作用状态信息,计算各个训练样本对应的基础相互作用状态信息与各个训练样本对应的相互作用状态标签之间的误差,得到基础损失信息。
步骤1302,计算各个训练样本组对应的样本秩。基于样本秩计算加权值,使用加权值对多样性学习参数进行加权,得到目标加权值,计算目标加权值与难易度学习参数的和,得到各个训练样本组的更新阈值。
步骤1308,将更新阈值与各个训练样本组中训练样本对应的基础损失信息进行比较,得到训练样本对应的比较结果,根据训练样本对应的比较结果确定各个训练样本组中训练样本对应的更新样本权重。
步骤1310,按照预设增加量对当前学习参数进行更新,得到更新学习参数,将更新学习参数作为当前学习参数,并返回基于当前学习参数确定选取样本数和样本分布的步骤执行,直到模型训练完成时,得到目标预测模型。
本申请还另外提供一种应用场景,该应用场景应用上述的预测模型训练方法。具体地:
如图14所示,为预测模型训练方法的流程示意图,具体来说:
获取到输入数据和训练样本组信息,该输入数据包括各个训练样本和对应的训练样本权重即为0或者为1,该训练样本组信息表明输入数据中的训练样本属于的训练样本组。此时初始化预测模型的模型参数和学习参数。
然后固定训练样本对应的训练样本权重不变,训练模型的参数,即根据初始化的学习参数选取训练样本权重为1的训练样本,得到当前训练样本,并提取当前训练样本对应的当前目标能量特征,将当前目标能量特征输入到初始化的预测模型中进行基础训练,当基础训练完成时,得到基础预测模型。
然后固定基础预测模型的参数不变,更新样本权重,即使用公式(3)来更新每个训练样本对应的训练样本权重,得到更新样本权重。
此时进一步更新初始化的学习参数,然后返回到固定训练样本对应的训练样本权重不变,来训练模型的参数的步骤继续迭代执行,直到模型训练完成时,输出训练完成时预测模型的模型参数以及训练样本权重,即得到目标预测模型。
在该实施例中,对训练得到的目标预测模型进行对比测试。具体来说,使用耐药性标准数据集Platinum(Platinum是一个广泛收集耐药性信息的数据库,是为了研究和理解错义突 变对配体与蛋白质组相互作用的影响而开发的)和TKI来进行训练和测试,其中,使用数据集Platinum训练得到目标预测模型,然后使用数据集TKI进行测试。通过采用RDKit(RDKit是一个用于化学信息学的开源工具包,基于对化合物2D和3D分子操作,利用机器学习方法进行化合物描述符生成,fingerprint生成,化合物结构相似性计算,2D和3D分子展示等),Biopython(Biopython为使用和研究生物信息学的开发者提供了一个在线的资源库),FoldX(计算蛋白结合自由能),PLIP(是一个蛋白配体非共价相互作用的分析工具),AutoDock(开源的分子模拟软件,最主要应用于执行配体—蛋白分子对接)等非物理模型工具生成对预测蛋白质突变后的亲和力变化对应的特征。并且使用基于混合的物理和经验势能的建模程序Rosetta计算能量特征。然后进行特征选取,得到最终选取的特征。具体如下表1所示,为最终选取的特征数表。
表1特征数表
数据集 样本数 非物理模型特征 物理和经验势能特征 特征总数
Platinum 484 129 19 148
TKI 144 129 19 148
此时,对训练得到的目标预测模型进行对比测试,测试结果如图15所示,该图15中展示了实验测得的和预测得到的△△G值的散点图,△△G是指配体与受体的结合自由能的相对差值,即突变前后的蛋白质与分别化合物进行结合时对应的结合自由能的差值。其中,图15中第一行是只使用非物理模型特征来预测结合自由能的相对差值的结果示意图,图15中第二行是使用非物理模型特征以及物理和经验势能特征共同来预测结合自由能的相对差值的结果示意图。第一列为使用现有技术1进行测试得到的结合自由能的相对差值的散点图。第二列为使用现有技术2进行测试得到的结合自由能的相对差值的散点图。第三列为使用本申请技术方案进行测试得到的结合自由能的相对差值的散点图。其中,使用RMSE(均方根误差),Pearson(Pearson Correlation Coefficient是用来衡量两个数据集合是否在一条线上面,它用来衡量定距变量间的线性关系)和AUPRC(曲线下面积递减的精度召回曲线)作为评价指标。其中,分别计算RMSE,Pearson和AUPRC指标的均值,最小值和最大值,得到的结果如下表2所示。
表2评价指标表
Figure PCTCN2022079885-appb-000011
其中,全部特征中本申请中的RMSE(越小越好)指标平均值为0.73,最小值为0.72,最大值为0.74,现对于其他明显均分误差较小。本申请中的Pearson(越大越好)指标也明显由于其他现有技术。本申请中AUPRC指标也优于其他现有技术。因此,本申请中相对于现有技术,预测的准确性明显提升。进一步,如图16所示,为对比测试结果中AUPRC指标的示意图。其中,每条曲线中从左往右第一个圆圈表示当以△△G>1.36kcal/mol为阈值时,测试样本得到对应的耐药性结果时,预测耐药性结果对应的精度和召回率。每条曲线中从左 往右第二个圆圈表示将前15%△△G的测试样本作为划分耐药性结果时,预测耐药性结果对应的精度和召回率。从中明显可以看出,本申请的技术方案明显可以提升划分是否有耐药性的性能。
应该理解的是,虽然图2-14的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-14中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图17所示,提供了一种预测模型训练装置1700,该装置可以采用软件模块或硬件模块,或者是二者的结合成为计算机设备的一部分,该装置具体包括:样本获取模块1702、样本确定模块1704、训练模块1706和迭代模块1708,其中:
样本获取模块1702,用于获取训练样本集,训练样本集包括各个训练样本、各个训练样本对应的训练样本权重和各个训练样本对应的目标能量特征,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,目标能量特征基于野生型能量特征和突变型能量特征得到,野生型能量特征是基于野生型蛋白质信息和化合物信息进行结合能量特征提取得到,突变型能量特征是基于突变型蛋白质信息和化合物信息进行结合能量特征提取得到的;
样本确定模块1704,用于基于训练样本权重从训练样本集中确定当前训练样本;
训练模块1706,用于将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;
迭代模块1708,用于基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时,得到目标预测模型,目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息。
在其中一个实施例中,预测模型训练装置1700,还包括:
预训练模块,用于获取各个训练样本,训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息;基于野生型蛋白质信息和化合物信息进行结合初始能量特征提取,得到野生型初始能量特征;基于突变型蛋白质信息和化合物信息进行结合初始能量特征提取,得到突变型初始能量特征,并基于野生型初始能量特征和突变型初始能量特征确定各个训练样本对应的目标初始能量特征;将各个训练样本对应的目标初始能量特征输入到初始预测模型中进行预测,得到各个训练样本对应的初始相互作用状态信息,初始预测模型是使用随机森林算法建立的;基于各个训练样本对应的初始相互作用状态信息和各个训练样本对应的相互作用状态标签进行损失计算,得到各个训练样本对应的初始损失信息;基于初始损失信息更新初始预测模型,并返回将各个训练样本对应的目标能量特征输入到初始预测模型中进行预测的步骤执行,直到预训练完成时,得到预训练预测模型和目标初始能量特征对应的特征重要性;基于预训练完成时各个训练样本对应的损失信息确定各个训练样本对应的训练样本权重,并基于特征重要性从目标初始能量特征中选取目标能量特征。
在一个实施例中,预训练模块还用于将各个训练样本对应的目标初始能量特征输入到初始预测模型中;初始预测模型将各个训练样本对应的目标初始能量特征作为当前待划分集,并计算目标初始能量特征对应的初始特征重要性,基于初始特征重要性从目标初始能量特征 中确定初始划分特征,基于初始划分特征将各个训练样本对应的目标初始能量特征进行划分,得到各个划分结果,划分结果中包括各个划分样本对应的目标初始能量特征,将各个划分结果作为当前待划分集,并返回计算目标初始能量特征对应的初始特征重要性的步骤迭代,直到划分完成时,得到各个训练样本对应的初始相互作用状态信息。
在一个实施例中,样本获取模块1702还用于获取各个训练样本对应的置信度,基于置信度确定各个训练样本对应的训练样本权重。
在一个实施例中,样本获取模块1702还用于基于野生型蛋白质信息和化合物信息进行结合能量特征提取,得到野生型能量特征;基于突变型蛋白质信息和化合物信息进行结合能量特征提取,得到突变型能量特征;计算野生型能量特征和突变型能量特征之间的差异,得到目标能量特征。
在一个实施例中,野生型能量特征包括第一野生型能量特征和第二野生型能量特征;样本获取模块1702还用于基于野生型蛋白质信息和化合物信息使用非物理型打分函数进行结合能量特征提取,得到第一野生型能量特征;基于野生型蛋白质信息和化合物信息使用物理型函数进行结合能量特征提取,得到第二野生型能量特征;基于第一野生型能量特征和第二野生型能量特征进行融合,得到野生型能量特征。
在一个实施例中,突变型能量特征包括第一突变型能量特征和第二突变型能量特征;样本获取模块1702还用于基于突变型蛋白质信息和化合物信息使用非物理型函数进行结合能量特征提取,得到第一突变型能量特征;基于突变型蛋白质信息和化合物信息使用物理型函数进行结合能量特征提取,得到第二突变型能量特征;基于第一突变型能量特征和第二突变型能量特征进行融合,得到突变型能量特征。
在一个实施例中,样本确定模块1704还用于获取蛋白质家族信息,基于蛋白质家族信息将训练样本集进行划分,得到各个训练样本组;基于训练样本权重从各个训练样本组中选取当前训练样本,得到当前训练样本集。
训练模块1706还用于将当前训练样本集中各个当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到目标基础预测模型。
在一个实施例中,样本确定模块1704还用于获取当前学习参数,基于当前学习参数确定选取样本数和样本分布;基于选取样本数和样本分布按照训练样本权重从各个训练样本组中选取当前训练样本,得到目标当前训练样本集。
在一个实施例中,训练模块1706还用于将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行预测,得到当前相互作用状态信息;计算当前相互作用状态信息与当前训练样本对应的相互作用状态标签之间的误差,得到当前损失信息;基于当前损失信息更新预训练预测模型,并返回将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行预测,得到当前相互作用状态信息的步骤执行,直到达到基础训练完成条件时,得到基础预测模型。
在一个实施例中,迭代模块1708还用于将各个训练样本对应的目标能量特征输入到基础预测模型中,得到各个训练样本对应的基础相互作用状态信息;计算各个训练样本对应的基础相互作用状态信息与各个训练样本对应的相互作用状态标签之间的误差,得到基础损失信息;基于基础损失信息对训练样本权重进行更新,得到各个训练样本对应的更新样本权重。
在一个实施例中,迭代模块1708还用于获取当前学习参数,基于当前学习参数计算更新阈值;将更新阈值与各个训练样本对应的基础损失信息进行比较,得到各个训练样本对应的 比较结果;根据各个训练样本对应的比较结果确定各个训练样本对应的更新样本权重。
在一个实施例中,当前学习参数包括多样性学习参数和难易度学习参数;迭代模块1708还用于获取各个训练样本组,从各个训练样本组中确定当前训练样本组,并计算当前训练样本组对应的样本秩;基于样本秩计算加权值,使用加权值对多样性学习参数进行加权,得到目标加权值;计算目标加权值与难易度学习参数的和,得到更新阈值。
在一个实施例中,迭代模块1708获取当前学习参数,按照预设增加量对当前学习参数进行更新,得到更新学习参数,将更新学习参数作为当前学习参数。
在一个实施例中,如图18所示,提供了一种数据预测装置1800,该装置可以采用软件模块或硬件模块,或者是二者的结合成为计算机设备的一部分,该装置具体包括:数据获取模块1802、特征提取模块1804、目标特征确定模块1806和预测模块1808,其中:
数据获取模块1802,用于获取待预测数据,待预测数据包括待预测野生型蛋白质信息、待预测突变型蛋白质信息和待预测化合物信息;
特征提取模块1804,用于基于待预测野生型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测野生型能量特征,基于待预测突变型蛋白质信息和待预测化合物信息进行结合能量特征提取,得到待预测突变型能量特征;
目标特征确定模块1806,用于基于待预测野生型能量特征和待预测突变型能量特征确定待预测目标能量特征;
预测模块1808,用于将待预测目标能量特征输入目标预测模型中进行预测,得到相互作用状态信息,目标预测模型是通过获取训练样本集,基于训练样本权重从训练样本集中确定当前训练样本;将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于基础预测模型更新各个训练样本对应的训练样本权重,并返回基于训练样本权重从训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时得到的。
关于预测模型训练装置和数据预测装置的具体限定可以参见上文中对于预测模型训练方法和数据预测方法的限定,在此不再赘述。上述数据预测装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图19所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储训练样本数据和待预测数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种预测模型训练方法或者数据预测方法。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图20所示。该计算机设备包括通过系统总线连接的处理器、存储器、通信接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指 令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、运营商网络、NFC(近场通信)或其他技术实现。该计算机可读指令被处理器执行时以实现一种预测模型训练方法和数据预测方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图19和图20中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,还提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机可读指令,该处理器执行计算机可读指令时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机可读指令,该计算机可读指令被处理器执行时实现上述各方法实施例中的步骤。
在一个实施例中,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各方法实施例中的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (33)

  1. 一种预测模型训练方法,由计算机设备执行,其特征在于,所述方法包括:
    获取训练样本集,所述训练样本集包括各个训练样本、所述各个训练样本对应的训练样本权重和所述各个训练样本对应的目标能量特征,所述训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,所述目标能量特征基于野生型能量特征和突变型能量特征得到,所述野生型能量特征是基于所述野生型蛋白质信息和所述化合物信息进行结合能量特征提取得到,所述突变型能量特征是基于所述突变型蛋白质信息和所述化合物信息进行结合能量特征提取得到的;
    基于所述训练样本权重从所述训练样本集中确定当前训练样本;
    将所述当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;
    基于所述基础预测模型更新所述各个训练样本对应的训练样本权重,并返回基于训练样本权重从所述训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时,得到目标预测模型,所述目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息。
  2. 根据权利要求1所述的方法,其特征在于,在所述获取训练样本集之前,还包括:
    获取所述各个训练样本和所述各个训练样本对应的相互作用状态标签,所述训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息;
    基于所述野生型蛋白质信息和所述化合物信息进行结合初始能量特征提取,得到野生型初始能量特征;
    基于所述突变型蛋白质信息和所述化合物信息进行结合初始能量特征提取,得到突变型初始能量特征,并基于所述野生型初始能量特征和突变型初始能量特征确定所述各个训练样本对应的目标初始能量特征;
    将所述各个训练样本对应的目标初始能量特征输入到初始预测模型中进行预测,得到所述各个训练样本对应的初始相互作用状态信息,所述初始预测模型是使用随机森林算法建立的;
    基于所述各个训练样本对应的初始相互作用状态信息和所述各个训练样本对应的相互作用状态标签进行损失计算,得到所述各个训练样本对应的初始损失信息;
    基于所述初始损失信息更新所述初始预测模型,并返回将所述各个训练样本对应的目标能量特征输入到初始预测模型中进行预测的步骤执行,直到预训练完成时,得到预训练预测模型和所述目标初始能量特征对应的特征重要性;
    基于预训练完成时所述各个训练样本对应的损失信息确定所述各个训练样本对应的训练样本权重,并基于所述特征重要性从所述目标初始能量特征中选取目标能量特征。
  3. 根据权利要求2所述的方法,其特征在于,所述将所述各个训练样本对应的目标初始能量特征输入到初始预测模型中进行预测,得到所述各个训练样本对应的初始相互作用状态信息,所述初始预测模型是使用随机森林算法建立的,包括:
    将所述各个训练样本对应的目标初始能量特征输入到初始预测模型中;
    所述初始预测模型将所述各个训练样本对应的目标初始能量特征作为当前待划分集,并计算所述目标初始能量特征对应的初始特征重要性,基于所述初始特征重要性从所述目标初始能量特征中确定初始划分特征,基于所述初始划分特征将所述各个训练样本对应的目标初 始能量特征进行划分,得到各个划分结果,所述划分结果中包括各个划分样本对应的目标初始能量特征,将所述各个划分结果作为当前待划分集,并返回计算所述目标初始能量特征对应的初始特征重要性的步骤迭代,直到划分完成时,得到所述各个训练样本对应的初始相互作用状态信息。
  4. 根据权利要求1所述的方法,其特征在于,所述获取训练样本集,所述训练样本集包括所述各个训练样本对应的训练样本权重,包括:
    获取所述各个训练样本对应的置信度,基于所述置信度确定所述各个训练样本对应的训练样本权重。
  5. 根据权利要求1所述的方法,其特征在于,所述获取训练样本集,所述训练样本集包括所述各个训练样本对应的目标能量特征,包括:
    基于所述野生型蛋白质信息和所述化合物信息进行结合能量特征提取,得到所述野生型能量特征;
    基于所述突变型蛋白质信息和所述化合物信息进行结合能量特征提取,得到所述突变型能量特征;
    计算所述野生型能量特征和所述突变型能量特征之间的差异,得到目标能量特征。
  6. 根据权利要求5所述的方法,其特征在于,所述野生型能量特征包括第一野生型能量特征和第二野生型能量特征;
    所述基于所述野生型蛋白质信息和所述化合物信息进行结合能量特征提取,得到所述野生型能量特征,包括:
    基于所述野生型蛋白质信息和所述化合物信息使用非物理型打分函数进行结合能量特征提取,得到第一野生型能量特征;
    基于所述野生型蛋白质信息和所述化合物信息使用物理型函数进行结合能量特征提取,得到第二野生型能量特征;
    基于所述第一野生型能量特征和所述第二野生型能量特征进行融合,得到所述野生型能量特征。
  7. 根据权利要求5所述的方法,其特征在于,所述突变型能量特征包括第一突变型能量特征和第二突变型能量特征;
    所述基于所述突变型蛋白质信息和所述化合物信息进行结合能量特征提取,得到所述突变型能量特征,包括:
    基于所述突变型蛋白质信息和所述化合物信息使用非物理型函数进行结合能量特征提取,得到第一突变型能量特征;
    基于所述突变型蛋白质信息和所述化合物信息使用物理型函数进行结合能量特征提取,得到第二突变型能量特征;
    基于所述第一突变型能量特征和所述第二突变型能量特征进行融合,得到所述突变型能量特征。
  8. 根据权利要求1所述的方法,其特征在于,所述基于所述训练样本权重从所述训练样本集中确定当前训练样本,包括:
    获取蛋白质家族信息,基于所述蛋白质家族信息将所述训练样本集进行划分,得到各个训练样本组;
    基于所述训练样本权重从所述各个训练样本组中选取当前训练样本,得到当前训练样本 集;
    所述将所述当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型,包括:
    将所述当前训练样本集中各个当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到目标基础预测模型。
  9. 根据权利要求8所述的方法,其特征在于,所述基于所述训练样本权重从所述各个训练样本组中选取当前训练样本,得到当前训练样本集,包括:
    获取当前学习参数,基于所述当前学习参数确定选取样本数和样本分布;
    基于所述选取样本数和所述样本分布按照所述训练样本权重从所述各个训练样本组中选取当前训练样本,得到目标当前训练样本集。
  10. 根据权利要求1所述的方法,其特征在于,所述将所述当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型,包括:
    将所述当前训练样本对应的当前目标能量特征输入到所述预训练预测模型中进行预测,得到当前相互作用状态信息;
    计算所述当前相互作用状态信息与所述当前训练样本对应的相互作用状态标签之间的误差,得到当前损失信息;
    基于所述当前损失信息更新所述预训练预测模型,并返回将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行预测,得到当前相互作用状态信息的步骤执行,直到达到基础训练完成条件时,得到所述基础预测模型。
  11. 根据权利要求1所述的方法,其特征在于,所述基于所述基础预测模型更新所述各个训练样本对应的训练样本权重,包括:
    将所述各个训练样本对应的目标能量特征输入到所述基础预测模型中,得到所述各个训练样本对应的基础相互作用状态信息;
    计算所述各个训练样本对应的基础相互作用状态信息与所述各个训练样本对应的相互作用状态标签之间的误差,得到基础损失信息;
    基于所述基础损失信息对所述训练样本权重进行更新,得到所述各个训练样本对应的更新样本权重。
  12. 根据权利要求11所述的方法,其特征在于,所述基于所述基础损失信息对所述训练样本权重进行更新,得到所述各个训练样本对应的更新样本权重,包括:
    获取当前学习参数,基于所述当前学习参数计算更新阈值;
    将所述更新阈值与所述各个训练样本对应的基础损失信息进行比较,得到所述各个训练样本对应的比较结果;
    根据所述各个训练样本对应的比较结果确定所述各个训练样本对应的更新样本权重。
  13. 根据权利要求12所述的方法,其特征在于,所述当前学习参数包括多样性学习参数和难易度学习参数;
    所述基于所述当前学习参数计算更新阈值,包括:
    获取各个训练样本组,从所述各个训练样本组中确定当前训练样本组,并计算所述当前训练样本组对应的样本秩;
    基于所述样本秩计算加权值,使用所述加权值对所述多样性学习参数进行加权,得到目 标加权值;
    计算所述目标加权值与所述难易度学习参数的和,得到所述更新阈值。
  14. 根据权利要求1所述的方法,其特征在于,在所述基于所述基础预测模型更新所述各个训练样本对应的训练样本权重之后,还包括:
    获取当前学习参数,按照预设增加量对所述当前学习参数进行更新,得到更新学习参数,将所述更新学习参数作为当前学习参数。
  15. 一种数据预测方法,其特征在于,所述方法包括:
    获取待预测数据,所述待预测数据包括待预测野生型蛋白质信息、待预测突变型蛋白质信息和待预测化合物信息;
    基于所述待预测野生型蛋白质信息和所述待预测化合物信息进行结合能量特征提取,得到待预测野生型能量特征,基于所述待预测突变型蛋白质信息和所述待预测化合物信息进行结合能量特征提取,得到待预测突变型能量特征;
    基于所述待预测野生型能量特征和所述待预测突变型能量特征确定待预测目标能量特征;
    将所述待预测目标能量特征输入目标预测模型中进行预测,得到相互作用状态信息,所述目标预测模型是通过获取训练样本集,基于训练样本权重从所述训练样本集中确定当前训练样本;将所述当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于所述基础预测模型更新所述训练样本权重,并返回基于训练样本权重从所述训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时得到的。
  16. 一种预测模型训练装置,其特征在于,所述装置包括:
    样本获取模块,用于获取训练样本集,所述训练样本集包括各个训练样本、所述各个训练样本对应的训练样本权重和所述各个训练样本对应的目标能量特征,所述训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息,所述目标能量特征基于野生型能量特征和突变型能量特征得到,所述野生型能量特征是基于所述野生型蛋白质信息和所述化合物信息进行结合能量特征提取得到,所述突变型能量特征是基于所述突变型蛋白质信息和所述化合物信息进行结合能量特征提取得到的;
    样本确定模块,用于基于所述训练样本权重从所述训练样本集中确定当前训练样本;
    训练模块,用于将所述当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;
    迭代模块,用于基于所述基础预测模型更新所述各个训练样本对应的训练样本权重,并返回基于训练样本权重从所述训练样本集中确定当前训练样本的步骤执行,直到模型训练完成时,得到目标预测模型,所述目标预测模型用于预测输入的蛋白质信息与输入的化合物信息对应的相互作用状态信息。
  17. 根据权利要求16所述的装置,其特征在于,所述装置,还包括:
    预训练模块,用于获取所述各个训练样本和所述各个训练样本对应的相互作用状态标签,所述训练样本包括野生型蛋白质信息、突变型蛋白质信息和化合物信息;基于所述野生型蛋白质信息和所述化合物信息进行结合初始能量特征提取,得到野生型初始能量特征;基于所述突变型蛋白质信息和所述化合物信息进行结合初始能量特征提取,得到突变型初始能量特征,并基于所述野生型初始能量特征和突变型初始能量特征确定所述各个训练样本对应的目 标初始能量特征;将所述各个训练样本对应的目标初始能量特征输入到初始预测模型中进行预测,得到所述各个训练样本对应的初始相互作用状态信息,所述初始预测模型是使用随机森林算法建立的;基于所述各个训练样本对应的初始相互作用状态信息和所述各个训练样本对应的相互作用状态标签进行损失计算,得到所述各个训练样本对应的初始损失信息;基于所述初始损失信息更新所述初始预测模型,并返回将所述各个训练样本对应的目标能量特征输入到初始预测模型中进行预测的步骤执行,直到预训练完成时,得到预训练预测模型和所述目标初始能量特征对应的特征重要性;基于预训练完成时所述各个训练样本对应的损失信息确定所述各个训练样本对应的训练样本权重,并基于所述特征重要性从所述目标初始能量特征中选取目标能量特征。
  18. 根据权利要求17所述的装置,其特征在于,所述预训练模块还用于将所述各个训练样本对应的目标初始能量特征输入到初始预测模型中;所述初始预测模型将所述各个训练样本对应的目标初始能量特征作为当前待划分集,并计算所述目标初始能量特征对应的初始特征重要性,基于所述初始特征重要性从所述目标初始能量特征中确定初始划分特征,基于所述初始划分特征将所述各个训练样本对应的目标初始能量特征进行划分,得到各个划分结果,所述划分结果中包括各个划分样本对应的目标初始能量特征,将所述各个划分结果作为当前待划分集,并返回计算所述目标初始能量特征对应的初始特征重要性的步骤迭代,直到划分完成时,得到所述各个训练样本对应的初始相互作用状态信息。
  19. 根据权利要求16所述的装置,其特征在于,所述样本获取模块还用于获取所述各个训练样本对应的置信度,基于所述置信度确定所述各个训练样本对应的训练样本权重。
  20. 根据权利要求16所述的装置,其特征在于,所述样本获取模块还用于基于所述野生型蛋白质信息和所述化合物信息进行结合能量特征提取,得到所述野生型能量特征;基于所述突变型蛋白质信息和所述化合物信息进行结合能量特征提取,得到所述突变型能量特征;计算所述野生型能量特征和所述突变型能量特征之间的差异,得到目标能量特征。
  21. 根据权利要求20所述的装置,其特征在于,所述野生型能量特征包括第一野生型能量特征和第二野生型能量特征;所述样本获取模块还用于基于所述野生型蛋白质信息和所述化合物信息使用非物理型打分函数进行结合能量特征提取,得到第一野生型能量特征;基于所述野生型蛋白质信息和所述化合物信息使用物理型函数进行结合能量特征提取,得到第二野生型能量特征;基于所述第一野生型能量特征和所述第二野生型能量特征进行融合,得到所述野生型能量特征。
  22. 根据权利要求20所述的装置,其特征在于,所述突变型能量特征包括第一突变型能量特征和第二突变型能量特征;所述样本获取模块还用于基于所述突变型蛋白质信息和所述化合物信息使用非物理型函数进行结合能量特征提取,得到第一突变型能量特征;基于所述突变型蛋白质信息和所述化合物信息使用物理型函数进行结合能量特征提取,得到第二突变型能量特征;基于所述第一突变型能量特征和所述第二突变型能量特征进行融合,得到所述突变型能量特征。
  23. 根据权利要求16所述的装置,其特征在于,所述样本确定模块还用于获取蛋白质家族信息,基于所述蛋白质家族信息将所述训练样本集进行划分,得到各个训练样本组;基于所述训练样本权重从所述各个训练样本组中选取当前训练样本,得到当前训练样本集;
    所述训练模块还用于将所述当前训练样本集中各个当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到目标基础预测模型。
  24. 根据权利要求23所述的装置,其特征在于,所述样本确定模块还用于获取当前学习参数,基于所述当前学习参数确定选取样本数和样本分布;基于所述选取样本数和所述样本分布按照所述训练样本权重从所述各个训练样本组中选取当前训练样本,得到目标当前训练样本集。
  25. 根据权利要求16所述的装置,其特征在于,所述训练模块还用于将所述当前训练样本对应的当前目标能量特征输入到所述预训练预测模型中进行预测,得到当前相互作用状态信息;计算所述当前相互作用状态信息与所述当前训练样本对应的相互作用状态标签之间的误差,得到当前损失信息;基于所述当前损失信息更新所述预训练预测模型,并返回将当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行预测,得到当前相互作用状态信息的步骤执行,直到达到基础训练完成条件时,得到所述基础预测模型。
  26. 根据权利要求16所述的装置,其特征在于,所述迭代模块还用于将所述各个训练样本对应的目标能量特征输入到所述基础预测模型中,得到所述各个训练样本对应的基础相互作用状态信息;计算所述各个训练样本对应的基础相互作用状态信息与所述各个训练样本对应的相互作用状态标签之间的误差,得到基础损失信息;基于所述基础损失信息对所述训练样本权重进行更新,得到所述各个训练样本对应的更新样本权重。
  27. 根据权利要求26所述的装置,其特征在于,所述迭代模块还用于获取当前学习参数,基于所述当前学习参数计算更新阈值;将所述更新阈值与所述各个训练样本对应的基础损失信息进行比较,得到所述各个训练样本对应的比较结果;根据所述各个训练样本对应的比较结果确定所述各个训练样本对应的更新样本权重。
  28. 根据权利要求27所述的装置,其特征在于,所述当前学习参数包括多样性学习参数和难易度学习参数;所述迭代模块还用于获取各个训练样本组,从所述各个训练样本组中确定当前训练样本组,并计算所述当前训练样本组对应的样本秩;基于所述样本秩计算加权值,使用所述加权值对所述多样性学习参数进行加权,得到目标加权值;计算所述目标加权值与所述难易度学习参数的和,得到所述更新阈值。
  29. 根据权利要求16所述的装置,其特征在于,所述迭代模块还用于获取当前学习参数,按照预设增加量对所述当前学习参数进行更新,得到更新学习参数,将所述更新学习参数作为当前学习参数。
  30. 一种数据预测装置,其特征在于,所述装置包括:
    数据获取模块,用于获取待预测数据,所述待预测数据包括待预测野生型蛋白质信息、待预测突变型蛋白质信息和待预测化合物信息;
    特征提取模块,用于基于所述待预测野生型蛋白质信息和所述待预测化合物信息进行结合能量特征提取,得到待预测野生型能量特征,基于所述待预测突变型蛋白质信息和所述待预测化合物信息进行结合能量特征提取,得到待预测突变型能量特征;
    目标特征确定模块,用于基于所述待预测野生型能量特征和所述待预测突变型能量特征确定待预测目标能量特征;
    预测模块,用于将所述待预测目标能量特征输入目标预测模型中进行预测,得到相互作用状态信息,所述目标预测模型是通过获取训练样本集,基于训练样本权重从所述训练样本集中确定当前训练样本;将所述当前训练样本对应的当前目标能量特征输入到预训练预测模型中进行基础训练,当基础训练完成时,得到基础预测模型;基于所述基础预测模型更新所述训练样本权重,并返回基于训练样本权重从所述训练样本集中确定当前训练样本的步骤执 行,直到模型训练完成时得到的。
  31. 一种计算机可读存储介质,存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现权利要求1至15中任一项所述的方法的步骤。
  32. 一种计算机可读存储介质,存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现权利要求1至15中任一项所述的方法的步骤。
  33. 一种计算机程序产品,包括计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现权利要求1至15中任一项所述的方法的步骤。
PCT/CN2022/079885 2021-04-01 2022-03-09 预测模型训练、数据预测方法、装置和存储介质 WO2022206320A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP22778504.5A EP4318478A1 (en) 2021-04-01 2022-03-09 Prediction model training and data prediction methods and apparatuses, and storage medium
JP2023534153A JP2023552416A (ja) 2021-04-01 2022-03-09 予測モデルの訓練方法、データ予測方法、装置及びコンピュータプログラム
US18/075,643 US20230097667A1 (en) 2021-04-01 2022-12-06 Methods and apparatuses for training prediction model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110355929.6A CN112735535B (zh) 2021-04-01 2021-04-01 预测模型训练、数据预测方法、装置和存储介质
CN202110355929.6 2021-04-01

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/075,643 Continuation US20230097667A1 (en) 2021-04-01 2022-12-06 Methods and apparatuses for training prediction model

Publications (1)

Publication Number Publication Date
WO2022206320A1 true WO2022206320A1 (zh) 2022-10-06

Family

ID=75596362

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/079885 WO2022206320A1 (zh) 2021-04-01 2022-03-09 预测模型训练、数据预测方法、装置和存储介质

Country Status (5)

Country Link
US (1) US20230097667A1 (zh)
EP (1) EP4318478A1 (zh)
JP (1) JP2023552416A (zh)
CN (1) CN112735535B (zh)
WO (1) WO2022206320A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600511A (zh) * 2022-12-01 2023-01-13 北京金羽新材科技有限公司(Cn) 电解质材料预测模型训练方法、装置和计算机设备
CN116994698A (zh) * 2023-03-31 2023-11-03 河北医科大学第一医院 基于深度学习的舍曲林剂量个体化推荐方法及装置

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735535B (zh) * 2021-04-01 2021-06-25 腾讯科技(深圳)有限公司 预测模型训练、数据预测方法、装置和存储介质
CN113284577B (zh) * 2021-05-24 2023-08-11 康键信息技术(深圳)有限公司 药品预测方法、装置、设备及存储介质
CN113255770B (zh) * 2021-05-26 2023-10-27 北京百度网讯科技有限公司 化合物属性预测模型训练方法和化合物属性预测方法
CN113409884B (zh) * 2021-06-30 2022-07-22 北京百度网讯科技有限公司 排序学习模型的训练方法及排序方法、装置、设备及介质
CN113889179B (zh) * 2021-10-13 2024-06-11 山东大学 基于多视图深度学习的化合物-蛋白质相互作用预测方法
CN114187979A (zh) * 2022-02-15 2022-03-15 北京晶泰科技有限公司 数据处理、模型训练、分子预测和筛选方法及其装置
CN114708931B (zh) * 2022-04-22 2023-01-24 中国海洋大学 结合机器学习和构象计算提高药-靶活性预测精度的方法
CN116913393B (zh) * 2023-09-12 2023-12-01 浙江大学杭州国际科创中心 一种基于强化学习的蛋白质进化方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020019706A1 (en) * 1995-06-07 2002-02-14 Paul Braun Method and apparatus for predicting the presence of an abnormal level of one or more proteins in the clotting cascade
CN109147866A (zh) * 2018-06-28 2019-01-04 南京理工大学 基于采样与集成学习的蛋白质-dna绑定残基预测方法
CN110008984A (zh) * 2019-01-22 2019-07-12 阿里巴巴集团控股有限公司 一种基于多任务样本的目标模型训练方法和装置
CN110443419A (zh) * 2019-08-01 2019-11-12 太原理工大学 基于iceemdan与极限学习机的中长期径流预测方法
CN111667884A (zh) * 2020-06-12 2020-09-15 天津大学 基于注意力机制使用蛋白质一级序列预测蛋白质相互作用的卷积神经网络模型
CN111985274A (zh) * 2019-05-23 2020-11-24 中国科学院沈阳自动化研究所 一种基于卷积神经网络的遥感图像分割算法
CN112735535A (zh) * 2021-04-01 2021-04-30 腾讯科技(深圳)有限公司 预测模型训练、数据预测方法、装置和存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020733B (zh) * 2012-11-27 2017-04-12 南京航空航天大学 一种基于权重的机场单航班噪声预测方法及其系统
CN103116713B (zh) * 2013-02-25 2015-09-16 浙江大学 基于随机森林的化合物和蛋白质相互作用预测方法
CN106650926B (zh) * 2016-09-14 2019-04-16 天津工业大学 一种稳健的boosting极限学习机集成建模方法
CN106548210B (zh) * 2016-10-31 2021-02-05 腾讯科技(深圳)有限公司 基于机器学习模型训练的信贷用户分类方法及装置
CN107679455A (zh) * 2017-08-29 2018-02-09 平安科技(深圳)有限公司 目标跟踪装置、方法及计算机可读存储介质
CN110689965B (zh) * 2019-10-10 2023-03-24 电子科技大学 一种基于深度学习的药物靶点亲和力预测方法
CN112530514A (zh) * 2020-12-18 2021-03-19 中国石油大学(华东) 基于深度学习方法预测化合物蛋白质相互作用的新型深度模型、计算机设备、存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020019706A1 (en) * 1995-06-07 2002-02-14 Paul Braun Method and apparatus for predicting the presence of an abnormal level of one or more proteins in the clotting cascade
CN109147866A (zh) * 2018-06-28 2019-01-04 南京理工大学 基于采样与集成学习的蛋白质-dna绑定残基预测方法
CN110008984A (zh) * 2019-01-22 2019-07-12 阿里巴巴集团控股有限公司 一种基于多任务样本的目标模型训练方法和装置
CN111985274A (zh) * 2019-05-23 2020-11-24 中国科学院沈阳自动化研究所 一种基于卷积神经网络的遥感图像分割算法
CN110443419A (zh) * 2019-08-01 2019-11-12 太原理工大学 基于iceemdan与极限学习机的中长期径流预测方法
CN111667884A (zh) * 2020-06-12 2020-09-15 天津大学 基于注意力机制使用蛋白质一级序列预测蛋白质相互作用的卷积神经网络模型
CN112735535A (zh) * 2021-04-01 2021-04-30 腾讯科技(深圳)有限公司 预测模型训练、数据预测方法、装置和存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600511A (zh) * 2022-12-01 2023-01-13 北京金羽新材科技有限公司(Cn) 电解质材料预测模型训练方法、装置和计算机设备
CN116994698A (zh) * 2023-03-31 2023-11-03 河北医科大学第一医院 基于深度学习的舍曲林剂量个体化推荐方法及装置

Also Published As

Publication number Publication date
JP2023552416A (ja) 2023-12-15
EP4318478A1 (en) 2024-02-07
CN112735535B (zh) 2021-06-25
US20230097667A1 (en) 2023-03-30
CN112735535A (zh) 2021-04-30

Similar Documents

Publication Publication Date Title
WO2022206320A1 (zh) 预测模型训练、数据预测方法、装置和存储介质
CA3110200C (en) Iterative protein structure prediction using gradients of quality scores
Ancien et al. Prediction and interpretation of deleterious coding variants in terms of protein structural stability
Vlasblom et al. Markov clustering versus affinity propagation for the partitioning of protein interaction graphs
Stekhoven et al. MissForest—non-parametric missing value imputation for mixed-type data
WO2022206604A1 (zh) 分类模型训练和分类方法、装置、计算机设备和存储介质
He et al. Evolutionary graph clustering for protein complex identification
Sriwastava et al. Predicting protein-protein interaction sites with a novel membership based fuzzy SVM classifier
Ma et al. Layer-specific modules detection in cancer multi-layer networks
CN115116539A (zh) 对象确定方法、装置、计算机设备和存储介质
CN113838541B (zh) 设计配体分子的方法和装置
Wang et al. A novel stochastic block model for network-based prediction of protein-protein interactions
Chen et al. Domain-based predictive models for protein-protein interaction prediction
Meseguer et al. Prediction of protein–protein binding affinities from unbound protein structures
Wilson et al. The electrostatic landscape of MHC-peptide binding revealed using inception networks
Fang et al. The intrinsic geometric structure of protein-protein interaction networks for protein interaction prediction
CN110728289B (zh) 一种家庭宽带用户的挖掘方法及设备
CN114694744A (zh) 蛋白质结构预测
CN110599377A (zh) 在线学习的知识点排序方法和装置
Jadamba et al. NetRanker: a network-based gene ranking tool using protein-protein interaction and gene expression data
Blanchet et al. A model-based approach to gene clustering with missing observation reconstruction in a Markov random field framework
Radu et al. Node handprinting: a scalable and accurate algorithm for aligning multiple biological networks
CN113780445B (zh) 癌症亚型分类预测模型的生成方法及装置、存储介质
Pozzati et al. Improved protein docking by predicted interface residues
Fayyaz Movaghar et al. Statistical significance of threading scores

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22778504

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023534153

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2022778504

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022778504

Country of ref document: EP

Effective date: 20231101