WO2024114154A1 - 噪声数据确定模型的训练、噪声数据的确定方法及装置 - Google Patents

噪声数据确定模型的训练、噪声数据的确定方法及装置 Download PDF

Info

Publication number
WO2024114154A1
WO2024114154A1 PCT/CN2023/125347 CN2023125347W WO2024114154A1 WO 2024114154 A1 WO2024114154 A1 WO 2024114154A1 CN 2023125347 W CN2023125347 W CN 2023125347W WO 2024114154 A1 WO2024114154 A1 WO 2024114154A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
data
noise data
noise
small molecule
Prior art date
Application number
PCT/CN2023/125347
Other languages
English (en)
French (fr)
Inventor
黄磊
张恒通
徐挺洋
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024114154A1 publication Critical patent/WO2024114154A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the embodiments of the present application relate to the field of biotechnology, and in particular to the training of noise data determination models and noise data determination technology.
  • the noisy small molecule data can be denoised based on the noise data to obtain the denoised small molecule data. If the denoised small molecule data meets the conditions, the denoised small molecule data is used as the target small molecule data without noise; if the denoised small molecule data does not meet the conditions, the denoised small molecule data is used as the noisy small molecule data, and the noisy small molecule data is denoised again until the conditions are met to obtain the target small molecule data.
  • the target small molecule data can be used for drug research and development to speed up the drug research and development rate. Based on this, how to determine the noise data in the noisy small molecule data has become an urgent problem to be solved.
  • the present application provides a noise data determination model training, noise data determination method and device, which can be used to solve the problems in the related technology.
  • the technical solution includes the following contents.
  • a method for training a noise data determination model comprising:
  • sample noisy small molecule data is small molecule data with noise data and the sample noisy small molecule data includes data of multiple sample atoms, and the annotated noise data is obtained by annotation and is noise data in the sample noisy small molecule data;
  • a sample graph structure is output through a neural network model, wherein the sample graph structure includes multiple sample nodes and multiple sample edges, wherein any sample node represents the data of a sample atom, and any sample edge represents the distance between sample atoms corresponding to two sample nodes at both ends;
  • the neural network model is trained to obtain a noise data determination model, and the noise data determination model is used to determine the final noise data in the data of the noisy small molecule to be processed.
  • a method for determining noise data is provided, the method being performed by an electronic device, the method comprising:
  • Acquire data of a small molecule with noise to be processed wherein the data of the small molecule with noise to be processed is small molecule data with noise data and the data of the small molecule with noise to be processed includes data of a plurality of atoms to be processed;
  • a graph structure to be processed is determined by a noise data determination model, wherein the graph structure to be processed includes multiple nodes and multiple edges, any node represents an atom to be processed, and any edge represents the distance between the atoms to be processed corresponding to two nodes at both ends, and the noise data determination model is obtained by training according to the method described in any one of the first aspects;
  • final noise data is determined by the noise data determination model, and the final noise data is the noise data in the data of the noisy small molecule to be processed.
  • a training device for a noise data determination model wherein the device is deployed on an electronic device, and the device includes:
  • An acquisition module used to acquire sample noisy small molecule data and annotated noise data, wherein the sample noisy small molecule data is small molecule data with noise data and the sample noisy small molecule data includes data of multiple sample atoms, and the annotated noise data is obtained by annotation and is noise data in the sample noisy small molecule data;
  • a determination module configured to output a sample graph structure through a neural network model based on the data of the plurality of sample atoms, wherein the sample graph structure includes a plurality of sample nodes and a plurality of sample edges, wherein any sample node represents the data of a sample atom, and any sample edge represents the distance between sample atoms corresponding to two sample nodes at both ends;
  • the determination module is further used to predict the sample graph structure through the neural network model to obtain predicted noise data, wherein the predicted noise data is obtained through prediction and is noise data in the data of the sample noisy small molecule;
  • the training module is also used to train the neural network model based on the predicted noise data and the labeled noise data to obtain a noise data determination model, and the noise data determination model is used to determine the final noise data in the data of the noisy small molecule to be processed.
  • a device for determining noise data is provided, the device being deployed on an electronic device, the device comprising:
  • An acquisition module used for acquiring data of small molecules with noise to be processed, wherein the data of small molecules with noise to be processed is small molecule data with noise data and the data of small molecules with noise to be processed includes data of multiple atoms to be processed;
  • a determination module configured to determine a graph structure to be processed by a noise data determination model based on the data of the multiple atoms to be processed, wherein the graph structure to be processed includes multiple nodes and multiple edges, wherein any node represents an atom to be processed, and any edge represents a distance between atoms to be processed corresponding to two nodes at both ends, and the noise data determination model is obtained by training according to the training method for the noise data determination model described in any one of the first aspects;
  • the determination module is further used to determine final noise data based on the graph structure to be processed through the noise data determination model, and the final noise data is the noise data in the data of the noisy small molecule to be processed.
  • an electronic device comprising a processor and a memory, wherein at least one computer program is stored in the memory, and the at least one computer program is loaded and executed by the processor so that the electronic device implements the training method of the noise data determination model described in any one of the first aspects or implements the noise data determination method described in any one of the second aspects.
  • a computer-readable storage medium stores at least one computer program, and the at least one computer program is loaded and executed by a processor to enable the electronic device to realize A method for training a noise data determination model as described in any one of the first aspects above is present, or a method for determining noise data as described in any one of the second aspects above is implemented.
  • a computer program product is also provided, wherein at least one computer program is stored in the computer program product, and the at least one computer program is loaded and executed by a processor so that the electronic device implements the training method of the noise data determination model described in any one of the first aspect or implements the noise data determination method described in any one of the second aspect.
  • the technical solution provided by the present application is to determine the sample graph structure based on the data of multiple sample atoms in the data of sample noisy small molecules, predict the sample graph structure through a neural network model to determine the predicted noise data, and obtain a noise data determination model based on the predicted noise data and the labeled noise data.
  • the noise data determination model can be used to determine the final noise data in the data of the noisy small molecules to be processed, so that the data of the noisy small molecules to be processed can be denoised based on the final noise data to obtain the denoised small molecule data, and then drug research and development can be carried out based on the denoised small molecule data to improve the efficiency of drug research and development.
  • FIG1 is a schematic diagram of an implementation environment of a method for training a noise data determination model or a method for determining noise data provided in an embodiment of the present application;
  • FIG2 is a flow chart of a method for training a noise data determination model provided in an embodiment of the present application
  • FIG3 is a schematic diagram of a noise addition process and a noise removal process provided by an embodiment of the present application.
  • FIG4 is a flow chart of a method for determining noise data provided by an embodiment of the present application.
  • FIG5 is a schematic diagram of a training process of a noise data determination model provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of a target small molecule provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of the structure of a training device for a noise data determination model provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of a device for determining noise data provided by an embodiment of the present application.
  • FIG9 is a schematic diagram of the structure of a terminal device provided in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the structure of a server provided in an embodiment of the present application.
  • FIG1 is a schematic diagram of an implementation environment of a method for training a noise data determination model or a method for determining noise data provided in an embodiment of the present application.
  • the implementation environment includes a terminal device 101 and a server 102.
  • the method for training a noise data determination model or a method for determining noise data in the embodiment of the present application can be executed by the terminal device 101, can be executed by the server 102, or can be executed by the terminal device 101 and the server 102 together.
  • the terminal device 101 may be a smart phone, a game console, a desktop computer, a tablet computer, a laptop computer, a smart TV, a smart car device, a smart voice interaction device, a smart home appliance, etc.
  • the server 102 may be a The server 102 may be a server cluster composed of multiple servers, or any one of a cloud computing platform and a virtualization center, which is not limited in the embodiments of the present application.
  • the server 102 may be connected to the terminal device 101 via a wired network or a wireless network.
  • the server 102 may have functions such as data processing, data storage, and data transmission and reception, which are not limited in the embodiments of the present application.
  • the number of the terminal device 101 and the server 102 is not limited, and may be one or more.
  • the embodiments of the present application can automatically perform a training method for a noise data determination model or a method for determining noise data based on artificial intelligence technology.
  • noise data can be used as noisy small molecule data.
  • the noisy small molecule data can be denoised based on the noise data.
  • the denoised small molecule data does not meet the conditions, the denoised small molecule data needs to be used as noisy small molecule data, and the noisy small molecule data is denoised again until the conditions are met to obtain the target small molecule data.
  • the target small molecule data can be used as drug data to achieve drug research and development. Therefore, the generation of small molecule data is closely related to drug research and development. Based on this, how to determine the noise data in the noisy small molecule data has become a problem that needs to be solved urgently.
  • the embodiment of the present application provides a training method for a noise data determination model, which can be applied to the above-mentioned implementation environment, and can determine the final noise data in the data of the noisy small molecules to be processed, so that the data of the noisy small molecules to be processed can be denoised based on the final noise data, laying the foundation for the generation of target small molecule data.
  • a training method for a noise data determination model provided by the embodiment of the present application shown in Figure 2 as an example, for ease of description, the terminal device 101 or server 102 that executes the training method for the noise data determination model in the embodiment of the present application is referred to as an electronic device, and the method can be performed by an electronic device. As shown in Figure 2, the method includes steps 201 to 204.
  • Step 201 obtaining sample noisy small molecule data and annotated noise data.
  • the sample noisy small molecule is a small molecule with noise
  • the sample noisy small molecule includes multiple sample atoms
  • any sample atom can be an atom with noise or an atom without noise.
  • any sample atom with noise there is an error in the relevant information of the atom and the error is outside the error range.
  • any atom without noise there is no error in the relevant information of the atom, or there is an error in the relevant information of the atom but the error is within the error range.
  • Any sample atom has its corresponding data, and the relevant information of the sample atom can be described by the data of the sample atom.
  • the data of the sample atom may include data for describing the type of the sample atom, that is, the data of the sample atom includes type data of the sample atom.
  • the type data of the sample atom is the element symbol O.
  • the type data of the sample atom is the element symbol C.
  • the type data of the sample atom is the element symbol N.
  • the data of the sample atoms may also include data for describing the positions of the sample atoms, that is, the data of the sample atoms include the position data of the sample atoms.
  • the position data of the sample atoms may be the three-dimensional coordinates of the sample atoms, including the horizontal coordinate (usually represented by x), the vertical coordinate (usually represented by y) and the vertical coordinate (usually represented by z), and the positions of the sample atoms in the three-dimensional space coordinate system are described by the three coordinates.
  • the sample noisy small molecule includes multiple sample atoms, and the data of each sample atom constitutes the data of the sample noisy small molecule, that is, the data of the sample noisy small molecule is small molecule data with noise data and the data of the sample noisy small molecule includes the data of multiple sample atoms.
  • the data of the sample noisy small molecule may include other data in addition to the data of each sample atom, for example, the other data includes data for characterizing the type of small molecule to which the sample noisy small molecule belongs.
  • sample noisy small molecule data can be expressed as Where N is the number of sample atoms, a i is the type data of the i-th sample atom, and ri is the position data of the i-th sample atom.
  • an actual small molecule can be used as a sample small molecule, or a designed small molecule can be used as a sample small molecule, and the sample small molecule is an effective small molecule that can bind to the sample protein.
  • the data of each atom in the sample small molecule is obtained, thereby obtaining the data of the sample small molecule.
  • the data of the sample small molecule is data about the small molecule without noise.
  • the data of the sample small molecule can be subjected to a first noise addition process to obtain the small molecule data after the first noise addition process; the small molecule data after the first noise addition process is subjected to a second noise addition process to obtain the small molecule data after the second noise addition process, and so on.
  • the data of the sample small molecule is subjected to T noise addition processes, and the small molecule data after the 1st to Tth noise addition processes can be obtained, and T is a positive integer.
  • the small molecule data after the Tth noise addition process is the initial noise data mentioned below, and the data of the sample small molecule can be understood as the small molecule data after the 0th noise addition process, that is, the data of the effective small molecule mentioned below.
  • Figure 3 is a schematic diagram of a noise addition and denoising process provided in an embodiment of the present application.
  • the small molecule data after the t-th noise addition process can be used as the data of the sample noisy small molecule, and the noise data when the small molecule data after the t-1-th noise addition process is subjected to the t-th noise addition process, that is, the noise data after the t-th noise addition process, can be used as the labeled noise data, where t is a positive integer greater than or equal to 1 and less than or equal to T.
  • the labeled noise data can be understood as the noise data in the sample noisy small molecule data obtained through labeling.
  • the small molecule data after the 0th noise addition process is subjected to T noise addition processes
  • the value of T is relatively large.
  • the small molecule data after the mth noise addition process is subjected to n-m noise addition processes, and the small molecule data after the nth noise addition process can be obtained, where m and n are both positive integers and are less than or equal to T, and m is less than n.
  • the small molecule data after the nth noise addition process can be used as the data of the sample noisy small molecule, and the noise data of each noise addition process in the n-m noise addition processes can be used to determine the labeled noise data.
  • the small molecule data after the 1000th noise addition process is obtained.
  • the small molecule data after the 1000th noise addition process, the small molecule data after the 9990th noise addition process, ..., the small molecule data after the 10th noise addition process, and the small molecule data after the 1st noise addition process can be respectively used as the data of sample noisy small molecules.
  • the annotated noise data is the sum of the noise data of each noise addition process from the 9990th to the 1000th noise addition process. And so on.
  • the small molecule data after the 10th noise addition process is the data of sample noisy small molecules.
  • the annotated noise data is the sum of the noise data of each noise addition process from the 1st to the 10th noise addition process.
  • the small molecule data after any noise addition process is required in the process of obtaining the data of the sample noisy small molecule, and the sum of the noise data of each noise addition process during at least one noise addition process is used as the labeled noise data, so that the predicted noise data determined by the neural network model is the sum of the noise data of each noise addition process during at least one noise addition process.
  • Step 202 Based on the data of multiple sample atoms, output the sample graph structure through a neural network model.
  • the sample graph structure includes multiple sample nodes and multiple sample edges. Any sample node represents the data of a sample atom, and any sample edge represents the distance between sample atoms corresponding to two sample nodes at both ends.
  • the data of sample noisy small molecules can be input into the neural network model, and the neural network model constructs a sample graph structure based on the data of each sample atom.
  • the sample graph structure includes multiple sample nodes, and any sample node represents the data of a sample atom. There may or may not be an edge between the sample nodes corresponding to any two sample atoms. When there is an edge between the sample nodes corresponding to two sample atoms, the edge can be called a sample edge, and the sample edge represents the distance between the two sample atoms.
  • the neural network model is an initial network model.
  • the model structure, model parameters, etc. of the neural network model are exactly the same as the model structure, model parameters, etc. of the initial network model.
  • the initial network model includes at least one of a small molecule encoder, a protein encoder, a frequency encoder, a graph structure generator, and a noise generator, and the functions of each are described below, which will not be repeated here.
  • the neural network model is a model obtained by training the initial network model at least once in the manner of steps 201 to 204. In this case, the neural network model and the initial network model only differ in model parameters, and the model structures of the two are the same.
  • step 202 includes steps 2021 to 2023.
  • Step 2021 extracting features from data of multiple sample atoms through a neural network model to obtain initial atomic features of each sample atom.
  • the neural network model includes a small molecule encoder, which can input data of sample noisy small molecules into the small molecule encoder, and perform feature extraction on the data of each sample atom through the small molecule encoder to obtain the initial atomic features of each sample atom.
  • the embodiment of the present application does not limit the model structure, model parameters, etc. of the small molecule encoder.
  • the small molecule encoder is an autoencoder (AE) or a variational autoencoder (VAE).
  • the data of any sample atom includes at least one of the type data of the sample atom and the position data of the sample atom.
  • the type data of the sample atom is encoded by a small molecule encoder to obtain the type characteristics of the sample atom.
  • the type data of the sample atom is an element symbol that can characterize the type of the sample atom.
  • the element symbol is encoded by a small molecule encoder by single-hot encoding, multi-hot encoding, etc. to obtain the type characteristics of the sample atom.
  • the position characteristics of the sample atom are determined by the small molecule encoder based on the position data of the sample atom.
  • the The position data is the three-dimensional coordinates of the sample atoms, and the three-dimensional coordinates of the sample atoms are used as the position features of the sample atoms by the small molecule encoder, or the three-dimensional coordinates of the sample atoms are normalized by the small molecule encoder to obtain the position features of the sample atoms.
  • the type features of the sample atoms or the position features of the sample atoms can be used as the initial atomic features of the sample atoms, or the type features of the sample atoms and the position features of the sample atoms can be spliced to obtain the initial atomic features of the sample atoms.
  • Step 2022 obtaining data of sample proteins, performing feature extraction on the data of sample proteins through a neural network model, and obtaining features of the sample proteins.
  • the sample protein includes a plurality of atoms, and any atom has its corresponding data, and the relevant information of the atom is described by the data of the atom.
  • the data of any atom includes at least one of the position data of the atom and the type data of the atom.
  • the data of each atom constitutes the data of the sample protein. It is understandable that the data of the sample protein may include other data in addition to the data of each atom, for example, the other data includes data for characterizing the type of protein to which the sample protein belongs.
  • the neural network model may further include a protein encoder, and the data of the sample protein may be input into the protein encoder, and the protein encoder may be used to extract features of the data of each atom in the sample protein to obtain features of each atom in the sample protein.
  • the embodiments of the present application do not limit the model structure, model parameters, etc. of the protein encoder.
  • the protein encoder is VAE or Schnet, wherein Schnet is a variant of Deep Tensor Neural Network (DTNN).
  • DTNN Deep Tensor Neural Network
  • the type data of each atom in the sample protein is encoded by a protein encoder to obtain the type characteristics of each atom in the sample protein.
  • the position characteristics of each atom in the sample protein are determined by the protein encoder based on the position data of each atom in the sample protein.
  • the type characteristics of any atom in the sample protein or the position characteristics of the atom are used as the characteristics of the atom, or the type characteristics of the atom and the position characteristics of the atom are spliced to obtain the characteristics of the atom.
  • the characteristics of each atom in the sample protein After the characteristics of each atom in the sample protein are obtained, it is equivalent to obtaining the characteristics of the sample protein.
  • the characteristics of each atom in the sample protein can be used as the characteristics of the sample protein.
  • the characteristics of each atom in the sample protein can be subjected to convolution processing, normalization processing, regularization processing, etc. to obtain the characteristics of the sample protein.
  • Step 2023 based on the initial atomic features of each sample atom and the features of the sample protein, the sample graph structure is determined through a neural network model.
  • the neural network model may further include a graph structure generator.
  • the initial atomic features of each sample atom and the features of the sample protein may be input into the graph structure generator, and the graph structure generator generates a sample graph structure.
  • a sample graph structure is generated based on the characteristics of the sample protein, so that when the predicted noise data is determined based on the sample graph structure, the predicted noise data is the noise data in the sample noisy small molecule data determined by the neural network model with reference to the sample protein.
  • the sample noisy small molecule data is denoised based on the predicted noise data, the small molecule corresponding to the denoised small molecule data is more likely to bind to the sample protein. The higher the probability of small molecules binding to proteins, the more likely the small molecule is to become a drug. Therefore, the sample is determined based on the characteristics of the sample protein.
  • the graph structure is used to determine the predicted noise data through the sample graph structure, which is conducive to determining the data of small molecules that can bind to the sample protein based on the predicted noise data, thereby obtaining effective small molecules that can bind to the sample protein and improving the efficiency of drug research and development.
  • the data of the sample protein can be understood as the constraint conditions for determining the noise data in the data of the sample noisy small molecule.
  • the sample protein data is used as a constraint condition to perform noise addition on any small molecule data after noise addition.
  • the noise addition process is shown in the following formula (1).
  • G0 represents the small molecule data after the 0th noise addition process
  • Gt represents the small molecule data after the tth noise addition process
  • Gt -1 represents the small molecule data after the t-1th noise addition process
  • G1 :T represents the small molecule data after the 1st to the Tth noise addition process.
  • p ctx represents the data of the sample protein.
  • q(x) represents the function sign of the noise addition process function
  • x is a variable.
  • represents the cumulative multiplication sign.
  • G 0 , pctx ) represents that the data of sample protein is used as a condition, and the small molecule data after the 0th noise addition processing is subjected to T noise addition processing, and the small molecule data after the 1st to Tth noise addition processing are obtained in sequence.
  • G t-1 , pctx ) represents that the data of sample protein is used as a condition, and the small molecule data after the t-1th noise addition processing is subjected to the tth noise addition processing, and the small molecule data after the tth noise addition processing is obtained.
  • the small molecule data after the t-th noise addition process satisfies the formula (2) shown below.
  • the function symbol that represents the normal distribution function is I is the parameter of the normal distribution function.
  • ⁇ 1 ... ⁇ T are fixed variance parameters, and ⁇ t is the tth variance parameter.
  • the tth variance parameter satisfies:
  • formula (2) indicates that the small molecule data after the t-th noise addition process conforms to the normal distribution function.
  • the sample protein data is used as a constraint condition when performing noise processing on the small molecule data after any noise processing
  • the sample protein data is also needed as a constraint condition to perform denoising on the small molecule data after any noise processing.
  • the denoising process is shown in the following formula (3).
  • G0 represents the small molecule data after the 0th noise addition process
  • Gt represents the small molecule data after the tth noise addition process
  • Gt -1 represents the small molecule data after the t-1th noise addition process
  • G0 :T-1 represents the small molecule data after the 0th to T-1th noise addition processes.
  • p ctx represents the data of the sample protein.
  • the function symbol representing the denoising function, is a variable.
  • represents the symbol of cumulative multiplication.
  • G T ,p ctx ) represents that the data of the sample protein is used as a condition, and the small molecule data after the T-th noise processing are subjected to T-th denoising processing, and the small molecule data after the 0th to T-1th noise processing are obtained in sequence.
  • G t ,p ctx ) represents that the data of the sample protein is used as a condition, and the small molecule data after the t-th noise processing are subjected to the t-th denoising processing, and the small molecule data after the t-1th noise processing are obtained.
  • the small molecule data after the t-1th noise addition process satisfies the formula (4) shown below.
  • the function symbol that represents the normal distribution function In general, the normal distribution function is I is the parameter of the normal distribution function. ⁇ ⁇ is the mean value of the small molecule data after the t-th noise treatment. ⁇ t is the variance The value can be any set data. In the embodiment of the present application, formula (4) represents that the small molecule data after the t-1th noise addition process conforms to the normal distribution function Wherein, ⁇ ⁇ is the parameter that the neural network model in the embodiment of the present application needs to learn. In the process of training the neural network model, it is necessary to Perform maximum likelihood.
  • step 2023 includes: for any sample atom, fusing the initial atomic features of any sample atom and the features of the sample protein through a neural network model to obtain the first atomic features of any sample atom; based on the first atomic features of each sample atom, determining the first distance between every two sample atoms; based on the first atomic features of each sample atom and the first distance between every two sample atoms, determining the sample graph structure.
  • the initial atomic features of the sample atoms and the features of the sample proteins are fused to obtain the first atomic features of the sample atoms, so that the first atomic features of the sample atoms can be expressed based on the features of the sample proteins, and then the sample graph structure is determined based on the features of the sample proteins.
  • the noise data in the data of the sample noisy small molecules determined with reference to the sample proteins is beneficial to determine the data of small molecules that can bind to the sample proteins based on the predicted noise data, thereby obtaining effective small molecules that can bind to the sample proteins and improving the efficiency of drug research and development.
  • the initial atomic features of any sample atom and the features of the sample protein are spliced, added, multiplied, or arbitrarily fused to obtain the first atomic features of the sample atom.
  • the distance formula based on the first atomic features of any two sample atoms, the first distance between the two sample atoms is determined.
  • the embodiment of the present application does not limit the distance formula.
  • the distance formula is a cosine distance formula, a cross entropy distance formula, or a relative entropy distance formula. In this way, the first distance between every two sample atoms can be determined.
  • the first atomic feature of any sample atom is used as a sample node, and the first distance between any two sample atoms is used as an edge between sample nodes corresponding to the two sample atoms.
  • each sample node and the edge between every two sample nodes can be determined to obtain a sample graph structure.
  • the first atomic feature of any sample atom is used as a sample node. For any two sample nodes, if the first distance between the sample atoms corresponding to any two sample nodes is greater than the distance threshold, it is determined that there is no edge between the two sample nodes. If the first distance between the sample atoms corresponding to any two sample nodes is not greater than the distance threshold, the first distance between the sample atoms corresponding to the two sample nodes is determined as the edge between the two sample nodes. In this way, the edges existing between each sample node and any two sample nodes can be determined, thereby obtaining a sample graph structure.
  • the embodiment of the present application does not limit the distance threshold.
  • the distance threshold is a value set according to artificial experience, or the distance threshold is the farthest distance at which an interaction force can exist between two sample atoms.
  • the sample graph structure is determined according to the initial atomic features of each sample atom and the features of the sample protein.
  • the present application also provides an implementation A2, in which the sample graph structure can be determined only according to the initial atomic features of each sample atom.
  • the initial distance between every two sample atoms is determined; based on the initial atomic features of each sample atom and the initial distance between every two sample atoms, the sample graph structure is determined.
  • the method for determining the initial distance between two sample atoms is similar to the method for determining the first distance between two sample atoms, and the method for determining the sample graph structure based on the initial atomic features of each sample atom and the initial distance between every two sample atoms is similar to the method for “determining the sample graph structure based on the first atomic features of each sample atom and the first distance between every two sample atoms”, which will not be described in detail here.
  • the embodiment of the present application also provides a method for determining a sample graph structure that is different from implementation methods A1 and A2, as shown in implementation method A3 below.
  • the data of the noisy small molecules in the sample is the initial noise data or is obtained by performing at least one denoising process on the initial noise data.
  • the initial noise data can be regarded as the small molecule data after the T-th noise processing.
  • the small molecule data after the T-th noise processing is subjected to the first denoising process
  • the small molecule data after the T-1-th noise processing is obtained
  • the small molecule data after the T-1-th noise processing is subjected to the second denoising process
  • the small molecule data after the T-2-th noise processing is obtained; and so on.
  • the small molecule data after the T-th noise processing is subjected to T denoising processes, and the small molecule data after the 0-th noise processing is obtained, that is, the data of the sample small molecules mentioned above.
  • the small molecule data after the t-1th noise addition process is obtained.
  • the small molecule data after the T-th noise addition process can be continuously denoised, and after T denoising processes, the small molecule data after the 0th noise addition process is obtained.
  • the small molecule data after the t-th noise addition process may be used as the sample noisy small molecule data, where t is a positive integer greater than or equal to 1 and less than or equal to T.
  • step 202 includes steps 2024 to 2025.
  • Step 2024 obtaining sample denoising times information, where the sample denoising times information represents the number of denoising processes performed from the initial noise data to the sample noisy small molecule data.
  • the small molecule data after the 0th noise addition process is subjected to T noise addition processes, and the small molecule data after the 1st to the Tth noise addition processes are obtained in sequence.
  • the small molecule data after the Tth noise addition process is subjected to T denoising processes, and the small molecule data after the T-1th to the 0th noise addition processes are obtained in sequence. Therefore, the tth noise addition process and the T-tth denoising process are two opposite processes.
  • the sample denoising times information is T-t.
  • Step 2025 based on the sample denoising times information and the data of multiple sample atoms, determine the sample graph structure through a neural network model.
  • the sample graph structure can be determined based on the sample denoising times information and the data of multiple sample atoms at least according to the implementation method B1 or implementation method B2 mentioned below.
  • the sample denoising times information can reflect the number of denoising processes performed from the initial noise data to the sample noisy small molecule data.
  • the sample denoising times information and the data of multiple sample atoms are combined to determine the sample graph structure, thereby predicting noise data based on the sample graph structure more quickly, thereby improving the efficiency of drug research and development.
  • step 2025 includes steps C1 to C3.
  • Step C1 extracting features of sample denoising times information through a neural network model to obtain sample denoising times features.
  • the neural network model may further include a frequency encoder.
  • the sample denoising frequency information is input into the frequency encoder, and the frequency encoder performs feature extraction on the sample denoising frequency information to obtain the sample denoising frequency feature.
  • the embodiment of the present application does not limit the model structure, model parameters, etc. of the frequency encoder.
  • the frequency encoder is a multi-layer perceptron or AE, etc.
  • the sample denoising frequency information is encoded by the frequency encoder using single-hot encoding, multi-hot encoding, etc. to obtain the sample denoising frequency feature.
  • Step C2 extracting features from the data of multiple sample atoms respectively through a neural network model to obtain initial atomic features of each sample atom.
  • the small molecule encoder in the neural network model can be used to extract features from the data of each sample atom to obtain the initial atomic features of each sample atom.
  • the content of step C2 can be found in the description of step 2021 and will not be repeated here.
  • Step C3 based on the sample denoising times feature and the initial atomic features of each sample atom, the sample graph structure is determined through a neural network model.
  • the sample graph structure can be determined based on the sample denoising times feature and the initial atomic features of each sample atom at least according to implementation method D1 or implementation method D2.
  • step C3 includes: for any sample atom, the initial atomic feature and the sample denoising frequency feature of any sample atom are merged through a neural network model to obtain the second atomic feature of any sample atom; based on the second atomic feature of each sample atom, the second distance between every two sample atoms is determined; based on the second atomic feature of each sample atom and the second distance between every two sample atoms, the sample graph structure is determined.
  • the initial atomic features of any sample atom and the sample denoising times features are spliced, added, multiplied, or arbitrarily fused to obtain the second atomic features of the sample atom.
  • the distance formula based on the second atomic features of any two sample atoms, the second distance between the two sample atoms is determined. In this way, the second distance between every two sample atoms can be determined.
  • the second atomic feature of any sample atom is used as a sample node, and the second distance between any two sample atoms is used as an edge between sample nodes corresponding to the two sample atoms.
  • each sample node and the edge between every two sample nodes can be determined to obtain a sample graph structure.
  • the second atomic feature of any sample atom is used as a sample node. For any two sample nodes, if the second distance between the sample atoms corresponding to any two sample nodes is greater than the distance threshold, it is determined that there is no edge between the two sample nodes. If the second distance between the sample atoms corresponding to any two sample nodes is not greater than the distance threshold, the second distance between the sample atoms corresponding to the two sample nodes is determined as the edge between the two sample nodes. In this way, the edges existing between each sample node and any two sample nodes can be determined, thereby obtaining a sample graph structure.
  • step C3 includes: for any sample atom, the initial atomic feature, sample denoising frequency feature and sample protein feature of any sample atom are integrated through a neural network model to obtain the third atomic feature of any sample atom; based on the third atomic feature of each sample atom, the third distance between every two sample atoms is determined; based on the third atomic feature of each sample atom and the third distance between every two sample atoms, the sample graph structure is determined.
  • the initial atomic features of any sample atom, the sample denoising times features and the sample protein features are spliced, added or multiplied to obtain the third atomic features of the sample atom.
  • the distance formula based on the third atomic features of any two sample atoms, the third distance between the two sample atoms is determined. In this way, the third distance between every two sample atoms can be determined.
  • the third atomic feature of any sample atom is used as a sample node, and the third distance between any two sample atoms is used as an edge between sample nodes corresponding to the two sample atoms.
  • each sample node and the edge between every two sample nodes can be determined to obtain a sample graph structure.
  • the third atomic feature of any sample atom is used as a sample node. For any two sample nodes, if the third distance between the sample atoms corresponding to any two sample nodes is greater than the distance threshold, it is determined that there is no edge between the two sample nodes. If the third distance between the sample atoms corresponding to any two sample nodes is not greater than the distance threshold, the third distance between the sample atoms corresponding to the two sample nodes is determined as the edge between the two sample nodes. In this way, the edges existing between each sample node and any two sample nodes can be determined, thereby obtaining a sample graph structure.
  • step 2025 includes: determining the number of sample noise addition information based on the number of sample denoising information, performing feature extraction on the sample noise addition number information through a neural network model to obtain sample noise addition number features; performing feature extraction on multiple sample atomic data through a neural network model to obtain initial atomic features of each sample atom; and determining the sample graph structure through a neural network model based on the sample noise addition number features and the initial atomic features of each sample atom.
  • the sample denoising times information can be determined to be t based on the sample denoising times information.
  • the number of times the sample is noised is input into the number encoder of the neural network model, and the number encoder extracts features from the number of times the sample is noised to obtain the number of times the sample is noised.
  • the number encoder performs encoding processing such as single-hot encoding and multi-hot encoding on the number of times the sample is noised to obtain the number of times the sample is noised.
  • the data of each sample atom can be feature extracted to obtain the initial atomic features of each sample atom.
  • the method for determining the initial atomic features of the sample atoms can be seen in the description of step 2021, which will not be repeated here.
  • the sample graph structure is determined through a neural network model based on the number of sample noise addition characteristics and the initial atomic characteristics of each sample atom.
  • the initial atomic feature of any sample atom and the sample noise times feature are fused through a neural network model to obtain the fourth atomic feature of any sample atom; based on the fourth atomic feature of each sample atom, the fourth distance between every two sample atoms is determined; based on the fourth atomic feature of each sample atom and the fourth distance between every two sample atoms, the sample graph structure is determined.
  • the implementation method of the above process can be seen in the description of implementation method D1. The implementation principles of the two are similar and will not be repeated here.
  • the initial atomic feature, sample noise times feature and sample protein feature of any sample atom are fused through a neural network model to obtain the fifth atomic feature of any sample atom; based on the fifth atomic feature of each sample atom, the fifth distance between every two sample atoms is determined; based on the fifth atomic feature of each sample atom and the fifth distance between every two sample atoms, the sample graph structure is determined.
  • the implementation of the above process can be seen in the description of implementation D2, and the implementation principles of the two are similar, which will not be repeated here.
  • the sample denoising times information is directly obtained, and the sample graph structure is constructed using the sample denoising times information, or the sample denoising times information is determined based on the sample denoising times information, and the sample graph structure is constructed based on the sample denoising times information.
  • the number of times of adding noise is used to construct a sample graph structure, or the number of times of denoising the sample is determined based on the number of times of adding noise, and the sample graph structure is constructed based on the number of times of denoising the sample, which will not be described in detail here.
  • Step 203 predict the sample graph structure through a neural network model to obtain predicted noise data.
  • the neural network model may further include a noise generator.
  • the sample graph structure is input into the noise generator, and the noise generator determines predicted noise data based on the sample graph structure, where the predicted noise data is noise data obtained through prediction.
  • the noise generator includes a graph encoder and an activation layer, wherein the functions of the graph encoder and the activation layer are described below and will not be repeated here.
  • the embodiment of the present application does not limit the network structure, network parameters, etc. of the graph encoder.
  • the graph encoder may be a graph auto-encoder (GAE), a graph variational auto-encoder (GVAE), etc.
  • the embodiment of the present application also does not limit the network structure, network parameters, etc. of the activation layer.
  • the activation layer may be a rectified linear unit (ReLU), an S-shaped growth curve (i.e., a sigmoid function), etc.
  • step 203 includes steps 2031 to 2033 .
  • Step 2031 extracting features from the sample graph structure through a neural network model to obtain the atomic features to be processed of each sample atom.
  • the sample graph structure can be input into a graph encoder, and the sample graph structure can be updated at least once by the graph encoder to obtain an updated sample graph structure, wherein each sample node in the updated sample graph structure is an atomic feature to be processed of each sample atom.
  • the sample graph structure includes multiple sample nodes and multiple sample edges, any sample node is the initial atomic feature of a sample atom, or any sample node is any one of the first atomic feature to the fifth atomic feature of a sample atom, and any sample node is connected to another sample node through a sample edge.
  • sample node and other sample nodes connected to the sample node through sample edges can be used to update the sample node, or the sample node, each sample edge with one end being the sample node, and other sample nodes connected to the sample node through sample edges can be used to update the sample node. In this way, each sample node in the sample graph structure can be updated to obtain an updated sample graph structure.
  • Each sample node in the updated sample graph structure is used as the atomic feature to be processed of each sample atom.
  • the updated sample graph structure is used as the sample graph structure, and each sample node in the sample graph structure is updated again to obtain an updated sample graph structure, and each sample node in the updated sample graph structure is used as the atomic feature to be processed of each sample atom.
  • Step 2032 based on the atomic features to be processed of each sample atom, at least one of the predicted type noise data and the predicted position noise data is determined through a neural network model, wherein the predicted type noise data is noise data related to the type of the sample atom obtained through prediction, and the predicted position noise data is noise data related to the position of the sample atom obtained through prediction.
  • the predicted type noise data is noise data related to the type of the sample atom obtained through prediction
  • the predicted position noise data is noise data related to the position of the sample atom obtained through prediction.
  • the atomic features to be processed of each sample atom can be input into the activation layer in the noise generator, and the atomic features to be processed of each sample atom can be activated through the activation layer to obtain predicted type noise data and/or predicted position noise data.
  • the atomic features to be processed of each sample atom are activated through the activation layer to obtain type noise data of each sample atom, and the type noise data of any sample atom is noise data related to the type of the sample atom obtained through prediction.
  • the predicted type noise data includes the type noise data of each sample atom.
  • the atomic features to be processed of each sample atom are activated through the activation layer to obtain the position noise data of each sample atom, and the position noise data of any sample atom is the noise data related to the position of the sample atom obtained through prediction.
  • the predicted position noise data includes the position noise data of each sample atom.
  • Step 2033 Use at least one of the predicted type noise data and the predicted position noise data as predicted noise data.
  • the prediction type noise data or the prediction position noise data can be used as the prediction noise data.
  • the prediction type noise data and the prediction position noise data can be used as the prediction noise data, or the prediction type noise data or the prediction position noise data can be used as the prediction noise data.
  • step 203 includes steps 2034 to 2036.
  • Step 2034 for any sample edge included in the sample graph structure, if the distance represented by any sample edge is not greater than the reference distance, then any sample edge is determined as the first edge; the first edge is deleted from the multiple sample edges included in the sample graph structure through the neural network model to obtain the first graph structure.
  • the present application embodiment does not limit the reference distance.
  • the reference distance is a value set according to artificial experience, or the reference distance is not less than the distance of chemical bond action and not greater than the distance of van der Waals force action.
  • the distance of chemical bond action is less than the distance of van der Waals force action.
  • the distance of chemical bond action is less than 2 angstroms (unit: ), and the distance of van der Waals force is greater than 2 angstroms, then the reference distance can be determined to be 2 angstroms.
  • any sample edge is the initial distance between sample atoms corresponding to two sample nodes at both ends of the sample edge, or any sample edge is any one of the first distance to the fifth distance between sample atoms corresponding to two sample nodes at both ends of the sample edge.
  • any sample edge represents the distance between sample atoms corresponding to two sample nodes at both ends of the sample edge.
  • the sample edge is determined as the first edge, and the first edge is deleted from the sample graph structure by the graph structure generator of the neural network model. In this way, at least one first edge can be deleted from the sample graph structure to obtain a first graph structure.
  • the above-mentioned first graph structure is obtained by deleting at least one first edge from the sample graph structure.
  • there may be other ways to generate the first graph structure For example, when constructing the sample graph structure, if the distance between the sample atoms corresponding to any two sample nodes (the distance may be any one of the first distance to the fifth distance) is greater than the reference distance, the distance between the sample atoms corresponding to the two sample nodes is determined as the edge between the two sample nodes. If the distance between the sample atoms corresponding to any two sample nodes is not greater than the reference distance, it is determined that there is no edge between the two sample nodes. The sample graph structure constructed in this way is the first graph structure.
  • the sample edges related to chemical bonds in the sample graph structure can be deleted by deleting at least one first edge from the sample graph structure, so that the first graph structure includes sample edges related to van der Waals force.
  • Step 2035 Based on the first graph structure, determine the first noise data through the neural network model.
  • the first noise data is determined based on the first graph structure through the neural network model, thereby achieving the first noise data determined based on sample edges related to van der Waals forces. Since the first noise data is obtained by analyzing the van der Waals forces in the sample noisy small molecules, the first noise data is related to a single factor, namely, the van der Waals forces, so that the model can focus on learning the mapping relationship between the noise data and the van der Waals forces, thereby improving the accuracy of the model in determining the noise data, that is, the accuracy of the first noise data is high.
  • the first graph structure may be input into a noise generator, and the noise generator determines first noise data based on the first graph structure, where the first noise data is noise data obtained through prediction.
  • step 2035 includes: performing feature extraction on the first graph structure through a neural network model to obtain the sixth atomic feature of each sample atom; determining at least one of first type noise data and first position noise data based on the sixth atomic feature of each sample atom through a neural network model, the first type noise data being noise data related to the type of the sample atom obtained through prediction, and the first position noise data being noise data related to the position of the sample atom obtained through prediction; and using at least one of the first type noise data and the first position noise data as the first noise data.
  • the first graph structure can be input into a graph encoder, and the first graph structure can be updated at least once by the graph encoder to obtain an updated first graph structure, wherein each sample node in the updated first graph structure is a sixth atomic feature of each sample atom.
  • the method of updating the first graph structure is similar to the method of updating the sample graph structure, as described in step 2031, and will not be repeated here.
  • the sixth atomic feature of each sample atom can be input into the activation layer in the noise generator, and the sixth atomic feature of each sample atom can be activated by the activation layer to obtain the first type of noise data and/or the first position noise data.
  • the determination method of the first type of noise data is similar to the determination method of the predicted type of noise data
  • the determination method of the first position noise data is similar to the determination method of the predicted position noise data, which will not be repeated here.
  • the first type noise data includes type noise data of each sample atom, and the type noise data of any sample atom is noise data related to the type of the sample atom obtained by prediction.
  • the first position noise data includes position noise data of each sample atom, and the position noise data of any sample atom is noise data related to the position of the sample atom obtained by prediction.
  • the first type of noise data or the first position noise data can be used as the first noise data.
  • the first type of noise data and the first position noise data can be used as the first noise data, or the first type of noise data or the first position noise data can be used as the first noise data.
  • Step 2036 determining predicted noise data based on the first noise data.
  • the first noise data may be used as the predicted noise data, or the first noise data may be multiplied by a corresponding weight to obtain the predicted noise data.
  • step 203 includes steps 2037 to 2039.
  • Step 2037 for any sample edge included in the sample graph structure, if the distance represented by any sample edge is greater than the reference distance, then any sample edge is determined as the second edge; the second edge is deleted from the multiple sample edges included in the sample graph structure through the neural network model to obtain a second graph structure.
  • the sample edge is determined as the second edge, and the second edge is deleted from the sample graph structure by the graph structure generator of the neural network model. In this way, at least one second edge can be deleted from the sample graph structure to obtain a second graph structure.
  • the above-mentioned second graph structure is obtained by deleting at least one second edge from the sample graph structure.
  • there may be other ways to generate the second graph structure For example, when constructing the sample graph structure, if the distance between the sample atoms corresponding to any two sample nodes (the distance may be any one of the first distance to the fifth distance) is greater than the reference distance, it is determined that there is no edge between the two sample nodes. If the distance between the sample atoms corresponding to any two sample nodes is not greater than the reference distance, the distance between the sample atoms corresponding to the two sample nodes is determined as the edge between the two sample nodes.
  • the sample graph structure constructed in this way is the second graph structure.
  • the sample edges related to van der Waals force in the sample graph structure can be deleted by deleting at least one second edge from the sample graph structure, so that the second graph structure includes sample edges related to chemical bonds.
  • Step 2038 Determine second noise data through a neural network model based on the second graph structure.
  • the second noise data is determined based on the second graph structure through the neural network model, thereby achieving the second noise data based on sample edges related to chemical bonds. Since the second noise data is obtained by analyzing the chemical bonds in the sample noisy small molecules, the second noise data is related to a single factor, namely the chemical bonds, so that the model can focus on learning the mapping relationship between noise data and chemical bonds, thereby improving the accuracy of the model in determining the noise data, that is, the accuracy of the second noise data is higher.
  • the second graph structure may be input into a noise generator, and the noise generator determines second noise data based on the second graph structure, where the second noise data is noise data obtained through prediction.
  • step 2038 includes: performing feature extraction on the second graph structure through a neural network model to obtain the seventh atomic feature of each sample atom; determining at least one of the second type noise data and the second position noise data based on the seventh atomic feature of each sample atom through the neural network model, the second type noise data being noise data related to the type of the sample atom obtained by prediction, and the second position noise data being noise data related to the position of the sample atom obtained by prediction; and using at least one of the second type noise data and the second position noise data as the second noise data.
  • the second graph structure can be input into a graph encoder, and the graph encoder updates the second graph structure at least once to obtain an updated second graph structure, wherein each sample node in the updated second graph structure is a seventh atomic feature of each sample atom.
  • the method of updating the second graph structure is similar to the method of updating the sample graph structure, as described in step 2031, and will not be repeated here.
  • the seventh atomic feature of each sample atom can be input into the activation layer in the noise generator, and the seventh atomic feature of each sample atom can be activated by the activation layer to obtain the second type of noise data and/or the second position noise data.
  • the second type of noise data is determined in a similar manner to the predicted type of noise data
  • the second position noise data is determined in a similar manner to the predicted position noise data, which will not be described in detail here.
  • the second type of noise data includes type noise data of each sample atom, and the type noise data of any sample atom is noise data related to the type of the sample atom obtained by prediction.
  • the data includes position noise data of each sample atom.
  • the position noise data of any sample atom is noise data related to the position of the sample atom obtained through prediction.
  • the second type noise data or the second position noise data may be used as the second noise data.
  • the second type noise data and the second position noise data may be used as the second noise data, or the second type noise data or the second position noise data may be used as the second noise data.
  • Step 2039 determining predicted noise data based on the second noise data.
  • the second noise data may be used as the predicted noise data, or the predicted noise data may be obtained by multiplying the second noise data by a corresponding weight.
  • step 203 includes: for any sample edge included in the sample graph structure, if the distance represented by any sample edge is not greater than the reference distance, then any sample edge is determined as the first edge; the first edge is deleted from the multiple sample edges included in the sample graph structure through the neural network model to obtain the first graph structure; the first noise data is determined based on the first graph structure through the neural network model.
  • any sample edge included in the sample graph structure if the distance represented by any sample edge is greater than the reference distance, then any sample edge is determined as the second edge; the second edge is deleted from the multiple sample edges included in the sample graph structure through the neural network model to obtain the second graph structure; the second noise data is determined based on the second graph structure through the neural network model.
  • the predicted noise data is determined based on the first noise data and the second noise data.
  • the first noise data may be determined according to the content of implementation mode E2, and the second noise data may be determined according to the content of implementation mode E3.
  • the first noise data and the second noise data may be determined as predicted noise data, or the first noise data and the second noise data may be subjected to weighted averaging, weighted summing, and other operations to obtain predicted noise data.
  • the first noise data includes first type noise data and first position noise data
  • the second noise data includes second type noise data and second position noise data.
  • the first type noise data and the second type noise data are subjected to weighted summation, weighted averaging and other operations to obtain predicted type noise data.
  • the first position noise data and the second position noise data are subjected to weighted summation, weighted averaging and other operations to obtain predicted position noise data.
  • the predicted noise data includes predicted type noise data and predicted position noise data.
  • Step 204 Based on the predicted noise data and the labeled noise data, the neural network model is trained to obtain a noise data determination model, where the noise data determination model is used to determine the final noise data based on the data of the noisy small molecule to be processed.
  • the loss of the neural network model can be determined based on the predicted noise data and the labeled noise data.
  • the neural network model is trained based on the loss of the neural network model to obtain a trained neural network model.
  • the trained neural network model meets the training end condition, the trained neural network model is used as the noise data determination model. If the trained neural network model does not meet the training end condition, the trained neural network model is used as the neural network model, and the neural network model is trained in the manner of steps 201 to 204 until the training end condition is met to obtain the noise data determination model.
  • the training termination condition is met when the number of training times is reached, for example, the number of training times reaches 500 or 1000 times.
  • the training termination condition is met when the difference between the loss of the neural network model obtained in this training and the loss of the neural network model obtained in the previous training is within a set range.
  • the training termination condition is met when the gradient of the loss of the neural network model obtained in this training is within a set range.
  • the loss of the neural network model can be determined based on the predicted noise data and the labeled noise data according to formula (5) shown below.
  • T represents the total number of denoising processes.
  • ⁇ t is a hyperparameter.
  • represents the labeled noise data, and the labeled noise data can be designed to obey the normal distribution.
  • ⁇ ⁇ (G t , p ctx ,t) represents the predicted noise data, where G t represents the small molecule data after the tth noise addition process, and p ctx represents the sample protein data.
  • the representation is the mean square error between the calculated annotated noise data and the predicted noise data.
  • E is the symbol for averaging.
  • ⁇ G 0 ⁇ q(G 0 ) represents that the small molecule data after the 0th noise addition during the noise addition process is the same as the small molecule data after the 0th noise addition during the denoising process.
  • Characterize the annotation noise data to conform to the normal distribution function I is the parameter of the normal distribution function.
  • E is the symbol for averaging.
  • log is the symbol for logarithm.
  • p ctx ) represents the small molecule data after the 0th noise processing obtained by denoising the data of the sample protein as a constraint.
  • G 0 ,p ctx ) represents the small molecule data after the 1st to the Tth noise processing after T times of noise processing on the small molecule data after the 0th noise processing.
  • p ctx ) represents the small molecule data after the 0th to the Tth noise processing obtained by T times of denoising the data of the sample protein as a constraint.
  • D KL represents the function symbol of the relative entropy function.
  • G t ,G 0 , pctx ) represents the process of taking the sample protein data as the constraint condition, performing noise processing on the small molecule data after the 0th noise processing to obtain the small molecule data after the tth noise processing, and obtaining the small molecule data after the tth noise processing.
  • G t , pctx ) represents the process of taking the sample protein data as the constraint condition, performing denoising on the small molecule data after the tth noise processing to obtain the small molecule data after the t-1th noise processing.
  • the predicted noise data includes predicted type noise data and predicted position noise data
  • the annotated noise data includes annotated type noise data and annotated position noise data.
  • the predicted type noise data is noise data related to the type of sample atoms obtained by prediction
  • the predicted position noise data is noise data related to the position of the sample atoms obtained by prediction.
  • the annotated type noise data is noise data related to the type of sample atoms obtained by annotation
  • the annotated position noise data is noise data related to the position of the sample atoms obtained by annotation.
  • Step 204 includes steps 2041 to 2043 .
  • Step 2041 determine a first loss based on the prediction type noise data and the annotation type noise data.
  • the first loss can be determined based on the prediction type noise data and the annotation type noise data according to the first loss function.
  • the embodiment of the present application does not limit the first loss function.
  • the first loss function is a relative entropy loss function, a mean absolute error (MAE) loss function, or a mean square error (MSE) loss function.
  • the MAE loss function is also called the L1 loss function
  • the MSE loss function is also called the L2 loss function.
  • the first loss function can also be a loss function after smoothing the L1 loss function using the L2 loss function, that is, the L1 loss function after smoothing.
  • Step 2042 Determine a second loss based on the predicted position noise data and the annotated position noise data.
  • the second loss can be determined based on the predicted position noise data and the annotated position noise data according to the second loss function.
  • the embodiment of the present application does not limit the second loss function.
  • the second loss function is a relative entropy loss function, an L1 loss function, an L2 loss function, or an L1 loss function after smoothing.
  • Step 2043 Based on the first loss and the second loss, the neural network model is trained to obtain a noise data determination model.
  • the first loss and the second loss may be subjected to weighted summation, weighted averaging and other operations to obtain the loss of the neural network model.
  • the neural network model is trained based on the loss of the neural network model to obtain a trained neural network model. If the trained neural network model meets the training end condition, the trained neural network model is used as the noise data determination model. If the trained neural network model does not meet the training end condition, the trained neural network model is used as the neural network model for the next training, and the neural network model is trained for the next time in the manner of steps 201 to 204 until the training end condition is met to obtain the noise data determination model.
  • each atom is constrained by the first loss of sample atoms, and the position of each atom is constrained by the second loss of sample atoms, thereby improving the accuracy of the noise data determination model.
  • the loss of the neural network model can also be determined based on other losses.
  • the sample noisy small molecule data can be denoised based on the predicted noise data to obtain denoised small molecule data.
  • the denoised small molecule data includes first data of multiple sample atoms. Based on the first data of any sample atom and the data of the sample protein, the distance between the sample atom and the surface of the sample protein is determined.
  • the difference between the sample atom and the surface of the sample protein is subtracted from the size of the sample atom, and the difference is taken as the third loss of the sample atom. If the distance between the sample atom and the surface of the sample protein is not smaller than the size of the sample atom, the sample atom does not have the third loss.
  • the loss of the neural network model is determined, and the noise data is trained using the loss of the neural network model to determine the model.
  • the sample atom When the distance between the sample atom and the surface of the sample protein is smaller than the size of the sample atom, the sample atom has a third loss, and the third loss of the sample atom is the difference between the size of the sample atom and the distance between the sample atom and the surface of the sample protein.
  • the third loss of the sample atom is used to constrain the position of each atom in the small molecule, avoid overlapping between the atoms in the small molecule and the protein, and improve the accuracy of the noise data determination model.
  • the information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards in the relevant regions.
  • relevant laws, regulations and standards for example, the data of sample noisy small molecules and annotated noise data involved in this application are all obtained with full authorization.
  • the above method determines the sample graph structure based on the data of multiple sample atoms in the sample noisy small molecule data, predicts the sample graph structure through a neural network model to determine the predicted noise data, and obtains the noise data determination model based on the predicted noise data and the labeled noise data.
  • the noise data determination model can be used to determine the number of noisy small molecules to be processed.
  • the final noise data in the data can be used to denoise the noisy small molecule data to be processed based on the final noise data, so as to obtain the denoised small molecule data, and then drug research and development can be carried out based on the denoised small molecule data to improve the efficiency of drug research and development.
  • the embodiment of the present application also provides a method for determining noise data, which can be applied in the above-mentioned implementation environment, and the noise data determination model can be used to determine the final noise data in the data of the noisy small molecules to be processed.
  • the terminal device 101 or server 102 that executes the method for determining noise data in the embodiment of the present application is referred to as an electronic device, and the method can be performed by an electronic device. As shown in Figure 4, the method includes the following steps.
  • Step 401 obtaining data of noisy small molecules to be processed.
  • the data of the small molecules with noise to be processed are small molecules with noise data and the data of the small molecules with noise to be processed include data of multiple atoms to be processed.
  • the description of step 401 can be seen in the description of "sample noisy small molecules data" in step 201. The implementation principles of the two are similar and will not be repeated here.
  • Step 402 Based on the data of the multiple atoms to be processed, determine the graph structure to be processed by using the noise data determination model.
  • the graph structure to be processed includes multiple nodes and multiple edges, any node represents an atom to be processed, and any edge represents the distance between the atoms to be processed corresponding to the two nodes at both ends, and the noise data determination model is trained according to the training method of the noise data determination model related to Figure 2.
  • the description of step 402 can be found in the description of step 202, and the implementation principles of the two are similar, which will not be repeated here.
  • step 402 includes: extracting features from data of multiple atoms to be processed by a noise data determination model to obtain initial atomic features of each atom to be processed; obtaining data of a protein to be processed, extracting features from data of the protein to be processed by a noise data determination model to obtain features of the protein to be processed; determining a graph structure to be processed based on initial atomic features of each atom to be processed and features of the protein to be processed by a noise data determination model.
  • the data of the noisy small molecule to be processed is the initial noise data or the initial noise data is denoised at least once; step 402 includes: obtaining denoising times information, the denoising times information characterizing the number of denoising processes required to transform the initial noise data into the data of the noisy small molecule to be processed; based on the denoising times information and the data of the multiple atoms to be processed, determining the target graph structure through the noise data determination model.
  • Step 403 Based on the graph structure to be processed, final noise data is determined by a noise data determination model.
  • the final noise data is the noise data in the data of the noisy small molecules to be processed.
  • the description of step 403 can be found in the description of step 203. The implementation principles of the two are similar and will not be repeated here.
  • step 403 further includes: denoising the noisy small molecule data to be processed based on the final noise data to obtain first small molecule data; and in response to the first small molecule data satisfying the data condition, taking the first small molecule data as target small molecule data.
  • the final noise data can be removed from the data of the noisy small molecules to be processed, so as to perform denoising on the data of the noisy small molecules to be processed, obtain the first small molecule data, and when the first small molecule data meets the data condition, use the first small molecule data as the target small molecule data.
  • the embodiment of the present application does not limit whether the first small molecule data satisfies the data condition.
  • the first small molecule data is obtained by performing at least one denoising process on the initial noise data, so when the number of denoising processes corresponding to the first small molecule data reaches a set number of times, the first small molecule data satisfies the data condition.
  • the error between the target noisy small molecule and the first small molecule can be determined based on the data of the noisy small molecule to be processed and the first small molecule data. If the error is within a set range, it is determined that the first small molecule data meets the data condition; if the error is outside the set range, it is determined that the first small molecule data does not meet the data condition.
  • the method further includes: in response to the first small molecule data not satisfying the data condition, based on the first small molecule data, determining a reference graph structure through a noise data determination model; based on the reference graph structure, determining reference noise data through a noise data determination model; denoising the first small molecule data based on the reference noise data to obtain second small molecule data; in response to the second small molecule data satisfying the data condition, using the second small molecule data as the target small molecule data.
  • the first small molecule data can be regarded as the data of the noisy small molecule to be processed, and the model is determined by the noise data in the manner of step 402, and the reference graph structure is determined based on the first small molecule data.
  • the reference graph structure can be regarded as the graph structure to be processed.
  • the reference noise data is determined based on the reference graph structure by the noise data determination model, wherein the reference noise data can be regarded as the final noise data. Therefore, the content of determining the reference noise data based on the first small molecule data is similar to the content of steps 401 to 403, and will not be repeated here.
  • the reference noise data is removed from the first small molecule data to perform denoising on the first small molecule data to obtain the second small molecule data, and when the second small molecule data meets the data condition, the second small molecule data is used as the target small molecule data.
  • the second small molecule data does not meet the data condition, the second small molecule data can be used as the first small molecule data, and the target small molecule data is obtained by determining the reference noise data in the first small molecule data and removing the reference noise data from the first small molecule data until the data condition is met.
  • the information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards in the relevant regions.
  • the data of noisy small molecules to be processed involved in this application are all obtained with full authorization.
  • the above method determines the graph structure to be processed based on the data of the noisy small molecules to be processed, and determines the final noise data based on the graph structure to be processed through the noise data determination model, so that the data of the noisy small molecules to be processed can be denoised based on the final noise data to obtain the denoised small molecule data, and then drug research and development can be carried out based on the denoised small molecule data to improve the efficiency of drug research and development.
  • a schematic diagram of a training process of a noise data determination model is to obtain a noise data determination model by training a neural network model at least once, wherein the neural network model includes a small molecule encoder, a protein encoder, a frequency encoder, and an equivariant neural network, and the equivariant neural network includes a graph structure generator and a noise generator.
  • the noise data determination model also includes the above-mentioned network blocks.
  • the small molecule data after the 0th noise processing includes the type data of multiple atoms and the position data of multiple atoms.
  • A can be used to represent the type data of multiple atoms
  • R can be used to represent the position data of multiple atoms.
  • the small molecule data after the 0th noise processing includes the type data of five atoms, and the type data of these five atoms are H, C, H, H, and O respectively.
  • the position data of an atom includes the horizontal coordinate (represented by x), the vertical coordinate (represented by y), and the vertical coordinate (represented by z), which can be abbreviated as [x, y, z].
  • the small molecule data after the 0th noise addition process includes position data of five atoms, and the position data of the five atoms are [1, 3, 1], [0, 2, 0], [1, 0, 1], [4, 3, 5], and [2, 0, 1], respectively.
  • the small molecule data after the 0th noise addition process can be subjected to the first noise addition process to obtain the small molecule data after the first noise addition process; based on the noise data of the second noise addition process, the small molecule data after the first noise addition process can be subjected to the second noise addition process to obtain the small molecule data after the second noise addition process; and so on. That is to say, based on the noise data of each noise addition process, the small molecule data after the 0th noise addition process can be subjected to T noise addition processes to obtain the small molecule data after the 1st to Tth noise addition processes, where T is a positive integer.
  • a small molecule data is randomly sampled from the small molecule data after the 1st to Tth noise addition processes to obtain the small molecule data after the tth noise addition process, where t is a positive integer greater than or equal to 1 and less than or equal to T.
  • the small molecule data after the t-th noise processing is the data of the sample noisy small molecules mentioned above. Since the small molecule data after the 0-th noise processing has been subjected to t-time noise processing, the type data of each atom included in the small molecule data after the t-th noise processing carries certain noise data, and the type data of each atom carrying certain noise data can be recorded as the type data of each sample atom. Similarly, the position data of each atom included in the small molecule data after the t-th noise processing also carries certain noise data, and the position data of each atom carrying certain noise data is recorded as the position data of each sample atom.
  • the small molecule data after the t-th noise processing is input into the small molecule encoder, and the type data of each sample atom is encoded by the small molecule encoder to obtain the type characteristics of each sample atom, and the position data of each sample atom is encoded by the small molecule encoder to obtain the position characteristics of each sample atom.
  • the initial atomic characteristics of any sample atom include the type characteristics of the sample atom and the position characteristics of the sample atom. At can be used to characterize the type characteristics of multiple sample atoms, and Rt can be used to characterize the position characteristics of multiple sample atoms.
  • the data of the sample protein can be input into the protein encoder, and the protein encoder can extract the features of the sample protein data to obtain the features of the sample protein.
  • the features of the sample protein, the type features of multiple sample atoms, and the position features of multiple sample atoms are spliced to obtain the first splicing feature.
  • the first splicing feature can be characterized by [At, Cp], Rt.
  • the first splicing feature includes the first atomic features of each sample atom mentioned above.
  • the number of times the sample is noised can also be obtained, that is, t is obtained.
  • t is input into the number encoder, and t is encoded by the number encoder to obtain the number of times the sample is noised feature, which can be represented by te.
  • the first splicing feature and the number of times the sample is noised feature are spliced to obtain the second splicing feature, which can be represented by [At, Cp, te], Rt.
  • the second splicing feature includes the fifth atomic features of each sample atom mentioned above.
  • the second splicing feature is input into the equivariant neural network, and the sample graph structure is constructed based on the second splicing feature by the graph structure generator included in the equivariant neural network.
  • the sample graph structure includes multiple sample nodes and multiple sample edges, any sample node is the fifth atomic feature of a sample atom, and any sample edge is the fifth distance between two sample atoms determined based on the fifth atomic features of two sample atoms at both ends.
  • the graph structure generator deletes each sample edge whose fifth distance is not greater than the reference distance from the sample graph structure to obtain the first graph structure; the graph structure generator deletes each sample edge whose fifth distance is greater than the reference distance from the sample graph structure to obtain the second graph structure.
  • the dotted circle shown in Figure 5 represents the range area where the sample atoms whose fifth distance from the sample atom at the center of the circle is not greater than the reference distance are located.
  • the first graph structure is input into the first graph encoder to obtain the first graph feature, and the first graph feature includes the sixth atomic feature of each sample atom mentioned above.
  • the first graph feature is input into the first activation layer to obtain the first noise data, wherein the first noise data includes the first type noise data and the first position noise data.
  • the second graph structure is input into the second graph encoder to obtain the second graph feature, and the second graph feature includes the seventh atomic feature of each sample atom mentioned above.
  • the second graph feature is input into the second activation layer to obtain the second noise data, wherein the second noise data includes the second type noise data and the second position noise data.
  • the noise data of the t-th noise addition processing is obtained, wherein the noise data of the t-th noise addition processing is used as the labeled noise data.
  • the first noise data and the second noise data are used as the predicted noise data.
  • the predicted noise data and the labeled noise data can be used to determine the loss of the neural network model, and the neural network model is trained once based on the loss of the neural network model to obtain a trained neural network model.
  • the trained neural network model When the trained neural network model meets the training end condition, the trained neural network model is the noise data determination model; when the trained neural network model does not meet the training end condition, the trained neural network model is the neural network model for the next training, and the neural network model can be trained for the next time in the manner shown in FIG5 until the training end condition is met to obtain the noise data determination model.
  • randomly generated Gaussian noise data can be obtained.
  • the Gaussian noise data is used as the small molecule data of the Tth noise processing.
  • the noise data in the small molecule data after the Tth noise processing is determined by the noise data determination model, and the noise data can be recorded as the noise data of the Tth noise processing.
  • the small molecule data after the Tth noise processing is denoised to obtain the small molecule data after the T-1th noise processing; the noise data in the small molecule data after the T-1th noise processing is determined by the noise data determination model, and the noise data can be recorded as the noise data of the T-1th noise processing.
  • the small molecule data after the T-1th noise processing is denoised to obtain the small molecule data after the T-2th noise processing; and so on. That is to say, based on the noise data of each noise processing, the small molecule data after the Tth noise processing is denoised T times, and the small molecule data after the 0th noise processing can be obtained, that is, the target small molecule data mentioned above.
  • the target small molecule data is data describing the target small molecule.
  • Figure 6 is a schematic diagram of a target small molecule provided in an embodiment of the present application. Among them, (1) to (6) in Figure 6 respectively show 6 target small molecules.
  • the process of determining the noise data in the small molecule data after any noise addition process by the noise data determination model can be seen in FIG5 for the process of determining the predicted noise data based on the small molecule data after the t-th noise addition process, wherein the predicted noise data includes the first noise data and the second noise data.
  • the implementation principles of the two are similar and will not be repeated here.
  • the Gaussian noise data is denoised T times by the noise data determination model, and the process of atoms in the small molecule data being continuously approached from an unstable state to a stable state is realized based on the thermal diffusion theory.
  • the data of the sample protein Through the data of the sample protein, the data of the target small molecule that can bind to the sample protein is generated, which is conducive to accelerating the rate of drug development.
  • the denoised small molecule data can be generated at one time, that is, the type data of each atom and the coordinate data of each atom included in the small molecule data are generated at one time, which speeds up the generation rate, shortens the generation time, and the one-time generation method can avoid cumulative errors.
  • the number of atoms included in the custom small molecule can be realized.
  • neural network models can be used to generate target small molecule data.
  • the model that can generate target small molecule data in the related art can be recorded as a small molecule generation model.
  • the same data set is used to train small molecule generation model 1, small molecule generation model 2 and noise data determination model, and the performance of these three models in generating target small molecule data is tested. Scoring index 1 to scoring index 6 can be used to evaluate model performance, and the results obtained are shown in Table 1 below.
  • the symbol “ ⁇ ” indicates that the smaller the value of the scoring index is, the better the model performance is; the symbol “ ⁇ ” indicates that the larger the value of the scoring index is, the better the model performance is. It can also be seen from Table 1 that the performance of the noise data determination model is better than that of small molecule generation model 1 and small molecule generation model 2.
  • FIG. 7 is a schematic diagram of the structure of a training device for a noise data determination model provided in an embodiment of the present application. As shown in FIG. 7 , the device includes:
  • An acquisition module 701 is used to acquire sample noisy small molecule data and annotated noise data, wherein the sample noisy small molecule data is small molecule data with noise data and the sample noisy small molecule data includes data of multiple sample atoms, and the annotated noise data is obtained by annotation and is noise data in the sample noisy small molecule data;
  • a determination module 702 is used to output a sample graph structure through a neural network model based on the data of the plurality of sample atoms, wherein the sample graph structure includes a plurality of sample nodes and a plurality of sample edges, wherein any sample node represents the data of a sample atom, and any sample edge represents the distance between the sample atoms corresponding to the two sample nodes at both ends;
  • a determination module 702 is used to predict the sample graph structure through a neural network model to obtain predicted noise data, where the predicted noise data is noise data in the data of the sample noisy small molecules obtained through prediction;
  • the training module 703 is used to train the neural network model based on the predicted noise data and the labeled noise data to obtain a noise data determination model, and the noise data determination model is used to determine the final noise data in the data of the noisy small molecule to be processed.
  • the determination module 702 is used to perform feature extraction on the data of multiple sample atoms through a neural network model to obtain the initial atomic features of each sample atom; obtain the data of the sample protein, perform feature extraction on the data of the sample protein through a neural network model to obtain the features of the sample protein; and determine the sample graph structure through a neural network model based on the initial atomic features of each sample atom and the features of the sample protein.
  • the determination module 702 is used to fuse the initial atomic features of any sample atom and the features of the sample protein for any sample atom through a neural network model to obtain the first atomic features of any sample atom; determine the first distance between every two sample atoms based on the first atomic features of each sample atom; and determine the sample graph structure based on the first atomic features of each sample atom and the first distance between every two sample atoms.
  • the sample noisy small molecule data is initial noise data or is obtained by performing at least one denoising process on the initial noise data;
  • Determination module 702 is used to obtain sample denoising times information, which represents the number of denoising processes performed from initial noise data to sample noisy small molecule data; based on the sample denoising times information and multiple sample atomic data, the sample graph structure is determined through a neural network model.
  • the determination module 702 is used to perform feature extraction on the sample denoising frequency information through a neural network model to obtain the sample denoising frequency feature; perform feature extraction on the data of multiple sample atoms through the neural network model to obtain the initial atomic feature of each sample atom; and determine the sample graph structure through the neural network model based on the sample denoising frequency feature and the initial atomic feature of each sample atom.
  • the determination module 702 is used to fuse the initial atomic feature and the sample denoising frequency feature of any sample atom through a neural network model to obtain the second atomic feature of any sample atom; determine the second distance between every two sample atoms based on the second atomic feature of each sample atom; and determine the sample graph structure based on the second atomic feature of each sample atom and the second distance between every two sample atoms.
  • the determination module 702 is used to fuse the initial atomic features, sample denoising times features and sample protein features of any sample atom through a neural network model to obtain the third atomic features of any sample atom; determine the third distance between every two sample atoms based on the third atomic features of each sample atom; and determine the sample graph structure based on the third atomic features of each sample atom and the third distance between every two sample atoms.
  • the determination module 702 is used to extract features of the sample graph structure through a neural network model to obtain the atomic features to be processed of each sample atom; based on the atomic features to be processed of each sample atom, at least one of the predicted type noise data and the predicted position noise data is determined through the neural network model, the predicted type noise data is noise data related to the type of the sample atom obtained through prediction, and the predicted position noise data is noise data related to the position of the sample atom obtained through prediction; at least one of the predicted type noise data and the predicted position noise data is used as the predicted noise data.
  • the determination module 702 is used to delete a first edge from multiple sample edges included in the sample graph structure through a neural network model to obtain a first graph structure, and the distance represented by the first edge is not greater than a reference distance; based on the first graph structure, determine first noise data through a neural network model; and determine predicted noise data based on the first noise data.
  • the determination module 702 is used to delete a second edge from multiple sample edges included in the sample graph structure through a neural network model to obtain a second graph structure, and the distance represented by the second edge is greater than the reference distance; based on the second graph structure, determine second noise data through the neural network model; and determine predicted noise data based on the second noise data.
  • the predicted noise data includes predicted type noise data and predicted position noise data
  • the annotated noise data includes annotated type noise data and annotated position noise data
  • the training module 703 is used to determine the first loss based on the predicted type noise data and the labeled type noise data; determine the second loss based on the predicted position noise data and the labeled position noise data; and train the neural network model based on the first loss and the second loss to obtain a noise data determination model.
  • the above device determines the sample graph structure based on the data of multiple sample atoms in the sample noisy small molecule data, predicts the sample graph structure through a neural network model to determine the predicted noise data, and obtains a noise data determination model based on the predicted noise data and the labeled noise data.
  • the noise data determination model can be used to determine the final noise data in the data of the noisy small molecule to be processed, so that the data of the noisy small molecule to be processed can be denoised based on the final noise data to obtain the denoised small molecule data, and then drug research and development can be carried out based on the denoised small molecule data to improve the efficiency of drug research and development.
  • the device provided in FIG. 7 above only uses the division of the above functional modules as an example to implement its functions.
  • the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the device and method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process is detailed in the method embodiment, which will not be repeated here.
  • FIG8 is a schematic diagram of the structure of a device for determining noise data provided by an embodiment of the present application. As shown in FIG8 , the device includes:
  • An acquisition module 801 is used to acquire data of small molecules with noise to be processed, where the data of small molecules with noise to be processed is small molecule data with noise data and the data of small molecules with noise to be processed includes data of multiple atoms to be processed;
  • a determination module 802 is used to determine a graph structure to be processed based on data of multiple atoms to be processed by a noise data determination model, wherein the graph structure to be processed includes multiple nodes and multiple edges, wherein any node represents an atom to be processed, and any edge represents a distance between atoms to be processed corresponding to two nodes at both ends, and the noise data determination model is trained according to the training method for the noise data determination model of any one of the first aspects;
  • the determination module 802 is further used to determine the final noise data based on the graph structure to be processed and through the noise data determination model.
  • the final noise data is the noise data in the data of the noisy small molecule to be processed.
  • the determination module 802 is used to perform feature extraction on data of multiple atoms to be processed through a noise data determination model to obtain initial atomic features of each atom to be processed; obtain data of a protein to be processed, perform feature extraction on the data of the protein to be processed through a noise data determination model to obtain features of the protein to be processed; and determine a graph structure to be processed through a noise data determination model based on the initial atomic features of each atom to be processed and the features of the protein to be processed.
  • the noisy small molecule data to be processed is initial noise data or is obtained by performing at least one denoising process on the initial noise data;
  • Determination module 802 is used to obtain denoising times information, which represents the number of denoising processes performed from initial noise data to the data of noisy small molecules to be processed; based on the denoising times information and the data of multiple atoms to be processed, the graph structure to be processed is determined by the noise data determination model.
  • the device further includes:
  • a denoising module used for denoising the data of the noisy small molecule to be processed based on the final noise data to obtain first small molecule data
  • the determination module 802 is further configured to use the first small molecule data as target small molecule data in response to the first small molecule data satisfying the data condition.
  • the device further includes:
  • the determination module 802 is further configured to determine the reference graph structure through the noise data determination model based on the first small molecule data in response to the first small molecule data not satisfying the data condition; and determine the reference noise data through the noise data determination model based on the reference graph structure;
  • the denoising module is further used to perform denoising processing on the first small molecule data based on the reference noise data to obtain second small molecule data;
  • the determination module 802 is further configured to use the second small molecule data as target small molecule data in response to the second small molecule data satisfying the data condition.
  • the above-mentioned device determines the graph structure to be processed based on the data of the noisy small molecules to be processed, and determines the final noise data based on the graph structure to be processed through the noise data determination model, so that the data of the noisy small molecules to be processed can be denoised based on the final noise data to obtain the denoised small molecule data, and then drug research and development can be carried out based on the denoised small molecule data to improve the efficiency of drug research and development.
  • the device provided in FIG. 8 above only uses the division of the above functional modules as an example when implementing its functions.
  • the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the device and method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process is detailed in the method embodiment, which will not be repeated here.
  • FIG9 shows a block diagram of a terminal device 900 provided by an exemplary embodiment of the present application.
  • the terminal device 900 includes: a processor 901 and a memory 902 .
  • the processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc.
  • the processor 901 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array).
  • the processor 901 may also include a main processor and a coprocessor.
  • the main processor is a processor for processing data in the awake state, also known as a CPU (Central Processing Unit); the coprocessor is a low-power processor for processing data in the standby state.
  • the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the display screen.
  • the processor 901 may also include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 902 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 902 may also include a high-speed random access memory, and a non-volatile memory, such as one or more disk storage devices, flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 902 is used to store at least one computer program, which is used to be executed by the processor 901 to implement the training method of the noise data determination model or the noise data determination method provided in the method embodiment of the present application.
  • the terminal device 900 may further include: a peripheral device interface 903 and at least one peripheral device.
  • the peripheral device includes: at least one of a radio frequency circuit 904 , a display screen 905 , a camera assembly 906 , an audio circuit 907 and a power supply 908 .
  • the terminal device 900 further includes one or more sensors 909 .
  • the one or more sensors 909 include but are not limited to: an acceleration sensor 911 , a gyroscope sensor 912 , a pressure sensor 913 , an optical sensor 914 , and a proximity sensor 915 .
  • FIG. 9 does not limit the terminal device 1000 and may include more or fewer components than shown in the figure, or combine certain components, or adopt a different component arrangement.
  • FIG10 is a schematic diagram of the structure of the server provided in the embodiment of the present application.
  • the server 1000 may have relatively large differences due to different configurations or performances, and may include one or more processors 1001 and one or more memories 1002, wherein the one or more memories 1002 store at least one computer program, and the at least one computer program is loaded and executed by the one or more processors 1001 to implement the training method of the noise data determination model or the noise data determination method provided in the above-mentioned various method embodiments.
  • the processor 1001 is a CPU.
  • the server 1000 may also have components such as a wired or wireless network interface, a keyboard, and an input and output interface for input and output.
  • the server 1000 may also include other components for implementing device functions, which will not be described in detail here.
  • a computer-readable storage medium in which at least one computer program is stored.
  • the at least one computer program is loaded and executed by a processor to enable an electronic device to implement any of the above-mentioned noise data determination model training methods or noise data determination methods.
  • the computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a compact disc (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, etc.
  • ROM read-only memory
  • RAM random access memory
  • CD-ROM compact disc
  • magnetic tape a magnetic tape
  • floppy disk a magnetic tape
  • optical data storage device etc.
  • a computer program product in which at least one computer program is stored, and the at least one computer program is loaded and executed by a processor to enable an electronic device to implement any of the above-mentioned training methods for noise data determination models or methods for determining noise data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Epidemiology (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Public Health (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Biotechnology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

本申请公开了一种噪声数据确定模型的训练、噪声数据的确定方法及装置,属于生物技术领域。方法包括:获取样本带噪小分子的数据和标注噪声数据,样本带噪小分子的数据包括多个样本原子的数据;基于多个样本原子的数据,通过神经网络模型输出样本图结构;通过神经网络模型对样本图结构进行预测得到预测噪声数据;基于预测噪声数据和标注噪声数据,对神经网络模型进行训练,得到噪声数据确定模型。通过噪声数据确定模型确定待处理带噪小分子的数据中的最终噪声数据,从而可以实现基于最终噪声数据对待处理带噪小分子的数据进行去噪处理,得到去噪后的小分子数据,进而可以基于去噪后的小分子数据进行药物研发,提高药物研发效率。

Description

噪声数据确定模型的训练、噪声数据的确定方法及装置
本申请要求于2022年11月30日提交中国专利局、申请号202211525333.7、申请名称为“噪声数据确定模型的训练、噪声数据的确定方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及生物技术领域,特别涉及噪声数据确定模型的训练、噪声数据的确定技术。
背景技术
在生物技术领域中,小分子与药物研发息息相关。一些情况下,可以将随机生成的噪声数据作为带噪小分子数据,通过确定带噪小分子数据中的噪声数据,以基于噪声数据对带噪小分子数据进行去噪处理,得到去噪后的小分子数据。若去噪后的小分子数据满足条件,则将去噪后的小分子数据作为不带有噪声的目标小分子数据;若去噪后的小分子数据不满足条件,则将去噪后的小分子数据作为带噪小分子数据,再次对带噪小分子数据进行去噪处理,直至满足条件,得到目标小分子数据。可以利用目标小分子数据进行药物研发,以加快药物研发速率。基于此,如何确定带噪小分子数据中的噪声数据成为一个亟需解决的问题。
发明内容
本申请提供了一种噪声数据确定模型的训练、噪声数据的确定方法及装置,可用于解决相关技术中的问题,所述技术方案包括如下内容。
第一方面,提供了一种噪声数据确定模型的训练方法,所述方法由电子设备执行,所述方法包括:
获取样本带噪小分子的数据和标注噪声数据,所述样本带噪小分子的数据为带有噪声数据的小分子数据且所述样本带噪小分子的数据包括多个样本原子的数据,所述标注噪声数据是通过标注得到的且是所述样本带噪小分子的数据中的噪声数据;
基于所述多个样本原子的数据,通过神经网络模型输出样本图结构,所述样本图结构包括多个样本节点和多个样本边,任一个样本节点表征一个样本原子的数据,任一个样本边表征两端两个样本节点对应的样本原子之间的距离;
通过所述神经网络模型对所述样本图结构进行预测得到预测噪声数据,所述预测噪声数据是通过预测得到的且是所述样本带噪小分子的数据中的噪声数据;
基于所述预测噪声数据和所述标注噪声数据,对所述神经网络模型进行训练,得到噪声数据确定模型,所述噪声数据确定模型用于确定待处理带噪小分子的数据中的最终噪声数据。
第二方面,提供了一种噪声数据的确定方法,所述方法由电子设备执行,所述方法包括:
获取待处理带噪小分子的数据,所述待处理带噪小分子的数据为带有噪声数据的小分子数据且所述待处理带噪小分子的数据包括多个待处理原子的数据;
基于所述多个待处理原子的数据,通过噪声数据确定模型确定待处理图结构,所述待处理图结构包括多个节点和多个边,任一个节点表征一个待处理原子,任一个边表征两端两个节点对应的待处理原子之间的距离,所述噪声数据确定模型是按照第一方面任一项所述的方法训练得到的;
基于所述待处理图结构,通过所述噪声数据确定模型确定最终噪声数据,所述最终噪声数据是所述待处理带噪小分子的数据中的噪声数据。
第三方面,提供了一种噪声数据确定模型的训练装置,所述装置部署在电子设备上,所述装置包括:
获取模块,用于获取样本带噪小分子的数据和标注噪声数据,所述样本带噪小分子的数据为带有噪声数据的小分子数据且所述样本带噪小分子的数据包括多个样本原子的数据,所述标注噪声数据是通过标注得到的且是所述样本带噪小分子的数据中的噪声数据;
确定模块,用于基于所述多个样本原子的数据,通过神经网络模型输出样本图结构,所述样本图结构包括多个样本节点和多个样本边,任一个样本节点表征一个样本原子的数据,任一个样本边表征两端两个样本节点对应的样本原子之间的距离;
所述确定模块,还用于通过所述神经网络模型对所述样本图结构进行预测得到预测噪声数据,所述预测噪声数据是通过预测得到的且是所述样本带噪小分子的数据中的噪声数据;
训练模块,还用于基于所述预测噪声数据和所述标注噪声数据,对所述神经网络模型进行训练,得到噪声数据确定模型,所述噪声数据确定模型用于确定待处理带噪小分子的数据中的最终噪声数据。
第四方面,提供了一种噪声数据的确定装置,所述装置部署在电子设备上,所述装置包括:
获取模块,用于获取待处理带噪小分子的数据,所述待处理带噪小分子的数据为带有噪声数据的小分子数据且所述待处理带噪小分子的数据包括多个待处理原子的数据;
确定模块,用于基于所述多个待处理原子的数据,通过噪声数据确定模型确定待处理图结构,所述待处理图结构包括多个节点和多个边,任一个节点表征一个待处理原子,任一个边表征两端两个节点对应的待处理原子之间的距离,所述噪声数据确定模型是按照第一方面任一项所述的噪声数据确定模型的训练方法训练得到的;
所述确定模块,还用于基于所述待处理图结构,通过所述噪声数据确定模型确定最终噪声数据,所述最终噪声数据是所述待处理带噪小分子的数据中的噪声数据。
第五方面,提供了一种电子设备,所述电子设备包括处理器和存储器,所述存储器中存储有至少一条计算机程序,所述至少一条计算机程序由所述处理器加载并执行,以使所述电子设备实现上述第一方面任一所述的噪声数据确定模型的训练方法或者实现上述第二方面任一所述的噪声数据的确定方法。
第六方面,还提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行,以使所述电子设备实 现上述第一方面任一所述的噪声数据确定模型的训练方法或者实现上述第二方面任一所述的噪声数据的确定方法。
第七方面,还提供了一种计算机程序产品,所述计算机程序产品中存储有至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行,以使所述电子设备实现上述第一方面任一所述的噪声数据确定模型的训练方法或者实现上述第二方面任一所述的噪声数据的确定方法。
本申请提供的技术方案至少带来如下有益效果:
本申请提供的技术方案是基于样本带噪小分子的数据中多个样本原子的数据确定样本图结构,通过神经网络模型对样本图结构进行预测来确定预测噪声数据,基于预测噪声数据和标注噪声数据训练得到噪声数据确定模型。通过噪声数据确定模型可以确定待处理带噪小分子的数据中的最终噪声数据,从而可以实现基于最终噪声数据对待处理带噪小分子的数据进行去噪处理,得到去噪后的小分子数据,进而可以基于去噪后的小分子数据进行药物研发,提高药物研发效率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种噪声数据确定模型的训练方法或者噪声数据的确定方法的实施环境示意图;
图2是本申请实施例提供的一种噪声数据确定模型的训练方法的流程图;
图3是本申请实施例提供的一种加噪处理和去噪处理的示意图;
图4是本申请实施例提供的一种噪声数据的确定方法的流程图;
图5是本申请实施例提供的一种噪声数据确定模型的训练过程的示意图;
图6是本申请实施例提供的一种目标小分子的示意图;
图7是本申请实施例提供的一种噪声数据确定模型的训练装置的结构示意图;
图8是本申请实施例提供的一种噪声数据的确定装置的结构示意图;
图9是本申请实施例提供的一种终端设备的结构示意图;
图10是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
图1是本申请实施例提供的一种噪声数据确定模型的训练方法或者噪声数据的确定方法的实施环境示意图,如图1所示,该实施环境包括终端设备101和服务器102。其中,本申请实施例中的噪声数据确定模型的训练方法或者噪声数据的确定方法可以由终端设备101执行,也可以由服务器102执行,或者由终端设备101和服务器102共同执行。
终端设备101可以是智能手机、游戏主机、台式计算机、平板电脑、膝上型便携计算机、智能电视、智能车载设备、智能语音交互设备、智能家电等。服务器102可以为一台 服务器,或者为多台服务器组成的服务器集群,或者为云计算平台和虚拟化中心中的任意一种,本申请实施例对此不加以限定。服务器102可以与终端设备101通过有线网络或无线网络进行通信连接。服务器102可以具有数据处理、数据存储以及数据收发等功能,在本申请实施例中不加以限定。终端设备101和服务器102的数量不受限制,可以是一个或多个。
本申请实施例可以基于人工智能技术自动化进行噪声数据确定模型的训练方法或者噪声数据的确定方法。
在生物技术领域中,可以将随机生成的噪声数据作为带噪小分子数据,通过确定带噪小分子数据中的噪声数据,以基于该噪声数据对带噪小分子数据进行去噪处理。当去噪后的小分子数据不满足条件时,需要将去噪后的小分子数据作为带噪小分子数据,再次对带噪小分子数据进行去噪处理,直至满足条件,得到目标小分子数据。在对目标小分子数据进行实验测试并测试通过后,可以将目标小分子数据作为药物数据,从而实现药物研发。因此,小分子数据的生成与药物研发息息相关,基于此,如何确定带噪小分子数据中的噪声数据成为一个亟需解决的问题。
本申请实施例提供了一种噪声数据确定模型的训练方法,该方法可应用于上述实施环境中,可以确定出待处理带噪小分子的数据中的最终噪声数据,从而可以基于最终噪声数据对待处理带噪小分子的数据进行去噪处理,为目标小分子数据的生成奠定了基础。以图2所示的本申请实施例提供的一种噪声数据确定模型的训练方法的流程图为例,为便于描述,将执行本申请实施例中的噪声数据确定模型的训练方法的终端设备101或者服务器102称为电子设备,该方法可以由电子设备来执行。如图2所示,该方法包括步骤201至步骤204。
步骤201,获取样本带噪小分子的数据和标注噪声数据。
本申请实施例中,样本带噪小分子是带有噪声的小分子,样本带噪小分子包括多个样本原子,任一个样本原子可以为带有噪声的原子,也可以为不带有噪声的原子。对于任一个带有噪声的原子,该原子的相关信息存在误差且误差在误差范围之外。对于任一个不带有噪声的原子,该原子的相关信息不存在误差,或者该原子的相关信息存在误差但误差在误差范围之内。任一个样本原子存在其对应的数据,通过样本原子的数据可以描述样本原子的相关信息。
可选地,样本原子的数据可以包括用于描述样本原子的类型的数据,即样本原子的数据包括样本原子的类型数据。例如,样本原子为氧原子,则样本原子的类型数据为元素符号O。样本原子为碳原子,则样本原子的类型数据为元素符号C。样本原子为氮原子,则样本原子的类型数据为元素符号N。
样本原子的数据还可以包括用于描述样本原子的位置的数据,即样本原子的数据包括样本原子的位置数据。样本原子的位置数据可以为样本原子的三维坐标,包括横坐标(常用x来表征)、纵坐标(常用y来表征)和竖直坐标(常用z来表征),通过三个坐标来描述样本原子在三维空间坐标系中的位置。
样本带噪小分子包括多个样本原子,各个样本原子的数据组成样本带噪小分子的数据,即,样本带噪小分子的数据为带有噪声数据的小分子数据且样本带噪小分子的数据包括多个样本原子的数据。可以理解的是,样本带噪小分子的数据可以包括除各个样本原子的数据之外的其他数据,例如,其他数据包括用于表征样本带噪小分子所属的小分子类型的数据。
可选地,样本带噪小分子的数据可以表示为其中,N是样本原子的数量,ai是第i个样本原子的类型数据,ri是第i个样本原子的位置数据。
可选地,可以将实际存在的小分子作为样本小分子,或者,将设计的小分子作为样本小分子,样本小分子是能与样本蛋白质结合的有效小分子。通过分析样本小分子中各个原子的类型、位置等信息,得到样本小分子中各个原子的数据,从而获取到样本小分子的数据。样本小分子的数据是有关小分子且不带有噪声的数据。可以对样本小分子的数据进行第一次加噪处理,得到第一次加噪处理后的小分子数据;对第一次加噪处理后的小分子数据进行第二次加噪处理,得到第二次加噪处理后的小分子数据,以此类推。也就是说,对样本小分子的数据进行T次加噪处理,可以得到第1次至第T次加噪处理后的小分子数据,T为正整数。其中,第T次加噪处理后的小分子数据为下文提及的初始噪声数据,样本小分子的数据可以理解为第0次加噪处理后的小分子数据,也就是下文提及的有效小分子的数据。
请参见图3,图3是本申请实施例提供的一种加噪处理和去噪处理的示意图。通过对第t-1次加噪处理后的小分子数据进行一次加噪处理,得到第t次加噪处理后的小分子数据。通过这个原理,可以对第0次加噪处理后的小分子数据不断地进行加噪处理,在经过T次加噪处理后,得到第T次加噪处理后的小分子数据。
可以将第t次加噪处理后的小分子数据作为样本带噪小分子的数据,将对第t-1次加噪处理后的小分子数据进行第t次加噪处理时的噪声数据,也就是第t次加噪处理的噪声数据作为标注噪声数据,t为大于或者等于1且小于或者等于T的正整数。其中,标注噪声数据可以理解为通过标注得到的且是样本带噪小分子的数据中的噪声数据。
可选地,在对第0次加噪处理后的小分子数据进行T次加噪处理时,通常情况下,T的数值较大。其中,对第m次加噪处理后的小分子数据进行n-m次加噪处理,可以得到第n次加噪处理后的小分子数据,m、n均为正整数且小于或者等于T,m小于n。这种情况下,可以将第n次加噪处理后的小分子数据作为样本带噪小分子的数据,利用n-m次加噪处理中各次加噪处理的噪声数据,确定标注噪声数据。
例如,对第0次加噪处理后的小分子数据进行1000次加噪处理时得到第1000次加噪处理后的小分子数据。可以采用间隔采样的方式,将第1000次加噪处理后的小分子数据、第9990次加噪处理后的小分子数据、……、第10次加噪处理后的小分子数据,第1次加噪处理后的小分子数据分别作为样本带噪小分子的数据。当第1000次加噪处理后的小分子数据为样本带噪小分子的数据时,标注噪声数据是从第9990次至第1000次加噪处理中每次加噪处理的噪声数据的总和。以此类推。当第10次加噪处理后的小分子数据为样本带噪 小分子的数据时,标注噪声数据是从第1次至第10次加噪处理中每次加噪处理的噪声数据的总和。
通过将任一次加噪处理后的小分子数据作为样本带噪小分子的数据,在得到样本带噪小分子的数据的过程中需要经过至少一次加噪处理,将至少一次加噪处理时各次加噪处理的噪声数据的总和作为标注噪声数据,使得神经网络模型确定的预测噪声数据为至少一次加噪处理时各次加噪处理的噪声数据的总和。通过从样本带噪小分子的数据中去除预测噪声数据,可以得到对样本带噪小分子的数据进行至少一次去噪处理后的小分子数据,加快了有效小分子的生成效率,从而提高了药物研发的效率。
步骤202,基于多个样本原子的数据,通过神经网络模型输出样本图结构。
其中,样本图结构包括多个样本节点和多个样本边,任一个样本节点表征一个样本原子的数据,任一个样本边表征两端两个样本节点对应的样本原子之间的距离。
可以将样本带噪小分子的数据输入神经网络模型,由神经网络模型基于各个样本原子的数据构建样本图结构。样本图结构包括多个样本节点,任一个样本节点表征一个样本原子的数据。任两个样本原子对应的样本节点之间可以有边,也可以没有边。当两个样本原子对应的样本节点之间有边时,该边可以称为样本边,该样本边表征这两个样本原子之间的距离。
本申请实施例不对神经网络模型的模型结构、模型参数等做限定。示例性地,神经网络模型是初始网络模型,这种情况下,神经网络模型的模型结构、模型参数等与初始网络模型的模型结构、模型参数等完全相同。可选地,初始网络模型包括小分子编码器、蛋白质编码器、次数编码器、图结构生成器和噪声生成器等中的至少一项,各项的功能在下文有对应描述,在此暂不赘述。或者,神经网络模型是按照步骤201至步骤204的方式对初始网络模型进行至少一次的训练后得到的模型。这种情况下,神经网络模型与初始网络模型仅存在模型参数上的区别,二者的模型结构相同。
本申请实施例提供一种实现方式A1,在实现方式A1中,步骤202包括步骤2021至步骤2023。
步骤2021,通过神经网络模型对多个样本原子的数据进行特征提取,得到各个样本原子的初始原子特征。
在一种可能的实现方式中,神经网络模型包括小分子编码器,可以将样本带噪小分子的数据输入小分子编码器,通过小分子编码器对各个样本原子的数据进行特征提取,得到各个样本原子的初始原子特征。
本申请实施例不对小分子编码器的模型结构、模型参数等做限定。示例性地,小分子编码器为自编码器(Auto-Encoder,AE)或者变分自编码器(Variational Auto-Encoder,VAE)。
可选地,任一个样本原子的数据包括该样本原子的类型数据和该样本原子的位置数据中的至少一项。通过小分子编码器对该样本原子的类型数据进行编码处理,得到样本原子的类型特征。例如,样本原子的类型数据是能表征样本原子的类型的元素符号,通过小分子编码器对元素符号进行独热编码、多热编码等编码处理,得到样本原子的类型特征。通过小分子编码器基于样本原子的位置数据确定该样本原子的位置特征。例如,样本原子的 位置数据为样本原子的三维坐标,通过小分子编码器将样本原子的三维坐标作为样本原子的位置特征,或者,通过小分子编码器对样本原子的三维坐标进行归一化处理,得到样本原子的位置特征。可以将样本原子的类型特征或者样本原子的位置特征作为样本原子的初始原子特征,或者,将样本原子的类型特征和样本原子的位置特征进行拼接,得到样本原子的初始原子特征。
步骤2022,获取样本蛋白质的数据,通过神经网络模型对样本蛋白质的数据进行特征提取,得到样本蛋白质的特征。
可以将实际存在的蛋白质作为样本蛋白质,或者,将设计的蛋白质作为样本蛋白质。样本蛋白质包括多个原子,任一个原子存在其对应的数据,通过原子的数据来描述原子的相关信息。可选地,任一个原子的数据包括该原子的位置数据和该原子的类型数据中的至少一项。各个原子的数据组成样本蛋白质的数据。可以理解的是,样本蛋白质的数据可以包括除各个原子的数据之外的其他数据,例如,其他数据包括用于表征样本蛋白质所属的蛋白质类型的数据。
在一种可能的实现方式中,神经网络模型还可以包括蛋白质编码器,可以将样本蛋白质的数据输入蛋白质编码器,通过蛋白质编码器对样本蛋白质中各个原子的数据进行特征提取,得到样本蛋白质中各个原子的特征。
本申请实施例不对蛋白质编码器的模型结构、模型参数等做限定。示例性地,蛋白质编码器为VAE或者Schnet,其中,Schnet是深度张量神经网络(Deep Tensor Neural Network,DTNN)的变体。
可选地,通过蛋白质编码器对样本蛋白质中各个原子的类型数据进行编码处理,得到样本蛋白质中各个原子的类型特征。通过蛋白质编码器基于样本蛋白质中各个原子的位置数据确定样本蛋白质中各个原子的位置特征。将样本蛋白质中任一个原子的类型特征或者该原子的位置特征作为该原子的特征,或者,将该原子的类型特征和该原子的位置特征进行拼接,得到该原子的特征。
在得到样本蛋白质中各个原子的特征之后,相当于得到了样本蛋白质的特征。也就是说,可以将样本蛋白质中各个原子的特征作为样本蛋白质的特征。或者,可以对样本蛋白质中各个原子的特征进行卷积处理、归一化处理、正则处理等,得到样本蛋白质的特征。
步骤2023,基于各个样本原子的初始原子特征和样本蛋白质的特征,通过神经网络模型确定样本图结构。
在一种可能的实现方式中,神经网络模型还可以包括图结构生成器。可以将各个样本原子的初始原子特征和样本蛋白质的特征输入图结构生成器,由图结构生成器生成样本图结构。
本申请实施例中,基于样本蛋白质的特征生成样本图结构,使得在基于样本图结构确定预测噪声数据时,预测噪声数据是神经网络模型参考样本蛋白质来确定的样本带噪小分子的数据中的噪声数据。在基于预测噪声数据对样本带噪小分子的数据进行去噪处理后,得到的去噪后的小分子数据所对应的小分子更有可能与样本蛋白质进行结合。小分子与蛋白质结合的概率越高,小分子越有可能成为药物。因此,基于样本蛋白质的特征确定样本 图结构,以通过样本图结构确定预测噪声数据,有利于基于预测噪声数据确定出能与样本蛋白质结合的小分子的数据,从而得到能与样本蛋白质结合的有效小分子,提高药物的研发效率。在本申请实施例中,样本蛋白质的数据可以理解为确定样本带噪小分子的数据中的噪声数据的约束条件。
可选地,将样本蛋白质的数据作为约束条件,对任一次加噪处理后的小分子数据进行加噪处理。该加噪过程如下公式(1)所示。
其中,G0表征第0次加噪处理后的小分子数据,Gt表征第t次加噪处理后的小分子数据,Gt-1表征第t-1次加噪处理后的小分子数据,G1:T表征第1次至第T次加噪处理后的小分子数据。pctx表征样本蛋白质的数据。q(x)表征加噪处理函数的函数符号,x为变量。∏表征累乘符号。
q(G1:T∣G0,pctx)表征将样本蛋白质的数据作为条件,对第0次加噪处理后的小分子数据进行T次加噪处理后,依次得到第1次至第T次加噪处理后的小分子数据。q(Gt∣Gt-1,pctx)表征将样本蛋白质的数据作为条件,对第t-1次加噪处理后的小分子数据进行第t次加噪处理后,得到第t次加噪处理后的小分子数据。
可选地,第t次加噪处理后的小分子数据满足如下所示的公式(2)。
其中,表征正态分布函数的函数符号。一般情况下,正态分布函数为I为正态分布函数的参数。β1…βT是固定的方差参数,βt是第t个方差参数。可选地,第t个方差参数满足:在本申请实施例中,公式(2)表征第t次加噪处理后的小分子数据符合正态分布函数
由于对任一次加噪处理后的小分子数据进行加噪处理时是将样本蛋白质的数据作为约束条件,因此,也需要样本蛋白质的数据作为约束条件,对任一次加噪处理后的小分子数据进行去噪处理。该去噪过程如下公式(3)所示。
其中,G0表征第0次加噪处理后的小分子数据,Gt表征第t次加噪处理后的小分子数据,Gt-1表征第t-1次加噪处理后的小分子数据,G0:T-1表征第0次至第T-1次加噪处理后的小分子数据。pctx表征样本蛋白质的数据。表征去噪处理函数的函数符号,为变量。∏表征累乘符号。
pθ(G0:T-1∣GT,pctx)表征将样本蛋白质的数据作为条件,对第T次加噪处理后的小分子数据进行T次去噪处理后,依次得到第0次至第T-1次加噪处理后的小分子数据。pθ(Gt-1∣Gt,pctx)表征将样本蛋白质的数据作为条件,对第t次加噪处理后的小分子数据进行第t次去噪处理后,得到第t-1次加噪处理后的小分子数据。
可选地,第t-1次加噪处理后的小分子数据满足如下所示的公式(4)。
其中,表征正态分布函数的函数符号。一般情况下,正态分布函数为I为正态分布函数的参数。μθ为第t次加噪处理后的小分子数据符合分布的平均值。σt是方差 值,可以为任意设定的数据。在本申请实施例中,公式(4)表征第t-1次加噪处理后的小分子数据符合正态分布函数其中,μθ是本申请实施例中的神经网络模型需要学习的参数,在训练神经网络模型的过程中,需要对进行极大似然化。
可选地,步骤2023包括:对于任一个样本原子,通过神经网络模型将任一个样本原子的初始原子特征和样本蛋白质的特征进行融合,得到任一个样本原子的第一原子特征;基于各个样本原子的第一原子特征,确定每两个样本原子之间的第一距离;基于各个样本原子的第一原子特征和每两个样本原子之间的第一距离,确定样本图结构。
将样本原子的初始原子特征和样本蛋白质的特征进行融合得到样本原子的第一原子特征,从而可以基于样本蛋白质的特征表达样本原子的第一原子特征,进而基于样本蛋白质的特征确定样本图结构,以参考样本蛋白质来确定的样本带噪小分子的数据中的噪声数据,有利于基于预测噪声数据确定出能与样本蛋白质结合的小分子的数据,从而得到能与样本蛋白质结合的有效小分子,提高药物的研发效率。
通过神经网络模型,将任一个样本原子的初始原子特征和样本蛋白质的特征进行拼接或者相加或者相乘等任意的融合处理,得到该样本原子的第一原子特征。按照距离公式,基于任两个样本原子的第一原子特征,确定这两个样本原子之间的第一距离。本申请实施例不对距离公式做限定,示例性地,距离公式为余弦距离公式、交叉熵距离公式或者相对熵距离公式等。通过这种方式,可以确定出每两个样本原子之间的第一距离。
可选地,将任一个样本原子的第一原子特征作为样本节点,将任两个样本原子之间的第一距离作为这两个样本原子对应的样本节点之间的边。通过这种方式,可以确定各个样本节点和每两个样本节点之间的边,得到样本图结构。
或者,将任一个样本原子的第一原子特征作为样本节点。对于任两个样本节点,若任两个样本节点对应的样本原子之间的第一距离大于距离阈值,则确定这两个样本节点之间不存在边,若任两个样本节点对应的样本原子之间的第一距离不大于距离阈值,则将这两个样本节点对应的样本原子之间的第一距离确定为这两个样本节点之间的边。通过这种方式,可以确定各个样本节点以及任两个样本节点之间存在的边,从而得到样本图结构。本申请实施例不对距离阈值做限定,可选地,距离阈值是根据人工经验设定的数值,或者,距离阈值是两个样本原子之间能存在相互作用力的最远距离。
在实现方式A1中,根据各个样本原子的初始原子特征和样本蛋白质的特征确定样本图结构。本申请实施例还提供一种实现方式A2,在实现方式A2中,可以仅根据各个样本原子的初始原子特征确定样本图结构。
示例性地,基于各个样本原子的初始原子特征,确定每两个样本原子之间的初始距离;基于各个样本原子的初始原子特征和每两个样本原子之间的初始距离,确定样本图结构。其中,两个样本原子之间的初始距离的确定方式与两个样本原子之间的第一距离的确定方式相类似,基于各个样本原子的初始原子特征和每两个样本原子之间的初始距离确定样本图结构的方式,与“基于各个样本原子的第一原子特征和每两个样本原子之间的第一距离,确定样本图结构”的方式相类似,在此不再赘述。
本申请实施例还提供了一种与实现方式A1、实现方式A2不同的样本图结构的确定方式,如下文实现方式A3所示。
在实现方式A3中,样本带噪小分子的数据是初始噪声数据或者对初始噪声数据进行至少一次的去噪处理得到的。
上文已提及,可以将初始噪声数据视作第T次加噪处理后的小分子数据。对第T次加噪处理后的小分子数据进行第一次去噪处理后,得到第T-1次加噪处理后的小分子数据;对第T-1次加噪处理后的小分子数据进行第二次去噪处理后,得到第T-2次加噪处理后的小分子数据;以此类推。也就是说,对第T次加噪处理后的小分子数据进行T次去噪处理,得到第0次加噪处理后的小分子数据,即上文提及的样本小分子的数据。
请参见图3,通过对第t次加噪处理后的小分子数据进行一次去噪处理,得到第t-1次加噪处理后的小分子数据。通过这个原理,可以对第T次加噪处理后的小分子数据不断地进行去噪处理,在经过T次去噪处理后,得到第0次加噪处理后的小分子数据。
可以将第t次加噪处理后的小分子数据作为样本带噪小分子的数据,t为大于或者等于1且小于或者等于T的正整数。
在实现方式A3中,步骤202包括步骤2024至步骤2025。
步骤2024,获取样本去噪次数信息,样本去噪次数信息表征从初始噪声数据变至样本带噪小分子的数据所进行的去噪处理的次数。
本申请实施例中,对第0次加噪处理后的小分子数据进行T次加噪处理,依次得到第1次至第T次加噪处理后的小分子数据。对第T次加噪处理后的小分子数据进行T次去噪处理,依次得到第T-1次至第0次加噪处理后的小分子数据。因此,第t次加噪处理与第T-t次去噪处理是相逆的两个过程。
当样本带噪小分子的数据为第t次加噪处理后的小分子数据时,表明从第0次加噪处理后的小分子数据变至第t次加噪处理后的小分子数据所需要t次加噪处理。基于此可以得出:从第T次加噪处理后的小分子数据变至第t次加噪处理后的小分子数据所需要T-t次去噪处理,而初始噪声数据可以视为第T次加噪处理后的小分子数据。因此,样本去噪次数信息为T-t。
步骤2025,基于样本去噪次数信息和多个样本原子的数据,通过神经网络模型确定样本图结构。
本申请实施例中,通过神经网络模型,至少可以按照下文提及的实现方式B1或者实现方式B2,基于样本去噪次数信息和多个样本原子的数据确定样本图结构。
样本去噪次数信息可以体现从初始噪声数据变至样本带噪小分子的数据所进行的去噪处理的次数,结合样本去噪次数信息和多个样本原子的数据确定样本图结构,由此可以更加更快地基于样本图结构得到预测噪声数据,由此提高药物的研发效率。
实现方式B1,步骤2025包括步骤C1至步骤C3。
步骤C1,通过神经网络模型对样本去噪次数信息进行特征提取,得到样本去噪次数特征。
在一种可能的实现方式中,神经网络模型还可以包括次数编码器。将样本去噪次数信息输入次数编码器,由次数编码器对样本去噪次数信息进行特征提取,得到样本去噪次数特征。本申请实施例不对次数编码器的模型结构、模型参数等做限定。示例性地,次数编码器为多层感知机或者AE等。可选地,通过次数编码器对样本去噪次数信息进行独热编码、多热编码等编码处理,得到样本去噪次数特征。
步骤C2,通过神经网络模型对多个样本原子的数据分别进行特征提取,得到各个样本原子的初始原子特征。
可以通过神经网络模型中的小分子编码器对各个样本原子的数据进行特征提取,得到各个样本原子的初始原子特征。步骤C2的内容可以见步骤2021的描述,在此不再赘述。
步骤C3,基于样本去噪次数特征和各个样本原子的初始原子特征,通过神经网络模型确定样本图结构。
通过神经网络模型的图结构生成器,至少可以按照实现方式D1或者实现方式D2,基于样本去噪次数特征和各个样本原子的初始原子特征确定样本图结构。
在实现方式D1中,步骤C3包括:对于任一个样本原子,通过神经网络模型将任一个样本原子的初始原子特征和样本去噪次数特征进行融合,得到任一个样本原子的第二原子特征;基于各个样本原子的第二原子特征,确定每两个样本原子之间的第二距离;基于各个样本原子的第二原子特征和每两个样本原子之间的第二距离,确定样本图结构。
通过神经网络模型,将任一个样本原子的初始原子特征和样本去噪次数特征进行拼接或者相加或者相乘等任意的融合处理,得到该样本原子的第二原子特征。按照距离公式,基于任两个样本原子的第二原子特征,确定这两个样本原子之间的第二距离。通过这种方式,可以确定出每两个样本原子之间的第二距离。
可选地,将任一个样本原子的第二原子特征作为样本节点,将任两个样本原子之间的第二距离作为这两个样本原子对应的样本节点之间的边。通过这种方式,可以确定各个样本节点和每两个样本节点之间的边,得到样本图结构。
或者,将任一个样本原子的第二原子特征作为样本节点。对于任两个样本节点,若任两个样本节点对应的样本原子之间的第二距离大于距离阈值,则确定这两个样本节点之间不存在边,若任两个样本节点对应的样本原子之间的第二距离不大于距离阈值,则将这两个样本节点对应的样本原子之间的第二距离确定为这两个样本节点之间的边。通过这种方式,可以确定各个样本节点以及任两个样本节点之间存在的边,从而得到样本图结构。
在实现方式D2中,步骤C3包括:对于任一个样本原子,通过神经网络模型将任一个样本原子的初始原子特征、样本去噪次数特征和样本蛋白质的特征进行融合,得到任一个样本原子的第三原子特征;基于各个样本原子的第三原子特征,确定每两个样本原子之间的第三距离;基于各个样本原子的第三原子特征和每两个样本原子之间的第三距离,确定样本图结构。
通过神经网络模型,将任一个样本原子的初始原子特征、样本去噪次数特征和样本蛋白质的特征进行拼接或者相加或者相乘等任意的融合处理,得到该样本原子的第三原子特 征。按照距离公式,基于任两个样本原子的第三原子特征,确定这两个样本原子之间的第三距离。通过这种方式,可以确定出每两个样本原子之间的第三距离。
可选地,将任一个样本原子的第三原子特征作为样本节点,将任两个样本原子之间的第三距离作为这两个样本原子对应的样本节点之间的边。通过这种方式,可以确定各个样本节点和每两个样本节点之间的边,得到样本图结构。
或者,将任一个样本原子的第三原子特征作为样本节点。对于任两个样本节点,若任两个样本节点对应的样本原子之间的第三距离大于距离阈值,则确定这两个样本节点之间不存在边,若任两个样本节点对应的样本原子之间的第三距离不大于距离阈值,则将这两个样本节点对应的样本原子之间的第三距离确定为这两个样本节点之间的边。通过这种方式,可以确定各个样本节点以及任两个样本节点之间存在的边,从而得到样本图结构。
实现方式B2,步骤2025包括:基于样本去噪次数信息确定样本加噪次数信息,通过神经网络模型对样本加噪次数信息进行特征提取,得到样本加噪次数特征;通过神经网络模型对多个样本原子数据进行特征提取,得到各个样本原子的初始原子特征;基于样本加噪次数特征和各个样本原子的初始原子特征,通过神经网络模型确定样本图结构。
由于第t次加噪处理与第T-t次去噪处理是相逆的两个过程。因此,当确定出样本去噪次数信息为T-t时,可以基于样本去噪次数信息确定样本加噪次数信息为t。
将样本加噪次数信息输入神经网络模型的次数编码器,由次数编码器对样本加噪次数信息进行特征提取,得到样本加噪次数特征。可选地,通过次数编码器对样本加噪次数信息进行独热编码、多热编码等编码处理,得到样本加噪次数特征。
此外,通过神经网络模型的小分子编码器,可以对各个样本原子的数据进行特征提取,得到各个样本原子的初始原子特征。样本原子的初始原子特征的确定方式可以见步骤2021的描述,在此不再赘述。
接着,通过神经网络模型,基于样本加噪次数特征和各个样本原子的初始原子特征确定样本图结构。
可选地,对于任一个样本原子,通过神经网络模型将任一个样本原子的初始原子特征和样本加噪次数特征进行融合,得到任一个样本原子的第四原子特征;基于各个样本原子的第四原子特征,确定每两个样本原子之间的第四距离;基于各个样本原子的第四原子特征和每两个样本原子之间的第四距离,确定样本图结构。上述过程的实现方式可以见实现方式D1的描述,二者实现原理相类似,在此不再赘述。
或者,对于任一个样本原子,通过神经网络模型将任一个样本原子的初始原子特征、样本加噪次数特征和样本蛋白质的特征进行融合,得到任一个样本原子的第五原子特征;基于各个样本原子的第五原子特征,确定每两个样本原子之间的第五距离;基于各个样本原子的第五原子特征和每两个样本原子之间的第五距离,确定样本图结构。上述过程的实现方式可以见实现方式D2的描述,二者实现原理相类似,在此不再赘述。
可以理解的是,在实现方式A3中是直接获取样本去噪次数信息,利用样本去噪次数信息构建样本图结构,或者,基于样本去噪次数信息确定样本加噪次数信息,并基于样本加噪次数信息构建样本图结构。在应用时,可以基于实现方式A3的原理,直接获取样本 加噪次数信息,利用样本加噪次数信息构建样本图结构,或者,基于样本加噪次数信息确定样本去噪次数信息,并基于样本去噪次数信息构建样本图结构,在此不再赘述。
步骤203,通过神经网络模型对样本图结构进行预测得到预测噪声数据。
在一种可能的实现方式中,神经网络模型还可以包括噪声生成器。将样本图结构输入噪声生成器,由噪声生成器基于样本图结构确定预测噪声数据,预测噪声数据是通过预测得到的噪声数据。
可选地,噪声生成器包括图编码器和激活层,其中,图编码器和激活层的功能在下文有对应描述,在此不再赘述。本申请实施例不对图编码器的网络结构、网络参数等做限定,示例性地,图编码器可以为图自编码器(Graph Auto-Encoders,GAE)、图变分自编码器(Graph Variational Auto-Encoder,GVAE)等。本申请实施例也不对激活层的网络结构、网络参数等做限定,示例性地,激活层可以为线性修正单元(Rectified Linear Unit,ReLU)、S型生长曲线(即Sigmoid函数)等。
在实现方式E1中,步骤203包括步骤2031至步骤2033。
步骤2031,通过神经网络模型对样本图结构进行特征提取,得到各个样本原子的待处理原子特征。
在本申请实施例中,可以将样本图结构输入图编码器,通过图编码器对样本图结构进行至少一次的更新处理,得到更新后的样本图结构,更新后的样本图结构中的各个样本节点为各个样本原子的待处理原子特征。
上文已提及,样本图结构包括多个样本节点和多个样本边,任一个样本节点为一个样本原子的初始原子特征,或者,任一个样本节点为一个样本原子的第一原子特征至第五原子特征中的任一项,任一个样本节点通过一个样本边与另一个样本节点连接。
在对样本图结构进行一次更新时,对于任一个样本节点,可以利用该样本节点和与该样本节点通过样本边连接的其他的样本节点,更新该样本节点,或者,可以利用该样本节点、一端为该样本节点的各个样本边、以及与该样本节点通过样本边连接的其他的样本节点,更新该样本节点。通过这种方式,可以更新样本图结构中的各个样本节点,得到更新后的样本图结构。
将更新后的样本图结构中的各个样本节点作为各个样本原子的待处理原子特征,或者,将更新后的样本图结构作为样本图结构,通过再次更新样本图结构中的各个样本节点,得到更新后的样本图结构,并将更新后的样本图结构中的各个样本节点作为各个样本原子的待处理原子特征。
步骤2032,基于各个样本原子的待处理原子特征,通过神经网络模型确定预测类型噪声数据和预测位置噪声数据中的至少一项,预测类型噪声数据是通过预测得到的与样本原子的类型相关的噪声数据,预测位置噪声数据是通过预测得到的与样本原子的位置相关的噪声数据。
在本申请实施例中,可以将各个样本原子的待处理原子特征输入噪声生成器中的激活层,通过激活层对各个样本原子的待处理原子特征进行激活处理,得到预测类型噪声数据和/或预测位置噪声数据。
可选地,通过激活层对各个样本原子的待处理原子特征进行激活处理,得到各个样本原子的类型噪声数据,任一个样本原子的类型噪声数据是通过预测得到的与该样本原子的类型相关的噪声数据。其中,预测类型噪声数据包括各个样本原子的类型噪声数据。
同样地,通过激活层对各个样本原子的待处理原子特征进行激活处理,得到各个样本原子的位置噪声数据,任一个样本原子的位置噪声数据是通过预测得到的与该样本原子的位置相关的噪声数据。其中,预测位置噪声数据包括各个样本原子的位置噪声数据。
步骤2033,将预测类型噪声数据和预测位置噪声数据中的至少一项,作为预测噪声数据。
在本申请实施例中,可以将预测类型噪声数据或者预测位置噪声数据作为预测噪声数据。或者,可以将预测类型噪声数据和预测位置噪声数据作为预测噪声数据,或者,可以将预测类型噪声数据或者预测位置噪声数据作为预测噪声数据。
在实现方式E2中,步骤203包括步骤2034至步骤2036。
步骤2034,对于样本图结构包括的任一个样本边,若任一个样本边表征的距离不大于参考距离,则将任一个样本边确定为第一边;通过神经网络模型从样本图结构包括的多个样本边中删除第一边,得到第一图结构。
本申请实施例不对参考距离做限定,示例性地,参考距离是根据人工经验设定的数值,或者,参考距离不小于化学键作用的距离且不大于范德华力作用的距离。例如,化学键作用的距离小于范德华力作用的距离,通常情况下,化学键作用的距离小于2埃(单位:),而范德华力作用的距离大于2埃,则可以确定参考距离为2埃。
上文已提及,任一个样本边为该样本边两端两个样本节点对应的样本原子之间的初始距离,或者,任一个样本边为该样本边两端两个样本节点对应的样本原子之间的第一距离至第五距离中的任一项。也就是说,任一个样本边表征该样本边两端两个样本节点对应的样本原子之间的距离。
若样本图结构包括的任一个样本边表征的距离不大于参考距离,则将该样本边确定为第一边,通过神经网络模型的图结构生成器,从样本图结构中删除该第一边。通过这种方式,可以从样本图结构中删除至少一个第一边,得到第一图结构。
需要说明的是,上述第一图结构是通过从样本图结构中删除至少一个第一边得到的。在应用时,可以有其他第一图结构的生成方式。例如,在构建样本图结构时,若任两个样本节点对应的样本原子之间的距离(该距离可以为第一距离至第五距离中的任一个)大于参考距离,则将这两个样本节点对应的样本原子之间的距离确定为这两个样本节点之间的边,若任两个样本节点对应的样本原子之间的距离不大于参考距离,确定这两个样本节点之间不存在边。通过这种方式构建的样本图结构即为第一图结构。
当参考距离不小于化学键作用的距离且不大于范德华力作用的距离时,通过从样本图结构中删除至少一个第一边,可以将样本图结构中与化学键相关的样本边删除,使得第一图结构中包括与范德华力相关的样本边。
步骤2035,基于第一图结构,通过神经网络模型确定第一噪声数据。
由于第一图结构中包括与范德华力相关的样本边,通过神经网络模型基于第一图结构确定第一噪声数据,实现了基于与范德华力相关的样本边确定第一噪声数据。由于是通过分析样本带噪小分子中的范德华力得到的第一噪声数据,使得第一噪声数据与单一因素也就是范德华力相关,使得模型可以专注于学习噪声数据与范德华力之间的映射关系,提高模型确定噪声数据的准确性,也就是说,第一噪声数据的准确性较高。
在本申请实施例中,可以将第一图结构输入噪声生成器,由噪声生成器基于第一图结构确定第一噪声数据,第一噪声数据是通过预测得到的噪声数据。
可选地,步骤2035包括:通过神经网络模型对第一图结构进行特征提取,得到各个样本原子的第六原子特征;通过神经网络模型基于各个样本原子的第六原子特征,确定第一类型噪声数据和第一位置噪声数据中的至少一项,第一类型噪声数据是通过预测得到的与样本原子的类型相关的噪声数据,第一位置噪声数据是通过预测得到的与样本原子的位置相关的噪声数据;将第一类型噪声数据和第一位置噪声数据中的至少一项,作为第一噪声数据。
在本申请实施例中,可以将第一图结构输入图编码器,通过图编码器对第一图结构进行至少一次的更新处理,得到更新后的第一图结构,更新后的第一图结构中的各个样本节点为各个样本原子的第六原子特征。其中,对第一图结构进行更新处理的方式与对样本图结构进行更新处理的方式相类似,可以见步骤2031的描述,在此不再赘述。
在本申请实施例中,可以将各个样本原子的第六原子特征输入噪声生成器中的激活层,通过激活层对各个样本原子的第六原子特征进行激活处理,得到第一类型噪声数据和/或第一位置噪声数据。第一类型噪声数据的确定方式与预测类型噪声数据的确定方式相类似,第一位置噪声数据的确定方式与预测位置噪声数据的确定方式相类似,在此不再赘述。
其中,第一类型噪声数据包括各个样本原子的类型噪声数据,任一个样本原子的类型噪声数据是通过预测得到的与该样本原子的类型相关的噪声数据。同样地,第一位置噪声数据包括各个样本原子的位置噪声数据,任一个样本原子的位置噪声数据是通过预测得到的与该样本原子的位置相关的噪声数据。
在本申请实施例中,可以将第一类型噪声数据或者第一位置噪声数据作为第一噪声数据。或者,可以将第一类型噪声数据和第一位置噪声数据作为第一噪声数据,或者,可以将第一类型噪声数据或者第一位置噪声数据作为第一噪声数据。
步骤2036,基于第一噪声数据确定预测噪声数据。
在本申请实施例中,可以将第一噪声数据作为预测噪声数据,或者,将第一噪声数据乘以对应的权重得到预测噪声数据。
在实现方式E3中,步骤203包括步骤2037至步骤2039。
步骤2037,对于样本图结构包括的任一个样本边,若任一个样本边表征的距离大于参考距离,则将任一个样本边确定为第二边;通过神经网络模型从样本图结构包括的多个样本边中删除第二边,得到第二图结构。
若样本图结构包括的任一个样本边表征的距离大于参考距离,则将该样本边确定为第二边,通过神经网络模型的图结构生成器,从样本图结构中删除该第二边。通过这种方式,可以从样本图结构中删除至少一个第二边,得到第二图结构。
需要说明的是,上述第图结构是通过从样本图结构中删除至少一个第二边得到的。在应用时,可以有其他第二图结构的生成方式。例如,在构建样本图结构时,若任两个样本节点对应的样本原子之间的距离(该距离可以为第一距离至第五距离中的任一个)大于参考距离,则确定这两个样本节点之间不存在边,若任两个样本节点对应的样本原子之间的距离不大于参考距离,将这两个样本节点对应的样本原子之间的距离确定为这两个样本节点之间的边。通过这种方式构建的样本图结构即为第二图结构。
当参考距离不小于化学键作用的距离且不大于范德华力作用的距离时,通过从样本图结构中删除至少一个第二边,可以将样本图结构中与范德华力相关的样本边删除,使得第二图结构中包括与化学键相关的样本边。
步骤2038,基于第二图结构,通过神经网络模型确定第二噪声数据。
由于第二图结构中包括与化学键相关的样本边,通过神经网络模型基于第二图结构确定第二噪声数据,实现了基于与化学键相关的样本边确定第二噪声数据。由于是通过分析样本带噪小分子中的化学键得到的第二噪声数据,使得第二噪声数据与单一因素也就是化学键相关,使得模型可以专注于学习噪声数据与化学键之间的映射关系,提高模型确定噪声数据的准确性,也就是说,第二噪声数据的准确性较高。
在本申请实施例中,可以将第二图结构输入噪声生成器,由噪声生成器基于第二图结构确定第二噪声数据,第二噪声数据是通过预测得到的噪声数据。
可选地,步骤2038包括:通过神经网络模型对第二图结构进行特征提取,得到各个样本原子的第七原子特征;通过神经网络模型基于各个样本原子的第七原子特征,确定第二类型噪声数据和第二位置噪声数据中的至少一项,第二类型噪声数据是通过预测得到的与样本原子的类型相关的噪声数据,第二位置噪声数据是通过预测得到的与样本原子的位置相关的噪声数据;将第二类型噪声数据和第二位置噪声数据中的至少一项,作为第二噪声数据。
在本申请实施例中,可以将第二图结构输入图编码器,通过图编码器对第二图结构进行至少一次的更新处理,得到更新后的第二图结构,更新后的第二图结构中的各个样本节点为各个样本原子的第七原子特征。其中,对第二图结构进行更新处理的方式与对样本图结构进行更新处理的方式相类似,可以见步骤2031的描述,在此不再赘述。
在本申请实施例中,可以将各个样本原子的第七原子特征输入噪声生成器中的激活层,通过激活层对各个样本原子的第七原子特征进行激活处理,得到第二类型噪声数据和/或第二位置噪声数据。第二类型噪声数据的确定方式与预测类型噪声数据的确定方式相类似,第二位置噪声数据的确定方式与预测位置噪声数据的确定方式相类似,在此不再赘述。
其中,第二类型噪声数据包括各个样本原子的类型噪声数据,任一个样本原子的类型噪声数据是通过预测得到的与该样本原子的类型相关的噪声数据。同样地,第二位置噪声 数据包括各个样本原子的位置噪声数据,任一个样本原子的位置噪声数据是通过预测得到的与该样本原子的位置相关的噪声数据。
可以将第二类型噪声数据或者第二位置噪声数据作为第二噪声数据。或者,可以将第二类型噪声数据和第二位置噪声数据作为第二噪声数据,或者,可以将第二类型噪声数据或者第二位置噪声数据作为第二噪声数据。
步骤2039,基于第二噪声数据确定预测噪声数据。
在本申请实施例中,可以将第二噪声数据作为预测噪声数据,或者,将第二噪声数据乘以对应的权重之后得到预测噪声数据。
在实现方式E4中,步骤203包括:对于样本图结构包括的任一个样本边,若任一个样本边表征的距离不大于参考距离,则将任一个样本边确定为第一边;通过神经网络模型从样本图结构包括的多个样本边中删除第一边,得到第一图结构;通过神经网络模型基于第一图结构确定第一噪声数据。对于样本图结构包括的任一个样本边,若任一个样本边表征的距离大于参考距离,则将任一个样本边确定为第二边;通过神经网络模型从样本图结构包括的多个样本边中删除第二边,得到第二图结构;通过神经网络模型基于第二图结构确定第二噪声数据。基于第一噪声数据和第二噪声数据确定预测噪声数据。
在本申请实施例中,可以按照实现方式E2的内容确定第一噪声数据,按照实现方式E3的内容确定第二噪声数据。将第一噪声数据和第二噪声数据确定为预测噪声数据,或者,将第一噪声数据和第二噪声数据进行加权平均、加权求和等运算处理,得到预测噪声数据。
可选地,第一噪声数据包括第一类型噪声数据和第一位置噪声数据,第二噪声数据包括第二类型噪声数据和第二位置噪声数据。将第一类型噪声数据和第二类型噪声数据进行加权求和、加权求平均等运算处理,得到预测类型噪声数据。将第一位置噪声数据和第二位置噪声数据进行加权求和、加权求平均等运算处理,得到预测位置噪声数据。预测噪声数据包括预测类型噪声数据和预测位置噪声数据。
步骤204,基于预测噪声数据和标注噪声数据,对神经网络模型进行训练,得到噪声数据确定模型,噪声数据确定模型用于基于待处理带噪小分子的数据确定最终噪声数据。
在本申请实施例中,可以基于预测噪声数据和标注噪声数据确定神经网络模型的损失。基于神经网络模型的损失对神经网络模型进行训练,得到训练后的神经网络模型。
若训练后的神经网络模型满足训练结束条件,则将训练后的神经网络模型作为噪声数据确定模型。若训练后的神经网络模型不满足训练结束条件,则将训练后的神经网络模型作为神经网络模型,并按照步骤201至步骤204的方式对神经网络模型进行训练,直至满足训练结束条件,得到噪声数据确定模型。
本申请实施例不对训练结束条件做限定。示例性地,满足训练结束条件为达到训练次数,例如,训练次数达到500次或者1000次。或者,满足训练结束条件为本次训练得到的神经网络模型的损失与上一次训练得到的神经网络模型的损失之间的差值在设定范围内。或者,满足训练结束条件为本次训练得到的神经网络模型的损失的梯度在设定范围内。
可选地,可以按照如下所示的公式(5),基于预测噪声数据和标注噪声数据确定神经网络模型的损失。
其中,表征神经网络模型的损失。T表征去噪处理的总次数。γt为超参数。∈表征标注噪声数据,且标注噪声数据可以设计为服从正态分布。∈θ(Gt,pctx,t)表征预测噪声数据,其中,Gt表征第t次加噪处理后的小分子数据,pctx表征样本蛋白质的数据。 表征计算标注噪声数据和预测噪声数据之间的均方误差。E为求平均符号。{G0}~q(G0)表征加噪处理时第0次加噪处理后的小分子数据与去噪处理时第0次加噪处理后的小分子数据相同。表征标注噪声数据符合正态分布函数I为正态分布函数的参数。
其中,公式(5)的推导过程如公式(6)所示。
其中,E为求平均符号。log为对数符号。pθ(G0|pctx)表征将样本蛋白质的数据作为约束条件,通过去噪处理得到的第0次加噪处理后的小分子数据。q(G1:T∣G0,pctx)表征将样本蛋白质的数据作为约束条件,对第0次加噪处理后的小分子数据进行T次加噪处理后,依次得到第1次至第T次加噪处理后的小分子数据。pθ(G0:T|pctx)表征将样本蛋白质的数据作为约束条件,通过T次去噪处理后依次得到的第0次至第T次加噪处理后的小分子数据。DKL表征相对熵函数的函数符号。q(Gt-1∣Gt,G0,pctx)表征将样本蛋白质的数据作为约束条件,对第0次加噪处理后的小分子数据进行加噪处理得到第t次加噪处理后的小分子数据的过程中得到的第t-1次加噪处理后的小分子数据。pθ(Gt-1∣Gt,pctx)表征将样本蛋白质的数据作为约束条件,通过对第t次加噪处理后的小分子数据进行去噪处理,得到第t-1次加噪处理后的小分子数据。采用重参数技巧,基于标注噪声数据可以得到 其中,为设定的参数。
在一种可能的实现方式中,预测噪声数据包括预测类型噪声数据和预测位置噪声数据,标注噪声数据包括标注类型噪声数据和标注位置噪声数据。其中,预测类型噪声数据是通过预测得到的与样本原子的类型相关的噪声数据,预测位置噪声数据是通过预测得到的与样本原子的位置相关的噪声数据。标注类型噪声数据是通过标注得到的与样本原子的类型相关的噪声数据,标注位置噪声数据是通过标注得到的与样本原子的位置相关的噪声数据。
步骤204包括步骤2041至步骤2043。
步骤2041,基于预测类型噪声数据和标注类型噪声数据,确定第一损失。
在本申请实施例中,可以按照第一损失函数,基于预测类型噪声数据和标注类型噪声数据确定第一损失。本申请实施例不对第一损失函数做限定,示例性地,第一损失函数为相对熵损失函数、平均绝对误差(Mean Absolute Error,MAE)损失函数或者均方误差(Mean Square Error,MSE)损失函数等。其中,MAE损失函数也叫L1损失函数,MSE损失函数也叫L2损失函数。可选地,第一损失函数还可以是利用L2损失函数对L1损失函数进行平滑处理后的损失函数,即经过平滑处理后的L1损失函数。
步骤2042,基于预测位置噪声数据和标注位置噪声数据,确定第二损失。
在本申请实施例中,可以按照第二损失函数,基于预测位置噪声数据和标注位置噪声数据确定第二损失。本申请实施例不对第二损失函数做限定,示例性地,第二损失函数为相对熵损失函数、L1损失函数、L2损失函数或者经过平滑处理后的L1损失函数。
步骤2043,基于第一损失和第二损失,对神经网络模型进行训练,得到噪声数据确定模型。
在本申请实施例中,可以对第一损失和第二损失进行加权求和、加权求平均等运算处理,得到神经网络模型的损失。基于神经网络模型的损失对神经网络模型进行训练,得到训练后的神经网络模型。若训练后的神经网络模型满足训练结束条件,将训练后的神经网络模型作为噪声数据确定模型。若训练后的神经网络模型不满足训练结束条件,则将训练后的神经网络模型作为下一次训练的神经网络模型,并按照步骤201至步骤204的方式对神经网络模型进行下一次训练,直至满足训练结束条件,得到噪声数据确定模型。
通过样本原子的第一损失对各原子的类型进行约束,通过样本原子的第二损失对各原子的位置进行约束,由此提高噪声数据确定模型的准确性。
可以理解的是,除了基于第一损失、第二损失确定神经网络模型的损失之外,还可以基于其他损失来确定神经网络模型的损失。
示例性地,在确定出预测噪声数据之后,可以基于预测噪声数据,对样本带噪小分子数据进行去噪处理,得到去噪后的小分子数据。去噪后的小分子数据包括多个样本原子的第一数据。基于任一个样本原子的第一数据和样本蛋白质的数据,确定样本原子与样本蛋白质的表面之间的距离。
若样本原子与样本蛋白质的表面之间的距离小于样本原子的尺寸,则将样本原子的尺寸减去样本原子与样本蛋白质的表面之间的距离,得到的差值作为该样本原子的第三损失。若样本原子与样本蛋白质的表面之间的距离不小于样本原子的尺寸,则该样本原子不存在第三损失。
基于第一损失和/或第二损失和/或至少一个样本原子的第三损失,确定神经网络模型的损失,利用神经网络模型的损失训练得到噪声数据确定模型。
由于当样本原子与样本蛋白质的表面之间的距离小于样本原子的尺寸,该样本原子存在第三损失,且该样本原子的第三损失为样本原子的尺寸减去样本原子与样本蛋白质的表面之间的距离得到的差值。通过样本原子的第三损失来对小分子中各原子的位置进行约束,避免小分子的中的原子与蛋白质重叠,提高噪声数据确定模型的准确性。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关地区的相关法律法规和标准。例如,本申请中涉及到的样本带噪小分子的数据、标注噪声数据都是在充分授权的情况下获取的。
上述方法是基于样本带噪小分子的数据中多个样本原子的数据确定样本图结构,通过神经网络模型对样本图结构进行预测来确定预测噪声数据,基于预测噪声数据和标注噪声数据训练得到噪声数据确定模型。通过噪声数据确定模型可以确定待处理带噪小分子的数 据中的最终噪声数据,从而可以实现基于最终噪声数据对待处理带噪小分子的数据进行去噪处理,得到去噪后的小分子数据,进而可以基于去噪后的小分子数据进行药物研发,提高药物研发效率。
本申请实施例还提供了一种噪声数据的确定方法,该方法可应用于上述实施环境中,可以利用噪声数据确定模型来确定出待处理带噪小分子的数据中的最终噪声数据。以图4所示的本申请实施例提供的一种噪声数据的确定方法的流程图为例,为便于描述,将执行本申请实施例中的噪声数据的确定方法的终端设备101或者服务器102称为电子设备,该方法可以由电子设备来执行。如图4所示,该方法包括如下步骤。
步骤401,获取待处理带噪小分子的数据。
其中,待处理带噪小分子的数据为带有噪声数据的小分子数据且待处理带噪小分子的数据包括多个待处理原子的数据。有关步骤401的描述可以见步骤201中有关“样本带噪小分子的数据”的描述,二者实现原理相类似,在此不再赘述。
步骤402,基于多个待处理原子的数据,通过噪声数据确定模型确定待处理图结构。
其中,待处理图结构包括多个节点和多个边,任一个节点表征一个待处理原子,任一个边表征两端两个节点对应的待处理原子之间的距离,噪声数据确定模型是按照与图2相关的噪声数据确定模型的训练方法训练得到的。有关步骤402的描述可以见步骤202的描述,二者实现原理相类似,在此不再赘述。
在一种可能的实现方式中,步骤402包括:通过噪声数据确定模型对多个待处理原子的数据进行特征提取,得到各个待处理原子的初始原子特征;获取待处理蛋白质的数据,通过噪声数据确定模型对待处理蛋白质的数据进行特征提取,得到待处理蛋白质的特征;通过噪声数据确定模型,基于各个待处理原子的初始原子特征和待处理蛋白质的特征确定待处理图结构。其中,上述内容可以见实现方式A1的描述,二者实现原理相类似,在此不再赘述。
在另一种可能的实现方式中,待处理带噪小分子的数据是初始噪声数据或者对初始噪声数据进行至少一次的去噪处理得到的;步骤402包括:获取去噪次数信息,去噪次数信息表征从初始噪声数据变至待处理带噪小分子的数据所需要的去噪处理的次数;基于去噪次数信息和多个待处理原子的数据,通过噪声数据确定模型确定目标图结构。其中,上述内容可以见实现方式A3的描述,二者实现原理相类似,在此不再赘述。
步骤403,基于待处理图结构,通过噪声数据确定模型确定最终噪声数据。
其中,最终噪声数据是待处理带噪小分子的数据中的噪声数据。有关步骤403的描述可以见步骤203的描述,二者实现原理相类似,在此不再赘述。
在一种可能的实现方式中,步骤403之后还包括:基于最终噪声数据对待处理带噪小分子的数据进行去噪处理,得到第一小分子数据;响应于第一小分子数据满足数据条件,则将第一小分子数据作为目标小分子数据。
本申请实施例中,可以从待处理带噪小分子的数据中去除掉最终噪声数据,以对待处理带噪小分子的数据进行去噪处理,得到第一小分子数据,并在第一小分子数据满足数据条件时,将第一小分子数据作为目标小分子数据。
本申请实施例不对第一小分子数据满足数据条件做限定。示例性地,第一小分子数据是通过对初始噪声数据进行至少一次的去噪处理得到的,因此,当第一小分子数据对应的去噪处理次数达到设定次数时,第一小分子数据满足数据条件。
例如,对初始噪声数据进行t次去噪处理后得到的第一小分子数据,则第一小分子数据对应的去噪处理次数为t。若t=T,则第一小分子数据满足数据条件;若t<T,则第一小分子数据不满足数据条件。
或者,可以基于待处理带噪小分子的数据和第一小分子数据确定目标带噪小分子和第一小分子之间的误差。若该误差在设定范围内,则确定第一小分子数据满足数据条件;若该误差在设定范围外,则确定第一小分子数据不满足数据条件。
在一种可能的实现方式中,基于最终噪声数据对待处理带噪小分子的数据进行去噪处理,得到第一小分子数据之后,还包括:响应于第一小分子数据不满足数据条件,则基于第一小分子数据,通过噪声数据确定模型确定参考图结构;基于参考图结构,通过噪声数据确定模型确定参考噪声数据;基于参考噪声数据对第一小分子数据进行去噪处理,得到第二小分子数据;响应于第二小分子数据满足数据条件,则将第二小分子数据作为目标小分子数据。
当第一小分子数据不满足数据条件时,可以将第一小分子数据可以视作待处理带噪小分子的数据,按照步骤402的方式,通过噪声数据确定模型,基于第一小分子数据确定参考图结构。其中,可以将参考图结构视作待处理图结构。按照步骤403的方式,通过噪声数据确定模型基于参考图结构确定参考噪声数据,其中,参考噪声数据可以视作最终噪声数据。因此,基于第一小分子数据确定参考噪声数据的内容与步骤401至步骤403的内容相类似,在此不再赘述。
接着,从第一小分子数据中去除掉参考噪声数据,以对第一小分子数据进行去噪处理,得到第二小分子数据,并在第二小分子数据满足数据条件时,将第二小分子数据作为目标小分子数据。在第二小分子数据不满足数据条件时,可以将第二小分子数据作为第一小分子数据,通过确定第一小分子数据中的参考噪声数据,并从第一小分子数据中去除掉参考噪声数据,直至满足数据条件,得到目标小分子数据。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关地区的相关法律法规和标准。例如,本申请中涉及到的待处理带噪小分子的数据等都是在充分授权的情况下获取的。
上述方法是基于待处理带噪小分子的数据确定待处理图结构,基于待处理图结构,通过噪声数据确定模型来确定最终噪声数据,从而可以实现基于最终噪声数据对待处理带噪小分子的数据进行去噪处理,得到去噪后的小分子数据,进而可以基于去噪后的小分子数据进行药物研发,提高药物研发效率。
上述从方法步骤的角度阐述了噪声数据确定模型的训练方法以及噪声数据的确定方法,下面来系统地描述噪声数据确定模型的训练过程。请参见图5,图5是本申请实施例提供 的一种噪声数据确定模型的训练过程的示意图。本申请实施例是通过对神经网络模型进行至少一次的训练后得到噪声数据确定模型,神经网络模型包括小分子编码器、蛋白质编码器、次数编码器和等变神经网络,等变神经网络包括图结构生成器和噪声生成器。在对神经网络模型进行训练的过程中,仅改变模型参数而不改变模型结构,因此,噪声数据确定模型也包括上述各网络块。
首先,获取样本小分子的数据,样本小分子的数据记为第0次加噪处理后的小分子数据。第0次加噪处理后的小分子数据包括多个原子的类型数据和多个原子的位置数据,可以用A来表征多个原子的类型数据,用R来表征多个原子的位置数据。当一个原子为氢原子时,该原子的类型数据为元素符号H;当一个原子为碳原子时,该原子的类型数据为元素符号C;当一个原子为氧原子时,该原子的类型数据为元素符号O。在图5中,第0次加噪处理后的小分子数据包括五个原子的类型数据,这五个原子的类型数据依次为H、C、H、H、O。一个原子的位置数据包括横坐标(用x来表示)、纵坐标(用y来表示)和竖直坐标(用z来表示),可以简写成[x,y,z]。在图5中,第0次加噪处理后的小分子数据包括五个原子的位置数据,这五个原子的位置数据依次为[1,3,1]、[0,2,0]、[1,0,1]、[4,3,5]、[2,0,1]。
可以基于第1次加噪处理的噪声数据,对第0次加噪处理后的小分子数据进行第1次加噪处理,得到第1次加噪处理后的小分子数据;基于第2次加噪处理的噪声数据,对第1次加噪处理后的小分子数据进行第2次加噪处理,得到第2次加噪处理后的小分子数据;以此类推。也就是说,基于每一次加噪处理时的噪声数据,对第0次加噪处理后的小分子数据进行T次加噪处理,可以得到第1次至第T次加噪处理后的小分子数据,T为正整数。从第1次至第T次加噪处理后的小分子数据中随机采样出一个小分子数据,得到第t次加噪处理后的小分子数据,t为大于等于1且小于等于T的正整数。
其中,第t次加噪处理后的小分子数据即为上文提及的样本带噪小分子的数据。由于对第0次加噪处理后的小分子数据进行了t次加噪处理,因此,第t次加噪处理后的小分子数据包括的各个原子的类型数据携带了一定的噪声数据,可以将携带有一定噪声数据的各个原子的类型数据记为各个样本原子的类型数据。同样地,第t次加噪处理后的小分子数据包括的各个原子的位置数据也携带了一定的噪声数据,将携带有一定噪声数据的各个原子的位置数据记为各个样本原子的位置数据。
将第t次加噪处理后的小分子数据输入小分子编码器,通过小分子编码器对各个样本原子的类型数据进行编码处理,得到各个样本原子的类型特征,通过小分子编码器对各个样本原子的位置数据进行编码处理,得到各个样本原子的位置特征。任一个样本原子的初始原子特征包括该样本原子的类型特征和该样本原子的位置特征。可以用At来表征多个样本原子的类型特征,用Rt来表征多个样本原子的位置特征。
可以将样本蛋白质的数据输入蛋白质编码器,由蛋白质编码器对样本蛋白质的数据进行特征提取,得到样本蛋白质的特征,可以用Cp来表征样本蛋白质的特征。将样本蛋白质的特征、多个样本原子的类型特征和多个样本原子的位置特征进行拼接,得到第一拼接特 征,可以用[At,Cp],Rt来表征第一拼接特征。其中,第一拼接特征包括上文提及的各个样本原子的第一原子特征。
还可以获取样本加噪次数信息,即获取t。将t输入次数编码器,通过次数编码器对t进行编码处理,得到样本加噪次数特征,可以用te来表征样本加噪次数特征。将第一拼接特征和样本加噪次数特征进行拼接,得到第二拼接特征,可以用[At,Cp,te],Rt来表征第二拼接特征。其中,第二拼接特征包括上文提及的各个样本原子的第五原子特征。
接着,将第二拼接特征输入等变神经网络,通过等变神经网络包括的图结构生成器基于第二拼接特征构建样本图结构。样本图结构包括多个样本节点和多个样本边,任一个样本节点为一个样本原子的第五原子特征,任一个样本边为基于两端两个样本原子的第五原子特征确定的这两个样本原子之间的第五距离。通过图结构生成器从样本图结构中删除第五距离不大于参考距离的各个样本边,得到第一图结构;通过图结构生成器从样本图结构中删除第五距离大于参考距离的各个样本边,得到第二图结构。其中,在图5中示出的虚线圆表征与位于圆心的样本原子之间的第五距离不大于参考距离的样本原子所在的范围区域。
将第一图结构输入第一图编码器,得到第一图特征,第一图特征包括上文提及的各个样本原子的第六原子特征。将第一图特征输入第一激活层,得到第一噪声数据,其中,第一噪声数据包括第一类型噪声数据和第一位置噪声数据。同样地,将第二图结构输入第二图编码器,得到第二图特征,第二图特征包括上文提及的各个样本原子的第七原子特征。将第二图特征输入第二激活层,得到第二噪声数据,其中,第二噪声数据包括第二类型噪声数据和第二位置噪声数据。
之后,获取第t次加噪处理的噪声数据,其中,第t次加噪处理的噪声数据作为标注噪声数据。第一噪声数据和第二噪声数据作为预测噪声数据。可以利用预测噪声数据和标注噪声数据确定神经网络模型的损失,以基于神经网络模型的损失对神经网络模型进行一次训练,得到训练后的神经网络模型。当训练后的神经网络模型满足训练结束条件时,则训练后的神经网络模型为噪声数据确定模型;当训练后的神经网络模型不满足训练结束条件时,则训练后的神经网络模型为下一次训练的神经网络模型,可以按照图5所示的方式对神经网络模型进行下一次的训练,直至满足训练结束条件,得到噪声数据确定模型。
本申请实施例中,可以获取随机生成的高斯噪声数据。将高斯噪声数据作为第T次加噪处理的小分子数据。通过噪声数据确定模型确定第T次加噪处理后的小分子数据中的噪声数据,该噪声数据可以记为第T次加噪处理的噪声数据,基于第T次加噪处理的噪声数据对第T次加噪处理后的小分子数据进行去噪处理,得到第T-1次加噪处理后的小分子数据;通过噪声数据确定模型确定第T-1次加噪处理后的小分子数据中的噪声数据,该噪声数据可以记为第T-1次加噪处理的噪声数据,基于第T-1次加噪处理的噪声数据对第T-1次加噪处理后的小分子数据进行去噪处理,得到第T-2次加噪处理后的小分子数据;以此类推。也就是说,基于每一次加噪处理的噪声数据,对第T次加噪处理后的小分子数据进行T次去噪处理,可以得到第0次加噪处理后的小分子数据,也就是上文提及的目标小分子数据。
目标小分子数据是描述目标小分子的数据。请参见图6,图6是本申请实施例提供的一种目标小分子的示意图。其中,图6中的(1)至(6)分别示出了6个目标小分子。
其中,通过噪声数据确定模型确定任一次加噪处理后的小分子数据中的噪声数据的过程,可以见图5中有关基于第t次加噪处理后的小分子数据确定预测噪声数据的过程,其中,预测噪声数据包括第一噪声数据和第二噪声数据。二者实现原理相类似,在此不再赘述。
本申请实施例中,通过噪声数据确定模型对高斯噪声数据进行T次去噪处理,实现了基于热学扩散理论,还原小分子数据中的原子由不稳定状态不断地向稳定状态靠近的过程。通过样本蛋白质的数据,实现了生成能够与样本蛋白质结合的目标小分子的数据,有利于加快药物研发速率。通过确定任一次加噪处理后的小分子数据中的噪声数据,并基于该噪声数据对该加噪处理后的小分子数据进行去噪处理,可以一次性生成去噪后的小分子数据,也就是说,一次性生成小分子数据包括的各个原子的类型数据和各个原子的坐标数据,加快了生成速率,缩短了生成时间,且一次性生成的方式可以避免累计错误。除此之外,通过随机生成高斯噪声,可以实现自定义小分子包括的原子的数量。
相关技术中,可以利用其他的神经网络模型生成目标小分子数据。可以将相关技术中能够生成目标小分子数据的模型记为小分子生成模型。本申请实施例中,使用相同的数据集训练得到小分子生成模型1、小分子生成模型2和噪声数据确定模型,并测试这三个模型在生成目标小分子数据方面的性能。可以采用评分指标1至评分指标6来评价模型性能,得到的结果如下表1所示。
表1
其中,符号“↓”表征评分指标的数值越小,模型性能越好;符号“↑”表征评分指标的数值越大,模型性能越好。由表1也可以看出,噪声数据确定模型的性能优于小分子生成模型1和小分子生成模型2。
图7所示为本申请实施例提供的一种噪声数据确定模型的训练装置的结构示意图,如图7所示,该装置包括:
获取模块701,用于获取样本带噪小分子的数据和标注噪声数据,样本带噪小分子的数据为带有噪声数据的小分子数据且样本带噪小分子的数据包括多个样本原子的数据,标注噪声数据是通过标注得到的且是样本带噪小分子的数据中的噪声数据;
确定模块702,用于基于多个样本原子的数据,通过神经网络模型输出样本图结构,样本图结构包括多个样本节点和多个样本边,任一个样本节点表征一个样本原子的数据,任一个样本边表征两端两个样本节点对应的样本原子之间的距离;
确定模块702,用于通过神经网络模型对样本图结构进行预测得到预测噪声数据,预测噪声数据是通过预测得到的且是样本带噪小分子的数据中的噪声数据;
训练模块703,用于基于预测噪声数据和标注噪声数据,对神经网络模型进行训练,得到噪声数据确定模型,噪声数据确定模型用于确定待处理带噪小分子的数据中的最终噪声数据。
在一种可能的实现方式中,确定模块702,用于通过神经网络模型对多个样本原子的数据分别进行特征提取,得到各个样本原子的初始原子特征;获取样本蛋白质的数据,通过神经网络模型对样本蛋白质的数据进行特征提取,得到样本蛋白质的特征;基于各个样本原子的初始原子特征和样本蛋白质的特征,通过神经网络模型确定样本图结构。
在一种可能的实现方式中,确定模块702,用于对于任一个样本原子,通过神经网络模型将任一个样本原子的初始原子特征和样本蛋白质的特征进行融合,得到任一个样本原子的第一原子特征;基于各个样本原子的第一原子特征,确定每两个样本原子之间的第一距离;基于各个样本原子的第一原子特征和每两个样本原子之间的第一距离,确定样本图结构。
在一种可能的实现方式中,样本带噪小分子数据是初始噪声数据或者对初始噪声数据进行至少一次的去噪处理得到的;
确定模块702,用于获取样本去噪次数信息,样本去噪次数信息表征从初始噪声数据变至样本带噪小分子数据所进行的去噪处理的次数;基于样本去噪次数信息和多个样本原子数据,通过神经网络模型确定样本图结构。
在一种可能的实现方式中,确定模块702,用于通过神经网络模型对样本去噪次数信息进行特征提取,得到样本去噪次数特征;通过神经网络模型对多个样本原子的数据分别进行特征提取,得到各个样本原子的初始原子特征;基于样本去噪次数特征和各个样本原子的初始原子特征,通过神经网络模型确定样本图结构。
在一种可能的实现方式中,确定模块702,用于对于任一个样本原子,通过神经网络模型将任一个样本原子的初始原子特征和样本去噪次数特征进行融合,得到任一个样本原子的第二原子特征;基于各个样本原子的第二原子特征,确定每两个样本原子之间的第二距离;基于各个样本原子的第二原子特征和每两个样本原子之间的第二距离,确定样本图结构。
在一种可能的实现方式中,确定模块702,用于对于任一个样本原子,通过神经网络模型将任一个样本原子的初始原子特征、样本去噪次数特征和样本蛋白质的特征进行融合,得到任一个样本原子的第三原子特征;基于各个样本原子的第三原子特征,确定每两个样本原子之间的第三距离;基于各个样本原子的第三原子特征和每两个样本原子之间的第三距离,确定样本图结构。
在一种可能的实现方式中,确定模块702,用于通过神经网络模型对样本图结构进行特征提取,得到各个样本原子的待处理原子特征;基于各个样本原子的待处理原子特征,通过神经网络模型确定预测类型噪声数据和预测位置噪声数据中的至少一项,预测类型噪声数据是通过预测得到的与样本原子的类型相关的噪声数据,预测位置噪声数据是通过预测得到的与样本原子的位置相关的噪声数据;将预测类型噪声数据和预测位置噪声数据中的至少一项,作为预测噪声数据。
在一种可能的实现方式中,确定模块702,用于通过神经网络模型从样本图结构包括的多个样本边中删除第一边,得到第一图结构,第一边表征的距离不大于参考距离;基于第一图结构,通过神经网络模型确定第一噪声数据;基于第一噪声数据确定预测噪声数据。
在一种可能的实现方式中,确定模块702,用于通过神经网络模型从样本图结构包括的多个样本边中删除第二边,得到第二图结构,第二边表征的距离大于参考距离;基于第二图结构,通过神经网络模型确定第二噪声数据;基于第二噪声数据确定预测噪声数据。
在一种可能的实现方式中,预测噪声数据包括预测类型噪声数据和预测位置噪声数据,标注噪声数据包括标注类型噪声数据和标注位置噪声数据;
训练模块703,用于基于预测类型噪声数据和标注类型噪声数据,确定第一损失;基于预测位置噪声数据和标注位置噪声数据,确定第二损失;基于第一损失和第二损失,对神经网络模型进行训练,得到噪声数据确定模型。
上述装置是基于样本带噪小分子的数据中多个样本原子的数据确定样本图结构,通过神经网络模型对样本图结构进行预测来确定预测噪声数据,基于预测噪声数据和标注噪声数据训练得到噪声数据确定模型。通过噪声数据确定模型可以确定待处理带噪小分子的数据中的最终噪声数据,从而可以实现基于最终噪声数据对待处理带噪小分子的数据进行去噪处理,得到去噪后的小分子数据,进而可以基于去噪后的小分子数据进行药物研发,提高药物研发效率。
应理解的是,上述图7提供的装置在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图8所示为本申请实施例提供的一种噪声数据的确定装置的结构示意图,如图8所示,该装置包括:
获取模块801,用于获取待处理带噪小分子的数据,待处理带噪小分子的数据为带有噪声数据的小分子数据且待处理带噪小分子的数据包括多个待处理原子的数据;
确定模块802,用于基于多个待处理原子的数据,通过噪声数据确定模型确定待处理图结构,待处理图结构包括多个节点和多个边,任一个节点表征一个待处理原子,任一个边表征两端两个节点对应的待处理原子之间的距离,噪声数据确定模型是按照第一方面任一项的噪声数据确定模型的训练方法训练得到的;
确定模块802,还用于基于待处理图结构,通过噪声数据确定模型确定最终噪声数据,最终噪声数据是待处理带噪小分子的数据中的噪声数据。
在一种可能的实现方式中,确定模块802,用于通过噪声数据确定模型对多个待处理原子的数据进行特征提取,得到各个待处理原子的初始原子特征;获取待处理蛋白质的数据,通过噪声数据确定模型对待处理蛋白质的数据进行特征提取,得到待处理蛋白质的特征;基于各个待处理原子的初始原子特征和待处理蛋白质的特征,通过噪声数据确定模型确定待处理图结构。
在一种可能的实现方式中,待处理带噪小分子数据是初始噪声数据或者对初始噪声数据进行至少一次的去噪处理得到的;
确定模块802,用于获取去噪次数信息,去噪次数信息表征从初始噪声数据变至待处理带噪小分子的数据所进行的去噪处理的次数;基于去噪次数信息和多个待处理原子的数据,通过噪声数据确定模型确定待处理图结构。
在一种可能的实现方式中,装置还包括:
去噪模块,用于基于最终噪声数据对待处理带噪小分子的数据进行去噪处理,得到第一小分子数据;
确定模块802,还用于响应于第一小分子数据满足数据条件,则将第一小分子数据作为目标小分子数据。
在一种可能的实现方式中,装置还包括:
确定模块802,还用于响应于第一小分子数据不满足数据条件,则基于第一小分子数据,通过噪声数据确定模型确定参考图结构;基于参考图结构,通过噪声数据确定模型确定参考噪声数据;
去噪模块,还用于基于参考噪声数据对第一小分子数据进行去噪处理,得到第二小分子数据;
确定模块802,还用于响应于第二小分子数据满足数据条件,则将第二小分子数据作为目标小分子数据。
上述装置是基于待处理带噪小分子的数据确定待处理图结构,基于待处理图结构,通过噪声数据确定模型来确定最终噪声数据,从而可以实现基于最终噪声数据对待处理带噪小分子的数据进行去噪处理,得到去噪后的小分子数据,进而可以基于去噪后的小分子数据进行药物研发,提高药物研发效率。
应理解的是,上述图8提供的装置在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图9示出了本申请一个示例性实施例提供的终端设备900的结构框图。该终端设备900包括有:处理器901和存储器902。
处理器901可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器901可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器901也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器901可以集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器901还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器902可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器902还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器902中的非暂态的计算机可读存储介质用于存储至少一个计算机程序,该至少一个计算机程序用于被处理器901所执行以实现本申请中方法实施例提供的噪声数据确定模型的训练方法或者噪声数据的确定方法。
在一些实施例中,终端设备900还可选包括有:外围设备接口903和至少一个外围设备。外围设备包括:射频电路904、显示屏905、摄像头组件906、音频电路907和电源908中的至少一种。
在一些实施例中,终端设备900还包括有一个或多个传感器909。该一个或多个传感器909包括但不限于:加速度传感器911、陀螺仪传感器912、压力传感器913、光学传感器914以及接近传感器915。
本领域技术人员可以理解,图9中示出的结构并不构成对终端设备1000的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
图10为本申请实施例提供的服务器的结构示意图,该服务器1000可因配置或性能不同而产生比较大的差异,可以包括一个或多个处理器1001和一个或多个的存储器1002,其中,该一个或多个存储器1002中存储有至少一条计算机程序,该至少一条计算机程序由该一个或多个处理器1001加载并执行以实现上述各个方法实施例提供的噪声数据确定模型的训练方法或者噪声数据的确定方法,示例性的,处理器1001为CPU。当然,该服务器1000还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器1000还可以包括其他用于实现设备功能的部件,在此不做赘述。
在示例性实施例中,还提供了一种计算机可读存储介质,该存储介质中存储有至少一条计算机程序,该至少一条计算机程序由处理器加载并执行,以使电子设备实现上述任一种噪声数据确定模型的训练方法或者噪声数据的确定方法。
可选地,上述计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品,该计算机程序产品中存储有至少一条计算机程序,该至少一条计算机程序由处理器加载并执行,以使电子设备实现上述任一种噪声数据确定模型的训练方法或者噪声数据的确定方法。
应当理解的是,在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
以上所述仅为本申请的示例性实施例,并不用以限制本申请,凡在本申请的原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (21)

  1. 一种噪声数据确定模型的训练方法,所述方法由电子设备执行,所述方法包括:
    获取样本带噪小分子的数据和标注噪声数据,所述样本带噪小分子的数据为带有噪声数据的小分子数据且所述样本带噪小分子的数据包括多个样本原子的数据,所述标注噪声数据是通过标注得到的且是所述样本带噪小分子的数据中的噪声数据;
    基于所述多个样本原子的数据,通过神经网络模型输出样本图结构,所述样本图结构包括多个样本节点和多个样本边,任一个样本节点表征一个样本原子的数据,任一个样本边表征两端两个样本节点对应的样本原子之间的距离;
    通过所述神经网络模型对所述样本图结构进行预测得到预测噪声数据,所述预测噪声数据是通过预测得到的且是所述样本带噪小分子的数据中的噪声数据;
    基于所述预测噪声数据和所述标注噪声数据,对所述神经网络模型进行训练,得到噪声数据确定模型,所述噪声数据确定模型用于确定待处理带噪小分子的数据中的最终噪声数据。
  2. 根据权利要求1所述的方法,所述基于所述多个样本原子的数据,通过神经网络模型输出样本图结构,包括:
    通过所述神经网络模型对所述多个样本原子的数据分别进行特征提取,得到各个样本原子的初始原子特征;
    获取样本蛋白质的数据,通过所述神经网络模型对所述样本蛋白质的数据进行特征提取,得到所述样本蛋白质的特征;
    基于所述各个样本原子的初始原子特征和所述样本蛋白质的特征,通过所述神经网络模型确定所述样本图结构。
  3. 根据权利要求2所述的方法,所述基于所述各个样本原子的初始原子特征和所述样本蛋白质的特征,通过所述神经网络模型确定所述样本图结构,包括:
    对于任一个样本原子,通过所述神经网络模型将所述任一个样本原子的初始原子特征和所述样本蛋白质的特征进行融合,得到所述任一个样本原子的第一原子特征;
    基于所述各个样本原子的第一原子特征,确定每两个样本原子之间的第一距离;
    基于所述各个样本原子的第一原子特征和所述每两个样本原子之间的第一距离,确定所述样本图结构。
  4. 根据权利要求1所述的方法,所述样本带噪小分子的数据是初始噪声数据或者对所述初始噪声数据进行至少一次的去噪处理得到的;
    所述基于所述多个样本原子的数据,通过神经网络模型输出样本图结构,包括:
    获取样本去噪次数信息,所述样本去噪次数信息表征从所述初始噪声数据变至所述样本带噪小分子的数据所进行的去噪处理的次数;
    基于所述样本去噪次数信息和所述多个样本原子的数据,通过所述神经网络模型确定样本图结构。
  5. 根据权利要求4所述的方法,所述基于所述样本去噪次数信息和所述多个样本原子的数据,通过所述神经网络模型确定样本图结构,包括:
    通过所述神经网络模型对所述样本去噪次数信息进行特征提取,得到样本去噪次数特征;
    通过所述神经网络模型对所述多个样本原子的数据分别进行特征提取,得到各个样本原子的初始原子特征;
    基于所述样本去噪次数特征和所述各个样本原子的初始原子特征,通过所述神经网络模型确定样本图结构。
  6. 根据权利要求5所述的方法,所述基于所述样本去噪次数特征和所述各个样本原子的初始原子特征,通过所述神经网络模型确定样本图结构,包括:
    对于任一个样本原子,通过所述神经网络模型将所述任一个样本原子的初始原子特征和所述样本去噪次数特征进行融合,得到所述任一个样本原子的第二原子特征;
    基于所述各个样本原子的第二原子特征,确定每两个样本原子之间的第二距离;
    基于所述各个样本原子的第二原子特征和所述每两个样本原子之间的第二距离,确定所述样本图结构。
  7. 根据权利要求5所述的方法,所述基于所述样本去噪次数特征和所述各个样本原子的初始原子特征,通过所述神经网络模型确定样本图结构,包括:
    对于任一个样本原子,通过所述神经网络模型将所述任一个样本原子的初始原子特征、所述样本去噪次数特征和样本蛋白质的特征进行融合,得到所述任一个样本原子的第三原子特征;
    基于所述各个样本原子的第三原子特征,确定每两个样本原子之间的第三距离;
    基于所述各个样本原子的第三原子特征和所述每两个样本原子之间的第三距离,确定所述样本图结构。
  8. 根据权利要求1-7任一项所述的方法,所述通过所述神经网络模型对所述样本图结构进行预测得到预测噪声数据,包括:
    通过所述神经网络模型对所述样本图结构进行特征提取,得到各个样本原子的待处理原子特征;
    基于所述各个样本原子的待处理原子特征,通过所述神经网络模型确定预测类型噪声数据和预测位置噪声数据中的至少一项,所述预测类型噪声数据是通过预测得到的与所述样本原子的类型相关的噪声数据,所述预测位置噪声数据是通过预测得到的与所述样本原子的位置相关的噪声数据;
    将所述预测类型噪声数据和所述预测位置噪声数据中的至少一项,作为所述预测噪声数据。
  9. 根据权利要求1-7任一项所述的方法,所述通过所述神经网络模型对所述样本图结构进行预测得到预测噪声数据,包括:
    通过所述神经网络模型从所述样本图结构包括的多个样本边中删除第一边,得到第一图结构,所述第一边表征的距离不大于参考距离;
    基于所述第一图结构,通过所述神经网络模型确定第一噪声数据;
    基于所述第一噪声数据确定所述预测噪声数据。
  10. 根据权利要求1-7任一项所述的方法,所述通过所述神经网络模型对所述样本图结构进行预测得到预测噪声数据,包括:
    通过所述神经网络模型从所述样本图结构包括的多个样本边中删除第二边,得到第二图结构,所述第二边表征的距离大于参考距离;
    基于所述第二图结构,通过所述神经网络模型确定第二噪声数据;
    基于所述第二噪声数据确定所述预测噪声数据。
  11. 根据权利要求1-7任一项所述的方法,所述预测噪声数据包括预测类型噪声数据和预测位置噪声数据,所述标注噪声数据包括标注类型噪声数据和标注位置噪声数据;
    所述基于所述预测噪声数据和所述标注噪声数据,对所述神经网络模型进行训练,得到噪声数据确定模型,包括:
    基于所述预测类型噪声数据和所述标注类型噪声数据,确定第一损失;
    基于所述预测位置噪声数据和所述标注位置噪声数据,确定第二损失;
    基于所述第一损失和所述第二损失,对所述神经网络模型进行训练,得到噪声数据确定模型。
  12. 一种噪声数据的确定方法,所述方法由电子设备执行,所述方法包括:
    获取待处理带噪小分子的数据,所述待处理带噪小分子的数据为带有噪声数据的小分子数据且所述待处理带噪小分子的数据包括多个待处理原子的数据;
    基于所述多个待处理原子的数据,通过噪声数据确定模型确定待处理图结构,所述待处理图结构包括多个节点和多个边,任一个节点表征一个待处理原子,任一个边表征两端两个节点对应的待处理原子之间的距离,所述噪声数据确定模型是按照权利要求1至11任一项所述的方法训练得到的;
    基于所述待处理图结构,通过所述噪声数据确定模型确定最终噪声数据,所述最终噪声数据是所述待处理带噪小分子的数据中的噪声数据。
  13. 根据权利要求12所述的方法,所述基于所述多个待处理原子的数据,通过噪声数据确定模型确定待处理图结构,包括:
    通过噪声数据确定模型对所述多个待处理原子的数据进行特征提取,得到各个待处理原子的初始原子特征;
    获取待处理蛋白质的数据,通过所述噪声数据确定模型对所述待处理蛋白质的数据进行特征提取,得到所述待处理蛋白质的特征;
    基于所述各个待处理原子的初始原子特征和所述待处理蛋白质的特征,通过所述噪声数据确定模型确定所述待处理图结构。
  14. 根据权利要求12所述的方法,所述待处理带噪小分子的数据是初始噪声数据或者对所述初始噪声数据进行至少一次的去噪处理得到的;
    所述基于所述多个待处理原子的数据,通过噪声数据确定模型确定待处理图结构,包括:
    获取去噪次数信息,所述去噪次数信息表征从所述初始噪声数据变至所述待处理带噪小分子的数据所进行的去噪处理的次数;
    基于所述去噪次数信息和所述多个待处理原子的数据,通过噪声数据确定模型确定所述待处理图结构。
  15. 根据权利要求12所述的方法,所述方法还包括:
    基于所述最终噪声数据对所述待处理带噪小分子的数据进行去噪处理,得到第一小分子数据;
    响应于所述第一小分子数据满足数据条件,则将所述第一小分子数据作为目标小分子数据。
  16. 根据权利要求15所述的方法,所述方法还包括:
    响应于所述第一小分子数据不满足所述数据条件,则基于所述第一小分子数据,通过所述噪声数据确定模型确定参考图结构;
    基于所述参考图结构,通过所述噪声数据确定模型确定参考噪声数据;
    基于所述参考噪声数据对所述第一小分子数据进行去噪处理,得到第二小分子数据;
    响应于所述第二小分子数据满足所述数据条件,则将所述第二小分子数据作为所述目标小分子数据。
  17. 一种噪声数据确定模型的训练装置,所述装置部署在电子设备上,所述装置包括:
    获取模块,用于获取样本带噪小分子的数据和标注噪声数据,所述样本带噪小分子的数据为带有噪声数据的小分子数据且所述样本带噪小分子的数据包括多个样本原子的数据,所述标注噪声数据是通过标注得到的且是所述样本带噪小分子的数据中的噪声数据;
    确定模块,用于基于所述多个样本原子的数据,通过神经网络模型输出样本图结构,所述样本图结构包括多个样本节点和多个样本边,任一个样本节点表征一个样本原子的数据,任一个样本边表征两端两个样本节点对应的样本原子之间的距离;
    所述确定模块,还用于通过所述神经网络模型对所述样本图结构进行预测得到预测噪声数据,所述预测噪声数据是通过预测得到的且是所述样本带噪小分子的数据中的噪声数据;
    训练模块,还用于基于所述预测噪声数据和所述标注噪声数据,对所述神经网络模型进行训练,得到噪声数据确定模型,所述噪声数据确定模型用于确定待处理带噪小分子的数据中的最终噪声数据。
  18. 一种噪声数据的确定装置,所述装置部署在电子设备上,所述装置包括:
    获取模块,用于获取待处理带噪小分子的数据,所述待处理带噪小分子的数据为带有噪声数据的小分子数据且所述待处理带噪小分子的数据包括多个待处理原子的数据;
    确定模块,用于基于所述多个待处理原子的数据,通过噪声数据确定模型确定待处理图结构,所述待处理图结构包括多个节点和多个边,任一个节点表征一个待处理原子,任一个边表征两端两个节点对应的待处理原子之间的距离,所述噪声数据确定模型是按照权利要求1至11任一项所述的方法训练得到的;
    所述确定模块,还用于基于所述待处理图结构,通过所述噪声数据确定模型确定最终噪声数据,所述最终噪声数据是所述待处理带噪小分子的数据中的噪声数据。
  19. 一种电子设备,所述电子设备包括处理器和存储器,所述存储器中存储有至少一条计算机程序,所述至少一条计算机程序由所述处理器加载并执行,以使所述电子设备实现如权利要求1至11任一所述的噪声数据确定模型的训练方法或者实现如权利要求12至16任一所述的噪声数据的确定方法。
  20. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行,以使电子设备实现如权利要求1至11任一所述的噪声数据确定模型的训练方法或者实现如权利要求12至16任一所述的噪声数据的确定方法。
  21. 一种计算机程序产品,所述计算机程序产品中存储有至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行,以使所述电子设备实现如权利要求1至11任一所述的噪声数据确定模型的训练方法或者实现如权利要求12至16任一所述的噪声数据的确定方法。
PCT/CN2023/125347 2022-11-30 2023-10-19 噪声数据确定模型的训练、噪声数据的确定方法及装置 WO2024114154A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211525333.7A CN116959616A (zh) 2022-11-30 2022-11-30 噪声数据确定模型的训练、噪声数据的确定方法及装置
CN202211525333.7 2022-11-30

Publications (1)

Publication Number Publication Date
WO2024114154A1 true WO2024114154A1 (zh) 2024-06-06

Family

ID=88455406

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/125347 WO2024114154A1 (zh) 2022-11-30 2023-10-19 噪声数据确定模型的训练、噪声数据的确定方法及装置

Country Status (2)

Country Link
CN (1) CN116959616A (zh)
WO (1) WO2024114154A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257855A (zh) * 2020-11-26 2021-01-22 Oppo(重庆)智能科技有限公司 一种神经网络的训练方法及装置、电子设备及存储介质
CN112651467A (zh) * 2021-01-18 2021-04-13 第四范式(北京)技术有限公司 卷积神经网络的训练方法和系统以及预测方法和系统
US20210349954A1 (en) * 2020-04-14 2021-11-11 Naver Corporation System and method for performing cross-modal information retrieval using a neural network using learned rank images
CN113707235A (zh) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 基于自监督学习的药物小分子性质预测方法、装置及设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210349954A1 (en) * 2020-04-14 2021-11-11 Naver Corporation System and method for performing cross-modal information retrieval using a neural network using learned rank images
CN112257855A (zh) * 2020-11-26 2021-01-22 Oppo(重庆)智能科技有限公司 一种神经网络的训练方法及装置、电子设备及存储介质
CN112651467A (zh) * 2021-01-18 2021-04-13 第四范式(北京)技术有限公司 卷积神经网络的训练方法和系统以及预测方法和系统
CN113707235A (zh) * 2021-08-30 2021-11-26 平安科技(深圳)有限公司 基于自监督学习的药物小分子性质预测方法、装置及设备

Also Published As

Publication number Publication date
CN116959616A (zh) 2023-10-27

Similar Documents

Publication Publication Date Title
CN109214343B (zh) 用于生成人脸关键点检测模型的方法和装置
CN109902186B (zh) 用于生成神经网络的方法和装置
CN112257578B (zh) 人脸关键点检测方法、装置、电子设备及存储介质
CN109800730B (zh) 用于生成头像生成模型的方法和装置
CN115271071A (zh) 基于图神经网络的知识图谱实体对齐方法、系统及设备
CN115907970A (zh) 信贷风险识别方法、装置、电子设备及存储介质
CN113409307A (zh) 基于异质噪声特性的图像去噪方法、设备及介质
CN108509179B (zh) 用于检测人脸的方法、用于生成模型的装置
CN114420135A (zh) 基于注意力机制的声纹识别方法及装置
CN114065915A (zh) 网络模型的构建方法、数据处理方法、装置、介质及设备
CN113468344A (zh) 实体关系抽取方法、装置、电子设备和计算机可读介质
CN116912923B (zh) 一种图像识别模型训练方法和装置
WO2024114154A1 (zh) 噪声数据确定模型的训练、噪声数据的确定方法及装置
CN116957006A (zh) 预测模型的训练方法、装置、设备、介质及程序产品
CN114241411B (zh) 基于目标检测的计数模型处理方法、装置及计算机设备
CN111626044B (zh) 文本生成方法、装置、电子设备及计算机可读存储介质
CN111709784B (zh) 用于生成用户留存时间的方法、装置、设备和介质
CN111949860B (zh) 用于生成相关度确定模型的方法和装置
CN114037772A (zh) 一种图像生成器的训练方法、图像生成方法及装置
CN113642510A (zh) 目标检测方法、装置、设备和计算机可读介质
CN112613544A (zh) 目标检测方法、装置、电子设备和计算机可读介质
CN111666449A (zh) 视频检索方法、装置、电子设备和计算机可读介质
CN111582456A (zh) 用于生成网络模型信息的方法、装置、设备和介质
Vu et al. On the initial value problem for random fuzzy differential equations with Riemann-Liouville fractional derivative: Existence theory and analytical solution
CN117456562B (zh) 姿态估计方法及装置