WO2023226310A1 - 一种分子优化方法以及装置 - Google Patents

一种分子优化方法以及装置 Download PDF

Info

Publication number
WO2023226310A1
WO2023226310A1 PCT/CN2022/130492 CN2022130492W WO2023226310A1 WO 2023226310 A1 WO2023226310 A1 WO 2023226310A1 CN 2022130492 W CN2022130492 W CN 2022130492W WO 2023226310 A1 WO2023226310 A1 WO 2023226310A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
molecular
data set
objective function
attribute
Prior art date
Application number
PCT/CN2022/130492
Other languages
English (en)
French (fr)
Inventor
熊招平
崔晓鹏
乔楠
翁文康
林歆远
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202211019436.6A external-priority patent/CN117174185A/zh
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2023226310A1 publication Critical patent/WO2023226310A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs

Definitions

  • the present application relates to the field of artificial intelligence, and in particular to a molecular optimization method and device.
  • Some commonly used molecular optimization methods such as molecular optimization based on Bayesian optimization, molecular optimization based on reinforcement learning, or molecular optimization based on conditional generation, usually require more training data, and the optimization cycle is very long, and the output effect is Very unstable. Therefore, how to carry out efficient and stable output molecular optimization has become an urgent problem to be solved.
  • This application provides a molecular optimization method and device, which constructs an objective function based on the Ising model and uses a quantum annealing algorithm to solve it, so that the optimal molecular structure can be efficiently and accurately solved.
  • this application provides a molecular optimization method, including: first, obtaining a first data set and an attribute set.
  • the first data set includes multiple sets of data, and the multiple sets of data can be used to represent multiple molecular structures, Each set of data can be used to represent at least one molecular structure.
  • the attribute set includes multiple sets of attribute information.
  • the multiple sets of attribute information can correspond one-to-one to multiple sets of data. Each set of attribute information includes at least one corresponding molecular structure.
  • the value of the attribute such as the toughness, toxicity or solubility of the molecule; construct the objective function according to the first data set and the attribute set, where the attribute information in the attribute set can be used to fit the parameters of the objective function; and then use
  • the quantum annealing algorithm solves the objective function to obtain a molecular sequence, which can be used to represent the solved molecular structure, where the properties of the solved molecular structure are better than the properties of the molecular structure represented in the first data set.
  • a molecular structure with known properties can be used to construct the objective function, and a quantum annealing algorithm can be used to solve it, so that efficient and accurate solution can be achieved, and a molecular structure with better properties can be obtained.
  • the first data set and attribute set may be obtained by receiving input data from the client.
  • users can input known molecular structures and attribute information of each molecular structure through the client, such as the heat resistance, hardness and other attribute information of the molecules.
  • the aforementioned constructing the objective function based on the first data set and the attribute set may include: performing binary encoding on each group of data in the first data set to obtain the second data set.
  • the data set includes multiple sets of sequences, and the multiple sets of sequences correspond to multiple sets of data, and the multiple sets of sequences are all binary sequences; then based on the second data set and the attribute set, the objective function is constructed based on the structure of the Ising model.
  • each group of data in the first data set can be binary coded separately, which is equivalent to Each set of data in the first data set is converted into a binary sequence representation, so that the objective function can be successfully constructed based on the structure of the Ising model.
  • the aforementioned constructing the objective function based on the structure of the Ising model based on the second data set may include: based on the structure and attribute set of the Ising model, based on the sequence corresponding to the second data set Matrix factorization constructs the objective function.
  • the objective function when constructing the objective function, can be constructed based on the structure of the Ising model and using matrix factorization, so that the quantum annealing algorithm can be used to solve it and obtain the optimal solution of the objective function.
  • the aforementioned binary encoding of multiple sets of sequences in the first data set to obtain the second data set may include: using the prior distribution as a constraint, using the variational autoencoder VAE The encoder encodes multiple sets of sequences in the first data set to obtain latent variable encoded data, and the prior distribution is obtained based on the Bernoulli distribution sampling corresponding to the sequences in the first data set.
  • the prior distribution can be collected from the Bernoulli distribution as a constraint, so that each element in the sequence obtained by the encoder when encoding is 0 or 1, thus obtaining binary sequence.
  • the method provided by this application may also include: based on the restricted Boltzmann machine, using Gibbs sampling to sample from the Bernoulli distribution to obtain the prior distribution.
  • Gibbs sampling can be used to sample from the Bernoulli distribution to obtain the prior distribution based on the pre-trained restricted Boltzmann machine, so as to facilitate subsequent binary encoding.
  • the aforementioned decoding of the target sequence to obtain the molecular sequence includes: decoding the target sequence through a decoder in VAE to obtain the molecular sequence.
  • a binary sequence is usually used for calculation, and the representation of the molecular structure may be a non-binary representation. Therefore, after solving the binary sequence, , the binary sequence can be decoded by the decoder to construct an identifiable molecular structure.
  • the aforementioned solving the objective function through a quantum annealing algorithm to obtain the target sequence may include: solving the objective function through a quantum annealing machine to obtain the target sequence.
  • a quantum annealing machine can be directly used for solving. Compared with simulating quantum annealing in the same device for calculation, using a quantum annealing machine for solving can further improve the solving efficiency.
  • the data in the first data set includes one or more of the following: one-dimensional character strings, two-dimensional molecular maps, or three-dimensional three-dimensional structure data.
  • the molecular structure can be represented in a variety of ways, and can be applied to a variety of scenarios.
  • decoding one or more of the aforementioned multiple data types can also be decoded, so that it can be used Users can identify the specific structure of the molecule based on the output molecular sequence.
  • this application provides a molecular optimization device, including:
  • the acquisition module is used to acquire a first data set and an attribute set.
  • the first data set includes multiple sets of data, each set of data is used to represent at least one molecular structure.
  • the attribute set includes multiple sets of attribute information, multiple sets of attribute information, and Multiple sets of data correspond one to one, and each set of attribute information includes the value of at least one attribute of the corresponding molecular structure;
  • a construction module used to construct the objective function based on the first data set and the attribute set, and the attribute information in the attribute set is used to fit the parameters in the objective function;
  • the solving module is used by the quantum annealing algorithm to solve the objective function and obtain the molecular sequence.
  • the molecular sequence is used to represent the molecular structure obtained by solving the problem.
  • the device further includes: an encoding module
  • the encoding module is used to perform binary encoding on each set of data in the first data set to obtain a second data set.
  • the second data set includes multiple sets of sequences, and the multiple sets of sequences correspond to multiple sets of data;
  • the construction module is specifically used to construct the objective function based on the structure of the Ising model according to the second data set and the attribute set.
  • the construction module is specifically configured to construct an objective function based on the structure and attribute set of the Ising model based on the matrix factor decomposition corresponding to the sequence in the second data set.
  • the encoding module is specifically used to use the prior distribution as a constraint to encode multiple sets of sequences in the first data set through the encoder in the variational autoencoder VAE to obtain latent variable encoding.
  • the prior distribution is sampled based on the Bernoulli distribution corresponding to the sequence in the first data set.
  • the device further includes: a sampling module, configured to use Gibbs sampling to sample from the Bernoulli distribution to obtain a priori distribution based on the restricted Boltzmann machine.
  • a sampling module configured to use Gibbs sampling to sample from the Bernoulli distribution to obtain a priori distribution based on the restricted Boltzmann machine.
  • the device further includes: a decoding module
  • the solving module is specifically used to solve the target function through the quantum annealing algorithm to obtain the target sequence
  • This decoding module is used to decode the target sequence through the decoder in VAE to obtain the molecular sequence.
  • the solving module is specifically configured to solve the objective function through a quantum annealing machine to obtain the target sequence.
  • the data in the first data set includes one or more of the following: one-dimensional character strings, two-dimensional molecular maps, or three-dimensional three-dimensional structure data.
  • embodiments of the present application provide a molecular optimization device, which has the function of implementing the image processing method in the first aspect.
  • This function can be implemented by hardware, or it can be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • embodiments of the present application provide a molecular optimization device, including: a processor and a memory, wherein the processor and the memory are interconnected through lines, and the processor calls the program code in the memory to execute any one of the above first aspects. Shown are processing-related functions used in molecular optimization methods.
  • the molecular optimization device may be a chip.
  • inventions of the present application provide a molecular optimization device.
  • the molecular optimization device can also be called a digital processing chip or chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface, and the program instructions are
  • the processing unit executes, and the processing unit is configured to perform processing-related functions in the above-mentioned first aspect or any optional implementation manner of the first aspect.
  • embodiments of the present application provide a computer-readable storage medium that includes instructions that, when run on a computer, cause the computer to execute the method in any optional implementation manner in the first aspect.
  • embodiments of the present application provide a computer program product containing instructions that, when run on a computer, cause the computer to execute the method in any optional implementation manner in the first aspect.
  • Figure 1 is a schematic framework diagram of a cloud platform applied in this application
  • FIG. 2 is a schematic diagram of a system architecture provided by this application.
  • Figure 3 is a schematic flow chart of a molecular optimization method provided by this application.
  • Figure 4 is a schematic flow chart of another molecular optimization method provided by this application.
  • Figure 5 is a schematic flow chart of another molecular optimization method provided by this application.
  • Figure 6 is a schematic flow chart of another molecular optimization method provided by this application.
  • Figure 7 is a schematic flow chart of another molecular optimization method provided by this application.
  • Figure 8 is a schematic structural diagram of a molecular optimization device provided by the present application.
  • Figure 9 is a schematic structural diagram of another molecular optimization device provided by this application.
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensation process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (providing and processing technology implementation) to the systematic industrial ecological process.
  • Infrastructure provides computing power support for artificial intelligence systems, enables communication with the external world, and supports it through basic platforms.
  • computing power is provided by smart chips, such as central processing unit (CPU), neural-network processing unit (NPU), graphics processing unit (GPU), dedicated integration Hardware acceleration chips such as application specific integrated circuit (ASIC) or field programmable gate array (FPGA) are provided;
  • the basic platform includes distributed computing framework and network and other related platform guarantees and support, which can include Cloud storage and computing, interconnection network, etc.
  • sensors communicate with the outside world to obtain data, which are provided to smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can perform symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formal information to perform machine thinking and problem solving based on reasoning control strategies. Typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of further data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image processing. identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart cities, etc.
  • the method provided by this application can be applied in a variety of scenarios, such as in the molecular optimization scenario of materials or drugs with better structures.
  • Chemical molecules such as materials or drugs want to have better properties, such as stronger toughness, lower toxicity, When the solubility is better, the structure of the molecule needs to be optimized.
  • the process of changing the molecular structure to achieve better performance is molecular optimization.
  • AI services and products in the cloud field not only reflect the on-demand use and purchase characteristics of cloud services, but also have the abstract, diverse, and widely used characteristics of AI technology.
  • One is Platform-as-a-Service (PaaS) AI basic development platform service, and the other is Software-as-a-Service (Software-as-a-Service).
  • SaaS SaaS type AI application cloud service.
  • AI basic development platform service For the first type of AI basic development platform service, public cloud service providers rely on their sufficient underlying resource support and upper-layer AI algorithm capabilities to provide users with an AI basic development platform.
  • the built-in AI development framework and various AI algorithms in the AI basic development platform allow users to quickly build and develop AI models or AI applications that meet personalized needs on the AI basic development platform.
  • public cloud service providers provide general AI application cloud services through cloud platforms, allowing users to use AI capabilities in various application scenarios with zero threshold.
  • the public cloud AI basic development platform is a PaaS cloud service in the cloud platform. It is provided to users (also called tenants, AI developers, etc.) based on the large number of basic resources and software capabilities owned by the public cloud service provider.
  • a software platform that assists in the construction, training, and deployment of AI models, as well as the development and deployment of AI applications.
  • the method provided by this application can be applied to a cloud platform, such as a drug molecule design platform that can be deployed on a cloud medical agent as a cloud service, as a way of molecule optimization, through an application program interface (API)
  • API application program interface
  • the method provided in this application can be deployed in a cloud platform as a service for users, and provide users with an API that can call the service.
  • the user can call the service through the API, enter a molecular structure with known properties, and use the API to call the service.
  • the service outputs molecular structures with excellent properties required by the user, thereby screening out the required molecular structures for the user.
  • the interaction form between users and the AI basic development platform mainly includes: users log in to the cloud platform through the client web page, select and purchase the cloud service of the AI basic development platform in the cloud platform, and the user can then use the AI basic development platform based on
  • the functions provided provide full-process AI services.
  • the basic resources that support any process in the AI platform may be distributed on different physical devices. That is, the hardware devices that actually execute a process are usually server clusters in the same data center, or distributed in different data centers. Server cluster.
  • These data centers can be central cloud data centers of cloud service providers or edge data centers provided by cloud service providers to users.
  • the resources in the public cloud are used to run the model training and model management functions provided in the AI basic development platform
  • the resources in the private cloud are used to run the data provided in the AI basic development platform.
  • Storage and data preprocessing functions which can provide stronger security for user data.
  • public cloud resources can come from the central cloud data center
  • private cloud resources can come from edge data centers.
  • the AI platform can be independently deployed on a server or virtual machine in a data center in a cloud environment.
  • the AI platform can also be deployed distributedly on multiple servers in a data center or distributed in a data center. on multiple virtual machines.
  • the AI platform provided by this application can also be deployed in a distributed manner in different environments.
  • the AI platform provided by this application can be logically divided into multiple parts, each part having different functions.
  • part of the AI platform 100 may be deployed in computing devices in an edge environment (also called edge computing devices), and another part may be deployed in devices in a cloud environment.
  • the edge environment is an environment that is geographically close to the user's terminal computing device.
  • the edge environment includes edge computing devices, such as edge servers, edge stations with computing capabilities, etc.
  • Various parts of the AI platform 100 deployed in different environments or devices collaborate to provide users with functions such as training AI models.
  • this application provides a system architecture, as shown in Figure 2.
  • data collection device 160 is used to collect training data.
  • the training data may include a large number of molecular structures with known properties.
  • the data collection device 160 After collecting the training data, the data collection device 160 stores the training data into the database 130, and the training device 120 trains to obtain the target model/rules 101 based on the training data maintained in the database 130.
  • the training set mentioned in the following embodiments of this application may be obtained from the database 130 or may be obtained through user input data.
  • the target model/rule 101 may be a neural network trained in the embodiment of the present application, and the neural network may include one or more networks, such as an autoencoding model.
  • the above target model/rule 101 can be used to implement the neural network mentioned in the molecular optimization method in the embodiment of the present application, that is, the data to be processed (such as the image to be compressed) is input into the target model/ Rule 101, you can get the processing results.
  • the target model/rule 101 in the embodiment of this application may specifically be the neural network mentioned below in this application, and the neural network may be the aforementioned CNN, DNN or RNN type of neural network.
  • the training data maintained in the database 130 may not necessarily be collected by the data collection device 160, but may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rules 101 based entirely on the training data maintained by the database 130. It may also obtain training data from the cloud or other places for model training, which is not limited in this application. .
  • the target model/rules 101 trained according to the training device 120 can be applied to different systems or devices, such as to the execution device 110 shown in Figure 2, which is a server or a cloud device.
  • the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140.
  • the input data may include: data to be processed input by the client device.
  • the client can be other hardware devices, such as terminals or servers, etc.
  • the client can also be software deployed on the terminal, such as APPs, web pages, etc.
  • the preprocessing module 113 and the preprocessing module 114 are used to perform preprocessing according to the input data (such as data to be processed) received by the I/O interface 112.
  • the preprocessing module 113 and the preprocessing module may not be present.
  • 114 there can also be only one preprocessing module, and the calculation module 111 is directly used to process the input data.
  • the execution device 110 When the execution device 110 preprocesses input data, or when the calculation module 111 of the execution device 110 performs calculations and other related processes, the execution device 110 can call data, codes, etc. in the data storage system 150 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result to the client device 140 to provide it to the user. For example, if the first neural network is used for image classification and the processing result is a classification result, the I/O interface 112 The classification results obtained above are returned to the client device 140 to provide them to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete the The above tasks, thereby providing the user with the desired results.
  • the execution device 110 and the training device 120 may be the same device, or located within the same computing device. To facilitate understanding, this application will introduce the execution device and the training device separately, which is not a limitation.
  • the user can manually set the input data, and the manual setting can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send input data to the I/O interface 112. If requiring the client device 140 to automatically send input data requires the user's authorization, the user can set corresponding permissions in the client device 140.
  • the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be display, sound, action, etc.
  • the client device 140 can also serve as a data collection end, collecting the input data input to the I/O interface 112 as shown in the figure and the predicted tags output from the I/O interface 112 as new sample data, and stored in the database 130 .
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 as shown in the figure and the predicted label output from the I/O interface 112 as a new sample.
  • the data is stored in database 130.
  • Figure 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 can also be placed in the execution device 110.
  • the target model/rule 101 is obtained by training according to the training device 120.
  • the target model/rule 101 in the embodiment of the present application can be the model in the present application.
  • the neural network provided in the embodiment of the present application can Including CNN, deep convolutional neural networks (DCNN), recurrent neural network (RNN) or constructed neural networks, etc.
  • the molecular optimization provided by this application can be deployed in the above-mentioned system architecture, and the molecular optimization is achieved through the above-mentioned architecture.
  • the autoencoding model is a neural network that uses the backpropagation algorithm to make the output value equal to the input value. It first compresses the input data into a latent space representation, and then reconstructs the output through this representation.
  • Autoencoding models usually include encoding (encoder) models and decoder (decoder) models.
  • the trained encoding model is used to extract features from the input image to obtain latent variables.
  • the latent variables are input to the trained decoding model to output the predicted residual corresponding to the input image.
  • VAE Variational autoencoder
  • the variational autoencoder is similar to the autoencoder. It is composed of an encoder, a set of latent variables and a decoder. The difference from the autoencoder is that when training the variational autoencoder, in addition to the reconstruction of the decoding molecule To reduce the loss, it is also necessary to make the latent variables approximate the normal distribution as much as possible. In this way, random sampling of latent variables from the normal distribution can also decode effective samples and achieve the effect of sample generation.
  • Boltzmann machine originates from statistical physics and is a modeling based on energy function that can describe high-order interactions between variables.
  • Restricted Boltzmann machine can be understood as a neural network, usually consisting of a visible neuron layer and a hidden neuron layer, because there are no interconnections between hidden layer neurons and the hidden layer neurons are independent of the given training samples. , which makes it easy to directly calculate the data-dependent expected value. There are no interconnections between the visible layer neurons.
  • the data-independent expectation value is estimated by performing a Markov chain sampling process on the hidden layer neuron states obtained from the training samples. Expected value, update the values of all visible layer neurons and hidden layer neurons alternately in parallel.
  • the restricted Boltzmann machine mentioned below in this application may be a pre-trained neural network.
  • ECFP Extended Connectivity Fingerprints
  • QSAR quantitative structure-activity relationship
  • the implementation method is to divide the substructure of the molecule with each atom as the center and different step sizes as the radius, and take a hash value for each substructure. The same substructure has the same hash value. Calculate the remainder of the fingerprint length for the hash value. The remainder will be 1 in the dimension corresponding to the fingerprint, which means that this substructure exists in the molecule. Otherwise, the fingerprint will be 0 in this dimension.
  • Quantum annealing is an optimization process based on quantum fluctuation characteristics, which can find the global optimal solution when the objective function has multiple candidate solutions. Quantum annealing is mainly used to solve problems with multiple local minima in discrete spaces (combinatorial optimization problems), such as finding the ground state of spin glass. Quantum annealing first runs from a quantum superposition of all possible states (candidate states) with the same weight, and then the physical system starts quantum evolution based on the Schrödinger equation. According to the time-dependent strength of the transverse field, quantum tunneling occurs between states, causing the probability amplitudes of all candidate states to continuously change, achieving quantum parallelism. The transverse field is finally turned off, and the system is expected to have obtained the solution to the original optimization problem, that is, to the corresponding classical Ising model ground state.
  • Quantum annealing algorithm models usually include two parts: the first part is quantum potential energy, whose purpose is to map the quantum optimization problem to the quantum system, and map the optimized objective function into a potential field imposed on the quantum system; the second part is quantum kinetic energy, by introducing a kinetic energy term (with controllable amplitude) as a penetrating field for controlling quantum fluctuations.
  • quantum mechanics such as quantum fluctuations, quantum tunneling, adiabatic quantum evolution, etc.
  • quantum annealing heuristic algorithm converts the adiabatic quantum process into its corresponding classical dynamic process, which retains the characteristics of the adiabatic quantum evolution.
  • the Ising model is a type of stochastic process model that describes the phase transition of matter. When matter undergoes phase change, new structures and physical properties will appear. Systems that undergo phase transitions are generally systems with strong interactions between molecules, also known as cooperative systems.
  • the system studied by the Ising model consists of a multi-dimensional periodic lattice.
  • the geometric structure of the lattice can be cubic or hexagonal.
  • Each lattice point is assigned a value to represent the spin variable, that is, spin up or spin down. Turn down.
  • the Ising model assumes that only nearest neighbor spins interact, and the configuration of the lattice is determined by a set of spin variables.
  • a common two-dimensional Ising model diagram uses the direction of the arrow to indicate the spin direction.
  • SMILES strings can be imported by most molecule editing software and converted into 2D graphics or 3D models of molecules. Converting to a two-dimensional graphic can use Helson's "Structure Diagram Generation algorithm” (Structure Diagram Generation algorithm).
  • SELFIES is proposed. SELFIES is an improved form of SMILES based on natural language and a string-based representation. Each SELFIES string corresponds to a valid numerator.
  • molecular optimization based on expert experience refers to experienced experts deleting or adding certain chemical groups to molecules based on professional knowledge and industry experience, so that materials or drug molecules have better properties. , which is also the main way of molecular optimization in traditional materials science and drug discovery. But this requires a lot of trial and error, and experiments are expensive. It requires experience and expert knowledge, has high cost, long cycle, is subjective, has poor stability, and has low throughput.
  • the molecular optimization method based on Bayesian optimization is a molecular optimization method that has emerged in recent years.
  • the method is to use the representation learning (Representation Learning) method in deep learning to encode the molecule into a vector representation.
  • representation learning Representation Learning
  • Bayesian optimization Using this optimized strategy, we adjust the vector representation of the molecule and decode it to obtain new chemical molecules.
  • the calculation cost is high, the cycle is long, and the degree of parallelism is poor. The effect depends on the selection of the agent function and the acquisition function.
  • the molecular optimization method based on reinforcement learning is also a molecular optimization method that has emerged in recent years. Its approach follows the molecular optimization method based on Bayesian optimization. It first uses the representation learning (Representation Learning) method to encode the molecule into a vector representation. , during which, through the scoring function of reinforcement learning, the vector representation of the molecule is adjusted, and the new chemical molecule is decoded. However, the calculation cost is high, the cycle is long, the parallelism is poor, and effective results may not be output.
  • representation learning Representation Learning
  • the molecular optimization method based on conditional generation uses the conditional generation model to generate molecules that tend to have certain properties, which is also a current method for molecular optimization.
  • Molecular optimization based on conditional generation uses generative models such as VAE or GAN as the framework, specifying certain dimensions of random sampling vectors as fixed eigenvalues, so that the generative model tends to generate molecules with specified properties.
  • VAE virtual averaged probability density function
  • GAN the computational cost is high, more data is required, and the optimization conditions are not strongly dependent on the generation, so the optimization capability is weak.
  • this application provides a molecular optimization method, a molecular structure optimization and transformation system of quantum annealing and its heuristic algorithm, to achieve efficient and rapid molecular optimization.
  • FIG. 3 is a schematic flow chart of a molecular optimization method provided by this application.
  • the first data set includes multiple sets of data.
  • the multiple sets of data can be used to represent multiple molecular structures.
  • Each molecular structure corresponds to at least one attribute.
  • the attribute set includes multiple sets of attribute information.
  • Each set of data corresponds to a set of attributes.
  • Information, each set of attribute information includes the value of at least one attribute of the corresponding molecular structure, that is, each molecular structure has one or more corresponding attributes, such as toughness, toxicity, catalytic efficiency, druggability or solubility, etc. Value, usually different attributes may have different representation methods or units, which can be determined according to the actual application scenario.
  • the molecular structure can be represented in multiple ways.
  • each molecular structure can be represented by sequences in multiple dimensions.
  • one-dimensional SMILES/SELFIES strings two-dimensional molecular diagrams or three-dimensional structures, such as 3D point clouds, or a combination of these representations can be used to represent molecular structures to form a data set.
  • the molecular optimization method provided by this application can be deployed in a server.
  • the server can receive the first data set and attribute set sent by the client.
  • the user can input multiple molecular structures and the corresponding molecular structures through the client.
  • the client can send multiple molecular structures input by the user and the attribute information corresponding to each molecular structure to the server through a wireless or wired network.
  • the attributes corresponding to the attribute information in the attribute set may be attributes that the user needs to solve the optimal molecular structure. For example, if the user needs to solve the molecular structure with optimal corrosion resistance, the user can input the molecular structure with known corrosion resistance value when inputting the molecular structure with known properties; for another example, if the user needs to solve the problem of optimal heat resistance, molecule structure, the user can enter a molecular structure with a known heat resistance value.
  • the multiple sets of sequences in the first data set can be binary encoded to obtain a second data set, which includes multiple sets of sequences. binary sequence. It can be understood that multiple sets of data in the first data set are binary-converted and converted into binary representation.
  • the encoder in the pre-trained autoencoder can be used, and the data to be encoded is used as the input of the encoder to extract features from the input data.
  • the prior distribution is used as a constraint to encode, and the latent variable data is output, that is, a binary sequence.
  • the prior distribution may be obtained by sampling from the Bernoulli distribution corresponding to the multiple sets of sequences in the first data set, that is, the prior distribution is also a binary sequence. Therefore, in the embodiment of the present application, the prior distribution can be used as a constraint, so that the output latent variable obeys the prior distribution as much as possible, thereby realizing binary encoding.
  • the prior distribution when collecting the prior distribution, can be sampled from the Bernoulli distribution based on the restricted Boltzmann machine using Gibbs sampling, so that the collected prior distribution is value sequence, so that the output latent variable data is also a binary sequence under the constraints of the binary sequence, so as to facilitate the subsequent construction of the objective function.
  • the characteristics of the molecular attributes can be extracted from the first data set through the pre-trained encoder, and represented by a binary sequence, so as to facilitate subsequent efficient solution through the quantum annealing algorithm.
  • the target model can be constructed based on the sequence and attribute set in the first data set; if the molecular structure sequence included in the first sequence is a non-binary representation. , then the target model can be constructed based on the sequence in the second data set and the attribute set.
  • the objective function can be used to predict the molecular attributes, and the attribute information in the attribute set can be used to fit the parameters in the objective function.
  • the objective function when constructing the objective function, you can use the sequence representing the molecular structure in the second data set to construct a matrix, and then construct the objective function based on this matrix. For some parameters in the objective function, such as coupling coefficients, you can use the attributes in the set The attribute information is fitted to construct a solvable objective function.
  • the embodiment of the present application takes constructing an objective function based on the second data set as an example for illustrative description.
  • the objective function when constructing the objective function, it can be constructed based on the structure of the Ising model, so that the constructed objective function conforms to the structure of the Ising model, so that it can be subsequently solved by the quantum annealing algorithm.
  • the Ising model can represent for:
  • the objective function can be constructed according to the structure of the Ising model, so that the structure of the objective function is consistent with the structure of the Ising model.
  • the objective function can be constructed using matrix factorization for the second data set.
  • Each molecular structure in the second data set may include one or more attributes.
  • Multiple sets of sequences in the second data set may form a matrix.
  • the matrix is decomposed using matrix factorization, which can usually be decomposed into multiple matrix, the product of these multiple matrices matches the initial matrix, thereby reducing the dimension of the data included in the second data set through matrix factorization, which is equivalent to splitting various attributes of the molecule, and then based on each attribute.
  • Construct an objective function usually the ultimate of the objective function is the molecular structure with optimal properties.
  • the objective function can be solved through the quantum annealing algorithm to obtain a target sequence that meets the requirements.
  • the target sequence represents a sequence of molecular structures that meets the requirements.
  • the matching method can be selected according to the actual application scenario, and this application does not limit this.
  • the quantum environment can be simulated by a computing device and solved by an annealing algorithm. If a quantum annealing machine is used, the target function can be used as the input of the quantum annealing machine. After the internal calculation of the quantum annealing machine, the solution of the target function is output to obtain the target sequence.
  • the objective function can be constructed based on the binary sequence, and then can be solved by the quantum annealing algorithm, so that efficient solving can be achieved.
  • solving algorithms such as reinforcement learning and Bayesian optimization, etc.
  • the target sequence obtained by solving the problem is a binary sequence.
  • the target sequence can be decoded to obtain a more accurate representation of the attributes.
  • the sequence of the optimal molecular structure The properties of the molecular structure corresponding to the obtained molecular sequence are better than the properties of the molecular structure corresponding to the multiple sets of data in the first data set.
  • decoding can be performed through the decoder in the autoencoder.
  • This decoding process can be understood as the inverse operation of the aforementioned binary encoding process, which is equivalent to reducing the binary sequence to a sequence representing the molecular structure, thereby obtaining a representation for Sequence of molecular structure.
  • VAE can be used to extract the features in the data set and represent it through a binary sequence, that is, it can be constructed based on the structure of the Ising model for prediction.
  • the objective function of molecular properties is used to obtain a molecular structure with better properties by solving the objective function.
  • the quantum annealing algorithm can be used to solve the problem, which can efficiently and accurately solve the molecular structure with better properties.
  • even molecules with attributes of multiple dimensions can be encoded into binary sequences, thereby achieving efficient solution and adapting to scenarios with multiple molecular attributes. Come up with a variety of molecular structures with excellent molecular properties.
  • the method provided in this application can be applied to a molecular optimization scenario, as shown in Figure 4.
  • this application can be deployed on a cloud platform or in a user's device.
  • a pre-trained quantum annealing molecular optimization system can be deployed on the cloud platform for molecular optimization.
  • users need to solve the optimal molecular structure, they can input a batch of molecular structures with known properties to the cloud platform, and then run the quantum annealing molecular optimization system deployed in the cloud platform to output molecular structures with optimal properties.
  • the quantum annealing molecular optimization system can encode the input molecular structure through the encoder in VAE, output binary encoded data, construct an objective function based on the binary encoded data, and solve the objective function through the quantum annealing algorithm, and the solution is
  • the binary sequence of molecules with better properties is decoded by the decoder in VAE to output a sequence representing the molecular structure with better properties.
  • the method provided by this application can be divided into multiple parts, such as multi-dimensional representation of molecules, binary encoding, objective function construction, quantum annealing optimization and molecular encoding reduction as shown in Figure 5.
  • the molecular structure can be represented by a sequence of multiple dimensions.
  • VAE can be used to binary encode the molecular structure sequence based on the first Boltzmann machine and Gibbs sampling from the Bernoulli distribution, and then use
  • the matrix factorization method the objective function is constructed based on the structure of the Ising model, and the quantum annealing algorithm is used to solve it to obtain a binary sequence of molecular structures with optimal properties.
  • the binary sequence is then encoded and restored to obtain a representation. Sequence of molecular structure. Combined with Figure 6, each step is introduced below.
  • the molecules can be expressed in a variety of ways, such as one-dimensional SMILES or SELFIES strings, two-dimensional molecular diagrams, three-dimensional structures, or various combinations of the above.
  • the molecular structure can be represented by one-dimensional SMILES or SELFIES strings, two-dimensional molecular diagrams, and three-dimensional three-dimensional structures.
  • molecules with different properties may have different structures, and molecules with different structures may also have different properties, which can be achieved by changing the molecular structure.
  • the representation of the molecule can be converted into a binary representation.
  • binary sequences can also be used directly to represent molecular structures.
  • binary encoding is required as an example for illustrative introduction, and this application does not limit this.
  • the encoder can be trained in advance. After the molecule representation is pre-trained by the binary autoencoder, it can be encoded into a vector composed of 0/1 as the encoding of the molecule.
  • the molecules need to be encoded into vectors composed of 0/1 first.
  • Some commonly used heat value encoding methods can encode molecules into vectors composed of 0/1 through hashing algorithms, but they cannot be restored from vectors composed of 0/1 to molecular structures. Therefore, this application provides a binary encoding method, which can restore the obtained molecular structure on the basis of optimization using a quantum annealing algorithm, thereby screening out better molecular structures.
  • VAE can be used for encoding.
  • it can also be replaced by other types of autoencoders, which this application is not limited to.
  • the latent variables output by the encoder in commonly used VAE usually approach the normal distribution and cannot achieve 0/1 binary encoding.
  • a constraint condition is added.
  • the constraint condition can be the Bernoulli distribution of the collected data, so that the encoder can output hidden data that obeys the Bernoulli distribution under the constraints of the Bernoulli distribution. variable.
  • Gibbs sampling can be used to collect the prior distribution p from the Bernoulli distribution based on the restricted Boltzmann machine principle, so that when training the VAE, the prior distribution p will be collected
  • the obtained prior distribution is used as a constraint to make the hidden variable output by the encoder in VAE obey the prior distribution p as much as possible.
  • its convergence condition is that the reconstruction rate is as large as possible and the KL divergence is as small as possible.
  • VAE uses the normal distribution as the prior distribution of the variational autoencoder
  • Bernoulli uses the binomial distribution as the prior distribution of the VAE
  • Quantum VAE is based on the restricted Boltzmann machine and uses Gibbs sampling from The distribution taken from the Bernoulli distribution is used as the prior distribution of VAE
  • z hidden variable dimension.
  • the solution provided by this application uses 2.5 million drug-like molecules in the zinc compound library to conduct molecular self-encoding training. In this way, the molecules and codes can have a good correspondence as much as possible, and even randomly sampled codes can be effectively decoded. into molecules.
  • the reconstruction rate of the molecular binary encoding method i.e. Quantum VAE
  • Quantum VAE the molecular binary encoding method provided by this application is significantly improved when the dimension of the latent variable z is increased to 2048, which is comparable to the VAE sampled from the normal distribution. The effect is quite good.
  • the validity, uniqueness and novelty indicators of molecules decoded after sampling from the prior distribution are even slightly better than conventional VAE.
  • matrix factorization can be used to construct the prediction function f(q) of the attribute.
  • the matrix factorization method can be used to construct the prediction function of molecular attributes, and f(q) can be expressed as:
  • q i and q j represent the values of the i-th and j-th dimensions of the binary encoding vector of the molecule respectively, vi ik and v jk refer to the coefficients of the k-th factor, and f(q) is the attribute value of the molecule predicted by the model. . Since q i and q j can only take on 0 or 1, the functional form of f(q) is close to the functional form of the Hamiltonian of the Ising model.
  • f(q) can be understood as quadratic unconstrained binary optimization (quadratic
  • q i and q j represent the spin states of the i-th element and j-th element respectively
  • Q ij is the coupling coefficient of the i-th element and the j-th element, which can be calculated by fitting the attribute information.
  • the objective function (Formula 4.1)
  • vi ik v jk is summed in dimension k, and Q ij can be obtained. Therefore, this application can use quantum annealing to solve the Ising model ground state Hamiltonian to find the pole of the objective function f (q). value.
  • the point where the objective function takes the extreme value is a binary code, and the corresponding molecule after decoding the binary code is the optimized molecule.
  • the objective function f(q) constructed in the embodiment of this application is the same as or close to the Hamiltonian function form of the Ising model. Therefore, it can be solved through the quantum annealing algorithm to find the extreme value of the objective function H problem , which is also the original objective function.
  • the quantum annealing algorithm can use a quantum heuristic annealing algorithm or a quantum annealing machine for calculation.
  • a matching quantum annealing method can be selected according to the actual application scenario, and this application does not limit this.
  • This adiabatic quantum process is a quantum evolution process with parameter t. It evolves from a simple initial quantum Hamiltonian H 0 to a complex target quantum Ising Hamiltonian by gradually adjusting the parameters. And its ground state is obtained through measurement. The value of the spin corresponding to the ground state is the optimal solution to the target problem. They are the quantum Pauli operators of the spin angular momentum z and x directions respectively.
  • the adiabatic classical Hamiltonian Carry out noisy classical dynamic evolution.
  • the evolution process proceeds according to the classic Hamiltonian canonical equation.
  • the coordinates By taking the sign, you can get the final target problem untie.
  • the extreme value obtained by solving f(q) is also a vector composed of 0/1.
  • the molecular properties encoded by this vector are optimal, and the optimized molecular structure can be restored through the decoder.
  • the VAE can include an encoder and a decoder.
  • the encoder can be used to encode in the aforementioned binary encoding process.
  • the decoder can be used to decode the sequence obtained by solving the problem and output the molecular structure with optimal attributes.
  • the molecular optimization algorithm based on quantum annealing has four molecules: QED (drug-like properties), EGFR (protein binding activity), BACE1 (protein binding activity) and CB1 (protein binding activity).
  • QED drug-like properties
  • EGFR protein binding activity
  • BACE1 protein binding activity
  • CB1 protein binding activity
  • the molecular optimization method based on quantum annealing can perform binary encoding on the sequence representing the molecular structure with known properties, thereby converting it into a binary sequence, and can construct a structure close to the Ising model based on the binary sequence.
  • the objective function can be solved using the quantum annealing algorithm to obtain the extreme value of the objective function, that is, the molecular structure with the optimal known properties can be obtained. Efficient operations can be achieved, and the properties of molecules are better.
  • the molecular optimization device includes:
  • Acquisition module 801 is used to acquire a first data set and an attribute set.
  • the first data set includes multiple sets of data.
  • the multiple sets of data are used to represent multiple molecular structures.
  • Each set of data can be used to represent at least one molecular structure.
  • Attributes The set includes multiple sets of attribute information, which can correspond to multiple sets of data one-to-one, and each set of attribute information includes the value of at least one attribute of the corresponding molecular structure;
  • the construction module 802 is used to construct the objective function according to the first data set and the attribute set;
  • Solving module 803 is used for the quantum annealing algorithm to solve the objective function to obtain a molecular sequence.
  • the molecular sequence is used to represent the molecular structure obtained by solving, where the properties of the molecular structure obtained by solving are better than those represented in the first data set. Properties of molecular structure.
  • the device further includes: an encoding module 804;
  • the encoding module 804 is used to perform binary encoding on each set of data in the first data set to obtain a second data set.
  • the second data set includes multiple sets of sequences, and the multiple sets of sequences correspond to multiple sets of data;
  • the construction module 802 is specifically configured to construct an objective function based on the structure of the Ising model according to the second data set and the attribute set.
  • the construction module 802 is specifically configured to construct an objective function based on the structure and attribute set of the Ising model based on the matrix factor decomposition corresponding to the sequence in the second data set.
  • the encoding module 804 is specifically used to use the prior distribution as a constraint to encode multiple sets of sequences in the first data set through the encoder in the variational autoencoder VAE to obtain latent variables.
  • the prior distribution is sampled based on the Bernoulli distribution corresponding to the sequence in the first data set.
  • the device further includes: a sampling module 805, configured to use Gibbs sampling to sample from the Bernoulli distribution to obtain a priori distribution based on the restricted Boltzmann machine.
  • a sampling module 805 configured to use Gibbs sampling to sample from the Bernoulli distribution to obtain a priori distribution based on the restricted Boltzmann machine.
  • the device further includes: a decoding module 806;
  • the solving module 803 is specifically used to solve the target function through the quantum annealing algorithm to obtain the target sequence
  • the decoding module 806 is used to decode the target sequence through the decoder in the VAE to obtain the molecular sequence.
  • the solving module 803 is specifically configured to solve the target function through a quantum annealing machine to obtain the target sequence.
  • the data in the first data set includes one or more of the following: one-dimensional character strings, two-dimensional molecular maps, or three-dimensional three-dimensional structure data.
  • Figure 9 is a schematic structural diagram of another molecular optimization device provided by this application, as described below.
  • the molecular optimization device may include a processor 901 and a memory 902.
  • the processor 901 and the memory 902 are interconnected through lines.
  • the memory 902 stores program instructions and data.
  • the memory 902 stores program instructions and data corresponding to the steps in FIGS. 3 to 7 .
  • the processor 901 is configured to execute the method steps performed by the molecular optimization device shown in any of the embodiments shown in FIGS. 3 to 7 .
  • the molecule optimization device may also include a transceiver 903 for receiving or transmitting data.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a program for generating vehicle driving speed.
  • the computer When running on the computer, the computer is caused to execute the steps shown in Figures 3 to 7.
  • the illustrated embodiments describe steps in a method.
  • the aforementioned molecular optimization device shown in Figure 9 is a chip.
  • Embodiments of the present application also provide a molecular optimization device.
  • the molecular optimization device can also be called a digital processing chip or chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface, and the program instructions are executed by the processing unit.
  • the processing unit is used to perform the method steps performed by the molecular optimization device shown in any of the embodiments in FIGS. 3 to 7 .
  • An embodiment of the present application also provides a digital processing chip.
  • the digital processing chip integrates the circuit and one or more interfaces for realizing the above-mentioned processor 901, or the functions of the processor 901.
  • the digital processing chip can complete the method steps of any one or more embodiments in the foregoing embodiments.
  • the digital processing chip does not have an integrated memory, it can be connected to an external memory through a communication interface.
  • the digital processing chip implements the actions performed by the molecular optimization device in the above embodiment according to the program code stored in the external memory.
  • the molecular optimization device can be a chip.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit can be, for example, a processor.
  • the communication unit can be, for example, an input/output interface, a pin, or a circuit.
  • the processing unit can execute computer execution instructions stored in the storage unit, so that the chip in the server executes the molecular optimization method described in the embodiments shown in FIGS. 3 to 7 .
  • the storage unit is a storage unit within the chip, such as a register, cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • Embodiments of the present application also provide a computer program product that, when run on a computer, causes the computer to perform the steps performed by the image decompression device or the image decompression device in the method described in the embodiments shown in FIGS. 3 to 7 . step.
  • the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (GPU), or a digital signal processing unit.
  • CPU central processing unit
  • NPU network processor
  • GPU graphics processing unit
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general-purpose processor may be a microprocessor or any conventional processor, etc.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
  • the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
  • the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, read only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc., including a number of instructions to make a computer device (which can be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Chemical & Material Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种分子优化方法以及装置,基于伊辛模型来构造目标函数,并使用量子退火算法来进行求解,从而可以高效准确地求解得到最优的分子结构。方法包括:首先,获取第一数据集合和属性集合(301),第一数据集合中包括多组数据,多组数据可以用于表示多种分子结构,每组数据可以表示至少一种分子结构,属性集合中包括表示多种分子结构的属性的值,每组数据具有对应的至少一种分子属性,如分子的韧性、毒性或者溶解性等属性;根据第一数据集合以及属性集合构造目标函数;随后通过量子退火算法对目标函数进行求解,得到分子序列,分子序列可以用于表示求解得到的分子结构。

Description

一种分子优化方法以及装置
本申请要求于2022年05月23日提交中国专利局、申请号为202210564370.2、申请名称为“一种基于量子退火的分子优化框架”的中国专利申请的优先权,以及于2022年08月24日提交中国专利局、申请号为202211019436.6、申请名称为“一种分子优化方法以及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种分子优化方法以及装置。
背景技术
材料或药物等化学分子想要有更好的性质,比如韧性更强、毒性更低、溶解性更好时,需要对分子的结构进行优化改造。改变分子结构达到更优的性能的过程就是分子优化。
一些常用的分子优化方式,如基于贝叶斯优化的分子优化、基于强化学习的分子优化或者基于条件生成的分子优化等,通常需要较多的训练数据,且优化的周期非常长,且输出效果非常不稳定。因此,如何进行高效且输出稳定的分子优化,成为亟待解决的问题。
发明内容
本申请提供一种分子优化方法以及装置,基于伊辛模型来构造目标函数,并使用量子退火算法来进行求解,从而可以高效准确地求解得到最优的分子结构。
第一方面,本申请提供一种分子优化方法,包括:首先,获取第一数据集合和属性集合,该第一数据集合中包括多组数据,该多组数据可以用于表示多种分子结构,每组数据可以用于表示至少一种分子结构,属性集合中包括多组属性信息,该多组属性信息可与多组数据一一对应,每组属性信息中包括对应的分子结构的至少一种属性的值,如分子的韧性、毒性或者溶解性等属性的值;根据第一数据集合以及属性集合来构造目标函数,其中属性集合中的属性信息可以用于拟合目标函数的参数;随后通过量子退火算法对目标函数进行求解,得到分子序列,该分子序列可以用于表示求解得到的分子结构,其中求解得到的分子结构的属性优于前述第一数据集合中所表示的分子结构的属性。
本申请实施方式中,可以使用属性已知的分子结构来构造目标函数,并使用量子退火算法来进行求解,从而可以实现高效且准确的求解,得到属性更优的分子结构。
在一种可能的实施方式中,第一数据集合和属性集合可以是接收客户端的输入数据得到。如用户可以通过客户端输入已知的分子结构以及每种分子结构的属性信息,如分子的耐热性、硬度等属性信息。
在一种可能的实施方式中,前述的根据第一数据集合以及属性集合构造目标函数,可以包括:对第一数据集合中的每组数据分别进行二值编码,得到第二数据集合,第二数据集合中包括多组序列,多组序列与多组数据对应,该多组序列均为二值序列;随后根据第二数据集合以及属性集合,基于伊辛模型的结构构造目标函数。
本申请实施方式中,当第一数据集合中的数据为非二值序列时,为了便于后续的目标函数构造以及求解,可以对第一数据集合中的每组数据分别进行二值编码,相当于将第一 数据集合中的每组数据转换为二值序列表示,从而使后续可以成功基于伊辛模型的结构来构建目标函数。
在一种可能的实施方式中,前述的根据第二数据集合,基于伊辛模型的结构构造目标函数,可以包括:基于伊辛模型的结构以及属性集合,根据第二数据集合中的序列对应的矩阵因子分解构造目标函数。
本申请实施方式中,在构造目标函数时,可以基于伊辛模型的结构,采用矩阵因子分解的方式来构造目标函数,从而可以使用量子退火算法来进行求解,求解得到目标函数的最优解。
在一种可能的实施方式中,前述的对第一数据集合中的多组序列进行二值编码,得到第二数据集合,可以包括:将先验分布作为约束,通过变分自编码器VAE中的编码器对第一数据集合中的多组序列进行编码,得到隐变量编码数据,先验分布为基于第一数据集合中的序列对应的伯努利分布采样得到。
因此,本申请实施方式中,在进行二值编码时,可以从伯努利分布中采集先验分布作为约束,从而使编码器进行编码时得到的序列中的各个元素为0或1,从而得到二值序列。
在一种可能的实施方式中,本申请提供的方法还可以包括:基于受限玻尔兹曼机,利用吉布斯采样从伯努利分布中采样得到先验分布。
因此,本申请实施方式中,可以基于预训练的受限玻尔兹曼机,利用吉布斯采样从伯努利分布中采样得到先验分布,以便于后续进行二值编码。
在一种可能的实施方式中,前述的对目标序列进行解码,得到分子序列,包括:通过VAE中的解码器对目标序列进行解码,得到分子序列。
本申请实施方式中,在构造目标函数并使用量子退火算法进行求解的过程中,通常使用二值序列进行计算,而分子结构的表示方式可能为非二值表示,因此在求解得到二值序列之后,可以通过解码器对该二值序列进行解码,从而构建出可识别的分子结构。
在一种可能的实施方式中,前述的通过量子退火算法对目标函数进行求解,得到目标序列,可以包括:通过量子退火机对目标函数进行求解,得到目标序列。
因此,本申请实施方式中,可以直接采用量子退火机来进行求解,相对于在同一个设备中模拟量子退火进行计算,采用量子退火机求解的方式可以进一步提高求解效率。
在一种可能的实施方式中,第一数据集合中的数据包括以下一种或者多种:一维字符串、二维分子图或者三维立体结构数据。
因此,本申请实施方式中,可以通过多种方式来表示分子结构,可以适用于多种场景,在进行解码时,也可以解码得到前述多种数据类型中的一种或者多种,从而可以使用户可以根据输出的分子序列识别出分子的具体结构。
第二方面,本申请提供一种分子优化装置,包括:
获取模块,用于获取第一数据集合和属性集合,第一数据集合中包括多组数据,每组数据用于表示至少一种分子结构,属性集合中包括多组属性信息,多组属性信息和多组数据一一对应,每组属性信息中包括对应的分子结构的至少一种属性的值;
构造模块,用于根据第一数据集合以及属性集合构造目标函数,属性集合中的属性信 息用于拟合目标函数中的参数;
求解模块,用于量子退火算法对目标函数进行求解,对得到分子序列,分子序列用于表示求解得到的分子结构。
在一种可能的实施方式中,装置还包括:编码模块;
该编码模块,用于对第一数据集合中的每组数据进行二值编码,得到第二数据集合,第二数据集合中包括多组序列,多组序列与多组数据对应;
构造模块,具体用于根据第二数据集合以及属性集合,基于伊辛模型的结构构造目标函数。
在一种可能的实施方式中,构造模块,具体用于基于伊辛模型的结构以及属性集合,根据第二数据集合中的序列对应的矩阵因子分解构造目标函数。
在一种可能的实施方式中,编码模块,具体用于将先验分布作为约束,通过变分自编码器VAE中的编码器对第一数据集合中的多组序列进行编码,得到隐变量编码数据,先验分布为基于第一数据集合中的序列对应的伯努利分布采样得到。
在一种可能的实施方式中,装置还包括:采样模块,用于基于受限玻尔兹曼机,利用吉布斯采样从伯努利分布中采样得到先验分布。
在一种可能的实施方式中,装置还包括:解码模块;
求解模块,具体用于通过量子退火算法对目标函数进行求解,得到目标序列;
该解码模块,用于通过VAE中的解码器对目标序列进行解码,得到分子序列。
在一种可能的实施方式中,求解模块,具体用于通过量子退火机对目标函数进行求解,得到目标序列。
在一种可能的实施方式中,第一数据集合中的数据包括以下一种或者多种:一维字符串、二维分子图或者三维立体结构数据。
第三方面,本申请实施例提供一种分子优化装置,该分子优化装置具有实现上述第一方面图像处理方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
第四方面,本申请实施例提供一种分子优化装置,包括:处理器和存储器,其中,处理器和存储器通过线路互联,处理器调用存储器中的程序代码用于执行上述第一方面任一项所示的用于分子优化方法中与处理相关的功能。可选地,该分子优化装置可以是芯片。
第五方面,本申请实施例提供了一种分子优化装置,该分子优化装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行如上述第一方面或第一方面任一可选实施方式中与处理相关的功能。
第六方面,本申请实施例提供了一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述第一方面中任一可选实施方式中的方法。
第七方面,本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面中任一可选实施方式中的方法。
附图说明
图1为本申请应用的一种云平台的框架示意图;
图2为本申请提供的一种系统架构示意图;
图3为本申请提供的一种分子优化方法的流程示意图;
图4为本申请提供的另一种分子优化方法的流程示意图;
图5为本申请提供的另一种分子优化方法的流程示意图;
图6为本申请提供的另一种分子优化方法的流程示意图;
图7为本申请提供的另一种分子优化方法的流程示意图;
图8为本申请提供的一种分子优化装置的结构示意图;
图9为本申请提供的另一种分子优化装置的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
首先对人工智能系统总体工作流程进行描述,下面从“智能信息链”和“IT价值链”两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片,如中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。
本申请提供的方法可以应用于多种场景中,如材料或者药物等更优结构的分子优化场景中,材料或药物等化学分子想要有更好的性质,比如韧性更强、毒性更低、溶解性更好时,需要对分子的结构进行优化改造。改变分子结构达到更优的性能的过程就是分子优化。
云领域的AI服务和产品既体现了云服务的按需使用和购买的特点,也兼具AI技术的抽象、多样、应用广泛的特点。云领域的AI服务的主流类型有两类,一类是平台即服务(Platform-as-a-Service,PaaS)类型的AI基础开发平台服务,另一类是软件即服务(Software-as-a-Service,SaaS)类型的AI应用云服务。
对于第一种类型的AI基础开发平台服务,公有云服务提供商凭借其充足的底层资源的支撑以及上层AI算法能力,向用户提供AI基础开发平台。该AI基础开发平台中内置的AI开发框架、各种AI算法可供用户在AI基础开发平台上快速构建和开发符合个性化需求的AI模型或AI应用。
对于第二种类型的AI应用云服务,公有云服务提供商通过云平台提供通用的AI应用云服务,使用户在各种不同的应用场景零门槛地使用AI能力。
例如,公有云AI基础开发平台是云平台中一项PaaS云服务,是基于公有云服务提供商所拥有的大量基础资源和软件能力对用户(也称为:租户、AI开发者等)提供的辅助进行AI模型的构建、训练、部署以及AI应用的开发和部署的软件平台。
示例性地,本申请提供的方法可以应用于云平台中,如可以以云服务的方式部署于云医疗智能体的药物分子设计平台,作为分子优化的一种方式,以应用程序接口(API)的形式被用户付费调用。具体例如,本申请提供的方法可以作为为用户提供服务部署于云平台中,并为用户提供可调用该服务的API,用户可以通过该API调用该服务,输入已知属性的分子结构,通过该服务输出用户所需属性均优的分子结构,从而为用户筛选出所需的分子结构。
如图1所示,用户与AI基础开发平台的交互形态主要包括:用户通过客户端网页登录云平台,在云平台中选择并购买AI基础开发平台的云服务,用户即可以基于AI基础开发 平台提供的功能进行全流程的AI服务。
用户在AI基础开发平台上开发和训练AI模型时,是基于云服务提供商的数据中心中的基础资源(主要是计算资源,例如CPU、GPU、NPU等)进行的。
通常,支撑AI平台中任何一个流程的基础资源可能是分布于不同的物理设备上的,也即实际执行一个流程的硬件设备通常是同一数据中心中的服务器集群,或者是分布在不同数据中心的服务器集群。
这些数据中心可以是云服务提供商的中心云数据中心、也可能是云服务提供商向用户提供的边缘数据中心。例如:在公有云与私有云结合的场景中,利用公有云中的资源运行AI基础开发平台中提供的模型训练和模型管理的功能,利用私有云中的资源运行AI基础开发平台中提供的数据存储和数据预处理的功能,这样可以为用户的数据提供更强的安全性。这种场景下,公有云的资源可以是来自中心云数据中心,私有云的资源可以是来自边缘数据中心。
可以理解为,AI平台可以独立地部署在云环境的数据中心中的服务器或虚拟机上,AI平台也可以分布式地部署在数据中心中的多台服务器上、或者分布式地部署在数据中心中的多台虚拟机上。
在另一种实施例中,本申请提供的AI平台还可以分布式地部署在不同的环境中。本申请提供的AI平台可以在逻辑上分成多个部分,每个部分具有不同的功能。例如,AI平台100中的一部分可以部署在边缘环境中的计算设备中(也称边缘计算设备),另一部分可以部署在云环境中的设备中。边缘环境为在地理位置上距离用户的终端计算设备较近的环境,边缘环境包括边缘计算设备,例如:边缘服务器、拥有计算能力的边缘小站等。部署在不同环境或设备的AI平台100的各个部分协同实现为用户提供训练AI模型等功能。
基于上述描述,本申请提供一种系统架构,如图2所示。在图2中,数据采集设备160用于采集训练数据。在一些可选的实现中,本申请中,针对编码模型,训练数据可以包括大量已知属性的分子结构。
在采集到训练数据之后,数据采集设备160将这些训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。可选地,在本申请以下实施方式中所提及的训练集,可以是从该数据库130中得到,也可以是通过用户的输入数据得到。
其中,目标模型/规则101可以为本申请实施例中进行训练后的神经网络,该神经网络可以包括一个或者多个网络,如自编码模型等。
上述目标模型/规则101能够用于实现本申请实施例的用于分子优化方法中提及的神经网络,即,将待处理数据(如待压缩的图像)通过相关预处理后输入该目标模型/规则101,即可得到处理结果。本申请实施例中的目标模型/规则101具体可以为本申请以下所提及的神经网络,该神经网络可以是前述的CNN、DNN或者RNN等类型的神经网络。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他 地方获取训练数据进行模型训练,本申请对此并不作限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图2所示的执行设备110,该执行设备110是服务器或者云端设备等。在图2中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:客户设备输入的待处理数据。客户端可以是其他的硬件设备,如终端或者服务器等,客户端也可以是部署于终端上的软件,如APP、网页端等。
预处理模块113和预处理模块114用于根据I/O接口112接收到的输入数据(如待处理数据)进行预处理,在本申请实施例中,也可以没有预处理模块113和预处理模块114(也可以只有其中的一个预处理模块),而直接采用计算模块111对输入数据进行处理。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,则将处理结果返回给客户设备140,从而提供给用户,例如若第一神经网络用于进行图像分类,处理结果为分类结果,则I/O接口112将上述得到的分类结果返回给客户设备140,从而提供给用户。
需要说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。在一些场景中,执行设备110和训练设备120可以是相同的设备,或者位于相同的计算设备内部,为便于理解,本申请将执行设备和训练设备分别进行介绍,并不作为限定。
在图2所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的预测标签作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的预测标签,作为新的样本数据存入数据库130。
需要说明的是,图2仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图2中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图2所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是本申请中的模型,具体的,本申请实施例提供的神经网络可以包括CNN,深度卷积神经网络(deep convolutional neural networks,DCNN),循环神经网络 (recurrent neural network,RNN)或者构建得到的神经网络等等。
本申请提供的分子优化可以部署于上述系统架构中,通过上述架构来实现分子优化。
首先,为便于理解,对本申请涉及到的一些术语进行解释。
(1)自编码模型
自编码模型是一种利用反向传播算法使得输出值等于输入值的神经网络,先将输入数据压缩成潜在空间表征,然后通过这种表征来重构输出。
自编码模型通常包括编码(encoder)模型和解码(decoder)模型。本申请中,训练后的编码模型用于从输入图像中提取特征,得到隐变量,将该隐变量输入至训练后的解码模型,即可输出预测的输入图像对应的残差。
(2)变分自编码器(variational autoencoder,VAE)
变分自编码器与自编码器类似,都是由一个编码器,一组隐变量和一个解码器组成,与自编码器不同的是,变分自编码器训练时,除了使解码分子的重建损失降低,还需要让隐变量尽可能地近似正态分布,以此,从正态分布中随机采样隐变量也能解码出有效的样本,达到样本生成的效果。
(3)受限玻尔兹曼机(restricted boltzmann machine,RBM)
玻尔兹曼机起源于统计物理学,是一种基于能量函数的建模,能够描述变量之间的高阶相互作用。受限玻尔兹曼机可以理解为神经网络,通常由一个可见神经元层和一个隐神经元层组成,由于隐层神经元之间没有相互连接并且隐层神经元独立于给定的训练样本,这使直接计算依赖数据的期望值变得容易,可见层神经元之间也没有相互连接,通过从训练样本得到的隐层神经元状态上执行马尔可夫链抽样过程,来估计独立于数据的期望值,并行交替更新所有可见层神经元和隐层神经元的值。本申请以下所提及的受限玻尔兹曼机可以是预训练后的神经网络。
(4)扩展连通性指纹(Extended Connectivity Fingerprints,ECFP)分子指纹
能将化学结构转化为0/1组成的向量,也叫分子的扩展连通性指纹(Extended Connectivity Fingerprints,ECFP),常用于构建化合物定量构效关系(QSAR)模型。实现方法是以每个原子为中心,不同的步长为半径划分出分子的子结构,把每个子结构取一个哈希值,相同的子结构具有相同的哈希值。对哈希值求指纹长度的余数,余数是多少,该指纹对应的维度上就取1,代表分子中存在这个子结构,否则指纹在该维度上为0。
(5)量子退火(Quantum annealing,QA)
量子退火是一种基于量子涨落特性的优化过程,可以在目标函数拥有多种候选解答的情况下,找到全局最优解。量子退火主要用于解决离散空间有多个局部最小值的问题(组合优化问题),例如寻找自旋玻璃的基态。量子退火首先从权重相同的所有可能状态(候选状态)的量子叠加态开始运行,接着物理系统基于薛定谔方程开始量子演化。根据横向场的时间依赖强度,状态之间产生量子隧穿,使得所有候选状态的几率幅不断改变,实现量子并行性。横向场最终被关闭,并且预期系统已得到原优化问题的解,也就是到达相对应的经典伊辛模型基态。
量子退火算法模型通常可以包括两个部分:第一部分为量子势能,其目的是将量子优 化问题与量子系统形成映射,将优化的目标函数映射为施加在该量子系统的一个势场;第二部分为量子动能,通过引入动能项(幅度可控)作为控制量子波动的穿透场。
(6)量子启发式算法
传统优化算法常常受到局部极值的约束而影响优化效果,为了使算法跳出局部极值,引入量子力学的思想(如,量子涨落、量子隧穿、绝热量子演化等)进一步改进已有的算法,提高其收敛速度和精度就是量子启发式算法,其中最具代表性的就是量子退火启发的算法。量子退火启发式算法是将绝热量子过程转化到其对应的经典动力学过程,其保留了该绝热量子演化的特性,模拟该绝热经典动力学过程即可获得目标复杂哈密顿的基态配置(即复杂目标函数的全局最优解)。
(7)伊辛模型(Ising model)
伊辛模型是一类描述物质相变的随机过程(stochastic process)模型。物质经过相变,要出现新的结构和物性。发生相变的系统一般是在分子之间有较强相互作用的系统,又称合作系统。
伊辛模型所研究的系统由多维周期性点阵组成,点阵的几何结构可以是立方的或六角形等,每个阵点上都赋予一个取值表示自旋变数,即自旋向上或自旋向下。伊辛模型假设只有最近邻的自旋之间有相互作用,点阵的位形用一组自旋变数来确定。常见的二维伊辛模型示意图使用箭头方向表示自旋方向。
(8)伯努利分布
伯努利分布又名0-1分布或者两点分布,是一个离散型概率分布。若伯努利试验成功,则伯努利随机变量取值为1。若伯努利试验失败,则伯努利随机变量取值为0。记其成功概率为p(0≤p≤1),失败概率为q=1-p。
(9)简化分子线性输入规范(Simplified molecular input line entry system,SMILES)
是一种用ASCII字符串明确描述分子结构的规范。SMILES字符串可以被大多数分子编辑软件导入并转换成二维图形或分子的三维模型。转换成二维图形可以使用Helson的“结构图生成算法”(Structure Diagram Generation algorithms)。
相当于将分子结构的图结构数据转换为文本内容,并在机器学习输入管道中使用文本(编码字符串)作为输入。转换后,可以使用相关算法来处理药物,例如,预测其性质,副作用甚至化合物之间的相互作用。
(10)SELFIES(SELF-referencIng Embedded Strings)
为了解决SMILES的表示方法有时候不能对应有效的分子,提出了SELFIES,SELFIES是一种基于自然语言的SMILES的改进形式,是基于字符串的表示形式。每个SELFIES字符串都对应一个有效分子。
通常,材料或药物等化学分子想要有更好的性质,比如韧性更强、毒性更低、溶解性更好时,需要对分子的结构进行优化改造。一些常用的方式需要人工经验或者优化效果不佳、计算时间长、计算成本高。
例如,一些常用的方式中,可以基于专家经验的分子优化是指有经验的专家根据专业知识和行业经验,对分子删减或添加某些化学基团,使材料或药物分子有更好的性质,这 也是传统材料科学和药物发现做分子优化的主要方式。但这需要很多的试错,实验成本高昂。需要经验和专家知识,成本高,周期长,较主观,稳定性差,通量低。
又例如,基于贝叶斯优化的分子优化方法是近些年兴起的分子优化方法,做法是利用深度学习中的表征学习(Representation Learning)方法,将分子编码成一个向量表征,其间,通过贝叶斯优化的策略,调整该分子的向量表征,解码得到新的化学分子。但计算成本高,周期长,并行度较差,效果依赖于代理函数和获取函数的选择。
还例如,基于强化学习的分子优化方法也是近些年兴起的分子优化方法,其做法沿袭基于贝叶斯优化的分子优化方法,先是利用表征学习(Representation Learning)方法,将分子编码成一个向量表征,其间,通过强化学习的打分函数,调整该分子的向量表征,解码得到新的化学分子。但计算成本高,周期长,并行度较差,且可能出现不能输出有效结果。
还例如,基于条件生成的分子优化方法利用条件生成模型生成倾向于具有某种性质的分子,也是现在进行分子优化的一*+种手段。基于条件生成的分子优化是以VAE或者GAN等生成模型为框架,指定随机采样向量的某些维度成固定的特征值,以此让生成模型倾向于生成指定性质的分子。但计算成本高,需要较多数据,优化条件与生成非强依赖关系,所以,优化能力较弱。
因此,本申请提供一种分子优化方法,量子退火及其启发式算法的分子结构优化改造系统,实现高效、快速的分子优化。
下面对本申请提供的分子优化方法进行介绍。
参阅图3,本申请提供的一种分子优化方法的流程示意图。
301、获取第一数据集合和属性集合。
该第一数据集合中包括多组数据,该多组数据可以用于表示多种分子结构,每种分子结构对应至少一种属性,属性集合中包括多组属性信息,每组数据对应一组属性信息,每组属性信息中包括对应的分子结构的至少一种属性的值,即每个分子结构具有对应的一种或者多种属性,如韧性、毒性、催化效率、成药性或者溶解性等属性值,通常不同的属性可能具有不同的表示方式或者表示单位,具体可以根据实际应用场景来确定。
可选地,分子结构的表示方式采用多种方式,当每个分子结构具有多种属性时,可以通过多个维度的序列来表示。例如,可以采用一维SMILES/SELFIES字符串、二维分子图或三维立体结构,如3D点云,或者这些表示方式的组合等来表示分子结构,形成数据集合。
可选地,本申请提供的分子优化方法可以部署于服务器中,服务器可以接收客户端发送的第一数据集合和属性集合,如用户可以通过客户端来输入多种分子结构以及每种分子结构对应的属性信息,客户端可以通过无线或者有线网络将用户输入的多种分子结构以及每种分子结构对应的属性信息发送给服务器。
其中,属性集合中的属性信息所对应的属性,可以是用户需要求解最优分子结构的属性。例如,若用户需求解耐腐蚀性最优的分子结构,则用户在输入已知属性的分子结构时,可以输入已知耐腐蚀性值的分子结构;又例如,若用户需求解耐热最优的分子结构,则用户可以输入已知耐热性值的分子结构。
302、对第一数据集合中的多组序列进行二值编码,得到第二数据集合。
可选地,若第一数据集合中的多组序列并非二值序列,则可以对第一数据集合中的多组序列进行二值编码,得到第二数据集合,该第二数据中包括多组二值序列。可以理解为,将第一数据集合中的多组数据进行了二值转换,转换为二值表示方式。
具体地,在进行二值编码时,可以采用预训练后的自编码器中的编码器,将需编码的数据作为该编码器的输入,从而从输入的数据中提取特征。在进行编码的过程中,将先验分布作为约束进行编码,输出隐变量数据,即二值序列。该先验分布可以是从第一数据集合中的多组序列对应的伯努利分布中进行采样得到,即该先验分布也为二值序列。因此,本申请实施方式中,可以将先验分布作为约束,从而使输出的隐变量尽可能服从先验分布,从而实现二值编码。
可选地,在采集先验分布时,可以基于受限波兹曼机,利用吉布斯采样从所述伯努利分布中采样得到该先验分布,从而使采集到的先验分布为二值序列,进而使输出的隐变量数据在该二值序列的约束下也为二值序列,以便于后续进行目标函数构造。
可以理解为,可以通过预训练后的编码器,从第一数据集合中提取到分子属性的特征,并通过二值序列来表示,以便于后续进行通过量子退火算法进行高效求解。
303、根据第二数据集合以及属性集合结构构造目标函数。
若第一数据集合中即可包括分子结构序列的二值表示,则可以基于第一数据集合中的序列以及属性集合来构造目标模型;若第一序列中包括的分子结构序列为非二值表示,则可以基于第二数据集合中的序列以及属性集合来构造目标模型,该目标函数可以用于预测分子属性,属性集合中的属性信息可以用于拟合目标函数中的参数。
例如,在构造目标函数时,可以利用第二数据集合中表示分子结构的序列来构建矩阵,然后基于该矩阵来构造目标函数,对于目标函数中一些参数,如耦合系数,则可以使用属性集合中的属性信息来拟合,从而构造得到可求解的目标函数。
为便于理解,本申请实施例以根据第二数据集合来构造目标函数为例进行示例性说明。
具体地,在构造目标函数时,可以基于伊辛模型的结构来构造,使构造得到的目标函数符合伊辛模型的结构,以便于后续可以通过量子退火算法来进行求解,如伊辛模型可以表示为:
Figure PCTCN2022130492-appb-000001
在构造目标函数时,即可按照该伊辛模型的结构来构造目标函数,使目标函数的结构与伊辛模型的结构相符。
具体可以对第二数据集合采用矩阵因子分解的方式来构造目标函数。例如。第二数据集合中的每种分子结构可能包括一种或者多种属性,第二数据集合中的多组序列可以形成矩阵,采用矩阵因子分解的方式对该矩阵进行分解,通常可以分解为多个矩阵,这多个矩阵的乘积与初始的矩阵匹配,从而通过矩阵因子分解的方式来降低第二数据集合中所包括的数据的维度,相当于拆分分子的各种属性,随后基于各个属性来构造目标函数,通常该目标函数的极致即为属性最优的分子结构。
304、通过量子退火算法对所述目标函数进行求解,得到目标序列。
在构造了目标函数之后,即可通过量子退火算法对该目标函数进行求解,从而求解得到符合需求的目标序列,该目标序列即表示符合需求的分子结构的序列。
具体地,可以使用量子启发式算法进行求解,也可以直接通过量子退火机进行求解,具体可以根据实际应用场景选择匹配的方式,本申请对此并不作限定。在使用量子启发式算法进行求解时,可以由计算设备模拟量子环境,并通过退火算法进行求解。若使用量子退火机,则可以将目标函数作为量子退火机的输入,通过该量子退火机内部计算后,输出目标函数的解,得到目标序列。
因此,本申请提供的方法中,可以基于二值序列构造目标函数,进而可以通过量子退火算法进行求解,从而可以实现高效求解,相对于其他求解算法,如强化学习、基于贝叶斯优化等方式,可以使用更短的运行实现来求解。
305、对目标序列进行解码,得到分子序列。
通常,求解得到的目标序列为二值序列,为了使用户可以更方便地识别出该二值序列所表示的分子结构,在得到目标序列之后,即可对该目标序列进行解码,得到表示属性更优的分子结构的序列。求解得到的分子序列对应的分子结构的属性优于前述第一数据集合中多组数据对应的分子结构的属性。
具体地,可以通过自编码器中的解码器来进行解码,该解码过程可以理解为前述二值编码过程的逆运算,相当于将二值序列还原为表示分子结构的序列,从而得到用于表示分子结构的序列。
因此,本申请实施方式中,在得到表示分子结构的数据集合之后,可以通过VAE来提取该数据集合中特征,并通过二值序列来表示,即可基于伊辛模型的结构来构造用于预测分子属性的目标函数,从而通过求解目标函数来得到属性更优的分子结构。在求解过程中,可以通过量子退火算法进行求解,可以高效准确地求解出属性更优的分子结构。并且,通过本申请提供的方法,在进行二值编码的过程中,即使分子具有多个维度的属性,也可以编码为二值序列,进而实现高效求解,可以适应多种分子属性的场景,求解出多种分子属性均优的分子结构。
前述对本申请提供的分子优化方法的流程进行了概述,为便于理解,下面结合具体的应用场景,对本申请提供的分子优化方法的流程进行更详细介绍。
示例性地,本申请提供的方法可以应用于分子优化场景,如图4所示。
例如,本申请可以部署于云平台,或者部署于用户的设备中,如可以在云平台上部署预训练的量子退火分子优化系统,用于进行分子优化。如当用户需要求解最优分子结构时,可以向云平台输入一批已知属性的分子结构,随后通过运行云平台中部署的量子退火分子优化系统,输出属性最优的分子结构。
可以应用于药物分子优化、材料优化分子或者分子结构研究场景中,因此可以从一些制药厂商、材料化工厂或者研究机构等获取一批已知属性的分子,作为量子退火分子优化系统的输入,输出属性更优的分子结构。
其中,量子退火分子优化系统可以对输入的分子结构通过VAE中的编码器进行编码, 输出二值编码数据,基于二值编码数据构造目标函数,并通过量子退火算法对目标函数进行求解,求解得到属性更优的分子的二值序列,通过VAE中的解码器进行解码,从而输出表示属性更优的分子结构的序列。
下面对具体的分子优化过程进行示例性介绍。
参阅图5,本申请体用的另一种分子优化方法的流程示意图。
其中本申请提供的方法可以分为多个部分,如图5中所示的分子多维度表示、二值编码、目标函数构造、量子退火优化以及分子编码还原。
首先,分子结构可以通过多个维度的序列来表示,随后,可以利用VAE,基于首先波兹曼机,结合从伯努利分布中进行吉布斯采样对分子结构序列进行二值编码,随后采用矩阵因子分解的方式,基于伊辛模型的结构来构造目标函数,并使用量子退火算法来进行求解,得到属性最优的分子结构的二值序列,然后对该二值序列进行编码还原,得到表示分子结构的序列。结合图6,下面分别对各个步骤进行介绍。
一、分子多维度表示
其中分子的表示方式可以包括多种,如一维SMILES或SELFIES字符串、二维分子图、三维立体结构或者上述多种组合方式等。例如,如图7所示,可以通过一维SMILES或SELFIES字符串、二维分子图以及三维立体结构等来表示分子结构。通常,不同属性的分子可能具有不同的结构,不同结构的分子也可能具有不同的属性,因此可以通过改变分子结构来实现。
二、二值编码
为便于后续可以基于伊辛模型的结构来构造目标函数,可以将分子的表示方式转换为二值表示。当然,在一些场景中,也可以直接使用二值序列来表示分子结构,本申请实施例中以需进行二值编码为例进行示例性介绍,本申请对此并不作限定。
在进行二值编码的过程中,可以提前对编码器进行训练,分子表示经过二值自编码器预训练后,可以编码成一个由0/1组成的向量作为这个分子的编码。
可以理解为,对某一属性进行优化时,可以接收一批这种属性已知的分子的表示序列,编码成二值向量,并利用矩阵因子分解来构造这种属性的预测函数f(q),即目标函数。
本申请实施方式中,为了让分子能够利用量子退火算法进行优化,需先将分子编码成0/1组成的向量。
一些常用的热值编码方式,如ECFP指纹,可以通过哈希算法能把分子编码成0/1组成的向量但其没法从0/1组成的向量还原成分子结构。因此,本申请提供一种二值编码方式,在可以利用量子退火算法进行优化的基础上,可以还原得到的分子结构,从而筛选出更优的分子结构。
本申请示例性地,为了实现可逆编码,可以采用VAE来进行编码,当然也可以替换为其他类型的自编码器,本申请对此并不作限定。常用的VAE中编码器输出的隐变量通常趋近于正态分布,不能实现0/1二值编码。本申请实施方式中,在训练VAE时,增加约束条件,该约束条件可以是从采集数据的伯努利分布,从而使编码器在伯努利分布的约束下, 输出服从伯努利分布的隐变量。具体地,为了进一步实现可逆编码,本申请实施例中,可以基于受限波兹曼机原理,利用吉布斯采样从伯努利分布中采集先验分布p,从而在训练VAE时,将采集到的先验分布作为约束,使VAE中的编码器输出的隐变量尽可能服从先验分布p。在进行模型训练时,其收敛条件即使重建率尽可能大,KL散度尽可能小。
例如,通过多种方式实现二值编码的效果可以参阅表1:
Figure PCTCN2022130492-appb-000002
表1
其中,VAE:以正态分布作为变分自编码器的先验分布;Bernoulli VAE:以二项分布作为VAE的先验分布;Quantum VAE:基于受限波兹曼机,利用吉布斯采样从伯努利分布中采到的分布作为VAE的先验分布;z:隐变量维度。
本申请提供的方案利用zinc化合物库中的250万个类药分子进行分子自编码的训练,以此,尽可能让分子与编码有较好的对应关系,即使随机采样出的编码也能有效解码成分子。从表1中可以看出,本申请提供的分子二值编码方式(即Quantum VAE)在隐变量z的维度增大为2048时,其重建率得到了显著的改善,与正态分布采样的VAE效果相当。从先验分布中采样后解码出的分子,其有效性、唯一性和新颖性指标甚至略优于常规的VAE。
三、目标函数构造
在对分子进行二值编码后,为得到优化的分子结构,在针对已知属性进行优化时,可以利用矩阵因子分解来构造该属性的预测函数f(q)。
具体地,可以采用矩阵因子分解的方法构建分子属性的预测函数,f(q)可以表示为:
Figure PCTCN2022130492-appb-000003
其中,q i和q j分别代表分子二值编码向量的第i维和第j维的值,v ik和v jk是指第k个因子的系数,f(q)为模型预测的分子的属性值。由于q i和q j只能取0或1,因此,f(q)的函数形式与伊辛模型的哈密顿量的函数形式接近,f(q)可以理解为二次无约束二值优化(quadratic unconstrained binary optimization,QUBO)形式的二次优化问题,通过变 量代换,如s i=2q i-1可转换为伊辛形式的优化问题。
如表示为:
Figure PCTCN2022130492-appb-000004
其中,q i和q j分别代表第i个元素和第j个元素的自旋状态,Q ij为第i个元素与第j个元素的耦合系数,如可以通过对属性信息进行拟合计算得到。目标函数(公式4.1)中,v ikv jk在维度k上求和,可以得到Q ij,因此本申请可以利用量子退火求解伊辛模型基态哈密顿量的方式求目标函数f(q)的极值。目标函数取极值的点是一个二值编码,该二值编码解码后对应的分子就是优化后的分子。
四、量子退火优化
本申请实施例中构造的目标函数f(q)和伊辛模型的哈密顿量函数形式相同或者接近,因此可以通过量子退火算法进行求解,求目标函数H problem的极值,同时也是原目标函数f(q)的最优值。
并且,本申请中,量子退火算法可以采用量子启发式退火算法,也可以采用量子退火机进行计算,具体可以根据实际应用场景选择匹配的量子退火方式,本申请对此并不作限定。
示例性地,以转换为伊辛模型的哈密顿量函数形式为例,
Figure PCTCN2022130492-appb-000005
求解过程例如:
构建上述伊辛目标问题H problem的量子伊辛哈密顿
Figure PCTCN2022130492-appb-000006
Figure PCTCN2022130492-appb-000007
构建绝热量子演化哈密顿用于量子退火:
Figure PCTCN2022130492-appb-000008
Figure PCTCN2022130492-appb-000009
该绝热量子过程是一个含参数t的量子演化过程,从一个简单的初始量子哈密顿H 0,逐步调整参数演化到复杂的目标量子伊辛哈密顿
Figure PCTCN2022130492-appb-000010
并通过测量获取其基态,该基态对应自旋取值就是其目标问题最优解。
Figure PCTCN2022130492-appb-000011
分别为自旋角动量z,x方向的量子泡利算符。
将该绝热量子哈密顿转换为其对应的绝热经典哈密顿
Figure PCTCN2022130492-appb-000012
Figure PCTCN2022130492-appb-000013
Figure PCTCN2022130492-appb-000014
分别为对应经典系统的广义坐标和动量。
对该绝热经典哈密顿
Figure PCTCN2022130492-appb-000015
进行含噪的经典动力学演化,演化过程按照经典哈密顿正则方程进行,最后对其中的坐标
Figure PCTCN2022130492-appb-000016
取符号,即可得到最终目标问题的
Figure PCTCN2022130492-appb-000017
解。
五、分子编码还原
求解f(q)得到的极值,也是0/1组成的向量,该向量编码的分子性质最优,可以通过解码器还原出优化后的分子结构。
具体地,VAE中可以包括编码器和解码器,编码器可以用于在前述二值编码过程中进行编码,解码器可以用于对求解得到的序列进行解码,输出属性最优的分子结构。
示例性,以一些具体的分子优化方式为例,如表2所示:
Figure PCTCN2022130492-appb-000018
表2
显然,如表2所示,本申请提供的基于量子退火的分子优化算法在QED(类药性),EGFR(蛋白结合活性),BACE1(蛋白结合活性)和CB1(蛋白结合活性)这四个分子任务中,相比常用的方式,如基于强化学习方式、基于贝叶斯优化方式或者基于条件生成等方式,都能找到性质更优的分子,并且其平均运行时间也相比现有的迭代式的分子优化算法更短,效率更高,具有更高的应用前景。
因此,本申请提供的基于量子退火的分子优化方式,可以对表示已知属性分子结构的序列进行二值编码,从而转化为二值序列,即可根据二值序列来构造与伊辛模型结构接近的目标函数,即可使用量子退火算法进行求解,从而求解得到目标函数的极值,即求解得到该已知属性最优的分子结构。可以实现高效运算,且分子的属性更优。
前述对本申请提供的方法流程进行了详细介绍,下面对执行本申请提供的方法的装置进行介绍。
参阅图8,本申请提供的一种分子优化装置的结构示意图,如下所述。
该分子优化装置包括:
获取模块801,用于获取第一数据集合和属性集合,第一数据集合中包括多组数据,多组数据用于表示多种分子结构,每组数据可以用于表示至少一种分子结构,属性集合中包括多组属性信息,该多组属性信息可与多组数据一一对应,每组属性信息中包括对应的分子结构的至少一种属性的值;
构造模块802,用于根据第一数据集合以及属性集合来构造目标函数;
求解模块803,用于量子退火算法对目标函数进行求解,对得到分子序列,分子序列用 于表示求解得到的分子结构,其中求解得到的分子结构的属性优于前述第一数据集合中所表示的分子结构的属性。
在一种可能的实施方式中,装置还包括:编码模块804;
该编码模块804,用于对第一数据集合中的每组数据进行二值编码,得到第二数据集合,第二数据集合中包括多组序列,多组序列与多组数据对应;
构造模块802,具体用于根据第二数据集合以及属性集合,基于伊辛模型的结构构造目标函数。
在一种可能的实施方式中,构造模块802,具体用于基于伊辛模型的结构以及属性集合,根据第二数据集合中的序列对应的矩阵因子分解构造目标函数。
在一种可能的实施方式中,编码模块804,具体用于将先验分布作为约束,通过变分自编码器VAE中的编码器对第一数据集合中的多组序列进行编码,得到隐变量编码数据,先验分布为基于第一数据集合中的序列对应的伯努利分布采样得到。
在一种可能的实施方式中,装置还包括:采样模块805,用于基于受限玻尔兹曼机,利用吉布斯采样从伯努利分布中采样得到先验分布。
在一种可能的实施方式中,装置还包括:解码模块806;
求解模块803,具体用于通过量子退火算法对目标函数进行求解,得到目标序列;
该解码模块806,用于通过VAE中的解码器对目标序列进行解码,得到分子序列。
在一种可能的实施方式中,求解模块803,具体用于通过量子退火机对目标函数进行求解,得到目标序列。
在一种可能的实施方式中,第一数据集合中的数据包括以下一种或者多种:一维字符串、二维分子图或者三维立体结构数据。
请参阅图9,本申请提供的另一种分子优化装置的结构示意图,如下所述。
该分子优化装置可以包括处理器901和存储器902。该处理器901和存储器902通过线路互联。其中,存储器902中存储有程序指令和数据。
存储器902中存储了前述图3-图7中的步骤对应的程序指令以及数据。
处理器901用于执行前述图3-图7中任一实施例所示的分子优化装置执行的方法步骤。
可选地,该分子优化装置还可以包括收发器903,用于接收或者发送数据。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于生成车辆行驶速度的程序,当其在计算机上行驶时,使得计算机执行如前述图3-图7所示实施例描述的方法中的步骤。
可选地,前述的图9中所示的分子优化装置为芯片。
本申请实施例还提供了一种分子优化装置,该分子优化装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行前述图3-图7中任一实施例所示的分子优化装置执行的方法步骤。
本申请实施例还提供一种数字处理芯片。该数字处理芯片中集成了用于实现上述处理器901,或者处理器901的功能的电路和一个或者多个接口。当该数字处理芯片中集成了 存储器时,该数字处理芯片可以完成前述实施例中的任一个或多个实施例的方法步骤。当该数字处理芯片中未集成存储器时,可以通过通信接口与外置的存储器连接。该数字处理芯片根据外置的存储器中存储的程序代码来实现上述实施例中分子优化装置执行的动作。
本申请实施例提供的分子优化装置可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使服务器内的芯片执行上述图3-图7所示实施例描述的分子优化方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述图3-图7所示实施例描述的方法中图像解压装置或者图像解压装置所执行的步骤。
具体地,前述的处理单元或者处理器可以是中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者也可以是任何常规的处理器等。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
最后应说明的是:以上,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。

Claims (19)

  1. 一种分子优化方法,其特征在于,包括:
    获取第一数据集合和属性集合,所述第一数据集合中包括多组数据,每组数据用于表示至少一种分子结构,所述属性集合中包括多组属性信息,所述多组属性信息和所述多组数据一一对应,每组属性信息中包括对应的分子结构的至少一种属性的值;
    根据所述第一数据集合以及所述属性集合构造目标函数,所述属性集合中的属性信息用于拟合所述目标函数中的参数;
    通过量子退火算法对所述目标函数进行求解,得到分子序列,所述分子序列用于表示求解得到的分子结构。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述第一数据集合以及所述属性集合构造目标函数,包括:
    对所述第一数据集合中的每组数据进行二值编码,得到第二数据集合,所述第二数据集合中包括多组序列,所述多组序列与所述多组数据对应;
    根据所述第二数据集合以及所述属性集合,基于伊辛模型的结构构造所述目标函数。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述第二数据集合以及所述属性集合,基于伊辛模型的结构构造所述目标函数,包括:
    基于伊辛模型的结构以及所述属性集合,根据所述第二数据集合中的序列对应的矩阵因子分解构造所述目标函数。
  4. 根据权利要求2或3所述的方法,其特征在于,所述对所述第一数据集合中的多组序列进行二值编码,得到第二数据集合,包括:
    将先验分布作为约束,通过变分自编码器VAE中的编码器对所述第一数据集合中的多组序列进行编码,得到隐变量编码数据,所述先验分布为基于所述第一数据集合中的序列对应的伯努利分布采样得到。
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    基于受限玻尔兹曼机,利用吉布斯采样从所述伯努利分布中采样得到所述先验分布。
  6. 根据权利要求2-5中任一项所述的方法,其特征在于,所述通过量子退火算法对所述目标函数进行求解,得到分子序列,包括:
    通过量子退火算法对所述目标函数进行求解,得到目标序列;
    通过VAE中的解码器对所述目标序列进行解码,得到所述分子序列。
  7. 根据权利要求6所述的方法,其特征在于,所述通过量子退火算法对所述目标函数进行求解,得到目标序列,包括:
    通过量子退火机对所述目标函数进行求解,得到目标序列。
  8. 根据权利要求1-7中任一项所述的方法,其特征在于,所述第一数据集合中的数据包括以下一种或者多种:一维字符串、二维分子图或者三维立体结构数据。
  9. 一种分子优化装置,其特征在于,包括:
    获取模块,用于获取第一数据集合和属性集合,所述第一数据集合中包括多组数据,每组数据用于表示至少一种分子结构,所述属性集合中包括多组属性信息,所述多组属性信息和所述多组数据一一对应,每组属性信息中包括对应的分子结构的至少一种属性的值;
    构造模块,用于根据所述第一数据集合以及所述属性集合构造目标函数,所述属性集合中的属性信息用于拟合所述目标函数中的参数;
    求解模块,用于量子退火算法对所述目标函数进行求解,对所述得到分子序列,所述分子序列用于表示求解得到的分子结构。
  10. 根据权利要求9所述的装置,其特征在于,所述装置还包括:
    编码模块,用于对所述第一数据集合中的每组数据进行二值编码,得到第二数据集合,所述第二数据集合中包括多组序列,所述多组序列与所述多组数据对应;
    所述构造模块,具体用于根据所述第二数据集合以及所述属性集合,基于伊辛模型的结构构造所述目标函数。
  11. 根据权利要求10所述的装置,其特征在于,
    所述构造模块,具体用于基于伊辛模型的结构以及所述属性集合,根据所述第二数据集合中的序列对应的矩阵因子分解构造所述目标函数。
  12. 根据权利要求10或11所述的装置,其特征在于,
    所述编码模块,具体用于将先验分布作为约束,通过变分自编码器VAE中的编码器对所述第一数据集合中的多组序列进行编码,得到隐变量编码数据,所述先验分布为基于所述第一数据集合中的序列对应的伯努利分布采样得到。
  13. 根据权利要求12所述的装置,其特征在于,所述装置还包括:
    采样模块,用于基于受限玻尔兹曼机,利用吉布斯采样从所述伯努利分布中采样得到所述先验分布。
  14. 根据权利要求10-13中任一项所述的装置,其特征在于,所述装置还包括:解码模块;
    所述求解模块,具体用于通过量子退火算法对所述目标函数进行求解,得到目标序列;
    所述解码模块,用于通过VAE中的解码器对所述目标序列进行解码,得到所述分子序 列。
  15. 根据权利要求14所述的装置,其特征在于,
    所述求解模块,具体用于通过量子退火机对所述目标函数进行求解,得到目标序列。
  16. 根据权利要求9-15中任一项所述的装置,其特征在于,所述第一数据集合中的数据包括以下一种或者多种:一维字符串、二维分子图或者三维立体结构数据。
  17. 一种分子优化装置,其特征在于,包括处理器,所述处理器和存储器耦合,所述存储器存储有程序,当所述存储器存储的程序指令被所述处理器执行时实现权利要求1-8中任一项所述的方法的步骤。
  18. 一种计算机可读存储介质,其特征在于,包括计算机程序指令,当所述计算机程序指令由处理器执行时,所述处理器执行如权利要求1-8中任一项所述的方法。
  19. 一种计算机程序产品,其特征在于,所述计算机程序产品包括软件代码,所述软件代码用于执行如权利要求1至8中任一项所述的方法的步骤。
PCT/CN2022/130492 2022-05-23 2022-11-08 一种分子优化方法以及装置 WO2023226310A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210564370 2022-05-23
CN202210564370.2 2022-05-23
CN202211019436.6 2022-08-24
CN202211019436.6A CN117174185A (zh) 2022-05-23 2022-08-24 一种分子优化方法以及装置

Publications (1)

Publication Number Publication Date
WO2023226310A1 true WO2023226310A1 (zh) 2023-11-30

Family

ID=88918306

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/130492 WO2023226310A1 (zh) 2022-05-23 2022-11-08 一种分子优化方法以及装置

Country Status (1)

Country Link
WO (1) WO2023226310A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200394547A1 (en) * 2018-08-17 2020-12-17 Zapata Computing, Inc. Hybrid Quantum-Classical Computer System and Method for Performing Function Inversion
WO2021226461A1 (en) * 2020-05-07 2021-11-11 Translate Bio, Inc. Generation of optimized nucleotide sequences
CN114334018A (zh) * 2021-12-29 2022-04-12 深圳晶泰科技有限公司 获取分子特征描述的方法、装置及存储介质
CN114420217A (zh) * 2021-12-22 2022-04-29 苏州鸣石量子信息技术有限公司 一种新型量子化学分子性能预测的方法和系统
CN114444016A (zh) * 2022-02-02 2022-05-06 上海图灵智算量子科技有限公司 实现伊辛模型的方法
CN114446391A (zh) * 2022-02-07 2022-05-06 上海图灵智算量子科技有限公司 一种基于量子退火的蛋白质折叠方法
CN114464250A (zh) * 2022-02-25 2022-05-10 上海图灵智算量子科技有限公司 基于伊辛机量子退火的基因稳定性筛选方法及系统
CN114512178A (zh) * 2022-02-02 2022-05-17 上海图灵智算量子科技有限公司 基于伊辛机量子退火的密码子优化方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200394547A1 (en) * 2018-08-17 2020-12-17 Zapata Computing, Inc. Hybrid Quantum-Classical Computer System and Method for Performing Function Inversion
WO2021226461A1 (en) * 2020-05-07 2021-11-11 Translate Bio, Inc. Generation of optimized nucleotide sequences
CN114420217A (zh) * 2021-12-22 2022-04-29 苏州鸣石量子信息技术有限公司 一种新型量子化学分子性能预测的方法和系统
CN114334018A (zh) * 2021-12-29 2022-04-12 深圳晶泰科技有限公司 获取分子特征描述的方法、装置及存储介质
CN114444016A (zh) * 2022-02-02 2022-05-06 上海图灵智算量子科技有限公司 实现伊辛模型的方法
CN114512178A (zh) * 2022-02-02 2022-05-17 上海图灵智算量子科技有限公司 基于伊辛机量子退火的密码子优化方法
CN114446391A (zh) * 2022-02-07 2022-05-06 上海图灵智算量子科技有限公司 一种基于量子退火的蛋白质折叠方法
CN114464250A (zh) * 2022-02-25 2022-05-10 上海图灵智算量子科技有限公司 基于伊辛机量子退火的基因稳定性筛选方法及系统

Similar Documents

Publication Publication Date Title
WO2022083624A1 (zh) 一种模型的获取方法及设备
WO2022042002A1 (zh) 一种半监督学习模型的训练方法、图像处理方法及设备
WO2021159714A1 (zh) 一种数据处理方法及相关设备
JP2023082017A (ja) コンピュータシステム
EP3924893A1 (en) Incremental training of machine learning tools
JP2018521382A (ja) 古典的なプロセッサで量子類似計算をエミュレートするためのquanton表現
JP2021524099A (ja) 異なるデータモダリティの統計モデルを統合するためのシステムおよび方法
Wilson et al. Quantum kitchen sinks: An algorithm for machine learning on near-term quantum computers
WO2023029352A1 (zh) 基于图神经网络的药物小分子性质预测方法、装置及设备
US20230075100A1 (en) Adversarial autoencoder architecture for methods of graph to sequence models
WO2023236977A1 (zh) 一种数据处理方法及相关设备
WO2024041479A1 (zh) 一种数据处理方法及其装置
WO2023284716A1 (zh) 一种神经网络搜索方法及相关设备
WO2023231954A1 (zh) 一种数据的去噪方法以及相关设备
WO2024001806A1 (zh) 一种基于联邦学习的数据价值评估方法及其相关设备
CN113571125A (zh) 基于多层网络与图编码的药物靶点相互作用预测方法
Chen et al. Binarized neural architecture search for efficient object recognition
CN112749791A (zh) 一种基于图神经网络和胶囊网络的链路预测方法
CN112199884A (zh) 物品分子生成方法、装置、设备及存储介质
CN115526246A (zh) 一种基于深度学习模型的自监督分子分类方法
WO2022100607A1 (zh) 一种神经网络结构确定方法及其装置
Liu et al. Efficient neural networks for edge devices
Bhardwaj et al. Computational biology in the lens of CNN
WO2023174064A1 (zh) 自动搜索方法、自动搜索的性能预测模型训练方法及装置
WO2023185541A1 (zh) 一种模型训练方法及其相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22943497

Country of ref document: EP

Kind code of ref document: A1