WO2023077522A1 - Compound design method and apparatus, device, and computer readable storage medium - Google Patents

Compound design method and apparatus, device, and computer readable storage medium Download PDF

Info

Publication number
WO2023077522A1
WO2023077522A1 PCT/CN2021/129381 CN2021129381W WO2023077522A1 WO 2023077522 A1 WO2023077522 A1 WO 2023077522A1 CN 2021129381 W CN2021129381 W CN 2021129381W WO 2023077522 A1 WO2023077522 A1 WO 2023077522A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
seed
compound
matrix
sample
Prior art date
Application number
PCT/CN2021/129381
Other languages
French (fr)
Chinese (zh)
Inventor
杨立君
Original Assignee
深圳晶泰科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳晶泰科技有限公司 filed Critical 深圳晶泰科技有限公司
Priority to PCT/CN2021/129381 priority Critical patent/WO2023077522A1/en
Publication of WO2023077522A1 publication Critical patent/WO2023077522A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like

Definitions

  • the present application relates to the technical field of computational chemistry, in particular to a compound design method, device, equipment and computer-readable storage medium.
  • the technical problem mainly solved by this application is to provide a compound design method, device, equipment and computer-readable storage medium, which can increase the diversity of designed compounds.
  • a technical solution adopted by the present application is: provide a compound design method, the method includes obtaining a seed vector, the seed vector is the representation of the feature vector of the seed compound; based on the genetic algorithm, the seed vector is cross-operated and /or mutation operation to obtain a derivative vector; process the derivative vector to obtain a derivative compound.
  • the method further includes measuring the fitness of the derivative compound based on the fitness function; selecting candidate compounds from the derivative compound according to the degree of fitness.
  • select candidate compounds from derivative compounds including:
  • Step S1 According to the size of the fitness, select the target compound whose fitness meets the preset condition from the derivative compounds;
  • Step S2 using the derivation vector corresponding to the target compound as the seed vector, continue to perform the cross operation and/or mutation operation on the seed vector based on the genetic algorithm, obtain the derivation vector and measure the fitness of the derivation compound based on the fitness function;
  • a predetermined ratio or a predetermined number of derivative compounds with better fitness are selected as candidate compounds.
  • the cross operation of the seed vector based on the genetic algorithm includes: selecting two seed vectors from the seed vector set, selecting the exchange position of one of the seed vectors, and comparing the value of the exchange position of the seed vector with the value of the corresponding position of the other seed vector Values are exchanged.
  • the mutation operation on the seed vector based on the genetic algorithm includes: selecting a seed vector from the seed vector set, selecting a mutation position from the selected seed vector, and replacing the value at the mutation position with a new value.
  • processing the derivative vector to obtain the derivative compound includes: inputting the derivative vector into the molecular structure decoding model, decoding the derivative vector to obtain the derivative molecular structure, the molecular structure decoding model is a neural network model, and obtaining derivative compounds.
  • the method also includes: obtaining a sample matrix, which is a matrix representation of the sample compound; inputting the sample matrix
  • the coding layer of the self-encoder is encoded to obtain a sample vector, which is the representation of the feature vector of the sample compound; the sample vector is input into the decoding layer of the self-encoder, and the prediction matrix is obtained by decoding; the loss between the prediction matrix and the sample matrix is calculated;
  • the parameters of the self-encoder are updated iteratively based on the loss until the loss is stable, and the decoding layer and output layer of the self-encoder after the update training are used as the molecular structure decoding model, and the output layer is used to convert the compound represented by the matrix into the compound represented by the molecular structure. Way.
  • obtaining the seed vector includes: obtaining the SMILES string of the seed compound; performing one-hot encoding on the SMILES string of the seed compound to obtain a seed matrix, which is a matrix representation of the seed compound; encoding the seed matrix to obtain a seed vector.
  • encoding the seed matrix to obtain the seed vector includes: inputting the seed matrix into the molecular structure encoding model, and encoding the seed matrix to obtain the seed vector.
  • the method also includes: obtaining a sample matrix, which is a matrix representation of the sample compound; inputting the sample matrix into the self-encoder Encoding layer, encoding to obtain sample vectors, the sample vectors are the representation of the feature vectors of sample compounds; input the sample vectors into the decoding layer of the self-encoder, and decode to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update based on the loss The parameters of the autoencoder, until the loss is stable, will update the encoding layer of the trained autoencoder as the molecular structure encoding model.
  • the compound design device includes an acquisition module, an operation module and a decoding module, the acquisition module is used to obtain a seed vector, and the seed vector is a seed compound The eigenvector representation method; the operation module is used to perform cross operation and/or mutation operation on the seed vector based on the genetic algorithm to obtain a derivative vector; the decoding module is used to process the derivative vector to obtain a derivative compound.
  • the compound design device also includes a selection module, which is used to respectively measure the fitness of the derivative compounds based on the fitness function; and select candidate compounds from the derivative compounds according to the size of the fitness.
  • the selection module selects candidate compounds from derivative compounds according to the size of fitness, including: step S1: according to the size of fitness, select the target compound whose fitness meets the preset conditions from the derivative compounds; step S2: select the target compound
  • the derivative vector corresponding to the compound is used as the seed vector, and the step of performing cross operation and/or mutation operation on the seed vector based on the genetic algorithm to obtain the derivative vector to measure the fitness of the derivative compound based on the fitness function respectively; iterative loop steps S1 and S2 , until the iterative termination condition is satisfied, the iterative loop operation ends; all derived compounds obtained are sorted in descending order according to fitness; a predetermined proportion or a predetermined number of derived compounds with better fitness are selected as candidate compounds.
  • the operation module includes a crossover operation submodule, and the crossover operation submodule is used to select two seed vectors from the seed vector set, select the exchange position of one of the seed vectors, and compare the value of the exchange position of the seed vector with the value of the other subvector The values at the corresponding positions are exchanged.
  • the operation module includes a mutation operator module, and the mutation operator module is used to select a seed vector from the seed vector set, select a mutation position from the selected seed vector, and replace the value at the mutation position with a new value.
  • the decoding module is specifically used to input the derivative vector into the molecular structure decoding model, decode the derivative vector to obtain the derivative molecular structure, and obtain the derivative compound according to the derived molecular structure;
  • the molecular structure decoding model is a neural network model.
  • the compound design device also includes a model training module, which is used to obtain a sample matrix, which is the matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain a sample vector, which is the sample compound.
  • a model training module which is used to obtain a sample matrix, which is the matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain a sample vector, which is the sample compound.
  • Feature vector representation input the sample vector into the decoding layer of the self-encoder, and decode it to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update the parameters of the self-encoder based on the loss until the loss is stable, and update the post-training
  • the decoding layer and output layer of the self-encoder are used as the molecular structure decoding model, and the output layer is used to convert the compound represented by the matrix into the way represented by the molecular structure.
  • the compound design device also includes an encoding module for obtaining the SMILES character string of the seed compound; performing one-hot encoding on the SMILES character string of the seed compound to obtain a seed matrix, and the seed matrix is a matrix representation of the seed compound; Encode to get the seed vector.
  • the encoding module encodes the seed matrix to obtain the seed vector, including: inputting the seed matrix into the molecular structure encoding model, and encoding the seed matrix to obtain the seed vector.
  • the compound design device also includes a model training module, which is used to obtain a sample matrix, which is the matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain a sample vector, which is the sample compound.
  • Feature vector representation input the sample vector into the decoding layer of the self-encoder, and decode it to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update the parameters of the self-encoder based on the loss until the loss is stable, and update the post-training
  • the encoding layer of the autoencoder acts as a molecular structure encoding model.
  • a compound design device including a processor and a memory, where instructions are stored in the memory, and the processor is used to execute the instructions to realize any of the above compound design methods .
  • another technical solution adopted by the present application is to provide a computer-readable storage medium, which is used to store instructions/program data, and the instructions/program data can be executed to achieve any of the above-mentioned The compound design method of item.
  • the beneficial effects of the present application are: different from the situation of the prior art, the compound design method provided by the present application is based on the genetic algorithm for the development and design of compounds, which increases the exploreable compound space, can obtain diversified compounds, and increases Choose a space. Furthermore, the complex chemical space is reduced into a one-dimensional vector during operation, which enables the design algorithm to search the chemical space conveniently and efficiently.
  • Figure 1 is a schematic flow diagram of a compound design method in the embodiment of the present application.
  • Fig. 2 is a schematic diagram of the training process of a molecular structure model in the embodiment of the present application
  • Figure 3 is a schematic flow diagram of another compound design method in the embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of a compound design device in an embodiment of the present application.
  • Fig. 5 is a schematic structural diagram of the compound design equipment in the embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present application.
  • the inventors of the present application found that the molecular generation model based on deep learning can use a large-scale compound database to self-learn the writing rules of compounds, and express the compounds as a dense continuous Value vectors, and then learn the structural features of compounds, generate compounds with new skeletons, and expand the searchable chemical space.
  • transfer learning or reinforcement learning methods can be used to guide model training, so that the chemical space generated by molecules can be shrunk to a specific area, and the sampling generation in this area meets the conditions.
  • molecules For example, molecules with special functional groups can be generated.
  • the present application provides a compound design method.
  • new compounds are learned, developed and designed based on the genetic algorithm, and a certain number of seed compounds are selected to simulate the chromosomes in nature by using the principle of simulating the evolution of the natural world in the genetic algorithm to form an initial compound. populations.
  • the fitness of the entire population is evaluated, and several individuals are selected based on the fitness to simulate natural selection, inheritance, and mutation to produce the next generation of population (ie, derivative compounds). Each generation repeats this cycle to search for an optimal solution.
  • FIG. 1 is a schematic flowchart of a compound design method in an embodiment of the present application. It should be noted that this embodiment is not limited to the flow sequence shown in FIG. 1 if substantially the same result is obtained. As shown in Figure 1, this embodiment includes:
  • the seed vector is the feature vector representation of the seed compound.
  • the first-generation population of genetic evolution is first constructed, that is, the basic compound for compound design, that is, the seed compound, needs to be obtained first.
  • the seed compound can be any compound randomly selected in the compound database, and it can be one or more. According to different design requirements, specific screening of seed compounds can also be carried out, which is not limited here.
  • dimensionality reduction processing is also performed on the seed compound, and the complex chemical space is reduced into a one-dimensional vector. Specifically, the way of expressing the compound with molecular structural formula is changed to the way of expressing the compound with vector.
  • the design algorithm based on the genetic algorithm can be simplified to the operation between vectors, which is more convenient and efficient to search the chemical space, and the efficiency is higher.
  • S130 Perform a crossover operation and/or a mutation operation on the seed vector based on the genetic algorithm to obtain a derivative vector.
  • the dimension of the operation result is processed. Specifically, the method of using vectors to represent compounds is converted to the method of using molecular structures to represent compounds, so as to obtain the specific structural formula of the compound, and then determine the derivative compound.
  • the development and design of compounds is carried out based on the genetic algorithm, which increases the space of compounds that can be explored, enables to obtain diversified compounds, and increases the space for selection. Furthermore, the complex chemical space is reduced into a one-dimensional vector during operation, which enables the design algorithm to search the chemical space conveniently and efficiently.
  • the present application may use a neural network, which takes the chemical structure as input and output, and extracts the vector output by the intermediate layer as a one-dimensional representation of the chemical structure. That is, the neural network model can be used to reduce and increase the dimension of the compound.
  • the autoencoder can be used to train the molecular structure encoding model and the molecular structure decoding model.
  • the molecular structure encoding model can be used to reduce the dimension of the chemical structure, and encode the chemical structure into a vector; the molecular structure decoding model can be used to increase the dimension of the vector, and decode the vector into a chemical structure.
  • An autoencoder is a deep learning neural network that is trained so that the input and output values are the same. It first compresses the input vector into a hidden space, and then reconstructs and decodes the output so that the output is the same as the input.
  • the autoencoder mainly includes an encoding layer, a hidden vector layer and a decoding layer.
  • the encoding layer contains several neurons, which can convert a large and sparse matrix into a dense one-dimensional vector composed of floating point numbers (the vector in the hidden vector layer).
  • the decoding layer also contains several neurons, which can decode a dense one-dimensional vector into a large and sparse matrix.
  • a neural network In the training phase, a neural network is first built, which can receive large and sparse matrices. It is first converted into a vector of continuous values through the embedding layer. These vectors are combined through various linear transformations and nonlinear transformations, and finally a latent vector is obtained. This hidden vector is decoded into a large and sparse matrix through multiple linear transformations and nonlinear transformations. Since the parameters of these transformations are random or inaccurate, the decoded matrix is likely to be very different from the original matrix.
  • the chemical structure can be one-hot encoded and converted into a matrix representation. Therefore, the above-mentioned neural network can be used to reduce and increase the dimension of the compound, and the above-mentioned training method can be used to train molecules Structure encoding model and molecular structure decoding model.
  • FIG. 2 is a schematic diagram of a training process of a molecular structure model in an embodiment of the present application. It should be noted that, if there are substantially the same results, this embodiment is not limited to the flow sequence shown in FIG. 2 . As shown in Figure 2, this embodiment includes:
  • sample matrix is a matrix representation of the sample compound.
  • the compound library can be downloaded from the Internet, and effective compounds can be extracted from the compound library as sample compounds.
  • the sample compounds can be screened to a certain extent, for example, chiral compounds, salt compounds, uncommon molecules, molecules with too many heavy atoms, inorganic substances, etc. can be removed when screening sample compounds. Different screening rules can be set according to different requirements, which are not limited here.
  • SMILES Simple molecular input line entry system, simplified molecular linear input specification
  • SMILES Simple molecular input line entry system, simplified molecular linear input specification
  • the chemical structure can be written in the form of a SMILES string according to an existing set of rules.
  • pyrimidine can be written as SMILES string "c1ccncn1".
  • a string can be thought of as a sentence consisting of several words.
  • the above-mentioned string of pyrimidines can be regarded as composed of three words c, 1, n. These words can be converted into a vector consisting of only 0 and 1 using one-hot encoding, and then the string can be converted into a matrix representation to obtain a sample matrix.
  • c1ccncn1 which can be regarded as consisting of three words c, 1, and n. These three words have disorder and discontinuity. Treat these three words as three states, represented by a vector consisting of 0 and 1. For example, the first digit is c, the second digit is 1, and the third digit is n, then these three words can be expressed as [1,0,0],[0,1,0],[0,0,1] . 1 means it contains the word, 0 means it does not contain the word.
  • the structure of pyrimidine will be represented as a two-dimensional matrix [[1,0,0],[0,1,0],[1,0,0],[1,0,0],[0,0 ,1],[1,0,0],[0,0,1],[0,1,0]].
  • the so-called two-dimensional matrix the dimension of the matrix can be understood as one dimension is used to represent the vector length of each word, and one dimension is used to represent the length of each string.
  • the length of each word is 3, and the length of the entire pyrimidine string is 8. Encoded in this way, the pyrimidine structure is transformed into something that can be understood by a computer.
  • the SMILES strings in the sample compound set can be uniformly encoded into an m*n matrix (m words, each word vector length is n). You can find out the longest SMILES string among them, for example, its length is m, if the length of a SMILES string is less than m words, it is also expressed as a matrix of m*n, and the insufficient elements are all filled with 0. Similarly, find the word with the longest length, say it has length n.
  • S230 Input the sample matrix into the encoding layer of the autoencoder, and encode to obtain a sample vector, wherein the sample vector is a representation of the feature vector of the sample compound.
  • S250 Input the sample vector into the decoding layer of the self-encoder, and decode to obtain a prediction matrix.
  • S290 Iteratively updating the parameters of the self-encoder based on the loss until the loss is stable, and obtaining a molecular structure encoding model and a molecular structure decoding model.
  • the encoding layer of the updated autoencoder can be used as a molecular structure encoding model, and the updated decoding layer of the trained autoencoder can be used as a molecular structure decoding model.
  • the autoencoder may further include an input layer, which may be used to convert compounds of chemical structural formulas into compounds of matrix formulas.
  • the input layer and the encoding layer are then used together as a molecular structure encoding model.
  • the molecular structure encoding model can take the compound of the molecular structural formula as input, and output the compound of the encoded vector formula.
  • the autoencoder may further include an output layer, which may be used to convert the compound of the matrix formula into the compound of the chemical structure formula.
  • the specific conversion process is the reverse process of converting the compound of the chemical structural formula into the compound of the matrix formula. Please refer to the above description for details, and will not repeat them here.
  • the output layer and the decoding layer are used together as a molecular structure decoding model.
  • the molecular structure decoding model can take the compound of the vector formula as input, and output the compound of the decoded molecular structure formula.
  • FIG. 3 is a schematic flowchart of another compound design method in the embodiment of the present application. It should be noted that this embodiment is not limited to the flow sequence shown in FIG. 3 if substantially the same result is achieved. As shown in Figure 3, this embodiment can combine the molecular structure coding model, molecular structure decoding model and genetic algorithm for compound design, specifically including:
  • S310 Acquire a seed vector.
  • the seed compound in the compound database can select the seed compound in the compound database to obtain the SMILES string of the seed compound; perform one-hot encoding on the SMILES string of the seed compound to obtain the seed matrix, which is the matrix representation of the seed compound; input the seed matrix into the molecular structure
  • the encoding model encodes the seed matrix to obtain the seed vector. Please refer to the above description for details, and will not repeat them here.
  • S330 Perform a cross operation on the seed vector based on the genetic algorithm to obtain a derived vector.
  • the cross operation can select two seed vectors from the seed vector set, and select the exchange position (can be one or more positions) of one of the seed vectors.
  • the selection mode of the seed vector and the exchange position can be randomly selected, or can be set Set certain selection rules.
  • the value of the selected exchange position of this seed vector is exchanged with the value of the corresponding position of another sub vector. For example, there are two vectors [0.1,0.2,0.3] and [0.4,0.5,0.6], exchange the first position, then get two new vectors, [0.4,0.2,0.3] and [0.1,0.5,0.6] . As another example, if the above two vectors are exchanged for the first and third positions, then two new vectors [0.4,0.2,0.6] and [0.1,0.5,0.3] are obtained.
  • S350 Perform a mutation operation on the seed vector based on the genetic algorithm to obtain a derived vector.
  • the mutation operation can select several seed vectors (the proportion of the vector to be mutated can be specified in advance) from the seed vector set, and select the mutation position (can be one or more positions) from these seed vectors, the seed vector and the mutation position
  • the selection method may be random selection, or a certain selection rule may be set. Replace the values at these mutation positions with new values, which can be randomly replaced with any value, or replaced with a set value. For example, there is a vector [0.1,0.2,0.3], select the first position, and replace this value with a value at random to get a new vector [0.5,0.2,0.3]. As another example, select the first and second positions, and randomly replace the corresponding values with new values to obtain a new vector [0.2, 0.4, 0.3].
  • Both the crossover operation and the mutation operation are for generating new vectors (ie derived vectors), deriving more vectors, and further deriving more compounds.
  • Crossover operation and mutation operation can simulate genetic evolution and improve the diversity of compounds.
  • the crossover operation and the mutation operation can be performed simultaneously, or in reverse order, or only one of them can be performed, that is, steps S330 and S350 are only for illustration, and one can be selectively performed, or the order can be reversed, and there is no limitation here .
  • the derivative vector is input into the molecular structure decoding model, and the derivative vector is decoded to obtain the derivative matrix, and then the derivative matrix is converted to obtain the derived molecular structure, and then the derivative compound can be determined according to the derived molecular structure. Please refer to the above description for details, and will not repeat them here.
  • S390 Measure the fitness of the derived compounds based on the fitness function, and select candidate compounds from the derived compounds according to the fitness.
  • Fitness is a scale used to evaluate derivative compounds, such as whether the structure has good solubility, good activity, etc. In this way, the derivative compounds are associated with the criteria for judging the quality, that is, the fitness function is constructed.
  • each derivative compound has an evaluation value, which represents the adaptability of the compound in the evolution process. For example, molecules with poor solubility and poor activity tend to be eliminated.
  • This evaluation standard depends on the definition of the user, and the user can adaptively set the evaluation standard (fitness function) according to the characteristic requirements of the compound to be designed. For example, the user wants to get a compound with a large enough molecular weight. Then, thousands of derived vectors are randomly generated, and these derived vectors are transformed according to the above to obtain a compound respectively, and then the molecular weights of these compounds are calculated. This molecular weight is the user-defined fitness. We arrange these compounds in descending order according to molecular weight, and select a top-ranked candidate compound or a batch of candidate compounds according to the user-defined parameters (the ratio or quantity selected each time).
  • multiple rounds of crossover operations and mutation operations can be iteratively performed to obtain more derivative compounds, and then desired candidate compounds are selected from these derivative compounds.
  • the specific implementation of selecting the candidate compound from the derivative compound may include: step S1: according to the size of the fitness, select the target compound whose fitness meets the preset condition from the derivative compound; step S2 : Using the derivation vector corresponding to the target compound as the seed vector, continue to execute steps from S330 and/or S350 to step S390 to measure the fitness of the derivation compound based on the fitness function.
  • the target compound that satisfies the preset conditions can be a fixed number of compounds (such as 10, 30, 50, etc.) selected from the derivative compounds as the target compound;
  • the compounds are sorted according to the fitness, and a fixed ratio (such as 1/10, 1/5, 1/3, etc.) of the compound is selected from the front to the back as the target compound; it can also be selected from the derivative compound with a fitness greater than a certain fixed threshold compound as the target compound.
  • the number, ratio, and conditions of target compounds selected can be set according to needs, and will not be repeated here.
  • the derivative compounds can be sorted in descending order according to the fitness, and the top-ranked target compounds can be selected, and the derivative vectors of these target compounds can be cross-operated,
  • the mutation operation generates a new batch of 1D derived vectors. Input these new derivative vectors into the molecular structure decoding model, decode new matrix and transform into new derivative compounds, and calculate the fitness of these derivative compounds.
  • These derivative compounds are arranged in descending order of fitness, and the top-ranked target compounds are selected from them, and then crossed and mutated to generate a new one-dimensional derivative vector. This loop is iterated and all derived compounds generated are recorded. Candidate compounds with better fitness are selected from these generated derivative compounds as the final result.
  • the number of iterations can depend on the set parameters and the characteristics of the data set itself.
  • the iteration termination condition can be the number of iterations set in advance, and the number of iterations can be dozens to hundreds of times, such as 200 to 400 rounds.
  • the iteration termination condition can be the iteration duration set in advance, such as 8 hours, 12 hours, 24 hours, 48 hours, etc.
  • the complex chemical space is reduced into a one-dimensional vector, which can make the design algorithm search the chemical space conveniently and efficiently; the organic combination of chemical space and genetic algorithm overcomes the molecular Generative Models After Reinforcement Learning and Transfer Learning Generating Compound Gradually Single Problems.
  • the latest ChEMBL28 database can be downloaded from the Internet, and the SMILES string of the compound is proposed.
  • the sample compound structure must only contain atoms of hydrogen, carbon, nitrogen, oxygen, fluorine, sulfur, chlorine, and bromine. And do not contain chiral compounds, inorganic substances, salt ions, and restrict the number of heavy atoms within 70, convert these SMILES strings into canonical forms.
  • About 1.8 million SMILES are obtained after deduplication. Use these SMILES to train a neural network. Embodiments are developed based on this neural network.
  • Protein kinase B also known as AKT, is a serine/threonine-specific protein kinase. It plays an important regulatory role in cell apoptosis, proliferation, migration and other cellular processes. AKT1 participates in the cell survival pathway through the process of apoptosis, blocks apoptosis and promotes cell survival. Clinical studies have found that AKT is overexpressed in various human tumors such as gastric cancer and pancreatic cancer. AKT inhibitors can inhibit the activity of AKT and promote the apoptosis of cancer cells.
  • Compound 1 is an AKT inhibitor in clinical research. By analyzing its interaction mode and establishing a pharmacophore model to evaluate the matching degree between the molecule and the pharmacophore, it is used as the fitness evaluation standard to find new molecules.
  • IDH1 human isocitrate dehydrogenase 1
  • glioma a variety of malignant tumors, such as glioma.
  • Mutated IDH1 can convert ⁇ -ketoglutarate to 2-hydroxyglutarate.
  • the latter is a carcinogen that accumulates in the body and promotes the further progression of cancer.
  • drugs that inhibit the activity of mutant IDH1 can effectively reduce the concentration of 2-hydroxyglutarate in the body and relieve cancer symptoms.
  • Compound 2 is the most promising inhibitor of mutant IDH1 currently studied. Take it as a template molecule, calculate the similarity (measured by molecular fingerprint) with the template molecule for each generated molecule as the fitness of the molecule, and search a batch of similar molecules from the latent space.
  • the third predetermined amount of seed compounds in the compound database input the molecular structure coding model, and obtain the seed vector; perform crossover and mutation operations on the seed vector based on the genetic algorithm, and obtain multiple derived vectors; input the derived vector Molecular structure decoding model to obtain multiple derivative compounds; respectively calculate the similarity between each derivative compound and the template molecule to obtain the fitness of the derivative compound; then select the fourth predetermined amount of derivative vector as the seed vector for crossover according to the degree of fitness Operation and mutation operation, such an iterative cycle for 380 rounds, to obtain a batch of new compounds, as follows:
  • FIG. 4 is a schematic structural diagram of a compound design device in an embodiment of the present application.
  • the compound design device 40 includes an acquisition module 41 , an operation module 42 and a decoding module 43 .
  • the obtaining module 41 is used to obtain the seed vector, and the seed vector is the feature vector representation of the seed compound; the operation module 42 is used to perform cross operation and/or mutation operation on the seed vector based on the genetic algorithm to obtain a derived vector; the decoding module 43 uses The derivation vector is processed to obtain the derivation compound.
  • the device develops and designs compounds based on the genetic algorithm, which increases the exploreable compound space, can obtain diversified compounds, and increases the selection space.
  • the complex chemical space is reduced into a one-dimensional vector during operation, which enables the design algorithm to search the chemical space conveniently and efficiently. Please refer to the description of the above-mentioned embodiments for the specific execution process, and will not repeat it again.
  • the compound design device 40 also includes a selection module (not shown in the figure), which is used to measure the fitness of the derived compounds based on the fitness function; and select candidate compounds from the derived compounds according to the fitness.
  • a selection module (not shown in the figure), which is used to measure the fitness of the derived compounds based on the fitness function; and select candidate compounds from the derived compounds according to the fitness.
  • the selection module selects candidate compounds from the derivative compounds according to the size of the fitness, including: step S1: according to the size of the fitness, select the target compound whose fitness meets the preset conditions from the derivative compounds; step S2: the The derivation vector corresponding to the target compound is used as the seed vector, continue to perform the cross operation and/or mutation operation on the seed vector based on the genetic algorithm, and obtain the derivation vector to the step of measuring the fitness of the derivation compound based on the fitness function respectively; iterative loop steps S1 and S2. End the iterative loop operation until the iteration termination condition is met; sort all the obtained derived compounds in descending order according to their fitness; select a predetermined proportion or a predetermined number of derived compounds with better fitness as candidate compounds. In this way, more candidate compounds can be obtained, and better compounds can be screened more easily.
  • step S1 according to the size of the fitness, select the target compound whose fitness meets the preset conditions from the derivative compounds
  • step S2 the The derivation vector corresponding to the
  • the operation module 42 includes a cross operation submodule (not shown in the figure), which is used to select two seed vectors from the seed vector set, select the exchange position of one of the seed vectors, and combine the value of the exchange position of the seed vector with The value of the corresponding position of another sub-vector is exchanged to obtain a new derivative vector, and then a derivative compound can be obtained to enrich the derivative vector and increase the diversity of the derivative compound.
  • a cross operation submodule (not shown in the figure), which is used to select two seed vectors from the seed vector set, select the exchange position of one of the seed vectors, and combine the value of the exchange position of the seed vector with The value of the corresponding position of another sub-vector is exchanged to obtain a new derivative vector, and then a derivative compound can be obtained to enrich the derivative vector and increase the diversity of the derivative compound.
  • the operation module 42 includes a mutation operator module (not shown in the figure), which is used to select a seed vector from the seed vector set, select a mutation position from the selected seed vector, and replace the value on the mutation position with a new Value, get a new derivative vector, and then get a derivative compound to enrich the derivative vector and increase the diversity of derivative compounds.
  • a mutation operator module (not shown in the figure), which is used to select a seed vector from the seed vector set, select a mutation position from the selected seed vector, and replace the value on the mutation position with a new Value, get a new derivative vector, and then get a derivative compound to enrich the derivative vector and increase the diversity of derivative compounds.
  • the decoding module 43 is used to input the derived vector into the molecular structure decoding model, and decode the derived vector to obtain the derived molecular structure.
  • the molecular structure decoding model is a neural network model; and obtain the derived compound according to the derived molecular structure.
  • the compound design device 40 also includes an encoding module (not shown in the figure), which is used to obtain the SMILES character string of the seed compound; the SMILES character string of the seed compound is one-hot encoded to obtain the seed matrix, and the seed matrix is the Matrix representation; the seed matrix is input into the molecular structure encoding model, and the seed matrix is encoded to obtain the seed vector.
  • an encoding module (not shown in the figure), which is used to obtain the SMILES character string of the seed compound; the SMILES character string of the seed compound is one-hot encoded to obtain the seed matrix, and the seed matrix is the Matrix representation; the seed matrix is input into the molecular structure encoding model, and the seed matrix is encoded to obtain the seed vector.
  • the compound design device 40 also includes a model training module (not shown in the figure), which is used to obtain a sample matrix, which is a matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain the sample Vector, the sample vector is the eigenvector representation of the sample compound; input the sample vector into the decoding layer of the self-encoder, and decode it to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update the parameters of the self-encoder based on the loss, Until the loss is stable, the decoding layer and output layer of the trained self-encoder will be updated as the molecular structure decoding model.
  • the output layer is used to convert the compound represented by the matrix into a molecular structure representation, and the trained self-encoder will be updated
  • the encoding layer serves as a molecular structure encoding model.
  • the compound design device can be an independent server, a server cluster, or a module of the server. It can be used for model training, genetic algorithm, and then used to design compounds.
  • FIG. 5 is a schematic structural diagram of a compound design device in an embodiment of the present application.
  • the compound design device 10 includes a processor 11 and a memory 12 .
  • the processor 11 may also be called a CPU (Central Processing Unit, central processing unit).
  • the processor 11 may be an integrated circuit chip with signal processing capabilities.
  • the processor 11 can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components .
  • the general processor can be a microprocessor or the processor 11 can also be any conventional processor or the like.
  • the compound design device 10 may further include a memory 12 for storing instructions and data required for the operation of the processor 11 .
  • the processor 11 is configured to execute instructions to implement the methods provided in any embodiment of the compound design method of the present application and any non-conflicting combination.
  • Compound design equipment can be servers, desktop computers, laptops, etc. It can be used for model training, genetic algorithm, and then used to design compounds.
  • FIG. 6 is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present application.
  • the computer-readable storage medium 20 of the embodiment of the present application stores instructions/program data 21.
  • the instructions/program data 21 are executed, the methods provided by any embodiment of the compound design method of the present application and any non-conflicting combination are implemented.
  • the instruction/program data 21 can form a program file and be stored in the above-mentioned storage medium 20 in the form of a software product, so that a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor (processor) Execute all or part of the steps of the methods in various implementation manners of the present application.
  • aforementioned storage medium 20 comprises: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or optical disc etc. can store program codes Media, or terminal devices such as computers, servers, mobile phones, and tablets.
  • the disclosed system, device and method can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

Abstract

The present application discloses a compound design method and apparatus, a device, and a computer readable storage medium. The method comprises acquiring a seed vector that is a feature vector representation mode of a seed compound; performing crossover operation and/or mutation operation on the seed vector on the basis of a genetic algorithm to obtain a derivative vector; and processing the derivative vector to obtain a derivative compound. In this way, the present application can improve the diversity of a designed compound.

Description

化合物设计方法、装置、设备及计算机可读存储介质Compound design method, device, equipment and computer-readable storage medium 【技术领域】【Technical field】
本申请涉及计算化学技术领域,特别是涉及一种化合物设计方法、装置、设备及计算机可读存储介质。The present application relates to the technical field of computational chemistry, in particular to a compound design method, device, equipment and computer-readable storage medium.
【背景技术】【Background technique】
在传统药物研究中,科学家通过筛选化合物库,逐个测试化合物靶点,最后筛选出苗头化合物。由于过程成本高昂、失败率高,计算化学家尝试使用计算模型的方法预测化合物的活性,用计算机模拟药物在蛋白空腔中的结合,推荐并测试一批可能具有活性的分子。但是这个方法严重受限于虚拟筛选库的质量,现有化合物库的分子数量一般在几十万,且化合物的骨架已经被前人大量研究、筛选,难以找出新骨架的候选化合物,数量太少、结构新颖性差的筛选库已难以满足日益增长的研发需求。In traditional drug research, scientists screen compound libraries, test compound targets one by one, and finally screen out hit compounds. Due to the high cost and high failure rate of the process, computational chemists try to use computational models to predict the activity of compounds, use computers to simulate the binding of drugs in protein cavities, and recommend and test a batch of potentially active molecules. However, this method is severely limited by the quality of the virtual screening library. The number of molecules in the existing compound library is generally hundreds of thousands, and the skeleton of the compound has been extensively studied and screened by the predecessors. It is difficult to find candidate compounds with a new skeleton. The number is too large. Screening libraries with few and poor structural novelty have been difficult to meet the growing demand for research and development.
【发明内容】【Content of invention】
本申请主要解决的技术问题是提供一种化合物设计方法、装置、设备及计算机可读存储介质,能够提高被设计化合物的多样性。The technical problem mainly solved by this application is to provide a compound design method, device, equipment and computer-readable storage medium, which can increase the diversity of designed compounds.
为解决上述技术问题,本申请采用的一个技术方案是:提供一种化合物设计方法,该方法包括获取种子向量,种子向量为种子化合物的特征向量表示方式;基于遗传算法对种子向量进行交叉运算和/或变异运算,得到衍生向量;对衍生向量进行处理,得到衍生化合物。In order to solve the above technical problems, a technical solution adopted by the present application is: provide a compound design method, the method includes obtaining a seed vector, the seed vector is the representation of the feature vector of the seed compound; based on the genetic algorithm, the seed vector is cross-operated and /or mutation operation to obtain a derivative vector; process the derivative vector to obtain a derivative compound.
其中,对衍生向量进行处理,得到衍生化合物之后,该方法还包括基于适应度函数分别度量衍生化合物的适应度;根据适应度的大小,从衍生化合物中选取出候选化合物。Wherein, after the derivative vector is processed to obtain the derivative compound, the method further includes measuring the fitness of the derivative compound based on the fitness function; selecting candidate compounds from the derivative compound according to the degree of fitness.
其中,根据适应度的大小,从衍生化合物中选取出候选化合物,包括:Among them, according to the size of fitness, select candidate compounds from derivative compounds, including:
步骤S1:根据适应度的大小,从衍生化合物中选取适应度满足预设条件的目标化合物;Step S1: According to the size of the fitness, select the target compound whose fitness meets the preset condition from the derivative compounds;
步骤S2:将目标化合物对应的衍生向量作为种子向量,继续执行基于遗传算法对种子向量进行交叉运算和/或变异运算,得到衍生向量至基于适应度函数分别度量衍生化合物的适应度的步骤;Step S2: using the derivation vector corresponding to the target compound as the seed vector, continue to perform the cross operation and/or mutation operation on the seed vector based on the genetic algorithm, obtain the derivation vector and measure the fitness of the derivation compound based on the fitness function;
迭代循环步骤S1和S2,直至满足迭代终止条件时,结束迭代循环操作;Iterate loop steps S1 and S2 until the iteration termination condition is satisfied, then end the iteration loop operation;
按照适应度对得到的所有衍生化合物进行降序排列;Arrange all derived compounds obtained in descending order according to their fitness;
选取预定比例或预定数量的适应度较优的衍生化合物作为候选化合物。A predetermined ratio or a predetermined number of derivative compounds with better fitness are selected as candidate compounds.
其中,基于遗传算法对种子向量进行交叉运算包括:从种子向量集中选取两个种子向量,选取其中一个种子向量的交换位置,将这个种子向量的交换位置的数值与另一种子向量的对应位置的数值进行交换。Among them, the cross operation of the seed vector based on the genetic algorithm includes: selecting two seed vectors from the seed vector set, selecting the exchange position of one of the seed vectors, and comparing the value of the exchange position of the seed vector with the value of the corresponding position of the other seed vector Values are exchanged.
其中,基于遗传算法对种子向量进行变异运算包括:从种子向量集中选取种子向量,从所选取的种子向量中选取突变位置,将突变位置上的数值替换成新的数值。Wherein, the mutation operation on the seed vector based on the genetic algorithm includes: selecting a seed vector from the seed vector set, selecting a mutation position from the selected seed vector, and replacing the value at the mutation position with a new value.
其中,对衍生向量进行处理,得到衍生化合物包括:将衍生向量输入分子结构解码模型,对衍生向量进行解码处理,得到衍生分子结构,分子结构解码模型为一种神经网络模型,根据衍生分子结构得到衍生化合物。Wherein, processing the derivative vector to obtain the derivative compound includes: inputting the derivative vector into the molecular structure decoding model, decoding the derivative vector to obtain the derivative molecular structure, the molecular structure decoding model is a neural network model, and obtaining derivative compounds.
其中,将衍生向量输入分子结构解码模型,对衍生向量进行解码处理,解码得到衍生化合物的分子结构之前,该方法还包括:获取样本矩阵,样本矩阵为样本化合物的矩阵表示方式;将样本矩阵输入自编码器的编码层,编码得到样本向量,样本向量为样本化合物的特征向量表示方式;将样本向量输入自编码器的解码层,解码得到预测矩阵;计算预测矩阵与样本矩阵之间的损失;基于损失迭代更新自编码器的参数,直至损失稳定,将更新训练后的自编码器的解码层和输出层作为分子结构解码模型,输出层用于将矩阵表示的化合物转换成以分子结构表示的方式。Wherein, the derivative vector is input into the molecular structure decoding model, and the derivative vector is decoded, and before the molecular structure of the derivative compound is obtained by decoding, the method also includes: obtaining a sample matrix, which is a matrix representation of the sample compound; inputting the sample matrix The coding layer of the self-encoder is encoded to obtain a sample vector, which is the representation of the feature vector of the sample compound; the sample vector is input into the decoding layer of the self-encoder, and the prediction matrix is obtained by decoding; the loss between the prediction matrix and the sample matrix is calculated; The parameters of the self-encoder are updated iteratively based on the loss until the loss is stable, and the decoding layer and output layer of the self-encoder after the update training are used as the molecular structure decoding model, and the output layer is used to convert the compound represented by the matrix into the compound represented by the molecular structure. Way.
其中,获取种子向量包括:获取种子化合物的SMILES字符串;对种子化合物的SMILES字符串进行独热编码得到种子矩阵,种子矩阵为种子化合物的矩阵表示方式;对种子矩阵进行编码得到种子向量。Wherein, obtaining the seed vector includes: obtaining the SMILES string of the seed compound; performing one-hot encoding on the SMILES string of the seed compound to obtain a seed matrix, which is a matrix representation of the seed compound; encoding the seed matrix to obtain a seed vector.
其中,对种子矩阵进行编码得到种子向量包括:将种子矩阵输入分子结构编码模型,对种子矩阵进行编码处理,得到种子向量。Wherein, encoding the seed matrix to obtain the seed vector includes: inputting the seed matrix into the molecular structure encoding model, and encoding the seed matrix to obtain the seed vector.
其中,将种子矩阵输入分子结构编码模型,对种子矩阵进行编码处理,得到种子向量之前,该方法还包括:获取样本矩阵,样本矩阵为样本化合物的矩阵表示方式;将样本矩阵输入自编码器的编码层,编码得到样本向量,样本向量为样本化合物的特征向量表示方式;将样本向量输入自编码器的解码层,解码得到预测矩阵;计算预测矩阵与样本矩阵之间的损失;基于损失迭代更新自编码器的参数,直至损失稳定,将更新训练后的自编码器的编码层作为分子结构编码模型。Wherein, the seed matrix is input into the molecular structure encoding model, the seed matrix is encoded, and before the seed vector is obtained, the method also includes: obtaining a sample matrix, which is a matrix representation of the sample compound; inputting the sample matrix into the self-encoder Encoding layer, encoding to obtain sample vectors, the sample vectors are the representation of the feature vectors of sample compounds; input the sample vectors into the decoding layer of the self-encoder, and decode to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update based on the loss The parameters of the autoencoder, until the loss is stable, will update the encoding layer of the trained autoencoder as the molecular structure encoding model.
为解决上述技术问题,本申请采用的另一个技术方案是:提供一种化合物设计装置,该化合物设计装置包括获取模块、运算模块和解码模块,获取模块用于获取种子向量,种子向量为种子化合物的特征向量表示方式;运算模块用于基于遗传算法对种 子向量进行交叉运算和/或变异运算,得到衍生向量;解码模块用于对衍生向量进行处理,得到衍生化合物。In order to solve the above technical problems, another technical solution adopted by the present application is to provide a compound design device, the compound design device includes an acquisition module, an operation module and a decoding module, the acquisition module is used to obtain a seed vector, and the seed vector is a seed compound The eigenvector representation method; the operation module is used to perform cross operation and/or mutation operation on the seed vector based on the genetic algorithm to obtain a derivative vector; the decoding module is used to process the derivative vector to obtain a derivative compound.
其中,该化合物设计装置还包括评选模块,评选模块用于基于适应度函数分别度量衍生化合物的适应度;根据适应度的大小,从衍生化合物中选取出候选化合物。Wherein, the compound design device also includes a selection module, which is used to respectively measure the fitness of the derivative compounds based on the fitness function; and select candidate compounds from the derivative compounds according to the size of the fitness.
其中,评选模块根据适应度的大小,从衍生化合物中选取出候选化合物,包括:步骤S1:根据适应度的大小,从衍生化合物中选取适应度满足预设条件的目标化合物;步骤S2:将目标化合物对应的衍生向量作为种子向量,继续执行基于遗传算法对种子向量进行交叉运算和/或变异运算,得到衍生向量至基于适应度函数分别度量衍生化合物的适应度的步骤;迭代循环步骤S1和S2,直至满足迭代终止条件时,结束迭代循环操作;按照适应度对得到的所有衍生化合物进行降序排列;选取预定比例或预定数量的适应度较优的衍生化合物作为候选化合物。Among them, the selection module selects candidate compounds from derivative compounds according to the size of fitness, including: step S1: according to the size of fitness, select the target compound whose fitness meets the preset conditions from the derivative compounds; step S2: select the target compound The derivative vector corresponding to the compound is used as the seed vector, and the step of performing cross operation and/or mutation operation on the seed vector based on the genetic algorithm to obtain the derivative vector to measure the fitness of the derivative compound based on the fitness function respectively; iterative loop steps S1 and S2 , until the iterative termination condition is satisfied, the iterative loop operation ends; all derived compounds obtained are sorted in descending order according to fitness; a predetermined proportion or a predetermined number of derived compounds with better fitness are selected as candidate compounds.
其中,运算模块包括交叉运算子模块,交叉运算子模块用于从种子向量集中选取两个种子向量,选取其中一个种子向量的交换位置,将该种子向量的交换位置的数值与另一种子向量的对应位置的数值进行交换。Wherein, the operation module includes a crossover operation submodule, and the crossover operation submodule is used to select two seed vectors from the seed vector set, select the exchange position of one of the seed vectors, and compare the value of the exchange position of the seed vector with the value of the other subvector The values at the corresponding positions are exchanged.
其中,运算模块包括变异运算子模块,变异运算子模块用于从种子向量集中选取种子向量,从所选取的种子向量中选取突变位置,将突变位置上的数值替换成新的数值。Wherein, the operation module includes a mutation operator module, and the mutation operator module is used to select a seed vector from the seed vector set, select a mutation position from the selected seed vector, and replace the value at the mutation position with a new value.
其中,解码模块具体用于将衍生向量输入分子结构解码模型,对衍生向量进行解码处理,得到衍生分子结构,根据衍生分子结构得到衍生化合物;分子结构解码模型为一种神经网络模型。Wherein, the decoding module is specifically used to input the derivative vector into the molecular structure decoding model, decode the derivative vector to obtain the derivative molecular structure, and obtain the derivative compound according to the derived molecular structure; the molecular structure decoding model is a neural network model.
其中,该化合物设计装置还包括模型训练模块,用于获取样本矩阵,样本矩阵为样本化合物的矩阵表示方式;将样本矩阵输入自编码器的编码层,编码得到样本向量,样本向量为样本化合物的特征向量表示方式;将样本向量输入自编码器的解码层,解码得到预测矩阵;计算预测矩阵与样本矩阵之间的损失;基于损失迭代更新自编码器的参数,直至损失稳定,将更新训练后的自编码器的解码层和输出层作为分子结构解码模型,输出层用于将矩阵表示的化合物转换成以分子结构表示的方式。Wherein, the compound design device also includes a model training module, which is used to obtain a sample matrix, which is the matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain a sample vector, which is the sample compound. Feature vector representation; input the sample vector into the decoding layer of the self-encoder, and decode it to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update the parameters of the self-encoder based on the loss until the loss is stable, and update the post-training The decoding layer and output layer of the self-encoder are used as the molecular structure decoding model, and the output layer is used to convert the compound represented by the matrix into the way represented by the molecular structure.
其中,该化合物设计装置还包括编码模块,用于获取种子化合物的SMILES字符串;对种子化合物的SMILES字符串进行独热编码得到种子矩阵,种子矩阵为种子化合物的矩阵表示方式;对种子矩阵进行编码得到种子向量。Wherein, the compound design device also includes an encoding module for obtaining the SMILES character string of the seed compound; performing one-hot encoding on the SMILES character string of the seed compound to obtain a seed matrix, and the seed matrix is a matrix representation of the seed compound; Encode to get the seed vector.
其中,编码模块对种子矩阵进行编码得到种子向量,包括:将种子矩阵输入分子结构编码模型,对种子矩阵进行编码处理,得到种子向量。Wherein, the encoding module encodes the seed matrix to obtain the seed vector, including: inputting the seed matrix into the molecular structure encoding model, and encoding the seed matrix to obtain the seed vector.
其中,该化合物设计装置还包括模型训练模块,用于获取样本矩阵,样本矩阵为样本化合物的矩阵表示方式;将样本矩阵输入自编码器的编码层,编码得到样本向量,样本向量为样本化合物的特征向量表示方式;将样本向量输入自编码器的解码层,解码得到预测矩阵;计算预测矩阵与样本矩阵之间的损失;基于损失迭代更新自编码器的参数,直至损失稳定,将更新训练后的自编码器的编码层作为分子结构编码模型。Wherein, the compound design device also includes a model training module, which is used to obtain a sample matrix, which is the matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain a sample vector, which is the sample compound. Feature vector representation; input the sample vector into the decoding layer of the self-encoder, and decode it to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update the parameters of the self-encoder based on the loss until the loss is stable, and update the post-training The encoding layer of the autoencoder acts as a molecular structure encoding model.
为解决上述技术问题,本申请采用的另一个技术方案是:提供一种化合物设计设备,包括处理器和存储器,存储器中存储有指令,处理器用于执行指令以实现上述任一项的化合物设计方法。In order to solve the above technical problems, another technical solution adopted by the present application is to provide a compound design device, including a processor and a memory, where instructions are stored in the memory, and the processor is used to execute the instructions to realize any of the above compound design methods .
为解决上述技术问题,本申请采用的另一个技术方案是:提供一种计算机可读存储介质,计算机可读存储介质用于存储指令/程序数据,指令/程序数据能够被执行以实现上述任一项的化合物设计方法。In order to solve the above technical problems, another technical solution adopted by the present application is to provide a computer-readable storage medium, which is used to store instructions/program data, and the instructions/program data can be executed to achieve any of the above-mentioned The compound design method of item.
本申请的有益效果是:区别于现有技术的情况,本申请提供的化合物设计方法,基于遗传算法进行化合物的开发设计,增大了可探索的化合物空间,能够得到多样化的化合物,增大选择空间。进一步地,运算时将复杂的化学空间降维成了一维向量,能够使得设计算法可以方便高效地搜索化学空间。The beneficial effects of the present application are: different from the situation of the prior art, the compound design method provided by the present application is based on the genetic algorithm for the development and design of compounds, which increases the exploreable compound space, can obtain diversified compounds, and increases Choose a space. Furthermore, the complex chemical space is reduced into a one-dimensional vector during operation, which enables the design algorithm to search the chemical space conveniently and efficiently.
【附图说明】【Description of drawings】
图1是本申请实施方式中一化合物设计方法的流程示意图;Figure 1 is a schematic flow diagram of a compound design method in the embodiment of the present application;
图2是本申请实施方式中一分子结构模型的训练流程示意图;Fig. 2 is a schematic diagram of the training process of a molecular structure model in the embodiment of the present application;
图3是本申请实施方式中另一化合物设计方法的流程示意图;Figure 3 is a schematic flow diagram of another compound design method in the embodiment of the present application;
图4是本申请实施方式中化合物设计装置的结构示意图;Fig. 4 is a schematic structural diagram of a compound design device in an embodiment of the present application;
图5是本申请实施方式中化合物设计设备的结构示意图;Fig. 5 is a schematic structural diagram of the compound design equipment in the embodiment of the present application;
图6是本申请实施方式中计算机可读存储介质的结构示意图。FIG. 6 is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present application.
【具体实施方式】【Detailed ways】
为使本申请的目的、技术方案及效果更加清楚、明确,以下参照附图并举实施例对本申请进一步详细说明。In order to make the purpose, technical solution and effect of the present application more clear and definite, the present application will be further described in detail below with reference to the accompanying drawings and examples.
为适应满足药物研发对化合物筛选库的需求,本申请发明人研究发现基于深度学习的分子生成模型可以借助大规模的化合物数据库,自我学习出化合物的书写规律,并将化合物表示成一个稠密的连续值向量,进而学习到化合物的结构特征,生成新骨架的化合物,拓展了可搜索的化学空间。在这个基础上,为了能生成具有某类特点的 分子,可以使用迁移学习或者强化学习方法来指导模型训练,使得分子生成的化学空间收缩到某个特定的区域,在这个区域内采样生成符合条件的分子。如可以生成具有特殊官能团的分子。但是,不论是迁移学习还是强化学习,都存在着随着训练的进行,生成的分子多样性逐渐降低,生成的分子骨架逐渐单一化的问题。迁移学习严重依赖于小数据集的质量,样本太少、多样性低导致模型太早收敛,生成的化合物多样性差。强化学习中过于复杂的函数组合使得模型训练不稳定、难以收敛。如果打分函数使用的打分标准单一,模型依旧过早收敛,得到的分子不具有多样性。In order to meet the needs of compound screening library for drug research and development, the inventors of the present application found that the molecular generation model based on deep learning can use a large-scale compound database to self-learn the writing rules of compounds, and express the compounds as a dense continuous Value vectors, and then learn the structural features of compounds, generate compounds with new skeletons, and expand the searchable chemical space. On this basis, in order to generate molecules with certain characteristics, transfer learning or reinforcement learning methods can be used to guide model training, so that the chemical space generated by molecules can be shrunk to a specific area, and the sampling generation in this area meets the conditions. molecules. For example, molecules with special functional groups can be generated. However, whether it is transfer learning or reinforcement learning, there is a problem that as the training progresses, the diversity of generated molecules gradually decreases, and the generated molecular skeletons gradually become simpler. Transfer learning relies heavily on the quality of small datasets. Too few samples and low diversity lead to premature convergence of the model and poor diversity of generated compounds. Overly complex function combinations in reinforcement learning make model training unstable and difficult to converge. If the scoring function uses a single scoring standard, the model still converges prematurely, and the obtained molecules do not have diversity.
基于此,本申请提供一种化合物设计方法,该方法中基于遗传算法来学习开发设计新的化合物,利用遗传算法中模拟自然界进化的原理,选取一定数量的种子化合物模拟自然界中的染色体,组成初始的种群。在进化的每一代中,评价整个种群的适应度,并基于适应度选取若干个个体模拟自然界的自然选择,遗传,突变产生下一代的种群(即衍生化合物)。每一代如此重复循环,搜索出一个最优解。Based on this, the present application provides a compound design method. In this method, new compounds are learned, developed and designed based on the genetic algorithm, and a certain number of seed compounds are selected to simulate the chromosomes in nature by using the principle of simulating the evolution of the natural world in the genetic algorithm to form an initial compound. populations. In each generation of evolution, the fitness of the entire population is evaluated, and several individuals are selected based on the fitness to simulate natural selection, inheritance, and mutation to produce the next generation of population (ie, derivative compounds). Each generation repeats this cycle to search for an optimal solution.
请参阅图1,图1是本申请实施方式中一化合物设计方法的流程示意图。需注意的是,若有实质上相同的结果,本实施例并不以图1所示的流程顺序为限。如图1所示,本实施方式包括:Please refer to FIG. 1 . FIG. 1 is a schematic flowchart of a compound design method in an embodiment of the present application. It should be noted that this embodiment is not limited to the flow sequence shown in FIG. 1 if substantially the same result is obtained. As shown in Figure 1, this embodiment includes:
S110:获取种子向量。S110: Acquire a seed vector.
其中,种子向量为种子化合物的特征向量表示方式。Among them, the seed vector is the feature vector representation of the seed compound.
基于遗传算法,先构建遗传进化的初代种群,即需要先获取进行化合物设计的基础化合物,即种子化合物。种子化合物可以是化合物数据库中随机选取的任意化合物,可以是一个,也可以是多个。不同的设计需求,也可以对种子化合物进行特定的筛选,在此不做限定。Based on the genetic algorithm, the first-generation population of genetic evolution is first constructed, that is, the basic compound for compound design, that is, the seed compound, needs to be obtained first. The seed compound can be any compound randomly selected in the compound database, and it can be one or more. According to different design requirements, specific screening of seed compounds can also be carried out, which is not limited here.
本申请所提供的实施方式中,还对种子化合物进行了降维处理,将复杂的化学空间降维成一维向量。具体地,将用分子结构式来表示化合物的方式转换成使用向量来表示化合物的方式。通过降维处理,能够使基于遗传算法的设计算法简化为向量间的运算,更方便高效的搜索化学空间,效率更高。In the embodiment provided in the present application, dimensionality reduction processing is also performed on the seed compound, and the complex chemical space is reduced into a one-dimensional vector. Specifically, the way of expressing the compound with molecular structural formula is changed to the way of expressing the compound with vector. Through dimension reduction processing, the design algorithm based on the genetic algorithm can be simplified to the operation between vectors, which is more convenient and efficient to search the chemical space, and the efficiency is higher.
S130:基于遗传算法对种子向量进行交叉运算和/或变异运算,得到衍生向量。S130: Perform a crossover operation and/or a mutation operation on the seed vector based on the genetic algorithm to obtain a derivative vector.
通过向量间的交叉运算和变异运算,能够模拟自然界的自然选择、遗传、突变、进化等,产生新的向量(即衍生向量),也即能产生新的化合物,来实现化合物的多样化。Through the crossover operation and mutation operation between vectors, it is possible to simulate natural selection, inheritance, mutation, evolution, etc. in nature to generate new vectors (that is, derivative vectors), that is, to generate new compounds to realize the diversification of compounds.
S150:对衍生向量进行处理,得到衍生化合物。S150: Process the derivation vector to obtain a derivation compound.
向量间的运算结束后,对运算结果进行升维处理。具体地,将用向量来表示化合 物的方式转换成使用分子结构来表示化合物的方式,得到化合物的具体结构式,进而确定衍生化合物。After the operation between vectors is completed, the dimension of the operation result is processed. Specifically, the method of using vectors to represent compounds is converted to the method of using molecular structures to represent compounds, so as to obtain the specific structural formula of the compound, and then determine the derivative compound.
该实施方式中,基于遗传算法进行化合物的开发设计,增大了可探索的化合物空间,能够得到多样化的化合物,增大选择空间。进一步地,运算时将复杂的化学空间降维成了一维向量,能够使得设计算法可以方便高效地搜索化学空间。In this embodiment, the development and design of compounds is carried out based on the genetic algorithm, which increases the space of compounds that can be explored, enables to obtain diversified compounds, and increases the space for selection. Furthermore, the complex chemical space is reduced into a one-dimensional vector during operation, which enables the design algorithm to search the chemical space conveniently and efficiently.
在一实施方式中,本申请可以使用一个神经网络,以化学结构作为输入、输出,提取中间层输出的向量作为化学结构的一维表示。即可以使用神经网络模型对化合物进行降维、升维处理。In one embodiment, the present application may use a neural network, which takes the chemical structure as input and output, and extracts the vector output by the intermediate layer as a one-dimensional representation of the chemical structure. That is, the neural network model can be used to reduce and increase the dimension of the compound.
其中,可以使用自编码器来训练得到分子结构编码模型和分子结构解码模型。分子结构编码模型可用于对化学结构进行降维处理,将化学结构编码成向量;分子结构解码模型可用于对向量进行升维处理,将向量解码成化学结构。Among them, the autoencoder can be used to train the molecular structure encoding model and the molecular structure decoding model. The molecular structure encoding model can be used to reduce the dimension of the chemical structure, and encode the chemical structure into a vector; the molecular structure decoding model can be used to increase the dimension of the vector, and decode the vector into a chemical structure.
自编码器是一种深度学习的神经网络,通过训练该网络可使得输入值和输出值相同。它先将输入的向量压缩成一个隐空间,然后再重构解码出输出,使得输出与输入相同。具体地,自编码器主要包括了编码层、隐向量层和解码层。编码层中含有若干个神经元,它能将一个大而稀疏的矩阵转化为一个浮点数组成的稠密的一维向量(隐向量层中的向量)。解码层也含有若干个神经元,它能将稠密的一维向量解码成一个大而稀疏的矩阵。An autoencoder is a deep learning neural network that is trained so that the input and output values are the same. It first compresses the input vector into a hidden space, and then reconstructs and decodes the output so that the output is the same as the input. Specifically, the autoencoder mainly includes an encoding layer, a hidden vector layer and a decoding layer. The encoding layer contains several neurons, which can convert a large and sparse matrix into a dense one-dimensional vector composed of floating point numbers (the vector in the hidden vector layer). The decoding layer also contains several neurons, which can decode a dense one-dimensional vector into a large and sparse matrix.
在训练阶段,首先建立一个神经网络,这个神经网络可接收大而稀疏的矩阵。先经过embedding层转化为一个连续值的向量。这些向量经过多种线性变换和非线性变换组合,最后得到一个隐向量。这个隐向量又经过多个线性变换和非线性变换,解码成一个大而稀疏的矩阵。由于这些变换的参数都是随机或者不准确的,所以解码出来的矩阵和原始矩阵大概率是差别很大的。因此,使用一定的度量标准衡量解码出的矩阵和原始矩阵的差异,依据差异程度反向传播更新神经网络中的参数,然后使用更新后的网络重新生成新的大而稀疏的矩阵,再计算解码出的矩阵和原始矩阵的差异,再更新参数。循环多轮,直到这个差异逐渐减低并稳定(即使后面再循环差异也不再降低)。经过这种训练后,输入一个大而稀疏的向量,最后可还原出一个几乎相同的大而稀疏的矩阵。In the training phase, a neural network is first built, which can receive large and sparse matrices. It is first converted into a vector of continuous values through the embedding layer. These vectors are combined through various linear transformations and nonlinear transformations, and finally a latent vector is obtained. This hidden vector is decoded into a large and sparse matrix through multiple linear transformations and nonlinear transformations. Since the parameters of these transformations are random or inaccurate, the decoded matrix is likely to be very different from the original matrix. Therefore, use a certain metric to measure the difference between the decoded matrix and the original matrix, update the parameters in the neural network according to the degree of difference backpropagation, and then use the updated network to regenerate a new large and sparse matrix, and then calculate the decoding The difference between the output matrix and the original matrix, and then update the parameters. Repeat for multiple rounds until the difference gradually decreases and stabilizes (even if the difference is not reduced after recycling). After this kind of training, a large and sparse vector is input, and finally an almost identical large and sparse matrix can be restored.
在一实施方式中,可以对化学结构进行独热编码,转换成用矩阵的方式来表示,因此,利用上述神经网络可实现对化合物进行降维、升维处理,可利用上述训练方法训练出分子结构编码模型和分子结构解码模型。In one embodiment, the chemical structure can be one-hot encoded and converted into a matrix representation. Therefore, the above-mentioned neural network can be used to reduce and increase the dimension of the compound, and the above-mentioned training method can be used to train molecules Structure encoding model and molecular structure decoding model.
请参阅图2,图2是本申请实施方式中一分子结构模型的训练流程示意图。需注意 的是,若有实质上相同的结果,本实施例并不以图2所示的流程顺序为限。如图2所示,本实施方式包括:Please refer to FIG. 2 . FIG. 2 is a schematic diagram of a training process of a molecular structure model in an embodiment of the present application. It should be noted that, if there are substantially the same results, this embodiment is not limited to the flow sequence shown in FIG. 2 . As shown in Figure 2, this embodiment includes:
S210:获取样本矩阵。S210: Acquire a sample matrix.
其中,样本矩阵为样本化合物的矩阵表示方式。Wherein, the sample matrix is a matrix representation of the sample compound.
其中,可以从网上下载化合物库,从化合物库中提取有效的化合物作为样本化合物。可以对样本化合物进行一定的筛选,例如筛选样本化合物时可去掉手性类化合物、去掉盐类化合物、删除不常见的分子、去掉重原子数过多的分子、去掉无机物等等。可以根据不同需求设定不同的筛选规则,在此不作限定。Among them, the compound library can be downloaded from the Internet, and effective compounds can be extracted from the compound library as sample compounds. The sample compounds can be screened to a certain extent, for example, chiral compounds, salt compounds, uncommon molecules, molecules with too many heavy atoms, inorganic substances, etc. can be removed when screening sample compounds. Different screening rules can be set according to different requirements, which are not limited here.
样本化合物选定后,将选定的样本化合物转换成SMILES字符串的方式来表示。SMILES(Simplified molecular input line entry system,简化分子线性输入规范),是一种用ASCII字符串明确描述分子结构的规范。化学结构可以按照既有的一套规定,编写成一个SMILES字符串的形式。例如:嘧啶可以写成SMILES字符串“c1ccncn1”。可以将字符串看成是一个句子,该句子由若干个词组成。上述嘧啶的字符串可以视为由三个词c,1,n组成。可以将这些词汇采用独热编码转换为仅由0和1组成的向量,进而将这个字符串转换为矩阵的表示方式,得到样本矩阵。After the sample compound is selected, the selected sample compound is converted into a SMILES character string to represent it. SMILES (Simplified molecular input line entry system, simplified molecular linear input specification), is a specification that clearly describes molecular structures with ASCII strings. The chemical structure can be written in the form of a SMILES string according to an existing set of rules. For example: pyrimidine can be written as SMILES string "c1ccncn1". A string can be thought of as a sentence consisting of several words. The above-mentioned string of pyrimidines can be regarded as composed of three words c, 1, n. These words can be converted into a vector consisting of only 0 and 1 using one-hot encoding, and then the string can be converted into a matrix representation to obtain a sample matrix.
以嘧啶为例,其SMILES字符串为“c1ccncn1”,具体可以视为由三个词c,1,n组成。这三个词具有无序性,不连续性。将这三个词视为三种状态,采用一个0,1组成的向量表示。比如第一位是c,第二位是1,第三位是n,那么这三个词可以表示成[1,0,0],[0,1,0],[0,0,1]。1表示含有这个词,0表示不含有这个词。那么嘧啶这个结构会被表示成一个二维的矩阵[[1,0,0],[0,1,0],[1,0,0],[1,0,0],[0,0,1],[1,0,0],[0,0,1],[0,1,0]]。其中,所谓二维矩阵,矩阵的维度可以被理解为一个维度用来表示每个词的向量长度,一个维度用来表示每个字符串的长度。例如,嘧啶中,每个词的长度是3,整个嘧啶字符串的长度是8。这样编码后,嘧啶结构就被转换成能够被计算机所理解的样子。Taking pyrimidine as an example, its SMILES string is "c1ccncn1", which can be regarded as consisting of three words c, 1, and n. These three words have disorder and discontinuity. Treat these three words as three states, represented by a vector consisting of 0 and 1. For example, the first digit is c, the second digit is 1, and the third digit is n, then these three words can be expressed as [1,0,0],[0,1,0],[0,0,1] . 1 means it contains the word, 0 means it does not contain the word. Then the structure of pyrimidine will be represented as a two-dimensional matrix [[1,0,0],[0,1,0],[1,0,0],[1,0,0],[0,0 ,1],[1,0,0],[0,0,1],[0,1,0]]. Among them, the so-called two-dimensional matrix, the dimension of the matrix can be understood as one dimension is used to represent the vector length of each word, and one dimension is used to represent the length of each string. For example, in pyrimidine, the length of each word is 3, and the length of the entire pyrimidine string is 8. Encoded in this way, the pyrimidine structure is transformed into something that can be understood by a computer.
在将样本化合物集中的化合物都转换成SMILES字符串后,可以在SMILES字符串首尾端补上“$”和“#”等特殊字符,表示SMILES字符串的开始和结尾,以区分断开不同的字符串,还可以对字符串做去重处理。可以将样本化合物集中的SMILES字符串统一编码成m*n的矩阵(m个词,每个词向量长度为n)。可以找出其中最长的SMILES字符串,比如它的长度为m,如果一个SMILES字符串的长度不足m个词,也表示成m*n的矩阵,不足的元素全部填0。同样地,找到长度最长的词,比如它的长度为n。After all the compounds in the sample compound set are converted into SMILES strings, special characters such as "$" and "#" can be added at the beginning and end of the SMILES string to indicate the beginning and end of the SMILES string to distinguish between different Strings can also be deduplicated. The SMILES strings in the sample compound set can be uniformly encoded into an m*n matrix (m words, each word vector length is n). You can find out the longest SMILES string among them, for example, its length is m, if the length of a SMILES string is less than m words, it is also expressed as a matrix of m*n, and the insufficient elements are all filled with 0. Similarly, find the word with the longest length, say it has length n.
S230:将样本矩阵输入自编码器的编码层,编码得到样本向量,其中,样本向量为样本化合物的特征向量表示方式。S230: Input the sample matrix into the encoding layer of the autoencoder, and encode to obtain a sample vector, wherein the sample vector is a representation of the feature vector of the sample compound.
S250:将样本向量输入自编码器的解码层,解码得到预测矩阵。S250: Input the sample vector into the decoding layer of the self-encoder, and decode to obtain a prediction matrix.
S270:计算预测矩阵与样本矩阵之间的损失。S270: Calculate the loss between the prediction matrix and the sample matrix.
S290:基于损失迭代更新自编码器的参数,直至损失稳定,得到分子结构编码模型和分子结构解码模型。S290: Iteratively updating the parameters of the self-encoder based on the loss until the loss is stable, and obtaining a molecular structure encoding model and a molecular structure decoding model.
更新训练后的自编码器的编码层可作为分子结构编码模型,更新训练后的自编码器的解码层可作为分子结构解码模型。The encoding layer of the updated autoencoder can be used as a molecular structure encoding model, and the updated decoding layer of the trained autoencoder can be used as a molecular structure decoding model.
在一实施方式中,自编码器还可以包括输入层,输入层可用于将化学结构式的化合物转换为矩阵式的化合物。然后将输入层和编码层一起作为分子结构编码模型。该分子结构编码模型可以以分子结构式的化合物为输入,输出编码后的向量式的化合物。In one embodiment, the autoencoder may further include an input layer, which may be used to convert compounds of chemical structural formulas into compounds of matrix formulas. The input layer and the encoding layer are then used together as a molecular structure encoding model. The molecular structure encoding model can take the compound of the molecular structural formula as input, and output the compound of the encoded vector formula.
在一实施方式中,自编码器还可以包括输出层,输出层可用于将矩阵式的化合物转换为化学结构式的化合物。具体转换过程是将化学结构式的化合物转换为矩阵式的化合物的逆过程,具体请参阅上文描述,在此不再赘述。然后将输出层和解码层一起作为分子结构解码模型。该分子结构解码模型可以以向量式的化合物为输入,输出解码后的分子结构式的化合物。In one embodiment, the autoencoder may further include an output layer, which may be used to convert the compound of the matrix formula into the compound of the chemical structure formula. The specific conversion process is the reverse process of converting the compound of the chemical structural formula into the compound of the matrix formula. Please refer to the above description for details, and will not repeat them here. Then the output layer and the decoding layer are used together as a molecular structure decoding model. The molecular structure decoding model can take the compound of the vector formula as input, and output the compound of the decoded molecular structure formula.
请参阅图3,图3是本申请实施方式中另一化合物设计方法的流程示意图。需注意的是,若有实质上相同的结果,本实施例并不以图3所示的流程顺序为限。如图3所示,本实施方式可以结合分子结构编码模型、分子结构解码模型和遗传算法进行化合物设计,具体包括:Please refer to FIG. 3 . FIG. 3 is a schematic flowchart of another compound design method in the embodiment of the present application. It should be noted that this embodiment is not limited to the flow sequence shown in FIG. 3 if substantially the same result is achieved. As shown in Figure 3, this embodiment can combine the molecular structure coding model, molecular structure decoding model and genetic algorithm for compound design, specifically including:
S310:获取种子向量。S310: Acquire a seed vector.
其中,可以在化合物数据库中选取种子化合物,获取种子化合物的SMILES字符串;对种子化合物的SMILES字符串进行独热编码得到种子矩阵,种子矩阵为种子化合物的矩阵表示方式;将种子矩阵输入分子结构编码模型,对种子矩阵进行编码,得到种子向量。具体请参阅上文描述,在此不再赘述。Among them, you can select the seed compound in the compound database to obtain the SMILES string of the seed compound; perform one-hot encoding on the SMILES string of the seed compound to obtain the seed matrix, which is the matrix representation of the seed compound; input the seed matrix into the molecular structure The encoding model encodes the seed matrix to obtain the seed vector. Please refer to the above description for details, and will not repeat them here.
S330:基于遗传算法对种子向量进行交叉运算,得到衍生向量。S330: Perform a cross operation on the seed vector based on the genetic algorithm to obtain a derived vector.
其中,交叉运算可从种子向量集中选取两个种子向量,选取其中一个种子向量的交换位置(可以是一个或多个位置),种子向量以及交换位置的选取方式可以是随机选取,也可以是设定一定的选取规则。将这个种子向量的所选取交换位置的数值与另一种子向量的对应位置的数值进行交换。比如有两个向量[0.1,0.2,0.3]和[0.4,0.5,0.6],交换第一个位置,那么得到两个新的向量,[0.4,0.2,0.3]和[0.1,0.5,0.6]。又如,将上述两个向量交换第一个和第三个位置,那么得到两个新的向量[0.4,0.2,0.6]和[0.1,0.5,0.3]。Among them, the cross operation can select two seed vectors from the seed vector set, and select the exchange position (can be one or more positions) of one of the seed vectors. The selection mode of the seed vector and the exchange position can be randomly selected, or can be set Set certain selection rules. The value of the selected exchange position of this seed vector is exchanged with the value of the corresponding position of another sub vector. For example, there are two vectors [0.1,0.2,0.3] and [0.4,0.5,0.6], exchange the first position, then get two new vectors, [0.4,0.2,0.3] and [0.1,0.5,0.6] . As another example, if the above two vectors are exchanged for the first and third positions, then two new vectors [0.4,0.2,0.6] and [0.1,0.5,0.3] are obtained.
S350:基于遗传算法对种子向量进行变异运算,得到衍生向量。S350: Perform a mutation operation on the seed vector based on the genetic algorithm to obtain a derived vector.
其中,变异运算可从种子向量集中选取若干个(可预先指定要突变的向量的比例)种子向量,从这些种子向量中选取突变位置(可以是一个或多个位置),种子向量以及变异位置的选取方式可以是随机选取,也可以是设定一定的选取规则。将这些突变位置上的数值替换成新的数值,可以是随机替换成任一值,也可以是替换成设定的值。比如,现有一个向量[0.1,0.2,0.3],选取第一个位置,将这个数值随机替换成一个数值得到一个新的向量[0.5,0.2,0.3]。又如,选取第一个和第二个位置,将对应的数值随机替换成新的数值得到一个新的向量[0.2,0.4,0.3]。Among them, the mutation operation can select several seed vectors (the proportion of the vector to be mutated can be specified in advance) from the seed vector set, and select the mutation position (can be one or more positions) from these seed vectors, the seed vector and the mutation position The selection method may be random selection, or a certain selection rule may be set. Replace the values at these mutation positions with new values, which can be randomly replaced with any value, or replaced with a set value. For example, there is a vector [0.1,0.2,0.3], select the first position, and replace this value with a value at random to get a new vector [0.5,0.2,0.3]. As another example, select the first and second positions, and randomly replace the corresponding values with new values to obtain a new vector [0.2, 0.4, 0.3].
交叉运算和变异运算都是为了能够生成新的向量(即衍生向量)、衍生出更多向量,进而衍生得到更多化合物。交叉运算和变异运算能够模拟遗传进化,能够提高化合物的多样性。交叉运算和变异运算可以同步进行,也可以颠倒顺序进行,也可以只进行其中一种处理,即S330与S350步骤仅是示意,可以选择性执行一个,也可以调换顺序执行,在此不做限定。Both the crossover operation and the mutation operation are for generating new vectors (ie derived vectors), deriving more vectors, and further deriving more compounds. Crossover operation and mutation operation can simulate genetic evolution and improve the diversity of compounds. The crossover operation and the mutation operation can be performed simultaneously, or in reverse order, or only one of them can be performed, that is, steps S330 and S350 are only for illustration, and one can be selectively performed, or the order can be reversed, and there is no limitation here .
S370:对衍生向量进行处理,得到衍生化合物。S370: Process the derivation vector to obtain a derivation compound.
将衍生向量输入分子结构解码模型,对衍生向量进行解码处理,得到衍生矩阵,再对衍生矩阵进行转换,得到衍生分子结构,进而可以根据衍生分子结构确定衍生化合物。具体请参阅上文描述,在此不再赘述。The derivative vector is input into the molecular structure decoding model, and the derivative vector is decoded to obtain the derivative matrix, and then the derivative matrix is converted to obtain the derived molecular structure, and then the derivative compound can be determined according to the derived molecular structure. Please refer to the above description for details, and will not repeat them here.
S390:基于适应度函数度量衍生化合物的适应度,并根据适应度的大小,从衍生化合物中选取出候选化合物。S390: Measure the fitness of the derived compounds based on the fitness function, and select candidate compounds from the derived compounds according to the fitness.
适应度为用来评价衍生化合物的尺度,比如这个结构是否有好的溶解度、好的活性等。这样,衍生化合物与评判优劣的标准建立了联系,即构建了适应度函数。Fitness is a scale used to evaluate derivative compounds, such as whether the structure has good solubility, good activity, etc. In this way, the derivative compounds are associated with the criteria for judging the quality, that is, the fitness function is constructed.
遗传算法模拟了进化的过程,按照上文所述,每个衍生化合物都具有了一个评价的数值,这个数值代表了这个化合物在进化过程中的适应能力。比如溶解度差、活性不好的分子更倾向于被淘汰。The genetic algorithm simulates the process of evolution. According to the above, each derivative compound has an evaluation value, which represents the adaptability of the compound in the evolution process. For example, molecules with poor solubility and poor activity tend to be eliminated.
这个评价标准依赖于用户的定义,用户可以根据要设计的化合物的特征需求来适应性设置评价标准(适应度函数)。比如用户希望得到一个分子量足够大的化合物。那么,随机生成几千个衍生向量,这些衍生向量按照上述的转化,分别得到一个化合物,接着计算这些化合物的分子量。这个分子量就是用户定义的适应度。我们按照分子量对这些化合物降序排列,按照用户预先定义的参数(每次选取的比例或者数量)选取出排名靠前的一个或一批候选化合物。This evaluation standard depends on the definition of the user, and the user can adaptively set the evaluation standard (fitness function) according to the characteristic requirements of the compound to be designed. For example, the user wants to get a compound with a large enough molecular weight. Then, thousands of derived vectors are randomly generated, and these derived vectors are transformed according to the above to obtain a compound respectively, and then the molecular weights of these compounds are calculated. This molecular weight is the user-defined fitness. We arrange these compounds in descending order according to molecular weight, and select a top-ranked candidate compound or a batch of candidate compounds according to the user-defined parameters (the ratio or quantity selected each time).
在一实施方式中,可以迭代进行多轮交叉运算和变异运算,得到更多衍生化合物,然后在这些衍生化合物中选取想要的候选化合物。具体的,根据适应度的大小,从衍 生化合物中选取出候选化合物的具体实施方式可以包括:步骤S1:根据适应度的大小,从衍生化合物中选取适应度满足预设条件的目标化合物;步骤S2:将目标化合物对应的衍生向量作为种子向量,继续执行S330和/或S350的步骤至步骤S390基于适应度函数度量衍生化合物的适应度。迭代循环上述步骤S1和S2,直至满足迭代终止条件时,结束迭代循环操作;按照适应度对得到的所有衍生化合物进行降序排列;选取预定比例或预定数量的适应度较优的衍生化合物作为候选化合物。In one embodiment, multiple rounds of crossover operations and mutation operations can be iteratively performed to obtain more derivative compounds, and then desired candidate compounds are selected from these derivative compounds. Specifically, according to the size of the fitness, the specific implementation of selecting the candidate compound from the derivative compound may include: step S1: according to the size of the fitness, select the target compound whose fitness meets the preset condition from the derivative compound; step S2 : Using the derivation vector corresponding to the target compound as the seed vector, continue to execute steps from S330 and/or S350 to step S390 to measure the fitness of the derivation compound based on the fitness function. Iteratively loop the above steps S1 and S2 until the iteration termination condition is satisfied, then end the iterative loop operation; sort all the derivative compounds obtained according to the fitness in descending order; select a predetermined proportion or a predetermined number of derivative compounds with better fitness as candidate compounds .
其中,满足预设条件的目标化合物可以是从衍生化合物中选取适应度排在前列的某一固定个数(如10个、30个、50个等)的化合物作为目标化合物;也可以是从衍生化合物中按照适应度排序,从前往后选取一固定比例(如1/10、1/5、1/3等)的化合物作为目标化合物;还可以是从衍生化合物中选取适应度大于某一固定阈值的化合物作为目标化合物。其中,目标化合物选取的个数、比例、条件可以根据需要设置,在此不再赘述。Among them, the target compound that satisfies the preset conditions can be a fixed number of compounds (such as 10, 30, 50, etc.) selected from the derivative compounds as the target compound; The compounds are sorted according to the fitness, and a fixed ratio (such as 1/10, 1/5, 1/3, etc.) of the compound is selected from the front to the back as the target compound; it can also be selected from the derivative compound with a fitness greater than a certain fixed threshold compound as the target compound. Wherein, the number, ratio, and conditions of target compounds selected can be set according to needs, and will not be repeated here.
举例来说,在进行完一次操作,得到衍生化合物的适应度之后,可以依据适应度对衍生化合物进行降序排列,选取出排名靠前的目标化合物,并对这些目标化合物的衍生向量做交叉运算、变异运算生成一批新的一维衍生向量。将这些新的衍生向量输入分子结构解码模型中,解码出新的矩阵并转化出新的衍生化合物,并计算出这些衍生化合物的适应度。这些衍生化合物按照适应度降序排列,从中选取排名靠前的目标化合物,又交叉、变异生成新的一维衍生向量。如此迭代循环,并记录所有生成的衍生化合物。从这些生成的衍生化合物中找出适应度较优的候选化合物作为最终结果。For example, after one operation is completed and the fitness of the derivative compounds is obtained, the derivative compounds can be sorted in descending order according to the fitness, and the top-ranked target compounds can be selected, and the derivative vectors of these target compounds can be cross-operated, The mutation operation generates a new batch of 1D derived vectors. Input these new derivative vectors into the molecular structure decoding model, decode new matrix and transform into new derivative compounds, and calculate the fitness of these derivative compounds. These derivative compounds are arranged in descending order of fitness, and the top-ranked target compounds are selected from them, and then crossed and mutated to generate a new one-dimensional derivative vector. This loop is iterated and all derived compounds generated are recorded. Candidate compounds with better fitness are selected from these generated derivative compounds as the final result.
迭代的轮数可以依赖于设置的参数和数据集自身的特点,迭代终止条件可以是提前设置好的迭代次数,迭代次数可以是几十次到几百次,例如200轮到400轮。迭代终止条件可以是提前设置好的迭代时长,如8小时、12小时、24小时、48小时等。The number of iterations can depend on the set parameters and the characteristics of the data set itself. The iteration termination condition can be the number of iterations set in advance, and the number of iterations can be dozens to hundreds of times, such as 200 to 400 rounds. The iteration termination condition can be the iteration duration set in advance, such as 8 hours, 12 hours, 24 hours, 48 hours, etc.
当然,也可以不进行迭代,一次执行后就得到了想要的候选化合物。Of course, it is not necessary to perform iterations, and the desired candidate compounds can be obtained after one execution.
以上实施方式所提供的方案,通过采用神经网络的算法,将复杂的化学空间降维成一维向量,能够使得设计算法可以方便高效地搜索化学空间;将化学空间和遗传算法有机结合,克服了分子生成模型强化学习和迁移学习后生成化合物逐渐单一的问题。In the solution provided by the above embodiments, by using the neural network algorithm, the complex chemical space is reduced into a one-dimensional vector, which can make the design algorithm search the chemical space conveniently and efficiently; the organic combination of chemical space and genetic algorithm overcomes the molecular Generative Models After Reinforcement Learning and Transfer Learning Generating Compound Gradually Single Problems.
下面,将通过几个具体实验例对本申请的方案进行描述说明,但不应对本申请带来过多限定。In the following, the solution of the present application will be described through several specific experimental examples, but the present application should not be limited too much.
可以从网上下载最新的ChEMBL28数据库,提出化合物的SMILES字符串,样本化合物结构必须只能含有氢、碳、氮、氧、氟、硫、氯、溴的原子。且不含有手性类化合物,无机物,盐离子,并约束重原子数在70以内,将这些SMILES字符串转化为 canonical的形式。去重后约得到180万个SMILES。使用这些SMILES训练一个神经网络。以这个神经网络为基础展开实施例。The latest ChEMBL28 database can be downloaded from the Internet, and the SMILES string of the compound is proposed. The sample compound structure must only contain atoms of hydrogen, carbon, nitrogen, oxygen, fluorine, sulfur, chlorine, and bromine. And do not contain chiral compounds, inorganic substances, salt ions, and restrict the number of heavy atoms within 70, convert these SMILES strings into canonical forms. About 1.8 million SMILES are obtained after deduplication. Use these SMILES to train a neural network. Embodiments are developed based on this neural network.
实验例1Experimental example 1
蛋白激酶B,又称为AKT,是丝氨酸/苏氨酸特异性蛋白激酶。它在细胞凋亡、增殖、迁移等细胞过程中具有重要的调控作用。AKT1通过细胞凋亡过程参与细胞存活途径,阻断细胞凋亡,促进细胞存活。临床研究发现AKT在胃癌、胰腺癌等多种人类肿瘤中过表达。AKT抑制剂能抑制AKT的活性,促进癌细胞凋亡。Protein kinase B, also known as AKT, is a serine/threonine-specific protein kinase. It plays an important regulatory role in cell apoptosis, proliferation, migration and other cellular processes. AKT1 participates in the cell survival pathway through the process of apoptosis, blocks apoptosis and promotes cell survival. Clinical studies have found that AKT is overexpressed in various human tumors such as gastric cancer and pancreatic cancer. AKT inhibitors can inhibit the activity of AKT and promote the apoptosis of cancer cells.
化合物1是临床研究中的一种AKT抑制剂。通过分析它的相互作用模式,建立药效团模型评价分子与药效团的匹配程度,以此作为适应度评价标准,来寻找新的分子。Compound 1 is an AKT inhibitor in clinical research. By analyzing its interaction mode and establishing a pharmacophore model to evaluate the matching degree between the molecule and the pharmacophore, it is used as the fitness evaluation standard to find new molecules.
具体地,在化合物数据库中随机挑选第一预定量的种子化合物,输入分子结构编码模型,得到种子向量;基于遗传算法对种子向量进行交叉运算和变异运算,得到多个衍生向量;将衍生向量输入分子结构解码模型,得到多个衍生化合物;利用上述的药效团模型评价衍生化合物与药效团的匹配程度,得出衍生化合物的适应度;再按照适应度大小从中选取第二预定量的衍生向量作为种子向量进行交叉运算和变异运算,如此迭代循环300轮,得到一批新的化合物,具体如下:Specifically, randomly select the first predetermined amount of seed compounds in the compound database, input the molecular structure coding model, and obtain the seed vector; perform cross operation and mutation operation on the seed vector based on the genetic algorithm, and obtain multiple derivative vectors; input the derived vector Decode the molecular structure model to obtain multiple derivative compounds; use the above-mentioned pharmacophore model to evaluate the matching degree of the derivative compound and the pharmacophore, and obtain the fitness of the derivative compound; then select the second predetermined amount of derivatives according to the degree of fitness Vectors are used as seed vectors for crossover and mutation operations, and iteratively circulates for 300 rounds to obtain a batch of new compounds, as follows:
Figure PCTCN2021129381-appb-000001
Figure PCTCN2021129381-appb-000001
实验例2Experimental example 2
临床研究发现人体的异柠檬酸脱氢酶1(IDH1)在多种恶性肿瘤,比如脑胶质瘤,发生突变。突变后的IDH1能将α-酮戊二酸转化为2-羟戊二酸。后者是一种致癌物,在体内蓄积,促进癌症进一步恶化。临床试验表明,药物抑制突变型IDH1的活性,可以有效降低体内的2-羟戊二酸浓度,缓解癌症症状。Clinical studies have found that human isocitrate dehydrogenase 1 (IDH1) is mutated in a variety of malignant tumors, such as glioma. Mutated IDH1 can convert α-ketoglutarate to 2-hydroxyglutarate. The latter is a carcinogen that accumulates in the body and promotes the further progression of cancer. Clinical trials have shown that drugs that inhibit the activity of mutant IDH1 can effectively reduce the concentration of 2-hydroxyglutarate in the body and relieve cancer symptoms.
化合物2是当前研究最具前景的突变IDH1的抑制剂。将它作为模板分子,对每个生成的分子计算与模板分子的相似度(使用分子指纹度量)作为分子的适应度,从潜 在的空间中搜索出一批具有相似性的分子。Compound 2 is the most promising inhibitor of mutant IDH1 currently studied. Take it as a template molecule, calculate the similarity (measured by molecular fingerprint) with the template molecule for each generated molecule as the fitness of the molecule, and search a batch of similar molecules from the latent space.
具体地,在化合物数据库中随机挑选第三预定量的种子化合物,输入分子结构编码模型,得到种子向量;基于遗传算法对种子向量进行交叉运算和变异运算,得到多个衍生向量;将衍生向量输入分子结构解码模型,得到多个衍生化合物;分别计算各衍生化合物与模板分子的相似度,得出衍生化合物的适应度;再按照适应度大小从中选取第四预定量的衍生向量作为种子向量进行交叉运算和变异运算,如此迭代循环380轮,得到一批新的化合物,具体如下:Specifically, randomly select the third predetermined amount of seed compounds in the compound database, input the molecular structure coding model, and obtain the seed vector; perform crossover and mutation operations on the seed vector based on the genetic algorithm, and obtain multiple derived vectors; input the derived vector Molecular structure decoding model to obtain multiple derivative compounds; respectively calculate the similarity between each derivative compound and the template molecule to obtain the fitness of the derivative compound; then select the fourth predetermined amount of derivative vector as the seed vector for crossover according to the degree of fitness Operation and mutation operation, such an iterative cycle for 380 rounds, to obtain a batch of new compounds, as follows:
Figure PCTCN2021129381-appb-000002
Figure PCTCN2021129381-appb-000002
请参阅图4,图4是本申请实施方式中化合物设计装置的结构示意图。该实施方式中,化合物设计装置40包括获取模块41、运算模块42和解码模块43。Please refer to FIG. 4 . FIG. 4 is a schematic structural diagram of a compound design device in an embodiment of the present application. In this embodiment, the compound design device 40 includes an acquisition module 41 , an operation module 42 and a decoding module 43 .
其中,获取模块41用于获取种子向量,种子向量为种子化合物的特征向量表示方式;运算模块42用于基于遗传算法对种子向量进行交叉运算和/或变异运算,得到衍生 向量;解码模块43用于对衍生向量进行处理,得到衍生化合物。通过这种方式,该装置基于遗传算法进行化合物的开发设计,增大了可探索的化合物空间,能够得到多样化的化合物,增大选择空间。进一步地,运算时将复杂的化学空间降维成了一维向量,能够使得设计算法可以方便高效地搜索化学空间,具体执行过程请参阅上述实施例的描述,再次不再赘述。Wherein, the obtaining module 41 is used to obtain the seed vector, and the seed vector is the feature vector representation of the seed compound; the operation module 42 is used to perform cross operation and/or mutation operation on the seed vector based on the genetic algorithm to obtain a derived vector; the decoding module 43 uses The derivation vector is processed to obtain the derivation compound. In this way, the device develops and designs compounds based on the genetic algorithm, which increases the exploreable compound space, can obtain diversified compounds, and increases the selection space. Furthermore, the complex chemical space is reduced into a one-dimensional vector during operation, which enables the design algorithm to search the chemical space conveniently and efficiently. Please refer to the description of the above-mentioned embodiments for the specific execution process, and will not repeat it again.
进一步地,化合物设计装置40还包括评选模块(图中未示出),用于基于适应度函数分别度量衍生化合物的适应度;根据适应度的大小,从衍生化合物中选取出候选化合物。Further, the compound design device 40 also includes a selection module (not shown in the figure), which is used to measure the fitness of the derived compounds based on the fitness function; and select candidate compounds from the derived compounds according to the fitness.
具体地,评选模块根据适应度的大小,从衍生化合物中选取出候选化合物,包括:步骤S1:根据适应度的大小,从衍生化合物中选取适应度满足预设条件的目标化合物;步骤S2:将目标化合物对应的衍生向量作为种子向量,继续执行基于遗传算法对种子向量进行交叉运算和/或变异运算,得到衍生向量至基于适应度函数分别度量衍生化合物的适应度的步骤;迭代循环步骤S1和S2,直至满足迭代终止条件时,结束迭代循环操作;按照适应度对得到的所有衍生化合物进行降序排列;选取预定比例或预定数量的适应度较优的衍生化合物作为候选化合物。通过这种方式,能够得到更多个候选化合物,更容易筛选得到较优的化合物。具体执行过程请参阅上述实施例的描述,再次不再赘述。Specifically, the selection module selects candidate compounds from the derivative compounds according to the size of the fitness, including: step S1: according to the size of the fitness, select the target compound whose fitness meets the preset conditions from the derivative compounds; step S2: the The derivation vector corresponding to the target compound is used as the seed vector, continue to perform the cross operation and/or mutation operation on the seed vector based on the genetic algorithm, and obtain the derivation vector to the step of measuring the fitness of the derivation compound based on the fitness function respectively; iterative loop steps S1 and S2. End the iterative loop operation until the iteration termination condition is met; sort all the obtained derived compounds in descending order according to their fitness; select a predetermined proportion or a predetermined number of derived compounds with better fitness as candidate compounds. In this way, more candidate compounds can be obtained, and better compounds can be screened more easily. For the specific execution process, please refer to the description of the above embodiments, and details will not be repeated again.
进一步地,运算模块42包括交叉运算子模块(图中未示出),用于从种子向量集中选取两个种子向量,选取其中一个种子向量的交换位置,将该种子向量的交换位置的数值与另一种子向量的对应位置的数值进行交换,得到新的衍生向量,进而能够得到一个衍生化合物,以丰富衍生向量,提高衍生化合物的多样性。具体执行过程请参阅上述实施例的描述,再次不再赘述。Further, the operation module 42 includes a cross operation submodule (not shown in the figure), which is used to select two seed vectors from the seed vector set, select the exchange position of one of the seed vectors, and combine the value of the exchange position of the seed vector with The value of the corresponding position of another sub-vector is exchanged to obtain a new derivative vector, and then a derivative compound can be obtained to enrich the derivative vector and increase the diversity of the derivative compound. For the specific execution process, please refer to the description of the above embodiments, and details will not be repeated again.
进一步地,运算模块42包括变异运算子模块(图中未示出),用于从种子向量集中选取种子向量,从所选取的种子向量中选取突变位置,将突变位置上的数值替换成新的数值,得到新的衍生向量,进而能够得到一个衍生化合物,以丰富衍生向量,提高衍生化合物的多样性。具体执行过程请参阅上述实施例的描述,再次不再赘述。Further, the operation module 42 includes a mutation operator module (not shown in the figure), which is used to select a seed vector from the seed vector set, select a mutation position from the selected seed vector, and replace the value on the mutation position with a new Value, get a new derivative vector, and then get a derivative compound to enrich the derivative vector and increase the diversity of derivative compounds. For the specific execution process, please refer to the description of the above embodiments, and details will not be repeated again.
进一步地,解码模块43用于将衍生向量输入分子结构解码模型,对衍生向量进行解码处理,得到衍生分子结构,分子结构解码模型为一种神经网络模型;根据衍生分子结构得到衍生化合物。Further, the decoding module 43 is used to input the derived vector into the molecular structure decoding model, and decode the derived vector to obtain the derived molecular structure. The molecular structure decoding model is a neural network model; and obtain the derived compound according to the derived molecular structure.
进一步地,化合物设计装置40还包括编码模块(图中未示出),用于获取种子化合物的SMILES字符串;对种子化合物的SMILES字符串进行独热编码得到种子矩阵, 种子矩阵为种子化合物的矩阵表示方式;将种子矩阵输入分子结构编码模型,对种子矩阵进行编码处理,得到种子向量。Further, the compound design device 40 also includes an encoding module (not shown in the figure), which is used to obtain the SMILES character string of the seed compound; the SMILES character string of the seed compound is one-hot encoded to obtain the seed matrix, and the seed matrix is the Matrix representation; the seed matrix is input into the molecular structure encoding model, and the seed matrix is encoded to obtain the seed vector.
进一步地,化合物设计装置40还包括模型训练模块(图中未示出),用于获取样本矩阵,样本矩阵为样本化合物的矩阵表示方式;将样本矩阵输入自编码器的编码层,编码得到样本向量,样本向量为样本化合物的特征向量表示方式;将样本向量输入自编码器的解码层,解码得到预测矩阵;计算预测矩阵与样本矩阵之间的损失;基于损失迭代更新自编码器的参数,直至损失稳定,将更新训练后的自编码器的解码层和输出层作为分子结构解码模型,输出层用于将矩阵表示的化合物转换成以分子结构表示的方式,将更新训练后的自编码器的编码层作为分子结构编码模型。Further, the compound design device 40 also includes a model training module (not shown in the figure), which is used to obtain a sample matrix, which is a matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain the sample Vector, the sample vector is the eigenvector representation of the sample compound; input the sample vector into the decoding layer of the self-encoder, and decode it to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update the parameters of the self-encoder based on the loss, Until the loss is stable, the decoding layer and output layer of the trained self-encoder will be updated as the molecular structure decoding model. The output layer is used to convert the compound represented by the matrix into a molecular structure representation, and the trained self-encoder will be updated The encoding layer serves as a molecular structure encoding model.
化合物设计装置可以是独立的服务器、可以是服务器集群,也可以是服务器的一个模块。能够用于进行模型训练、遗传算法、进而用来设计化合物。The compound design device can be an independent server, a server cluster, or a module of the server. It can be used for model training, genetic algorithm, and then used to design compounds.
请参阅图5,图5是本申请实施方式中化合物设计设备的结构示意图。该实施方式中,化合物设计设备10包括处理器11和存储器12。Please refer to FIG. 5 . FIG. 5 is a schematic structural diagram of a compound design device in an embodiment of the present application. In this embodiment, the compound design device 10 includes a processor 11 and a memory 12 .
处理器11还可以称为CPU(Central Processing Unit,中央处理单元)。处理器11可能是一种集成电路芯片,具有信号的处理能力。处理器11还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器11也可以是任何常规的处理器等。The processor 11 may also be called a CPU (Central Processing Unit, central processing unit). The processor 11 may be an integrated circuit chip with signal processing capabilities. The processor 11 can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components . The general processor can be a microprocessor or the processor 11 can also be any conventional processor or the like.
化合物设计设备10可以进一步包括存储器12,用于存储处理器11运行所需的指令和数据。The compound design device 10 may further include a memory 12 for storing instructions and data required for the operation of the processor 11 .
处理器11用于执行指令以实现上述本申请化合物设计方法任一实施例及任意不冲突的组合所提供的方法。The processor 11 is configured to execute instructions to implement the methods provided in any embodiment of the compound design method of the present application and any non-conflicting combination.
化合物设计设备可以是服务器、台式电脑、笔记本电脑等。能够用于进行模型训练、遗传算法、进而用来设计化合物。Compound design equipment can be servers, desktop computers, laptops, etc. It can be used for model training, genetic algorithm, and then used to design compounds.
请参阅图6,图6为本申请实施方式中计算机可读存储介质的结构示意图。本申请实施例的计算机可读存储介质20存储有指令/程序数据21,该指令/程序数据21被执行时实现本申请化合物设计方法任一实施例以及任意不冲突的组合所提供的方法。其中,该指令/程序数据21可以形成程序文件以软件产品的形式存储在上述存储介质20中,以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式方法的全部或部分步骤。而前述的存储介质20包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM, Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。Please refer to FIG. 6 , which is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present application. The computer-readable storage medium 20 of the embodiment of the present application stores instructions/program data 21. When the instructions/program data 21 are executed, the methods provided by any embodiment of the compound design method of the present application and any non-conflicting combination are implemented. Wherein, the instruction/program data 21 can form a program file and be stored in the above-mentioned storage medium 20 in the form of a software product, so that a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor (processor) Execute all or part of the steps of the methods in various implementation manners of the present application. And aforementioned storage medium 20 comprises: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or optical disc etc. can store program codes Media, or terminal devices such as computers, servers, mobile phones, and tablets.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
以上所述仅为本申请的实施方式,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above is only the implementation of the application, and does not limit the patent scope of the application. Any equivalent structure or equivalent process conversion made by using the specification and drawings of the application, or directly or indirectly used in other related technologies fields, are all included in the scope of patent protection of this application in the same way.

Claims (22)

  1. 一种化合物设计方法,其中,包括:A compound design method, wherein, comprising:
    获取种子向量,所述种子向量为种子化合物的特征向量表示方式;Obtain a seed vector, the seed vector is the representation of the feature vector of the seed compound;
    基于遗传算法对所述种子向量进行交叉运算和/或变异运算,得到衍生向量;performing a crossover operation and/or a mutation operation on the seed vector based on a genetic algorithm to obtain a derived vector;
    对所述衍生向量进行处理,得到衍生化合物。The derivatization vector is processed to obtain a derivatization compound.
  2. 根据权利要求1所述的化合物设计方法,其中,所述对所述衍生向量进行处理,得到衍生化合物之后,所述方法还包括:The compound design method according to claim 1, wherein, after the derivation vector is processed, and the derivation compound is obtained, the method further comprises:
    基于适应度函数分别度量所述衍生化合物的适应度;Measuring the fitness of the derivative compounds respectively based on a fitness function;
    根据所述适应度的大小,从所述衍生化合物中选取出候选化合物。Select candidate compounds from the derivative compounds according to the size of the fitness.
  3. 根据权利要求2所述的化合物设计方法,其中,所述根据所述适应度的大小,从所述衍生化合物中选取出候选化合物,包括:The compound design method according to claim 2, wherein, according to the size of the fitness, selecting candidate compounds from the derivative compounds includes:
    步骤S1:根据所述适应度的大小,从所述衍生化合物中选取适应度满足预设条件的目标化合物;Step S1: According to the size of the fitness, select the target compound whose fitness meets the preset condition from the derivative compounds;
    步骤S2:将所述目标化合物对应的衍生向量作为所述种子向量,继续执行所述基于遗传算法对所述种子向量进行交叉运算和/或变异运算,得到衍生向量至所述基于适应度函数分别度量所述衍生化合物的适应度的步骤;Step S2: use the derivative vector corresponding to the target compound as the seed vector, continue to perform the cross operation and/or mutation operation on the seed vector based on the genetic algorithm, and obtain the derivative vector to the fitness function-based the step of measuring the fitness of said derivative compound;
    迭代循环所述步骤S1和S2,直至满足迭代终止条件时,结束迭代循环操作;Steps S1 and S2 are iteratively looped until the iteration termination condition is satisfied, and the iterative loop operation is ended;
    按照所述适应度对得到的所有衍生化合物进行降序排列;Arrange all derived compounds obtained in descending order according to the fitness;
    选取预定比例或预定数量的适应度较优的衍生化合物作为候选化合物。A predetermined ratio or a predetermined number of derivative compounds with better fitness are selected as candidate compounds.
  4. 根据权利要求1所述的化合物设计方法,其中,基于遗传算法对所述种子向量进行交叉运算,包括:The compound design method according to claim 1, wherein, performing a cross operation on the seed vector based on a genetic algorithm, comprising:
    从种子向量集中选取两个种子向量,选取其中一个种子向量的交换位置,将该种子向量的所述交换位置的数值与另一种子向量的对应位置的数值进行交换。Select two seed vectors from the set of seed vectors, select an exchange position of one of the seed vectors, and exchange the value of the exchange position of the seed vector with the value of the corresponding position of the other seed vector.
  5. 根据权利要求1所述的化合物设计方法,其中,基于遗传算法对所述种子向量进行变异运算,包括:The compound design method according to claim 1, wherein, performing a mutation operation on the seed vector based on a genetic algorithm, comprising:
    从种子向量集中选取种子向量,从所选取的种子向量中选取突变位置,将所述突变位置上的数值替换成新的数值。A seed vector is selected from the seed vector set, a mutation position is selected from the selected seed vector, and a value at the mutation position is replaced with a new value.
  6. 根据权利要求1-5任一所述的化合物设计方法,其中,所述对所述衍生向量进行处理,得到衍生化合物,包括:According to the compound design method according to any one of claims 1-5, wherein said derivation vector is processed to obtain a derivation compound, comprising:
    将所述衍生向量输入分子结构解码模型,对所述衍生向量进行解码处理,得到衍 生分子结构,所述分子结构解码模型为一种神经网络模型;The derived vector is input into the molecular structure decoding model, and the derived vector is decoded to obtain the derived molecular structure, and the molecular structure decoding model is a neural network model;
    根据所述衍生分子结构得到衍生化合物。A derivative compound is obtained according to the derivative molecular structure.
  7. 根据权利要求6所述的化合物设计方法,其中,所述将所述衍生向量输入分子结构解码模型,对所述衍生向量进行解码处理,得到衍生分子结构之前,所述方法还包括:The compound design method according to claim 6, wherein said inputting the derivation vector into the molecular structure decoding model, performing decoding processing on the derivation vector, and before obtaining the derivation molecular structure, the method further comprises:
    获取样本矩阵,所述样本矩阵为样本化合物的矩阵表示方式;Acquiring a sample matrix, the sample matrix is a matrix representation of sample compounds;
    将所述样本矩阵输入自编码器的编码层,编码得到样本向量,所述样本向量为所述样本化合物的特征向量表示方式;Inputting the sample matrix into the encoding layer of the self-encoder, encoding to obtain a sample vector, the sample vector being the representation of the feature vector of the sample compound;
    将所述样本向量输入所述自编码器的解码层,解码得到预测矩阵;Input the sample vector into the decoding layer of the self-encoder, and decode to obtain a prediction matrix;
    计算所述预测矩阵与所述样本矩阵之间的损失;calculating a loss between said prediction matrix and said sample matrix;
    基于所述损失迭代更新所述自编码器的参数,直至所述损失稳定,将更新训练后的自编码器的解码层和输出层作为所述分子结构解码模型,所述输出层用于将矩阵表示的化合物转换成以分子结构表示的方式。The parameters of the self-encoder are iteratively updated based on the loss until the loss is stable, and the decoding layer and output layer of the self-encoder after the update training are used as the molecular structure decoding model, and the output layer is used to convert the matrix Represented compounds are converted to molecular structure representations.
  8. 根据权利要求1-5任一所述的化合物设计方法,其中,所述获取种子向量,包括:The compound design method according to any one of claims 1-5, wherein said obtaining a seed vector comprises:
    获取种子化合物的SMILES字符串;Get the SMILES string of the seed compound;
    对所述种子化合物的SMILES字符串进行独热编码得到种子矩阵,所述种子矩阵为所述种子化合物的矩阵表示方式;Carrying out one-hot encoding to the SMILES string of the seed compound to obtain a seed matrix, the seed matrix being the matrix representation of the seed compound;
    对所述种子矩阵进行编码得到所述种子向量。Encoding the seed matrix to obtain the seed vector.
  9. 根据权利要求8所述的化合物设计方法,其中,所述对所述种子矩阵进行编码得到所述种子向量,包括:The compound design method according to claim 8, wherein said encoding said seed matrix to obtain said seed vector comprises:
    将所述种子矩阵输入分子结构编码模型,对所述种子矩阵进行编码处理,得到所述种子向量。The seed matrix is input into the molecular structure encoding model, and the seed matrix is encoded to obtain the seed vector.
  10. 根据权利要求9所述的化合物设计方法,其中,所述将所述种子矩阵输入分子结构编码模型,对所述种子矩阵进行编码处理,得到所述种子向量之前,所述方法还包括:The compound design method according to claim 9, wherein said inputting said seed matrix into a molecular structure encoding model, performing encoding processing on said seed matrix, and before obtaining said seed vector, said method further comprises:
    获取样本矩阵,所述样本矩阵为样本化合物的矩阵表示方式;Acquiring a sample matrix, the sample matrix is a matrix representation of sample compounds;
    将所述样本矩阵输入自编码器的编码层,编码得到样本向量,所述样本向量为样本化合物的特征向量表示方式;Inputting the sample matrix into the encoding layer of the self-encoder, encoding to obtain a sample vector, the sample vector being the representation of the feature vector of the sample compound;
    将所述样本向量输入所述自编码器的解码层,解码得到预测矩阵;Input the sample vector into the decoding layer of the self-encoder, and decode to obtain a prediction matrix;
    计算所述预测矩阵与所述样本矩阵之间的损失;calculating a loss between said prediction matrix and said sample matrix;
    基于所述损失迭代更新所述自编码器的参数,直至所述损失稳定,将更新训练后 的自编码器的编码层作为所述分子结构编码模型。The parameters of the self-encoder are iteratively updated based on the loss until the loss is stable, and the coding layer of the self-encoder after the update training is used as the molecular structure coding model.
  11. 一种化合物设计装置,其中,包括:A compound design device, including:
    获取模块,用于获取种子向量,所述种子向量为种子化合物的特征向量表示方式;An acquisition module, configured to acquire a seed vector, wherein the seed vector is a representation of a feature vector of a seed compound;
    运算模块,用于基于遗传算法对所述种子向量进行交叉运算和/或变异运算,得到衍生向量;An operation module, configured to perform a crossover operation and/or a mutation operation on the seed vector based on a genetic algorithm to obtain a derived vector;
    解码模块,用于对所述衍生向量进行处理,得到衍生化合物。The decoding module is configured to process the derivation vector to obtain a derivation compound.
  12. 根据权利要求11所述的化合物设计装置,其中,还包括:The compound design device according to claim 11, further comprising:
    评选模块,用于基于适应度函数分别度量所述衍生化合物的适应度;根据所述适应度的大小,从所述衍生化合物中选取出候选化合物。The selection module is used to respectively measure the fitness of the derivative compounds based on the fitness function; and select candidate compounds from the derivative compounds according to the size of the fitness.
  13. 根据权利要求12所述的化合物设计装置,其中,所述评选模块根据所述适应度的大小,从所述衍生化合物中选取出候选化合物,包括:The compound design device according to claim 12, wherein the selection module selects candidate compounds from the derivative compounds according to the size of the fitness, including:
    步骤S1:根据所述适应度的大小,从所述衍生化合物中选取适应度满足预设条件的目标化合物;Step S1: According to the size of the fitness, select the target compound whose fitness meets the preset condition from the derivative compounds;
    步骤S2:将所述目标化合物对应的衍生向量作为所述种子向量,继续执行所述基于遗传算法对所述种子向量进行交叉运算和/或变异运算,得到衍生向量至所述基于适应度函数分别度量所述衍生化合物的适应度的步骤;Step S2: use the derivative vector corresponding to the target compound as the seed vector, continue to perform the cross operation and/or mutation operation on the seed vector based on the genetic algorithm, and obtain the derivative vector to the fitness function-based the step of measuring the fitness of said derivative compound;
    迭代循环所述步骤S1和S2,直至满足迭代终止条件时,结束迭代循环操作;Steps S1 and S2 are iteratively looped until the iteration termination condition is satisfied, and the iterative loop operation is ended;
    按照所述适应度对得到的所有衍生化合物进行降序排列;Arrange all derived compounds obtained in descending order according to the fitness;
    选取预定比例或预定数量的适应度较优的衍生化合物作为候选化合物。A predetermined ratio or a predetermined number of derivative compounds with better fitness are selected as candidate compounds.
  14. 根据权利要求11所述的化合物设计装置,其中,所述运算模块包括交叉运算子模块,所述交叉运算子模块用于从种子向量集中选取两个种子向量,选取其中一个种子向量的交换位置,将该种子向量的所述交换位置的数值与另一种子向量的对应位置的数值进行交换。The compound design device according to claim 11, wherein the operation module includes a crossover operation submodule, and the crossover operation submodule is used to select two seed vectors from the seed vector set, and select an exchange position of one of the seed vectors, Exchanging the numerical value of the exchanged position of the seed vector with the numerical value of the corresponding position of another subvector.
  15. 根据权利要求11所述的化合物设计装置,其中,所述运算模块包括变异运算子模块,所述变异运算子模块用于从种子向量集中选取种子向量,从所选取的种子向量中选取突变位置,将所述突变位置上的数值替换成新的数值。The compound design device according to claim 11, wherein the operation module includes a mutation operator module, and the mutation operator module is used to select a seed vector from the seed vector set, and select a mutation position from the selected seed vector, Replace the value at the mutation position with a new value.
  16. 根据权利要求11-15任一项所述的化合物设计装置,其中,所述解码模块具体用于将所述衍生向量输入分子结构解码模型,对所述衍生向量进行解码处理,得到衍生分子结构,根据所述衍生分子结构得到衍生化合物;所述分子结构解码模型为一种神经网络模型。The device for designing compounds according to any one of claims 11-15, wherein the decoding module is specifically configured to input the derivation vector into a molecular structure decoding model, and perform decoding processing on the derivation vector to obtain a derivation molecular structure, A derivative compound is obtained according to the derived molecular structure; the molecular structure decoding model is a neural network model.
  17. 根据权利要求16所述的化合物设计装置,其中,还包括模型训练模块,用于:The compound design device according to claim 16, further comprising a model training module for:
    获取样本矩阵,所述样本矩阵为样本化合物的矩阵表示方式;Acquiring a sample matrix, the sample matrix is a matrix representation of sample compounds;
    将所述样本矩阵输入自编码器的编码层,编码得到样本向量,所述样本向量为所述样本化合物的特征向量表示方式;Inputting the sample matrix into the encoding layer of the self-encoder, encoding to obtain a sample vector, the sample vector being the representation of the feature vector of the sample compound;
    将所述样本向量输入所述自编码器的解码层,解码得到预测矩阵;Input the sample vector into the decoding layer of the self-encoder, and decode to obtain a prediction matrix;
    计算所述预测矩阵与所述样本矩阵之间的损失;calculating a loss between said prediction matrix and said sample matrix;
    基于所述损失迭代更新所述自编码器的参数,直至所述损失稳定,将更新训练后的自编码器的解码层和输出层作为所述分子结构解码模型,所述输出层用于将矩阵表示的化合物转换成以分子结构表示的方式。The parameters of the self-encoder are iteratively updated based on the loss until the loss is stable, and the decoding layer and output layer of the self-encoder after the update training are used as the molecular structure decoding model, and the output layer is used to convert the matrix Represented compounds are converted to molecular structure representations.
  18. 根据权利要求11-15任一项所述的化合物设计装置,其中,还包括编码模块,用于:The device for designing compounds according to any one of claims 11-15, further comprising a coding module for:
    获取种子化合物的SMILES字符串;Get the SMILES string of the seed compound;
    对所述种子化合物的SMILES字符串进行独热编码得到种子矩阵,所述种子矩阵为所述种子化合物的矩阵表示方式;Carrying out one-hot encoding to the SMILES string of the seed compound to obtain a seed matrix, the seed matrix being the matrix representation of the seed compound;
    对所述种子矩阵进行编码得到所述种子向量。Encoding the seed matrix to obtain the seed vector.
  19. 根据权利要求18所述的化合物设计装置,其中,所述编码模块对所述种子矩阵进行编码得到所述种子向量,包括:将所述种子矩阵输入分子结构编码模型,对所述种子矩阵进行编码处理,得到所述种子向量。The device for designing compounds according to claim 18, wherein the encoding module encodes the seed matrix to obtain the seed vector, comprising: inputting the seed matrix into a molecular structure encoding model, and encoding the seed matrix Process to get the seed vector.
  20. 根据权利要求19所述的化合物设计装置,其中,还包括模型训练模块,用于:The compound design device according to claim 19, further comprising a model training module for:
    获取样本矩阵,所述样本矩阵为样本化合物的矩阵表示方式;Acquiring a sample matrix, the sample matrix is a matrix representation of sample compounds;
    将所述样本矩阵输入自编码器的编码层,编码得到样本向量,所述样本向量为样本化合物的特征向量表示方式;Inputting the sample matrix into the encoding layer of the self-encoder, encoding to obtain a sample vector, the sample vector being the representation of the feature vector of the sample compound;
    将所述样本向量输入所述自编码器的解码层,解码得到预测矩阵;Input the sample vector into the decoding layer of the self-encoder, and decode to obtain a prediction matrix;
    计算所述预测矩阵与所述样本矩阵之间的损失;calculating a loss between said prediction matrix and said sample matrix;
    基于所述损失迭代更新所述自编码器的参数,直至所述损失稳定,将更新训练后的自编码器的编码层作为所述分子结构编码模型。Iteratively updating the parameters of the autoencoder based on the loss until the loss is stable, and using the updated coding layer of the trained autoencoder as the molecular structure coding model.
  21. 一种化合物设计设备,其中,包括存储器和处理器,所述存储器中存储有指令,所述处理器用于执行所述指令以实现如权利要求1-10任一项所述的化合物设计方法。A compound design device, which includes a memory and a processor, the memory stores instructions, and the processor is used to execute the instructions to realize the compound design method according to any one of claims 1-10.
  22. 一种计算机可读存储介质,其中,所述计算机可读存储介质用于存储指令/程序数据,所述指令/程序数据能够被执行以实现如权利要求1-10任一项所述的化合物设计方法。A computer-readable storage medium, wherein the computer-readable storage medium is used to store instructions/program data, and the instructions/program data can be executed to realize the compound design according to any one of claims 1-10 method.
PCT/CN2021/129381 2021-11-08 2021-11-08 Compound design method and apparatus, device, and computer readable storage medium WO2023077522A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/129381 WO2023077522A1 (en) 2021-11-08 2021-11-08 Compound design method and apparatus, device, and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/129381 WO2023077522A1 (en) 2021-11-08 2021-11-08 Compound design method and apparatus, device, and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2023077522A1 true WO2023077522A1 (en) 2023-05-11

Family

ID=86240609

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/129381 WO2023077522A1 (en) 2021-11-08 2021-11-08 Compound design method and apparatus, device, and computer readable storage medium

Country Status (1)

Country Link
WO (1) WO2023077522A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046692A (en) * 2018-01-17 2019-07-23 三星电子株式会社 Generate method, neural network equipment and the computer readable recording medium of chemical structure
US20200168302A1 (en) * 2017-07-20 2020-05-28 The University Of North Carolina At Chapel Hill Methods, systems and non-transitory computer readable media for automated design of molecules with desired properties using artificial intelligence
CN112071373A (en) * 2020-09-02 2020-12-11 深圳晶泰科技有限公司 Drug molecule screening method and system
CN113409898A (en) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 Molecular structure acquisition method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200168302A1 (en) * 2017-07-20 2020-05-28 The University Of North Carolina At Chapel Hill Methods, systems and non-transitory computer readable media for automated design of molecules with desired properties using artificial intelligence
CN110046692A (en) * 2018-01-17 2019-07-23 三星电子株式会社 Generate method, neural network equipment and the computer readable recording medium of chemical structure
CN112071373A (en) * 2020-09-02 2020-12-11 深圳晶泰科技有限公司 Drug molecule screening method and system
CN113409898A (en) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 Molecular structure acquisition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Zerveas et al. A transformer-based framework for multivariate time series representation learning
Nigam et al. Parallel tempered genetic algorithm guided by deep neural networks for inverse molecular design
US20240029834A1 (en) Drug Optimization by Active Learning
Monteiro et al. DTITR: End-to-end drug–target binding affinity prediction with transformers
Hii et al. Evolving toxicity models using multigene symbolic regression and multiple objectives
Manikandan et al. Bacterial foraging optimization–genetic algorithm for multiple sequence alignment with multi-objectives
Yuan et al. DeCban: prediction of circRNA-RBP interaction sites by using double embeddings and cross-branch attention networks
Lin et al. PanGu Drug Model: learn a molecule like a human
Ventz et al. Integration of survival data from multiple studies
Yu et al. Perturbnet predicts single-cell responses to unseen chemical and genetic perturbations
US20240152763A1 (en) Subset conditioning using variational autoencoder with a learnable tensor train induced prior
Singh et al. A framework for designing efficient deep learning-based genomic basecallers
WO2023077522A1 (en) Compound design method and apparatus, device, and computer readable storage medium
Shi et al. A vector representation of DNA sequences using locality sensitive hashing
Jia et al. pSuc-FFSEA: predicting lysine succinylation sites in proteins based on feature fusion and stacking ensemble algorithm
Lu et al. TrGPCR: GPCR-ligand Binding Affinity Predicting based on Dynamic Deep Transfer Learning
Líndez et al. Adversarial and variational autoencoders improve metagenomic binning
CN114220488A (en) Compound design method, device, equipment and computer readable storage medium
Ma et al. Drug-target binding affinity prediction method based on a deep graph neural network
Chen et al. PmliHFM: Predicting Plant miRNA-lncRNA Interactions with Hybrid Feature Mining Network
Khatibipour et al. JacLy: a Jacobian-based method for the inference of metabolic interactions from the covariance of steady-state metabolome data
Xu et al. MultiQuant: Training Once for Multi-bit Quantization of Neural Networks.
Jones et al. HD-bind: Encoding of molecular structure with low precision, hyperdimensional binary representations
Jain et al. Capturing emerging complexity in lenia
Meynard-Piganeau et al. Generating Interacting Protein Sequences using Domain-to-Domain Translation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21963031

Country of ref document: EP

Kind code of ref document: A1