WO2023077522A1 - Procédé et appareil de conception de composé, dispositif et support de stockage lisible par ordinateur - Google Patents

Procédé et appareil de conception de composé, dispositif et support de stockage lisible par ordinateur Download PDF

Info

Publication number
WO2023077522A1
WO2023077522A1 PCT/CN2021/129381 CN2021129381W WO2023077522A1 WO 2023077522 A1 WO2023077522 A1 WO 2023077522A1 CN 2021129381 W CN2021129381 W CN 2021129381W WO 2023077522 A1 WO2023077522 A1 WO 2023077522A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
seed
compound
matrix
sample
Prior art date
Application number
PCT/CN2021/129381
Other languages
English (en)
Chinese (zh)
Inventor
杨立君
Original Assignee
深圳晶泰科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳晶泰科技有限公司 filed Critical 深圳晶泰科技有限公司
Priority to PCT/CN2021/129381 priority Critical patent/WO2023077522A1/fr
Publication of WO2023077522A1 publication Critical patent/WO2023077522A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like

Definitions

  • the present application relates to the technical field of computational chemistry, in particular to a compound design method, device, equipment and computer-readable storage medium.
  • the technical problem mainly solved by this application is to provide a compound design method, device, equipment and computer-readable storage medium, which can increase the diversity of designed compounds.
  • a technical solution adopted by the present application is: provide a compound design method, the method includes obtaining a seed vector, the seed vector is the representation of the feature vector of the seed compound; based on the genetic algorithm, the seed vector is cross-operated and /or mutation operation to obtain a derivative vector; process the derivative vector to obtain a derivative compound.
  • the method further includes measuring the fitness of the derivative compound based on the fitness function; selecting candidate compounds from the derivative compound according to the degree of fitness.
  • select candidate compounds from derivative compounds including:
  • Step S1 According to the size of the fitness, select the target compound whose fitness meets the preset condition from the derivative compounds;
  • Step S2 using the derivation vector corresponding to the target compound as the seed vector, continue to perform the cross operation and/or mutation operation on the seed vector based on the genetic algorithm, obtain the derivation vector and measure the fitness of the derivation compound based on the fitness function;
  • a predetermined ratio or a predetermined number of derivative compounds with better fitness are selected as candidate compounds.
  • the cross operation of the seed vector based on the genetic algorithm includes: selecting two seed vectors from the seed vector set, selecting the exchange position of one of the seed vectors, and comparing the value of the exchange position of the seed vector with the value of the corresponding position of the other seed vector Values are exchanged.
  • the mutation operation on the seed vector based on the genetic algorithm includes: selecting a seed vector from the seed vector set, selecting a mutation position from the selected seed vector, and replacing the value at the mutation position with a new value.
  • processing the derivative vector to obtain the derivative compound includes: inputting the derivative vector into the molecular structure decoding model, decoding the derivative vector to obtain the derivative molecular structure, the molecular structure decoding model is a neural network model, and obtaining derivative compounds.
  • the method also includes: obtaining a sample matrix, which is a matrix representation of the sample compound; inputting the sample matrix
  • the coding layer of the self-encoder is encoded to obtain a sample vector, which is the representation of the feature vector of the sample compound; the sample vector is input into the decoding layer of the self-encoder, and the prediction matrix is obtained by decoding; the loss between the prediction matrix and the sample matrix is calculated;
  • the parameters of the self-encoder are updated iteratively based on the loss until the loss is stable, and the decoding layer and output layer of the self-encoder after the update training are used as the molecular structure decoding model, and the output layer is used to convert the compound represented by the matrix into the compound represented by the molecular structure. Way.
  • obtaining the seed vector includes: obtaining the SMILES string of the seed compound; performing one-hot encoding on the SMILES string of the seed compound to obtain a seed matrix, which is a matrix representation of the seed compound; encoding the seed matrix to obtain a seed vector.
  • encoding the seed matrix to obtain the seed vector includes: inputting the seed matrix into the molecular structure encoding model, and encoding the seed matrix to obtain the seed vector.
  • the method also includes: obtaining a sample matrix, which is a matrix representation of the sample compound; inputting the sample matrix into the self-encoder Encoding layer, encoding to obtain sample vectors, the sample vectors are the representation of the feature vectors of sample compounds; input the sample vectors into the decoding layer of the self-encoder, and decode to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update based on the loss The parameters of the autoencoder, until the loss is stable, will update the encoding layer of the trained autoencoder as the molecular structure encoding model.
  • the compound design device includes an acquisition module, an operation module and a decoding module, the acquisition module is used to obtain a seed vector, and the seed vector is a seed compound The eigenvector representation method; the operation module is used to perform cross operation and/or mutation operation on the seed vector based on the genetic algorithm to obtain a derivative vector; the decoding module is used to process the derivative vector to obtain a derivative compound.
  • the compound design device also includes a selection module, which is used to respectively measure the fitness of the derivative compounds based on the fitness function; and select candidate compounds from the derivative compounds according to the size of the fitness.
  • the selection module selects candidate compounds from derivative compounds according to the size of fitness, including: step S1: according to the size of fitness, select the target compound whose fitness meets the preset conditions from the derivative compounds; step S2: select the target compound
  • the derivative vector corresponding to the compound is used as the seed vector, and the step of performing cross operation and/or mutation operation on the seed vector based on the genetic algorithm to obtain the derivative vector to measure the fitness of the derivative compound based on the fitness function respectively; iterative loop steps S1 and S2 , until the iterative termination condition is satisfied, the iterative loop operation ends; all derived compounds obtained are sorted in descending order according to fitness; a predetermined proportion or a predetermined number of derived compounds with better fitness are selected as candidate compounds.
  • the operation module includes a crossover operation submodule, and the crossover operation submodule is used to select two seed vectors from the seed vector set, select the exchange position of one of the seed vectors, and compare the value of the exchange position of the seed vector with the value of the other subvector The values at the corresponding positions are exchanged.
  • the operation module includes a mutation operator module, and the mutation operator module is used to select a seed vector from the seed vector set, select a mutation position from the selected seed vector, and replace the value at the mutation position with a new value.
  • the decoding module is specifically used to input the derivative vector into the molecular structure decoding model, decode the derivative vector to obtain the derivative molecular structure, and obtain the derivative compound according to the derived molecular structure;
  • the molecular structure decoding model is a neural network model.
  • the compound design device also includes a model training module, which is used to obtain a sample matrix, which is the matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain a sample vector, which is the sample compound.
  • a model training module which is used to obtain a sample matrix, which is the matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain a sample vector, which is the sample compound.
  • Feature vector representation input the sample vector into the decoding layer of the self-encoder, and decode it to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update the parameters of the self-encoder based on the loss until the loss is stable, and update the post-training
  • the decoding layer and output layer of the self-encoder are used as the molecular structure decoding model, and the output layer is used to convert the compound represented by the matrix into the way represented by the molecular structure.
  • the compound design device also includes an encoding module for obtaining the SMILES character string of the seed compound; performing one-hot encoding on the SMILES character string of the seed compound to obtain a seed matrix, and the seed matrix is a matrix representation of the seed compound; Encode to get the seed vector.
  • the encoding module encodes the seed matrix to obtain the seed vector, including: inputting the seed matrix into the molecular structure encoding model, and encoding the seed matrix to obtain the seed vector.
  • the compound design device also includes a model training module, which is used to obtain a sample matrix, which is the matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain a sample vector, which is the sample compound.
  • Feature vector representation input the sample vector into the decoding layer of the self-encoder, and decode it to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update the parameters of the self-encoder based on the loss until the loss is stable, and update the post-training
  • the encoding layer of the autoencoder acts as a molecular structure encoding model.
  • a compound design device including a processor and a memory, where instructions are stored in the memory, and the processor is used to execute the instructions to realize any of the above compound design methods .
  • another technical solution adopted by the present application is to provide a computer-readable storage medium, which is used to store instructions/program data, and the instructions/program data can be executed to achieve any of the above-mentioned The compound design method of item.
  • the beneficial effects of the present application are: different from the situation of the prior art, the compound design method provided by the present application is based on the genetic algorithm for the development and design of compounds, which increases the exploreable compound space, can obtain diversified compounds, and increases Choose a space. Furthermore, the complex chemical space is reduced into a one-dimensional vector during operation, which enables the design algorithm to search the chemical space conveniently and efficiently.
  • Figure 1 is a schematic flow diagram of a compound design method in the embodiment of the present application.
  • Fig. 2 is a schematic diagram of the training process of a molecular structure model in the embodiment of the present application
  • Figure 3 is a schematic flow diagram of another compound design method in the embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of a compound design device in an embodiment of the present application.
  • Fig. 5 is a schematic structural diagram of the compound design equipment in the embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present application.
  • the inventors of the present application found that the molecular generation model based on deep learning can use a large-scale compound database to self-learn the writing rules of compounds, and express the compounds as a dense continuous Value vectors, and then learn the structural features of compounds, generate compounds with new skeletons, and expand the searchable chemical space.
  • transfer learning or reinforcement learning methods can be used to guide model training, so that the chemical space generated by molecules can be shrunk to a specific area, and the sampling generation in this area meets the conditions.
  • molecules For example, molecules with special functional groups can be generated.
  • the present application provides a compound design method.
  • new compounds are learned, developed and designed based on the genetic algorithm, and a certain number of seed compounds are selected to simulate the chromosomes in nature by using the principle of simulating the evolution of the natural world in the genetic algorithm to form an initial compound. populations.
  • the fitness of the entire population is evaluated, and several individuals are selected based on the fitness to simulate natural selection, inheritance, and mutation to produce the next generation of population (ie, derivative compounds). Each generation repeats this cycle to search for an optimal solution.
  • FIG. 1 is a schematic flowchart of a compound design method in an embodiment of the present application. It should be noted that this embodiment is not limited to the flow sequence shown in FIG. 1 if substantially the same result is obtained. As shown in Figure 1, this embodiment includes:
  • the seed vector is the feature vector representation of the seed compound.
  • the first-generation population of genetic evolution is first constructed, that is, the basic compound for compound design, that is, the seed compound, needs to be obtained first.
  • the seed compound can be any compound randomly selected in the compound database, and it can be one or more. According to different design requirements, specific screening of seed compounds can also be carried out, which is not limited here.
  • dimensionality reduction processing is also performed on the seed compound, and the complex chemical space is reduced into a one-dimensional vector. Specifically, the way of expressing the compound with molecular structural formula is changed to the way of expressing the compound with vector.
  • the design algorithm based on the genetic algorithm can be simplified to the operation between vectors, which is more convenient and efficient to search the chemical space, and the efficiency is higher.
  • S130 Perform a crossover operation and/or a mutation operation on the seed vector based on the genetic algorithm to obtain a derivative vector.
  • the dimension of the operation result is processed. Specifically, the method of using vectors to represent compounds is converted to the method of using molecular structures to represent compounds, so as to obtain the specific structural formula of the compound, and then determine the derivative compound.
  • the development and design of compounds is carried out based on the genetic algorithm, which increases the space of compounds that can be explored, enables to obtain diversified compounds, and increases the space for selection. Furthermore, the complex chemical space is reduced into a one-dimensional vector during operation, which enables the design algorithm to search the chemical space conveniently and efficiently.
  • the present application may use a neural network, which takes the chemical structure as input and output, and extracts the vector output by the intermediate layer as a one-dimensional representation of the chemical structure. That is, the neural network model can be used to reduce and increase the dimension of the compound.
  • the autoencoder can be used to train the molecular structure encoding model and the molecular structure decoding model.
  • the molecular structure encoding model can be used to reduce the dimension of the chemical structure, and encode the chemical structure into a vector; the molecular structure decoding model can be used to increase the dimension of the vector, and decode the vector into a chemical structure.
  • An autoencoder is a deep learning neural network that is trained so that the input and output values are the same. It first compresses the input vector into a hidden space, and then reconstructs and decodes the output so that the output is the same as the input.
  • the autoencoder mainly includes an encoding layer, a hidden vector layer and a decoding layer.
  • the encoding layer contains several neurons, which can convert a large and sparse matrix into a dense one-dimensional vector composed of floating point numbers (the vector in the hidden vector layer).
  • the decoding layer also contains several neurons, which can decode a dense one-dimensional vector into a large and sparse matrix.
  • a neural network In the training phase, a neural network is first built, which can receive large and sparse matrices. It is first converted into a vector of continuous values through the embedding layer. These vectors are combined through various linear transformations and nonlinear transformations, and finally a latent vector is obtained. This hidden vector is decoded into a large and sparse matrix through multiple linear transformations and nonlinear transformations. Since the parameters of these transformations are random or inaccurate, the decoded matrix is likely to be very different from the original matrix.
  • the chemical structure can be one-hot encoded and converted into a matrix representation. Therefore, the above-mentioned neural network can be used to reduce and increase the dimension of the compound, and the above-mentioned training method can be used to train molecules Structure encoding model and molecular structure decoding model.
  • FIG. 2 is a schematic diagram of a training process of a molecular structure model in an embodiment of the present application. It should be noted that, if there are substantially the same results, this embodiment is not limited to the flow sequence shown in FIG. 2 . As shown in Figure 2, this embodiment includes:
  • sample matrix is a matrix representation of the sample compound.
  • the compound library can be downloaded from the Internet, and effective compounds can be extracted from the compound library as sample compounds.
  • the sample compounds can be screened to a certain extent, for example, chiral compounds, salt compounds, uncommon molecules, molecules with too many heavy atoms, inorganic substances, etc. can be removed when screening sample compounds. Different screening rules can be set according to different requirements, which are not limited here.
  • SMILES Simple molecular input line entry system, simplified molecular linear input specification
  • SMILES Simple molecular input line entry system, simplified molecular linear input specification
  • the chemical structure can be written in the form of a SMILES string according to an existing set of rules.
  • pyrimidine can be written as SMILES string "c1ccncn1".
  • a string can be thought of as a sentence consisting of several words.
  • the above-mentioned string of pyrimidines can be regarded as composed of three words c, 1, n. These words can be converted into a vector consisting of only 0 and 1 using one-hot encoding, and then the string can be converted into a matrix representation to obtain a sample matrix.
  • c1ccncn1 which can be regarded as consisting of three words c, 1, and n. These three words have disorder and discontinuity. Treat these three words as three states, represented by a vector consisting of 0 and 1. For example, the first digit is c, the second digit is 1, and the third digit is n, then these three words can be expressed as [1,0,0],[0,1,0],[0,0,1] . 1 means it contains the word, 0 means it does not contain the word.
  • the structure of pyrimidine will be represented as a two-dimensional matrix [[1,0,0],[0,1,0],[1,0,0],[1,0,0],[0,0 ,1],[1,0,0],[0,0,1],[0,1,0]].
  • the so-called two-dimensional matrix the dimension of the matrix can be understood as one dimension is used to represent the vector length of each word, and one dimension is used to represent the length of each string.
  • the length of each word is 3, and the length of the entire pyrimidine string is 8. Encoded in this way, the pyrimidine structure is transformed into something that can be understood by a computer.
  • the SMILES strings in the sample compound set can be uniformly encoded into an m*n matrix (m words, each word vector length is n). You can find out the longest SMILES string among them, for example, its length is m, if the length of a SMILES string is less than m words, it is also expressed as a matrix of m*n, and the insufficient elements are all filled with 0. Similarly, find the word with the longest length, say it has length n.
  • S230 Input the sample matrix into the encoding layer of the autoencoder, and encode to obtain a sample vector, wherein the sample vector is a representation of the feature vector of the sample compound.
  • S250 Input the sample vector into the decoding layer of the self-encoder, and decode to obtain a prediction matrix.
  • S290 Iteratively updating the parameters of the self-encoder based on the loss until the loss is stable, and obtaining a molecular structure encoding model and a molecular structure decoding model.
  • the encoding layer of the updated autoencoder can be used as a molecular structure encoding model, and the updated decoding layer of the trained autoencoder can be used as a molecular structure decoding model.
  • the autoencoder may further include an input layer, which may be used to convert compounds of chemical structural formulas into compounds of matrix formulas.
  • the input layer and the encoding layer are then used together as a molecular structure encoding model.
  • the molecular structure encoding model can take the compound of the molecular structural formula as input, and output the compound of the encoded vector formula.
  • the autoencoder may further include an output layer, which may be used to convert the compound of the matrix formula into the compound of the chemical structure formula.
  • the specific conversion process is the reverse process of converting the compound of the chemical structural formula into the compound of the matrix formula. Please refer to the above description for details, and will not repeat them here.
  • the output layer and the decoding layer are used together as a molecular structure decoding model.
  • the molecular structure decoding model can take the compound of the vector formula as input, and output the compound of the decoded molecular structure formula.
  • FIG. 3 is a schematic flowchart of another compound design method in the embodiment of the present application. It should be noted that this embodiment is not limited to the flow sequence shown in FIG. 3 if substantially the same result is achieved. As shown in Figure 3, this embodiment can combine the molecular structure coding model, molecular structure decoding model and genetic algorithm for compound design, specifically including:
  • S310 Acquire a seed vector.
  • the seed compound in the compound database can select the seed compound in the compound database to obtain the SMILES string of the seed compound; perform one-hot encoding on the SMILES string of the seed compound to obtain the seed matrix, which is the matrix representation of the seed compound; input the seed matrix into the molecular structure
  • the encoding model encodes the seed matrix to obtain the seed vector. Please refer to the above description for details, and will not repeat them here.
  • S330 Perform a cross operation on the seed vector based on the genetic algorithm to obtain a derived vector.
  • the cross operation can select two seed vectors from the seed vector set, and select the exchange position (can be one or more positions) of one of the seed vectors.
  • the selection mode of the seed vector and the exchange position can be randomly selected, or can be set Set certain selection rules.
  • the value of the selected exchange position of this seed vector is exchanged with the value of the corresponding position of another sub vector. For example, there are two vectors [0.1,0.2,0.3] and [0.4,0.5,0.6], exchange the first position, then get two new vectors, [0.4,0.2,0.3] and [0.1,0.5,0.6] . As another example, if the above two vectors are exchanged for the first and third positions, then two new vectors [0.4,0.2,0.6] and [0.1,0.5,0.3] are obtained.
  • S350 Perform a mutation operation on the seed vector based on the genetic algorithm to obtain a derived vector.
  • the mutation operation can select several seed vectors (the proportion of the vector to be mutated can be specified in advance) from the seed vector set, and select the mutation position (can be one or more positions) from these seed vectors, the seed vector and the mutation position
  • the selection method may be random selection, or a certain selection rule may be set. Replace the values at these mutation positions with new values, which can be randomly replaced with any value, or replaced with a set value. For example, there is a vector [0.1,0.2,0.3], select the first position, and replace this value with a value at random to get a new vector [0.5,0.2,0.3]. As another example, select the first and second positions, and randomly replace the corresponding values with new values to obtain a new vector [0.2, 0.4, 0.3].
  • Both the crossover operation and the mutation operation are for generating new vectors (ie derived vectors), deriving more vectors, and further deriving more compounds.
  • Crossover operation and mutation operation can simulate genetic evolution and improve the diversity of compounds.
  • the crossover operation and the mutation operation can be performed simultaneously, or in reverse order, or only one of them can be performed, that is, steps S330 and S350 are only for illustration, and one can be selectively performed, or the order can be reversed, and there is no limitation here .
  • the derivative vector is input into the molecular structure decoding model, and the derivative vector is decoded to obtain the derivative matrix, and then the derivative matrix is converted to obtain the derived molecular structure, and then the derivative compound can be determined according to the derived molecular structure. Please refer to the above description for details, and will not repeat them here.
  • S390 Measure the fitness of the derived compounds based on the fitness function, and select candidate compounds from the derived compounds according to the fitness.
  • Fitness is a scale used to evaluate derivative compounds, such as whether the structure has good solubility, good activity, etc. In this way, the derivative compounds are associated with the criteria for judging the quality, that is, the fitness function is constructed.
  • each derivative compound has an evaluation value, which represents the adaptability of the compound in the evolution process. For example, molecules with poor solubility and poor activity tend to be eliminated.
  • This evaluation standard depends on the definition of the user, and the user can adaptively set the evaluation standard (fitness function) according to the characteristic requirements of the compound to be designed. For example, the user wants to get a compound with a large enough molecular weight. Then, thousands of derived vectors are randomly generated, and these derived vectors are transformed according to the above to obtain a compound respectively, and then the molecular weights of these compounds are calculated. This molecular weight is the user-defined fitness. We arrange these compounds in descending order according to molecular weight, and select a top-ranked candidate compound or a batch of candidate compounds according to the user-defined parameters (the ratio or quantity selected each time).
  • multiple rounds of crossover operations and mutation operations can be iteratively performed to obtain more derivative compounds, and then desired candidate compounds are selected from these derivative compounds.
  • the specific implementation of selecting the candidate compound from the derivative compound may include: step S1: according to the size of the fitness, select the target compound whose fitness meets the preset condition from the derivative compound; step S2 : Using the derivation vector corresponding to the target compound as the seed vector, continue to execute steps from S330 and/or S350 to step S390 to measure the fitness of the derivation compound based on the fitness function.
  • the target compound that satisfies the preset conditions can be a fixed number of compounds (such as 10, 30, 50, etc.) selected from the derivative compounds as the target compound;
  • the compounds are sorted according to the fitness, and a fixed ratio (such as 1/10, 1/5, 1/3, etc.) of the compound is selected from the front to the back as the target compound; it can also be selected from the derivative compound with a fitness greater than a certain fixed threshold compound as the target compound.
  • the number, ratio, and conditions of target compounds selected can be set according to needs, and will not be repeated here.
  • the derivative compounds can be sorted in descending order according to the fitness, and the top-ranked target compounds can be selected, and the derivative vectors of these target compounds can be cross-operated,
  • the mutation operation generates a new batch of 1D derived vectors. Input these new derivative vectors into the molecular structure decoding model, decode new matrix and transform into new derivative compounds, and calculate the fitness of these derivative compounds.
  • These derivative compounds are arranged in descending order of fitness, and the top-ranked target compounds are selected from them, and then crossed and mutated to generate a new one-dimensional derivative vector. This loop is iterated and all derived compounds generated are recorded. Candidate compounds with better fitness are selected from these generated derivative compounds as the final result.
  • the number of iterations can depend on the set parameters and the characteristics of the data set itself.
  • the iteration termination condition can be the number of iterations set in advance, and the number of iterations can be dozens to hundreds of times, such as 200 to 400 rounds.
  • the iteration termination condition can be the iteration duration set in advance, such as 8 hours, 12 hours, 24 hours, 48 hours, etc.
  • the complex chemical space is reduced into a one-dimensional vector, which can make the design algorithm search the chemical space conveniently and efficiently; the organic combination of chemical space and genetic algorithm overcomes the molecular Generative Models After Reinforcement Learning and Transfer Learning Generating Compound Gradually Single Problems.
  • the latest ChEMBL28 database can be downloaded from the Internet, and the SMILES string of the compound is proposed.
  • the sample compound structure must only contain atoms of hydrogen, carbon, nitrogen, oxygen, fluorine, sulfur, chlorine, and bromine. And do not contain chiral compounds, inorganic substances, salt ions, and restrict the number of heavy atoms within 70, convert these SMILES strings into canonical forms.
  • About 1.8 million SMILES are obtained after deduplication. Use these SMILES to train a neural network. Embodiments are developed based on this neural network.
  • Protein kinase B also known as AKT, is a serine/threonine-specific protein kinase. It plays an important regulatory role in cell apoptosis, proliferation, migration and other cellular processes. AKT1 participates in the cell survival pathway through the process of apoptosis, blocks apoptosis and promotes cell survival. Clinical studies have found that AKT is overexpressed in various human tumors such as gastric cancer and pancreatic cancer. AKT inhibitors can inhibit the activity of AKT and promote the apoptosis of cancer cells.
  • Compound 1 is an AKT inhibitor in clinical research. By analyzing its interaction mode and establishing a pharmacophore model to evaluate the matching degree between the molecule and the pharmacophore, it is used as the fitness evaluation standard to find new molecules.
  • IDH1 human isocitrate dehydrogenase 1
  • glioma a variety of malignant tumors, such as glioma.
  • Mutated IDH1 can convert ⁇ -ketoglutarate to 2-hydroxyglutarate.
  • the latter is a carcinogen that accumulates in the body and promotes the further progression of cancer.
  • drugs that inhibit the activity of mutant IDH1 can effectively reduce the concentration of 2-hydroxyglutarate in the body and relieve cancer symptoms.
  • Compound 2 is the most promising inhibitor of mutant IDH1 currently studied. Take it as a template molecule, calculate the similarity (measured by molecular fingerprint) with the template molecule for each generated molecule as the fitness of the molecule, and search a batch of similar molecules from the latent space.
  • the third predetermined amount of seed compounds in the compound database input the molecular structure coding model, and obtain the seed vector; perform crossover and mutation operations on the seed vector based on the genetic algorithm, and obtain multiple derived vectors; input the derived vector Molecular structure decoding model to obtain multiple derivative compounds; respectively calculate the similarity between each derivative compound and the template molecule to obtain the fitness of the derivative compound; then select the fourth predetermined amount of derivative vector as the seed vector for crossover according to the degree of fitness Operation and mutation operation, such an iterative cycle for 380 rounds, to obtain a batch of new compounds, as follows:
  • FIG. 4 is a schematic structural diagram of a compound design device in an embodiment of the present application.
  • the compound design device 40 includes an acquisition module 41 , an operation module 42 and a decoding module 43 .
  • the obtaining module 41 is used to obtain the seed vector, and the seed vector is the feature vector representation of the seed compound; the operation module 42 is used to perform cross operation and/or mutation operation on the seed vector based on the genetic algorithm to obtain a derived vector; the decoding module 43 uses The derivation vector is processed to obtain the derivation compound.
  • the device develops and designs compounds based on the genetic algorithm, which increases the exploreable compound space, can obtain diversified compounds, and increases the selection space.
  • the complex chemical space is reduced into a one-dimensional vector during operation, which enables the design algorithm to search the chemical space conveniently and efficiently. Please refer to the description of the above-mentioned embodiments for the specific execution process, and will not repeat it again.
  • the compound design device 40 also includes a selection module (not shown in the figure), which is used to measure the fitness of the derived compounds based on the fitness function; and select candidate compounds from the derived compounds according to the fitness.
  • a selection module (not shown in the figure), which is used to measure the fitness of the derived compounds based on the fitness function; and select candidate compounds from the derived compounds according to the fitness.
  • the selection module selects candidate compounds from the derivative compounds according to the size of the fitness, including: step S1: according to the size of the fitness, select the target compound whose fitness meets the preset conditions from the derivative compounds; step S2: the The derivation vector corresponding to the target compound is used as the seed vector, continue to perform the cross operation and/or mutation operation on the seed vector based on the genetic algorithm, and obtain the derivation vector to the step of measuring the fitness of the derivation compound based on the fitness function respectively; iterative loop steps S1 and S2. End the iterative loop operation until the iteration termination condition is met; sort all the obtained derived compounds in descending order according to their fitness; select a predetermined proportion or a predetermined number of derived compounds with better fitness as candidate compounds. In this way, more candidate compounds can be obtained, and better compounds can be screened more easily.
  • step S1 according to the size of the fitness, select the target compound whose fitness meets the preset conditions from the derivative compounds
  • step S2 the The derivation vector corresponding to the
  • the operation module 42 includes a cross operation submodule (not shown in the figure), which is used to select two seed vectors from the seed vector set, select the exchange position of one of the seed vectors, and combine the value of the exchange position of the seed vector with The value of the corresponding position of another sub-vector is exchanged to obtain a new derivative vector, and then a derivative compound can be obtained to enrich the derivative vector and increase the diversity of the derivative compound.
  • a cross operation submodule (not shown in the figure), which is used to select two seed vectors from the seed vector set, select the exchange position of one of the seed vectors, and combine the value of the exchange position of the seed vector with The value of the corresponding position of another sub-vector is exchanged to obtain a new derivative vector, and then a derivative compound can be obtained to enrich the derivative vector and increase the diversity of the derivative compound.
  • the operation module 42 includes a mutation operator module (not shown in the figure), which is used to select a seed vector from the seed vector set, select a mutation position from the selected seed vector, and replace the value on the mutation position with a new Value, get a new derivative vector, and then get a derivative compound to enrich the derivative vector and increase the diversity of derivative compounds.
  • a mutation operator module (not shown in the figure), which is used to select a seed vector from the seed vector set, select a mutation position from the selected seed vector, and replace the value on the mutation position with a new Value, get a new derivative vector, and then get a derivative compound to enrich the derivative vector and increase the diversity of derivative compounds.
  • the decoding module 43 is used to input the derived vector into the molecular structure decoding model, and decode the derived vector to obtain the derived molecular structure.
  • the molecular structure decoding model is a neural network model; and obtain the derived compound according to the derived molecular structure.
  • the compound design device 40 also includes an encoding module (not shown in the figure), which is used to obtain the SMILES character string of the seed compound; the SMILES character string of the seed compound is one-hot encoded to obtain the seed matrix, and the seed matrix is the Matrix representation; the seed matrix is input into the molecular structure encoding model, and the seed matrix is encoded to obtain the seed vector.
  • an encoding module (not shown in the figure), which is used to obtain the SMILES character string of the seed compound; the SMILES character string of the seed compound is one-hot encoded to obtain the seed matrix, and the seed matrix is the Matrix representation; the seed matrix is input into the molecular structure encoding model, and the seed matrix is encoded to obtain the seed vector.
  • the compound design device 40 also includes a model training module (not shown in the figure), which is used to obtain a sample matrix, which is a matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain the sample Vector, the sample vector is the eigenvector representation of the sample compound; input the sample vector into the decoding layer of the self-encoder, and decode it to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update the parameters of the self-encoder based on the loss, Until the loss is stable, the decoding layer and output layer of the trained self-encoder will be updated as the molecular structure decoding model.
  • the output layer is used to convert the compound represented by the matrix into a molecular structure representation, and the trained self-encoder will be updated
  • the encoding layer serves as a molecular structure encoding model.
  • the compound design device can be an independent server, a server cluster, or a module of the server. It can be used for model training, genetic algorithm, and then used to design compounds.
  • FIG. 5 is a schematic structural diagram of a compound design device in an embodiment of the present application.
  • the compound design device 10 includes a processor 11 and a memory 12 .
  • the processor 11 may also be called a CPU (Central Processing Unit, central processing unit).
  • the processor 11 may be an integrated circuit chip with signal processing capabilities.
  • the processor 11 can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components .
  • the general processor can be a microprocessor or the processor 11 can also be any conventional processor or the like.
  • the compound design device 10 may further include a memory 12 for storing instructions and data required for the operation of the processor 11 .
  • the processor 11 is configured to execute instructions to implement the methods provided in any embodiment of the compound design method of the present application and any non-conflicting combination.
  • Compound design equipment can be servers, desktop computers, laptops, etc. It can be used for model training, genetic algorithm, and then used to design compounds.
  • FIG. 6 is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present application.
  • the computer-readable storage medium 20 of the embodiment of the present application stores instructions/program data 21.
  • the instructions/program data 21 are executed, the methods provided by any embodiment of the compound design method of the present application and any non-conflicting combination are implemented.
  • the instruction/program data 21 can form a program file and be stored in the above-mentioned storage medium 20 in the form of a software product, so that a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor (processor) Execute all or part of the steps of the methods in various implementation manners of the present application.
  • aforementioned storage medium 20 comprises: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or optical disc etc. can store program codes Media, or terminal devices such as computers, servers, mobile phones, and tablets.
  • the disclosed system, device and method can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé et un appareil de conception de composé, un dispositif et un support de stockage lisible par ordinateur. Le procédé comprend les étapes consistant à acquérir un vecteur de germe qui est un mode de représentation de vecteur de caractéristique d'un composé de germe ; effectuer une opération de croisement et/ou une opération de mutation sur le vecteur de germe sur la base d'un algorithme génétique pour obtenir un vecteur dérivé ; et traiter le vecteur dérivé pour obtenir un composé dérivé. De cette manière, la présente invention peut améliorer la diversité d'un composé conçu.
PCT/CN2021/129381 2021-11-08 2021-11-08 Procédé et appareil de conception de composé, dispositif et support de stockage lisible par ordinateur WO2023077522A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/129381 WO2023077522A1 (fr) 2021-11-08 2021-11-08 Procédé et appareil de conception de composé, dispositif et support de stockage lisible par ordinateur

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/129381 WO2023077522A1 (fr) 2021-11-08 2021-11-08 Procédé et appareil de conception de composé, dispositif et support de stockage lisible par ordinateur

Publications (1)

Publication Number Publication Date
WO2023077522A1 true WO2023077522A1 (fr) 2023-05-11

Family

ID=86240609

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/129381 WO2023077522A1 (fr) 2021-11-08 2021-11-08 Procédé et appareil de conception de composé, dispositif et support de stockage lisible par ordinateur

Country Status (1)

Country Link
WO (1) WO2023077522A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046692A (zh) * 2018-01-17 2019-07-23 三星电子株式会社 产生化学结构的方法、神经网络设备和计算机可读记录介质
US20200168302A1 (en) * 2017-07-20 2020-05-28 The University Of North Carolina At Chapel Hill Methods, systems and non-transitory computer readable media for automated design of molecules with desired properties using artificial intelligence
CN112071373A (zh) * 2020-09-02 2020-12-11 深圳晶泰科技有限公司 药物分子筛选方法及系统
CN113409898A (zh) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 分子结构获取方法、装置、电子设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200168302A1 (en) * 2017-07-20 2020-05-28 The University Of North Carolina At Chapel Hill Methods, systems and non-transitory computer readable media for automated design of molecules with desired properties using artificial intelligence
CN110046692A (zh) * 2018-01-17 2019-07-23 三星电子株式会社 产生化学结构的方法、神经网络设备和计算机可读记录介质
CN112071373A (zh) * 2020-09-02 2020-12-11 深圳晶泰科技有限公司 药物分子筛选方法及系统
CN113409898A (zh) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 分子结构获取方法、装置、电子设备及存储介质

Similar Documents

Publication Publication Date Title
Zerveas et al. A transformer-based framework for multivariate time series representation learning
JP2023082017A (ja) コンピュータシステム
Nigam et al. Parallel tempered genetic algorithm guided by deep neural networks for inverse molecular design
Monteiro et al. DTITR: End-to-end drug–target binding affinity prediction with transformers
US20240029834A1 (en) Drug Optimization by Active Learning
Hii et al. Evolving toxicity models using multigene symbolic regression and multiple objectives
Manikandan et al. Bacterial foraging optimization–genetic algorithm for multiple sequence alignment with multi-objectives
Yuan et al. DeCban: prediction of circRNA-RBP interaction sites by using double embeddings and cross-branch attention networks
Lin et al. PanGu Drug Model: learn a molecule like a human
Yu et al. Perturbnet predicts single-cell responses to unseen chemical and genetic perturbations
Marbach et al. Replaying the evolutionary tape: biomimetic reverse engineering of gene networks
US20240152763A1 (en) Subset conditioning using variational autoencoder with a learnable tensor train induced prior
Singh et al. A framework for designing efficient deep learning-based genomic basecallers
WO2023077522A1 (fr) Procédé et appareil de conception de composé, dispositif et support de stockage lisible par ordinateur
Shi et al. A vector representation of DNA sequences using locality sensitive hashing
Jia et al. pSuc-FFSEA: predicting lysine succinylation sites in proteins based on feature fusion and stacking ensemble algorithm
Lu et al. TrGPCR: GPCR-ligand Binding Affinity Predicting based on Dynamic Deep Transfer Learning
CN114220488A (zh) 化合物设计方法、装置、设备及计算机可读存储介质
Sanchez Reconstructing our past˸ deep learning for population genetics
Ma et al. Drug-target binding affinity prediction method based on a deep graph neural network
Chen et al. PmliHFM: Predicting Plant miRNA-lncRNA Interactions with Hybrid Feature Mining Network
Khatibipour et al. JacLy: a Jacobian-based method for the inference of metabolic interactions from the covariance of steady-state metabolome data
Xu et al. MultiQuant: Training Once for Multi-bit Quantization of Neural Networks.
Hoffbauer et al. TransMEP: Transfer learning on large protein language models to predict mutation effects of proteins from a small known dataset
Niu et al. ACO: lossless quality score compression based on adaptive coding order

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21963031

Country of ref document: EP

Kind code of ref document: A1