WO2023077522A1

WO2023077522A1 - Compound design method and apparatus, device, and computer readable storage medium

Info

Publication number: WO2023077522A1
Application number: PCT/CN2021/129381
Authority: WO
Inventors: 杨立君
Original assignee: 深圳晶泰科技有限公司
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2023-05-11

Abstract

The present application discloses a compound design method and apparatus, a device, and a computer readable storage medium. The method comprises acquiring a seed vector that is a feature vector representation mode of a seed compound; performing crossover operation and/or mutation operation on the seed vector on the basis of a genetic algorithm to obtain a derivative vector; and processing the derivative vector to obtain a derivative compound. In this way, the present application can improve the diversity of a designed compound.

Description

Compound design method, device, equipment and computer-readable storage medium

【Technical field】

The present application relates to the technical field of computational chemistry, in particular to a compound design method, device, equipment and computer-readable storage medium.

【Background technique】

In traditional drug research, scientists screen compound libraries, test compound targets one by one, and finally screen out hit compounds. Due to the high cost and high failure rate of the process, computational chemists try to use computational models to predict the activity of compounds, use computers to simulate the binding of drugs in protein cavities, and recommend and test a batch of potentially active molecules. However, this method is severely limited by the quality of the virtual screening library. The number of molecules in the existing compound library is generally hundreds of thousands, and the skeleton of the compound has been extensively studied and screened by the predecessors. It is difficult to find candidate compounds with a new skeleton. The number is too large. Screening libraries with few and poor structural novelty have been difficult to meet the growing demand for research and development.

【Content of invention】

The technical problem mainly solved by this application is to provide a compound design method, device, equipment and computer-readable storage medium, which can increase the diversity of designed compounds.

In order to solve the above technical problems, a technical solution adopted by the present application is: provide a compound design method, the method includes obtaining a seed vector, the seed vector is the representation of the feature vector of the seed compound; based on the genetic algorithm, the seed vector is cross-operated and /or mutation operation to obtain a derivative vector; process the derivative vector to obtain a derivative compound.

Wherein, after the derivative vector is processed to obtain the derivative compound, the method further includes measuring the fitness of the derivative compound based on the fitness function; selecting candidate compounds from the derivative compound according to the degree of fitness.

Among them, according to the size of fitness, select candidate compounds from derivative compounds, including:

Step S1: According to the size of the fitness, select the target compound whose fitness meets the preset condition from the derivative compounds;

Step S2: using the derivation vector corresponding to the target compound as the seed vector, continue to perform the cross operation and/or mutation operation on the seed vector based on the genetic algorithm, obtain the derivation vector and measure the fitness of the derivation compound based on the fitness function;

Iterate loop steps S1 and S2 until the iteration termination condition is satisfied, then end the iteration loop operation;

Arrange all derived compounds obtained in descending order according to their fitness;

A predetermined ratio or a predetermined number of derivative compounds with better fitness are selected as candidate compounds.

Among them, the cross operation of the seed vector based on the genetic algorithm includes: selecting two seed vectors from the seed vector set, selecting the exchange position of one of the seed vectors, and comparing the value of the exchange position of the seed vector with the value of the corresponding position of the other seed vector Values are exchanged.

Wherein, the mutation operation on the seed vector based on the genetic algorithm includes: selecting a seed vector from the seed vector set, selecting a mutation position from the selected seed vector, and replacing the value at the mutation position with a new value.

Wherein, processing the derivative vector to obtain the derivative compound includes: inputting the derivative vector into the molecular structure decoding model, decoding the derivative vector to obtain the derivative molecular structure, the molecular structure decoding model is a neural network model, and obtaining derivative compounds.

Wherein, the derivative vector is input into the molecular structure decoding model, and the derivative vector is decoded, and before the molecular structure of the derivative compound is obtained by decoding, the method also includes: obtaining a sample matrix, which is a matrix representation of the sample compound; inputting the sample matrix The coding layer of the self-encoder is encoded to obtain a sample vector, which is the representation of the feature vector of the sample compound; the sample vector is input into the decoding layer of the self-encoder, and the prediction matrix is obtained by decoding; the loss between the prediction matrix and the sample matrix is calculated; The parameters of the self-encoder are updated iteratively based on the loss until the loss is stable, and the decoding layer and output layer of the self-encoder after the update training are used as the molecular structure decoding model, and the output layer is used to convert the compound represented by the matrix into the compound represented by the molecular structure. Way.

Wherein, obtaining the seed vector includes: obtaining the SMILES string of the seed compound; performing one-hot encoding on the SMILES string of the seed compound to obtain a seed matrix, which is a matrix representation of the seed compound; encoding the seed matrix to obtain a seed vector.

Wherein, encoding the seed matrix to obtain the seed vector includes: inputting the seed matrix into the molecular structure encoding model, and encoding the seed matrix to obtain the seed vector.

Wherein, the seed matrix is input into the molecular structure encoding model, the seed matrix is encoded, and before the seed vector is obtained, the method also includes: obtaining a sample matrix, which is a matrix representation of the sample compound; inputting the sample matrix into the self-encoder Encoding layer, encoding to obtain sample vectors, the sample vectors are the representation of the feature vectors of sample compounds; input the sample vectors into the decoding layer of the self-encoder, and decode to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update based on the loss The parameters of the autoencoder, until the loss is stable, will update the encoding layer of the trained autoencoder as the molecular structure encoding model.

In order to solve the above technical problems, another technical solution adopted by the present application is to provide a compound design device, the compound design device includes an acquisition module, an operation module and a decoding module, the acquisition module is used to obtain a seed vector, and the seed vector is a seed compound The eigenvector representation method; the operation module is used to perform cross operation and/or mutation operation on the seed vector based on the genetic algorithm to obtain a derivative vector; the decoding module is used to process the derivative vector to obtain a derivative compound.

Wherein, the compound design device also includes a selection module, which is used to respectively measure the fitness of the derivative compounds based on the fitness function; and select candidate compounds from the derivative compounds according to the size of the fitness.

Among them, the selection module selects candidate compounds from derivative compounds according to the size of fitness, including: step S1: according to the size of fitness, select the target compound whose fitness meets the preset conditions from the derivative compounds; step S2: select the target compound The derivative vector corresponding to the compound is used as the seed vector, and the step of performing cross operation and/or mutation operation on the seed vector based on the genetic algorithm to obtain the derivative vector to measure the fitness of the derivative compound based on the fitness function respectively; iterative loop steps S1 and S2 , until the iterative termination condition is satisfied, the iterative loop operation ends; all derived compounds obtained are sorted in descending order according to fitness; a predetermined proportion or a predetermined number of derived compounds with better fitness are selected as candidate compounds.

Wherein, the operation module includes a crossover operation submodule, and the crossover operation submodule is used to select two seed vectors from the seed vector set, select the exchange position of one of the seed vectors, and compare the value of the exchange position of the seed vector with the value of the other subvector The values at the corresponding positions are exchanged.

Wherein, the operation module includes a mutation operator module, and the mutation operator module is used to select a seed vector from the seed vector set, select a mutation position from the selected seed vector, and replace the value at the mutation position with a new value.

Wherein, the decoding module is specifically used to input the derivative vector into the molecular structure decoding model, decode the derivative vector to obtain the derivative molecular structure, and obtain the derivative compound according to the derived molecular structure; the molecular structure decoding model is a neural network model.

Wherein, the compound design device also includes a model training module, which is used to obtain a sample matrix, which is the matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain a sample vector, which is the sample compound. Feature vector representation; input the sample vector into the decoding layer of the self-encoder, and decode it to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update the parameters of the self-encoder based on the loss until the loss is stable, and update the post-training The decoding layer and output layer of the self-encoder are used as the molecular structure decoding model, and the output layer is used to convert the compound represented by the matrix into the way represented by the molecular structure.

Wherein, the compound design device also includes an encoding module for obtaining the SMILES character string of the seed compound; performing one-hot encoding on the SMILES character string of the seed compound to obtain a seed matrix, and the seed matrix is a matrix representation of the seed compound; Encode to get the seed vector.

Wherein, the encoding module encodes the seed matrix to obtain the seed vector, including: inputting the seed matrix into the molecular structure encoding model, and encoding the seed matrix to obtain the seed vector.

Wherein, the compound design device also includes a model training module, which is used to obtain a sample matrix, which is the matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain a sample vector, which is the sample compound. Feature vector representation; input the sample vector into the decoding layer of the self-encoder, and decode it to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update the parameters of the self-encoder based on the loss until the loss is stable, and update the post-training The encoding layer of the autoencoder acts as a molecular structure encoding model.

In order to solve the above technical problems, another technical solution adopted by the present application is to provide a compound design device, including a processor and a memory, where instructions are stored in the memory, and the processor is used to execute the instructions to realize any of the above compound design methods .

In order to solve the above technical problems, another technical solution adopted by the present application is to provide a computer-readable storage medium, which is used to store instructions/program data, and the instructions/program data can be executed to achieve any of the above-mentioned The compound design method of item.

The beneficial effects of the present application are: different from the situation of the prior art, the compound design method provided by the present application is based on the genetic algorithm for the development and design of compounds, which increases the exploreable compound space, can obtain diversified compounds, and increases Choose a space. Furthermore, the complex chemical space is reduced into a one-dimensional vector during operation, which enables the design algorithm to search the chemical space conveniently and efficiently.

【Description of drawings】

Figure 1 is a schematic flow diagram of a compound design method in the embodiment of the present application;

Fig. 2 is a schematic diagram of the training process of a molecular structure model in the embodiment of the present application;

Figure 3 is a schematic flow diagram of another compound design method in the embodiment of the present application;

Fig. 4 is a schematic structural diagram of a compound design device in an embodiment of the present application;

Fig. 5 is a schematic structural diagram of the compound design equipment in the embodiment of the present application;

FIG. 6 is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present application.

【Detailed ways】

In order to make the purpose, technical solution and effect of the present application more clear and definite, the present application will be further described in detail below with reference to the accompanying drawings and examples.

In order to meet the needs of compound screening library for drug research and development, the inventors of the present application found that the molecular generation model based on deep learning can use a large-scale compound database to self-learn the writing rules of compounds, and express the compounds as a dense continuous Value vectors, and then learn the structural features of compounds, generate compounds with new skeletons, and expand the searchable chemical space. On this basis, in order to generate molecules with certain characteristics, transfer learning or reinforcement learning methods can be used to guide model training, so that the chemical space generated by molecules can be shrunk to a specific area, and the sampling generation in this area meets the conditions. molecules. For example, molecules with special functional groups can be generated. However, whether it is transfer learning or reinforcement learning, there is a problem that as the training progresses, the diversity of generated molecules gradually decreases, and the generated molecular skeletons gradually become simpler. Transfer learning relies heavily on the quality of small datasets. Too few samples and low diversity lead to premature convergence of the model and poor diversity of generated compounds. Overly complex function combinations in reinforcement learning make model training unstable and difficult to converge. If the scoring function uses a single scoring standard, the model still converges prematurely, and the obtained molecules do not have diversity.

Based on this, the present application provides a compound design method. In this method, new compounds are learned, developed and designed based on the genetic algorithm, and a certain number of seed compounds are selected to simulate the chromosomes in nature by using the principle of simulating the evolution of the natural world in the genetic algorithm to form an initial compound. populations. In each generation of evolution, the fitness of the entire population is evaluated, and several individuals are selected based on the fitness to simulate natural selection, inheritance, and mutation to produce the next generation of population (ie, derivative compounds). Each generation repeats this cycle to search for an optimal solution.

Please refer to FIG. 1 . FIG. 1 is a schematic flowchart of a compound design method in an embodiment of the present application. It should be noted that this embodiment is not limited to the flow sequence shown in FIG. 1 if substantially the same result is obtained. As shown in Figure 1, this embodiment includes:

S110: Acquire a seed vector.

Among them, the seed vector is the feature vector representation of the seed compound.

Based on the genetic algorithm, the first-generation population of genetic evolution is first constructed, that is, the basic compound for compound design, that is, the seed compound, needs to be obtained first. The seed compound can be any compound randomly selected in the compound database, and it can be one or more. According to different design requirements, specific screening of seed compounds can also be carried out, which is not limited here.

In the embodiment provided in the present application, dimensionality reduction processing is also performed on the seed compound, and the complex chemical space is reduced into a one-dimensional vector. Specifically, the way of expressing the compound with molecular structural formula is changed to the way of expressing the compound with vector. Through dimension reduction processing, the design algorithm based on the genetic algorithm can be simplified to the operation between vectors, which is more convenient and efficient to search the chemical space, and the efficiency is higher.

S130: Perform a crossover operation and/or a mutation operation on the seed vector based on the genetic algorithm to obtain a derivative vector.

Through the crossover operation and mutation operation between vectors, it is possible to simulate natural selection, inheritance, mutation, evolution, etc. in nature to generate new vectors (that is, derivative vectors), that is, to generate new compounds to realize the diversification of compounds.

S150: Process the derivation vector to obtain a derivation compound.

After the operation between vectors is completed, the dimension of the operation result is processed. Specifically, the method of using vectors to represent compounds is converted to the method of using molecular structures to represent compounds, so as to obtain the specific structural formula of the compound, and then determine the derivative compound.

In this embodiment, the development and design of compounds is carried out based on the genetic algorithm, which increases the space of compounds that can be explored, enables to obtain diversified compounds, and increases the space for selection. Furthermore, the complex chemical space is reduced into a one-dimensional vector during operation, which enables the design algorithm to search the chemical space conveniently and efficiently.

In one embodiment, the present application may use a neural network, which takes the chemical structure as input and output, and extracts the vector output by the intermediate layer as a one-dimensional representation of the chemical structure. That is, the neural network model can be used to reduce and increase the dimension of the compound.

Among them, the autoencoder can be used to train the molecular structure encoding model and the molecular structure decoding model. The molecular structure encoding model can be used to reduce the dimension of the chemical structure, and encode the chemical structure into a vector; the molecular structure decoding model can be used to increase the dimension of the vector, and decode the vector into a chemical structure.

An autoencoder is a deep learning neural network that is trained so that the input and output values are the same. It first compresses the input vector into a hidden space, and then reconstructs and decodes the output so that the output is the same as the input. Specifically, the autoencoder mainly includes an encoding layer, a hidden vector layer and a decoding layer. The encoding layer contains several neurons, which can convert a large and sparse matrix into a dense one-dimensional vector composed of floating point numbers (the vector in the hidden vector layer). The decoding layer also contains several neurons, which can decode a dense one-dimensional vector into a large and sparse matrix.

In the training phase, a neural network is first built, which can receive large and sparse matrices. It is first converted into a vector of continuous values through the embedding layer. These vectors are combined through various linear transformations and nonlinear transformations, and finally a latent vector is obtained. This hidden vector is decoded into a large and sparse matrix through multiple linear transformations and nonlinear transformations. Since the parameters of these transformations are random or inaccurate, the decoded matrix is likely to be very different from the original matrix. Therefore, use a certain metric to measure the difference between the decoded matrix and the original matrix, update the parameters in the neural network according to the degree of difference backpropagation, and then use the updated network to regenerate a new large and sparse matrix, and then calculate the decoding The difference between the output matrix and the original matrix, and then update the parameters. Repeat for multiple rounds until the difference gradually decreases and stabilizes (even if the difference is not reduced after recycling). After this kind of training, a large and sparse vector is input, and finally an almost identical large and sparse matrix can be restored.

In one embodiment, the chemical structure can be one-hot encoded and converted into a matrix representation. Therefore, the above-mentioned neural network can be used to reduce and increase the dimension of the compound, and the above-mentioned training method can be used to train molecules Structure encoding model and molecular structure decoding model.

Please refer to FIG. 2 . FIG. 2 is a schematic diagram of a training process of a molecular structure model in an embodiment of the present application. It should be noted that, if there are substantially the same results, this embodiment is not limited to the flow sequence shown in FIG. 2 . As shown in Figure 2, this embodiment includes:

S210: Acquire a sample matrix.

Wherein, the sample matrix is a matrix representation of the sample compound.

Among them, the compound library can be downloaded from the Internet, and effective compounds can be extracted from the compound library as sample compounds. The sample compounds can be screened to a certain extent, for example, chiral compounds, salt compounds, uncommon molecules, molecules with too many heavy atoms, inorganic substances, etc. can be removed when screening sample compounds. Different screening rules can be set according to different requirements, which are not limited here.

After the sample compound is selected, the selected sample compound is converted into a SMILES character string to represent it. SMILES (Simplified molecular input line entry system, simplified molecular linear input specification), is a specification that clearly describes molecular structures with ASCII strings. The chemical structure can be written in the form of a SMILES string according to an existing set of rules. For example: pyrimidine can be written as SMILES string "c1ccncn1". A string can be thought of as a sentence consisting of several words. The above-mentioned string of pyrimidines can be regarded as composed of three words c, 1, n. These words can be converted into a vector consisting of only 0 and 1 using one-hot encoding, and then the string can be converted into a matrix representation to obtain a sample matrix.

Taking pyrimidine as an example, its SMILES string is "c1ccncn1", which can be regarded as consisting of three words c, 1, and n. These three words have disorder and discontinuity. Treat these three words as three states, represented by a vector consisting of 0 and 1. For example, the first digit is c, the second digit is 1, and the third digit is n, then these three words can be expressed as [1,0,0],[0,1,0],[0,0,1] . 1 means it contains the word, 0 means it does not contain the word. Then the structure of pyrimidine will be represented as a two-dimensional matrix [[1,0,0],[0,1,0],[1,0,0],[1,0,0],[0,0 ,1],[1,0,0],[0,0,1],[0,1,0]]. Among them, the so-called two-dimensional matrix, the dimension of the matrix can be understood as one dimension is used to represent the vector length of each word, and one dimension is used to represent the length of each string. For example, in pyrimidine, the length of each word is 3, and the length of the entire pyrimidine string is 8. Encoded in this way, the pyrimidine structure is transformed into something that can be understood by a computer.

After all the compounds in the sample compound set are converted into SMILES strings, special characters such as "$" and "#" can be added at the beginning and end of the SMILES string to indicate the beginning and end of the SMILES string to distinguish between different Strings can also be deduplicated. The SMILES strings in the sample compound set can be uniformly encoded into an m*n matrix (m words, each word vector length is n). You can find out the longest SMILES string among them, for example, its length is m, if the length of a SMILES string is less than m words, it is also expressed as a matrix of m*n, and the insufficient elements are all filled with 0. Similarly, find the word with the longest length, say it has length n.

S230: Input the sample matrix into the encoding layer of the autoencoder, and encode to obtain a sample vector, wherein the sample vector is a representation of the feature vector of the sample compound.

S250: Input the sample vector into the decoding layer of the self-encoder, and decode to obtain a prediction matrix.

S270: Calculate the loss between the prediction matrix and the sample matrix.

S290: Iteratively updating the parameters of the self-encoder based on the loss until the loss is stable, and obtaining a molecular structure encoding model and a molecular structure decoding model.

The encoding layer of the updated autoencoder can be used as a molecular structure encoding model, and the updated decoding layer of the trained autoencoder can be used as a molecular structure decoding model.

In one embodiment, the autoencoder may further include an input layer, which may be used to convert compounds of chemical structural formulas into compounds of matrix formulas. The input layer and the encoding layer are then used together as a molecular structure encoding model. The molecular structure encoding model can take the compound of the molecular structural formula as input, and output the compound of the encoded vector formula.

In one embodiment, the autoencoder may further include an output layer, which may be used to convert the compound of the matrix formula into the compound of the chemical structure formula. The specific conversion process is the reverse process of converting the compound of the chemical structural formula into the compound of the matrix formula. Please refer to the above description for details, and will not repeat them here. Then the output layer and the decoding layer are used together as a molecular structure decoding model. The molecular structure decoding model can take the compound of the vector formula as input, and output the compound of the decoded molecular structure formula.

Please refer to FIG. 3 . FIG. 3 is a schematic flowchart of another compound design method in the embodiment of the present application. It should be noted that this embodiment is not limited to the flow sequence shown in FIG. 3 if substantially the same result is achieved. As shown in Figure 3, this embodiment can combine the molecular structure coding model, molecular structure decoding model and genetic algorithm for compound design, specifically including:

S310: Acquire a seed vector.

Among them, you can select the seed compound in the compound database to obtain the SMILES string of the seed compound; perform one-hot encoding on the SMILES string of the seed compound to obtain the seed matrix, which is the matrix representation of the seed compound; input the seed matrix into the molecular structure The encoding model encodes the seed matrix to obtain the seed vector. Please refer to the above description for details, and will not repeat them here.

S330: Perform a cross operation on the seed vector based on the genetic algorithm to obtain a derived vector.

Among them, the cross operation can select two seed vectors from the seed vector set, and select the exchange position (can be one or more positions) of one of the seed vectors. The selection mode of the seed vector and the exchange position can be randomly selected, or can be set Set certain selection rules. The value of the selected exchange position of this seed vector is exchanged with the value of the corresponding position of another sub vector. For example, there are two vectors [0.1,0.2,0.3] and [0.4,0.5,0.6], exchange the first position, then get two new vectors, [0.4,0.2,0.3] and [0.1,0.5,0.6] . As another example, if the above two vectors are exchanged for the first and third positions, then two new vectors [0.4,0.2,0.6] and [0.1,0.5,0.3] are obtained.

S350: Perform a mutation operation on the seed vector based on the genetic algorithm to obtain a derived vector.

Among them, the mutation operation can select several seed vectors (the proportion of the vector to be mutated can be specified in advance) from the seed vector set, and select the mutation position (can be one or more positions) from these seed vectors, the seed vector and the mutation position The selection method may be random selection, or a certain selection rule may be set. Replace the values at these mutation positions with new values, which can be randomly replaced with any value, or replaced with a set value. For example, there is a vector [0.1,0.2,0.3], select the first position, and replace this value with a value at random to get a new vector [0.5,0.2,0.3]. As another example, select the first and second positions, and randomly replace the corresponding values with new values to obtain a new vector [0.2, 0.4, 0.3].

Both the crossover operation and the mutation operation are for generating new vectors (ie derived vectors), deriving more vectors, and further deriving more compounds. Crossover operation and mutation operation can simulate genetic evolution and improve the diversity of compounds. The crossover operation and the mutation operation can be performed simultaneously, or in reverse order, or only one of them can be performed, that is, steps S330 and S350 are only for illustration, and one can be selectively performed, or the order can be reversed, and there is no limitation here .

S370: Process the derivation vector to obtain a derivation compound.

The derivative vector is input into the molecular structure decoding model, and the derivative vector is decoded to obtain the derivative matrix, and then the derivative matrix is converted to obtain the derived molecular structure, and then the derivative compound can be determined according to the derived molecular structure. Please refer to the above description for details, and will not repeat them here.

S390: Measure the fitness of the derived compounds based on the fitness function, and select candidate compounds from the derived compounds according to the fitness.

Fitness is a scale used to evaluate derivative compounds, such as whether the structure has good solubility, good activity, etc. In this way, the derivative compounds are associated with the criteria for judging the quality, that is, the fitness function is constructed.

The genetic algorithm simulates the process of evolution. According to the above, each derivative compound has an evaluation value, which represents the adaptability of the compound in the evolution process. For example, molecules with poor solubility and poor activity tend to be eliminated.

This evaluation standard depends on the definition of the user, and the user can adaptively set the evaluation standard (fitness function) according to the characteristic requirements of the compound to be designed. For example, the user wants to get a compound with a large enough molecular weight. Then, thousands of derived vectors are randomly generated, and these derived vectors are transformed according to the above to obtain a compound respectively, and then the molecular weights of these compounds are calculated. This molecular weight is the user-defined fitness. We arrange these compounds in descending order according to molecular weight, and select a top-ranked candidate compound or a batch of candidate compounds according to the user-defined parameters (the ratio or quantity selected each time).

In one embodiment, multiple rounds of crossover operations and mutation operations can be iteratively performed to obtain more derivative compounds, and then desired candidate compounds are selected from these derivative compounds. Specifically, according to the size of the fitness, the specific implementation of selecting the candidate compound from the derivative compound may include: step S1: according to the size of the fitness, select the target compound whose fitness meets the preset condition from the derivative compound; step S2 : Using the derivation vector corresponding to the target compound as the seed vector, continue to execute steps from S330 and/or S350 to step S390 to measure the fitness of the derivation compound based on the fitness function. Iteratively loop the above steps S1 and S2 until the iteration termination condition is satisfied, then end the iterative loop operation; sort all the derivative compounds obtained according to the fitness in descending order; select a predetermined proportion or a predetermined number of derivative compounds with better fitness as candidate compounds .

Among them, the target compound that satisfies the preset conditions can be a fixed number of compounds (such as 10, 30, 50, etc.) selected from the derivative compounds as the target compound; The compounds are sorted according to the fitness, and a fixed ratio (such as 1/10, 1/5, 1/3, etc.) of the compound is selected from the front to the back as the target compound; it can also be selected from the derivative compound with a fitness greater than a certain fixed threshold compound as the target compound. Wherein, the number, ratio, and conditions of target compounds selected can be set according to needs, and will not be repeated here.

For example, after one operation is completed and the fitness of the derivative compounds is obtained, the derivative compounds can be sorted in descending order according to the fitness, and the top-ranked target compounds can be selected, and the derivative vectors of these target compounds can be cross-operated, The mutation operation generates a new batch of 1D derived vectors. Input these new derivative vectors into the molecular structure decoding model, decode new matrix and transform into new derivative compounds, and calculate the fitness of these derivative compounds. These derivative compounds are arranged in descending order of fitness, and the top-ranked target compounds are selected from them, and then crossed and mutated to generate a new one-dimensional derivative vector. This loop is iterated and all derived compounds generated are recorded. Candidate compounds with better fitness are selected from these generated derivative compounds as the final result.

The number of iterations can depend on the set parameters and the characteristics of the data set itself. The iteration termination condition can be the number of iterations set in advance, and the number of iterations can be dozens to hundreds of times, such as 200 to 400 rounds. The iteration termination condition can be the iteration duration set in advance, such as 8 hours, 12 hours, 24 hours, 48 hours, etc.

Of course, it is not necessary to perform iterations, and the desired candidate compounds can be obtained after one execution.

In the solution provided by the above embodiments, by using the neural network algorithm, the complex chemical space is reduced into a one-dimensional vector, which can make the design algorithm search the chemical space conveniently and efficiently; the organic combination of chemical space and genetic algorithm overcomes the molecular Generative Models After Reinforcement Learning and Transfer Learning Generating Compound Gradually Single Problems.

In the following, the solution of the present application will be described through several specific experimental examples, but the present application should not be limited too much.

The latest ChEMBL28 database can be downloaded from the Internet, and the SMILES string of the compound is proposed. The sample compound structure must only contain atoms of hydrogen, carbon, nitrogen, oxygen, fluorine, sulfur, chlorine, and bromine. And do not contain chiral compounds, inorganic substances, salt ions, and restrict the number of heavy atoms within 70, convert these SMILES strings into canonical forms. About 1.8 million SMILES are obtained after deduplication. Use these SMILES to train a neural network. Embodiments are developed based on this neural network.

Experimental example 1

Protein kinase B, also known as AKT, is a serine/threonine-specific protein kinase. It plays an important regulatory role in cell apoptosis, proliferation, migration and other cellular processes. AKT1 participates in the cell survival pathway through the process of apoptosis, blocks apoptosis and promotes cell survival. Clinical studies have found that AKT is overexpressed in various human tumors such as gastric cancer and pancreatic cancer. AKT inhibitors can inhibit the activity of AKT and promote the apoptosis of cancer cells.

Compound 1 is an AKT inhibitor in clinical research. By analyzing its interaction mode and establishing a pharmacophore model to evaluate the matching degree between the molecule and the pharmacophore, it is used as the fitness evaluation standard to find new molecules.

Specifically, randomly select the first predetermined amount of seed compounds in the compound database, input the molecular structure coding model, and obtain the seed vector; perform cross operation and mutation operation on the seed vector based on the genetic algorithm, and obtain multiple derivative vectors; input the derived vector Decode the molecular structure model to obtain multiple derivative compounds; use the above-mentioned pharmacophore model to evaluate the matching degree of the derivative compound and the pharmacophore, and obtain the fitness of the derivative compound; then select the second predetermined amount of derivatives according to the degree of fitness Vectors are used as seed vectors for crossover and mutation operations, and iteratively circulates for 300 rounds to obtain a batch of new compounds, as follows:

Experimental example 2

Clinical studies have found that human isocitrate dehydrogenase 1 (IDH1) is mutated in a variety of malignant tumors, such as glioma. Mutated IDH1 can convert α-ketoglutarate to 2-hydroxyglutarate. The latter is a carcinogen that accumulates in the body and promotes the further progression of cancer. Clinical trials have shown that drugs that inhibit the activity of mutant IDH1 can effectively reduce the concentration of 2-hydroxyglutarate in the body and relieve cancer symptoms.

Compound 2 is the most promising inhibitor of mutant IDH1 currently studied. Take it as a template molecule, calculate the similarity (measured by molecular fingerprint) with the template molecule for each generated molecule as the fitness of the molecule, and search a batch of similar molecules from the latent space.

Specifically, randomly select the third predetermined amount of seed compounds in the compound database, input the molecular structure coding model, and obtain the seed vector; perform crossover and mutation operations on the seed vector based on the genetic algorithm, and obtain multiple derived vectors; input the derived vector Molecular structure decoding model to obtain multiple derivative compounds; respectively calculate the similarity between each derivative compound and the template molecule to obtain the fitness of the derivative compound; then select the fourth predetermined amount of derivative vector as the seed vector for crossover according to the degree of fitness Operation and mutation operation, such an iterative cycle for 380 rounds, to obtain a batch of new compounds, as follows:

Please refer to FIG. 4 . FIG. 4 is a schematic structural diagram of a compound design device in an embodiment of the present application. In this embodiment, the compound design device 40 includes an acquisition module 41 , an operation module 42 and a decoding module 43 .

Wherein, the obtaining module 41 is used to obtain the seed vector, and the seed vector is the feature vector representation of the seed compound; the operation module 42 is used to perform cross operation and/or mutation operation on the seed vector based on the genetic algorithm to obtain a derived vector; the decoding module 43 uses The derivation vector is processed to obtain the derivation compound. In this way, the device develops and designs compounds based on the genetic algorithm, which increases the exploreable compound space, can obtain diversified compounds, and increases the selection space. Furthermore, the complex chemical space is reduced into a one-dimensional vector during operation, which enables the design algorithm to search the chemical space conveniently and efficiently. Please refer to the description of the above-mentioned embodiments for the specific execution process, and will not repeat it again.

Further, the compound design device 40 also includes a selection module (not shown in the figure), which is used to measure the fitness of the derived compounds based on the fitness function; and select candidate compounds from the derived compounds according to the fitness.

Specifically, the selection module selects candidate compounds from the derivative compounds according to the size of the fitness, including: step S1: according to the size of the fitness, select the target compound whose fitness meets the preset conditions from the derivative compounds; step S2: the The derivation vector corresponding to the target compound is used as the seed vector, continue to perform the cross operation and/or mutation operation on the seed vector based on the genetic algorithm, and obtain the derivation vector to the step of measuring the fitness of the derivation compound based on the fitness function respectively; iterative loop steps S1 and S2. End the iterative loop operation until the iteration termination condition is met; sort all the obtained derived compounds in descending order according to their fitness; select a predetermined proportion or a predetermined number of derived compounds with better fitness as candidate compounds. In this way, more candidate compounds can be obtained, and better compounds can be screened more easily. For the specific execution process, please refer to the description of the above embodiments, and details will not be repeated again.

Further, the operation module 42 includes a cross operation submodule (not shown in the figure), which is used to select two seed vectors from the seed vector set, select the exchange position of one of the seed vectors, and combine the value of the exchange position of the seed vector with The value of the corresponding position of another sub-vector is exchanged to obtain a new derivative vector, and then a derivative compound can be obtained to enrich the derivative vector and increase the diversity of the derivative compound. For the specific execution process, please refer to the description of the above embodiments, and details will not be repeated again.

Further, the operation module 42 includes a mutation operator module (not shown in the figure), which is used to select a seed vector from the seed vector set, select a mutation position from the selected seed vector, and replace the value on the mutation position with a new Value, get a new derivative vector, and then get a derivative compound to enrich the derivative vector and increase the diversity of derivative compounds. For the specific execution process, please refer to the description of the above embodiments, and details will not be repeated again.

Further, the decoding module 43 is used to input the derived vector into the molecular structure decoding model, and decode the derived vector to obtain the derived molecular structure. The molecular structure decoding model is a neural network model; and obtain the derived compound according to the derived molecular structure.

Further, the compound design device 40 also includes an encoding module (not shown in the figure), which is used to obtain the SMILES character string of the seed compound; the SMILES character string of the seed compound is one-hot encoded to obtain the seed matrix, and the seed matrix is the Matrix representation; the seed matrix is input into the molecular structure encoding model, and the seed matrix is encoded to obtain the seed vector.

Further, the compound design device 40 also includes a model training module (not shown in the figure), which is used to obtain a sample matrix, which is a matrix representation of the sample compound; input the sample matrix into the coding layer of the self-encoder, and encode to obtain the sample Vector, the sample vector is the eigenvector representation of the sample compound; input the sample vector into the decoding layer of the self-encoder, and decode it to obtain the prediction matrix; calculate the loss between the prediction matrix and the sample matrix; iteratively update the parameters of the self-encoder based on the loss, Until the loss is stable, the decoding layer and output layer of the trained self-encoder will be updated as the molecular structure decoding model. The output layer is used to convert the compound represented by the matrix into a molecular structure representation, and the trained self-encoder will be updated The encoding layer serves as a molecular structure encoding model.

The compound design device can be an independent server, a server cluster, or a module of the server. It can be used for model training, genetic algorithm, and then used to design compounds.

Please refer to FIG. 5 . FIG. 5 is a schematic structural diagram of a compound design device in an embodiment of the present application. In this embodiment, the compound design device 10 includes a processor 11 and a memory 12 .

The processor 11 may also be called a CPU (Central Processing Unit, central processing unit). The processor 11 may be an integrated circuit chip with signal processing capabilities. The processor 11 can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components . The general processor can be a microprocessor or the processor 11 can also be any conventional processor or the like.

The compound design device 10 may further include a memory 12 for storing instructions and data required for the operation of the processor 11 .

The processor 11 is configured to execute instructions to implement the methods provided in any embodiment of the compound design method of the present application and any non-conflicting combination.

Compound design equipment can be servers, desktop computers, laptops, etc. It can be used for model training, genetic algorithm, and then used to design compounds.

Please refer to FIG. 6 , which is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present application. The computer-readable storage medium 20 of the embodiment of the present application stores instructions/program data 21. When the instructions/program data 21 are executed, the methods provided by any embodiment of the compound design method of the present application and any non-conflicting combination are implemented. Wherein, the instruction/program data 21 can form a program file and be stored in the above-mentioned storage medium 20 in the form of a software product, so that a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor (processor) Execute all or part of the steps of the methods in various implementation manners of the present application. And aforementioned storage medium 20 comprises: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or optical disc etc. can store program codes Media, or terminal devices such as computers, servers, mobile phones, and tablets.

In the several embodiments provided in this application, it should be understood that the disclosed system, device and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

The above is only the implementation of the application, and does not limit the patent scope of the application. Any equivalent structure or equivalent process conversion made by using the specification and drawings of the application, or directly or indirectly used in other related technologies fields, are all included in the scope of patent protection of this application in the same way.

Claims

A compound design method, wherein, comprising:

Obtain a seed vector, the seed vector is the representation of the feature vector of the seed compound;

performing a crossover operation and/or a mutation operation on the seed vector based on a genetic algorithm to obtain a derived vector;

The derivatization vector is processed to obtain a derivatization compound.
The compound design method according to claim 1, wherein, after the derivation vector is processed, and the derivation compound is obtained, the method further comprises:

Measuring the fitness of the derivative compounds respectively based on a fitness function;

Select candidate compounds from the derivative compounds according to the size of the fitness.
The compound design method according to claim 2, wherein, according to the size of the fitness, selecting candidate compounds from the derivative compounds includes:

Step S1: According to the size of the fitness, select the target compound whose fitness meets the preset condition from the derivative compounds;

Step S2: use the derivative vector corresponding to the target compound as the seed vector, continue to perform the cross operation and/or mutation operation on the seed vector based on the genetic algorithm, and obtain the derivative vector to the fitness function-based the step of measuring the fitness of said derivative compound;

Steps S1 and S2 are iteratively looped until the iteration termination condition is satisfied, and the iterative loop operation is ended;

Arrange all derived compounds obtained in descending order according to the fitness;

A predetermined ratio or a predetermined number of derivative compounds with better fitness are selected as candidate compounds.
The compound design method according to claim 1, wherein, performing a cross operation on the seed vector based on a genetic algorithm, comprising:

Select two seed vectors from the set of seed vectors, select an exchange position of one of the seed vectors, and exchange the value of the exchange position of the seed vector with the value of the corresponding position of the other seed vector.
The compound design method according to claim 1, wherein, performing a mutation operation on the seed vector based on a genetic algorithm, comprising:

A seed vector is selected from the seed vector set, a mutation position is selected from the selected seed vector, and a value at the mutation position is replaced with a new value.
According to the compound design method according to any one of claims 1-5, wherein said derivation vector is processed to obtain a derivation compound, comprising:

The derived vector is input into the molecular structure decoding model, and the derived vector is decoded to obtain the derived molecular structure, and the molecular structure decoding model is a neural network model;

A derivative compound is obtained according to the derivative molecular structure.
The compound design method according to claim 6, wherein said inputting the derivation vector into the molecular structure decoding model, performing decoding processing on the derivation vector, and before obtaining the derivation molecular structure, the method further comprises:

Acquiring a sample matrix, the sample matrix is a matrix representation of sample compounds;

Inputting the sample matrix into the encoding layer of the self-encoder, encoding to obtain a sample vector, the sample vector being the representation of the feature vector of the sample compound;

Input the sample vector into the decoding layer of the self-encoder, and decode to obtain a prediction matrix;

calculating a loss between said prediction matrix and said sample matrix;

The parameters of the self-encoder are iteratively updated based on the loss until the loss is stable, and the decoding layer and output layer of the self-encoder after the update training are used as the molecular structure decoding model, and the output layer is used to convert the matrix Represented compounds are converted to molecular structure representations.
The compound design method according to any one of claims 1-5, wherein said obtaining a seed vector comprises:

Get the SMILES string of the seed compound;

Carrying out one-hot encoding to the SMILES string of the seed compound to obtain a seed matrix, the seed matrix being the matrix representation of the seed compound;

Encoding the seed matrix to obtain the seed vector.
The compound design method according to claim 8, wherein said encoding said seed matrix to obtain said seed vector comprises:

The seed matrix is input into the molecular structure encoding model, and the seed matrix is encoded to obtain the seed vector.
The compound design method according to claim 9, wherein said inputting said seed matrix into a molecular structure encoding model, performing encoding processing on said seed matrix, and before obtaining said seed vector, said method further comprises:

Acquiring a sample matrix, the sample matrix is a matrix representation of sample compounds;

Inputting the sample matrix into the encoding layer of the self-encoder, encoding to obtain a sample vector, the sample vector being the representation of the feature vector of the sample compound;

Input the sample vector into the decoding layer of the self-encoder, and decode to obtain a prediction matrix;

calculating a loss between said prediction matrix and said sample matrix;

The parameters of the self-encoder are iteratively updated based on the loss until the loss is stable, and the coding layer of the self-encoder after the update training is used as the molecular structure coding model.
A compound design device, including:

An acquisition module, configured to acquire a seed vector, wherein the seed vector is a representation of a feature vector of a seed compound;

An operation module, configured to perform a crossover operation and/or a mutation operation on the seed vector based on a genetic algorithm to obtain a derived vector;

The decoding module is configured to process the derivation vector to obtain a derivation compound.
The compound design device according to claim 11, further comprising:

The selection module is used to respectively measure the fitness of the derivative compounds based on the fitness function; and select candidate compounds from the derivative compounds according to the size of the fitness.
The compound design device according to claim 12, wherein the selection module selects candidate compounds from the derivative compounds according to the size of the fitness, including:

Step S1: According to the size of the fitness, select the target compound whose fitness meets the preset condition from the derivative compounds;

Step S2: use the derivative vector corresponding to the target compound as the seed vector, continue to perform the cross operation and/or mutation operation on the seed vector based on the genetic algorithm, and obtain the derivative vector to the fitness function-based the step of measuring the fitness of said derivative compound;

Steps S1 and S2 are iteratively looped until the iteration termination condition is satisfied, and the iterative loop operation is ended;

Arrange all derived compounds obtained in descending order according to the fitness;

A predetermined ratio or a predetermined number of derivative compounds with better fitness are selected as candidate compounds.
The compound design device according to claim 11, wherein the operation module includes a crossover operation submodule, and the crossover operation submodule is used to select two seed vectors from the seed vector set, and select an exchange position of one of the seed vectors, Exchanging the numerical value of the exchanged position of the seed vector with the numerical value of the corresponding position of another subvector.
The compound design device according to claim 11, wherein the operation module includes a mutation operator module, and the mutation operator module is used to select a seed vector from the seed vector set, and select a mutation position from the selected seed vector, Replace the value at the mutation position with a new value.
The device for designing compounds according to any one of claims 11-15, wherein the decoding module is specifically configured to input the derivation vector into a molecular structure decoding model, and perform decoding processing on the derivation vector to obtain a derivation molecular structure, A derivative compound is obtained according to the derived molecular structure; the molecular structure decoding model is a neural network model.
The compound design device according to claim 16, further comprising a model training module for:

Acquiring a sample matrix, the sample matrix is a matrix representation of sample compounds;

Inputting the sample matrix into the encoding layer of the self-encoder, encoding to obtain a sample vector, the sample vector being the representation of the feature vector of the sample compound;

Input the sample vector into the decoding layer of the self-encoder, and decode to obtain a prediction matrix;

calculating a loss between said prediction matrix and said sample matrix;

The parameters of the self-encoder are iteratively updated based on the loss until the loss is stable, and the decoding layer and output layer of the self-encoder after the update training are used as the molecular structure decoding model, and the output layer is used to convert the matrix Represented compounds are converted to molecular structure representations.
The device for designing compounds according to any one of claims 11-15, further comprising a coding module for:

Get the SMILES string of the seed compound;

Carrying out one-hot encoding to the SMILES string of the seed compound to obtain a seed matrix, the seed matrix being the matrix representation of the seed compound;

Encoding the seed matrix to obtain the seed vector.
The device for designing compounds according to claim 18, wherein the encoding module encodes the seed matrix to obtain the seed vector, comprising: inputting the seed matrix into a molecular structure encoding model, and encoding the seed matrix Process to get the seed vector.
The compound design device according to claim 19, further comprising a model training module for:

Acquiring a sample matrix, the sample matrix is a matrix representation of sample compounds;

Inputting the sample matrix into the encoding layer of the self-encoder, encoding to obtain a sample vector, the sample vector being the representation of the feature vector of the sample compound;

Input the sample vector into the decoding layer of the self-encoder, and decode to obtain a prediction matrix;

calculating a loss between said prediction matrix and said sample matrix;

Iteratively updating the parameters of the autoencoder based on the loss until the loss is stable, and using the updated coding layer of the trained autoencoder as the molecular structure coding model.
A compound design device, which includes a memory and a processor, the memory stores instructions, and the processor is used to execute the instructions to realize the compound design method according to any one of claims 1-10.
A computer-readable storage medium, wherein the computer-readable storage medium is used to store instructions/program data, and the instructions/program data can be executed to realize the compound design according to any one of claims 1-10 method.