WO2021220774A1 - 化合物構造表現を生成するシステム - Google Patents
化合物構造表現を生成するシステム Download PDFInfo
- Publication number
- WO2021220774A1 WO2021220774A1 PCT/JP2021/015042 JP2021015042W WO2021220774A1 WO 2021220774 A1 WO2021220774 A1 WO 2021220774A1 JP 2021015042 W JP2021015042 W JP 2021015042W WO 2021220774 A1 WO2021220774 A1 WO 2021220774A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- structural
- model
- vector
- encoder
- structural formula
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/50—Molecular design, e.g. of drugs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/90—Programming languages; Computing architectures; Database systems; Data warehousing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/80—Data visualisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/30—Prediction of properties of chemical compounds, compositions or mixtures
Definitions
- the present invention relates to a system for generating candidates for compound structural representations that are expected to have desired physical characteristics.
- a virtual screening method is used for new material search tasks.
- An example of a virtual screening method is disclosed in, for example, Non-Patent Document 1.
- a machine learning model is applied to the data of a known compound, and a physical characteristic value estimation model is constructed by inputting a chemical structural formula expressed in a predetermined expression format.
- the above-mentioned physical property value estimation model is applied to the randomly generated chemical structural formula. Screening is performed based on the predicted value calculated in this way, and a chemical structural formula expected to have a physical property value exceeding the threshold value is presented as a candidate.
- Non-Patent Document 2 discloses a stacked semi-supervised learning model that performs image classification tasks.
- Non-Patent Document 2 discloses that the outer model in the stacked model is trained with the unlabeled training data, and the inner model is trained with the labeled training data.
- the search method using the conventional physical property value estimation model is an interpolated search method that can be estimated only within the range of the training data, it is necessary to discover a new material having a physical property value that exceeds the performance of the known material. It is not suitable for extrapolated search for the purpose of.
- the conventional virtual screening method acquires the relationship between the expression form of the chemical structural formula such as SMILES (Simplified Molecular Input Line Entry System) and the physical property value by a model such as a neural network.
- the purpose of this is to generate a chemical structural formula having a desired physical property value.
- a large amount of the chemical structural formula and the physical property value are set. Data is needed.
- it is difficult to prepare a large amount of data such as experimental data and simulation results, which is a set of chemical structural formulas and physical property values.
- One aspect of the present invention is a system for generating a compound structural representation, which includes one or more processors and one or more storage devices.
- the one or more storage devices store a structural model, a structural property relationship model, a compound structural representation of one or more known substances, and one or more target values for each of one or more physical property values.
- the structural model includes a first encoder that converts a compound structure representation into a real number vector and a first decoder that estimates the compound structure representation from the real number vector converted by the first encoder.
- the structural property relationship model converts an input vector including the real number vector generated by the first encoder and a target value vector including a target value of one or more kinds of physical property values as a component into a real number vector.
- the one or more processors use the first encoder of the structural model based on the compound structure representation of the one or more known substances and one or more target values for each of the one or more physical property values, and one or more. Generate the structure generation vector of.
- Each of the one or more structure generation vectors is a real number vector generated by the first encoder of the compound structure representation of one known substance, and a target value vector including the target value of each of the one or more physical property values. , Included as a component.
- the one or more processors input each structure generation vector of the one or more structure generation vectors into the structure property relation model.
- the one or more processors extract a real number vector corresponding to the compound structure representation from the output of the second decoder of the structural property relation model.
- the one or more processors input the extracted real number vector to the first decoder of the structural model to generate a new compound structural representation.
- a learning model capable of presenting a candidate of a compound having a physical characteristic value better than that contained in the learning data is provided as a data in which a small number of chemical structural formulas and physical characteristic values are set. Can be generated using.
- a configuration example of the chemical structural formula generation model according to the examples of the present specification is schematically shown.
- An example of the configuration of the chemical structural formula generation system according to the first embodiment is shown.
- An example of the hardware configuration of the structural formula generator is shown.
- An example of the structure of the catalog data is shown.
- An example of the configuration of experimental data is shown.
- An example of the data contained in the structural matrix database is shown.
- An example of the data contained in the learning database is shown.
- An example of the information contained in the initial parameters is shown.
- An example of a network structure confirmation screen displayed by the display unit for the user in the display device is shown.
- a configuration example of the model table included in the model data is shown.
- a flowchart of a processing example of the structural formula conversion unit is shown.
- a flowchart of a processing example of the learning data generation unit is shown.
- the flowchart of the processing example of the structure generation vector group generation part is shown.
- a flowchart of a processing example of the network structure determination unit is shown.
- a flowchart of a processing example of the structural model learning unit is shown.
- a flowchart of a processing example of the structural model additional learning unit is shown.
- the flowchart of the processing example of the structural property relation model learning part is shown.
- a flowchart of a processing example of the new structural formula generator is shown.
- a flowchart of a processing example of the structural formula inverse conversion unit is shown.
- the flowchart of the processing example of the structural formula shaping part is shown.
- An example of a network structure confirmation screen that the display unit displays for the user on the display device according to the second embodiment is shown.
- An example of the data included in the learning database according to the second embodiment is shown.
- a configuration example of the model table included in the model data according to the second embodiment is shown.
- An example of a network structure confirmation screen that the display unit displays for the user on the display device according to the third embodiment is shown.
- An example of the data included in the learning database DB according to the third embodiment is shown.
- a configuration example of the model table included in the model data DB according to the third embodiment is shown.
- the conditions to be satisfied by (type) of the physical property values included in the record in the experimental data are schematically shown.
- An example of training data for a structural property relationship model composed of two types of experimental data is schematically shown.
- An example of a network structure confirmation screen that the display unit displays for the user on the display device according to the fourth embodiment is shown.
- a configuration example of the model table included in the model data DB according to the third embodiment is shown.
- An example is shown in which the training data generation unit generates training data for a structural property relationship model from experimental data.
- This system may be a physical computer system (one or more physical computers) or a system built on a group of computer resources (multiple computer resources) such as a cloud platform.
- a computer system or computational resource group includes one or more interface devices (including, for example, communication devices and input / output devices), one or more storage devices (including, for example, memory (main storage) and auxiliary storage devices), and one or more. Includes the processor.
- the process described with the function as the subject may be a process performed by a processor or a system having the processor.
- the program may be installed from the program source.
- the program source may be, for example, a program distribution computer or a computer-readable storage medium (eg, a computer-readable non-transient storage medium).
- the description of each function is an example, and a plurality of functions may be combined into one function, or one function may be divided into a plurality of functions.
- FIG. 1 schematically shows a configuration example of a chemical structural formula generation model according to an embodiment of the present specification.
- the chemical structural formula generation model 10 accepts an expression representing a known chemical structural formula and a target physical characteristic value as inputs, and outputs a new expression of a chemical structural formula expected to have a physical property value close to the target physical property value.
- the chemical structural formula generation model 10 includes two types of models: a structural model 100 for learning the chemical structural formula and a structural property relationship model 104 for learning the relationship between the feature quantity of the chemical structural formula and the physical property value. combine.
- the structural model 100 is composed of one variational auto-encoder (VAE), and the structural property relationship model 104 is composed of one or more VAEs.
- VAE variational auto-encoder
- the structural property relationship model 104 is composed of a single VAE.
- VAE is a kind of autoencoder, and is a deep generative model composed of two neural networks, an encoder and a decoder.
- the encoder converts the input (vector) into a real vector.
- the space to which the real vector belongs is called the latent space, and it is assumed that it follows a predetermined distribution, for example, a normal distribution.
- the decoder inversely transforms the real vector and outputs a vector of the same dimension as the input.
- Encoders and decoders are trained (learned) so that the inputs and outputs are equal.
- the ability to reconstruct an input from an intermediate output real vector means that the real vector reflects the full characteristics of the input.
- the dimension of the latent space is set to be smaller than the dimension of the input. Therefore, the encoder can extract the features of the input and compress the dimensions of the input.
- the intermediate output vector is called a latent variable or latent expression, and is an abstract expression that represents the features extracted from the structural formula matrix that represents the chemical structural formula.
- the structural formula matrix can be converted from, for example, a character string representing the chemical structural formula of the material.
- Latent variables are assumed to follow a given distribution, such as a Gaussian distribution. Therefore, the decoder can restore the input structural formula matrix with high accuracy when it receives the noisy vector. As described above, VAE has high robustness as a generative model.
- the chemical structural formula generation model 10 has a nested structure.
- the structural property relationship model 104 is arranged between the encoder 101 of the structural model (outer VAE) 100 and the decoder 102.
- the structural property relationship model 104 is composed of an encoder 105 and a decoder 106.
- the structural property relationship model 104 can include a plurality of VAEs (inner VAEs), and each inner VAE is arranged between an encoder and a decoder of another VAE.
- the encoder 101 of the structural model 100 can be composed of, for example, a plurality of one-dimensional convolution layers and a plurality of fully connected layers.
- the encoder 101 receives an M ⁇ N-dimensional structural determinant (structural representation) as an input and converts it into an L-dimensional vector.
- the decoder 102 can be composed of, for example, a plurality of fully connected layers and an RNN (Recurrent Neural Network).
- the decoder 102 receives the L-dimensional vector as an input and inversely transforms it into an M ⁇ N-dimensional structural determinant.
- the encoder 105 and the decoder 106 of the inner VAE of the structural property relationship model 104 can be composed of, for example, a plurality of fully connected layers.
- the encoder 105 includes an L-dimensional vector which is a latent variable (conversion result) of the structural model 100 (outer VAE) and a P-dimensional vector (target value vector) composed of an array of P physical property values as components.
- the encoder 105 outputs an intermediate vector (latent representation) 108 having a dimension smaller than that of the vector 107.
- the latent space of the structural property relational model 104 gives a latent expression that is abstracted by the combination of the structural feature and the physical property value feature.
- the decoder 106 receives the intermediate vector 108 as an input and outputs the L + P dimensional vector 109.
- the P elements extracted from the L + P dimension vector 109 are an array of physical property values.
- the L-dimensional vector extracted from the output of the decoder 106 is input to the decoder 102 of the structural model 100, and the M ⁇ N-dimensional structural determinant is output.
- the system sequentially inputs the chemical structural formula (structural expression) of the known compound and the target physical property value into the learned chemical structural formula generation model 10.
- This makes it possible to generate a new chemical structural formula that is expected to show a value close to the target physical property value.
- the system has learned the chemical structural formulas of the chemical structural formulas of the compounds with the highest performance among the known compounds, and the combinations of the predetermined physical property values and the values in the vicinity thereof (these are also called target values). It may be sequentially input to the generation model 10. As a result, if a value close to the target physical property value is shown, the probability that a new chemical structural formula will be generated can be increased.
- One learning object is a grammatical rule of expression form expressing a chemical structural formula
- the other learning object is a relationship between a chemical structure (chemical structural formula) and a physical property value.
- VAE learning a loss function is given so that the input of the encoder and the output of the decoder are equal, and the parameters of the encoder and the decoder are updated (optimized).
- the physical characteristic value is necessary only for learning the relationship between the feature and the physical characteristic value.
- the types of physical property values include a type representing a physical property and a type representing a chemical property. Both types of physical property values are strongly influenced by local structural features such as main chain structure, terminal structure, and partial structure. Therefore, in learning the relationship between the chemical structure and the physical property value, the relationship between the feature quantity extracted from the chemical structural formula and the physical property value can be used as learning data instead of the chemical structural formula itself. ..
- the examples of the present specification mainly perform the following steps.
- the system receives user settings and data to determine a learning model (network structure).
- the system performs training (training) of the chemical structural formula generative model.
- the learning of the chemical structural formula generation model includes learning with the catalog data of the structural model (outer VAE) 100, learning with the chemical structural formula of the experimental data of the structural model 100, and the structural property relationship model 104 (one or more inner VAE). Includes learning with experimental data from.
- the system builds a chemical structural formula generation model from the trained VAE and generates a new structural formula.
- FIG. 2 shows an example of the configuration of the chemical structural formula generation system according to the first embodiment.
- the system includes a parameter setting device M01, a data storage device M02, a model learning device M03, a structural formula generator M04, and a display device M05 that can communicate with each other via a network.
- the parameter setting device M01 sets or generates various data including parameters for generating (including learning) a chemical structural formula generation system.
- the parameter setting device M01 includes a structural formula conversion unit P01, a learning data generation unit P02, a structure generation vector group generation unit P03, and a network structure determination unit P04. These are programs.
- the parameter setting device M01 further stores the catalog data DB 10, the experimental data DB 11, the structural formula vocabulary data DB 12, and the initial parameter DB 13.
- the data storage device M02 can store various types of data including data (information) generated by other devices.
- the data storage device M02 stores the structural formula database DB14, the learning database DB15, the structure generation vector database DB16, the model data DB17, and the candidate structural formula database DB18.
- the model learning device M03 learns the learning model included in the chemical structural formula generation system.
- the model learning device M03 includes a structural model learning unit P05, a structural model additional learning unit P06, and a structural property relationship model learning unit P07. These are programs.
- the structural formula generator M04 uses the learned chemical structural formula generation model to generate (estimate) the chemical structural formula of a new substance that is expected to have a desired physical property value.
- the structural formula generation device M04 includes a new structural formula generation unit P08, a structural formula inverse conversion unit P09, and a structural formula shaping unit P10. These are programs.
- the display device M05 can present the information acquired from the other device to the user, receive the input data from the user, and transmit the information to the other device.
- the display device M05 includes a display unit P11 which is a program.
- FIG. 3 shows an example of the hardware configuration of the structural formula generator M04.
- the structural formula generator M04 includes a processor U111 having arithmetic performance, and a DRAM U112 that provides a volatile temporary storage area for storing programs and data executed by the processor U111.
- the structural formula generator M04 further provides a communication device U113 that performs data communication with other devices including other devices in this system, and a permanent information storage area using an HDD (Hard Disk Drive), a flash memory, or the like. Includes auxiliary storage device U114, which provides. Further, the structural formula generation device M04 includes an input device U115 that accepts an operation from the user, and a monitor U116 (an example of the output device) that presents the output result in each process to the user.
- a communication device U113 that performs data communication with other devices including other devices in this system, and a permanent information storage area using an HDD (Hard Disk Drive), a flash memory, or the like.
- auxiliary storage device U114 which provides.
- the structural formula generation device M04 includes an input device U115 that accepts an operation from the user, and a monitor U116 (an example of the output device) that presents the output result in each process to the user.
- the auxiliary storage device U114 stores programs such as a new structural formula generation unit P08, a structural formula inverse conversion unit P09, and a structural formula shaping unit P10.
- the program executed by the processor U111 and the data to be processed are loaded from the auxiliary storage device U114 into the DRAM U112.
- the hardware elements constituting each of the parameter setting device M01, the data storage device M02, the model learning device M03, and the display device M05 are the structural formula generation device M04. May be the same as. Further, the functions divided into a plurality of devices may be integrated into one device, or the plurality of device functions may be distributed to a larger number of devices.
- the chemical structural formula generation system includes one or more storage devices and one or more processors.
- FIG. 4 shows a configuration example of the catalog data DB 10.
- the catalog data DB 10 is a database of compound structural formulas and includes a large number of records. Each record stores information on one chemical structural formula.
- the data of the catalog data DB 10 can include, for example, easily available open data published in a state where it can be freely used for secondary use.
- the chemical structural formula is represented by a character string (expression) according to the SMILES notation.
- the TableID column T0C1 indicates an identifier of the table (table shown in FIG. 4).
- the Table Type column T0C2 indicates the type of data stored in the table.
- the Table Type column T0C2 indicates that this table is a table of catalog data.
- the ID column T0C3 indicates an identifier of the chemical structural formula.
- the SMILES column T0C4 represents the SMILES representation of the chemical structural formula.
- FIG. 5 shows a configuration example of the experimental data DB 11.
- the experimental data DB 11 stores experimental data showing one or more physical property values of interest in the chemical structural formula.
- Each record contains a set of experimental results of one or more physical characteristic values of interest and information of one chemical structural formula. It is assumed that the number of records in the experimental data DB 11 is smaller than the number of records in the catalog data DB 10.
- the Table ID column T1C1 indicates an identifier of the table (table shown in FIG. 5).
- the Table Type column T1C2 indicates the type of data stored in the table.
- the Table Type column T1C2 indicates that this table is a table of experimental data.
- the ID column T1C3 indicates an identifier of the chemical structural formula.
- the SMILES column T1C4 represents the SMILES representation of the chemical structural formula.
- the MWt column T1C5 indicates the molecular weight of the compound represented by the chemical structural formula.
- the logP column T1C6 shows the partition coefficient of the compound represented by the chemical structural formula.
- the molecular weight and partition coefficient are examples of physical characteristic values of the chemical structural formula, and the experimental data can include arbitrary physical characteristic values.
- FIG. 6 shows an example of data included in the structural formula matrix database DB14.
- the structural formula matrix database DB 14 includes a plurality of tables.
- the structural formula database DB 14 is a collection of tables in which columns of the structural formula matrix obtained by converting the assigned original data ID and the structural formula (SMILES) are added to each of the tables of the catalog data DB 10 and the experimental data DB 11. Therefore, in the first embodiment, the structural formula matrix database DB 14 includes two tables.
- the structural formula matrix database DB 14 stores a matrix of chemical structural formulas converted from the SMILES representation by the structural formula conversion unit P01. As described above, in the embodiment of the present specification, the character string representing the chemical structural formula is converted into a matrix.
- the vertical axis of the matrix indicates the symbol type such as the element symbol, and the horizontal axis indicates the appearance position.
- this matrix is referred to as a structural formula matrix.
- the structural formula string Assuming that the number of symbol species is M and the length of the character string representing the chemical structure is N, the structural formula string has M ⁇ N dimensions. The length of the string can vary depending on the structural formula. Therefore, padding is performed by a negative number or a zero value to generate a fixed-length matrix. Since the structural formula matrix has information about which symbol appears at which position, the structural formula is uniquely determined, and the structural formula can be generated by the inverse transformation of the structural formula matrix.
- the structural formula database DB 14 includes a catalog data structural formula matrix table 141 and an experimental data structural formula matrix table 142.
- the catalog data structural formula matrix table 141 is generated from the catalog data DB 10, and a structural formula matrix is further added.
- the experimental data structural formula matrix table 142 is generated from the experimental data DB 11, and a structural formula matrix is further added.
- the TableID column T3C1 indicates an identifier of the table (table shown in FIG. 6).
- the Table Type column T3C2 indicates the type of data stored in the table.
- the Table Type column T3C2 indicates that this table is a table generated from the catalog data.
- the ID column T3C3 indicates an identifier of the chemical structural formula.
- the SMILES column T3C4 represents the SMILES representation of the chemical structural formula.
- the structural formula matrix column T3C5 shows the structural formula matrix of the chemical structural formula converted from the SMILES representation by the structural formula conversion unit P01.
- the TableID column T4C1 indicates an identifier of the table (table shown in FIG. 6).
- the Table Type column T4C2 indicates the type of data stored in the table.
- the Table Type column T4C2 indicates that this table is a table generated from experimental data.
- the ID column T4C3 indicates an identifier of the chemical structural formula.
- the SMILES column T4C4 represents the SMILES representation of the chemical structural formula.
- the MWt column T4C5 indicates the molecular weight of the compound represented by the chemical structural formula.
- the logP column T4C6 shows the partition coefficient of the compound represented by the chemical structural formula.
- the structural formula matrix column T4C7 shows the structural formula matrix of the chemical structural formula converted from the SMILES representation by the structural formula conversion unit P01.
- FIG. 7 shows an example of data included in the learning database DB15.
- the learning database DB 15 stores data used for learning the chemical structural formula generation model generated by the learning data generation unit P02 from the structural formula database DB 14.
- the chemical structural formula generation model in the examples of the present specification includes a structural model and a structural property relationship model.
- the learning database DB 15 includes a structural model table 151 and a structural property relationship model table 152.
- the structural model table 151 stores a structural formula matrix group (compound structural expression group) of a compound that is not associated with the measured value of the physical property value.
- the structural property relationship model table 152 stores a structural formula linear group (compound structural expression group) of the compound associated with the measured value of the physical characteristic value.
- the example of the structural model table 151 shown in FIG. 7 stores the same information as the catalog data structural formula matrix table 141 except for the Table ID column T5C1.
- the Table ID column T5C1 shows an identifier of the structural model table 151 shown in FIG. 7.
- the columns T5C2 to T5C5 are the same as the columns T3C2 to T3C5 having the same name in the catalog data structure matrix table 141, respectively.
- the example of the structural property relationship model table 152 shown in FIG. 7 stores the same information as the experimental data structural formula matrix table 142 except for the Table ID column T6C1.
- the Table ID column T6C1 shows an identifier of the structural property relationship model table 152 shown in FIG.
- the columns T6C2 to T6C7 are the same as the columns T4C2 to T4C7 having the same name in the experimental data structure matrix table 142, respectively.
- FIG. 8 shows an example of information included in the initial parameter DB 13.
- the initial parameter DB 13 stores all the initial values of the parameters required for defining the network structure. For example, the initial value of the structural parameter of the neural network, the initial value of the learning parameter, and the initial value of other user-set data are stored.
- the updated parameters are stored in the model data DB 17.
- parameters generally required for the network definition of the neural network such as the type, number, order, number of dimensions, neuron weight, and weight update rate of the layers constituting the neural network, may be omitted.
- the user can set the initial parameter DB 13 via the input device of any of the devices.
- the initial parameter DB 13 contains information necessary for constructing a chemical structural formula generative model.
- Catalog Data Tables indicates the catalog data to be used.
- Experimental Data Tables indicates the experimental data to be used.
- Tiget Properties indicates the type of physical characteristic value of interest.
- Tiget Property Values indicates the target value of the type of physical characteristic value of interest.
- “Number_of_vae_relation” indicates the number of stages of VAE (inner VAE) of the structural property relationship model.
- “VAE_Initial_Params” indicates the initial value of each parameter of VAE of the chemical structural formula generation model. More specifically, “grammar_layer” indicates the configuration such as the number of VAE (outer VAE) layers and the number of dimensions of the structural model.
- “Vae_relation_layers” indicates the configuration such as the number of layers and the number of dimensions of each VAE of the structural property value relational model.
- “Middle_dims” indicates a list of dimensions in the intermediate output from the encoder or decoder.
- FIG. 9 shows an example of the network structure confirmation screen 201 displayed by the display unit P11 for the user in the display device M05.
- the display unit P11 generates a configuration diagram of the chemical structural formula generation model from the configuration information of the chemical structural formula generation model received from the network structure determination unit P04, and displays it on the monitor.
- the structural model that is the outer VAE is composed of the outer encoder # enc_01 and the outer decoder # dec_01.
- the structural property relational model, which is the inner VAE is composed of the inner encoder # enc_02 and the inner decoder # dec_02.
- the inner VAE is sandwiched between the outer encoder # enc_01 and the outer decoder # dec_01.
- the outer encoder # enc_01 accepts the structural formula matrix generated from the SMILES representation as an input and outputs a 9-dimensional intermediate vector (latent representation).
- the inner encoder # enc_02 accepts an 11-dimensional vector obtained by combining two physical property values (MWt and logP) with the output of the outer encoder # enc_01 as an input, and outputs a 7-dimensional intermediate vector (latent expression).
- the inner decoder # dec_02 accepts the output of the inner encoder # enc_02 as an input and outputs an 11-dimensional vector.
- This vector is a combination of a 9-dimensional vector corresponding to a chemical structural formula and a 2-dimensional vector showing two physical property values.
- the outer decoder # dec_01 accepts a 9-dimensional vector extracted from the output of the inner decoder # dec_02 as an input, and outputs a vector indicating a chemical structure matrix. By inversely transforming the chemical structural matrix, the SMILES representation of the chemical structural formula can be obtained.
- the user can confirm whether the chemical structural formula generation model to be constructed has a desired configuration by referring to the network structure confirmation screen 201.
- the user can input data for correction from the input device of the display device M05.
- the display unit P11 displays the chemical structural formula newly generated by the chemical structural formula generation model and information related thereto. The user can select the chemical structural formula to be actually tested from the displayed chemical structural formula.
- FIG. 10 shows a configuration example of the model table 171 included in the model data DB 17.
- the model data DB 17 stores parameters necessary for defining the network structure of the chemical structural formula generation model. All parameters are included, including default values that are not included in the initial parameters. For example, it includes structural parameters of a neural network, learning parameters, vocabulary data necessary for inverse transformation of a structural formula matrix, and other user-configured data. The parameters are updated sequentially according to learning.
- the model data DB 17 is read / written at a timing such as at the start of learning, during learning, or at the end of learning.
- the model table 171 is generated in the network structure determination unit P04 and included in the model data DB 17.
- the model table 171 of FIG. 10 corresponds to the configuration diagram of the chemical structural formula generation model shown in FIG.
- the Network ID column T7C1 indicates an identifier of an encoder or a decoder of the chemical structural formula generation model.
- the Network Order column T7C2 indicates the order of the encoder or decoder from the input.
- the Nest Order column T7C3 shows the order from the input of the VAE having a nested structure.
- the Target column T7C4 indicates an identifier of data used for learning VAE.
- FIG. 11 shows a flowchart of a processing example of the structural formula conversion unit P01.
- the structural formula conversion unit P01 converts the character string of the chemical structural formula included in the catalog data and the experimental data into a structural formula matrix.
- the structural formula conversion unit P01 reads necessary initial parameters from the initial parameter DB 13 (S101).
- the structural formula conversion unit P01 further reads the structural formula vocabulary data DB 12 (S102).
- the structural formula vocabulary data DB 12 associates the types of elements arranged vertically in the structural formula matrix with the symbols in the SMILES representation. The number of vertical dimensions of the structural formula matrix and the number of vocabularies match.
- the structural formula conversion unit P01 reads the original data from the catalog data DB 10 and the experimental data DB 11 indicated by the initial parameters (S103).
- the structural formula conversion unit P01 adds an end token to the end of all structural formulas in the read data (S104).
- the structural formula conversion unit P01 refers to the structural formula vocabulary data DB 12 and converts each of all the structural formulas into a structural formula matrix (S105).
- the structural formula conversion unit P01 adds columns to each of the original data tables and stores the converted structural formula matrix (S106). As a result, the catalog data structural formula matrix table 141 and the experimental data structural formula matrix table 142 shown in FIG. 6 are generated.
- the structural formula conversion unit P01 writes the generated tables 141 and 142 to the structural formula matrix database DB 14 and adds them (S107). Further, the structural formula conversion unit P01 writes out the structural formula vocabulary data used for the conversion as a part (structural formula vocabulary dictionary) of the model data DB 17 (S108).
- the structural formula vocabulary data is referenced to inversely transform the structural formula matrix to obtain a SMILES representation of the chemical structural formula.
- FIG. 12 shows a flowchart of a processing example of the learning data generation unit P02.
- the learning data generation unit P02 generates training data for each VAE of the chemical structural formula generation model.
- the learning data generation unit P02 reads the necessary initial parameters from the initial parameter DB 13 (S151).
- the learning data generation unit P02 reads the structural formula matrix database DB 14 (S152).
- the learning data generation unit P02 determines the Table Type of each record of the read data (S153).
- the learning data generation unit P02 executes different processing according to the Table Type.
- the processing for the record (S153: Catalog) whose Table Type is "Catalog” will be described.
- the learning data generation unit P02 extracts the corresponding records and aggregates them into one table (S154).
- the learning data generation unit P02 extracts the corresponding records (S155) and aggregates them into one table (S156).
- the learning data generation unit P02 complements the field of the physical characteristic value with Null.
- the missing physical characteristic value is a physical characteristic value that is included in any other record and is not included in the record.
- the learning data generation unit P02 generates a table according to the number of stages of the structural property relation model indicated by the initial parameters (S157). Each table stores one inner VAE training data.
- the learning data generation unit P02 deletes the column containing Null of the generated table (S158). The set of physical property values of the generated table satisfies the inclusion relationship described later (see, for example, Example 4).
- the learning data generation unit P02 assigns a new Table ID to the generated table and overwrites and updates the Table ID column (S159).
- the learning data generation unit P02 writes the generated table to the learning database (S160).
- FIG. 13 shows a flowchart of a processing example of the structure generation vector group generation unit P03.
- the structure generation vector group generation unit P03 generates input data for generating (estimating) a new chemical structural formula expected to have a target physical property value after learning the chemical structural formula generation model.
- the structure generation vector group generation unit P03 reads the necessary initial parameters from the initial parameter DB 13 (S201). Next, the structure generation vector group generation unit P03 reads the learning database DB15 (S202). The structure generation vector group generation unit P03 extracts a table whose Table Type is "Experiment" from the learning database (S203). Each table shows the training data of one corresponding inner VAE.
- the structure generation vector group generation unit P03 sorts the records by each physical characteristic value in each extracted table, and extracts the top S cases of each physical characteristic value of each table.
- S is a natural number indicated by the initial parameter.
- the table contains a plurality of types of physical property values, the top S cases of each type are extracted.
- the structure generation vector group generation unit P03 aggregates as a higher-level compound table of only the ID column and the structural formula matrix column (S204). When a plurality of records having the same ID are extracted, only one of the records is stored in the upper compound table.
- the generation of the upper compound table is not limited to the above method.
- records may be extracted only from some tables, for example, a table containing the most types of physical characteristic values, or higher-level records containing only specified types of physical characteristic values may be extracted.
- the number of high-ranking records to be extracted may differ depending on the type of physical characteristic value.
- each target value list shows a plurality of target values of the corresponding physical characteristic value types.
- the initial parameter indicates information for generating a plurality of target values, for example, the plurality of target values may be indicated, or another target value may be obtained from the reference target value, the number of generated target values, and the reference target value.
- the formula to be generated may be shown.
- the structure generation vector group generation unit P03 generates a target value matrix by the direct product of the target value list for each physical characteristic value type (S206). Further, the structure generation vector group generation unit P03 generates a structure generation vector group by the direct product of the upper compound table and the target value matrix (S207). The structure generation vector group generation unit P03 writes the generated structure generation vector group to the structure generation vector database DB16 (S208).
- FIG. 14 shows a flowchart of a processing example of the network structure determination unit P04.
- the network structure determination unit P04 determines the structure of the chemical structural formula generation model from the initial parameters and the physical characteristic values to be considered x included in the experimental data (model data generation).
- the network structure determination unit P04 reads the necessary initial parameters from the initial parameter DB 13 (S251).
- the initial parameters to be read include a catalog data identifier, an experimental data identifier, a column name of the object physical characteristics, a dimension number list, and the like.
- the network structure determination unit P04 constructs a structural model and initializes it with initial parameters (S252).
- the network structure determination unit P04 reads the structure property relationship model table of the learning database DB15 (S253).
- the network structure determination unit P04 constructs an encoder / decoder pair (inner VAE) as many as the number of tables for the structural property relation model as the structural property relation model, and initializes the encoder / decoder pair (inner VAE) with the initial parameters (S254).
- the network structure determination unit P04 arranges the encoder of the structural model, the encoder group of the structural property-related model, the decoder group of the structural property-related model, and the decoder of the structural model in order, and serially numbers (Network Orderer) for each network from the input side. ) Is given (S255).
- the network structure determination unit P04 aggregates the physical characteristic value column names (physical characteristic value types) of each of the structural property relationship model tables and makes an inclusion determination (S256).
- An inclusive relationship of the physical characteristic value column name is established between any two structural property relationship model tables. Specifically, a table having a large number of physical characteristic value columns includes all physical characteristic value column names of a table having a small number of physical characteristic value columns.
- the table for the structural property relationship model is prepared by the learning data generation unit P02 so that such an inclusion relationship is established.
- the network structure determination unit P04 sorts the Table IDs in descending order in the table order in which the number of physical characteristic value columns included is small (S258).
- the network structure determination unit P04 associates each encoder / decoder pair with the learning table so that the higher-level Table ID corresponds to the outer encoder / decoder pair in the structural property value relational model.
- the network structure determination unit P04 determines the number of input / output dimensions of each encoder / decoder pair according to the initial parameters (S260).
- the network structure determination unit P04 displays the model structure (S261). Specifically, the network structure determination unit P04 transmits the model structure information to the display unit P11. The display unit P11 generates and displays a structural image of the chemical structural formula generation model according to the received information.
- the network structure determination unit P04 receives user input regarding the structure of the chemical structural formula generation model via the display unit P11, and determines whether or not the network structure has been modified (S262).
- the network structure determination unit P04 modifies the network structure according to the user input (S263), and displays the modified network structure using the display unit P11. do.
- the network structure determination unit P04 pairs an encoder and a decoder whose encoder input and decoder output match, and serially numbers each pair in order from the outside. (Nest Order) is given (S264). Each pair constitutes a VAE.
- the network structure determination unit P04 outputs all the parameters of all the encoders and decoders to the data storage device M02 as a part of the DB 17 (S265). Further, the network structure determination unit P04 outputs the model table 171 to the data storage device M02 as a part of the model data DB 17 (S266).
- FIG. 15 shows a flowchart of a processing example of the structural model learning unit P05.
- the structural model learning unit P05 executes learning (also referred to as training) of the structural model (outer VAE) using the structural model table generated from the catalog data.
- the catalog data DB 10 stores more records (data) than the experimental data DB 11. Since more data can be prepared as training data if only the chemical structural formula is used, effective training of the chemical structural formula generation model as a whole becomes possible.
- the structural model learning unit P05 reads the model data DB 17 (S301).
- the structural model learning unit P05 refers to the model table and identifies a model having the Nest Order of 1 (S302).
- the model with Nest Order 1 is the outermost structural model.
- the structural model learning unit P05 constructs the specified model (S303).
- the structural model learning unit P05 refers to the learning database DB15 and reads the structural model table (S304). The structural model learning unit P05 sequentially inputs the structural formula matrix into the structural model to learn the neural network. The structural model learning unit P05 updates and optimizes network parameters (S305). The structural model learning unit P05 writes out the parameters after learning and updates the model data DB 17 (S306).
- FIG. 16 shows a flowchart of a processing example of the structural model additional learning unit P06.
- the structural model additional learning unit P06 performs additional learning of the structural model by using the structural formula matrix of the structural property relational model table.
- the structural model additional learning unit P06 reads the model data DB 17 (S351).
- the structural model additional learning unit P06 refers to the model table and reconstructs the trained structural model in which Nest Order is 1 (S352).
- the structural model additional learning unit P06 refers to the learning database DB15 and reads all the structural property relational model tables (S353).
- the structural model additional learning unit P06 sequentially inputs the structural formula matrix of the structural property relational model table into the structural model, and performs additional learning of the learned structural model.
- the structural model additional learning unit P06 updates and optimizes the network parameters (S354).
- the structural model additional learning unit P06 writes out the structural model parameters after the additional learning and updates the model data DB 17 (S355).
- FIG. 17 shows a flowchart of a processing example of the structural property relationship model learning unit P07.
- the structural property relationship model learning unit P07 executes learning of each VAE (hereinafter, also referred to as a model) in the structural property relationship model. For the learning of the inner VAE, all encoders outside the VAE are reconstructed and connected.
- VAE hereinafter, also referred to as a model
- the structural property relationship model learning unit P07 reads the model data DB 17 (S401).
- the structural property relationship model learning unit P07 initializes N and sets the value to 2 (S402).
- the structural property relation model learning unit P07 refers to the Network ID column of the row in which the value of Nest Order of the model table is equal to N, and constructs the VAE of the model to be learned (S403).
- the structural property relational model learning unit P07 refers to the Target column of the row in which the Nest Order value of the model table is equal to N, and reads the corresponding learning table (structural property relational model table) from the learning database DB15. (S404).
- the structural property relationship model learning unit P07 reconstructs only the encoder (without constructing a decoder) of the additionally trained structural model (S405). Further, the structural property relation model learning unit P07 refers to the Network ID column in which the value of Nest Order in the model table is smaller than N, and reconstructs only the trained encoder (without constructing a decoder) (S406). The structural property relationship model learning unit P07 refers to the Network Order column of the model table, and connects the constructed trained encoders in order (S407).
- the structural property relationship model learning unit P07 sequentially inputs the physical property values corresponding to the structural formula matrix into each of the connected encoders and converts them into a learning target vector (S408). Only the structural formula matrix is input to the structural model.
- the structural property relationship model learning unit P07 inputs the learning target vector into the learning target model VAE, learns the model, and optimizes the network parameters (S409).
- N 2
- the learning target vector is a vector obtained by combining the conversion result of the structural matrix of the structural model and the physical characteristic value vector.
- the structural property relationship model learning unit P07 writes out the parameters of the model after learning and updates the model data DB 17 (S410).
- the structural property relationship model learning unit P07 determines whether the learning of all the models (VAE) of the structural property relationship model has been completed (S411). When an unlearned model remains (S411: NO), the structural property relationship model learning unit P07 increments the value N of the Nest Order (S412) and returns to step S403. When the learning of all the models of the structural property relational model is completed (S411: YES), this flow ends.
- FIG. 18 shows a flowchart of a processing example of the new structural formula generation unit P08.
- the new structural formula generation unit P08 uses the learned chemical structural formula generation model to generate (estimate) new chemical structural formula candidates that are expected to have desired physical property values.
- the new structural formula generation unit P08 reads the model data DB 17 (S451).
- the new structural formula generation unit P08 reconstructs the learned structural model and the structural property value relational model to form a chemical structural formula generation model (generator) (S452).
- the new structural formula generation unit P08 reads the structure generation vector group from the structure generation vector database DB16 (S453).
- the new structural formula generation unit P08 inputs the structural generation vector group into the chemical structural formula generation model and generates a structural formula matrix (S454).
- the new structural formula generation unit P08 collectively writes the structural formula matrix as a candidate structural formula to the candidate structural formula database DB18 (S455).
- FIG. 19 shows a flowchart of a processing example of the structural formula inverse conversion unit P09.
- the structural formula inverse conversion unit P09 converts the structural formula matrix output by the chemical structural formula generation model into a SMILES representation (character string) of the structural formula.
- the structural formula inverse conversion unit P09 reads the structural formula vocabulary dictionary from the model data DB 17 (S501).
- the structural formula inverse conversion unit P09 reads the candidate structural formula database DB18 (S502).
- the structural formula inverse conversion unit P09 converts the structural formula matrix into a structural formula (SMILES representation). (S503).
- the structural formula inverse conversion unit P09 deletes the end token at the end. (S504).
- the structural formula inverse conversion unit P09 overwrites the structural formula in the candidate structural formula database 18DB. (S505).
- FIG. 20 shows a flowchart of a processing example of the structural formula shaping unit P10.
- the chemical structural formula generated by the chemical structural formula generation model may include a chemical structural formula that does not conform to the SMILES grammar.
- the structural formula shaping unit P10 corrects a chemical structural formula that does not conform to the SMILES grammar, and further removes a chemical structural formula that cannot be corrected.
- the structural formula shaping unit P10 reads the candidate structural formula database 18DB (S551).
- the structural formula shaping unit P10 determines grammatical consistency for each chemical structural formula. (S552).
- the structural formula shaping unit P10 corrects the chemical structural formula. (S553).
- the structural formula shaping unit P10 redetermines the grammatical consistency of the corrected chemical structural formula. (S554).
- the structural formula shaping unit P10 rejects the candidate structural formula. (S555).
- the structural formula shaping unit P10 overwrites the corrected chemical structural formula in the candidate structural formula database 18DB (S556).
- the chemical structure generation model having a nested structure can present candidates for compounds having better physical property values than those contained in the training data, and can be generated using a small amount of experimental data. can.
- the accuracy of feature extraction of the structural model can be further improved by additional learning of the structural model using experimental data.
- the structural property value relational model according to the second embodiment has a nested structure composed of a plurality of stages of VAE. Further, the input to each encoder (each VAE) of the structural property value relational model is a vector obtained by combining an intermediate vector from the previous encoder and a single physical property value.
- the structural property value relational model By configuring the structural property value relational model with a plurality of stages of VAEs, for example, physical property values that are preferable to be separated can be included in the inputs of different VAEs.
- Example 2 it is assumed that all the chemical structural formulas (records) of the experimental data used for learning have the experimental data of the common physical characteristic value type (physical characteristic value name) (missing experimental data). none).
- experimental data of two types of physical property values of MWt and logP are associated with each chemical structural formula will be described.
- FIG. 21 shows an example of the network structure confirmation screen 202, which is displayed for the user on the display device M05 by the display unit P11 according to the second embodiment.
- the display unit P11 generates a configuration diagram of the chemical structural formula generation model from the configuration information of the chemical structural formula generation model received from the network structure determination unit P04, and displays it on the monitor.
- the structural model that is the outer VAE is composed of the outer encoder # enc_01 and the outer decoder # dec_01.
- the structural property relationship model consists of two inner VAEs.
- One inner VAE is composed of an inner encoder # enc_02 and an inner decoder # dec_02.
- the other inner VAE is composed of an inner encoder # enc_03 and an inner decoder # dec_03.
- the inner encoder # enc_03 and the inner decoder # dec_03 are sandwiched between the inner encoder # enc_02 and the inner decoder # dec_02.
- the inner encoder # enc_03 and the inner decoder # dec_03 and the inner encoder # enc_02 and the inner decoder # dec_02 are sandwiched between the outer encoder # enc_01 and the outer decoder # dec_01.
- the outer encoder # enc_01 accepts the structural formula matrix generated from the SMILES representation as an input and outputs an intermediate vector (latent representation).
- the inner encoder # enc_02 accepts a vector obtained by combining the output of the outer encoder # enc_01 with a one-dimensional vector indicating one physical property value (MWt) as an input, and outputs an intermediate vector (latent expression).
- the inner encoder # enc_03 accepts a vector obtained by combining the output of the inner encoder # enc_02 with a one-dimensional vector indicating one physical property value (logP) as an input, and outputs an intermediate vector (latent expression).
- the inner decoder # dec_03 accepts the output of the inner encoder # encoder # 03 as an input and outputs a vector. Part of the vector corresponds to the input to the inner encoder # enc_03, and the other part corresponds to the physical characteristic value (logP).
- the vector obtained by removing the physical characteristic value vector from the output vector of the inner decoder # dec_03 is input to the inner decoder # dec_02.
- Part of the vector output from the inner decoder # dec_02 is the input to the inner encoder # enc_02, that is, the feature vector of the chemical structural formula (structural formula matrix), and the other part is the physical property value (MWt). It is a physical characteristic value vector.
- the vector obtained by removing the physical characteristic value vector from the output vector of the inner decoder # dec_02 is input to the outer decoder # dec_01.
- the outer decoder # dec_01 outputs a vector indicating the chemical structure matrix.
- FIG. 22 shows an example of data included in the learning database DB 15 according to the second embodiment.
- FIG. 22 shows learning data of the structural property relational model.
- the training data of the structural property relationship model includes the outer first VAE (# enc_02 and # dec_02) training table 153 in the structural property relationship model and the inner second VAE (# enc_03 and # dec_03) in the structural property relationship model.
- the learning table 154 and the training table 154 are stored.
- the first VAE learning table 153 has a structure in which the logP column T6C6 is removed from the structural property relationship model table 152 shown in FIG.
- the Table ID column T8C1 indicates an identifier of the learning table 153 for the first VAE.
- the columns T8C2 to T8C5 and T8C7 of the learning table 153 for the first VAE store the same information as the columns T6C2 to T6C5 and T6C7 of the table 152 for the structural property relationship model.
- the physical characteristic values required for learning of the first VAE (# enc_02 and # dec_02) are only MWt included in the input / output of the first VAE.
- the learning table 154 for the second VAE (# enc_03 and # dec_03) has the same structure as the structural property relationship model table 152 shown in FIG.
- the Table ID column T9C1 indicates an identifier of the learning table 154 for the second VAE.
- the columns T9C2 to T9C7 of the second VAE learning table 154 store the same information as the columns T6C2 to T6C7 of the structural property relationship model table 152.
- the learning data includes MWt, which is the input physical property value of the encoder # enc_02, in addition to logP, which is the input / output physical property value of the second VAE.
- the learning data for the first VAE is larger than the learning data for the second VAE. Since the number of dimensions of the first VAE arranged outside the second VAE is larger, the learning of the structural characteristic relationship model can be effectively performed. This point is the same in Examples 3 and 4 below.
- FIG. 23 shows a configuration example of the model table 172 included in the model data DB 17 according to the second embodiment.
- the model table 172 of FIG. 23 corresponds to the configuration diagram of the chemical structural formula generation model shown in FIG.
- the Network ID column T10C1 shows the identifiers of each of the three encoders and the three decoders of the chemical structural formula generation model.
- the Network Order column T10C2 indicates the order from the input of each of the three encoders and the three decoders.
- the Nest Order column T10C3 shows the order (from the outside) from the input of each of the three VAEs including the encoder and decoder.
- the Target column T10C4 indicates an identifier of data and an object physical characteristic value used for learning VAE.
- the structural property value relational model according to the third embodiment has a nested structure composed of a plurality of stages of VAE. Further, the input to each encoder (each VAE) of the structural property value relational model is a vector in which an intermediate vector from the previous encoder and a single or multiple physical property values are combined.
- the structural property value relational model with multiple stages of VAE, for example, the physical property values that are preferable to be combined are included in the input of the same VAE, and the physical property values that are preferable to be separated are included in the input of different VAEs. Can be done.
- Example 3 it is assumed that all the chemical structural formulas (records) of the experimental data used for learning have the experimental data of the common physical characteristic value type (physical characteristic value name) (missing experimental data). none).
- experimental data of three types of physical property values of Prop1, Prop2, and Prop3 are associated with each chemical structural formula will be described.
- FIG. 24 shows an example of the network structure confirmation screen 203 that the display unit P11 displays for the user on the display device M05 according to the third embodiment.
- the display unit P11 generates a configuration diagram of the chemical structural formula generation model from the configuration information of the chemical structural formula generation model received from the network structure determination unit P04, and displays it on the monitor.
- the network structure shown in FIG. 24 is different from the network structure shown in FIG. 21 of Example 2 in that the number of physical property values of the innermost VAE (# enc_03 and # dec_03) is 2.
- the input / output property values of the outer VAEs (# enc_02 and # dec_02) in the structural property relationship model are Prop1, and the input / output property values of the innermost VAEs (# enc_03 and # dec_03) are Prop1 and Prop2. ..
- FIG. 25 shows an example of data included in the learning database DB 15 according to the third embodiment.
- FIG. 25 shows learning data of the structural property relational model.
- the training data of the structural property relationship model includes the learning table 155 for the first VAE (# enc_02 and # dec_02) in the structural property relationship model and the learning table 156 for the second VAE (# enc_03 and # dec_03) in the structural property relationship model. And, are stored.
- the first VAE learning table 155 has the same structure as the first VAE learning table 153 shown in FIG.
- the column names of the columns T11C1 to T11C4 and T11C7 are the same as those of the learning table 153 for the first VAE.
- the Prop1 column T11C5 shows the measured value of Prop1 of each chemical structural formula.
- the physical characteristic value required for learning of the first VAE (# enc_02 and # dec_02) is only Prop1 included in the input / output of the first VAE.
- the learning table 156 for the second VAE (# enc_03 and # dec_03) has a structure in which two columns of physical characteristic values are added to the learning table 155 for the first VAE shown in FIG.
- the information of the columns T12C1 to T12C5 and T12C8 is the same as that of the columns T11C1 to T11C5 and T11C7 of the learning table 155 for the first VAE.
- the added Prop2 column T12C6 and Prop3 column T12C7 show the experimentally measured values of Prop2 and Prop3 of the respective chemical structural formulas, respectively.
- the learning data includes Prop1 which is the input physical property value of the encoder # enc_02 in addition to Prop2 and Prop3 which are the input / output physical property values of the second VAE.
- FIG. 26 shows a configuration example of the model table 173 included in the model data DB 17 according to the third embodiment.
- the model table 173 of FIG. 26 corresponds to the configuration diagram of the chemical structural formula generation model shown in FIG. 24.
- the information of the columns T13C1 to T13C3 is the same as the information of the columns T10C1 to T10C3 of the model table 172 according to the second embodiment.
- the Target column T13C5 indicates an identifier of data and an object physical property value used for learning VAE in this example.
- the fourth embodiment will be described below. Mainly, the differences from the above other examples will be described.
- the experimental data of this example includes a chemical structural formula having a deficiency in the physical property values obtained by the experiment. More appropriate learning becomes possible by constructing the learning data to be applied to each VAE in the nested structure from the experimental data according to the combination of the associated physical property values.
- FIG. 27 schematically shows the conditions that the physical property value (type) included in the record in the experimental data should be satisfied.
- the combination of physical characteristic value types of records is required to satisfy the inclusion relationship.
- a record containing (types of) more physical characteristics includes all physical property values (types) of a record containing (types of) fewer physical characteristics. For example, suppose there are three types of experimental data.
- the first experimental data includes the experimental results of one type of physical property value
- the second experimental data includes the experimental results of two types of physical property values
- the third experimental data includes the experimental results of three types of physical property values.
- the three types of physical characteristic values of the third experimental data are composed of the physical characteristic value types of the first experimental data and the two physical characteristic value types of the second experimental data.
- the physical characteristic value type of the second experimental data is composed of the physical characteristic value type of the first experimental data and other physical characteristic value types.
- the physical property value set of the first experimental data (physical property value column set or physical property value type set) is included in the physical property value sets of the second and third experimental data, and the physical property value set of the second experimental data is the first. It is included in the physical property value set of the experimental data of 3.
- the first experimental data lacks data on two types of physical property values, and the second experimental data lacks data on one type of physical property value.
- the training data for the structural property relationship model is preprocessed from the experimental data so as to satisfy the inclusion relationship as described above.
- FIG. 28 schematically shows an example of training data for a structural property relationship model composed of two types of experimental data.
- the first experimental data 311 has only the measured value of the physical property value 1 (Prop1).
- the second experimental data 312 has measured values of physical property value 1 (Prop1) and physical property value 2 (Prop2). That is, the physical characteristic value set of the first experimental data is included in the physical characteristic value set of the second experimental data.
- FIG. 29 shows an example of the network structure confirmation screen 204 that the display unit P11 displays for the user on the display device M05 according to the fourth embodiment.
- FIG. 30 shows a configuration example of the model table 174 included in the model data DB 17 according to the third embodiment.
- the model table 174 of FIG. 30 corresponds to the configuration diagram of the chemical structural formula generation model shown in FIG. 29.
- the network structure shown in FIG. 29 is the same as the network structure shown in FIG. 21 of the second embodiment.
- the physical characteristic value MWt is replaced with the physical characteristic value Prop1
- the physical characteristic value logP is replaced with the physical characteristic value Prop2.
- the information of the columns T14C1 to T14C3 of the model table 174 of FIG. 30 is the same as the information of the columns T10C1 to T10C3 of the model table 172 shown in FIG. 23 of the second embodiment.
- the Target column T14C4 indicates the table name and the physical property value name (Prop1, Prop2) of this example.
- FIG. 31 shows an example in which the learning data generation unit P02 generates training data for a structural property relationship model from experimental data.
- the initial table 150 stores the training data for the structural property relationship model generated from the experimental data and preprocessed so that the physical property value set satisfies the inclusion relationship.
- the learning data generation unit P02 generates learning data for each VAE from the initial table 150.
- the columns T15C1 to T15C4 and T15C7 of the initial table 150 show the same kind of information as the columns of the same name in the learning table 154 shown in FIG. 22 of Example 2.
- Columns T15C5 and T15C6 show the measured values of Prop1 and Prop2, respectively.
- the initial table 150 includes records having different sets of physical characteristics. Each record whose Table ID is "Tbl_Exp_011” includes the measured values of Prop1 and Prop2. Each record whose Table ID is “Tbl_Exp_012” includes only the measured value of Prop1. The record having the Table ID "Tbl_Exp_013” includes a record containing the measured values of Prop1 and Prop1 and a record containing only the measured values of Prop1.
- the learning data generation unit P02 extracts a record containing Prop1 and Prop2 and a record containing only Prop1 from the experimental data, and stores Null in the field of Prop2 for the record containing only Prop1.
- the learning data generation unit P02 stores these records in the initial table 150 and sorts the records according to the number of Nulls (for example, in ascending order).
- the learning data generation unit P02 generates the learning table 157 for the first VAE in the structural property relationship model and the learning table 158 for the second VAE in the structural property relationship model from the initial table 150.
- the learning table 157 for the first VAE in the structural property relationship model is the training data of the VAE (# enc_02 and # dec_02) outside the structural property relationship model.
- the learning table 157 for the first VAE includes a record including the measured value of Prop1 in the initial table 150, that is, all the records.
- the columns T16C1 to T16C5 and T16C7 show the same kind of information as the column of the same name in the initial table.
- the Prop2 column of the initial table 150 is deleted.
- the physical characteristic value required for learning of the first VAE (# enc_02 and # dec_02) is only Prop1 included in the input / output of the first VAE.
- the learning table 158 for the second VAE in the structural property relationship model is the training data of the VAE (# enc_03 and # dec_03) on the inner side of the structural property relationship model.
- the learning table 158 for the second VAE is composed of records including the measured values of Prop1 and Prop2 in the initial table 150.
- the columns T17C1 to T17C7 show the same kind of information as the column of the same name in the initial table.
- the learning data includes Prop1 which is an input physical property value of the encoder # enc_02 in addition to Prop2 which is an input / output physical property value of the second VAE.
- the learning data of the structural property relationship model is composed of a plurality of learning tables (groups) used for learning each of the plurality of VAEs.
- Each learning table associates each of the compound structural representations with one or more measurements of a given physical property value type.
- the learning table with a large number of physical characteristic value types includes all the physical characteristic value types and all compound structural representations of the learning table with a small number of physical characteristic value types.
- a training table with a larger number of physical property value types is used for training the inner VAE in the structural property relational model.
- the training data is composed of the records having the inclusion relationship of the above-mentioned physical characteristic value set from the experimental data including the record having the deficiency of the physical characteristic value.
- the training data for the VAE in the structural property relationship model can be composed of a record including all the physical property values input to the outer encoder in addition to the input / output physical value of the VAE. As a result, appropriate learning of each VAE becomes possible.
- the present invention is not limited to the above-described embodiment, and includes various modifications.
- the above-described embodiment has been described in detail in order to explain the present invention in an easy-to-understand manner, and is not necessarily limited to the one including all the configurations described.
- it is possible to replace a part of the configuration of one embodiment with the configuration of another embodiment and it is also possible to add the configuration of another embodiment to the configuration of one embodiment.
- each of the above-mentioned configurations, functions, processing units, etc. may be realized by hardware, for example, by designing a part or all of them with an integrated circuit.
- each of the above configurations, functions, and the like may be realized by software by the processor interpreting and executing a program that realizes each function.
- Information such as programs, tables, and files that realize each function can be placed in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card or an SD card.
- control lines and information lines indicate those that are considered necessary for explanation, and not all control lines and information lines are necessarily indicated on the product. In practice, it can be considered that almost all configurations are interconnected.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Chemical & Material Sciences (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Molecular Biology (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Pharmacology & Pharmacy (AREA)
- Medicinal Chemistry (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Complex Calculations (AREA)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP21796144.0A EP4145453A4 (en) | 2020-04-28 | 2021-04-09 | SYSTEM FOR GENERATING A COMPOSITE STRUCTURE REPRESENTATION |
| US17/919,804 US12406752B2 (en) | 2020-04-28 | 2021-04-09 | System for generating compound structure representation |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2020-079790 | 2020-04-28 | ||
| JP2020079790A JP7390250B2 (ja) | 2020-04-28 | 2020-04-28 | 化合物構造表現を生成するシステム |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021220774A1 true WO2021220774A1 (ja) | 2021-11-04 |
Family
ID=78279918
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2021/015042 Ceased WO2021220774A1 (ja) | 2020-04-28 | 2021-04-09 | 化合物構造表現を生成するシステム |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US12406752B2 (enExample) |
| EP (1) | EP4145453A4 (enExample) |
| JP (1) | JP7390250B2 (enExample) |
| WO (1) | WO2021220774A1 (enExample) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2023198927A1 (en) * | 2022-04-14 | 2023-10-19 | Basf Se | Methods and apparatuses for characterizing chemical substances, measuring physicochemical properties and generating control data for synthesizing chemical substances |
| EP4394780A1 (en) * | 2022-12-27 | 2024-07-03 | Basf Se | Methods and apparatuses for generating a digital representation of chemical substances, measuring physicochemical properties and generating control data for synthesizing chemical substances |
| WO2025088437A1 (ja) * | 2023-10-24 | 2025-05-01 | 株式会社半導体エネルギー研究所 | 分子構造の生成方法 |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP7088399B1 (ja) * | 2021-12-17 | 2022-06-21 | Dic株式会社 | ノボラック型フェノール樹脂の探索方法、情報処理装置、及びプログラム |
| JP7547423B2 (ja) | 2022-09-02 | 2024-09-09 | キヤノン株式会社 | 情報処理装置、情報処理方法およびプログラム |
| US12368503B2 (en) | 2023-12-27 | 2025-07-22 | Quantum Generative Materials Llc | Intent-based satellite transmit management based on preexisting historical location and machine learning |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2020009203A (ja) * | 2018-07-09 | 2020-01-16 | 学校法人関西学院 | 人工化合物データを用いた化合物特性予測の深層学習方法および装置、並びに、化合物特性予測方法および装置 |
| US20200050737A1 (en) * | 2018-08-10 | 2020-02-13 | International Business Machines Corporation | Molecular representation |
| JP2020079790A (ja) | 2018-11-12 | 2020-05-28 | 学校法人近畿大学 | 配管磁化方法、配管磁化装置、配管検査方法及び配管検査装置 |
Family Cites Families (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5566083A (en) * | 1994-10-18 | 1996-10-15 | The Research Foundation Of State University Of New York | Method for analyzing voltage fluctuations in multilayered electronic packaging structures |
| EP1167969A2 (en) * | 2000-06-14 | 2002-01-02 | Pfizer Inc. | Method and system for predicting pharmacokinetic properties |
| US7487078B1 (en) * | 2002-12-20 | 2009-02-03 | Cadence Design Systems, Inc. | Method and system for modeling distributed time invariant systems |
| US8374827B2 (en) * | 2006-09-12 | 2013-02-12 | Osaka University | Numerical simulation apparatus for time dependent schrödinger equation |
| US8706427B2 (en) * | 2010-02-26 | 2014-04-22 | The Board Of Trustees Of The Leland Stanford Junior University | Method for rapidly approximating similarities |
| US8886497B1 (en) * | 2010-07-19 | 2014-11-11 | Terje Graham Vold | Computer simulation of electromagnetic fields |
| FR2982050B1 (fr) * | 2011-11-01 | 2014-06-20 | Nantes Ecole Centrale | Procede et dispositif pour la simulation en temps reel de systemes et de processus complexes |
| EP3339846B1 (en) * | 2016-12-22 | 2020-12-09 | Malvern Panalytical B.V. | Method of measuring properties of a thin film stack |
| JP6829385B2 (ja) * | 2017-02-22 | 2021-02-10 | 富士通株式会社 | 磁性材料シミュレーションプログラム、磁性材料シミュレーション方法および磁性材料シミュレーション装置 |
-
2020
- 2020-04-28 JP JP2020079790A patent/JP7390250B2/ja active Active
-
2021
- 2021-04-09 EP EP21796144.0A patent/EP4145453A4/en active Pending
- 2021-04-09 US US17/919,804 patent/US12406752B2/en active Active
- 2021-04-09 WO PCT/JP2021/015042 patent/WO2021220774A1/ja not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2020009203A (ja) * | 2018-07-09 | 2020-01-16 | 学校法人関西学院 | 人工化合物データを用いた化合物特性予測の深層学習方法および装置、並びに、化合物特性予測方法および装置 |
| US20200050737A1 (en) * | 2018-08-10 | 2020-02-13 | International Business Machines Corporation | Molecular representation |
| JP2020079790A (ja) | 2018-11-12 | 2020-05-28 | 学校法人近畿大学 | 配管磁化方法、配管磁化装置、配管検査方法及び配管検査装置 |
Non-Patent Citations (3)
| Title |
|---|
| D. P. KINGMAD. J. REZENDES. MOHAMEDM. WELLING: "Semi-supervised Learning with Deep Generative Models", NIPS, 2014 |
| R. GOMEZ-BOMBARELLI ET AL.: "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules", ACS CENT. SCI., vol. 4, no. 2, February 2018 (2018-02-01), pages 268 - 276, XP055589835, DOI: 10.1021/acscentsci.7b00572 |
| See also references of EP4145453A4 |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2023198927A1 (en) * | 2022-04-14 | 2023-10-19 | Basf Se | Methods and apparatuses for characterizing chemical substances, measuring physicochemical properties and generating control data for synthesizing chemical substances |
| EP4394780A1 (en) * | 2022-12-27 | 2024-07-03 | Basf Se | Methods and apparatuses for generating a digital representation of chemical substances, measuring physicochemical properties and generating control data for synthesizing chemical substances |
| WO2024141949A3 (en) * | 2022-12-27 | 2024-10-10 | Basf Se | Methods and apparatuses for characterizing chemical substances, measuring physicochemical properties and generating control data for synthesizing chemical substances |
| WO2025088437A1 (ja) * | 2023-10-24 | 2025-05-01 | 株式会社半導体エネルギー研究所 | 分子構造の生成方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| JP7390250B2 (ja) | 2023-12-01 |
| US12406752B2 (en) | 2025-09-02 |
| EP4145453A1 (en) | 2023-03-08 |
| JP2021174401A (ja) | 2021-11-01 |
| EP4145453A4 (en) | 2024-06-12 |
| US20230117325A1 (en) | 2023-04-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7390250B2 (ja) | 化合物構造表現を生成するシステム | |
| CN111460311B (zh) | 基于字典树的搜索处理方法、装置、设备和存储介质 | |
| Chen et al. | Wavelet networks in power transformers diagnosis using dissolved gas analysis | |
| KR20190125029A (ko) | 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법 및 장치 | |
| CN110598869B (zh) | 基于序列模型的分类方法、装置、电子设备 | |
| CN111105029A (zh) | 神经网络的生成方法、生成装置和电子设备 | |
| CN108710662A (zh) | 语言转换方法和装置、存储介质、数据查询系统和方法 | |
| KR102389555B1 (ko) | 가중 트리플 지식 그래프를 생성하는 장치, 방법 및 컴퓨터 프로그램 | |
| CN116958613A (zh) | 深度多视图聚类方法、装置、电子设备及可读存储介质 | |
| WO2019167240A1 (ja) | 情報処理装置、制御方法、及びプログラム | |
| JP2009503732A (ja) | 選択されたセグメントのビット平面表現を用いた連想マトリックス法、システムおよびコンピュータプログラム製品 | |
| CN110134943A (zh) | 领域本体生成方法、装置、设备及介质 | |
| CN113626610A (zh) | 知识图谱嵌入方法、装置、计算机设备和存储介质 | |
| KR102305981B1 (ko) | 신경망 압축 훈련 방법 및 압축된 신경망을 이용하는 방법 | |
| Javaheripi et al. | Swann: Small-world architecture for fast convergence of neural networks | |
| CN116542286A (zh) | 模型超参数取值方法及装置、处理核、设备、芯片和介质 | |
| CN117556292A (zh) | 基于动态超图连续学习的连续任务数据分类方法及装置 | |
| CN114169255B (zh) | 图像生成系统以及方法 | |
| JP7374829B2 (ja) | ニューラルネット解析装置、ニューラルネット解析方法及びプログラム | |
| JP7265837B2 (ja) | 学習装置および学習方法 | |
| JP7388661B2 (ja) | 情報処理装置、情報処理方法、及び情報処理プログラム | |
| CN114926856B (zh) | 基于知识抗遗忘的地震幸存者识别方法及装置 | |
| CN120144775B (zh) | 一种基于大语言模型和图神经网络的知识图谱补全方法 | |
| Tousside et al. | Towards Robust Continual Learning using an Enhanced Tree-CNN. | |
| CN117492722B (zh) | 代码生成方法、装置、计算机设备和存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21796144 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2021796144 Country of ref document: EP Effective date: 20221128 |
|
| WWG | Wipo information: grant in national office |
Ref document number: 17919804 Country of ref document: US |