CN115691695A

CN115691695A - Material component generation method and evaluation method based on GAN and VAE

Info

Publication number: CN115691695A
Application number: CN202211412749.8A
Authority: CN
Inventors: 鲁鸣鸣; 姚艺峰; 王超
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-02-03
Anticipated expiration: 2042-11-11
Also published as: CN115691695B

Abstract

The invention discloses a material component generation method based on GAN and VAE, which comprises the steps of coding and expressing the composition of each material in an original data set to obtain a real data sample; performing one-hot encoding on the specific attribute of each material in the original data set to represent the specific attribute into a condition vector; constructing a conditional countermeasure self-encoder preliminary model; training a preliminary model of the conditional countermeasure self-encoder by adopting real data samples, condition vectors, latent variables and corresponding condition information to obtain a conditional countermeasure self-encoder model; the material composition is generated using the conditional opposing autoencoder model. The invention also discloses an evaluation method comprising the GAN and VAE-based material composition generation method. The invention can not only generate chemically effective material molecules, but also generate materials with specific attributes by using controllable condition information, keeps high novelty and high uniqueness, and has high reliability, good accuracy and wide application range.

Description

Material component generation method and evaluation method based on GAN and VAE

Technical Field

The invention belongs to the technical field of material information, and particularly relates to a method for generating and evaluating material components based on GAN and VAE.

Background

With the development of economic technology and the improvement of living standard of people, various new materials are widely applied to the production and the life of people, and bring endless convenience to the production and the life of people. Therefore, the search for new materials becomes one of the research focuses of researchers.

However, because of the huge material composition space, it is a very challenging task to efficiently explore useful materials from the material composition space. Most of the traditional methods are that scientists design theoretically feasible candidate materials in a certain material chemical system according to own intuition and rich experience, and then verify the feasibility of the designed materials through experiments. However, this method is not only inefficient and highly dependent on the level of expertise of the scientists, but also has a high cost.

Currently, with the rapid development of deep learning technology, more and more methods are combined with deep learning, and the deep learning makes breakthrough progress in the field of material informatics. The scheme of adopting the deep generative model to realize the tasks of material generation and material condition generation is superior to the traditional material space searching scheme. However, in the current field of material generation, the application of depth generation models still has the following three problems: (1) Although work has been done to apply deep generative models such as GAN or VAE to the field of material generation, these work is only directed to specific material systems, such as alloy materials, fixed materials, or hydrides; the application range of the scheme is small; (2) Although the technical solutions are not limited to a particular material system, candidate material generation across the system can be achieved, however, the materials generated by these solutions are either not sufficiently novel or are chemically ineffective; the reliability of such solutions is not high; (3) Some solutions fail to generate candidate materials based on specific target material properties, i.e., do not enable conditional generation.

Disclosure of Invention

One of the purposes of the present invention is to provide a method for generating a material component based on GAN and VAE, which has high reliability, good accuracy and a wide application range.

Another object of the present invention is to provide an evaluation method including the method for producing a material composition based on GAN and VAE.

The invention provides a method for generating a material component based on GAN and VAE, which comprises the following steps:

s1, representing the composition of each material in an original data set by adopting codes to obtain a real data sample;

s2, carrying out one-hot coding on the specific attribute of each material in the original data set, and expressing the specific attribute into a condition vector;

s3, constructing a conditional countermeasure self-encoder preliminary model based on the GAN network and the VAE network;

s4, training the preliminary model of the conditional countermeasure self-encoder constructed in the step S3 by adopting the real data sample obtained in the step S1, the condition vector obtained in the step S2, the latent variable obtained by random sampling and condition information corresponding to the latent variable to obtain a conditional countermeasure self-encoder model;

and S5, adopting the condition obtained in the step S4 to resist the self-encoder model, and generating the final material composition.

The step S1 of representing the composition of each material in the original data set by using codes to obtain a real data sample specifically includes the following steps:

counting and analyzing data in an OQMD data set and an MP data set of the open data set, and selecting e chemical elements as atom types; defining that the number of elements in the composition of each material in a real data sample is not more than n;

finally, the composition of each material is expressed by a matrix M, wherein M belongs to R ^e*n E is the total number of chemical elements and n is the total number of elements in the composition.

The step S2 of performing one-hot encoding on the specific attribute of each material in the original data set, and representing the specific attribute as a condition vector specifically includes the following steps:

selecting three material characteristics of chemical effectiveness, monatomic formation energy and band gap as generation condition information;

the generation condition information is encoded as follows:

if the material meets the charge center and the electronegativity balance, setting a chemical effectiveness mark Vflag to be 1; otherwise, setting the chemical effectiveness mark Vflag to be 0;

if the monatomic formation energy of the material is not more than 0, the monatomic formation energy flag Fflag is set to 1; otherwise, setting the monatomic formation energy flag Fflag to be 0;

if the band gap of the material is not less than 0, setting a band gap flag Bflag to be 1; otherwise, setting a band gap flag Bflag to be 0;

according to the rule of permutation and combination, the code of the condition information of the material can be obtained;

and converting the coding of the condition information of the material into a one-hot vector to express, so as to obtain the condition vector of the material.

Step S3, constructing a preliminary model of the conditional countermeasure self-encoder based on the GAN network and the VAE network, specifically comprising the following steps:

the constructed condition confrontation self-encoder preliminary model comprises a mapping network F, a generating network G, an encoding network E and an identifying network D;

the mapping network F is used for realizing the mapping of the data samples to the embedding space; mapping network F includes eight fully-connected layers, the first fully-connected layer for implementing the mapping of (128 + 8) dimensions to 512 dimensions, the second fully-connected layer to the eighth fully-connected layer for maintaining the mapping transformation of 512 dimensions;

generating a network G for enabling generation of candidate materials from the embedding space; the generation network G comprises a fully connected layer and four deconvolution layers: the fully connected layer realizes the mapping transformation from (512 + 8) dimension to 32 +8 dimension, the convolution kernels of the four deconvolution layers are all 3 × 3 in size, and the step sizes are (2,2), (2,2), (2,2) and (2,1), respectively;

the structure of the encoding network E is opposite to that of the generating network G, and the encoding network E is used for realizing the encoding of the candidate materials into the embedding space; the encoding network E comprises four convolutional layers and a fully-connected layer, the structure of each layer is opposite to that of each layer of the generation network G, the sizes of convolution kernels of the four convolutional layers are all 3 multiplied by 3, the step sizes are respectively set to be (2,1), (2,2), (2,2) and (2,2), and the fully-connected layer is used for realizing mapping transformation of (512 + 8) dimensions;

the identification network D is used for obtaining the authenticity judgment result of the corresponding material according to the input of the embedding space; the identification network D comprises six full connection layers, wherein the first full connection layer to the fifth full connection layer are used for realizing the conversion from (512 + 8) dimension to 512 dimension, and the sixth full connection layer is used for outputting the true and false judgment results of the candidate materials;

the input distribution of the generating network G matches the output distribution of the encoding network E.

Step S4, training the preliminary model of the conditional countermeasure autoencoder constructed in step S3 by using the real data sample obtained in step S1, the condition vector obtained in step S2, the latent variable obtained by random sampling, and the condition information corresponding to the latent variable, specifically including the following steps:

randomly generated condition information associating latent variable z with it

Splicing to obtain a first splicing vector

Vector the first splicing

Inputting the data into a mapping network F to obtain a mapping network output w _z ；

Output w of the mapping network _z And condition vector

Splicing to obtain a second splicing vector

Concatenate the second vector

Inputting the vector into a generation network G to obtain a vector meeting the condition

Candidate material of (2)

Mixing the candidate materials

Inputting the actual data sample x into the coding network E to obtain a candidate material

Latent variables in embedding space

And the latent variable w of the real data sample x in the embedding space _x ；

Mixing the candidate materials

Latent variables in embedding space

And condition vector

Combining to obtain a third splicing vector

Latent variable w in embedding space for real data sample x _x And condition vector

Combining to obtain a fourth splicing vector

Joining the third spliced vector

And a fourth stitching vector

Inputting the data into an identification network D to finally obtain the authenticity judgment result and the classification result of the corresponding material;

in the training process, firstly, the following Loss function Loss is adopted _ED Updating the encoding network E and the authentication network D:

in the formula

Is a connection symbol; d is an authentication network; e is a coding network; p _x Is the prior distribution of x;

is composed of

A prior distribution of;

is the loss of match between the probability distribution of x and the prior distribution of x;

is composed of

Probability distribution of

Prior distribution ofThe matching loss of (2); λ is a constant; | | is a 1 norm;

is a gradient operator; BCE () is a binary cross entropy loss; cls _x Outputting variables for x corresponding to classification through the D network;

is composed of

Outputting variables corresponding to the classification through the D network; c is a condition vector corresponding to x;

is composed of

A corresponding condition vector;

then, the Loss function Loss is used as follows _FG Updating the mapping network F and generating the network G:

finally, the matching Loss function Loss is adopted as follows _EG Updating the encoding network E and the generating network G:

wherein F is a mapping network; g is a generation network; e is a coding network; z is a latent variable; p _z Is the prior distribution of z;

a loss of match between the probability distribution of (a) and the prior distribution of z;

is 2The square of the norm.

And S5, generating a final material component by adopting the condition-confrontation self-encoder model obtained in the step S4, specifically, randomly sampling a latent variable z in the prior distribution p (z) by adopting the condition-confrontation self-encoder model obtained in the step S4, and generating the final material component.

The invention also provides an evaluation method comprising the GAN and VAE-based material composition generation method, and the evaluation method further comprises the following steps:

and S6, evaluating the material components obtained in the step S5 in the aspects of chemical effectiveness, condition generation, novelty and monatomic forming energy band gap.

The invention provides a material component generation model based on a GAN and VAE model, which can generate chemically effective material molecules, can generate materials with specific properties by using controllable condition information, and keeps high novelty and high uniqueness in the generation process; therefore, the invention has high reliability, good accuracy and wide application range.

Drawings

FIG. 1 is a schematic method flow diagram of the generation method of the present invention.

Fig. 2 is a schematic structural diagram of a generative model in the generation method of the present invention.

FIG. 3 is a schematic diagram of the comparison result analysis of the novelty of material generation by the generation method of the present invention and other methods.

FIG. 4 is a schematic method flow diagram of the evaluation method of the present invention.

Detailed Description

Fig. 1 is a schematic flow chart of the method of the generation method of the present invention: the invention provides a method for generating a material component based on GAN and VAE, which comprises the following steps:

s1, representing the composition of each material in an original data set by adopting codes to obtain a real data sample; the method specifically comprises the following steps:

counting and analyzing the data in the open dataset OQMD dataset and the MP dataset, and selecting e (preferably 86) chemical elements related to most of the materials in the dataset as atom types; defining that the number of elements in the composition of each material in a real data sample is not more than n; the number of the elements is preferably 8, because the number of the elements in most material compositions is less than or equal to 8;

finally, the composition of each material is expressed by a matrix M, wherein M belongs to R ^e*n E is the total number of chemical elements, n is the total number of elements in the composition; each row in the matrix M represents an element e (the elements are arranged in the order of the periodic table of elements, for a total of 86 elements); each column in the matrix M represents the number n of atoms of the element in the composition;

s2, carrying out one-hot coding on the specific attribute of each material in the original data set, and expressing the specific attribute into a condition vector; the method specifically comprises the following steps:

the generation condition information is encoded as follows:

if the monatomic formation energy of the material is not more than 0, the monatomic formation energy flag Fflag is set to 1; otherwise, setting a monatomic formation energy flag Fflag to be 0;

according to the law of permutation and combination, any material can belong to one of 8 (2 x 2) categories; a code of condition information of the material can be obtained;

converting the codes of the condition information of the material into one-hot vectors for representation to obtain the condition vectors of the material;

s3, constructing a conditional countermeasure self-encoder preliminary model (ConALAE preliminary model, the structure of which is shown in figure 2) based on the GAN network and the VAE network; the method specifically comprises the following steps:

the constructed condition confrontation self-encoder preliminary model comprises a mapping network F, a generating network G, an encoding network E and an identifying network D; the mapping network F and the generating network G can serve as generating networks in the GAN, and the encoding network E and the authentication network D can serve as authentication networks in the GAN;

the input distribution of the generated network G is matched with the output distribution of the coding network E, namely, the potential spatial distribution after passing through the coding network E is limited through prior distribution, and the VAE idea is embodied;

s4, training the preliminary model of the conditional countermeasure self-encoder constructed in the step S3 by adopting the real data sample obtained in the step S1, the condition vector obtained in the step S2, the latent variable obtained by random sampling and condition information corresponding to the latent variable to obtain a conditional countermeasure self-encoder model; the method specifically comprises the following steps:

randomly sampling potential variable z and corresponding condition information

Splicing is carried out to obtain a first splicing vector

Vector the first splicing

Will map the net output w _z And condition vector

Splicing to obtain a second splicing vector

Second splicing vector

Candidate material of (2)

Mixing the candidate materials

In the embeddingLatent variables in space

Mixing the candidate materials

Latent variables in embedding space

And condition vector

Combining to obtain a third splicing vector

Combining to obtain a fourth splicing vector

Joining the third spliced vector

And a fourth stitching vector

Inputting the true and false judgment result and the classification result of the corresponding material into an identification network D;

in the formula

is composed of

A priori distribution of;

is the match loss between the probability distribution of x and the prior distribution of x;

is composed of

Probability distribution of

Matching loss of prior distribution of (a); λ is a constant; | | is a 1 norm;

is composed of

is composed of

A corresponding condition vector;

then, the following Loss function Loss is adopted _FG Updating the mapping network F and generating the network G:

wherein F is a mapping network; g is a generation network; e is a coding network; z is a latent variable; p _z Is a prior distribution of z;

is the square of the 2 norm.

S5, adopting the conditional countermeasure self-encoder model obtained in the step S4 to generate final material components; specifically, the conditional countermeasure self-encoder model obtained in step S4 is used to randomly sample the latent variable z in the prior distribution p (z) to generate the final material composition.

The effects of the process of the present invention will be further described with reference to examples.

The examples were performed on two large public datasets, OQMD and MP. The effect of a generative model is generally difficult to evaluate, and generally a corresponding evaluation index needs to be proposed for a specific application field. In the application, several important characteristics in the field of materials and some indexes of a generation model are considered to evaluate the generation effect of the ConALAE model and the base line model MatGAN provided by the application. On one hand, three material characteristics of chemical effectiveness (charge neutrality and electronegativity balance), monatomic formation energy and band gap are selected, wherein the monatomic formation energy is related to the thermal stability of the material, and the band gap is an important characteristic of various materials such as a solar cell. On the other hand, it is proposed to use the unique rate, the novelty rate, to evaluate the quality of samples generated by the ConALAE model and the baseline model MatGAN.

Generating a material chemical effectiveness assessment: the evaluation data are shown in table 1:

table 1 schematic table of chemical validity evaluation data of the resulting material

The ConALAE method (the method of the invention) was chemically more effective on both datasets than the MATGAN method, and was slightly less effective at the unique rate. In particular, from the results on the OQMD data set, the ConALAE model of the present invention has chemical effectiveness as high as 76.7%, and can be generated without the limitation of chemical effectiveness of the original material data set (the original chemical effectiveness is 41.6%). The MATGAN model can not break through the limit, and only 38.4% of chemical efficiency can be obtained in the experiment.

Analysis of material condition generation results: analytical data are shown in table 2:

table 2 analysis data schematic table of material condition generation result

Material generation novelty experimental analysis: analytical data are shown in table 3; fig. 3 is a schematic diagram illustrating analysis of comparison results of the generation method of the present invention and other methods in material generation novelty, fig. 3 (a) is a schematic diagram illustrating comparison results of the methods on an MP data set, and fig. 3 (b) is a schematic diagram illustrating comparison results of the methods on an OQMD data set;

TABLE 3 materials Generation summary of analysis data of novelty experiments

Method	OQMD	MP
			MatGAN	70.10％	97.50％
The invention	96.10％	99.20％

As can be seen from table 2, table 3 and fig. 3, the generated material pair novelty index of the ConALAE model proposed by the present invention on both data sets is higher than that of the MatGAN currently optimized method. On the MP dataset, the method of the invention achieved a novelty rate of 99.2% and on the OQMD dataset also achieved a novelty rate of 96.1%. This demonstrates that the method of the invention can maintain the production of highly novel candidate materials even when a large amount of material composition is produced.

Fig. 4 is a schematic flow chart of the evaluation method of the present invention: the evaluation method comprising the GAN and VAE-based material composition generation method provided by the invention comprises the following steps:

s5, adopting the condition obtained in the step S4 to resist the self-encoder model to generate final material components;

Claims

1. A GAN and VAE based material composition generation method comprising the steps of:

2. The method according to claim 1, wherein the step S1 of representing the composition of each material in the original data set by using codes to obtain the actual data sample comprises the following steps:

counting and analyzing data in an open data set OQMD data set and an MP data set, and selecting e chemical elements as atom types; defining that the number of elements in the composition of each material in a real data sample is not more than n;

3. The GAN and VAE-based material composition generating method of claim 2, wherein the step S2 of encoding the specific property of each material in the original data set by one hot code and expressing it as condition vector comprises the following steps:

the generation condition information is encoded as follows:

if the monatomic formation energy of the material is not greater than 0, the monatomic formation energy flag Fflag is set to 1; otherwise, setting the monatomic formation energy flag Fflag to be 0;

4. The GAN and VAE-based material composition generating method according to claim 3, wherein the step S3 of constructing the conditional countermeasure autoencoder preliminary model based on the GAN network and the VAE network specifically comprises the following steps:

the encoding network E is used for realizing the encoding of the candidate materials into the embedding space; the encoding network E comprises four convolutional layers and a full connection layer, the sizes of convolutional kernels of the four convolutional layers are all 3 multiplied by 3, the step sizes are respectively set to be (2,1), (2,2), (2,2) and (2,2), and the full connection layer is used for realizing mapping transformation of (512 + 8) dimensions;

5. The method for generating a material composition based on GAN and VAE as claimed in claim 4, wherein the step S4 is performed by training the preliminary model of the conditional robust auto-encoder constructed in the step S3 by using the real data samples obtained in the step S1, the condition vectors obtained in the step S2, the latent variables obtained by random sampling, and the condition information corresponding to the latent variables, and specifically includes the following steps:

randomly sampling potential variable z and corresponding condition information

Splicing is carried out to obtain a first splicing vector

Vector the first splicing

Will map the net output w _z And condition vector

Splicing to obtain a second splicing vector

Second splicing vector

Candidate material of (2)

Mixing the candidate materials

Latent variables in embedding space

Mixing the candidate materials

Latent variables in embedding space

And condition vector

Combining to obtain a third splicing vector

Combining to obtain a fourth splicing vector

Joining the third spliced vector

And a fourth stitching vector

in the formula

Is a connection symbol; d is an authentication network; e is a coding network; p _x Is a prior distribution of x;

is composed of

A priori distribution of;

is composed of

Probability distribution of

Matching loss of prior distribution of (a); λ is a constant; | | is a 1 norm;

is composed of

is composed of

A corresponding condition vector;

is the square of the 2 norm.

6. The GAN and VAE based material composition generation method of claim 5, wherein the step S5 adopts the conditional robust self-encoder model obtained in the step S4 to generate a final material composition, and specifically adopts the conditional robust self-encoder model obtained in the step S4 to randomly sample the latent variable z in the prior distribution p (z) to generate the final material composition.

7. An evaluation method comprising the GAN and VAE based material composition generation method according to any one of claims 1 to 6, further comprising the steps of: