CN113360485A

CN113360485A - Engineering data enhancement algorithm based on generation of countermeasure network

Info

Publication number: CN113360485A
Application number: CN202110528930.4A
Authority: CN
Inventors: 刘洋; 申迎港; 王浩成; 张茜; 蔡宗熙
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2021-09-07

Abstract

The invention discloses an engineering data enhancement algorithm based on a generation countermeasure network, which aims to provide sufficient data information for researchers so as to carry out more accurate research work, and comprises the following steps: acquiring original data, and performing data preprocessing such as 'halt' data processing, noise reduction processing, normalization processing and the like on the original data to obtain a group of smooth and stable construction data; substituting the processed data into the GAN to generate a confrontation network data enhancement algorithm, and performing data enhancement by using the mutual confrontation principle of a generator and a discriminator; and outputting engineering data similar to the original data distribution. The method of the invention realizes the combination of data preprocessing and data enhancement, removes a plurality of groups of useless data including noise, utilizes the countermeasures principle of a generator and a discriminator in a generating countermeasures network, realizes the enhancement of the construction airborne data and solves the problem of data shortage in the research.

Description

Engineering data enhancement algorithm based on generation of countermeasure network

Technical Field

The invention relates to an engineering data enhancement algorithm, in particular to an engineering data enhancement algorithm based on a generation countermeasure network.

Background

With the development of deep learning in recent years, deep neural networks have revolutionized classification tasks. The deep neural network-based classifier can achieve high accuracy on the premise that sufficient label samples are used as training data. However, in some situations, the tagged data is difficult to collect or the data is expensive, time-consuming and labor-consuming to obtain. When the data is insufficient, the neural network is difficult to stably train and has weak generalization capability.

In response to this problem, the Goodfellow professor from the university of Montreal proposed a deep learning method based on generation of a confrontation network (GAN) and applied to solve the problem that neural networks are difficult to train due to data scarcity. After the generation of the countermeasure network, the generation of the countermeasure network attracts a great deal of attention, and students make various improvements on the model. The deep convolution generation countermeasure network proposed by Radford combines generation countermeasure networks with deep convolution. The model removes a full connection layer and a pooling layer in the original network, and both the generator and the discriminator are processed by batch regularization, so that the training time of the model is shortened, and the stability of the generation of the countermeasure network and the generation quality of the picture are greatly improved. Compared with other generation models, the generation of the countermeasure network does not need to make assumptions on the distribution of original data, so that the model is relatively flexible, but the model can also become uncontrollable by the method without modeling in advance, and Mirza introduces corresponding condition variables in the network construction process, thereby providing the condition generation countermeasure network. Where the condition variable may be any information, the model changes from unsupervised to supervised when the condition variable is the corresponding label.

Data enhancement, also referred to as data augmentation, refers to the expansion of data using limited data structures and sizes to extend the value of the limited data. The data enhancement comprises supervised data enhancement and unsupervised data enhancement, the supervised data enhancement can be divided into single-sample data enhancement and multi-sample data enhancement, and the unsupervised data enhancement comprises two directions of generating new data and learning enhancement strategy.

However, GAN generation countermeasure network models are mostly used for enhancing graphic data or language data, and the enhancement of engineering data has not been realized yet, and since field construction data is difficult to obtain, researchers cannot obtain enough engineering data, so that the verification of the models is difficult to perform.

Based on the data enhancement function of the generation countermeasure network, the method is improved on the basis of the original algorithm, is applied to enhancement of construction geological data and TBM operation data, and enhances the original data according to the generation principle of the generation countermeasure network, so that the data are expanded, and preparation is made for the following big data analysis and the coupling relation between mining and searching distribution.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an engineering data enhancement algorithm based on a generation countermeasure network, and solves the problem of research stagnation caused by insufficient engineering data in the prior art.

The technical scheme of the invention is as follows:

an engineering data enhancement algorithm based on a generative confrontation network, comprising the steps of:

step 1: preprocessing original data; abnormal data and noise data in the data are removed, and shutdown data are screened, so that the engineering data are smoother; noise in the data is reduced by noise reduction processing.

Step 2: and carrying out normalization processing on the preprocessed data to avoid errors caused by unit and value differences in the original data.

And step 3: building and training a GAN model; establishing a GAN generation confrontation network data enhancement algorithm, and bringing the data after data preprocessing into a model for training;

and 4, step 4: testing the model; and (4) bringing a plurality of groups of engineering data into the model for training, and comparing and analyzing the result and the original data to obtain an accurate result so as to complete data enhancement of the engineering data.

And 5: optimizing the model; by changing parameters such as learning rate and the form of a loss function, the model is optimized, so that a more accurate result can be generated. (solution to the problems of the prior art)

Further, the specific implementation manner of the cleaning operation of the abnormal data in the data preprocessing process performed in the step 1 is as follows:

performing box separation processing on the original data, screening out abnormal data values through the box separation processing, and deleting abnormal data;

further, for the denoising processing in the step 1, a large amount of noise data exists in the original data, and the smoothing processing of the data is required to be performed by the denoising processing, which mainly includes wavelet transformation and moving average denoising, the invention adopts the moving average method to perform the denoising processing, and the processing of the "shutdown data" is specifically expressed as the processing of each engineering parameter value in a shutdown state, and the specific implementation process is as follows:

the shutdown data in the raw data is subjected to a screening based on the following formula: and judging whether a single construction record is 'halt' data by adopting a formula. Namely when any one of the four main working parameters of the TBM is zero, the current construction record is determined as 'shutdown' data.

P＝f(RSP)f(T)f(F)f(V)

In the above formula, RSP is the cutter rotation speed, T is the cutter torque, F is the propulsion force, and V is the propulsion speed.

The function f is defined as follows:

further, for the step 2, the data after the data preprocessing is normalized, and all the data are converted into values between 0 and 1, so that a higher accuracy can be achieved, and the specific implementation process is as follows:

suppose a set of data v₀＝{v₁,v₂,...,v_nMaximum inA value v_maxMinimum value of v_minAll the data in the data are converted according to the following formula:

the result after conversion is the result of data normalization, and the data enhancement effect after normalization is better;

further, for the step 3, the GAN generation confrontation network model mainly comprises a generator G and a discriminator D;

the generator G inputs a group of randomly distributed noises to generate a group of engineering data which accords with the distribution of the original data;

the discriminator D is a classifier with inputs including the raw data and the data generated by the generator G. Outputting a probability value;

the countermeasure principle of the GAN generation countermeasure network data enhancement algorithm is as follows: the aim of the generator is to enable the generated data to cheat the discriminator as much as possible, so that the probability of the discriminator for discriminating the real data is improved as much as possible; the goal of the discriminator is to discriminate the data generated by the generator as accurate as possible as false data, so that the probability of output is as small as possible; the two are mutually confronted, so that the models are mutually promoted until the data generated by the generator successfully cheats the discriminator, and the result output by the generator is the data which accords with the original data distribution. Through the antagonistic training, the sample generated by the generator is closer to the original data sample, and the quality of data enhancement is effectively improved.

The GAN algorithm data enhancement model network structure is mainly divided into a generator G and a discriminator D, and is composed of three neural networks, wherein an input layer, a hidden layer and an output layer of the neural networks are all composed of linear layers, and the enhancement work of original data meeting all conditions can be realized through linear transformation and nonlinear conversion of an activation function.

Further, the generation principle of the GAN generation confrontation network data enhancement model is a maximum and minimum value game, wherein the objective function is as follows:

where E () is the loss function, x belongs to the original sample data, and z is the noise data of the data generator G

Further, the step 3 comprises the following steps:

step 3.1: initializing parameters of two neural networks of a generator G and a discriminator D;

step 3.2: n samples are extracted from the raw data and the generator generates the n samples using the defined noise profile. The generator G is fixed and the discriminator D is trained so that the discriminator can discriminate between true and false.

Step 3.3: and (3) carrying out loop iteration training, wherein the discriminator D and the generator G are in mutual confrontation, and under an ideal state, the final discriminator D cannot distinguish whether the data comes from the original data or the data generated by the generator G, and the discrimination probability at the moment is 0.5, so that the training is completed.

Noise data is input into the trained generator model, and through the transformation of the three-layer neural network of the generator G, the model generates generation sample data approximate to the distribution of the original data, so that the data expansion of the original data is realized.

Furthermore, the data dimension in the training process can be set according to the requirement, and due to the one-dimensional property of the engineering data, the original data is regarded as a one-dimensional matrix, and the matrix is sent into a generator to output the one-dimensional matrix.

Further, an activation function is added after each linear layer in the generator G neural network and the discriminator D neural network, if no activation function exists, the linear function is finally equivalent to the linear function, so that the sigmoid activation function is introduced into the algorithm, and the final output value is data within the range of 0-1, and the formula is as follows:

furthermore, compared with other minipatch modes for generating an antagonistic network model to be sent into the model, the project data is more prone to setting the batch to be 1, namely all data are brought into training, and therefore a better training effect is achieved;

further, the loss function used in the algorithm is a mean square error function, the mean square error is also called quadratic loss, and L2 loss, the specific form is the sum of the squares of the distances between the original variables and the output values, the final result is that the smaller the loss function is, the better the loss function is, the final result tends to a stable state, and the formula is as follows:

wherein x_iGenerating data variables, y, for a generator G_iIs an original data variable;

further, for the GAN generation confrontation network data enhancement algorithm, a random gradient descent method is adopted to optimize the model, the optimized parameters include weight ω and bias b of a neural network linear layer of a generator G and a discriminator D, and the principle is to solve the partial derivative of the loss function, so that the weight changes towards the direction of the fastest loss function decline, thereby achieving the goal of optimizing the network, wherein x in the formula is_i＝ω·x_i+ b, wherein ω is updated for each sample, and the specific implementation formula is as follows:

finally, updated ω is obtained, where α is the learning rate, where x_iGenerating data variables, y, for a generator G_iIs an original data variable;

compared with the prior art, the invention has the beneficial effects that:

1. the method realizes the enhancement of engineering data by utilizing the GAN to generate the countermeasure network, so that the current research is not limited to taking the data from a construction party any more, but a group of data can expand a plurality of groups of data with the same distribution.

2. Compared with interpolation method expansion data and least square method expansion data, the data generated by the GAN generation confrontation network data enhancement model can reflect the characteristics of the original distribution of the data, the generated data is more convenient and simpler, and the scale and the form of the generated data can be freely determined.

3. The GAN algorithm model is divided into a generator G and a discriminator D, the network structure is relatively simple, the whole network structure is formed by linear layers, the network parameters are greatly reduced, and the training difficulty is reduced

4. By using the training method of the antagonistic type, the data generated after the noise data passes through the G neural network of the generator is closer to the original data variable, and the enhanced quality is effectively improved.

Drawings

FIG. 1 is a general flow chart of the data enhancement algorithm set;

fig. 2 is a schematic diagram of a GAN generation countermeasure network data enhancement algorithm.

Detailed Description

The invention is described in further detail below with reference to the following figures and specific examples:

example (b):

an engineering data enhancement method based on generation of a countermeasure network, as shown in fig. 1, includes the following steps:

step 1: preprocessing original data; the 'halt' data and the noise data in the data are removed, so that the engineering data are smoother; noise in the data is reduced by noise reduction processing.

And preprocessing the obtained original data, removing abnormal data and noise data in the original data, screening and deleting useless data when the machine is stopped, wherein the obtained data are engineering data in a working state, the characteristics are more obvious, and the data are smoother.

Performing box diagram processing on the original data, screening out abnormal data values through the box diagram processing, and deleting abnormal data; the abnormal value processing of the box diagram is mainly to carry out screening and clearing processing on data with data values exceeding upper and lower branch bit lines;

the data after the abnormal data is screened is subjected to noise reduction processing, the purpose of the data noise reduction processing is to eliminate high-frequency noise in the original data and further improve the data quality, two general noise reduction processing methods are adopted, namely a wavelet transform method and a moving average method are adopted for noise reduction, the moving average method adopts a sliding window mode to realize smooth processing of the original data, and the data after the moving average method is more beneficial to data enhancement, so that the moving average method is adopted for noise reduction processing of the data.

The data subjected to abnormal value cleaning and noise reduction processing also comprises a large amount of shutdown data, the data to be researched is data in a machine working state, so that the shutdown data needs to be removed, specifically, the shutdown data is represented as processing of various engineering parameter values in a shutdown state, and the specific implementation process is as follows:

and (3) carrying out a screening process on the null-push data in the original data based on the following formula: and judging whether a single construction record is 'null-push' data by adopting a formula. Namely when any one of the four main working parameters of the TBM is zero, the current construction record is determined as 'null-push' data.

P＝f(RSP)f(T)f(F)f(V)

The function f is defined as follows:

namely, each numerical value in the engineering data mainly comprises main operation parameters in a working state, in this case, the shutdown data of the TBM operation data is removed, namely, any data of the four data of the rotating speed of the cutter head, the total thrust of the cutter head torque and the thrust speed is 0, namely, the data represents shutdown data.

The values in the original data are different in size and unit, and the improvement and optimization of the model are needed when the data in different ranges are brought into the data enhancement algorithm, on the contrary, the influence caused by the unit can be eliminated by the normalized data, and meanwhile, the values can be normalized in the range of 0-1 without influencing the distribution characteristics of the data. The specific implementation mode is as follows:

setting the original data to be enhanced as v₀＝{v₁,v₂,...,v_nV, maximum value of v_maxMinimum value of v_minAll the data in the data are converted according to the following formula:

the result after conversion is the result of data normalization, and the result effect of the anti-network data enhancement algorithm generated by substituting the normalized data into the GAN is better;

the processed data set is called target original data, and is brought into a generation countermeasure network for data enhancement.

And step 3: building and training a GAN model; establishing a GAN generation confrontation network data enhancement algorithm, and bringing the preprocessed data into a model for training;

in the embodiment, a group of data sets with the size of [1,4000] is selected from target original data, 1 × 4000 is 4000 data, the sizes of all input noise data of a generator G in the GAN generation countermeasure network are set to 4000, the size of data input of a discriminator D is 4000, and the target original data and sample data generated by the generator are input into the discriminator sequentially. The output data size of the generator G is 4000, and the output data size of the discriminator D is 1, which is a probability value that the generated data conforms to the real data. The generation principle of the GAN generation countermeasure network data enhancement model is a maximum and minimum value game, wherein the objective function is as follows:

the objective function of discriminator D is:

the objective function of generator G is:

and (c) initializing omega and b in the neural network of the generator G and the discriminator D, namely giving an initial value of the weight and the offset, namely, each of the three layers of neural networks is endowed with an initial weight and an offset. To obtain { (ω)₁,b₁),(ω₂,b₂),(ω₃,b₃)}

4000 samples were taken from the original data, and the hidden layer size of the discriminator and the generator was set to 2000, i.e. a matrix of 1,4000 was input, a matrix of 4000,2000 weight after initialization of the parameters, and the hidden layer size of the generator G and the discriminator D was the same.

During each iteration:

1. 4000 sampling points y are selected from the target original data set₁,y₂,...,y₄₀₀₀The number of sampling points can be adjusted according to the needsIn this example, 4000 is selected;

2. a set of random variables is selected from a random distribution (Gaussian, positive-Taiyang, etc.) with dimensions set to 4000, i.e. { z }₁,z₂,...,z₄₀₀₀}; 3. taking z in the step 2 as an input, sending the input into a G neural network of a generator to finally obtain a group of generated data, and setting the dimensionality to 4000, namely { x₁,x₂...x₄₀₀₀I.e. x_i＝G(z_i)；

4. Updating parameter ω of discriminator D_DTo maximize V_DOur goal is to make V_DThe larger the better, the more V is obtained according to the following formula_DThe larger the better, the smaller D (g (z)) is, the better, i.e. the score obtained after the sample is generated by the discriminator generator and input to the discriminator D, which is one classifier:

the 1-4 steps are mainly used for training and updating the parameters of the discriminator D, and the parameters of the general discriminator D need to be trained for a plurality of times

5. Updating generator parameters omega_GTo minimize V_G：

In each iteration process:

1. and fixing the generator G, updating only the parameters of the discriminator D, and respectively substituting the data of the generator generated samples and the sample data in the target original data into the discriminator D, wherein the target of the discriminator D is that if the input is from a real data set, namely the output numerical value is larger, the input is truer, and the opposite numerical value is smaller, the input is more false

2. Fixing the parameters of the discriminator D, updating the parameters of the generator G, inputting a group of noise vectors into the generator G to obtain a group of outputs, inputting the outputs into the discriminator D to obtain a numerical value, wherein the parameters of the discriminator D at this stage are fixed, and the generator G needs to update the parameters thereof to ensure that the numerical value is better when the numerical value is larger.

After the training of the step 3, the GAN generation countermeasure network is trained, the algorithm has the capability of generating engineering data, target original data needing to be enhanced is brought into the algorithm for training, and the output result is the data needing to be enhanced. And substituting the target original data and the data generated by the generator G into the MSE mean square error function, if the value of the finally obtained mean square error function is greater than 0.2, the result is not good, and the GAN generation countermeasure network adjustment needs to be carried out again to achieve the optimal quality.

And 5: optimizing the model; the model is optimized by changing the form of the parameters and the loss function, so that more accurate results can be generated.

The model is optimized by changing the learning rate, the size of a hidden layer and the form of an activation function, and the algorithm is suitable for the enhancement of most engineering data due to different optimal parameters corresponding to each group of data, and the parameters in the GAN generation countermeasure network need to be improved to achieve the optimal effect for a small part of data.

In this embodiment, the data distribution after the engineering data enhancement almost matches the data distribution of the target original data, and the two sets of data are substituted into the mean square error function, so that the obtained result is also less than 0.2, which indicates that the result quality of the slave algorithm can be used in research and analysis, thereby effectively solving the dilemma that data are deficient and data analysis is difficult to continue.

The method carries out data preprocessing on the engineering data, the data preprocessing process is not complex, and mainly comprises a plurality of processes of cleaning abnormal data, reducing noise of the data, cleaning shutdown data and the like, and finally the normalized data is target original data which can be brought into the GAN to generate an anti-network. The method adopts a GAN generation confrontation network data enhancement algorithm which is mainly divided into a generator G and a discriminator D, the network structure is simple, three linear layers are mainly adopted, and each linear layer has an activation function at last, so that the parameter quantity of the network is greatly reduced, and the training difficulty is reduced. And countermeasure training is used, so that the noise data is more approximate to the target original data after being reconstructed by the generator G, and the quality of engineering data enhancement is effectively improved. Different data distribution needs to carry out fine adjustment on a GAN generation countermeasure network model to adapt to different data, but the algorithm disclosed by the invention is used for enhancing most of engineering data, and the enhancing effect is good.

It is obvious that the described embodiment is only one possible embodiment of the invention, not all embodiments, and that all other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the invention, belong to the protection scope of the invention.

Claims

1. An engineering data enhancement algorithm based on a generative confrontation network, characterized by comprising the following steps:

step 1: preprocessing original data; abnormal data and noise data in the data are removed, and shutdown data are screened, so that the engineering data are smoother; reducing noise in the data by noise reduction processing;

step 2: normalization processing is carried out on the preprocessed data, and errors caused by unit and numerical value differences in the original data are avoided;

and 4, step 4: testing the model; bringing a plurality of groups of engineering data into a model for training, and comparing and analyzing results and original data to obtain accurate results and complete data enhancement of the engineering data;

2. The engineering data enhancement algorithm based on generation of the countermeasure network according to claim 1, wherein the screening of the "shutdown" data in step 1 is to remove the "shutdown" data from the original data;

and judging whether a single construction record is 'halt' data or not by adopting the following formula, namely determining that the current construction record is 'halt' data when any one of four main working parameters of the TBM is zero.

P＝f(RSP)f(T)f(F)f(V)

In the above formula, RSP is the cutter rotation speed, T is the cutter torque, F is the propulsion, and V is the propulsion speed; the function is defined as follows:

3. the engineering data enhancement algorithm based on generation of countermeasure network according to claim 1, wherein the step 2 of normalizing the data comprises:

let the data set of the parameters be { V₀}，V_XFor a variable in the dataset, the result after normalization is V:

4. the engineering data enhancement algorithm based on generation of confrontation network as claimed in claim 1, wherein said step 3GAN generation of confrontation network model includes two parts of generator model G and discriminator model D; the system is composed of three layers of neural networks, wherein an input layer, a hidden layer and an output layer of the neural network are all composed of linear layers, and the enhancement work of original data meeting all conditions is realized through linear transformation and nonlinear conversion of an activation function;

the generator G inputs a group of random noises, and finally generates a group of engineering data which accords with the distribution of the original data;

the discriminator D is a classifier II, and the input of the discriminator comprises original data and data generated by the generator G; a scalar is output, and the larger the value, the more consistent the data generated by the generator is with the original data.

5. The engineering data enhancement algorithm based on generation of countermeasure network as claimed in claim 4, wherein the principle of GAN generation of countermeasure network is the maximum and minimum value game of generator G and discriminator D, and its objective function is:

where E () is the loss function, x is the original sample data, and z is the noise data of the data generator G.

6. The algorithm for enhancing engineering data based on generation of countermeasure network as claimed in claim 4, wherein an activation function is added after each linear layer in the generator G neural network and the discriminator D neural network, a sigmoid activation function is introduced into the algorithm, and the final output value is data in the range of 0-1, and the formula is as follows:

7. the engineering data enhancement algorithm based on generation of countermeasure network according to claim 1, wherein the step 3 comprises the following steps:

step 3.2: extracting n samples from the raw data, generating n samples by using the defined noise distribution by a generator, fixing the generator G, and training a discriminator D, so that the discriminator can distinguish true from false;

8. The engineering data enhancement algorithm based on generation of countermeasure network as claimed in claim 1, wherein for step (5), the optimization process is the process of modifying necessary parameters, including but not limited to learning rate, training step number, and loss function form.

9. The engineering data enhancement algorithm based on the generation countermeasure network of claim 1, wherein the loss function used in step 5 is a mean square error function, the mean square error is also called quadratic loss, L2 loss, and the specific form is the sum of distance squares of the original variables and the output values, the final result is that the smaller the loss function is, the better, the final result is to be in a stable state, and the formula is as follows:

wherein x_iIs generated by a generator GData variable, y_iAre raw data variables.

10. The method for enhancing engineering data based on generation of countermeasure network as claimed in claim 1, wherein the optimization of model in step 5 is performed by stochastic gradient descent method, the optimized parameters include weight ω and bias b of linear layer of neural network of generator G and discriminator D, and the principle is to make partial derivative of loss function, so that the weight is changed toward the direction of fastest descent of loss function, thereby achieving the goal of optimizing network, wherein x in the above formula is_i＝ω·x_i+ b, wherein ω is updated for each sample, and the specific implementation formula is as follows:

finally, updated ω is obtained, where α is the learning rate, where x_iGenerating data variables, y, for a generator G_iAre raw data variables.