CN110490320B

CN110490320B - Deep neural network structure optimization method based on fusion of prediction mechanism and genetic algorithm

Info

Publication number: CN110490320B
Application number: CN201910696239.XA
Authority: CN
Inventors: 魏巍; 徐松正; 李威; 王聪; 张艳宁
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2022-08-23
Anticipated expiration: 2039-07-30
Also published as: CN110490320A

Abstract

The invention discloses a deep neural network structure optimization method based on the fusion of a prediction mechanism and a genetic algorithm, which is used for solving the technical problem of low search efficiency of the conventional network structure search method. The technical scheme is that a deep network structure is coded and expressed to form a network structure code, and then the network structure code is randomly generated to be used as the initial generation of a genetic algorithm; then, carrying out selection, crossing, mutation and prediction processes on individuals in the initial generation, and only carrying out actual training on a network corresponding to an individual with higher expected performance; finally, all individual performances are evaluated and the next round of selection is entered. And after the algorithm is finished, selecting the individual with the best fitness as the optimal network structure under the specific task. By predicting the network performance before the actual training of the network, the time cost for training the search algorithm on the low-price network can be reduced, and the search process of the search algorithm is greatly accelerated.

Description

Deep neural network structure optimization method based on fusion of prediction mechanism and genetic algorithm

Technical Field

The invention relates to a network structure searching method, in particular to a deep neural network structure optimization method based on the fusion of a prediction mechanism and a genetic algorithm.

Background

Document 1 "Lingxi Xie, Alan Yuille, Genetic CNN. computer Vision and Pattern Recognition (2017)" proposes a network structure searching method based on Genetic algorithm, which introduces Darwinian theory of evolution, considers the network structure as an individual in a population, and continuously updates the network structure through the processes of selection, intersection, variation and evaluation. However, the network structure search method requires complete training of the network before evaluating the network performance, which consumes a lot of time and computing resources.

Document 2, "Bowen Baker, Otkrist Gupta1, Ramesh rake: additive Neural Architecture Search Performance prediction, international Conference on Learning recovery (2018)", predicts the final Performance of the network by using the time sequence information of the network training earlier stage, and introduces an "Early Stop" mechanism to terminate the training process of the network with poor effect in advance. Although the method has a certain acceleration effect on the network search algorithm, the method still needs to carry out partial training on the network, thereby limiting the acceleration effect on the structure search algorithm.

Disclosure of Invention

In order to overcome the defect of low searching efficiency of the conventional network structure searching method, the invention provides a deep neural network structure optimization method based on the fusion of a prediction mechanism and a genetic algorithm. The method randomly generates neural networks with different structures to carry out complete training, and trains a network performance prediction model by utilizing information in the network training process; in the network structure searching stage, firstly, coding expression is carried out on a deep network structure to form a network structure code, and then, the network structure code is randomly generated to be used as the initial generation of a genetic algorithm; then, carrying out selection, crossing, mutation and prediction processes on the individuals in the primary generation, and only carrying out actual training on the network corresponding to the individual with higher expected performance; finally, all individual performances are evaluated and the next round of selection is entered. And after the algorithm is finished, selecting the individual with the best fitness as the optimal network structure under the specific task. By predicting the network performance before the actual training of the network, the time cost for training the search algorithm on the low-price network can be reduced, and the search process of the search algorithm is greatly accelerated.

The technical scheme adopted by the invention for solving the technical problems is as follows: a deep neural network structure optimization method based on the fusion of a prediction mechanism and a genetic algorithm is characterized by comprising the following steps:

step one, data preprocessing:

firstly, defining image classification database X ═ X ₁ ,x ₂ ...x _n } ^T ∈R ^n×b ,x _n ∈R ^1×b Representing the nth sample data; the class label vector is Y ═ Y ₁ ,y ₂ ...y _n } ^T ∈R ^n×l ，y _n ∈R ^1×l Is a one-hot label of the nth sample data, where N ═ 1,2.. N }, N is the total number of samples, l represents the total number of classes of the samples, and b represents the spectral dimension; each sample in the image classification database X is then normalized to a range of 0-1, and N is randomly selected therefrom _train Obtaining training data X by individual sample data and class labels thereof _train And its corresponding category label Y _train Wherein N is _train < N. In addition, the rest data and labels in the data set are all classified into a test set, and the data and labels are respectively marked as X _test And Y _test 。

Step two, determining a coding rule of a network structure:

firstly, M different network structures are generated, wherein the structure code of the mth neural network is C _m The code includes S stages, i.e.

Wherein

Is the coding segment of the s-th stage. The stage comprises K _s Each node represents a mixed operation composed of convolution, batch normalization and ReLU activation, and is recorded as

The nodes with small numbers in the same stage are connected to the nodes with large numbers, and the connection mode between the nodes is used

Bit binary encoding for representation. Wherein the 1 st bit is binary coded to represent (v) _s,1 ,v _s,2 ) The bit is 1 if there is connection, and is 0 if there is no connection; the next two bits represent three nodes (v) _s,1 ,v _s,3 ),(v _s,2 ,v _s,3 ) The situation of the connection between them. Setting S to 3, K ₁ ＝3,K ₂ ＝4,K ₃ Network structure code length is 19 bits, i.e. 5

Step three, collecting training data of the network performance prediction model:

randomly generating m mutually different structural codes C ₁ ,C ₂ ,...,C _m And after automatic compilation, the depth network corresponding to the code is completely trained on a specified data set. The Adam optimizer is used for training to learn the network parameters, and the training is iterated for T times. When the network is trained in a batch size, recording the iteration times t of the current network and the classification accuracy Ag on the verification set _t And taking the data as data required by the prediction model training: data ═ C _m ,t,Ag _t ],t＝{1,2...T}。

Step four, constructing and training a network performance prediction model:

defining a network performance prediction model f, inputting a structure code C into the model and mapping mu, and measuring the accuracy rate Ap of the structure neural network on a test set after t times of iterative training by the model _t Namely:

Ap _t ＝f(μ(C _m ),t) (2)

in the mapping phase, the model maps the structure code C into a network structure code group consisting of s structure codes

Wherein p is _s First, the

Bit to bit

The value of each bit is equal to the value of the corresponding position of the original structure code, and the rest positions are filled with zero values, namely:

wherein p is _s [idx]And C [ idx ]]Coding p for a structure _s And the idx-th bit of C.

After the structure code is mapped, p is mapped ₁ ,p ₂ ...p _s And sequentially inputting a single-layer long and short term memory network with the hidden layer size of 128 and finally obtaining the hidden state h of the long and short term memory network unit, wherein the hidden state h is called as a network structure characteristic. Meanwhile, inputting the iteration times t into a multilayer perceptron consisting of a full-link layer with the size of (1,64), a ReLU activation function layer, a full-link layer with the size of (64,32) and a full-link layer with the size of (32,1), and obtaining the contribution D of the iteration times to the final classification accuracy of the network _t 。

Degree of contribution D _t Element-by-element multiplication is carried out with the structural feature h of the network:

h[id]＝D _t ×h[id],id＝{1,2,...,len(h)} (4)

and inputting the calculation result into a small-sized full-connection module. It comprises a full-junction layer with the size of (128 ), a random inactivation layer with inactivation probability of 0.5, a ReLU kinaseAn activity function layer, a full link layer of size (128,32), a ReLU activation function layer, and a full link layer of size (32, 1). The output result of the full connection module is the predicted value Ap of the final classification accuracy of the current network _t 。

Before training the performance prediction network, randomly initializing network parameters, and solving the following optimization problem by using a back propagation algorithm to learn the network parameters to obtain the optimal parameters theta of the network:

wherein | · | purple sweet ₂ Is the norm of L2.

Step five, initializing a genetic algorithm:

setting parameters of genetic algorithm, including population individual number G _N Number of iteration rounds G _T Probability of mutation G _M Cross probability G _C Variation parameter q _M Cross parameter q _C And threshold fit _mgn And randomly generating G _N Coding of a structure

As initial population Ge ⁰ The initial generation population is marked as 0 th generation, and the ith individual in the population is marked as

Then, the score of each individual in the population is evaluated to obtain the score of the individual

Recording the current highest accuracy as fit _max 。

Step six, selecting the individuals:

the selection operation is directed to each individual in the previous generation population. The method is Ge of the previous generation population ^j-1 ,j＝1,2...G _T According to the rules of Russian roulette, according to the individual scores

Selecting a new generation of Ge population ^j (ii) a The higher the individual score, the greater the probability of being selected and retained to the next generation.

Step seven, performing cross operation on the individuals:

interleaving operations for encoding each stage of an individual within a group

According to G between every two individuals in the population _C Probability crossing, the operation of which is that the code string of three stages in two individuals is according to q _C The exchange of probabilities occurs.

Step eight, performing mutation operation on individuals:

the mutation operation aims at each bit of the individual code, and the mutation is represented by that each binary digit on the individual code is according to the probability q _M Inversion occurs, i.e., from 0 to 1 or from 1 to 0.

Step nine, predicting the performance of the network corresponding to the individual:

inputting the iteration times of the network structure coding and training ending into the network performance prediction model to obtain the expected score of each individual in the population

I.e. the expected classification accuracy after the network has been fully trained.

Step ten, evaluating the individual:

will score the expected score

Fit with the current best score _max And (6) comparing. If it is

The algorithm will test the network on the test set after it has been fully trained, and take the actual performance on the test set as the actual score of the individual

If it is

Then no actual training of the network is performed and only the lower expected performance is taken as the score for that individual

After the evaluation is finished, the current best individual score fit is updated _max And returning to the step six until the total iteration number is more than T. And obtaining the optimal network structure after the algorithm is finished.

The invention has the beneficial effects that: the method randomly generates neural networks with different structures to carry out complete training, and trains a network performance prediction model by utilizing information in the network training process; in the network structure searching stage, firstly, coding expression is carried out on a deep network structure to form a network structure code, and then, the network structure code is randomly generated to be used as the initial generation of a genetic algorithm; then, carrying out selection, crossing, mutation and prediction processes on the individuals in the primary generation, and only carrying out actual training on the network corresponding to the individual with higher expected performance; finally, all individual performances are evaluated and the next round of selection is entered. And after the algorithm is finished, selecting the individual with the best fitness as the optimal network structure under the specific task. By predicting the network performance before the actual training of the network, the time cost of training the search algorithm on the low-price network can be reduced, and the search process of the search algorithm is greatly accelerated.

Because the network performance prediction model is introduced into the deep neural network structure optimization method based on the genetic algorithm, the network performance can be predicted by the algorithm before the actual training of the network, and the actual training process of the network with poor expected performance is cancelled, so that the time consumption of the structure optimization algorithm is greatly reduced. Compared with the network structure searching algorithm based on the genetic algorithm in the background art, the method has the advantage that the searching speed is improved by 55% on the premise of keeping the searched network performance similar.

The present invention will be described in detail with reference to the following embodiments.

Detailed Description

The deep neural network structure optimization method based on the fusion of the prediction mechanism and the genetic algorithm specifically comprises the following steps:

1. and (4) preprocessing data.

Defining an image classification database X ═ { X ═ X ₁ ,x ₂ ...x _n } ^T ∈R ^n×b The class label vector is Y ═ Y ₁ ,y ₂ ...y _n } ^T ∈R ^n×l Wherein x is _n ∈R ^1×b Represents the nth sample data, y _n ∈R ^1×l Is a one-hot label of the nth sample data, where N ═ 1,2.. N }, N is the total number of samples, l represents the total number of classes of the samples, and b represents the spectral dimension; normalizing each sample in the hyperspectral image data X to be in the range of 0-1, and randomly selecting N from the samples _train Obtaining training data X by individual sample data and class labels thereof _train And its corresponding category label Y _train Wherein N is _train < N. In addition, the rest data and labels in the data set are all classified into a test set, and the data and labels are respectively marked as X _test And Y _test 。

2. And determining a deep network structure coding rule.

In order to optimize the deep network structure, the topological structure of the deep network structure needs to be represented by coding. The network is divided into a plurality of stages in the coding process, parameters (channel number, convolution kernel size and the like) of convolution operation in the same stage are kept unchanged, and different stages are connected through pooling operation. Each stage of the deep network comprises a plurality of nodes with ordered numbers, and each node represents a mixed operation of convolution, batch standardization and ReLU activation; the small-number nodes in the same stage can be connected to the large-number nodes, and the connection mode among the nodes represents the flowing condition of data in the network in the stage.

M different network structures are generated in the network structure optimization process, and the structure of the mth (M ═ {1,2.., M }) neural network is coded as C _m The code includes S stages, i.e.

Wherein

Is the code segment of the S (S) {1,2., S }). The s-th stage in the coding comprising K _s A node, is marked as

Therefore, this stage needs to be used

A bit binary code (hereinafter, a bit binary code is referred to as a bit) represents a connection relationship between nodes. Wherein the 1 st bit represents (v) _s,1 ,v _s,2 ) The bit is 1 if there is connection, and is 0 if there is no connection; the next two bits represent three nodes (v) _s,1 ,v _s,3 ),(v _s,2 ,v _s,3 ) The situation of the connection between them. In the experiment, S is 3, K ₁ ＝3,K ₂ ＝4,K ₃ The total length of the network structure code is 19 bits, that is:

where len () represents the length of the code (i.e., the number of bits in the binary code).

3. And collecting training data of the network performance prediction model.

Randomly generating m mutually different structural codes C ₁ ,C ₂ ,...,C _m . After the codes are generated, the codes are automatically compiled into calculation graphs, and then the depth networks corresponding to the calculation graphs are carried out on the specified data setsAnd (4) complete training. The network parameters are learned by using an Adam optimizer, and the parameters of the optimizer are set to be a learning rate alpha of 0.001 and an exponential decay factor beta ₁ ＝0.9，β ₂ 0.999. The training process is iterated for T times. Meanwhile, in the training process, every time the network is trained in a batch size, the iteration times t of the current network experience and the classification accuracy rate Ag on the verification set need to be recorded _t After arrangement, data [ C ] needed by the prediction model training is obtained _m ,t,Ag _t ],t＝{1,2...T}。

4. And constructing and training a network performance prediction model.

Recording the network performance prediction model as f, the model firstly encodes the structure C _m Mapping mu is performed, and then the mapping result mu (C) can be obtained _m ) Predicting the accuracy rate Ap of the structural neural network on the test set after t times of iterative training _t Namely:

Ap _t ＝f(μ(C _m ),t) (2)

the specific structure of the prediction model is as follows:

(a) structure code mapping

In the mapping phase, the model maps a single structure code C into a network structure code group consisting of s structure codes

Noting the mapping process as μ, the mapping for encoding the structure can be expressed as:

for a structure-coded set:

wherein ps is

Bit to bit

The value of each bit is equal to the value of the corresponding position of the original structure code, and the rest positions are filled with zero values. The invention marks the value of idx position of p and C of the structural code as p [ idx ]]And C [ idx ]]Then the mapping can be expressed as:

(b) network performance prediction model f:

the structure code is mapped to obtain a structure code group

Then, p1, p2... ps are input into a single-layer long-short term memory network (LSTM) with hidden layer size of 128 in sequence, and finally a one-dimensional array h with length of 128 is obtained, which is called the network structure characteristic of the predicted network.

And inputting the iteration times t into the multilayer perceptron while obtaining the network structure characteristics h. The multi-layered perceptron consists of a fully-connected layer of size (1,64), a ReLU activation function layer, a fully-connected layer of size (64,32), and a fully-connected layer of size (32, 1). Outputting a scalar value by the multi-layer perception opportunity, thereby giving the contribution D of the iteration number to the final classification accuracy of the network _t 。

Then the contribution degree D _t Element-by-element multiplication is performed with the structural feature h of the network, and the operation can be expressed as:

h[id]＝D _t ×h[id],id＝{1,2,...,len(h)} (4)

and passing the operation result through a small-sized full-connection module. The full-link module is composed of a full-link module with the size of (128 ), a random deactivation layer with the deactivation probability of 0.5, a ReLU activation function layer, a full-link layer with the size of (128,32), a ReLU activation function layer and a full-link layer with the size of (32,1) which are sequentially connected. All-purposeThe output result of the connection module is the predicted value Ap of the final classification accuracy of the current network _t 。

Before using a network performance prediction model to guide the network optimization process, random initialization needs to be performed on network parameters, and a back propagation algorithm is used to solve the following optimization problem for network training, so as to obtain the optimal parameter theta of the network:

wherein r is the number of samples contained in a single training batch, | · | | computationally ₂ Is the norm of L2.

5. And initializing a genetic algorithm.

First, a parameter of the genetic algorithm, i.e. the number of population individuals G, is determined _N Number of iteration rounds G _T Probability of variation G _M Cross probability G _C Variation parameter q _M Cross parameter q _C And threshold fit _mgn . Random generation of G _N Coding of a structure

As 0 th generation initial population Ge ⁰ The ith individual (i.e., the ith structural code) in the population is noted

And then, carrying out complete training on the deep network corresponding to each individual in the population, and after the test set tests, taking the classification accuracy of the network as the score of the individual

Recording the current highest accuracy as fit _max 。

6. And carrying out selection operation on the individual.

Then, the selection operation O needs to be carried out on the individuals in the population _s . In the j-1 generation Ge population ^j-1 ,j＝1,2...G _T Selecting the j generation Ge population according to the rule of Russian roulette ^j (ii) a The selection is based on the score of each individual in the current population

By using the russian roulette approach, individuals with higher scores have a greater probability of remaining in the next generation, and the process is iterated.

7. And performing cross operation on the individuals.

For individuals in the population, the probability is G _C Parameter is q _C The interleaving operation of (2); the interleaving process is directed to a segment of the code string for each stage in the individual

Between each two individuals in the population according to G _C Probability is crossed, and the specific operation of the cross is that the code strings of three stages in two individuals are crossed according to q _C The exchange of probabilities occurs.

8. Performing mutation operation on individuals.

For individuals without crossover, the probability is G _M Is represented by each binary digit on the individual code string according to the probability q _M Inversion occurs, i.e., from 0 to 1 or from 1 to 0. The mutation process is directed to the change of a single binary digit.

9. And predicting the performance of the network corresponding to the individual.

Inputting the iteration times when the network structure coding and training are finished into the network performance prediction model to obtain the expected score of each individual in the population

10. And performing evaluation operation on the individual.

Obtained in step 8After the expected score of the individual is obtained, the expected score is obtained

Fit with the current best score _max And (6) comparing. If it is

The expected performance of the individual is better, the algorithm can fully train the individual and then test the individual on the test set, and the actual performance on the test set is used as the actual score of the individual. If it is

This indicates that the expected performance of the individual is poor. For an individual with poor expected performance, the algorithm is not actually trained, and only the lower expected performance is taken as the score of the individual

After the evaluation is finished, the current best individual score fit is updated _max And returning to the step 6 until the total iteration number of the algorithm is more than G _T Until now. After the algorithm is finished, the optimal network structure can be given.

The method has better acceleration effect on the optimization tasks of various image classification network structures. Taking the optimization process of the classification network structure on the Pa via University data set as an example, the traditional network structure optimization method based on the genetic algorithm needs 0.99 hour to provide the optimal deep network structure with the classification accuracy rate of 89.1%; the method can provide the optimal deep network structure with the classification accuracy rate of 88.6% only in 0.635 hour. Therefore, the deep neural network structure optimization method based on the fusion of the prediction mechanism and the genetic algorithm can greatly accelerate the structure optimization process, and the classification accuracy of the finally searched network optimal structure on the designated data set is almost the same as the result of the traditional network structure optimization method based on the genetic algorithm.

Claims

1. A deep neural network structure optimization method based on the fusion of a prediction mechanism and a genetic algorithm is characterized by comprising the following steps:

step one, data preprocessing:

first, an image classification database X is defined as X ₁ ,x ₂ ...x _n } ^T ∈R ^n×b ,x _n ∈R ^1×b Representing the nth sample data; the class label vector is Y ═ Y ₁ ,y ₂ ...y _n } ^T ∈R ^n×l ，y _n ∈R ^1×l Is a one-hot tag of the nth sample data, where N ═ {1,2.. N }, N is the total number of samples, l denotes the total number of categories of the samples, and b denotes the spectral dimension; each sample in the image classification database X is then normalized to a range of 0-1, and N is randomly selected therefrom _train Obtaining training data X by individual sample data and class labels thereof _train And its corresponding category label Y _train Wherein N is _train < N; in addition, the rest data and labels in the data set are all classified into a test set, and the data and labels are respectively marked as X _test And Y _test ；

Step two, determining a coding rule of a network structure:

Wherein

Is the coding segment of the s stage; the stage comprises K _s Each node represents a mixed operation composed of convolution, batch normalization and ReLU activation, and is recorded as

Bit binary coding for representation; wherein the 1 st bit is binary coded to represent (v) _s,1 ,v _s,2 ) The connection condition between the two is that if the connection exists, the bit is 1, and if the connection does not exist, the bit is 0; the next two bits represent three nodes (v) _s,1 ,v _s,3 ),(v _s,2 ,v _s,3 ) The connection condition between the two; setting S to 3, K ₁ ＝3,K ₂ ＝4,K ₃ Network structure code length is 19 bits, i.e. 5

Where "len () represents the length of the structure code in parentheses;

randomly generating m mutually different structural codes C ₁ ,C ₂ ,...,C _m After automatic compilation, the depth network corresponding to the code is completely trained on a specified data set; training and learning network parameters by using an Adam optimizer, and training for T times in total; when the network is trained in a batch size, recording the iteration times t of the current network and the classification accuracy Ag on the verification set _t And taking the data as data required by the prediction model training: data ═ C _m ,t,Ag _t ],t＝{1,2...T}；

Step four, constructing and training a network performance prediction model:

defining a network performance prediction model f, inputting a structure code C into the model and mapping mu, and measuring the accuracy rate Ap of the neural network of the structure on a test set after t times of iterative training by the model _t Namely:

Ap _t ＝f(μ(C _m ),t) (2)

Wherein p is _s First, the

Bit to bit

wherein p is _s [idx]And C [ idx ]]Coding p for a structure _s And the value at idx-th bit of C;

after the structure code is mapped, p is mapped ₁ ,p ₂ ...p _s Sequentially inputting a single-layer long and short term memory network with a hidden layer size of 128 and finally obtaining a hidden state h of a long and short term memory network unit, wherein the hidden state h is called a network structure characteristic; meanwhile, inputting the iteration times t into a multilayer perceptron consisting of a full-link layer with the size of (1,64), a ReLU activation function layer, a full-link layer with the size of (64,32) and a full-link layer with the size of (32,1), and obtaining the contribution D of the iteration times to the final classification accuracy of the network _t ；

h[id]＝D _t ×h[id],id＝{1,2,...,len(h)} (4)

inputting the calculation result into a small-sized full-connection module; it contains a full-link layer of size (128 ), a random deactivation layer with deactivation probability of 0.5, a ReLU activation function layer, a full-link layer of size (128,32), a ReLU activation function layer and a full-link layer of size (32, 1); the output result of the full connection module is the predicted value Ap of the final classification accuracy of the current network _t ；

wherein | · | purple sweet ₂ Is the norm of L2;

step five, initializing a genetic algorithm:

setting parameters of genetic algorithm, including population individual number G _N Number of iteration rounds G _T Probability of mutation G _M Cross probability G _C Coding parameter q _M Cross parameter q _C And threshold fit _mgn And randomly generating G _N Coding of a structure

Recording the current highest accuracy as fit _max ；

Step six, selecting the individuals:

selecting operation is directed to each individual in the previous generation population; the method is Ge of the previous generation population ^j-1 ,j＝1,2...G _T According to the rules of Russian roulette, according to the individual scores

Selecting a new generation of Ge population ^j (ii) a The higher the individual score is, the greater the probability of being selected and retained to the next generation;

step seven, performing cross operation on the individuals:

interleaving encodes for each stage of an individual within a population

Between each two individuals in the population according to G _C Probability crossing, the operation of which is that the code string of three stages in two individuals is according to q _C Exchanging probability;

step eight, carrying out mutation operation on individuals

The mutation operation aims at each bit of the individual code, and the mutation is represented by that each binary digit on the individual code is according to the probability q _M Inversion occurs, i.e., from 0 to 1 or from 1 to 0;

Namely the expected classification precision after the network is fully trained;

step ten, evaluating the individual:

will score the expectation

Fit with the current best score _max Comparing; if it is

The algorithm will fully train the network and then test it on the test set, and take the actual performance on the test set as the actual score of the individual

If it is

After the evaluation is finished, the current best individual score fit is updated _max And returning to the step six until the total iteration times are more than T; and obtaining the optimal network structure after the algorithm is finished.