CN113344174A

CN113344174A - Efficient neural network structure searching method based on probability distribution

Info

Publication number: CN113344174A
Application number: CN202110421335.0A
Authority: CN
Inventors: 王涛; 周达; 刘星宇; 徐航; 王易; 李明光
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-09-03

Abstract

The application relates to a high-efficiency neural network structure searching method based on probability distribution. The neural network structure obtained by searching the neural network structure has very competitive effect in various computer image tasks and language tasks at present. How to improve the efficiency of the search strategy and reduce the evaluation cost of the neural structure is still the direction of effort to find a better network structure in a shorter time. The invention provides a probability distribution type algorithm, which greatly reduces the number of training sub-networks, accelerates the searching process of the neural network architecture, and uses a parameter sharing mode of training and searching at the same time, thereby reducing the evaluation cost of the sub-networks, ensuring better operation and obtaining more training, and further accelerating the searching process of the neural network architecture. On CIFAR-10, the method can search out an optimal neural network structure by using GTX1080Ti for only 2 GPU hours, and 2.69% of test errors are realized under the condition that the network parameter number is only 2.8M. On the ImageNet dataset, the network can achieve a top1 accuracy of 76%.

Description

Efficient neural network structure searching method based on probability distribution

Technical Field

The invention relates to a method for designing a deep neural network structure in the field of artificial intelligence, in particular to a high-efficiency neural network structure searching method.

Background

Automatic neural network searching in a given neural network architecture space has attracted considerable attention over the past few years. To this end, many people have proposed a number of excellent search algorithms and evaluation strategies to find the best neural Network Architecture Search (NAS). In general, the NAS framework is divided into three parts, search space, search strategy, and evaluation strategy. As in fig. 1.

The variable of the search space definition optimization problem, the variable definition of the neural network structure and the variable definition of the hyper-parameter are different, and the different variable scales are different for the difficulty of the algorithm. If we find a set of network architecture parameters and corresponding hyper-parameters, the performance of the deep learning model is actually controlled and determined by the set of parameters, so that only the architecture parameters and the corresponding hyper-parameters of the complex model need to be optimized. In the early days of NAS, our commonly used network architecture was a chain structure, as shown in fig. 2.

The structure is equivalent to a sequence of N layers, each layer having several optional operators, such as convolution, pooling, etc., each operator including some hyper-parameters, such as convolution size, convolution step size, etc.

Some recent work has inspired some manually designed network architectures to study networks with multiple branches, such as fig. 3.

Many deep networks have similar structures, many networks have many repeated cells although they are deep, and after the cells are abstracted, the complex structure becomes simple, so that the number of optimized variables can be reduced, and on the other hand, the same cells can be migrated between different tasks, as shown in fig. 4.

Due to the difficulties of high latitude, continuity, discrete mixing and the like, the neural network architecture searching problem can greatly improve the effect if the dimension reduction can be carried out on the block of searching space dimension, and the mode that the Zoph works in 2018 is accelerated by 7 times compared with the mode that the work in 2017 is carried out.

The search strategy defines what algorithm can be used to quickly and accurately find the optimal network structure parameter configuration. Common search strategies include: reinforcement learning, evolutionary algorithm, random search, Bayesian optimization and gradient-based algorithm. Work as in NAS uses reinforcement learning as a meta-controller, based on the performance of the sampling network, an iterative Recurrent Neural Network (RNN) controller is trained to sample sequentially strings encoding a particular neural architecture, resulting in a new sub-network. The approximate framework of the evolutionary algorithm is also basically similar, a population (N groups of solutions) is randomly generated, and the following steps are started to be circulated: and selecting, crossing and mutating until final conditions are met. The evolutionary algorithm is a non-gradient optimization algorithm, and has the advantages that a global optimal solution can be obtained, and the defect that the efficiency is relatively low. DARTS is to relax the discrete search space to the continuous space, so that the search space can be effectively optimized by using a gradient method. By converting the problem of searching the network architecture into a process of optimizing continuous variables. After the search is complete, the most likely operation needs to be selected, while other operations are discarded. DARTS thus amounts to solving a quadratic optimization problem, which requires optimization of the blending operation while optimizing the network weights. However, this causes a problem of large GPU memory consumption when searching.

The evaluation strategy is similar to the agent model in engineering optimization, because the effect of the deep learning model is very dependent on the scale of the training data, the model training on large-scale data is very time-consuming, and the evaluation on the optimization result is very time-consuming, so some means is needed to perform approximate evaluation. Such as training the model with some low fidelity training set, or by treating all the architectures as subgraphs of a hypergraph, the subgraphs sharing weights directly through the edges of the hypergraph, etc. The MDENAS also makes great interest in proposing a precision hypothesis ordering that considers the precision ordering of each training period of a subnetwork to be consistent, so that the performance of the subnetwork can be estimated only by training the subnetwork several times, and the process of network structure search convergence is accelerated. However, this hypothesis is verified to have an accuracy of only about 70%, and the neural network structure that is excellent in the initial stage of training is not necessarily the best performance when training is completed to converge. The evaluation process is accelerated but the final web search results are also affected.

Disclosure of Invention

The invention provides a novel probability distribution algorithm, which greatly reduces the number of training sub-networks, accelerates the searching process of the neural network architecture, and uses a parameter sharing mode of training and searching at the same time, thereby reducing the evaluation cost of the sub-networks, ensuring better operation and obtaining more training, and further accelerating the searching process of the neural network architecture.

In a first aspect, the present invention provides a probabilistic distributed algorithm for use in a neural network structure search strategy, the algorithm comprising: initialization and sampling, as shown in FIG. 5, the network structure is diversified by selecting between every two nodes

One possible operation (in this paper, one

) To be realized. And the operations on the edges in fig. 5 are initially unknown, the iteration probabilities are continuously updated by initially initializing probabilities for each operation, and finally selecting the operation that performs best. Therefore, when searching initially, we need to initialize all operations in the search space with probability parameters as

I.e. between two nodes

The sum of the probabilities of the operations is 1. Then in the sampling phase, we are based on the fact that between every two nodes

And selecting the operation between every two nodes in the current round according to the operation probability, wherein the probability value is higher, and the probability of selection is higher. And finally, selecting a result, namely the network sampled in the current round. Compared with the prior NAS sampling method, only one operation needs to be selected among the nodes for sampling during searching, and the memory consumption of the GPU is effectively reduced.

cell＝{o^(i,j)|0≤i≤N,i＜j≤N} (3)

In a second aspect, the present invention provides a probabilistic distributed algorithm for use in a neural network structure search strategy, the algorithm comprising: the probability updating method and the prior NAS method are time-consuming and memory-consuming, and the main reasons are that the number of sampling networks is large in the searching process, the network performance evaluation is slow, and the sampled networks need to be trained to be convergent. Therefore, for the network performance evaluation strategy, a forced parameter sharing mode is also adopted, after the sampling network is completed and cells are stacked to form a complete CNN network, sharing parameters are directly given, and then performance evaluation is carried out on a data set. After the accuracy of the network on the data set is obtained, we will feed back the accuracy and perform probability update of each operation. As shown in FIG. 6, according to the probability selection operation, after determining the cells, stacking the cells into a complete network, giving shared parameters, performing performance evaluation, feeding back to the controller, updating information and probability, and completing a round of iteration. We define the probability of operation as

Defining the training algebra of each operation as

Define the average accuracy of each operation as

Wherein

Represents

An operation of the operations. Following the rule that at a node

And

in between

In an operation, if the operation is

The operation is superior to other operations with fewer iterations and higher accuracy than other operations. The updating formula of the average accuracy of each operation is as follows:

(a is performance of the current round of network evaluation) (4)

The comparison between operations is:

(function F indicates that if the result is true, 1 is returned, and if the result is false, 0 is returned) (5)

The probability update formula for the operation is:

P_m＝P_m+α×Z,(1≤m≤M) (6)

where α is a hyper-parameter representing the magnitude of the operation probability update, which also affects the convergence speed and convergence effect of the search process.

As we see in the formula. We will pick the operation with fewer iterations but higher average accuracy in the search space and then enhance the probability of that operation. Meanwhile, for those operations with more iterations but lower average accuracy, we consider that this is a poorly performing operation, so the probability of the operation will be reduced. After a certain number of iterations, the probabilities of the operations in the search space will converge and stabilize effectively.

To generate the final neural network, we select the operation with the highest probability among all edges after the probability converges. For nodes with multiple inputs, we take

And (4) operation of individual probabilities. After determining the normal cell and the reduction cell, we stack them by a set number to form a complete neural network.

In a third aspect, the present invention provides a parameter sharing strategy for training and searching in a neural network structure evaluation strategy.

As shown in fig. 6, the known search space has been determined, and we determined that after each iteration of searching for a cell, a complete neural network consisting of 8 cell stacks is subjected to performance evaluation. So we share the operating parameters on each side in 8 cells. I.e. only the training parameters of 8 x 14 x 8 operations need to be saved. Then, when the neural network is trained or evaluated on the training data set or the evaluation data set each time, corresponding operation parameters are read from the stored shared parameters instead of random initialization, and after the training parameters are finished, the latest operation parameters are stored back to corresponding positions. Therefore, when the neural network is evaluated, the shared parameters can be directly read, performance evaluation is carried out on the evaluation data set, the process that each searched network needs to be trained to be convergent is avoided, and the network evaluation process is greatly accelerated. As shown in fig. 7.

Then how do the shared parameters train?

Let us consider that if before the search starts, per operation

Is randomly sampled for each of the initial probabilities

Training a batch with a randomly sampled network, then training

After epoch, each operational parameter can be trained to some extent. However, the search cost is increased because it is ensured that the operation parameters can be sufficiently trained, and excessive algebra cannot be trained. Therefore, a shared parameter training mode for training and searching simultaneously is provided. Referring to fig. 7, after training a generation of shared parameters, we perform a round of network search, sample the network, perform performance evaluation, update the operation probability, and then perform a new round of shared parameter training with the updated operation probability, and so on until the operation probability converges and stabilizes. The method has the advantages that the operation with poor performance gradually loses the parameter training opportunity, and the operation with good performance obtains more parameter training opportunities to find out the operation with best performance, so that the convergence of the operation probability is accelerated, and the overall speed of searching the neural network architecture is improved.

The invention has the beneficial effects that: the search framework provided by the invention greatly improves the search efficiency of the NAS, and can search an excellent neural network architecture by only using 1 GTX1080ti to search for 2 hours, thereby improving the accuracy, the parameter quantity and the GPU time delay to a certain extent. The search efficiency of the invention is the highest in the existing various NAS algorithms, MetaQNN, Progressive NAS, DARTS, ENAS, AmoebaNet-A + CutOut.

Drawings

FIG. 1 is a NAS framework diagram

FIG. 2 is an exemplary diagram of a chain network architecture

FIG. 3 is a diagram of an example multi-drop network architecture

FIG. 4 is a diagram of an exemplary network structure based on cell stacking

FIG. 5 is a diagram of exemplary network structure diversity

FIG. 6 is an exemplary graph of probabilistic iterative update

FIG. 7 is a diagram of an example of alternating iterations of network search and parameter training

FIG. 8 is a diagram illustrating a simple operation between nodes

FIG. 9 is an exemplary diagram of a cell

FIG. 10 is a network example diagram

FIG. 11 is a diagram of an example of a Normal Cell neural network architecture

FIG. 12 is a diagram of an exemplary architecture of a neural network of a preferred reduction cell

Detailed Description

Some terms used in the embodiments of the present application will be explained below.

The embodiments of the present application relate to related applications of a neural network, and in order to better understand the solution of the embodiments of the present application, a search space construction and related concepts that may be involved in the embodiments of the present application are described below.

On the search space, we search cells as building blocks of the final architecture. The searched cells may be stacked to form a convolutional network, or recursively connected to form a cyclic network. Neural networks are defined in different proportions: network, cell and node.

And (3) node:

nodes are the basic elements that make up a cell. Each node xⁱIs a particular tensor (e.g., an eigengraph in a convolutional neural network), each directed edge (i, j) represents an operation O sampled from an operation search space^(i，j)To connect node xⁱConversion to another node x^jAs shown in fig. 8. There are three types of nodes in the Cell: input node

Intermediate node

And an output node

Each cell takes the previous output tensor as an input node and operates by sampling^(i，j)Applied to the previous node (

And

j ∈ [1, i)) to generate intermediate nodes

The concatenation of all intermediate nodes is considered the final output node.

The following set of possible operations (denoted as O) that we have chosen according to the differentiated architecture search consists of the following 8 operations: (1)3 × 3 maximal pooling; (2) connectionless (zero) (3)3 x 3 average pooling (4) skip connection (identity) (5) rate 23 x 3 extended convolution (6) rate 25 x 5 extended convolution (7)3 x 3 depth separable convolution (8)5 x 5 depth separable convolution

We only employ element-by-element addition at the input of nodes with multiple operations (edges). For example, in fig. 9, B2 has three operations, the result of which is added element by element and then considered B2.

Cell：

One Cell is defined as a tiny convolutional network, one

Tensor mapping to another

There are two types of cells, normal cells and reduction cells. One normal cell uses an operation with a span of 1, thus making it possible to operate with a single normal cell

And

one reduction cell uses operations with stride 2, i.e.

And

for filters and

number of convolutional neural network architectures designed in most people [10,12,13,23,32,33 ]]In the method, when the space feature map is halved, the common method is to reduce the space feature map to half

Doubled. Thus, for step 1, use is made of

For step 2, use is made of

As shown in fig. 9, the unit is represented by DAG having 7 nodes (two input nodes, i 1 and i 1, four intermediate nodes, B1, B2, B3, B4, apply the adoption operation to the input nodes and upper nodes, and the output node connecting the intermediate nodes). The edge between two nodes represents the operation O sampled from the operation search space^(i，j). In training, when an intermediate node has a plurality of edges (operations), the input of the intermediate node is obtained by adding element by element. In the test, we select the top K probabilities to generate the final unit. Thus, the size of the entire search space is

Wherein

Is a set of possible edges with N intermediate nodes. In our use

In the case of (2X 8) total number of cell structures^2+3+4+5＝2×8¹⁴This is a very large search space and therefore requires an efficient optimization method.

Network：

As shown in fig. 10, the network is composed of a predetermined number of cell stacks, which may be normal cells or reduction cells. At the top of the network, global average pooling is followed by a softmax layer for final output. Based on the network performance evaluation strategy of the shared parameters, a small (for example, 8-layer) stacking model is trained on related data sets to search for normal cells and reduction cells, and then the normal cells and the reduction cells are stacked into a deeper network (for example, 20 layers) for performance evaluation. The overall construction process and search space of CNNs is the same as the differentiated architecture search. Note, however, that our search algorithm is different.

The present invention will be described in further detail with reference to specific examples.

In this example we first performed experiments on the CIFAR-10 dataset to demonstrate the feasibility of our algorithm. Then, the cell searched on the CIFAR-10 dataset is applied to other wider image classification datasets (such as CIFAR-100 and ImageNet), so that the method carries out comparison on the aspects of searching efficiency, accuracy, network parameter size and the like with other latest NAS methods.

The method comprises the following steps: the data set was set up and we followed the experimental data set and evaluation index of most NAS algorithms. Therefore, we performed a number of experiments on the CIFAR-10 dataset. The CIFAR-10 data set comprises 50000 training set pictures and 10000 testing set pictures. In the neural network architecture search, 5000 training set pictures are randomly selected as a verification set to evaluate the sampled network architecture. The picture of CIFAR-10 is a color image, the size is 32 x 32, and 10 image categories are provided in total. All color intensities of the image are normalized to [ -1, +1 ].

Step two: setting a search space; in the whole searching process, according to the theory, the number of the cell stacks does not influence the updating of the evaluation result on the operation probability. Therefore, in order to speed up the search process, when searching, the number of layers of cell stack is set to be 8, the 2 nd layer and the 5 th layer are reduction cells, the rest layers are normal cells, and each cell has 4 nodes. The search process trains the shared parameter 50 epochs total, samples sub-network 100 epochs, batch size set to 128, initial number of channels 16

The initial learning rate was set to 0.025 (scheduled to 0 by cosine anneal), the momentum was set to 0.9, and the weight decay value was set to 3 x 10-4. And the hyper-parameter a representing the magnitude of the update of the probability of operation is set to 0.005.

Step three: the probability of initializing each cell operation is 1/8;

step four: sampling the operation according to the probability;

step five: stacking cells to form a network;

step six: allocating a sharing parameter;

step seven: verifying the network effect in the verification set;

step eight: the feedback accuracy;

step nine: updating information and operation probability;

step ten: generating hundreds of sub-networks;

step eleven: training each subnetwork once;

step twelve: updating the sharing parameters;

step thirteen: then implementing steps four to twelve in each epoche;

fourteen steps: after searching, we need to stack cells to form a complete neural network so that we can perform performance evaluation on the neural network on data sets such as CIFAR-10, CIFAR-100 and ImageNet.

Fourteen steps: when evaluating on CIFAR-10 and CIFAR-100 datasets, we basically retain the hyper-parameter settings when searching on CIFAR-10, but to expand 8 cells to 20 cells, train 600 epochs, set the batch-size to 128, and perform regularization processes such as cut out and path drop of probability of 0.3.

Step fifteen: when evaluated on the ImageNet dataset, we also basically retained the previous hyper-parameter settings, but set the cell to 14 cells, train 250 epochs, set the batch-size to 64, set the weight attenuation to 3X 10-5, and have an initial SGD learning rate of 0.1 (calibrated by a factor of 0.97in event epochs).

To eliminate the random factor, we performed several experiments. We find that the finally searched neural network architectures perform closely, which explains the stability of our algorithm. The optimal Normal Cell neural network architecture is shown in FIG. 11. The optimal ReductoCell neural network architecture is shown in FIG. 12.

TABLE 1

The training results for the optimal neural network architecture on CIFAR-10 and CIFAR-100 data sets are listed in Table 1. It is worth mentioning that compared with other NAS methods, the method proposed by the present invention has a very significant advantage in terms of computing resource consumption, and only 2 GPU hours are required to complete the whole search process. In terms of the network parameter quantity, the network parameter is only 2.8M, which is far smaller than the networks searched by other NAS methods. On the error rate, the error rate of the neural network is 2.69% on CIFAR-10 and 17% on CIFAR-100, and the neural network is slightly better than other neural network architectures. Therefore, it is clear that our method shows certain advantages in terms of computational resource consumption and network test accuracy over other NAS methods, as well as over artificially designed neural networks.

We also train our optimal neural network structure on the ImageNet dataset. According to the existing searched cell structure, the cell structure is stacked into a complete neural network architecture and is directly transplanted to ImageNet for training. As described in section 1, we set the number of cells to 14, where 4,9 layers are reduction cells and the rest are normal cells, and the input image size is 224 × 224.

TABLE 2

As shown in table 2, on the ImageNet data set, the error rate of our neural network is 24%, which is superior to the neural network searched by other NAS methods, and compared with them, our method has higher performance and less calculation consumption.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A high-efficiency neural network structure searching method based on probability distribution is characterized by comprising the following steps: a novel probability distribution type algorithm greatly reduces the number of training sub-networks, accelerates the searching process of a neural network architecture, and uses a parameter sharing mode of training and searching at the same time, so that the evaluation cost of the sub-networks is reduced, better operation is guaranteed to obtain more training, and the searching process of the neural network architecture is further accelerated.

2. The method of claim 1, wherein the probabilistic distributed algorithm comprises: initialization and sampling, the network structure being diversified by selecting between every two nodes

A possible operation;initializing probability for each operation at first, continuously updating iteration probability, and finally selecting the operation which shows the best performance; when searching is started, all operations in the search space are initialized to probability parameters

I.e. between two nodes

The sum of the probabilities of the operations is 1; then in the sampling phase, we are based on the fact that between every two nodes

Selecting operation between every two nodes in the round according to the operation probability, wherein the probability value is higher, and the selected probability is higher; and finally, selecting a result, namely the network sampled in the current round.

3. The method according to claim 1 or 2, wherein the probability distribution algorithm comprises a probability updating method, wherein for a network performance evaluation strategy, a forced parameter sharing mode is adopted, after a sampling network is completed and cells are stacked to form a complete CNN network, sharing parameters are directly given, and then performance evaluation is carried out on a data set; after the accuracy of the network on the data set is obtained, the accuracy is fed back, and probability updating of each operation is carried out; selecting operation according to the probability, stacking the determined cells into a complete network, giving a shared parameter, performing performance evaluation, feeding back to a controller, updating information and the probability, and completing a round of iteration; defining the probability of operation as

Defining the training algebra of each operation as

Define the average accuracy of each operation as

Wherein

Represents

One of the operations follows the rule that, at a node

And

in between

In an operation, if the operation is

Compared with other operations, the operation has fewer iterations and higher accuracy, and the operation is superior to other operations, and the updating formula of the average accuracy of each operation is as follows:

the comparison between operations is:

the probability update formula for the operation is:

P_m＝P_m+α×Z,(1≤m≤M) (6)

wherein α is a hyperparameter representing the magnitude of the operational probability update; selecting an operation with less iteration times but higher average accuracy in a search space, and then enhancing the probability of the operation; meanwhile, for the operations with more iteration times but lower average accuracy, the operations are considered to belong to the operations with poor performance, and the probability of the operations is weakened; after a certain number of iterations, the probabilities of the operations in the search space will converge and stabilize effectively; in order to generate a final neural network, after probability convergence, selecting an operation with the highest probability in all edges; for nodes with multiple inputs, we take

An operation of individual probabilities; after determining the normal cell and the reduction cell, we stack them by a set number to form a complete neural network.

4. The method of claim 1, wherein the parameter sharing mode for search-while-training comprises: after the known search space has been determined, and we have determined that after each iteration of searching for a cell, we will be able to search for a cell by

Stacking the cells to form a complete neural network for performance evaluation; so we share

Operating parameters on each side in each cell; i.e. only need to save

A training parameter of an operation, wherein ∈

Is a set of possible edges with N intermediate nodes; then training or evaluating on the training data set or the evaluation data set each timeAnd during neural network, reading corresponding operating parameters from the stored shared parameters instead of random initialization, and storing the latest operating parameters back to corresponding positions after training parameters are finished.

5. The method according to claim 1 or 4, wherein the parameter sharing mode of the search-while-training comprises: after training a generation of shared parameters, performing a round of network search, sampling a network, performing performance evaluation, updating the operation probability, and performing a new round of shared parameter training by using the updated operation probability, and repeating the operation until the operation probability is converged and stable.