CN114626506A

CN114626506A - Attention mechanism-based neural network unit structure searching method and system

Info

Publication number: CN114626506A
Application number: CN202210219650.XA
Authority: CN
Inventors: 胡瑜; 孙自浩
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-06-14

Abstract

The invention provides a neural network unit structure searching method and system based on an attention mechanism, which comprises the following steps: constructing a macro-architecture hyper-network in a search space, wherein each layer of unit structure in the macro-architecture hyper-network is a directed acyclic graph, nodes in the directed acyclic graph are connected through edges, and each edge represents a combination of a plurality of candidate operations in the search space; adding an attention module after outputting the characteristic diagram of all candidate operations of each edge in the unit structure to obtain a network to be searched; and training the network to be searched by using the labeled data set, gradually deleting the candidate operation with the minimum attention weight on each edge in the intermediate search network unit structure in the training process until the training reaches the preset iteration number, and eliminating all attention modules in the current network to be searched to obtain the search result of the neural network unit structure of the data set. The invention can not only consider the mutual influence among the operations, but also reserve each operation until the final step of searching.

Description

Attention mechanism-based neural network unit structure searching method and system

Technical Field

The invention relates to the technical field of neural network architecture search and picture classification in the field of automatic machine learning, in particular to a neural network unit structure search method and device based on an attention mechanism.

Background

Automatic Machine Learning (Auto-ML for short) refers to automating the steps of data preprocessing, feature selection, algorithm selection and the like in Machine Learning and the steps of neural network architecture design, hyper-parameter optimization, neural network model training and the like in deep Learning, and obtaining expected results without manual intervention. Neural network Architecture Search (NAS for short) belongs to the category of network design in automatic machine learning, and refers to automatically searching to obtain a Neural network Architecture, for example, combining various operations according to a Search strategy from a Search space containing various operations (such as convolution, pooling, jump connection) aiming at different computer vision tasks such as classification, detection, segmentation, tracking and the like to obtain a Neural network Architecture, and further measuring the performance of the Neural network Architecture on the corresponding computer vision task under a specified evaluation strategy.

Early neural network architecture search strategies including reinforcement learning, evolutionary algorithms, random search and bayesian optimization generally required retraining of each resulting network structure to evaluate the performance of the corresponding network structure, and thus the whole search process was computationally intensive and time consuming. In recent years, the differentiable search strategy has attracted extensive attention in academia and industry because it utilizes weight sharing and gradient descent optimization algorithms to significantly reduce the search time. Particularly representative is a differentiable architecture search dart, which verifies performance by searching for cell structures and then stacking the searched cell structures into a target network.

But the differentiable search strategy DARTS only considers the influence of the neural network model loss function on each operation weight in the search space, and does not consider the mutual influence among the operations in the search space; StacNAS proposes a problem that similar operations have 'vote' due to multiple collinearity (multicollinearity) among the operations, so that a selection drop problem is caused, a correlation matrix of all the operations in an original search space is calculated firstly, the operations are grouped according to the correlation, then one operation is selected from each group to represent the operation in each group, and a compact search space is formed by the representative operations; then in the compact search space, the StacNAS obtains the weight of each operation on each edge by adopting the same method as DARTS, only reserves the operation with the maximum weight on each edge, replaces the operation with a plurality of operations in the corresponding group of the original search space, and continues searching; and finally, obtaining the final reserved operation according to the operation weight on each edge. Since both the stacNAS and the DARTS delete the operation with smaller weight in the searching process, when the operation weight is not greatly different, the operation with smaller weight is not selected any more after being deleted, and thus, only a suboptimal neural network architecture can be searched.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a neural network unit structure searching method based on an attention mechanism, which comprises the following steps:

step 1, constructing a macro-architecture hyper-network in a search space, wherein each layer of unit structure in the macro-architecture hyper-network is a directed acyclic graph, nodes in the directed acyclic graph comprise input nodes, intermediate nodes and output nodes, the input nodes receive output feature graphs of previous unit structures, the intermediate nodes aggregate the feature graphs of all previous nodes in the unit structure, the output nodes splice the feature graphs of all intermediate nodes, the nodes in the directed acyclic graph are connected through edges, and each edge represents a combination of a plurality of candidate operations in the search space;

step 2, adding an attention module after outputting a characteristic diagram to all candidate operations of each edge in the unit structure to obtain a network to be searched;

and 3, training the network to be searched by using the labeled data set, gradually deleting the candidate operation with the minimum attention weight on each edge in the intermediate search network unit structure in the training process until the training reaches the preset iteration times, and eliminating all attention modules in the current network to be searched to obtain the search result of the neural network unit structure of the data set.

The neural network unit structure searching method comprises the steps that a data set comprises a plurality of samples, each sample is provided with a corresponding label, the samples are pictures, and the labels are picture categories; the search space is a DARTS search space.

The attention mechanism-based neural network macro architecture searching method further comprises the following steps:

and 5, training a search result of the neural network unit structure by using the data set to obtain an image search model, and inputting the image to be classified into the image search model to obtain the image category of the image to be classified.

The neural network macro architecture searching method based on the attention mechanism is characterized in that each edge in a unit structure of the macro architecture super network consists of a plurality of candidate operations, the edge has m candidate operations, and each candidate operation corresponds to m characteristic graphs

Each feature map has a size of

Splicing the m feature graphs according to the channel dimension to obtain spliced features

Inputting the calculated attention weight into the attention module for calculating the attention weight of each candidate operation, wherein the attention module consists of a global average pooling layer, a full connection layer and a Sigmoid layer;

this step 3 comprises calculating the attention weight of each candidate operation on each edge in the cell structure of the network to be searched:

stitching features F of all candidate operations on each edge^conObtaining features after pooling via global average pooling

Then the characteristic

Output attention weight through two layers of full connection and Sigmoid layer, Sigmoid layer

The invention also provides a neural network unit structure searching system based on the attention mechanism, which comprises the following components:

the macro-architecture hyper-network comprises an initialization module, a search space and a search module, wherein the initialization module is used for constructing a macro-architecture hyper-network in the search space, each layer of unit structure in the macro-architecture hyper-network is a directed acyclic graph, nodes in the directed acyclic graph comprise input nodes, intermediate nodes and output nodes, the input nodes receive output feature graphs of previous unit structures, the intermediate nodes gather the feature graphs of all the previous nodes in the unit structure, the output nodes are spliced with the feature graphs of all the intermediate nodes, the nodes in the directed acyclic graph are connected through edges, and each edge represents a combination of a plurality of candidate operations in the search space;

the adding module is used for adding an attention module after outputting the characteristic diagram to all the candidate operations of each edge in the unit structure to obtain a network to be searched;

and the searching module is used for training the network to be searched by using the labeled data set, gradually deleting the candidate operation with the minimum attention weight on each edge in the intermediate searching network unit structure in the training process until the training reaches the preset iteration times, and eliminating all the attention modules in the current network to be searched to obtain the searching result of the neural network unit structure of the data set.

The neural network unit structure searching system comprises a plurality of samples in the data set, wherein each sample is provided with a corresponding label, the samples are pictures, and the labels are picture categories; the search space is a DARTS search space.

The attention mechanism-based neural network macro architecture search system further comprises:

and the picture classification module is used for training the neural network unit structure search result by using the data set to obtain a picture search model, and inputting the picture to be classified into the picture search model to obtain the picture category of the picture to be classified.

The neural network macro architecture search system based on the attention mechanism is characterized in that each edge in the unit structure of the macro architecture super network consists of a plurality of candidate operations, the edge has m candidate operations, and each candidate operation corresponds to m characteristic graphs

Each feature map having dimensions of

the searching module is used for calculating the attention weight of each candidate operation on each edge in the unit structure of the network to be searched:

stitching feature F of all candidate operations on each edge^conObtaining features after pooling through global average pooling

Then the characteristic

The invention also provides a storage medium for storing a program for executing the any one attention-based neural network macro architecture searching method.

The invention also provides a client used for the neural network macro architecture search system based on the attention mechanism.

According to the scheme, the invention has the advantages that:

the invention provides an Attention-based Neural network unit structure searching method and device (ANCS for short), which utilize an Attention mechanism to evaluate the importance of each operation on the basis of a differentiable searching strategy, add a regular term in a loss function to sparsify operation weight, and finally select the operation which needs to be finally reserved and the number of the operation according to the constraints of calculation and storage resources and inference time. Compared with the prior art, the method can not only consider the mutual influence among the operations, but also reserve each operation until the final step of searching, thereby obtaining the neural network architecture with excellent performance.

Drawings

FIG. 1 is a schematic diagram of a neural network unit structure searching method based on an attention mechanism according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a hyper-network in accordance with an embodiment of the present invention;

FIG. 3 is a schematic view of an attention module of an embodiment of the present invention;

fig. 4 is a schematic diagram of an apparatus for searching a neural network unit structure based on an attention mechanism according to an embodiment of the present invention.

Detailed Description

The invention aims to provide a neural network unit structure searching method and device based on an attention mechanism. The invention mainly aims at the cell structure search in DARTS search space, and adopts an attention mechanism to measure the importance of each candidate operation in the cell.

In a first aspect, the present invention provides a neural network unit structure search method based on an attention mechanism, specifically including the following steps:

step 1, designing a directed acyclic graph.

The directed acyclic graph is composed of N (e.g., N ═ 3) intermediate nodes and E (e.g., E ═ 9) edges. Each node represents a corresponding feature graph, each edge represents a combination of a plurality of candidate operations in a search space, both N and E are hyperreferences, and the larger the two are, the larger the search space is, the larger the difficulty in searching a good structure is. The directed acyclic graph has two input nodes, and each input node receives the output characteristic graphs of two previous unit structures; each intermediate node of the directed acyclic graph aggregates the characteristic graph information of all the previous nodes in the unit structure; the feature graph of the output nodes of the directed acyclic graph is defined as a stitched feature graph of all intermediate nodes.

And 2, adding an attention module to each edge of the acyclic graph.

In a well-defined directed acyclic graph, each edge is composed of a plurality of candidate operations. Assuming that the edge has m candidate operations, each candidate operation receives the input node of the edge, and thus m feature maps are obtained

Each feature map has a size of

h and w represent the length and width of the feature map, respectively, c is the number of channels, and b represents the size of the convolution kernel. Then, the m feature graphs are spliced according to the channel dimension to obtain spliced features

And then input into our proposed attention mechanism module to calculate attention weights for each candidate operation. The attention mechanism module consists of a global averaging pooling, a full connectivity layer, and a Sigmoid layer.

And 3, calculating the importance of each candidate operation on each edge.

First, the stitching features F of all candidate operations on each edge^conObtaining features after pooling through global average pooling

Then the characteristic

Output via two full connections and Sigmoid, Sigmoid layers

Which is the attention weight, can be viewed as the degree of importance of each channel. Thus, the importance of each candidate operation is represented by the sum of the activation values of the corresponding channels. To enable updating the weights of the attention module, the activation value of each channel is multiplied by the original stitching profile

Obtaining attention weighted feature maps

Attention weighted feature maps to maintain the same dimensions as the original input feature map

A dot add operation is performed.

And 4, training and updating the weight of the candidate operation of the directed acyclic graph and the weight of the attention module based on the selected execution task.

Depending on the different target tasks, which may be target classification (object classification), target detection (object detection), semantic segmentation (semantic segmentation), instance segmentation (instance segmentation), target tracking (object tracking), etc., the directed acyclic graph is trained on the received training data set using conventional machine learning training techniques (e.g., stochastic gradient descent with back propagation) appropriate for the task, updating the weights of the different operations and the attention weight on each side according to the back propagation.

And 5, evaluating the performance of the directed acyclic graph by using the verification data set of the selected task.

And when the training set of the selected task is used for converging the network training in the step 4, evaluating the performance of the directed acyclic graph by using the verification set of the task, obtaining the attention weights of all candidate operations on each edge in the unit structure, and deleting the low-attention-weight operation of the directed acyclic graph firstly.

And 6, repeating the step 5, continuing training until convergence, and then using the verification set again for evaluation.

When the operation with low attention weight on each edge of the unit structure in step 5 is deleted, the accuracy of the super network is reduced, the training set is required to be used again to train to converge, then the verification set is used to perform evaluation again, the attention weights of all candidate operations on each edge of the unit structure in the current state are obtained, and the operation with low attention weight of the directed acyclic graph is deleted again.

And 7, outputting the neural network corresponding to the directed acyclic graph.

When only one operation with the largest attention weight on each side of the unit structure is reserved, the optimal target neural network suitable for the selected task is obtained. And finally, retraining the network by using all the data sets, and verifying the final performance.

In a second aspect, the invention provides an attention mechanism-based neural network unit structure searching device, which specifically comprises the following modules:

A. a unit structure construction module: the module constructs a directed acyclic graph structure of a unit structure, and the directed acyclic graph structure is composed of N intermediate nodes and E edges.

B. An attention module: this module is added to each edge of the unit structure for extracting the attention weight magnitude of all candidate operations on each edge of the unit structure, i.e. the importance degree of each candidate operation.

C. A unit structure searching and optimizing module: the module mainly sends the training set into the unit structure for forward propagation, and optimizes the weight parameters of different candidate operations in the unit structure and the weight parameters of the attention module through backward propagation.

D. Evaluating and updating the unit structure module: the module is used for evaluating the importance degree of each candidate operation in the unit structure, deleting the operation with low attention weight according to the weight of each operation, and then updating the topology of the unit structure.

E. A unit structure acquisition module: the module is used for obtaining an optimal unit structure model according to the importance degree of each candidate operation on each edge of the unit structure.

F. A target network training and verifying module: the module retrains the obtained optimal unit network structure model by using all training sets so as to optimize the weight parameters in the unit structure model and verify the performance of the searched unit structure model on the test set.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

Example 1

The invention provides a neural network unit structure searching method based on an attention mechanism, which comprises the following steps:

s11: the unit structure is designed to have an acyclic graph.

In this step, the present invention is directed to a neural network cell structure search, i.e., a candidate operation for searching and determining the optimum on each edge in the cell structure. Taking DARTS search space as an example, each edge in the space contains a plurality of candidate operations, and the purpose of unit structure search is to select the optimal operation from the plurality of candidate operations to form the final target structure.

S12: an attention module is added to each edge of the directed acyclic graph.

In this step, the super network is formed by stacking L-layer units, as shown in FIG. 2, where c _ { k-2} and c _ { k-2} are input nodes, input data is the output of the first two units, 0, 1, 2 are intermediate nodes, and c _ { k } is an output node. The cells have two types, namely, a normal cell (normal cell) and a down-sampling cell (reduction cell), wherein the down-sampling cell is located in the network

And

layers, with the general cells being located in other layers. All the units are internally provided with directed acyclic graphs which comprise N nodes and E edges, and each edge is provided with m candidate operations, such as zero operation, convolution operation, pooling operation and the like. Meanwhile, the present invention adds an attention module after outputting the feature map for all candidate operations of each edge, as shown in fig. 3. The attention weight of the attention module may be expressed as a degree of importance of each candidate operation.

The resolution of the characteristic diagram is not changed by the general unit structure, the resolution of the characteristic diagram is halved by the down-sampling unit, and the channel is doubled. The number of the down-sampling units in the search space is generally two, and the down-sampling units may be arranged at different positions of the super-network according to the requirement, which is not limited by the present invention.

S13: and sending the selected task training set picture into a directed acyclic graph, calculating the gradient, and optimizing the unit structure and the weight of the attention module according to the gradient direction by using an optimizer.

In this step, training a hyper-network including an attention module on a training set is mainly performed, and the weight of the hyper-network is updated at the same time of training, and the weight of the attention module is updated accordingly, so as to learn the importance of each candidate operation in the hyper-network.

S14: the low attention weight operation is deleted according to the weight size of each operation, and then the cell structure topology is updated.

In this step, as training of the hyper-network proceeds, operations with smaller attention weights on each edge in the unit structure are deleted step by step, and then the topology of the unit structure is updated. The gradual deletion may be performed every iteration, or after every preset number of iterations.

S15: whether the algorithm reaches a specified number of iterations.

In this step, it is used to determine whether the training of the hyper-network including the attention module reaches the specified iteration number, that is, whether there are multiple candidate operations on each edge of the unit structure, if not, continue to train in step S13, and if the specified iteration number is reached, that is, after there is only one operation on each edge, proceed to the next step.

S16: and obtaining the target optimal unit structure.

In this step, candidate operations with low attention weights are deleted step by step according to steps S13-S15, and then only one operation with the largest attention weight is left on each side of the final cell structure, resulting in a target optimal cell structure on the task.

S17: the target structure is retrained on the entire data set, verifying performance.

In this step, the target structure obtained in step S16 above is retrained over the entire training set until convergence, and then its performance index is tested in the test set.

Example 2

An embodiment of the present invention further provides an apparatus for searching a neural network unit structure based on an attention mechanism, as shown in fig. 4, the apparatus includes: a unit structure construction module 21, an attention module 22, a unit structure searching and optimizing module 23, an evaluation and update unit structure module 24, a unit structure acquisition module 25, and a target network training and verifying module 26.

The unit structure building module 21 is configured to build a directed acyclic graph structure of a unit structure, and is configured to form N intermediate nodes and E edges; an attention module 22, which is composed of a global average pooling layer, a full connection layer and a Sigmoid layer, and is added to each edge of the unit structure for extracting the attention weight of all candidate operations on each edge of the unit structure, i.e. the importance degree of each candidate operation; a unit structure searching and optimizing module 23, which mainly sends the training set into the unit structure for forward propagation and optimizes the weight parameters of different candidate operations in the unit structure and the weight parameters of the attention module through backward propagation; an evaluate and update unit structure module 24 for evaluating the importance of each candidate operation in the unit structure, deleting the operation with low attention weight according to the weight of each operation, and then updating the topology of the unit structure; a unit structure obtaining module 25, configured to obtain an optimal unit structure model according to the importance degree of each candidate operation on each edge of the unit structure; and a target network training and verification module 26, which retrains the obtained optimal unit network structure model by using the whole training set to optimize the weight parameters in the unit structure model, and verifies the performance of the searched unit structure model on the test set.

In the device for searching the network architecture based on the microscopic spirit of the attention mechanism provided by the embodiment of the invention, the working process of each module has the same technical characteristics as the method for searching the network architecture based on the microscopic spirit of the attention mechanism, so that the functions can be realized in the same way, and the detailed description is omitted.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

the macro-architecture hyper-network comprises an initialization module, a search space and a search module, wherein the initialization module is used for constructing a macro-architecture hyper-network in the search space, each layer of unit structure in the macro-architecture hyper-network is a directed acyclic graph, nodes in the directed acyclic graph comprise input nodes, intermediate nodes and output nodes, the input nodes receive output feature graphs of previous unit structures, the intermediate nodes aggregate the feature graphs of all the previous nodes in the unit structure, the output nodes splice the feature graphs of all the intermediate nodes, the nodes in the directed acyclic graph are connected through edges, and each edge represents a combination of a plurality of candidate operations in the search space;

and the searching module is used for training the network to be searched by using the labeled data set, gradually deleting the candidate operation with the minimum attention weight on each edge in the intermediate searching network unit structure in the training process until the training reaches the preset iteration number, and eliminating all the attention modules in the current network to be searched to obtain the searching result of the neural network unit structure of the data set.

The neural network macro architecture search system based on the attention mechanism is characterized in that each edge in the unit structure of the macro architecture super network consists of a plurality of candidate operations, the edge has m candidate operations, and each candidate operation corresponds to m feature maps

Each feature map has a size of

Splicing the m characteristic graphs according to the channel dimension to obtain the spliced characteristics

Inputting the calculated attention weight into the attention module for calculating the attention weight of each candidate operation, wherein the attention module comprises a global average pooling layer, a full connection layer and a Sigmoid layer;

the searching module is configured to calculate an attention weight of each candidate operation on each edge in a unit structure of the network to be searched:

stitching features F of all candidate operations on each edge^conObtaining features after pooling through global average pooling

Then the characteristic

The invention also provides a client used for the arbitrary attention mechanism-based neural network macro architecture search system.

Claims

1. A neural network unit structure searching method based on an attention mechanism is characterized by comprising the following steps:

step 2, adding an attention module after outputting the characteristic diagram of all candidate operations of each edge in the unit structure to obtain a network to be searched;

2. The method of claim 1, wherein the data set comprises a plurality of samples, each sample having a corresponding label, the samples being pictures and the labels being picture categories; the search space is a DARTS search space.

3. The attention mechanism-based neural network macro architecture search method of claim 2, further comprising:

4. The method according to claim 1 or 2, wherein each edge of the unit structure of the macro architecture super network is composed of a plurality of candidate operations, the edge has m candidate operations, and each candidate operation corresponds to m feature maps

Each feature map having dimensions of

Then the characteristic

5. An attention-based neural network element structure search system, comprising:

6. The neural network unit structure searching system of claim 5, wherein the data set comprises a plurality of samples, each sample having a corresponding label, the samples being pictures, the labels being picture categories; the search space is a DARTS search space.

7. The attention mechanism-based neural network macro architecture search system of claim 6, further comprising:

8. The attention-based neural network macro architecture search system of claim 5 or 6, wherein each edge in the unit structure of the macro architecture super network is composed of a plurality of candidate operations, the edge has m candidate operations, and each candidate operation corresponds to m feature maps

Each feature map has a size of

all on each edgeSplicing feature F of candidate operation^conObtaining features after pooling through global average pooling

Then the characteristics

9. A storage medium storing a program for executing the attention mechanism-based neural network macro architecture search method according to any one of claims 1 to 4.

10. A client for use in the attention-based neural network macro architecture search system of any one of claims 5 to 8.