CN113344181B

CN113344181B - Neural network structure searching method and device, computer equipment and storage medium

Info

Publication number: CN113344181B
Application number: CN202110602497.4A
Authority: CN
Inventors: 苏修; 游山; 王飞; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2022-10-18
Anticipated expiration: 2041-05-31
Also published as: CN113344181A

Abstract

The present disclosure provides a structure search method, apparatus, computer device and storage medium for a neural network, wherein the method comprises: based on structural parameters of a visual depth self-attention transformation ViT network, constructing a search space of a super-large neural network corresponding to the ViT network, wherein the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron; carrying out structure search on the oversized neural network based on a search space and an image sample set to obtain a target search path of the ViT network, wherein the target search path comprises one processing block of each processing block layer in a plurality of processing block layers; based on the target search path, the structure of the ViT network for processing the image data is determined. Therefore, the structure of the ViT network can be accurately determined, the ViT network structure with better performance can be obtained, and the processing precision of the ViT network is improved.

Description

Neural network structure searching method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of deep learning technologies, and in particular, to a structure search method and apparatus for a neural network, a computer device, and a storage medium.

Background

Among neural networks, a depth self-attention transformation network (transducer) has wide application in processing of sequence data (such as speech recognition and processing), while a Visual depth self-attention transformation (ViT) network can inherit characteristics of a transducer to solve a deep learning problem in the image field. When designing a ViT network, the structure thereof is usually designed manually; the ViT network structure obtained by the ViT network design method often cannot guarantee the performance of the ViT network, and affects the processing precision when the ViT network is used for processing images.

Disclosure of Invention

The embodiment of the disclosure at least provides a neural network structure searching method, a neural network structure searching device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a method for searching a structure of a neural network, including: based on structural parameters of a visual depth self-attention transformation ViT network, constructing a search space of a super-large neural network corresponding to the ViT network, wherein the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron; performing structure search on the super-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network, wherein the target search path comprises one processing block of each processing block layer in the plurality of processing block layers; determining a structure of the ViT network for processing image data based on the target search path.

Therefore, the structure search of the ViT network can be further carried out by utilizing a neural network architecture search mode through pertinently constructing a search space of the corresponding super-large neural network for the ViT network. Because the search space is constructed based on the structural features of the ViT network, the structure of the ViT network is accurately determined on the premise of automatic structure search, so that the ViT network structure with better performance is obtained, and the processing precision of the ViT network is improved.

In an alternative embodiment, the structural parameters include at least one of: the number of the processing blocks, the number of the attention feature processing modules corresponding to the multi-head attention mechanism layer in each processing block, the output dimension of the full connection layer in each processing block, and the segmentation size of the image data input into the first processing block.

In an optional embodiment, the performing a structural search on the super-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network includes: training neural networks corresponding to a plurality of paths to be searched sampled from the search space based on the image sample set to obtain a trained ultra-large neural network; and determining the target search path from the trained super-large neural network.

Therefore, the trained oversized neural network is obtained by training the neural networks corresponding to the paths to be searched by using the image sample, the target search path is determined from the trained oversized neural network, the trained oversized neural network can be further adjusted, better structural parameters in the oversized neural network are gradually reserved, and partial parameters which cannot improve the accuracy of the ViT network are screened out.

In an optional embodiment, the training, based on the image sample set, a network structure corresponding to a path to be searched sampled from the search space to obtain a trained super-large neural network includes: executing training of a plurality of iteration cycles until a preset iteration stop condition is reached; using the super-large neural network obtained in the last iteration cycle as the trained super-large neural network; wherein the training of any iteration cycle comprises: determining a path to be searched in the current iteration cycle based on the search space; and training the neural network corresponding to the path to be searched by utilizing the image sample set.

Therefore, under the condition of consuming less computing power and time, the optimization of the structural parameters in the super-large neural network can be ensured, so that the further training and learning with better accuracy can be ensured, and the searching efficiency is improved.

In an optional embodiment, the preset iteration stop condition includes: and reaching a preset iteration number, and/or reaching a preset precision threshold value of the training precision of the path to be searched on the image sample set.

In an optional implementation manner, the structure search method further includes: acquiring an original image training sample set; and performing data enhancement processing on the original image training sample by using a predetermined data enhancement mode to obtain the image sample set.

Therefore, the data enhancement processing is carried out on the original image training sample by utilizing the predetermined data enhancement mode, the convergence of the super-large neural network can be ensured, and the ViT network with a better structure can be constructed.

In an alternative embodiment, the predetermined data enhancement mode includes at least one of: random cropping, random inversion, and label smoothing.

In an optional implementation manner, the predetermined data enhancement manner is determined based on the accuracy and/or the convergence degree of the super-large neural network after the structure search is performed on the super-large neural network based on the search space and the experimental sample.

In an optional implementation manner, the determining, based on the search space, a path to be searched in a current iteration cycle includes: determining the value of the structural parameter of each processing block in the ViT network based on the search space; and generating the path to be searched in the current iteration period based on the value of the structural parameter of each processing block in the ViT network.

In an alternative embodiment, the data cost parameters used by the linear mappings corresponding to the self-attention feature processing modules in the same processing block in different iteration cycles are the same.

In this way, the amount of data that needs to be processed can be further reduced.

In an optional implementation manner, the structure search method further includes: and retraining the ViT network based on the structure of the ViT network to obtain a target ViT network.

Therefore, when the ViT network is retrained, the structural parameters in the ViT network can be further adjusted to further improve the processing precision of the ViT network, so that the determined ViT network can better identify a plurality of different objects in the image when identifying the objects in the image, and can accurately classify and identify the different objects; namely, a ViT network with higher accuracy can be obtained.

In a second aspect, an embodiment of the present disclosure further provides a structure searching apparatus for a neural network, including: the system comprises a construction module, a processing module and a processing module, wherein the construction module is used for constructing a search space of a super-large neural network corresponding to the ViT network based on structural parameters of a visual depth self-attention transformation ViT network, the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron; the structure searching module is used for carrying out structure searching on the ultra-large neural network based on the searching space and the image sample set to obtain a target searching path of the ViT network, wherein the target searching path comprises one processing block of each processing block layer in the plurality of processing block layers; a determining module, configured to determine a structure of the ViT network for processing image data based on the target search path.

In an alternative embodiment, the structural parameters include at least one of: the processing block number, the number of the self-attention feature processing modules corresponding to the multi-head self-attention mechanism layer in each processing block, the output dimension of the full-connection layer in each processing block, and the segmentation size of the image data input into the first processing block.

In an optional implementation manner, when the structure search module performs a structure search on the super-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network, the structure search module is configured to: training neural networks corresponding to a plurality of paths to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network; and determining the target search path from the trained super-large neural network.

In an optional embodiment, the structure search module, when training a network structure corresponding to a path to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network, is configured to: executing training of a plurality of iteration cycles until a preset iteration stop condition is reached; using the super-large neural network obtained in the last iteration cycle as the trained super-large neural network; wherein the training of any iteration cycle comprises: determining a path to be searched in the current iteration cycle based on the search space; and training the neural network corresponding to the path to be searched by utilizing the image sample set.

In an optional implementation manner, the structure searching apparatus further includes: a data processing module; the data processing module is used for: acquiring an original image training sample set; and performing data enhancement processing on the original image training sample by using a predetermined data enhancement mode to obtain the image sample set.

In an optional embodiment, the predetermined data enhancement mode is determined based on the accuracy and/or the convergence degree of the super-large neural network after performing a structure search on the super-large neural network based on the search space and the experimental sample.

In an alternative embodiment, when determining the path to be searched in the current iteration cycle based on the search space, the structure search module is configured to: determining the value of the structural parameter of each processing block in the ViT network based on the search space; and generating the path to be searched in the current iteration cycle based on the value of the structural parameter of each processing block in the ViT network.

In an optional embodiment, the structure searching apparatus further includes a retraining module; the retraining module is to: and retraining the ViT network based on the structure of the ViT network to obtain a target ViT network.

In a third aspect, this disclosure also provides a computer device, a processor, and a memory, where the memory stores machine-readable instructions executable by the processor, and the processor is configured to execute the machine-readable instructions stored in the memory, and when the machine-readable instructions are executed by the processor, the machine-readable instructions are executed by the processor to perform the steps in the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, alternative implementations of the present disclosure also provide a computer-readable storage medium having a computer program stored thereon, where the computer program is executed to perform the steps of the first aspect or any one of the possible implementations of the first aspect.

For the description of the effects of the structure searching apparatus, the computer device, and the computer-readable storage medium of the neural network, reference is made to the description of the structure searching method of the neural network, and details are not repeated here.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is to be understood that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art to which the disclosure pertains without the benefit of the inventive faculty, and that additional related drawings may be derived therefrom.

Fig. 1 is a flowchart illustrating a structure searching method of a neural network provided by an embodiment of the present disclosure;

fig. 2 shows a schematic structural diagram of a ViT network provided by an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating a structure searching apparatus of a neural network provided in an embodiment of the present disclosure;

fig. 4 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of embodiments of the present disclosure, as generally described and illustrated herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the disclosure is not intended to limit the scope of the disclosure as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making any creative effort, shall fall within the protection scope of the disclosure.

Research shows that compared with a Transformer which can complete speech recognition and processing tasks more accurately and quickly, the ViT network is further adjusted in structure in order to better adapt to the deep learning problem in the image field. For example, part of the decoder (decoder) is deleted in the structure of the ViT network; in order to be more suitable for processing the image, the structure for segmenting and coding the input image is further added; also, the respective processing layers are changed in each processing block (block). Because the ViT network has a special structure, the ViT network is usually constructed in a manual design manner, and is further adjusted according to the actual performance of the constructed ViT network to determine the structure of the ViT network; when the method is used for carrying out structure search on the ViT network, the performance of the ViT network cannot be ensured often through a manual adjustment mode, and the obtained ViT network has low processing precision when processing images.

Based on the research, the disclosure provides a structure searching method of a ViT network, which constructs a searching space of a corresponding super-large neural network for the ViT network according to the structure parameters of the ViT network, performs structure searching in the searching space by using a structure searching mode, and further determines the structure of the ViT network by using a determined target searching path. In this way, by specifically constructing a Search space of the huge Neural network corresponding to the ViT network, the ViT network can be further subjected to structure Search in a Neural Architecture Search (NAS) manner. Because the search space is constructed based on the ViT network structure characteristics, the structure of the ViT network is accurately determined on the premise of automatic structure search, so that the ViT network structure with better performance is obtained, and the processing precision of the ViT network is improved.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.

To facilitate understanding of the present embodiment, first, a structure searching method for a neural network disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the structure searching method for a neural network provided in the embodiments of the present disclosure is generally a computer device with certain computing power, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the structure search method of the neural network may be implemented by a processor calling computer readable instructions stored in a memory.

The following is a description of a method for searching a neural network structure provided in an embodiment of the present disclosure.

Referring to fig. 1, a flowchart of a structure searching method of a neural network provided in an embodiment of the present disclosure is shown, where the method includes steps S101 to S103, where:

s101: based on structural parameters of a visual depth self-attention transformation ViT network, constructing a search space of a super-large neural network corresponding to the ViT network, wherein the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron;

s102: performing structure search on the super-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network, wherein the target search path comprises one processing block of each processing block layer in the plurality of processing block layers;

s103: determining a structure of the ViT network for processing image data based on the target search path.

The method and the device construct the search space of the ultra-large neural network corresponding to the ViT network based on the structural parameters of the ViT network, so as to perform structural search on the ultra-large neural network in the search space, determine the target search path of the ViT network and further determine the structure of the ViT network. Because the search space is constructed based on the structural features of the ViT network, the structure of the ViT network is accurately determined on the premise of automatic structure search, so that the ViT network structure with better performance is obtained, and the processing precision of the ViT network is improved.

First, the structure of the ViT network and corresponding structure parameters will be described.

Referring to fig. 2, a schematic structural diagram of a ViT network provided in the embodiment of the present disclosure is shown. In fig. 2, an image Embedded layer (Embedded Patches) 21 is shown, as well as a processing block layer in the ViT network. In one possible embodiment, a processing block layer may comprise, for example, at least one processing block (block); in the embodiment of the present disclosure, an example in which one processing block is included in the processing block layer is described.

Wherein, for the processing block, it may include a plurality of stacked processing blocks in the ViT network, for example; for one processing block, the last processing block connected with the processing block receives the data output by the processing block.

Illustratively, L processing blocks, represented in fig. 2 in the form of "L x", may be included in the ViT network. In first processing block 22, there may also be included a Normalization Layer (Layer Normalization) 23, a Multi-Head Attention arrangement (Multi-Head Attention) 24, a Normalization Layer 25, and a Multi-Layer Perceptron (MLP) 26. Among them, a Multi-Head Attention layer (Multi-Head Attention) 24 processes features input to a processing block based on a self-Attention mechanism.

As for the image embedding layer 21, it may perform a segmentation process on the image input into the ViT network. Since the Transformer structure is effective for the input Sequence (Sequence), the image embedding layer 21 may segment an image input to the ViT network, process a sub-image (patch) obtained after the segmentation into a vector (vector), and form an image Sequence by positions of a plurality of sub-images in the image, so that, for the image to be processed, the image to be processed may be converted into a vector Sequence formed by vectors corresponding to the plurality of sub-images.

For example, when a piece of image to be processed having a size (unit is a pixel value) of 9000 × 9000 is subjected to classification recognition processing, the image embedding layer 21 may divide it into sub-images having a size of P1 × P2, for example, 3 × 3 sub-images having a size of 3000 × 3000. For each sub-image, the sub-image can be abstractly expressed by using the matrix determined by the corresponding pixel matrix and the position information corresponding to each sub-image, so as to determine the output data of the sub-image in the image embedding layer 21 respectively corresponding to the sub-image.

Specifically, the input data of the image embedding layer satisfies the following formula (1):

after the image is divided into two-dimensional (2dimension, 2d) sub-images of P1 × P2 size, each image can correspond to C channels, for example, three channels corresponding to Red (Red, R), green (Green, G) and Blue (Blue, B). Thus, for each sub-image, its corresponding dimension is P1 × P2 × C, i.e., each sub-image can be written as a vector of size P1 × P2 × C. Corresponding vector of each sub-image

Indicating that i corresponds to the ith sub-image; in the formula, N sub-images are represented.

In formula (1), E is a matrix that can abstract the input image and the output image, and may be, for example, a matrix formed by original pixels of the image to be processed, or may be replaced by any one of Convolutional Neural Networks (CNNs) and MLPs. Since the ViT network uses constant hidden vectors (constant hidden vectors) in all its layers, we reduce the dimensions of the sub-images and map them onto vectors of length D by a trainable linear projection method. Thus, by means of the corresponding vector of the sub-image

Multiplying with the matrix E, the vector corresponding to the sub-image may be mapped to the vector with length D, resulting in the vector with length D. Wherein, for the matrix E, the dimension thereof is

x _class Represents one additional learnable vector; in general, a vector of length D of all 0's may be included. In addition, E may also be set in order to express the position of the different sub-images corresponding to the image to be processed _pos I.e. Position encoding (Position Embedding). Here, in addition to the corresponding vectors of N sub-images, an additional learnable vector x is also included _class And thus corresponds to E _pos Of dimension of

Using x _class 、

And E _pos I.e. the output data of the image embedding layer 21, i.e. the input data of the first processing block 22, denoted z, can be calculated ₀ 。

Taking the first processing block 22 as an example, z is output at the image embedding layer 21 ₀ And then input to the first processing block 22. Here, the

Indicating the hierarchy of the current processing block in the L processing blocks included in the ViT network.

The normalization layer 23 and the multi-head attention mechanism layer 24 in the first processing block 22 are applied to the input data z ₀ In the treatment, for example, the following formula (2) can be used:

wherein LN (-) represents the normalization process, which is a conventional step and will not be described herein. By means of the normalization process, independent feature scaling can be performed on each sub-image, and more feature information of the sub-images can be saved.

MSA (-) denotes a multi-headed attention mechanism process; here, since the data processed by the multi-head attention mechanism processing can be mapped to different attention feature processing spaces, it is possible to focus more on information in the different attention feature processing spaces and capture more abundant feature information.

Obtained by making use of a multi-head attention mechanism for the layer 24

Thereafter, the normalized layer 25 and the multi-layer sensor 26 can be determined according to the following formula (3)

Output data of layer processing block

Wherein, MLP (·) represents multi-layer perceptual processing, and the specific process is not described again.

In addition, for the second

The layer processing block, whose corresponding image representation (image representation) y can be determined, for example, according to the following formula (4):

in obtaining the first

Output data of layer processing block

Then, it can be inputted to

And the layer processing block continues processing until the L layers are processed, and the classification result of the image to be processed is determined.

Here, the above description of fig. 2 is only one possible way for the ViT network to process the image to be processed; for the ViT network, there may be other variant forms, and the image to be processed is processed according to different processing manners, which is not described herein again.

The following describes in detail the above-mentioned S101 to S103, taking the ViT network listed in the above-mentioned fig. 2 as an example.

For the above S101, the configuration parameters of the ViT network include at least one of the following: the processing block number, the number of the self-attention feature processing modules corresponding to the multi-head self-attention mechanism layer in each processing block, the output dimension of the full-connection layer in each processing block, and the segmentation size of the image data input into the first processing block.

Wherein, the number of processing blocks may correspond to L, for example; from experience or experimental verification, it may be determined that the number of process pieces may include, for example, 10, 11, \8230;, or 18. Under a possible condition, when a search space is used for training a search path in a super-large neural network, if the number of processing blocks determined in a current iteration period is less than a preset maximum processing block number threshold, in order to prevent parameters corresponding to a plurality of network layers in shielded processing blocks from participating in a search process of the current iteration period, a plurality of shielding modules can be arranged in parallel for the plurality of processing blocks; and when determining that the processing block which does not participate in the searching process of the current iteration cycle exists, replacing the processing block at the corresponding position with a shielding module, so that the searching path determined by the current iteration cycle passes through the shielding module corresponding to the processing block and does not pass through the processing block.

When determining the number of self-attention feature processing modules corresponding to each processing block, that is, when determining the number of parallel self-attention feature processing modules in the multi-head attention mechanism layer, that is, the number of "heads" in the multi-head attention mechanism layer. Illustratively, the number of self-attention feature processing modules may include, for example, 3, 4, or 5. In addition, for different self-attention feature processing modules corresponding to the multi-head self-attention mechanism layer, the corresponding parameters can be the same or different; therefore, the parameters of different self-attention feature processing modules can also be used as the structural parameters of the ViT network.

In determining the output dimension of the fully connected layer in each processing block, the output dimension may be determined according to the structure of the multi-layer perceptron, for example. For example, the multi-layer perceptron may comprise a fully connected layer having a three-layer network structure, where n different output data are output in the last layer of the network structure, and thus the output dimension n may be taken as the output dimension of the fully connected layer. In addition, the number of network configuration layers of the full connection layer may be part of the configuration parameters of the ViT network.

As for the first processing block, the processing block, which is different from the other processing blocks, receives image data input from the image embedding layer. Further, when the division method of the image data is different, the data sizes of the plurality of processing blocks are different, and thus the number of necessary processing blocks may be different. Therefore, the slice size of the image data may also be taken as part of the structural parameters of the ViT network.

Here, the slice size may include, for example, a size of a sub-image obtained by slicing the image data, such as 30 × 30, 40 × 60, or 100 × 100; another example may include a method of slicing the image data, such as sequentially reading pixel values of the image data, and taking every 1000 read pixel values as pixel values of the sub-image, or taking every 2000 read pixel values as pixel values of the sub-image. When the search space corresponding to the segmentation size of the image data is specifically determined, the search space may be determined according to an actual situation, which is not described herein again.

The above-described structural parameters of the ViT network only include some examples, and the network structure of the ViT network may include other structural parameters that can be adjusted and that can improve or enhance the ViT network after adjustment, which are all included in the structural parameters of the ViT network provided in the embodiments of the present disclosure, and are not described in detail herein for example.

After the structural parameters of the ViT network are determined, the search space of the super-large neural network of the ViT network can be determined according to the structural parameters of the ViT network. Illustratively, a neural network architecture searching method can be utilized, and in a searching space of the super-large neural network, the structure parameters of the ViT network are sequentially constructed at a specific level in the super-large neural network according to the structure of the ViT network, so as to construct the super-large neural network.

Here, the search space of the super-large neural network includes a plurality of network layers, and the plurality of network layers includes a plurality of processing block layers, and each processing block layer includes at least one processing block described above. For the search space of the super-large neural network, the parameter values which can be used as the corresponding structural parameters of the layer can be searched in the network layer included in the super-large neural network. For example, the first network layer of the super-large neural network includes parameters corresponding to the slicing size of the image data by the first processing block, specifically 30 × 30, 80 × 90, 100 × 120, and 150 × 150. When the super-large neural network searches the layer network layer, network nodes corresponding to 30 × 30, 80 × 90, 100 × 120 and 150 × 150 can be searched; in addition, it is considered that each node may correspond to a searchable path. That is, in the layer of the super-large neural network, four different search paths are included, each search path passes through a different network node, and the size of the image data to be cut can be found by the search method, which respectively includes: 30 × 30, 80 × 90, 100 × 120, 150 × 150.

For the network layer corresponding to the structural parameter in the processing block in the super-large neural network, the network layer is similar to the network layer corresponding to the segmentation size of the image data in the super-large neural network, and details are not repeated here. In addition, when the structures of different processing blocks are the same, the hierarchy of different processing blocks in a plurality of processing blocks is different, so that the characteristics represented by corresponding processing data are also different; in this way, the specific selectable values of the configuration parameters respectively corresponding to the different processing blocks are also different.

For the above S102, when performing structure search on the super-large neural network based on the search space and the image sample set to obtain the target search path of the ViT network, for example, the following method may be adopted: training a plurality of paths to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network; and determining the target search path from the trained super-large neural network.

Specifically, when each search path in the super-large neural network is trained based on the search space, for example, training of multiple iteration cycles may be performed until a preset iteration stop condition is reached; and taking the super-large neural network obtained in the last iteration cycle as the trained super-large neural network.

Here, the training of any iteration cycle includes: determining a path to be searched in the current iteration cycle based on the search space; and training the neural network corresponding to the path to be searched by utilizing the image sample set.

In specific implementation, when a path to be searched in a current iteration cycle is determined, for example, values of structural parameters of each processing block in the ViT network may be determined based on a search space; and then, generating a path to be searched in the current iteration period based on the values of the structural parameters of the processing blocks in the ViT network.

Because the number of the structural parameters in the ViT network is huge, for the oversized neural network, there are many search paths included, and the way of training the neural networks corresponding to the search paths one by one consumes excessive computing power and time, so that, for example, a random search mode can be selected to determine the paths to be searched, and the neural networks corresponding to a small number of selected paths to be searched are trained. Therefore, under the condition of consuming less computing power and time, the optimization of the structural parameters in the super-large neural network can be ensured, so that the further training and learning with better accuracy can be ensured, and the searching efficiency is improved.

Illustratively, taking the first iteration cycle as an example, based on a search space, a node may be randomly selected from each network layer in the super-large neural network in a random search manner, where each node corresponds to a value of a structural parameter of a processing block, and the nodes selected from the plurality of network layers may form a path to be searched, and in the path to be searched, a value of a structural parameter of each processing block in the ViT network may be determined. By using the determination method of the paths to be searched, a plurality of paths to be searched corresponding to the first iteration period can be determined.

Or after each round of search is finished, the sampled path to be searched is evaluated based on the performance of the super-large neural network, algorithms such as a genetic algorithm or reinforcement learning can be adopted, the evaluation result is used as feedback information obtained by the current round of search, for example, fitness in the genetic algorithm or reward value (reward) in the reinforcement learning, and in the next round of search, a new path to be searched is adopted based on the feedback information, so that the convergence speed of the super-large neural network can be increased.

In addition, in different iteration cycles, for the self-attention feature processing modules in the same processing block, the data Value parameters used by the corresponding linear mappings are the same, thereby further reducing the data amount required to be processed through parameter sharing. Optionally, in different iteration cycles, for the self-attention feature processing modules in the same processing block, the Query parameter used by the corresponding linear mapping is the same, and the Key value parameter Key used by the corresponding linear mapping is the same.

Under the condition that a plurality of paths to be searched are determined, the neural network corresponding to the paths to be searched can be trained by utilizing the image sample set.

The image sample set may be determined, for example, according to the following: acquiring an original image training sample set; and performing data enhancement processing on the original image training sample set by using a predetermined data enhancement mode to obtain the image sample set.

The original image training sample set may include, for example, a plurality of frames of images. The multi-frame image may include at least one identifiable object, such as a person, an animal, or a building. And data enhancement processing can be further carried out on the multi-frame images. Specifically, the predetermined data enhancement mode is determined by the Accuracy (Accuracy) and/or Convergence (Convergence) degree of the super-large neural network after performing the structure search on the super-large neural network by using the experimental sample, and may include at least one of: random cropping, random inversion, and label smoothing (label smoothing).

Specifically, when the structure of the ViT network is searched by using the super-large neural network, the ViT network includes a large number of processing blocks, and each processing block includes a plurality of different structural layers, so that the determined ViT network has a large depth. At this time, if the regularization mode is directly used to train the super-large neural network, the gradient disappears easily, so that the trained super-large neural network cannot achieve good convergence characteristics, or more time and calculation effort are consumed to adjust the structure of the ViT network in order to obtain the super-large neural network with good convergence characteristics.

Therefore, when data enhancement is performed on an original image training sample set, an experimental sample can be selected first, the experimental sample is enhanced by utilizing a plurality of data enhancement processing modes respectively, and then structure search is performed on the oversized neural network based on a search space aiming at the experimental sample enhanced by each data enhancement processing mode; and selecting a data enhancement processing mode with better selectivity according to the accuracy and/or convergence degree of the searched super-large neural network. The experimental sample may be obtained through different data enhancement processes, and the different data enhancement processes may include, for example, mix enhancement (Mixup), cut fill (Cutout), mix fill (CutMix), random depth, modified image attribute (ColorJitter), random cropping, random inversion, and label smoothing.

Specifically, the hybrid enhancement may include, for example, an enhancement mode in which two random image samples are proportionally mixed and the classification result is proportionally distributed to obtain an experimental sample; the cut filling may include, for example, randomly cutting off a partial region in the image sample, and filling 0 pixel value, and the classification result is not changed to obtain an enhanced mode of the experimental sample; the mixed filling may include, for example, cutting off a part of the region in the experimental sample but filling 0 pixel, but randomly filling the region pixel value of other data in the training set, and the classification result is distributed according to a certain proportion to obtain an enhanced mode of the experimental sample.

After the experimental samples obtained through different enhancement modes are used for carrying out structure search on the super-large neural network, the accuracy and/or convergence degree of the super-large neural network under any one enhancement mode can be determined. By utilizing the accuracy and/or convergence degree of the super-large neural network obtained in different enhancement modes, a better partial data enhancement processing mode such as random clipping processing, random inversion processing and label smoothing processing can be screened out from different enhancement modes, and the better partial data enhancement processing mode is used as a predetermined data enhancement mode to carry out data enhancement processing on an original image training sample so as to improve the convergence speed of the super-large neural network without influencing the training precision, thereby constructing a ViT network with a better structure.

When the training sample is used for training the neural network corresponding to the path to be searched, the corresponding accuracy rate can be determined for different paths to be searched; the accuracy may include, for example, a predicted value, such as 60%, 73%, or 90%. By utilizing the accuracy rate respectively corresponding to each path to be searched, the parameters respectively corresponding to a plurality of nodes in the path can be reversely adjusted so as to correct the path to be searched. Because a plurality of paths to be searched can be obtained in a random searching mode and can comprise the same node, a plurality of nodes in the super-large neural network can be optimized by utilizing the random searching mode.

Similarly, the method for determining the path to be searched by using the random search method can only search for parts of all searchable paths in the super-large neural network, so that when training each search path in the quota of the super-large neural network, the training of the current step can be considered to be completed when the preset iteration stop condition is reached.

Specifically, the preset iteration stop condition includes reaching a preset iteration number; that is, an iteration number, for example, 150 or 200, may be set, and after an iteration period exceeding a preset iteration number is performed, it may be considered that training of each search path in the super-large neural network is completed when the random search is completed. Or, the preset iteration stop condition may include that the training precision of the path to be searched on the image sample set reaches a preset precision threshold.

Then, a genetic algorithm can be used for determining a target search path from the trained super-large neural network. Specifically, when a genetic algorithm is utilized, any two search paths in the trained super-large neural network can be respectively used as a parent chromosome and a parent chromosome; that is, for the parent chromosome and the parent chromosome applied in the genetic algorithm, their corresponding genes respectively include parameters of respective nodes in the corresponding search path. On the basis of sample verification, the performance of networks respectively formed by a parent chromosome and a parent chromosome is evaluated, and based on the evaluation result, genes in the networks are crossed by the parent chromosome and the parent chromosome, so that a new child chromosome, namely a new search path, can be obtained. The determined new search path can also be used for obtaining a determined neural network, and the verification sample is further used for obtaining the accuracy rate corresponding to the neural network, so that whether the obtained new search path has better performance compared with the two search paths corresponding to the parent chromosome and the mother chromosome or not is judged.

In addition, structural parameters in the trained oversized neural network can be randomly varied by utilizing a genetic algorithm, a corresponding neural network is generated by utilizing a search path obtained after variation, and whether the variation is beneficial to obtaining a better ViT network can be evaluated.

Through multiple rounds of genetic variation, a search path with better performance can be screened from the super-large neural network, so that the structure of the ViT network is obtained, and the ViT network generated based on the structure also has better performance.

The trained super-large neural network can be searched at a speed of a block by using a genetic algorithm so as to determine a search path with higher accuracy and use the search path as a target search path.

The searching paths in the super-large neural network are trained firstly, part of calculation power can be used for roughly adjusting the searching paths in the super-large neural network, and the training process can be completed with high efficiency while the calculation power consumption is low. After the trained super-large neural network is obtained, the super-large neural network is further adjusted by using a genetic algorithm, so that better structural parameters in the super-large neural network can be gradually reserved in the crossing, heredity and variation processes, and partial parameters which cannot improve the accuracy of the ViT network are screened out.

For the above S103, in the case of determining the target search path, since the parameters corresponding to each node in the target search path are the structural parameters of the ViT network, the ViT network can be constructed by directly using the parameters included in the target search path.

Here, since the obtained target search path includes a search path with higher accuracy in the super-large neural network, constructing the ViT network using the target search path also has higher processing accuracy.

After the ViT network is constructed, the ViT network can be retrained based on the structure of the ViT network. When the ViT network is retrained, the image obtained after data enhancement processing can be used, or the raw image which is not processed can be directly used for training. Moreover, when the ViT network is retrained, the structural parameters in the ViT network can be further adjusted to further improve the processing precision of the ViT network, so that a plurality of different objects in the image can be better identified when the determined ViT network identifies the objects in the image, and the different objects can be accurately classified and identified; that is, a ViT network with high accuracy can be obtained.

According to the method, the structure of the high-performance ViT network for processing the image data can be automatically searched by constructing the search space of the ultra-large neural network corresponding to the visual depth self-attention transformation ViT network and searching the structure of the ViT network in the search space by using the image sample set, so that the processing precision of the image data is improved.

It will be understood by those of skill in the art that in the above method of the present embodiment, the order of writing the steps does not imply a strict order of execution and does not impose any limitations on the implementation, as the order of execution of the steps should be determined by their function and possibly inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a structure search apparatus of a neural network corresponding to the structure search method of the neural network, and since the principle of the apparatus in the embodiment of the present disclosure for solving the problem is similar to the structure search method of the neural network described above in the embodiment of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 3, a schematic diagram of a structure search apparatus of a neural network provided in an embodiment of the present disclosure is shown, where the apparatus includes: a construction module 31, a structure search module 32, and a determination module 33; wherein the content of the first and second substances,

the building module 31 is configured to transform a structural parameter of a ViT network based on visual depth self-attention, and build a search space of a super-large neural network corresponding to the ViT network, where the search space of the super-large neural network includes a plurality of processing block layers, and each processing block layer includes at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron; a structure search module 32, configured to perform a structure search on the super-large neural network based on the search space and the image sample set, so as to obtain a target search path of the ViT network, where the target search path includes one processing block of each of the plurality of processing block layers; a determining module 33, configured to determine a structure of the ViT network for processing image data based on the target search path.

In an optional embodiment, when performing a structure search on the super-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network, the structure search module 32 is configured to: training neural networks corresponding to a plurality of paths to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network; and determining the target search path from the trained super-large neural network.

In an optional embodiment, when the structure search module 32 trains a network structure corresponding to a path to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network, the structure search module is configured to: executing training of a plurality of iteration cycles until a preset iteration stop condition is reached; using the super-large neural network obtained in the last iteration cycle as the trained super-large neural network; wherein the training of any iteration cycle comprises: determining a path to be searched in the current iteration cycle based on the search space; and training the neural network corresponding to the path to be searched by utilizing the image sample set.

In an optional implementation manner, the structure searching apparatus further includes: a data processing module 34; the data processing module is used for: acquiring an original image training sample set; and performing data enhancement processing on the original image training sample by using a predetermined data enhancement mode to obtain the image sample set.

In an alternative embodiment, the structure searching module 32, when determining the path to be searched for in the current iteration cycle based on the search space, is configured to: determining the value of the structural parameter of each processing block in the ViT network based on the search space; and generating the path to be searched in the current iteration period based on the value of the structural parameter of each processing block in the ViT network.

In an optional embodiment, the structure searching apparatus further includes a retraining module 35; the retraining module is to: and retraining the ViT network based on the structure of the ViT network to obtain a target ViT network.

The description of the processing flow of each module in the apparatus and the interaction flow between the modules may refer to the relevant description in the above method embodiments, and will not be described in detail here.

An embodiment of the present disclosure further provides a computer device, as shown in fig. 4, which is a schematic structural diagram of the computer device provided in the embodiment of the present disclosure, and includes:

a processor 10 and a memory 20; the memory 20 stores machine-readable instructions executable by the processor 10, the processor 10 being configured to execute the machine-readable instructions stored in the memory 20, the processor 10 performing the following steps when the machine-readable instructions are executed by the processor 10:

based on structural parameters of a visual depth self-attention transformation ViT network, constructing a search space of a super-large neural network corresponding to the ViT network, wherein the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron; performing structure search on the ultra-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network, wherein the target search path comprises one processing block of each processing block layer in the plurality of processing block layers; determining a structure of the ViT network for processing image data based on the target search path.

The storage 20 includes a memory 210 and an external storage 220; the memory 210 is also referred to as an internal memory, and temporarily stores operation data in the processor 10 and data exchanged with the external memory 220 such as a hard disk, and the processor 10 exchanges data with the external memory 220 through the memory 210.

For the specific execution process of the instruction, reference may be made to the steps of the neural network structure search method described in the embodiments of the present disclosure, and details are not described here.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the structure search method for a neural network described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the neural network structure search method described in the above method embodiments, which may be referred to specifically for the above method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK) or the like.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the system and the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solutions of the present disclosure, which are essential or part of the technical solutions contributing to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used to illustrate the technical solutions of the present disclosure, but not to limit the technical solutions, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: those skilled in the art can still make modifications or changes to the embodiments described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the technical scope of the disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A structure search method of a neural network, comprising:

based on structural parameters of a visual depth self-attention transformation ViT network, constructing a search space of a super-large neural network corresponding to the ViT network, wherein the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron; wherein the structural parameters include at least one of: the number of the processing blocks, the number of self-attention feature processing modules corresponding to the multi-head self-attention mechanism layer in each processing block, the output dimension of a full-connection layer in each processing block, and the segmentation size of the image data input into the first processing block;

performing structure search on the ultra-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network, wherein the target search path comprises one processing block of each processing block layer in the plurality of processing block layers;

determining a structure of the ViT network for processing image data based on the target search path.

2. The structure search method of claim 1, wherein the performing the structure search on the super-large neural network based on the search space and the image sample set to obtain the target search path of the ViT network comprises:

training neural networks corresponding to a plurality of paths to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network;

and determining the target search path from the trained super-large neural network.

3. The structure search method according to claim 2, wherein the training of the network structure corresponding to the path to be searched sampled from the search space based on the image sample set to obtain the trained super-large neural network comprises:

executing training of a plurality of iteration cycles until a preset iteration stop condition is reached; using the super-large neural network obtained in the last iteration cycle as the trained super-large neural network;

wherein the training of any iteration cycle comprises:

determining a path to be searched in the current iteration cycle based on the search space;

and training the neural network corresponding to the path to be searched by utilizing the image sample set.

4. The structure search method according to claim 3, wherein the preset iteration stop condition comprises: and reaching a preset iteration number, and/or reaching a preset precision threshold value of the training precision of the path to be searched on the image sample set.

5. The structure search method according to claim 3 or 4, further comprising:

acquiring an original image training sample set;

and performing data enhancement processing on the original image training sample by using a predetermined data enhancement mode to obtain the image sample set.

6. The structure searching method of claim 5, wherein the predetermined data enhancement comprises at least one of: random cropping, random inversion, and label smoothing.

7. The structure searching method of claim 5, wherein the predetermined data enhancement mode is determined based on an accuracy and/or a convergence degree of the super-large neural network after the structure search of the super-large neural network is performed based on the search space and the experimental sample.

8. The structure searching method according to claim 3, wherein the determining the path to be searched for in the current iteration cycle based on the search space comprises:

determining the value of the structural parameter of each processing block in the ViT network based on the search space;

and generating the path to be searched in the current iteration cycle based on the value of the structural parameter of each processing block in the ViT network.

9. The structure searching method of claim 8, wherein the linear mapping corresponding to the attention feature processing module in the same processing block of different iteration cycles uses the same data value parameter.

10. The structure search method according to claim 1, further comprising: and retraining the ViT network based on the structure of the ViT network to obtain a target ViT network.

11. An apparatus for searching a structure of a neural network, comprising:

the system comprises a construction module, a processing module and a processing module, wherein the construction module is used for transforming structural parameters of a ViT network based on visual depth self-attention and constructing a search space of a super-large neural network corresponding to the ViT network, the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron; wherein the structural parameters include at least one of: the number of the processing blocks, the number of self-attention feature processing modules corresponding to the multi-head self-attention mechanism layer in each processing block, the output dimension of a full-connection layer in each processing block, and the segmentation size of the image data input into the first processing block;

the structure searching module is used for carrying out structure searching on the ultra-large neural network based on the searching space and the image sample set to obtain a target searching path of the ViT network, wherein the target searching path comprises one processing block of each processing block layer in the plurality of processing block layers;

a determining module for determining a structure of the ViT network for processing image data based on the target search path.

12. A computer device, comprising: a processor, a memory storing machine-readable instructions executable by the processor, the processor for executing the machine-readable instructions stored in the memory, the processor performing the steps of the method of structure search of a neural network of any one of claims 1 to 10 when the machine-readable instructions are executed by the processor.

13. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a computer device, executes the steps of the structure search method for a neural network according to any one of claims 1 to 10.