CN113344181A

CN113344181A - Neural network structure searching method and device, computer equipment and storage medium

Info

Publication number: CN113344181A
Application number: CN202110602497.4A
Authority: CN
Inventors: 苏修; 游山; 王飞; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-09-03
Anticipated expiration: 2041-05-31
Also published as: CN113344181B

Abstract

The present disclosure provides a structure search method, apparatus, computer device and storage medium for a neural network, wherein the method comprises: constructing a search space of the super-large neural network corresponding to the ViT network based on structural parameters of the visual depth self-attention transformation ViT network, wherein the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron; carrying out structure search on the super-large neural network based on a search space and an image sample set to obtain a target search path of ViT network, wherein the target search path comprises a processing block of each processing block layer in a plurality of processing block layers; based on the target search path, the structure of ViT network for processing the image data is determined. Therefore, the structure of the ViT network can be determined more accurately, so that a ViT network structure with better performance can be obtained, and the processing precision of the ViT network is improved.

Description

Neural network structure searching method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of deep learning technologies, and in particular, to a structure search method and apparatus for a neural network, a computer device, and a storage medium.

Background

In the neural network, a depth self-attention transformation network (transducer) has wide application in processing of sequence data (such as speech recognition and processing), and a Visual depth self-attention transformation (ViT) network can inherit the characteristics of the transducer to solve the deep learning problem in the image field. When designing ViT a network, its structure is typically designed manually; the ViT network structure obtained by the design of the ViT network often cannot ensure the performance of the ViT network, and the processing precision of the ViT network in processing images is affected.

Disclosure of Invention

The embodiment of the disclosure at least provides a neural network structure searching method, a neural network structure searching device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a method for searching a structure of a neural network, including: constructing a search space of a super-large neural network corresponding to the ViT network on the basis of structural parameters of a visual depth self-attention transformation ViT network, wherein the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron; performing a structural search on the super-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network, wherein the target search path includes one processing block of each processing block layer of the plurality of processing block layers; determining a structure of the ViT network for processing image data based on the target search path.

In this way, by specifically constructing a search space of the corresponding super-large neural network for the ViT network, the structure search of the ViT network can be further performed by using a neural network architecture search mode. Because the search space is constructed based on ViT network structure characteristics, the structure of the ViT network is determined more accurately on the premise of automatic structure search, so that a ViT network structure with better performance is obtained, and the processing precision of the ViT network is improved.

In an alternative embodiment, the structural parameters include at least one of: the processing block number, the number of the self-attention feature processing modules corresponding to the multi-head self-attention mechanism layer in each processing block, the output dimension of the full-connection layer in each processing block, and the segmentation size of the image data input into the first processing block.

In an optional implementation manner, the performing a structure search on the super-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network includes: training neural networks corresponding to a plurality of paths to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network; and determining the target search path from the trained super-large neural network.

Therefore, the trained oversized neural network is obtained by training the neural networks corresponding to the paths to be searched by using the image sample, and then the target search path is determined from the trained oversized neural network, so that the trained oversized neural network can be further adjusted, better structural parameters in the oversized neural network are gradually reserved, and partial parameters which cannot improve ViT network accuracy are screened out.

In an optional implementation manner, the training, based on the image sample set, a network structure corresponding to a path to be searched sampled from the search space to obtain a trained super-large neural network includes: executing training of a plurality of iteration cycles until a preset iteration stop condition is reached; using the super-large neural network obtained in the last iteration cycle as the trained super-large neural network; wherein the training of any iteration cycle comprises: determining a path to be searched in the current iteration cycle based on the search space; and training the neural network corresponding to the path to be searched by utilizing the image sample set.

Therefore, under the condition of consuming less computing power and time, the optimization of the structural parameters in the oversized neural network can be ensured, so that the further training and learning with better accuracy can be ensured, and the search efficiency is improved.

In an optional embodiment, the preset iteration stop condition includes: and reaching a preset iteration number, and/or reaching a preset precision threshold value of the training precision of the path to be searched on the image sample set.

In an optional implementation manner, the structure search method further includes: acquiring an original image training sample set; and performing data enhancement processing on the original image training sample by using a predetermined data enhancement mode to obtain the image sample set.

Therefore, the original image training sample is subjected to data enhancement processing by using a predetermined data enhancement mode, and the convergence of the super-large neural network can be ensured, so that an ViT network with a better structure can be constructed.

In an alternative embodiment, the predetermined data enhancement mode includes at least one of: random cropping, random inversion, and label smoothing.

In an optional implementation manner, the predetermined data enhancement manner is determined based on the accuracy and/or the convergence degree of the super-large neural network after the structure search is performed on the super-large neural network based on the search space and the experimental sample.

In an optional implementation manner, the determining, based on the search space, a path to be searched in a current iteration cycle includes: determining ViT values of structural parameters of each processing block in the network based on the search space; and generating the path to be searched in the current iteration cycle based on the value of the structural parameter of each processing block in the ViT network.

In an alternative embodiment, the data cost parameters used by the linear mappings corresponding to the self-attention feature processing modules in the same processing block in different iteration cycles are the same.

In this way, the amount of data that needs to be processed can be further reduced.

In an optional implementation manner, the structure search method further includes: and retraining the ViT network based on the structure of the ViT network to obtain a target ViT network.

In this way, when the ViT network is retrained, the structural parameters in the ViT network can be further adjusted to further improve the processing precision of the ViT network, so that the determined ViT network can better identify a plurality of different objects in the image when identifying the objects in the image, and can accurately classify and identify the different objects; that is, ViT network with higher accuracy can be obtained.

In a second aspect, an embodiment of the present disclosure further provides a structure searching apparatus for a neural network, including: the building module is used for building a search space of the super-large neural network corresponding to the ViT network based on structural parameters of a visual depth self-attention transformation ViT network, wherein the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block built by a multi-head self-attention mechanism layer and a multi-layer perceptron; a structure search module, configured to perform a structure search on the super-large neural network based on the search space and the image sample set, to obtain a target search path of the ViT network, where the target search path includes one processing block of each of the multiple processing block layers; a determination module to determine a structure of the ViT network for processing image data based on the target search path.

In an optional implementation manner, when performing a structure search on the super-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network, the structure search module is configured to: training neural networks corresponding to a plurality of paths to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network; and determining the target search path from the trained super-large neural network.

In an optional implementation manner, the structure search module, when training a network structure corresponding to a path to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network, is configured to: executing training of a plurality of iteration cycles until a preset iteration stop condition is reached; using the super-large neural network obtained in the last iteration cycle as the trained super-large neural network; wherein the training of any iteration cycle comprises: determining a path to be searched in the current iteration cycle based on the search space; and training the neural network corresponding to the path to be searched by utilizing the image sample set.

In an optional implementation manner, the structure searching apparatus further includes: a data processing module; the data processing module is used for: acquiring an original image training sample set; and performing data enhancement processing on the original image training sample by using a predetermined data enhancement mode to obtain the image sample set.

In an optional embodiment, when determining the path to be searched for in the current iteration cycle based on the search space, the structure search module is configured to: determining ViT values of structural parameters of each processing block in the network based on the search space; and generating the path to be searched in the current iteration cycle based on the value of the structural parameter of each processing block in the ViT network.

In an optional embodiment, the structure searching apparatus further includes a retraining module; the retraining module is to: and retraining the ViT network based on the structure of the ViT network to obtain a target ViT network.

In a third aspect, this disclosure also provides a computer device, a processor, and a memory, where the memory stores machine-readable instructions executable by the processor, and the processor is configured to execute the machine-readable instructions stored in the memory, and when the machine-readable instructions are executed by the processor, the machine-readable instructions are executed by the processor to perform the steps in the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, this disclosure also provides a computer-readable storage medium having a computer program stored thereon, where the computer program is executed to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

For the description of the effects of the structure searching apparatus, the computer device, and the computer-readable storage medium of the neural network, reference is made to the description of the structure searching method of the neural network, and details are not repeated here.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 shows a flowchart of a structure searching method of a neural network provided by an embodiment of the present disclosure;

fig. 2 shows a schematic structural diagram of an ViT network provided by an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating a structure searching apparatus of a neural network provided in an embodiment of the present disclosure;

fig. 4 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of embodiments of the present disclosure, as generally described and illustrated herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

It has been found that ViT network is further structurally adapted to better adapt to the deep learning problem in the image domain, compared to a Transformer that can perform speech recognition and processing more accurately and quickly. For example, parts of the decoder (decoder) are deleted in the structure of the ViT network; in order to be more suitable for processing the image, the structure for segmenting and coding the input image is further added; also, the respective processing layers are changed in each processing block (block). Because ViT has its special structure, it usually constructs ViT network by manual design, and further adjusts it according to the actual performance of the ViT network constructed to determine ViT network structure; when the ViT network is searched for the structure, the ViT network performance is often not guaranteed by manual adjustment, and the obtained ViT network has low processing accuracy when processing images.

Based on the research, the disclosure provides an ViT network structure searching method, which is to construct a ViT network search space corresponding to a ViT network structure parameter for the ViT network, perform structure search in the search space by using a structure search mode, and further determine a ViT network structure by using a determined target search path. In this way, by specifically constructing a Search space of a huge Neural network corresponding to the ViT network, it is possible to further perform a structure Search on the ViT network by using a Neural Network Architecture Search (NAS) method. Because the search space is constructed based on ViT network structure characteristics, the structure of the ViT network is determined more accurately on the premise of automatic structure search, so that a ViT network structure with better performance is obtained, and the processing precision of the ViT network is improved.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

To facilitate understanding of the present embodiment, first, a structure searching method for a neural network disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the structure searching method for a neural network provided in the embodiments of the present disclosure is generally a computer device with certain computing power, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the structure search method of the neural network may be implemented by a processor calling computer-readable instructions stored in a memory.

The following describes a structure search method for a neural network provided in an embodiment of the present disclosure.

Referring to fig. 1, a flowchart of a structure searching method of a neural network provided in an embodiment of the present disclosure is shown, where the method includes steps S101 to S103, where:

s101: constructing a search space of a super-large neural network corresponding to the ViT network on the basis of structural parameters of a visual depth self-attention transformation ViT network, wherein the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron;

s102: performing a structural search on the super-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network, wherein the target search path includes one processing block of each processing block layer of the plurality of processing block layers;

s103: determining a structure of the ViT network for processing image data based on the target search path.

According to the embodiment of the disclosure, a search space of the super-large neural network corresponding to the ViT network is constructed based on the structural parameters of the ViT network, so that the super-large neural network is subjected to structural search in the search space, a target search path of the ViT network is determined, and the structure of the ViT network is determined. Because the search space is constructed based on ViT network structure characteristics, the structure of the ViT network is determined more accurately on the premise of automatic structure search, so that a ViT network structure with better performance is obtained, and the processing precision of the ViT network is improved.

Next, the configuration of the ViT network and the corresponding configuration parameters will be described first.

Referring to fig. 2, a schematic structural diagram of an ViT network is provided for the embodiment of the present disclosure. In fig. 2, an image Embedded layers (Embedded Patches)21, and a processing block layer in ViT network are shown. In one possible embodiment, a processing block layer may comprise, for example, at least one processing block (block); in the embodiment of the present disclosure, an example in which one processing block is included in the processing block layer is described.

Wherein, for the processing block, it may comprise a plurality of stacked processing blocks in ViT network, for example; for one processing block, the last processing block connected with the processing block receives the data output by the processing block.

Illustratively, ViT networks may include L processing blocks, represented in fig. 2 as "L x". In the first processing block 22, a Normalization Layer (Layer Normalization)23, a Multi-Head Attention mechanism Layer (Multi-Head Attention)24, a Normalization Layer 25, and a Multi-Layer Perceptron (MLP) 26 may also be included. Therein, a Multi-Head Attention layer (Multi-Head Attention)24 processes features input to the processing block based on a self-Attention mechanism.

For the image embedding layer 21, it may perform a slicing process on the image input into the ViT network. Since the Transformer structure is effective for the input Sequence (Sequence), the image embedding layer 21 may segment the image input to the ViT network, process the sub-image (patch) obtained after the segmentation into one vector (vector), and form an image Sequence from the positions of the sub-images in the image, so that the image to be processed may be converted into a vector Sequence formed from the vectors corresponding to the sub-images.

For example, when a piece of image to be processed having a size (unit is a pixel value) of 9000 × 9000 is subjected to classification recognition processing, the image embedding layer 21 may divide it into sub-images having a size of P1 × P2, for example, 3 × 3 sub-images having a size of 3000 × 3000. For each sub-image, the sub-image can be abstractly expressed by using the matrix determined by the pixel matrix corresponding to the sub-image and the position information corresponding to each sub-image, so as to determine the output data of the sub-image in the image embedding layer 21 respectively corresponding to the sub-image.

Specifically, the input data of the image embedding layer satisfies the following formula (1):

after the image is divided into two-dimensional (2D) sub-images of P1 × P2 size, each image can correspond to C channels, for example, three channels corresponding to Red (Red, R), Green (Green, G), and Blue (Blue, B). Thus, for each sub-image, its corresponding dimension is P1 × P2 × C, i.e., each sub-image can be written as a vector of size P1 × P2 × C. The corresponding vector of each sub-image

Indicating that i corresponds to the ith sub-image; in the formula, N sub-images are represented.

In formula (1), E is a matrix that can abstract the input image and the output image, and may be, for example, a matrix formed by original pixels of the image to be processed, or may be replaced by any one of Convolutional Neural Networks (CNNs) and MLPs. Since ViT network uses constant hidden vectors (constant hidden vectors) in all its layers, we reduce the dimension of the sub-images and pass the trainable linear projection methodIt is mapped onto a vector of length D. Thus, by means of the corresponding vector of the sub-image

Multiplying the matrix E, the vector corresponding to the sub-image may be mapped to the vector with length D, so as to obtain the vector with length D. Wherein, for the matrix E, the dimension thereof is

x_classRepresents one additional learnable vector; in general, a vector of length D of all 0's may be included. In addition, E may also be set in order to express the position of the different sub-images corresponding to the image to be processed_posI.e. Position encoding (Position Embedding). Here, an additional learnable vector x is included in addition to the corresponding vectors of the N sub-images_classAnd thus corresponds to E_posOf dimension of

Using x_class、

And E_posI.e. the output data of the image embedding layer 21, i.e. the input data of the first processing block 22, denoted z, can be calculated₀。

Taking the first processing block 22 as an example, z is output at the image embedding layer 21₀And then input to the first processing block 22. Here, in

Indicating the hierarchy in which the current processing block resides among the L processing blocks comprised by the ViT network.

The normalization layer 23 and the multi-head attention mechanism layer 24 in the first processing block 22 are applied to the input data z₀In the treatment, for example, the following formula (2) can be used:

wherein LN (-) represents the normalization process, which is a conventional step and will not be described herein. By means of the normalization process, independent feature scaling can be performed on each sub-image, and more feature information of the sub-images can be saved.

MSA (-) denotes a multi-head attention mechanism process; here, since the data processed by the multi-head attention mechanism processing can be mapped to different attention feature processing spaces, it is possible to focus more on information in the different attention feature processing spaces and capture more abundant feature information.

Obtained by making use of a multi-head attention mechanism for the layer 24

Thereafter, the normalized layer 25 and the multi-layer sensor 26 can be determined according to the following formula (3)

Output data of layer processing block

Wherein, MLP (·) represents multi-layer sensing processing, and the specific process is not described again.

In addition, for the second

The layer processing block, the image representation (image representation) y corresponding thereto, may be determined according to the following formula (4), for example:

in obtaining the first

Output data of layer processing block

Then, it can be inputted to

And the layer processing block continues processing until the L layers are processed, and the classification result of the image to be processed is determined.

Here, the above description of fig. 2 is only one possible way for ViT network to process the image to be processed; for the ViT network, there may be other variants, and the images to be processed are processed according to different processing modes, which are not described herein again.

Next, the details of S101 to S103 will be described by taking ViT network as an example shown in fig. 2.

The structural parameters for the S101, ViT network include at least one of the following: the processing block number, the number of the self-attention feature processing modules corresponding to the multi-head self-attention mechanism layer in each processing block, the output dimension of the full-connection layer in each processing block, and the segmentation size of the image data input into the first processing block.

Wherein, the number of processing blocks may correspond to L, for example; from empirical or experimental verification, it may be determined that the number of processing blocks may include, for example, 10, 11, … …, or 18. Under a possible condition, when a search space is used for training a search path in a super-large neural network, if the number of processing blocks determined in a current iteration period is less than a preset maximum processing block number threshold, in order to prevent parameters corresponding to a plurality of network layers in shielded processing blocks from participating in a search process of the current iteration period, a plurality of shielding modules can be arranged in parallel for the plurality of processing blocks; and when determining that a processing block which does not participate in the searching process of the current iteration cycle exists, replacing the processing block at the corresponding position with a shielding module, so that the searching path determined by the current iteration cycle passes through the shielding module corresponding to the processing block and does not pass through the processing block.

When determining the number of self-attention feature processing modules corresponding to each processing block, that is, when determining the number of parallel self-attention feature processing modules in the multi-head attention mechanism layer, that is, the number of "heads" in the multi-head attention mechanism layer. Illustratively, the number of self-attention feature processing modules may include, for example, 3, 4, or 5. In addition, for different self-attention feature processing modules corresponding to the multi-head self-attention mechanism layer, the corresponding parameters can be the same or different; therefore, the parameters of different self-attention feature processing modules can also be used as the structural parameters of the ViT network.

In determining the output dimension of the fully connected layer in each processing block, the output dimension may be determined according to the structure of the multi-layer perceptron, for example. For example, the multi-layer perceptron may comprise a fully connected layer having a three-layer network structure, where n different output data are output in the last layer of the network structure, and thus the output dimension n may be taken as the output dimension of the fully connected layer. The number of network configuration layers of the full link layer may be a part of the configuration parameters of the ViT network.

As for the first processing block, the processing block, which is different from the other processing blocks, receives image data input from the image embedding layer. Further, when the division method of the image data is different, the data sizes of the plurality of processing blocks are different, and thus the number of necessary processing blocks may be different. Therefore, the slice size of the image data may also be taken as part of the structural parameters of the ViT network.

Here, the slice size may include, for example, a size of a sub-image obtained by slicing the image data, for example, 30 × 30, 40 × 60, or 100 × 100; another example may include a slicing method for the image data, such as sequentially reading pixel values of the image data, and using every 1000 read pixel values as pixel values of the sub-image, or using every 2000 read pixel values as pixel values of the sub-image. When the search space corresponding to the segmentation size of the image data is specifically determined, the search space may be determined according to an actual situation, which is not described herein again.

The above-described structural parameters of the ViT network only include some examples, and the network structure of the ViT network may include other structural parameters that can be adjusted and that can improve or enhance the ViT network after being adjusted, which are all included in the structural parameters of the ViT network provided in the embodiment of the present disclosure, and are not described in detail herein for example.

After determining ViT the structural parameters of the network, the search space of the extra-large neural network of ViT network can be determined according to the structural parameters of ViT network. Illustratively, the method of neural network architecture search can be utilized, in the search space of the super-large neural network, the specific levels in the super-large neural network are constructed for the structural parameters according to the ViT network structure sequence, so as to construct the super-large neural network.

Here, the search space of the super-large neural network includes a plurality of network layers, each of which includes a plurality of processing block layers, and each of the processing block layers includes at least one of the processing blocks. For the search space of the super-large neural network, the parameter values which can be used as the corresponding structural parameters of the layer can be searched in the network layer included in the super-large neural network. For example, the first network layer of the super-large neural network includes parameters corresponding to the slice size of the image data by the first processing block, specifically 30 × 30, 80 × 90, 100 × 120, and 150 × 150. When the super-large neural network searches the layer network layer, network nodes corresponding to 30 × 30, 80 × 90, 100 × 120 and 150 × 150 can be searched; in addition, each node is considered to correspond to a searchable path. That is, in the layer of the super-large neural network, four different search paths are included, each search path passes through a different network node, and the size of the image data to be cut can be found by the search method, which respectively includes: 30 × 30, 80 × 90, 100 × 120, 150 × 150.

For the network layer corresponding to the structural parameter in the processing block in the super-large neural network, the network layer is similar to the network layer corresponding to the segmentation size of the image data in the super-large neural network, and details are not repeated here. In addition, when the structures of different processing blocks are the same, the hierarchy of different processing blocks in a plurality of processing blocks is different, so that the characteristics represented by corresponding processing data are also different; in this way, the specific selectable values of the configuration parameters respectively corresponding to different processing blocks are different.

For the above S102, when performing structure search on the super-large neural network based on the search space and the image sample set to obtain the target search path of the ViT network, for example, the following method may be adopted: training a plurality of paths to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network; and determining the target search path from the trained super-large neural network.

Specifically, when training each search path in the super-large neural network based on the search space, for example, training for multiple iteration cycles may be performed until a preset iteration stop condition is reached; and taking the super-large neural network obtained in the last iteration cycle as the trained super-large neural network.

Here, the training of any iteration cycle includes: determining a path to be searched in the current iteration cycle based on the search space; and training the neural network corresponding to the path to be searched by utilizing the image sample set.

In a specific implementation, when a path to be searched in a current iteration cycle is determined, for example, values of structural parameters of each processing block in the ViT network may be determined based on a search space; and then, generating a path to be searched in the current iteration cycle based on the values of the structural parameters of the processing blocks in the ViT network.

Because the number of the structural parameters in the ViT network is huge, for a huge neural network, there are many search paths included, and the way of training the neural networks corresponding to the search paths one by one would consume too much computation and time, so for example, a random search mode may be selected to determine the paths to be searched, and the neural networks corresponding to a small number of selected paths to be searched are trained. Therefore, under the condition of consuming less computing power and time, the optimization of the structural parameters in the oversized neural network can be ensured, so that the further training and learning with better accuracy can be ensured, and the search efficiency is improved.

Illustratively, taking the first iteration cycle as an example, based on a search space, a node may be randomly selected from each network layer in the super-large neural network in a random search manner, where each node corresponds to a value of a structural parameter of a processing block, and the nodes selected from the plurality of network layers may form a path to be searched, and in the path to be searched, a value of a structural parameter of each processing block in the ViT network may be determined. By using the determination method of the paths to be searched, a plurality of paths to be searched corresponding to the first iteration period can be determined.

Or after each round of search is finished, the sampled path to be searched is evaluated based on the performance of the super-large neural network, algorithms such as a genetic algorithm or reinforcement learning can be adopted, the evaluation result is used as feedback information obtained by the current round of search, for example, fitness in the genetic algorithm or reward value (reward) in the reinforcement learning, and in the next round of search, a new path to be searched is adopted based on the feedback information, so that the convergence speed of the super-large neural network can be increased.

In addition, in different iteration cycles, for the self-attention feature processing modules in the same processing block, the data Value parameters used by the corresponding linear mappings are the same, thereby further reducing the data amount required to be processed through parameter sharing. Optionally, in different iteration cycles, for the self-attention feature processing modules in the same processing block, the Query parameter used by the corresponding linear mapping is the same, and the Key value parameter Key used by the corresponding linear mapping is the same.

When a plurality of paths to be searched are determined, the neural network corresponding to the paths to be searched can be trained by using the image sample set.

The image sample set may be determined, for example, according to the following: acquiring an original image training sample set; and performing data enhancement processing on the original image training sample set by using a predetermined data enhancement mode to obtain the image sample set.

The original image training sample set may include, for example, a plurality of frames of images. The multi-frame image may include at least one identifiable object, such as a person, an animal, or a building. And data enhancement processing can be further carried out on the multi-frame images. Specifically, the predetermined data enhancement mode is determined by the Accuracy (Accuracy) and/or Convergence (Convergence) degree of the super-large neural network after the structure search is performed on the super-large neural network by using the experimental sample, and may include at least one of the following: random cropping, random inversion, and label smoothing (label smoothing).

In particular, when the structure of the ViT network is searched by using the super-large neural network, the ViT network comprises a large number of processing blocks, and each processing block comprises a plurality of different structural layers, so that the determined ViT network has a large depth. At this time, if the regularization mode is directly used to train the super-large neural network, the gradient disappears easily, so that the trained super-large neural network cannot achieve good convergence characteristics, or more time and effort are consumed to adjust the structure of the ViT network in order to obtain the super-large neural network with good convergence characteristics.

Therefore, when data enhancement is performed on an original image training sample set, an experimental sample can be selected first, the experimental sample is enhanced by utilizing a plurality of data enhancement processing modes respectively, and then structure search is performed on the oversized neural network based on a search space aiming at the experimental sample enhanced by each data enhancement processing mode; and selecting a data enhancement processing mode with better selectivity according to the accuracy and/or convergence degree of the searched super-large neural network. The experimental sample may be obtained through different data enhancement processes, and the different data enhancement processes may include, for example, mix enhancement (Mixup), cut fill (Cutout), mix fill (CutMix), random depth, modified image attribute (ColorJitter), random cropping, random inversion, and label smoothing.

Specifically, the hybrid enhancement may include, for example, an enhancement mode in which two random image samples are proportionally mixed and the classification result is proportionally distributed to obtain an experimental sample; the cut filling may include, for example, randomly cutting off a partial region in the image sample, and filling 0 pixel value, and the classification result is not changed to obtain an enhanced mode of the experimental sample; the mixed filling may include, for example, cutting off a part of the region in the experimental sample but filling 0 pixel, but randomly filling the region pixel value of other data in the training set, and the classification result is distributed according to a certain proportion to obtain an enhanced mode of the experimental sample.

After the experimental samples obtained through different enhancement modes are used for carrying out structure search on the super-large neural network, the accuracy and/or convergence degree of the super-large neural network under any one enhancement mode can be determined. By utilizing the accuracy and/or convergence degree of the super-large neural network obtained in different enhancement modes, a better partial data enhancement processing mode such as random clipping processing, random inversion processing and label smoothing processing can be screened out from different enhancement modes, and the better partial data enhancement processing mode is used as a predetermined data enhancement mode to carry out data enhancement processing on an original image training sample so as to improve the convergence speed of the super-large neural network without influencing the training precision, so that an ViT network with a better structure can be constructed.

When a neural network corresponding to a path to be searched is trained by using a training sample, corresponding accuracy rates can be determined for different paths to be searched; the accuracy may include, for example, a predicted value, such as 60%, 73%, or 90%. By utilizing the accuracy rate respectively corresponding to each path to be searched, the parameters respectively corresponding to a plurality of nodes in the path can be reversely adjusted so as to correct the path to be searched. Because a plurality of paths to be searched can be obtained in a random searching mode and can comprise the same node, a plurality of nodes in the super-large neural network can be optimized by utilizing the random searching mode.

Similarly, the method for determining the path to be searched by using the random search method can only search for parts of all searchable paths in the super-large neural network, so that when training each search path in the quota of the super-large neural network, the training of the current step can be considered to be completed when the preset iteration stop condition is reached.

Specifically, the preset iteration stop condition includes reaching a preset iteration number; that is, an iteration number, for example, 150 or 200, may be set, and after an iteration period exceeding a preset iteration number is performed, it may be considered that training of each search path in the super-large neural network is completed when the random search is completed. Or, the preset iteration stop condition may include that the training precision of the path to be searched on the image sample set reaches a preset precision threshold.

Then, a genetic algorithm can be used for determining a target search path from the trained super-large neural network. Specifically, when a genetic algorithm is utilized, any two search paths in the trained super-large neural network can be respectively used as a father chromosome and a mother chromosome; that is, for the parent chromosome and the parent chromosome applied in the genetic algorithm, their corresponding genes respectively include parameters of respective nodes in the corresponding search paths. On the basis of sample verification, the performance of networks respectively formed by a parent chromosome and a parent chromosome is evaluated, and based on the evaluation result, genes in the networks are crossed by the parent chromosome and the parent chromosome, so that a new child chromosome, namely a new search path, can be obtained. The determined new search path can also be used for obtaining a determined neural network, and the verification sample is further used for obtaining the accuracy corresponding to the neural network, so that whether the obtained new search path has better performance compared with the two search paths corresponding to the parent chromosome and the mother chromosome is judged.

In addition, structural parameters in the trained super-large neural network can be randomly mutated by using a genetic algorithm, a corresponding neural network is generated by using a search path obtained after mutation, and whether the mutation is beneficial to obtaining a better ViT network can be evaluated.

Through multiple rounds of genetic variation, a search path with better performance can be screened from the super-large neural network, so that the ViT network structure is obtained, and the ViT network generated based on the structure also has better performance.

The trained super-large neural network can be searched at a speed of a block by using a genetic algorithm so as to determine a search path with higher accuracy and use the search path as a target search path.

The searching paths in the super-large neural network are trained firstly, part of calculation power can be used for roughly adjusting the searching paths in the super-large neural network, and the training process can be completed with high efficiency while the calculation power consumption is low. After the trained super-large neural network is obtained, the super-large neural network is further adjusted by using a genetic algorithm, so that better structural parameters in the super-large neural network can be gradually reserved in the crossing, heredity and variation processes, and partial parameters which cannot improve ViT network accuracy are screened out.

In the case of determining the target search path in S103, since the parameters corresponding to the nodes in the target search path are the structural parameters of the ViT network, the ViT network can be constructed by directly using the parameters included in the target search path.

Here, since the obtained target search path includes a search path with a higher accuracy in the super-large neural network, constructing ViT network by using the target search path also has a higher processing accuracy.

After the ViT network is constructed, the ViT network can also be retrained based on the structure of the ViT network. When the ViT network is retrained, the images obtained after data enhancement processing can be used, or the unprocessed original images can be directly used for training. Moreover, when the ViT network is retrained, the structural parameters in the ViT network can be further adjusted to further improve the processing precision of the ViT network, so that the determined ViT network can better identify a plurality of different objects in the image when identifying the objects in the image, and can accurately classify and identify the different objects; that is, ViT network with higher accuracy can be obtained.

According to the method of the embodiment, the search space of the super-large neural network corresponding to the visual depth self-attention transformation ViT network is constructed, the structure of the ViT network is searched in the search space by using the image sample set, and the structure of the ViT network with high performance for processing the image data can be automatically searched out, so that the processing precision of the image data is improved.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a structure search apparatus of a neural network corresponding to the structure search method of the neural network, and since the principle of the apparatus in the embodiment of the present disclosure for solving the problem is similar to the structure search method of the neural network described above in the embodiment of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 3, a schematic diagram of a structure search apparatus of a neural network provided in an embodiment of the present disclosure is shown, where the apparatus includes: a construction module 31, a structure search module 32, and a determination module 33; wherein the content of the first and second substances,

the building module 31 is configured to build a search space of the superlarge neural network corresponding to the ViT network based on structural parameters of a visual depth self-attention transformation ViT network, where the search space of the superlarge neural network includes multiple processing block layers, and each processing block layer includes at least one processing block built by a multi-head self-attention mechanism layer and a multi-layer perceptron; a structure search module 32, configured to perform a structure search on the super-large neural network based on the search space and the image sample set, so as to obtain a target search path of the ViT network, where the target search path includes one processing block of each of the multiple processing block layers; a determining module 33 for determining a structure of the ViT network for processing image data based on the target search path.

In an optional implementation manner, when performing a structure search on the super-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network, the structure search module 32 is configured to: training neural networks corresponding to a plurality of paths to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network; and determining the target search path from the trained super-large neural network.

In an optional implementation manner, when the structure search module 32 trains a network structure corresponding to a path to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network, it is configured to: executing training of a plurality of iteration cycles until a preset iteration stop condition is reached; using the super-large neural network obtained in the last iteration cycle as the trained super-large neural network; wherein the training of any iteration cycle comprises: determining a path to be searched in the current iteration cycle based on the search space; and training the neural network corresponding to the path to be searched by utilizing the image sample set.

In an optional implementation manner, the structure searching apparatus further includes: a data processing module 34; the data processing module is used for: acquiring an original image training sample set; and performing data enhancement processing on the original image training sample by using a predetermined data enhancement mode to obtain the image sample set.

In an alternative embodiment, the structure searching module 32, when determining the path to be searched for in the current iteration cycle based on the search space, is configured to: determining ViT values of structural parameters of each processing block in the network based on the search space; and generating the path to be searched in the current iteration cycle based on the value of the structural parameter of each processing block in the ViT network.

In an optional embodiment, the structure searching apparatus further includes a retraining module 35; the retraining module is to: and retraining the ViT network based on the structure of the ViT network to obtain a target ViT network.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

An embodiment of the present disclosure further provides a computer device, as shown in fig. 4, which is a schematic structural diagram of the computer device provided in the embodiment of the present disclosure, and the computer device includes:

a processor 10 and a memory 20; the memory 20 stores machine-readable instructions executable by the processor 10, the processor 10 being configured to execute the machine-readable instructions stored in the memory 20, the processor 10 performing the following steps when the machine-readable instructions are executed by the processor 10:

constructing a search space of a super-large neural network corresponding to the ViT network on the basis of structural parameters of a visual depth self-attention transformation ViT network, wherein the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron; performing a structural search on the super-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network, wherein the target search path includes one processing block of each processing block layer of the plurality of processing block layers; determining a structure of the ViT network for processing image data based on the target search path.

The storage 20 includes a memory 210 and an external storage 220; the memory 210 is also referred to as an internal memory, and temporarily stores operation data in the processor 10 and data exchanged with the external memory 220 such as a hard disk, and the processor 10 exchanges data with the external memory 220 through the memory 210.

For the specific execution process of the instruction, reference may be made to the steps of the neural network structure search method described in the embodiments of the present disclosure, and details are not described here.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the structure search method for a neural network described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the neural network structure search method described in the above method embodiments, which may be referred to specifically for the above method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A structure search method of a neural network, comprising:

constructing a search space of a super-large neural network corresponding to the ViT network on the basis of structural parameters of a visual depth self-attention transformation ViT network, wherein the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block constructed by a multi-head self-attention mechanism layer and a multi-layer perceptron;

performing a structural search on the super-large neural network based on the search space and the image sample set to obtain a target search path of the ViT network, wherein the target search path includes one processing block of each processing block layer of the plurality of processing block layers;

determining a structure of the ViT network for processing image data based on the target search path.

2. The structure searching method of claim 1, wherein the structure parameter comprises at least one of: the processing block number, the number of the self-attention feature processing modules corresponding to the multi-head self-attention mechanism layer in each processing block, the output dimension of the full-connection layer in each processing block, and the segmentation size of the image data input into the first processing block.

3. The structure search method of claim 1, wherein the performing the structure search on the super-large neural network based on the search space and the image sample set to obtain the target search path of the ViT network comprises:

training neural networks corresponding to a plurality of paths to be searched sampled from the search space based on the image sample set to obtain a trained super-large neural network;

and determining the target search path from the trained super-large neural network.

4. The structure search method according to claim 3, wherein the training of the network structure corresponding to the path to be searched sampled from the search space based on the image sample set to obtain the trained super-large neural network comprises:

executing training of a plurality of iteration cycles until a preset iteration stop condition is reached; using the super-large neural network obtained in the last iteration cycle as the trained super-large neural network;

wherein the training of any iteration cycle comprises:

determining a path to be searched in the current iteration cycle based on the search space;

and training the neural network corresponding to the path to be searched by utilizing the image sample set.

5. The structure searching method according to claim 4, wherein the preset iteration stop condition comprises: and reaching a preset iteration number, and/or reaching a preset precision threshold value of the training precision of the path to be searched on the image sample set.

6. The structure search method according to claim 4 or 5, further comprising:

acquiring an original image training sample set;

and performing data enhancement processing on the original image training sample by using a predetermined data enhancement mode to obtain the image sample set.

7. The structure searching method of claim 6, wherein the predetermined data enhancement comprises at least one of: random cropping, random inversion, and label smoothing.

8. The structure search method according to claim 6 or 7, wherein the predetermined data enhancement mode is determined based on an accuracy and/or a convergence degree of the super-large neural network after the structure search of the super-large neural network is performed based on the search space and the experimental sample.

9. The structure searching method according to any one of claims 4 to 8, wherein the determining the path to be searched for in the current iteration cycle based on the search space comprises:

determining ViT values of structural parameters of each processing block in the network based on the search space;

and generating the path to be searched in the current iteration cycle based on the value of the structural parameter of each processing block in the ViT network.

10. The structure searching method of claim 9, wherein the linear mapping corresponding to the attention feature processing module in the same processing block of different iteration cycles uses the same data value parameter.

11. The structure search method according to any one of claims 1 to 10, further comprising: and retraining the ViT network based on the structure of the ViT network to obtain a target ViT network.

12. An apparatus for searching a structure of a neural network, comprising:

the building module is used for building a search space of the super-large neural network corresponding to the ViT network based on structural parameters of a visual depth self-attention transformation ViT network, wherein the search space of the super-large neural network comprises a plurality of processing block layers, and each processing block layer comprises at least one processing block built by a multi-head self-attention mechanism layer and a multi-layer perceptron;

a structure search module, configured to perform a structure search on the super-large neural network based on the search space and the image sample set, to obtain a target search path of the ViT network, where the target search path includes one processing block of each of the multiple processing block layers;

a determination module to determine a structure of the ViT network for processing image data based on the target search path.

13. A computer device, comprising: a processor, a memory storing machine-readable instructions executable by the processor, the processor for executing the machine-readable instructions stored in the memory, the processor performing the steps of the method of structure search of a neural network as claimed in any one of claims 1 to 11 when the machine-readable instructions are executed by the processor.

14. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a computer device, executes the steps of the structure search method for a neural network according to any one of claims 1 to 11.