WO2024032585A1

WO2024032585A1 - Data processing method and apparatus, neural network model, device, and medium

Info

Publication number: WO2024032585A1
Application number: PCT/CN2023/111669
Authority: WO
Inventors: 吴臻志; 祝夭龙
Original assignee: 北京灵汐科技有限公司
Priority date: 2022-08-09
Filing date: 2023-08-08
Publication date: 2024-02-15
Also published as: CN115034375B; CN115034375A

Abstract

The present disclosure relates to a data processing method and apparatus, a neural network model, a device, and a medium. The data processing method comprises: inputting data to be processed into a target neural network, and performing data processing on the basis of a spatial attention mechanism of a squeeze-and-excitation framework, so as to obtain a processing result, wherein the spatial attention mechanism is used for compressing features in a channel dimension, and exciting the relevance, in a spatial dimension, of the features which have been subjected to channel compression, so as to obtain attention information of the features in the spatial dimension.

Description

Data processing methods and devices, neural network models, equipment, media

Technical field

The embodiments of the present disclosure relate to the field of computer technology, and in particular, to a data processing method and device, a neural network model, electronic equipment, and a computer-readable storage medium.

Background technique

Technologies such as neural networks have been widely used in image processing, video processing, speech processing, text processing and other fields. When performing corresponding tasks based on neural networks, feature extraction is usually required, and data processing is performed based on the extracted features.

Contents of the invention

The present disclosure provides a data processing method and device, a neural network model, electronic equipment, and a computer-readable storage medium.

In a first aspect, the present disclosure provides a data processing method. The data processing method includes: inputting data to be processed into a target neural network, performing data processing based on the spatial attention mechanism of the squeeze and excitation framework, and obtaining processing results; wherein, The spatial attention mechanism is used to compress features from the channel dimension, stimulate the correlation of the channel-compressed features in the spatial dimension, and obtain the attention information of the features in the spatial dimension.

In a second aspect, the present disclosure provides a neural network model, which is a model constructed based on the model parameters of a target neural network, wherein the target neural network adopts the target described in any one of the embodiments of the present disclosure. Neural Networks.

In a third aspect, the present disclosure provides a data processing device. The data processing device includes: a data processing module for inputting data to be processed into a target neural network and performing data processing based on the spatial attention mechanism of the extrusion and excitation framework. Obtain processing results; wherein, the spatial attention mechanism is used to compress features from the channel dimension, stimulate the correlation of the channel-compressed features in the spatial dimension, and obtain attention information of the features in the spatial dimension.

In a fourth aspect, the present disclosure provides an electronic device, which includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be processed by the at least one processor. One or more computer programs are executed by the processor, and the one or more computer programs are executed by the at least one processor, so that the at least one processor can execute the above-mentioned data processing method.

In a fifth aspect, the present disclosure provides a computer-readable storage medium on which a computer program is stored, wherein the computer program implements the above-mentioned data processing method when executed by a processor/processing core.

In the embodiment of the present disclosure, the data to be processed is input to the target neural network, so that the target neural network performs data processing based on the spatial attention mechanism of the squeeze and excitation framework, obtains the processing results, and realizes compression of features from the channel dimension, thereby It can reduce the size of features in the channel dimension and reduce the amount of data to be processed, thereby improving task processing efficiency. At the same time, by stimulating the correlation of the channel-compressed features in the spatial dimension, the attention information of the feature in the spatial dimension can be obtained, thereby improving The accuracy of the processing results, thereby improving the accuracy of task processing.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

Description of drawings

Figure 1 is a flow chart of a data processing method provided by an embodiment of the present disclosure;

Figure 2 is a flow chart of a data processing method provided by an embodiment of the present disclosure;

Figure 3 is a schematic diagram of a target neural network provided by an embodiment of the present disclosure;

Figure 4 is a schematic diagram of a spatial attention module provided by an embodiment of the present disclosure;

Figure 5 is a schematic diagram of a spatial attention module provided by an embodiment of the present disclosure;

Figure 6 is a schematic diagram of a spatial attention module provided by an embodiment of the present disclosure;

Figure 7 is a schematic diagram of a spatial attention module provided by an embodiment of the present disclosure;

Figure 8 is a schematic diagram of a target neural network provided by an embodiment of the present disclosure;

Figure 9 is a schematic diagram of a neural network model provided by an embodiment of the present disclosure;

Figure 10 is a block diagram of a data processing device provided by an embodiment of the present disclosure;

Figure 11 is a block diagram of an electronic device provided by an embodiment of the present disclosure;

Figure 12 is a block diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed ways

The present disclosure will be further described in detail below in conjunction with the accompanying drawings and examples. It can be understood that the specific embodiments described here are only used to explain the present disclosure, but not to limit the present disclosure. It should also be noted that, for convenience of description, only some but not all structures related to the present disclosure are shown in the drawings.

The original data (for example, pictures, speech, text, video, etc.) used when performing various tasks is usually high-dimensional information, which contains more redundant information and may also be sparse data. Therefore, Processing based directly on the original data requires too much calculation and the task execution efficiency is low.

Based on this, in related technologies, relatively low-dimensional feature data is obtained by extracting features from original data, and then data processing is performed based on the feature data to reduce the amount of calculation. However, in some scenarios, the amount of feature data is still large. When data processing is performed directly based on feature data, the amount of calculation is large, which may result in task execution efficiency still being unable to meet user needs.

In view of this, embodiments of the present disclosure provide a data processing method and device, a neural network model, electronic equipment, and a computer-readable storage medium. The data processing method according to the embodiment of the present disclosure can compress features from the channel dimension, thereby reducing the size of the features in the channel dimension, reducing the amount of data to be processed, thereby improving task processing efficiency, and at the same time, by stimulating the features that have been compressed by the channel The correlation in the spatial dimension can obtain the attention information of the feature in the spatial dimension, thereby improving the accuracy of the processing results and thus improving the accuracy of task processing.

The data processing method according to the embodiment of the present disclosure can be executed by an electronic device such as a terminal device or a server. The terminal device can be a user equipment (User Equipment, UE), a mobile device, a terminal, a cellular phone, a cordless phone, or a personal digital assistant (Personal Digital Assistant). Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc., the method can be implemented by the processor calling computer-readable program instructions stored in the memory. Alternatively, the method can be executed via the server.

A first aspect of the embodiment of the present disclosure provides a data processing method.

Figure 1 is a flow chart of a data processing method provided by an embodiment of the present disclosure. Referring to Figure 1, the method includes the following steps.

In step S11, the data to be processed is input into the target neural network, the data is processed based on the spatial attention mechanism of the squeeze and excitation framework, and the processing results are obtained.

Among them, the spatial attention mechanism is used to compress features from the channel dimension, stimulate the correlation of the channel-compressed features in the spatial dimension, and obtain the attention information of the features in the spatial dimension.

In some optional implementations, the Squeeze-and-Excitation (SE) framework can be implemented by the SE Network structure (SE Network, SENet). In related technologies, SENet is mainly used to model the interdependence between channels of convolutional features to improve the representation effect of features in the channel dimension. In this process, although the amount of data processing is reduced by squeezing features in the spatial dimension, and channel attention information is obtained by stimulating the correlation between channels, the attention of features in the spatial dimension is also lost. information. In view of this, embodiments of the present disclosure provide a spatial attention mechanism based on a squeeze and excitation framework. The squeeze and excitation framework acts in the spatial attention dimension and focuses on the attention information of features in the spatial dimension. The corresponding implementation method In order to compress features in the channel dimension, the correlation in the spatial dimension is stimulated based on the channel compressed features to obtain Obtain the attention information of features in the spatial dimension.

In some optional implementations, the above spatial attention mechanism can be used to process at least one of image data, voice data, text data, and video data. That is, in step S11, the data to be processed may include at least one of image data, voice data, text data, and video data.

In some optional implementations, after the data to be processed is input into the target neural network, the target neural network processes the data based on the spatial attention mechanism of the squeeze and excitation framework to obtain corresponding processing results.

In some optional implementations, the target neural network can be used to perform at least one of image processing tasks, speech processing tasks, text processing tasks, and video processing tasks. Correspondingly, the processing results include at least one of image processing results, speech processing results, text processing results, and video processing results (where the processing may include operations such as identification, classification, and labeling), which are related to the type of data to be processed. , the content of the data to be processed, and the execution tasks of the target neural network, etc. The embodiments of this disclosure place no restrictions on the tasks that the target neural network can perform, the corresponding data to be processed, and the processing results.

For example, when the target neural network performs an image processing task, the data to be processed includes at least one image to be processed. After the image to be processed is input to the target neural network, it is processed through some network structures (for example, convolutional layers) ), extract the image features to be processed from the image to be processed. Among them, the image features to be processed can include image features at different levels. The higher the level, the better the semantic representation effect of the image features. The lower the level, the better the spatial representation effect of the image features. After obtaining the image features to be processed, the target neural network compresses them from the channel dimension to obtain the image channel context features, and further stimulates the correlation of the image channel context features in the spatial dimension to obtain the image spatial attention features (which can characterize the spatial dimensions attention information), and then fuse the image spatial attention features with the corresponding image features to be processed, that is, obtain the target image features that can represent the spatial attention information. The above target image features can be further used in image processing processes such as image classification, image annotation, and image recognition.

For example, when the target neural network performs an image classification task, the data to be processed includes at least one image to be classified. After the image to be classified is input to the target neural network, it is processed through some network structures (for example, convolutional layers) ), extract the image features to be classified from the image to be classified. Among them, the image features to be classified can include image features at different levels. The higher the level, the better the semantic representation effect of the image features. The lower the level, the better the spatial representation effect of the image features. After obtaining the image features to be classified, the target neural network compresses them from the channel dimension to obtain the image channel context features, and further stimulates the correlation of the image channel context features in the spatial dimension to obtain the image spatial attention features (which can characterize the spatial dimensions attention information), and then fuse the image spatial attention features with the corresponding image features to be classified, that is, obtain the target image features that can represent the spatial attention information. Based on the above target image characteristics, the image classification results can be obtained through the corresponding image classification algorithm.

It should be noted that the images to be classified are essentially external technical data, and a series of technical processing implemented on them at least include: inputting the images to be classified into the target neural network, data processing based on the spatial attention mechanism of the squeeze and excitation framework, Get image classification results. It can be seen from this that in the embodiments of the present disclosure, in order to solve the corresponding technical problems, a series of technical means are adopted. The technical operations in these technical means include but are not limited to data input, feature extraction, feature compression, feature conversion, etc., The above technical operations all correspond to corresponding technical features, and they are all computerized processing and operations of given data. Through these technical operations, on the one hand, the amount of data processing can be reduced and the requirements for hardware equipment can be reduced. On the other hand, the characteristics of the images to be classified can be maintained as much as possible to ensure the accuracy of the image classification results. It can be seen from this that although the technical solutions of the embodiments of the present disclosure do not directly change the processing performance, operation speed, and calculation accuracy of the hardware device, it changes the resource overhead required by the hardware device when processing data such as images to be classified, thereby improving resources. Utilization. In other words, due to the reduction in resource overhead, some hardware devices that were originally unable to handle tasks such as image classification can perform corresponding data processing such as image classification based on the data processing methods of embodiments of the present disclosure, thereby reducing the requirements for hardware devices.

Other task processing processes are similar to the above and will not be described here.

It should be noted that in some optional implementations, the above spatial attention mechanism corresponds to certain networks in the target neural network. network layer or network structure, and the corresponding network layer or network structure implements the above processing process.

Exemplarily, the target neural network includes a spatial attention module. The spatial attention module is a module built according to the spatial attention mechanism based on the squeeze and excitation framework. It is used to compress features from the channel dimension and excite the features that have been compressed by the channel. The correlation of features in the spatial dimension and the attention information of the features in the spatial dimension are obtained. It should be understood that the spatial attention module itself can also include functional units with smaller granularity, and each functional unit has a certain connection relationship. Based on each functional unit and its connection relationship, the attention information of the feature in the spatial dimension can be obtained.

It should be noted that the spatial attention mechanism based on the squeeze and excitation framework is only one of the processing mechanisms used by the target neural network for data processing. The target neural network can also perform data processing based on other data processing mechanisms. This disclosure implements There is no restriction on this. In other words, the spatial attention mechanism based on the squeeze and excitation framework processes the data to be processed. It can be a processing step in the entire data processing process, or it can correspond to the entire data process. Corresponding to the target neural network, the target neural network may only include a spatial attention module, or may include other network layers or network structures in addition to the spatial attention module. This is not limited in the embodiments of the present disclosure.

According to the embodiment of the present disclosure, the data to be processed is input to the target neural network, so that the target neural network performs data processing based on the spatial attention mechanism of the squeeze and excitation framework, obtains the processing results, and realizes compression of features from the channel dimension, so that it can Reduce the size of the feature in the channel dimension and reduce the amount of data to be processed, thereby improving task processing efficiency. At the same time, by stimulating the correlation of the channel-compressed features in the spatial dimension, the attention information of the feature in the spatial dimension can be obtained, thereby improving processing The accuracy of the results, thereby improving the accuracy of task processing.

Figure 2 is a flow chart of a data processing method provided by an embodiment of the present disclosure. Referring to Figure 2, the method includes the following steps.

In step S21, characteristics to be processed are determined based on the data to be processed.

In step S22, feature compression is performed on the features to be processed from the channel dimension to obtain channel context features.

In step S23, feature conversion is performed on the channel context features to obtain spatial attention features.

In step S24, feature fusion is performed on the features to be processed and the spatial attention features to obtain the target features.

In step S25, the processing result is determined based on the target characteristics.

Among them, the feature size to be processed, the channel context feature, the spatial attention feature and the target feature are the same in the spatial dimension. The feature to be processed and the target feature have the same first feature size in the channel dimension. The channel context feature and the spatial attention feature are the same. have the same second feature size in the channel dimension, and the first feature size is larger than the second feature size. In other words, during the above data processing process, the features to be processed can be compressed in the channel dimension first, and then transformed based on the compressed features to obtain the attention information of the spatial dimension. Finally, the attention information of the spatial dimension can be integrated into the data to be processed. features to obtain the target features.

In some optional implementations, the features to be processed may be features obtained based on the data to be processed, and may have a certain corresponding relationship with the data to be processed.

In some optional implementations, in step S21, determining the features to be processed based on the data to be processed includes: directly performing feature extraction on the data to be processed to obtain the features to be processed; or, first performing some data processing operations on the data to be processed, Obtain the intermediate data, and then perform feature extraction on the intermediate data to obtain the features to be processed; or, first perform feature extraction on the data to be processed to obtain the initial features, and then perform some data processing operations on the initial features to obtain the features to be processed.

It should be noted that the above methods for determining the features to be processed are only examples, and the embodiments of the present disclosure do not limit the method of determining the features to be processed based on the data to be processed.

After obtaining the features to be processed, the spatial attention mechanism based on the extrusion and excitation framework can be used to compress the features to be processed in the channel dimension, and stimulate the correlation of the channel-compressed features in the spatial dimension to obtain the spatial dimension of the features to be processed. Dimensional attention information, feature fusion of spatial dimension attention information and features to be processed, to obtain target features with better representation, so as to determine the corresponding processing results based on the target features. The above processing process corresponds to step S22 to step S25 in the embodiment of the present disclosure.

In some optional implementations, step S22 is mainly used to compress the features to be processed in the channel dimension to reduce the amount of data processing. Among them, the channel context feature is used to characterize the contextual relationship of the features to be processed in the channel dimension.

In some optional implementations, in step S22, performing feature compression on the features to be processed from the channel dimension to obtain the channel context features includes: pooling the features to be processed in the channel dimension to obtain the channel context features.

Among them, pooling is an important concept in neural networks, and its essence is a downsampling processing method. Through pooling processing, the receptive field of the feature can be increased and the number of parameters can be reduced, while maintaining certain invariances of the feature (for example, rotation invariance, translation invariance, scaling invariance, etc.). Common pooling processing methods include average pooling (Average Pooling), maximum pooling (Max Pooling) and global pooling (Global Pooling). Among them, average pooling refers to averaging the feature points in the neighborhood; maximum pooling refers to taking the maximum value of the feature points in the neighborhood; global pooling refers to making the entire feature a window for pooling processing, which can be used Reduce the feature dimension, and global pooling is usually used in combination with average pooling and max pooling (for example, global average pooling, global max pooling, etc.). In practical applications, features can be pooled through various pooling functions.

For example, performing pooling processing on the features to be processed in the channel dimension to obtain channel context features includes: performing average pooling processing on the features to be processed in the channel dimension to obtain channel context features with a feature scale of n1 in the channel dimension. Among them, 1<n1<N, N is the number of channels of features to be processed, and N≥1.

For example, performing pooling processing on the features to be processed in the channel dimension to obtain channel context features includes: performing maximum pooling processing on the features to be processed in the channel dimension to obtain channel context features with a feature scale of n2 in the channel dimension. Among them, 1＜n2＜N, N is the number of channels of features to be processed.

It should be noted that since n1 and n2 are smaller than N, after the above average pooling process or maximum pooling process, the feature scale of the feature to be processed in the channel dimension can be reduced (from N to n1 or from N to n2 ), the corresponding data volume is also reduced, thereby reducing task processing pressure.

Furthermore, considering that n1 and n2 are both integers greater than 1, there is still room for reduction in the feature scale of the features to be processed in the channel dimension. Based on this, in some optional implementations, the features to be processed are added in the channel dimension. Dimensions are compressed to the greatest extent to minimize the amount of data processing.

Exemplarily, performing pooling processing on the features to be processed in the channel dimension to obtain the channel context features includes: performing global average pooling processing on the features to be processed in the channel dimension to obtain a feature scale of 1 in the channel dimension (that is, the number of channels n=1 ) channel context features.

Exemplarily, performing pooling processing on the features to be processed in the channel dimension to obtain the channel context features includes: performing global maximum pooling processing on the features to be processed in the channel dimension, and obtaining a feature scale of 1 in the channel dimension (that is, the number of channels n=1 ) channel context features.

It should be understood that since the feature scale of the feature to be processed in the channel dimension is compressed to 1, there is no room for continued compression. Therefore, the corresponding data processing amount is the minimum processing amount.

It should be noted that when obtaining channel context features based on pooling processing, the corresponding parameters are usually hyperparameters, and the channel context features obtained through pooling processing can retain limited channel context information. Considering that convolution processing can also achieve feature size compression, and when performing convolution processing, various learnable parameters (for example, the weight of the convolution kernel) can be introduced as needed, so that the convolution-processed features can be While compressing, feature information can be better retained. Based on this, in some optional implementations, channel context features are obtained through convolution processing so that the channel context features can better retain channel context information.

In some optional implementations, in step S22, performing feature compression on the feature to be processed from the channel dimension to obtain the channel context feature includes: performing convolution processing on the feature to be processed in the channel dimension to obtain the channel context feature.

It should be noted that, similar to the pooling processing method, by modifying the parameters of the convolution processing, the feature scale of the channel context feature in the channel dimension can be adjusted.

For example, performing convolution processing on the feature to be processed in the channel dimension to obtain the channel context feature includes: convolving the feature to be processed in the channel dimension to obtain the channel context feature with a feature scale of n3 in the channel dimension. Among them, 1＜n3＜N, N is the number of channels of features to be processed.

Exemplarily, performing convolution processing on the feature to be processed in the channel dimension to obtain the channel context feature includes: performing global convolution on the feature to be processed in the channel dimension to obtain the feature scale of 1 in the channel dimension (that is, the number of channels n=1) Channel context features.

As mentioned before, after obtaining the channel context features, by performing feature transformation on the channel context features and stimulating the attention information of the features in the spatial dimension, we can obtain spatial attention features that can represent the spatial attention information.

In some optional implementations, in step S23, performing feature conversion on the channel context features to obtain spatial attention features includes: performing feature extraction on the channel context features in the spatial dimension to obtain the first intermediate features; The intermediate features are activated to obtain the second intermediate features; the second intermediate features are subjected to feature reduction processing to obtain the third intermediate features; the third intermediate features are activated to obtain the spatial attention features.

In other words, by performing feature transformation on the channel context features, spatial attention features that can represent the attention information of the spatial dimension can be obtained, thus clarifying the processing objects (i.e., channel context features), processing dimensions (i.e., spatial dimensions), and requirements. The acquired information (i.e., attention information), therefore, the attention information of the channel context features in the spatial dimension can be obtained in any way, and the embodiment of the present disclosure does not limit this.

In some optional implementations, performing feature extraction on the channel context features in the spatial dimension to obtain the first intermediate features includes: performing a first convolution process on the channel context features in the spatial dimension to obtain the first intermediate features.

For example, the first convolution corresponds to a convolution kernel, and the size of the convolution kernel is 3*3 and the step size is 2.

In other words, the first convolution process does not change the feature scale of the channel context features in the channel dimension. It mainly squeezes the channel context features in the spatial dimension. Through this extrusion process, the amount of data processing can be reduced. However, considering the requirements for processing accuracy, in some optional implementations, the amount of data processing can be appropriately increased in exchange for higher processing accuracy. Corresponding processing methods include using multiple convolution kernels to obtain the characteristics of multiple output channels, and then averaging in the channel dimension (Channel-mean) to improve processing accuracy.

In some optional implementations, performing feature extraction on the channel context features in the spatial dimension to obtain the first intermediate features includes: performing a second convolution process on the channel context features in the spatial dimension to obtain the fourth corresponding to multiple channels. Intermediate features; determine the average value of multiple fourth intermediate features in the channel dimension to obtain the first intermediate feature.

Among them, the second convolution process not only squeezes the channel context features in the spatial dimension, but also appropriately expands the channel dimension (that is, expands the number of channels), and finally obtains the first intermediate feature through channel averaging.

For example, the second convolution corresponds to four convolution kernels, each convolution kernel corresponds to one channel, and the size of the convolution kernel is 7*7, the step size is 4, and the expansion coefficient is 2. In other words, through the second convolution process, the channel context features are extended to four channels to obtain spatial attention information, and then based on the channel averaging method, the average value of the spatial attention information of the four channels is determined, thereby obtaining the first intermediate feature.

In some optional implementations, after obtaining the first intermediate feature, nonlinear activation processing can be performed on the first intermediate feature to increase the nonlinear characteristics of the network and obtain better processing results.

In some optional implementations, activation processing is performed on the first intermediate feature to obtain the second intermediate feature, including: performing nonlinear activation on the first intermediate feature based on a linear rectification (Rectified Linear Unit, ReLU) function to obtain the second intermediate feature. intermediate characteristics.

It should be noted that the above is just an example of the nonlinear activation function. The hyperbolic tangent (Tanh) function, the exponential linear unit (ELU) function and the Gaussian error linear unit (GeLu) can all be used for the first The intermediate features are activated, and the embodiment of the present disclosure does not limit this.

As mentioned before, the channel context features are convolved to obtain the first intermediate features, and after activation processing, the second intermediate features are obtained. Normally, after extracting features using convolution for channel context features, the output feature size (i.e., the (an intermediate feature) usually becomes smaller. In some cases, the reduced feature needs to be restored to its original size (i.e., the feature size of the channel context) for further calculations. This method uses enlarging the feature size to achieve feature resolution from small to small. The operation of mapping from high resolution to large resolution is called upsampling. Deconvolution (Transposed Convolution) processing is one of the implementation methods of upsampling. Its essence is a special forward convolution, that is, it first expands the size of the original feature by padding 0 according to a certain proportion, and then rotates the convolution. kernel, and then perform forward convolution. In embodiments of the present disclosure, a first deconvolution process may be performed on the second intermediate features to obtain spatial attention features with the same size as the channel context features.

In some optional implementations, performing feature reduction processing on the second intermediate feature to obtain the third intermediate feature includes: performing a first deconvolution process on the second intermediate feature to obtain the third intermediate feature.

For example, the first deconvolution corresponds to a convolution kernel, and the size of the convolution kernel is 3*3 and the step size is 2.

Similar to obtaining the first intermediate feature based on channel context features, in order to improve processing accuracy, the amount of data processing can be appropriately increased. The corresponding processing method is to use multiple convolution kernels to obtain multiple output channels, and then pass the channel dimension take the average.

In some optional implementations, performing feature reduction processing on the second intermediate feature to obtain a third intermediate feature includes: performing a second deconvolution process on the second intermediate feature to obtain a fifth intermediate feature corresponding to multiple channels. ; Based on the average value of multiple fifth intermediate features in the channel dimension, the third intermediate feature is obtained.

For example, there are four convolution kernels corresponding to the second deconvolution, each convolution kernel corresponds to one channel, and the size of the convolution kernel is 7*7, the step size is 4, and the expansion coefficient is 2. In other words, through the second deconvolution process, the second intermediate feature is expanded to four channels, each channel corresponds to a fifth intermediate feature, and then based on the channel averaging method, the average of the four fifth intermediate features is determined, so that Obtain third intermediate features.

It should be noted that the above parameters used in various convolution processes and deconvolution processes are only examples, and the embodiments of the present disclosure do not limit them.

Similar to the first intermediate feature, after the third intermediate feature is obtained, nonlinear activation processing can be performed on the third intermediate feature, and, in order to facilitate subsequent calculations, normalization processing is also required. Among them, the nonlinear activation process and the normalization process can be implemented by a function that has both nonlinear activation and normalization functions, or they can be implemented by a nonlinear activation function and a normalization function respectively. This is not the case in the embodiment of the present disclosure. limit.

In some optional implementations, activating the third intermediate feature to obtain spatial attention features includes: performing nonlinear normalized activation of the third intermediate feature based on the Sigmoid function to obtain spatial attention. feature. Since the Sigmoid function has both nonlinear activation and normalization functions, spatial attention features can be obtained directly based on the Sigmoid function.

In some optional implementations, activation processing is performed on the third intermediate feature to obtain spatial attention features, including: nonlinear activation of the third intermediate feature based on the ReLU function, and activation of the third intermediate feature based on the normalized index (Softmax) function. The nonlinear activation results are normalized to obtain spatial attention features.

After obtaining the spatial attention features, through feature fusion, the target features that can well represent the spatial attention information can be obtained.

In some optional implementations, in step S24, performing feature fusion on the features to be processed and the spatial attention features to obtain the target features includes: adding the features to be processed and the spatial attention features point by point to obtain the target features.

In some optional implementations, in step S24, performing feature fusion on the features to be processed and the spatial attention features to obtain the target features includes: multiplying the features to be processed and the spatial attention features point by point to obtain the target features.

After the target characteristics are obtained, in step S25, the processing result can be determined based on the target characteristics.

In some optional implementations, determining the processing result based on the target characteristics includes: determining the processing result directly based on the target characteristics; or performing some data processing operations on the target characteristics to obtain the processing results.

As mentioned above, the embodiment of the present disclosure implements the corresponding data processing method through the above steps. In some alternative implementations, The above method corresponds to some network layers or network structures in the target neural network, and these network layers or network structures implement the above data processing method.

The data processing method of the embodiment of the present disclosure will be described below with reference to Figures 3-8.

Figure 3 is a schematic diagram of a target neural network provided by an embodiment of the present disclosure. Referring to Figure 3, the target neural network includes: a first network structure, a spatial attention module and a second network structure, where the spatial attention module includes a context modeling unit, a conversion unit and a fusion unit.

In some optional implementations, the data to be processed is input into the target neural network, and the first network structure located before the spatial attention module processes the data to be processed, obtains the features to be processed, and uses the features to be processed as the spatial attention module. Input data for the force module. The spatial attention module uses the context modeling unit to compress the features to be processed from the channel dimension to obtain the channel context features, and input the channel context features into the conversion unit; the conversion unit performs feature conversion on the channel context features to obtain the spatial attention features. And input the spatial attention features into the fusion unit; the fusion unit fuses the features to be processed and the spatial attention features through point-by-point multiplication or point-by-point addition to obtain the target features, and inputs the target features into the second network structure. . The second network structure performs corresponding data processing based on the target characteristics to obtain processing results.

Corresponding each network structure and module in the above target neural network to the embodiment of the present disclosure, it can be seen that the first network structure is used to perform the processing process of step S21, the context modeling unit is used to perform the processing process of step S22, and the conversion unit is used to perform the processing process of step S21. In performing the processing of step S23, the fusion unit is used to perform the processing of step S24, and the second network structure is used to perform the processing of step S25.

It should be noted that the first network structure and the second network structure are abstract network structures, and their internal structures may be the same or different, and the embodiments of the present disclosure do not limit this. Further, in some optional implementations, the first network structure and the second network structure can be set according to task processing requirements, statistical data, experience and other information. For example, the first network structure may include any one or more of convolutional layers, pooling layers, connection layers, activation layers, etc., and the second network structure may also include convolutional layers, pooling layers, Any one or more of the network layers such as connection layer and activation layer.

Figure 3 only shows the framework structure of the target neural network relatively simply from the functional level. In some optional implementations, each of the above network structures or modules can optionally be composed of more fine-grained functional units. In this embodiment of the present disclosure, since we are more concerned about the structure of the spatial attention module in the target neural network and do not limit other network structures, only the spatial attention module in the target neural network is shown in Figure 4 various functional units.

Figure 4 is a schematic diagram of a spatial attention module provided by an embodiment of the present disclosure. Referring to Figure 4, the spatial attention module includes: a context modeling unit, a conversion unit and a fusion unit, where the conversion unit includes a feature extraction layer, a first activation layer, a feature reduction layer and a second activation layer.

In some optional implementations, after the obtained features to be processed are input into the spatial attention module, the context modeling unit first compresses them from the channel dimension to obtain the channel context features, and inputs the channel context features into the feature extraction layer. ; The feature extraction layer performs feature extraction on the channel context feature in the spatial dimension, obtains the first intermediate feature, and inputs the first intermediate feature into the first activation layer; the first activation layer activates the first intermediate feature, and obtains the second intermediate features, and input the second intermediate features into the feature reduction layer; the feature reduction layer performs feature reduction processing on the second intermediate features, obtains the third intermediate features, and inputs the third intermediate features into the second activation layer; the second activation layer The third intermediate feature is activated to obtain spatial attention features, and the spatial attention features are input into the fusion unit; the fusion unit performs feature fusion on the features to be processed and the spatial attention features by point-by-point multiplication or point-by-point addition. , obtain the target features, and output the target features so that other network structures of the target neural network can perform data processing based on the target features and obtain corresponding processing results.

In Figure 4, feature compression can be achieved by pooling processing, feature extraction can be achieved by convolution processing, feature restoration can be achieved by deconvolution processing, and feature activation can be achieved by the corresponding activation function. By replacing the above processing process in Figure 4 with the corresponding network layer, the spatial attention module shown in Figure 5 can be obtained.

Figure 5 is a schematic diagram of a spatial attention module provided by an embodiment of the present disclosure. Referring to Figure 5, the spatial attention module mainly includes: a global average pooling (GAP) layer, a first convolution layer, a ReLU activation layer, a first deconvolution layer, and a Sigmoid activation layer.

In some optional implementations, the feature to be processed is a four-dimensional tensor (b, c, h1, w1), where b represents the number of features to be processed, c represents the number of channels of the feature to be processed, h1 and w1 respectively Represents the height and width of the feature to be processed.

After inputting the features to be processed into the spatial attention module, the global average pooling layer performs global average pooling on the features to be processed to obtain channel context features. Since it is a global pooling process, the channel context feature is a global channel context feature, and the tensor size corresponding to the channel context feature is (b, 1, h1, w1). In other words, the global average pooling layer will The number of channels of the feature to be processed is compressed from c to 1, but its feature size in the spatial dimension is not changed.

After obtaining the channel context features, the first convolution layer performs convolution processing on the channel context features and extracts the first intermediate features. The corresponding tensor size of the first intermediate features is (b, 1, h2, w2), and h2 < h1, w2<w1. Among them, the first convolution layer can use a convolution kernel to squeeze the channel context in the spatial dimension (the height is squeezed from h1 to h2, and the width is squeezed from w1 to w2) to obtain the first intermediate size with a smaller spatial dimension. feature.

Further, the ReLU activation layer performs nonlinear activation processing on the first intermediate feature to obtain the second intermediate feature, and the corresponding tensor size is (b, 1, h2, w2).

After obtaining the second intermediate feature, the first deconvolution layer performs deconvolution processing on the second intermediate feature to achieve its expansion in the spatial dimension, and obtain the third intermediate feature. The corresponding tensor size is (b, 1, h1 ,w1). In other words, through the first deconvolution process, the height of the second intermediate feature in the spatial dimension is expanded from h2 to h1, and the width is expanded from w2 to w1, while keeping the number of channels unchanged.

Further, the third intermediate feature is nonlinearly activated and normalized through the Sigmoid activation layer to obtain the spatial attention feature, and the corresponding tensor size is (b, 1, h1, w1).

Through the point-by-point multiplication unit, the spatial attention features are multiplied point by point to the features to be processed to obtain the target features. The corresponding tensor size is (b, 1, h1, w1). The target feature is a feature that incorporates spatial attention information.

Taking the feature to be processed as (b, 256, 28, 28), the first convolution layer corresponds to a convolution kernel, and the convolution kernel size is 3*3, the step size is 2, and the first deconvolution corresponds to a convolution Kernel, and the size of the convolution kernel is 3*3, and the step size is 2. Take an example to illustrate the above data processing process. First, the features to be processed are input into the spatial attention module, and the global average pooling layer aggregates the information of all channel dimensions element by element to obtain channel context features with a shape of (b, 1, 28, 28); then the first volume The product layer performs Conv3*3 convolution processing with a step size of 2 to achieve feature compression in the spatial dimension and obtain the first intermediate feature with a shape of (b, 1, 14, 14); through the ReLU activation layer, the first intermediate feature is Nonlinear activation processing of features to obtain the second intermediate feature with a shape of (b, 1, 14, 14); then the first deconvolution layer performs TransposedConv3*3, deconvolution processing with a step size of 2, and is implemented in Through feature expansion in the spatial dimension, the third intermediate feature with a shape of (b, 1, 28, 28) is obtained; after nonlinear activation and normalization of the third intermediate feature by the Sigmoid activation layer, the shape is (b, 1 ,28,28) spatial attention features; multiply the spatial attention features point by point into the features to be processed to obtain the target feature with a shape of (b,256,28,28).

In summary, it can be seen that after the features to be processed are processed by the global average pooling layer, the number of channels is only 1, and the subsequent Conv3*3 and TransposedConv3*3 require less calculation (based on the features to be processed (b, 256, 28, 28 ) for example, if the number of channels is reduced by 256 times, the amount of calculation will be reduced by at least 256 times), which can effectively improve task processing efficiency.

In addition, it needs to be emphasized that channel extrusion and excitation based on SENet are usually based on full connection of channels, that is, each output channel is calculated from all input channels regardless of whether the output channel changes or not. In the embodiment of the present disclosure, each pixel in the target feature is calculated from only 3*3 pixels in the feature to be processed, and the receptive field size is about 7*7 (the convolution kernel size is 3*3, the step size Calculated as 2), resulting in a relatively limited range of spatial attention.

In view of this, in some optional implementations, the size of the above-mentioned receptive field can be appropriately increased to expand the scope of spatial attention, thereby improving task processing accuracy. It should be understood that increasing the size of the receptive field usually leads to an increase in the amount of calculation. In other words, the increase in the accuracy of the above task processing may be achieved at the expense of some processing capabilities.

In some optional implementations, the spatial attention module shown in Figure 5 is regarded as a simple version. By improving or strengthening some of the network layers, an enhanced version with a larger spatial attention range can be obtained. Spatial attention module.

For example, the global average pooling layer in Figure 5 is replaced by a global convolution layer, and the learnable parameters of the convolution are used to better retain the information of the features to be processed.

For example, the first convolution layer in Figure 5 is replaced with a second convolution layer and a first channel average layer, where the second convolution layer uses a larger convolution kernel and corresponds to multiple output channels. , the first channel averaging layer averages the features of each output channel to obtain the corresponding first intermediate features. It can be understood that since the size of the convolution kernel increases, the size of the receptive field also increases accordingly, and since the first intermediate feature is determined by the mean of multiple output channels, the accuracy of the first intermediate feature can be improved to a certain extent. sex.

For example, the first deconvolution layer in Figure 5 is replaced with a second deconvolution layer and a second channel average layer, where the second deconvolution layer can use a larger convolution kernel, and corresponds to For multiple output channels, the second channel averaging layer averages the features of each output channel to obtain the corresponding third intermediate features. It can be understood that since the size of the convolution kernel increases, the size of the receptive field also increases accordingly, and since the third intermediate feature is determined by the mean of multiple output channels, the accuracy of the third intermediate feature can be improved to a certain extent. sex.

It should be noted that any one or more of the above improvements can be implemented on the basis of the simple version of the spatial attention module in order to increase the scope of spatial attention.

Figure 6 is a schematic diagram of a spatial attention module provided by an embodiment of the present disclosure, which is an enhanced version of the spatial attention module. Referring to Figure 6, the spatial attention module includes: a global convolution layer, a feature extraction layer, a ReLU activation layer, a feature reduction layer and a Sigmoid activation layer, where the feature extraction layer includes a second convolution layer and a first channel average layer, The feature reduction layer includes a second deconvolution layer and a second channel averaging layer.

In some optional implementations, the feature to be processed is a four-dimensional tensor (b, c1, h1, w1), where b represents the number of features to be processed, c1 represents the number of channels of the feature to be processed, h1 and w1 respectively Represents the height and width of the feature to be processed.

After inputting the features to be processed into the spatial attention module, the global convolution layer first performs global convolution processing on the features to be processed to obtain the channel context features. Since it is a global convolution process, the channel context feature is a global channel context feature, and the tensor size corresponding to the channel context feature is (b, 1, h1, w1). In other words, the global convolution layer will wait for The number of channels processing features is compressed from c1 to 1, without changing its feature size in the spatial dimension. At the same time, the feature information is better retained through the learnable parameters in the global convolution layer.

After the channel context features are obtained, the channel context features are processed through a feature extraction layer composed of a second convolution layer and a first channel average layer to obtain the first intermediate features. For example, the second convolution layer performs convolution processing on the channel context features to obtain the fourth intermediate features of multiple channels, and then the first channel average layer calculates the average value of the multiple fourth intermediate features in the channel dimension, Get the first intermediate feature. Among them, the tensor size corresponding to the fourth intermediate feature of multiple channels is (b, c2, h3, w3), and the tensor size corresponding to the first intermediate feature is (b, 1, h3, w3), 1<c2< c1, h3<h1, w3<w1. In other words, the second convolutional layer obtains the fourth intermediate feature of c2 channels based on the channel context feature of 1 channel, and at the same time squeezes the channel context feature in the spatial dimension (the height is squeezed from h1 to h3, and the width is squeezed from w1 The pressure is w3); the first channel averaging layer calculates the average of the above-mentioned plurality of fourth intermediate features in the channel dimension to obtain the first intermediate feature.

Further, the ReLU activation layer performs nonlinear activation processing on the first intermediate feature to obtain the second intermediate feature, and the corresponding tensor size is (b, 1, h3, w3).

After the second intermediate feature is obtained, the second intermediate feature is processed through a feature reduction layer composed of a second deconvolution layer and a second channel average layer to obtain a third intermediate feature. For example, the second intermediate feature is deconvolved by the second deconvolution layer to obtain the fifth intermediate feature of multiple channels, and then the second channel average layer is used to calculate the channel dimension of the multiple fifth intermediate features. average to obtain the third intermediate feature. Among them, the tensor size corresponding to the fifth intermediate feature of multiple channels is (b, c3, h1, w1), and the third intermediate feature The corresponding tensor size is (b, 1, h1, w1), 1＜c3＜c1, and c3 and c2 can be the same or different. In other words, the second deconvolution layer obtains the fifth intermediate feature of c3 channels based on the second intermediate feature of 1 channel, and at the same time expands the second intermediate feature in the spatial dimension (the height is expanded from h3 to h1, and the width is expanded from w3 Expanded to w1); the second channel averaging layer calculates the average of the above-mentioned plurality of fifth intermediate features in the channel dimension to obtain the third intermediate feature.

Through the point-by-point multiplication unit, the spatial attention features are multiplied point by point to the features to be processed to obtain the target features. The corresponding tensor size is (b, 1, h1, w1).

Taking the features to be processed as (b, 256, 28, 28), the second convolution layer corresponds to four convolution kernels, each convolution kernel corresponds to a channel, and the convolution kernel size is 7*7 and the step size is 4 , the expansion coefficient is 2, the second deconvolution corresponds to four convolution kernels, each convolution kernel corresponds to a channel, and the size of the convolution kernel is 7*7, the step size is 4, and the expansion coefficient is 2, as an example illustrate. First, the features to be processed are input into the spatial attention module, and the global convolution layer performs feature compression on the features to be processed in the channel dimension to obtain channel context features with a shape of (b, 1, 28, 28); then the second convolution is used to compress the features to be processed in the channel dimension. The layer uses convolution kernels corresponding to 4 channels to perform Conv7*7 convolution processing with a step size of 4 and an expansion coefficient of 2. The channel context features are expanded to 4 channels in the channel dimension, and feature compression is performed in the spatial dimension. Obtain the fourth intermediate feature with shape (b,4,7,7) (or the fourth intermediate feature with shape (b,1,7,7) corresponding to 4 channels); the first channel average layer is in the channel dimension Calculate the average of the above-mentioned fourth intermediate feature (that is, calculate the average of the feature points at the same spatial position in the four channels) to obtain the first intermediate feature, with the corresponding size (b, 1, 7, 7); after ReLU activation The layer performs nonlinear activation processing on the first intermediate feature to obtain the second intermediate feature with a shape of (b, 1, 7, 7); then the second deconvolution layer uses the convolution kernel corresponding to the 4 channels to perform TransposedConv7 *7, deconvolution processing with a step size of 4 and an expansion coefficient of 2, expand the second feature to 4 channels in the channel dimension, perform feature expansion in the spatial dimension, and obtain the shape (b, 4, 28, 28) The fifth intermediate feature (or the fifth intermediate feature corresponding to the shape of 4 channels is (b, 1, 28, 28)); the second channel averaging layer calculates the average of the above fifth intermediate feature in the channel dimension (that is, calculates The average of the 4 channels of feature points at the same spatial position) is obtained, and the third intermediate feature is obtained, with the corresponding size (b, 1, 28, 28); the third intermediate feature is nonlinearly activated and reduced through the Sigmoid activation layer. After unified processing, a spatial attention feature with a shape of (b, 1, 28, 28) is obtained; the spatial attention feature is multiplied point by point into the features to be processed, and a target with a shape of (b, 256, 28, 28) is obtained. feature.

It should be noted that when using a convolution kernel with an expansion coefficient of 2 and a size of 7*7 for convolution processing, zeros need to be padded at the feature edge (other complement methods can also be used), which may cause the feature extraction results to be padded. There is zero operational impact, causing inaccuracies in the results. Based on this, in the second convolution layer and the second deconvolution layer, a multi-convolution kernel strategy is adopted, that is, multiple convolution kernels are used to expand the number of channels of the feature, and then the number of channels is reduced by calculating the average value of the feature in the channel dimension. The impact of the above zero padding operation.

It should be emphasized that, taking the feature to be processed as (b, 256, 28, 28) as an example, during the processing of the simple version of the spatial self-attention module, the scope of the spatial self-attention is about 7*7, which is The spatial dimension of the feature is small compared to 28*28; during the processing of the enhanced version of the spatial self-attention module, the expansion coefficient is 2, the step size is 4, and the convolution kernel with a size of 7*7 has a receptive field in About 53*53. In other words, the scope of the enhanced spatial self-attention is about 53*53, and the simpler version can be effectively expanded. Even for the largest feature map YOLO of the target detection network (the spatial dimension size is 608/8=76), 53 *The receptive field of 53 is also sufficient.

It should also be noted that although compared to the feature spatial dimension size of 28*28, the convolution kernel with an expansion coefficient of 2, a step size of 4, and a size of 7*7 is a very large convolution kernel, however, in the relevant Technology has verified that when faced with the processing of small-size feature maps, ultra-large convolution kernels still have good processing effects and processing efficiency, and there is no problem that ultra-large convolution kernels cannot process small-size features.

It should be understood that in the enhanced version of spatial self-attention processing, although the number of channels is increased in some processing steps, since the features to be processed have been compressed from the channel dimension into smaller features at the beginning of the processing, therefore, the calculation Although the amount of calculation has increased, the increase is not large. Compared with the accuracy that can be improved in the task processing results, the new calculation amount is more cost-effective.

It should be noted that the spatial attention module shown in Figure 4-6 above corresponds to the SE Block structure in SENet. Based on the spatial attention module shown in Figure 3, a variety of deformation processes can be performed to obtain information about SENet A variety of variant structures, these variants can also be used to process the spatial attention mechanism to obtain spatial attention information.

Figure 7 is a schematic diagram of a spatial attention module provided by an embodiment of the present disclosure, which belongs to one of the variants of SENet (ie Simplified NL Block). Referring to Figure 7, in the spatial attention module, the tensor size corresponding to the feature to be processed is (b, c, h1, w1), and the context modeling unit includes a third convolution layer and a normalization layer, which is used to Perform feature compression on the features to be processed from the channel dimension to obtain the channel context features. The corresponding tensor size is (b, 1, h1, w1); the fourth convolution layer corresponds to the conversion unit, which is used to perform feature conversion on the channel context features. Obtain spatial attention features, and the corresponding tensor size is (b, 1, h1, w1); the fusion unit is implemented by a point-by-point multiplier, which is used to multiply the spatial attention features into the features to be processed point by point to obtain the target Features, the tensor size of the target feature is (b, c, h1, w1).

It can be seen that by using the Simplified NL Block structure, the attention information of the feature to be processed in the spatial dimension can also be obtained.

It should be noted that in addition to Simplified NL Block, there are many variants of SENet, such as global context modeling framework (Global context Block, GC Block), etc., but no matter what variant, the processing process is similar , both of which first compress the features to be processed from the channel dimension to obtain the channel context features, then stimulate the channel context features in the spatial dimension to obtain the spatial attention features, and finally fuse the spatial attention features with the features to be processed to obtain the target feature.

In some optional implementations, the above spatial attention mechanism can be used in combination with the channel attention mechanism to further improve the accuracy of task processing results.

In some optional implementations, the processing result may be determined by attention information in the spatial dimension and attention information in the channel dimension. The attention information in the spatial dimension is obtained based on the spatial attention mechanism, and the attention information in the channel dimension is obtained based on Channel attention mechanism is obtained.

In other words, compared with using only the spatial attention mechanism or only using the channel attention mechanism to obtain processing results, combining the two mechanisms can simultaneously obtain the attention information of the spatial dimension and the attention information of the channel dimension, and the representation effect of the features can be further improved, and correspondingly, the accuracy of task processing results can also be improved.

Figure 8 is a schematic diagram of a target neural network provided by an embodiment of the present disclosure. Referring to Figure 8 , the target neural network includes a first network structure, a spatial attention module, a channel attention module, a fusion module and a second network structure, where the spatial attention module is a module set based on the spatial attention mechanism for To obtain attention information in the spatial dimension, the channel attention module is a module set based on the channel attention mechanism and is used to obtain attention information in the channel dimension.

In some optional implementations, the data to be processed is input into the target neural network, and the first network structure located before the spatial attention module and the channel attention module processes the data to be processed, obtains the features to be processed, and adds the features to be processed. The features are processed as input data to the spatial attention module and the channel attention module.

For the spatial attention module, it performs feature compression on the features to be processed from the channel dimension through the first context modeling unit, obtains the channel context features, and inputs the channel context features into the first conversion unit; the first conversion unit The features are transformed into features to obtain spatial attention features, and the spatial attention features are input into the first fusion unit; the first fusion unit combines the features to be processed with the spatial attention features by point-by-point multiplication or point-by-point addition. Fusion, obtain the first target feature, and input the first target feature into the fusion module.

Similar to the spatial attention module, the channel attention module performs feature compression on the features to be processed from the spatial dimension through the second context modeling unit, obtains the spatial context features, and inputs the spatial context features into the second conversion unit; the second conversion unit The spatial context features are transformed into features to obtain channel attention features, and the channel attention features are input into the second fusion unit; the second fusion unit multiplies the features to be processed and the channel attention features by point-by-point multiplication or point-by-point addition. Perform feature fusion to obtain the second target feature, and input the second target feature into the fusion module.

The fusion module further fuses the first target feature and the second target feature to obtain both spatial attention information and The fusion feature of the channel attention information is input into the second network structure, and the second network structure performs corresponding data processing based on the fusion feature to obtain the processing result.

During the above processing, the spatial attention module and the channel attention module act on the same feature to be processed to simultaneously enhance its feature expression effect from the spatial dimension and the channel dimension. In some optional implementations, the spatial attention module and the channel attention module can also act on different features to be processed, so as to adopt different processing methods for different features to be processed. Exemplarily, the spatial attention module acts on the first feature to be processed to obtain the attention information of the first feature to be processed in the spatial dimension; the channel attention module acts on the second feature to be processed to obtain the second feature to be processed. Process the attentional information of features in the channel dimension.

As mentioned before, the spatial attention module includes a naive version and an enhanced version. Similarly, the channel attention module can also include a naive version and an enhanced version. Moreover, in the target neural network, the naive versions of both can be used at the same time, or the enhanced versions of both can be used at the same time, or the naive version of one of them can be used and the enhanced version of the other can be used. The embodiments of the present disclosure are suitable for This is not a restriction.

It should be noted that the spatial attention module and the channel attention module can use various SENet (including corresponding variants) structures, and the spatial attention module and the channel attention module can use the same SENet structure or different ones. SENet structure, the embodiment of the present disclosure does not limit this.

It should also be noted that according to the above data processing process, the processing processes of the spatial attention module and the channel attention module are relatively independent, and data processing can be performed without relying on the processing results of the other party. Therefore, the spatial attention module and the channel attention module The processing procedures of the modules can be executed simultaneously or sequentially, and the embodiments of the present disclosure do not limit this.

In some optional implementations, the above-mentioned spatial attention module and channel attention module can be carried by different hardware devices, or can be carried by the same hardware device. When hosted by the same hardware device, the processing processes of the spatial attention module and the channel attention module can be executed sequentially, or by establishing two processes, they can be executed simultaneously in the two processes. When carried by different hardware devices, the processing processes of the spatial attention module and the channel attention module can be executed by the respective hardware devices at the same time, or can be executed in sequence.

A second aspect of the embodiment of the present disclosure provides a neural network model.

Figure 9 is a schematic diagram of a neural network model provided by an embodiment of the present disclosure. Referring to FIG. 9 , the neural network model is a model constructed based on the model parameters of the target neural network, wherein the target neural network adopts the target neural network in any one of the embodiments of the present disclosure.

In some optional implementations, the neural network model can be used to perform at least one of image processing tasks, speech processing tasks, text processing tasks, and video processing tasks. No matter what kind of task the neural network model performs, it needs to obtain the attention information of the features in the spatial dimension during the execution process. Based on this, the neural network model includes the following steps during the execution of the task: compress the features from the channel dimension, Excite the correlation of the channel-compressed features in the spatial dimension to obtain the attention information of the features in the spatial dimension. In other words, the structure of the neural network model may be different for different types of tasks, but no matter how its structure changes, it includes functional modules for executing the spatial attention mechanism.

In some optional implementations, an initial neural network model is built based on the task to be processed. In the initial neural network model, at least some of the model parameters are initial parameters. When the task to be processed is executed directly based on the initial neural network model, the task Processing accuracy is low. Based on this, the model parameters of the target neural network are used to update the corresponding parameters in the initial neural network model to obtain a neural network model with higher accuracy.

In some optional implementations, the process of building a neural network model based on the model parameters of the target neural network can be implemented through model training.

For example, first, an initial neural network model is built. In the initial neural network model, each model parameter is an initialization parameter set based on experience, statistical data, or randomly set. This initial model cannot be directly used to perform tasks. Secondly, obtain the corresponding training set, train the initial neural network model based on the training set, and obtain the training results. Then, according to the training results The result and the preset iteration conditions determine whether to continue training the model. If it is determined to continue training the model, it means that the current model parameters have not reached the optimum and there is room for continued optimization. Therefore, the model is updated according to the results of this round of training. parameters, and iteratively train the updated model based on the training set until it is determined to stop training the model, thereby obtaining a trained neural network model. In the trained neural network model, the model parameters correspond to the model parameters of the target neural network.

It should be noted that after obtaining the trained neural network model based on the training set, model verification and correction can also be performed based on the verification set. Similarly, model evaluation can also be performed based on the test set. The embodiments of the present disclosure improve the neural network model. There is no restriction on the acquisition method.

It can be understood that the above-mentioned method embodiments mentioned in this disclosure can be combined with each other to form a combined embodiment without violating the principle logic. Due to space limitations, the details will not be described in this disclosure. Those skilled in the art can understand that in the above-mentioned methods of specific embodiments, the specific execution order of each step should be determined by its function and possible internal logic.

In addition, the disclosure also provides data processing devices, electronic equipment, and computer-readable storage media, all of which can be used to implement any of the data processing methods provided by the disclosure. For corresponding technical solutions and descriptions, please refer to the corresponding records in the method section. Again.

Figure 10 is a block diagram of a data processing device provided by an embodiment of the present disclosure.

Referring to Figure 10, an embodiment of the present disclosure provides a data processing device, which includes the following modules.

The data processing module 101 is used to input the data to be processed into the target neural network, perform data processing based on the spatial attention mechanism of the squeeze and excitation framework, and obtain processing results.

In some optional implementations, the data processing module includes a first determination sub-module, a compression sub-module, a conversion sub-module, a fusion sub-module and a second determination sub-module. Among them, the first determination sub-module is used to determine the features to be processed based on the data to be processed; the compression sub-module is used to compress the features to be processed from the channel dimension to obtain channel context features; the conversion sub-module is used to for performing feature conversion on the channel context features to obtain spatial attention features; a fusion submodule for feature fusion of the to-be-processed features and the spatial attention features to obtain target features; a second determination submodule, Used to determine the processing result according to the target feature; wherein the feature size to be processed, the channel context feature, the spatial attention feature and the target feature in the spatial dimension are the same, and the feature to be processed is The channel context feature and the spatial attention feature have the same first feature size in the channel dimension, and the channel context feature and the spatial attention feature have the same second feature size in the channel dimension, and the first feature size is larger than the third feature size. 2. Feature size.

The above functional sub-modules are mapped to the target neural network shown in Figure 3. The first determination sub-module corresponds to the first network structure, the compression sub-module corresponds to the context modeling unit, the conversion sub-module corresponds to the conversion unit, and the fusion sub-module corresponds to the fusion unit , the second determination sub-module corresponds to the second network structure. Among them, the compression sub-module, conversion sub-module, and fusion sub-module also include more fine-grained functional units. For relevant content, please refer to the corresponding descriptions of the embodiments of the present disclosure, and will not be repeated here.

Figure 11 is a block diagram of an electronic device provided by an embodiment of the present disclosure.

Referring to Figure 11, an embodiment of the present disclosure provides an electronic device, which includes: at least one processor 1101; at least one memory 1102, and one or more I/O interfaces 1103 connected between the processor 1101 and the memory 1102 among them, the memory 1102 stores one or more computer programs that can be executed by at least one processor 501, and the one or more computer programs are executed by at least one processor 1101, so that at least one processor 1101 can execute the above-mentioned Data processing methods.

Referring to Figure 12, an embodiment of the present disclosure provides an electronic device. The electronic device includes multiple processing cores 1201 and an on-chip network 1202. The multiple processing cores 1201 are connected to the on-chip network 1202. The on-chip network 1202 is used to interact with multiple processing cores. One handles inter-core data and external data.

Among them, one or more instructions are stored in one or more processing cores 1201, and one or more instructions are processed by one or more processing cores 1201. 1201 is executed so that one or more processing cores 1201 can execute the above data processing method.

In some embodiments, the electronic device may be a brain-like chip, because the brain-like chip can adopt a vectorized calculation method and needs to be loaded into the neural network through an external memory such as a double data rate (Double Data Rate, DDR) synchronous dynamic random access memory. Model weight information and other parameters. Therefore, the operation efficiency of batch processing in the embodiments of the present disclosure is relatively high.

Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, wherein the computer program implements the above-mentioned data processing method when executed by a processor/processing core. Computer-readable storage media may be volatile or non-volatile computer-readable storage media.

Embodiments of the present disclosure also provide a computer program product, including computer readable code, or a non-volatile computer readable storage medium carrying the computer readable code. When the computer readable code is stored in a processor of an electronic device, When running, the processor in the electronic device executes the above data processing method.

Those of ordinary skill in the art can understand that all or some steps, systems, and functional modules/units in the devices disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. In hardware implementations, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may consist of several physical components. Components execute cooperatively. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable storage media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).

As is known to those of ordinary skill in the art, the term computer storage media includes volatile and non-volatile media implemented in any method or technology for storage of information such as computer readable program instructions, data structures, program modules or other data. lossless, removable and non-removable media. Computer storage media include, but are not limited to, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM), static random access memory (SRAM), flash memory or other memory technology, portable Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassette, magnetic tape, disk storage or other magnetic storage device, or that can be used to store the desired information and can be accessed by a computer any other media. Additionally, it is known to those of ordinary skill in the art that communication media typically embodies computer readable program instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery medium.

Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to various computing/processing devices, or to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage on a computer-readable storage medium in the respective computing/processing device .

Computer program instructions for performing operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages. Source code or object code written in any combination of object-oriented programming languages - such as Smalltalk, C++, etc., and conventional procedural programming languages - such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server implement. In situations involving remote computers, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through the Internet). connect). In some embodiments, by utilizing state information of computer-readable program instructions to personalize an electronic circuit, such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), the electronic circuit can Computer readable program instructions are executed to implement various aspects of the disclosure.

The computer program product described here may be implemented specifically through hardware, software, or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium. In another optional embodiment, the computer program product is embodied as a software product, such as a Software Development Kit (SDK), etc. wait.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus, thereby producing a machine that, when executed by the processor of the computer or other programmable data processing apparatus, , resulting in an apparatus that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium. These instructions cause the computer, programmable data processing device and/or other equipment to work in a specific manner. Therefore, the computer-readable medium storing the instructions includes An article of manufacture that includes instructions that implement aspects of the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.

Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other equipment, causing a series of operating steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executed on a computer, other programmable data processing apparatus, or other equipment to implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions that contains one or more components for implementing the specified logical function(s). Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two consecutive blocks may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts. , or can be implemented using a combination of specialized hardware and computer instructions.

Example embodiments have been disclosed herein, and although specific terms are employed, they are used and should be interpreted in a general illustrative sense only and not for purpose of limitation. In some instances, it will be apparent to those skilled in the art that features, characteristics and/or elements described in connection with a particular embodiment may be used alone, or may be used in conjunction with other embodiments, unless expressly stated otherwise. Features and/or components used in combination. Accordingly, it will be understood by those skilled in the art that various changes in form and details may be made without departing from the scope of the present disclosure as set forth in the appended claims.

Claims

A data processing method, including:

Input the data to be processed into the target neural network, perform data processing based on the spatial attention mechanism of the squeeze and excitation framework, and obtain the processing results;

Wherein, the spatial attention mechanism is used to compress features from the channel dimension, stimulate the correlation of the channel-compressed features in the spatial dimension, and obtain the attention information of the features in the spatial dimension.
The method according to claim 1, wherein the data to be processed is input into the target neural network, the data is processed based on the spatial attention mechanism of the extrusion and excitation framework, and the processing results are obtained, including:

Determine characteristics to be processed based on the data to be processed;

Perform feature compression on the features to be processed from the channel dimension to obtain channel context features;

Perform feature conversion on the channel context features to obtain spatial attention features;

Perform feature fusion on the features to be processed and the spatial attention features to obtain target features;

Determine the processing result according to the target characteristics;

Wherein, the feature to be processed, the channel context feature, the spatial attention feature and the target feature have the same feature size in the spatial dimension, and the feature to be processed and the target feature have the same first dimension in the channel dimension. A feature size, the channel context feature and the spatial attention feature have the same second feature size in the channel dimension, and the first feature size is larger than the second feature size.
The method according to claim 2, wherein said performing feature compression on the features to be processed from the channel dimension to obtain channel context features includes:

Perform pooling processing on the features to be processed in the channel dimension to obtain the channel context features;

or,

Convolution processing is performed on the features to be processed in the channel dimension to obtain the channel context features.
The method according to claim 3, wherein said performing pooling processing on the features to be processed in the channel dimension to obtain the channel context features includes:

Perform global average pooling processing on the features to be processed in the channel dimension to obtain the channel context features with a feature scale of 1 in the channel dimension;

The step of performing convolution processing on the features to be processed in the channel dimension to obtain the channel context features includes:

Perform global convolution on the feature to be processed in the channel dimension to obtain the channel context feature with a feature scale of 1 in the channel dimension.
The method according to claim 2, wherein said performing feature transformation on said channel context features to obtain spatial attention features includes:

Perform feature extraction on the channel context features in the spatial dimension to obtain the first intermediate features;

Perform activation processing on the first intermediate feature to obtain a second intermediate feature;

Perform feature reduction processing on the second intermediate feature to obtain a third intermediate feature;

Perform activation processing on the third intermediate feature to obtain the spatial attention feature.
The method according to claim 5, wherein the feature extraction of the channel context features in a spatial dimension to obtain the first intermediate features includes:

Perform a first convolution process on the channel context feature in the spatial dimension to obtain the first intermediate feature;

or,

Perform a second convolution process on the channel context features in the spatial dimension to obtain fourth intermediate features corresponding to multiple channels; determine the average value of multiple fourth intermediate features in the channel dimension to obtain the first intermediate features .
The method according to claim 6, wherein the first convolution corresponds to a convolution kernel, and the size of the convolution kernel is 3*3, and the step size is 2;

The second convolution corresponds to four convolution kernels, each convolution kernel corresponds to one channel, and the size of the convolution kernel is 7*7, the step size is 4, and the expansion coefficient is 2.
The method of claim 5, wherein performing feature reduction processing on the second intermediate feature to obtain a third intermediate feature includes:

Perform a first deconvolution process on the second intermediate feature to obtain the third intermediate feature;

or,

Perform a second deconvolution process on the second intermediate feature to obtain a fifth intermediate feature corresponding to multiple channels;

The third intermediate feature is obtained based on an average value of a plurality of the fifth intermediate features in the channel dimension.
The method according to claim 8, wherein the first deconvolution corresponds to a convolution kernel, and the size of the convolution kernel is 3*3, and the step size is 2;

Each of the four convolution kernels corresponding to the second deconvolution corresponds to a channel, and the size of the convolution kernel is 7*7, the step size is 4, and the expansion coefficient is 2.
The method according to claim 5, wherein said activating the first intermediate feature to obtain the second intermediate feature includes:

Perform nonlinear activation on the first intermediate feature based on a linear rectification function to obtain the second intermediate feature;

The step of activating the third intermediate feature to obtain the spatial attention feature includes:

Nonlinear normalized activation is performed on the third intermediate feature based on the S-shaped function to obtain the spatial attention feature.
The method according to claim 2, wherein the feature fusion of the features to be processed and the spatial attention features to obtain target features includes:

Add the features to be processed and the spatial attention features point by point to obtain the target features;

or,

The target feature is obtained by multiplying the feature to be processed and the spatial attention feature point by point.
The method according to claim 1, wherein the target neural network further includes a channel attention mechanism based on a squeeze and excitation framework;

After inputting the data to be processed into the target neural network, it also includes:

Perform data processing based on the channel attention mechanism;

Among them, the channel attention mechanism is used to compress features from the spatial dimension and stimulate spatially compressed features. The correlation of features in the channel dimension is used to obtain the attention information of the features in the channel dimension.
The method according to claim 12, wherein the processing result is determined by the attention information of the spatial dimension and the attention information of the channel dimension, and the attention information of the spatial dimension is obtained based on the spatial attention mechanism, The channel-dimensional attention information is obtained based on the channel attention mechanism.
The method according to any one of claims 1 to 13, wherein the target neural network is used to perform at least one of an image processing task, a speech processing task, a text processing task, and a video processing task.
A neural network model, which includes: the neural network model is a model constructed based on the model parameters of the target neural network,

Wherein, the target neural network adopts the target neural network as described in any one of claims 1-14.
A data processing device, which includes:

The data processing module is used to input the data to be processed into the target neural network, perform data processing based on the spatial attention mechanism of the squeeze and excitation framework, and obtain the processing results;

Wherein, the spatial attention mechanism is used to compress features from the channel dimension, stimulate the correlation of the channel-compressed features in the spatial dimension, and obtain the attention information of the features in the spatial dimension.
An electronic device, including:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores one or more computer programs executable by the at least one processor, and the one or more computer programs are executed by the at least one processor to enable the at least one processor to perform e.g. The data processing method according to any one of claims 1-14.
A computer-readable storage medium on which a computer program is stored, characterized in that, when executed by a processor, the computer program implements the data processing method according to any one of claims 1-14.
A computer program product comprising computer readable code, or a non-volatile computer readable storage medium carrying the computer readable code, wherein when the computer readable code is executed in a processor of an electronic device, the The processor in the electronic device executes the data processing method described in any one of claims 1-14.