CN113298091A

CN113298091A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN113298091A
Application number: CN202110573055.1A
Authority: CN
Inventors: 陈博宇; 李楚鸣
Original assignee: Sensetime Group Ltd
Current assignee: Sensetime Group Ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-24
Also published as: WO2022247128A1

Abstract

The present disclosure relates to an image processing method and apparatus, an electronic device, and a storage medium, the method including: determining a first image block feature corresponding to a target image, and dividing the first image block feature into a second image block feature and a third image block feature; performing feature enhancement on the second image block feature based on a global attention mechanism to obtain a fourth image block feature, and performing feature enhancement on the third image block feature based on a local attention mechanism to obtain a fifth image block feature; determining target image block characteristics corresponding to the target image according to the fourth image block characteristics and the fifth image block characteristics; and according to the target image block characteristics, performing target image processing operation on the target image to obtain an image processing result. The embodiment of the disclosure can improve the precision of the image processing result.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

Recently, self-attention networks have been widely used in natural language processing, and the self-attention networks perform feature enhancement by establishing a connection between features, thereby improving the final performance of the network. With the proposal of the vision self-attention network, the self-attention network is also applied in the field of computer vision on a large scale, and shows great potential. However, the existing design of the visual self-attention network is only a simple way to follow the design in natural language processing, and the computer vision characteristics are not improved, so that the performance of the visual self-attention network is poor.

Disclosure of Invention

The disclosure provides an image processing method and device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided an image processing method including: determining a first image block feature corresponding to a target image, and dividing the first image block feature into a second image block feature and a third image block feature; performing feature enhancement on the second image block feature based on a global attention mechanism to obtain a fourth image block feature, and performing feature enhancement on the third image block feature based on a local attention mechanism to obtain a fifth image block feature; determining target image block characteristics corresponding to the target image according to the fourth image block characteristics and the fifth image block characteristics; and according to the target image block characteristics, performing target image processing operation on the target image to obtain an image processing result.

In a possible implementation manner, the number of channels corresponding to the first image block feature is a target number of channels; the dividing the first image block feature into a second image block feature and a third image block feature includes: and dividing the first image block characteristics on channel dimensions according to the target channel number and the target proportion to obtain the second image block characteristics and the third image block characteristics.

In a possible implementation manner, the performing feature enhancement on the second image block feature based on the global attention mechanism to obtain a fourth image block feature includes: determining a first feature vector, a second feature vector and a third feature vector according to the features of the second image block; determining an attention feature map according to the first feature vector and the second feature vector; and determining the fourth image block characteristic according to the attention characteristic map and the third characteristic vector.

In a possible implementation manner, the performing feature enhancement on the third image block feature based on the local attention mechanism to obtain a fifth image block feature includes: and performing convolution processing on the third image block characteristics according to the target convolution kernel and the first channel expansion rate to obtain the fifth image block characteristics.

In a possible implementation manner, the determining, according to the fourth image block feature and the fifth image block feature, a target image block feature corresponding to the target image includes: and performing feature conversion on the fourth image block feature and the fifth image block feature according to a second channel expansion rate to obtain the target image block feature.

In one possible implementation, the image processing method is implemented by a self-attention neural network, wherein the self-attention neural network comprises at least one attention module, and at least one of the attention modules comprises a global attention submodule, a local attention submodule and a feed-forward submodule; the feature enhancement of the second image block feature based on the global attention mechanism to obtain a fourth image block feature includes: performing feature enhancement on the second image block feature based on a global attention mechanism by using the global attention submodule to obtain a fourth image block feature; the feature enhancement of the third image block feature based on the local attention mechanism to obtain a fifth image block feature includes: performing feature enhancement on the third image block feature based on a local attention mechanism by using the local attention submodule to obtain a fifth image block feature; determining the target image block characteristics corresponding to the target image according to the fourth image block characteristics and the fifth image block characteristics, including: and performing feature conversion on the fourth image block feature and the fifth image block feature by using the feedforward sub-module to obtain the target image block feature.

In one possible implementation, the method further includes: constructing a first network structure search space and a second network structure search space, wherein the first network structure search space comprises a plurality of module distribution hyper-parameters, and the second network structure search space comprises a plurality of module structure hyper-parameters; determining a target module distribution hyperparameter from the plurality of module distribution hyperparameters according to the first network structure search space, wherein the target module hyperparameter is used for indicating the number of the global attention sub-modules and the number of the local attention sub-modules included in each attention module; determining a target module structure hyper-parameter from the plurality of module structure hyper-parameters according to the target module distribution hyper-parameter and the second network structure search space, wherein the target module structure hyper-parameter is used for indicating the number of target channels corresponding to each attention module, target convolution kernels and first channel expansion rates corresponding to each local attention submodule, and second channel expansion rates corresponding to each feed-forward module; and constructing the self-attention neural network according to the target module distribution hyper-parameter and the target module structure hyper-parameter.

In one possible implementation, the determining a target module distribution hyperparameter from the plurality of module distribution hyperparameters according to the first network structure search space includes: constructing a first super network according to the first network structure search space, wherein the first super network comprises a plurality of first selectable network structures constructed according to the plurality of module distribution super parameters; determining a first target network structure from the plurality of first selectable network structures by network training the first super network; and determining the module distribution hyperparameter corresponding to the first target network structure as the target module distribution hyperparameter.

In one possible implementation, the determining a target module structure hyperparameter from the plurality of module structure hyperparameters according to the target module distribution hyperparameter and the second network structure search space includes: constructing a second super network according to the target module distribution super parameter and the second network structure search space, wherein the second super network comprises a plurality of second selectable network structures constructed according to the plurality of module structure super parameters; determining a second target network structure from the plurality of second selectable network structures by network training the second super network; and determining the module structure hyperparameters corresponding to the second target network structure as the target module structure hyperparameters.

In one possible implementation, in the same attention module, each local attention submodule corresponds to the same target convolution kernel and the first channel expansion rate.

According to an aspect of the present disclosure, there is provided an image processing apparatus including: the characteristic determination module is used for determining a first image block characteristic corresponding to a target image and dividing the first image block characteristic into a second image block characteristic and a third image block characteristic; the attention module is used for performing feature enhancement on the second image block feature based on a global attention mechanism to obtain a fourth image block feature, and performing feature enhancement on the third image block feature based on a local attention mechanism to obtain a fifth image block feature; the determining module is used for determining the target image block characteristics corresponding to the target image according to the fourth image block characteristics and the fifth image block characteristics; and the target image processing module is used for carrying out target image processing operation on the target image according to the target image block characteristics to obtain an image processing result.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, a first image block feature corresponding to a target image is determined, and the first image block feature is divided into a second image block feature and a third image block feature; performing feature enhancement on the second image block feature based on a global attention mechanism to obtain a fourth image block feature, and performing feature enhancement on the third image block feature based on a local attention mechanism to obtain a fifth image block feature; determining target image block characteristics corresponding to the target image according to the fourth image block characteristics and the fifth image block characteristics; and according to the target image block characteristics, performing target image processing operation on the target image to obtain an image processing result. The image block features are subjected to feature enhancement based on a global attention mechanism and a local attention mechanism respectively, so that the target image block features after feature enhancement comprise both global information and local information, the semantic expression capability of the target image block features is effectively improved, and the precision of an image processing result can be improved after target image processing operation is carried out by utilizing the target image block features with high semantic expression capability.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 shows a flow diagram of an image processing method according to an embodiment of the present disclosure;

FIG. 2 illustrates a schematic diagram of determining a plurality of first image block features corresponding to a target image according to an embodiment of the disclosure;

fig. 3 is a network structure diagram showing a self-attention neural network in the related art;

FIG. 4 illustrates a network architecture diagram of a self-attentive neural network, according to an embodiment of the disclosure;

FIG. 5 shows a schematic diagram of a hierarchical network structure search according to an embodiment of the present disclosure;

FIG. 6 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure;

FIG. 7 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure;

FIG. 8 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flow chart of an image processing method according to an embodiment of the present disclosure. The image processing method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a processor calling a computer readable instruction stored in a memory. Alternatively, the method may be performed by a server. As shown in fig. 1, the image processing method may include:

in step S11, a first image block feature corresponding to the target image is determined, and the first image block feature is divided into a second image block feature and a third image block feature.

The target image is an image to be processed which needs image processing. In order to better acquire the internal correlation of the target image, the target image may be divided into a plurality of image blocks, and feature extraction may be performed on each image block to obtain a plurality of first image block features corresponding to the target image. The number of the plurality of image blocks may be determined according to actual conditions, which is not specifically limited by the present disclosure.

Fig. 2 illustrates a schematic diagram of determining a first image block characteristic corresponding to a target image according to an embodiment of the present disclosure. As shown in fig. 2, the target image is divided into L image blocks, and feature extraction is performed on each of the L image blocks to obtain L first image block features.

In order to facilitate subsequent processing of the plurality of first image block features, the plurality of first image block features may be converted into a sequence of first image block features. The specific manner of converting the plurality of first image block features into the first image block feature sequence may be determined according to actual situations, and this disclosure does not specifically limit this.

The first image block feature is divided into a second image block feature and a third image block feature, so that feature enhancement can be performed on the second image block feature and the third image block feature subsequently based on a global attention mechanism and a local attention mechanism respectively. Hereinafter, a process of dividing the first image block feature into the second image block feature and the third image block feature will be described in detail with reference to possible implementation manners of the present disclosure, and details will not be described here.

In step S12, the second image block feature is feature-enhanced based on the global attention mechanism to obtain a fourth image block feature, and the third image block feature is feature-enhanced based on the local attention mechanism to obtain a fifth image block feature.

The global attention mechanism may refer to a process of determining a global attention relationship between all features, and then determining a new feature according to the global attention relationship. Therefore, based on the global attention mechanism, the global attention relationship among the second image block features is determined, and then the fourth image block feature is determined based on the global attention relationship, so as to realize feature enhancement. The fourth image block feature is determined based on the global attention relationship, and therefore the fourth image block feature includes global information of the target image. Hereinafter, a process of performing feature enhancement based on global attention will be described in detail based on possible implementations of the present disclosure, and will not be described in detail herein.

The local attention mechanism may refer to a process of determining a local attention relationship between locally adjacent features, and then determining a new feature according to the local attention relationship. Therefore, based on the local attention mechanism, the local attention relationship between the local adjacent third image block features is determined, and then the fifth image block feature is determined based on the local attention relationship, so as to realize feature enhancement. The fifth image block feature is determined based on the local attention relationship, and therefore the fifth image block feature includes local information of the target image. Hereinafter, a process of performing feature enhancement based on local attention will be described in detail based on possible implementations of the present disclosure, and will not be described in detail herein.

In step S13, a target image block feature corresponding to the target image is determined according to the fourth image block feature and the fifth image block feature.

The fourth image block feature comprises global information of the target image, and the fifth image block feature comprises local information of the target image, so that the target image block feature determined according to the fourth image block feature and the fifth image block feature comprises both the global information and the local information of the target image, and has high semantic expression capability.

In step S14, a target image processing operation is performed on the target image according to the target image block feature, and an image processing result is obtained.

Because the target image block features include both global information and local information and have high semantic expression capability, the target image processing operation is performed by using the target image block features according to the image processing requirements, and an image processing result with high precision can be obtained.

In the embodiment of the disclosure, a plurality of first image block features corresponding to a target image are determined, and the first image block features are divided into second image block features and third image block features; performing feature enhancement on the second image block feature based on a global attention mechanism to obtain a fourth image block feature, and performing feature enhancement on the third image block feature based on a local attention mechanism to obtain a fifth image block feature; determining target image block characteristics corresponding to the target image according to the fourth image block characteristics and the fifth image block characteristics; and according to the target image block characteristics, performing target image processing operation on the target image to obtain an image processing result. The image block features are subjected to feature enhancement based on a global attention mechanism and a local attention mechanism respectively, so that the target image block features after feature enhancement comprise both global information and local information, the semantic expression capability of the target image block features is effectively improved, and the precision of an image processing result can be improved after target image processing operation is carried out by utilizing the target image block features with high semantic expression capability.

In a possible implementation manner, the number of channels corresponding to the first image block feature is a target number of channels; dividing the first image block feature into a second image block feature and a third image block feature, comprising: and dividing the first image block characteristics on the channel dimension according to the target channel number and the target proportion to obtain second image block characteristics and third image block characteristics.

By dividing the first image block features in the channel dimension, the second image block features used for carrying out global attention processing and the third image block features used for carrying out local attention processing can be obtained, and the number of the second image block features and the number of the third image block features accord with a target proportion, so that preparation work is made for a subsequent feature enhancement process. Specific values of the number of the target channels and the target proportion can be determined according to actual conditions, and the specific values are not specifically limited by the disclosure.

Also taking the above-mentioned L first image block characteristics as an example, the target channel number is d_kThe target ratio is a: b. Thus, the L first image block features may be equally divided into N parts along the channel dimension, of which N parts

Is determined as

A second image block feature set, and N parts

Is determined as

A third image block feature set. Wherein each second image block feature group comprises L second image block features, and the number of channels corresponding to each second image block feature is d_kN; each third image block feature group comprises L third image block features, and the number of channels corresponding to each third image block feature is d_kand/N. The specific value of N may be determined according to actual conditions, and this disclosure does not specifically limit this.

In a possible implementation manner, performing feature enhancement on the second image block feature based on a global attention mechanism to obtain a fourth image block feature includes: determining a first feature vector, a second feature vector and a third feature vector according to the features of the second image block; determining an attention feature map according to the first feature vector and the second feature vector; and determining a plurality of second image block features according to the attention feature map and the third feature vector.

By determining the attention feature map, a global attention relationship among the second image block features can be obtained, further feature enhancement based on a global attention mechanism can be achieved according to the attention feature map, and the fourth image block features including global information can be effectively obtained.

Also in the above

Taking the second image block feature groups as an example, for any one of the second image block feature groups, converting L second image block features included in the second image block feature group into three different feature vectors: a first eigenvector Q, a second eigenvector K and a third eigenvector V, wherein the number of the characteristic channels corresponding to the first eigenvector Q, the second eigenvector K and the third eigenvector V is d_kand/N, further performing feature enhancement based on a global attention mechanism by using the following formula (1):

wherein Softmax (·) represents a normalization function, and represents a vector dot product. As shown in the above formula (1), the feature vector Q and the feature vector K are used to perform a dot product operation, so as to obtain a dot product result Q · (K)^T) Using the number of characteristic channels d_kN and a normalization function Softmax (K) for the product result Q (K)^T) By performing the normalization operation, an attention feature map for indicating the global attention relationship can be obtained

Using attention feature maps

And performing dot product operation on the feature vector V to obtain Att (Q, K, V), wherein the Att (Q, K, V) is an output result corresponding to the second image block feature group obtained after feature enhancement is performed on the basis of a global attention mechanism.

Respectively obtain based on the above way

The output result corresponding to the second image block feature set is further processed

And the output results corresponding to the second image block feature groups are fused in channel dimensions to obtain L fourth image block features after feature enhancement is carried out based on the global attention mechanism.

In a possible implementation manner, performing feature enhancement on the third image block feature based on a local attention mechanism to obtain multiple fifth image block features includes: and performing convolution processing on the third image block characteristics to obtain fifth image block characteristics.

By performing convolution processing on the third image block features to realize feature enhancement based on a local attention mechanism, a plurality of fifth image block features including local information can be effectively obtained according to a convolution result. Since the third image block feature is one-dimensional, the fifth image block feature can be obtained by performing one-dimensional convolution processing on the third image block feature.

Also in the above

Taking the third image block feature groups as examples, for any third image block feature group, performing one-dimensional convolution processing on L third image block features included in the third image block feature group to obtain a convolution result corresponding to the third image block feature group.

In a possible implementation manner, performing feature enhancement on the third image block feature based on a local attention mechanism to obtain a fifth image block feature includes: and performing convolution processing on the third image block characteristic according to the target convolution kernel and the first channel expansion rate to obtain a fifth image block characteristic.

The third image block feature is one-dimensional, so that the one-dimensional convolution processing can be performed on the third image block feature according to the target convolution kernel and the first channel expansion rate to obtain a fifth image block feature.

Also in the above

Taking the third image block feature groups as examples, performing convolution processing on L third image block features included in each third image block feature group according to the target convolution kernel and the first channel expansion rate, specifically including: performing one-dimensional point-by-point convolution (pointwise convolution) processing on the L third image block features according to the first channel expansion rate E to obtain a first convolution sub-result, wherein the number of channels corresponding to the first convolution result is d_kN × E; performing one-dimensional layer-by-layer convolution (depthwise convolution) processing on the first convolution result according to the target convolution kernel K to obtain a second convolution sub-result, wherein the number of channels corresponding to the second convolution sub-result is d_kN × E; performing one-dimensional pointwise convolution processing on the second convolution sub-result according to the first channel expansion rate E to obtain a convolution result corresponding to the third image block feature group, wherein the number of channels of the convolution result corresponding to the third image block feature group is d_kand/N. The specific values of the first channel expansion rate E and the target convolution kernel K may be determined according to actual conditions, which is not specifically limited by the present disclosure.

Respectively obtain based on the above way

The convolution result corresponding to the third image block feature group is further processed

And the convolution results corresponding to the third image block feature groups are fused in channel dimensions to obtain L fifth image block features after feature enhancement is carried out based on a local attention mechanism.

In a possible implementation manner, the feature enhancement of the third image block based on the local attention mechanism can be implemented by reducing the size of the processed data, in addition to the one-dimensional convolution processing. Also in the above

For example, for any one of the third image block feature groups, the feature enhancement shown in the above formula (1) is performed only on a part of the third image block features (for example, L/2 third image block features) among the L third image block features included in the third image block feature group, so that the feature enhancement based on the local attention mechanism is realized by reducing the size of the processed data. The feature enhancement of the third image block based on the local attention mechanism may also adopt other processing forms according to actual needs, which is not specifically limited by the present disclosure.

In a possible implementation manner, determining a target image block feature corresponding to the target image according to the fourth image block feature and the fifth image block feature includes: and performing feature conversion on the fourth image block feature and the fifth image block feature according to the second channel expansion rate to obtain a target image block feature.

And further performing feature conversion on the fourth image block feature and the fifth image block feature through the second channel expansion rate to obtain a target image block feature which comprises global information and local information, so that the semantic expression capability of the target image block feature is effectively improved.

Still using the L fourth image block features and the L fifth image block features, fusing the L fourth image block features and the L fifth image block features in channel dimensions to obtain a fusion result, and further using a second channel expansion rate d_zAnd performing feature conversion on the fusion result to convert the number of channels of the fusion result into the same number of target channels corresponding to the first image block feature before feature enhancement, so as to obtain the channel number d_kThe L target image block features. Second channel expansion ratio d_zThe specific value of (a) may be determined according to actual conditions, and the disclosure does not specifically limit this.

In one possible implementation, the target image processing operation includes one of: target detection, target tracking, image recognition and image classification.

And performing target image processing operation by using the target image block features with higher semantic expression capability, so that an image processing result with higher precision can be obtained.

For example, image classification is performed on the target image according to the target image block features, and an image classification result with higher classification accuracy corresponding to the target image is obtained. The target image processing operation may include other image processing operations according to actual image processing requirements, besides the above-mentioned target detection, target tracking, image recognition, and image classification, which is not specifically limited by the present disclosure.

In one possible implementation, the image processing method is implemented by a self-attention neural network, wherein the self-attention neural network comprises at least one attention module, and the at least one attention module comprises a global attention submodule, a local attention submodule and a feedforward submodule; performing feature enhancement on the second image block feature based on a global attention mechanism to obtain a fourth image block feature, including: performing feature enhancement on the second image block feature by using a global attention submodule based on a global attention mechanism to obtain a fourth image block feature; performing feature enhancement on the third image block feature based on a local attention mechanism to obtain a fifth image block feature, including: performing feature enhancement on the third image block feature based on a local attention mechanism by using a local attention submodule to obtain a fifth image block feature; determining the target image block characteristics corresponding to the target image according to the fourth image block characteristics and the fifth image block characteristics, including: and performing feature conversion on the fourth image block feature and the fifth image block feature by using a feedforward sub-module to obtain a target image block feature.

The attention module is arranged in the self-attention neural network and comprises a global attention submodule, a local attention submodule and a feedforward submodule, so that the global attention submodule can be used for feature enhancement based on a global attention system, the local attention submodule is used for feature enhancement based on the local attention system, and the feedforward module is used for further performing feature conversion on image block features after feature enhancement based on the global attention system and the local attention system, so that target image block features including global information and local information are obtained, the semantic expression capability of the target image block features is effectively improved, and the network performance of the self-attention neural network is effectively improved.

Fig. 3 shows a network structure diagram of a self-attention neural network in the related art. As shown in FIG. 3, the self-Attention neural network includes M Attention modules (G-Block _1 to G-Block _ M), each of which employs a Multi-Head Attention (MHA) mechanism. As shown in fig. 3, each attention module includes 3 global attention submodules (i.e., each global attention module includes 3 attention heads) and a feed forward submodule (FFN).

Also taking the characteristics of the L first image blocks as an example, for G-Block _1, the number of target channels corresponding to G-Block _1 is d_kIn the case of (1), since G-Block _1 includes 3 global attention sub-modules, L first image Block features are equally divided into N3 parts along the channel dimension, where each part includes a channel number d_kL image block characteristics of/3, and then the number of 3 partial channels is d_kThe L image Block features of the/3 image Block are respectively sent to each global attention submodule so as to respectively execute feature enhancement based on the global attention mechanism shown in the formula (1) in each global attention submodule to obtain an output result of each global attention submodule, the output result of each global attention submodule is fused in channel dimensions and then is input into an FFN submodule for further feature conversion to obtain an output result of G-Block _1, and the output result is that the number of channels is d_kThe L image block features of (1); the number of channels to output G-Block _1 is d_kAnd inputting the characteristics of the L image blocks into G-Block _2, and repeatedly executing the characteristic strengthening process until an output result of G-Block _ M is obtained.

As shown in fig. 3, the self-attention neural network in the related art includes only the global attention submodule and the FFN submodule in each attention module, and therefore, the self-attention neural network in the related art can only implement feature enhancement based on the global attention mechanism, which results in ignoring local information in a target image, so that the network performance of the self-attention neural network is poor.

Fig. 4 illustrates a network architecture diagram of a self-attention neural network, according to an embodiment of the present disclosure. As shown in fig. 4, the self-attention neural network proposed by the present disclosure includes M attention modules (GL-Block _1 to GL-Block _ M). The at least one attention module includes G global attention submodules, L local attention submodules and an FFN submodule. With reference to fig. 3 and fig. 4, G-Block _2 includes 3 global attention submodules, and after 2 of the global attention submodules are replaced by two local attention submodules, GL-Block _2 can be obtained, that is, GL-Block _2 includes 1 global attention submodule G, 2 local attention submodules L, and one FFN submodule.

Also taking the characteristics of the L first image blocks as an example, for GL-Block _1, the number of target channels corresponding to G-Block _1 is d_kAnd in case that G-1 global attention sub-modules and L-2 local attention sub-modules are included in GL-Block _1, the L first image Block features are equally divided into N-3 parts along the channel dimension, and each part includes a channel whose number is d_kL image block features of/3, and a part of channels are d_kL image block features of/3 are sent to a global attention submodule, and the number of the remaining two part channels is d_kAnd the L image block characteristics of the/3 are respectively sent to the two local attention sub-modules.

For GL-Block _1, in G ═ 1 global attention submodules, performing feature enhancement based on the global attention mechanism shown in the above formula (1), and obtaining output results of the global attention submodules; respectively performing one-dimensional convolution processing on the L-2 local attention submodules to realize characteristic enhancement based on a local attention mechanism, and obtaining output results of the L-2 local attention submodules; and then after the output results of the 1 global attention submodule and the 2 local attention submodules are fused in channel dimensions and input into the FFN submodule for further characteristic conversion, obtaining the output result of GL-Block _1, wherein the output result is the number of channelsIs d_kThe L image block features of (1); the number of channels for outputting GL-Block _1 is d_kInputting the L image Block characteristics into G-Block _2, and repeatedly executing the characteristic strengthening process according to the distribution conditions of the global attention submodule and the local attention submodule in the G-Block _2 until an output result of GL-Block _ M is obtained.

As shown in fig. 4, the self-attention neural network further includes an image processing module (a Head module shown in the figure), and the target image Block features output by GL-Block _ M are input to the Head module to perform corresponding target image processing operations, so as to obtain an image processing result. The Head module may be configured according to an actual image processing requirement, and for example, may be an image classification module, a target detection module, an image recognition module, an image tracking module, and the like, which is not specifically limited in this disclosure.

By introducing the local attention submodule into the self-attention neural network, the global information and the local information can be utilized in the characteristic strengthening process, so that the finally obtained target image block characteristic has higher semantic expression capability, and an image processing result with higher precision can be obtained after the target image characteristic is utilized to perform target image processing operation on the target image, thereby effectively improving the network performance of the self-attention neural network.

In one possible implementation, the local attention submodule may include two pointwise convolutional layers, and one depthwise convolutional layer between the two pointwise convolutional layers. The first pointwise convolutional layer corresponds to a first channel expansion rate E so as to realize E-time expansion of the channel number, then the depthwise convolutional layer corresponds to a target convolutional core K without changing the channel number, and the last pointwise convolutional layer restores the channel number to the input channel number. The feature enhancement process of the local attention sub-module based on the local attention mechanism is similar to the related description above, and is not repeated here.

In one possible implementation, the FFN sub-groupModule to second channel expansion rate d_zAnd the method is used for performing one-step feature conversion on the output results of each global attention submodule and each local attention submodule. The feature transformation process of the FFN sub-module is similar to that described above, and is not described here again.

As shown in fig. 4, in constructing the self-attention neural network, the module distribution parameters (the number G of global attention sub-modules and the number L of local attention sub-modules included in each attention module) corresponding to each attention module, and the module structure super-parameter (the number d of target channels corresponding to each attention module) corresponding to each attention module_kA target convolution kernel K and a first channel expansion rate E corresponding to each local attention submodule, and a second channel expansion rate d corresponding to each FFN submodule_z) Are all network hyper-parameters that need to be considered.

In one possible implementation, the image processing method further includes: constructing a first network structure search space and a second network structure search space, wherein the first network structure search space comprises a plurality of module distribution hyper-parameters, and the second network structure search space comprises a plurality of module structure hyper-parameters; the first network structure searching space determines a target module distribution hyperparameter from a plurality of module distribution hyperparameters, wherein the target module hyperparameter is used for indicating the number of global attention sub-modules and the number of local attention sub-modules included in each attention module; determining a target module structure hyper-parameter from a plurality of module structure hyper-parameters according to the target module distribution hyper-parameter and a second network structure search space, wherein the target module structure hyper-parameter is used for indicating the number of target channels corresponding to each attention module, target convolution kernels and first channel expansion rates corresponding to each local attention sub-module, and second channel expansion rates corresponding to each feedforward module; and constructing a self-attention neural network according to the target module distribution hyper-parameter and the target module structure hyper-parameter.

By constructing the first network structure search space and the second network structure search space, the distributed hyper-parameters and the target structure hyper-parameters of the target module can be searched for the hierarchical network structure based on the first network structure search space and the second network structure search space, so that the size of the search space corresponding to the hierarchical network structure search can be reduced, the search efficiency is effectively improved, and the self-attention neural network can be quickly constructed.

Table 1 shows an example of a first network structure search space and a second network structure search space.

TABLE 1

As shown in table 1, the first network structure search space includes possible global attention sub-modules and local attention sub-modules in each attention module. For example, (G, L) ═ 1, 2 is used to indicate that G ═ 1 global attention submodule and L ═ 2 local attention submodule are included in the attention module.

The second network structure search space comprises module structure parameters corresponding to the attention modules. For example, the target number of channels d corresponding to each attention module_k(alternatives include 96, 192, 384), a target convolution kernel K (alternatives include 17, 31, 45) and a first channel expansion E (alternatives include 1, 2, 4) for each local attention submodule, and a second channel expansion d for each FFN submodule_z(alternatives include 2, 4, 6).

Module distribution superparameters (G, L) and module structure superparameters (K, E, d)_k、d_z) The selectable items of (1) may include, in addition to the selectable items shown in table 1, other selectable items may also be determined according to actual conditions, and the number of selectable items corresponding to each hyper-parameter may also be determined according to actual conditions, which is not specifically limited by this disclosure.

Fig. 5 shows a schematic diagram of a hierarchical network structure search according to an embodiment of the present disclosure. As shown in fig. 5, first, a high-level network structure search is performed in a first network structure search space, a target module distribution superparameter is determined for each self-attention module from among a plurality of module distribution superparameters (G,l). Then, fixing the target module distribution hyperparameters (G, L) in each self-attention module unchanged, and performing a low-level network structure search in a second network structure search space, determining target module structure hyperparameters (K, E, d) for each self-attention module from the plurality of module structure hyperparameters_k，d_z)。

In one possible implementation, determining a target module distribution hyperparameter from a plurality of module distribution hyperparameters according to a first network structure search space includes: constructing a first super network according to a first network structure search space, wherein the first super network comprises a plurality of first selectable network structures constructed according to a plurality of module distribution super parameters; determining a first target network structure from a plurality of first selectable network structures by network training a first super network; and determining the module distribution hyperparameter corresponding to the first target network structure as a target module distribution hyperparameter.

According to the module distribution hyper-parameter included in the first network structure search space, a plurality of first selectable network structures can be constructed, further, a first super network including the plurality of first selectable network structures can be constructed, the first super network is trained to realize high-level network structure search in the first network structure search space, a target first network structure with optimal performance is determined from the plurality of first selectable network structures, and the module distribution hyper-parameter corresponding to the target first network structure is determined as the target module distribution hyper-parameter.

For example, a first super network is trained to obtain a plurality of trained first selectable network architectures; carrying out accuracy verification on the trained first selectable network architectures based on an evolutionary algorithm, and selecting a first preset number of first selectable network architectures with optimal accuracy ranking; and training the selected first selectable network architectures with the preset number again to determine a target first selectable network architecture with the network performance meeting the requirement, and further determining the module distribution hyperparameter corresponding to the target first selectable network architecture as a target module distribution hyperparameter. The size of the first preset number may be determined according to actual situations, and the disclosure does not specifically limit this.

As shown in fig. 5, after the above-mentioned high-level network structure search is performed in the first network structure search space shown in table 1, a target module distribution superparameter corresponding to GL-Block _1 is obtained (G ═ 1, L ═ 2); distributing hyper-parameters (G is 3, L is 0) for the target module corresponding to GL-Block _ 2; … …, respectively; and distributing hyper-parameters (G is 0, and L is 3) for the target module corresponding to GL-Block _ M.

In one possible implementation, determining the target module structure hyperparameter from the plurality of module structure hyperparameters according to the target module distribution hyperparameter and the second network structure search space includes: constructing a second super network according to the target module distribution super parameter and a second network structure search space, wherein the second super network comprises a plurality of second optional network structures constructed according to the plurality of module structure super parameters; determining a second target network structure from the plurality of second selectable network structures by network training the second super network; and determining the module structure hyperparameters corresponding to the second target network structure as target module structure hyperparameters.

Fixing the distribution hyper-parameters of the target modules corresponding to the attention modules unchanged, further constructing a plurality of second selectable network structures according to the hyper-parameters of the plurality of module structures included in the second network structure search space, further constructing a second super network comprising the plurality of second selectable network structures, training the second super network to realize low-level network structure search in the second network structure search space, determining a target second network structure with optimal performance from the plurality of second selectable network structures, and determining the module structure hyper-parameters corresponding to the target second network structure as the hyper-parameters of the target module structure.

The principle of constructing the self-attention neural network may include: for the same attention module, each local attention submodule in the attention module corresponds to the same target convolution kernel K and the first channel expansion rate E. For example, for the GL-Block _1, the GL-Block _1 includes 2 local attention sub-modules, where L is 2 local attention sub-modules corresponding to the same target convolution kernel K and the first channel expansion rate E. Based on the construction principle of the self-attention neural network, the second optional network structure which does not conform to the principle and is included in the second super network can be deleted, so that the size of the second network searching space is reduced, and the subsequent searching efficiency of the target second network structure is improved.

After the second network structure search space is reduced based on the construction principle of the self-attention neural network, the second super network comprises second optional network structures which accord with the construction principle of the self-attention neural network. And training the second super network to realize low-level network structure search in the second network structure search space and determine the structural super parameters of the target module.

For example, training a second super network to obtain a plurality of trained second selectable network architectures; carrying out accuracy verification on the trained second selectable network architectures based on an evolutionary algorithm, and selecting a second preset number of second selectable network architectures with optimal accuracy ranking; and training the second selectable network architectures with the preset number again to determine a target second selectable network architecture with the network performance meeting the requirement, and further determining the module structure hyper-parameter corresponding to the target second selectable network architecture as the target structure hyper-parameter. The size of the first preset number may be determined according to actual situations, and the disclosure does not specifically limit this.

As shown in fig. 5, after the above-mentioned low-level network structure search is performed in the second network structure search space shown in table 1, the target module structure superparameters (K-17, E-4, d) corresponding to GL-Block _1 are obtained_k＝92，d_z2); a target module structure hyperparameter corresponding to GL-Block _2 (K is 45, E is 1); … …, respectively; the structure of the target module corresponding to GL-Block _ M is hyperparameter (d)_k＝192，d_z＝3)。

The self-attention neural network can be constructed according to the target module distribution hyper-parameter obtained by searching the high-level network structure and the target module structure hyper-parameter obtained by searching the low-level network structure. The self-attention neural network comprises at least one self-attention module, and the at least one self-attention module comprises a global attention submodule and a local attention submodule, so that the self-attention neural network can utilize global information and local information in the process of feature enhancement, and the network performance of the self-attention neural network is effectively improved.

According to the image processing requirement, the self-attention neural network disclosed by the disclosure can be applied to image processing tasks such as target detection, target tracking, image recognition, image classification and the like, and the disclosure does not specifically limit the tasks.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides an image processing apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the image processing methods provided by the present disclosure, and the descriptions and corresponding descriptions of the corresponding technical solutions and the corresponding descriptions in the methods section are omitted for brevity.

Fig. 6 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure. As shown in fig. 6, the apparatus 60 includes:

the feature determination module 61 is configured to determine a first image block feature corresponding to the target image, and divide the first image block feature into a second image block feature and a third image block feature;

the attention module 62 is configured to perform feature enhancement on the second image block feature based on a global attention mechanism to obtain a fourth image block feature, and perform feature enhancement on the third image block feature based on a local attention mechanism to obtain a fifth image block feature;

the first determining module 63 is configured to determine a target image block feature corresponding to the target image according to the fourth image block feature and the fifth image block feature;

and the target image processing module 64 is configured to perform a target image processing operation on the target image according to the target image block feature to obtain an image processing result.

In a possible implementation manner, the number of channels corresponding to the first image block feature is a target number of channels;

the feature determining module 61 is specifically configured to:

and dividing the first image block characteristics on the channel dimension according to the target channel number and the target proportion to obtain second image block characteristics and third image block characteristics.

In one possible implementation, the attention module 62 includes: the global attention submodule is specifically configured to:

determining a first feature vector, a second feature vector and a third feature vector according to the features of the second image block;

determining an attention feature map according to the first feature vector and the second feature vector;

and determining the fourth image block characteristic according to the attention characteristic map and the third characteristic vector.

In one possible implementation, the attention module 62 further includes: the local attention submodule is specifically configured to:

and performing convolution processing on the third image block characteristic according to the target convolution kernel and the first channel expansion rate to obtain a fifth image block characteristic.

In a possible implementation manner, the first determining module 63 is specifically configured to:

and performing feature conversion on the fourth image block feature and the fifth image block feature according to the second channel expansion rate to obtain a target image block feature.

In one possible implementation, the apparatus 60 implements the image processing method by a self-attention neural network, wherein the self-attention neural network includes at least one attention module 62, and the at least one attention module 62 includes a global attention sub-module, a local attention sub-module, and a feed-forward sub-module;

the global attention sub-module is used for performing feature enhancement on the second image block features based on a global attention mechanism to obtain fourth image block features;

the local attention submodule is used for performing feature enhancement on the third image block feature based on a local attention mechanism to obtain a fifth image block feature;

and the feedforward sub-module is used for performing feature conversion on the fourth image block feature and the fifth image block feature to obtain a target image block feature.

In one possible implementation, the apparatus 60 further includes:

the search space construction module is used for constructing a first network structure search space and a second network structure search space, wherein the first network structure search space comprises a plurality of module distribution hyper-parameters, and the second network structure search space comprises a plurality of module structure hyper-parameters;

the second determining module is used for determining a target module distribution hyperparameter from the plurality of module distribution hyperparameters according to the first network structure search space, wherein the target module hyperparameter is used for indicating the number of global attention sub-modules and the number of local attention sub-modules included in each attention module;

a third determining module, configured to determine a target module structure superparameter from the multiple module structure superparameters according to the target module distribution superparameter and a second network structure search space, where the target module structure superparameter is used to indicate a number of target channels corresponding to each attention module, a target convolution kernel and a first channel expansion rate corresponding to each local attention submodule, and a second channel expansion rate corresponding to each feed-forward module;

and the network construction module is used for constructing the self-attention neural network according to the target module distribution hyper-parameter and the target module structure hyper-parameter.

In a possible implementation manner, the second determining module is specifically configured to:

constructing a first super network according to a first network structure search space, wherein the first super network comprises a plurality of first selectable network structures constructed according to a plurality of module distribution super parameters;

determining a first target network structure from a plurality of first selectable network structures by network training a first super network;

and determining the module distribution hyperparameter corresponding to the first target network structure as a target module distribution hyperparameter.

In a possible implementation manner, the third determining module is specifically configured to:

constructing a second super network according to the target module distribution super parameter and a second network structure search space, wherein the second super network comprises a plurality of second optional network structures constructed according to the plurality of module structure super parameters;

determining a second target network structure from the plurality of second selectable network structures by network training the second super network;

and determining the module structure hyperparameters corresponding to the second target network structure as target module structure hyperparameters.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The disclosed embodiments also provide a computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 7 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 7, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 7, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

FIG. 8 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 8, electronic device 1900 may be provided as a server. Referring to fig. 8, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932^TM) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X)^TM) Multi-user, multi-process computer operating system (Unix)^TM) Free and open native code Unix-like operating System (Linux)^TM) Open native code Unix-like operating System (FreeBSD)^TM) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An image processing method, comprising:

determining a first image block feature corresponding to a target image, and dividing the first image block feature into a second image block feature and a third image block feature;

performing feature enhancement on the second image block feature based on a global attention mechanism to obtain a fourth image block feature, and performing feature enhancement on the third image block feature based on a local attention mechanism to obtain a fifth image block feature;

determining target image block characteristics corresponding to the target image according to the fourth image block characteristics and the fifth image block characteristics;

and according to the target image block characteristics, performing target image processing operation on the target image to obtain an image processing result.

2. The method according to claim 1, wherein the number of channels corresponding to the first image block characteristic is a target number of channels;

the dividing the first image block feature into a second image block feature and a third image block feature includes:

and dividing the first image block characteristics on channel dimensions according to the target channel number and the target proportion to obtain the second image block characteristics and the third image block characteristics.

3. The method according to claim 1 or 2, wherein the performing feature enhancement on the second image block feature based on the global attention mechanism to obtain a fourth image block feature comprises:

4. The method according to any one of claims 1 to 3, wherein the performing feature enhancement on the third image block feature based on the local attention mechanism to obtain a fifth image block feature comprises:

and performing convolution processing on the third image block characteristics according to the target convolution kernel and the first channel expansion rate to obtain the fifth image block characteristics.

5. The method according to any one of claims 1 to 4, wherein the determining a target image block feature corresponding to the target image according to the fourth image block feature and the fifth image block feature comprises:

and performing feature conversion on the fourth image block feature and the fifth image block feature according to a second channel expansion rate to obtain the target image block feature.

6. The method according to any one of claims 1 to 5, wherein the image processing method is implemented by a self-attention neural network, wherein the self-attention neural network comprises at least one attention module, and at least one of the attention modules comprises a global attention sub-module, a local attention sub-module and a feedforward sub-module;

the feature enhancement of the second image block feature based on the global attention mechanism to obtain a fourth image block feature includes:

performing feature enhancement on the second image block feature based on a global attention mechanism by using the global attention submodule to obtain a fourth image block feature;

the feature enhancement of the third image block feature based on the local attention mechanism to obtain a fifth image block feature includes:

performing feature enhancement on the third image block feature based on a local attention mechanism by using the local attention submodule to obtain a fifth image block feature;

determining the target image block characteristics corresponding to the target image according to the fourth image block characteristics and the fifth image block characteristics, including:

and performing feature conversion on the fourth image block feature and the fifth image block feature by using the feedforward sub-module to obtain the target image block feature.

7. The method of claim 6, further comprising:

constructing a first network structure search space and a second network structure search space, wherein the first network structure search space comprises a plurality of module distribution hyper-parameters, and the second network structure search space comprises a plurality of module structure hyper-parameters;

determining a target module distribution hyperparameter from the plurality of module distribution hyperparameters according to the first network structure search space, wherein the target module hyperparameter is used for indicating the number of the global attention sub-modules and the number of the local attention sub-modules included in each attention module;

determining a target module structure hyper-parameter from the plurality of module structure hyper-parameters according to the target module distribution hyper-parameter and the second network structure search space, wherein the target module structure hyper-parameter is used for indicating the number of target channels corresponding to each attention module, target convolution kernels and first channel expansion rates corresponding to each local attention submodule, and second channel expansion rates corresponding to each feed-forward module;

and constructing the self-attention neural network according to the target module distribution hyper-parameter and the target module structure hyper-parameter.

8. The method of claim 7, wherein determining a target module distribution hyperparameter from the plurality of module distribution hyperparameters based on the first network structure search space comprises:

constructing a first super network according to the first network structure search space, wherein the first super network comprises a plurality of first selectable network structures constructed according to the plurality of module distribution super parameters;

determining a first target network structure from the plurality of first selectable network structures by network training the first super network;

and determining the module distribution hyperparameter corresponding to the first target network structure as the target module distribution hyperparameter.

9. The method of claim 7 or 8, wherein determining a target module structure hyperparameter from the plurality of module structure hyperparameters based on the target module distribution hyperparameter and the second network structure search space comprises:

constructing a second super network according to the target module distribution super parameter and the second network structure search space, wherein the second super network comprises a plurality of second selectable network structures constructed according to the plurality of module structure super parameters;

and determining the module structure hyperparameters corresponding to the second target network structure as the target module structure hyperparameters.

10. The method of any one of claims 7 to 9, wherein each of the local attention sub-modules corresponds to the same target convolution kernel and the first channel expansion ratio in the same attention module.

11. An image processing apparatus characterized by comprising:

the characteristic determination module is used for determining a first image block characteristic corresponding to a target image and dividing the first image block characteristic into a second image block characteristic and a third image block characteristic;

the attention module is used for performing feature enhancement on the second image block feature based on a global attention mechanism to obtain a fourth image block feature, and performing feature enhancement on the third image block feature based on a local attention mechanism to obtain a fifth image block feature;

the first determining module is used for determining the target image block characteristics corresponding to the target image according to the fourth image block characteristics and the fifth image block characteristics;

and the target image processing module is used for carrying out target image processing operation on the target image according to the target image block characteristics to obtain an image processing result.

12. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any one of claims 1 to 10.

13. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 10.