CN115062673B

CN115062673B - Image processing method, image processing device, electronic equipment and storage medium

Info

Publication number: CN115062673B
Application number: CN202210895580.XA
Authority: CN
Inventors: 赫然; 黄怀波; 周晓强
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2022-10-28
Anticipated expiration: 2042-07-28
Also published as: CN115062673A

Abstract

The invention relates to the technical field of computer vision, and provides an image processing method, an image processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring an image to be processed; inputting the image to be processed into a feature extraction model to obtain image features output by the feature extraction model; based on the image characteristics, performing image processing on the image to be processed; the feature extraction model includes an orthogonal self-attention module for projecting tokens of the image to be processed into an orthogonal space for self-attention transformation. The method, the device, the electronic equipment and the storage medium provided by the invention have the advantages that the orthogonal self-attention module in the feature extraction model can project the token of the image to be processed to the orthogonal space for self-attention conversion, the complexity of the self-attention conversion is reduced, the extraction quality of the image features is improved, and the effectiveness of image processing is ensured.

Description

Image processing method, image processing device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer vision technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of artificial intelligence, researchers successfully apply a self-attention mechanism in a transform network to image feature extraction in image processing.

However, the existing self-attention mechanism has a problem of high computational complexity, and for the problem, the prior art reduces the computational complexity of the global self-attention mechanism by reducing the number of tokens, but is accompanied by a problem of losing fine-grained image feature information.

Therefore, how to extract image features on the premise of reducing the complexity of image feature extraction and not losing fine-grained image feature information is a problem to be solved urgently in the technical field of image processing.

Disclosure of Invention

The invention provides an image processing method, an image processing device, electronic equipment and a storage medium, which are used for solving the defect of high complexity in image feature extraction in the prior art.

The invention provides an image processing method, which comprises the following steps:

acquiring an image to be processed;

inputting the image to be processed into a feature extraction model to obtain image features output by the feature extraction model;

based on the image characteristics, carrying out image processing on the image to be processed;

the feature extraction model comprises an orthogonal self-attention module for projecting tokens of the image to be processed into an orthogonal space for self-attention transformation.

According to the image processing method provided by the invention, the feature extraction model comprises a plurality of cascaded feature extraction modules, the plurality of feature extraction modules comprise orthogonal feature extraction modules, and the orthogonal feature extraction modules comprise the orthogonal self-attention module and a forward propagation network which are cascaded;

the inputting the image to be processed into a feature extraction model to obtain the image features output by the feature extraction model comprises the following steps:

inputting a previous token of the image to be processed to a current feature extraction module to obtain a current token output by the current feature extraction module, wherein the previous token is output by a feature extraction module before the current feature extraction module;

and taking the token output by the last feature extraction module as the image feature.

According to an image processing method provided by the present invention, in a case that the current feature extraction module is an orthogonal feature extraction module, the step of inputting a previous token of the image to be processed to the current feature extraction module to obtain a current token output by the current feature extraction module includes:

inputting the previous token into an orthogonal self-attention module of a current feature extraction module, orthogonalizing the previous token by the orthogonal self-attention module to obtain an orthogonal token, then performing multi-head attention calculation on the orthogonal token, performing inverse orthogonalization on the orthogonal attention feature obtained by calculation, and fusing the attention feature subjected to inverse orthogonalization and the previous token to obtain a current self-attention feature output by the orthogonal self-attention module;

and inputting the current self-attention feature into a forward propagation network of a current feature extraction module to obtain a current token output by the forward propagation network.

According to an image processing method provided by the invention, the plurality of feature extraction modules further comprise a window feature extraction module, the window feature extraction module comprises a window self-attention module and a forward propagation network which are cascaded, and the window self-attention module is used for dividing tokens of the image to be processed in a sliding window mode and then performing self-attention conversion.

According to an image processing method provided by the present invention, in a case that the current feature extraction module is a window feature extraction module, the step of inputting a previous token of the image to be processed to the current feature extraction module to obtain a current token output by the current feature extraction module includes:

inputting the previous token into a window self-attention module of a current feature extraction module, performing window division on the previous token by the window self-attention module to obtain a window token, performing multi-head attention calculation on the window token, performing window combination on the calculated window attention feature, and fusing the attention feature after window combination with the previous token to obtain a current self-attention feature output by the window self-attention module;

According to an image processing method provided by the invention, the forward propagation network comprises a first convolution layer, and the first convolution layer is used for extracting position information.

According to an image processing method provided by the invention, the forward propagation network comprises a first branch and a second branch, the input of the first branch and the input of the second branch are the same, and the output of the first branch and the output of the second branch are added to be used as the output of the forward propagation network;

the first branch comprises a first normalization layer, a first full-connection layer, an activation layer, the first convolution layer and a second full-connection layer which are connected in sequence;

when the convolution kernel step size of the first convolution layer is 1, the input and the output of the second branch are the same;

and under the condition that the convolution kernel step size of the first convolution layer is larger than 1, the second branch comprises a second normalization layer and a second convolution layer which are sequentially connected, and the convolution kernel step size of the second convolution layer is the same as that of the first convolution layer.

The present invention also provides an image processing apparatus comprising:

the acquisition unit is used for acquiring an image to be processed;

the characteristic extraction unit is used for inputting the image to be processed into a characteristic extraction model to obtain the image characteristics output by the characteristic extraction model;

the image processing unit is used for carrying out image processing on the image to be processed based on the image characteristics;

The present invention also provides an electronic device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the image processing method as described in any of the above when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the image processing method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the image processing method as described in any one of the above.

The orthogonal self-attention module in the feature extraction model can project the token of the image to be processed to the orthogonal space for self-attention conversion, so that the complexity of self-attention conversion is reduced, the extraction quality of the image features is improved, and the effectiveness of image processing is ensured.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of an image processing method provided by the present invention;

FIG. 2 is a schematic structural diagram of an orthogonal feature extraction module provided in the present invention;

FIG. 3 is a schematic structural diagram of a convolutional encoder provided in the present invention;

FIG. 4 is a schematic flow chart of the self-attention conversion performed by the orthogonal self-attention module according to the present invention;

FIG. 5 is a schematic structural diagram of a window feature extraction module provided in the present invention;

FIG. 6 is a schematic flow chart of the window feature extraction module performing window self-attention transformation according to the present invention;

FIG. 7 is a schematic structural diagram of a forward propagation network when the convolution kernel step size of the first convolution layer provided by the present invention is greater than 1;

fig. 8 is a schematic structural diagram of a forward propagation network in a case that a convolution kernel step of the first convolution layer provided by the present invention is 1;

FIG. 9 is a schematic structural diagram of a feature extraction model provided by the present invention;

FIG. 10 is a schematic diagram of an image processing apparatus according to the present invention;

fig. 11 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

The terms first, second and the like in the description and in the claims of the present invention are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the objects identified as "first", "second", etc. are generally one class.

The present invention provides an image processing method, and fig. 1 is a schematic flow chart of the image processing method provided by the present invention, as shown in fig. 1, the method includes:

step 110, acquiring an image to be processed.

Here, the image to be processed, that is, the image to be processed may be acquired in advance by an image acquisition device, may also be obtained by real-time shooting, and may also be obtained by downloading or scanning through the internet, which is not specifically limited in this embodiment of the present invention.

Step 120, inputting the image to be processed into a feature extraction model to obtain image features output by the feature extraction model;

step 130, processing the image to be processed based on the image characteristics;

Specifically, the acquired image to be processed is input to a feature extraction model, so that the image features output by the feature extraction model can be obtained, where the feature extraction model includes an orthogonal self-attention module, where the orthogonal self-attention module is used to project tokens of the image to be processed to an orthogonal space for self-attention conversion. The orthonormal self-attention module herein may perform the steps of orthogonalization, multi-head attention calculation, and inverse orthogonalization in sequence.

The token is obtained by dividing an image to be processed into different image blocks in a sliding window manner and encoding each image block. The orthogonal space here refers to a space defined by the concept of the metric of orthogonality (vector inner product), i.e., a space to which a symmetric bilinear function is assigned. The orthogonal self-attention module can project the tokens of the image to be processed into an orthogonal space through orthogonalization, then perform multi-head attention calculation on the tokens in the orthogonal space, and then restore the tokens obtained through attention calculation from the orthogonal space to an original space through inverse orthogonalization.

It can be appreciated that the computational complexity of the self-attention mechanism in the Transformer network is proportional to the square of the number of input tokens. And the token of the image to be processed is projected to the orthogonal space for self-attention conversion, so that the computational complexity of a self-attention mechanism can be reduced, and the complexity of image feature extraction is further reduced.

After the image to be processed is input to the feature extraction model to obtain the image features output by the feature extraction model, the image to be processed may be processed based on the image features, where the image processing may be target detection, semantic segmentation, image reconstruction, and the like, which is not specifically limited in the embodiment of the present invention. It can be understood that the image features output by the feature extraction model are image features subjected to self-attention conversion, and image processing is performed based on the image features obtained by the self-attention conversion, so that the reliability of the image processing can be ensured.

According to the method provided by the embodiment of the invention, the orthogonal self-attention module in the feature extraction model can project the token of the image to be processed to the orthogonal space for self-attention conversion, so that the complexity of self-attention conversion is reduced, the extraction quality of the image features is improved, and the effectiveness of image processing is ensured.

Based on the above embodiment, the feature extraction model includes a plurality of feature extraction modules connected in cascade, and the plurality of feature extraction modules includes an orthogonal feature extraction module. Fig. 2 is a schematic structural diagram of an orthogonal feature extraction module provided in the present invention, and as shown in fig. 2, the orthogonal feature extraction module includes the orthogonal self-attention module 20 and a forward propagation network 21, which are cascaded.

Specifically, the feature extraction model includes a plurality of feature extraction modules connected in cascade, where the plurality of feature extraction modules may include one or more orthogonal feature extraction modules, and where the orthogonal feature extraction module may include a cascaded orthogonal self-attention module 20 and a Forward propagation Network 21 (FFN). The orthogonal self-attention module herein may perform orthogonalization, normalization, multi-head attention calculation, and inverse orthogonalization.

Accordingly, step 120 includes:

step 121, inputting a previous token of the image to be processed to a current feature extraction module to obtain a current token output by the current feature extraction module, where the previous token is output by a feature extraction module before the current feature extraction module;

and step 122, taking the token output by the last feature extraction module as the image feature.

Specifically, there is a precedence order among the multiple cascaded feature extraction modules, and a previous token of the image to be processed may be input to the current feature extraction module to obtain a current token output by the current feature extraction module, where the previous token is output by a feature extraction module before the current feature extraction module, that is, the previous token may be output by an orthogonal feature extraction module or may be output by a window feature extraction module, which is not specifically limited in the embodiment of the present invention.

For example, an image code of an image to be processed may be used as an input of a first feature extraction module, and thereafter, an input of each feature extraction module is an output of a previous feature extraction module, and finally, a token output by a last feature extraction module is used as an image feature, where the last feature extraction module is a last feature extraction module in the plurality of feature extraction modules. In this process, the image to be processed is subjected to feature extraction operations that are successively performed by a plurality of feature extraction modules.

It should be noted that the input of the first feature extraction module, that is, the image coding, may be the image itself to be processed, or may be obtained by coding the image to be processed, for example, the feature extraction module may further include a convolution encoder, an output end of the convolution encoder is connected to an input end of the first feature extraction module, that is, the image to be processed may be coded by the convolution encoder to obtain the image coding.

Based on the foregoing embodiment, fig. 3 is a schematic structural diagram of a Convolutional encoder provided by the present invention, as shown IN fig. 3, IN addition, the multiple feature extraction modules may further include a Convolutional encoder, where the Convolutional encoder may include a Convolutional Layer 30, a Normalization Layer 31, and an active Layer 32, the Convolutional encoder may be cascaded according to the order of the Convolutional Layer 30, the Normalization Layer 31, and the active Layer 32, where the Convolutional Layer 30 may use a Deep-Layer full Convolutional network (Deep-Convolutional network), may also use a Full Convolutional Network (FCN), where the Normalization Layer 31 may use a signal Normalization (LN) activation function, may also use a Batch Normalization Layer (Batch Normalization Layer), and may also use an IN (instruction Normalization Layer), where the active Layer 32 may use a uniform Error Linear (GELU) activation function, may also use a signal activation function, and may also use a reactive unified activation function (return).

And, setting the step size of the convolution kernel of the convolution encoder to 2, for example, reduces the characteristic resolution of the image to be processed

After being input to the convolutional encoder, the signal is output,deriving image coding features

。

Based on the foregoing embodiment, fig. 4 is a schematic flow chart of the self-attention conversion performed by the orthogonal self-attention module provided by the present invention, and as shown in fig. 4, in a case that the current feature extraction module is an orthogonal feature extraction module, the step 121 includes:

step 1211, inputting the previous token to the orthogonal self-attention module of the current feature extraction module, performing orthogonalization on the previous token by the orthogonal self-attention module to obtain an orthogonal token, performing multi-head attention calculation on the orthogonal token, performing inverse orthogonalization on the calculated orthogonal attention feature, and fusing the inverse orthogonalized attention feature and the previous token to obtain the current self-attention feature output by the orthogonal self-attention module.

Specifically, when the current feature extraction module is an orthogonal feature extraction module, the previous token is input to the orthogonal self-attention module of the current feature extraction module, and the previous token is orthogonalized by the orthogonal self-attention module to obtain an orthogonal token, where the orthogonal token is the previous token projected from the original feature space to the orthogonal space. Subsequently, the orthogonal self-attention module may normalize the orthogonal token, perform multi-head attention calculation on the normalized orthogonal token, and perform inverse orthogonalization on the orthogonal attention feature obtained by the calculation, thereby obtaining an orthogonal attention feature of the original feature space, that is, obtaining the attention feature after the inverse orthogonalization.

Finally, the orthogonal self-attention module can fuse the inversely orthogonalized attention feature with the last token, namely the input and the output of the orthogonal feature extraction module are connected in a residual error mode, so that the current self-attention feature output by the orthogonal self-attention module is obtained

The specific calculation formula is as follows:

wherein,

in order to be the last token entered,

the orthogonalization and the inverse orthogonalization are respectively expressed, and for the orthogonalized matrix, the inverse matrix is the transpose matrix.

It is expressed as a normalization of the signals,

a multi-head attention calculation is shown.

Step 1212, inputting the current self-attention feature into a forward propagation network of the current feature extraction module, to obtain a current token output by the forward propagation network.

Specifically, after obtaining the current self-attention feature orthogonally output from the attention module, the current self-attention feature may be input to the forward propagation network of the current feature extraction module, resulting in a current token output by the forward propagation network.

Based on the foregoing embodiment, fig. 5 is a schematic structural diagram of the window feature extraction module provided in the present invention, and as shown in fig. 5, the feature extraction modules further include a window feature extraction module, the window feature extraction module includes a cascaded window self-attention module 50 and a forward propagation network 21, and the window self-attention module 50 is configured to divide a token of the image to be processed in a sliding window manner and then perform self-attention conversion.

Specifically, the plurality of feature extraction modules include a windowing feature extraction module in addition to the convolutional encoder and the orthogonal feature extraction module, where the windowing feature extraction module includes a cascaded windowing self-attention module 50 and a forward propagation network 21. The window self-attention module 50 is used to divide the token of the image to be processed into sliding windows and then perform self-attention conversion. The form division of the sliding window herein includes window division and window combination, which correspond to orthogonalization and inverse orthogonalization of the orthogonal feature extraction module respectively, and are used for performing self-attention conversion on the token of the image to be processed.

Based on the foregoing embodiment, fig. 6 is a schematic flowchart of a process of performing window self-attention conversion by a window feature extraction module provided by the present invention, and as shown in fig. 6, in a case that the current feature extraction module is the window feature extraction module, the inputting a previous token of the image to be processed to the current feature extraction module to obtain a current token output by the current feature extraction module includes:

step 310, inputting the previous token into a window self-attention module of a current feature extraction module, performing window division on the previous token by the window self-attention module to obtain a window token, performing multi-head attention calculation on the window token, performing window merging on the calculated window attention feature, and fusing the window merged attention feature and the previous token to obtain a current self-attention feature output by the window self-attention module;

and step 320, inputting the current self-attention feature into a forward propagation network of a current feature extraction module to obtain a current token output by the forward propagation network.

Specifically, when the current feature extraction module is a window feature extraction module, the previous token is input to a window self-attention module of the current feature extraction module, and the window self-attention module performs window division on the previous token to obtain a window token, where the window division refers to performing multiple window divisions of the same size on the previous token to obtain the window token. And then normalizing the window tokens, performing multi-head attention calculation on the normalized window tokens, and performing window combination on the calculated window attention characteristics, wherein the window combination refers to combining the calculated window attention characteristics according to a rule of window division before, so as to obtain the current token.

And finally, fusing the window attention feature after the window is merged with the previous token, namely connecting the input and the output of the window feature extraction module in a residual error mode to obtain the current self-attention feature output by the window self-attention module

The specific calculation formula is as follows:

wherein,

in order to be the last token entered,

respectively represent a sliding window and a window combination,

it is expressed as a normalization that,

a multi-head attention calculation is shown.

Further, after obtaining the current self-attention feature output by the window self-attention module, the current self-attention feature may be input to the forward propagation network of the current window feature extraction module, so as to obtain the current token output by the forward propagation network.

Based on the above embodiment, the forward propagation network comprises a first convolutional layer for extracting the position information.

Specifically, the orthogonal feature extraction module and the window feature extraction module both include forward propagation networks, and the orthogonal feature extraction module and the window feature extraction module include forward propagation networks having the same structure and both include the first convolution layer. The first convolutional layer here may extract the location information currently carried in the self-attention feature in the forward propagation network, so that the current token output by the forward propagation network may also carry the location information.

The position information referred to here may include information of a position of the image block in the image, or information of a position of the token in the feature, and the application of the first convolution layer in the forward propagation network realizes modeling embedding for the position information, and improves flexible processing capability of the feature extraction module for the to-be-processed image with any resolution.

Based on the above embodiment, fig. 7 is a schematic structural diagram of a forward propagation network when the convolution kernel step size of the first convolution layer provided by the present invention is greater than 1, fig. 8 is a schematic structural diagram of a forward propagation network when the convolution kernel step size of the first convolution layer provided by the present invention is 1, as shown in fig. 7 and 8, the forward propagation network includes a first branch and a second branch, inputs of the first branch and the second branch are the same, and outputs of the first branch and the second branch are added to be an output of the forward propagation network;

in the case that the convolution kernel step size of the first convolution layer is 1, the input and the output of the second branch are the same;

Specifically, the forward propagation network in the orthogonal feature extraction module and the window feature extraction module includes a first branch and a second branch, where the first branch includes a first normalization layer, a first Fully Connected layer (FC), an active layer, a first convolution layer, and a second Fully Connected layer, which are Connected in sequence. The first normalization layer here may be an LN, a BN, or an IN, and the activation layer here may use a GELU activation function, a Sigmoid activation function, or a ReLU activation function, which is not specifically limited IN this embodiment of the present invention.

The inputs of the first branch and the second branch are the same, and the outputs of the first branch and the second branch are added to be the output of the forward propagation network, that is, the outputs of the first branch and the second branch are connected by adopting a residual error.

The second branch has two different situations, and when the step length of the convolution kernel of the first convolution layer is 1, the input and the output of the second branch are the same, namely the second branch is empty;

in the case that the convolution kernel step size of the first convolution layer is greater than 1, the second branch includes a second normalization layer and a second convolution layer which are connected in sequence, and the convolution kernel step size of the second convolution layer is the same as that of the first convolution layer.

It can be understood that, in the forward propagation networks shown in fig. 7 and 8, the first convolution layer can implement modeling embedding of the position information, and the forward propagation networks shown in fig. 7 and 8 are different in whether the function of down-sampling is provided, that is, the forward propagation network shown in fig. 7 has the down-sampling function, and the forward propagation network shown in fig. 8 does not have the down-sampling function.

Here, fig. 8 shows a forward propagation network without a down-sampling function

The specific calculation formula of (2) is as follows:

wherein,

for the current self-attention feature of the input,

a layer of normalization is represented that,

the layer of the active layer is represented,

a fully-connected layer is shown as such,

indicating a depth-separable convolutional layer, the first fully-connected layer

Will amplify the number of channels of the current self-attention feature of the input, and the second fully-connected layer

The number of channels of the current self-attention feature of the augmented input is converted to the number of channels before augmentation.

In one embodiment, fig. 9 is a schematic structural diagram of the feature extraction model provided by the present invention, and as shown in fig. 9, the size of the feature extraction model is first set to be

In the convolutional encoder for the image to be processed, the image encoding output by the encoder is obtained, and the image features are obtained through feature extraction in four stages. Here, H and W represent the height and width of the image to be processed, 3 is the number of channels of the image to be processed, and the feature extraction at each stage is implemented by a plurality of feature extraction modules in cascade.

The first stage feature extraction module herein comprises

A feature extraction module, among which

One has no descendingFeature extraction module with a sampling function and a feature extraction module with a down-sampling function, e.g. in the first stage of the feature extraction module

The size of the current token output by the feature extraction module without the down-sampling function is

If the current token output by one of the feature extraction modules with the down-sampling function in the feature extraction modules in the first stage is larger than the current token output by the other feature extraction module with the down-sampling function in the first stage

In this case, the first and second liquid crystal panels,

and

are the number of channels.

The second stage of the feature extraction module comprises

A feature extraction module, among which

The token size output by one of the feature extraction modules with the down-sampling function in the second stage is correspondingly

Here, the number of the first and second electrodes,

is the number of channels.

The feature extraction module of the third stage comprises

A feature extraction module, among which

A feature extraction module without a down-sampling function and a feature extraction module with a down-sampling function, wherein the current token output by one of the feature extraction modules with a down-sampling function in the feature extraction modules passing through the third stage is of a size

In this case, the first and second liquid crystal panels,

is the number of channels.

The feature extraction module of the fourth stage comprises

And the characteristic extraction module is not provided with a down-sampling function.

It should be noted that the feature extraction modules at each stage may be further divided into two types of feature extraction modules, namely, an orthogonal feature extraction module and a window feature extraction module, and both of the two types of feature extraction modules may be constructed based on a transform module.

The following describes the image processing apparatus provided by the present invention, and the image processing apparatus described below and the image processing method described above may be referred to in correspondence with each other.

Based on any of the above embodiments, fig. 10 is a schematic structural diagram of an image processing apparatus provided by the present invention, as shown in fig. 10, the apparatus includes:

an acquisition unit 1010 for acquiring an image to be processed;

a feature extraction unit 1020, configured to input the image to be processed into a feature extraction model, so as to obtain an image feature output by the feature extraction model;

an image processing unit 1030, configured to perform image processing on the image to be processed based on the image feature;

the feature extraction model includes an orthogonal self-attention module for projecting tokens of the image to be processed into an orthogonal space for self-attention transformation.

The device provided by the embodiment of the invention has the advantages that the orthogonal self-attention module in the feature extraction model can project the token of the image to be processed to the orthogonal space for self-attention conversion, the complexity of the self-attention conversion is reduced, the extraction quality of the image features is improved, and the effectiveness of image processing is ensured.

Based on any of the above embodiments, the feature extraction model includes a plurality of feature extraction modules in cascade, the plurality of feature extraction modules includes an orthogonal feature extraction module, and the orthogonal feature extraction module includes the orthogonal self-attention module and a forward propagation network in cascade;

the feature extraction unit specifically includes:

Based on any of the above embodiments, in a case that the current feature extraction module is an orthogonal feature extraction module, the inputting a previous token of the image to be processed to the current feature extraction module to obtain a current token output by the current feature extraction module includes:

the current self-attention feature unit is used for inputting the previous token into an orthogonal self-attention module of a current feature extraction module, orthogonalizing the previous token by the orthogonal self-attention module to obtain an orthogonal token, then performing multi-head attention calculation on the orthogonal token, inversely orthogonalizing the calculated orthogonal attention feature, and fusing the inversely orthogonalized attention feature and the previous token to obtain a current self-attention feature output by the orthogonal self-attention module;

and the current token unit is used for inputting the current self-attention characteristics into a forward propagation network of the current characteristic extraction module to obtain a current token output by the forward propagation network.

Based on any of the above embodiments, the feature extraction modules further include a window feature extraction module, the window feature extraction module includes a cascaded window self-attention module and a forward propagation network, and the window self-attention module is configured to divide the token of the image to be processed in the form of a sliding window and then perform self-attention conversion.

Based on any of the above embodiments, in a case that the current feature extraction module is a window feature extraction module, the inputting a previous token of the image to be processed to the current feature extraction module to obtain a current token output by the current feature extraction module includes:

the current self-attention feature subunit is configured to input the previous token to a window self-attention module of the current feature extraction module, perform window division on the previous token by the window self-attention module to obtain a window token, perform multi-head attention calculation on the window token, perform window merging on the calculated window attention features, and fuse the window-merged attention features with the previous token to obtain current self-attention features output by the window self-attention module;

and the current token subunit is used for inputting the current self-attention feature into a forward propagation network of the current feature extraction module to obtain a current token output by the forward propagation network.

In accordance with any of the above embodiments, the forward propagation network comprises a first convolutional layer for extracting the position information.

In any of the above embodiments, the forward propagation network comprises a first branch and a second branch, the inputs of the first branch and the second branch are the same, and the outputs of the first branch and the second branch are added as the output of the forward propagation network;

Fig. 11 illustrates a physical structure diagram of an electronic device, and as shown in fig. 11, the electronic device may include: a processor (processor) 1110, a communication Interface (Communications Interface) 1120, a memory (memory) 1130, and a communication bus 1140, wherein the processor 1110, the communication Interface 1120, and the memory 1130 communicate with each other via the communication bus 1140. The processor 1110 may invoke logic instructions in the memory 1130 to perform an image processing method comprising: acquiring an image to be processed; inputting the image to be processed into a feature extraction model to obtain image features output by the feature extraction model; based on the image characteristics, carrying out image processing on the image to be processed; the feature extraction model includes an orthogonal self-attention module for projecting tokens of the image to be processed into an orthogonal space for self-attention transformation.

In addition, the logic instructions in the memory 1130 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing an image processing method provided by the above methods, the method comprising: acquiring an image to be processed; inputting the image to be processed into a feature extraction model to obtain image features output by the feature extraction model; based on the image characteristics, carrying out image processing on the image to be processed; the feature extraction model comprises an orthogonal self-attention module for projecting tokens of the image to be processed into an orthogonal space for self-attention transformation.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image processing method provided by the above methods, the method including: acquiring an image to be processed; inputting the image to be processed into a feature extraction model to obtain image features output by the feature extraction model; based on the image characteristics, carrying out image processing on the image to be processed; the feature extraction model comprises an orthogonal self-attention module for projecting tokens of the image to be processed into an orthogonal space for self-attention transformation.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An image processing method, comprising:

acquiring an image to be processed;

the feature extraction model comprises an orthogonal self-attention module, wherein the orthogonal self-attention module is used for projecting a token of the image to be processed to an orthogonal space for self-attention conversion, the token is obtained by dividing the image to be processed into different image blocks in a sliding window mode and coding each image block, the orthogonal space is a space with the metric concept of orthogonality, and the orthogonal self-attention module sequentially performs orthogonalization, multi-head attention calculation and inverse orthogonalization.

2. The image processing method of claim 1, wherein the feature extraction model comprises a cascade of a plurality of feature extraction modules, including an orthogonal feature extraction module comprising the cascade of the orthogonal self-attention module and a forward propagation network;

3. The image processing method according to claim 2, wherein in a case that the current feature extraction module is an orthogonal feature extraction module, the inputting a previous token of the image to be processed to the current feature extraction module to obtain a current token output by the current feature extraction module includes:

4. The image processing method according to claim 2, wherein the plurality of feature extraction modules further comprise a window feature extraction module, and the window feature extraction module comprises a cascaded window self-attention module and a forward propagation network, and the window self-attention module is configured to divide the token of the image to be processed in the form of a sliding window and then perform self-attention conversion.

5. The image processing method according to claim 4, wherein in a case that the current feature extraction module is a window feature extraction module, the inputting a last token of the image to be processed to the current feature extraction module to obtain a current token output by the current feature extraction module includes:

6. An image processing method according to any of claims 2 to 5, wherein the forward propagation network comprises a first convolutional layer for extracting position information.

7. The image processing method according to claim 6, wherein the forward propagation network comprises a first branch and a second branch, the inputs of the first branch and the second branch being the same, the outputs of the first branch and the second branch being added as the output of the forward propagation network;

and under the condition that the convolution kernel step length of the first convolution layer is larger than 1, the second branch comprises a second normalization layer and a second convolution layer which are connected in sequence, and the convolution kernel step length of the second convolution layer is the same as that of the first convolution layer.

8. An image processing apparatus characterized by comprising:

the acquisition unit is used for acquiring an image to be processed;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the image processing method according to any of claims 1 to 7 when executing the program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the image processing method according to any one of claims 1 to 7.