CN117635962A

CN117635962A - Multi-frequency fusion-based channel attention image processing method

Info

Publication number: CN117635962A
Application number: CN202410103196.0A
Authority: CN
Inventors: 刘沛瑜; 廖赟; 刘俊晖; 段清; 吴旭宁; 潘志轩; 邸一得; 周豪; 朱开军; 钱旭; 靳方伟; 李沄朋; 滕荣睿; 吕佳依; 陈楠; 胡宗潇
Original assignee: Yunnan Lanyi Network Technology Co ltd; Yunnan University YNU
Current assignee: Yunnan Lanyi Network Technology Co ltd; Yunnan University YNU
Priority date: 2024-01-25
Filing date: 2024-01-25
Publication date: 2024-03-01
Anticipated expiration: 2044-01-25
Also published as: CN117635962B

Abstract

The invention relates to the technical field of image processing and discloses a channel attention image processing method based on multi-frequency fusion. The multi-scale features of the image are extracted through a backbone network, the feature images are respectively processed through discrete cosine transforms of different frequency components, a final channel attention vector is obtained, and the attention vector is multiplied with the original image feature images, so that a feature image weighted by the channel attention is obtained. According to the multi-frequency fusion-based channel attention image processing method, feature images are respectively processed through discrete cosine transforms of 16 different frequency components, then a percentage weighting method is carried out on 16 one-dimensional vectors obtained through processing, an initialized attention vector mapping function is obtained, a sigmoid function is used for processing the initialized attention vector, finally the attention vector is multiplied with an original image feature image, and the feature image weighted by the channel attention is obtained, so that the matching capability and the matching accuracy of the images are effectively improved.

Description

Multi-frequency fusion-based channel attention image processing method

Technical Field

The invention relates to the technical field of image processing and analysis, in particular to a channel attention image processing method based on multi-frequency fusion.

Background

Image processing tasks such as Structure From Motion (SfM), simultaneous localization and mapping (SLAM), relative pose estimation, and visual localization among many basic computer vision tasks are centered on matching two or more views in the same scene, and image matching task is also an important image analysis processing task, which aims to identify and align on pixels content or structures with the same/similar attributes in two images, and existing image matching methods can be divided into three types: detect then Describe, describe to Detect and Detect Free frames, detect then Describe frames first Detect image keypoints, then generate patches around the keypoints, and send the patches into a feature extraction network to extract features; describe to Detect framework then first extracts dense descriptors using CNNs and then determines the likelihood of whether the descriptor is a keypoint based on the descriptor's resolvability; the Detector free framework does not need to detect key points of images, CNN or Transformer is used for extracting feature images of different levels, then coarse-level matching is carried out according to the feature images of coarse granularity, and finally fine matching is carried out according to the feature images of fine granularity on the basis of coarse matching.

However, in the current image processing method, there is little practice in which channel attention is integrated. In particular, the image matching analysis method based on the no-detector framework does not introduce the concept of channel attention. For a feature map generated by one picture, weights of different channels are generally different, so that important influence is generated on the result of a downstream image processing task. How to integrate the channel attention into the solution of the image matching problem and further into the image processing field is a technical problem to be solved urgently.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides a channel attention image processing method based on multi-frequency fusion, which has the advantages of merging the channel attention into the image characteristic process and the like, and solves the technical problems.

(II) technical scheme

In order to achieve the above purpose, the present invention provides the following technical solutions: a channel attention image processing method based on multi-frequency fusion comprises the following steps:

s1, extracting multi-scale features of an image through a backbone network;

s2, processing the feature images by discrete cosine transformation of different frequency components respectively;

s3, calculating weights by using an algorithm formula, and weighting the vectors obtained by processing to obtain initialized attention vectors;

s4, processing the initialized attention vector to obtain a final channel attention vector;

s5, multiplying the attention vector with the original image feature map to obtain a feature map weighted by the attention of the channel.

As a preferred embodiment of the present invention, the basic structure of the backbone network of the extracted features in the step S1 is as followsAnd extracting the multi-scale features of the image through the feature pyramid structure by adopting a residual error network structure.

As a preferred embodiment of the present invention, the discrete cosine transform expression in the step S2 is as follows:

；

wherein,，/>，/>for inputting the height of the image +.>For the width of the input image, superscript +.>Representing a two-dimensional image +.>Is the spectrum of a two-dimensional discrete cosine transform, and +.>，Representing an input image, and->，/>Is->Go (go)/(go)>Basis function of two-dimensional discrete cosine transform of column, < >>And->Representing specific values in the long and wide dimensions of the input two-dimensional image, respectively, < >>Representing internal data correspondenceAnd->Is the sum operation of the variables.

As a preferred embodiment of the present invention, the basis function of the two-dimensional discrete cosine transformThe specific expression of (2) is as follows:

；

wherein,indicating radian angle ++>And is a cosine function.

As a preferred solution of the present invention, the plurality of frequency components in the processing in step S2 compress information, including the lowest frequency component, and different frequency components will be applied to the image feature map respectively, and the processing result may be used as the effect result of channel attention, and the generated vector expression is as follows:

；

wherein,indicate->Frequency components of discrete cosine transform, +.>Indicate->Frequency of discrete cosine transformVector processed by rate component, < >>Sequence number representing the vector processed,/->I.e. 16 frequency components are applied to the image profile, respectively +.>Basis function of two-dimensional discrete cosine transform, +.>，/>Corresponding +.>And->Value of>Representing the correspondence of internal data->，/>Is the sum operation of the variables.

As a preferable technical scheme of the invention, the step S3 of carrying out the percentage weighting method on the vector comprises the following steps:

s3.1, setting the channel number of the feature map as；

S3.2 for each processedCalculating the percentage weight of (2);

s3.3, for each dimensionA kind of electronic deviceWeighting to obtain +.>Values for each dimension of the dimension-initialized attention vector +.>。

As a preferred technical scheme of the present invention, the percentage weight calculation formula in the step S3.2 is as follows:

；

wherein,，/>，/>indicating that go through->Serial number of vector processed by frequency component of discrete cosine transform,/for each frequency component>Sequence number representing vector lane number,/->For each +.>Vector>Percentage weighted weight of the individual channels, +.>For each of/>Vector>The number of channels.

As a preferred embodiment of the present invention, in the step S3.3The expression for the values of each dimension of the dimension-initialized attention vector is as follows:

；

wherein,indicate use of->Vector processed by frequency component of discrete cosine transform after percentage weighting +.>The number of channels.

As a preferred embodiment of the present invention, the processing in step S4 maps the weight values to the intervals using a mapping functionIn, and pass->The function is initialized.

As a preferred embodiment of the present invention, the calculation formula of the feature map weighted by the attention of the channel obtained in step S5 is as follows:

；

wherein,a feature map weighted by channel attention.

Compared with the prior art, the invention provides a channel attention image processing method based on multi-frequency fusion, which has the following beneficial effects:

the invention processes the feature map through discrete cosine transformation of 16 different frequency components, then carries out a percentage weighting method on the 16 one-dimensional vectors obtained by processing to obtain an initialized attention vector mapping function and a sigmoid function to process the initialized attention vector, and finally multiplies the attention vector with the original image feature map to obtain a feature map weighted by channel attention, thereby effectively improving the matching capability and the matching accuracy of the image.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a schematic diagram of the image processing of the present invention;

FIG. 3 is a schematic diagram of a channel attention module according to the present invention;

FIG. 4 is a graph showing the comparison of the image matching results according to the present invention;

fig. 5 is a comparison diagram of AUC indexes of pose estimation according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-5, a method for processing a channel attention image based on multi-frequency fusion includes the following steps:

s1, extracting multi-scale characteristics of an image through a backbone network, when the network depth is increased to a certain degree, a deeper network means higher training error, the error is increased because the deeper the network is, the more obvious the gradient disappears, so that the gradient cannot be effectively updated to the front network layer when the gradient is transmitted backward, the parameters of the front network layer cannot be updated, the training and testing effects are poor, the problem faced by ResNet is how to effectively solve the gradient disappearance problem under the condition of increasing the network depth, the ResNet effectively solves the gradient extinction and model degradation problems in the deep neural network by introducing the concepts of residual learning and jump connection, so that the ResNet can construct a very deep network with better performance and learning capacity, the backbone network infrastructure is ResNet34, resNet34 adopts a residual structure design network, the residual network adds an identity mapping, the current output is directly transmitted to the next layer network without adding additional parameters, which is equivalent to taking a shortcut, skipping the operation of the layer, the direct connection is named as skip connection, meanwhile, in the backward propagation process, the gradient of the next layer network is directly transmitted to the last layer network, in the ResNet model, batch Normalization is widely applied to each residual block of the network, batch Normalization aims at carrying out normalization processing on the input in each layer of the network, thereby accelerating the training process and improving the generalization capability of the model, a feature pyramid structure (Feature Pyramid Networks) is used for generating a multi-scale feature map, the forward process of the FPN reduces the resolution through downsampling, then upsampling is carried out, the feature from the upstream high resolution is fused in the process, the enhanced feature map is obtained, the method has the advantages that more semantic information of the low-resolution feature map can be presented to high resolution, compared with the feature map of a shallow layer, more features can be used for downstream task processing by using the feature map with larger scale, the method is more accurate, the input of the FPN is the output of each layer of a main network, the features are up-sampled and added with the features of the previous layer to obtain the output of each layer of the FPN structure, the FPN structure and the main network are mutually independent, and three layers of FPN structures are used to obtain the feature map with the original image size of 1/2,1/4 and 1/8 respectively;

s2, processing the feature images by using discrete cosine transforms of different frequency components, wherein the discrete cosine transform expression in the step S2 is as follows:

；

wherein,，/>，/>for inputting the height of the image +.>For the width of the input image, superscript +.>Representing a two-dimensional image +.>Is the spectrum of a two-dimensional discrete cosine transform, and +.>，/>Representing an input image, and->，/>Is->Go (go)/(go)>Basis function of two-dimensional discrete cosine transform of column, < >>And->Representing specific values in the long and wide dimensions of the input two-dimensional image, respectively, < >>Representing internal data correspondenceAnd->Basis function of two-dimensional discrete cosine transform for summation of variables +.>The specific expression of (2) is as follows:

；

wherein,indicating radian angle ++>As a cosine function, the multiple frequency components compressed information in the processing of step S2, including the lowest frequency component, will be applied to the image feature map respectively, and the result of the processing can be used as the result of the channel attention effect, and the vector expression is generated as follows:

；

wherein,indicate->Frequency components of discrete cosine transform, +.>Indicate->Vector processed by frequency components of discrete cosine transform,/->Sequence number representing the vector processed,/->I.e. 16 frequency components are applied to the image profile, respectively +.>Basis function of two-dimensional discrete cosine transform, +.>，/>Corresponding +.>And->Value of>Representing the correspondence of internal data->，/>For the summation of the variables, 16 are chosen because we first determine the importance of each frequency component, then study the influence of using different numbers of frequency components, that is, we evaluate the results of each frequency component in the channel attention, and finally, according to the evaluation results, we choose the Top-k frequency component with the highest performance, k is 16;

s3, calculating the weight by using an algorithm formula, and processing the weight to obtainThe vectors are weighted to obtain initialized attention vectors, and the feature map of the input image hasEach of the channels obtained in the previous step>All are +.>A one-dimensional vector of values, each value representing information dependent on a channel that has not been processed, a total of 16The method used here is a percentage weighting method, the formula of which is as follows:

；

wherein,，/>，/>indicating that go through->Serial number of vector processed by frequency component of discrete cosine transform,/for each frequency component>Sequence number representing vector lane number,/->For each +.>Vector>Percentage weighted weight of the individual channels, +.>For each +.>Vector>The values of the channels, and the values of each dimension of the finally obtained C-dimensional initialized attention vector are as follows:

；

wherein,indicate use of->Vector processed by frequency component of discrete cosine transform after percentage weighting +.>A number of channels;

s4, processing the initialized attention vector through a mapping function to obtain a final channel attention vector, wherein the mapping function can be a full-connection layer or one-dimensional convolution, the full-connection layer (fully connected layers, FC) plays a role of a classifier in the whole convolution neural network, if the operations of the convolution layer, the pooling layer, the activation function and the like are to map the original data to the hidden layer feature space, the full-connection layer plays a role of mapping the learned distributed feature representation to the sample mark space, in actual use, the full-connection layer can be realized by convolution operation, after passing through the full-connection layer, we choose to map weight values into a range of 0-1 by using a Sigmoid function, the Sigmoid function is a mathematical function with a graceful S-shaped curve, and in logistic regression and artificial godIn information science, due to its single increment and its inverse, sigmoid functions are often used as threshold numbers for neural networks, mapping variables between 0,1,the values on each channel of (a) are expressed as the dependency weights between the different image feature channels;

s5, multiplying the attention vector with the original image feature map to obtain a channel attention weighted feature map, wherein the channel attention weighted feature map is as follows:

；

wherein,a feature map weighted by channel attention.

Under the condition that other objective environments are the same, the method is adopted to carry out a comparison experiment with the image matching method LoFTR without the detector, so that the image matching effect of the method designed by the invention is verified.

The Megadepth dataset is an outdoor scene dataset using network pictures, and there are 130K images in total, wherein 100K images are depth maps and 30K images are ordinal maps. Ordinal images simply represent the relative depth order between two context objects in the same image. The AUC values at the threshold (5, 10, 20) were used to evaluate the quality of the network on the image matching task.

As shown in fig. 4, the first column shows the image matching result using LoFTR on a set of test images of the Megadepth dataset, and the second column shows the image matching result of the invention, and a significant improvement in the number of matches can be seen.

As shown in FIG. 5, the comparison of the image matching method of the invention with the classical LoFTR by using the objective evaluation index AUC shows that the AUC value of the method of the invention is higher on three different thresholds, which indicates that the image processing effect of the method designed by the invention has high accuracy and ideal effect in the aspect of image matching.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A multi-frequency fusion-based channel attention image processing method is characterized in that: the method comprises the following steps:

s1, extracting multi-scale features of an image through a backbone network;

2. The multi-frequency fusion-based channel attention image processing method of claim 1, wherein: the basic structure of the backbone network of the extracted features in the step S1 is thatAnd extracting the multi-scale features of the image through the feature pyramid structure by adopting a residual error network structure.

3. The multi-frequency fusion-based channel attention image processing method of claim 1, wherein: the discrete cosine transform expression in the step S2 is as follows:

；

wherein,，/>，/>for inputting the height of the image +.>For the width of the input image, superscript +.>Representing a two-dimensional image +.>Is the spectrum of a two-dimensional discrete cosine transform, and +.>，/>Representing an input image, and->，/>Is->Go (go)/(go)>Basis function of two-dimensional discrete cosine transform of column, < >>And->Representing specific values in the long and wide dimensions of the input two-dimensional image, respectively, < >>Representing the correspondence of internal data->And->Is the sum operation of the variables.

4. A multi-frequency fusion-based channel attention image processing method as recited in claim 3, wherein: basis functions of the two-dimensional discrete cosine transformThe specific expression of (2) is as follows:

；

wherein,indicating radian angle ++>And is a cosine function.

5. A multi-frequency fusion-based channel attention image processing method as recited in claim 3, wherein: the multiple frequency components in the processing of step S2 compress information, including the lowest frequency component, and different frequency components will be applied to the image feature map respectively, and the processing result can be used as the effect result of channel attention, and the generated vector expression is as follows:

；

wherein,indicate->Frequency components of discrete cosine transform, +.>Indicate->Vector processed by frequency components of discrete cosine transform,/->Sequence number representing the vector processed,/->I.e. 16 frequency components are applied to the image profile, respectively +.>Basis function of two-dimensional discrete cosine transform, +.>，/>Corresponding +.>And->Value of>Representing the correspondence of internal data->，/>Is the sum operation of the variables.

6. The multi-frequency fusion-based channel attention image processing method of claim 5, wherein: the step S3 of carrying out a percentage weighting method on the vector comprises the following steps:

s3.1, setting the channel number of the feature map as；

S3.2 for each processedCalculating the percentage weight of (2);

s3.3 for each dimensionWeighting to obtain +.>Values for each dimension of the dimension-initialized attention vector +.>。

7. The multi-frequency fusion-based channel attention image processing method of claim 6, wherein: the percentage weight calculation formula in the step S3.2 is as follows:

；

wherein,，/>，/>indicating that go through->Serial number of vector processed by frequency component of discrete cosine transform,/for each frequency component>Sequence number representing vector lane number,/->For each +.>Vector numberPercentage weighted weight of the individual channels, +.>For each +.>Vector>The number of channels.

8. The multi-frequency fusion-based channel attention image processing method of claim 7, wherein: in said step S3.3The expression for the values of each dimension of the dimension-initialized attention vector is as follows:

；

9. The multi-frequency fusion-based channel attention image processing method of claim 7, wherein: the processing in the step S4 adopts a mapping function to map the weight value to the intervalIn, and pass->The function is initialized.

10. The multi-frequency fusion-based channel attention image processing method of claim 1, wherein: the calculation formula of the feature map weighted by the channel attention in the step S5 is as follows:

；

wherein,a feature map weighted by channel attention.