CN116958759A

CN116958759A - Image processing method, apparatus, device, storage medium, and program product

Info

Publication number: CN116958759A
Application number: CN202210379553.7A
Authority: CN
Inventors: 刘天鸿; 崔文学; 惠晨; 姜峰; 高莹; 谢绍伟; 吴平
Original assignee: Harbin Institute of Technology; ZTE Corp
Current assignee: Harbin Institute of Technology; ZTE Corp
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2023-10-27
Also published as: WO2023197784A1

Abstract

The embodiment of the application provides an image processing method, an image processing device, a storage medium and a program product, comprising the following steps: acquiring a compressed image and corresponding coding information; extracting the characteristics of the compressed image according to the coding information to obtain local characteristic information and global characteristic information; calculating a first attention weight corresponding to the local feature information and a second attention weight corresponding to the global feature information; weighting and fusing the local feature information, the global feature information, the first attention weight and the second attention weight to obtain fused feature information; and obtaining image residual information according to the fusion characteristic information, and superposing the compressed image and the image residual information to obtain a reconstructed image. The embodiment of the application adaptively selects the fusion weights of different areas according to the picture characteristics based on the attention fusion mechanism, thereby achieving better recovery effect, and has no change to the coding end and no additional calculation introduced, thereby ensuring the definition of the video and reducing the transmission cost of the video.

Description

Image processing method, apparatus, device, storage medium, and program product

Technical Field

The present application relates to the field of image processing technology, and in particular, to an image processing method, apparatus, device, storage medium, and program product.

Background

For video data with larger data volume, the storage and bandwidth are often challenged, so that the video needs to be compressed in practical situations, but the compressed video is often accompanied by distortion and compression noise, and has a certain quality loss compared with the original video. The existing video image restoration processing technology has a certain limitation, the visual effect and quality of the reconstructed image cannot be well ensured, and the coding end can be possibly changed to introduce additional computational complexity.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the application provides an image processing method, an image processing device, image processing equipment, a storage medium and a program product, which can improve the visual effect and quality of a reconstructed video image after encoding and decoding.

In one aspect, an embodiment of the present application provides an image processing method, including: acquiring a compressed image and coding information corresponding to the compressed image; extracting the characteristics of the compressed image according to the coding information to obtain local characteristic information and global characteristic information; calculating a first attention weight corresponding to the local feature information and a second attention weight corresponding to the global feature information; fusing the local feature information, the global feature information, the first attention weight and the second attention weight to obtain fused feature information; and obtaining image residual information according to the fusion characteristic information, and superposing the compressed image and the image residual information to obtain a reconstructed image.

On the other hand, the embodiment of the application also provides an image processing device, which comprises: an acquisition unit configured to acquire a compressed image and encoding information corresponding to the compressed image; the feature extraction unit is used for carrying out feature extraction on the compressed image according to the coding information to obtain local feature information and global feature information; an attention weight calculation unit configured to calculate a first attention weight corresponding to the local feature information and a second attention weight corresponding to the global feature information; the feature fusion unit is used for fusing the local feature information, the global feature information, the first attention weight and the second attention weight to obtain fused feature information; and the image superposition unit is used for obtaining image residual information according to the fusion characteristic information, and superposing the compressed image and the image residual information to obtain a reconstructed image.

In another aspect, an embodiment of the present application further provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the image processing method as described above when executing the computer program.

In another aspect, embodiments of the present application also provide a computer-readable storage medium storing computer-executable instructions for performing the image processing method as described above.

In another aspect, embodiments of the present application further provide a computer program product including a computer program or computer instructions stored in a computer-readable storage medium, from which a processor of a computer device reads the computer program or the computer instructions, the processor executing the computer program or the computer instructions, causing the computer device to perform the image processing method as described above.

In the embodiment of the application, firstly, an attention fusion mechanism is provided, the local features and the global features of the image can be subjected to weighted fusion, and the mechanism can adaptively select fusion weights of different areas according to the features of the image, so that a better recovery effect can be obtained; in addition, the embodiment of the application directly enhances the quality of the lossy compressed image after encoding and decoding without changing the process of the encoding end, thereby ensuring that the encoding end does not introduce extra computational complexity and the bit stream is not changed, and greatly increasing the flexibility of the framework. Due to the enhancement of the quality of the decoded video, on one hand, higher video definition under the same code rate is ensured, and on the other hand, the code stream required to be transmitted is reduced under the condition of ensuring the same video quality, so that the cost of transmitting the video is greatly reduced.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate and do not limit the application.

Fig. 1 is a schematic structural view of an electronic device for performing an image processing method according to an embodiment of the present application;

FIG. 2 is a flow chart of an image processing method provided by an embodiment of the present application;

FIG. 3 is a detailed flow chart of one embodiment of step S200 in FIG. 2;

FIG. 4 is a detailed flow chart of one embodiment of step S600 in FIG. 3;

FIG. 5 is a detailed flowchart of another embodiment of step S600 in FIG. 3;

FIG. 6 is a detailed flow chart of one embodiment of step S700 in FIG. 3;

FIG. 7 is a detailed flow chart of another embodiment of step S700 in FIG. 3;

FIG. 8 is a detailed flow chart of one embodiment of step S720 in FIG. 7;

FIG. 9 is a detailed flow chart of one embodiment of step S300 in FIG. 2;

FIG. 10 is a detailed flowchart of another embodiment of step S300 in FIG. 2;

FIG. 11 is a detailed flow chart of one embodiment of step S400 of FIG. 2;

fig. 12 is a specific flowchart of normalization processing for the first attention weight and the second attention weight before step S400 in fig. 2;

FIG. 13 is a flowchart showing an embodiment of obtaining image residual information according to the fused feature information in step S500 in FIG. 2;

FIG. 14 is a schematic diagram of a spatial attention fusion module combining a local feature extraction module and a global feature extraction module according to one embodiment of the present application;

FIG. 15 is a schematic diagram of a local feature extraction module according to one embodiment of the application;

FIG. 16 is a schematic diagram of a global feature extraction module according to one embodiment of the application;

FIG. 17 is a schematic diagram of a channel attention fusion module combining a local feature extraction module and a global feature extraction module according to one embodiment of the present application;

fig. 18 is a schematic structural diagram of a residual block according to an embodiment of the present application;

FIG. 19 is a detailed schematic diagram of the spatial attention fusion module shown in FIG. 14;

FIG. 20 is a schematic diagram of a network for attention-based quality enhancement for lossy compressed images, in accordance with one embodiment of the present application;

fig. 21 is a schematic structural view of an image processing apparatus according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description, in the claims and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

In the related art, with the development of the internet and the continuous progress of video encoding and decoding technology, video and image application scenes are more and more, but the raw video data volume is huge, which not only brings great challenges to storage and bandwidth. Therefore, in practical situations, compression processing is required for video, but the compressed video often accompanies distortion and compression noise, and has a certain quality loss compared with the original video. For this case, the deep learning technique can be used to improve the coding efficiency of video, but under the existing conventional codec frame, the coding frame itself generally has high time complexity due to the complex Rate-distortion optimization (Rate-Distortion Optimization, RDO) process. This certainly places additional computational burden on video coding if the depth network is considered in the overall RDO process.

In addition, whether the high-performance video coding (h.265/High Efficiency Video Coding, HEVC) standard and the general video coding (h.266/Versatile Video Coding, VVC) standard established by the joint expert group (Joint Video Expert Teams, jfet) subordinate to ISO/IEC and ITU, or the audio video coding and decoding standard (Audio Video coding Standard, AVS) series of coding and decoding standards established by the national digital audio video coding and decoding technical standards working group, a block-based hybrid coding framework is adopted, and after the original video data is subjected to image block division, multiple processes such as prediction, transformation, quantization, reconstruction, filtering and the like are performed. Because these processes are all performed on the divided image blocks, prediction methods, transformation processes, quantization parameters, and the like, which may be adopted between different image blocks, are not completely the same, which may cause distortion conditions such as blocking effect at the boundaries of adjacent image blocks. In addition, since the human eyes are sensitive to information such as the low frequency characteristics of an image, for example, the overall brightness of an object, and insensitive to high frequency detail information in the image, the quantization process removes the high frequency information insensitive to the human eyes by finely quantizing the coefficients of the low frequency region and coarsely quantizing the coefficients of the high frequency region, thereby reducing the information transfer amount. A common quantization method is dividing by quantization step size. The quantization step size may be indicated by quantization parameters (Quantization Parameter, QP), in general, the smaller the QP value, the smaller the corresponding quantization step size, i.e. the less image compression loss, and the larger the QP value, the larger the corresponding quantization step size, i.e. the greater the image compression loss, which may also affect the quality of the reconstructed image, and if a distorted reconstructed image is present as a reference image for a subsequent encoded image, the accuracy of the subsequent encoded image may be further affected.

At present, in most video compression coding and decoding standards, loop filtering technology is mainly adopted in the restoration processing of the reconstructed image, including deblocking filtering, sample self-adaptive compensation, self-adaptive loop filtering and the like, and although the traditional methods can eliminate compression noise to a certain extent and improve the quality of the reconstructed video image, the mapping relation between the lossy compressed image and the original image cannot be necessarily mined to the greatest extent due to the fact that the parameters of the algorithm are artificially set. Many researchers have made a series of improvements to the existing coding framework to address this problem, but the drawbacks of the conventional methods are still not eliminated.

In addition, with the rise of deep learning, great potential is embodied in various fields. Compared with the traditional method, the deep learning technology has self-learning capability based on big data, can be obtained by training a large amount of data, and learns a certain nonlinear mapping relation from the data to achieve the purpose of adapting to specific tasks. Meanwhile, as the amount of training data increases, the effect, robustness and generalization capability of the deep learning algorithm are also enhanced. However, the existing neural network method for completing the image recovery task is built by using a convolution layer, and has the characteristics of weight sharing and local perception, but the receptive field of convolution operation is limited to the size of a convolution kernel, and only neighborhood information can be extracted in each step, and information of other long-distance positions in the image is ignored, so that the method has certain limitation.

Based on the above situation, the embodiments of the present application provide an image processing method, apparatus, device, storage medium, and program product, which specifically include the following steps: acquiring a compressed image and coding information corresponding to the compressed image; extracting the characteristics of the compressed image according to the coding information to obtain local characteristic information and global characteristic information; calculating a first attention weight corresponding to the local feature information and a second attention weight corresponding to the global feature information; fusing the local feature information, the global feature information, the first attention weight and the second attention weight to obtain fused feature information; and obtaining image residual information according to the fusion characteristic information, and superposing the compressed image and the image residual information to obtain a reconstructed image. According to the technical scheme of the embodiment of the application, firstly, the embodiment of the application provides an attention fusion mechanism which can carry out weighted fusion on local features and global features of an image, and the mechanism can adaptively select fusion weights of different areas according to picture features, so that a better recovery effect can be obtained; in addition, the embodiment of the application directly enhances the quality of the lossy compressed image after encoding and decoding without changing the process of the encoding end, thereby ensuring that the encoding end does not introduce extra computational complexity and the bit stream is not changed, and greatly increasing the flexibility of the framework. Due to the enhancement of the quality of the decoded video, on one hand, higher video definition under the same code rate is ensured, and on the other hand, the code stream required to be transmitted is reduced under the condition of ensuring the same video quality, so that the cost of transmitting the video is greatly reduced.

Embodiments of the present application will be further described below with reference to the accompanying drawings.

As shown in fig. 1, fig. 1 is a schematic diagram of an electronic device for performing an image processing method according to an embodiment of the present application.

In the example of fig. 1, the electronic device 100 includes a processor 110 and a memory 120, wherein the processor 110 and the memory 120 are communicatively coupled.

Wherein the memory is operable as a non-transitory computer readable storage medium storing a non-transitory software program and a non-transitory computer executable program. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some implementations, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

As will be appreciated by those skilled in the art, the electronic device 100 may be applied to a 3G communication network system, an LTE communication network system, a 5G communication network system, a 6G communication network system, a mobile communication network system that is evolved later, and the like, which is not particularly limited in this embodiment.

It will be appreciated by those skilled in the art that the electronic device 100 shown in fig. 1 is not limiting of the embodiments of the application and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

In the electronic device 100 shown in fig. 1, the processor 110 or the processor in the memory 120 may call an image processing program stored in the memory, thereby performing an image processing method.

Based on the above-described electronic device 100, various embodiments of the image processing method of the present application are set forth below.

As shown in fig. 2, fig. 2 is a flowchart of an image processing method according to an embodiment of the present application, and the image processing method includes, but is not limited to, step S100, step S200, step S300, step S400, and step S500.

Step S100, obtaining a compressed image and coding information corresponding to the compressed image;

step 200, extracting the characteristics of the compressed image according to the coding information to obtain local characteristic information and global characteristic information;

step S300, calculating a first attention weight corresponding to the local feature information and a second attention weight corresponding to the global feature information;

step S400, carrying out weighted fusion on the local feature information, the global feature information, the first attention weight and the second attention weight to obtain fusion feature information;

And S500, obtaining image residual information according to the fusion characteristic information, and superposing the compressed image and the image residual information to obtain a reconstructed image.

According to the technical scheme of the embodiment of the application, firstly, the embodiment of the application provides an attention fusion mechanism which can carry out weighted fusion on local features and global features of an image, and the mechanism can adaptively select fusion weights of different areas according to picture features, so that a better recovery effect can be obtained; in addition, the embodiment of the application directly enhances the quality of the lossy compressed image after encoding and decoding without changing the process of the encoding end, thereby ensuring that the encoding end does not introduce extra computational complexity and the bit stream is not changed, and greatly increasing the flexibility of the framework. Due to the enhancement of the quality of the decoded video, on one hand, higher video definition under the same code rate is ensured, and on the other hand, the code stream required to be transmitted is reduced under the condition of ensuring the same video quality, so that the cost of transmitting the video is greatly reduced.

The lossy compressed image may be image information compressed by the codec, or may be image information in a video sequence compressed by the codec.

In addition, the coding information refers to information required when coding the image, and may specifically be coding unit partition structure information, quantization parameter information or other coding information.

In addition, it should be noted that the first attention weight and the second attention weight may be spatial attention weights or channel attention weights, and the types of the first attention weight and the second attention weight are not limited in the embodiment of the present application.

In addition, as shown in fig. 3, fig. 3 is a specific flow chart of one embodiment of step S200 in fig. 2, including, but not limited to, steps S600 and S700.

Step S600, obtaining feature information to be restored corresponding to the compressed image according to the coding information;

and step S700, extracting the characteristics of the characteristic information to be recovered to obtain local characteristic information and global characteristic information.

Specifically, with respect to step S200, before performing feature extraction on the compressed image, the embodiment of the present application further needs to obtain feature information to be restored corresponding to the compressed image according to the encoding information, and then perform feature extraction on the feature information to be restored to obtain local feature information and global feature information.

It should be noted that the feature information to be recovered may be a feature obtained by processing a lossy compressed image with a convolution layer as shown in fig. 4, or a feature obtained by processing a lossy compressed image with a convolution layer and at least one residual block as shown in fig. 5, or a feature obtained by processing a convolution layer and at least one fusion structure, or a feature obtained by processing a convolution layer and at least one residual block and a fusion structure.

As shown in fig. 4, fig. 4 is a specific flow chart of one embodiment of step S600 in fig. 3, including but not limited to step S610.

And step S610, performing convolution operation on the compressed image and the coding information to obtain the characteristic information to be restored corresponding to the compressed image.

Specifically, before extracting features of the compressed image according to the encoded information, the embodiment of the application also needs to process the lossy compressed image through a convolution layer to obtain the feature information to be recovered.

As shown in fig. 5, fig. 5 is a specific flowchart of another embodiment of step S600 in fig. 3, including, but not limited to, step S621 and step S622.

Step S621, performing convolution operation on the compressed image and the coding information to obtain initial characteristic information;

Step S622, performing optimization processing on the initial feature information to obtain feature information to be restored corresponding to the compressed image.

Specifically, regarding the optimization processing in step S622 described above, three cases are included, but not limited to:

first optimization processing case: before extracting the characteristics of the compressed image according to the coding information, the embodiment of the application also needs to process the lossy compressed image through a convolution layer to obtain initial characteristic information, and then inputs the initial characteristic information into a residual block for optimization processing to obtain the characteristic information to be recovered.

Second optimization process case: before extracting the characteristics of the compressed image according to the coding information, the embodiment of the application also needs to process the lossy compressed image through a convolution layer to obtain initial characteristic information, and then inputs the initial characteristic information into at least one fusion structure to perform optimization processing to obtain the characteristic information to be recovered.

Third optimization processing case: before extracting the characteristics of the compressed image according to the coding information, the embodiment of the application also needs to process the lossy compressed image through a convolution layer to obtain initial characteristic information, and then inputs the initial characteristic information into at least one residual block and a fusion structure to perform optimization processing to obtain the characteristic information to be recovered.

It should be noted that, regarding the above-mentioned fusion structure, the fusion structure may be a fusion structure based on spatial attention, or a fusion structure based on channel attention, and the type of the fusion structure is not limited in the embodiment of the present application.

In addition, as shown in fig. 6, fig. 6 is a specific flowchart of one embodiment of step S700 in fig. 3. Regarding the feature extraction of the feature information to be restored in step S700, local feature information is obtained, including but not limited to step S710.

And step S710, carrying out feature extraction on the feature information to be recovered through at least one pair of cascaded convolutional neural networks and an activation function to obtain local feature information.

Specifically, the embodiment of the application can input the feature information to be recovered to the local feature extraction module for feature extraction, wherein the local feature extraction module can adopt any network built based on convolution kernel. If the local feature extraction module adopts a cascaded convolutional neural network structure, the local feature extraction module is composed of at least one pair of cascaded convolutional neural networks and an activation function, and feature extraction can be performed on feature information to be recovered through the at least one pair of cascaded convolutional neural networks and the activation function to obtain local feature information.

In addition, as shown in fig. 7, fig. 7 is a specific flowchart of another embodiment of step S700 in fig. 3. Regarding the feature extraction of the feature information to be restored in step S700, global feature information is obtained, including but not limited to step S720.

And step S720, inputting the characteristic information to be recovered to a transducer network so that the transducer network outputs global characteristic information based on the characteristic information to be recovered.

Specifically, the embodiment of the application can input the feature information to be recovered into the global feature extraction module for feature extraction, wherein the global feature extraction module can adopt a classical non-local network or a transform network-based network or other variant networks based on the two networks. If the global feature extraction module adopts a transducer network, the embodiment of the application can input the feature information to be recovered to the transducer network, then the transducer network can perform feature extraction based on the feature information to be recovered, and finally the global feature information is output.

It should be noted that, the transducer network can make up for the limitation defect of the convolutional neural network mentioned above, and the multi-head attention mechanism of the transducer network can discover the global similarity of the feature map from different layers, so as to make up for the defect of the convolutional network. Therefore, the modeling capability of the network can be improved by combining the convolution layer and the transform design fusion network, and the video frame restoration effect is enhanced.

In addition, as shown in fig. 8, fig. 8 is a specific flowchart of one embodiment of step S720 in fig. 7. Regarding the above-mentioned Transformer network, a dimension reduction module, a shift window and a dimension increase module are provided; specifically, step S720 includes, but is not limited to, step S721, step S722, and step S723.

Step S721, performing dimension reduction processing on the feature information to be restored through a dimension reduction module to obtain input feature information;

step S722, inputting the input characteristic information into a shift window for characteristic extraction to obtain output characteristic information;

and step 723, performing dimension-lifting processing on the output characteristic information through a dimension-lifting module to obtain global characteristic information.

Specifically, the embodiment of the application can acquire global feature information by performing window attention calculation after performing block division on the lossy compressed image through the shift window of the transducer network. For example, if the shift window only can accept two-dimensional vector features as input, the embodiment of the application needs to perform dimension reduction processing on the feature information to be restored through the dimension reduction module to obtain two-dimensional input feature information; then, the two-dimensional input characteristic information can be subjected to characteristic extraction through the shift window to obtain two-dimensional output characteristic information; and finally, carrying out dimension lifting processing on the two-dimensional output characteristic information through a dimension lifting module to obtain global characteristic information.

The input feature information and the output feature information may be two-dimensional feature information or feature information of other dimensions, and the dimensions of the input feature information and the output feature information are not limited in the embodiment of the present application.

In addition, as shown in fig. 9, fig. 9 is a specific flowchart of one embodiment of step S300 in fig. 2, including but not limited to step S810, step S820, step S830, step S840, and step S850.

Step S810, fusion and extraction are carried out on the local characteristic information and the global characteristic information, and key diagram information is obtained;

step S820, extracting the local characteristic information to obtain first query graph information corresponding to the local characteristic information;

step S830, extracting the global feature information to obtain second query graph information corresponding to the global feature information;

step S840, calculating the key map information and the first query map information to obtain a first spatial attention weight corresponding to the local feature information;

step S850, calculating the key map information and the second query map information to obtain a second spatial attention weight corresponding to the global feature information.

Specifically, in the embodiment of the application, a fusion structure based on spatial attention can be adopted to calculate the first attention weight and the second attention weight, wherein the fusion step is divided into three branches, and the first branch is used for carrying out fusion extraction on local feature information and global feature information to obtain key map information; the second branch is used for extracting the local characteristic information to obtain first query graph information corresponding to the local characteristic information; the third branch is used for extracting the global feature information to obtain second query graph information corresponding to the global feature information. After obtaining the key map information, the first query map information and the second query map information, the embodiment of the application calculates the weight according to the key map information and the first query map information to obtain a first spatial attention weight; and meanwhile, weight calculation is carried out according to the key diagram information and the second query diagram information, so that a second spatial attention weight is obtained.

In addition, as shown in fig. 10, fig. 10 is a specific flowchart of another embodiment of step S300 in fig. 2, including but not limited to step S910, step S920, step S930, and step S940.

Step S910, splicing the local characteristic information and the global characteristic information to obtain spliced characteristic information;

step S920, global pooling is carried out on the splicing characteristic information to obtain pooled splicing characteristic information;

step S930, calculating the pooled spliced characteristic information through a first full-connection layer to obtain a first channel attention weight corresponding to the local characteristic information;

and step S940, calculating the pooled spliced characteristic information through a second full-connection layer to obtain a second channel attention weight corresponding to the global characteristic information.

Specifically, the embodiment of the application can calculate the first attention weight and the second attention weight by adopting a fusion structure based on channel attention, wherein the fusion step only has one branch, and the fusion step is used for respectively calculating the channel attention weights corresponding to the local feature information and the global feature information through a full connection layer after the local feature information and the global feature information are spliced and global pooled.

In addition, as shown in fig. 11, fig. 11 is a specific flowchart of one embodiment of step S400 in fig. 2, including, but not limited to, step S1010, step S1020, and step S1030.

Step S1010, carrying out weighted calculation on the local characteristic information and the first attention weight to obtain weighted local characteristic information;

step S1020, carrying out weighted calculation on the global feature information and the second attention weight to obtain weighted global feature information;

step 1030, fusing the weighted local feature information and the weighted global feature information to obtain fused feature information.

Specifically, the embodiment of the application can utilize the first attention weight corresponding to the local feature information and the second attention weight corresponding to the global feature information to weight and fuse the local feature information and the global feature information to obtain high-dimensional fusion feature information. The method specifically comprises the steps of respectively carrying out weighted calculation on local feature information and global feature information to obtain weighted local feature information and global feature information, and then fusing the weighted local feature information and the weighted global feature information to obtain high-dimensional fused feature information.

In addition, as shown in fig. 12, fig. 12 is a specific flowchart of normalization processing for the first attention weight and the second attention weight before step S400 in fig. 2. Specifically, before step S400, the image processing method according to the embodiment of the present application further includes, but is not limited to, step S1100.

Step S1100, performing normalization processing on the first attention weight and the second attention weight, so that the sum of weights corresponding to each point in the space of the first attention weight and the second attention weight is one.

Specifically, before the step of weighted fusion, it is required to determine whether the sum of weights corresponding to each point in the space of the first attention weight and the second attention weight is one, and if not, it is also required to normalize the first attention weight and the second attention weight so that the sum of weights corresponding to each point in the space of the first attention weight and the second attention weight is one.

It should be noted that, when the first attention weight and the second attention weight are spatial attention weights, the first attention weight and the second attention weight are two-dimensional weights, and the normalization process may be performed by using Softmax 2D or other similar methods; in addition, when the first attention weight and the second attention weight are channel attention weights, the first attention weight and the second attention weight are one-dimensional weight values, and the normalization processing may be performed using Softmax or other similar methods.

In addition, as shown in fig. 13, fig. 13 is a specific flowchart of one embodiment of obtaining image residual information according to the fusion characteristic information in step S500 in fig. 2. Specifically, regarding the obtaining of the image residual information according to the fusion characteristic information in step S500, there is included, but not limited to, step S1200.

And step 1200, performing dimension reduction processing on the fusion characteristic information to obtain image residual information.

Specifically, because the fusion characteristic information is high-dimensional characteristic information, the image restoration residual error is obtained after the dimension reduction processing is performed by a convolution layer or other similar methods.

Based on the above-described image processing methods of fig. 2 to 13, a plurality of specific embodiments of the present application are set forth below.

In one embodiment, in order to further improve the visual effect and quality of the decoded reconstructed video image, as shown in fig. 14, an embodiment of the present application provides a schematic structural diagram of a spatial attention fusion module that combines a local feature extraction module and a global feature extraction module.

Specifically, the spatial attention fusion module shown in fig. 14 mainly includes three processing steps of splitting, fusing, and selecting:

regarding the splitting step in fig. 14, the corresponding modules include a local feature extraction module and a global feature extraction module, which extract features of different features as local feature information and global feature information, respectively, for the input feature information. The local feature extraction module may be any network built based on convolution kernels, and in the following embodiments, the local feature extraction module will be described by taking a cascaded convolutional neural network structure as an example. The global feature extraction module may be a classical non-local network, a trans-former network, or other variants of the two networks, and in the following embodiments, the global feature extraction module will be described by taking a Swin trans-former/Block (Shift Window Transformer/Block) network structure as an example.

As an implementation manner, the local feature extraction module may be composed of at least one pair of cascaded convolutional neural networks and activation functions, as shown in fig. 15, where fig. 15 illustrates an example in which the local feature extraction module includes three pairs of cascaded convolutional neural networks and activation functions. When the activation function is a PReLU activation function, the formula corresponding to local feature extraction can be expressed as:

F _la ＝PReLU ^i-1 (Conv ^i-1 (...PReLU ⁰ (Conv ⁰ (F _e )))

wherein F is _e F for inputting characteristic information to be restored _la Is the extracted local feature information.

Also, as an implementation, the global feature extraction module includes two or more integer multiple shift windows (Swin transducer/Block), denoted as S _SBs As shown in fig. 16, fig. 16 illustrates an example in which the global feature extraction module includes two Swin blocks.

The Swin Block obtains global feature information by performing window attention calculation after performing Block division on the lossy compressed image. Because Swin Block can only accept two-dimensional vector features as input, the input three-dimensional features F need to be input first _e Downsampling by convolutional neural network, and processing each block into two-dimensional vector feature F by Flatten flattening and Layer normalization (Layer Normalization, layer Norm) _patched The specific formula is as follows:

F _patched ＝LayerNorm(Flatten(Conv(F _e )))

Otherwise, the global feature extraction module needs to convert the two-dimensional vector features into three-dimensional features before outputting the features, and then performs up-sampling to obtain global feature information F _ga As an implementation, wherein: the two-dimensional vector is converted into three-dimensional features by using a remodelling operation View, and the up-sampling can be performed by using a Pixel Shuffle, deconvolution, neighbor interpolation or linear interpolation matched with a convolution layer and other methods, wherein the specific formula is as follows:

F _ga ＝UpSampler(View(LayerNorm(S _SBs (F _patched ))))

it should be noted that the embodiment of the present application does not limit the global feature extraction module to directly perform global feature extraction on the lossy compressed image.

Regarding the fusion step in fig. 14, the corresponding modules include a query graph extraction module, a key graph extraction module, and a weight calculation module, where the fusion step is used to calculate the spatial attention weights of the local feature information and the global feature information, respectively. One possible implementation of the fusion step is shown in fig. 14, which is divided into three branches: the middle branch adds the features output by the local feature extraction module and the global feature extraction module, and extracts Key map information Key map of the whole feature data through the Key map (Key map) extraction module; the upper branch and the lower branch respectively extract a first Query map information Query map corresponding to the local feature information and a second Query map information Query map corresponding to the global feature information through a Query map (Query map) extraction module. The weight calculation module is used for performing matrix multiplication according to the Key map and the Query map to obtain a first spatial attention weight corresponding to the local feature information and a second spatial attention weight corresponding to the global feature information.

Regarding the selection step in fig. 14, the corresponding modules include a weighted fusion module, and the local feature information and the global feature information and the first spatial attention weight and the second spatial attention weight corresponding to the local feature information and the global feature information are respectively weighted and fused by the weighted fusion module, so as to output fused feature information.

It should be noted that, the attention fusion module may be a spatial attention fusion module shown in fig. 14, or a channel attention fusion module shown in fig. 17. Specifically, compared with the spatial attention fusion module shown in fig. 14, the channel attention fusion module shown in fig. 17 is mainly different in that only one fusion branch exists in the fusion step, and the local feature information and the global feature information extracted in the splitting step are spliced and globally pooled to respectively calculate a first channel attention weight corresponding to the local feature information and a second channel attention weight corresponding to the global feature information through a full connection layer.

In addition, it should be noted that, in the following embodiments, the fusion module may be a spatial attention fusion module structure or a channel attention fusion module structure. The attention weight may correspond to a spatial attention weight or a channel attention weight.

In one embodiment, the method steps of FIG. 2 are illustrated in conjunction with FIGS. 14 and 17:

in step S100, the lossy compressed image may be an image after being encoded and decoded, or may be a frame of image in a video sequence after being encoded and decoded. The corresponding encoding information refers to information required when encoding an image, including, but not limited to, coding unit division structure information, quantization parameter information, and the like. The following description will take the quantization parameter QP as an example, but the method of the embodiment of the present application will be equally applicable to other encoded information or a combination of encoded information.

With respect to step S200, it corresponds to the splitting step comprised by the fusion module in fig. 14 and 17. To-be-restored characteristic information F input into fusion module _e Respectively through a local feature extraction module S _la And a global feature extraction module S _ga Obtaining a characteristic diagram F recovered by two different modes _la 、F _ga The following are respectively:

F _la ＝S _la (F _e )

F _ga ＝S _ga (F _e )

it should be noted that before extracting the local feature information and the global feature information, the method further includes obtaining feature information F to be restored corresponding to the lossy compressed image according to the encoding information _e To-be-restored characteristic information F _e The characteristics obtained by processing the lossy compressed image through a convolution layer can be the characteristics obtained by processing the lossy compressed image through the convolution layer and at least one residual block. Fig. 18 is a structural example of a residual block of an embodiment. The residual block outputs the characteristics beneficial to the image recovery of the fusion module through carrying out two convolution layers and a PReLU activation function on the input characteristics, and the residual connection has the following optimization processing formula:

F _o ＝Conv ¹ (PReLU(Conv ⁰ (F _i )))+F _i

Wherein: f (F) _i For inputting features, F _o Representing the output characteristics.

In addition, when there are a plurality of residual blocks, the output feature of the previous residual block is used as the input feature of the subsequent residual block, and the feature is continuously optimized.

With respect to step S300, the fusion steps included in the fusion modules in fig. 14 and 17 are corresponded. When the fusion module is a spatial attention fusion module, as shown in fig. 14, the fusion step is divided into three branches, and Key maps of the overall feature data, and Query maps of the local feature information and the global feature information are respectively obtained.

Fig. 19 is a detailed structural schematic diagram of the spatial attention fusion module shown in fig. 14. Specifically, the intermediate branch outputs three-dimensional (C, H, W) local feature information F with a shape of (C, H, W) to the local feature extraction module and the global feature extraction module respectively _la And global feature information F _ga Adding to obtain primary fusion characteristics, and using a convolution layer S _k Converting the primary fusion feature from a recovery domain to a weight solving domain, and flattening the feature shape by using the flat operation to obtain a two-dimensional Key map with the shape of (C/2, H x W) of the primary fusion feature:

K _lga ＝View(S _k (F _la +F _ga ))

the upper and lower branches use a convolution layer S _qla And S is _qga Global pooling Average Pool and transpose Transpose, softmax normalization process to extract local feature information F _la And global feature information F _ga The corresponding shape is a two-dimensional Query map of (1, C/2), i.e. Q _la 、Q _ga The following are respectively:

Q _la ＝Softmax(AvgPool(S _qla (F _la )) ^T )

Q _ga ＝Softmax(AvgPool(S _qga (F _ga )) ^T )

wherein: the convolution layer is used for converting local features or global features from a recovery domain to a weight solving domain, and the global pooling is used for compressing local feature information or global feature information, and because the compression can lead to information loss, the software max is required to be added for normalization processing, and the information is enhanced.

In addition, the weight calculation module performs matrix multiplication on the Key map and the local feature information and global feature information Query map respectively to obtain a spatial attention weight value corresponding to the local feature information and a spatial attention weight value corresponding to the global feature information.

When the fusion module is the channel attention fusion module shown in fig. 17, the local feature information and the global feature information are spliced and globally pooled, and then the channel attention weight values corresponding to the local feature information and the global feature information are respectively calculated through the full connection layer.

Regarding step S400, the selection steps included in the fusion module in fig. 14 and 17 are corresponded; attention weight value W corresponding to the output local characteristic information _la Attention weight value W corresponding to global feature information _ga For local characteristic information F _la And global feature information F _ga Weighted fusion to obtain high-dimensional fusion characteristic information F of image _gla The specific formula is as follows:

F _gla ＝W _la *F _la +W _ga *F _ga

before this step is performed, it is necessary to determine W _la And W is _ga If the sum of the weights corresponding to each point in the space is one, if not, the sum of the weights corresponding to each point in the space is one, and if not, the sum of the weights corresponding to each point in the space is one _la And W is _ga Normalization processing is performed to make W _la 、W _ga The sum of the weights corresponding to each point in the space is one.

When the fusion module is a spatial attention fusion module, W _la And W is _ga For two-dimensional weight values, the normalization process may be performed using Softmax2D or other similar methods, such that W _la 、W _ga The sum of the weights corresponding to each point in the space is one; when the fusion module is a channel attention fusion module, W _la And W is _ga For one-dimensional vector weight value, softmax or other similar methods can be adopted for normalization processing to ensure W _la 、W _ga The sum of the weights corresponding to each point in the space is one.

It should be noted that, steps S200 to S400 may be performed in a loop for a plurality of times, and when the loop is performed, the fusion feature information outputted in step S400 will be used as the input feature in step S200 of the next round.

Regarding step S500, the fusion characteristic information F finally obtained in step S400 _r Is high-dimensional characteristic information, needs to be subjected to dimension reduction treatment by a convolution layer or other similar methods to obtain an image recovery residual error, and passes through the convolution layer S _C And obtaining an image restoration residual error. The obtaining of the reconstructed image with enhanced quality according to the image residual information and the lossy compressed image may be directly overlapping the image restoration residual with the lossy image to obtain a final restoration result, i.e. the reconstructed image with enhanced quality.

In one embodiment, as shown in fig. 20, fig. 20 is a network diagram of attention-based quality enhancement for lossy compressed images according to an embodiment of the present application.

In particular, lossy video image I _lf And quantization parameter information M _qp Spliced as input information to the network and passed through a convolutional layer S _S The input information is up-scaled, and a specific expression formula is as follows:

F _s ＝S _s (I _lf ,M _qp )

feature F with increased dimensions _s Is then fed into at least one DFB module to extract fusion features F _r . The DFB module is composed of a residual block and a fusion module, wherein the fusion module may be a spatial attention fusion module or a channel attention fusion module, and the structure of the residual block is shown in fig. 18. The DFB module may include 0 or at least 1 residual block therein.

Where the number of DFB blocks is denoted as i, in fig. 20, i=4 is taken as an example:

F _r ＝DFB ^i-1 (...DFB ⁰ (F _s ))

Extracting fusion feature F _r Referring to steps S200 to S400 in the above embodiments, when i=4, it means that steps S200 to S400 in the above embodiments need to be repeatedly performed 4 times.

Extraction of fusion feature F _r Later, further pass through convolution layer S _C For fusion feature F _r Performing dimension reduction processing to obtain an image restoration residual error, and restoring the imageAfter the residual error is overlapped with the lossy compressed image, a reconstructed image I with enhanced quality is obtained _rf The specific expression formula is as follows:

I _rf ＝I _lf +S _C (F _r )

based on the above embodiments of fig. 2 to 20, embodiments of the present application include, but are not limited to, the following technical effects: firstly, the embodiment of the application provides an attention fusion mechanism which comprises a spatial attention fusion structure and a channel attention fusion structure, and can carry out weighted fusion on local characteristic information and global characteristic information of an image. Secondly, the embodiment of the application constructs the recovery network by simultaneously using the convolution neural network (Convolutional Neural Network, CNN) which is good at local modeling and the transducer which is good at global modeling, so that the network has stronger image recovery capability. In addition, the embodiment of the application directly enhances the quality of the lossy compressed video image after encoding and decoding without changing the process of the encoding end, thereby ensuring that the encoding end does not introduce extra computational complexity and the bit stream is not changed, and greatly increasing the flexibility of the framework. Due to the enhancement of the quality of the decoded video, on one hand, higher video definition under the same code rate is ensured, and on the other hand, the code stream required to be transmitted is reduced under the condition of ensuring the same video quality, so that the cost of transmitting the video can be greatly reduced.

Based on the above-described image processing method, various embodiments of the image processing apparatus, the electronic device, the computer-readable storage medium, and the computer program product of the present application are set forth below.

As shown in fig. 21, fig. 21 is a schematic structural view of an image processing apparatus according to an embodiment of the present application. The image processing apparatus 200 of the embodiment of the present application includes, but is not limited to, an acquisition unit 210, a feature extraction unit 220, an attention weight calculation unit 230, a feature fusion unit 240, and an image superimposition unit 250.

Specifically, the acquiring unit 210 is configured to acquire a compressed image and encoding information corresponding to the compressed image; the feature extraction unit 220 is configured to perform feature extraction on the compressed image according to the encoding information, so as to obtain local feature information and global feature information; the attention weight calculation unit 230 is configured to calculate a first attention weight corresponding to the local feature information and a second attention weight corresponding to the global feature information; the feature fusion unit 240 is configured to perform weighted fusion on the local feature information, the global feature information, the first attention weight and the second attention weight to obtain fused feature information; the image superimposing unit 250 is configured to obtain image residual information according to the fusion feature information, and superimpose the compressed image and the image residual information to obtain a reconstructed image.

It should be noted that, the specific implementation and technical effect of the image processing apparatus according to the embodiments of the present application may correspond to the specific implementation and technical effect of the image processing method described above.

Furthermore, an embodiment of the present application provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the image processing method as described above when executing the computer program.

Note that, the electronic device in this embodiment may be correspondingly shown in fig. 1, which is not described in detail herein.

The non-transitory software programs and instructions required to implement the image processing methods of the above embodiments are stored in the memory and when executed by the processor, perform the image processing methods of the above embodiments, for example, perform the method steps of fig. 2 through 13 described above.

It should be noted that, the specific implementation and the technical effect of the electronic device according to the embodiments of the present application may correspond to the specific implementation and the technical effect of the image processing method described above.

Furthermore, an embodiment of the present application also provides a computer-readable storage medium storing computer-executable instructions for performing the above-described image processing method, for example, performing the above-described method steps of fig. 2 to 13.

Furthermore, an embodiment of the present application also discloses a computer program product comprising a computer program or computer instructions stored in a computer-readable storage medium, the computer program or computer instructions being read from the computer-readable storage medium by a processor of a computer device, the processor executing the computer program or computer instructions to cause the computer device to perform the image processing method as in any of the previous embodiments.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically include computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and may include any information delivery media.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the above embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit and scope of the present application, and these equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.

Claims

1. An image processing method, comprising:

acquiring a compressed image and coding information corresponding to the compressed image;

extracting the characteristics of the compressed image according to the coding information to obtain local characteristic information and global characteristic information;

calculating a first attention weight corresponding to the local feature information and a second attention weight corresponding to the global feature information;

weighting and fusing the local feature information, the global feature information, the first attention weight and the second attention weight to obtain fused feature information;

and obtaining image residual information according to the fusion characteristic information, and superposing the compressed image and the image residual information to obtain a reconstructed image.

2. The image processing method according to claim 1, wherein the feature extraction of the compressed image according to the encoding information to obtain local feature information and global feature information includes:

Obtaining feature information to be restored corresponding to the compressed image according to the coding information;

and extracting the characteristics of the characteristic information to be recovered to obtain local characteristic information and global characteristic information.

3. The image processing method according to claim 2, wherein the obtaining feature information to be restored corresponding to the compressed image from the encoded information includes:

and carrying out convolution operation on the compressed image and the coding information to obtain the characteristic information to be restored corresponding to the compressed image.

4. The image processing method according to claim 2, wherein the obtaining feature information to be restored corresponding to the compressed image from the encoded information includes:

performing convolution operation on the compressed image and the coding information to obtain initial characteristic information;

and optimizing the initial characteristic information to obtain the characteristic information to be restored corresponding to the compressed image.

5. The image processing method according to claim 2, wherein the feature extraction of the feature information to be restored to obtain local feature information includes:

and extracting the characteristics of the characteristic information to be recovered through at least one pair of cascaded convolutional neural networks and an activation function to obtain local characteristic information.

6. The image processing method according to claim 2, wherein the feature extraction of the feature information to be restored to obtain global feature information includes:

and inputting the characteristic information to be recovered to a transducer network so that the transducer network outputs global characteristic information based on the characteristic information to be recovered.

7. The image processing method according to claim 6, wherein the Transformer network is provided with a dimension reduction module, a shift window and a dimension increase module; the inputting the feature information to be recovered to a transducer network, so that the transducer network obtains global feature information based on the feature information to be recovered, including:

performing dimension reduction processing on the feature information to be restored through the dimension reduction module to obtain input feature information;

inputting the input characteristic information into the shift window for characteristic extraction to obtain output characteristic information;

and carrying out dimension lifting processing on the output characteristic information through the dimension lifting module to obtain global characteristic information.

8. The image processing method according to claim 1, wherein the calculating of the first attention weight corresponding to the local feature information and the second attention weight corresponding to the global feature information includes:

Carrying out fusion extraction on the local characteristic information and the global characteristic information to obtain key map information;

extracting the local feature information to obtain first query graph information corresponding to the local feature information;

extracting the global feature information to obtain second query graph information corresponding to the global feature information;

calculating the key map information and the first query map information to obtain a first spatial attention weight corresponding to the local feature information;

and calculating the key map information and the second query map information to obtain a second spatial attention weight corresponding to the global feature information.

9. The image processing method according to claim 1, wherein the calculating of the first attention weight corresponding to the local feature information and the second attention weight corresponding to the global feature information includes:

splicing the local characteristic information and the global characteristic information to obtain spliced characteristic information;

global pooling is carried out on the splicing characteristic information to obtain pooled splicing characteristic information;

calculating the pooled spliced characteristic information through a first full-connection layer to obtain a first channel attention weight corresponding to the local characteristic information;

And calculating the pooled spliced characteristic information through a second full-connection layer to obtain a second channel attention weight corresponding to the global characteristic information.

10. The image processing method according to claim 1, wherein the weighting and fusing the local feature information, the global feature information, the first attention weight, and the second attention weight to obtain fused feature information includes:

performing weighted calculation on the local feature information and the first attention weight to obtain weighted local feature information;

performing weighted calculation on the global feature information and the second attention weight to obtain weighted global feature information;

and fusing the weighted local characteristic information and the weighted global characteristic information to obtain fused characteristic information.

11. The image processing method according to claim 1 or 10, wherein before the weighted fusion of the local feature information, the global feature information, the first attention weight and the second attention weight, the image processing method further comprises:

And carrying out normalization processing on the first attention weight and the second attention weight so that the sum of weights corresponding to each point in the space of the first attention weight and the second attention weight is one.

12. The image processing method according to claim 1, wherein the obtaining image residual information according to the fusion feature information includes:

and performing dimension reduction processing on the fusion characteristic information to obtain image residual information.

13. The image processing method according to any one of claims 1 to 10 and 12, wherein the compressed image includes image information after codec compression or image information in a video sequence after codec compression.

14. The image processing method according to any one of claims 1 to 10 and 12, wherein the coding information includes coding unit division structure information or quantization parameter information.

15. An image processing apparatus comprising:

an acquisition unit configured to acquire a compressed image and encoding information corresponding to the compressed image;

the feature extraction unit is used for carrying out feature extraction on the compressed image according to the coding information to obtain local feature information and global feature information;

An attention weight calculation unit configured to calculate a first attention weight corresponding to the local feature information and a second attention weight corresponding to the global feature information;

the feature fusion unit is used for carrying out weighted fusion on the local feature information, the global feature information, the first attention weight and the second attention weight to obtain fusion feature information;

and the image superposition unit is used for obtaining image residual information according to the fusion characteristic information, and superposing the compressed image and the image residual information to obtain a reconstructed image.

16. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the image processing method according to any one of claims 1 to 14 when executing the computer program.

17. A computer-readable storage medium storing computer-executable instructions for performing the image processing method of any one of claims 1 to 14.

18. A computer program product comprising a computer program or computer instructions, characterized in that the computer program or the computer instructions are stored in a computer readable storage medium, from which the computer program or the computer instructions are read by a processor of a computer device, which processor executes the computer program or the computer instructions, so that the computer device performs the image processing method according to any one of claims 1 to 14.