WO2022068682A1

WO2022068682A1 - Image processing method and apparatus

Info

Publication number: WO2022068682A1
Application number: PCT/CN2021/120193
Authority: WO
Inventors: 王诗淇; 孙龙
Original assignee: 华为技术有限公司; 香港城市大学
Priority date: 2020-09-30
Filing date: 2021-09-24
Publication date: 2022-04-07
Also published as: CN114339260A

Abstract

Disclosed are an image processing method and apparatus, which are used for reducing the video super-resolution time. The method in the embodiments of the present application comprises: a decoding device processing, in combination with motion vector information in an encoded code stream, image information of a target frame to be subjected to super-resolution and image information of an adjacent frame of the target frame, so as to generate a reconstructed frame with a high resolution.

Description

Image processing method and device

This application claims the priority of the Chinese patent application with the application number 202011063576.4 and the invention titled "Image Processing Method and Device" filed with the China Patent Office on September 30, 2020, the entire contents of which are incorporated into this application by reference.

technical field

The embodiments of the present application relate to the field of image processing, and in particular, to an image processing method and apparatus.

Background technique

Traditional video super-resolution algorithms can reconstruct a group of consecutive low-resolution frames through techniques such as linear interpolation, and finally fuse a high-resolution image. In recent years, with the rapid development of deep learning technology, in order to improve the effect of super-resolution technology, academia generally begins to use deep learning methods to reconstruct low-resolution video sequences into high-resolution video sequences related to space and time.

Existing video super-resolution techniques usually use the combining local and global-television (CLG-TV) optical flow model algorithm to calculate the optical flow velocity before all low-resolution video image sequences and the current frame video image. Vector, that is to perform motion estimation, obtain the low-resolution video image of the 2T frame after motion compensation and the low-resolution video image of the current frame according to the optical flow velocity vector, and then use the deep residual network to analyze the low-resolution video of the 2T frame. The image and the low-resolution video image of the current frame perform the initial stage, the concatenated convolution layer calculation stage and the residual block calculation stage in turn, and then gradually use the deconvolution and convolution operations to reconstruct the high-resolution video image.

Existing video super-resolution algorithms require motion estimation, which requires a lot of computational resources.

SUMMARY OF THE INVENTION

The embodiment of the present application provides an image processing method, which is used to process a super-divided frame into a high-resolution reconstructed frame by using motion vector information in an encoded code stream, which avoids a large amount of computational requirements required for motion estimation, and can Significantly reduces the time for video super-resolution.

A first aspect of the embodiments of the present application provides an image processing method, the method includes: a decoding device obtains motion vector information of a target frame, image information of the target frame, and image information of adjacent frames in an encoded code stream, and the target frame and phase The adjacent frame is an image of the first resolution, the target frame is an image that needs to be subjected to super-resolution processing, and the adjacent frame includes an image within a preset period before or after the target frame; The image information and the image information of the adjacent frames generate a reconstructed frame, the reconstructed frame is an image of the second resolution, the second resolution is greater than the first resolution, and the motion vector information is used to indicate the image information of the adjacent frame and the target frame. adaptive local sampling of the image information.

In the above-mentioned first aspect, the encoded code stream may be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation, and the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image Compression coding techniques may apply motion estimation algorithms to determine motion vector information.

In this embodiment of the present application, the decoding device may directly extract motion vector information in the encoded code stream, and decode the encoded code stream to obtain decoded image information, where the image information may include a target for performing super-resolution processing The image information of the frame and the image information of the adjacent frames in the preset period before or after the target frame. Exemplarily, the adjacent frames may include T frames before and after the target frame. The adjacent frames are all images of the first resolution, that is, the low resolution. After decoding the image information of the target frame and the image information of the adjacent frames, the decoding device can process the image information of the target frame and the image information of the adjacent frames in combination with the motion vector information to generate a reconstructed frame of the second resolution , the reconstructed frame is a high-resolution image, that is, the second resolution is greater than the above-mentioned first resolution, and the motion vector information is used to indicate that the image information of the adjacent frame and the image information of the target frame perform adaptive local sampling, which is implemented in this application. In the example, the decoding device does not need to perform a motion estimation process that consumes a lot of computations when performing super-resolution, which can greatly reduce the time for video super-resolution.

In a possible implementation manner of the first aspect, the decoding device performs adaptive local sampling on each position corresponding to the image information of the target frame in the image information of adjacent frames based on the motion vector information; A reconstructed frame is generated from the image information and the image information of the adjacent frames after adaptive local sampling.

In this possible implementation manner, the decoding device can adaptively select similarity from multiple image points in each corresponding local position in the image information of adjacent frames based on each local position of the image information of the target frame in combination with the motion vector information High image points are sampled. The decoding device can perform corresponding image reconstruction based on the image information of the adjacent frames after the adaptive local sampling to generate a reconstructed frame, which can reduce the noise existing in the motion vector information and improve the robustness.

In a possible implementation manner of the first aspect, the decoding device in the above steps performs adaptive local sampling on each position corresponding to the image information of the target frame in the image information of adjacent frames based on the motion vector information, including: a decoding device Generate a target feature pyramid according to the image information of the target frame, and generate an adjacent feature pyramid according to the image information of each adjacent frame. The target feature pyramid includes target features of multiple scales, and each adjacent feature pyramid includes multiple scales. The adjacent features of ; based on the motion vector information, the decoding device performs adaptive local sampling on the adjacent features corresponding to each target feature in each adjacent feature pyramid with the position of each target feature as a reference.

In this possible implementation manner, the feature pyramid of an image is a series of feature sets arranged in a pyramid shape. The feature pyramid is obtained by down-sampling an original feature step by step, so the size is reduced layer by layer. The function extracts the image information of the target frame and the image information of the adjacent frames to obtain the target features and adjacent features, and then generates target features and adjacent features of multiple scales through downsampling, and forms the corresponding feature pyramid. The decoding device may, based on the scale invariance of the feature pyramid, take the position of each target feature at each scale as a benchmark, and perform adaptive local sampling on the adjacent features at the corresponding positions of each adjacent feature pyramid according to the motion vector information, that is, Refine the search for better matching features within the adjacent features at the mapping, and improve the feature quality of each adjacent feature.

In a possible implementation manner of the first aspect, the decoding device in the above steps generates a reconstructed frame according to the image information of the target frame and the image information of the adjacent frames after adaptive local sampling, including: the decoding device converts the target feature pyramid and the Each adjacent feature pyramid after adaptive local sampling is fused to generate a fused feature pyramid, and the fused feature pyramid includes fused features of multiple scales; the decoding device processes the fused feature pyramid to generate a reconstructed frame.

In this possible implementation, the decoding device stacks the adjacent feature pyramids and the target feature pyramid after adaptive local sampling, and convolutionally fuses them into a fused feature pyramid, and then the fused feature pyramid can be reconstructed into a high-resolution feature pyramid. rate images. Stacking (concat) is the merging of the number of feature channels, that is to say, the number of features (the number of channels) describing the image itself increases, while the information under each feature does not increase.

In a possible implementation manner of the first aspect, the above-mentioned decoding device, based on the motion vector information, takes the position of each target feature as a reference, and converts the adjacent features of each adjacent feature pyramid corresponding to the position of each target feature Performing adaptive local sampling, including: for each adjacent feature pyramid, the decoding device is based on the coordinates of the first local feature block in the target feature, and the first local feature block included in the motion vector information and the second in the adjacent feature. The mapping relationship between the local feature blocks is used to find the second local feature block; the decoding device performs feature matching on the first local feature block and the second local feature block through the fully connected layer to determine the set of relevant attention coefficients and the relevant attention coefficients. The set includes a plurality of relevant attention coefficients, wherein each relevant attention coefficient indicates the similarity between a feature point in the first local feature block and a corresponding feature point in the second local feature block; the decoding device is based on the set of relevant attention coefficients A weighted average of multiple feature points in the second local feature block is performed to determine the adjacent feature pyramid after adaptive local sampling.

In this possible implementation, the attention coefficient is the degree of attention, that is, more attention is attached to the feature points in the adjacent features that are more similar to the target feature. In the embodiment of the present application, the decoding device extracts the first local feature block in a scale of the target feature pyramid, and based on the mapping relationship indicated by the motion vector information, the coordinates of the first local feature block are determined from the corresponding coordinates of an adjacent feature pyramid. Extract the second local feature block, and then determine the attention coefficient of each feature point in the second local feature block through a double-layer fully connected layer, and then attach the above attention coefficient to each feature point in the second local feature block The corresponding degree of attention, that is, the second local feature block after adaptive local sampling is obtained. After processing feature blocks extracted from adjacent features of all adjacent feature pyramids corresponding to all scales of the target feature pyramid, adjacent feature pyramids after adaptive local sampling can be obtained.

In a possible implementation manner of the first aspect, the decoding device in the above steps fuses the target feature pyramid and the adjacent feature pyramid after adaptive local sampling to generate a fusion feature pyramid. The method includes: for each adjacent feature pyramid, the decoding device The attention map is calculated according to the target feature and the adjacent features after adaptive local sampling, and the attention map is used to represent the similarity between the adjacent features after adaptive local sampling and the target feature; Perform feature enhancement processing with the attention map; the decoding device stacks and convolves all adjacent features and target features after feature enhancement processing to generate fusion features, and determines the fusion feature pyramid.

In this possible implementation, the decoding device can generate an attention map in the time domain according to the alignment quality of adjacent frame features, and reduce the low-quality adaptive local sampling by increasing the weight of the high-quality adaptive local sampling local area. The weight of the region dynamically fuses the features of the adjacent frame and the target frame after adaptive local sampling. The alignment quality can be expressed by calculating the feature inner product of the adjacent frame feature after adaptive local sampling and the target feature at each coordinate point, and the feature inner product can represent the adjacent feature after adaptive local sampling. similarity of features. Then, each feature area is weighted, for example, the feature and the above attention map are multiplied point by point. The above-mentioned stacking operation is performed on the adjacent features after feature enhancement processing and the target features, and the fusion features are generated by convolution. After the decoding device performs a feature fusion, it is necessary to detect whether there are adjacent features without feature enhancement processing. Neighbor features and target features are fused to generate a fused feature pyramid.

In a possible implementation manner of the first aspect, generating the target feature pyramid according to the image information of the target frame by the decoding device in the above steps includes: the decoding device convolves the image information of each adjacent frame, and then performs a first concatenation process. The residual block is processed to generate an adjacent feature corresponding to the image information of each adjacent frame; the decoding device generates adjacent features of multiple scales by bilinear interpolation, and constructs adjacent features. Feature Pyramid.

In this possible implementation, the residual blocks (rasidual blocks) use skip links to improve the accuracy by increasing a considerable depth, wherein the skip links are the residual blocks that directly detour the received input information to the output, protecting the Integrity of Information. The scale represents the number of pixels in the image. In this embodiment of the present application, the decoding device extracts a target feature in the image information of the target frame through a feature extraction function, where the feature extraction function includes convolution processing and cascaded residual blocks. The decoding device can reduce the image to different degrees by downsampling the target features through bilinear interpolation, so as to obtain target features of different scales, and then arrange the target features according to the scale to generate the target feature pyramid. The number of pixels in each layer of the pyramid continues to decrease from bottom to top, which can greatly reduce the amount of computation.

In a possible implementation manner of the first aspect, the decoding device generating an adjacent feature pyramid according to the image information of each adjacent frame includes: the decoding device convolves the image information of each adjacent frame, and then performs The processing of the first cascaded residual blocks to generate an adjacent feature corresponding to the image information of each adjacent frame; the decoding device generates an adjacent feature by bilinear interpolation of adjacent features of multiple scales pyramid.

In this possible implementation manner, the decoding device simultaneously performs feature extraction on the image information of adjacent frames through a plurality of feature extraction functions that share weights with the above-mentioned feature extraction function to obtain adjacent features, and then downsamples through bilinear interpolation To reduce the image to different degrees in the way, adjacent features of different scales can be obtained, and then the adjacent features are arranged according to the scale to generate adjacent feature pyramids.

In a possible implementation manner of the first aspect, the decoding device processing the fused feature pyramid to generate the reconstructed frame includes: the decoding device calculates the fused feature pyramid through the second cascade residual block to generate an optimized feature pyramid; The feature pyramid is optimized for size expansion and convolution to generate a reconstructed residual signal; the decoding device adds the reconstructed residual signal and the image enlargement result to obtain a reconstructed frame, and the image enlargement result is the image information of the target frame after double-line generated by sex interpolation.

In this possible implementation manner, the decoding device exchanges information on the features of each scale level in the above-generated fused feature pyramid through the second cascaded residual block. Exemplarily, the features of each scale level can be upsampled. Or downsample, perform interaction at the same scale to optimize the fusion feature, then amplify and convolve the optimized fusion feature, and add it to the image information of the target frame amplified by bilinear interpolation, you can get high The reconstructed frame of the resolution.

A second aspect of the embodiments of the present application provides a decoding device, where the decoding device has a function of implementing the method of the first aspect or any possible implementation manner of the first aspect. This function can be implemented by hardware or by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, such as a receiving unit and a processing unit.

A third aspect of the embodiments of the present application provides a computer device, the computer device includes at least one processor, a storage system, an input/output (input/output, I/O) interface, and a computer device stored in the storage system and available on the processor The running computer executes the instructions, and when the computer executes the instructions are executed by the processor, the processor executes the method according to the first aspect or any possible implementation manner of the first aspect.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium that stores one or more computer-executable instructions. When the computer-executable instructions are executed by a processor, the processor executes any one of the first aspect or the first aspect. possible implementation methods.

A fifth aspect of the embodiments of the present application provides a computer program product that stores one or more computer-executable instructions. When the computer-executable instructions are executed by a processor, the processor executes the first aspect or any one of the possible first aspects. method of implementation.

A sixth aspect of an embodiment of the present application provides a chip system, where the chip system includes at least one processor, and the at least one processor is configured to support a decoding device to implement the first aspect or any of the possible implementation manners of the first aspect. function. In a possible design, the chip system may further include a memory for storing necessary program instructions and data of the decoding device. The chip system may be composed of chips, or may include chips and other discrete devices.

Wherein, for the technical effects brought by the second aspect to the sixth aspect or any of the possible implementations thereof, reference may be made to the technical effects brought by the first aspect or different possible implementations of the first aspect, which will not be repeated here.

In the solution provided by this embodiment of the present application, the decoding device generates high-resolution reconstructed frames by performing neural network processing on motion vector information, target feature pyramids, and adjacent feature pyramids obtained from the encoded code stream, and the encoded code stream contains a certain amount of Motion information, and the computational cost of extracting motion information in the code stream is negligible, so the time for video super-resolution can be greatly reduced.

Description of drawings

FIG. 1 is an application scenario of an embodiment of the present application;

2 is a schematic block diagram of a video encoding and decoding system in an embodiment of the present application;

FIG. 3 is a flowchart of an embodiment of an image processing method in an embodiment of the present application;

4 is a schematic diagram of extracting local feature blocks in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a decoding device in an embodiment of the present application;

Fig. 6 is the image processing flow chart of the feature extraction module in the embodiment of the application;

Fig. 7 is the image processing flow chart of the flexible alignment module in the embodiment of the present application;

FIG. 8 is an image processing flowchart of a multi-frame feature fusion module in an embodiment of the present application;

Fig. 9 is the image processing flow chart of the feature super-resolution reconstruction module in the embodiment of the present application;

10 is a comparison diagram of the super-resolution result of the embodiment of the application and the super-resolution result of the prior art;

11 is a schematic structural diagram of a decoding device according to an embodiment of the present application;

FIG. 12 is another schematic structural diagram of a decoding device in an embodiment of the present application.

Detailed ways

Embodiments of the present application provide an image processing method and a decoding device, which are used to reduce the time for video super-resolution.

The embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. In the following description, reference is made to the accompanying drawings which form a part of this disclosure and which illustrate, by way of illustration, specific aspects of the embodiments of the application or in which specific aspects of the embodiments of the application may be used. It should be understood that the embodiments of the present application may be utilized in other aspects and may include structural or logical changes not depicted in the accompanying drawings. Therefore, the following detailed description is not to be taken in a limiting sense, and the scope of the application is defined by the appended claims. For example, it should be understood that disclosures in connection with a described method may equally apply to a corresponding apparatus or system for performing the described method, and vice versa. For example, if one or more specific method steps are described, the corresponding apparatus may include one or more units, such as functional units, to perform one or more of the described method steps (eg, one unit performs one or more steps) , or units, each of which performs one or more of the steps), even if such unit or units are not explicitly described or illustrated in the figures. On the other hand, if, for example, a specific apparatus is described based on one or more units, such as functional units, the corresponding method may contain a step to perform the functionality of the one or more units (eg, a step to perform the one or more units) functionality, or steps, each of which performs the functionality of one or more of the plurality of units), even if such one or more steps are not explicitly described or illustrated in the figures. Further, it is to be understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other unless expressly stated otherwise.

Video coding generally refers to the processing of sequences of images that form a video or video sequence. In the field of video coding, the terms "picture", "frame" or "image" may be used as synonyms. Video encoding is performed on the source side and typically involves processing (eg, by compressing) the original video image to reduce the amount of data required to represent the video image for more efficient storage and/or transmission. Video decoding is performed on the destination side and typically involves inverse processing relative to the encoder to reconstruct the video image. The "encoding" of video images referred to in the embodiments should be understood to refer to "encoding" or "decoding" of video sequences. The combination of the encoding part and the decoding part is also called encoding and decoding (encoding and decoding).

This embodiment can be applied to the application scenario shown in FIG. 1 . The terminal 11 , the server 12 , the set-top box 13 , and the TV 14 are connected through a wireless or wired network, and the terminal 11 can use application software (application, APP) installed locally. ) to remotely control the display 14, for example, the user can output a video source for television playback by performing operations on the operation interface of the terminal 11, and the terminal 11 performs encoding processing on the video source through the server 12 and then forwards it to the set-top box 13, and the set-top box 13 sends the video source to the set-top box 13. The encoded video source is decoded to the display 14, and then the display 14 can play based on the decoded video source.

The following describes the system architecture to which the embodiments of the present application are applied. Referring to FIG. 2, FIG. 2 exemplarily shows a schematic block diagram of a video encoding and decoding system to which the embodiments of the present application are applied. As shown in FIG. 2, a video encoding and decoding system may include an encoding apparatus 21 and a decoding apparatus 22, which generates encoded video data. Decoding apparatus 22 may decode the encoded video data generated by encoding apparatus 21 . Various implementations of encoding apparatus 21, decoding apparatus 22, or both may include one or more processors and a memory coupled to the one or more processors. The memory may include but is not limited to random access memory (RAM), read-only memory (ROM), electrically erasable programmable read only memory (EEPROM), fast Flash memory or any other medium that can be used to store the desired program code in the form of instructions or data structures that can be accessed by a computer, as described herein. The encoding device 21 and decoding device 22 may comprise various devices including desktop computers, mobile computing devices, notebook (eg, laptop) computers, tablet computers, set-top boxes, telephone handsets such as so-called "smart" phones, etc. , televisions, cameras, display devices, digital media players, video game consoles, in-vehicle computers, wireless communication devices, or the like.

Although FIG. 2 shows that the encoding device 21 and the decoding device 22 may be separate devices, they may also include the encoding device 21 and the decoding device 22 or the functionality of both, that is, the encoding device 21 or the corresponding functionality and the decoding device 22 or the corresponding functionality. In such embodiments, encoding device 21 or corresponding functionality and decoding device 22 or corresponding functionality may be implemented using the same hardware and/or software, or using separate hardware and/or software, or any combination thereof.

The encoding device 21 and the decoding device 22 may be communicatively connected through a link 23 , and the decoding device 22 may receive encoded video data from the encoding device 21 via the link 23 . Link 23 may include one or more media or devices capable of moving encoded video data from encoding apparatus 21 to decoding apparatus 22. In one example, link 23 may include one or more communication media that enable encoding device 21 to transmit encoded video data directly to decoding device 22 in real-time. In this example, encoding apparatus 21 may modulate the encoded video data according to a communication standard, such as a wireless communication protocol, and may transmit the modulated video data to decoding apparatus 22 . The one or more communication media may include wireless and/or wired communication media, such as radio frequency spectrum or one or more physical transmission lines. The one or more communication media may form part of a packet-based network, such as a local area network, a wide area network, or a global network (eg, the Internet). The one or more communication media may include routers, switches, base stations, or other devices that facilitate communication from encoding device 21 to decoding device 22 .

The encoding device 21 includes an encoder 211 , and optionally, the encoding device 21 may further include an image preprocessor 212 and a first communication interface 213 . In a specific implementation form, the encoder 211 , the image preprocessor 212 and the first communication interface 213 may be hardware components in the encoding device 21 , or may be software programs in the encoding device 21 .

They are described as follows:

The image preprocessor 212 is configured to receive the original image data 214 transmitted by the external terminal, and perform preprocessing on the original image data 214 to obtain the preprocessed image data 215 or the preprocessed image data 215 . For example, the preprocessing performed by the image preprocessor 212 may include trimming, color format conversion (eg, from a three primary color (RGB) format to a Luma and Chroma (YUV, Y for Luma and UV for Chroma) format) format ), toning or denoising.

Among them, the image can be regarded as a two-dimensional array or matrix of picture elements. The pixels in the array can also be called sampling points. The number of sampling points in the horizontal and vertical directions (or axes) of an array or image defines the size and/or resolution of the image. To represent color, three color components are usually employed, i.e. an image can be represented as or contain three arrays of samples. For example in RBG format or color space, an image includes corresponding arrays of red, green and blue samples. However, in video coding, each pixel is usually represented in a luma/chroma format or color space, for example, for a YUV format image, it includes a luma component indicated by Y (sometimes can also be indicated by L) and two components indicated by U and V. chrominance components. The luminance (luma) component Y represents the luminance or gray level intensity (eg, both are the same in a gray scale image), while the two chroma (chroma) components U and V represent the chrominance or color information components. Correspondingly, an image in YUV format includes a luma sample array of luma sample values (Y), and two chroma sample arrays of chroma values (U and V). Images in RGB format can be converted or transformed to YUV format and vice versa, the process is also known as color transformation or conversion. If the image is black and white, the image may only include an array of luminance samples.

An encoder 211 (or a video encoder 211 ) for receiving the pre-processed image data 215, and processing the pre-processed image data 215 using a relevant prediction mode (such as the prediction mode in various embodiments herein), thereby Encoded image data 216 is provided.

a first communication interface 213 that can be used to receive encoded image data 216 and to transmit the encoded image data 216 via link 23 to decoding device 22 or any other device (eg, memory) for storage or direct reconstruction, The other device may be any device for decoding or storage. The first communication interface 213 may be used, for example, to encapsulate the encoded image data 216 into a suitable format, such as a data packet, for transmission over the link 23 .

The decoding device 22 includes a decoder 221 , and optionally, the decoding device 22 may further include a second communication interface 222 and an image post-processor 223 . They are described as follows:

A second communication interface 222 may be used to receive encoded image data 216 from the encoding device 21 or any other source, such as a storage device, such as an encoded image data storage device. The second communication interface 222 may be used to transmit or receive the encoded image data 216 via the link 23 between the encoding device 21 and the decoding device 22, such as a direct wired or wireless connection, or via any kind of network, Networks of any kind are, for example, wired or wireless networks or any combination thereof, or private and public networks of any kind, or any combination thereof. The second communication interface 222 may be used, for example, to decapsulate the data packets transmitted by the first communication interface 213 to obtain the encoded image data 216 .

Both the second communication interface 222 and the first communication interface 213 may be configured as a one-way communication interface or a two-way communication interface, and may be used, for example, to send and receive messages to establish connections, acknowledge and exchange any other communication links and/or for example Information about data transmission of encoded image data transmission.

Decoder 221 (or referred to as decoder 221 ) receives encoded image data 216 and provides decoded image data 224 or decoded image 224 . In some embodiments, the decoder 221 may be configured to execute various embodiments described later, so as to realize the application of the image processing method described in this application on the decoding side.

An image post-processor 223 for performing post-processing on decoded image data 224 (also referred to as reconstructed image data) to obtain post-processed image data 225 . The post-processing performed by the image post-processor 223 may include color format conversion (eg, from YUV format to RGB format), toning, trimming or resampling, or any other It is transmitted to an external display device for playback. The display device may be or include any type of display for presenting the reconstructed image, eg, an integrated or external display or monitor. For example, displays may include liquid crystal displays (LCDs), organic light emitting diode (OLED) displays, plasma displays, projectors, micro LED displays, liquid crystal on silicon (LCoS), A digital light processor (DLP) or other display of any kind.

Although FIG. 2 depicts the encoding device 21 and the decoding device 22 as separate devices, device embodiments may also include the functionality of the encoding device 21 and the decoding device 22 or both at the same time, ie the encoding device 21 or the corresponding Functionality and decoding device 22 or corresponding functionality. In such embodiments, encoding device 21 or corresponding functionality and decoding device 22 or corresponding functionality may be implemented using the same hardware and/or software, or using separate hardware and/or software, or any combination thereof.

Based on the description, it will be apparent to those skilled in the art that the functionality of the different units or the existence and (exact) division of the functionality of the encoding device 21 and/or decoding device 22 shown in FIG. 2 may vary according to actual devices and applications. The encoding device 21 and decoding device 22 may include any of a variety of devices, including any class of handheld or stationary devices, such as notebook or laptop computers, mobile phones, smart phones, tablet or tablet computers, video cameras, desktop computers , set-top boxes, televisions, cameras, in-vehicle devices, display devices, digital media players, video game consoles, video streaming devices (such as content serving servers or content distribution servers), broadcast receiver devices, broadcast transmitter devices, etc. , and may not use or use any kind of operating system.

Both encoder 211 and decoder 221 may be implemented as any of a variety of suitable circuits, eg, one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (application-specific integrated circuits) circuit, ASIC), field-programmable gate array (FPGA), discrete logic, hardware, or any combination thereof. If the techniques are implemented in part in software, an apparatus may store instructions for the software in a suitable non-transitory computer-readable storage medium and may execute the instructions in hardware using one or more processors to perform the techniques of this disclosure . Any of the foregoing (including hardware, software, a combination of hardware and software, etc.) may be considered one or more processors.

In some cases, the video encoding and decoding system shown in FIG. 2 is merely an example, and the techniques of this application may be applicable to video encoding setups (eg, video encoding or video decoding) that do not necessarily involve any data communication between encoding and decoding devices ). In other examples, data may be retrieved from local storage, streamed over a network, and the like. A video encoding device may encode and store data to memory, and/or a video decoding device may retrieve and decode data from memory. In some examples, encoding and decoding is performed by devices that do not communicate with each other but only encode data to and/or retrieve data from memory and decode data.

Currently, with lossless video coding, the original video image can be reconstructed, ie the reconstructed video image has the same quality as the original video image (eg no transmission loss or other data loss during storage or transmission). In the case of lossy video coding, where further compression is performed by, for example, quantization to reduce the amount of data required to represent the video image, the super-resolution algorithm used in the process of reconstructing the video image at the decoder side requires motion estimation, and motion estimation It needs to spend a lot of computing resources. In order to reduce the time of video super-resolution, the embodiment of the present application provides a corresponding image processing method. The method includes the decoding device acquiring the motion vector information of the target frame, the image information of the target frame and Image information of adjacent frames, the target frame and adjacent frames are images of the first resolution, the target frame is an image that needs to be processed by super-resolution, and the adjacent frames include images located in a preset period before or after the target frame ; The decoding device generates a reconstructed frame according to the motion vector information, the image information of the target frame and the image information of the adjacent frame, and the reconstructed frame is an image of a second resolution, and the second resolution is greater than the first resolution, and the motion vector information uses Adaptive local sampling is performed on the image information indicating the adjacent frame and the image information of the target frame. In this way, the present application uses the motion vector information of the target frame in the encoded code stream to improve the resolution of the reconstructed frame, saving resources for re-estimating the motion vector information.

Based on the above step of the decoding device generating the reconstructed frame according to the motion vector information, the image information of the target frame, and the image information of the adjacent frames, the implementation manner of the embodiment of the present application may also be: the decoding device, based on the motion vector information, Each position in the image information corresponding to the image information of the target frame performs adaptive local sampling; the decoding device generates a reconstructed frame according to the image information of the target frame and the image information of the adjacent frames after adaptive local sampling. In the embodiment of the present application, image points with high similarity can be selected for sampling through adaptive local sampling, so as to reduce the influence of noise in the motion vector information, and improve the robustness of super-resolution.

Hereinafter, based on the above application scenario and system architecture, the image processing method in the embodiment of the present application will be described with reference to FIG. 3 .

Referring to FIG. 3, an embodiment of the image processing method of the present application includes:

301. The decoding device acquires the image information of the target frame and the image information of the adjacent frames in the encoded code stream.

In this embodiment, after receiving the encoded code stream sent by the server, the decoder can decode the encoded video to obtain image information of the target frame and image information of adjacent frames.

The target frame is an image that needs to be subjected to super-resolution processing in this embodiment. Super-resolution means improving the resolution of the original image by means of hardware or software, that is, obtaining a high-resolution image through a series of low-resolution images. The high-speed image process is super-resolution reconstruction.

When the images in the preset period before and after the target frame are also decoded, exemplarily, when 2T adjacent frames before and after the target frame are also decoded, the decoding device can obtain the image information of the adjacent frames. The period T can be preset or changed according to actual needs. For the boundary conditions of the sequence such as the first frame or the last frame, the input can meet the needs of the network by repeating the operation of the existing adjacent frames.

The coded code stream can be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation, the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image compression coding technique can apply motion An estimation algorithm determines motion vector information.

The target frame and adjacent frames in this embodiment are images of the first resolution, and the resolution indicated by the first resolution specifically refers to the low resolution to which the image decoded by the encoded code stream belongs.

302. The decoding device extracts target features from the image information of the target frame, and extracts adjacent features from the image information of adjacent frames.

In this embodiment, the decoding device includes multiple feature extraction functions that share weights. After receiving the image information of the target frame and the image information of the adjacent frames sent by the decoder, the decoding device can use each feature extraction function to extract the target frame. The image information of the target frame and the image information of each adjacent frame are extracted image features to generate a target feature corresponding to the image information of the target frame and an adjacent feature corresponding to the image information of each adjacent frame respectively.

Optionally, the above-mentioned image feature may be image texture information of the image.

The above feature extraction function consists of a convolutional layer and several cascaded residual blocks.

303. The decoding device constructs a target feature pyramid from the target feature, and constructs an adjacent feature pyramid from the adjacent features.

In this embodiment, the decoding device continuously reduces the image size through filtering and bilinear interpolation downsampling to obtain the target features of different scales, and then arranges the target features of different scales according to the scale to generate the target Feature pyramid. The bottom image of the target feature pyramid corresponds to the original target feature. By averaging every 2*2=4 pixels, a 2-level target feature can be formed, and so on to form a multi-level target feature pyramid. Optional , the pyramid structure may be a Gaussian pyramid, a Laplacian pyramid, or a wavelet pyramid, etc., which is not specifically limited here.

The decoding device continuously reduces the image size corresponding to the adjacent features of the image information of each adjacent frame by means of filtering and bilinear interpolation downsampling, so as to obtain adjacent features of different scales, and then compares the corresponding adjacent features of different scales according to the scale. The adjacent features are arranged to generate adjacent feature pyramids. Exemplarily, based on 2T adjacent frames before and after the target frame, the embodiment of the present application may correspondingly generate 2T adjacent feature pyramids.

Feature pyramid is a basic component in multi-scale object detection system. The feature pyramid of an image is a series of feature sets arranged in a pyramid shape. It is obtained by downsampling an original feature echelon, so the size is reduced layer by layer. The feature pyramid has a certain scale invariance, and this characteristic enables the decoding device of this embodiment to detect images of a large scale.

304. The decoding device determines the position mapping relationship between the target feature and the adjacent feature according to the motion vector information.

In this embodiment, the decoding device can directly determine the motion vector information from the encoded code stream, and determine the position mapping relationship between the target feature indicated by the motion vector information and the adjacent features, and the position mapping relationship is the same as the above-mentioned target feature pyramid. and the adjacent feature pyramid, at each coordinate in the target feature of each scale in the target feature pyramid, there is a corresponding coordinate in the adjacent feature of the same scale in the adjacent feature pyramid.

305. The decoding device searches for the second local feature block of the adjacent feature according to the coordinates of the first local feature block of the target feature and the position mapping relationship.

In this embodiment, for each scale in the adjacent feature pyramid, the same flexible alignment operation will be performed. Taking a certain scale as an example, that is, the feature on a certain pixel layer of the feature pyramid, the specific operation is as follows: the schematic diagram of local feature block extraction shown in Figure 4, under the guidance of the motion vector information, that is, the position mapping relationship 41 , for each coordinate of the target feature 42 in the target feature pyramid, there is a corresponding coordinate in the adjacent feature 223 of any adjacent feature pyramid. The decoding device respectively extracts the first local feature block 421 corresponding to the target feature and the second local feature block 2231 corresponding to the adjacent feature from the two corresponding coordinates.

306. The decoding device performs feature matching on the first local feature block and the second local feature block through a fully connected layer to determine a set of relevant attention coefficients.

In this embodiment, the decoding device will rearrange the first local feature block of a target feature and the second local feature block of an adjacent feature to form two one-dimensional feature vectors, and then convert the two one-dimensional feature vectors through a concatenation operation. The dimensional feature vectors are combined into a two-layer fully connected layer to generate an attention vector with the same length as the number of pixel values contained in the local feature block. The decoding device rearranges the attention vector to obtain a set of related attention coefficients between the two local feature blocks.

The above-mentioned set of relevant attention coefficients includes a plurality of relevant attention coefficients, wherein each relevant attention coefficient can indicate the similarity between a feature point in the first local feature block and a corresponding feature point in the second local feature block;

307. The decoding device performs a weighted average of a plurality of feature points in the second local feature block based on the relevant attention coefficient set to determine the adjacent feature pyramid after adaptive local sampling.

In this embodiment, due to the design logic of the coding framework itself, the position mapping relationship provided by the above motion vector information is not necessarily completely real object motion, and may contain coding noise. The relevant attention coefficients in the decoding device can allow the network to map Refine the search for better matching features in the local neighborhood at the location. In this embodiment, after the relevant attention coefficients are collected by the decoding device, the multiple relevant attention coefficients and the local feature blocks of adjacent features can be multiplied point by point and then summed, so as to realize the sampling of the second local feature block. deal with.

The relevant attention coefficient is the importance weight of each feature in the local feature block indicating the adjacent features, which is used to improve the feature quality of the adjacent features after adaptive local sampling.

The same steps as above need to be performed for the local feature block for each coordinate on the target feature on the target feature pyramid.

When the decoding device performs one adaptive local sampling, it only performs adaptive local sampling on one local feature block in one target feature and one local feature block in one adjacent feature, and for other adjacent features in the 2T adjacent feature pyramids The local feature blocks of other coordinates in the pyramid and the target feature correspond to the local feature blocks of adjacent features, and they also need to be adaptively localized with the same steps respectively, so as to perform adaptive local sampling on all adjacent feature pyramids.

308. The decoding device calculates an attention map based on the target feature and the adaptively locally sampled adjacent features.

After the decoding device performs adaptive local sampling on adjacent feature pyramids, it can generate an attention map in the time domain according to the adaptive local sampling quality of adjacent frame features.

Exemplarily, the decoding device can calculate the feature inner product of the adjacent frame feature after adaptive local sampling and the target feature at each coordinate point, and the feature inner product can represent the adjacent feature after adaptive local sampling at this point. Similarity with the target feature, the similarity also represents the adaptive local sampling quality of the point to a certain extent. The decoding device can obtain an attention map with the same size as the feature size through the above adaptive local sampling quality.

309. The decoding device performs feature enhancement processing on the adjacent features after adaptive local sampling and the attention map.

After determining the above attention map, the decoding device can dynamically fuse the features of the adjacent frame and the current frame after adaptive local sampling by increasing the weight of the high-quality local area and reducing the weight of the low-quality local area.

Exemplarily, the decoding device performs point-by-point multiplication of the adjacent features in the adjacent feature pyramid after adaptive local sampling with the above-mentioned attention map, so that those regions on the adjacent features that are more similar to the target feature are adaptively Allocate higher attention, that is, a higher weight for the possible contributions of the super-resolution results, enhance the required features, and suppress possible interference such as mismatches.

310. The decoding device performs stacking and convolution calculations on all adjacent features and target features after feature enhancement processing to generate fusion features, and determines a fusion feature pyramid.

After determining that all adjacent features have been processed by feature enhancement, the decoding device can stack the adjacent features after feature enhancement processing with the target features, that is, superimpose the features on the adjacent features after feature enhancement processing on the target features, and then pass A convolutional layer obtains fused features. When all adjacent features after feature enhancement processing are stacked on the target features in the target feature pyramid and convolved, the fused feature pyramid can be obtained.

After the decoding device performs feature fusion once, it needs to detect whether there are adjacent features without feature enhancement processing. If there are adjacent features without feature enhancement processing, it needs to perform the above feature enhancement processing on the adjacent features and the target feature. , until all adjacent features are enhanced and fused with the target feature, and a complete fused feature pyramid is determined.

311. The decoding device generates an optimized feature pyramid by calculating the fusion feature pyramid through the second cascaded residual block.

The decoding device uses the cascaded residual block with scale fusion to reconstruct the feature generation of the fused feature pyramid. Unlike the general residual block that only processes a single scale, the second cascaded residual block above is at the end of the skip connection. According to the different levels of the features in the pyramid, additional upsampling operations or downsampling operations are added, so that the feature reconstruction residuals of different scales can fully exchange information, enhance the quality of reconstructed features, and obtain optimized fusion features. The optimized feature pyramid can be determined.

312. The decoding device performs size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal.

Since the downsampling in the decoding device reduces the feature image, it is necessary to expand the size of the above-mentioned optimized feature pyramid through a sub-pixel convolution layer, and then generate a high-resolution reconstructed residual signal through a convolution layer.

313. The decoding device adds the reconstructed residual signal and the image upscaling result to obtain a reconstructed frame.

In this embodiment, the decoding device may expand the size of the image information of the target frame by performing an up-sampling operation of bilinear interpolation on the image information of the target frame. The decoding device may add the reconstructed residual signal and the image information of the up-sampled target frame to obtain the image information of the reconstructed frame, and determine the reconstructed frame. The reconstructed frame is the target frame after super-resolution.

In the technical solutions of the embodiments of the present application, the decoding device generates high-resolution reconstructed frames by performing neural network processing on the motion vector information, target feature pyramid and adjacent feature pyramids obtained from the encoded code stream, and the encoded code stream contains a certain amount of Motion information, and the computational cost of extracting motion information in the code stream is negligible, all of which can greatly reduce the time of video super-resolution.

The super-resolution process in the image processing in the embodiment of the present application may be implemented by a pre-trained network model. For an example, please refer to FIG. 5 , which is a schematic diagram of the architecture of the decoding device in the embodiment of the present application. The decoding device 22 may include a decoder 221 , a network model 222 , a graphics processing unit (graphics processing unit, GPU) memory 223 and an output buffer 224 . They are described as follows:

The decoder 221 is a device that performs restoration and decoding operations on the encoded encoded code stream. Optionally, the decoder 221 may be a video decoder that supports encoding and decoding standards such as H.264/high efficiency video coding (HEVC)/versatile video coding (VVC), for example, HEVC decoding device. The decoder 221 in this embodiment of the present application adds a motion vector information output interface.

The network model 222 in the product implementation form of the embodiment of the present application is a program code included in the machine learning and deep learning platform software and deployed on the decoding device. The program codes of the embodiments of the present application exist outside the existing decoder 221 . The network model 222 may be generated by supervised training of the data of the decoded video at low resolution and its unencoded high-resolution video by a machine learning method. The network model 222 is designed with a feature extraction module 2221 , a flexible alignment module 2222 , a multi-frame feature fusion module 2223 and a feature super-score reconstruction module 2224 in this embodiment of the present application.

The network model 222 first contains a feature extraction module 2221, which aims to transform the input decoded frames from the pixel domain to the feature domain, since image features have important physical meanings in deep learning methods.

Flexible alignment module 2222, this module receives the motion vector in the extracted code stream from the decoder 221, and uses this as a guide to design a multi-scale local attention mechanism to achieve flexible alignment of adjacent frames at the feature level .

Multi-frame feature fusion module 2223, this module receives the adjacent frame features and the current frame features after alignment, and uses the attention mechanism in the time domain to complete the feature fusion operation.

Feature super-resolution reconstruction module 2224, this module receives the image features after fusion, and uses cascaded multi-scale fusion residual blocks and sub-pixel convolution to complete the super-resolution reconstruction of the decoded video, and a reconstructed frame has been generated.

The GPU memory 223 is for the execution of program codes that support the computation of each module in the network model 222 .

The output buffer 224 receives and saves the reconstructed frames output by the network model 222 .

The embodiments of the present application are implemented in the open source Python-first deep learning framework (PyTorch) machine learning platform, and run on a decoding unit with a graphics card NVIDIA GPU card, to implement the HEVC standard decoding video super-resolution program code. Among them, the NVIDIA GPU card provides computing acceleration capabilities through the unified computing device architecture (compute unified device architecture, CUDA) programming interface. In this embodiment, the inference process of the network model in the distributed PyTorch machine learning platform can be accelerated, and the model after training can be directly reconstructed end-to-end from the decoded video with compressed noise.

Based on the above architecture, the image processing method in the embodiment of the present application is described below:

Please refer to Figure 6, the image processing flow chart of the feature extraction module shown in Figure 6. The feature extraction module converts the target frame and adjacent frames output by the decoder from the pixel domain to the feature domain, as follows:

601. The feature extraction module obtains image information of the target frame and image information of adjacent frames from the decoder.

In this embodiment, after receiving the encoded code stream sent by the source device, the decoder can decode the encoded video to obtain image information of the target frame and image information of adjacent frames, and the feature extraction module can receive the target frame transmitted by the decoder. image information and image information of adjacent frames.

When the images in the preset period before and after the target frame are also decoded, exemplarily, when 2T adjacent frames before and after the target frame are also decoded, the decoder outputs the image information of the adjacent frames. The period T can be preset or changed according to actual needs. For the boundary conditions of the sequence such as the first frame or the last frame, the input can meet the needs of the network by repeating the operation of the existing adjacent frames.

602. The feature extraction module extracts target features from image information of the target frame, and extracts adjacent features from image information of adjacent frames.

603. The feature extraction module constructs a target feature pyramid from the target feature, and constructs an adjacent feature pyramid from the adjacent features.

Steps 602 to 603 are similar to steps 302 to 303 of the image processing method shown in FIG. 3 , and details are not repeated here.

Please refer to Fig. 7, the image processing flow chart of the flexible alignment module shown in Fig. 7, the flexible alignment module realizes the flexible alignment of adjacent frames according to the target feature pyramid and the adjacent feature pyramid output by the feature extraction module. Neighbor feature pyramid, as follows:

701. The flexible alignment module receives the motion vector information from the decoder, and the target feature pyramid and neighboring feature pyramids from the feature extraction module.

In this embodiment, the encoded code stream may be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation. The motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image compression Coding techniques may apply motion estimation algorithms to determine motion vector information. The decoder can extract the motion vector information from the encoded code stream and send it to the flexible alignment module.

The flexible alignment module can also receive the target feature pyramid and adjacent feature pyramid described in Figure 3 from the feature extraction module.

702. The flexible alignment module determines the position mapping relationship between the target feature and the adjacent feature according to the motion vector information.

703. The flexible alignment module searches for the second local feature block of the adjacent feature according to the coordinates of the first local feature block of the target feature and the position mapping relationship.

704. The flexible alignment module performs feature matching on the first local feature block and the second local feature block through a fully connected layer to determine a set of relevant attention coefficients.

705. The flexible alignment module performs a weighted average on the second local feature block based on the set of relevant attention coefficients to determine the adjacent feature pyramid after adaptive local sampling.

Steps 702-705 are similar to steps 304-307 in the image processing method shown in FIG. 3 , and details are not repeated here.

Please refer to FIG. 8, the image processing flow chart of the multi-frame feature fusion module shown in FIG. 8, the multi-frame feature fusion module according to the adaptive local sampling adjacent feature pyramid output by the flexible alignment module, for each adaptive The adjacent feature pyramid after local sampling, as follows:

801. The multi-frame feature fusion module receives the adaptively locally sampled adjacent feature pyramids from the flexible alignment module.

After the flexible alignment module performs adaptive local sampling on the adjacent feature pyramids, the adjacent feature pyramids after the adaptive local sampling are sent to the multi-frame feature fusion module.

802. The multi-frame feature fusion module calculates an attention map according to the target feature and the adjacent features after adaptive local sampling.

803. The multi-frame feature fusion module performs feature enhancement processing on the adjacent features after adaptive local sampling and the attention map.

804. The multi-frame feature fusion module performs stacking and convolution calculations on adjacent features and target features after all feature enhancement processing to generate fusion features, and determines a fusion feature pyramid.

Steps 802-804 are similar to steps 308-310 in the image processing method shown in FIG. 3 , and details are not repeated here.

Please refer to Figure 9, the image processing flow chart of the feature super-division reconstruction module shown in Figure 9, the feature super-division reconstruction module according to the fusion feature pyramid output by the multi-frame feature fusion module, as follows:

901. The feature superdivision reconstruction module receives the fused feature pyramid from the multi-frame feature fusion module.

902. The feature super-score reconstruction module generates an optimized feature pyramid by calculating the fusion feature pyramid through the second cascade residual block.

903. The feature super-division reconstruction module performs size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal.

904. The feature super-division reconstruction module adds the reconstructed residual signal and the image enlargement result to obtain a reconstructed frame.

Steps 902-904 are similar to steps 311-313 in the image processing method shown in FIG. 3 , and details are not repeated here.

Based on the technical solutions of the embodiments of the present application, as shown in FIG. 10 , a comparison diagram of the super-resolution results of the embodiments of the present application and the super-resolution results of the prior art, the ordinate is the signal-to-noise ratio (peak signal to noise ratio, PSNR) , the unit is decibel (dB), the abscissa is the test time of each frame, the unit is millisecond (ms), the super-resolution result Ours of the embodiment of the present application compares the compression noise removal technology of the prior art with the lightweight Super-resolution techniques consist of a two-step approach, such as: variable-filter-size residual-learning convolutional neural networks (VRCNN) + efficient sub-pixel convolutional networks for video (video efficient sub-pixel convolution network, VESPCN), multi-frame quality enhancement method (multi-frame quality enhancement, MFQE) + VESPCN, deep convolutional neural networks-based auto decoder (DCAD) ) + video super-resolution method based on optical flow super-resolution (super-resolving optical flow for video super-resolution, SOFVSR) and MFQE+SOFVSR, etc., the peak signal-to-noise ratio has been greatly improved, and the comparison is end-to-end For example, when using a non-local spatio-temporal correlation (progressive fusion video super-resolution network via exploiting non-Local spatio-temporal correlations, PFNL), Due to the reasonable use of the characteristics of motion vectors, the average processing time of each frame is significantly reduced. For example, as shown in FIG. 10 , the super-resolution test time of the embodiment of the present application is 280ms compared to the test time used by PFNL. 850ms.

Residual learning convolutional neural network with variable filter size is used for light-weight compressed noise removal network, consisting of 4 convolutional layers.

The multi-frame quality enhancement method is used for an end-to-end network for denoising compressed video using motion estimation and motion compensation, and the idea of "good frames compensate bad frames".

A deep convolutional neural network-based auto-decoder was used to compress the noise removal network, consisting of 10 convolutional layers.

Efficient sub-pixel convolutional networks for video are used to utilize motion estimation and motion compensation to account for temporal correlations by aligning adjacent frames to perform video super-resolution methods.

Video super-resolution methods based on optical flow super-resolution are used for video super-resolution methods using motion estimation and motion compensation to consider temporal correlations by aligning adjacent frames. Compared with efficient sub-pixel convolutional networks for video that only predict low-resolution optical flow, the video super-resolution method based on optical flow super-resolution chooses to predict more accurate high-resolution optical flow.

A progressive fusion video super-resolution network utilizing non-local spatiotemporal correlations is used for an end-to-end video super-resolution method by computing non-local attention and the proposed progressive fusion module.

The image processing method has been described above, and the decoding device according to the embodiments of the present application will be described below with reference to the accompanying drawings.

FIG. 11 is a schematic diagram of an embodiment of a decoding device 110 in an embodiment of the present application.

As shown in FIG. 11 , an embodiment of the present application provides a decoding device, and the decoding device includes:

The obtaining unit 1101 is used to obtain the motion vector information of the target frame in the encoded code stream, the image information of the target frame and the image information of the adjacent frame, where the target frame and the adjacent frame are images of the first resolution , the target frame is an image that needs to be subjected to super-resolution processing, and the adjacent frame includes an image within a preset period before or after the target frame;

The generating unit 1102 is configured to generate a reconstructed frame according to the motion vector information, the image information of the target frame and the image information of the adjacent frame, where the reconstructed frame is an image of the second resolution, and the first The second resolution is greater than the first resolution, and the motion vector information is used to indicate that the image information of the adjacent frame and the image information of the target frame perform adaptive local sampling.

Optionally, the generation module 1102 is specifically configured to generate a target feature pyramid according to the image information of the target frame, and generate an adjacent feature pyramid according to the image information of each adjacent frame, and the target feature pyramid includes target features of multiple scales. , each adjacent feature pyramid includes adjacent features of multiple scales; based on the motion vector information, with the position of each target feature as the benchmark, the adjacent features in each adjacent feature pyramid corresponding to each target feature Perform adaptive local sampling; fuse the target feature pyramid and each adjacent feature pyramid after adaptive local sampling to generate a fusion feature pyramid, which includes fusion features of multiple scales; process the fusion feature pyramid to generate Reconstructed frame.

Optionally, the generation module 1102 is further configured to, for each adjacent feature pyramid, according to the coordinates of the first local feature block in the target feature, and the first local feature block contained in the motion vector information and the first local feature block in the adjacent feature. The mapping relationship between the two local feature blocks is used to find the second local feature block; the feature matching is performed on the first local feature block and the second local feature block through the fully connected layer to determine the relevant attention coefficient set and the relevant attention coefficient set. Including a plurality of relevant attention coefficients, wherein, each relevant attention coefficient indicates the similarity of a feature point in the first local feature block and the corresponding feature point in the second local feature block; Multiple feature points in the local feature block are weighted and averaged to determine the adjacent feature pyramid after adaptive local sampling.

Optionally, the generation module 1102 is further configured to, for each adjacent feature pyramid, calculate an attention map according to the target feature and the adjacent features after adaptive local sampling, and the attention map is used to represent the adjacent features after adaptive local sampling. Similarity with target features; perform feature enhancement processing on adjacent features after adaptive local sampling and attention map; stack and convolute all adjacent features and target features after feature enhancement processing to generate fusion features, and Determine the fusion feature pyramid.

Optionally, the generating module 1102 is further configured to perform convolution processing on the image information of the target frame, and then perform the processing of the first cascaded residual block to generate target features of multiple scales; The target feature pyramid is generated by bilinear interpolation.

Optionally, the generation module 1102 is further configured to convolve the image information of each adjacent frame, and then process the first cascaded residual block to generate adjacent features of multiple scales; The adjacent features of , generate an adjacent feature pyramid by bilinear interpolation.

Optionally, the generation module 1102 is further configured to calculate the fusion feature pyramid through the second cascade residual block to generate an optimized feature pyramid; perform size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal; The reconstructed residual signal is added to the image enlargement result to obtain the reconstructed frame, and the image enlargement result is generated by the bilinear interpolation of the image information of the target frame.

The decoding device described above can be understood by referring to the corresponding content in the foregoing method embodiment section, and details are not repeated here.

12 is a schematic diagram of a UP structure provided by an embodiment of the present application, the decoding device 1200 may include one or more central processing units (central processing units, CPU) 1201 and a memory 1205, and the memory 1205 stores one or more than one applications or data.

Among them, the memory 1205 may be volatile storage or persistent storage. The program stored in the memory 1205 may include one or more modules, and each module may include a series of instruction operations in the service control unit. Further, the central processing unit 1201 may be arranged to communicate with the memory 1205 to execute a series of instruction operations in the memory 1205 on the decoding device 1200.

The decoding device 1200 may also include one or more power supplies 1202, one or more wired or wireless network interfaces 1203, one or more input and output interfaces 1204, and/or, one or more operating systems, such as Windows Server™, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The decoding device 1200 can perform the operations performed by the decoding device in the foregoing embodiments shown in FIG. 3 to FIG. 9 , and details are not repeated here.

In another embodiment of the present application, a computer-readable storage medium is also provided, where computer-executable instructions are stored in the computer-readable storage medium. When the processor of the device executes the computer-executable instructions, the device executes the above-mentioned FIG. 3 to The steps of the image processing method performed by the processor in FIG. 9 .

In another embodiment of the present application, a computer program product is also provided, the computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium; when a processor of a device executes the computer-executable instructions , the device executes the steps of the image processing method executed by the processor in the above-mentioned FIG. 3 to FIG. 9 .

In another embodiment of the present application, a chip system is further provided, the chip system includes at least one processor, and the processor is configured to support a decoding device to implement the steps of the image processing method performed by the processor in the foregoing FIG. 3 to FIG. 9 . . In a possible design, the chip system may further include a memory for storing necessary program instructions and data of the decoding device. The chip system may be composed of chips, or may include chips and other discrete devices.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Experts may use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of the embodiments of the present application.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

Units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

The integrated unit, if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program codes .

The above are only specific implementations of the embodiments of the present application, but the protection scope of the embodiments of the present application is not limited thereto.

Claims

An image processing method, comprising:

The decoding device obtains the motion vector information of the target frame, the image information of the target frame and the image information of the adjacent frames in the encoded code stream, the target frame and the adjacent frames are images of the first resolution, and the target frame and the adjacent frames are images of the first resolution. The frame is an image that needs to be subjected to super-resolution processing, and the adjacent frame includes one or more images within a preset period before or after the target frame;

The decoding device generates a reconstructed frame according to the motion vector information, the image information of the target frame, and the image information of the adjacent frame, where the reconstructed frame is an image of a second resolution, and the second resolution At a rate greater than the first resolution, the motion vector information is used for adaptive local sampling of image information of the adjacent frame and image information of the target frame.
The image processing method according to claim 1, wherein the decoding device generating the reconstructed frame according to the motion vector information, the image information of the target frame and the image information of the adjacent frame comprises:

The decoding device performs adaptive local sampling on each position corresponding to the image information of the target frame in the image information of the adjacent frames based on the motion vector information;

The decoding device generates the reconstructed frame according to the image information of the target frame and the image information of the adjacent frame after adaptive local sampling.
The image processing method according to claim 2, characterized in that, based on the motion vector information, the decoding device automatically performs automatic automatic operation on each position corresponding to the image information of the target frame in the image information of the adjacent frames. Adapting local sampling includes:

The decoding device generates a target feature pyramid according to the image information of the target frame, and generates an adjacent feature pyramid according to the image information of each adjacent frame, the target feature pyramid includes target features of multiple scales, and each The adjacent feature pyramid includes adjacent features of multiple scales;

Based on the motion vector information, the decoding device performs adaptive local sampling on the adjacent features at the corresponding positions of each target feature in each adjacent feature pyramid with the position of each target feature as a reference.
The image processing method according to claim 3, wherein the decoding device generating the reconstructed frame according to the image information of the target frame and the image information of the adjacent frames after adaptive local sampling comprises:

The decoding device fuses the target feature pyramid and each adjacent feature pyramid after adaptive local sampling to generate a fusion feature pyramid, and the fusion feature pyramid includes fusion features of multiple scales;

The decoding device processes the fused feature pyramid to generate the reconstructed frame.
The image processing method according to claim 3, wherein, based on the motion vector information, the decoding device uses the position of each target feature as a reference, and compares each adjacent feature pyramid with the each target feature. The adaptive local sampling of the adjacent features at the corresponding positions of the target features includes: for each adjacent feature pyramid,

The decoding device is based on the coordinates of the first local feature block in the target feature and the difference between the first local feature block included in the motion vector information and the second local feature block in the adjacent feature. mapping relationship, find the second local feature block;

The decoding device performs feature matching on the first local feature block and the second local feature block through a fully connected layer to determine a set of related attention coefficients, where the set of related attention coefficients includes a plurality of related attention coefficients , wherein each relevant attention coefficient indicates the similarity between a feature point in the first local feature block and a corresponding feature point in the second local feature block;

The decoding device performs a weighted average of a plurality of feature points in the second local feature block based on the relevant attention coefficient set to determine the adjacent feature pyramid after adaptive local sampling.
The image processing method according to claim 4, wherein the decoding device fuses the target feature pyramid and the adjacent feature pyramids after adaptive local sampling to generate a fused feature pyramid comprising: for each of the adjacent feature pyramids,

The decoding device calculates an attention map according to the target feature and the adjacent features after the adaptive local sampling, and the attention map is used to represent the adjacent features after the adaptive local sampling and the adjacent features. Similarity of target features;

The decoding device performs feature enhancement processing on the adjacent features after the adaptive local sampling and the attention map;

The decoding device performs stacking and convolution calculations on all adjacent features after feature enhancement processing and the target feature to generate the fusion feature, and determines the fusion feature pyramid.
The image processing method according to any one of claims 3-6, wherein the decoding device generating the target feature pyramid according to the image information of the target frame comprises:

The decoding device performs convolution processing on the image information of the target frame, and then performs processing on the first cascaded residual block to generate target features;

The decoding device generates target features of multiple scales from the target features by means of bilinear interpolation, and constructs the target feature pyramid.
The image processing method according to any one of claims 3-6, wherein the decoding device generates an adjacent feature pyramid according to the image information of each adjacent frame, comprising:

The decoding device convolves the image information of each of the adjacent frames, and then performs the first cascaded residual block processing to generate a phase corresponding to the image information of each of the adjacent frames. Neighbor features;

The decoding device generates adjacent features of multiple scales from the adjacent features by means of the bilinear interpolation, and constructs the adjacent feature pyramid.
The image processing method according to claim 4 or 6, wherein the decoding device performs processing on the fusion feature pyramid to generate the reconstructed frame, comprising:

The decoding device generates an optimized feature pyramid by calculating the fusion feature pyramid through the second cascade residual block;

The decoding device performs size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal;

The decoding device adds the reconstructed residual signal and an image enlargement result to obtain the reconstructed frame, and the image enlargement result is generated by bilinear interpolation of image information of the target frame.
A computer-readable storage medium, wherein a program is stored in the computer-readable storage medium, and when the computer executes the program, the method according to any one of claims 1 to 9 is executed.
A computing device, comprising a processor and a computer-readable storage medium storing a computer program;

The processor is coupled to the computer-readable storage medium, the computer program when executed by the processor implements the method of any of claims 1-9.
A computer program product, characterized in that, when the computer program product is executed on a computer, the computer executes the method according to any one of claims 1 to 9.
A chip system, characterized by comprising a processor, the processor being invoked to execute the method according to any one of claims 1-9.