WO2022068682A1 - Image processing method and apparatus - Google Patents

Image processing method and apparatus Download PDF

Info

Publication number
WO2022068682A1
WO2022068682A1 PCT/CN2021/120193 CN2021120193W WO2022068682A1 WO 2022068682 A1 WO2022068682 A1 WO 2022068682A1 CN 2021120193 W CN2021120193 W CN 2021120193W WO 2022068682 A1 WO2022068682 A1 WO 2022068682A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
adjacent
decoding device
target
frame
Prior art date
Application number
PCT/CN2021/120193
Other languages
French (fr)
Chinese (zh)
Inventor
王诗淇
孙龙
Original Assignee
华为技术有限公司
香港城市大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司, 香港城市大学 filed Critical 华为技术有限公司
Publication of WO2022068682A1 publication Critical patent/WO2022068682A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/513Processing of motion vectors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/527Global motion vector estimation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/53Multi-resolution motion estimation; Hierarchical motion estimation

Definitions

  • the embodiments of the present application relate to the field of image processing, and in particular, to an image processing method and apparatus.
  • Existing video super-resolution techniques usually use the combining local and global-television (CLG-TV) optical flow model algorithm to calculate the optical flow velocity before all low-resolution video image sequences and the current frame video image.
  • Vector that is to perform motion estimation, obtain the low-resolution video image of the 2T frame after motion compensation and the low-resolution video image of the current frame according to the optical flow velocity vector, and then use the deep residual network to analyze the low-resolution video of the 2T frame.
  • the image and the low-resolution video image of the current frame perform the initial stage, the concatenated convolution layer calculation stage and the residual block calculation stage in turn, and then gradually use the deconvolution and convolution operations to reconstruct the high-resolution video image.
  • the embodiment of the present application provides an image processing method, which is used to process a super-divided frame into a high-resolution reconstructed frame by using motion vector information in an encoded code stream, which avoids a large amount of computational requirements required for motion estimation, and can Significantly reduces the time for video super-resolution.
  • a first aspect of the embodiments of the present application provides an image processing method, the method includes: a decoding device obtains motion vector information of a target frame, image information of the target frame, and image information of adjacent frames in an encoded code stream, and the target frame and phase
  • the adjacent frame is an image of the first resolution
  • the target frame is an image that needs to be subjected to super-resolution processing
  • the adjacent frame includes an image within a preset period before or after the target frame
  • the image information and the image information of the adjacent frames generate a reconstructed frame
  • the reconstructed frame is an image of the second resolution
  • the second resolution is greater than the first resolution
  • the motion vector information is used to indicate the image information of the adjacent frame and the target frame.
  • adaptive local sampling of the image information is used to indicate the image information of the adjacent frame and the target frame.
  • the encoded code stream may be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation, and the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image Compression coding techniques may apply motion estimation algorithms to determine motion vector information.
  • the decoding device may directly extract motion vector information in the encoded code stream, and decode the encoded code stream to obtain decoded image information, where the image information may include a target for performing super-resolution processing
  • the image information of the frame and the image information of the adjacent frames in the preset period before or after the target frame.
  • the adjacent frames may include T frames before and after the target frame.
  • the adjacent frames are all images of the first resolution, that is, the low resolution.
  • the decoding device can process the image information of the target frame and the image information of the adjacent frames in combination with the motion vector information to generate a reconstructed frame of the second resolution , the reconstructed frame is a high-resolution image, that is, the second resolution is greater than the above-mentioned first resolution, and the motion vector information is used to indicate that the image information of the adjacent frame and the image information of the target frame perform adaptive local sampling, which is implemented in this application.
  • the decoding device does not need to perform a motion estimation process that consumes a lot of computations when performing super-resolution, which can greatly reduce the time for video super-resolution.
  • the decoding device performs adaptive local sampling on each position corresponding to the image information of the target frame in the image information of adjacent frames based on the motion vector information; A reconstructed frame is generated from the image information and the image information of the adjacent frames after adaptive local sampling.
  • the decoding device can adaptively select similarity from multiple image points in each corresponding local position in the image information of adjacent frames based on each local position of the image information of the target frame in combination with the motion vector information High image points are sampled.
  • the decoding device can perform corresponding image reconstruction based on the image information of the adjacent frames after the adaptive local sampling to generate a reconstructed frame, which can reduce the noise existing in the motion vector information and improve the robustness.
  • the decoding device in the above steps performs adaptive local sampling on each position corresponding to the image information of the target frame in the image information of adjacent frames based on the motion vector information, including: a decoding device Generate a target feature pyramid according to the image information of the target frame, and generate an adjacent feature pyramid according to the image information of each adjacent frame.
  • the target feature pyramid includes target features of multiple scales, and each adjacent feature pyramid includes multiple scales.
  • the adjacent features of based on the motion vector information, the decoding device performs adaptive local sampling on the adjacent features corresponding to each target feature in each adjacent feature pyramid with the position of each target feature as a reference.
  • the feature pyramid of an image is a series of feature sets arranged in a pyramid shape.
  • the feature pyramid is obtained by down-sampling an original feature step by step, so the size is reduced layer by layer.
  • the function extracts the image information of the target frame and the image information of the adjacent frames to obtain the target features and adjacent features, and then generates target features and adjacent features of multiple scales through downsampling, and forms the corresponding feature pyramid.
  • the decoding device may, based on the scale invariance of the feature pyramid, take the position of each target feature at each scale as a benchmark, and perform adaptive local sampling on the adjacent features at the corresponding positions of each adjacent feature pyramid according to the motion vector information, that is, Refine the search for better matching features within the adjacent features at the mapping, and improve the feature quality of each adjacent feature.
  • the decoding device in the above steps generates a reconstructed frame according to the image information of the target frame and the image information of the adjacent frames after adaptive local sampling, including: the decoding device converts the target feature pyramid and the Each adjacent feature pyramid after adaptive local sampling is fused to generate a fused feature pyramid, and the fused feature pyramid includes fused features of multiple scales; the decoding device processes the fused feature pyramid to generate a reconstructed frame.
  • the decoding device stacks the adjacent feature pyramids and the target feature pyramid after adaptive local sampling, and convolutionally fuses them into a fused feature pyramid, and then the fused feature pyramid can be reconstructed into a high-resolution feature pyramid.
  • rate images Stacking (concat) is the merging of the number of feature channels, that is to say, the number of features (the number of channels) describing the image itself increases, while the information under each feature does not increase.
  • the above-mentioned decoding device based on the motion vector information, takes the position of each target feature as a reference, and converts the adjacent features of each adjacent feature pyramid corresponding to the position of each target feature
  • Performing adaptive local sampling including: for each adjacent feature pyramid, the decoding device is based on the coordinates of the first local feature block in the target feature, and the first local feature block included in the motion vector information and the second in the adjacent feature.
  • the mapping relationship between the local feature blocks is used to find the second local feature block; the decoding device performs feature matching on the first local feature block and the second local feature block through the fully connected layer to determine the set of relevant attention coefficients and the relevant attention coefficients.
  • the set includes a plurality of relevant attention coefficients, wherein each relevant attention coefficient indicates the similarity between a feature point in the first local feature block and a corresponding feature point in the second local feature block; the decoding device is based on the set of relevant attention coefficients A weighted average of multiple feature points in the second local feature block is performed to determine the adjacent feature pyramid after adaptive local sampling.
  • the attention coefficient is the degree of attention, that is, more attention is attached to the feature points in the adjacent features that are more similar to the target feature.
  • the decoding device extracts the first local feature block in a scale of the target feature pyramid, and based on the mapping relationship indicated by the motion vector information, the coordinates of the first local feature block are determined from the corresponding coordinates of an adjacent feature pyramid. Extract the second local feature block, and then determine the attention coefficient of each feature point in the second local feature block through a double-layer fully connected layer, and then attach the above attention coefficient to each feature point in the second local feature block
  • the corresponding degree of attention that is, the second local feature block after adaptive local sampling is obtained. After processing feature blocks extracted from adjacent features of all adjacent feature pyramids corresponding to all scales of the target feature pyramid, adjacent feature pyramids after adaptive local sampling can be obtained.
  • the decoding device in the above steps fuses the target feature pyramid and the adjacent feature pyramid after adaptive local sampling to generate a fusion feature pyramid.
  • the method includes: for each adjacent feature pyramid, the decoding device The attention map is calculated according to the target feature and the adjacent features after adaptive local sampling, and the attention map is used to represent the similarity between the adjacent features after adaptive local sampling and the target feature; Perform feature enhancement processing with the attention map; the decoding device stacks and convolves all adjacent features and target features after feature enhancement processing to generate fusion features, and determines the fusion feature pyramid.
  • the decoding device can generate an attention map in the time domain according to the alignment quality of adjacent frame features, and reduce the low-quality adaptive local sampling by increasing the weight of the high-quality adaptive local sampling local area.
  • the weight of the region dynamically fuses the features of the adjacent frame and the target frame after adaptive local sampling.
  • the alignment quality can be expressed by calculating the feature inner product of the adjacent frame feature after adaptive local sampling and the target feature at each coordinate point, and the feature inner product can represent the adjacent feature after adaptive local sampling. similarity of features. Then, each feature area is weighted, for example, the feature and the above attention map are multiplied point by point.
  • the above-mentioned stacking operation is performed on the adjacent features after feature enhancement processing and the target features, and the fusion features are generated by convolution. After the decoding device performs a feature fusion, it is necessary to detect whether there are adjacent features without feature enhancement processing. Neighbor features and target features are fused to generate a fused feature pyramid.
  • generating the target feature pyramid according to the image information of the target frame by the decoding device in the above steps includes: the decoding device convolves the image information of each adjacent frame, and then performs a first concatenation process.
  • the residual block is processed to generate an adjacent feature corresponding to the image information of each adjacent frame; the decoding device generates adjacent features of multiple scales by bilinear interpolation, and constructs adjacent features.
  • Feature Pyramid the decoding device convolves the image information of each adjacent frame, and then performs a first concatenation process.
  • the residual block is processed to generate an adjacent feature corresponding to the image information of each adjacent frame;
  • the decoding device generates adjacent features of multiple scales by bilinear interpolation, and constructs adjacent features.
  • the residual blocks use skip links to improve the accuracy by increasing a considerable depth, wherein the skip links are the residual blocks that directly detour the received input information to the output, protecting the Integrity of Information.
  • the scale represents the number of pixels in the image.
  • the decoding device extracts a target feature in the image information of the target frame through a feature extraction function, where the feature extraction function includes convolution processing and cascaded residual blocks.
  • the decoding device can reduce the image to different degrees by downsampling the target features through bilinear interpolation, so as to obtain target features of different scales, and then arrange the target features according to the scale to generate the target feature pyramid.
  • the number of pixels in each layer of the pyramid continues to decrease from bottom to top, which can greatly reduce the amount of computation.
  • the decoding device generating an adjacent feature pyramid according to the image information of each adjacent frame includes: the decoding device convolves the image information of each adjacent frame, and then performs The processing of the first cascaded residual blocks to generate an adjacent feature corresponding to the image information of each adjacent frame; the decoding device generates an adjacent feature by bilinear interpolation of adjacent features of multiple scales pyramid.
  • the decoding device simultaneously performs feature extraction on the image information of adjacent frames through a plurality of feature extraction functions that share weights with the above-mentioned feature extraction function to obtain adjacent features, and then downsamples through bilinear interpolation To reduce the image to different degrees in the way, adjacent features of different scales can be obtained, and then the adjacent features are arranged according to the scale to generate adjacent feature pyramids.
  • the decoding device processing the fused feature pyramid to generate the reconstructed frame includes: the decoding device calculates the fused feature pyramid through the second cascade residual block to generate an optimized feature pyramid; The feature pyramid is optimized for size expansion and convolution to generate a reconstructed residual signal; the decoding device adds the reconstructed residual signal and the image enlargement result to obtain a reconstructed frame, and the image enlargement result is the image information of the target frame after double-line generated by sex interpolation.
  • the decoding device exchanges information on the features of each scale level in the above-generated fused feature pyramid through the second cascaded residual block.
  • the features of each scale level can be upsampled. Or downsample, perform interaction at the same scale to optimize the fusion feature, then amplify and convolve the optimized fusion feature, and add it to the image information of the target frame amplified by bilinear interpolation, you can get high The reconstructed frame of the resolution.
  • a second aspect of the embodiments of the present application provides a decoding device, where the decoding device has a function of implementing the method of the first aspect or any possible implementation manner of the first aspect.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions, such as a receiving unit and a processing unit.
  • a third aspect of the embodiments of the present application provides a computer device, the computer device includes at least one processor, a storage system, an input/output (input/output, I/O) interface, and a computer device stored in the storage system and available on the processor
  • the running computer executes the instructions, and when the computer executes the instructions are executed by the processor, the processor executes the method according to the first aspect or any possible implementation manner of the first aspect.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium that stores one or more computer-executable instructions.
  • the processor executes any one of the first aspect or the first aspect. possible implementation methods.
  • a fifth aspect of the embodiments of the present application provides a computer program product that stores one or more computer-executable instructions.
  • the processor executes the first aspect or any one of the possible first aspects. method of implementation.
  • a sixth aspect of an embodiment of the present application provides a chip system, where the chip system includes at least one processor, and the at least one processor is configured to support a decoding device to implement the first aspect or any of the possible implementation manners of the first aspect. function.
  • the chip system may further include a memory for storing necessary program instructions and data of the decoding device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • the decoding device generates high-resolution reconstructed frames by performing neural network processing on motion vector information, target feature pyramids, and adjacent feature pyramids obtained from the encoded code stream, and the encoded code stream contains a certain amount of Motion information, and the computational cost of extracting motion information in the code stream is negligible, so the time for video super-resolution can be greatly reduced.
  • FIG. 1 is an application scenario of an embodiment of the present application
  • FIG. 2 is a schematic block diagram of a video encoding and decoding system in an embodiment of the present application
  • FIG. 3 is a flowchart of an embodiment of an image processing method in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of extracting local feature blocks in an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a decoding device in an embodiment of the present application.
  • Fig. 6 is the image processing flow chart of the feature extraction module in the embodiment of the application.
  • Fig. 7 is the image processing flow chart of the flexible alignment module in the embodiment of the present application.
  • FIG. 8 is an image processing flowchart of a multi-frame feature fusion module in an embodiment of the present application.
  • Fig. 9 is the image processing flow chart of the feature super-resolution reconstruction module in the embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a decoding device according to an embodiment of the present application.
  • FIG. 12 is another schematic structural diagram of a decoding device in an embodiment of the present application.
  • Embodiments of the present application provide an image processing method and a decoding device, which are used to reduce the time for video super-resolution.
  • the corresponding apparatus may include one or more units, such as functional units, to perform one or more of the described method steps (eg, one unit performs one or more steps) , or units, each of which performs one or more of the steps), even if such unit or units are not explicitly described or illustrated in the figures.
  • the corresponding method may contain a step to perform the functionality of the one or more units (eg, a step to perform the one or more units) functionality, or steps, each of which performs the functionality of one or more of the plurality of units), even if such one or more steps are not explicitly described or illustrated in the figures.
  • Video coding generally refers to the processing of sequences of images that form a video or video sequence.
  • the terms "picture”, “frame” or “image” may be used as synonyms.
  • Video encoding is performed on the source side and typically involves processing (eg, by compressing) the original video image to reduce the amount of data required to represent the video image for more efficient storage and/or transmission.
  • Video decoding is performed on the destination side and typically involves inverse processing relative to the encoder to reconstruct the video image.
  • the "encoding" of video images referred to in the embodiments should be understood to refer to “encoding” or "decoding” of video sequences.
  • the combination of the encoding part and the decoding part is also called encoding and decoding (encoding and decoding).
  • This embodiment can be applied to the application scenario shown in FIG. 1 .
  • the terminal 11 , the server 12 , the set-top box 13 , and the TV 14 are connected through a wireless or wired network, and the terminal 11 can use application software (application, APP) installed locally. ) to remotely control the display 14, for example, the user can output a video source for television playback by performing operations on the operation interface of the terminal 11, and the terminal 11 performs encoding processing on the video source through the server 12 and then forwards it to the set-top box 13, and the set-top box 13 sends the video source to the set-top box 13.
  • the encoded video source is decoded to the display 14, and then the display 14 can play based on the decoded video source.
  • FIG. 2 exemplarily shows a schematic block diagram of a video encoding and decoding system to which the embodiments of the present application are applied.
  • a video encoding and decoding system may include an encoding apparatus 21 and a decoding apparatus 22, which generates encoded video data.
  • Decoding apparatus 22 may decode the encoded video data generated by encoding apparatus 21 .
  • Various implementations of encoding apparatus 21, decoding apparatus 22, or both may include one or more processors and a memory coupled to the one or more processors.
  • the memory may include but is not limited to random access memory (RAM), read-only memory (ROM), electrically erasable programmable read only memory (EEPROM), fast Flash memory or any other medium that can be used to store the desired program code in the form of instructions or data structures that can be accessed by a computer, as described herein.
  • the encoding device 21 and decoding device 22 may comprise various devices including desktop computers, mobile computing devices, notebook (eg, laptop) computers, tablet computers, set-top boxes, telephone handsets such as so-called "smart" phones, etc. , televisions, cameras, display devices, digital media players, video game consoles, in-vehicle computers, wireless communication devices, or the like.
  • FIG. 2 shows that the encoding device 21 and the decoding device 22 may be separate devices, they may also include the encoding device 21 and the decoding device 22 or the functionality of both, that is, the encoding device 21 or the corresponding functionality and the decoding device 22 or the corresponding functionality.
  • encoding device 21 or corresponding functionality and decoding device 22 or corresponding functionality may be implemented using the same hardware and/or software, or using separate hardware and/or software, or any combination thereof.
  • the encoding device 21 and the decoding device 22 may be communicatively connected through a link 23 , and the decoding device 22 may receive encoded video data from the encoding device 21 via the link 23 .
  • Link 23 may include one or more media or devices capable of moving encoded video data from encoding apparatus 21 to decoding apparatus 22.
  • link 23 may include one or more communication media that enable encoding device 21 to transmit encoded video data directly to decoding device 22 in real-time.
  • encoding apparatus 21 may modulate the encoded video data according to a communication standard, such as a wireless communication protocol, and may transmit the modulated video data to decoding apparatus 22 .
  • the one or more communication media may include wireless and/or wired communication media, such as radio frequency spectrum or one or more physical transmission lines.
  • the one or more communication media may form part of a packet-based network, such as a local area network, a wide area network, or a global network (eg, the Internet).
  • the one or more communication media may include routers, switches, base stations, or other devices that facilitate communication from encoding device 21 to decoding device 22 .
  • the encoding device 21 includes an encoder 211 , and optionally, the encoding device 21 may further include an image preprocessor 212 and a first communication interface 213 .
  • the encoder 211 , the image preprocessor 212 and the first communication interface 213 may be hardware components in the encoding device 21 , or may be software programs in the encoding device 21 .
  • the image preprocessor 212 is configured to receive the original image data 214 transmitted by the external terminal, and perform preprocessing on the original image data 214 to obtain the preprocessed image data 215 or the preprocessed image data 215 .
  • the preprocessing performed by the image preprocessor 212 may include trimming, color format conversion (eg, from a three primary color (RGB) format to a Luma and Chroma (YUV, Y for Luma and UV for Chroma) format) format ), toning or denoising.
  • the image can be regarded as a two-dimensional array or matrix of picture elements.
  • the pixels in the array can also be called sampling points.
  • the number of sampling points in the horizontal and vertical directions (or axes) of an array or image defines the size and/or resolution of the image.
  • three color components are usually employed, i.e. an image can be represented as or contain three arrays of samples.
  • an image includes corresponding arrays of red, green and blue samples.
  • each pixel is usually represented in a luma/chroma format or color space, for example, for a YUV format image, it includes a luma component indicated by Y (sometimes can also be indicated by L) and two components indicated by U and V. chrominance components.
  • the luminance (luma) component Y represents the luminance or gray level intensity (eg, both are the same in a gray scale image), while the two chroma (chroma) components U and V represent the chrominance or color information components.
  • an image in YUV format includes a luma sample array of luma sample values (Y), and two chroma sample arrays of chroma values (U and V). Images in RGB format can be converted or transformed to YUV format and vice versa, the process is also known as color transformation or conversion. If the image is black and white, the image may only include an array of luminance samples.
  • An encoder 211 (or a video encoder 211 ) for receiving the pre-processed image data 215, and processing the pre-processed image data 215 using a relevant prediction mode (such as the prediction mode in various embodiments herein), thereby Encoded image data 216 is provided.
  • a relevant prediction mode such as the prediction mode in various embodiments herein
  • a first communication interface 213 that can be used to receive encoded image data 216 and to transmit the encoded image data 216 via link 23 to decoding device 22 or any other device (eg, memory) for storage or direct reconstruction,
  • the other device may be any device for decoding or storage.
  • the first communication interface 213 may be used, for example, to encapsulate the encoded image data 216 into a suitable format, such as a data packet, for transmission over the link 23 .
  • the decoding device 22 includes a decoder 221 , and optionally, the decoding device 22 may further include a second communication interface 222 and an image post-processor 223 . They are described as follows:
  • a second communication interface 222 may be used to receive encoded image data 216 from the encoding device 21 or any other source, such as a storage device, such as an encoded image data storage device.
  • the second communication interface 222 may be used to transmit or receive the encoded image data 216 via the link 23 between the encoding device 21 and the decoding device 22, such as a direct wired or wireless connection, or via any kind of network, Networks of any kind are, for example, wired or wireless networks or any combination thereof, or private and public networks of any kind, or any combination thereof.
  • the second communication interface 222 may be used, for example, to decapsulate the data packets transmitted by the first communication interface 213 to obtain the encoded image data 216 .
  • Both the second communication interface 222 and the first communication interface 213 may be configured as a one-way communication interface or a two-way communication interface, and may be used, for example, to send and receive messages to establish connections, acknowledge and exchange any other communication links and/or for example Information about data transmission of encoded image data transmission.
  • Decoder 221 receives encoded image data 216 and provides decoded image data 224 or decoded image 224 .
  • the decoder 221 may be configured to execute various embodiments described later, so as to realize the application of the image processing method described in this application on the decoding side.
  • the post-processing performed by the image post-processor 223 may include color format conversion (eg, from YUV format to RGB format), toning, trimming or resampling, or any other It is transmitted to an external display device for playback.
  • the display device may be or include any type of display for presenting the reconstructed image, eg, an integrated or external display or monitor.
  • displays may include liquid crystal displays (LCDs), organic light emitting diode (OLED) displays, plasma displays, projectors, micro LED displays, liquid crystal on silicon (LCoS), A digital light processor (DLP) or other display of any kind.
  • FIG. 2 depicts the encoding device 21 and the decoding device 22 as separate devices
  • device embodiments may also include the functionality of the encoding device 21 and the decoding device 22 or both at the same time, ie the encoding device 21 or the corresponding Functionality and decoding device 22 or corresponding functionality.
  • encoding device 21 or corresponding functionality and decoding device 22 or corresponding functionality may be implemented using the same hardware and/or software, or using separate hardware and/or software, or any combination thereof.
  • the encoding device 21 and decoding device 22 may include any of a variety of devices, including any class of handheld or stationary devices, such as notebook or laptop computers, mobile phones, smart phones, tablet or tablet computers, video cameras, desktop computers , set-top boxes, televisions, cameras, in-vehicle devices, display devices, digital media players, video game consoles, video streaming devices (such as content serving servers or content distribution servers), broadcast receiver devices, broadcast transmitter devices, etc. , and may not use or use any kind of operating system.
  • handheld or stationary devices such as notebook or laptop computers, mobile phones, smart phones, tablet or tablet computers, video cameras, desktop computers , set-top boxes, televisions, cameras, in-vehicle devices, display devices, digital media players, video game consoles, video streaming devices (such as content serving servers or content distribution servers), broadcast receiver devices, broadcast transmitter devices, etc. , and may not use or use any kind of operating system.
  • Both encoder 211 and decoder 221 may be implemented as any of a variety of suitable circuits, eg, one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (application-specific integrated circuits) circuit, ASIC), field-programmable gate array (FPGA), discrete logic, hardware, or any combination thereof.
  • DSPs digital signal processors
  • ASIC application-specific integrated circuits
  • FPGA field-programmable gate array
  • an apparatus may store instructions for the software in a suitable non-transitory computer-readable storage medium and may execute the instructions in hardware using one or more processors to perform the techniques of this disclosure . Any of the foregoing (including hardware, software, a combination of hardware and software, etc.) may be considered one or more processors.
  • the video encoding and decoding system shown in FIG. 2 is merely an example, and the techniques of this application may be applicable to video encoding setups (eg, video encoding or video decoding) that do not necessarily involve any data communication between encoding and decoding devices ).
  • data may be retrieved from local storage, streamed over a network, and the like.
  • a video encoding device may encode and store data to memory, and/or a video decoding device may retrieve and decode data from memory.
  • encoding and decoding is performed by devices that do not communicate with each other but only encode data to and/or retrieve data from memory and decode data.
  • the original video image can be reconstructed, ie the reconstructed video image has the same quality as the original video image (eg no transmission loss or other data loss during storage or transmission).
  • the super-resolution algorithm used in the process of reconstructing the video image at the decoder side requires motion estimation, and motion estimation It needs to spend a lot of computing resources.
  • the embodiment of the present application provides a corresponding image processing method.
  • the method includes the decoding device acquiring the motion vector information of the target frame, the image information of the target frame and Image information of adjacent frames, the target frame and adjacent frames are images of the first resolution, the target frame is an image that needs to be processed by super-resolution, and the adjacent frames include images located in a preset period before or after the target frame ;
  • the decoding device generates a reconstructed frame according to the motion vector information, the image information of the target frame and the image information of the adjacent frame, and the reconstructed frame is an image of a second resolution, and the second resolution is greater than the first resolution, and the motion vector information uses Adaptive local sampling is performed on the image information indicating the adjacent frame and the image information of the target frame. In this way, the present application uses the motion vector information of the target frame in the encoded code stream to improve the resolution of the reconstructed frame, saving resources for re-estimating the motion vector information.
  • the implementation manner of the embodiment of the present application may also be: the decoding device, based on the motion vector information, Each position in the image information corresponding to the image information of the target frame performs adaptive local sampling; the decoding device generates a reconstructed frame according to the image information of the target frame and the image information of the adjacent frames after adaptive local sampling.
  • image points with high similarity can be selected for sampling through adaptive local sampling, so as to reduce the influence of noise in the motion vector information, and improve the robustness of super-resolution.
  • an embodiment of the image processing method of the present application includes:
  • the decoding device acquires the image information of the target frame and the image information of the adjacent frames in the encoded code stream.
  • the decoder after receiving the encoded code stream sent by the server, the decoder can decode the encoded video to obtain image information of the target frame and image information of adjacent frames.
  • the target frame is an image that needs to be subjected to super-resolution processing in this embodiment.
  • Super-resolution means improving the resolution of the original image by means of hardware or software, that is, obtaining a high-resolution image through a series of low-resolution images.
  • the high-speed image process is super-resolution reconstruction.
  • the decoding device can obtain the image information of the adjacent frames.
  • the period T can be preset or changed according to actual needs.
  • the input can meet the needs of the network by repeating the operation of the existing adjacent frames.
  • the coded code stream can be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation, the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image compression coding technique can apply motion An estimation algorithm determines motion vector information.
  • the target frame and adjacent frames in this embodiment are images of the first resolution, and the resolution indicated by the first resolution specifically refers to the low resolution to which the image decoded by the encoded code stream belongs.
  • the decoding device extracts target features from the image information of the target frame, and extracts adjacent features from the image information of adjacent frames.
  • the decoding device includes multiple feature extraction functions that share weights. After receiving the image information of the target frame and the image information of the adjacent frames sent by the decoder, the decoding device can use each feature extraction function to extract the target frame.
  • the image information of the target frame and the image information of each adjacent frame are extracted image features to generate a target feature corresponding to the image information of the target frame and an adjacent feature corresponding to the image information of each adjacent frame respectively.
  • the above-mentioned image feature may be image texture information of the image.
  • the above feature extraction function consists of a convolutional layer and several cascaded residual blocks.
  • the decoding device constructs a target feature pyramid from the target feature, and constructs an adjacent feature pyramid from the adjacent features.
  • the decoding device continuously reduces the image size through filtering and bilinear interpolation downsampling to obtain the target features of different scales, and then arranges the target features of different scales according to the scale to generate the target Feature pyramid.
  • the bottom image of the target feature pyramid corresponds to the original target feature.
  • a 2-level target feature can be formed, and so on to form a multi-level target feature pyramid.
  • the pyramid structure may be a Gaussian pyramid, a Laplacian pyramid, or a wavelet pyramid, etc., which is not specifically limited here.
  • the decoding device continuously reduces the image size corresponding to the adjacent features of the image information of each adjacent frame by means of filtering and bilinear interpolation downsampling, so as to obtain adjacent features of different scales, and then compares the corresponding adjacent features of different scales according to the scale.
  • the adjacent features are arranged to generate adjacent feature pyramids. Exemplarily, based on 2T adjacent frames before and after the target frame, the embodiment of the present application may correspondingly generate 2T adjacent feature pyramids.
  • Feature pyramid is a basic component in multi-scale object detection system.
  • the feature pyramid of an image is a series of feature sets arranged in a pyramid shape. It is obtained by downsampling an original feature echelon, so the size is reduced layer by layer.
  • the feature pyramid has a certain scale invariance, and this characteristic enables the decoding device of this embodiment to detect images of a large scale.
  • the decoding device determines the position mapping relationship between the target feature and the adjacent feature according to the motion vector information.
  • the decoding device can directly determine the motion vector information from the encoded code stream, and determine the position mapping relationship between the target feature indicated by the motion vector information and the adjacent features, and the position mapping relationship is the same as the above-mentioned target feature pyramid. and the adjacent feature pyramid, at each coordinate in the target feature of each scale in the target feature pyramid, there is a corresponding coordinate in the adjacent feature of the same scale in the adjacent feature pyramid.
  • the decoding device searches for the second local feature block of the adjacent feature according to the coordinates of the first local feature block of the target feature and the position mapping relationship.
  • the same flexible alignment operation for each scale in the adjacent feature pyramid, the same flexible alignment operation will be performed.
  • the specific operation is as follows: the schematic diagram of local feature block extraction shown in Figure 4, under the guidance of the motion vector information, that is, the position mapping relationship 41 , for each coordinate of the target feature 42 in the target feature pyramid, there is a corresponding coordinate in the adjacent feature 223 of any adjacent feature pyramid.
  • the decoding device respectively extracts the first local feature block 421 corresponding to the target feature and the second local feature block 2231 corresponding to the adjacent feature from the two corresponding coordinates.
  • the decoding device performs feature matching on the first local feature block and the second local feature block through a fully connected layer to determine a set of relevant attention coefficients.
  • the decoding device will rearrange the first local feature block of a target feature and the second local feature block of an adjacent feature to form two one-dimensional feature vectors, and then convert the two one-dimensional feature vectors through a concatenation operation.
  • the dimensional feature vectors are combined into a two-layer fully connected layer to generate an attention vector with the same length as the number of pixel values contained in the local feature block.
  • the decoding device rearranges the attention vector to obtain a set of related attention coefficients between the two local feature blocks.
  • the above-mentioned set of relevant attention coefficients includes a plurality of relevant attention coefficients, wherein each relevant attention coefficient can indicate the similarity between a feature point in the first local feature block and a corresponding feature point in the second local feature block;
  • the decoding device performs a weighted average of a plurality of feature points in the second local feature block based on the relevant attention coefficient set to determine the adjacent feature pyramid after adaptive local sampling.
  • the position mapping relationship provided by the above motion vector information is not necessarily completely real object motion, and may contain coding noise.
  • the relevant attention coefficients in the decoding device can allow the network to map Refine the search for better matching features in the local neighborhood at the location.
  • the multiple relevant attention coefficients and the local feature blocks of adjacent features can be multiplied point by point and then summed, so as to realize the sampling of the second local feature block. deal with.
  • the relevant attention coefficient is the importance weight of each feature in the local feature block indicating the adjacent features, which is used to improve the feature quality of the adjacent features after adaptive local sampling.
  • the decoding device When the decoding device performs one adaptive local sampling, it only performs adaptive local sampling on one local feature block in one target feature and one local feature block in one adjacent feature, and for other adjacent features in the 2T adjacent feature pyramids
  • the local feature blocks of other coordinates in the pyramid and the target feature correspond to the local feature blocks of adjacent features, and they also need to be adaptively localized with the same steps respectively, so as to perform adaptive local sampling on all adjacent feature pyramids.
  • the decoding device calculates an attention map based on the target feature and the adaptively locally sampled adjacent features.
  • the decoding device After the decoding device performs adaptive local sampling on adjacent feature pyramids, it can generate an attention map in the time domain according to the adaptive local sampling quality of adjacent frame features.
  • the decoding device can calculate the feature inner product of the adjacent frame feature after adaptive local sampling and the target feature at each coordinate point, and the feature inner product can represent the adjacent feature after adaptive local sampling at this point. Similarity with the target feature, the similarity also represents the adaptive local sampling quality of the point to a certain extent.
  • the decoding device can obtain an attention map with the same size as the feature size through the above adaptive local sampling quality.
  • the decoding device performs feature enhancement processing on the adjacent features after adaptive local sampling and the attention map.
  • the decoding device can dynamically fuse the features of the adjacent frame and the current frame after adaptive local sampling by increasing the weight of the high-quality local area and reducing the weight of the low-quality local area.
  • the decoding device performs point-by-point multiplication of the adjacent features in the adjacent feature pyramid after adaptive local sampling with the above-mentioned attention map, so that those regions on the adjacent features that are more similar to the target feature are adaptively Allocate higher attention, that is, a higher weight for the possible contributions of the super-resolution results, enhance the required features, and suppress possible interference such as mismatches.
  • the decoding device performs stacking and convolution calculations on all adjacent features and target features after feature enhancement processing to generate fusion features, and determines a fusion feature pyramid.
  • the decoding device can stack the adjacent features after feature enhancement processing with the target features, that is, superimpose the features on the adjacent features after feature enhancement processing on the target features, and then pass A convolutional layer obtains fused features.
  • the fused feature pyramid can be obtained.
  • the decoding device After the decoding device performs feature fusion once, it needs to detect whether there are adjacent features without feature enhancement processing. If there are adjacent features without feature enhancement processing, it needs to perform the above feature enhancement processing on the adjacent features and the target feature. , until all adjacent features are enhanced and fused with the target feature, and a complete fused feature pyramid is determined.
  • the decoding device generates an optimized feature pyramid by calculating the fusion feature pyramid through the second cascaded residual block.
  • the decoding device uses the cascaded residual block with scale fusion to reconstruct the feature generation of the fused feature pyramid.
  • the second cascaded residual block above is at the end of the skip connection.
  • additional upsampling operations or downsampling operations are added, so that the feature reconstruction residuals of different scales can fully exchange information, enhance the quality of reconstructed features, and obtain optimized fusion features.
  • the optimized feature pyramid can be determined.
  • the decoding device performs size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal.
  • the decoding device adds the reconstructed residual signal and the image upscaling result to obtain a reconstructed frame.
  • the decoding device may expand the size of the image information of the target frame by performing an up-sampling operation of bilinear interpolation on the image information of the target frame.
  • the decoding device may add the reconstructed residual signal and the image information of the up-sampled target frame to obtain the image information of the reconstructed frame, and determine the reconstructed frame.
  • the reconstructed frame is the target frame after super-resolution.
  • the decoding device generates high-resolution reconstructed frames by performing neural network processing on the motion vector information, target feature pyramid and adjacent feature pyramids obtained from the encoded code stream, and the encoded code stream contains a certain amount of Motion information, and the computational cost of extracting motion information in the code stream is negligible, all of which can greatly reduce the time of video super-resolution.
  • the super-resolution process in the image processing in the embodiment of the present application may be implemented by a pre-trained network model.
  • FIG. 5 is a schematic diagram of the architecture of the decoding device in the embodiment of the present application.
  • the decoding device 22 may include a decoder 221 , a network model 222 , a graphics processing unit (graphics processing unit, GPU) memory 223 and an output buffer 224 . They are described as follows:
  • the decoder 221 is a device that performs restoration and decoding operations on the encoded encoded code stream.
  • the decoder 221 may be a video decoder that supports encoding and decoding standards such as H.264/high efficiency video coding (HEVC)/versatile video coding (VVC), for example, HEVC decoding device.
  • HEVC high efficiency video coding
  • VVC versatile video coding
  • the decoder 221 in this embodiment of the present application adds a motion vector information output interface.
  • the network model 222 in the product implementation form of the embodiment of the present application is a program code included in the machine learning and deep learning platform software and deployed on the decoding device.
  • the program codes of the embodiments of the present application exist outside the existing decoder 221 .
  • the network model 222 may be generated by supervised training of the data of the decoded video at low resolution and its unencoded high-resolution video by a machine learning method.
  • the network model 222 is designed with a feature extraction module 2221 , a flexible alignment module 2222 , a multi-frame feature fusion module 2223 and a feature super-score reconstruction module 2224 in this embodiment of the present application.
  • the network model 222 first contains a feature extraction module 2221, which aims to transform the input decoded frames from the pixel domain to the feature domain, since image features have important physical meanings in deep learning methods.
  • Flexible alignment module 2222 this module receives the motion vector in the extracted code stream from the decoder 221, and uses this as a guide to design a multi-scale local attention mechanism to achieve flexible alignment of adjacent frames at the feature level .
  • Multi-frame feature fusion module 2223 this module receives the adjacent frame features and the current frame features after alignment, and uses the attention mechanism in the time domain to complete the feature fusion operation.
  • Feature super-resolution reconstruction module 2224 this module receives the image features after fusion, and uses cascaded multi-scale fusion residual blocks and sub-pixel convolution to complete the super-resolution reconstruction of the decoded video, and a reconstructed frame has been generated.
  • the GPU memory 223 is for the execution of program codes that support the computation of each module in the network model 222 .
  • the output buffer 224 receives and saves the reconstructed frames output by the network model 222 .
  • the embodiments of the present application are implemented in the open source Python-first deep learning framework (PyTorch) machine learning platform, and run on a decoding unit with a graphics card NVIDIA GPU card, to implement the HEVC standard decoding video super-resolution program code.
  • the NVIDIA GPU card provides computing acceleration capabilities through the unified computing device architecture (compute unified device architecture, CUDA) programming interface.
  • the inference process of the network model in the distributed PyTorch machine learning platform can be accelerated, and the model after training can be directly reconstructed end-to-end from the decoded video with compressed noise.
  • the feature extraction module converts the target frame and adjacent frames output by the decoder from the pixel domain to the feature domain, as follows:
  • the feature extraction module obtains image information of the target frame and image information of adjacent frames from the decoder.
  • the decoder after receiving the encoded code stream sent by the source device, the decoder can decode the encoded video to obtain image information of the target frame and image information of adjacent frames, and the feature extraction module can receive the target frame transmitted by the decoder. image information and image information of adjacent frames.
  • the target frame is an image that needs to be subjected to super-resolution processing in this embodiment.
  • Super-resolution means improving the resolution of the original image by means of hardware or software, that is, obtaining a high-resolution image through a series of low-resolution images.
  • the high-speed image process is super-resolution reconstruction.
  • the decoder When the images in the preset period before and after the target frame are also decoded, exemplarily, when 2T adjacent frames before and after the target frame are also decoded, the decoder outputs the image information of the adjacent frames.
  • the period T can be preset or changed according to actual needs. For the boundary conditions of the sequence such as the first frame or the last frame, the input can meet the needs of the network by repeating the operation of the existing adjacent frames.
  • the coded code stream can be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation, the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image compression coding technique can apply motion An estimation algorithm determines motion vector information.
  • the target frame and adjacent frames in this embodiment are images of the first resolution, and the resolution indicated by the first resolution specifically refers to the low resolution to which the image decoded by the encoded code stream belongs.
  • the feature extraction module extracts target features from image information of the target frame, and extracts adjacent features from image information of adjacent frames.
  • the feature extraction module constructs a target feature pyramid from the target feature, and constructs an adjacent feature pyramid from the adjacent features.
  • Steps 602 to 603 are similar to steps 302 to 303 of the image processing method shown in FIG. 3 , and details are not repeated here.
  • the flexible alignment module realizes the flexible alignment of adjacent frames according to the target feature pyramid and the adjacent feature pyramid output by the feature extraction module. Neighbor feature pyramid, as follows:
  • the flexible alignment module receives the motion vector information from the decoder, and the target feature pyramid and neighboring feature pyramids from the feature extraction module.
  • the encoded code stream may be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation.
  • the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image compression Coding techniques may apply motion estimation algorithms to determine motion vector information.
  • the decoder can extract the motion vector information from the encoded code stream and send it to the flexible alignment module.
  • the flexible alignment module can also receive the target feature pyramid and adjacent feature pyramid described in Figure 3 from the feature extraction module.
  • the flexible alignment module determines the position mapping relationship between the target feature and the adjacent feature according to the motion vector information.
  • the flexible alignment module searches for the second local feature block of the adjacent feature according to the coordinates of the first local feature block of the target feature and the position mapping relationship.
  • the flexible alignment module performs feature matching on the first local feature block and the second local feature block through a fully connected layer to determine a set of relevant attention coefficients.
  • the flexible alignment module performs a weighted average on the second local feature block based on the set of relevant attention coefficients to determine the adjacent feature pyramid after adaptive local sampling.
  • Steps 702-705 are similar to steps 304-307 in the image processing method shown in FIG. 3 , and details are not repeated here.
  • FIG. 8 the image processing flow chart of the multi-frame feature fusion module shown in FIG. 8, the multi-frame feature fusion module according to the adaptive local sampling adjacent feature pyramid output by the flexible alignment module, for each adaptive The adjacent feature pyramid after local sampling, as follows:
  • the multi-frame feature fusion module receives the adaptively locally sampled adjacent feature pyramids from the flexible alignment module.
  • the flexible alignment module After the flexible alignment module performs adaptive local sampling on the adjacent feature pyramids, the adjacent feature pyramids after the adaptive local sampling are sent to the multi-frame feature fusion module.
  • the multi-frame feature fusion module calculates an attention map according to the target feature and the adjacent features after adaptive local sampling.
  • the multi-frame feature fusion module performs feature enhancement processing on the adjacent features after adaptive local sampling and the attention map.
  • the multi-frame feature fusion module performs stacking and convolution calculations on adjacent features and target features after all feature enhancement processing to generate fusion features, and determines a fusion feature pyramid.
  • Steps 802-804 are similar to steps 308-310 in the image processing method shown in FIG. 3 , and details are not repeated here.
  • the feature superdivision reconstruction module receives the fused feature pyramid from the multi-frame feature fusion module.
  • the feature super-score reconstruction module generates an optimized feature pyramid by calculating the fusion feature pyramid through the second cascade residual block.
  • the feature super-division reconstruction module performs size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal.
  • the feature super-division reconstruction module adds the reconstructed residual signal and the image enlargement result to obtain a reconstructed frame.
  • Steps 902-904 are similar to steps 311-313 in the image processing method shown in FIG. 3 , and details are not repeated here.
  • FIG. 10 a comparison diagram of the super-resolution results of the embodiments of the present application and the super-resolution results of the prior art, the ordinate is the signal-to-noise ratio (peak signal to noise ratio, PSNR) , the unit is decibel (dB), the abscissa is the test time of each frame, the unit is millisecond (ms), the super-resolution result
  • VRCNN variable-filter-size residual-learning convolutional neural networks
  • VESPCN efficient sub-pixel convolutional networks for video
  • VESPCN multi-frame quality enhancement method
  • MFQE multi-frame quality enhancement
  • DCAD deep convolutional neural networks-based auto decoder
  • Residual learning convolutional neural network with variable filter size is used for light-weight compressed noise removal network, consisting of 4 convolutional layers.
  • the multi-frame quality enhancement method is used for an end-to-end network for denoising compressed video using motion estimation and motion compensation, and the idea of "good frames compensate bad frames".
  • a deep convolutional neural network-based auto-decoder was used to compress the noise removal network, consisting of 10 convolutional layers.
  • Efficient sub-pixel convolutional networks for video are used to utilize motion estimation and motion compensation to account for temporal correlations by aligning adjacent frames to perform video super-resolution methods.
  • Video super-resolution methods based on optical flow super-resolution are used for video super-resolution methods using motion estimation and motion compensation to consider temporal correlations by aligning adjacent frames.
  • the video super-resolution method based on optical flow super-resolution chooses to predict more accurate high-resolution optical flow.
  • a progressive fusion video super-resolution network utilizing non-local spatiotemporal correlations is used for an end-to-end video super-resolution method by computing non-local attention and the proposed progressive fusion module.
  • FIG. 11 is a schematic diagram of an embodiment of a decoding device 110 in an embodiment of the present application.
  • an embodiment of the present application provides a decoding device, and the decoding device includes:
  • the obtaining unit 1101 is used to obtain the motion vector information of the target frame in the encoded code stream, the image information of the target frame and the image information of the adjacent frame, where the target frame and the adjacent frame are images of the first resolution , the target frame is an image that needs to be subjected to super-resolution processing, and the adjacent frame includes an image within a preset period before or after the target frame;
  • the generating unit 1102 is configured to generate a reconstructed frame according to the motion vector information, the image information of the target frame and the image information of the adjacent frame, where the reconstructed frame is an image of the second resolution, and the first The second resolution is greater than the first resolution, and the motion vector information is used to indicate that the image information of the adjacent frame and the image information of the target frame perform adaptive local sampling.
  • the decoding device generates high-resolution reconstructed frames by performing neural network processing on motion vector information, target feature pyramids, and adjacent feature pyramids obtained from the encoded code stream, and the encoded code stream contains a certain amount of Motion information, and the computational cost of extracting motion information in the code stream is negligible, so the time for video super-resolution can be greatly reduced.
  • the generation module 1102 is specifically configured to generate a target feature pyramid according to the image information of the target frame, and generate an adjacent feature pyramid according to the image information of each adjacent frame, and the target feature pyramid includes target features of multiple scales.
  • each adjacent feature pyramid includes adjacent features of multiple scales; based on the motion vector information, with the position of each target feature as the benchmark, the adjacent features in each adjacent feature pyramid corresponding to each target feature Perform adaptive local sampling; fuse the target feature pyramid and each adjacent feature pyramid after adaptive local sampling to generate a fusion feature pyramid, which includes fusion features of multiple scales; process the fusion feature pyramid to generate Reconstructed frame.
  • the generation module 1102 is further configured to, for each adjacent feature pyramid, according to the coordinates of the first local feature block in the target feature, and the first local feature block contained in the motion vector information and the first local feature block in the adjacent feature.
  • the mapping relationship between the two local feature blocks is used to find the second local feature block; the feature matching is performed on the first local feature block and the second local feature block through the fully connected layer to determine the relevant attention coefficient set and the relevant attention coefficient set.
  • the generation module 1102 is further configured to, for each adjacent feature pyramid, calculate an attention map according to the target feature and the adjacent features after adaptive local sampling, and the attention map is used to represent the adjacent features after adaptive local sampling. Similarity with target features; perform feature enhancement processing on adjacent features after adaptive local sampling and attention map; stack and convolute all adjacent features and target features after feature enhancement processing to generate fusion features, and Determine the fusion feature pyramid.
  • the generating module 1102 is further configured to perform convolution processing on the image information of the target frame, and then perform the processing of the first cascaded residual block to generate target features of multiple scales;
  • the target feature pyramid is generated by bilinear interpolation.
  • the generation module 1102 is further configured to convolve the image information of each adjacent frame, and then process the first cascaded residual block to generate adjacent features of multiple scales;
  • the adjacent features of generate an adjacent feature pyramid by bilinear interpolation.
  • the generation module 1102 is further configured to calculate the fusion feature pyramid through the second cascade residual block to generate an optimized feature pyramid; perform size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal; The reconstructed residual signal is added to the image enlargement result to obtain the reconstructed frame, and the image enlargement result is generated by the bilinear interpolation of the image information of the target frame.
  • the decoding device described above can be understood by referring to the corresponding content in the foregoing method embodiment section, and details are not repeated here.
  • the decoding device 1200 may include one or more central processing units (central processing units, CPU) 1201 and a memory 1205, and the memory 1205 stores one or more than one applications or data.
  • CPU central processing units
  • the memory 1205 may be volatile storage or persistent storage.
  • the program stored in the memory 1205 may include one or more modules, and each module may include a series of instruction operations in the service control unit.
  • the central processing unit 1201 may be arranged to communicate with the memory 1205 to execute a series of instruction operations in the memory 1205 on the decoding device 1200.
  • the decoding device 1200 may also include one or more power supplies 1202, one or more wired or wireless network interfaces 1203, one or more input and output interfaces 1204, and/or, one or more operating systems, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • one or more operating systems such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the decoding device 1200 can perform the operations performed by the decoding device in the foregoing embodiments shown in FIG. 3 to FIG. 9 , and details are not repeated here.
  • a computer-readable storage medium is also provided, where computer-executable instructions are stored in the computer-readable storage medium.
  • the processor of the device executes the computer-executable instructions
  • the device executes the above-mentioned FIG. 3 to The steps of the image processing method performed by the processor in FIG. 9 .
  • a computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium; when a processor of a device executes the computer-executable instructions , the device executes the steps of the image processing method executed by the processor in the above-mentioned FIG. 3 to FIG. 9 .
  • a chip system is further provided, the chip system includes at least one processor, and the processor is configured to support a decoding device to implement the steps of the image processing method performed by the processor in the foregoing FIG. 3 to FIG. 9 .
  • the chip system may further include a memory for storing necessary program instructions and data of the decoding device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of units is only a logical function division.
  • there may be other division methods for example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • Units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program codes .

Abstract

Disclosed are an image processing method and apparatus, which are used for reducing the video super-resolution time. The method in the embodiments of the present application comprises: a decoding device processing, in combination with motion vector information in an encoded code stream, image information of a target frame to be subjected to super-resolution and image information of an adjacent frame of the target frame, so as to generate a reconstructed frame with a high resolution.

Description

图像处理方法及装置Image processing method and device
本申请要求于2020年9月30日提交中国专利局、申请号为202011063576.4、发明名称为“图像处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202011063576.4 and the invention titled "Image Processing Method and Device" filed with the China Patent Office on September 30, 2020, the entire contents of which are incorporated into this application by reference.
技术领域technical field
本申请实施例涉及图像处理领域,尤其涉及一种图像处理方法及装置。The embodiments of the present application relate to the field of image processing, and in particular, to an image processing method and apparatus.
背景技术Background technique
传统视频超分辨率算法通过线性插值等技术可以将一组连续的低分辨率帧进行重建,最后可以融合出一张高分辨率图像。近年来随着深度学习技术的迅猛发展,为了提升超分辨率技术的效果,学术界普遍开始使用深度学习的方法将低分辨率的视频序列重建成时空相关的高分辨率视频序列。Traditional video super-resolution algorithms can reconstruct a group of consecutive low-resolution frames through techniques such as linear interpolation, and finally fuse a high-resolution image. In recent years, with the rapid development of deep learning technology, in order to improve the effect of super-resolution technology, academia generally begins to use deep learning methods to reconstruct low-resolution video sequences into high-resolution video sequences related to space and time.
现有的视频超分辨率技术,通常利用电视局部与全局结合(combining local and global-television,CLG-TV)光流模型算法计算所有低分辨率视频图像序列与当前帧视频图像之前的光流速度矢量,即进行运动估计,根据光流速度矢量该获得运动补偿后的2T帧的低分辨率视频图像和当前帧的低分辨率视频图像,然后利用深度残差网络对2T帧的低分辨率视频图像和当前帧的低分辨率视频图像依次执行初始阶段、串联卷积层计算阶段和残差块计算阶段,再逐步利用反卷积和卷积运算的方式重建出高分辨率视频图像。Existing video super-resolution techniques usually use the combining local and global-television (CLG-TV) optical flow model algorithm to calculate the optical flow velocity before all low-resolution video image sequences and the current frame video image. Vector, that is to perform motion estimation, obtain the low-resolution video image of the 2T frame after motion compensation and the low-resolution video image of the current frame according to the optical flow velocity vector, and then use the deep residual network to analyze the low-resolution video of the 2T frame. The image and the low-resolution video image of the current frame perform the initial stage, the concatenated convolution layer calculation stage and the residual block calculation stage in turn, and then gradually use the deconvolution and convolution operations to reconstruct the high-resolution video image.
现有的视频超分辨率算法需要进行运动估计,而运动估计需要花费大量计算资源。Existing video super-resolution algorithms require motion estimation, which requires a lot of computational resources.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种图像处理方法,用于通过使用编码码流中的运动矢量信息来对待超分帧处理成高分辨率重构帧,避免了运动估计所需的大量计算需求,能够大量减少视频超分辨率的时间。The embodiment of the present application provides an image processing method, which is used to process a super-divided frame into a high-resolution reconstructed frame by using motion vector information in an encoded code stream, which avoids a large amount of computational requirements required for motion estimation, and can Significantly reduces the time for video super-resolution.
本申请实施例第一方面提供了一种图像处理方法,该方法包括:解码设备获取编码码流中目标帧的运动矢量信息、目标帧的图像信息和相邻帧的图像信息,目标帧和相邻帧为第一分辨率的图像,目标帧为需要进行超分辨率处理的图像,相邻帧包括位于目标帧之前或之后的预设周期内的图像;解码设备根据运动矢量信息、目标帧的图像信息和相邻帧的图像信息生成重构帧,重构帧为第二分辨率的图像,第二分辨率大于第一分辨率,运动矢量信息用于指示相邻帧的图像信息与目标帧的图像信息进行自适应局部采样。A first aspect of the embodiments of the present application provides an image processing method, the method includes: a decoding device obtains motion vector information of a target frame, image information of the target frame, and image information of adjacent frames in an encoded code stream, and the target frame and phase The adjacent frame is an image of the first resolution, the target frame is an image that needs to be subjected to super-resolution processing, and the adjacent frame includes an image within a preset period before or after the target frame; The image information and the image information of the adjacent frames generate a reconstructed frame, the reconstructed frame is an image of the second resolution, the second resolution is greater than the first resolution, and the motion vector information is used to indicate the image information of the adjacent frame and the target frame. adaptive local sampling of the image information.
上述第一方面中,编码码流可以为以包括基于运动估计和补偿的预测编码的图像压缩编码技术生成的码流,该运动估计和运动补偿算法用以去除时域冗余信息,即该图像压缩编码技术可以应用运动估计算法确定运动矢量信息。In the above-mentioned first aspect, the encoded code stream may be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation, and the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image Compression coding techniques may apply motion estimation algorithms to determine motion vector information.
本申请实施例中,解码设备可以直接提取编码码流中的运动矢量信息,并且对编码码流进行解码,以获得解码后的图像信息,该图像信息可以包括用于进行超分辨率处理的目 标帧的图像信息和位于该目标帧之前或之后的预设周期内的相邻帧的图像信息,示例性的,相邻帧可以包括该目标帧前面T帧和后面T帧,该目标帧和相邻帧都为第一分辨率即为低分辨率的图像。解码设备在解码出目标帧的图像信息和相邻帧的图像信息后,可以结合运动矢量信息对目标帧的图像信息和相邻帧的图像信息进行处理,以生成第二分辨率的重构帧,重构帧为高分辨率的图像,即第二分辨率大于上述第一分辨率,运动矢量信息用于指示相邻帧的图像信息与目标帧的图像信息进行自适应局部采样,本申请实施例中解码设备在进行超分辨率中不需要进行花费大量计算的运动估计过程,能够大量减少视频超分辨率的时间。In this embodiment of the present application, the decoding device may directly extract motion vector information in the encoded code stream, and decode the encoded code stream to obtain decoded image information, where the image information may include a target for performing super-resolution processing The image information of the frame and the image information of the adjacent frames in the preset period before or after the target frame. Exemplarily, the adjacent frames may include T frames before and after the target frame. The adjacent frames are all images of the first resolution, that is, the low resolution. After decoding the image information of the target frame and the image information of the adjacent frames, the decoding device can process the image information of the target frame and the image information of the adjacent frames in combination with the motion vector information to generate a reconstructed frame of the second resolution , the reconstructed frame is a high-resolution image, that is, the second resolution is greater than the above-mentioned first resolution, and the motion vector information is used to indicate that the image information of the adjacent frame and the image information of the target frame perform adaptive local sampling, which is implemented in this application. In the example, the decoding device does not need to perform a motion estimation process that consumes a lot of computations when performing super-resolution, which can greatly reduce the time for video super-resolution.
在第一方面的一种可能的实现方式中,解码设备基于运动矢量信息,对相邻帧的图像信息中与目标帧的图像信息对应的各个位置进行自适应局部采样;解码设备根据目标帧的图像信息和自适应局部采样后的相邻帧的图像信息生成重构帧。In a possible implementation manner of the first aspect, the decoding device performs adaptive local sampling on each position corresponding to the image information of the target frame in the image information of adjacent frames based on the motion vector information; A reconstructed frame is generated from the image information and the image information of the adjacent frames after adaptive local sampling.
该种可能的实现方式中,解码设备可以基于目标帧的图像信息各个局部位置结合运动矢量信息,在相邻帧的图像信息中的各个对应局部位置中的多个图像点自适应地选取相似度高的图像点进行采样。解码设备可以基于该自适应局部采样后的相邻帧的图像信息去进行相应图像重建生成重构帧,可以减少运动矢量信息中存在的噪声,提高鲁棒性。In this possible implementation manner, the decoding device can adaptively select similarity from multiple image points in each corresponding local position in the image information of adjacent frames based on each local position of the image information of the target frame in combination with the motion vector information High image points are sampled. The decoding device can perform corresponding image reconstruction based on the image information of the adjacent frames after the adaptive local sampling to generate a reconstructed frame, which can reduce the noise existing in the motion vector information and improve the robustness.
在第一方面的一种可能的实现方式中,上述步骤解码设备基于运动矢量信息,对相邻帧的图像信息中与目标帧的图像信息对应的各个位置进行自适应局部采样,包括:解码设备根据目标帧的图像信息生成目标特征金字塔,并根据每个相邻帧的图像信息各自生成一个相邻特征金字塔,目标特征金字塔包括多个尺度的目标特征,每个相邻特征金字塔包括多个尺度的相邻特征;解码设备基于运动矢量信息,以每个目标特征的位置为基准,将每个相邻特征金字塔中与每个目标特征对应位置的相邻特征进行自适应局部采样。In a possible implementation manner of the first aspect, the decoding device in the above steps performs adaptive local sampling on each position corresponding to the image information of the target frame in the image information of adjacent frames based on the motion vector information, including: a decoding device Generate a target feature pyramid according to the image information of the target frame, and generate an adjacent feature pyramid according to the image information of each adjacent frame. The target feature pyramid includes target features of multiple scales, and each adjacent feature pyramid includes multiple scales. The adjacent features of ; based on the motion vector information, the decoding device performs adaptive local sampling on the adjacent features corresponding to each target feature in each adjacent feature pyramid with the position of each target feature as a reference.
该种可能的实现方式中,一个图像的特征金字塔是一系列以金字塔形状排列的特征集合,该特征金字塔为通过一个原始特征梯次向下采样获得,因而尺寸逐层降低,解码设备可以通过特征提取功能对目标帧的图像信息和相邻帧的图像信息进行特征提取获得目标特征和相邻特征,然后通过下采样生成多个尺度的目标特征和相邻特征,并构成相应的特征金字塔。解码设备可以基于特征金字塔的尺度不变性,以每个尺度下的每个目标特征的位置为基准,根据运动矢量信息对每个相邻特征金字塔对应位置的相邻特征进行自适应局部采样,即在映射处的相邻特征内再细化搜索更优的匹配特征,提高各个相邻特征的特征质量。In this possible implementation manner, the feature pyramid of an image is a series of feature sets arranged in a pyramid shape. The feature pyramid is obtained by down-sampling an original feature step by step, so the size is reduced layer by layer. The function extracts the image information of the target frame and the image information of the adjacent frames to obtain the target features and adjacent features, and then generates target features and adjacent features of multiple scales through downsampling, and forms the corresponding feature pyramid. The decoding device may, based on the scale invariance of the feature pyramid, take the position of each target feature at each scale as a benchmark, and perform adaptive local sampling on the adjacent features at the corresponding positions of each adjacent feature pyramid according to the motion vector information, that is, Refine the search for better matching features within the adjacent features at the mapping, and improve the feature quality of each adjacent feature.
在第一方面的一种可能的实现方式中,上述步骤解码设备根据目标帧的图像信息和自适应局部采样后的相邻帧的图像信息生成重构帧,包括:解码设备将目标特征金字塔和自适应局部采样后的每个相邻特征金字塔进行融合生成融合特征金字塔,融合特征金字塔包括多个尺度的融合特征;解码设备对融合特征金字塔进行处理,以生成重构帧。In a possible implementation manner of the first aspect, the decoding device in the above steps generates a reconstructed frame according to the image information of the target frame and the image information of the adjacent frames after adaptive local sampling, including: the decoding device converts the target feature pyramid and the Each adjacent feature pyramid after adaptive local sampling is fused to generate a fused feature pyramid, and the fused feature pyramid includes fused features of multiple scales; the decoding device processes the fused feature pyramid to generate a reconstructed frame.
该种可能的实现方式中,解码设备将自适应局部采样后的相邻特征金字塔和目标特征金字塔进行堆叠,并卷积融合成一个融合特征金字塔,然后即可将该融合特征金字塔重建成高分辨率的图像。堆叠(concat)是特征通道数的合并,也就是说描述图像本身的特征数(通道数)增加了,而每一特征下的信息是没有增加。In this possible implementation, the decoding device stacks the adjacent feature pyramids and the target feature pyramid after adaptive local sampling, and convolutionally fuses them into a fused feature pyramid, and then the fused feature pyramid can be reconstructed into a high-resolution feature pyramid. rate images. Stacking (concat) is the merging of the number of feature channels, that is to say, the number of features (the number of channels) describing the image itself increases, while the information under each feature does not increase.
在第一方面的一种可能的实现方式中,上述解码设备基于运动矢量信息,以每个目标特征的位置为基准,将每个相邻特征金字塔中与每个目标特征对应位置的相邻特征进行自适应局部采样,包括:针对每个相邻特征金字塔,解码设备根据目标特征中第一局部特征块的坐标,以及运动矢量信息中所包含的第一局部特征块与相邻特征中第二局部特征块之间的映射关系,查找第二局部特征块;解码设备通过全连接层对第一局部特征块和第二局部特征块进行特征匹配,以确定相关注意力系数集合,相关注意力系数集合包括多个相关注意力系数,其中,每个相关注意力系数指示第一局部特征块中的一个特征点与第二局部特征块中对应特征点的相似度;解码设备基于相关注意力系数集合对第二局部特征块中的多个特征点进行加权平均,以确定自适应局部采样后的相邻特征金字塔。In a possible implementation manner of the first aspect, the above-mentioned decoding device, based on the motion vector information, takes the position of each target feature as a reference, and converts the adjacent features of each adjacent feature pyramid corresponding to the position of each target feature Performing adaptive local sampling, including: for each adjacent feature pyramid, the decoding device is based on the coordinates of the first local feature block in the target feature, and the first local feature block included in the motion vector information and the second in the adjacent feature. The mapping relationship between the local feature blocks is used to find the second local feature block; the decoding device performs feature matching on the first local feature block and the second local feature block through the fully connected layer to determine the set of relevant attention coefficients and the relevant attention coefficients. The set includes a plurality of relevant attention coefficients, wherein each relevant attention coefficient indicates the similarity between a feature point in the first local feature block and a corresponding feature point in the second local feature block; the decoding device is based on the set of relevant attention coefficients A weighted average of multiple feature points in the second local feature block is performed to determine the adjacent feature pyramid after adaptive local sampling.
该种可能的实施方式中,注意力系数即为关注度,即对相邻特征中与目标特征相似度更高的特征点附加更多的关注。本申请实施例中解码设备在目标特征金字塔的一尺度中提取第一局部特征块,基于运动矢量信息所指示的映射关系,对于第一局部特征块的坐标从一个相邻特征金字塔的相应坐标处提取第二局部特征块,然后通过一个双层的全连接层确定第二局部特征块中各个特征点的注意力系数,然后为该第二局部特征块中的各个特征点附上上述注意力系数相应的关注度,即获得自适应局部采样后的第二局部特征块。在对目标特征金字塔的所有尺度对应的所有相邻特征金字塔的相邻特征提取的特征块进行处理后,可以获得自适应局部采样后的相邻特征金字塔。In this possible implementation, the attention coefficient is the degree of attention, that is, more attention is attached to the feature points in the adjacent features that are more similar to the target feature. In the embodiment of the present application, the decoding device extracts the first local feature block in a scale of the target feature pyramid, and based on the mapping relationship indicated by the motion vector information, the coordinates of the first local feature block are determined from the corresponding coordinates of an adjacent feature pyramid. Extract the second local feature block, and then determine the attention coefficient of each feature point in the second local feature block through a double-layer fully connected layer, and then attach the above attention coefficient to each feature point in the second local feature block The corresponding degree of attention, that is, the second local feature block after adaptive local sampling is obtained. After processing feature blocks extracted from adjacent features of all adjacent feature pyramids corresponding to all scales of the target feature pyramid, adjacent feature pyramids after adaptive local sampling can be obtained.
在第一方面的一种可能的实现方式中,上述步骤解码设备将目标特征金字塔和自适应局部采样后的相邻特征金字塔进行融合生成融合特征金字塔包括:针对每个相邻特征金字塔,解码设备根据目标特征和自适应局部采样后的相邻特征计算注意力图,注意力图用于表示自适应局部采样后的相邻特征与目标特征的相似度;解码设备将自适应局部采样后的相邻特征与注意力图进行特征增强处理;解码设备将所有特征增强处理后的相邻特征和目标特征进行堆叠和卷积计算以生成融合特征,并确定融合特征金字塔。In a possible implementation manner of the first aspect, the decoding device in the above steps fuses the target feature pyramid and the adjacent feature pyramid after adaptive local sampling to generate a fusion feature pyramid. The method includes: for each adjacent feature pyramid, the decoding device The attention map is calculated according to the target feature and the adjacent features after adaptive local sampling, and the attention map is used to represent the similarity between the adjacent features after adaptive local sampling and the target feature; Perform feature enhancement processing with the attention map; the decoding device stacks and convolves all adjacent features and target features after feature enhancement processing to generate fusion features, and determines the fusion feature pyramid.
该种可能的实现方式中,解码设备可以根据相邻帧特征的对齐质量,产生时域上的注意力图,通过增加高质量自适应局部采样局部区域的权值,减少低质量自适应局部采样局部区域的权值,动态地融合自适应局部采样后相邻帧和目标帧的特征。该对齐质量可以通过计算自适应局部采样后的相邻帧特征和目标特征逐坐标点上的特征内积表示,该特征内积可以表征自适应局部采样后的相邻特征在该点上与目标特征的相似度。然后对各个特征区域进行加权处理,例如特征与上述注意力图进行逐点相乘处理。将特征增强处理后的相邻特征与目标特征进行上述堆叠操作,并卷积生成融合特征,当解码设备执行一次特征融合后,需要检测是否存在未进行特征增强处理的相邻特征,直至所有相邻特征与目标特征融合生成融合特征金字塔。In this possible implementation, the decoding device can generate an attention map in the time domain according to the alignment quality of adjacent frame features, and reduce the low-quality adaptive local sampling by increasing the weight of the high-quality adaptive local sampling local area. The weight of the region dynamically fuses the features of the adjacent frame and the target frame after adaptive local sampling. The alignment quality can be expressed by calculating the feature inner product of the adjacent frame feature after adaptive local sampling and the target feature at each coordinate point, and the feature inner product can represent the adjacent feature after adaptive local sampling. similarity of features. Then, each feature area is weighted, for example, the feature and the above attention map are multiplied point by point. The above-mentioned stacking operation is performed on the adjacent features after feature enhancement processing and the target features, and the fusion features are generated by convolution. After the decoding device performs a feature fusion, it is necessary to detect whether there are adjacent features without feature enhancement processing. Neighbor features and target features are fused to generate a fused feature pyramid.
在第一方面的一种可能的实现方式中,上述步骤解码设备根据目标帧的图像信息生成目标特征金字塔包括:解码设备将每个相邻帧的图像信息卷积处理,再进行第一级联残差块的处理,以生成每个相邻帧的图像信息分别对应的一个相邻特征;解码设备将相邻特征通过双线性插值的方式生成多个尺度的相邻特征,并构建相邻特征金字塔。In a possible implementation manner of the first aspect, generating the target feature pyramid according to the image information of the target frame by the decoding device in the above steps includes: the decoding device convolves the image information of each adjacent frame, and then performs a first concatenation process. The residual block is processed to generate an adjacent feature corresponding to the image information of each adjacent frame; the decoding device generates adjacent features of multiple scales by bilinear interpolation, and constructs adjacent features. Feature Pyramid.
该种可能的实现方式中,残差块(rasidual blocks)使用跳跃链接,通过增加相当的 深度来提高准确率,其中,跳跃链接为残差块直接将接收的输入信息绕道传到输出,保护了信息的完整性。尺度代表着图像的像素数。本申请实施例中,解码设备通过特征提取功能提取目标帧的图像信息中的一个目标特征,该特征提取功能包括卷积处理和级联的残差块。解码设备可以将目标特征通过双线性内插值下采样的方式进行不同程度的图像缩小,即可获得不同尺度的目标特征,然后按照尺度大小对目标特征进行排列生成上述目标特征金字塔,该目标特征金字塔自下而上每一层的像素数都不断减少,可以大大减少计算量。In this possible implementation, the residual blocks (rasidual blocks) use skip links to improve the accuracy by increasing a considerable depth, wherein the skip links are the residual blocks that directly detour the received input information to the output, protecting the Integrity of Information. The scale represents the number of pixels in the image. In this embodiment of the present application, the decoding device extracts a target feature in the image information of the target frame through a feature extraction function, where the feature extraction function includes convolution processing and cascaded residual blocks. The decoding device can reduce the image to different degrees by downsampling the target features through bilinear interpolation, so as to obtain target features of different scales, and then arrange the target features according to the scale to generate the target feature pyramid. The number of pixels in each layer of the pyramid continues to decrease from bottom to top, which can greatly reduce the amount of computation.
在第一方面的一种可能的实现方式中,解码设备根据每个相邻帧的图像信息各自生成一个相邻特征金字塔包括:解码设备将每个相邻帧的图像信息卷积处理,再进行第一级联残差块的处理,以生成每个相邻帧的图像信息分别对应的一个相邻特征;解码设备将多个尺度的相邻特征通过双线性插值的方式生成一个相邻特征金字塔。In a possible implementation manner of the first aspect, the decoding device generating an adjacent feature pyramid according to the image information of each adjacent frame includes: the decoding device convolves the image information of each adjacent frame, and then performs The processing of the first cascaded residual blocks to generate an adjacent feature corresponding to the image information of each adjacent frame; the decoding device generates an adjacent feature by bilinear interpolation of adjacent features of multiple scales pyramid.
该种可能的实现方式中,解码设备通过多个与上述特征提取功能共享权重的特征提取功能同时对相邻帧的图像信息进行特征提取,以获得相邻特征,然后通过双线性插值下采样的方式进行不同程度的图像缩小,即可获得不同尺度的相邻特征,然后按照尺度大小对相邻特征进行排列生成相邻特征金字塔。In this possible implementation manner, the decoding device simultaneously performs feature extraction on the image information of adjacent frames through a plurality of feature extraction functions that share weights with the above-mentioned feature extraction function to obtain adjacent features, and then downsamples through bilinear interpolation To reduce the image to different degrees in the way, adjacent features of different scales can be obtained, and then the adjacent features are arranged according to the scale to generate adjacent feature pyramids.
在第一方面的一种可能的实现方式中,解码设备对融合特征金字塔进行处理生成重构帧包括:解码设备将融合特征金字塔通过第二级联残差块计算生成优化特征金字塔;解码设备对优化特征金字塔进行尺寸扩大和卷积,以生成重构残差信号;解码设备将重构残差信号与图像放大结果相加,得到重构帧,图像放大结果为目标帧的图像信息经过双线性插值生成的。In a possible implementation manner of the first aspect, the decoding device processing the fused feature pyramid to generate the reconstructed frame includes: the decoding device calculates the fused feature pyramid through the second cascade residual block to generate an optimized feature pyramid; The feature pyramid is optimized for size expansion and convolution to generate a reconstructed residual signal; the decoding device adds the reconstructed residual signal and the image enlargement result to obtain a reconstructed frame, and the image enlargement result is the image information of the target frame after double-line generated by sex interpolation.
该种可能的实现方式中,解码设备通过第二级联的残差块将上述生成的融合特征金字塔中各个尺度层级的特征进行信息交换,示例性的,可以对各尺度层级的特征进行上采样或下采样,进行同尺度下的交互,以优化融合特征,然后对优化后的融合特征进行放大和卷积,并和通过双线性插值放大的目标帧的图像信息相加,即可得到高分辨率的重构帧。In this possible implementation manner, the decoding device exchanges information on the features of each scale level in the above-generated fused feature pyramid through the second cascaded residual block. Exemplarily, the features of each scale level can be upsampled. Or downsample, perform interaction at the same scale to optimize the fusion feature, then amplify and convolve the optimized fusion feature, and add it to the image information of the target frame amplified by bilinear interpolation, you can get high The reconstructed frame of the resolution.
本申请实施例第二方面提供了一种解码设备,该解码设备具有实现上述第一方面或第一方面任意一种可能实现方式的方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块,例如:接收单元和处理单元。A second aspect of the embodiments of the present application provides a decoding device, where the decoding device has a function of implementing the method of the first aspect or any possible implementation manner of the first aspect. This function can be implemented by hardware or by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, such as a receiving unit and a processing unit.
本申请实施例第三方面提供一种计算机设备,该计算机设备包括至少一个处理器、存储系统、输入/输出(input/output,I/O)接口以及存储在存储系统中并可在处理器上运行的计算机执行指令,当计算机执行指令被处理器执行时,处理器执行如上述第一方面或第一方面任意一种可能的实现方式的方法。A third aspect of the embodiments of the present application provides a computer device, the computer device includes at least one processor, a storage system, an input/output (input/output, I/O) interface, and a computer device stored in the storage system and available on the processor The running computer executes the instructions, and when the computer executes the instructions are executed by the processor, the processor executes the method according to the first aspect or any possible implementation manner of the first aspect.
本申请实施例第四方面提供一种存储一个或多个计算机执行指令的计算机可读存储介质,当计算机执行指令被处理器执行时,处理器执行如上述第一方面或第一方面任意一种可能的实现方式的方法。A fourth aspect of the embodiments of the present application provides a computer-readable storage medium that stores one or more computer-executable instructions. When the computer-executable instructions are executed by a processor, the processor executes any one of the first aspect or the first aspect. possible implementation methods.
本申请实施例第五方面提供一种存储一个或多个计算机执行指令的计算机程序产品,当计算机执行指令被处理器执行时,处理器执行如上述第一方面或第一方面任意一种可能的实现方式的方法。A fifth aspect of the embodiments of the present application provides a computer program product that stores one or more computer-executable instructions. When the computer-executable instructions are executed by a processor, the processor executes the first aspect or any one of the possible first aspects. method of implementation.
本申请实施例第六方面提供了一种芯片系统,该芯片系统包括至少一个处理器,至少一个处理器用于支持解码设备实现上述第一方面或第一方面任意一种可能的实现方式中所涉及的功能。在一种可能的设计中,芯片系统还可以包括存储器,存储器,用于保存解码设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。A sixth aspect of an embodiment of the present application provides a chip system, where the chip system includes at least one processor, and the at least one processor is configured to support a decoding device to implement the first aspect or any of the possible implementation manners of the first aspect. function. In a possible design, the chip system may further include a memory for storing necessary program instructions and data of the decoding device. The chip system may be composed of chips, or may include chips and other discrete devices.
其中,第二方面至第六方面或者其中任一种可能实现方式所带来的技术效果可参见第一方面或第一方面不同可能实现方式所带来的技术效果,此处不再赘述。Wherein, for the technical effects brought by the second aspect to the sixth aspect or any of the possible implementations thereof, reference may be made to the technical effects brought by the first aspect or different possible implementations of the first aspect, which will not be repeated here.
本申请实施例提供的方案,解码设备通过从编码码流获得的运动矢量信息、目标特征金字塔和相邻特征金字塔进行神经网络处理生成高分辨率重构帧,编码码流中的包含了一定的运动信息,而提取码流中的运动信息的计算代价是可以忽略的,所以能够大量减少视频超分辨率的时间。In the solution provided by this embodiment of the present application, the decoding device generates high-resolution reconstructed frames by performing neural network processing on motion vector information, target feature pyramids, and adjacent feature pyramids obtained from the encoded code stream, and the encoded code stream contains a certain amount of Motion information, and the computational cost of extracting motion information in the code stream is negligible, so the time for video super-resolution can be greatly reduced.
附图说明Description of drawings
图1为本申请实施例的应用场景;FIG. 1 is an application scenario of an embodiment of the present application;
图2为本申请实施例中视频编码及解码系统的示意性框图;2 is a schematic block diagram of a video encoding and decoding system in an embodiment of the present application;
图3为本申请实施例中图像处理方法的一实施例的流程图;FIG. 3 is a flowchart of an embodiment of an image processing method in an embodiment of the present application;
图4为本申请实施例中局部特征块提取示意图;4 is a schematic diagram of extracting local feature blocks in an embodiment of the present application;
图5为本申请实施例中解码设备的架构示意图;FIG. 5 is a schematic structural diagram of a decoding device in an embodiment of the present application;
图6为本申请实施例中特征提取模块的图像处理流程图;Fig. 6 is the image processing flow chart of the feature extraction module in the embodiment of the application;
图7为本申请实施例中柔性对齐模块的图像处理流程图;Fig. 7 is the image processing flow chart of the flexible alignment module in the embodiment of the present application;
图8为本申请实施例中多帧特征融合模块的图像处理流程图;FIG. 8 is an image processing flowchart of a multi-frame feature fusion module in an embodiment of the present application;
图9为本申请实施例中特征超分重建模块的图像处理流程图;Fig. 9 is the image processing flow chart of the feature super-resolution reconstruction module in the embodiment of the present application;
图10为本申请实施例的超分辨率结果与现有技术超分辨率结果的对比图;10 is a comparison diagram of the super-resolution result of the embodiment of the application and the super-resolution result of the prior art;
图11为本申请实施例中解码设备一结构示意图;11 is a schematic structural diagram of a decoding device according to an embodiment of the present application;
图12为本申请实施例中解码设备另一结构示意图。FIG. 12 is another schematic structural diagram of a decoding device in an embodiment of the present application.
具体实施方式Detailed ways
本申请实施例提供了一种图像处理方法以及解码设备,用于减少视频超分辨率的时间。Embodiments of the present application provide an image processing method and a decoding device, which are used to reduce the time for video super-resolution.
下面结合本申请实施例中的附图对本申请实施例进行描述。以下描述中,参考形成本公开一部分并以说明之方式示出本申请实施例的具体方面或可使用本申请实施例的具体方面的附图。应理解,本申请实施例可在其它方面中使用,并可包括附图中未描绘的结构或逻辑变化。因此,以下详细描述不应以限制性的意义来理解,且本申请的范围由所附权利要求书界定。例如,应理解,结合所描述方法的揭示内容可以同样适用于用于执行所述方法的对应设备或系统,且反之亦然。例如,如果描述一个或多个具体方法步骤,则对应的设备可以包含如功能单元等一个或多个单元,来执行所描述的一个或多个方法步骤(例如,一个单元执行一个或多个步骤,或多个单元,其中每个都执行多个步骤中的一个或多 个),即使附图中未明确描述或说明这种一个或多个单元。另一方面,例如,如果基于如功能单元等一个或多个单元描述具体装置,则对应的方法可以包含一个步骤来执行一个或多个单元的功能性(例如,一个步骤执行一个或多个单元的功能性,或多个步骤,其中每个执行多个单元中一个或多个单元的功能性),即使附图中未明确描述或说明这种一个或多个步骤。进一步,应理解的是,除非另外明确提出,本文中所描述的各示例性实施例和/或方面的特征可以相互组合。The embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. In the following description, reference is made to the accompanying drawings which form a part of this disclosure and which illustrate, by way of illustration, specific aspects of the embodiments of the application or in which specific aspects of the embodiments of the application may be used. It should be understood that the embodiments of the present application may be utilized in other aspects and may include structural or logical changes not depicted in the accompanying drawings. Therefore, the following detailed description is not to be taken in a limiting sense, and the scope of the application is defined by the appended claims. For example, it should be understood that disclosures in connection with a described method may equally apply to a corresponding apparatus or system for performing the described method, and vice versa. For example, if one or more specific method steps are described, the corresponding apparatus may include one or more units, such as functional units, to perform one or more of the described method steps (eg, one unit performs one or more steps) , or units, each of which performs one or more of the steps), even if such unit or units are not explicitly described or illustrated in the figures. On the other hand, if, for example, a specific apparatus is described based on one or more units, such as functional units, the corresponding method may contain a step to perform the functionality of the one or more units (eg, a step to perform the one or more units) functionality, or steps, each of which performs the functionality of one or more of the plurality of units), even if such one or more steps are not explicitly described or illustrated in the figures. Further, it is to be understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other unless expressly stated otherwise.
视频编码通常是指处理形成视频或视频序列的图像序列。在视频编码领域,术语“图像(picture)”、“帧(frame)”或“图像(image)”可以用作同义词。视频编码在源侧执行,通常包括处理(例如,通过压缩)原始视频图像以减少表示该视频图像所需的数据量,从而更高效地存储和/或传输。视频解码在目的地侧执行,通常包括相对于编码器作逆处理,以重构视频图像。实施例涉及的视频图像“编码”应理解为涉及视频序列的“编码”或“解码”。编码部分和解码部分的组合也称为编解码(编码和解码)。Video coding generally refers to the processing of sequences of images that form a video or video sequence. In the field of video coding, the terms "picture", "frame" or "image" may be used as synonyms. Video encoding is performed on the source side and typically involves processing (eg, by compressing) the original video image to reduce the amount of data required to represent the video image for more efficient storage and/or transmission. Video decoding is performed on the destination side and typically involves inverse processing relative to the encoder to reconstruct the video image. The "encoding" of video images referred to in the embodiments should be understood to refer to "encoding" or "decoding" of video sequences. The combination of the encoding part and the decoding part is also called encoding and decoding (encoding and decoding).
本实施例可应用于如图1所示的应用场景中,终端11、服务器12、机顶盒13以及电视14之间通过无线或有线网络连接,终端11可以通过安装于本地的应用软件(application,APP)来遥控显示器14,例如,用户在终端11的操作界面上进行操作可以输出用于电视播放的视频源,终端11将该视频源通过服务器12进行编码处理再转发给机顶盒13,由机顶盒13将编码后的视频源进行视频解码至显示器14上,进而显示器14可以基于该解码后的视频源进行播放。This embodiment can be applied to the application scenario shown in FIG. 1 . The terminal 11 , the server 12 , the set-top box 13 , and the TV 14 are connected through a wireless or wired network, and the terminal 11 can use application software (application, APP) installed locally. ) to remotely control the display 14, for example, the user can output a video source for television playback by performing operations on the operation interface of the terminal 11, and the terminal 11 performs encoding processing on the video source through the server 12 and then forwards it to the set-top box 13, and the set-top box 13 sends the video source to the set-top box 13. The encoded video source is decoded to the display 14, and then the display 14 can play based on the decoded video source.
下面描述本申请实施例所应用的系统架构。参见图2,图2示例性地给出了本申请实施例所应用的视频编码及解码系统的示意性框图。如图2所示,视频编码及解码系统可包括编码设备21和解码设备22,编码设备21产生经编码视频数据。解码设备22可对由编码设备21所产生的经编码的视频数据进行解码。编码设备21、解码设备22或两个的各种实施方案可包含一或多个处理器以及耦合到所述一或多个处理器的存储器。存储器可包含但不限于随机存取存储器(random access memory,RAM)、只读内存(read-only memory,ROM)、带电可擦可编程只读存储器(electrically erasable programmable read only memory,EEPROM)、快闪存储器或可用于以可由计算机存取的指令或数据结构的形式存储所要的程序代码的任何其它媒体,如本文所描述。编码设备21和解码设备22可以包括各种装置,包含桌上型计算机、移动计算装置、笔记型(例如,膝上型)计算机、平板计算机、机顶盒、例如所谓的“智能”电话等电话手持机、电视机、相机、显示装置、数字媒体播放器、视频游戏控制台、车载计算机、无线通信设备或其类似者。The following describes the system architecture to which the embodiments of the present application are applied. Referring to FIG. 2, FIG. 2 exemplarily shows a schematic block diagram of a video encoding and decoding system to which the embodiments of the present application are applied. As shown in FIG. 2, a video encoding and decoding system may include an encoding apparatus 21 and a decoding apparatus 22, which generates encoded video data. Decoding apparatus 22 may decode the encoded video data generated by encoding apparatus 21 . Various implementations of encoding apparatus 21, decoding apparatus 22, or both may include one or more processors and a memory coupled to the one or more processors. The memory may include but is not limited to random access memory (RAM), read-only memory (ROM), electrically erasable programmable read only memory (EEPROM), fast Flash memory or any other medium that can be used to store the desired program code in the form of instructions or data structures that can be accessed by a computer, as described herein. The encoding device 21 and decoding device 22 may comprise various devices including desktop computers, mobile computing devices, notebook (eg, laptop) computers, tablet computers, set-top boxes, telephone handsets such as so-called "smart" phones, etc. , televisions, cameras, display devices, digital media players, video game consoles, in-vehicle computers, wireless communication devices, or the like.
虽然图2将编码设备21和解码设备22可以为单独的设备,也可以同时包括编码设备21和解码设备22或同时包括两者的功能性,即编码设备21或对应的功能性以及解码设备22或对应的功能性。在此类实施例中,可以使用相同硬件和/或软件,或使用单独的硬件和/或软件,或其任何组合来实施编码设备21或对应的功能性以及解码设备22或对应的功能性。Although FIG. 2 shows that the encoding device 21 and the decoding device 22 may be separate devices, they may also include the encoding device 21 and the decoding device 22 or the functionality of both, that is, the encoding device 21 or the corresponding functionality and the decoding device 22 or the corresponding functionality. In such embodiments, encoding device 21 or corresponding functionality and decoding device 22 or corresponding functionality may be implemented using the same hardware and/or software, or using separate hardware and/or software, or any combination thereof.
编码设备21和解码设备22之间可通过链路23进行通信连接,解码设备22可经由链路23从编码设备21接收经编码视频数据。链路23可包括能够将经编码视频数据从编码设 备21移动到解码设备22的一或多个媒体或装置。在一个实例中,链路23可包括使得编码设备21能够实时将经编码视频数据直接发射到解码设备22的一或多个通信媒体。在此实例中,编码设备21可根据通信标准(例如无线通信协议)来调制经编码视频数据,且可将经调制的视频数据发射到解码设备22。所述一或多个通信媒体可包含无线和/或有线通信媒体,例如射频频谱或一或多个物理传输线。所述一或多个通信媒体可形成基于分组的网络的一部分,基于分组的网络例如为局域网、广域网或全球网络(例如,因特网)。所述一或多个通信媒体可包含路由器、交换器、基站或促进从编码设备21到解码设备22的通信的其它设备。The encoding device 21 and the decoding device 22 may be communicatively connected through a link 23 , and the decoding device 22 may receive encoded video data from the encoding device 21 via the link 23 . Link 23 may include one or more media or devices capable of moving encoded video data from encoding apparatus 21 to decoding apparatus 22. In one example, link 23 may include one or more communication media that enable encoding device 21 to transmit encoded video data directly to decoding device 22 in real-time. In this example, encoding apparatus 21 may modulate the encoded video data according to a communication standard, such as a wireless communication protocol, and may transmit the modulated video data to decoding apparatus 22 . The one or more communication media may include wireless and/or wired communication media, such as radio frequency spectrum or one or more physical transmission lines. The one or more communication media may form part of a packet-based network, such as a local area network, a wide area network, or a global network (eg, the Internet). The one or more communication media may include routers, switches, base stations, or other devices that facilitate communication from encoding device 21 to decoding device 22 .
编码设备21包括编码器211,另外可选地,编码设备21还可以包括图像预处理器212以及第一通信接口213。具体实现形态中,所述编码器211、图像预处理器212以及第一通信接口213可能是编码设备21中的硬件部件,也可能是编码设备21中的软件程序。The encoding device 21 includes an encoder 211 , and optionally, the encoding device 21 may further include an image preprocessor 212 and a first communication interface 213 . In a specific implementation form, the encoder 211 , the image preprocessor 212 and the first communication interface 213 may be hardware components in the encoding device 21 , or may be software programs in the encoding device 21 .
分别描述如下:They are described as follows:
图像预处理器212,用于接收外界终端传输的原始图像数据214,并对原始图像数据214执行预处理,以获取经预处理的图像数据215或经预处理的图像数据215。例如,图像预处理器212执行的预处理可以包括整修、色彩格式转换(例如,从三原色(RGB)格式转换为亮度和色差信号(Luma and Chroma,YUV,Y表示亮度,UV表示色度)格式)、调色或去噪。The image preprocessor 212 is configured to receive the original image data 214 transmitted by the external terminal, and perform preprocessing on the original image data 214 to obtain the preprocessed image data 215 or the preprocessed image data 215 . For example, the preprocessing performed by the image preprocessor 212 may include trimming, color format conversion (eg, from a three primary color (RGB) format to a Luma and Chroma (YUV, Y for Luma and UV for Chroma) format) format ), toning or denoising.
其中,图像可以视为像素点(picture element)的二维阵列或矩阵。阵列中的像素点也可以称为采样点。阵列或图像在水平和垂直方向(或轴线)上的采样点数目定义图像的尺寸和/或分辨率。为了表示颜色,通常采用三个颜色分量,即图像可以表示为或包含三个采样阵列。例如在RBG格式或颜色空间中,图像包括对应的红色、绿色及蓝色采样阵列。但是,在视频编码中,每个像素通常以亮度/色度格式或颜色空间表示,例如对于YUV格式的图像,包括Y指示的亮度分量(有时也可以用L指示)以及U和V指示的两个色度分量。亮度(luma)分量Y表示亮度或灰度水平强度(例如,在灰度等级图像中两者相同),而两个色度(chroma)分量U和V表示色度或颜色信息分量。相应地,YUV格式的图像包括亮度采样值(Y)的亮度采样阵列,和色度值(U和V)的两个色度采样阵列。RGB格式的图像可以转换或变换为YUV格式,反之亦然,该过程也称为色彩变换或转换。如果图像是黑白的,该图像可以只包括亮度采样阵列。Among them, the image can be regarded as a two-dimensional array or matrix of picture elements. The pixels in the array can also be called sampling points. The number of sampling points in the horizontal and vertical directions (or axes) of an array or image defines the size and/or resolution of the image. To represent color, three color components are usually employed, i.e. an image can be represented as or contain three arrays of samples. For example in RBG format or color space, an image includes corresponding arrays of red, green and blue samples. However, in video coding, each pixel is usually represented in a luma/chroma format or color space, for example, for a YUV format image, it includes a luma component indicated by Y (sometimes can also be indicated by L) and two components indicated by U and V. chrominance components. The luminance (luma) component Y represents the luminance or gray level intensity (eg, both are the same in a gray scale image), while the two chroma (chroma) components U and V represent the chrominance or color information components. Correspondingly, an image in YUV format includes a luma sample array of luma sample values (Y), and two chroma sample arrays of chroma values (U and V). Images in RGB format can be converted or transformed to YUV format and vice versa, the process is also known as color transformation or conversion. If the image is black and white, the image may only include an array of luminance samples.
编码器211(或称视频编码器211),用于接收经预处理的图像数据215,采用相关预测模式(如本文各个实施例中的预测模式)对经预处理的图像数据215进行处理,从而提供经编码图像数据216。An encoder 211 (or a video encoder 211 ) for receiving the pre-processed image data 215, and processing the pre-processed image data 215 using a relevant prediction mode (such as the prediction mode in various embodiments herein), thereby Encoded image data 216 is provided.
第一通信接口213,可用于接收经编码图像数据216,并可通过链路23将经编码图像数据216传输至解码设备22或任何其它设备(如存储器),以用于存储或直接重构,所述其它设备可为任何用于解码或存储的设备。第一通信接口213可例如用于将经编码图像数据216封装成合适的格式,例如数据包,以在链路23上传输。a first communication interface 213 that can be used to receive encoded image data 216 and to transmit the encoded image data 216 via link 23 to decoding device 22 or any other device (eg, memory) for storage or direct reconstruction, The other device may be any device for decoding or storage. The first communication interface 213 may be used, for example, to encapsulate the encoded image data 216 into a suitable format, such as a data packet, for transmission over the link 23 .
解码设备22包括解码器221,另外可选地,解码设备22还可以包括第二通信接口222和图像后处理器223。分别描述如下:The decoding device 22 includes a decoder 221 , and optionally, the decoding device 22 may further include a second communication interface 222 and an image post-processor 223 . They are described as follows:
第二通信接口222,可用于从编码设备21或任何其它源接收经编码图像数据216,所述任何其它源例如为存储设备,存储设备例如为经编码图像数据存储设备。第二通信接口222可以用于藉由编码设备21和解码设备22之间的链路23或藉由任何类别的网络传输或接收经编码图像数据216,链路23例如为直接有线或无线连接,任何类别的网络例如为有线或无线网络或其任何组合,或任何类别的私网和公网,或其任何组合。第二通信接口222可以例如用于解封装第一通信接口213所传输的数据包以获取经编码图像数据216。A second communication interface 222 may be used to receive encoded image data 216 from the encoding device 21 or any other source, such as a storage device, such as an encoded image data storage device. The second communication interface 222 may be used to transmit or receive the encoded image data 216 via the link 23 between the encoding device 21 and the decoding device 22, such as a direct wired or wireless connection, or via any kind of network, Networks of any kind are, for example, wired or wireless networks or any combination thereof, or private and public networks of any kind, or any combination thereof. The second communication interface 222 may be used, for example, to decapsulate the data packets transmitted by the first communication interface 213 to obtain the encoded image data 216 .
第二通信接口222和第一通信接口213都可以配置为单向通信接口或者双向通信接口,以及可以用于例如发送和接收消息来建立连接、确认和交换任何其它与通信链路和/或例如经编码图像数据传输的数据传输有关的信息。Both the second communication interface 222 and the first communication interface 213 may be configured as a one-way communication interface or a two-way communication interface, and may be used, for example, to send and receive messages to establish connections, acknowledge and exchange any other communication links and/or for example Information about data transmission of encoded image data transmission.
解码器221(或称为解码器221),用于接收经编码图像数据216并提供经解码图像数据224或经解码图像224。在一些实施例中,解码器221可以用于执行后文所描述的各个实施例,以实现本申请所描述的图像处理方法在解码侧的应用。Decoder 221 (or referred to as decoder 221 ) receives encoded image data 216 and provides decoded image data 224 or decoded image 224 . In some embodiments, the decoder 221 may be configured to execute various embodiments described later, so as to realize the application of the image processing method described in this application on the decoding side.
图像后处理器223,用于对经解码图像数据224(也称为经重构图像数据)执行后处理,以获得经后处理图像数据225。图像后处理器223执行的后处理可以包括:色彩格式转换(例如,从YUV格式转换为RGB格式)、调色、整修或重采样,或任何其它处理,还可用于将将经后处理图像数据传输至外界的显示设备进行播放。显示设备可以为或可以包括任何类别的用于呈现经重构图像的显示器,例如,集成的或外部的显示器或监视器。例如,显示器可以包括液晶显示器(liquid crystal display,LCD)、有机发光二极管(organic light emitting diode,OLED)显示器、等离子显示器、投影仪、微LED显示器、硅基液晶(liquid crystal on silicon,LCoS)、数字光处理器(digitallight processor,DLP)或任何类别的其它显示器。An image post-processor 223 for performing post-processing on decoded image data 224 (also referred to as reconstructed image data) to obtain post-processed image data 225 . The post-processing performed by the image post-processor 223 may include color format conversion (eg, from YUV format to RGB format), toning, trimming or resampling, or any other It is transmitted to an external display device for playback. The display device may be or include any type of display for presenting the reconstructed image, eg, an integrated or external display or monitor. For example, displays may include liquid crystal displays (LCDs), organic light emitting diode (OLED) displays, plasma displays, projectors, micro LED displays, liquid crystal on silicon (LCoS), A digital light processor (DLP) or other display of any kind.
虽然,图2将编码设备21和解码设备22绘示为单独的设备,但设备实施例也可以同时包括编码设备21和解码设备22或同时包括两者的功能性,即编码设备21或对应的功能性以及解码设备22或对应的功能性。在此类实施例中,可以使用相同硬件和/或软件,或使用单独的硬件和/或软件,或其任何组合来实施编码设备21或对应的功能性以及解码设备22或对应的功能性。Although FIG. 2 depicts the encoding device 21 and the decoding device 22 as separate devices, device embodiments may also include the functionality of the encoding device 21 and the decoding device 22 or both at the same time, ie the encoding device 21 or the corresponding Functionality and decoding device 22 or corresponding functionality. In such embodiments, encoding device 21 or corresponding functionality and decoding device 22 or corresponding functionality may be implemented using the same hardware and/or software, or using separate hardware and/or software, or any combination thereof.
本领域技术人员基于描述明显可知,不同单元的功能性或图2所示的编码设备21和/或解码设备22的功能性的存在和(准确)划分可能根据实际设备和应用有所不同。编码设备21和解码设备22可以包括各种设备中的任一个,包含任何类别的手持或静止设备,例如,笔记本或膝上型计算机、移动电话、智能手机、平板或平板计算机、摄像机、台式计算机、机顶盒、电视机、相机、车载设备、显示设备、数字媒体播放器、视频游戏控制台、视频流式传输设备(例如内容服务服务器或内容分发服务器)、广播接收器设备、广播发射器设备等,并可以不使用或使用任何类别的操作系统。Based on the description, it will be apparent to those skilled in the art that the functionality of the different units or the existence and (exact) division of the functionality of the encoding device 21 and/or decoding device 22 shown in FIG. 2 may vary according to actual devices and applications. The encoding device 21 and decoding device 22 may include any of a variety of devices, including any class of handheld or stationary devices, such as notebook or laptop computers, mobile phones, smart phones, tablet or tablet computers, video cameras, desktop computers , set-top boxes, televisions, cameras, in-vehicle devices, display devices, digital media players, video game consoles, video streaming devices (such as content serving servers or content distribution servers), broadcast receiver devices, broadcast transmitter devices, etc. , and may not use or use any kind of operating system.
编码器211和解码器221都可以实施为各种合适电路中的任一个,例如,一个或多个微处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)、离散逻辑、硬件或其任何组合。如果部分地以 软件实施所述技术,则设备可将软件的指令存储于合适的非暂时性计算机可读存储介质中,且可使用一或多个处理器以硬件执行指令从而执行本公开的技术。前述内容(包含硬件、软件、硬件与软件的组合等)中的任一者可视为一或多个处理器。Both encoder 211 and decoder 221 may be implemented as any of a variety of suitable circuits, eg, one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (application-specific integrated circuits) circuit, ASIC), field-programmable gate array (FPGA), discrete logic, hardware, or any combination thereof. If the techniques are implemented in part in software, an apparatus may store instructions for the software in a suitable non-transitory computer-readable storage medium and may execute the instructions in hardware using one or more processors to perform the techniques of this disclosure . Any of the foregoing (including hardware, software, a combination of hardware and software, etc.) may be considered one or more processors.
在一些情况下,图2中所示视频编码及解码系统仅为示例,本申请的技术可以适用于不必包含编码和解码设备之间的任何数据通信的视频编码设置(例如,视频编码或视频解码)。在其它实例中,数据可从本地存储器检索、在网络上流式传输等。视频编码设备可以对数据进行编码并且将数据存储到存储器,和/或视频解码设备可以从存储器检索数据并且对数据进行解码。在一些实例中,由并不彼此通信而是仅编码数据到存储器和/或从存储器检索数据且解码数据的设备执行编码和解码。In some cases, the video encoding and decoding system shown in FIG. 2 is merely an example, and the techniques of this application may be applicable to video encoding setups (eg, video encoding or video decoding) that do not necessarily involve any data communication between encoding and decoding devices ). In other examples, data may be retrieved from local storage, streamed over a network, and the like. A video encoding device may encode and store data to memory, and/or a video decoding device may retrieve and decode data from memory. In some examples, encoding and decoding is performed by devices that do not communicate with each other but only encode data to and/or retrieve data from memory and decode data.
当前,无损视频编码情况下,可以重构原始视频图像,即经重构视频图像具有与原始视频图像相同的质量(例如存储或传输期间没有传输损耗或其它数据丢失)。在有损视频编码情况下,通过例如量化执行进一步压缩,来减少表示视频图像所需的数据量,解码器侧重构视频图像的过程中使用的超分辨率算法需要进行运动估计,而运动估计需要花费大量计算资源,为了减少视频超分辨率的时间,本申请实施例提供了相应的图像处理方法,该方法包括解码设备获取编码码流中目标帧的运动矢量信息、目标帧的图像信息和相邻帧的图像信息,目标帧和相邻帧为第一分辨率的图像,目标帧为需要进行超分辨率处理的图像,相邻帧包括位于目标帧之前或之后的预设周期内的图像;解码设备根据运动矢量信息、目标帧的图像信息和相邻帧的图像信息生成重构帧,重构帧为第二分辨率的图像,第二分辨率大于第一分辨率,运动矢量信息用于指示相邻帧的图像信息与目标帧的图像信息进行自适应局部采样。这样,本申请采用编码码流中目标帧的运动矢量信息来提高重构帧的分辨率,节省了重新估计运动矢量信息的资源。Currently, with lossless video coding, the original video image can be reconstructed, ie the reconstructed video image has the same quality as the original video image (eg no transmission loss or other data loss during storage or transmission). In the case of lossy video coding, where further compression is performed by, for example, quantization to reduce the amount of data required to represent the video image, the super-resolution algorithm used in the process of reconstructing the video image at the decoder side requires motion estimation, and motion estimation It needs to spend a lot of computing resources. In order to reduce the time of video super-resolution, the embodiment of the present application provides a corresponding image processing method. The method includes the decoding device acquiring the motion vector information of the target frame, the image information of the target frame and Image information of adjacent frames, the target frame and adjacent frames are images of the first resolution, the target frame is an image that needs to be processed by super-resolution, and the adjacent frames include images located in a preset period before or after the target frame ; The decoding device generates a reconstructed frame according to the motion vector information, the image information of the target frame and the image information of the adjacent frame, and the reconstructed frame is an image of a second resolution, and the second resolution is greater than the first resolution, and the motion vector information uses Adaptive local sampling is performed on the image information indicating the adjacent frame and the image information of the target frame. In this way, the present application uses the motion vector information of the target frame in the encoded code stream to improve the resolution of the reconstructed frame, saving resources for re-estimating the motion vector information.
基于上述解码设备根据运动矢量信息、目标帧的图像信息和相邻帧的图像信息生成重构帧的步骤,本申请实施例的实现方式还可以是:解码设备基于运动矢量信息,对相邻帧的图像信息中与目标帧的图像信息对应的各个位置进行自适应局部采样;解码设备根据目标帧的图像信息和自适应局部采样后的相邻帧的图像信息生成重构帧。本申请实施例通过自适应局部采样可以选取相似度高的图像点进行采样,减少运动矢量信息中的噪声影响,提高超分辨率的鲁棒性。Based on the above step of the decoding device generating the reconstructed frame according to the motion vector information, the image information of the target frame, and the image information of the adjacent frames, the implementation manner of the embodiment of the present application may also be: the decoding device, based on the motion vector information, Each position in the image information corresponding to the image information of the target frame performs adaptive local sampling; the decoding device generates a reconstructed frame according to the image information of the target frame and the image information of the adjacent frames after adaptive local sampling. In the embodiment of the present application, image points with high similarity can be selected for sampling through adaptive local sampling, so as to reduce the influence of noise in the motion vector information, and improve the robustness of super-resolution.
下面,基于上述的应用场景和系统架构,结合图3对本申请实施例中的图像处理方法进行描述。Hereinafter, based on the above application scenario and system architecture, the image processing method in the embodiment of the present application will be described with reference to FIG. 3 .
请参阅图3,本申请图像处理方法的一实施例包括:Referring to FIG. 3, an embodiment of the image processing method of the present application includes:
301.解码设备获取编码码流中目标帧的图像信息和相邻帧的图像信息。301. The decoding device acquires the image information of the target frame and the image information of the adjacent frames in the encoded code stream.
本实施例中,解码器接收到服务器发送的编码码流后,可以对该编码视频进行解码得到目标帧的图像信息和相邻帧的图像信息。In this embodiment, after receiving the encoded code stream sent by the server, the decoder can decode the encoded video to obtain image information of the target frame and image information of adjacent frames.
该目标帧为本实施例中需要进行超分辨率处理的图像,超分辨率即通过硬件或软件的方法提高原有图像的分辨率,即通过一系列低分辨率的图像来得到一幅高分辨率的图像过程就是超分辨率重建。The target frame is an image that needs to be subjected to super-resolution processing in this embodiment. Super-resolution means improving the resolution of the original image by means of hardware or software, that is, obtaining a high-resolution image through a series of low-resolution images. The high-speed image process is super-resolution reconstruction.
当目标帧之前和之后预设周期内的图像也完成解码时,示例性的,该目标帧前后2T 个相邻帧也完成解码时,解码设备可以获得该相邻帧的图像信息。周期T可以为预设定的,也可以根据实际需要更改。对于第一帧或者最后一帧等序列的边界情况,可以通过重复已有相邻帧的操作,使得输入满足网络的需要。When the images in the preset period before and after the target frame are also decoded, exemplarily, when 2T adjacent frames before and after the target frame are also decoded, the decoding device can obtain the image information of the adjacent frames. The period T can be preset or changed according to actual needs. For the boundary conditions of the sequence such as the first frame or the last frame, the input can meet the needs of the network by repeating the operation of the existing adjacent frames.
编码码流可以为以包括基于运动估计和补偿的预测编码的图像压缩编码技术生成的码流,该运动估计和运动补偿算法用以去除时域冗余信息,即该图像压缩编码技术可以应用运动估计算法确定运动矢量信息。The coded code stream can be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation, the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image compression coding technique can apply motion An estimation algorithm determines motion vector information.
本实施例的目标帧和相邻帧为第一分辨率的图像,该第一分辨率指示的分辨率特指为编码码流解码后的图像所属的低分辨率。The target frame and adjacent frames in this embodiment are images of the first resolution, and the resolution indicated by the first resolution specifically refers to the low resolution to which the image decoded by the encoded code stream belongs.
302.解码设备从目标帧的图像信息提取目标特征,从相邻帧的图像信息提取相邻特征。302. The decoding device extracts target features from the image information of the target frame, and extracts adjacent features from the image information of adjacent frames.
本实施例中,解码设备中包括多个共享权重的特征提取功能,解码设备接收到解码器发送的目标帧的图像信息和相邻帧的图像信息后,可以通过各个特征提取功能分别为目标帧的图像信息和每个相邻帧的图像信息提取图像特征,以生成目标帧的图像信息对应的一个目标特征和每个相邻帧的图像信息分别对应的一个相邻特征。In this embodiment, the decoding device includes multiple feature extraction functions that share weights. After receiving the image information of the target frame and the image information of the adjacent frames sent by the decoder, the decoding device can use each feature extraction function to extract the target frame. The image information of the target frame and the image information of each adjacent frame are extracted image features to generate a target feature corresponding to the image information of the target frame and an adjacent feature corresponding to the image information of each adjacent frame respectively.
可选的,上述图像特征可以为图像的图像纹理信息。Optionally, the above-mentioned image feature may be image texture information of the image.
上述特征提取功能包括一个卷积层和若干个级联的残差块。The above feature extraction function consists of a convolutional layer and several cascaded residual blocks.
303.解码设备将目标特征构造目标特征金字塔,将相邻特征构造相邻特征金字塔。303. The decoding device constructs a target feature pyramid from the target feature, and constructs an adjacent feature pyramid from the adjacent features.
本实施例中,解码设备将上述目标特征通过滤波和双线性内插值下采样的方式连续减少图像尺寸,以得到不同尺度的目标特征,然后按照尺度大小对不同尺度的目标特征进行排列生成目标特征金字塔,该目标特征金字塔最底层的影像对应原始的目标特征,通过每2*2=4个像素平均,即可构成2级目标特征,如此类推,构成多级的目标特征金字塔,可选的,金字塔结构可以为高斯金字塔、拉普拉斯金字塔或小波金字塔等,具体此处不作限定。In this embodiment, the decoding device continuously reduces the image size through filtering and bilinear interpolation downsampling to obtain the target features of different scales, and then arranges the target features of different scales according to the scale to generate the target Feature pyramid. The bottom image of the target feature pyramid corresponds to the original target feature. By averaging every 2*2=4 pixels, a 2-level target feature can be formed, and so on to form a multi-level target feature pyramid. Optional , the pyramid structure may be a Gaussian pyramid, a Laplacian pyramid, or a wavelet pyramid, etc., which is not specifically limited here.
解码设备将每个相邻帧的图像信息对应相邻特征通过滤波和双线性内插值下采样的方式连续减少图像尺寸,以得到不同尺度的相邻特征,然后按照尺度大小对不同尺度的相邻特征进行排列生成相邻特征金字塔,示例性的,基于该目标帧前后2T个相邻帧,本申请实施例可以对应生成2T个相邻特征金字塔。The decoding device continuously reduces the image size corresponding to the adjacent features of the image information of each adjacent frame by means of filtering and bilinear interpolation downsampling, so as to obtain adjacent features of different scales, and then compares the corresponding adjacent features of different scales according to the scale. The adjacent features are arranged to generate adjacent feature pyramids. Exemplarily, based on 2T adjacent frames before and after the target frame, the embodiment of the present application may correspondingly generate 2T adjacent feature pyramids.
特征金字塔是多尺度目标检测系统中的一个基本组成部分,一个图像的特征金字塔是一系列以金字塔形状排列的特征集合。其通过一个原始特征梯次向下采样获得,因而尺寸逐层降低。该特征金字塔具有一定意义的尺度不变性,这种特性使得本实施例的解码设备可以检测大范围尺度的图像。Feature pyramid is a basic component in multi-scale object detection system. The feature pyramid of an image is a series of feature sets arranged in a pyramid shape. It is obtained by downsampling an original feature echelon, so the size is reduced layer by layer. The feature pyramid has a certain scale invariance, and this characteristic enables the decoding device of this embodiment to detect images of a large scale.
304.解码设备根据运动矢量信息确定目标特征和相邻特征的位置映射关系。304. The decoding device determines the position mapping relationship between the target feature and the adjacent feature according to the motion vector information.
本实施例中,解码设备可以直接从编码码流中确定运动矢量信息,并确定由该运动矢量信息指示的目标特征和相邻特征的位置映射关系,该位置映射关系即对于上述的目标特征金字塔和相邻特征金字塔,目标特征金字塔中的每个尺度的目标特征中的每个坐标处,在相邻特征金字塔相同尺度的相邻特征中都存在对应的坐标处。In this embodiment, the decoding device can directly determine the motion vector information from the encoded code stream, and determine the position mapping relationship between the target feature indicated by the motion vector information and the adjacent features, and the position mapping relationship is the same as the above-mentioned target feature pyramid. and the adjacent feature pyramid, at each coordinate in the target feature of each scale in the target feature pyramid, there is a corresponding coordinate in the adjacent feature of the same scale in the adjacent feature pyramid.
305.解码设备根据目标特征的第一局部特征块的坐标,以及位置映射关系,查找相邻特征的第二局部特征块。305. The decoding device searches for the second local feature block of the adjacent feature according to the coordinates of the first local feature block of the target feature and the position mapping relationship.
本实施例中,对于相邻特征金字塔中的每个尺度,将进行相同的柔性对齐操作。以某一尺度为例,即特征金字塔某一像素层上的特征为例,其具体操作如下:如图4所示的局部特征块提取示意图,在运动矢量信息,即位置映射关系41的引导下,对于目标特征金字塔中目标特征42的每个坐标,在任意一个相邻特征金字塔的相邻特征223中存在对应的坐标。解码设备从这两个对应的坐标中各自分别提取目标特征对应的第一局部特征块421和相邻特征对应的第二局部特征块2231。In this embodiment, for each scale in the adjacent feature pyramid, the same flexible alignment operation will be performed. Taking a certain scale as an example, that is, the feature on a certain pixel layer of the feature pyramid, the specific operation is as follows: the schematic diagram of local feature block extraction shown in Figure 4, under the guidance of the motion vector information, that is, the position mapping relationship 41 , for each coordinate of the target feature 42 in the target feature pyramid, there is a corresponding coordinate in the adjacent feature 223 of any adjacent feature pyramid. The decoding device respectively extracts the first local feature block 421 corresponding to the target feature and the second local feature block 2231 corresponding to the adjacent feature from the two corresponding coordinates.
306.解码设备通过全连接层对所述第一局部特征块和所述第二局部特征块进行特征匹配,以确定相关注意力系数集合。306. The decoding device performs feature matching on the first local feature block and the second local feature block through a fully connected layer to determine a set of relevant attention coefficients.
本实施例中,解码设备会将一个目标特征的第一局部特征块和一个相邻特征的第二局部特征块通过特征重排列形成两个一维的特征向量,然后通过串联操作将两个一维的特征向量合并输入到一个双层的全连接层中,以生成一个长度与局部特征块所包含像素值数量相同的注意力向量。解码设备将该注意力向量进行重排列后得到这两个局部特征块之间的相关注意力系数集合。In this embodiment, the decoding device will rearrange the first local feature block of a target feature and the second local feature block of an adjacent feature to form two one-dimensional feature vectors, and then convert the two one-dimensional feature vectors through a concatenation operation. The dimensional feature vectors are combined into a two-layer fully connected layer to generate an attention vector with the same length as the number of pixel values contained in the local feature block. The decoding device rearranges the attention vector to obtain a set of related attention coefficients between the two local feature blocks.
上述相关注意力系数集合包括多个相关注意力系数,其中,每个相关注意力系数可以指示第一局部特征块中的一个特征点与第二局部特征块中对应特征点的相似度;The above-mentioned set of relevant attention coefficients includes a plurality of relevant attention coefficients, wherein each relevant attention coefficient can indicate the similarity between a feature point in the first local feature block and a corresponding feature point in the second local feature block;
307.解码设备基于相关注意力系数集合对第二局部特征块中的多个特征点进行加权平均,以确定自适应局部采样后的相邻特征金字塔。307. The decoding device performs a weighted average of a plurality of feature points in the second local feature block based on the relevant attention coefficient set to determine the adjacent feature pyramid after adaptive local sampling.
本实施例中,由于编码框架本身的设计逻辑,上述运动矢量信息提供的位置映射关系不一定完全是真实的物体运动,可能带有编码噪声,解码设备中的相关注意力系数可以让网络在映射处的局部邻域内再细化搜索更优的匹配特征。本实施例在解码设备获得相关注意力系数后集合,可以将该多个相关注意力系数与相邻特征的局部特征块进行逐点乘积再求和处理,以实现对第二局部特征块的采样处理。In this embodiment, due to the design logic of the coding framework itself, the position mapping relationship provided by the above motion vector information is not necessarily completely real object motion, and may contain coding noise. The relevant attention coefficients in the decoding device can allow the network to map Refine the search for better matching features in the local neighborhood at the location. In this embodiment, after the relevant attention coefficients are collected by the decoding device, the multiple relevant attention coefficients and the local feature blocks of adjacent features can be multiplied point by point and then summed, so as to realize the sampling of the second local feature block. deal with.
相关注意力系数就是指示相邻特征的局部特征块中各个特征的重要性权重,用于提升自适应局部采样后的相邻特征的特征质量。The relevant attention coefficient is the importance weight of each feature in the local feature block indicating the adjacent features, which is used to improve the feature quality of the adjacent features after adaptive local sampling.
对于目标特征金字塔上的目标特征上的每个坐标的局部特征块都需要执行上述相同的步骤。The same steps as above need to be performed for the local feature block for each coordinate on the target feature on the target feature pyramid.
解码设备在进行一次自适应局部采样时,只对一个目标特征中的一个局部特征块和一个相邻特征的一个局部特征块进行自适应局部采样,对于2T个相邻特征金字塔的其他相邻特征金字塔和目标特征中的其他坐标的局部特征块对应相邻特征的局部特征块,也需要对他们分别进行步骤相同的自适应局部采样,以将所有相邻特征金字塔都进行自适应局部采样。When the decoding device performs one adaptive local sampling, it only performs adaptive local sampling on one local feature block in one target feature and one local feature block in one adjacent feature, and for other adjacent features in the 2T adjacent feature pyramids The local feature blocks of other coordinates in the pyramid and the target feature correspond to the local feature blocks of adjacent features, and they also need to be adaptively localized with the same steps respectively, so as to perform adaptive local sampling on all adjacent feature pyramids.
308.解码设备根据目标特征和自适应局部采样后的相邻特征计算注意力图。308. The decoding device calculates an attention map based on the target feature and the adaptively locally sampled adjacent features.
解码设备对相邻特征金子塔进行自适应局部采样后,可以根据相邻帧特征的自适应局部采样质量,产生时域上的注意力图。After the decoding device performs adaptive local sampling on adjacent feature pyramids, it can generate an attention map in the time domain according to the adaptive local sampling quality of adjacent frame features.
示例性的,解码设备可以通过计算自适应局部采样后的相邻帧特征和目标特征逐坐标点上的特征内积,该特征内积可以表征自适应局部采样后的相邻特征在该点上与目标特征的相似度,相似度在一定程度上也表示该点的自适应局部采样质量。解码设备可以通过上 述自适应局部采样质量得到一个和特征尺寸大小相同的注意力图。Exemplarily, the decoding device can calculate the feature inner product of the adjacent frame feature after adaptive local sampling and the target feature at each coordinate point, and the feature inner product can represent the adjacent feature after adaptive local sampling at this point. Similarity with the target feature, the similarity also represents the adaptive local sampling quality of the point to a certain extent. The decoding device can obtain an attention map with the same size as the feature size through the above adaptive local sampling quality.
309.解码设备将自适应局部采样后的相邻特征与注意力图进行特征增强处理。309. The decoding device performs feature enhancement processing on the adjacent features after adaptive local sampling and the attention map.
解码设备在确定上述注意力图后,可以通过增加高质量局部区域的权值,减少低质量局部区域的权值,动态地融合自适应局部采样后相邻帧和当前帧的特征。After determining the above attention map, the decoding device can dynamically fuse the features of the adjacent frame and the current frame after adaptive local sampling by increasing the weight of the high-quality local area and reducing the weight of the low-quality local area.
示例性的,解码设备将进行自适应局部采样后的相邻特征金字塔中的相邻特征与上述注意力图进行逐点相乘,使得相邻特征上那些和目标特征更相似的区域被自适应地分配更高的注意力,即对于超分辨率结果的可能个贡献程度所占的权重更高,增强需要的特征,抑制错匹等可能存在的干扰。Exemplarily, the decoding device performs point-by-point multiplication of the adjacent features in the adjacent feature pyramid after adaptive local sampling with the above-mentioned attention map, so that those regions on the adjacent features that are more similar to the target feature are adaptively Allocate higher attention, that is, a higher weight for the possible contributions of the super-resolution results, enhance the required features, and suppress possible interference such as mismatches.
310.解码设备将所有特征增强处理后的相邻特征和目标特征进行堆叠和卷积计算生成融合特征,并确定融合特征金字塔。310. The decoding device performs stacking and convolution calculations on all adjacent features and target features after feature enhancement processing to generate fusion features, and determines a fusion feature pyramid.
解码设备在确定特征增强处理完所有相邻特征后,可以将特征增强处理后的相邻特征和目标特征进行堆叠,即将特征增强处理后的相邻特征上的特征叠加到目标特征上,然后通过一个卷积层得到融合特征,当所有特征增强处理后的相邻特征都堆叠到目标特征金字塔中的目标特征上并进行卷积后,即可得到融合特征金字塔。After determining that all adjacent features have been processed by feature enhancement, the decoding device can stack the adjacent features after feature enhancement processing with the target features, that is, superimpose the features on the adjacent features after feature enhancement processing on the target features, and then pass A convolutional layer obtains fused features. When all adjacent features after feature enhancement processing are stacked on the target features in the target feature pyramid and convolved, the fused feature pyramid can be obtained.
当解码设备执行一次特征融合后,需要检测是否存在未进行特征增强处理的相邻特征,如果存在未进行特征增强处理的相邻特征,则需要将该相邻特征与目标特征进行上述特征增强处理,直至所有相邻特征都进行特征增强处理,并与目标特征进行特征融合,并确定完整的融合特征金字塔。After the decoding device performs feature fusion once, it needs to detect whether there are adjacent features without feature enhancement processing. If there are adjacent features without feature enhancement processing, it needs to perform the above feature enhancement processing on the adjacent features and the target feature. , until all adjacent features are enhanced and fused with the target feature, and a complete fused feature pyramid is determined.
311.解码设备将融合特征金字塔通过第二级联残差块计算生成优化特征金字塔。311. The decoding device generates an optimized feature pyramid by calculating the fusion feature pyramid through the second cascaded residual block.
解码设备使用带尺度融合的级联残差块对融合后的特征金字塔进行重构特征生成,与一般仅处理单尺度的残差块不同,上述第二级联残差块在跳跃连接处的末端会根据特征在金字塔中位于的不同层级,加入额外的上采样操作或下采样操作,使得不同尺度的特征重构残差能够充分交换信息,增强重构特征的质量,得到优化后的融合特征,即可确定优化特征金字塔。The decoding device uses the cascaded residual block with scale fusion to reconstruct the feature generation of the fused feature pyramid. Unlike the general residual block that only processes a single scale, the second cascaded residual block above is at the end of the skip connection. According to the different levels of the features in the pyramid, additional upsampling operations or downsampling operations are added, so that the feature reconstruction residuals of different scales can fully exchange information, enhance the quality of reconstructed features, and obtain optimized fusion features. The optimized feature pyramid can be determined.
312.解码设备对优化特征金字塔进行尺寸扩大和卷积生成重构残差信号。312. The decoding device performs size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal.
由于解码设备中的下采样将特征图像缩小,因此需要通过一个亚像素卷积层将上述优化特征金字塔进行扩大尺寸操作,然后通过一个卷积层产生高分辨率的重构残差信号。Since the downsampling in the decoding device reduces the feature image, it is necessary to expand the size of the above-mentioned optimized feature pyramid through a sub-pixel convolution layer, and then generate a high-resolution reconstructed residual signal through a convolution layer.
313.解码设备将重构残差信号与图像放大结果相加,得到重构帧。313. The decoding device adds the reconstructed residual signal and the image upscaling result to obtain a reconstructed frame.
本实施例中,解码设备可以将上述目标帧的图像信息通过双线性插值的上采样操作将该目标帧的图像信息进行尺寸扩大。解码设备可以将重构残差信号和进行上采样后的目标帧的图像信息相加,即可得到重构帧的图像信息,并确定重构帧。该重构帧即为进行超分辨率后的目标帧。In this embodiment, the decoding device may expand the size of the image information of the target frame by performing an up-sampling operation of bilinear interpolation on the image information of the target frame. The decoding device may add the reconstructed residual signal and the image information of the up-sampled target frame to obtain the image information of the reconstructed frame, and determine the reconstructed frame. The reconstructed frame is the target frame after super-resolution.
本申请实施例的技术方案中解码设备通过从编码码流获得的运动矢量信息、目标特征金字塔和相邻特征金字塔进行神经网络处理生成高分辨率重构帧,编码码流中的包含了一定的运动信息,而提取码流中的运动信息的计算代价是可以忽略的,所有能够大量减少视频超分辨率的时间。In the technical solutions of the embodiments of the present application, the decoding device generates high-resolution reconstructed frames by performing neural network processing on the motion vector information, target feature pyramid and adjacent feature pyramids obtained from the encoded code stream, and the encoded code stream contains a certain amount of Motion information, and the computational cost of extracting motion information in the code stream is negligible, all of which can greatly reduce the time of video super-resolution.
本申请实施例中的图像处理中的超分辨率过程可以通过一个预先训练的网络模型实 现,示例性的,请参阅图5,本申请实施例解码设备的架构示意图,解码设备22可以包括解码器221、网络模型222、图形处理器(graphics processing unit,GPU)内存223以及输出缓存224。分别描述如下:The super-resolution process in the image processing in the embodiment of the present application may be implemented by a pre-trained network model. For an example, please refer to FIG. 5 , which is a schematic diagram of the architecture of the decoding device in the embodiment of the present application. The decoding device 22 may include a decoder 221 , a network model 222 , a graphics processing unit (graphics processing unit, GPU) memory 223 and an output buffer 224 . They are described as follows:
解码器221为对已编码的编码码流进行还原解码操作的设备。可选的,该解码器221可以为支持H.264/高效率视频编码(high efficiency videocoding,HEVC)/多功能视频编码(versatile video coding,VVC)等编解码标准的视频解码器,例如HEVC解码器。本申请实施例的解码器221增加运动矢量信息输出接口。The decoder 221 is a device that performs restoration and decoding operations on the encoded encoded code stream. Optionally, the decoder 221 may be a video decoder that supports encoding and decoding standards such as H.264/high efficiency video coding (HEVC)/versatile video coding (VVC), for example, HEVC decoding device. The decoder 221 in this embodiment of the present application adds a motion vector information output interface.
网络模型222在本申请实施例的产品实现形态,是包含在机器学习、深度学习平台软件中,并部署在解码设备上的程序代码。本申请实施例的程序代码存在于现有的解码器221外部。网络模型222可以为以低分辨率解码视频以及其未编码的高分辨率视频的数据通过机器学习的方法进行有监督训练生成的。该网络模型222本申请实施例设计了特征提取模块2221,柔性对齐模块2222,多帧特征融合模块2223和特征超分重建模块2224。The network model 222 in the product implementation form of the embodiment of the present application is a program code included in the machine learning and deep learning platform software and deployed on the decoding device. The program codes of the embodiments of the present application exist outside the existing decoder 221 . The network model 222 may be generated by supervised training of the data of the decoded video at low resolution and its unencoded high-resolution video by a machine learning method. The network model 222 is designed with a feature extraction module 2221 , a flexible alignment module 2222 , a multi-frame feature fusion module 2223 and a feature super-score reconstruction module 2224 in this embodiment of the present application.
网络模型222首先包含一个特征提取模块2221,由于图像特征在深度学习方法中具有重要的物理意义,此模块旨在将输入的解码帧从像素域转换到特征域。The network model 222 first contains a feature extraction module 2221, which aims to transform the input decoded frames from the pixel domain to the feature domain, since image features have important physical meanings in deep learning methods.
柔性对齐模块2222,此模块接收来自解码器221的提取的码流中的运动矢量,并以此为引导,设计了多尺度下的局部注意力机制,在特征级别上实现相邻帧的柔性对齐。 Flexible alignment module 2222, this module receives the motion vector in the extracted code stream from the decoder 221, and uses this as a guide to design a multi-scale local attention mechanism to achieve flexible alignment of adjacent frames at the feature level .
多帧特征融合模块2223,此模块接收对齐之后的相邻帧特征和当前帧特征,在时域上使用注意力机制,完成特征的融合操作。Multi-frame feature fusion module 2223, this module receives the adjacent frame features and the current frame features after alignment, and uses the attention mechanism in the time domain to complete the feature fusion operation.
特征超分重建模块2224,此模块接收融合之后的图像特征,采用级联的多尺度融合残差块和亚像素卷积,完成解码视频的超分辨率重建,已生成重构帧。Feature super-resolution reconstruction module 2224, this module receives the image features after fusion, and uses cascaded multi-scale fusion residual blocks and sub-pixel convolution to complete the super-resolution reconstruction of the decoded video, and a reconstructed frame has been generated.
GPU内存223为支撑网络模型222中各个模块的计算的程序代码的运行。The GPU memory 223 is for the execution of program codes that support the computation of each module in the network model 222 .
输出缓存224为接收网络模型222输出的重构帧并保存。The output buffer 224 receives and saves the reconstructed frames output by the network model 222 .
本申请实施例在开源的以Python优先的深度学习框架(PyTorch)机器学习平台中,并运行在带有显卡NVIDIA GPU卡的解码单元上,实现HEVC标准解码视频超分辨率的程序代码。其中,NVIDIA GPU卡通过统一计算设备架构(compute unified device architecture,CUDA)编程接口提供计算加速能力。在本实施例中能够加速分布式PyTorch机器学习平台中的网络模型推理过程,训练完成后的模型可以直接从带压缩噪声的解码视频上进行端到端重建。The embodiments of the present application are implemented in the open source Python-first deep learning framework (PyTorch) machine learning platform, and run on a decoding unit with a graphics card NVIDIA GPU card, to implement the HEVC standard decoding video super-resolution program code. Among them, the NVIDIA GPU card provides computing acceleration capabilities through the unified computing device architecture (compute unified device architecture, CUDA) programming interface. In this embodiment, the inference process of the network model in the distributed PyTorch machine learning platform can be accelerated, and the model after training can be directly reconstructed end-to-end from the decoded video with compressed noise.
基于上述的架构,下面对本申请实施例中的图像处理方法进行描述:Based on the above architecture, the image processing method in the embodiment of the present application is described below:
请参阅图6,图6所示的特征提取模块的图像处理流程图,特征提取模块将解码器输出的目标帧和相邻帧从像素域转换到特征域,具体如下:Please refer to Figure 6, the image processing flow chart of the feature extraction module shown in Figure 6. The feature extraction module converts the target frame and adjacent frames output by the decoder from the pixel domain to the feature domain, as follows:
601.特征提取模块获取来自解码器的目标帧的图像信息和相邻帧的图像信息。601. The feature extraction module obtains image information of the target frame and image information of adjacent frames from the decoder.
本实施例中,解码器接收到源设备发送的编码码流后,可以对该编码视频进行解码得到目标帧的图像信息和相邻帧的图像信息,特征提取模块可以接收解码器传输的目标帧的图像信息和相邻帧的图像信息。In this embodiment, after receiving the encoded code stream sent by the source device, the decoder can decode the encoded video to obtain image information of the target frame and image information of adjacent frames, and the feature extraction module can receive the target frame transmitted by the decoder. image information and image information of adjacent frames.
该目标帧为本实施例中需要进行超分辨率处理的图像,超分辨率即通过硬件或软件的方法提高原有图像的分辨率,即通过一系列低分辨率的图像来得到一幅高分辨率的图像过 程就是超分辨率重建。The target frame is an image that needs to be subjected to super-resolution processing in this embodiment. Super-resolution means improving the resolution of the original image by means of hardware or software, that is, obtaining a high-resolution image through a series of low-resolution images. The high-speed image process is super-resolution reconstruction.
当目标帧之前和之后预设周期内的图像也完成解码时,示例性的,该目标帧前后2T个相邻帧也完成解码时,解码器输出该相邻帧的图像信息。周期T可以为预设定的,也可以根据实际需要更改。对于第一帧或者最后一帧等序列的边界情况,可以通过重复已有相邻帧的操作,使得输入满足网络的需要。When the images in the preset period before and after the target frame are also decoded, exemplarily, when 2T adjacent frames before and after the target frame are also decoded, the decoder outputs the image information of the adjacent frames. The period T can be preset or changed according to actual needs. For the boundary conditions of the sequence such as the first frame or the last frame, the input can meet the needs of the network by repeating the operation of the existing adjacent frames.
编码码流可以为以包括基于运动估计和补偿的预测编码的图像压缩编码技术生成的码流,该运动估计和运动补偿算法用以去除时域冗余信息,即该图像压缩编码技术可以应用运动估计算法确定运动矢量信息。The coded code stream can be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation, the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image compression coding technique can apply motion An estimation algorithm determines motion vector information.
本实施例的目标帧和相邻帧为第一分辨率的图像,该第一分辨率指示的分辨率特指为编码码流解码后的图像所属的低分辨率。The target frame and adjacent frames in this embodiment are images of the first resolution, and the resolution indicated by the first resolution specifically refers to the low resolution to which the image decoded by the encoded code stream belongs.
602.特征提取模块从目标帧的图像信息提取目标特征,从相邻帧的图像信息提取相邻特征。602. The feature extraction module extracts target features from image information of the target frame, and extracts adjacent features from image information of adjacent frames.
603.特征提取模块将目标特征构造目标特征金字塔,将相邻特征构造相邻特征金字塔。603. The feature extraction module constructs a target feature pyramid from the target feature, and constructs an adjacent feature pyramid from the adjacent features.
步骤602-603与图3所示的图像处理方法的步骤302-303类似,具体此处不再赘述。 Steps 602 to 603 are similar to steps 302 to 303 of the image processing method shown in FIG. 3 , and details are not repeated here.
请参阅图7,图7所示的柔性对齐模块的图像处理流程图,柔性对齐模块根据特征提取模块输出的目标特征金字塔和相邻特征金字塔实现相邻帧的柔性对齐,针对所述每个相邻特征金字塔,具体如下:Please refer to Fig. 7, the image processing flow chart of the flexible alignment module shown in Fig. 7, the flexible alignment module realizes the flexible alignment of adjacent frames according to the target feature pyramid and the adjacent feature pyramid output by the feature extraction module. Neighbor feature pyramid, as follows:
701.柔性对齐模块接收来自解码器的运动矢量信息,以及来自特征提取模块的目标特征金字塔和相邻特征金字塔。701. The flexible alignment module receives the motion vector information from the decoder, and the target feature pyramid and neighboring feature pyramids from the feature extraction module.
本实施例中,编码码流可以为以包括基于运动估计和补偿的预测编码的图像压缩编码技术生成的码流,该运动估计和运动补偿算法用以去除时域冗余信息,即该图像压缩编码技术可以应用运动估计算法确定运动矢量信息。解码器可以从编码码流中提取出该运动矢量信息并发送给柔性对齐模块。In this embodiment, the encoded code stream may be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation. The motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image compression Coding techniques may apply motion estimation algorithms to determine motion vector information. The decoder can extract the motion vector information from the encoded code stream and send it to the flexible alignment module.
柔性对齐模块还可以从特征提取模块中接收图3所述的目标特征金字塔和相邻特征金字塔。The flexible alignment module can also receive the target feature pyramid and adjacent feature pyramid described in Figure 3 from the feature extraction module.
702.柔性对齐模块根据运动矢量信息确定目标特征和相邻特征的位置映射关系。702. The flexible alignment module determines the position mapping relationship between the target feature and the adjacent feature according to the motion vector information.
703.柔性对齐模块根据目标特征的第一局部特征块的坐标,以及位置映射关系,查找相邻特征的第二局部特征块。703. The flexible alignment module searches for the second local feature block of the adjacent feature according to the coordinates of the first local feature block of the target feature and the position mapping relationship.
704.柔性对齐模块通过全连接层对所述第一局部特征块和所述第二局部特征块进行特征匹配,以确定相关注意力系数集合。704. The flexible alignment module performs feature matching on the first local feature block and the second local feature block through a fully connected layer to determine a set of relevant attention coefficients.
705.柔性对齐模块基于相关注意力系数集合对第二局部特征块进行加权平均,以确定自适应局部采样后的相邻特征金字塔。705. The flexible alignment module performs a weighted average on the second local feature block based on the set of relevant attention coefficients to determine the adjacent feature pyramid after adaptive local sampling.
步骤702-705与图3所示的图像处理方法中步骤304-307类似,具体此处不再赘述。Steps 702-705 are similar to steps 304-307 in the image processing method shown in FIG. 3 , and details are not repeated here.
请参阅图8,图8所示的多帧特征融合模块的图像处理流程图,多帧特征融合模块根据柔性对齐模块输出的自适应局部采样后的相邻特征金字塔,针对所述每个自适应局部采样后的相邻特征金字塔,具体如下:Please refer to FIG. 8, the image processing flow chart of the multi-frame feature fusion module shown in FIG. 8, the multi-frame feature fusion module according to the adaptive local sampling adjacent feature pyramid output by the flexible alignment module, for each adaptive The adjacent feature pyramid after local sampling, as follows:
801.多帧特征融合模块接收来自柔性对齐模块的自适应局部采样后的相邻特征金字 塔。801. The multi-frame feature fusion module receives the adaptively locally sampled adjacent feature pyramids from the flexible alignment module.
柔性对齐模块在对相邻特征金字塔进行自适应局部采样后,将该自适应局部采样后的相邻特征金字塔发送给多帧特征融合模块。After the flexible alignment module performs adaptive local sampling on the adjacent feature pyramids, the adjacent feature pyramids after the adaptive local sampling are sent to the multi-frame feature fusion module.
802.多帧特征融合模块根据目标特征和自适应局部采样后的相邻特征计算注意力图。802. The multi-frame feature fusion module calculates an attention map according to the target feature and the adjacent features after adaptive local sampling.
803.多帧特征融合模块将自适应局部采样后的相邻特征与注意力图进行特征增强处理。803. The multi-frame feature fusion module performs feature enhancement processing on the adjacent features after adaptive local sampling and the attention map.
804.多帧特征融合模块将所有特征增强处理后的相邻特征和目标特征进行堆叠和卷积计算以生成融合特征,并确定融合特征金字塔。804. The multi-frame feature fusion module performs stacking and convolution calculations on adjacent features and target features after all feature enhancement processing to generate fusion features, and determines a fusion feature pyramid.
步骤802-804与图3所示的图像处理方法中步骤308-310类似,具体此处不再赘述。Steps 802-804 are similar to steps 308-310 in the image processing method shown in FIG. 3 , and details are not repeated here.
请参阅图9,图9所示的特征超分重建模块的图像处理流程图,特征超分重建模块根据多帧特征融合模块输出的融合特征金字塔,具体如下:Please refer to Figure 9, the image processing flow chart of the feature super-division reconstruction module shown in Figure 9, the feature super-division reconstruction module according to the fusion feature pyramid output by the multi-frame feature fusion module, as follows:
901.特征超分重建模块接收来自多帧特征融合模块的融合特征金字塔。901. The feature superdivision reconstruction module receives the fused feature pyramid from the multi-frame feature fusion module.
902.特征超分重建模块将融合特征金字塔通过第二级联残差块计算生成优化特征金字塔。902. The feature super-score reconstruction module generates an optimized feature pyramid by calculating the fusion feature pyramid through the second cascade residual block.
903.特征超分重建模块对优化特征金字塔进行尺寸扩大和卷积生成重构残差信号。903. The feature super-division reconstruction module performs size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal.
904.特征超分重建模块将重构残差信号与图像放大结果相加,得到重构帧。904. The feature super-division reconstruction module adds the reconstructed residual signal and the image enlargement result to obtain a reconstructed frame.
步骤902-904与图3所示的图像处理方法中步骤311-313类似,具体此处不再赘述。Steps 902-904 are similar to steps 311-313 in the image processing method shown in FIG. 3 , and details are not repeated here.
基于本申请实施例的技术方案,如图10所示本申请实施例的超分辨率结果与现有技术超分辨率结果的对比图,纵坐标为信噪比(peak signal to noise ratio,PSNR),单位为分贝(decibel,dB),横坐标为每个框架frame的测试时间,单位为毫秒(ms),本申请实施例的超分辨率结果Ours对比现有技术的压缩噪声去除技术与轻量超分辨率技术组成两步方法,例如:滤波器大小可变的残差学习卷积神经网络(variable-filter-size residue-learning convolutional neural networks,VRCNN)+用于视频的高效亚像素卷积网络(video efficient sub-pixel convolution network,VESPCN)、多帧质量增强方法(multi-frame quality enhancement,MFQE)+VESPCN、基于深度卷积神经网络的自动解码器(deep convolutional neural networks-based auto decoder,DCAD)+基于光流超分辨率的视频超分辨率方法(super-resolving optical flow for video super-resolution,SOFVSR)和MFQE+SOFVSR等,在峰值信噪比上有较大提升,对比同为端到端(压缩噪声去除+视频超分辨率)的方法例如利用非局部时空相关性的渐进融合视频超分辨率网络(progressive fusion video super-resolution network via exploiting non-Local spatio-temporal correlations,PFNL)时,由于合理的利用运动矢量的特性,在平均每帧的处理时间有明显的减少,示例性的,如图10所示的本申请实施例的超分辨率的测试时间280ms对比PFNL所使用的测试时间850ms。Based on the technical solutions of the embodiments of the present application, as shown in FIG. 10 , a comparison diagram of the super-resolution results of the embodiments of the present application and the super-resolution results of the prior art, the ordinate is the signal-to-noise ratio (peak signal to noise ratio, PSNR) , the unit is decibel (dB), the abscissa is the test time of each frame, the unit is millisecond (ms), the super-resolution result Ours of the embodiment of the present application compares the compression noise removal technology of the prior art with the lightweight Super-resolution techniques consist of a two-step approach, such as: variable-filter-size residual-learning convolutional neural networks (VRCNN) + efficient sub-pixel convolutional networks for video (video efficient sub-pixel convolution network, VESPCN), multi-frame quality enhancement method (multi-frame quality enhancement, MFQE) + VESPCN, deep convolutional neural networks-based auto decoder (DCAD) ) + video super-resolution method based on optical flow super-resolution (super-resolving optical flow for video super-resolution, SOFVSR) and MFQE+SOFVSR, etc., the peak signal-to-noise ratio has been greatly improved, and the comparison is end-to-end For example, when using a non-local spatio-temporal correlation (progressive fusion video super-resolution network via exploiting non-Local spatio-temporal correlations, PFNL), Due to the reasonable use of the characteristics of motion vectors, the average processing time of each frame is significantly reduced. For example, as shown in FIG. 10 , the super-resolution test time of the embodiment of the present application is 280ms compared to the test time used by PFNL. 850ms.
滤波器大小可变的残差学习卷积神经网络用于轻型的压缩噪声去除网络,由4层卷积层组成。Residual learning convolutional neural network with variable filter size is used for light-weight compressed noise removal network, consisting of 4 convolutional layers.
多帧质量增强方法用于利用运动估计和运动补偿,以及“好帧补偿差帧”思想进行压缩视频噪声去除的端到端网络。The multi-frame quality enhancement method is used for an end-to-end network for denoising compressed video using motion estimation and motion compensation, and the idea of "good frames compensate bad frames".
基于深度卷积神经网络的自动解码器用于压缩噪声去除网络,由10层卷积层组成。A deep convolutional neural network-based auto-decoder was used to compress the noise removal network, consisting of 10 convolutional layers.
用于视频的高效亚像素卷积网络用于利用运动估计和运动补偿,通过对齐相领帧考虑时域相关性,从而进行视频超分辨率的方法。Efficient sub-pixel convolutional networks for video are used to utilize motion estimation and motion compensation to account for temporal correlations by aligning adjacent frames to perform video super-resolution methods.
基于光流超分辨率的视频超分辨率方法用于利用运动估计和运动补偿,通过对齐相领帧考虑时域相关性,从而进行视频超分辨率的方法。相对于用于视频的高效亚像素卷积网络只预测低分辨率的光流,基于光流超分辨率的视频超分辨率方法选择预测更加准确的高分辨率光流。Video super-resolution methods based on optical flow super-resolution are used for video super-resolution methods using motion estimation and motion compensation to consider temporal correlations by aligning adjacent frames. Compared with efficient sub-pixel convolutional networks for video that only predict low-resolution optical flow, the video super-resolution method based on optical flow super-resolution chooses to predict more accurate high-resolution optical flow.
利用非局部时空相关性的渐进融合视频超分辨率网络用于通过计算非局部注意力和所提出的渐进式融合模块,进行端到端视频超分辨率的方法。A progressive fusion video super-resolution network utilizing non-local spatiotemporal correlations is used for an end-to-end video super-resolution method by computing non-local attention and the proposed progressive fusion module.
以上描述了图像处理方法,下面结合附图介绍本申请实施例的解码设备。The image processing method has been described above, and the decoding device according to the embodiments of the present application will be described below with reference to the accompanying drawings.
图11为本申请实施例中解码设备110的一实施例示意图。FIG. 11 is a schematic diagram of an embodiment of a decoding device 110 in an embodiment of the present application.
如图11所示,本申请实施例提供了解码设备,该解码设备包括:As shown in FIG. 11 , an embodiment of the present application provides a decoding device, and the decoding device includes:
获取单元1101,用于获取编码码流中目标帧的运动矢量信息、所述目标帧的图像信息和相邻帧的图像信息,所述目标帧和所述相邻帧为第一分辨率的图像,所述目标帧为需要进行超分辨率处理的图像,所述相邻帧包括位于所述目标帧之前或之后的预设周期内的图像;The obtaining unit 1101 is used to obtain the motion vector information of the target frame in the encoded code stream, the image information of the target frame and the image information of the adjacent frame, where the target frame and the adjacent frame are images of the first resolution , the target frame is an image that needs to be subjected to super-resolution processing, and the adjacent frame includes an image within a preset period before or after the target frame;
生成单元1102,用于根据所述运动矢量信息、所述目标帧的图像信息和所述相邻帧的图像信息生成重构帧,所述重构帧为第二分辨率的图像,所述第二分辨率大于所述第一分辨率,所述运动矢量信息用于指示所述相邻帧的图像信息与所述目标帧的图像信息进行自适应局部采样。The generating unit 1102 is configured to generate a reconstructed frame according to the motion vector information, the image information of the target frame and the image information of the adjacent frame, where the reconstructed frame is an image of the second resolution, and the first The second resolution is greater than the first resolution, and the motion vector information is used to indicate that the image information of the adjacent frame and the image information of the target frame perform adaptive local sampling.
本申请实施例提供的方案,解码设备通过从编码码流获得的运动矢量信息、目标特征金字塔和相邻特征金字塔进行神经网络处理生成高分辨率重构帧,编码码流中的包含了一定的运动信息,而提取码流中的运动信息的计算代价是可以忽略的,所以能够大量减少视频超分辨率的时间。In the solution provided by this embodiment of the present application, the decoding device generates high-resolution reconstructed frames by performing neural network processing on motion vector information, target feature pyramids, and adjacent feature pyramids obtained from the encoded code stream, and the encoded code stream contains a certain amount of Motion information, and the computational cost of extracting motion information in the code stream is negligible, so the time for video super-resolution can be greatly reduced.
可选的,生成模块1102,具体用于根据目标帧的图像信息生成目标特征金字塔,并根据每个相邻帧的图像信息各自生成一个相邻特征金字塔,目标特征金字塔包括多个尺度的目标特征,每个相邻特征金字塔包括多个尺度的相邻特征;基于运动矢量信息,以每个目标特征的位置为基准,将每个相邻特征金字塔中与每个目标特征对应位置的相邻特征进行自适应局部采样;将目标特征金字塔和自适应局部采样后的每个相邻特征金字塔进行融合生成融合特征金字塔,融合特征金字塔包括多个尺度的融合特征;对融合特征金字塔进行处理,以生成重构帧。Optionally, the generation module 1102 is specifically configured to generate a target feature pyramid according to the image information of the target frame, and generate an adjacent feature pyramid according to the image information of each adjacent frame, and the target feature pyramid includes target features of multiple scales. , each adjacent feature pyramid includes adjacent features of multiple scales; based on the motion vector information, with the position of each target feature as the benchmark, the adjacent features in each adjacent feature pyramid corresponding to each target feature Perform adaptive local sampling; fuse the target feature pyramid and each adjacent feature pyramid after adaptive local sampling to generate a fusion feature pyramid, which includes fusion features of multiple scales; process the fusion feature pyramid to generate Reconstructed frame.
可选的,生成模块1102,还用于针对每个相邻特征金字塔,根据目标特征中第一局部特征块的坐标,以及运动矢量信息中所包含的第一局部特征块与相邻特征中第二局部特征块之间的映射关系,查找第二局部特征块;通过全连接层对第一局部特征块和第二局部特征块进行特征匹配,以确定相关注意力系数集合,相关注意力系数集合包括多个相关注意力系数,其中,每个相关注意力系数指示第一局部特征块中的一个特征点与第二局部特征块中对应特征点的相似度;基于相关注意力系数集合对第二局部特征块中的多个特征点进 行加权平均,以确定自适应局部采样后的相邻特征金字塔。Optionally, the generation module 1102 is further configured to, for each adjacent feature pyramid, according to the coordinates of the first local feature block in the target feature, and the first local feature block contained in the motion vector information and the first local feature block in the adjacent feature. The mapping relationship between the two local feature blocks is used to find the second local feature block; the feature matching is performed on the first local feature block and the second local feature block through the fully connected layer to determine the relevant attention coefficient set and the relevant attention coefficient set. Including a plurality of relevant attention coefficients, wherein, each relevant attention coefficient indicates the similarity of a feature point in the first local feature block and the corresponding feature point in the second local feature block; Multiple feature points in the local feature block are weighted and averaged to determine the adjacent feature pyramid after adaptive local sampling.
可选的,生成模块1102,还用于针对每个相邻特征金字塔,根据目标特征和自适应局部采样后的相邻特征计算注意力图,注意力图用于表示自适应局部采样后的相邻特征与目标特征的相似度;将自适应局部采样后的相邻特征与注意力图进行特征增强处理;将所有特征增强处理后的相邻特征和目标特征进行堆叠和卷积计算以生成融合特征,并确定融合特征金字塔。Optionally, the generation module 1102 is further configured to, for each adjacent feature pyramid, calculate an attention map according to the target feature and the adjacent features after adaptive local sampling, and the attention map is used to represent the adjacent features after adaptive local sampling. Similarity with target features; perform feature enhancement processing on adjacent features after adaptive local sampling and attention map; stack and convolute all adjacent features and target features after feature enhancement processing to generate fusion features, and Determine the fusion feature pyramid.
可选的,生成模块1102,还用于将目标帧的图像信息进行卷积处理,再进行第一级联残差块的处理,以生成多个尺度的目标特征;将多个尺度的目标特征通过双线性插值的方式生成目标特征金字塔。Optionally, the generating module 1102 is further configured to perform convolution processing on the image information of the target frame, and then perform the processing of the first cascaded residual block to generate target features of multiple scales; The target feature pyramid is generated by bilinear interpolation.
可选的,生成模块1102,还用于将每个相邻帧的图像信息卷积处理,再进行第一级联残差块的处理,以生成多个尺度的相邻特征;将多个尺度的相邻特征通过双线性插值的方式生成一个相邻特征金字塔。Optionally, the generation module 1102 is further configured to convolve the image information of each adjacent frame, and then process the first cascaded residual block to generate adjacent features of multiple scales; The adjacent features of , generate an adjacent feature pyramid by bilinear interpolation.
可选的,生成模块1102,还用于将融合特征金字塔通过第二级联残差块计算生成优化特征金字塔;对优化特征金字塔进行尺寸扩大和卷积,以生成重构残差信号;将重构残差信号与图像放大结果相加,得到重构帧,图像放大结果为目标帧的图像信息经过双线性插值生成的。Optionally, the generation module 1102 is further configured to calculate the fusion feature pyramid through the second cascade residual block to generate an optimized feature pyramid; perform size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal; The reconstructed residual signal is added to the image enlargement result to obtain the reconstructed frame, and the image enlargement result is generated by the bilinear interpolation of the image information of the target frame.
以上所描述的解码设备可以参阅前述方法实施例部分的相应内容进行理解,此处不做过多赘述。The decoding device described above can be understood by referring to the corresponding content in the foregoing method embodiment section, and details are not repeated here.
图12是本申请实施例提供的一种UP结构示意图,该解码设备1200可以包括一个或一个以上中央处理器(central processing units,CPU)1201和存储器1205,该存储器1205中存储有一个或一个以上的应用程序或数据。12 is a schematic diagram of a UP structure provided by an embodiment of the present application, the decoding device 1200 may include one or more central processing units (central processing units, CPU) 1201 and a memory 1205, and the memory 1205 stores one or more than one applications or data.
其中,存储器1205可以是易失性存储或持久存储。存储在存储器1205的程序可以包括一个或一个以上模块,每个模块可以包括对业务控制单元中的一系列指令操作。更进一步地,中央处理器1201可以设置为与存储器1205通信,在解码设备1200上执行存储器1205中的一系列指令操作。Among them, the memory 1205 may be volatile storage or persistent storage. The program stored in the memory 1205 may include one or more modules, and each module may include a series of instruction operations in the service control unit. Further, the central processing unit 1201 may be arranged to communicate with the memory 1205 to execute a series of instruction operations in the memory 1205 on the decoding device 1200.
解码设备1200还可以包括一个或一个以上电源1202,一个或一个以上有线或无线网络接口1203,一个或一个以上输入输出接口1204,和/或,一个或一个以上操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等。The decoding device 1200 may also include one or more power supplies 1202, one or more wired or wireless network interfaces 1203, one or more input and output interfaces 1204, and/or, one or more operating systems, such as Windows Server™, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
该解码设备1200可以执行前述图3至图9所示实施例中解码设备所执行的操作,具体此处不再赘述。The decoding device 1200 can perform the operations performed by the decoding device in the foregoing embodiments shown in FIG. 3 to FIG. 9 , and details are not repeated here.
在本申请的另一实施例中,还提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,当设备的处理器执行该计算机执行指令时,设备执行上述图3至图9中处理器所执行的图像处理方法的步骤。In another embodiment of the present application, a computer-readable storage medium is also provided, where computer-executable instructions are stored in the computer-readable storage medium. When the processor of the device executes the computer-executable instructions, the device executes the above-mentioned FIG. 3 to The steps of the image processing method performed by the processor in FIG. 9 .
在本申请的另一实施例中,还提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中;当设备的处理器执行该计算机执行指令时,设备执行上述图3至图9中处理器所执行的图像处理方法的步骤。In another embodiment of the present application, a computer program product is also provided, the computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium; when a processor of a device executes the computer-executable instructions , the device executes the steps of the image processing method executed by the processor in the above-mentioned FIG. 3 to FIG. 9 .
在本申请的另一实施例中,还提供一种芯片系统,该芯片系统包括至少一个处理器, 处理器用于支持解码设备实现上述图3至图9中处理器所执行的图像处理方法的步骤。在一种可能的设计中,芯片系统还可以包括存储器,存储器,用于保存解码设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。In another embodiment of the present application, a chip system is further provided, the chip system includes at least one processor, and the processor is configured to support a decoding device to implement the steps of the image processing method performed by the processor in the foregoing FIG. 3 to FIG. 9 . . In a possible design, the chip system may further include a memory for storing necessary program instructions and data of the decoding device. The chip system may be composed of chips, or may include chips and other discrete devices.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Experts may use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of the embodiments of the present application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。Units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-only memory)、随机存取存储器(RAM,random access memory)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program codes .
以上,仅为本申请实施例的具体实施方式,但本申请实施例的保护范围并不局限于此。The above are only specific implementations of the embodiments of the present application, but the protection scope of the embodiments of the present application is not limited thereto.

Claims (13)

  1. 一种图像处理方法,其特征在于,包括:An image processing method, comprising:
    解码设备获取编码码流中目标帧的运动矢量信息、所述目标帧的图像信息和相邻帧的图像信息,所述目标帧和所述相邻帧为第一分辨率的图像,所述目标帧为需要进行超分辨率处理的图像,所述相邻帧包括位于所述目标帧之前或之后的预设周期内的一个或多个图像;The decoding device obtains the motion vector information of the target frame, the image information of the target frame and the image information of the adjacent frames in the encoded code stream, the target frame and the adjacent frames are images of the first resolution, and the target frame and the adjacent frames are images of the first resolution. The frame is an image that needs to be subjected to super-resolution processing, and the adjacent frame includes one or more images within a preset period before or after the target frame;
    所述解码设备根据所述运动矢量信息、所述目标帧的图像信息和所述相邻帧的图像信息生成重构帧,所述重构帧为第二分辨率的图像,所述第二分辨率大于所述第一分辨率,所述运动矢量信息被用于所述相邻帧的图像信息与所述目标帧的图像信息的自适应局部采样。The decoding device generates a reconstructed frame according to the motion vector information, the image information of the target frame, and the image information of the adjacent frame, where the reconstructed frame is an image of a second resolution, and the second resolution At a rate greater than the first resolution, the motion vector information is used for adaptive local sampling of image information of the adjacent frame and image information of the target frame.
  2. 根据权利要求1所述的图像处理方法,其特征在于,所述解码设备根据所述运动矢量信息、所述目标帧的图像信息和所述相邻帧的图像信息生成重构帧包括:The image processing method according to claim 1, wherein the decoding device generating the reconstructed frame according to the motion vector information, the image information of the target frame and the image information of the adjacent frame comprises:
    所述解码设备基于所述运动矢量信息,对所述相邻帧的图像信息中与所述目标帧的图像信息对应的各个位置进行自适应局部采样;The decoding device performs adaptive local sampling on each position corresponding to the image information of the target frame in the image information of the adjacent frames based on the motion vector information;
    所述解码设备根据所述目标帧的图像信息和自适应局部采样后的所述相邻帧的图像信息生成所述重构帧。The decoding device generates the reconstructed frame according to the image information of the target frame and the image information of the adjacent frame after adaptive local sampling.
  3. 根据权利要求2所述的图像处理方法,其特征在于,所述解码设备基于所述运动矢量信息,对所述相邻帧的图像信息中与所述目标帧的图像信息对应的各个位置进行自适应局部采样包括:The image processing method according to claim 2, characterized in that, based on the motion vector information, the decoding device automatically performs automatic automatic operation on each position corresponding to the image information of the target frame in the image information of the adjacent frames. Adapting local sampling includes:
    所述解码设备根据所述目标帧的图像信息生成目标特征金字塔,并根据每个相邻帧的图像信息各自生成一个相邻特征金字塔,所述目标特征金字塔包括多个尺度的目标特征,每个所述相邻特征金字塔包括多个尺度的相邻特征;The decoding device generates a target feature pyramid according to the image information of the target frame, and generates an adjacent feature pyramid according to the image information of each adjacent frame, the target feature pyramid includes target features of multiple scales, and each The adjacent feature pyramid includes adjacent features of multiple scales;
    所述解码设备基于所述运动矢量信息,以每个目标特征的位置为基准,将所述每个相邻特征金字塔中与所述每个目标特征对应位置的相邻特征进行自适应局部采样。Based on the motion vector information, the decoding device performs adaptive local sampling on the adjacent features at the corresponding positions of each target feature in each adjacent feature pyramid with the position of each target feature as a reference.
  4. 根据权利要求3所述的图像处理方法,其特征在于,所述解码设备根据所述目标帧的图像信息和自适应局部采样后的所述相邻帧的图像信息生成所述重构帧包括:The image processing method according to claim 3, wherein the decoding device generating the reconstructed frame according to the image information of the target frame and the image information of the adjacent frames after adaptive local sampling comprises:
    所述解码设备将所述目标特征金字塔和自适应局部采样后的所述每个相邻特征金字塔进行融合生成融合特征金字塔,所述融合特征金字塔包括多个尺度的融合特征;The decoding device fuses the target feature pyramid and each adjacent feature pyramid after adaptive local sampling to generate a fusion feature pyramid, and the fusion feature pyramid includes fusion features of multiple scales;
    所述解码设备对所述融合特征金字塔进行处理,以生成所述重构帧。The decoding device processes the fused feature pyramid to generate the reconstructed frame.
  5. 根据权利要求3所述的图像处理方法,其特征在于,所述解码设备基于所述运动矢量信息,以每个目标特征的位置为基准,将所述每个相邻特征金字塔中与所述每个目标特征对应位置的相邻特征进行自适应局部采样包括:针对所述每个相邻特征金字塔,The image processing method according to claim 3, wherein, based on the motion vector information, the decoding device uses the position of each target feature as a reference, and compares each adjacent feature pyramid with the each target feature. The adaptive local sampling of the adjacent features at the corresponding positions of the target features includes: for each adjacent feature pyramid,
    所述解码设备根据所述目标特征中第一局部特征块的坐标,以及所述运动矢量信息中所包含的所述第一局部特征块与所述相邻特征中第二局部特征块之间的映射关系,查找所述第二局部特征块;The decoding device is based on the coordinates of the first local feature block in the target feature and the difference between the first local feature block included in the motion vector information and the second local feature block in the adjacent feature. mapping relationship, find the second local feature block;
    所述解码设备通过全连接层对所述第一局部特征块和所述第二局部特征块进行特征匹配,以确定相关注意力系数集合,所述相关注意力系数集合包括多个相关注意力系数,其 中,每个相关注意力系数指示所述第一局部特征块中的一个特征点与所述第二局部特征块中对应特征点的相似度;The decoding device performs feature matching on the first local feature block and the second local feature block through a fully connected layer to determine a set of related attention coefficients, where the set of related attention coefficients includes a plurality of related attention coefficients , wherein each relevant attention coefficient indicates the similarity between a feature point in the first local feature block and a corresponding feature point in the second local feature block;
    所述解码设备基于所述相关注意力系数集合对所述第二局部特征块中的多个特征点进行加权平均,以确定自适应局部采样后的所述相邻特征金字塔。The decoding device performs a weighted average of a plurality of feature points in the second local feature block based on the relevant attention coefficient set to determine the adjacent feature pyramid after adaptive local sampling.
  6. 根据权利要求4所述的图像处理方法,其特征在于,所述解码设备将所述目标特征金字塔和自适应局部采样后的所述相邻特征金字塔进行融合生成融合特征金字塔包括:针对所述每个相邻特征金字塔,The image processing method according to claim 4, wherein the decoding device fuses the target feature pyramid and the adjacent feature pyramids after adaptive local sampling to generate a fused feature pyramid comprising: for each of the adjacent feature pyramids,
    所述解码设备根据所述目标特征和所述自适应局部采样后的所述相邻特征计算注意力图,所述注意力图用于表示所述自适应局部采样后的所述相邻特征与所述目标特征的相似度;The decoding device calculates an attention map according to the target feature and the adjacent features after the adaptive local sampling, and the attention map is used to represent the adjacent features after the adaptive local sampling and the adjacent features. Similarity of target features;
    所述解码设备将所述自适应局部采样后的所述相邻特征与所述注意力图进行特征增强处理;The decoding device performs feature enhancement processing on the adjacent features after the adaptive local sampling and the attention map;
    所述解码设备将所有特征增强处理后的相邻特征和所述目标特征进行堆叠和卷积计算以生成所述融合特征,并确定所述融合特征金字塔。The decoding device performs stacking and convolution calculations on all adjacent features after feature enhancement processing and the target feature to generate the fusion feature, and determines the fusion feature pyramid.
  7. 根据权利要求3-6任一项所述的图像处理方法,其特征在于,所述解码设备根据所述目标帧的图像信息生成目标特征金字塔包括:The image processing method according to any one of claims 3-6, wherein the decoding device generating the target feature pyramid according to the image information of the target frame comprises:
    所述解码设备将所述目标帧的图像信息进行卷积处理,再进行第一级联残差块的处理,以生成目标特征;The decoding device performs convolution processing on the image information of the target frame, and then performs processing on the first cascaded residual block to generate target features;
    所述解码设备将所述目标特征通过双线性插值的方式生成多个尺度的目标特征,并构建所述目标特征金字塔。The decoding device generates target features of multiple scales from the target features by means of bilinear interpolation, and constructs the target feature pyramid.
  8. 根据权利要求3-6任一项所述的图像处理方法,其特征在于,所述解码设备根据每个所述相邻帧的图像信息各自生成一个相邻特征金字塔包括:The image processing method according to any one of claims 3-6, wherein the decoding device generates an adjacent feature pyramid according to the image information of each adjacent frame, comprising:
    所述解码设备将每个所述相邻帧的图像信息卷积处理,再进行第一级联残差块的处理,以生成所述每个所述相邻帧的图像信息分别对应的一个相邻特征;The decoding device convolves the image information of each of the adjacent frames, and then performs the first cascaded residual block processing to generate a phase corresponding to the image information of each of the adjacent frames. Neighbor features;
    所述解码设备将所述相邻特征通过所述双线性插值的方式生成多个尺度的相邻特征,并构建所述相邻特征金字塔。The decoding device generates adjacent features of multiple scales from the adjacent features by means of the bilinear interpolation, and constructs the adjacent feature pyramid.
  9. 根据权利要求4或6所述的图像处理方法,其特征在于,所述解码设备对所述融合特征金字塔进行处理生成所述重构帧包括:The image processing method according to claim 4 or 6, wherein the decoding device performs processing on the fusion feature pyramid to generate the reconstructed frame, comprising:
    所述解码设备将所述融合特征金字塔通过第二级联残差块计算生成优化特征金字塔;The decoding device generates an optimized feature pyramid by calculating the fusion feature pyramid through the second cascade residual block;
    所述解码设备对所述优化特征金字塔进行尺寸扩大和卷积,以生成重构残差信号;The decoding device performs size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal;
    所述解码设备将所述重构残差信号与图像放大结果相加,得到所述重构帧,所述图像放大结果为所述目标帧的图像信息经过双线性插值生成的。The decoding device adds the reconstructed residual signal and an image enlargement result to obtain the reconstructed frame, and the image enlargement result is generated by bilinear interpolation of image information of the target frame.
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中保存有程序,当所述计算机执行所述程序时,执行如权利要求1至9中任一项所述的方法。A computer-readable storage medium, wherein a program is stored in the computer-readable storage medium, and when the computer executes the program, the method according to any one of claims 1 to 9 is executed.
  11. 一种计算设备,其特征在于,包括处理器和存储有计算机程序的计算机可读存储介质;A computing device, comprising a processor and a computer-readable storage medium storing a computer program;
    所述处理器与所述计算机可读存储介质耦合,所述计算机程序被所述处理器执行时实 现如权利要求1-9任一项所述的方法。The processor is coupled to the computer-readable storage medium, the computer program when executed by the processor implements the method of any of claims 1-9.
  12. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上执行时,所述计算机执行如权利要求1至9中任一项所述的方法。A computer program product, characterized in that, when the computer program product is executed on a computer, the computer executes the method according to any one of claims 1 to 9.
  13. 一种芯片系统,其特征在于,包括处理器,所述处理器被调用用于执行如权利要求1-9任一项所述的方法。A chip system, characterized by comprising a processor, the processor being invoked to execute the method according to any one of claims 1-9.
PCT/CN2021/120193 2020-09-30 2021-09-24 Image processing method and apparatus WO2022068682A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011063576.4A CN114339260A (en) 2020-09-30 2020-09-30 Image processing method and device
CN202011063576.4 2020-09-30

Publications (1)

Publication Number Publication Date
WO2022068682A1 true WO2022068682A1 (en) 2022-04-07

Family

ID=80949600

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120193 WO2022068682A1 (en) 2020-09-30 2021-09-24 Image processing method and apparatus

Country Status (2)

Country Link
CN (1) CN114339260A (en)
WO (1) WO2022068682A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115190251A (en) * 2022-07-07 2022-10-14 北京拙河科技有限公司 Airport ground safety analysis method and device based on hundred million image array type camera
CN115567719A (en) * 2022-08-23 2023-01-03 天津市国瑞数码安全系统股份有限公司 Multi-level convolution video compression method and system
US11734837B2 (en) * 2020-09-30 2023-08-22 Shanghai United Imaging Intelligence Co., Ltd. Systems and methods for motion estimation
CN117714691A (en) * 2024-02-05 2024-03-15 佳木斯大学 AR augmented reality piano teaching is with self-adaptation transmission system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115623242A (en) * 2022-08-30 2023-01-17 华为技术有限公司 Video processing method and related equipment thereof
CN115861131B (en) * 2023-02-03 2023-05-26 北京百度网讯科技有限公司 Training method and device for generating video and model based on image, and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100290529A1 (en) * 2009-04-14 2010-11-18 Pankaj Topiwala Real-time superresolution and video transmission
CN102236889A (en) * 2010-05-18 2011-11-09 王洪剑 Super-resolution reconfiguration method based on multiframe motion estimation and merging
CN106851046A (en) * 2016-12-28 2017-06-13 中国科学院自动化研究所 Video dynamic super-resolution processing method and system
CN111047516A (en) * 2020-03-12 2020-04-21 腾讯科技(深圳)有限公司 Image processing method, image processing device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100290529A1 (en) * 2009-04-14 2010-11-18 Pankaj Topiwala Real-time superresolution and video transmission
CN102236889A (en) * 2010-05-18 2011-11-09 王洪剑 Super-resolution reconfiguration method based on multiframe motion estimation and merging
CN106851046A (en) * 2016-12-28 2017-06-13 中国科学院自动化研究所 Video dynamic super-resolution processing method and system
CN111047516A (en) * 2020-03-12 2020-04-21 腾讯科技(深圳)有限公司 Image processing method, image processing device, computer equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11734837B2 (en) * 2020-09-30 2023-08-22 Shanghai United Imaging Intelligence Co., Ltd. Systems and methods for motion estimation
CN115190251A (en) * 2022-07-07 2022-10-14 北京拙河科技有限公司 Airport ground safety analysis method and device based on hundred million image array type camera
CN115190251B (en) * 2022-07-07 2023-09-22 北京拙河科技有限公司 Airport ground safety analysis method and device based on Yilike array camera
CN115567719A (en) * 2022-08-23 2023-01-03 天津市国瑞数码安全系统股份有限公司 Multi-level convolution video compression method and system
CN117714691A (en) * 2024-02-05 2024-03-15 佳木斯大学 AR augmented reality piano teaching is with self-adaptation transmission system
CN117714691B (en) * 2024-02-05 2024-04-12 佳木斯大学 AR augmented reality piano teaching is with self-adaptation transmission system

Also Published As

Publication number Publication date
CN114339260A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2022068682A1 (en) Image processing method and apparatus
US10462457B2 (en) Dynamic reference motion vector coding mode
US9602819B2 (en) Display quality in a variable resolution video coder/decoder system
US9407915B2 (en) Lossless video coding with sub-frame level optimal quantization values
KR100985464B1 (en) Scaler architecture for image and video processing
WO2021042957A1 (en) Image processing method and device
US20240098298A1 (en) Segmentation-based parameterized motion models
CN107071440B (en) Motion vector prediction using previous frame residuals
WO2017129023A1 (en) Decoding method, encoding method, decoding apparatus, and encoding apparatus
US20230069953A1 (en) Learned downsampling based cnn filter for image and video coding using learned downsampling feature
WO2021109978A1 (en) Video encoding method, video decoding method, and corresponding apparatuses
WO2023000179A1 (en) Video super-resolution network, and video super-resolution, encoding and decoding processing method and device
CN115552905A (en) Global skip connection based CNN filter for image and video coding
WO2020143585A1 (en) Video encoder, video decoder, and corresponding method
TW202239209A (en) Multi-scale optical flow for learned video compression
US9210424B1 (en) Adaptive prediction block size in video coding
CN111225214A (en) Video processing method and device and electronic equipment
WO2022266955A1 (en) Image decoding method and apparatus, image processing method and apparatus, and device
US10448013B2 (en) Multi-layer-multi-reference prediction using adaptive temporal filtering
WO2021169817A1 (en) Video processing method and electronic device
CN114554205A (en) Image coding and decoding method and device
TWI834087B (en) Method and apparatus for reconstruct image from bitstreams and encoding image into bitstreams, and computer program product
CN116760976B (en) Affine prediction decision method, affine prediction decision device, affine prediction decision equipment and affine prediction decision storage medium
RU2787217C1 (en) Method and device for interpolation filtration for encoding with prediction
WO2023000182A1 (en) Image encoding, decoding and processing methods, image decoding apparatus, and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21874343

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21874343

Country of ref document: EP

Kind code of ref document: A1