WO2022068682A1 - Procédé et appareil de traitement d'images - Google Patents

Procédé et appareil de traitement d'images Download PDF

Info

Publication number
WO2022068682A1
WO2022068682A1 PCT/CN2021/120193 CN2021120193W WO2022068682A1 WO 2022068682 A1 WO2022068682 A1 WO 2022068682A1 CN 2021120193 W CN2021120193 W CN 2021120193W WO 2022068682 A1 WO2022068682 A1 WO 2022068682A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
adjacent
decoding device
target
frame
Prior art date
Application number
PCT/CN2021/120193
Other languages
English (en)
Chinese (zh)
Inventor
王诗淇
孙龙
Original Assignee
华为技术有限公司
香港城市大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司, 香港城市大学 filed Critical 华为技术有限公司
Publication of WO2022068682A1 publication Critical patent/WO2022068682A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/513Processing of motion vectors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/527Global motion vector estimation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/53Multi-resolution motion estimation; Hierarchical motion estimation

Definitions

  • the embodiments of the present application relate to the field of image processing, and in particular, to an image processing method and apparatus.
  • Existing video super-resolution techniques usually use the combining local and global-television (CLG-TV) optical flow model algorithm to calculate the optical flow velocity before all low-resolution video image sequences and the current frame video image.
  • Vector that is to perform motion estimation, obtain the low-resolution video image of the 2T frame after motion compensation and the low-resolution video image of the current frame according to the optical flow velocity vector, and then use the deep residual network to analyze the low-resolution video of the 2T frame.
  • the image and the low-resolution video image of the current frame perform the initial stage, the concatenated convolution layer calculation stage and the residual block calculation stage in turn, and then gradually use the deconvolution and convolution operations to reconstruct the high-resolution video image.
  • the embodiment of the present application provides an image processing method, which is used to process a super-divided frame into a high-resolution reconstructed frame by using motion vector information in an encoded code stream, which avoids a large amount of computational requirements required for motion estimation, and can Significantly reduces the time for video super-resolution.
  • a first aspect of the embodiments of the present application provides an image processing method, the method includes: a decoding device obtains motion vector information of a target frame, image information of the target frame, and image information of adjacent frames in an encoded code stream, and the target frame and phase
  • the adjacent frame is an image of the first resolution
  • the target frame is an image that needs to be subjected to super-resolution processing
  • the adjacent frame includes an image within a preset period before or after the target frame
  • the image information and the image information of the adjacent frames generate a reconstructed frame
  • the reconstructed frame is an image of the second resolution
  • the second resolution is greater than the first resolution
  • the motion vector information is used to indicate the image information of the adjacent frame and the target frame.
  • adaptive local sampling of the image information is used to indicate the image information of the adjacent frame and the target frame.
  • the encoded code stream may be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation, and the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image Compression coding techniques may apply motion estimation algorithms to determine motion vector information.
  • the decoding device may directly extract motion vector information in the encoded code stream, and decode the encoded code stream to obtain decoded image information, where the image information may include a target for performing super-resolution processing
  • the image information of the frame and the image information of the adjacent frames in the preset period before or after the target frame.
  • the adjacent frames may include T frames before and after the target frame.
  • the adjacent frames are all images of the first resolution, that is, the low resolution.
  • the decoding device can process the image information of the target frame and the image information of the adjacent frames in combination with the motion vector information to generate a reconstructed frame of the second resolution , the reconstructed frame is a high-resolution image, that is, the second resolution is greater than the above-mentioned first resolution, and the motion vector information is used to indicate that the image information of the adjacent frame and the image information of the target frame perform adaptive local sampling, which is implemented in this application.
  • the decoding device does not need to perform a motion estimation process that consumes a lot of computations when performing super-resolution, which can greatly reduce the time for video super-resolution.
  • the decoding device performs adaptive local sampling on each position corresponding to the image information of the target frame in the image information of adjacent frames based on the motion vector information; A reconstructed frame is generated from the image information and the image information of the adjacent frames after adaptive local sampling.
  • the decoding device can adaptively select similarity from multiple image points in each corresponding local position in the image information of adjacent frames based on each local position of the image information of the target frame in combination with the motion vector information High image points are sampled.
  • the decoding device can perform corresponding image reconstruction based on the image information of the adjacent frames after the adaptive local sampling to generate a reconstructed frame, which can reduce the noise existing in the motion vector information and improve the robustness.
  • the decoding device in the above steps performs adaptive local sampling on each position corresponding to the image information of the target frame in the image information of adjacent frames based on the motion vector information, including: a decoding device Generate a target feature pyramid according to the image information of the target frame, and generate an adjacent feature pyramid according to the image information of each adjacent frame.
  • the target feature pyramid includes target features of multiple scales, and each adjacent feature pyramid includes multiple scales.
  • the adjacent features of based on the motion vector information, the decoding device performs adaptive local sampling on the adjacent features corresponding to each target feature in each adjacent feature pyramid with the position of each target feature as a reference.
  • the feature pyramid of an image is a series of feature sets arranged in a pyramid shape.
  • the feature pyramid is obtained by down-sampling an original feature step by step, so the size is reduced layer by layer.
  • the function extracts the image information of the target frame and the image information of the adjacent frames to obtain the target features and adjacent features, and then generates target features and adjacent features of multiple scales through downsampling, and forms the corresponding feature pyramid.
  • the decoding device may, based on the scale invariance of the feature pyramid, take the position of each target feature at each scale as a benchmark, and perform adaptive local sampling on the adjacent features at the corresponding positions of each adjacent feature pyramid according to the motion vector information, that is, Refine the search for better matching features within the adjacent features at the mapping, and improve the feature quality of each adjacent feature.
  • the decoding device in the above steps generates a reconstructed frame according to the image information of the target frame and the image information of the adjacent frames after adaptive local sampling, including: the decoding device converts the target feature pyramid and the Each adjacent feature pyramid after adaptive local sampling is fused to generate a fused feature pyramid, and the fused feature pyramid includes fused features of multiple scales; the decoding device processes the fused feature pyramid to generate a reconstructed frame.
  • the decoding device stacks the adjacent feature pyramids and the target feature pyramid after adaptive local sampling, and convolutionally fuses them into a fused feature pyramid, and then the fused feature pyramid can be reconstructed into a high-resolution feature pyramid.
  • rate images Stacking (concat) is the merging of the number of feature channels, that is to say, the number of features (the number of channels) describing the image itself increases, while the information under each feature does not increase.
  • the above-mentioned decoding device based on the motion vector information, takes the position of each target feature as a reference, and converts the adjacent features of each adjacent feature pyramid corresponding to the position of each target feature
  • Performing adaptive local sampling including: for each adjacent feature pyramid, the decoding device is based on the coordinates of the first local feature block in the target feature, and the first local feature block included in the motion vector information and the second in the adjacent feature.
  • the mapping relationship between the local feature blocks is used to find the second local feature block; the decoding device performs feature matching on the first local feature block and the second local feature block through the fully connected layer to determine the set of relevant attention coefficients and the relevant attention coefficients.
  • the set includes a plurality of relevant attention coefficients, wherein each relevant attention coefficient indicates the similarity between a feature point in the first local feature block and a corresponding feature point in the second local feature block; the decoding device is based on the set of relevant attention coefficients A weighted average of multiple feature points in the second local feature block is performed to determine the adjacent feature pyramid after adaptive local sampling.
  • the attention coefficient is the degree of attention, that is, more attention is attached to the feature points in the adjacent features that are more similar to the target feature.
  • the decoding device extracts the first local feature block in a scale of the target feature pyramid, and based on the mapping relationship indicated by the motion vector information, the coordinates of the first local feature block are determined from the corresponding coordinates of an adjacent feature pyramid. Extract the second local feature block, and then determine the attention coefficient of each feature point in the second local feature block through a double-layer fully connected layer, and then attach the above attention coefficient to each feature point in the second local feature block
  • the corresponding degree of attention that is, the second local feature block after adaptive local sampling is obtained. After processing feature blocks extracted from adjacent features of all adjacent feature pyramids corresponding to all scales of the target feature pyramid, adjacent feature pyramids after adaptive local sampling can be obtained.
  • the decoding device in the above steps fuses the target feature pyramid and the adjacent feature pyramid after adaptive local sampling to generate a fusion feature pyramid.
  • the method includes: for each adjacent feature pyramid, the decoding device The attention map is calculated according to the target feature and the adjacent features after adaptive local sampling, and the attention map is used to represent the similarity between the adjacent features after adaptive local sampling and the target feature; Perform feature enhancement processing with the attention map; the decoding device stacks and convolves all adjacent features and target features after feature enhancement processing to generate fusion features, and determines the fusion feature pyramid.
  • the decoding device can generate an attention map in the time domain according to the alignment quality of adjacent frame features, and reduce the low-quality adaptive local sampling by increasing the weight of the high-quality adaptive local sampling local area.
  • the weight of the region dynamically fuses the features of the adjacent frame and the target frame after adaptive local sampling.
  • the alignment quality can be expressed by calculating the feature inner product of the adjacent frame feature after adaptive local sampling and the target feature at each coordinate point, and the feature inner product can represent the adjacent feature after adaptive local sampling. similarity of features. Then, each feature area is weighted, for example, the feature and the above attention map are multiplied point by point.
  • the above-mentioned stacking operation is performed on the adjacent features after feature enhancement processing and the target features, and the fusion features are generated by convolution. After the decoding device performs a feature fusion, it is necessary to detect whether there are adjacent features without feature enhancement processing. Neighbor features and target features are fused to generate a fused feature pyramid.
  • generating the target feature pyramid according to the image information of the target frame by the decoding device in the above steps includes: the decoding device convolves the image information of each adjacent frame, and then performs a first concatenation process.
  • the residual block is processed to generate an adjacent feature corresponding to the image information of each adjacent frame; the decoding device generates adjacent features of multiple scales by bilinear interpolation, and constructs adjacent features.
  • Feature Pyramid the decoding device convolves the image information of each adjacent frame, and then performs a first concatenation process.
  • the residual block is processed to generate an adjacent feature corresponding to the image information of each adjacent frame;
  • the decoding device generates adjacent features of multiple scales by bilinear interpolation, and constructs adjacent features.
  • the residual blocks use skip links to improve the accuracy by increasing a considerable depth, wherein the skip links are the residual blocks that directly detour the received input information to the output, protecting the Integrity of Information.
  • the scale represents the number of pixels in the image.
  • the decoding device extracts a target feature in the image information of the target frame through a feature extraction function, where the feature extraction function includes convolution processing and cascaded residual blocks.
  • the decoding device can reduce the image to different degrees by downsampling the target features through bilinear interpolation, so as to obtain target features of different scales, and then arrange the target features according to the scale to generate the target feature pyramid.
  • the number of pixels in each layer of the pyramid continues to decrease from bottom to top, which can greatly reduce the amount of computation.
  • the decoding device generating an adjacent feature pyramid according to the image information of each adjacent frame includes: the decoding device convolves the image information of each adjacent frame, and then performs The processing of the first cascaded residual blocks to generate an adjacent feature corresponding to the image information of each adjacent frame; the decoding device generates an adjacent feature by bilinear interpolation of adjacent features of multiple scales pyramid.
  • the decoding device simultaneously performs feature extraction on the image information of adjacent frames through a plurality of feature extraction functions that share weights with the above-mentioned feature extraction function to obtain adjacent features, and then downsamples through bilinear interpolation To reduce the image to different degrees in the way, adjacent features of different scales can be obtained, and then the adjacent features are arranged according to the scale to generate adjacent feature pyramids.
  • the decoding device processing the fused feature pyramid to generate the reconstructed frame includes: the decoding device calculates the fused feature pyramid through the second cascade residual block to generate an optimized feature pyramid; The feature pyramid is optimized for size expansion and convolution to generate a reconstructed residual signal; the decoding device adds the reconstructed residual signal and the image enlargement result to obtain a reconstructed frame, and the image enlargement result is the image information of the target frame after double-line generated by sex interpolation.
  • the decoding device exchanges information on the features of each scale level in the above-generated fused feature pyramid through the second cascaded residual block.
  • the features of each scale level can be upsampled. Or downsample, perform interaction at the same scale to optimize the fusion feature, then amplify and convolve the optimized fusion feature, and add it to the image information of the target frame amplified by bilinear interpolation, you can get high The reconstructed frame of the resolution.
  • a second aspect of the embodiments of the present application provides a decoding device, where the decoding device has a function of implementing the method of the first aspect or any possible implementation manner of the first aspect.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions, such as a receiving unit and a processing unit.
  • a third aspect of the embodiments of the present application provides a computer device, the computer device includes at least one processor, a storage system, an input/output (input/output, I/O) interface, and a computer device stored in the storage system and available on the processor
  • the running computer executes the instructions, and when the computer executes the instructions are executed by the processor, the processor executes the method according to the first aspect or any possible implementation manner of the first aspect.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium that stores one or more computer-executable instructions.
  • the processor executes any one of the first aspect or the first aspect. possible implementation methods.
  • a fifth aspect of the embodiments of the present application provides a computer program product that stores one or more computer-executable instructions.
  • the processor executes the first aspect or any one of the possible first aspects. method of implementation.
  • a sixth aspect of an embodiment of the present application provides a chip system, where the chip system includes at least one processor, and the at least one processor is configured to support a decoding device to implement the first aspect or any of the possible implementation manners of the first aspect. function.
  • the chip system may further include a memory for storing necessary program instructions and data of the decoding device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • the decoding device generates high-resolution reconstructed frames by performing neural network processing on motion vector information, target feature pyramids, and adjacent feature pyramids obtained from the encoded code stream, and the encoded code stream contains a certain amount of Motion information, and the computational cost of extracting motion information in the code stream is negligible, so the time for video super-resolution can be greatly reduced.
  • FIG. 1 is an application scenario of an embodiment of the present application
  • FIG. 2 is a schematic block diagram of a video encoding and decoding system in an embodiment of the present application
  • FIG. 3 is a flowchart of an embodiment of an image processing method in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of extracting local feature blocks in an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a decoding device in an embodiment of the present application.
  • Fig. 6 is the image processing flow chart of the feature extraction module in the embodiment of the application.
  • Fig. 7 is the image processing flow chart of the flexible alignment module in the embodiment of the present application.
  • FIG. 8 is an image processing flowchart of a multi-frame feature fusion module in an embodiment of the present application.
  • Fig. 9 is the image processing flow chart of the feature super-resolution reconstruction module in the embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a decoding device according to an embodiment of the present application.
  • FIG. 12 is another schematic structural diagram of a decoding device in an embodiment of the present application.
  • Embodiments of the present application provide an image processing method and a decoding device, which are used to reduce the time for video super-resolution.
  • the corresponding apparatus may include one or more units, such as functional units, to perform one or more of the described method steps (eg, one unit performs one or more steps) , or units, each of which performs one or more of the steps), even if such unit or units are not explicitly described or illustrated in the figures.
  • the corresponding method may contain a step to perform the functionality of the one or more units (eg, a step to perform the one or more units) functionality, or steps, each of which performs the functionality of one or more of the plurality of units), even if such one or more steps are not explicitly described or illustrated in the figures.
  • Video coding generally refers to the processing of sequences of images that form a video or video sequence.
  • the terms "picture”, “frame” or “image” may be used as synonyms.
  • Video encoding is performed on the source side and typically involves processing (eg, by compressing) the original video image to reduce the amount of data required to represent the video image for more efficient storage and/or transmission.
  • Video decoding is performed on the destination side and typically involves inverse processing relative to the encoder to reconstruct the video image.
  • the "encoding" of video images referred to in the embodiments should be understood to refer to “encoding” or "decoding” of video sequences.
  • the combination of the encoding part and the decoding part is also called encoding and decoding (encoding and decoding).
  • This embodiment can be applied to the application scenario shown in FIG. 1 .
  • the terminal 11 , the server 12 , the set-top box 13 , and the TV 14 are connected through a wireless or wired network, and the terminal 11 can use application software (application, APP) installed locally. ) to remotely control the display 14, for example, the user can output a video source for television playback by performing operations on the operation interface of the terminal 11, and the terminal 11 performs encoding processing on the video source through the server 12 and then forwards it to the set-top box 13, and the set-top box 13 sends the video source to the set-top box 13.
  • the encoded video source is decoded to the display 14, and then the display 14 can play based on the decoded video source.
  • FIG. 2 exemplarily shows a schematic block diagram of a video encoding and decoding system to which the embodiments of the present application are applied.
  • a video encoding and decoding system may include an encoding apparatus 21 and a decoding apparatus 22, which generates encoded video data.
  • Decoding apparatus 22 may decode the encoded video data generated by encoding apparatus 21 .
  • Various implementations of encoding apparatus 21, decoding apparatus 22, or both may include one or more processors and a memory coupled to the one or more processors.
  • the memory may include but is not limited to random access memory (RAM), read-only memory (ROM), electrically erasable programmable read only memory (EEPROM), fast Flash memory or any other medium that can be used to store the desired program code in the form of instructions or data structures that can be accessed by a computer, as described herein.
  • the encoding device 21 and decoding device 22 may comprise various devices including desktop computers, mobile computing devices, notebook (eg, laptop) computers, tablet computers, set-top boxes, telephone handsets such as so-called "smart" phones, etc. , televisions, cameras, display devices, digital media players, video game consoles, in-vehicle computers, wireless communication devices, or the like.
  • FIG. 2 shows that the encoding device 21 and the decoding device 22 may be separate devices, they may also include the encoding device 21 and the decoding device 22 or the functionality of both, that is, the encoding device 21 or the corresponding functionality and the decoding device 22 or the corresponding functionality.
  • encoding device 21 or corresponding functionality and decoding device 22 or corresponding functionality may be implemented using the same hardware and/or software, or using separate hardware and/or software, or any combination thereof.
  • the encoding device 21 and the decoding device 22 may be communicatively connected through a link 23 , and the decoding device 22 may receive encoded video data from the encoding device 21 via the link 23 .
  • Link 23 may include one or more media or devices capable of moving encoded video data from encoding apparatus 21 to decoding apparatus 22.
  • link 23 may include one or more communication media that enable encoding device 21 to transmit encoded video data directly to decoding device 22 in real-time.
  • encoding apparatus 21 may modulate the encoded video data according to a communication standard, such as a wireless communication protocol, and may transmit the modulated video data to decoding apparatus 22 .
  • the one or more communication media may include wireless and/or wired communication media, such as radio frequency spectrum or one or more physical transmission lines.
  • the one or more communication media may form part of a packet-based network, such as a local area network, a wide area network, or a global network (eg, the Internet).
  • the one or more communication media may include routers, switches, base stations, or other devices that facilitate communication from encoding device 21 to decoding device 22 .
  • the encoding device 21 includes an encoder 211 , and optionally, the encoding device 21 may further include an image preprocessor 212 and a first communication interface 213 .
  • the encoder 211 , the image preprocessor 212 and the first communication interface 213 may be hardware components in the encoding device 21 , or may be software programs in the encoding device 21 .
  • the image preprocessor 212 is configured to receive the original image data 214 transmitted by the external terminal, and perform preprocessing on the original image data 214 to obtain the preprocessed image data 215 or the preprocessed image data 215 .
  • the preprocessing performed by the image preprocessor 212 may include trimming, color format conversion (eg, from a three primary color (RGB) format to a Luma and Chroma (YUV, Y for Luma and UV for Chroma) format) format ), toning or denoising.
  • the image can be regarded as a two-dimensional array or matrix of picture elements.
  • the pixels in the array can also be called sampling points.
  • the number of sampling points in the horizontal and vertical directions (or axes) of an array or image defines the size and/or resolution of the image.
  • three color components are usually employed, i.e. an image can be represented as or contain three arrays of samples.
  • an image includes corresponding arrays of red, green and blue samples.
  • each pixel is usually represented in a luma/chroma format or color space, for example, for a YUV format image, it includes a luma component indicated by Y (sometimes can also be indicated by L) and two components indicated by U and V. chrominance components.
  • the luminance (luma) component Y represents the luminance or gray level intensity (eg, both are the same in a gray scale image), while the two chroma (chroma) components U and V represent the chrominance or color information components.
  • an image in YUV format includes a luma sample array of luma sample values (Y), and two chroma sample arrays of chroma values (U and V). Images in RGB format can be converted or transformed to YUV format and vice versa, the process is also known as color transformation or conversion. If the image is black and white, the image may only include an array of luminance samples.
  • An encoder 211 (or a video encoder 211 ) for receiving the pre-processed image data 215, and processing the pre-processed image data 215 using a relevant prediction mode (such as the prediction mode in various embodiments herein), thereby Encoded image data 216 is provided.
  • a relevant prediction mode such as the prediction mode in various embodiments herein
  • a first communication interface 213 that can be used to receive encoded image data 216 and to transmit the encoded image data 216 via link 23 to decoding device 22 or any other device (eg, memory) for storage or direct reconstruction,
  • the other device may be any device for decoding or storage.
  • the first communication interface 213 may be used, for example, to encapsulate the encoded image data 216 into a suitable format, such as a data packet, for transmission over the link 23 .
  • the decoding device 22 includes a decoder 221 , and optionally, the decoding device 22 may further include a second communication interface 222 and an image post-processor 223 . They are described as follows:
  • a second communication interface 222 may be used to receive encoded image data 216 from the encoding device 21 or any other source, such as a storage device, such as an encoded image data storage device.
  • the second communication interface 222 may be used to transmit or receive the encoded image data 216 via the link 23 between the encoding device 21 and the decoding device 22, such as a direct wired or wireless connection, or via any kind of network, Networks of any kind are, for example, wired or wireless networks or any combination thereof, or private and public networks of any kind, or any combination thereof.
  • the second communication interface 222 may be used, for example, to decapsulate the data packets transmitted by the first communication interface 213 to obtain the encoded image data 216 .
  • Both the second communication interface 222 and the first communication interface 213 may be configured as a one-way communication interface or a two-way communication interface, and may be used, for example, to send and receive messages to establish connections, acknowledge and exchange any other communication links and/or for example Information about data transmission of encoded image data transmission.
  • Decoder 221 receives encoded image data 216 and provides decoded image data 224 or decoded image 224 .
  • the decoder 221 may be configured to execute various embodiments described later, so as to realize the application of the image processing method described in this application on the decoding side.
  • the post-processing performed by the image post-processor 223 may include color format conversion (eg, from YUV format to RGB format), toning, trimming or resampling, or any other It is transmitted to an external display device for playback.
  • the display device may be or include any type of display for presenting the reconstructed image, eg, an integrated or external display or monitor.
  • displays may include liquid crystal displays (LCDs), organic light emitting diode (OLED) displays, plasma displays, projectors, micro LED displays, liquid crystal on silicon (LCoS), A digital light processor (DLP) or other display of any kind.
  • FIG. 2 depicts the encoding device 21 and the decoding device 22 as separate devices
  • device embodiments may also include the functionality of the encoding device 21 and the decoding device 22 or both at the same time, ie the encoding device 21 or the corresponding Functionality and decoding device 22 or corresponding functionality.
  • encoding device 21 or corresponding functionality and decoding device 22 or corresponding functionality may be implemented using the same hardware and/or software, or using separate hardware and/or software, or any combination thereof.
  • the encoding device 21 and decoding device 22 may include any of a variety of devices, including any class of handheld or stationary devices, such as notebook or laptop computers, mobile phones, smart phones, tablet or tablet computers, video cameras, desktop computers , set-top boxes, televisions, cameras, in-vehicle devices, display devices, digital media players, video game consoles, video streaming devices (such as content serving servers or content distribution servers), broadcast receiver devices, broadcast transmitter devices, etc. , and may not use or use any kind of operating system.
  • handheld or stationary devices such as notebook or laptop computers, mobile phones, smart phones, tablet or tablet computers, video cameras, desktop computers , set-top boxes, televisions, cameras, in-vehicle devices, display devices, digital media players, video game consoles, video streaming devices (such as content serving servers or content distribution servers), broadcast receiver devices, broadcast transmitter devices, etc. , and may not use or use any kind of operating system.
  • Both encoder 211 and decoder 221 may be implemented as any of a variety of suitable circuits, eg, one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (application-specific integrated circuits) circuit, ASIC), field-programmable gate array (FPGA), discrete logic, hardware, or any combination thereof.
  • DSPs digital signal processors
  • ASIC application-specific integrated circuits
  • FPGA field-programmable gate array
  • an apparatus may store instructions for the software in a suitable non-transitory computer-readable storage medium and may execute the instructions in hardware using one or more processors to perform the techniques of this disclosure . Any of the foregoing (including hardware, software, a combination of hardware and software, etc.) may be considered one or more processors.
  • the video encoding and decoding system shown in FIG. 2 is merely an example, and the techniques of this application may be applicable to video encoding setups (eg, video encoding or video decoding) that do not necessarily involve any data communication between encoding and decoding devices ).
  • data may be retrieved from local storage, streamed over a network, and the like.
  • a video encoding device may encode and store data to memory, and/or a video decoding device may retrieve and decode data from memory.
  • encoding and decoding is performed by devices that do not communicate with each other but only encode data to and/or retrieve data from memory and decode data.
  • the original video image can be reconstructed, ie the reconstructed video image has the same quality as the original video image (eg no transmission loss or other data loss during storage or transmission).
  • the super-resolution algorithm used in the process of reconstructing the video image at the decoder side requires motion estimation, and motion estimation It needs to spend a lot of computing resources.
  • the embodiment of the present application provides a corresponding image processing method.
  • the method includes the decoding device acquiring the motion vector information of the target frame, the image information of the target frame and Image information of adjacent frames, the target frame and adjacent frames are images of the first resolution, the target frame is an image that needs to be processed by super-resolution, and the adjacent frames include images located in a preset period before or after the target frame ;
  • the decoding device generates a reconstructed frame according to the motion vector information, the image information of the target frame and the image information of the adjacent frame, and the reconstructed frame is an image of a second resolution, and the second resolution is greater than the first resolution, and the motion vector information uses Adaptive local sampling is performed on the image information indicating the adjacent frame and the image information of the target frame. In this way, the present application uses the motion vector information of the target frame in the encoded code stream to improve the resolution of the reconstructed frame, saving resources for re-estimating the motion vector information.
  • the implementation manner of the embodiment of the present application may also be: the decoding device, based on the motion vector information, Each position in the image information corresponding to the image information of the target frame performs adaptive local sampling; the decoding device generates a reconstructed frame according to the image information of the target frame and the image information of the adjacent frames after adaptive local sampling.
  • image points with high similarity can be selected for sampling through adaptive local sampling, so as to reduce the influence of noise in the motion vector information, and improve the robustness of super-resolution.
  • an embodiment of the image processing method of the present application includes:
  • the decoding device acquires the image information of the target frame and the image information of the adjacent frames in the encoded code stream.
  • the decoder after receiving the encoded code stream sent by the server, the decoder can decode the encoded video to obtain image information of the target frame and image information of adjacent frames.
  • the target frame is an image that needs to be subjected to super-resolution processing in this embodiment.
  • Super-resolution means improving the resolution of the original image by means of hardware or software, that is, obtaining a high-resolution image through a series of low-resolution images.
  • the high-speed image process is super-resolution reconstruction.
  • the decoding device can obtain the image information of the adjacent frames.
  • the period T can be preset or changed according to actual needs.
  • the input can meet the needs of the network by repeating the operation of the existing adjacent frames.
  • the coded code stream can be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation, the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image compression coding technique can apply motion An estimation algorithm determines motion vector information.
  • the target frame and adjacent frames in this embodiment are images of the first resolution, and the resolution indicated by the first resolution specifically refers to the low resolution to which the image decoded by the encoded code stream belongs.
  • the decoding device extracts target features from the image information of the target frame, and extracts adjacent features from the image information of adjacent frames.
  • the decoding device includes multiple feature extraction functions that share weights. After receiving the image information of the target frame and the image information of the adjacent frames sent by the decoder, the decoding device can use each feature extraction function to extract the target frame.
  • the image information of the target frame and the image information of each adjacent frame are extracted image features to generate a target feature corresponding to the image information of the target frame and an adjacent feature corresponding to the image information of each adjacent frame respectively.
  • the above-mentioned image feature may be image texture information of the image.
  • the above feature extraction function consists of a convolutional layer and several cascaded residual blocks.
  • the decoding device constructs a target feature pyramid from the target feature, and constructs an adjacent feature pyramid from the adjacent features.
  • the decoding device continuously reduces the image size through filtering and bilinear interpolation downsampling to obtain the target features of different scales, and then arranges the target features of different scales according to the scale to generate the target Feature pyramid.
  • the bottom image of the target feature pyramid corresponds to the original target feature.
  • a 2-level target feature can be formed, and so on to form a multi-level target feature pyramid.
  • the pyramid structure may be a Gaussian pyramid, a Laplacian pyramid, or a wavelet pyramid, etc., which is not specifically limited here.
  • the decoding device continuously reduces the image size corresponding to the adjacent features of the image information of each adjacent frame by means of filtering and bilinear interpolation downsampling, so as to obtain adjacent features of different scales, and then compares the corresponding adjacent features of different scales according to the scale.
  • the adjacent features are arranged to generate adjacent feature pyramids. Exemplarily, based on 2T adjacent frames before and after the target frame, the embodiment of the present application may correspondingly generate 2T adjacent feature pyramids.
  • Feature pyramid is a basic component in multi-scale object detection system.
  • the feature pyramid of an image is a series of feature sets arranged in a pyramid shape. It is obtained by downsampling an original feature echelon, so the size is reduced layer by layer.
  • the feature pyramid has a certain scale invariance, and this characteristic enables the decoding device of this embodiment to detect images of a large scale.
  • the decoding device determines the position mapping relationship between the target feature and the adjacent feature according to the motion vector information.
  • the decoding device can directly determine the motion vector information from the encoded code stream, and determine the position mapping relationship between the target feature indicated by the motion vector information and the adjacent features, and the position mapping relationship is the same as the above-mentioned target feature pyramid. and the adjacent feature pyramid, at each coordinate in the target feature of each scale in the target feature pyramid, there is a corresponding coordinate in the adjacent feature of the same scale in the adjacent feature pyramid.
  • the decoding device searches for the second local feature block of the adjacent feature according to the coordinates of the first local feature block of the target feature and the position mapping relationship.
  • the same flexible alignment operation for each scale in the adjacent feature pyramid, the same flexible alignment operation will be performed.
  • the specific operation is as follows: the schematic diagram of local feature block extraction shown in Figure 4, under the guidance of the motion vector information, that is, the position mapping relationship 41 , for each coordinate of the target feature 42 in the target feature pyramid, there is a corresponding coordinate in the adjacent feature 223 of any adjacent feature pyramid.
  • the decoding device respectively extracts the first local feature block 421 corresponding to the target feature and the second local feature block 2231 corresponding to the adjacent feature from the two corresponding coordinates.
  • the decoding device performs feature matching on the first local feature block and the second local feature block through a fully connected layer to determine a set of relevant attention coefficients.
  • the decoding device will rearrange the first local feature block of a target feature and the second local feature block of an adjacent feature to form two one-dimensional feature vectors, and then convert the two one-dimensional feature vectors through a concatenation operation.
  • the dimensional feature vectors are combined into a two-layer fully connected layer to generate an attention vector with the same length as the number of pixel values contained in the local feature block.
  • the decoding device rearranges the attention vector to obtain a set of related attention coefficients between the two local feature blocks.
  • the above-mentioned set of relevant attention coefficients includes a plurality of relevant attention coefficients, wherein each relevant attention coefficient can indicate the similarity between a feature point in the first local feature block and a corresponding feature point in the second local feature block;
  • the decoding device performs a weighted average of a plurality of feature points in the second local feature block based on the relevant attention coefficient set to determine the adjacent feature pyramid after adaptive local sampling.
  • the position mapping relationship provided by the above motion vector information is not necessarily completely real object motion, and may contain coding noise.
  • the relevant attention coefficients in the decoding device can allow the network to map Refine the search for better matching features in the local neighborhood at the location.
  • the multiple relevant attention coefficients and the local feature blocks of adjacent features can be multiplied point by point and then summed, so as to realize the sampling of the second local feature block. deal with.
  • the relevant attention coefficient is the importance weight of each feature in the local feature block indicating the adjacent features, which is used to improve the feature quality of the adjacent features after adaptive local sampling.
  • the decoding device When the decoding device performs one adaptive local sampling, it only performs adaptive local sampling on one local feature block in one target feature and one local feature block in one adjacent feature, and for other adjacent features in the 2T adjacent feature pyramids
  • the local feature blocks of other coordinates in the pyramid and the target feature correspond to the local feature blocks of adjacent features, and they also need to be adaptively localized with the same steps respectively, so as to perform adaptive local sampling on all adjacent feature pyramids.
  • the decoding device calculates an attention map based on the target feature and the adaptively locally sampled adjacent features.
  • the decoding device After the decoding device performs adaptive local sampling on adjacent feature pyramids, it can generate an attention map in the time domain according to the adaptive local sampling quality of adjacent frame features.
  • the decoding device can calculate the feature inner product of the adjacent frame feature after adaptive local sampling and the target feature at each coordinate point, and the feature inner product can represent the adjacent feature after adaptive local sampling at this point. Similarity with the target feature, the similarity also represents the adaptive local sampling quality of the point to a certain extent.
  • the decoding device can obtain an attention map with the same size as the feature size through the above adaptive local sampling quality.
  • the decoding device performs feature enhancement processing on the adjacent features after adaptive local sampling and the attention map.
  • the decoding device can dynamically fuse the features of the adjacent frame and the current frame after adaptive local sampling by increasing the weight of the high-quality local area and reducing the weight of the low-quality local area.
  • the decoding device performs point-by-point multiplication of the adjacent features in the adjacent feature pyramid after adaptive local sampling with the above-mentioned attention map, so that those regions on the adjacent features that are more similar to the target feature are adaptively Allocate higher attention, that is, a higher weight for the possible contributions of the super-resolution results, enhance the required features, and suppress possible interference such as mismatches.
  • the decoding device performs stacking and convolution calculations on all adjacent features and target features after feature enhancement processing to generate fusion features, and determines a fusion feature pyramid.
  • the decoding device can stack the adjacent features after feature enhancement processing with the target features, that is, superimpose the features on the adjacent features after feature enhancement processing on the target features, and then pass A convolutional layer obtains fused features.
  • the fused feature pyramid can be obtained.
  • the decoding device After the decoding device performs feature fusion once, it needs to detect whether there are adjacent features without feature enhancement processing. If there are adjacent features without feature enhancement processing, it needs to perform the above feature enhancement processing on the adjacent features and the target feature. , until all adjacent features are enhanced and fused with the target feature, and a complete fused feature pyramid is determined.
  • the decoding device generates an optimized feature pyramid by calculating the fusion feature pyramid through the second cascaded residual block.
  • the decoding device uses the cascaded residual block with scale fusion to reconstruct the feature generation of the fused feature pyramid.
  • the second cascaded residual block above is at the end of the skip connection.
  • additional upsampling operations or downsampling operations are added, so that the feature reconstruction residuals of different scales can fully exchange information, enhance the quality of reconstructed features, and obtain optimized fusion features.
  • the optimized feature pyramid can be determined.
  • the decoding device performs size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal.
  • the decoding device adds the reconstructed residual signal and the image upscaling result to obtain a reconstructed frame.
  • the decoding device may expand the size of the image information of the target frame by performing an up-sampling operation of bilinear interpolation on the image information of the target frame.
  • the decoding device may add the reconstructed residual signal and the image information of the up-sampled target frame to obtain the image information of the reconstructed frame, and determine the reconstructed frame.
  • the reconstructed frame is the target frame after super-resolution.
  • the decoding device generates high-resolution reconstructed frames by performing neural network processing on the motion vector information, target feature pyramid and adjacent feature pyramids obtained from the encoded code stream, and the encoded code stream contains a certain amount of Motion information, and the computational cost of extracting motion information in the code stream is negligible, all of which can greatly reduce the time of video super-resolution.
  • the super-resolution process in the image processing in the embodiment of the present application may be implemented by a pre-trained network model.
  • FIG. 5 is a schematic diagram of the architecture of the decoding device in the embodiment of the present application.
  • the decoding device 22 may include a decoder 221 , a network model 222 , a graphics processing unit (graphics processing unit, GPU) memory 223 and an output buffer 224 . They are described as follows:
  • the decoder 221 is a device that performs restoration and decoding operations on the encoded encoded code stream.
  • the decoder 221 may be a video decoder that supports encoding and decoding standards such as H.264/high efficiency video coding (HEVC)/versatile video coding (VVC), for example, HEVC decoding device.
  • HEVC high efficiency video coding
  • VVC versatile video coding
  • the decoder 221 in this embodiment of the present application adds a motion vector information output interface.
  • the network model 222 in the product implementation form of the embodiment of the present application is a program code included in the machine learning and deep learning platform software and deployed on the decoding device.
  • the program codes of the embodiments of the present application exist outside the existing decoder 221 .
  • the network model 222 may be generated by supervised training of the data of the decoded video at low resolution and its unencoded high-resolution video by a machine learning method.
  • the network model 222 is designed with a feature extraction module 2221 , a flexible alignment module 2222 , a multi-frame feature fusion module 2223 and a feature super-score reconstruction module 2224 in this embodiment of the present application.
  • the network model 222 first contains a feature extraction module 2221, which aims to transform the input decoded frames from the pixel domain to the feature domain, since image features have important physical meanings in deep learning methods.
  • Flexible alignment module 2222 this module receives the motion vector in the extracted code stream from the decoder 221, and uses this as a guide to design a multi-scale local attention mechanism to achieve flexible alignment of adjacent frames at the feature level .
  • Multi-frame feature fusion module 2223 this module receives the adjacent frame features and the current frame features after alignment, and uses the attention mechanism in the time domain to complete the feature fusion operation.
  • Feature super-resolution reconstruction module 2224 this module receives the image features after fusion, and uses cascaded multi-scale fusion residual blocks and sub-pixel convolution to complete the super-resolution reconstruction of the decoded video, and a reconstructed frame has been generated.
  • the GPU memory 223 is for the execution of program codes that support the computation of each module in the network model 222 .
  • the output buffer 224 receives and saves the reconstructed frames output by the network model 222 .
  • the embodiments of the present application are implemented in the open source Python-first deep learning framework (PyTorch) machine learning platform, and run on a decoding unit with a graphics card NVIDIA GPU card, to implement the HEVC standard decoding video super-resolution program code.
  • the NVIDIA GPU card provides computing acceleration capabilities through the unified computing device architecture (compute unified device architecture, CUDA) programming interface.
  • the inference process of the network model in the distributed PyTorch machine learning platform can be accelerated, and the model after training can be directly reconstructed end-to-end from the decoded video with compressed noise.
  • the feature extraction module converts the target frame and adjacent frames output by the decoder from the pixel domain to the feature domain, as follows:
  • the feature extraction module obtains image information of the target frame and image information of adjacent frames from the decoder.
  • the decoder after receiving the encoded code stream sent by the source device, the decoder can decode the encoded video to obtain image information of the target frame and image information of adjacent frames, and the feature extraction module can receive the target frame transmitted by the decoder. image information and image information of adjacent frames.
  • the target frame is an image that needs to be subjected to super-resolution processing in this embodiment.
  • Super-resolution means improving the resolution of the original image by means of hardware or software, that is, obtaining a high-resolution image through a series of low-resolution images.
  • the high-speed image process is super-resolution reconstruction.
  • the decoder When the images in the preset period before and after the target frame are also decoded, exemplarily, when 2T adjacent frames before and after the target frame are also decoded, the decoder outputs the image information of the adjacent frames.
  • the period T can be preset or changed according to actual needs. For the boundary conditions of the sequence such as the first frame or the last frame, the input can meet the needs of the network by repeating the operation of the existing adjacent frames.
  • the coded code stream can be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation, the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image compression coding technique can apply motion An estimation algorithm determines motion vector information.
  • the target frame and adjacent frames in this embodiment are images of the first resolution, and the resolution indicated by the first resolution specifically refers to the low resolution to which the image decoded by the encoded code stream belongs.
  • the feature extraction module extracts target features from image information of the target frame, and extracts adjacent features from image information of adjacent frames.
  • the feature extraction module constructs a target feature pyramid from the target feature, and constructs an adjacent feature pyramid from the adjacent features.
  • Steps 602 to 603 are similar to steps 302 to 303 of the image processing method shown in FIG. 3 , and details are not repeated here.
  • the flexible alignment module realizes the flexible alignment of adjacent frames according to the target feature pyramid and the adjacent feature pyramid output by the feature extraction module. Neighbor feature pyramid, as follows:
  • the flexible alignment module receives the motion vector information from the decoder, and the target feature pyramid and neighboring feature pyramids from the feature extraction module.
  • the encoded code stream may be a code stream generated by an image compression coding technique including predictive coding based on motion estimation and compensation.
  • the motion estimation and motion compensation algorithms are used to remove redundant information in the temporal domain, that is, the image compression Coding techniques may apply motion estimation algorithms to determine motion vector information.
  • the decoder can extract the motion vector information from the encoded code stream and send it to the flexible alignment module.
  • the flexible alignment module can also receive the target feature pyramid and adjacent feature pyramid described in Figure 3 from the feature extraction module.
  • the flexible alignment module determines the position mapping relationship between the target feature and the adjacent feature according to the motion vector information.
  • the flexible alignment module searches for the second local feature block of the adjacent feature according to the coordinates of the first local feature block of the target feature and the position mapping relationship.
  • the flexible alignment module performs feature matching on the first local feature block and the second local feature block through a fully connected layer to determine a set of relevant attention coefficients.
  • the flexible alignment module performs a weighted average on the second local feature block based on the set of relevant attention coefficients to determine the adjacent feature pyramid after adaptive local sampling.
  • Steps 702-705 are similar to steps 304-307 in the image processing method shown in FIG. 3 , and details are not repeated here.
  • FIG. 8 the image processing flow chart of the multi-frame feature fusion module shown in FIG. 8, the multi-frame feature fusion module according to the adaptive local sampling adjacent feature pyramid output by the flexible alignment module, for each adaptive The adjacent feature pyramid after local sampling, as follows:
  • the multi-frame feature fusion module receives the adaptively locally sampled adjacent feature pyramids from the flexible alignment module.
  • the flexible alignment module After the flexible alignment module performs adaptive local sampling on the adjacent feature pyramids, the adjacent feature pyramids after the adaptive local sampling are sent to the multi-frame feature fusion module.
  • the multi-frame feature fusion module calculates an attention map according to the target feature and the adjacent features after adaptive local sampling.
  • the multi-frame feature fusion module performs feature enhancement processing on the adjacent features after adaptive local sampling and the attention map.
  • the multi-frame feature fusion module performs stacking and convolution calculations on adjacent features and target features after all feature enhancement processing to generate fusion features, and determines a fusion feature pyramid.
  • Steps 802-804 are similar to steps 308-310 in the image processing method shown in FIG. 3 , and details are not repeated here.
  • the feature superdivision reconstruction module receives the fused feature pyramid from the multi-frame feature fusion module.
  • the feature super-score reconstruction module generates an optimized feature pyramid by calculating the fusion feature pyramid through the second cascade residual block.
  • the feature super-division reconstruction module performs size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal.
  • the feature super-division reconstruction module adds the reconstructed residual signal and the image enlargement result to obtain a reconstructed frame.
  • Steps 902-904 are similar to steps 311-313 in the image processing method shown in FIG. 3 , and details are not repeated here.
  • FIG. 10 a comparison diagram of the super-resolution results of the embodiments of the present application and the super-resolution results of the prior art, the ordinate is the signal-to-noise ratio (peak signal to noise ratio, PSNR) , the unit is decibel (dB), the abscissa is the test time of each frame, the unit is millisecond (ms), the super-resolution result
  • VRCNN variable-filter-size residual-learning convolutional neural networks
  • VESPCN efficient sub-pixel convolutional networks for video
  • VESPCN multi-frame quality enhancement method
  • MFQE multi-frame quality enhancement
  • DCAD deep convolutional neural networks-based auto decoder
  • Residual learning convolutional neural network with variable filter size is used for light-weight compressed noise removal network, consisting of 4 convolutional layers.
  • the multi-frame quality enhancement method is used for an end-to-end network for denoising compressed video using motion estimation and motion compensation, and the idea of "good frames compensate bad frames".
  • a deep convolutional neural network-based auto-decoder was used to compress the noise removal network, consisting of 10 convolutional layers.
  • Efficient sub-pixel convolutional networks for video are used to utilize motion estimation and motion compensation to account for temporal correlations by aligning adjacent frames to perform video super-resolution methods.
  • Video super-resolution methods based on optical flow super-resolution are used for video super-resolution methods using motion estimation and motion compensation to consider temporal correlations by aligning adjacent frames.
  • the video super-resolution method based on optical flow super-resolution chooses to predict more accurate high-resolution optical flow.
  • a progressive fusion video super-resolution network utilizing non-local spatiotemporal correlations is used for an end-to-end video super-resolution method by computing non-local attention and the proposed progressive fusion module.
  • FIG. 11 is a schematic diagram of an embodiment of a decoding device 110 in an embodiment of the present application.
  • an embodiment of the present application provides a decoding device, and the decoding device includes:
  • the obtaining unit 1101 is used to obtain the motion vector information of the target frame in the encoded code stream, the image information of the target frame and the image information of the adjacent frame, where the target frame and the adjacent frame are images of the first resolution , the target frame is an image that needs to be subjected to super-resolution processing, and the adjacent frame includes an image within a preset period before or after the target frame;
  • the generating unit 1102 is configured to generate a reconstructed frame according to the motion vector information, the image information of the target frame and the image information of the adjacent frame, where the reconstructed frame is an image of the second resolution, and the first The second resolution is greater than the first resolution, and the motion vector information is used to indicate that the image information of the adjacent frame and the image information of the target frame perform adaptive local sampling.
  • the decoding device generates high-resolution reconstructed frames by performing neural network processing on motion vector information, target feature pyramids, and adjacent feature pyramids obtained from the encoded code stream, and the encoded code stream contains a certain amount of Motion information, and the computational cost of extracting motion information in the code stream is negligible, so the time for video super-resolution can be greatly reduced.
  • the generation module 1102 is specifically configured to generate a target feature pyramid according to the image information of the target frame, and generate an adjacent feature pyramid according to the image information of each adjacent frame, and the target feature pyramid includes target features of multiple scales.
  • each adjacent feature pyramid includes adjacent features of multiple scales; based on the motion vector information, with the position of each target feature as the benchmark, the adjacent features in each adjacent feature pyramid corresponding to each target feature Perform adaptive local sampling; fuse the target feature pyramid and each adjacent feature pyramid after adaptive local sampling to generate a fusion feature pyramid, which includes fusion features of multiple scales; process the fusion feature pyramid to generate Reconstructed frame.
  • the generation module 1102 is further configured to, for each adjacent feature pyramid, according to the coordinates of the first local feature block in the target feature, and the first local feature block contained in the motion vector information and the first local feature block in the adjacent feature.
  • the mapping relationship between the two local feature blocks is used to find the second local feature block; the feature matching is performed on the first local feature block and the second local feature block through the fully connected layer to determine the relevant attention coefficient set and the relevant attention coefficient set.
  • the generation module 1102 is further configured to, for each adjacent feature pyramid, calculate an attention map according to the target feature and the adjacent features after adaptive local sampling, and the attention map is used to represent the adjacent features after adaptive local sampling. Similarity with target features; perform feature enhancement processing on adjacent features after adaptive local sampling and attention map; stack and convolute all adjacent features and target features after feature enhancement processing to generate fusion features, and Determine the fusion feature pyramid.
  • the generating module 1102 is further configured to perform convolution processing on the image information of the target frame, and then perform the processing of the first cascaded residual block to generate target features of multiple scales;
  • the target feature pyramid is generated by bilinear interpolation.
  • the generation module 1102 is further configured to convolve the image information of each adjacent frame, and then process the first cascaded residual block to generate adjacent features of multiple scales;
  • the adjacent features of generate an adjacent feature pyramid by bilinear interpolation.
  • the generation module 1102 is further configured to calculate the fusion feature pyramid through the second cascade residual block to generate an optimized feature pyramid; perform size expansion and convolution on the optimized feature pyramid to generate a reconstructed residual signal; The reconstructed residual signal is added to the image enlargement result to obtain the reconstructed frame, and the image enlargement result is generated by the bilinear interpolation of the image information of the target frame.
  • the decoding device described above can be understood by referring to the corresponding content in the foregoing method embodiment section, and details are not repeated here.
  • the decoding device 1200 may include one or more central processing units (central processing units, CPU) 1201 and a memory 1205, and the memory 1205 stores one or more than one applications or data.
  • CPU central processing units
  • the memory 1205 may be volatile storage or persistent storage.
  • the program stored in the memory 1205 may include one or more modules, and each module may include a series of instruction operations in the service control unit.
  • the central processing unit 1201 may be arranged to communicate with the memory 1205 to execute a series of instruction operations in the memory 1205 on the decoding device 1200.
  • the decoding device 1200 may also include one or more power supplies 1202, one or more wired or wireless network interfaces 1203, one or more input and output interfaces 1204, and/or, one or more operating systems, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • one or more operating systems such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the decoding device 1200 can perform the operations performed by the decoding device in the foregoing embodiments shown in FIG. 3 to FIG. 9 , and details are not repeated here.
  • a computer-readable storage medium is also provided, where computer-executable instructions are stored in the computer-readable storage medium.
  • the processor of the device executes the computer-executable instructions
  • the device executes the above-mentioned FIG. 3 to The steps of the image processing method performed by the processor in FIG. 9 .
  • a computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium; when a processor of a device executes the computer-executable instructions , the device executes the steps of the image processing method executed by the processor in the above-mentioned FIG. 3 to FIG. 9 .
  • a chip system is further provided, the chip system includes at least one processor, and the processor is configured to support a decoding device to implement the steps of the image processing method performed by the processor in the foregoing FIG. 3 to FIG. 9 .
  • the chip system may further include a memory for storing necessary program instructions and data of the decoding device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of units is only a logical function division.
  • there may be other division methods for example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • Units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program codes .

Abstract

Sont divulgués un procédé et un appareil de traitement des images, qui sont utilisés pour réduire le temps de super-résolution vidéo. Le procédé dans les modes de réalisation de la présente demande comprend : un traitement de dispositif de décodage, en combinaison avec des informations de vecteur de mouvement dans un flux de code codé, des informations d'image d'une trame cible devant être soumises à une super-résolution et des informations d'image d'une trame adjacente de la trame cible, de façon à générer une trame reconstruite avec une résolution élevée.
PCT/CN2021/120193 2020-09-30 2021-09-24 Procédé et appareil de traitement d'images WO2022068682A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011063576.4A CN114339260A (zh) 2020-09-30 2020-09-30 图像处理方法及装置
CN202011063576.4 2020-09-30

Publications (1)

Publication Number Publication Date
WO2022068682A1 true WO2022068682A1 (fr) 2022-04-07

Family

ID=80949600

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120193 WO2022068682A1 (fr) 2020-09-30 2021-09-24 Procédé et appareil de traitement d'images

Country Status (2)

Country Link
CN (1) CN114339260A (fr)
WO (1) WO2022068682A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115190251A (zh) * 2022-07-07 2022-10-14 北京拙河科技有限公司 基于亿像阵列式摄像机的机场地面安全分析方法及装置
CN115567719A (zh) * 2022-08-23 2023-01-03 天津市国瑞数码安全系统股份有限公司 一种多层次卷积的视频压缩方法和系统
US11734837B2 (en) * 2020-09-30 2023-08-22 Shanghai United Imaging Intelligence Co., Ltd. Systems and methods for motion estimation
CN117714691A (zh) * 2024-02-05 2024-03-15 佳木斯大学 一种ar增强现实钢琴教学用自适应传输系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115623242A (zh) * 2022-08-30 2023-01-17 华为技术有限公司 一种视频处理方法及其相关设备
CN115861131B (zh) * 2023-02-03 2023-05-26 北京百度网讯科技有限公司 基于图像生成视频、模型的训练方法、装置及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100290529A1 (en) * 2009-04-14 2010-11-18 Pankaj Topiwala Real-time superresolution and video transmission
CN102236889A (zh) * 2010-05-18 2011-11-09 王洪剑 一种基于多帧运动估计和融合的超分辨率重构方法
CN106851046A (zh) * 2016-12-28 2017-06-13 中国科学院自动化研究所 视频动态超分辨率处理方法及系统
CN111047516A (zh) * 2020-03-12 2020-04-21 腾讯科技(深圳)有限公司 图像处理方法、装置、计算机设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100290529A1 (en) * 2009-04-14 2010-11-18 Pankaj Topiwala Real-time superresolution and video transmission
CN102236889A (zh) * 2010-05-18 2011-11-09 王洪剑 一种基于多帧运动估计和融合的超分辨率重构方法
CN106851046A (zh) * 2016-12-28 2017-06-13 中国科学院自动化研究所 视频动态超分辨率处理方法及系统
CN111047516A (zh) * 2020-03-12 2020-04-21 腾讯科技(深圳)有限公司 图像处理方法、装置、计算机设备和存储介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11734837B2 (en) * 2020-09-30 2023-08-22 Shanghai United Imaging Intelligence Co., Ltd. Systems and methods for motion estimation
CN115190251A (zh) * 2022-07-07 2022-10-14 北京拙河科技有限公司 基于亿像阵列式摄像机的机场地面安全分析方法及装置
CN115190251B (zh) * 2022-07-07 2023-09-22 北京拙河科技有限公司 基于亿像阵列式摄像机的机场地面安全分析方法及装置
CN115567719A (zh) * 2022-08-23 2023-01-03 天津市国瑞数码安全系统股份有限公司 一种多层次卷积的视频压缩方法和系统
CN117714691A (zh) * 2024-02-05 2024-03-15 佳木斯大学 一种ar增强现实钢琴教学用自适应传输系统
CN117714691B (zh) * 2024-02-05 2024-04-12 佳木斯大学 一种ar增强现实钢琴教学用自适应传输系统

Also Published As

Publication number Publication date
CN114339260A (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2022068682A1 (fr) Procédé et appareil de traitement d'images
US10462457B2 (en) Dynamic reference motion vector coding mode
US9602819B2 (en) Display quality in a variable resolution video coder/decoder system
US9407915B2 (en) Lossless video coding with sub-frame level optimal quantization values
KR100985464B1 (ko) 이미지 및 비디오 프로세싱을 위한 스케일러 구조
WO2021042957A1 (fr) Procédé et dispositif de traitement d'image
US20240098298A1 (en) Segmentation-based parameterized motion models
CN107071440B (zh) 使用先前帧残差的运动矢量预测
US20230069953A1 (en) Learned downsampling based cnn filter for image and video coding using learned downsampling feature
WO2017129023A1 (fr) Procédé de décodage, procédé de codage, appareil de décodage et appareil de codage
WO2021109978A1 (fr) Procédé de codage vidéo, procédé de décodage vidéo et appareils correspondants
WO2023000179A1 (fr) Réseau de super-résolution vidéo, et procédé et dispositif de traitement de codage, décodage et super-résolution vidéo
CN115552905A (zh) 用于图像和视频编码的基于全局跳过连接的cnn滤波器
WO2020143585A1 (fr) Codeur vidéo, décodeur vidéo et procédé correspondant
TW202239209A (zh) 用於經學習視頻壓縮的多尺度光流
US20240007637A1 (en) Video picture encoding and decoding method and related device
US9210424B1 (en) Adaptive prediction block size in video coding
CN111225214A (zh) 视频处理方法、装置及电子设备
WO2022266955A1 (fr) Procédé et appareil de décodage d'images, procédé et appareil de traitement d'images, et dispositif
US10448013B2 (en) Multi-layer-multi-reference prediction using adaptive temporal filtering
WO2021169817A1 (fr) Procédé de traitement vidéo et dispositif électronique
CN114554205A (zh) 一种图像编解码方法及装置
TWI834087B (zh) 用於從位元流重建圖像及用於將圖像編碼到位元流中的方法及裝置、電腦程式產品
CN116760976B (zh) 仿射预测决策方法、装置、设备及存储介质
RU2787217C1 (ru) Способ и устройство интерполяционной фильтрации для кодирования с предсказанием

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21874343

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21874343

Country of ref document: EP

Kind code of ref document: A1