WO2021042957A1 - 一种图像处理方法和装置 - Google Patents

一种图像处理方法和装置 Download PDF

Info

Publication number
WO2021042957A1
WO2021042957A1 PCT/CN2020/108829 CN2020108829W WO2021042957A1 WO 2021042957 A1 WO2021042957 A1 WO 2021042957A1 CN 2020108829 W CN2020108829 W CN 2020108829W WO 2021042957 A1 WO2021042957 A1 WO 2021042957A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
image
residual
interest
alignment
Prior art date
Application number
PCT/CN2020/108829
Other languages
English (en)
French (fr)
Inventor
孙龙
何柏岐
陈濛
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20860488.4A priority Critical patent/EP4020370A4/en
Publication of WO2021042957A1 publication Critical patent/WO2021042957A1/zh
Priority to US17/687,298 priority patent/US20220188976A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/513Processing of motion vectors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Definitions

  • This application relates to the field of image processing technology, and in particular to an image processing method and device, which are used to generate clear images in the process of image super-division.
  • Super-division technology mainly refers to an important computer vision and image processing method that restores a low-resolution picture to a high-resolution picture. Especially used in related fields such as medical imaging, security monitoring and TV entertainment.
  • the difficulty of the super-division technology is that low-resolution pictures lose the amount of information related to the high-frequency part, and at the same time add negative effects due to the recovery of the camera's photosensitive element, compression coding loss, and transmission channel packet loss.
  • SISR only sends a single frame of image to the neural network for super-division processing
  • SSR is to send continuous multi-frame images in the video stream into the neural network for super-division processing
  • RSR is to use the image that has been super-divided before, or Pre-placed image templates with special textures are input to the neural network together with the input image/frame sequence for super-division processing.
  • the embodiment of the present application discloses an image processing method, which is used to solve the problem of low resolution of the image generated during the over-division due to the lack of high-frequency information.
  • the embodiments of the present application provide an image processing method, which can be applied to a receiving end device, such as a destination device or a decoder, etc. Specifically, the method includes:
  • Obtain a video code stream which includes a first frame image, a second frame image, and a third frame image that are adjacent in time series; decode the video code stream to obtain a first aligned frame, a second aligned frame, and all Generating at least one residual error between the first frame image, the second frame image, and the third frame image according to the at least one residual error; according to the at least one residual frame,
  • the first alignment frame and the second alignment frame perform super-division processing on the second frame image to obtain a second frame image after super-division.
  • the first alignment frame is generated after the first frame image is moved to the second frame image by pixel block movement according to a first motion vector
  • the second alignment frame is generated by the third frame image.
  • the second frame of image is generated after pixel block movement according to the second motion vector
  • the residual is the difference between each macro of the previous frame of image and the subsequent frame of image after motion compensation according to the motion vector.
  • the method provided in this aspect uses the motion vector information and residual information in the video encoding and decoding process to align the video frame and the residual accumulation, and uses the alignment frame and the residual frame as the input of the neural network, because the accumulated frame passes through the neural network After outputting high-frequency information, the high-frequency information is pasted back into the brightness channel, which can enhance the edge details, thereby making up for the image resolution caused by the lack of high-frequency information in the low-resolution video frame during the over-segmentation process. Low problem.
  • This method improves the quality of the video image quality without increasing the hardware cost.
  • the generating at least one residual frame according to the at least one residual includes: generating a first residual frame according to the first residual;
  • the first residual and the first aligned frame satisfy the following relationship:
  • i represents a macroblock of the first frame of image
  • t1 represents the generation time of the first frame image
  • t2 represents the generation time of the second frame image.
  • the second frame of image is superimposed according to the at least one residual frame, the first alignment frame, and the second alignment frame.
  • the sub-processing to obtain the super-divided second frame image includes: inputting the at least one residual frame to a neural network for feature extraction to obtain at least one first feature image; Inputting the aligned frame and the second frame image to the neural network for feature extraction to obtain at least one second feature image; inputting the at least one first feature image to the first superdivision network for processing to generate high frequency information;
  • the at least one second feature map is input to a second super-division network for processing to generate a brightness channel; the high-frequency information is fused with the brightness channel to generate the second frame image after the super-division.
  • At least one residual frame and at least one alignment frame are input to the same neural network for super-division processing, thereby improving the parameter utilization rate, reducing the amount of model parameters, and improving the super-division processing efficiency.
  • the method before inputting the at least one residual frame to the neural network for feature extraction, the method further includes: determining a region of interest in the first residual frame
  • the macro block of the region of interest is a macro block whose sum of all pixel values contained in the current macro block exceeds a preset value; the macro block of the region of interest in the first residual frame is determined The region of interest of the remaining residual frames in the at least one residual frame except for the first residual frame.
  • inputting the at least one residual frame to a neural network for feature extraction includes: inputting macroblocks of all regions of interest of the at least one residual frame to the neural network for feature extraction, and the at least One residual frame includes the first residual frame and the remaining residual frames.
  • each frame of image into the region of interest and the region of non-interest, that is, the residual accumulated texture details of each frame of image and the motion characteristics of the previous and subsequent frames are analyzed, and only the macroblocks of the region of interest are super-divided.
  • you can directly use the super-division processing result of the previous frame thereby avoiding super-division processing on the entire image, reducing the amount of calculation, reducing power consumption, delay and memory overhead, and increasing the single Frame image super-division efficiency, and then achieve the beneficial effect of real-time acquisition of super-division images in a short period of time.
  • inputting the first alignment frame, the second alignment frame, and the second frame image to the neural network for feature extraction includes : Inputting the macroblocks of the region of interest in the first alignment frame and the second alignment frame, and the second frame image to the neural network for feature extraction.
  • the region of interest in the first alignment frame and the second alignment frame is the same as the region of interest in the first residual frame.
  • the first frame image, the second frame image, and the third frame image are three frames of images in the first image group;
  • the first frame of image is the last frame of image in the first image group
  • the second frame of image and the third frame of image are the first two frames of image in the second image group
  • the first frame image and the second frame image are the last two frames of the first image group, and the third frame image is the first frame of the second image group.
  • the embodiments of the present application also provide an image processing device, which includes a unit for executing the steps in the first aspect and the implementation manners of the first aspect, for example, includes an acquisition unit, a processing unit, and a sending unit.
  • the device is a communication device or chip.
  • the device may be integrated in an encoder, a decoder, or a video decoding system.
  • the device may also be a source device or a destination device.
  • the specific form of the image processing device in this embodiment is No restrictions.
  • an embodiment of the present application also provides an electronic device or electronic device, such as a video encoding device.
  • the device includes a processor, a transceiver, and a memory.
  • the memory is coupled to the processor and is used to store the video encoding device.
  • the necessary computer program instructions when the processor calls the computer program instructions stored in the memory, can cause the device to execute the foregoing first aspect and the methods in the various implementation manners of the first aspect.
  • the embodiments of the present application also provide a computer-readable storage medium.
  • the storage medium stores instructions.
  • the instructions run on a computer or a processor, they are used to execute the aforementioned first aspect and the first aspect.
  • the methods in various implementations.
  • the embodiments of the present application also provide a computer program product.
  • the computer program product includes computer instructions. When the instructions are executed by a computer or a processor, the foregoing first aspect and various aspects of the first aspect can be implemented. The method in the implementation mode.
  • an embodiment of the present application also provides a chip system, the chip system includes a processor and an interface circuit, the interface circuit is coupled with the processor, and the processor is used to execute computer programs or instructions to The foregoing first aspect and the methods in various implementation manners of the first aspect are implemented; wherein the interface circuit is used to communicate with other modules other than the chip system.
  • the image processing method provided by the embodiments of the application uses the motion vector information and residual information in the video encoding and decoding process to perform the alignment and residual accumulation calculation of the video frame, and uses the alignment frame and the cumulative residual frame as the input of the neural network module Because the residual accumulation frame outputs high-frequency information after the neural network model, the high-frequency information is pasted back into the brightness channel, which can enhance the edge details and make up for the lack of high-frequency parts in the low-resolution video frame during the over-segmentation process. For the problem of low image resolution caused by information, this method improves the quality of the video image without increasing the hardware cost.
  • FIG. 1 is a schematic diagram of the classification of a super-division algorithm provided by this application;
  • FIG. 2A is a schematic block diagram of a video encoding and decoding system provided by an embodiment of this application;
  • 2B is a schematic diagram of an exemplary structure of a video decoding system provided by an embodiment of this application.
  • 3A is a schematic flowchart of a video encoding and decoding method provided by an embodiment of this application;
  • 3B is a schematic flowchart of another video encoding and decoding method provided by an embodiment of the application.
  • FIG. 4A is a schematic diagram of an exemplary structure of an encoder 20 provided by an embodiment of the application.
  • FIG. 4B is a schematic diagram of an exemplary structure of a decoder 30 provided by an embodiment of the application.
  • FIG. 5 is a flowchart of an image processing method provided by an embodiment of the application.
  • FIG. 6 is a flowchart of processing and generating a super-division image by using a neural network according to an embodiment of the application;
  • FIG. 7A is a flowchart of yet another image processing method provided by an embodiment of this application.
  • FIG. 7B is a schematic diagram of a super-division effect combined to generate a super-division image according to an embodiment of the application;
  • FIG. 8 is a schematic structural diagram of an image processing device provided by an embodiment of the application.
  • FIG. 9 is a schematic structural diagram of a video decoder device provided by an embodiment of the application.
  • the technical solution of this embodiment is applied to the technical field of image processing, mainly for super-division processing for images of a series of continuous frames in a video.
  • the video can be understood as several frames of images (also can be described as images in the art) played in a certain order and frame rate.
  • the process of processing the video stream it includes video encoding and video decoding.
  • video encoding is a process of performing an encoding operation on each frame of image in the video to obtain the encoding information of each frame of image.
  • Video encoding is performed on the source side.
  • Video decoding is the process of reconstructing each frame of image according to the encoding information of each frame of image.
  • Video decoding is performed on the destination side.
  • the combination of video encoding operations and video decoding operations can be referred to as video encoding and decoding (encoding and decoding).
  • the existing video coding and decoding operations are performed according to the video coding and decoding standard (for example, the high efficiency video coding and decoding H.265 standard), and comply with the high efficiency video coding standard (HEVC) test model.
  • the video codec performs operations according to other proprietary or industry standards, such as standards including ITU-TH.261, ISO/IECMPEG-1Visual, ITU-TH.262 or ISO/IECMPEG-2Visual, ITU-TH.263, ISO/ IECMPEG-4Visual, ITU-TH.264 (or ISO/IECMPEG-4AVC), or the standard may also include hierarchical video codec and multi-view video codec extensions. It should be understood that the technology of this application is not limited to any specific codec standard or technology.
  • the coding and decoding operation is based on a coding unit (CU).
  • CU coding unit
  • the image is divided into multiple CUs, and then the pixel data in these CUs are encoded to obtain the encoding information of each CU.
  • the image is divided into multiple CUs, and then each CU is reconstructed according to the coding information corresponding to each CU, and the reconstruction block of each CU is obtained.
  • the image can also be divided into a grid of coded tree blocks.
  • the coding tree block is also called "tree block", "largest coding unit” (largest coding unit, LCU) or "coding tree unit”.
  • the coding tree block may be further divided into multiple CUs.
  • FIG. 2A exemplarily shows a schematic block diagram of the video codec system 10 applied in this application.
  • the system 10 includes a source device 12 and a destination device 14.
  • the source device 12 generates encoded video data. Therefore, the source device 12 is also referred to as a video encoding device.
  • the destination device 14 decodes the encoded video data generated by the source device 12, and therefore, the destination device 14 is also referred to as a video decoding device.
  • the source device 12 and the destination device 14 include one or more processors, and a memory coupled to the one or more processors.
  • the memory includes, but is not limited to, random access memory (RAM), read-only memory (read-only memory, ROM), electrically erasable programmable read only memory, EEPROM ), flash memory, or any other medium that can be used to store the desired program code in the form of instructions or data structures accessed by a computer.
  • the source device 12 and the destination device 14 include various devices, such as desktop computers, mobile computing devices, notebook (for example, laptop) computers, tablet computers, set-top boxes, such as so-called “smart" phones Telephone handsets, televisions, cameras, display devices, digital media players, video game consoles, on-board computers, wireless communication equipment, artificial intelligence equipment, virtual reality/mixed reality/augmented reality equipment, autonomous driving systems or other devices
  • desktop computers mobile computing devices
  • notebook (for example, laptop) computers tablet computers
  • set-top boxes such as so-called “smart” phones
  • Telephone handsets televisions, cameras, display devices, digital media players, video game consoles, on-board computers, wireless communication equipment, artificial intelligence equipment, virtual reality/mixed reality/augmented reality equipment, autonomous driving systems or other devices
  • the embodiment of the application does not limit the structure and specific form of the above-mentioned device.
  • the source device 12 and the destination device 14 are connected through a link 13, and the destination device 14 receives encoded video data from the source device 12 via the link 13.
  • the link 13 includes one or more media or devices.
  • link 13 includes one or more communication media that enable source device 12 to transmit encoded video data directly to destination device 14 in real time.
  • the source device 12 modulates the video data according to a communication standard (for example, a wireless communication protocol), and transmits the modulated video data to the destination device 14.
  • the one or more communication media include wireless or wired communication media, such as a radio frequency (RF) spectrum or at least one physical transmission line.
  • RF radio frequency
  • the one or more communication media may form a part of a packet-based network, and the packet network may be a local area network, a wide area network, or a global network (for example, the Internet).
  • the one or more communication media include routers, switches, base stations, or other devices that facilitate communication from source device 12 to destination device 14.
  • the source device 12 includes an image source 16, an image preprocessor 18, an encoder 20 and a communication interface 22.
  • the encoder 20, the image source 16, the image preprocessor 18, and the communication interface 22 may be hardware components in the source device 12, or may be software programs in the source device 12.
  • the image source 16 may include any type of image capturing device for capturing real-world images or comments.
  • the comments refer to some text on the screen that encodes the content of the screen.
  • the image capture device is used to acquire and/or provide real world images and computer animation images, such as screen content, virtual reality (VR) images, augmented reality (AR) images, and so on.
  • the image source 16 may be a camera for capturing images or a memory for storing images, and the image source 16 may also include any type (internal or external) interface for storing previously captured or generated images and/or acquiring or receiving images.
  • the image source 16 When the image source 16 is a camera, the image source 16 may be a local or an integrated camera integrated in the source device; when the image source 16 is a memory, the image source 16 may be a local or an integrated memory integrated in the source device.
  • the interface When the image source 16 includes an interface, the interface may be an external interface for receiving images from an external video source.
  • the external video source is an external image capture device, such as a camera, an external memory or an external image generation device, and the external image generation device is an external computer. Graphics processor, computer or server.
  • the interface can be any type of interface according to any proprietary or standardized interface protocol, such as a wired or wireless interface, and an optical interface.
  • the image stored in the image source 16 can be regarded as a two-dimensional array or matrix of picture elements.
  • the pixel points in the array can also be called sampling points.
  • the number of sampling points of the array or image in the horizontal and vertical directions (or axis) defines the size and/or resolution of the image.
  • three color components are usually used, that is, an image can be represented with three sample arrays.
  • the image includes corresponding red (R), green (G), and blue (B) sample arrays.
  • each pixel is usually expressed in a luminance/chrominance format or color space.
  • an image in the YUV format includes the luminance component indicated by Y (sometimes indicated by L) and the two indicated by U and V. Chrominance components.
  • the luma component Y represents brightness or gray level intensity, for example, the two are the same in a grayscale image; and the two chroma components U and V represent chroma or color information components.
  • an image in the YUV format includes a luminance sample array of luminance sample values (Y), and two chrominance sample arrays of chrominance values (U and V).
  • RGB format can be converted or converted to YUV format, and vice versa. This process is also called color conversion or conversion. If the image is black and white, the image may only include an array of luminance samples.
  • the image transmitted from the image source 16 to the image preprocessor 18 may also be referred to as original image data 17.
  • the image preprocessor 18 is configured to receive the original image data 17 and perform preprocessing on the original image data 17 to obtain the preprocessed image 19 or the preprocessed image data 19.
  • the pre-processing performed by the image pre-processor 18 may include trimming, color format conversion (for example, conversion from RGB format to YUV format), toning, or denoising.
  • the encoder 20, or video encoder 20 is used to receive preprocessed image data 19, and process the preprocessed image data 19 in a prediction mode, thereby providing encoded image data 21 (or called a video code stream).
  • the encoder 20 may be used to execute the embodiments of the various video encoding methods described below to implement the image generation method described in this application.
  • the communication interface 22 can be used to receive the encoded image data 21 and transmit the encoded image data 21 to the destination device 14 through the link 13.
  • the communication interface 22 can be used to encapsulate the encoded image data 21 into a suitable format, such as a data packet, for transmission on the link 13.
  • the destination device 14 includes a communication interface 28, a decoder 30, an image post processor 32, and a display device 34.
  • the following describes each component or device included in the destination device 14 one by one, as follows:
  • the communication interface 28 is used to receive the encoded image data 21 from the source device 12. In addition, the communication interface 28 is also used to receive the encoded image data 21 through the link 13 between the source device 12 and the destination device 14.
  • the link 13 is a direct wired or wireless connection, and any type of network is, for example, wired or wireless. Network or any combination thereof, or any type of private network and public network, or any combination thereof.
  • the communication interface 28 can also be used to decapsulate the data packet transmitted by the communication interface 22 to obtain the encoded image data 21.
  • both the communication interface 28 and the communication interface 22 can be a one-way communication interface or a two-way communication interface, and can be used for sending and receiving messages, and/or for establishing a communication link, and transmitting through the link, for example Image data transmitted through encoded image data.
  • the decoder 30 (or video decoder 30) is used to receive the encoded image data 21 and provide the decoded image data 31 or the decoded image 31.
  • the decoder 30 may be used to execute the embodiments of the various video decoding methods described below to implement the image generation method described in this application.
  • the image post-processor 32 is configured to perform post-processing on the decoded image data 31 to obtain post-processed image data 33.
  • the post-processing performed by the image post-processor 32 may include: color format conversion (for example, conversion from YUV format to RGB format), toning, trimming or resampling, or any other processing, and can also be used to convert post-processed image data 33 Transmitted to the display device 34.
  • the display device 34 is used for receiving the post-processed image data 33 so as to display the image to the user or the viewer.
  • the display device 34 includes any type of display for presenting reconstructed images, for example, an integrated or external display or monitor. Further, the display may include a liquid crystal display (LCD), an organic light emitting diode (OLED) display, a plasma display, a projector, a micro LED display, and a liquid crystal on silicon (LCoS) , Digital light processor (digital light processor, DLP) or any other type of display.
  • the source device 12 and the destination device 14 shown in FIG. 2A may be separate devices or integrated in the same device, that is, the integrated device includes the functionality of both the source device 12 and the destination device 14. .
  • the same hardware and/or software may be used, or separate hardware and/or software, or any combination thereof may be used to implement the source device 12 or the corresponding functionality and the destination device 14 or the corresponding Feature.
  • the source device 12 and the destination device 14 may include any of a variety of devices, including any type of handheld or stationary device, such as a notebook or laptop computer, mobile phone, smart phone, tablet or tablet computer, video camera, desktop Computers, set-top boxes, televisions, cameras, in-vehicle devices, display devices, digital media players, video game consoles, video streaming devices (such as content service servers or content distribution servers), broadcast receiver devices, broadcast transmitter devices Etc., the embodiment of the present application does not limit the specific structure and implementation form of the source device 12 and the destination device 14.
  • Both the encoder 20 and the decoder 30 can be any of various suitable circuits, for example, one or more microprocessors, digital signal processors (digital signal processors, DSP), application specific integrated circuits, ASIC), field-programmable gate array (FPGA), discrete logic, hardware, or any combination thereof. If the technology is partially implemented in software, the device can store the instructions of the software in a suitable computer-readable storage medium, and can use one or more processors to execute computer program instructions to execute the images described in this application. Generation method.
  • the technical solutions of the embodiments of the present application can be applied to video encoding settings that do not need to include any data communication between encoding and decoding devices, for example, Video encoding or video decoding.
  • the data can be retrieved from local storage, streamed on the network, etc.
  • the video encoding device can encode data and store the data in the memory, and/or the video decoding device can retrieve the data from the memory and decode the data.
  • FIG. 2B is a schematic structural diagram of a video decoding system 40 including an encoder 20 and a decoder 30 according to an exemplary embodiment.
  • the video decoding system 40 can implement various method steps of the embodiments of the present application.
  • the video coding system 40 may include an imaging device 41, an encoder 20, a decoder 30 (and/or a video encoder/decoder implemented by the processing unit 46), an antenna 42, a processor 43.
  • the memory 44 and the display device 45 are examples of the display device 45.
  • the imaging device 41, the antenna 42, the processing unit 46, the encoder 20, the decoder 30, the processor 43, the memory 44, and the display device 45 can communicate with each other.
  • the processing unit 46 may only include the encoder 20 or only the decoder 30.
  • the antenna 42 is used to transmit or receive a video code stream, or an encoded bit stream of video data.
  • the display device 45 can also be used to present video data.
  • the processing unit 46 may include application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, and the like.
  • the video decoding system 40 may also include a processor 43, and the processor 43 may similarly include ASIC logic, a graphics processor, a general-purpose processor, and the like.
  • the processing unit 46 may be implemented by hardware, such as dedicated video encoding hardware.
  • the processor 43 can be implemented by general-purpose software, an operating system, and the like.
  • the memory 44 may be any type of memory, such as volatile memory, for example, static random access memory (SRAM), dynamic random access memory (DRAM), and so on. Or non-volatile memory (for example, flash memory), etc.
  • the processing unit 46 can access the memory 44, for example, to implement an image buffer.
  • the processing unit 46 may also include a memory, for example, a cache and the like.
  • the encoder 20 implemented by a logic circuit includes an image buffer and a graphics processing unit.
  • the image buffer can be implemented by the processing unit 46 or the memory 44; the graphics processing unit can be implemented by the processing unit 46. To implement.
  • one possibility is to couple the graphics processing unit to the image buffer.
  • the graphics processing unit includes the encoder 20 implemented by the processing unit 46.
  • the decoder 30 can be implemented by the processing unit 46 in a similar manner.
  • the decoder 30 includes an image buffer and a graphics processing unit.
  • the graphics processing unit may be coupled to the image buffer.
  • the graphics processing unit includes a decoder 30 implemented by the processing unit 46.
  • the antenna 42 is used to receive a video code stream or an encoded bit stream of video data.
  • the coded bitstream includes data related to the coded video frame, indicators, index values, mode selection data, etc., such as data related to code segmentation, for example, transform coefficients or quantized transform coefficients, optional indicators , And/or define the data to be coded and divided.
  • the video coding system 40 may also include a decoder 30 coupled to the antenna 42 and used to decode the encoded bitstream.
  • the display device 45 is used to display image frames.
  • the decoder 30 may be used to perform functions opposite to those of the encoder 20.
  • the decoder 30 can be used to receive and decode related video data. It should be noted that the decoding method described in this application is mainly used for the decoding process, and this process exists in both the encoder 20 and the decoder 30.
  • FIG. 3A is a schematic flowchart of a video encoding and decoding method provided by an embodiment of this application, which can be applied to the system shown in FIGS. 2A and 2B. Specifically, the method can be summarized in the following five steps, which are: input video 110, video encoding 120, video stream transmission 130, video decoding 140, and output video 150.
  • the acquisition device such as the lossless video or image collected by the camera
  • the encoder in the step “video encoding 120", the acquired video or image is passed through the H.264 or H.265 codec Compression encoding is performed to generate an encoded video stream; then in step “Video Stream Transmission 130", the video stream is uploaded to the cloud server, and the user downloads the video stream from the cloud server.
  • the step “video decoding 140” includes a process in which the terminal device decodes the video stream downloaded from the cloud through a decoder, and finally outputs and displays the decoded video image in the step "output video 150".
  • the encoding and decoding process shown in FIG. 3A also includes the steps of "cross-GOP motion vector calculation 1201" and "video quality improvement 1401".
  • the calculation of the cross-GOP motion vector refers to the calculation of the motion vector between two adjacent GOPs.
  • the step 1201 cross-GOP motion vector calculation is mainly used in the video encoding 120 process; the step 1401 video quality improvement process is mainly used after the video decoding 140 and before the video 150 is output. Specifically, the implementation process of these two steps will be described in detail in subsequent embodiments.
  • the steps of "cross-GOP motion vector calculation 1201" and “video quality improvement 1401” can be implemented through program code or through a corresponding neural network model, for example, through a newly-added unit module to implement the above steps 1201 and 1201 and 1401, or implemented by existing processing units (including encoders and decoders).
  • FIG. 4A it is a schematic structural diagram of an encoder 20.
  • the execution function of the encoder 20 includes two paths, one is the forward path, and the other is the reconstruction path.
  • the input frame is subjected to intra-frame (Intra) or inter-frame (Inter) coding by the encoder in units of macroblocks or sub-blocks.
  • Intra intra-frame
  • Inter inter-frame
  • a residual block is generated.
  • a set of quantized transform coefficients are generated, and then entropy-encoded, and some side information required for decoding (such as The prediction mode, quantization parameters, motion vector information, etc.) together form a compressed video code stream, and finally handed over to the network abstraction layer (NAL) for transmission and storage.
  • NAL network abstraction layer
  • the encoder In the reconstruction path, in order to provide a reference image for prediction, the encoder needs to have the function of reconstructing the image. Therefore, the transform coefficient image must be inversely quantized, and the residual error obtained after the inverse transform is added to the predicted value to obtain the Filtered image (reconstructed block). Finally, the reconstructed block is filtered through a filter to obtain a reconstructed reference image, that is, a reconstructed frame.
  • FIG. 4B is a schematic structural diagram of a decoder 30.
  • the decoder 30 receives the video code stream from the NAL unit, the video code stream is entropy decoded to obtain transform coefficients, and then the residual is obtained after inverse quantization and inverse transformation, using the header information obtained by decoding the code stream, and the prediction generated by the decoder Value, the predicted value and the residual are added, and then filtered to finally get the decoded image.
  • video frames are divided into two encoding modes: intra (intra) and inter (inter).
  • intra intra
  • inter inter
  • each video frame is divided into blocks (or macroblocks), so that the frame processing is performed at the block level.
  • the pixels of the block are predicted by neighboring pixels, and the video block is transformed between different domains.
  • Transform from the time domain to another domain so that the transform coefficients are concentrated on a few points; or use the temporal redundancy between consecutive video frames to search for a match with the current block in the reference frame through motion vector estimation
  • This embodiment is mainly applied to the coding and decoding of inter-frame prediction.
  • the following describes various techniques of inter-frame prediction and inter-frame prediction.
  • Inter-frame prediction and coding mainly use the time-domain correlation of the video signal, and remove the time-domain redundant information of the video signal through motion estimation and compensation, so as to achieve the purpose of compressing video data. Since the time domain correlation of the video signal is far greater than its spatial correlation, the code stream can be reduced even more by using inter-frame prediction and coding.
  • a group of pictures (Group of pictures, GOP) mainly includes two frame modes: I frame (Intra Frame) and P frame (Predict Frame), and motion estimation uses positive pixel motion vectors.
  • the I frame is also called a key frame, which only performs predictive coding of intra-frame macroblocks and can retain more information.
  • a P frame is also called a predicted frame.
  • the P frame is a forward predicted frame, which can be understood as the P frame needs to use the previous frame to estimate the motion of the macroblock between frames and calculate the motion vector.
  • H.264 uses different macroblock segmentation methods when performing motion estimation. For example, a 16 ⁇ 16 macroblock can be divided into one 16 ⁇ 16, two 16 ⁇ 8 or 8 ⁇ 16, or four 8 ⁇ 8 pieces. kind of block size. For an 8 ⁇ 8 block, it can be divided into one 8 ⁇ 8, two 8 ⁇ 4 or 4 ⁇ 8, and four 4 ⁇ 4 blocks. Among them, the chrominance component of the macro block adopts the same segmentation mode as the luminance block, but the size is halved in the horizontal and vertical directions.
  • Each block has a motion vector MV, each MV is coded, transmitted, and the partition selection is also coded and compressed into the bitstream.
  • the selection and segmentation type of MV requires fewer bits, but the energy of the motion compensation residual is high in the multi-detail area.
  • the residual energy of motion compensation is low, but more bits are needed to characterize the MV and segmentation selection. Therefore, on the whole, a large segmentation size is suitable for flat areas, and a small segmentation size is suitable for multi-detail areas.
  • H.264 supports the division of a variety of macroblocks and sub-macroblocks, if the graphics details are more, the divided block size is smaller, and if the MV of each block is independently coded, a considerable number of bits are required.
  • the MV of a block has a strong correlation with neighboring blocks. Therefore, the MV can be predicted by neighboring coded partition blocks, that is, the motion vector predictor (MV Predict, MVP) of the neighboring coded partition blocks and the current
  • MV Predict, MVP the motion vector predictor
  • the technical solutions provided by the embodiments of this application can be applied to real-time video service scenarios, such as displaying videos on mobile terminals and large-screen devices, and improving video quality through joint video coding and decoding technologies, so as to achieve high resolution of video frames.
  • real-time video service scenarios such as displaying videos on mobile terminals and large-screen devices, and improving video quality through joint video coding and decoding technologies, so as to achieve high resolution of video frames.
  • the technical solutions of the embodiments of the present application are based on the motion vectors of each block (or macroblock) obtained by the decoder, and directly perform pixel alignment on multiple consecutive video frames according to each motion vector to obtain aligned frames, and obtain the difference between the video frames.
  • the residual is obtained from the residual frame, and the aligned frame and the residual frame are sent to the neural network to perform super-division processing on the image to be processed, and the super-division video frame is obtained.
  • this embodiment provides an image processing method, which can apply the decoder 30, and the method includes:
  • Step 101 Obtain a video code stream, where the video code stream includes a first frame image, a second frame image, and a third frame image that are adjacent in time series.
  • the video code stream is a code stream or bit stream output after the input video is encoded and compressed by an encoder, and the video code stream includes two or more frames of images.
  • the time-series adjacent refers to frames that are continuously shot (or generated) in time.
  • the encoder divides the video bitstream into at least one group of pictures (GOP), and each GOP includes an I frame and several subsequent P frames.
  • GOP group of pictures
  • the first GOP and the second GOP are included.
  • the first GOP includes an I frame and 3 P frames, namely ⁇ I1, P1, P2, P3 ⁇ ;
  • the second GOP also includes an I frame and 3
  • the P frame is ⁇ I2, P4, P5, P6 ⁇
  • the frames in the video bitstream include: I1, P1, P2, P3, I2, P4, P5, P6.
  • Step 102 Decode the video code stream to obtain a first aligned frame, a second aligned frame, and at least one residual error between the first frame image, the second frame image, and the third frame image.
  • the decoder receives the video stream transmitted from the encoder and decodes the video stream to obtain video information.
  • the video information includes: all the frame images that make up the video stream, the motion vectors of two adjacent frames, and the residual .
  • the first motion vector, the second motion vector, the first aligned frame and the second aligned frame, and at least one residual image are obtained after analyzing the video code stream. difference.
  • the first alignment frame is generated by moving pixel blocks from the first frame image to the second frame image according to the first motion vector
  • the second alignment frame is generated from the third frame image to the second frame image according to the first motion vector.
  • the second motion vector is generated after pixel block movement.
  • the first frame of image is I1
  • the second frame of image is P1
  • the third frame of image is P2.
  • the second frame of image P1 is the target frame (P1 is the super-divided Frame)
  • the process of generating the first aligned frame and the second aligned frame includes: the same position of the first frame image I1 and the second frame image P1 is divided into a plurality of macroblocks in advance, for example, divided into 3 ⁇ 3 9 macroblocks.
  • Each macro block includes multiple pixels, and each pixel corresponds to a pixel value.
  • the relative displacement between the pixel value of the first macro block of the I1 frame and the best matching macro block in the target frame P1 It is the motion vector of the first macroblock of the I1 frame, which can be expressed as MV 11 .
  • the matching macroblocks are searched in the P1 frame, and then the 9 motion vectors of the I1 frame can be obtained.
  • These 9 motion vectors are collectively called the "first motion vector", which can be passed through a matrix MV1 Means, for example,
  • the third frame image P2 performs pixel block movement to the target frame P1 according to the second motion vector MV2 to generate the second alignment frame.
  • the specific implementation process is similar to the above-mentioned process of generating the first alignment frame, and will not be repeated.
  • the residual is the pixel difference between each macroblock of the previous frame of the image and the subsequent frame of the image after motion compensation according to the motion vector.
  • the residual also called the first residual
  • the first frame of image I1 is transferred to the second frame of image P1 according to the first motion vector MV1.
  • the first aligned frame generated after the motion compensation alignment is performed, and the pixel difference between each macroblock of the first aligned frame and the second frame of image P1.
  • the first aligned frame includes 9 macroblocks of 3 ⁇ 3, there will be 9 pixel differences with the 9 macroblocks of the second frame image P1, and each of the pixel differences can be expressed as
  • the first frame of image I1 is generated at t1 and the second frame of image P1 is generated at t2, then Represents the residual between the first macroblock of the first aligned frame and the first macroblock of the P1 frame.
  • the first residual may be represented by a matrix, for example,
  • the first residual is, Where i is a positive integer, and 1 ⁇ i ⁇ 9.
  • the method further includes obtaining a second residual, the second residual being the pixel difference of each macroblock between the second aligned frame and the second frame image P2, the second aligned The frame is generated after the second frame of image P1 is aligned to the third frame of image P2 by motion compensation according to the second motion vector MV2.
  • the second residual is expressed as
  • the first frame of image P1 is the first frame of the entire video stream, there is no previous frame of image, so the residual of the first frame of image P1 is 0. .
  • obtaining at least one residual error between the first frame image I1, the second frame image P1, and the third frame image P2 through step 102 includes: obtaining the first residual error And the second residual
  • Step 103 Generate at least one residual frame according to the at least one residual.
  • step 103 includes:
  • the first residual Generate a first residual frame, and, according to the first residual And the second residual Generate a second residual frame.
  • the first residual and the first aligned frame satisfy the following relationship:
  • i represents a macroblock of the first frame of image
  • t1 represents the generation time of the first frame image I1
  • t2 represents the generation time of the second frame image P1.
  • the first residual The pixel values (RGB values) of the indicated macroblocks are restored to obtain the first residual frame.
  • Represents the second aligned frame Represents the second residual
  • i represents a macroblock of the second frame image
  • t3 represents the generation time of the third frame image P2.
  • the second residual Generate a second residual frame, including according to the cumulative residual
  • the pixel values (RGB values) of the indicated macroblocks are restored to obtain the second residual frame.
  • the corresponding relationship between the first frame of image and the third frame of image is obtained through the above formula 1 and formula 2, that is, two separated frames are realized, and the third frame of image P2 is expressed by the key frame I1.
  • Step 104 Perform super-division processing on the second frame of image according to the at least one residual frame, the first aligned frame, and the second aligned frame to obtain a super-divided second frame of image.
  • step 104 includes:
  • Step 1041 Input the at least one residual frame to the neural network for feature extraction, to obtain at least one first feature image.
  • the neural network includes multiple levels with different functions.
  • the functions of each of the multiple levels include but are not limited to arithmetic operations such as convolution, pooling, and activation.
  • the input image data can be feature extraction, analysis, and integration, and the final output is super-scored After the image.
  • the first residual frame and the second residual frame are input to the neural network, wherein the first residual frame passes through the first residual frame Generated, the second residual frame passes through the first residual And the second residual Generate (i.e. cumulative residual ).
  • the neural network may include a feature extraction network and a super-division network
  • the super-division network includes a first super-division network and a second super-division network.
  • step 1041 specifically includes: inputting the first residual frame and the second residual frame to the feature extraction network to obtain at least one first feature map.
  • Step 1042 Input the first aligned frame, the second aligned frame, and the second frame image to the neural network for feature extraction to obtain at least one second feature image.
  • an implementation manner is that after the first alignment frame, the second alignment frame, and the second frame image are input to the feature extraction network, at least one second feature image is obtained through feature extraction.
  • Step 1043 Input the at least one first characteristic image to the first superdivision network for processing to generate high-frequency information.
  • Step 1044 Input the at least one second feature map to the second superdivision network for processing to generate a brightness channel.
  • the first superdivision network is different from the second superdivision network.
  • the network structure, complexity, and weight parameters of the two superdivision networks are different.
  • Step 1045 Fusion the high frequency information with the brightness channel to generate the second frame image after the super-division.
  • At least one residual frame and at least one alignment frame are input to the same neural network for super-division processing, thereby improving the parameter utilization rate, reducing the amount of model parameters, and improving the super-division processing efficiency.
  • the embodiment of the application provides a joint video coding and decoding scheme, which uses motion vector information and residual information in the video coding and decoding process to align video frames and residual accumulation, and uses aligned frames and residual frames as a neural network Because the accumulated frames output high-frequency information after passing through the neural network, the high-frequency information can be pasted back into the brightness channel to enhance the edge details, thereby making up for the lack of high frequency in the low-resolution video frame.
  • the method further includes:
  • the at least one residual frame and the aligned frame are divided into regions of interest, where the regions of interest are macroblocks in the residual frame or the aligned frame that need to be super-divided; further, the residual frame or the aligned frame may be The size between the pixel value of the divided macro block and the preset value is used to determine the region of interest.
  • the macroblock of the region of interest in the first residual frame is determined, and the macroblock of the region of interest is the sum of all pixel values contained in the current macroblock.
  • the pixels and the macroblocks smaller than the first preset value are set as the non-interest area, and the macroblocks in the non-interest area are low-texture areas, so the previous frame processing can be used directly for this part of the area.
  • the super-score there is no need to perform the super-score processing, which saves the amount of calculation and improves the efficiency of the over-score processing.
  • the method further includes: processing all residual frames (for example, the second residual frame) except for the first residual frame to divide the region of interest.
  • a possible implementation manner is to determine each macroblock by comparing the pixel value of each macroblock with the size of the first preset value in the same way as the method for dividing the region of interest for the first residual frame. Whether it is a region of interest, the detailed determination process is similar to the foregoing determination process of the region of interest in the first residual frame, and will not be repeated here.
  • Another possible implementation is to use a macroblock at the same position as the region of interest determined by the first residual frame as the region of interest of the second residual frame, for example, when determining the perception of the first residual frame
  • the macroblocks of the region of interest are macroblocks 4-8 (a total of 9 macroblocks)
  • it is determined that the macroblocks of the region of interest in the second residual frame are also macroblocks 4-8.
  • determining the region of interest of the first residual frame may also adopt the same region as the aligned frame. That is, at least one aligned frame is divided into the region of interest and the region of non-interest, and then the region of interest determined in the aligned frame is used as the region of interest of the first residual frame.
  • dividing the region of interest for the at least one aligned frame includes: a possible division method is to compare the pixel value of each macroblock of the first aligned frame with the size of the second preset value, and combine all Macroblocks greater than or equal to the second preset value are set as regions of interest, and macroblocks less than the second preset value are set as regions of non-interest. Similarly, it is determined that the region of interest of the second aligned frame may be the same as the region of interest of the first aligned frame.
  • each frame of image is divided into regions of interest and regions of non-interest, that is, the residual accumulated texture details of each frame of image and the motion characteristics of the previous and subsequent frames are analyzed, and only the macroblocks of the region of interest are super-divided.
  • regions of interest and regions of non-interest that is, the residual accumulated texture details of each frame of image and the motion characteristics of the previous and subsequent frames are analyzed, and only the macroblocks of the region of interest are super-divided.
  • you can directly use the super-division processing result of the previous frame thereby avoiding super-division processing on the entire image, reducing the amount of calculation, reducing power consumption, delay and memory overhead, and increasing the single Frame image super-division efficiency, and then achieve the beneficial effect of real-time acquisition of super-division images in a short period of time.
  • the foregoing embodiment exemplifies the first three frames of images of a GOP, including but not limited to these three frames, and may also be other frame images, or located in different GOPs.
  • a possible situation is that the first frame of image is the last frame of image in the first GOP, and the second frame of image and the third frame of image are the first two frames of image in the second GOP.
  • the first image and the second image are the last two images in the first GOP, and the third image is the first image in the second GOP.
  • the first GOP and the second GOP are adjacent in time sequence.
  • the target frame to be super-divided may be any one of the foregoing first frame image, second frame image, and third frame image, which is not limited in this embodiment.
  • the image processing method provided by the present application will be introduced. This method is applied to the decoder 30, or the video decoding system 40, or the destination device 14 described in the foregoing embodiment.
  • the method includes:
  • Step 701 Obtain a video source.
  • the video source can be from a short video or video call APP platform, and the short video or video stream can be downloaded and obtained from the cloud.
  • Step 702 The encoder encodes the video source to generate a video code stream.
  • the encoder divides the video stream into two GOPs, the first GOP and the second GOP, where the first GOP includes one I frame and 3 P frames, which is ⁇ I1, P1, P2, P3 ⁇ ; the second GOP also includes an I frame and 3 P frames, namely ⁇ I2, P4, P5, P6 ⁇ , so the frames in the video stream include: I1, P1, P2, P3, I2, P4, P5, P6. Choose any one of these frames as the target frame and super-divide it.
  • step 702 in the video encoding stage further includes: 7021: the encoder obtains a cross-GOP motion vector, and performs inter-frame prediction according to the cross-GOP motion vector.
  • the "cross-GOP motion vector” refers to the motion vector between the last frame image of the previous GOP and the first frame image of the next GOP.
  • the cross-GOP motion vector between the first GOP and the second GOP refers to the motion vector between the P3 frame of the first GOP and the I2 frame of the second GOP.
  • the cross-GOP motion vector is used to align the target frame to generate the aligned frame, thereby ensuring the accuracy of "alignment" between cross-GOP frames.
  • the alignment processing means that the P3 frame of the first GOP is directed to the frame according to the cross-GOP motion vector.
  • the I2 frame moves the pixel block
  • the P4 frame of the second GOP moves the pixel block to the I2 frame according to the motion vector to generate two aligned frames.
  • the motion vector of a pixel or sub-pixel-level inter pixel block can be used for estimation according to the requirement of accuracy, and motion compensation information, such as residual error, can be calculated according to the motion vector.
  • motion compensation information such as residual error
  • motion vectors between adjacent GOPs ie, cross-GOP motion vectors
  • a video stream is generated, which provides continuous frame alignment operations for subsequent decoding stages. This ensures the continuity of inter-frame prediction.
  • Step 703 The decoder obtains the video code stream. Specifically, one implementation is that the encoder transmits the video code stream to the cloud, and the decoder downloads and obtains the video code stream from the cloud. Or, in another implementation manner, the encoder directly transmits the encoded video stream to the decoder.
  • Step 704 The decoder decrypts the video code stream to obtain information such as at least one video frame, motion vectors between adjacent frames, and residuals.
  • the motion vector includes a motion vector between two adjacent frames in the same GOP, and also includes a cross-GOP motion vector. In addition, it also includes all residuals between two adjacent video frames.
  • Step 705 The decoder preprocesses the super-division image to generate at least one alignment frame and at least one residual frame according to the information such as the video frame, the motion vector and the residual, which specifically includes:
  • Step 7051 "Align" the previous frame and the next frame of the target frame to the target frame according to their respective motion vectors to generate a first aligned frame and a second aligned frame.
  • the accuracy of the motion vector in order to improve the accuracy of the super-division image, can be divided into pixels and sub-pixels.
  • alignment frames can be generated according to normal alignment operations.
  • the video frame needs to be enlarged and then aligned, and then the aligned image is restored to the original size. Specifically, it includes: extracting multiple consecutive consecutive video frames, and selecting an image to be super-divided as the target frame, and the remaining frames are the front and rear auxiliary frames.
  • the front and rear auxiliary frames use formula 4 to move the pixel to a specified position according to the motion vector between the target frame and the target frame.
  • (x', y') represents the position of the pixel after the pixel is moved; (x, y) represents the original position of the pixel; (mv x , mv y ) represents the motion vector.
  • the continuous video frames are first enlarged and aligned, and then the aligned frames are restored to the original image size.
  • the codec calculates the sub-pixel precision motion vector
  • the current target frame is first amplified by the multiple of the corresponding pixel precision, and then the motion vector is calculated with the previous frame auxiliary frame.
  • the sub-pixel precision motion vector is obtained Is a non-integer (ie decimal).
  • the alignment process firstly establish an image buffer corresponding to the magnification of multiple consecutive video frames, where the image to be overdivided is the target frame, and the remaining frames are auxiliary frames. Then the auxiliary frame uses the formula 5 to move the pixel to the specified position according to the motion vector and the super-division magnification between the current target frame and the auxiliary frame.
  • factor represents the super-resolution magnification
  • (x', y') represents the position of the pixel after moving
  • (x, y) represents the original position of the pixel
  • (mv x , mv y ) represents the motion vector
  • the enlarged image buffer is weighted and sampled to the original image frame size to complete the alignment. And when sub-pixel precision is used to align the macroblock, the changed area can be more accurately dropped on the pixel points in the macroblock, thereby improving the accuracy of the alignment.
  • Step 7052 According to the residuals between the two adjacent frames, perform residual accumulation according to key frames to generate at least one residual frame with respect to the target frame.
  • the key frame is the first frame of each GOP.
  • the traceback technique may be used to accumulate the residuals to generate at least one residual frame.
  • the residuals are accumulated, that is, the corresponding relationship between each auxiliary frame and the key frame (first frame) in each GOP is obtained, which can be understood as obtaining that each auxiliary frame passes through the key frame and
  • the accumulated residuals are expressed.
  • the P3 frame, the I2 frame, and the P4 frame in the above two GOPs are the first frame image, the second frame image, and the third frame image, and the I2 frame is the target frame to be super-divided.
  • the process of the at least one residual frame includes:
  • the moment of frame image generation Represents the residual of the I1 frame and the P1 frame, Represents the residual of the P1 frame and the P2 frame, Represents the residual of the P2 frame and the P3 frame, Represents the cumulative residual between the I1 frame and the P3 frame, i represents a macroblock of the image, Represents the macro block after the macro block i is moved according to its corresponding motion vector.
  • the P4 frame is the second frame of the second GOP, so according to the above formula 2, the corresponding relationship between the P4 frame and the I2 frame is:
  • t5 is the generation time of the I2 frame image
  • t6 is the generation time of the P4 frame image
  • i represents a macroblock of the image
  • the corresponding relationship between the first frame of each GOP and any subsequent frame can be calculated. It no longer depends on the expression of the relationship between two adjacent frames, but directly establishes the key frames I1 and t2 generated at t1. , T3, t4 generates the corresponding relationship between each image, thereby breaking the interdependence between two adjacent P frames and P frames, avoiding storing the corresponding relationship between two consecutive frames, this method saves storage space.
  • the generated cumulative residual frame contains time-series related redundant information, which provides more original information for the detailed recovery of subsequent moving subjects.
  • Step 706 Determine whether the image to be generated is the first frame of the video stream, where the first frame is the first video frame played by the video stream, that is, the I1 frame of the first GOP.
  • step 710 If it is not the first frame, perform steps 707 to 709; if it is the first frame, perform step 710.
  • Step 707 For the non-I1 frame, divide the region of interest and the region of non-interest for the at least one residual frame and the aligned frame.
  • a possible screening process includes: using motion vectors and residual information to select super-division regions of interest. If there is little texture information in the area in the residual accumulation map, then the area is a non-interest area, and the super-division result of the previous frame can be used directly to reduce the amount of super-division calculation for this frame; otherwise, it is interesting Areas need to be over-divided. Further, the region of interest can be determined by Formula 6 and Formula 7.
  • I SR represents the set of macroblocks in all regions of interest
  • K represents the total number of pixels in each macroblock
  • I residual represents the current image macroblock
  • i k represents the pixel value of the k-th pixel
  • sign(x) is The symbolic function can be specifically defined by Equation 7, I represents the number of macroblocks in the target frame, and the U function represents the union of the macroblocks.
  • the threshold value (Threshold) is constant and can be preset.
  • Step 7081 the macroblock in the non-interest area adopts the processing result of the previous frame superdivision. For the macroblocks of the non-interest area, extract the image information obtained by superdivision of all the macroblocks of the non-interest area in the previous frame, and the image information includes the superdivision result of the macroblock at the corresponding position of the previous frame of image.
  • Step 7082 Send the macro block of the region of interest to the neural network for super-division processing.
  • the macroblock of the region of interest is the region of interest in the following frame.
  • the macroblocks of the region of interest of the first alignment frame and the second alignment frame generated in step 7051, and the target frame I2 are sent to the neural network, and at the same time, the macroblocks generated in step 7052 are sent to the neural network.
  • the macroblocks of the region of interest of the first residual frame and the second residual frame are sent to the neural network, and the brightness channel and high frequency information about the target frame I2 are obtained respectively.
  • Step 709 Fuse the brightness channel and the high-frequency information to obtain a super-divided image. As shown in Figure 7B.
  • the neural network includes: a feature extraction network and a super division network. Further, the feature extraction network is used to perform feature extraction on the input at least one aligned frame and at least one residual frame, and obtain a feature map of the aligned frame and a feature map of the residual frame, respectively.
  • the feature extraction network may be a shared network, that is, the alignment frame and the residual frame are shared for feature extraction, so that the parameter utilization rate can be improved and the parameter amount of the neural network can be reduced.
  • the extracted feature maps are sent to different superdivision networks, and the feature maps obtained by aligning the frames are output after the first superdivision network and the brightness channel; the feature maps obtained by the residual frames are output after the second superdivision network High-frequency information, and finally the high-frequency information is pasted back into the brightness channel to enhance the edge details and high-frequency information of the image to be super-divided, and finally obtain a high-quality super-division image.
  • Step 710 In the above step 706, if the image to be super-divided is the first frame, the region of interest is not divided, and the generated alignment frame and residual frame are directly input to the neural model for super-division processing, and the super-division processing is obtained. The divided image. Because there is no image in the first frame, there is no super-division result of the previous frame and the previous frame, so there is no need to divide the regions of interest and non-interest, but the entire image is super-division.
  • the output result needs to be spliced with the super-division result of the previous frame corresponding to the region of non-interest in step 7081 to obtain the complete super-division result of the current frame.
  • the three current frames are super-divided, respectively It is the image of the brim, nose and earrings of the hat that are merged with the previous frame to obtain the super-divided image.
  • the region of interest and the region of non-interest are divided, and only the macroblocks of the region of interest are superdivided, while for the non-interest region
  • the macroblock of the previous frame uses the superscore result of the previous frame, that is, the area where the pixel has motion and the information in the accumulated residual is overscored, and the module with the pixel motion and the accumulated residual information is small is used for the previous frame overscore
  • the use of macroblocks at the same position in the alignment frame complements each other for the same details, and this redundancy provides a greater amount of information.
  • the residual is actually the difference compensation of the pixel matching block, which contains information such as the edge of the moving subject.
  • the residual frame is obtained by accumulating residuals, and the high-frequency edges of the image are restored with the residuals, so that the subjective effect is better.
  • the amount of calculation allocated to different chips can be dynamically adjusted according to the load capacity of the terminal device chip to reduce the delay.
  • the video decoding system 40 includes a processing unit 46, or, as shown in FIG. 8, an image processing device is provided.
  • the device includes: an acquisition unit 810, a processing unit 820, and a sending unit 830. It may also include other functional modules or units, such as storage units.
  • the image processing device can be used to execute the image processing procedures in FIG. 5, FIG. 6 and FIG. 7A.
  • the obtaining unit 810 is used to obtain a video code stream; the video code stream includes a first frame image, a second frame image, and a third frame image that are adjacent in time sequence;
  • the processing unit 820 is used to decode the video code stream to obtain The first alignment frame, the second alignment frame, and at least one residual difference between the first frame image, the second frame image, and the third frame image;
  • the processing unit 820 is further configured to Generate at least one residual frame from at least one residual, and perform super-division processing on the second frame of image according to the at least one residual frame, the first alignment frame, and the second alignment frame to obtain the super-division The second frame of the image.
  • the sending unit 830 is configured to output the super-divided second frame image to the display screen for display.
  • the first alignment frame is generated after the first frame image is moved to the second frame image according to the first motion vector after the pixel block is moved, and the second alignment frame is the third frame image toward the
  • the second frame of image is generated after pixel block movement is performed according to the second motion vector.
  • the residual is the difference between the previous frame of image and the subsequent frame of image after motion compensation according to the motion vector.
  • the pixel difference between blocks are the same.
  • the first frame image, the second frame image, and the third frame image are three frames of images in the first image group;
  • the first frame of image is the last frame of image in the first image group
  • the second frame of image and the third frame of image are the first two frames of image in the second image group
  • the first frame image and the second frame image are the last two frames of the first image group, and the third frame image is the first frame of the second image group.
  • the embodiment of the present application does not limit the three frames of images selected in the video code stream, and also does not limit which frame is selected as the image to be super-divided.
  • the processing unit 820 is specifically configured to generate a first residual frame, where the first residual and the first alignment frame satisfy the following relationship:
  • i represents a macroblock of the first frame of image
  • t1 represents the generation time of the first frame image
  • t2 represents the generation time of the second frame image.
  • processing unit 820 is specifically configured to:
  • the processing unit 820 is further configured to determine the macroblock of the region of interest in the first residual frame, according to the region of interest in the first residual frame
  • the macroblock in the at least one residual frame determines the region of interest of the remaining residual frames except the first residual frame; the macroblock of the region of interest is all the pixels contained in the current macroblock A macro block whose sum of values exceeds the preset value.
  • processing unit 820 is specifically further configured to input the macroblocks of all the regions of interest of the at least one residual frame to the neural network for feature extraction, and the at least one residual frame includes the first residual frame. Difference frame and the remaining residual frame.
  • the processing unit 820 is specifically configured to combine the macroblocks of the region of interest in the first aligned frame and the second aligned frame with the second aligned frame.
  • the frame image is input to the neural network for feature extraction; wherein the region of interest in the first aligned frame and the second aligned frame is the same as the region of interest in the first residual frame.
  • Fig. 9 shows another possible structural schematic diagram of the video encoding device involved in the foregoing embodiment.
  • the video encoding device includes a processor 901, a transceiver 902, and a memory 903.
  • the memory 903 is configured to be coupled with the processor 902, and it stores the computer programs necessary for the video encoding device.
  • the transceiver 901 is configured to send encoded information to the decoder 30.
  • the processor 902 is configured as an encoding operation or function of the video encoding device.
  • the transceiver 902 is used to obtain a video code stream
  • the processor 901 can be used to decode the video code stream to obtain a first aligned frame, a second aligned frame, and the first frame image, the second frame image, and the At least one residual error between the third frame of images; generating at least one residual frame according to the at least one residual; and, according to the at least one residual frame, the first alignment frame, and the second
  • the aligning frame performs super-division processing on the second frame image to obtain a second frame image after super-division.
  • the processor 901 is further configured to input the at least one residual frame to the neural network for feature extraction to obtain at least one first feature image; Two aligned frames and the second frame image are input to the neural network for feature extraction to obtain at least one second feature image; the at least one first feature image is input to the first superdivision network for processing to generate high-frequency information Inputting the at least one second feature map to a second super-division network for processing to generate a brightness channel; fusing the high-frequency information with the brightness channel to generate the second frame image after the super-division.
  • the processor 901 may also be configured to determine the macroblocks of the region of interest in the first residual frame, where the macroblocks of the region of interest are all the macroblocks contained in the current macroblock. A macro block whose sum of pixel values exceeds a preset value; determining the remaining residuals in the at least one residual frame except for the first residual frame according to the macro block of the region of interest in the first residual frame The region of interest of the frame; and, inputting macroblocks of all regions of interest of the at least one residual frame to the neural network for feature extraction, and the at least one residual frame includes the first residual frame and The remaining residual frames.
  • the processor 901 is specifically further configured to input the macroblocks of the region of interest in the first aligned frame and the second aligned frame, and the second frame image to the The neural network is used for feature extraction.
  • the region of interest in the first alignment frame and the second alignment frame is the same as the region of interest in the first residual frame.
  • the video decoding device provided in this embodiment further includes a computer storage medium, which can store computer program instructions, and when the program instructions are executed, the image processing described in the foregoing embodiments of the present application can be implemented. All steps of the method.
  • the computer storage medium includes a magnetic disk, an optical disk, a read-only storage memory ROM, or a random storage memory RAM, and the like.
  • all or part of it may be implemented by software, hardware, firmware or any combination thereof.
  • software it may be implemented in the form of a computer program product in whole or in part, which is not limited in this embodiment.
  • This application also provides a computer program product, which includes one or more computer program instructions.
  • the computer loads and executes the computer program instructions, all or part of the processes or functions described in the foregoing embodiments of the present application are generated.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer program instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer program instructions may be transmitted from a network node, computer, server, or data.
  • the center transmits to another site, computer or server through wired or wireless means.
  • the storage medium in any device can be a magnetic disk, an optical disc, a read-only memory (ROM) or a random access memory (RAM), etc.
  • the foregoing processor may be a central processing unit (CPU), a network processor (NP), or a combination of a CPU and an NP.
  • the processor may further include a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof.
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), or any combination thereof.
  • the memory may include volatile memory (volatile memory), such as random-access memory (RAM); the memory may also include non-volatile memory, such as read-only memory (read-only memory). memory, ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (SSD); the memory may also include a combination of the above types of memories.
  • the various illustrative logic units and circuits described in this application can be implemented by general-purpose processors, digital signal processors, application-specific integrated circuits (ASIC), field programmable gate arrays (FPGA) or other programmable logic devices, discrete gates Or transistor logic, discrete hardware components, or any combination of the above are designed to implement or operate the described functions.
  • the general-purpose processor may be a microprocessor.
  • the general-purpose processor may also be any traditional processor, controller, microcontroller, or state machine.
  • the processor can also be implemented by a combination of computing devices, such as a digital signal processor and a microprocessor, multiple microprocessors, one or more microprocessors combined with a digital signal processor core, or any other similar configuration. achieve.
  • the steps of the method or algorithm described in this application can be directly embedded in hardware, a software unit executed by a processor, or a combination of the two.
  • the software unit can be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM or any other storage medium in the art.
  • the storage medium may be connected to the processor, so that the processor can read information from the storage medium, and can store and write information to the storage medium.
  • the storage medium may also be integrated into the processor.
  • the processor and the storage medium may be set in the ASIC, and the ASIC may be set in the UE.
  • the processor and the storage medium may also be provided in different components in the UE.
  • the size of the sequence number of each process does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not correspond to the implementation process of this application. Constitute any limitation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

公开了一种图像处理方法和装置,所述方法包括:获取视频码流,所述视频码流中包括时序相邻的第一帧图像、第二帧图像和第三帧图像;解码所述视频码流得到第一对齐帧、第二对齐帧、以及所述第一帧图像、所述第二帧图像和所述第三帧图像之间的至少一个残差;根据所述至少一个残差生成至少一个残差帧;根据所述至少一个残差帧、所述第一对齐帧和所述第二对齐帧对所述第二帧图像进行超分处理,得到超分后的第二帧图像。本方法采用对齐帧和残差帧作为神经网络的输入,由于残差帧经过神经网络后输出高频信息,将该高频信息贴合回亮度通道中可以加强边缘细节,从而提升了视频画质的质量,并且不增加硬件成本。

Description

一种图像处理方法和装置 技术领域
本申请涉及图像处理技术领域,尤其是涉及一种图像处理方法和装置,用于在对图像超分的过程中生成清晰图像。
背景技术
随着手机、平板、智慧电视等设备的不断普及和移动通信技术的提升,短视频、流媒体视频和实时视频通话等业务如雨后春笋,越来越多的占据终端屏幕。为使用户在体验这些视频业务时,不仅能够满足在不确定网络条件下视频播放的流畅性,还要降低边缘存储的成本开销,因此往往传输给用户的视频分辨率不高,且画质模糊,有噪声,块状效应和边缘锯齿等负向效果。为了去除这些负向效果,在不增加边缘存贮量和传输数据量的基础上,提高画面质量,一般可以采用超分辨率技术。其中所述超分辨率技术又可称为“超分”技术。
超分技术,主要是指从低分辨率的图片恢复成高分辨率图片的一种重要的计算机视觉和图像处理手段。尤其是应用在医疗图像、安全监控和电视娱乐等相关领域。超分技术的难点在于低分辨率的图片损失了高频部分相关的信息量,同时加入了由于摄像头感光元件恢复能力、压缩编码损失和传输信道丢包等引起的负向效果。
传统的基于图像特征的超分算法一般从去噪,滤波,边缘轮廓分析、提取和拟合,小波变换恢复频域信息等角度进行恢复。随着深度学习相关领域的发展,越来越多的工作转向使用卷积神经网络模型进行超分,并且取得了突破性的成果。参见图1所示,目前基于深度学习的超分工作主要可以分为以下三个方向:单图超分(single image super resolution,SISR)、序列超分(sequence super resolution,SSR)和参考图片超分(reference super resolution,RSR)。
其中,SISR只将单帧图像送入神经网络进行超分处理;SSR是将视频码流中连续的多帧图像送入神经网络进行超分处理;RSR则是使用之前已经超分的图像,或者预先放置的有特殊纹理的图像模板,与输入图像/帧序列一起输入至神经网络进行超分处理。基于上述的三种研究方向都存在共同的问题,即在超分过程中,由于低分辨率帧图像缺失了高频部分的信息,并且该丢失的高频部分信息几乎不可能通过单张图像来得到恢复,进而由于图像高频边缘部分的信息缺失,导致超分后的图像分辨率较低,影响用户观看体验。
发明内容
本申请实施例公开了一种图像处理方法,用于解决在超分时产生的图像由于高频信息缺失而导致的超分后的图像分辨率低的问题。
为了解决该技术问题,本申请公开了以下技术方案:
第一方面,本申请实施例提供了一种图像处理方法,该方法可以应用于接收端设备,比如目的地设备或解码器等,具体地,所述方法包括:
获取视频码流,所述视频码流中包括时序相邻的第一帧图像、第二帧图像和第三帧图像;解码所述视频码流得到第一对齐帧、第二对齐帧、以及所述第一帧图像、所述第二帧图像和所述第三帧图像之间的至少一个残差,根据所述至少一个残差生成至少一个残差帧;根据所述至少一个残差帧、所述第一对齐帧和所述第二对齐帧对所述第二帧图像进行超分处理,得到超分后的第二帧图像。
其中,所述第一对齐帧是所述第一帧图像向所述第二帧图像按照第一运动矢量进行像素块移动后生成的,所述第二对齐帧是所述第三帧图像向所述第二帧图像按照第二运动矢量进行像素块移动后生成的,所述残差为前一帧图像按照运动矢量向后一帧图像做运动补偿后与所述后一帧图像的每个宏块之间的像素差;
本方面提供的方法,采用视频编解码过程中的运动矢量信息和残差信息进行视频帧的对齐和残差累积,并且采用对齐帧和残差帧作为神经网络的输入,由于累积帧经过神经网络后输出了高频信息,将该高频信息贴合回亮度通道中,可以加强边缘细节,从而弥补超分过程中,由于低分辨率视频帧缺失了高频部分的信息而导致的图像分辨率低的问题。本方法在提升了视频画质的质量的同时还不增加硬件成本。
结合第一方面,在第一方面的一种可能的实现中,所述根据所述至少一个残差生成至少一个残差帧,包括:根据第一残差生成第一残差帧;
其中,所述第一残差与所述第一对齐帧满足以下关系式:
Figure PCTCN2020108829-appb-000001
其中,
Figure PCTCN2020108829-appb-000002
表示所述第二帧图像,
Figure PCTCN2020108829-appb-000003
表示所述第一对齐帧,
Figure PCTCN2020108829-appb-000004
表示所述第一残差,i表示所述第一帧图像的一个宏块,
Figure PCTCN2020108829-appb-000005
表示所述宏块i按照其对应的运动矢量
Figure PCTCN2020108829-appb-000006
移动后的宏块,t1表示所述第一帧图像的生成时刻,t2表示所述第二帧图像的生成时刻。
结合第一方面,在第一方面的另一种可能的实现中,根据所述至少一个残差帧、所述第一对齐帧和所述第二对齐帧,对所述第二帧图像进行超分处理得到超分后的第二帧图像,包括:将所述至少一个残差帧输入至神经网络进行特征提取,得到至少一个第一特征图像;将所述第一对齐帧、所述第二对齐帧和所述第二帧图像输入至所述神经网络进行特征提取,得到至少一个第二特征图像;将所述至少一个第一特征图像输入至第一超分网络处理,生成高频信息;将所述至少一个第二特征图输入至第二超分网络处理,生成亮度通道;将所述高频信息与所述亮度通道相融合,生成所述超分后的第二帧图像。
本实现方式中,将至少一个残差帧和至少一个对齐帧输入至同一个神经网络进行超分处理,从而提高参数利用率,减少模型参数量,提高超分处理效率。
结合第一方面,在第一方面的又一种可能的实现中,将所述至少一个残差帧输入至神经网络进行特征提取之前,还包括:确定所述第一残差帧中感兴趣区域的宏块,所述感兴趣区域的宏块为当前宏块中所包含的所有像素值之和超过预设值的宏块;根据所述第一残差帧中感兴趣区域的宏块确定所述至少一个残差帧中除了所述第一残差帧之外其余的残差帧的感兴趣区域。
对应地,将所述至少一个残差帧输入至神经网络进行特征提取,包括:将所述至 少一个残差帧的所有感兴趣区域的宏块输入至所述神经网络进行特征提取,所述至少一个残差帧包括所述第一残差帧和所述其余的残差帧。
本实现方式中,通过对每帧图像划分感兴趣区域和非感兴趣区域,即分析每帧图像的残差累积纹理细节和前后帧运动特点,并且只对感兴趣区域的宏块进行超分处理,对于非感兴趣区域则可以直接采用前一帧的超分处理结果,从而避免对整张图像进行超分处理,降低了计算量,减小功耗、时延和内存的开销,提高了单帧图像超分效率,进而达到短时间内实时获取超分图像的有益效果。
结合第一方面,在第一方面的又一种可能的实现中,将所述第一对齐帧、所述第二对齐帧和所述第二帧图像输入至所述神经网络进行特征提取,包括:将所述第一对齐帧和所述第二对齐帧中的感兴趣区域的宏块,和所述第二帧图像输入至所述神经网络进行特征提取。其中所述第一对齐帧和所述第二对齐帧中的感兴趣区域,与所述第一残差帧中感兴趣区域相同。
结合第一方面,在第一方面的又一种可能的实现中,所述第一帧图像、所述第二帧图像和所述第三帧图像为第一图像组中的三帧图像;
或者,所述第一帧图像为第一图像组中的最后一帧图像,所述第二帧图像和所述第三帧图像为第二图像组中的前两帧图像;
或者,所述第一帧图像和所述第二帧图像为第一图像组中的最后两帧图像,所述第三帧图像为第二图像组中的第一帧图像。
第二方面,本申请实施例还提供了一种图像处理装置,该装置包括用于执行第一方面及第一方面各实现方式的中步骤的单元,例如包括获取单元、处理单元和发送单元等,另外还可以包括存储单元。
可选的,所述装置为一种通信装置或芯片。
可选的,所述装置可以集成在编码器、解码器,或者是在视频译码系统中,另外,所述装置还可以是源设备或者目的地设备,本实施例对图像处理装置的具体形态不做限制。
第三方面,本申请实施例还提供了一种电子设备或电子装置,比如视频编码设备,该设备包括处理器、收发器和存储器,所述存储器与处理器耦合,用于保存该视频编码设备必要的计算机程序指令,当所述处理器调用存储在所述存储器中的计算机程序指令时,可使得该设备执行前述第一方面以及第一方面各种实现方式中的方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述存储介质中存储有指令,当所述指令在计算机或处理器上运行时,用于执行前述第一方面以及第一方面各种实现方式中的方法。
第五方面,本申请实施例还提供了一种计算机程序产品,所述计算机程序产品包括计算机指令,当所述指令被计算机或处理器执行时,可实现前述第一方面和第一方面各种实现方式中的方法。
第六方面,本申请实施例还提供了一种芯片系统,所述芯片系统包括处理器和接口电路,所述接口电路与所述处理器耦合,所述处理器用于执行计算机程序或指令,以实现前述第一方面以及第一方面各种实现方式中的方法;其中所述接口电路用于与所述芯片系统之外的其它模块进行通信。
本申请实施例提供的图像处理方法,采用视频编解码过程中的运动矢量信息和残差信息进行视频帧的对齐和残差累积计算,并且采用对齐帧和累积残差帧作为神经网络模块的输入,由于残差累积帧经过神经网络模型后输出高频信息,将该高频信息贴合回亮度通道中,从而可以加强边缘细节,弥补超分过程中低分辨率视频帧缺失了高频部分的信息而导致的图像分辨率低的问题,本方法在提升了视频画质的质量的同时还不增加硬件成本。
附图说明
图1为本申请提供的一种超分算法的分类的示意图;
图2A为本申请实施例提供的一种视频编码及解码系统的示意性框图;
图2B为本申请实施例提供的一种视频译码系统的示例性结构示意图;
图3A为本申请实施例提供的一种视频编解码方法的流程示意图;
图3B为本申请实施例提供的另一种视频编解码方法的流程示意图;
图4A为本申请实施例提供的一种编码器20的示例性结构示意图;
图4B为本申请实施例提供的一种解码器30的示例性结构示意图;
图5为本申请实施例提供的一种图像处理方法的流程图;
图6为本申请实施例提供的一种利用神经网络处理并生成超分图像的流程图;
图7A为本申请实施例提供的又一种图像处理方法的流程图;
图7B为本申请实施例提供的一种超分效果拼合生成超分后的图像的示意图;
图8为本申请实施例提供的一种图像处理装置的结构示意图;
图9为本申请实施例提供的一种视频解码器设备的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请实施例中的技术方案,并使本申请实施例的上述目的、特征和优点能够更加明显易懂,下面结合附图对本申请实施例中的技术方案作详细的说明。
在说明本申请实施例的技术方案之前,首先结合附图对本申请的技术场景和相关技术术语进行介绍。
本实施例的技术方案应用于图像处理的技术领域,主要是针对视频中一系列连续帧的图像做超分处理。其中,所述视频可以理解为按照一定顺序和帧速率播放的若干帧图像(本领域也可以描述为图像)。在对视频码流进行处理的过程中,包括视频编码和视频解码。
进一步地,视频编码是对视频中每帧图像执行编码操作,得到每帧图像的编码信息的过程。视频编码在源侧执行。视频解码是根据每帧图像的编码信息重构每帧图像的过程。视频解码在目的地侧执行。视频编码操作和视频解码操作的组合可以称为视频编解码(编码和解码)。
现有的视频编解码是根据视频编解码标准(例如,高效率视频编解码H.265标准)来执行操作的,且遵照高效视频编解码标准(high efficiency video coding standard,HEVC)测试模型。或者,视频编解码根据其它专属或行业标准来执行操作,例如标准包括ITU-TH.261、ISO/IECMPEG-1Visual、ITU-TH.262或ISO/IECMPEG-2Visual、 ITU-TH.263、ISO/IECMPEG-4Visual,ITU-TH.264(或称为ISO/IECMPEG-4AVC),或者所述标准还可以包括分级视频编解码及多视图视频编解码扩展。应理解,本申请的技术不限于任何特定编解码标准或技术。
一般地,编解码操作是以编码单元(coding unit,CU)为单位。具体地,在编码过程中,将图像划分为多个CU,然后对这些CU中的像素数据进行编码,得到每个CU的编码信息。在解码过程中,将图像划分为多个CU,然后根据每个CU对应的编码信息重建各个CU,得到每个CU的重建块。或者,还可以将图像划分成编码树型块的栅格。在一些示例中,编码树型块又被称作“树型块”、“最大编码单元”(largest coding unit,LCU)或“编码树型单元”。可选的,所述编码树型块还可以被继续划分为多个CU。
参见图2A,图2A示例性地给出了本申请所应用的视频编解码系统10的示意性框图。如图2A所示,该系统10包括源设备12和目的地设备14,其中源设备12产生经编码视频数据,因此,源设备12又被称为视频编码装置。目的地设备14对源设备12产生的经编码的视频数据进行解码,因此,目的地设备14又被称为视频解码装置。
其中,源设备12和目的地设备14中包括一个或多个处理器,以及耦合到所述一个或多个处理器的存储器。所述存储器包括但不限于随机存储记忆体(random access memory,RAM)、只读存储记忆体(read-only memory,ROM)、带电可擦可编程只读存储器(electrically erasable programmable read only memory,EEPROM)、快闪存储器或可用于以由计算机存取的指令或数据结构的形式存储所要的程序代码的任何其它媒体。
所述源设备12和所述目的地设备14包括各种装置,比如桌上型计算机、移动计算装置、笔记型(例如,膝上型)计算机、平板计算机、机顶盒、例如所谓的“智能”电话等电话手持机、电视机、相机、显示装置、数字媒体播放器、视频游戏控制台、车载计算机、无线通信设备、人工智能设备、虚拟现实/混合现实/增强现实设备、自动驾驶系统或其它装置,本申请实施例对上述装置的结构和具体形态不进行限制。
如图2A所示,源设备12和目的地设备14之间通过链路13连接,目的地设备14经由链路13从源设备12接收经编码的视频数据。其中,链路13包括一个或多个媒体或装置。在一种可能的实现中,链路13包括使得源设备12能够实时地将经编码视频数据直接发射到目的地设备14的一个或多个通信媒体。在一示例中,源设备12根据通信标准(例如无线通信协议)来调制视频数据,并且将经调制的视频数据传输到目的地设备14。所述一个或多个通信媒体包括无线或有线通信媒体,例如射频(RF)频谱或至少一个物理传输线。所述一个或多个通信媒体可形成基于分组网络的一部分,所述分组网络可以为局域网、广域网或全球网络(例如,因特网)等。所述一个或多个通信媒体包括路由器、交换器、基站或促进从源设备12到目的地设备14的通信的其它设备。
源设备12包括图像源16、图像预处理器18、编码器20和通信接口22。在一具体实现中,所述编码器20、图像源16、图像预处理器18和通信接口22可以是源设备12中的硬件部件,也可能是源设备12中的软件程序。
更具体的描述如下:
图像源16,可以包括任何类别的图像捕获设备,用于捕获现实世界图像或评论,所述评论是指对于屏幕内容编码,屏幕上的一些文字。其中所述图像捕获设备用于获取和/或提供现实世界图像、计算机动画图像,例如屏幕内容、虚拟现实(virtual reality,VR)图像、实景(augmented reality,AR)图像等。图像源16可以为用于捕获图像的相机或者用于存储图像的存储器,图像源16还可以包括存储先前捕获或产生的图像和/或获取或接收图像的任何类别的(内部或外部)接口。
当图像源16为相机时,图像源16可为本地的或集成在源设备中的集成相机;当图像源16为存储器时,图像源16可为本地的或集成在源设备中的集成存储器。当所述图像源16包括接口时,接口可为从外部视频源接收图像的外部接口,外部视频源为外部图像捕获设备,比如相机、外部存储器或外部图像生成设备,外部图像生成设备为外部计算机图形处理器、计算机或服务器。接口可以为根据任何专有或标准化接口协议的任何类别的接口,例如有线或无线接口、光接口。
在图像源16中存储的图像可以视为像素点(picture element)的二维阵列或矩阵。阵列中的像素点也可以称为采样点。阵列或图像在水平和垂直方向(或轴线)上的采样点数目定义图像的尺寸和/或分辨率。为了表示颜色,通常采用三个颜色分量,即图像可以表示包含三个采样阵列。例如在RBG格式或颜色空间中,图像包括对应的红色(R)、绿色(G)及蓝色(B)采样阵列。但是,在视频编码中,每个像素通常以亮度/色度格式或颜色空间表示,例如对于YUV格式的图像,包括Y指示的亮度分量(有时也可以用L指示)以及U和V指示的两个色度分量。亮度(luma)分量Y表示亮度或灰度水平强度,例如,在灰度等级图像中两者相同;而两个色度(chroma)分量U和V表示色度或颜色信息分量。相应地,YUV格式的图像包括亮度采样值(Y)的亮度采样阵列,和色度值(U和V)的两个色度采样阵列。RGB格式的图像可以转换或变换为YUV格式,反之亦然,该过程也称为色彩变换或转换。如果图像是黑白的,该图像可以只包括亮度采样阵列。本申请实施例中,由图像源16传输至图像预处理器18的图像也可称为原始图像数据17。
图像预处理器18,用于接收原始图像数据17并对原始图像数据17执行预处理,以获取经预处理的图像19或经预处理的图像数据19。例如,图像预处理器18执行的预处理可以包括整修、色彩格式转换(例如,从RGB格式转换为YUV格式)、调色或去噪。
编码器20或称视频编码器20,用于接收经预处理的图像数据19,采用预测模式对经过预处理的图像数据19进行处理,从而提供经编码图像数据21(或称视频码流)。在一些实施例中,编码器20可以用于执行后文所描述的各个视频编码方法的实施例,以实现本申请所描述的图像生成方法。
通信接口22,可用于接收经编码图像数据21,并通过链路13将经编码图像数据21传输至目的地设备14。通信接口22可用于将经编码图像数据21封装成合适的格式,例如数据包,以便在链路13上传输。
目的地设备14包括通信接口28、解码器30,图像后处理器32和显示设备34。下面对目的地设备14中所包含的各个部件或装置进行逐一地描述,具体如下:
通信接口28,用于从源设备12接收经编码图像数据21。另外,通信接口28还用 于藉由源设备12和目的地设备14之间的链路13接收经编码图像数据21,链路13为直接有线或无线连接,任何类别的网络例如为有线或无线网络或其任何组合,或任何类别的私网和公网,或其任何组合。通信接口28还可以用于解封装通信接口22所传输的数据包以获取经编码图像数据21。
需要说明的是,通信接口28和通信接口22都可以是单向通信接口或者双向通信接口,以及可以用于发送和接收消息,和/或用于建立通信链路,并通过该链路传输例如经编码图像数据传输的图像数据。
解码器30(或称视频解码器30),用于接收经编码图像数据21并提供经解码图像数据31或经解码图像31。在一些实施例中,解码器30可以用于执行后文所描述的各个视频解码方法的实施例,以实现本申请所描述的图像生成方法。
图像后处理器32,用于对经解码图像数据31执行后处理,以获得经过后处理图像数据33。图像后处理器32执行的后处理可以包括:色彩格式转换(例如,从YUV格式转换为RGB格式)、调色、整修或重采样,或任何其它处理,还可用于将经后处理图像数据33传输至显示设备34。
显示设备34,用于接收经后处理图像数据33以便向用户或观看者显示图像。显示设备34包括任何类别的用于呈现经重构图像的显示器,例如,集成的或外部的显示器或监视器。进一步地,显示器可以包括液晶显示器(liquid crystal display,LCD)、有机发光二极管(organic light emitting diode,OLED)显示器、等离子显示器、投影仪、微LED显示器、硅基液晶(liquid crystal on silicon,LCoS)、数字光处理器(digital light processor,DLP)或任何类别的其它显示器。
应理解,图2A所示的源设备12和目的地设备14可以是单独的设备,也可以集成在同一设备中,即所述集成的设备包括源设备12和目的地设备14两者的功能性。在一种可能的实施方式中,可以使用相同硬件和/或软件,或使用单独的硬件和/或软件,或其任何组合来实施源设备12或对应的功能性以及目的地设备14或对应的功能性。
此外,基于上述描述可知,不同单元的功能性或图2A所示的源设备12和/或目的地设备14的功能性的存在和(准确)划分可能根据实际设备和应用有所不同。源设备12和目的地设备14可以包括各种设备中的任一个,包含任何类别的手持或静止设备,例如,笔记本或膝上型计算机、移动电话、智能手机、平板或平板计算机、摄像机、台式计算机、机顶盒、电视机、相机、车载设备、显示设备、数字媒体播放器、视频游戏控制台、视频流式传输设备(例如内容服务服务器或内容分发服务器)、广播接收器设备、广播发射器设备等,本申请实施例对源设备12和目的地设备14的具体结构和实现形态不予限制。
编码器20和解码器30都可以为各种合适电路中的任一个,例如,一个或多个微处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)、离散逻辑、硬件或其任何组合。如果部分地以软件实施所述技术,则设备可将软件的指令存储于合适的计算机可读存储介质中,且可使用一个或多个处理器来执行计算机程序指令来执行本申请所述的图像生成方法。
在一示例中,以图2A所示的视频编码及解码系统10仅为示例,本申请实施例的技术方案可以适用于不必包含编码和解码设备之间的任何数据通信的视频编码设置,例如,视频编码或视频解码。在其它示例中,数据可从本地存储器检索、在网络上流式传输等。视频编码设备可以对数据进行编码,并且将数据存储到存储器中,和/或视频解码设备可以从存储器检索数据并且对数据进行解码。
参见图2B,是根据一示例性实施例的包含编码器20和解码器30的视频译码系统40的结构示意图。所述视频译码系统40可以实现本申请实施例的各种方法步骤。在所述各种实施例中,视频译码系统40可以包含成像设备41、编码器20、解码器30(和/或藉由处理单元46实施的视频编/解码器)、天线42、处理器43、存储器44和显示设备45。
如图2B所示,成像设备41、天线42、处理单元46、编码器20、解码器30、处理器43、存储器44和显示设备45能够互相通信。且所述处理单元46中可以仅包含编码器20或者只包含解码器30。
在一示例中,天线42用于传输或接收视频码流,或者视频数据的经编码比特流。另外,显示设备45还可以用于呈现视频数据。在一种实现方式中,处理单元46可以包含专用集成电路(application-specific integrated circuit,ASIC)逻辑、图形处理器、通用处理器等。视频译码系统40也可以包含处理器43,该处理器43类似地可以包含ASIC逻辑、图形处理器、通用处理器等。进一步地,处理单元46可以通过硬件实施,如视频编码专用硬件等。
处理器43可以通过通用软件、操作系统等实施。
存储器44可以是任何类型的存储器,例如易失性存储器,例如,静态随机存取存储器(Static Random Access Memory,SRAM)、动态随机存储器(Dynamic Random Access Memory,DRAM)等。或非易失性存储器(例如,闪存)等。其中,处理单元46可以访问存储器44,例如用于实施图像缓冲器。此外,处理单元46还可以包括存储器,例如,缓存等。
在一种实现方式中,通过逻辑电路实施的编码器20包括图像缓冲器和图形处理单元,所述图像缓冲器可通过处理单元46或存储器44来实施;所述图形处理单元可通过处理单元46来实施。此外,一种可能的情况是,将图形处理单元耦合至图像缓冲器。在一示例中,图形处理单元包含通过处理单元46实施的编码器20。
解码器30可以以类似方式通过处理单元46的方式实施。在一示例中,所述解码器30包括图像缓冲器和图形处理单元。所述图形处理单元可以耦合至图像缓冲器。在一示例中,图形处理单元包含通过处理单元46实施的解码器30。
天线42用于接收视频码流或视频数据的经编码比特流。具体地,经编码比特流中包括与编码视频帧相关的数据、指示符、索引值、模式选择数据等,例如与编码分割相关的数据,例如,变换系数或经量化变换系数,可选指示符,和/或定义编码分割的数据。此外,视频译码系统40还可包含耦合至天线42并用于解码经编码比特流的解码器30。显示设备45用于显示图像帧。
应理解,本申请实施例中关于编码器20的功能描述,解码器30可以用于执行与编码器20相反的功能。解码器30可以用于接收并解码相关视频数据。需要说明的是, 本申请描述的解码方法主要用于解码过程,此过程在编码器20和解码器30均存在。
参见图3A,为本申请实施例提供的一种视频编解码方法的流程示意图,可应用于前述图2A和2B所示的系统。具体地,该方法可以概况为以下五个步骤,分别是:输入视频110、视频编码120、视频码流传输130、视频解码140和输出视频150。
其中,步骤“输入视频110”中将采集设备,比如摄像头采集的无损视频或图像输入给编码器;步骤“视频编码120”中将获取的视频或图像通过H.264或者H.265编解码器进行压缩编码,生成编码后的视频码流;然后在步骤“视频码流传输130”中将视频码流上传至云端服务器,以及用户从云端服务器下载视频流的过程。步骤“视频解码140”包括终端设备将从云端下载的视频码流通过解码器进行解码的过程,最后在步骤“输出视频150”输出并显示解码后的视频图像。
进一步地,如图3B所示,在图3A所示的编解码过程中还包括步骤“跨GOP运动矢量计算1201”和“视频质量提升1401”。其中,所述跨GOP运动矢量计算是指相邻的两个GOP之间的运动矢量计算。步骤1201跨GOP运动矢量计算主要作用于视频编码120过程;步骤1401视频质量提升过程主要作用于视频解码140之后和输出视频150之前。具体地,这两个步骤的实现过程在后续实施例中会详细介绍。
可选的,所述“跨GOP运动矢量计算1201”和“视频质量提升1401”的步骤可以通过程序代码或者通过相应的神经网络模型来实现,比如通过新增的单元模块来实现上述步骤1201和1401,或者通过现有的处理单元(包括编码器和解码器)来实现。
下面简要介绍视频编解码标准H.264的视频编码层(video coding layer,VCL)编解码结构原理。如图4A所示,为一种编码器20的结构示意图。具体地,编码器20的执行功能包括两条路径,一条是前向路径,另一条是重构路径。对于前向路径,输入帧以宏块或子块为单位被编码器进行帧内(Intra)或帧间(Inter)编码处理。如果是帧内编码,其预测值由当前图像中像素预测得出;如果是帧间编码,其预测值由参考图像运动补偿获得,所述参考图像可以在过去或未来(在显示顺序上)仪编码,解码,重建和滤波的帧中选取。
其中,预测值与当前块相减后,产生一个残差块,该残差块经变换和量化会产生一组量化后的变换系数,再经熵编码,与解码所需的一些边信息(如预测模式、量化参数、运动矢量信息等)一起形成压缩视频码流,最后交给网络抽象层(network abstraction layer,NAL)供传输和存储。
在重构路径中,为了提供预测用的参考图像,编码器需要具备重建图像的功能,因此必须使变换系数图像经过反量化处理,将反变换后得到的残差与预测值相加,得到未经滤波的图像(重构块)。最后将重构块经过滤波器进行滤波处理,得到重建参考图像,即重构帧。
参见图4B所示,为一种解码器30的结构示意图。解码器30从NAL单元接收视频码流,该视频码流经过熵解码后得到变换系数,然后再经反量化和反变换后得到残差,使用解码码流得到的头信息,解码器产生的预测值,预测值与残差相加,再经滤波最后得到解码图像。
在编码时,将视频帧分为帧内(intra)和帧间(inter)两种编码模式。首先,将每个视频帧划分成块(或称宏块),以便将帧的处理在块的层次上进行。然后,利 用视频帧内存在的空间冗余性,通过相邻像素预测本块像素,对视频块进行不同域之间的变换。从时域转化到另一个域,使得变换系数集中到少数几个点上;或利用连续的视频帧之间具有的时间冗余性,通过运动矢量估计在参考帧内搜索到和当前块相匹配的视频块,然后计算两块之间的差值,并对差值进行变化。最后,对变换系数和运动矢量进行熵编码。
本实施例主要是应用于帧间预测时的编解码,下面对帧间预测以及帧间预测的各种技术进行介绍。
帧间预测和编码主要是利用视频信号的时域相关性,通过运动估计和补偿把视频信号的时域冗余信息去掉,从而达到压缩视频数据的目的。由于视频信号的时域相关性远大于其空域相关性,所以通过采用帧间预测和编码可以更大地降低编码码流。
在帧间编码中,一个帧组或称图像组(Group of pictures,GOP)中主要包括I帧(Intra Frame)和P帧(Predict Frame)两种帧模式,运动估计采用正像素运动矢量。其中,I帧也称关键帧,它只进行帧内宏块的预测编码,可以保留更多的信息。P帧也称预测帧,进一步地,所述P帧为前向预测帧,可以理解为P帧需要借助前面的帧来进行帧间的宏块的运动估计,并计算运动矢量。
下面介绍帧间编码的各种技术。
(1)块的划分
在进行运动估计时,使用的块的大小对运动估计的效果有较大影响,所以使用比较小的块可以使运动估计结果更准确,从而产生较小的残差,达到降低码率的作用。H.264在进行运动估计时使用了不同的宏块分割方式,比如一个16×16的宏块可以分为一个16×16,两个16×8或8×16,或4个8×8几种块大小。对于8×8的块,有可以分为一个8×8,两个8×4或4×8,4个4×4的块。其中,宏块的色度成分采用和亮度块相同的分割模式,只是尺寸在水平和垂直方向减半。
每个块具有一个运动矢量MV,每个MV被编码、传输,并且分割选择也被编码压缩到比特流中。对大的分割尺寸而言,MV的选择和分割类型只需较少比特,但运动补偿残差在多细节区域中的能量高。对小的分割尺寸运动补偿残差能量低,但需要较多的比特来表征MV和分割选择,所以整体而言,大的分割尺寸适用于平坦区域,小的分割尺寸适用于多细节区域。
(2)运动矢量MV预测
由于H.264支持多种宏块和子宏块的分割,如果图形细节较多时,划分的块尺寸较小,若对每个块的MV独立编码,则需要相当数目的比特。一个块的MV与邻近块具有较强的相关性,因此MV可由邻近已编码的分割块预测而得,即可以通过相邻已编码的分割块的运动矢量预测值(MV Predict,MVP)和当前宏块的MV得到与当前的差异,将所述差异编码传输。
下面对本申请实施例的技术方案做详细的描述。
本申请实施例提供的技术方案可应用于实时视频业务的场景下,比如将视频显示在移动终端及大屏设备上,通过联合视频编解码技术来提升视频质量,从而达到视频帧高分辨率,低负向效果的目的。
实施例一
本申请实施例的技术方案,基于解码器获取的各个块(或宏块)的运动矢量,按照各运动矢量直接对连续的多个视频帧进行像素对齐得到对齐帧,并且获取视频帧之间的残差得到残差帧,将对齐帧和残差帧送入神经网络对待处理图像进行超分处理,得到超分后的视频帧。
具体地,如图5所示,本实施例提供了一种图像处理方法,该方法可应用解码器30,所述方法包括:
步骤101:获取视频码流,所述视频码流中包括时序相邻的第一帧图像、第二帧图像和第三帧图像。
其中,所述视频码流为输入的视频经过编码器编码压缩后输出的码流或比特流,该视频码流中包括两帧或两帧以上图像。所述时序相邻是指在时间上是连续拍摄(或生成)的帧。
在一示例中,在编码阶段,编码器将视频码流划分为至少一个图像组(group of pictures,GOP),每个GOP包括一个I帧和后续若干个P帧。比如包括第一GOP和第二GOP,其中,第一GOP中包括一个I帧和3个P帧,即为{I1,P1,P2,P3};第二GOP中也包括一个I帧和3个P帧,即为{I2,P4,P5,P6},进而对于该视频码流中的帧包括有:I1,P1,P2,P3,I2,P4,P5,P6。
步骤102:解码所述视频码流得到第一对齐帧、第二对齐帧、以及所述第一帧图像、所述第二帧图像和所述第三帧图像之间的至少一个残差。
解码器接收来自编码器传输的视频码流,对该视频码流进行解码得到视频信息,所述视频信息包括:组成所述视频码流的所有帧图像、相邻两帧的运动矢量以及残差。
在本实施例中,以对第二帧图像做超分为例,解析所述视频码流后得到第一运动矢量、第二运动矢量,第一对齐帧和第二对齐帧,以及至少一个残差。具体地,所述第一对齐帧是第一帧图像向第二帧图像按照第一运动矢量进行像素块移动后生成的,所述第二对齐帧是第三帧图像向第二帧图像按照第二运动矢量进行像素块移动后生成的。
例如,在视频码流的第一个GOP中,第一帧图像是I1,第二帧图像是P1,第三帧图像是P2,以第二帧图像P1为目标帧(P1作为被超分的帧),生成所述第一对齐帧和所述第二对齐帧的过程包括:第一帧图像I1和第二帧图像P1的相同位置预先被划分为多个宏块,例如划分为3×3的9个宏块。每个宏块中包括多个像素点,每个像素点对应一个像素值,所述I1帧的第一个宏块的像素值与目标帧P1中的最佳匹配的宏块之间的相对位移为I1帧第一个宏块的运动矢量,可表示为MV 11。对I1帧的9个宏块分别在P1帧中查找相匹配的宏块,进而可以得到I1帧的9个运动矢量,这9个运动矢量统称为“第一运动矢量”,可以通过一个矩阵MV1表示,例如,
第一运动矢量,
Figure PCTCN2020108829-appb-000007
一种可能的情况是,如果I1帧中的某一宏块在目标帧P1中没有匹配的宏块,则该宏块的运动矢量取(0,0),表示该宏块不运动,匹配的位置为该宏块所在的原位置。第一帧图像I1中的每个宏块按照第一运动矢量MV1做像素块移动后形成的图像为第 一对齐帧。具体地过程可参见标准中关于“图像运动补偿对齐”的相关描述,本实施例此处不详细赘述。
同理地,第三帧图像P2按照第二运动矢量MV2向目标帧P1进行像素块移动后生成第二对齐帧,具体的实现过程与上述生成第一对齐帧的过程相似,不再赘述。
所述残差为前一帧图像按照运动矢量向后一帧图像做运动补偿后与所述后一帧图像的每个宏块之间的像素差。具体地,第一帧图像I1与第二帧图像P1之间的残差(又称第一残差)可定义为,第一帧图像I1按照第一运动矢量MV1向所述第二帧图像P1做运动补偿对齐后生成的所述第一对齐帧,该第一对齐帧与第二帧图像P1的每个宏块的像素差。若所述第一对齐帧包括3×3的9个宏块,则与所述第二帧图像P1的9个宏块之间会产生9个像素差,每个所述像素差可以表示为
Figure PCTCN2020108829-appb-000008
例如t1时刻生成第一帧图像I1,t2时刻生成第二帧图像P1,则
Figure PCTCN2020108829-appb-000009
表示所述第一对齐帧的第一个宏块与所述P1帧的第一个宏块之间的残差,可以理解地,所述第一残差可以通过矩阵表示,例如,
第一残差为,
Figure PCTCN2020108829-appb-000010
其中i为正整数,且1≤i≤9。
同理地,方法还包括获得第二残差,所述第二残差为所述第二对齐帧与所述第二帧图像P2之间的每个宏块的像素差,所述第二对齐帧是第二帧图像P1按照第二运动矢量MV2向所述第三帧图像P2做运动补偿对齐后生成的。例如在t3时刻生成第三帧图像P2的情况下,所述第二残差表示为
Figure PCTCN2020108829-appb-000011
另外,还包括获取第一帧图像P1的残差,由于第一帧图像P1是整个视频码流的第一帧,所以在前没有帧图像,所以所述第一帧图像P1的残差为0。
所以,通过步骤102得到所述第一帧图像I1、所述第二帧图像P1和所述第三帧图像P2之间的至少一个残差包括:得到所述第一残差
Figure PCTCN2020108829-appb-000012
和所述第二残差
Figure PCTCN2020108829-appb-000013
步骤103:根据所述至少一个残差生成至少一个残差帧。
具体地,步骤103包括:
根据所述第一残差
Figure PCTCN2020108829-appb-000014
生成第一残差帧,和,根据所述第一残差
Figure PCTCN2020108829-appb-000015
和所述第二残差
Figure PCTCN2020108829-appb-000016
生成第二残差帧。
其中,所述第一残差与所述第一对齐帧满足以下关系式:
Figure PCTCN2020108829-appb-000017
其中,
Figure PCTCN2020108829-appb-000018
表示所述第二帧图像P1,
Figure PCTCN2020108829-appb-000019
表示所述第一对齐帧,
Figure PCTCN2020108829-appb-000020
表示所述第一残差,i表示所述第一帧图像的一个宏块,
Figure PCTCN2020108829-appb-000021
表示所述宏块i按照其对应的运动矢量
Figure PCTCN2020108829-appb-000022
移动后的宏块,t1表示所述第一帧图像I1的生成时刻,t2表示所述第二帧图像P1的生成时刻。
将所述第一残差
Figure PCTCN2020108829-appb-000023
表示的各个宏块的像素值(RGB值)还原得到所述第一残差帧。
同理地,所述第二残差与所述第二对齐帧满足以下关系式:
Figure PCTCN2020108829-appb-000024
其中,
Figure PCTCN2020108829-appb-000025
表示所述第三帧图像P2,
Figure PCTCN2020108829-appb-000026
表示所述第二对齐帧,
Figure PCTCN2020108829-appb-000027
表示所述第二残差,i表示所述第二帧图像的一个宏块,
Figure PCTCN2020108829-appb-000028
表示所述宏块i按照其对应的运动矢量
Figure PCTCN2020108829-appb-000029
移动后的宏块,t3表示所述第三帧图像P2的生成时刻。
将上述公式2代入到所述公式1中得到第三帧图像P2与所述第一帧图像I1之间的对应关系为:
Figure PCTCN2020108829-appb-000030
其中,
Figure PCTCN2020108829-appb-000031
为从第一帧图像I1到第三帧图像P2之间的累积残差,即累积残差为第一残差
Figure PCTCN2020108829-appb-000032
与第二残差
Figure PCTCN2020108829-appb-000033
的和。
根据所述第一残差
Figure PCTCN2020108829-appb-000034
和所述第二残差
Figure PCTCN2020108829-appb-000035
生成第二残差帧,包括根据所述累积残差
Figure PCTCN2020108829-appb-000036
表示的各个宏块的像素值(RGB值)做还原得到所述第二残差帧。
本实施例通过上述公式1和公式2得到第一帧图像与第三帧图像之间的对应关系,即实现了间隔的两帧,所述第三帧图像P2通过关键帧I1表达。
应理解,根据上述公式1的对应关系,若还包括第四帧图像P3或者第五帧图像P4等,也都可以通过累积的关系得到该帧与关键帧I1之间的对应关系,即通过上述公式1和公式2的变形得到所述第四帧图像P3或者所述第五帧图像P4与所述关键帧I1的表达式,以及累积残差。
步骤104:根据所述至少一个残差帧、所述第一对齐帧和所述第二对齐帧对所述第二帧图像进行超分处理,得到超分后的第二帧图像。
具体地,如图6所示,步骤104包括:
步骤1041:将所述至少一个残差帧输入至神经网络进行特征提取,得到至少一个第一特征图像。
其中,所述神经网络中包括了多个不同功能的层级。所述多个层级中每个层级所具有的功能包括但不限于卷积、池化、激活等运算操作,经过这些层级的处理可以将输入图像数据进行特征提取、分析和整合,最终输出超分后的图像。
在一示例中,将所述第一残差帧和所述第二残差帧输入至所述神经网络,其中,所述第一残差帧通过第一残差
Figure PCTCN2020108829-appb-000037
生成,所述第二残差帧通过所述第一残差
Figure PCTCN2020108829-appb-000038
和所述第二残差
Figure PCTCN2020108829-appb-000039
生成(即累积残差
Figure PCTCN2020108829-appb-000040
)。
可选的,一种实现方式是,所述神经网络可以包括特征提取网络和超分网络,所述超分网络包括第一超分网络和第二超分网络。如图6所示,步骤1041具体包括:将所述第一残差帧和所述第二残差帧输入至所述特征提取网络得到至少一个第一特征图。
步骤1042:将所述第一对齐帧、所述第二对齐帧和所述第二帧图像输入至所述神经网络进行特征提取,得到至少一个第二特征图像。
可选的,一种实现方式是,将所述第一对齐帧、第二对齐帧和第二帧图像输入至所述特征提取网络后,经过特征提取得到至少一个第二特征图像。
步骤1043:将所述至少一个第一特征图像输入至所述第一超分网络处理,生成高 频信息。
步骤1044:将所述至少一个第二特征图输入至所述第二超分网络处理,生成亮度通道。
其中,所述第一超分网络与所述第二超分网络不同。例如两个超分网络的网络结构、复杂程度和权重参数等不同。
步骤1045:将所述高频信息与所述亮度通道相融合,生成所述超分后的第二帧图像。
本实施例中,将至少一个残差帧和至少一个对齐帧输入至同一个神经网络进行超分处理,从而提高参数利用率,减少模型参数量,提高超分处理效率。
本申请实施例提供的一种联合视频编解码的方案,采用视频编解码过程中的运动矢量信息和残差信息进行视频帧的对齐和残差累积,并且采用对齐帧和残差帧作为神经网络的输入,由于累积帧经过神经网络后输出了高频信息,将该高频信息贴合回亮度通道中,可以加强边缘细节,从而弥补超分过程中,由于低分辨率视频帧缺失了高频部分的信息而导致的图像分辨率低的问题。
另外,在一种可能的实施方式中,在上述步骤104中将所述至少一个残差帧和至少一个对齐帧送入神经网络进行超分处理之前,方法还包括:
对所述至少一个残差帧和对齐帧划分感兴趣区域,所述感兴趣区域为残差帧或对齐帧中需要进行超分处理的宏块;进一步地,可以通过残差帧或对齐帧中划分的宏块的像素值与预设值之间的大小来确定所述感兴趣区域。
例如以第一残差帧为例,确定所述第一残差帧中感兴趣区域的宏块,所述感兴趣区域的宏块为当前宏块中所包含的所有像素值之和超过第一预设值的宏块;即将第一残差帧划分为多个宏块,比如划分3×3的9个宏块,计算每个宏块中所有像素点的像素和,然后比较每个宏块的像素和与所述第一预设值的大小,并筛选出所有像素和大于等于第一预设值的宏块,作为感兴趣区域,并对这部分区域做超分处理。
对应地,将像素和小于所述第一预设值的宏块设为非感兴趣区域,所述非感兴趣区域的宏块为低纹理区域,所以对这部分区域可以直接采用前一帧处理的超分结果,不需要再进行超分处理,从而节约运算量,提高了超分处理的效率。
另外,本实施例中,所述方法还包括:对处理第一残差帧之外的其它所有残差帧(例如第二残差帧)划分感兴趣区域。
具体地,一种可能的实施方式是,与前述对第一残差帧划分感兴趣区域的方法相同,比较每个宏块的像素值与所述第一预设值的大小来确定各个宏块是否是感兴趣区域,详细的确定过程与前述对第一残差帧中感兴趣区域的确定过程相似,此处不再赘述。
另一种可能的实施方式是,采用与第一残差帧确定的感兴趣区域相同位置的宏块作为第二残差帧的感兴趣区域,比如,在确定所述第一残差帧的感兴趣区域的宏块为宏块4~8(一共9个宏块)时,对应地,确定第二残差帧中的感兴趣区域的宏块也为宏块4~8。
在另一种可能的实施方式中,确定所述第一残差帧的感兴趣区域还可以采用与对齐帧相同的区域。即先对至少一个对齐帧进行感兴趣区域和非感兴趣区域的划分,然 后再将对齐帧中确定的感兴趣区域作为第一残差帧的感兴趣区域。
具体地,对所述至少一个对齐帧划分感兴趣区域,包括:一种可能的划分方式是,比较第一对齐帧的每个宏块的像素值与第二预设值的大小,并将所有大于等于所述第二预设值的宏块设为感兴趣区域,将小于所述第二预设值的宏块设为非感兴趣区域。同理地,确定第二对齐帧的感兴趣区域可以与所述第一对齐帧的感兴趣区域相同。
在确定出所有残差帧和所有对齐帧的感兴趣区域之后,将所有所述感兴趣区域的宏块输入至所述神经网络进行特征提取,对于剩余的非感兴趣区域的宏块,可以利用前一帧图像的超分结果,再结合本次对感兴趣区域所做的超分处理得到的超分结果,最后融合成超分后的图像。
本实施例中,通过对每帧图像划分感兴趣区域和非感兴趣区域,即分析每帧图像的残差累积纹理细节和前后帧运动特点,并且只对感兴趣区域的宏块进行超分处理,对于非感兴趣区域则可以直接采用前一帧的超分处理结果,从而避免对整张图像进行超分处理,降低了计算量,减小功耗、时延和内存的开销,提高了单帧图像超分效率,进而达到短时间内实时获取超分图像的有益效果。
需要说明的是,上述实施例中例举了一个GOP的前三帧图像,包括但不限于这三帧图像,还可以是其他的帧图像,或者位于不同的GOP。例如一种可能的情况是,第一帧图像为第一GOP中的最后一帧图像,第二帧图像和第三帧图像为第二GOP中的前两帧图像。另一种可能的情况是,第一帧图像和第二帧图像为第一GOP中的最后两帧图像,第三帧图像为第二GOP中的第一帧图像。其中,第一GOP和第二GOP时序相邻。并且,待超分的目标帧可以是上述第一帧图像、第二帧图像和第三帧图像中的任意一帧,本实施例不予限制。
实施例二
下面以第一帧图像是第一GOP的最后一帧,第二帧图像和第三帧图像为第二GOP中的前两帧为例,对本申请提供的图像处理方法进行介绍。该方法应用于上述实施例所述的解码器30、或视频译码系统40、或目的地设备14。
进一步地,如图7A所示,该方法包括:
步骤701:获取视频源。其中,所述视频源可以来自短视频或视频通话APP平台,所述短视频或视频码流可以从云端下载和获取。
步骤702:编码器对所述视频源进行编码生成视频码流。
在一示例中,在编码阶段,编码器将视频码流划分为两个GOP,分别是第一GOP和第二GOP,其中,第一GOP中包括一个I帧和3个P帧,即为{I1,P1,P2,P3};第二GOP中也包括一个I帧和3个P帧,即为{I2,P4,P5,P6},所以对于该视频码流中的帧包括有:I1,P1,P2,P3,I2,P4,P5,P6。选择这些帧中的任意一帧为目标帧,对其进行超分。
其中,步骤702在视频编码阶段,还包括:7021:编码器获取跨GOP运动矢量,并根据该跨GOP运动矢量进行帧间预测。
所述“跨GOP运动矢量”是指前一个GOP的最后一帧图像与后一个GOP的第一帧图像之间的运动矢量。比如第一GOP和第二GOP之间的跨GOP运动矢量是指第一GOP的P3帧与第二GOP的I2帧之间的运动矢量。利用所述跨GOP运动矢量向目标帧做对 齐处理,生成所述对齐帧,进而保证了跨GOP帧间“对齐”的准确性。
在一示例中,在所述目标帧是I2帧(第二GOP的第一帧)的情况下,所述做对齐处理是指,第一GOP的P3帧按照所述跨GOP运动矢量向所述I2帧做像素块移动,第二GOP的P4帧按照运动矢量向所述I2帧做像素块移动,生成两个对齐帧。
其中,在利用运动矢量进行运动估计时,可以根据精度的需求,使用像素或者亚像素级帧间像素块的运动矢量来进行估计,并且根据运动矢量计算得到运动补偿信息,比如残差。具体地,运动矢量、对齐帧和残差的对应关系参见实施例一中各公式的相关描述,本实施例对此不再赘述。
本实施例在编码的过程中,增加了时序相邻的GOP之间的运动矢量(即跨GOP运动矢量)来进行运动估计,并生成视频码流,为后续解码阶段的帧对齐操作提供了连续的运动矢量,从而保证了帧间预测的连续性。
步骤703:解码器获取视频码流。具体地,一种实现方式是,编码器将视频码流传输到云端,解码器从云端下载并获取该视频码流。或者,另一种实现方式是,编码器直接将编码后的视频码流传输至解码器。
步骤704:解码器对所述视频码流进行解密得到至少一个视频帧、相邻帧间的运动矢量和残差等信息。所述运动矢量包括同一个GOP中相邻两帧之间的运动矢量,还包括跨GOP运动矢量。另外,还包括所有相邻两个视频帧之间的残差。
步骤705:解码器根据所述视频帧、运动矢量和残差等信息对待超分图像进行预处理生成至少一个对齐帧和至少一个残差帧,具体包括:
步骤7051:将目标帧的前一帧和后一帧分别按照各自的运动矢量向所述目标帧做“对齐”生成第一对齐帧和第二对齐帧。
在生成对齐帧的过程中,为了提高超分图像的精确度,运动矢量的精度可划分为像素和亚像素。对于像素精度的运动矢量,可以按照正常的对齐操作生成对齐帧。
对于亚像素精度的运动矢量,则需要对视频帧先进行放大再做对齐,然后再将对齐后的图像还原至原图大小。具体包括:提取相邻的连续多个视频帧,并选择一个待超分的图像作为目标帧,其余帧为前后辅助帧。前后辅助帧根据与所述目标帧之间的运动矢量利用公式4,将像素点移动到指定的位置。
(x′,y′)=(x+mv x,y+mv y),(公式4)
其中,(x′,y′)表示像素移动后的位置;(x,y)表示像素的原始位置;(mv x,mv y)表示运动矢量。
另外,对于亚像素精度的运动矢量在对齐处理时,先对连续视频帧进行放大和对齐,然后再将对齐帧还原至原图大小。因为在编解码器计算亚像素精度运动矢量时,先将当前目标帧放大相应像素精度的倍数,然后再和前一帧辅助帧进行运动矢量计算,在这种情况下获得的亚像素精度运动矢量是非整数(即小数)。在对齐过程中,首先将连续多帧视频帧建立放大倍数相应的图像缓存,其中待超分的图像为目标帧,其余帧为辅助帧。然后辅助帧根据与当前目标帧之间的运动矢量和超分放大倍数,利用公式5将像素点移动到指定的位置。
(x′,y′)=(x+mv x×factor,y+mv y×factor),(公式5)
其中,factor表示超分放大倍数,(x′,y′)表示像素移动后的位置;(x,y)表示像素的原始位置;(mv x,mv y)表示运动矢量。
本实施例中,通过对视频帧进行放大和移动,使得仅有
Figure PCTCN2020108829-appb-000041
的像素存在像素值,其余均为0。为了减少后续超分模型输入的数据量,将放大的图像缓存加权,并采样到原始图像帧大小,完成对齐。并且利用亚像素精度做宏块对齐时,能够使变化的区域更精准的落入到宏块中的像素点上,从而提高对齐的精确度。
步骤7052:根据所述相邻两帧之间的残差,按照关键帧进行残差累积生成关于所述目标帧的至少一个残差帧。
其中所述关键帧为每个GOP的第一帧。在步骤7052中可利用回溯技术对残差进行累积,生成至少一个残差帧。
在所述回溯技术中对残差进行累计,即得到每个GOP中的每个辅助帧与关键帧(第一帧)之间的对应关系,可以理解为,得到每个辅助帧通过关键帧和累积的残差来表达。
在一示例中,以上述两个GOP中的P3帧、I2帧和P4帧为所述第一帧图像、第二帧图像和第三帧图像,且I2帧为待超分的目标帧,生成所述至少一个残差帧的过程包括:
(1)对于P3帧的残差帧,根据上述实施例一的公式1至3得到,第一GOP的P3帧与I1帧的关系表达,
Figure PCTCN2020108829-appb-000042
其中,
Figure PCTCN2020108829-appb-000043
表示所述P3帧,
Figure PCTCN2020108829-appb-000044
表示所述P3帧按照跨GOP运动矢量向I2帧做对齐生成的对齐帧,t1为I1帧图像的生成时刻,t2为P1帧图像的生成时刻,t3为P2帧图像的生成时刻,t4为P3帧图像的生成时刻,
Figure PCTCN2020108829-appb-000045
表示所述I1帧与所述P1帧的残差,
Figure PCTCN2020108829-appb-000046
表示所述P1帧与所述P2帧的残差,
Figure PCTCN2020108829-appb-000047
表示所述P2帧与所述P3帧的残差,
Figure PCTCN2020108829-appb-000048
表示所述I1帧与所述P3帧之间的累积残差,i表示图像的一个宏块,
Figure PCTCN2020108829-appb-000049
表示所述宏块i按照其对应的运动矢量移动后的宏块。
根据所述I1帧与所述P3帧之间的累积残差
Figure PCTCN2020108829-appb-000050
生成第一残差帧。
(2)对于I2帧的残差帧,由于I2帧为第二GOP的第一帧,所以I2帧的残差为。
(3)对于P4帧的残差帧,P4帧为第二GOP的第二帧,所以根据上述公式2,得到P4帧与I2帧之间的对应关系为:
Figure PCTCN2020108829-appb-000051
其中,
Figure PCTCN2020108829-appb-000052
表示所述P4帧,
Figure PCTCN2020108829-appb-000053
表示所述P4帧按照运动矢量向所述I2帧做对齐后生成的对齐帧,t5为I2帧图像的生成时刻,t6为P4帧图像的生成时刻,
Figure PCTCN2020108829-appb-000054
表示所述I2帧与所述P4帧的残差,i表示图像的一个宏块,
Figure PCTCN2020108829-appb-000055
表示所述宏块i按照其对应的运动矢量
Figure PCTCN2020108829-appb-000056
移动后的宏块。
根据所述I2帧与所述P4帧之间的累积残差
Figure PCTCN2020108829-appb-000057
生成第二残差帧。
本实施例中,可以计算得到每个GOP的第一帧与后面任一帧的对应关系,不再依赖于相邻两帧的关系表达,而是直接建立了t1时刻生成的关键帧I1与t2、t3、t4生成的各个图像之间的对应关系,从而打破相邻两个P帧与P帧之间的相互依赖,避免存储连续两帧之间的对应关系,本方法节约存储空间。另外,生成的累积残差帧包含了时序上的相关冗余信息,为后续运动主体的细节恢复提供了更多的原始信息。
步骤706:判断待生成的图像是否是视频码流的第一帧,所述第一帧为视频码流播放的第一个视频帧,即第一GOP的I1帧。
如果不是第一帧,则执行步骤707至709;如果是第一帧,则执行步骤710。
步骤707:对于非I1帧,对所述至少一个残差帧和对齐帧划分感兴趣区域和非感兴趣区域。
具体地,一种可能的筛选过程包括:利用运动矢量和残差信息选取超分感兴趣区域。如果在残差累积图中该区域纹理信息很少,则该区域为非感兴趣区域,可以直接使用前一帧该区域超分结果,从而减少该帧超分计算量;反之,则为感兴趣区域,需要做超分处理。进一步地,所述感兴趣区域可以通过公式6和公式7来确定。
Figure PCTCN2020108829-appb-000058
Figure PCTCN2020108829-appb-000059
其中,I SR表示所有感兴趣区域的宏块集合,K表示每个宏块中的像素总数,I residual表示当前一个图像宏块,i k表示第k个像素的像素值,sign(x)为符号函数,具体可通过公式7来定义,
Figure PCTCN2020108829-appb-000060
I表示目标帧中的宏块个数,U函数表示将宏块取并集。门限值(Threshold)为常量,且可以预设。公式7表示如果某一个宏块中的所有像素值之和小于预设门限值,则sign(x)=0,该宏块为非感兴趣区域的宏块;如果sign(x)=1,则该i k像素值所对应的宏块为感兴趣区域,需要进行超分处理。然后将所有感兴趣区域的宏块取并集,作为最后需要进行超分处理的多纹理区域。
步骤7081:非感兴趣区域的宏块采用前一帧超分的处理结果。对于非感兴趣区域的宏块,提取所有非感兴趣区域的宏块在前一帧超分得到的图像信息,所述图像信息包括前一帧图像对应位置宏块的超分结果。
步骤7082:将感兴趣区域的宏块送入至神经网络进行超分处理。其中,所述感兴趣区域的宏块是对以下帧中的感兴趣区域。
在一示例中,将上述步骤7051生成的所述第一对齐帧和所述第二对齐帧的感兴趣区域的宏块,以及目标帧I2送入所述神经网络,同时,将步骤7052生成的所述第一残差帧和所述第二残差帧的感兴趣区域的宏块送入所述神经网络,分别得到关于目标帧I2的亮度通道和高频信息。
步骤709:将所述亮度通道与所述高频信息相融合得到超分后的图像。如图7B所示。
其中,所述神经网络包括:特征提取网络和超分网络。进一步地,所述特征提取网络用于对输入的至少一个对齐帧和至少一个残差帧进行特征提取,分别获得对齐帧 的特征图与残差帧的特征图。其中,所述特征提取网络可以为共享网络,即共享给对齐帧和残差帧做特征提取,从而可以提高参数利用率,减少神经网络的参数量。然后,将提取的特征图分别送入不同的超分网络,通过对齐帧得到的特征图经过第一超分网络后输出亮度通道;通过残差帧得到的特征图经过第二超分网络后输出高频信息,最后将高频信息贴合回亮度通道中,以加强待超分图像边缘细节和高频信息,最终得到高质量的超分图像。
步骤710:在上述步骤706中,如果待超分的图像是所述第一帧,则不划分感兴趣区域,直接将生成的对齐帧和残差帧输入至神经模型进行超分处理,得到超分后的图像。因为第一帧不存在图像不存在前一帧和前一帧的超分结果,因此不需要进行感兴趣和非感兴趣区域划分,而要将整张图都进行超分。
否则,需要将输出结果与步骤7081中非感兴趣区域对应的前一帧超分结果进行拼接,得到当前帧完整的超分结果,如图7B所示,将三个当前帧超分块,分别是帽檐、鼻子和帽饰穗与前一帧超分块进行融合得到超分后的图像。
本实施例中,对于待超分帧不是视频码流的第一帧时,划分感兴趣区域和非感兴趣区域,并且只对感兴趣区域的宏块进行超分处理,而对于非感兴趣区域的宏块则采用前一帧的超分结果,即将像素发生运动和累计残差中存在信息的区域进行超分,将像素不运动和累积残差信息量较小的模块采用前一帧超分的结果或者直接插值放大,从而避免对整张图像的所有宏块都进行超分处理,本方法节约输入至神经网络的运算量和运算时间,提高了超分效率。
另外,利用对齐帧相同位置的宏块是对相同细节的互相补充,这种冗余提供更大的信息量。残差实际为像素匹配块的差值补偿,包含了运动主体的边缘等信息,本实施例通过累积残差获得残差帧,用残差恢复图像高频边缘,使得主观效果更优。同时可以根据终端设备芯片的负载能力动态调节分配到不同芯片上的计算量,减小延时。
此外,本申请上述实施例所述的方法,还可以通过软件模块来实现相应的功能。如图2B所示,所述视频译码系统40包括处理单元46,或者,如图8所示提供了一种图像处理装置,该装置包括:获取单元810、处理单元820和发送单元830,此外还可以包括其它功能模块或单元,比如存储单元等。
在一个实施例中,对于所述图像处理装置可用于执行上述图5、图6和图7A中的图像处理流程。例如:获取单元810用于获取视频码流;所述视频码流中包括时序相邻的第一帧图像、第二帧图像和第三帧图像;处理单元820用于解码所述视频码流得到第一对齐帧、第二对齐帧、以及所述第一帧图像、所述第二帧图像和所述第三帧图像之间的至少一个残差;所述处理单元820还用于根据所述至少一个残差生成至少一个残差帧,以及根据所述至少一个残差帧、所述第一对齐帧和所述第二对齐帧对所述第二帧图像进行超分处理,得到超分后的第二帧图像。
发送单元830用于将所述超分后的第二帧图像输出到显示屏幕上进行显示。
其中所述第一对齐帧是所述第一帧图像向所述第二帧图像按照第一运动矢量进行像素块移动后生成的,所述第二对齐帧是所述第三帧图像向所述第二帧图像按照第二运动矢量进行像素块移动后生成的,所述残差为前一帧图像按照运动矢量向后一帧图像做运动补偿后,与所述后一帧图像的每个宏块之间的像素差。
其中,可选的,所述第一帧图像、第二帧图像和第三帧图像为第一图像组中的三帧图像;
或者,所述第一帧图像为第一图像组中的最后一帧图像,所述第二帧图像和所述第三帧图像为第二图像组中的前两帧图像;
或者,所述第一帧图像和所述第二帧图像为第一图像组中的最后两帧图像,所述第三帧图像为第二图像组中的第一帧图像。
本申请实施例对在视频码流中选择的三帧图像不进行限制,并且也不限制选择其中的哪一帧作为待超分的图像。
在本实施例的一种可能的实现方式中,处理单元820具体用于生成第一残差帧,其中,所述第一残差与所述第一对齐帧满足以下关系式:
Figure PCTCN2020108829-appb-000061
其中,
Figure PCTCN2020108829-appb-000062
表示所述第二帧图像,
Figure PCTCN2020108829-appb-000063
表示所述第一对齐帧,
Figure PCTCN2020108829-appb-000064
表示所述第一残差,i表示所述第一帧图像的一个宏块,
Figure PCTCN2020108829-appb-000065
表示所述宏块i按照其对应的运动矢量
Figure PCTCN2020108829-appb-000066
移动后的宏块,t1表示所述第一帧图像的生成时刻,t2表示所述第二帧图像的生成时刻。
在本实施例的另一种可能的实现方式中,所述处理单元820具体用于,
将所述至少一个残差帧输入至神经网络进行特征提取,得到至少一个第一特征图像;
将所述第一对齐帧、所述第二对齐帧和所述第二帧图像输入至所述神经网络进行特征提取,得到至少一个第二特征图像;
将所述至少一个第一特征图像输入至第一超分网络处理,生成高频信息;
将所述至少一个第二特征图输入至第二超分网络处理,生成亮度通道;
以及,将所述高频信息与所述亮度通道相融合,生成所述超分后的第二帧图像。
在本实施例的又一种可能的实现方式中,所述处理单元820还用于确定所述第一残差帧中感兴趣区域的宏块,根据所述第一残差帧中感兴趣区域的宏块确定所述至少一个残差帧中除了所述第一残差帧之外其余的残差帧的感兴趣区域;所述感兴趣区域的宏块为当前宏块中所包含的所有像素值之和超过预设值的宏块。
另外,所述处理单元820具体还用于将所述至少一个残差帧的所有感兴趣区域的宏块输入至所述神经网络进行特征提取,所述至少一个残差帧包括所述第一残差帧和所述其余的残差帧。
在本实施例的又一种可能的实现方式中,所述处理单元820具体用于将所述第一对齐帧和所述第二对齐帧中的感兴趣区域的宏块,和所述第二帧图像输入至所述神经网络进行特征提取;其中所述第一对齐帧和所述第二对齐帧中的感兴趣区域,与所述第一残差帧中感兴趣区域相同。
图9示出了上述实施例中所涉及的视频编码设备的另一种可能的结构示意图。视频编码设备包括处理器901、收发器902和存储器903。如图9所示,所述存储器903用于与处理器902耦合,其保存该视频编码设备必要的计算机程序。
例如,在一个实施例中,所述收发器901被配置为向解码器30发送编码信息。处 理器902被配置为视频编码设备的编码操作或功能。
进一步地,收发器902用于获取视频码流,处理器901可用于解码所述视频码流得到第一对齐帧、第二对齐帧、以及所述第一帧图像、所述第二帧图像和所述第三帧图像之间的至少一个残差;根据所述至少一个残差生成至少一个残差帧,以及,根据所述至少一个残差帧、所述第一对齐帧和所述第二对齐帧对所述第二帧图像进行超分处理,得到超分后的第二帧图像。
在一种具体实现中,所述处理器901还用于将所述至少一个残差帧输入至神经网络进行特征提取,得到至少一个第一特征图像;将所述第一对齐帧、所述第二对齐帧和所述第二帧图像输入至所述神经网络进行特征提取,得到至少一个第二特征图像;将所述至少一个第一特征图像输入至第一超分网络处理,生成高频信息;将所述至少一个第二特征图输入至第二超分网络处理,生成亮度通道;将所述高频信息与所述亮度通道相融合,生成所述超分后的第二帧图像。
在另一种具体实现中,所述处理器901还可以用于确定所述第一残差帧中感兴趣区域的宏块,所述感兴趣区域的宏块为当前宏块中所包含的所有像素值之和超过预设值的宏块;根据所述第一残差帧中感兴趣区域的宏块确定所述至少一个残差帧中除了所述第一残差帧之外其余的残差帧的感兴趣区域;以及,将所述至少一个残差帧的所有感兴趣区域的宏块输入至所述神经网络进行特征提取,所述至少一个残差帧包括所述第一残差帧和所述其余的残差帧。
在又一种具体实现中,所述处理器901具体还用于将所述第一对齐帧和所述第二对齐帧中的感兴趣区域的宏块,和所述第二帧图像输入至所述神经网络进行特征提取。其中,所述第一对齐帧和所述第二对齐帧中的感兴趣区域,与所述第一残差帧中感兴趣区域相同。
具体地上述处理器901的实现过程参见上述实施例一和实施例二以及附图5、图6和图7A的描述,本实施例此处不再赘述。
在本实施例提供的视频解码设备中,还包括计算机存储介质,该计算机存储介质可存储有计算机程序指令,当所述程序指令被执行时,可实现本申请上述各实施例所述的图像处理方法的全部步骤。所述计算机存储介质包括磁盘、光盘、只读存储记忆体ROM或随机存储记忆体RAM等。
在上述实施例中,可以全部或部分通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现,本实施例不予限制。
本申请还提供一种计算机程序产品,所述计算机程序产品包括一个或多个计算机程序指令。在计算机加载和执行所述计算机程序指令时,全部或部分地产生按照本申请上述各个实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或者其他可编程装置。
所述计算机程序指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网络节点、计算机、服务器或数据中心通过有线或无线方式向另一个站点、计算机或服务器进行传输。
进一步地,在该程序执行时,可实施包括图2A至图7A提供的视频编码方法和视频解码方法的各实施例中的部分或全部步骤。任意设备中的存储介质均可为磁碟、光盘、只读存储记忆体(read-only memory,ROM)或随机存储记忆体(random access memory,RAM)等。
本申请实施例中,上述处理器可以是中央处理器(central processing unit,CPU),网络处理器(network processor,NP)或者CPU和NP的组合。处理器还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。存储器可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM);存储器也可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器还可以包括上述种类的存储器的组合。
本领域技术任何还可以了解到本申请列出的各种说明性逻辑块(illustrative logical block)和步骤(step)可以通过电子硬件、电脑软件,或两者的结合进行实现。这样的功能是通过硬件还是软件来实现取决于特定的应用和整个系统的设计要求。本领域技术人员可以对于每种特定的应用,可以使用各种方法实现所述的功能,但这种实现不应被理解为超出本申请保护的范围。
本申请中所描述的各种说明性的逻辑单元和电路可以通过通用处理器,数字信号处理器,专用集成电路(ASIC),现场可编程门阵列(FPGA)或其它可编程逻辑装置,离散门或晶体管逻辑,离散硬件部件,或上述任何组合的设计来实现或操作所描述的功能。通用处理器可以为微处理器,可选地,该通用处理器也可以为任何传统的处理器、控制器、微控制器或状态机。处理器也可以通过计算装置的组合来实现,例如数字信号处理器和微处理器,多个微处理器,一个或多个微处理器联合一个数字信号处理器核,或任何其它类似的配置来实现。
本申请中所描述的方法或算法的步骤可以直接嵌入硬件、处理器执行的软件单元、或者这两者的结合。软件单元可以存储于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可移动磁盘、CD-ROM或本领域中其它任意形式的存储媒介中。示例性地,存储媒介可以与处理器连接,以使得处理器可以从存储媒介中读取信息,并可以向存储媒介存写信息。可选地,存储媒介还可以集成到处理器中。处理器和存储媒介可以设置于ASIC中,ASIC可以设置于UE中。可选地,处理器和存储媒介也可以设置于UE中的不同的部件中。
应理解,在本申请的各种实施例中,各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请的实施过程构成任何限定。
另外,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理 解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
本领域的技术人员可以清楚地了解到本申请实施例中的技术可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请实施例中的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例或者实施例的某些部分所述的方法。
本说明书中各个实施例之间相同相似的部分互相参见即可。尤其,对于网络设备/节点或装置设备而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例中的说明即可。
以上所述的本申请实施方式并不构成对本申请保护范围的限定。

Claims (18)

  1. 一种图像处理方法,其特征在于,所述方法包括:
    获取视频码流,所述视频码流中包括时序相邻的第一帧图像、第二帧图像和第三帧图像;
    解码所述视频码流得到第一对齐帧、第二对齐帧、以及所述第一帧图像、所述第二帧图像和所述第三帧图像之间的至少一个残差,其中所述第一对齐帧是所述第一帧图像向所述第二帧图像按照第一运动矢量进行像素块移动后生成的,所述第二对齐帧是所述第三帧图像向所述第二帧图像按照第二运动矢量进行像素块移动后生成的,所述残差为前一帧图像按照运动矢量向后一帧图像做运动补偿后与所述后一帧图像的每个宏块之间的像素差;
    根据所述至少一个残差生成至少一个残差帧;
    根据所述至少一个残差帧、所述第一对齐帧和所述第二对齐帧对所述第二帧图像进行超分处理,得到超分后的第二帧图像。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述至少一个残差生成至少一个残差帧,包括:
    根据第一残差生成第一残差帧;
    其中,所述第一残差与所述第一对齐帧满足以下关系式:
    Figure PCTCN2020108829-appb-100001
    其中,
    Figure PCTCN2020108829-appb-100002
    表示所述第二帧图像,
    Figure PCTCN2020108829-appb-100003
    表示所述第一对齐帧,
    Figure PCTCN2020108829-appb-100004
    表示所述第一残差,i表示所述第一帧图像的一个宏块,
    Figure PCTCN2020108829-appb-100005
    表示所述宏块i按照其对应的运动矢量
    Figure PCTCN2020108829-appb-100006
    移动后的宏块,t1表示所述第一帧图像的生成时刻,t2表示所述第二帧图像的生成时刻。
  3. 根据权利要求1或2所述的方法,其特征在于,根据所述至少一个残差帧、所述第一对齐帧和所述第二对齐帧,对所述第二帧图像进行超分处理得到超分后的第二帧图像,包括:
    将所述至少一个残差帧输入至神经网络进行特征提取,得到至少一个第一特征图像;
    将所述第一对齐帧、所述第二对齐帧和所述第二帧图像输入至所述神经网络进行特征提取,得到至少一个第二特征图像;
    将所述至少一个第一特征图像输入至第一超分网络处理,生成高频信息;
    将所述至少一个第二特征图输入至第二超分网络处理,生成亮度通道;
    将所述高频信息与所述亮度通道相融合,生成所述超分后的第二帧图像。
  4. 根据权利要求3所述的方法,其特征在于,将所述至少一个残差帧输入至神经网络进行特征提取之前,还包括:
    确定所述第一残差帧中感兴趣区域的宏块,所述感兴趣区域的宏块为当前宏块中所包含的所有像素值之和超过预设值的宏块;
    根据所述第一残差帧中感兴趣区域的宏块确定所述至少一个残差帧中除了所述第一残差帧之外其余的残差帧的感兴趣区域;
    将所述至少一个残差帧输入至神经网络进行特征提取,包括:
    将所述至少一个残差帧的所有感兴趣区域的宏块输入至所述神经网络进行特征提取,所述至少一个残差帧包括所述第一残差帧和所述其余的残差帧。
  5. 根据权利要求4所述的方法,其特征在于,将所述第一对齐帧、所述第二对齐帧和所述第二帧图像输入至所述神经网络进行特征提取,包括:
    将所述第一对齐帧和所述第二对齐帧中的感兴趣区域的宏块,和所述第二帧图像输入至所述神经网络进行特征提取;
    其中所述第一对齐帧和所述第二对齐帧中的感兴趣区域,与所述第一残差帧中感兴趣区域相同。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,
    所述第一帧图像、所述第二帧图像和所述第三帧图像为第一图像组中的三帧图像;
    或者,所述第一帧图像为第一图像组中的最后一帧图像,所述第二帧图像和所述第三帧图像为第二图像组中的前两帧图像;
    或者,所述第一帧图像和所述第二帧图像为第一图像组中的最后两帧图像,所述第三帧图像为第二图像组中的第一帧图像。
  7. 一种图像处理装置,其特征在于,所述装置包括:
    获取单元,用于获取视频码流,所述视频码流中包括时序相邻的第一帧图像、第二帧图像和第三帧图像;
    处理单元,用于解码所述视频码流得到第一对齐帧、第二对齐帧、以及所述第一帧图像、所述第二帧图像和所述第三帧图像之间的至少一个残差,其中所述第一对齐帧是所述第一帧图像向所述第二帧图像按照第一运动矢量进行像素块移动后生成的,所述第二对齐帧是所述第三帧图像向所述第二帧图像按照第二运动矢量进行像素块移动后生成的,所述残差为前一帧图像按照运动矢量向后一帧图像做运动补偿后,与所述后一帧图像的每个宏块之间的像素差;
    所述处理单元,还用于根据所述至少一个残差生成至少一个残差帧,以及根据所述至少一个残差帧、所述第一对齐帧和所述第二对齐帧对所述第二帧图像进行超分处理,得到超分后的第二帧图像。
  8. 根据权利要求7所述的装置,其特征在于,
    所述处理单元,具体用于生成第一残差帧;
    其中,所述第一残差与所述第一对齐帧满足以下关系式:
    Figure PCTCN2020108829-appb-100007
    其中,
    Figure PCTCN2020108829-appb-100008
    表示所述第二帧图像,
    Figure PCTCN2020108829-appb-100009
    表示所述第一对齐帧,
    Figure PCTCN2020108829-appb-100010
    表示所述第一残差,i表示所述第一帧图像的一个宏块,
    Figure PCTCN2020108829-appb-100011
    表示所述宏块i按照其对应的运动矢量
    Figure PCTCN2020108829-appb-100012
    移动后的宏块,t1表示所述第一帧图像的生成时刻,t2表示所述第二帧图像的生成时刻。
  9. 根据权利要求7或8所述的装置,其特征在于,
    所述处理单元,具体用于将所述至少一个残差帧输入至神经网络进行特征提取,得到至少一个第一特征图像;将所述第一对齐帧、所述第二对齐帧和所述第二帧图像 输入至所述神经网络进行特征提取,得到至少一个第二特征图像;将所述至少一个第一特征图像输入至第一超分网络处理,生成高频信息;将所述至少一个第二特征图输入至第二超分网络处理,生成亮度通道;以及,将所述高频信息与所述亮度通道相融合,生成所述超分后的第二帧图像。
  10. 根据权利要求9所述的装置,其特征在于,
    所述处理单元,还用于确定所述第一残差帧中感兴趣区域的宏块,根据所述第一残差帧中感兴趣区域的宏块确定所述至少一个残差帧中除了所述第一残差帧之外其余的残差帧的感兴趣区域;所述感兴趣区域的宏块为当前宏块中所包含的所有像素值之和超过预设值的宏块;
    所述处理单元,具体用于将所述至少一个残差帧的所有感兴趣区域的宏块输入至所述神经网络进行特征提取,所述至少一个残差帧包括所述第一残差帧和所述其余的残差帧。
  11. 根据权利要求10所述的装置,其特征在于,
    所述处理单元,具体用于将所述第一对齐帧和所述第二对齐帧中的感兴趣区域的宏块,和所述第二帧图像输入至所述神经网络进行特征提取;
    其中所述第一对齐帧和所述第二对齐帧中的感兴趣区域,与所述第一残差帧中感兴趣区域相同。
  12. 根据权利要求7-11任一项所述的装置,其特征在于,
    所述第一帧图像、所述第二帧图像和所述第三帧图像为第一图像组中的三帧图像;
    或者,所述第一帧图像为第一图像组中的最后一帧图像,所述第二帧图像和所述第三帧图像为第二图像组中的前两帧图像;
    或者,所述第一帧图像和所述第二帧图像为第一图像组中的最后两帧图像,所述第三帧图像为第二图像组中的第一帧图像。
  13. 一种计算机可读存储介质,所述存储介质中存储有指令,其特征在于,
    当所述指令被运行时,实现如权利要求1至6中任一项所述的方法。
  14. 一种视频编解码系统,包括源设备和目的地设备,所述源设备和所述目的地设备中包括一个或多个处理器,以及耦合到所述一个或多个处理器的存储器;
    所述存储器,用于存储指令;
    所述处理器,用于执行所述存储器中的指令,以实现如权利要求1至6中任一项所述的方法。
  15. 一种源设备,其特征在于,包括:图像源、图像预处理器、编码器和通信接口;
    所述图像源,用于捕获原始图像数据;
    所述图像预处理器,用于对所述原始图像数据执行预处理;
    所述编码器,用于执行权利要求1至6中任一项所述的方法,得到经编码图像数据;
    所述通信接口,用于通过链路将经编码图像数据传输至目的地设备。
  16. 一种目的地设备,其特征在于,包括:通信接口、解码器、图像后处理器和显示设备;
    所述通信接口,用于从源设备接收经编码图像数据;
    所述解码器,用于执行权利要求1至6中任一项所述的方法,得到经解码图像数据;
    所述图像后处理器,用于对经解码图像数据执行后处理,以获得经过后处理图像数据;所述显示设备,用于接收经后处理图像数据以便向用户或观看者显示图像。
  17. 一种处理器,其特征在于,所述处理器用于执行权利要求1至6中任一项所述的方法。
  18. 一种计算机程序产品,其特征在于,当其在计算机上运行时,使得计算机执行如权利要求1至6中任一项所述的方法。
PCT/CN2020/108829 2019-09-06 2020-08-13 一种图像处理方法和装置 WO2021042957A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20860488.4A EP4020370A4 (en) 2019-09-06 2020-08-13 IMAGE PROCESSING METHOD AND APPARATUS
US17/687,298 US20220188976A1 (en) 2019-09-06 2022-03-04 Image processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910840356.9 2019-09-06
CN201910840356.9A CN112465698A (zh) 2019-09-06 2019-09-06 一种图像处理方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/687,298 Continuation US20220188976A1 (en) 2019-09-06 2022-03-04 Image processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2021042957A1 true WO2021042957A1 (zh) 2021-03-11

Family

ID=74806778

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/108829 WO2021042957A1 (zh) 2019-09-06 2020-08-13 一种图像处理方法和装置

Country Status (4)

Country Link
US (1) US20220188976A1 (zh)
EP (1) EP4020370A4 (zh)
CN (1) CN112465698A (zh)
WO (1) WO2021042957A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378717A (zh) * 2021-06-11 2021-09-10 上海交通大学 基于关键对象拼合的视频识别方法及装置存储介质和终端
CN114708144A (zh) * 2022-03-16 2022-07-05 荣耀终端有限公司 图像数据处理方法及装置
CN114818992A (zh) * 2022-06-23 2022-07-29 成都索贝数码科技股份有限公司 图像数据解析方法、场景估计方法、3d融合方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113706555A (zh) * 2021-08-12 2021-11-26 北京达佳互联信息技术有限公司 一种视频帧处理方法、装置、电子设备及存储介质
CN114630124B (zh) * 2022-03-11 2024-03-22 商丘市第一人民医院 一种神经内窥镜备份方法及系统
CN115348456B (zh) * 2022-08-11 2023-06-06 上海久尺网络科技有限公司 视频图像处理方法、装置、设备及存储介质
CN115623242A (zh) * 2022-08-30 2023-01-17 华为技术有限公司 一种视频处理方法及其相关设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010015241A (ja) * 2008-07-01 2010-01-21 Hitachi Ltd 超解像撮像装置及び画像処理方法
CN103069802A (zh) * 2010-08-19 2013-04-24 汤姆森特许公司 重构图像的当前块的方法和对应的编码方法、对应的装置以及承载在比特流中编码的图像的存储介质
CN109087247A (zh) * 2018-08-17 2018-12-25 复旦大学 一种对立体图像进行超分的方法
CN110062164A (zh) * 2019-04-22 2019-07-26 深圳市商汤科技有限公司 视频图像处理方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI440363B (zh) * 2009-02-19 2014-06-01 Sony Corp Image processing apparatus and method
CN108632608B (zh) * 2011-09-29 2022-07-29 夏普株式会社 图像解码装置、图像解码方法、图像编码装置及图像编码方法
EP2920962A4 (en) * 2012-11-13 2016-07-20 Intel Corp ADAPTIVE TRANSFORMATION ENCODING OF CONTENT FOR NEXT GENERATION VIDEO
US20190007699A1 (en) * 2017-06-28 2019-01-03 Futurewei Technologies, Inc. Decoder Side Motion Vector Derivation in Video Coding
CN108109109B (zh) * 2017-12-22 2021-11-16 浙江大华技术股份有限公司 一种超分辨率图像重构方法、装置、介质及计算设备
CN108769690A (zh) * 2018-05-28 2018-11-06 思百达物联网科技(北京)有限公司 基于视频压缩的连续图片管理方法、装置、设备和介质
KR20210049822A (ko) * 2018-09-07 2021-05-06 파나소닉 인텔렉츄얼 프로퍼티 코포레이션 오브 아메리카 비디오 코딩 시스템 및 방법

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010015241A (ja) * 2008-07-01 2010-01-21 Hitachi Ltd 超解像撮像装置及び画像処理方法
CN103069802A (zh) * 2010-08-19 2013-04-24 汤姆森特许公司 重构图像的当前块的方法和对应的编码方法、对应的装置以及承载在比特流中编码的图像的存储介质
CN109087247A (zh) * 2018-08-17 2018-12-25 复旦大学 一种对立体图像进行超分的方法
CN110062164A (zh) * 2019-04-22 2019-07-26 深圳市商汤科技有限公司 视频图像处理方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4020370A4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378717A (zh) * 2021-06-11 2021-09-10 上海交通大学 基于关键对象拼合的视频识别方法及装置存储介质和终端
CN113378717B (zh) * 2021-06-11 2022-08-30 上海交通大学 基于关键对象拼合的视频识别方法及装置存储介质和终端
CN114708144A (zh) * 2022-03-16 2022-07-05 荣耀终端有限公司 图像数据处理方法及装置
CN114818992A (zh) * 2022-06-23 2022-07-29 成都索贝数码科技股份有限公司 图像数据解析方法、场景估计方法、3d融合方法
CN114818992B (zh) * 2022-06-23 2022-09-23 成都索贝数码科技股份有限公司 图像数据解析方法、场景估计方法、3d融合方法

Also Published As

Publication number Publication date
EP4020370A4 (en) 2022-11-16
EP4020370A1 (en) 2022-06-29
CN112465698A (zh) 2021-03-09
US20220188976A1 (en) 2022-06-16

Similar Documents

Publication Publication Date Title
WO2021042957A1 (zh) 一种图像处理方法和装置
CN112534818B (zh) 使用运动和对象检测的用于视频编码的译码参数的基于机器学习的自适应
CN112673625A (zh) 混合视频以及特征编码和解码
CN112005551B (zh) 一种视频图像预测方法及装置
CN110881126B (zh) 色度块预测方法以及设备
WO2022068682A1 (zh) 图像处理方法及装置
CN112995663B (zh) 视频编码的方法、视频解码的方法及相应装置
CN111277828B (zh) 视频编解码方法、视频编码器和视频解码器
WO2020143589A1 (zh) 视频图像解码、编码方法及装置
CN114339238A (zh) 视频编码的方法、视频解码的方法及其装置
US20220046234A1 (en) Picture prediction method and apparatus, and computer-readable storage medium
CN117241014A (zh) Mpm列表构建方法、色度块的帧内预测模式获取方法及装置
WO2021008524A1 (zh) 图像编码方法、解码方法、装置和存储介质
TW202239209A (zh) 用於經學習視頻壓縮的多尺度光流
WO2020253681A1 (zh) 融合候选运动信息列表的构建方法、装置及编解码器
WO2020181476A1 (zh) 视频图像预测方法及装置
WO2023092256A1 (zh) 一种视频编码方法及其相关装置
CN116193146A (zh) 图像划分方法及装置
WO2021169817A1 (zh) 视频处理方法及电子设备
CN111886868A (zh) 用于替代参考帧渲染的自适应时间滤波
CN113615191B (zh) 图像显示顺序的确定方法、装置和视频编解码设备
US20240223813A1 (en) Method and apparatuses for using face video generative compression sei message
WO2024140773A1 (en) Method and apparatuses for using face video generative compression sei message
CN116208780A (zh) 一种数据处理的方法以及装置
WO2020119742A1 (zh) 块划分方法、视频编解码方法、视频编解码器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20860488

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020860488

Country of ref document: EP

Effective date: 20220325