WO2022261838A1 - 残差编码和视频编码方法、装置、设备和系统 - Google Patents

残差编码和视频编码方法、装置、设备和系统 Download PDF

Info

Publication number
WO2022261838A1
WO2022261838A1 PCT/CN2021/100191 CN2021100191W WO2022261838A1 WO 2022261838 A1 WO2022261838 A1 WO 2022261838A1 CN 2021100191 W CN2021100191 W CN 2021100191W WO 2022261838 A1 WO2022261838 A1 WO 2022261838A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
residual
image
current frame
mode
Prior art date
Application number
PCT/CN2021/100191
Other languages
English (en)
French (fr)
Inventor
马展
夏琪
刘浩杰
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Priority to PCT/CN2021/100191 priority Critical patent/WO2022261838A1/zh
Priority to CN202180099185.4A priority patent/CN117480778A/zh
Publication of WO2022261838A1 publication Critical patent/WO2022261838A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation

Definitions

  • Embodiments of the present disclosure relate to, but are not limited to, video compression technologies, and in particular, relate to a residual coding method, a video coding method, and corresponding devices, devices, and systems.
  • Digital video compression technology mainly compresses huge digital image and video data to facilitate transmission and storage.
  • Digital video compression standards can save a lot of video data, it is still necessary to pursue better digital video compression technology to reduce digital video.
  • An embodiment of the present disclosure provides a residual coding method, including:
  • the first mode is a mode for only coding the residual of the target area in the frame, and the influence factor is based on The subsequent first image quality and/or first bit rate determination;
  • the second mode is a mode for residual coding of the entire frame of images.
  • An embodiment of the present disclosure also provides a video coding method, including:
  • the predicted image of the current frame is obtained through inter-frame prediction
  • Residual coding is performed according to the residual coding method described in any embodiment of the present disclosure.
  • An embodiment of the present disclosure also provides a residual encoding device, including a processor and a memory storing a computer program that can run on the processor, wherein, when the processor executes the computer program, the implementation of the present disclosure is implemented.
  • An embodiment of the present disclosure also provides a video encoding device, including a processor and a memory storing a computer program that can run on the processor, wherein, when the processor executes the computer program, any one of the present disclosure can be implemented.
  • a video encoding device including a processor and a memory storing a computer program that can run on the processor, wherein, when the processor executes the computer program, any one of the present disclosure can be implemented.
  • An embodiment of the present disclosure further provides a video encoding and decoding system, which includes the video encoding device described in any embodiment of the present disclosure.
  • An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein, when the computer program is executed by a processor, it implements any of the embodiments of the present disclosure.
  • the above residual coding method or video coding method is also provided.
  • FIG. 1 is a schematic diagram of a video codec system that can be used in an embodiment of the present disclosure
  • FIG. 2A and FIG. 2B are schematic diagrams of a residual coding and decoding processing framework
  • FIG. 3 is a schematic diagram of a video encoding and decoding method according to an embodiment of the present disclosure
  • FIG. 4 is a block diagram of a video encoder according to an embodiment of the present disclosure.
  • FIG. 5 is a flowchart of a video encoding method for an I frame according to an embodiment of the present disclosure
  • Fig. 6 is a block diagram of a residual encoding processing device in Fig. 4;
  • FIG. 7 is a flowchart of a video encoding method for an inter-frame prediction frame according to an embodiment of the present disclosure
  • FIG. 8 is a flowchart of a residual coding method according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram of intermittent residual coding of a background region according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic diagram of an expansion kernel used when performing expansion processing on a target mask according to an embodiment of the present disclosure
  • FIG. 11 is a schematic structural diagram of a residual encoding device according to an embodiment of the present disclosure.
  • Fig. 12 is a functional block diagram of a video decoder according to an embodiment of the present disclosure.
  • FIG. 13A is a flowchart of a video decoding method for an I frame according to an embodiment of the present disclosure
  • FIG. 13B is a flowchart of a video decoding method for an inter-frame prediction frame according to an embodiment of the present disclosure
  • Fig. 14A is a schematic diagram of the target mask before expansion
  • Fig. 14B is a schematic diagram of the target mask after the target mask in Fig. 14A is expanded
  • Fig. 14C is an image obtained after processing using the target mask in Fig. 14A
  • Fig. 14D is the image obtained after processing using the target mask of FIG. 14B.
  • words such as “exemplary” or “for example” are used to mean an example, illustration or illustration. Any embodiment described in this disclosure as “exemplary” or “for example” should not be construed as preferred or advantageous over other embodiments.
  • "And/or” in this article is a description of the relationship between associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations.
  • “A plurality” means two or more than two.
  • words such as “first” and “second” are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that words such as “first” and “second” do not limit the number and execution order, and words such as “first” and “second” do not necessarily limit the difference.
  • inter-frame prediction frame For a group of pictures (GOP: Group of Pictures) including I frame, P frame and B frame, the inter frame prediction frame includes P frame and B-frames. For a group of pictures (GOP: Group of Pictures) including I frames and P frames, inter-frame prediction frames include P frames.
  • FIG. 1 is a block diagram of a video encoding and decoding system applicable to an embodiment of the present disclosure. As shown in FIG. 1 , the system is divided into an encoding-side device 1 and a decoding-side device 2 , and the encoding-side device 1 generates code streams.
  • the decoding side device 2 can decode the code stream.
  • the encoding side device 1 and the decoding side device 2 may include one or more processors and memory coupled to the one or more processors, such as random access memory, charged erasable programmable read-only memory, flash memory or other media.
  • the encoding side device 1 and the decoding side device 2 can be implemented with various devices, such as desktop computers, mobile computing devices, notebook computers, tablet computers, set-top boxes, televisions, cameras, display devices, digital media players, vehicle-mounted computers, or other similar s installation.
  • the device 2 on the decoding side can receive the code stream from the device 1 on the encoding side via the link 3 .
  • the link 3 includes one or more media or devices capable of moving the code stream from the device 1 on the encoding side to the device 2 on the decoding side.
  • the link 3 includes one or more communication media that enable the device 1 on the encoding side to directly transmit the code stream to the device 2 on the decoding side.
  • the device 1 on the encoding side can modulate the code stream according to a communication standard (such as a wireless communication protocol), and can send the modulated code stream to the device 2 on the decoding side.
  • the one or more communication media may include wireless and/or wired communication media, such as a radio frequency (RF) spectrum or one or more physical transmission lines.
  • RF radio frequency
  • the one or more communication media may form part of a packet-based network, such as a local area network, a wide area network, or a global network (eg, the Internet).
  • the one or more communication media may include routers, switches, base stations, or other devices that facilitate communication from device 1 on the encoding side to device 2 on the decoding side.
  • the code stream can also be output from the output interface 15 to a storage device, and the decoding-side device 2 can read the stored data from the storage device via streaming or downloading.
  • the storage device may comprise any of a variety of distributed-access or locally-accessed data storage media, such as hard disk drives, Blu-ray Discs, Digital Versatile Discs, CD-ROMs, flash memory, volatile or non-volatile Volatile memory, file servers, and more.
  • distributed-access or locally-accessed data storage media such as hard disk drives, Blu-ray Discs, Digital Versatile Discs, CD-ROMs, flash memory, volatile or non-volatile Volatile memory, file servers, and more.
  • the encoding side device 1 includes a data source 11 , an encoder 13 and an output interface 15 .
  • Data sources 11 may include video capture devices (eg, video cameras), archives containing previously captured data, feed interfaces to receive data from content providers, computer graphics systems to generate data, or combinations of these sources.
  • the encoder 13 can encode the data from the data source 11 and output it to the output interface 15, and the output interface 15 can include at least one of an adjuster, a modem and a transmitter.
  • the decoding side device 2 includes an input interface 21 , a decoder 23 and a display device 25 .
  • input interface 21 includes at least one of a receiver and a modem.
  • the input interface 21 can receive the code stream via the link 3 or from a storage device.
  • the decoder 23 decodes the received code stream.
  • the display device 25 is used for displaying the decoded data, and the display device 25 may be integrated with other devices of the decoding side device 2 or provided separately.
  • the display device 25 may be, for example, a liquid crystal display, a plasma display, an organic light emitting diode display or other types of display devices.
  • the device 2 on the decoding side may not include the display device 25 , or may include other devices or devices for applying the decoded data.
  • various video codec methods can be used to implement video compression.
  • International video codec standards include H.264/Advanced Video Coding (Advanced Video Coding, AVC), H.265/High Efficiency Video Coding (High Efficiency Video Coding, HEVC), H.266/Versatile Video Coding (multifunctional video Coding, VVC), MPEG (Moving Picture Experts Group, Motion Picture Experts Group), AOM (Open Media Alliance, Alliance for Open Media), AVS (Audio Video coding Standard, audio and video coding standards) and extensions of these standards, or any Other self-defined standards, etc., these standards reduce the amount of transmitted data and stored data through video compression technology, so as to achieve more efficient video codec, transmission and storage.
  • the above-mentioned video codec standards all adopt a block-based hybrid coding method.
  • the block is used as the basic unit to perform intra-frame prediction or inter-frame prediction, and then the residual (or residual data, residual block) Transformation and quantization are performed, and entropy coding is performed on syntax elements related to block division, prediction, etc., and residuals after quantization, to obtain coded video code streams (referred to as code streams).
  • neural network structure image and video compression technology based on neural network has also been greatly developed.
  • Image compression based on random neural network image compression based on convolutional neural network, image compression based on recurrent neural network, image compression based on generative confrontation network and other technologies have been developed rapidly.
  • Neural network-based video coding and decoding technology has also achieved many achievements in hybrid neural network video coding and decoding, neural network rate-distortion optimized coding and decoding, and end-to-end video coding and decoding.
  • the hybrid neural network video codec replaces the traditional codec module with a neural network and embeds it into the traditional video codec framework, and realizes or optimizes functions such as intra-frame predictive coding, inter-frame predictive coding, loop filtering, and entropy coding based on neural networks. module and the corresponding decoder module to further improve the encoding and decoding performance.
  • Neural network rate-distortion optimized coding can use neural networks to completely replace traditional mode decisions such as intra prediction mode decisions.
  • End-to-end video coding and decoding can realize a complete video coding and decoding framework through neural networks.
  • the framework for encoding and decoding residuals in the process of video encoding and decoding is as shown in FIG. 2A and FIG. 2B .
  • the residual generation unit 901 subtracts the pixel value of the predicted image from the pixel value of the original image of the video frame, and sends the obtained residual to the residual encoding processing device 903 .
  • the residual encoding processing device 903 includes a residual encoding network 9031 and a residual quantization unit 9033 implemented based on a neural network, and the residual is obtained after being encoded and quantized by the residual encoding network 9031 and the residual quantization unit 9033 Residual coded data, the residual coded data is entropy coded by the entropy coding unit 905 and written into the code stream.
  • the residual quantization unit here can perform quantization operations such as up rounding, down rounding or rounding on the data output by the residual coding network 9031 .
  • the entropy decoding unit 911 performs entropy decoding on the code stream to extract the encoded residual data, and the encoded residual data is decoded in the residual decoding processing device 913 to obtain the reconstructed residual.
  • the residual decoding processing device 913 here may be a residual decoding network implemented based on a neural network. This method of encoding and decoding the residual of the video frame is to encode the residual of the full-width image of the video frame. The average bit rate after encoding and compression is relatively high, which will affect the viewing experience when the bandwidth is limited. .
  • the user pays different attention to different regions in the image, and pays more attention to moving objects and specific targets, but does not care much about other background parts.
  • moving vehicles and pedestrians on the road are the targets that users pay attention to, while the background parts such as road surface and green belt are not paid attention to by users.
  • the images of these moving objects and specific targets are the target images, the areas where these moving objects and specific targets are located are the target areas, the images of other background parts are the background images, and the areas where other background parts are located are the background areas.
  • the video frame is encoded by the encoding network to obtain an image feature map, and the image feature map is entropy encoded after being quantized, and then written into the encoded video stream (referred to as the stream).
  • the same code rate is assigned to the target area with high saliency and the background area with low salience, which is a waste when the code rate resources are tight.
  • an embodiment of the present disclosure proposes a video encoding method, the schematic diagram of which is shown in FIG. 3 , where X t represents the original image of the current frame, Denotes the reconstructed image of the previous frame, X t and The predicted image of the current frame is obtained through inter-frame prediction X t minus Obtain the residual r t of the entire frame image of the current frame .
  • the reconstruction residual is obtained by residual decoding reconstruction residual and the predicted image of the current frame Add to get the reconstructed image of the current frame
  • the background residual refers to the residual of the background region in the frame
  • the target residual refers to the residual of the target region in the frame.
  • an intermittent background residual encoding method when encoding an inter-frame prediction frame (such as a P frame), is adopted, which saves code rate resources to a certain extent.
  • an end-to-end target-based image encoding method when encoding an I frame, an end-to-end target-based image encoding method can be used, and by assigning a higher bit rate to the target image and a lower bit rate to the background image, the low bit rate can be improved. Subjective quality of video under high-rate conditions.
  • An embodiment of the present disclosure provides a video encoder for implementing the video encoding method of the embodiment of the present disclosure, and the video encoder may be implemented based on an end-to-end video encoding framework.
  • the video encoder can be divided into an I-frame encoding part and an inter-frame predictive encoding part.
  • the inter-frame predictive encoding part can be used to encode P frames and B frames.
  • the encoding of P frames is taken as an example .
  • the I frame encoding part includes a first division processing unit 101, a first image encoder 103, a second image encoder 105, a first multiplier 104, a second multiplier 106, a first quantization unit 107, second quantization unit 109, image merging unit 112, image decoder 113 and entropy coding unit 131 (the entropy coding unit is I frame coding part and P frame).
  • the I-frame coding part may also include more, fewer or different units.
  • the first segmentation processing unit 101 is configured to segment the background image and the target image in the I-frame image based on the target segmentation network, and process the segmentation result into a binarized target mask and background mask;
  • the first image encoder 103 is set to encode the I-frame image based on the first neural network, and outputs the image feature map of the first code rate; the second image encoder 105 is set to encode the I-frame image based on the second neural network, Outputting an image feature map of a second code rate; wherein, the first code rate is greater than the second code rate.
  • the first neural network and the second neural network may use neural networks with different structures, or use neural networks with the same structure but different parameters (such as weights, biases, etc.).
  • the first neural network is trained with the first bit rate as the target bit rate, and the second neural network is trained with the second bit rate as the target bit rate, so that the image feature maps of the first bit rate and the second bit rate can be respectively output.
  • the first multiplier 104 is configured to multiply the image feature map of the first code rate output by the first image encoder 103 with the target mask output by the first segmentation processing unit 101, and output the target feature map of the first code rate (i.e. feature map of the target image).
  • the second multiplier 106 is configured to multiply the image feature map of the second code rate output by the second image encoder 105 with the background mask output by the first segmentation processing unit 101, and output the background feature map of the second code rate (i.e. feature map of the background image).
  • the first quantization unit 107 is configured to quantize the target feature map and output the quantized target feature map; the second quantization unit 109 is configured to quantize the background feature map and output the quantized background feature map.
  • the quantization may be upper rounding, lower rounding, rounding, etc., and this disclosure is not limited thereto.
  • the entropy coding unit 131 performs entropy coding on the quantized target feature map and background feature map, and writes them into the code stream.
  • the image merging unit 112 is configured to merge the quantized target feature map and the quantized background feature map into a feature map of the entire frame image, and output it to the image decoder 113;
  • the image decoder 113 is configured to decode the feature map of the entire frame of image, and output an I frame of reconstructed image.
  • the image decoder 113 can be implemented based on a neural network.
  • the reconstructed image of the I frame output by the image decoder 113 is stored in the image buffer 209 and can be used as a reference image when performing inter-frame predictive encoding of the P frame.
  • an embodiment of the present disclosure provides a method for encoding the first frame image of a video series (such as a group of pictures), that is, an I-frame image, as shown in FIG. 5 , including:
  • Step 310 input the I-frame image into the target segmentation network, and process the segmentation result into a binarized target mask and background mask;
  • Step 320 the I-frame image is input into two neural network-based image encoders respectively, and the two image encoders output the image feature map of the first bit rate and the image feature map of the second bit rate respectively, wherein the first bit rate greater than the second code rate;
  • steps 310 and 320 are not in a fixed order, and may also be executed in parallel.
  • Step 330 multiplying the image feature map of the first code rate by the target mask to obtain the target feature map; multiplying the image feature map of the second code rate by the background mask to obtain the background feature map;
  • Step 340 perform quantization and entropy coding on the target feature map and the background feature map respectively, and write them into the code stream.
  • the I frame when the I frame is encoded, different code rates are allocated to the target image and the background image in the I frame, and a higher code rate is assigned to the target image that the user pays attention to, so that more code rate resources are given, and the user does not pay attention to the code rate.
  • the background image of the background image is assigned a lower bit rate, which improves the subjective quality of the video at a low bit rate (such as when the bandwidth is limited).
  • the inter-frame predictive coding part is used to realize the inter-frame predictive coding of P frame or B frame, including feature fusion network 201, motion compensation unit 203, residual generation unit 204, residual coding processing device 205, residual Difference decoding processing means 207 , reconstruction unit 208 , image buffer 209 , third quantization unit 211 , and entropy encoding unit 131 .
  • the inter-prediction coding part may also include more, fewer or different units.
  • the feature fusion network 201 can be implemented based on a neural network, and is set to receive the original image of the input current frame (taking the P frame as an example, or a B frame) and the reconstructed image of the previous frame (also referred to as a reference image), and output Inter-frame motion information feature map;
  • the motion compensation unit 203 is configured to perform motion compensation according to the reconstructed image of the previous frame and the inter-frame motion information feature map output by the feature fusion network 201, and output the predicted image of the current frame;
  • the residual generating unit 204 is configured to generate the residual of the current frame (also referred to as residual data) according to the original image and the predicted image of the current frame;
  • the residual encoding processing device 205 is configured to encode and quantize the residual, and output residual encoded data, wherein the encoding of the residual can be realized by a residual encoding network based on a neural network; the residual encoded data can be divided into two types: One path is output to the entropy coding unit 131 for entropy coding and then written into the code stream, and one path is output to the residual decoding processing device 207 for decoding to reconstruct the image.
  • the residual decoding processing device 207 is configured to decode the residual coded data, and output the reconstruction residual (also referred to as reconstruction residual data).
  • the residual decoding processing device 207 may use a neural network-based residual decoding network to decode residual encoded data;
  • the reconstruction unit 208 is configured to add the predicted image of the current frame to the reconstruction residual to obtain a reconstructed image of the current frame, such as a P frame, and store it in the image buffer 209;
  • the image buffer 209 is configured to save the reconstructed video frame image and provide the motion compensation unit 203 with reference images required for motion compensation.
  • the reconstructed video frame image includes a reconstructed I frame image and a reconstructed P frame image, and may also include a reconstructed B frame image;
  • the third quantization unit 211 is configured to quantize the inter-frame motion information feature map output by the feature fusion network 201, and output it to the entropy encoding unit 131;
  • the entropy coding unit 131 is also configured to perform entropy coding on the quantized inter-frame motion information feature map, residual coded data, etc., and write them into the code stream.
  • the multiple quantization units in the above video encoder 10 are mainly used to quantize the data output by the neural network into integers. If these neural networks are trained to output integers, these quantization units may not be set.
  • the video encoder 10 in FIG. 4 can be implemented using any one or any combination of the following circuits: one or more microprocessors, digital signal processors, application specific integrated circuits, field programmable gate arrays, discrete logic, hardware, etc. If the present disclosure is implemented partially in software, instructions for the software may be stored in a suitable non-transitory computer-readable storage medium and executed in hardware using one or more processors, Therefore, the video encoding method of any embodiment of the present disclosure is implemented.
  • FIG. 6 is an exemplary functional unit diagram of the residual coding processing device 205 in the video encoder 10.
  • the residual coding processing device 205 includes a second target segmentation network 2051, an expansion unit 2053, and a third multiplier. 2054 , a residual selection unit 2055 , a residual coding network 2057 and a fourth quantization unit 2059 .
  • the residual coding processing device 205 may also be implemented by using more, fewer or different units.
  • the expansion unit 2053 is omitted, the residual coding network 2057 is replaced with a transformation unit, and so on.
  • the target segmentation network 2051 is configured to segment the background image and the target image in the image of the current frame (taking the P frame as an example in the figure), and process the segmentation result into a binarized target mask;
  • the expansion unit 2053 is configured to perform morphological expansion processing on the target mask output by the target segmentation network 2051, and output the expanded target mask;
  • the third multiplier 2054 is configured to multiply the residual of the entire frame image of the current frame by the expanded target mask, and output the residual of the target region in the current frame;
  • the residual selection unit 2055 is configured to select one from the residual of the entire frame image of the current frame and the residual of the target area in the current frame according to the set conditions, and output it to the residual encoding network 2057 for encoding;
  • the fourth quantization unit 2059 is configured to quantize the data output by the residual coding network 2057, and output residual coding data (quantized data).
  • the above-mentioned residual coding processing device 205 can be realized by using any one or any combination of the following circuits: one or more microprocessors, digital signal processors, application-specific integrated circuits, field programmable gate arrays, discrete logic, hardware, etc. . If the present disclosure is implemented partially in software, instructions for the software may be stored in a suitable non-transitory computer-readable storage medium and executed in hardware using one or more processors, Therefore, the residual coding method of any embodiment of the present disclosure is implemented.
  • An embodiment of the present disclosure provides a video encoding method, as shown in FIG. 7 , including:
  • Step 410 when the current frame is an inter-frame prediction frame, obtain a predicted image of the current frame through inter-frame prediction;
  • the reconstructed image of the previous frame and the original image of the current frame can be input into the trained feature fusion network, and the feature fusion network outputs a feature map of inter-frame motion information.
  • the inter-frame motion information feature map is added to the reconstructed image of the previous frame to obtain the predicted image of the current frame.
  • the inter-frame motion information feature map is written into the code stream after quantization and entropy coding.
  • Step 420 calculating the residual error of the entire frame image of the current frame according to the original image and the predicted image of the current frame;
  • the original image of the current frame can be subtracted from the predicted image of the current frame (pixel value subtraction) to obtain the residual error of the entire frame image of the current frame
  • Step 430 perform residual coding according to the residual coding method described in any embodiment of the present disclosure.
  • the corresponding residual (such as the residual of the target area or the residual of the entire frame image) can be input into the residual coding network, and then the data output by the residual coding network is quantized, entropy Write code stream after encoding.
  • the video encoding method further includes the following method for encoding an I frame: when the current frame is an I frame, the first neural network and the second neural network are respectively used to encode the current frame
  • the original image is encoded to obtain the image feature map of the first bit rate and the image feature map of the second bit rate, wherein the first bit rate is greater than the second bit rate; the image feature map of the first bit rate is compared with the target mask Multiplying the target feature map to obtain the target feature map; multiplying the image feature map of the second code rate and the background mask to obtain the background feature map; and performing quantization and entropy coding on the target feature map and the background feature map respectively.
  • the subjective quality of the video at an extremely low bit rate is improved by assigning more bit rate resources to the video target area.
  • An exemplary embodiment of the present disclosure provides a residual coding method for residual coding of an inter-frame prediction frame.
  • the residual coding method may be implemented based on the residual coding processing device in FIG. 6 .
  • the residual coding method includes:
  • Step 510 when the current frame is an inter-frame prediction frame, calculate the influence factor of the current frame for residual coding according to the first mode;
  • the first mode is a mode that only encodes the residual of the target area in the frame, and the influence factor
  • the factor is determined according to the encoded first image quality and/or the first code rate.
  • the influencing factor is determined according to the encoded first image quality of the current frame, and in another example, the influencing factor is determined according to the encoded first image quality and the first code rate of the current frame.
  • Step 520 judging whether the influencing factor satisfies the setting condition, if the influencing factor meets the setting condition, execute step 530, and if the influencing factor does not meet the setting condition, execute step 540;
  • Step 530 determine that the current frame performs residual coding according to the first mode
  • Step 540 determine that the current frame performs residual coding according to the second mode, and the second mode is a mode for residual coding of the entire frame of images.
  • FIG. 9 is a schematic diagram of a residual coding method according to an embodiment of the present disclosure. It can be seen from the figure that in the embodiment of the present disclosure, when performing residual coding, the residual of the entire frame image of the current frame is selected through mode decision, or the residual of the target area in the current frame is selected to be coded, and the coding result Write the code stream, and the decoding end decodes the residual coded data in the code stream to obtain the reconstructed residual.
  • the residual of the target region is obtained by multiplying the residual of the whole frame image with the dilated target mask.
  • the residual encoding processing device and residual encoding method of the embodiments of the present disclosure select the entire frame image of the inter-frame prediction frame or the residual of the target area in the frame according to the set conditions to encode, that is, intermittently encode the background in the frame
  • the residual coding of the region continuously encodes the residual of the target region in the frame, selectively compensates the residual of the background region in the inter-frame prediction frame, reduces the amount of coding, improves the coding efficiency, and ensures the vision of the target image. Slightly reduces the quality of the background image while reducing the quality. Since the background image is not the area that users focus on watching the video, this method has little impact on the subjective quality of the video.
  • the above step 410 described "when the current frame is an inter-frame prediction frame, calculate the influence factor of the residual coding of the current frame according to the first mode" should not be understood as that all inter-frame prediction frames in a GOP must be calculated The impact factor.
  • the first inter-frame prediction frame in the GOP usually the second frame in the GOP
  • the first inter-frame prediction frame can be used as a reference frame, and the encoded image quality and/or code rate of the first inter-frame prediction frame can be used to calculate a reference factor, which is used in mode decision of subsequent inter-frame prediction frames.
  • the residual of the background area in the current frame may be set equal to 0 and coded, and the coded result is written into the code stream. Setting the residual of the background area in the current frame to be equal to 0 actually ignores the residual of the background area, and the encoding of these 0 values can be completed with a small encoding overhead.
  • the data format of the residual coding is not changed, and the decoding end can still use the original decoding method to complete the decoding, so the encoding method in the embodiment of the present disclosure has good compatibility with the decoding end.
  • two methods may be used to calculate the residual of the target area in the current frame.
  • the first method is to multiply the residual of the entire frame image of the current frame by the target mask to obtain the residual of the target region in the current frame.
  • the target segmentation network By inputting the original image of the video frame into the target segmentation network, the background image and the target image in the whole frame image can be segmented, and then the segmentation result can be processed into the binarized target mask.
  • This method does not expand the target mask, and the calculation is relatively simple and easy to implement.
  • the embodiment of the present disclosure intermittently encodes the residual of the background area, in the frame without residual encoding of the background area, this method does not perform residual compensation on the target edge, and in the decoded image, the target edge There may be subjective quality defects that affect the viewing effect of the video.
  • the second method is to multiply the residual of the entire frame of the current frame by the dilated target mask to obtain the residual of the target region in the current frame.
  • This method calculates the residual of the target area after expansion processing, and the residual of the target area is persistent coding, so each frame of image is compensated for the residual of the target edge, which can avoid the above subjective quality defects. Improve video viewing experience.
  • the dilation kernel used for dilation processing may be determined first, and the dilation processing is performed on the target mask by using the dilation kernel.
  • the expansion kernel is also called a structural element (SE: structure element) in image morphology, and the size and center point of the expansion kernel can be defined as needed.
  • SE structural element
  • the size of the expansion kernel is positively correlated with the displacement statistics of pixels in the target area.
  • the displacement statistic value is a maximum value among displacement values of all pixels in the target area, or an average value of displacement values of all pixels in the target area, and the present disclosure is not limited thereto.
  • the displacement value of the pixel reflects the moving speed of the target (such as a moving object in the monitoring screen) between the current frame and the previous frame.
  • This processing method associates the displacement value of the pixel in the target area with the size of the expansion kernel. The larger the displacement value of the pixel, it means that the target moves faster. At this time, a larger expansion kernel is selected to inflate the target mask, which can make The dilated object region also becomes larger, ensuring that the edge regions of the object are compensated residually.
  • the expansion kernel used when performing expansion processing on the target mask is a square
  • the side length k of the square is calculated according to the following formula :
  • D is a matrix composed of displacement values of pixels in the current frame
  • M o is the target mask
  • k 0 is a set constant
  • ceil() is an upward rounding function
  • max() indicates the maximum value of elements in the matrix function.
  • an expansion kernel including 3 ⁇ 3 pixel units as shown in Figure 10 can be used, and one pixel unit can include one or more pixel points, and the center point of the expansion kernel is Points with crossed lines are drawn in the middle.
  • the setting of the constant can provide a certain margin for the calculation.
  • this example uses a square expansion core as an example, the present disclosure does not limit the shape of the expansion core.
  • the shape of the expansion core may also be a triangle, rectangle, pentagon, cross or other shapes.
  • FIG. 14A is a schematic diagram of the target mask before expansion
  • FIG. 14B is a schematic diagram of the target mask after the target mask in FIG. 14A is expanded
  • FIG. 14C is an image obtained after processing the target mask in FIG. 14A
  • Fig. 14D is an image obtained after processing using the target mask of Fig. 14B.
  • the edges of the target area are clearer in Figure 14D.
  • the impact factor calculated in the above step 510 reflects the impact of the residual coding of the current frame according to the first mode (that is, only the residual coding of the target area in the frame, not the residual coding of the background area in the frame).
  • the impact can be measured by absolute indicators such as the encoded video quality, or, the impact can also be performed by residual encoding of the current frame according to the first mode, and relative to the residual encoding of the current frame according to the second mode in terms of video quality and bit rate
  • the video quality and/or bit rate after residual coding of the current frame according to the first mode can also be used, compared to the previously coded inter-frame prediction frame (referred to as a reference frame in the text) according to Changes in video quality and/or bit rate after residual coding in the first mode or the second mode are measured.
  • the method of using relative change to measure is a method of dynamic and adaptive mode judgment, which has better adaptability.
  • the impact factor is calculated according to the following formula:
  • RD cur is the impact factor
  • R r_bkg is the code rate of the residual of the background area in the current frame
  • D w/o_r_bkg is the distortion of the reconstructed image relative to the original image after the residual coding of the current frame according to the first mode.
  • is a set threshold
  • RD comp is a reference factor
  • R' r_bkg is the code rate of the residual of the background area in the reference frame
  • D w_r_bkg is the distortion degree of the reconstructed image of the reference frame relative to the original image
  • the above-mentioned mode judgment may not be performed, and the frame is directly determined to perform residual coding according to the second mode, that is, the residual of the entire frame image coding.
  • the initial value of RD comp can also be set to 0 or other values to ensure that when the current frame is the first inter-frame prediction frame in the GOP, according to the set conditions RD cur -RD comp ⁇ does not hold true (that is, the impact factor does not meet the set condition), which is essentially the same as directly determining the first inter-frame prediction frame and performing residual coding according to the second mode.
  • the calculation formula and setting conditions of the impact factor are the same as those in the previous embodiment, but R r_bkg is the code rate of the residual error of the target area in the current frame, and R'r_bkg is the target area in the reference frame Or the bit rate of the residual of the entire frame image is also possible.
  • the code rate of the residuals of the encoded background area, target area, or the entire frame of image can be obtained by performing entropy encoding on the data of these residuals, or without performing entropy encoding. After the encoding network encodes these residuals, the code rate of these residuals is estimated through the approximate code rate estimation mode or other methods.
  • the coding rate of the residual is associated with the bit overhead after coding the residual, and the larger the bit overhead after coding the residual, the larger the coding rate. Considering the video quality and bit rate at the same time, a reasonable balance can be achieved between improving the encoding effect and improving the video quality, and achieving performance optimization.
  • the reference factor RD comp is introduced into the setting condition, and the difference between the impact factor and the reference factor is compared with the set value, which reflects that when the current frame is residual encoded according to the first mode, compared with the existing
  • the difference in image quality and code rate of the reference frame for residual coding according to the second mode if the difference is large, it is considered that the residual coding of the current frame according to the first mode will lead to a difference in image quality and code rate relative to the reference frame degradation, the current frame should not be residually coded according to the first mode, but should be residually coded according to the second mode, if the difference is small, it is considered that the image quality and code quality of the current frame after residual coding according to the first mode
  • the rate is not significantly different from that of the reference frame, and it is determined that the current frame performs residual coding according to the first mode.
  • two factors of code rate and distortion degree are taken into consideration when making mode judgment, so as to achieve better comprehensive performance.
  • the degree of distortion is represented by Mean Squared Error (MSE: Mean Squared Error), but it is also possible to use the sum of absolute errors (SAD: Sum of Absolute Difference) and the absolute value after the difference transformation Sum (SATD: Sum of Absolute Transformed Difference), sum of squares of difference (SSD: Sum of Squared Difference) or mean absolute difference (MAD: Mean Absolute Difference).
  • MSE Mean Squared Error
  • SAD Sum of Absolute Difference
  • SSD Sum of Absolute Transformed Difference
  • MAD Mean Absolute Difference
  • the impact factor of this embodiment can better reflect the impact of the absence of background residuals, and the above setting conditions can also be used for mode judgment adaptively, not limited to fixed thresholds, and can be better applied to various video codecs scene.
  • a residual coding method including:
  • Step 1 input the image of the inter-frame prediction frame into the target segmentation network, and process the segmentation result into a target mask;
  • Step 2 performing morphological expansion processing on the target mask
  • the expansion kernel used in the expansion process in this step is related to the maximum displacement value of the pixel in the target area, which is determined by the following formula:
  • D is a matrix composed of the displacement values of each pixel in the image between the two frames before and after, and each element in the matrix is (u,v) is the optical flow of each pixel, M o is the target mask before expansion; ceil() is an upward rounding function; ceil(max(D*M o )) is the maximum displacement of pixels in the target area Value rounded up.
  • k 0 is a constant.
  • Step 3 when the current frame is an inter-frame prediction frame, the calculation is used to measure the current frame according to the first mode (that is, only encode the residual of the target area, do not encode the residual of the background area or do not compensate for the residual of the background area ) influence factor for residual coding:
  • RD cur is the influence factor of the current frame
  • R r_bkg is the code rate of the residual of the background area in the current frame
  • D w/o_r_bkg is the residual coding of the current frame according to the first mode, the reconstruction image relative to the original image Distortion, represented by MSE distortion.
  • the impact factor relates to code rate and distortion, and may also be referred to as code rate and distortion loss (RD loss).
  • the residual of the target area in the current frame can be obtained by multiplying the residual of the entire frame image of the current frame with the target mask of the current frame.
  • the residual of the background area in the current frame can be obtained by multiplying the residual of the entire frame image of the current frame with the background mask of the current frame.
  • Step 4 Determine whether the impact factor RD cur satisfies the following conditions:
  • RD comp is a reference factor
  • is a set threshold
  • the current frame is residual encoded according to the first mode, and the growth of RD loss exceeds a certain threshold, that is, when RD cur -RD comp ⁇ is established, it is determined that the current frame is residual encoded according to the first mode, and only the current The residual of the target area in the frame is input to the subsequent encoding network; when RD cur -RD comp ⁇ ⁇ , it is determined that the current frame is encoded according to the second mode, and the residual of the entire frame image of the current frame is input into the network.
  • the current frame when the current frame is the first inter-frame prediction frame in the GOP, no judgment may be made, and the current frame may be directly determined to perform residual coding according to the second mode, or by setting the initial value of RD comp , The decision result is that the current frame performs residual coding according to the second mode.
  • the reference factor RD comp is calculated and saved according to the following formula
  • R'r_bkg is the code rate of the residual of the background area in the frame when the current frame is residually encoded according to the second mode
  • D w_r_bkg is the reconstruction image relative to the original image when the current frame is residually encoded according to the second mode
  • MSE loss MSE loss
  • the above reference factor is the RD loss when the current frame is residually coded according to the second mode.
  • RD comp R' r_bkg +D w_r_bkg
  • the saved RD comp is updated to the calculated RD comp .
  • the reference factor RD comp can be directly obtained for the decision.
  • the update of the reference factor can adapt to the change of the video quality in time and make a more reasonable mode decision.
  • the setting conditions include one or more of the following conditions:
  • the impact factor is less than a set first threshold
  • Condition 2 The difference between the impact factor minus the first reference factor is less than the set second threshold, and the first reference factor is based on the second image quality and/or the second image quality after residual coding in the second mode according to the current frame.
  • the binary rate is determined;
  • the reference frame is an encoded inter-frame prediction frame in the GOP where the current frame is located.
  • the reference frame is an inter-frame prediction frame that has been determined to perform residual coding according to the second mode in the GOP where the current frame is located and is closest to the current frame, and the second reference factor performs residual coding according to the second mode according to the reference frame.
  • the encoded third image quality and/or the third code rate are determined.
  • the first image quality, the second image quality and the third image quality are all represented by the degree of distortion of the reconstructed image relative to the original image;
  • the first code rate is represented by the code rate of the background area or the residual of the target area ;
  • the second code rate is represented by the code rate of the residual of the background area and the target area;
  • the third code rate is represented by the code rate of the residual of the background area and/or the target area.
  • the first image quality, the second image quality, and the third image quality may also be calculated by using the similarity of the reconstructed image relative to the original image, and the peak signal-to-noise ratio (PSNR: Peak Signal to Noise Ratio ) and other parameters.
  • PSNR Peak Signal to Noise Ratio
  • the above-mentioned condition 1 is used for mode decision, and the influence factor is determined according to the first image quality, for example, after the current frame is residually encoded according to the first mode, the reconstructed image is relative to the original image distortion.
  • the influence factor satisfies the set condition, if the influence factor is less than the set distortion threshold, it means that the distortion degree after encoding is small, and then it is determined that the current frame is residual encoded according to the first mode, that is, only Residual coding of object regions in frames to save coding overhead.
  • the impact factor is greater than or equal to the set distortion threshold, indicating that the degree of distortion after encoding is relatively large, it is determined that the current frame performs residual encoding according to the second mode, that is, the residual encoding mode of the entire frame image, to ensure video quality.
  • This judgment method is relatively simple and is more suitable for scenarios where the threshold is relatively fixed. However, it is not very flexible, and it is difficult to meet the needs of scenarios where video quality requirements change.
  • the above condition two is used for mode judgment.
  • the impact factor is equal to the first image quality after the residual coding of the current frame according to the first mode plus the first bit rate, where the first image quality is equal to the distortion degree of the reconstructed image relative to the original image, and the first bit rate is equal to the first bit rate in the current frame
  • the code rate of the residual in the target region is equal to the second image quality after residual coding of the current frame according to the second mode plus the second code rate, wherein the second image quality is equal to the degree of distortion of the reconstructed image relative to the original image, and the second code rate is equal to The code rate of the residual between the background area and the target area (that is, the entire frame image) in the reference frame.
  • the impact factor is equivalent to the rate-distortion cost of residual coding of the current frame according to the first mode
  • the first reference factor is equivalent to the rate-distortion cost of residual coding of the current frame according to the second mode
  • the second threshold can be set as 0, that is, the mode decision can be made by comparing the rate-distortion cost corresponding to the first mode and the second mode, and the rate-distortion cost (influence factor) corresponding to the first mode is smaller than the rate-distortion cost corresponding to the second mode (the first reference factor)
  • the above condition three is used for mode judgment.
  • the second reference factor is determined according to the third image quality after residual coding of the reference frame according to the first mode, and the third image quality is equal to the degree of distortion of the reconstructed image relative to the original image after coding the reference frame.
  • the reference frame is the previous inter-predicted frame in the GOP where the current frame is located.
  • the impact factor is determined according to the first image quality after residual coding of the current frame according to the first mode, and the first image quality is equal to the distortion degree of the reconstructed image relative to the original image after coding of the current frame.
  • the difference between the influence factor and the second reference factor is used to reflect the distortion degree of the current frame after residual coding according to the first mode, compared to the distortion of the previous frame after residual coding according to the first mode If the difference is less than the set third threshold, it means that the change of the distortion degree is small, and the current frame can continue to perform residual coding according to the first mode. If the difference is greater than or equal to the set third threshold, it means that the distortion degree has If there is obvious degradation, the current frame should perform residual coding according to the second mode.
  • the second reference factor can also be determined according to the third image quality and the third code rate after the residual coding of the reference frame according to the first mode, and the impact factor can also be determined according to the current frame according to the first mode The first image quality and the first code rate after residual coding are determined.
  • the above condition three is used for mode judgment.
  • the second reference factor is determined according to the third image quality and the third code rate after the reference frame is residually encoded according to the second mode, wherein the third image quality is equal to the distortion degree of the reconstructed image relative to the original image after encoding of the reference frame, and the second
  • the third bit rate is equal to the bit rate of the residual of the entire frame of the reference frame
  • the reference frame is an inter-frame prediction frame that has been determined to perform residual coding according to the second mode and is closest to the current frame in the group of pictures where the current frame is located.
  • the impact factor is determined according to the first image quality and the first code rate after residual coding of the current frame according to the first mode, wherein the first image quality is equal to the distortion degree of the reconstructed image relative to the original image after the current frame is encoded, and the first The bit rate is equal to the bit rate of the residual of the target region in the current frame.
  • the difference between the impact factor and the second reference factor is used to reflect the distortion degree and code rate of the current frame after the residual coding is performed according to the first mode. If the difference between the degree of distortion and code rate after inter-frame prediction is less than the set third threshold, it means that the change in degree of distortion and code rate is small, and the current frame can be residual coded according to the first mode.
  • the second reference factor can also be determined according to the third image quality after the residual coding of the reference frame according to the second mode, and the impact factor can also be determined according to the current frame after residual coding according to the first mode
  • the first image quality is OK.
  • condition 1 is used for mode decision
  • the residual is determined according to the second mode when it appears
  • the mode decision is made using condition three instead.
  • condition 1 and condition 2 for mode decision
  • use condition 1 and condition 3 for mode decision
  • use condition 2 and condition 3 for mode decision, and so on.
  • An embodiment of the present disclosure also provides a residual coding device, as shown in FIG. 11 , including a processor and a memory storing a computer program that can run on the processor, wherein the processor executes the The computer program implements the residual coding method described in any embodiment of the present disclosure.
  • An embodiment of the present disclosure also provides a video encoding device, which can also be referred to FIG. 11 , including a processor and a memory storing a computer program that can run on the processor, wherein the processor executes the computer
  • the program implements the video coding method described in any embodiment of the present disclosure.
  • An embodiment of the present disclosure further provides a video encoding and decoding system, including the video encoding device according to any implementation of the present disclosure, and further including a video decoding device.
  • An embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores a computer program, wherein, when the computer program is executed by a processor, any implementation of the present disclosure can be realized.
  • An embodiment of the present disclosure further provides a code stream, wherein the code stream is generated according to the residual coding method or the video coding method according to any embodiment of the present disclosure, wherein, after determining that the current frame follows the second mode
  • the code stream includes only code words obtained by coding the residual of the target area in the current frame, and when it is determined that the current frame performs residual coding according to the first mode, the code stream Include the codeword obtained by residual coding of the entire frame image of the current frame.
  • An embodiment of the present disclosure further provides a video decoder for implementing the video decoding method of the embodiment of the present disclosure, and the video decoder may be implemented based on an end-to-end video decoding framework.
  • the video decoder 30 includes an entropy decoding unit 301 , an image merging unit 302 , an image decoder 303 , an image buffer 305 , a motion compensation unit 307 , a residual decoding processing device 309 and a reconstruction unit 308 .
  • video decoder 30 may also include more, fewer or different units.
  • the entropy decoding unit 301 is configured to perform entropy decoding on the code stream, extract the target feature map and background feature map of the I frame, the motion information feature map of the inter-frame prediction frame, and the residual coding data, etc., and send them to the corresponding units for further processing. deal with;
  • the image merging unit 302 is configured to merge the target feature map and the background feature map extracted by the entropy decoding unit 301 into the feature map of the whole frame image of I frame, output to the image decoder 303;
  • the image decoder 303 is configured to decode the feature map of the whole I-frame image, and output the I-frame reconstructed image.
  • the image decoder 303 can be realized based on a neural network;
  • the image buffer 305 is set to save the I-frame reconstructed image output by the image decoder 303 and the reconstructed image of the inter-frame prediction frame output by the reconstruction unit 308, and the reconstructed image cached is output and displayed as decoded video data, and provides motion compensation unit 307 with Reference image needed for motion compensation.
  • the motion compensation unit 307 is configured to perform motion compensation according to the reference image (such as the reconstructed image of the previous frame) and the inter-frame motion information feature map extracted by the entropy decoding unit 301, and output the predicted image of the current frame;
  • the residual decoding processing device 309 is configured to decode the residual coded data extracted by the entropy decoding unit 301, and output the reconstructed residual.
  • the residual decoding processing device 207 may use a neural network-based residual decoding network to decode residual encoded data;
  • the reconstruction unit 308 is configured to add the predicted image of the current frame to the reconstruction residual to obtain the reconstructed image of the inter-frame predicted frame (taking the P frame as an example), and save it to the image buffer 305;
  • the video decoder 30 in FIG. 12 can be implemented using any one or any combination of the following circuits: one or more microprocessors, digital signal processors, application specific integrated circuits, field programmable gate arrays, discrete logic, hardware, etc. If the present disclosure is implemented partially in software, instructions for the software may be stored in a suitable non-transitory computer-readable storage medium and executed in hardware using one or more processors, Therefore, the video decoding method of any embodiment of the present disclosure is implemented.
  • An embodiment of the present disclosure proposes a video decoding method, which can be implemented based on the video decoding framework shown in FIG. 12 .
  • the decoding process to the I frame includes:
  • Step 610 entropy decoding is carried out to the current frame (I frame) in the code stream, obtain target feature map and background feature map;
  • Step 620 adding the target feature map and the background feature map to obtain the feature map of the entire frame image of the current frame (I frame);
  • Step 630 input the feature map of the entire frame image of the current frame (I frame) into the decoder based on the neural network, obtain and save the reconstructed image of the current frame (I frame).
  • the current frame is an inter-frame prediction frame (taking a P frame as an example, it can also be a B frame)
  • the corresponding decoding process is shown in Figure 13B, including:
  • Step 710 entropy decoding the current frame (P frame) in the code stream to obtain residual coded data and inter-frame motion information feature map;
  • Step 720 using the inter-frame motion information feature map to compensate the reconstructed image of the previous frame to obtain the predicted image of the current frame (P frame);
  • Step 730 Decode the residual coded data to obtain a reconstruction residual, add the reconstruction residual to the predicted image of the current frame (P frame), obtain the reconstructed image of the current frame (P frame), and save it.
  • the embodiment of the present disclosure improves the subjective quality of the video at an extremely low bit rate by assigning more bit rate resources to the video target area.
  • the target area residual is compensated frame by frame. While ensuring the visual quality of the target area, the quality of the background area is slightly reduced, the subjective quality of the video is improved, and bit rate resources are saved to a certain extent.
  • this embodiment solves the object edge visual defect caused by the intermittent compensation of the background residual by performing a dilation operation on the object segmentation mask.
  • the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over, as one or more instructions or code, a computer-readable medium and executed by a hardware-based processing unit.
  • Computer-readable media may include computer-readable storage media that correspond to tangible media such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, eg, according to a communication protocol. In this manner, a computer-readable medium may generally correspond to a non-transitory tangible computer-readable storage medium or a communication medium such as a signal or carrier wave.
  • Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure.
  • a computer program product may comprise a computer readable medium.
  • such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk or other magnetic storage, flash memory, or may be used to store instructions or data Any other medium that stores desired program code in the form of a structure and that can be accessed by a computer.
  • any connection could also be termed a computer-readable medium. For example, if a connection is made from a website, server or other remote source for transmitting instructions, coaxial cable, fiber optic cable, dual wire, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium.
  • disk and disc include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, or blu-ray disc, etc. where disks usually reproduce data magnetically, while discs use lasers to Data is reproduced optically. Combinations of the above should also be included within the scope of computer-readable media.
  • processors can be implemented by one or more processors such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable logic arrays
  • processors may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein.
  • the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec.
  • the techniques may be fully implemented in one or more circuits or logic elements.
  • the technical solutions of the embodiments of the present disclosure may be implemented in a wide variety of devices or devices, including a wireless handset, an integrated circuit (IC), or a set of ICs (eg, a chipset).
  • IC integrated circuit
  • Various components, modules, or units are described in the disclosed embodiments to emphasize functional aspects of devices configured to perform the described techniques, but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by a collection of interoperable hardware units (comprising one or more processors as described above) in combination with suitable software and/or firmware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本公开提供一种残差编码方法,在当前帧为帧间预测帧时,通过模式判决,确定只对当前帧中目标区域的残差编码,或者确定对当前帧整帧图像的残差编码,通过间歇性对背景区域的残差编码,在不影响图像主观质量的情况下提高编码效率。本公开还提供了基于所述残差编码方法和视频编码方法及相应的设备、装置和系统。

Description

残差编码和视频编码方法、装置、设备和系统 技术领域
本公开实施例涉及但不限于视频压缩技术,尤其涉及一种残差编码方法、视频编码方法及相应的装置、设备和系统。
背景技术
数字视频压缩技术主要是将庞大的数字影像视频数据进行压缩,以便于传输以及存储等。随着互联网视频的激增以及人们对视频清晰度的要求越来越高,尽管已有的数字视频压缩标准能够节省不少视频数据,但目前仍然需要追求更好的数字视频压缩技术,以减少数字视频传输的带宽和流量压力。
发明概述
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本公开实施例提供了一种残差编码方法,包括:
在当前帧为帧间预测帧时,计算当前帧按照第一模式进行残差编码的影响因子,所述第一模式是只对帧中目标区域的残差编码的模式,所述影响因子根据编码后的第一图像质量和/或第一码率确定;
判断所述影响因子是否满足设定条件;
在所述影响因子满足设定条件的情况下,确定当前帧按照第一模式进行残差编码,在所述影响因子不满足设定条件的情况下,确定当前帧按照第二模式进行残差编码。所述第二模式是对整帧图像的残差编码的模式。
本公开实施例还提供了一种视频编码方法,包括:
在当前帧为帧间预测帧时,通过帧间预测得到当前帧的预测图像;
根据当前帧的原始图像和预测图像计算得到当前帧整帧图像的残差;
按照如本公开任一实施例所述的残差编码方法进行残差编码。
本公开实施例还提供了一种残差编码装置,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如本公开任一实施例所述的残差编码方法。
本公开实施例还提供了一种视频编码设备,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现本公开任一实施例所述的视频编码方法。
本公开实施例还提供了一种视频编解码系统,其中,包括本公开任一实施例所述的视频编码设备。
本公开实施例还提供了一种非瞬态计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序时被处理器执行时实现本公开任一实施例所述的残差编码方法或视频编码方法。
在阅读并理解了附图和详细描述后,可以明白其他方面。
附图概述
附图用来提供对本公开实施例的理解,并且构成说明书的一部分,与本公开实施例一起用于解释本公开的技术方案,并不构成对本公开技术方案的限制。
图1为一种可用于本公开实施例的视频编解码系统的示意图;
图2A和图2B分别是一种残差编码和解码处理框架的示意图;
图3为本公开一实施例视频编解码方法的示意图;
图4是本公开一实施例视频编码器的模块图;
图5是本公开一实施例用于I帧的视频编码方法的流程图;
图6是图4中残差编码处理装置的模块图;
图7是本公开一实施例用于帧间预测帧的视频编码方法的流程图;
图8为本公开一实施例残差编码方法的流程图;
图9为本公开一实施例间歇性对背景区域的残差编码的示意图;
图10为本公开一实施例对目标掩膜进行膨胀处理时使用的膨胀核的示意图;
图11是本公开一实施例残差编码装置的结构示意图;
图12是本公开一实施例视频解码器的功能模块图;
图13A是本公开一实施例用于I帧的视频解码方法的流程图;
图13B是本公开一实施例用于帧间预测帧的视频解码方法的流程图;
图14A是膨胀前的目标掩膜的示意图,图14B是对图14A的目标掩膜进行膨胀处理后的目标掩膜的示意图,图14C是使用图14A的目标掩膜处理后得到的图像,图14D是使用图14B的目标掩膜处理后得到的图像。
详述
本公开描述了多个实施例,但是该描述是示例性的,而不是限制性的,并且对于本领域的普通技术人员来说显而易见的是,在本公开所描述的实施例包含的范围内可以有更多的实施例和实现方案。
本公开的描述中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本公开中被描述为“示例性的”或者“例如”的任何实施例不应被解释为比其他实施例更优选或更具优势。本文中的“和/或”是对关联对象的关联关系的一种描述,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。“多个”是指两个或多于两个。另外,为了便于清楚描述本公开实施例的技术方案,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
在描述具有代表性的示例性实施例时,说明书可能已经将方法和/或过程呈现为特定的步骤序列。然而,在该方法或过程不依赖于本文所述步骤的特定顺序的程度上,该方法或过程不应限于所述的特定顺序的步骤。如本领域普通技术人员将理解的,其它的步骤顺序也是可能的。因此,说明书中阐述的步骤的特定顺序不应被解释为对权利要求的限制。此外,针对该方法和/或过程的权利要求不应限于按照所写顺序执行它们的步骤,本领域技术人员可以容易地理解,这些顺序可以变化,并且仍然保持在本公开实施例的精神和范围内。
在本文中,将使用帧间预测方式进行编码的视频帧称为帧间预测帧,对于包括I帧、P帧和B帧的画面组(GOP:Group of Pictures),帧间预测帧包括P帧和B帧。对于包括I帧和P帧的画面组(GOP:Group of Pictures),帧间预测帧包括P帧。
图1为可用于本公开实施例的一种视频编解码系统的框图。如图1所示,该系统分为编码侧装置1和解码侧装置2,编码侧装置1产生码流。解码侧装置2可对码流进行解码。编码侧装置1和解码侧装置2可包含一个或多个处理器以及耦合到所述一个或多个处理器的存储器,如随机存取存储器、带电可擦可编程只读存储器、快闪存储器或其它媒体。编码侧装置1和解码侧装置2可以用各种装置实现,如台式计算机、移动计算装置、笔记本电脑、平板计算机、机顶盒、电视机、相机、显示装置、数字媒体播放器、车载计算机或其他类似的装置。
解码侧装置2可经由链路3从编码侧装置1接收码流。链路3包括能够将码流从编码侧装置1移动到解码侧装置2的一个或多个媒体或装置。在一个示例中,链路3包括使得编码侧装置1能够将码流直接发送到解码侧装置2的一个或多个通信媒体。编码侧装置1可根据通信标准(例如无线通信协议)来调制码流,且可将经调制的码流发送到解码侧装置2。所述一个或多个通信媒体可包含无线和/或有线通信媒体,例如射频(radio frequency,RF)频谱或一个或多个物理传输线。所述一个或多个通信媒体可形成基于分组的网络的一部分,基于分组的网络例如为局域网、广域网或全球网络(例如,因特网)。所述一个或多个通信媒体可包含路由器、交换器、基站或促进从编码侧装置1到解码侧装置2的通信的其它设备。在另一示例中,也可将码流从输出接口15输出到一个存储装置,解码侧装置2可经由流式传输或下载从该存储装置读取所存储的数据。该存储装置可包含多种分布式存取或本地存取的数据存储媒体中的任一种,例如硬盘驱动器、蓝光光盘、数字多功能光盘、只读光盘、快闪存储器、易失性或非易失性存储器、文件服务器等等。
在图1所示的示例中,编码侧装置1包含数据源11、编码器13和输出接口15。在一些示例中。数据源11可包括视频捕获装置(例如,摄像机)、含有先前捕获的数据的存档、用以从内容提供者接收数据的馈入接口,用于产生数据的计算机图形系统,或这些来源的组合。编码器13可对来自数据源11的数据进行编码后输出到输出接口15,输出接口15可包含调节器、调制解调器和发射器中的至少之一。
在图1所示的示例中,解码侧装置2包含输入接口21、解码器23和显示装置25。在一些示例中,输入接口21包含接收器和调制解调器中的至少之一。输入接口21可经由链路3或从存储装置接收码流。解码器23对接收的码流进行解码。显示装置25用于显示解码后的数据,显示装置25可与解码侧装置2的其他装置集成在一起或者单独设置。显示装置25例如可以是液晶显示器、等离子显示器、有机发光二极管显示器或其它类型的显示装置。在其他示例中,解码侧装置2也可以不包含所述显示装置25,或者包含应用解码后数据的其他装置或设备。
基于图1所示的视频编解码系统,可以使用各种视频编解码方法来实现视频压缩。国际上的视频编解码标准包括H.264/Advanced Video Coding(高级视频编码,AVC),H.265/High Efficiency Video Coding(高效视频编码,HEVC),H.266/Versatile Video Coding(多功能视频编码,VVC),MPEG(Moving Picture Experts Group,动态图像专家组),AOM(开放媒体联盟,Alliance for Open Media),AVS(Audio Video coding Standard,音视频编码标准)以及这些标准的拓展,或任何自定义的其他标准等,这些标准通过视频压缩技术减少传输的数据量和存储的数据量,以达到更高效的视频编解码和传输储存。上述视频编解码标准都采用了基于块的混合编码方式,这种编码方式先以块为基本单元进行帧内预测或帧间预测,然后对残差(也可为残差数据,残差块)进行变换、量化,对与分块、预测等相关的语法元素及量化后残差等进行熵编码,得到已编码视频码流(简称码流)。
伴随神经网络结构的发展,基于神经网络的图像、视频压缩技术也得到了长足的发展。基于随机神经网络的图像压缩、基于卷积神经网络的图像压缩、基于循环神经网络的图像压缩、基于生成对抗网络的图像压缩等技术都得到了快速发展。而基于神经网络的视频编解码技术在混合式神经网络视频编解码、神经网络率失真优化编解码以及端到端视频编解码等方面也取得了很多成果。其中,混合式神经网络视频编解码用神经网络代替传统编解码模块嵌入到传统的视频编解码框架,基于神经网络实现或优化帧内预测编码、帧间预测编码、环路滤波、熵编码等功能模块及相应的解码端模块,以进一步提升编解码性能。神经网络率失真优化编码可以使用神经网络完全替换传统的模式决策如帧内预测模式的决策。端到端视频编解码可以通过神经网络实现完整的视频编解码框架。
在一些方案中,视频编解码过程中对残差进行编解码处理的框架如图2A和图2B所示。在编码端,残差产生单元901将视频帧原始图像的像素值减去预测图像的像素值,得到的残差送入残差编码处理装置903。图示的示例中,残差编码处理装置903包括基于神经网络实现的残差编码网络9031和残差量化单元9033,残差在残差编码网络9031和残差量化单元9033进行编码、量化后得到残差编码数据,残差编码数据在熵编码单元905进行熵编码后写入码流。这里的残差量化单元可以对残差编码网络9031输出的数据进行上取整、下取整或四舍五入等量化运算。在解码端,如图2B所示,熵解码单元911对码流进行熵解码,提取出其中的残差编码数据,残差编码数据在残差解码处理装置913中解码,得到重建的残差。这里的残差解码处理装置913可以是基于神经网络实现的残差解码网络。这种对视频帧的残差进行编解码处理的方式,是对视频帧全幅图像的残差进行编码,经编码压缩后的平均码率相对较高,在带宽受限的情况下会影响观看体验。
在一些视频应用中,用户对图像中不同区域的关注程度不一样,对运动的物体、特定目标的关注度高,而对其他背景部分不太关心。例如道路监控中,道路上的运动车辆、行人是用户关注的目标,而路面、绿化带等背景部分不为用户关注。这些运动物体、特定目标的图像即目标图像,这些运动物体、特定目标所在的区域即目标区域,其他背景部分的图像即背景图像,其他背景部分所在的区域即背景区域。在一些基于神经网络实现的视频编解码方法中,视频帧经编码网络编码后得到图像特征图,图像特征图经量化后进行熵编码,再写入已编码视频码流(简称码流)。这种方法对视频帧编码时,显著度高的目标区域和显著度低的背景区域分配相同的码率,在码率资源紧张的情况下是一种浪费。
为了解决上述问题,本公开一实施例提出一种视频编码方法,其示意图如图3所示,图中X t表示当前帧的原始图像,
Figure PCTCN2021100191-appb-000001
表示上一帧的重建图像,X t
Figure PCTCN2021100191-appb-000002
经帧间预测得到当前帧的预测图像
Figure PCTCN2021100191-appb-000003
X t减去
Figure PCTCN2021100191-appb-000004
得到当前帧整帧图像的残差r t,r t经间歇性背景残差编码同时持续性目标残差编码后,再经残差解码得到重建残差
Figure PCTCN2021100191-appb-000005
重建残差
Figure PCTCN2021100191-appb-000006
和当前帧的预测图像
Figure PCTCN2021100191-appb-000007
相加,得到当前帧的重建图像
Figure PCTCN2021100191-appb-000008
其中,背景残差指帧中背 景区域的残差,目标残差指帧中目标区域的残差。本公开实施例对帧间预测帧(如P帧)编码时,采用间歇性背景残差编码的方式,在一定程度上节省了码率资源。如图所示,在对I帧编码时,可以采用端到端的基于目标的图像编码方法,且通过向目标图像分配较高的码率,向背景图像分配较低的码率,可以提升低码率条件下视频的主观质量。
`
本公开一实施例提供了一种视频编码器,用于实现本公开实施例的视频编码方法,该视频编码器可以基于端到端的视频编码框架实现。如图4所示,该视频编码器可分为I帧编码部分和帧间预测编码部分,帧间预测编码部分可以用于对P帧、B帧的编码,图中以对P帧编码为例。
如图4所示的示例中,I帧编码部分包括第一分割处理单元101、第一图像编码器103、第二图像编码器105、第一乘法器104、第二乘法器106、第一量化单元107、第二量化单元109、图像合并单元112、图像解码器113和熵编码单元131(熵编码单元为I帧编码部分和P帧)。在其他示例中,I帧编码部分也可以包括更多、更少或者不同的单元。
第一分割处理单元101设置为基于目标分割网络对I帧图像中的背景图像和目标图像进行分割,将分割结果处理为二值化的目标掩膜和背景掩膜;
第一图像编码器103设置为基于第一神经网络对I帧图像进行编码,输出第一码率的图像特征图;第二图像编码器105设置为基于第二神经网络对I帧图像进行编码,输出第二码率的图像特征图;其中,第一码率大于第二码率。第一神经网络和第二神经网络可以使用不同结构的神经网络,或者使用结构相同但参数(如权重、偏置等)不同的神经网络。第一神经网络以第一码率为目标码率进行训练,第二神经网络以第二码率为目标码率进行训练,从而可以分别输出第一码率和第二码率的图像特征图。
第一乘法器104设置为将第一图像编码器103输出的第一码率的图像特征图与第一分割处理单元101输出的目标掩膜相乘,输出第一码率的目标特征图(即目标图像的特征图)。
第二乘法器106设置为将第二图像编码器105输出的第二码率的图像特征图与第一分割处理单元101输出的背景掩膜相乘,输出第二码率的背景特征图(即背景图像的特征图)。
第一量化单元107设置为对目标特征图进行量化,输出量化后的目标特征图;第二量化单元109设置为对背景特征图进行量化,输出量化后的背景特征图。量化如可以是上取整、下取整、四舍五入等,本公开不做局限。
熵编码单元131对量化后的目标特征图和背景特征图进行熵编码,写入码流。
图像合并单元112设置为将量化后的目标特征图和量化后的背景特征图合并为整帧图像的特征图,输出到图像解码器113;
图像解码器113设置为对整帧图像的特征图进行解码,输出I帧重建图像。图像解码器113可以基于神经网络实现。图像解码器113输出的I帧重建图像保存在图像缓冲器209中,可以作为P帧进行帧间预测编码时的参考图像。
基于上述I帧编码部分的架构,本公开一实施例提供了一种对视频系列(如一个画面组)的第一帧图像即I帧图像的编码方法,如图5所示,包括:
步骤310,将I帧图像输入目标分割网络,将分割结果处理为二值化的目标掩膜和背景掩膜;
步骤320,将I帧图像分别输入两个基于神经网络的图像编码器,两个图像编码器分别输出第一码率的图像特征图和第二码率的图像特征图,其中,第一码率大于第二码率;
上述步骤310和步骤320并无固定的顺序,也可以并行执行。
步骤330,将第一码率的图像特征图与目标掩膜相乘,得到目标特征图;将第二码率的图像特征图与背景掩膜相乘,得到背景特征图;
步骤340,对目标特征图和背景特征图分别进行量化和熵编码,写入码流。
本实施例在I帧编码时,对I帧中的目标图像和背景图像分配不同的码率,通过向用户关注的目标图像分配更高的码率,赋予更多码率资源,向用户不关注的背景图像分配较低码率,提升了低码率下(如带宽受限时)视频的主观质量。
如图4所示,帧间预测编码部分用于实现P帧或B帧的帧间预测编码,包括特征融合网络201、运动 补偿单元203、残差产生单元204、残差编码处理装置205、残差解码处理装置207、重建单元208、图像缓冲器209、第三量化单元211和熵编码单元131。在其他示例中,帧间预测编码部分也可以包括更多、更少或者不同的单元。
特征融合网络201可以基于神经网络实现,设置为接收输入的当前帧(以P帧为例,也可以是B帧)的原始图像和上一帧的重建图像(也可称为参考图像),输出帧间运动信息特征图;
运动补偿单元203设置为根据上一帧的重建图像和特征融合网络201输出的帧间运动信息特征图进行运动补偿,输出当前帧的预测图像;
残差产生单元204设置为根据当前帧的原始图像和预测图像,产生当前帧的残差(也可称为残差数据);
残差编码处理装置205设置为对残差进行编码和量化,输出残差编码数据,其中,对残差的编码可以通过基于神经网络的残差编码网络来实现;残差编码数据可分为两路,一路输出到熵编码单元131进行熵编码后写入码流,一种输出到残差解码处理装置207进行解码以重建图像。
残差解码处理装置207设置为对残差编码数据进行解码,输出重建残差(也可称为重建残差数据)。残差解码处理装置207可以使用基于神经网络的残差解码网络对残差编码数据进行解码;
重建单元208设置为将当前帧的预测图像与重建残差相加,得到当前帧如P帧的重建图像,保存到图像缓冲器209;
图像缓冲器209设置为保存重建的视频帧图像,为运动补偿单元203提供运动补偿所需的参考图像。其中,重建的视频帧图像包括重建的I帧图像和重建的P帧图像,还可以包括重建的B帧图像;
第三量化单元211设置为对特征融合网络201输出的帧间运动信息特征图进行量化后,输出到熵编码单元131;
熵编码单元131还设置为对量化后的帧间运动信息特征图、残差编码数据等进行熵编码,写入码流。
上述视频编码器10中的多个量化单元主要用于将神经网络输出的数据量化为整数,如果这些神经网络被训练为可以输出整数,则这些量化单元也可以不设置。
图4中的视频编码器10可使用以下电路中的任意一种或者以下电路的任意组合来实现:一个或多个微处理器、数字信号处理器、专用集成电路、现场可编程门阵列、离散逻辑、硬件等。如果部分地以软件来实施本公开,那么可将用于软件的指令存储在合适的非易失性计算机可读存储媒体中,且可使用一个或多个处理器在硬件中执行所述指令,从而实施本公开任一实施例的视频编码方法。
图6是视频编码器10中残差编码处理装置205的一个示例性的功能单元图,如图所示,残差编码处理装置205包括第二目标分割网络2051、膨胀单元2053、第三乘法器2054、残差选择单元2055、残差编码网络2057和第四量化单元2059。在其他示例中,残差编码处理装置205也可以采用更多、更少或不同的单元实现。例如,略去膨胀单元2053、将残差编码网络2057改用变换单元,等等。
目标分割网络2051设置为对当前帧(图中以P帧为例)图像中的背景图像和目标图像进行分割,将分割结果处理为二值化的目标掩膜;
膨胀单元2053设置为对目标分割网络2051输出的目标掩膜进行形态学的膨胀处理,输出膨胀后的目标掩膜;
第三乘法器2054设置为将当前帧整帧图像的残差与膨胀后的目标掩膜相乘,输出当前帧中目标区域的残差;
残差选择单元2055设置为根据设定条件,从当前帧整帧图像的残差和当前帧中目标区域的残差中选择一个,输出到残差编码网络2057进行编码;
第四量化单元2059设置为对残差编码网络2057输出的数据进行量化,输出残差编码数据(量化后的数据)。
上述膨胀单元2053所做的膨胀处理方法,残差选择单元2055的选择方法可以见本发明实施例的残差编码方法中的相应说明,这里不再赘述。
上述残差编码处理装置205可使用以下电路中的任意一种或者任意组合来实现:一个或多个微处理器、数字信号处理器、专用集成电路、现场可编程门阵列、离散逻辑、硬件等。如果部分地以软件来实施本公开,那么可将用于软件的指令存储在合适的非易失性计算机可读存储媒体中,且可使用一个或多个处理器在硬件中执行所述指令,从而实施本公开任一实施例的残差编码方法。
本公开一实施例提供了一种视频编码方法,如图7所示,包括:
步骤410,在当前帧为帧间预测帧时,通过帧间预测得到当前帧的预测图像;
本步骤中,可以将上一帧的重建图像与当前帧的原始图像输入训练好的特征融合网络,特征融合网络输出帧间运动信息特征图。将帧间运动信息特征图和上一帧的重建图像相加,即得到当前帧的预测图像。帧间运动信息特征图经过量化和熵编码后写入码流。
步骤420,根据当前帧的原始图像和预测图像计算得到当前帧整帧图像的残差;
本步骤中,可以将当前帧的原始图像减去当前帧的预测图像(像素值相减),得到当前帧整帧图像的残差
步骤430,按照本公开任一实施例所述的残差编码方法进行残差编码。
本步骤中,对于残差编码时,可以将相应的残差(如目标区域的残差或整帧图像的残差)输入残差编码网络,再对残差编码网络输出的数据进行量化、熵编码后写入码流。
在本公开一示例性的实施例中,所述视频编码方法还包括以下对I帧编码的方法:当前帧为I帧时,采用第一神经网络和第二神经网络分别对所述当前帧的原始图像进行编码,得到第一码率的图像特征图和第二码率的图像特征图,其中,第一码率大于第二码率;将第一码率的图像特征图与目标掩膜相乘,得到目标特征图;将第二码率的图像特征图与背景掩膜相乘,得到背景特征图;及,分别对所述目标特征图和背景特征图进行量化和熵编码。本实施例通过给视频目标区域赋予更多码率资源,从而提升了极低码率下视频的主观质量。
在本公开一示例性的实施例提供了一种残差编码方法,用于帧间预测帧的残差编码,该残差编码方法可以基于图6中的残差编码处理装置来实现。如图8所示,所述残差编码方法包括:
步骤510,在当前帧为帧间预测帧时,计算当前帧按照第一模式进行残差编码的影响因子;所述第一模式是只对帧中目标区域的残差编码的模式,所述影响因子根据编码后的第一图像质量和/或第一码率确定。
在一个示例中,所述影响因子根据当前帧编码后的第一图像质量确定,而在另一示例中,所述影响因子根据当前帧编码后的第一图像质量和第一码率确定。
步骤520,判断所述影响因子是否满足设定条件,在所述影响因子满足设定条件的情况下,执行步骤530,在所述影响因子不满足设定条件的情况下,执行步骤540;
步骤530,确定当前帧按照所述第一模式进行残差编码;
步骤540,确定当前帧按照第二模式进行残差编码,所述第二模式是对整帧图像的残差编码的模式。
图9所示是本公开实施例残差编码方法的一个示意图。从该图可以看出,本公开实施例在进行残差编码时,通过模式判决选择对当前帧整帧图像的残差进行编码,或者选择对当前帧中目标区域的残差进行编码,编码结果写入码流,解码端对码流中的残差编码数据解码后得到重建残差。目标区域的残差通过将整帧图像的残差与膨胀处理后的目标掩膜相乘得到。
本公开实施例的残差编码处理装置和残差编码方法,根据设定条件选择帧间预测帧的整帧图像或帧中目标区域的残差进行编码,也即是间歇性地对帧中背景区域的残差编码,持续性地对帧中目标区域的残差编码,有选择性地对帧间预测帧中背景区域的残差进行补偿,减少编码量,提高编码效率,在保证目标图像视觉质量的同时略微降低背景图像的质量。由于背景图像不是用户观看视频所关注的区域,该方法对视频的主观质量影响较小。
上述步骤410中描述的“在当前帧为帧间预测帧时,计算当前帧按照第一模式进行残差编码的影响因子”,不应理解为对一个GOP中所有的帧间预测帧都必须计算所述影响因子。在本公开的一个示例中,对于GOP中的第一个帧间预测帧(通常是GOP中的第二个帧),也可以直接该帧确定按照第二模式进行残差编码,不进行影响因子的计算,此时。该第一个帧间预测帧可以作为参考帧,该第一个帧间预测帧编码后的图像质量和/或码率可用于计算出参考因子,在后续帧间预测帧的模式判决中使用。
在本公开一实施例中,在确定当前帧按照第一模式进行残差编码的情况下,可以将当前帧中背景区域的残差设置为等于0并进行编码,编码结果写入码流。将当前帧中背景区域的残差设置为等于0,实际上忽略了背景区域的残差,用很小的编码开销就可以完成这些0值的编码。同时没有改变残差编码的数据格 式,解码端仍然可以使用原来的解码方法完成解码,因此本公开实施例的编码方法对解码端具有很好的兼容性。
本公开实施例中,可以采用两种方法计算当前帧中目标区域的残差。
第一种方法是将当前帧整帧图像的残差与目标掩膜相乘,得到当前帧中目标区域的残差。可以通过将视频帧的原始图像输入目标分割网络,对整帧图像中的背景图像和目标图像进行分割,再将分割结果处理为二值化的所述目标掩膜。这种方法没有对目标掩膜进行膨胀处理,计算比较简单,便于实现。但由于本公开实施例间歇性对背景区域的残差编码,在没有对背景区域的残差编码的帧中,这种方法没有对目标边缘进行残差补偿,在解码后的图像中,目标边缘有可能出现主观质量缺陷,影响视频观看效果。
第二种方法是将当前帧整帧图像的残差与膨胀处理后的目标掩膜相乘,得到当前帧中目标区域的残差。这种方法计算得到的是经膨胀处理后的目标区域的残差,而目标区域的残差是持续性编码,因而每帧图像都对目标边缘进行了残差补偿,可以避免上述主观质量缺陷,提升视频观看体验。
上述第二种方法对目标掩膜进行膨胀处理时,可以先确定膨胀处理使用的膨胀核,使用膨胀核对目标掩膜进行膨胀处理。膨胀核在图像形状学中也称为结构元素(SE:structure element),膨胀核的尺寸和中心点都可以根据需要定义。本实施例中,所述膨胀核的大小与所述目标区域中像素的位移统计值正相关。所述位移统计值为所述目标区域中所有像素的位移值中的最大值,或者是所述目标区域中所有像素的位移值的平均值等,本公开不做局限。像素的位移值反映了目标(如监控画面的运动物体)在当前帧和前一帧之间的运动速度。这种处理方式将目标区域中像素的位移值与膨胀核的大小关联起来,像素的位移值越大,说明目标移动较快,此时选用较大的膨胀核对目标掩膜进行膨胀处理,可以使膨胀后的目标区域也变大,从而保证目标的边缘区域得到残差补偿。
基于上述第二种方法的一个示例中,对目标掩膜进行膨胀处理时所使用的膨胀核为一正方形,所述正方形的边长k d根据下式计算:
k d=ceil(max(D*M o))+k 0
其中,D为当前帧中像素的位移值组成的矩阵,M o为目标掩膜,k 0为设定的常数,ceil()为向上取整函数,max()表示取矩阵中元素最大值的函数。
假定计算得到的k d=3,则可使用如图10所示的包括3×3个像素单元的膨胀核,一个像素单元可以包括一个或多个像素点,该膨胀核的中心点是图10中画有交叉线的点。常数的设置可以为计算提供一定的余量。虽然本示例是以正方形的膨胀核作为示例,但本公开对膨胀核的形状并不限定,例如膨胀核的形状也可以是三角形、矩形、五边形、十字形或者其他形状。
图14A是膨胀前的目标掩膜的示意图,图14B是对图14A的目标掩膜进行膨胀处理后的目标掩膜的示意图,图14C是使用图14A的目标掩膜处理后得到的图像,而图14D是使用图14B的目标掩膜处理后得到的图像。图14D中目标区域的边缘更为清晰。
上述步骤510计算的影响因子反应的是当前帧按照第一模式进行残差编码(即只对帧中目标区域的残差编码,不对帧中背景区域的残差编码)所带来的影响,该影响可以用编码后的视频质量等绝对指标来衡量,或者,该影响也可以用当前帧按照第一模式进行残差编码,相对于当前帧按照第二模式进行残差编码在视频质量、码率上的变化来衡量,或者,也可以用当前帧按照第一模式进行残差编码后的视频质量和/或码率,相对于之前已编码的帧间预测帧(文中称之为参考帧)按照第一模式或第二模式进行残差编码后的视频质量和/或码率的变化来衡量。使用相对变化来衡量的方式,是一种动态自适应地进行模式判决的方法,具有更好的适应性。
在本公开一示例性的实施例中,所述影响因子按照下式计算:
RD cur=R r_bkg+D w/o_r_bkg
其中,RD cur是影响因子,R r_bkg是当前帧中背景区域的残差的码率,D w/o_r_bkg是当前帧按照第一模式进行残差编码后的重建图像相对于原始图像的失真度。
所述设定条件为:RD cur-RD comp<λ,RD comp=R‘ r_bkg+D w_r_bkg
其中,λ为设定的阈值,RD comp为参考因子,R‘ r_bkg是参考帧中背景区域的残差的码率,D w_r_bkg是所述参考帧的重建图像相对于原始图像的失真度,所述参考帧指当前帧所在画面组中已确定按照第二模式进行残差编码且距离当前帧最近的一个帧间预测帧。
本实施例对于GOP中的第一个帧间预测帧(如第一个P帧),可以不进行上述模式判决,直接确定该帧按照第二模式进行残差编码即对整帧图像的残差编码。同时可以根据RD comp=R‘ r_bkg+D w_r_bkg计算参考因子RD comp。也即计算该帧按照第二模式进行残差编码后,该帧的重建图像相对于原始图像的失真度,以及该帧中背景区域的残差的码率。将计算的失真度和码率相加,保存为参考因子RD comp。这样,从该GOP中第二个帧间预测帧开始,对当前帧进行残差编码时,就可以使用公式RD cur-RD comp<λ来进行模式判决。如果判决结果确定当前帧按照第二模式进行残差编码,同样可以根据RD comp=R‘ r_bkg+D w_r_bkg计算参考因子RD comp,并将保存的RD comp更新为此次计算出的RD comp。但是本公开并非必须如此,在另一个实施例中,也可以将RD comp的初始值设置为0或者其他值,以保证当前帧为GOP中的第一个帧间预测帧时,按照设定条件RD cur-RD comp<λ不成立(即影响因子不满足设定条件)即可,这与直接确定第一个帧间预测帧按照第二模式进行残差编码本质上是相同的。
在本实施例的一个变例中,影响因子的计算公式和设定条件与前述实施例相同,但R r_bkg为当前帧中目标区域的残差的码率,R‘ r_bkg是参考帧中目标区域或者整帧图像的残差的码率,也是可以的。本实施例及变例中,编码后背景区域、目标区域或整帧图像的残差的码率可以通过对这些残差的数据进行熵编码来获取,也可以不进行熵编码,在使用残差编码网络对这些残差编码后,通过码率近似估计模式或其他方法估算出这些残差的码率。残差的码率是与残差编码后的比特开销关联的,残差编码后的比特开销越大,则码率也越大。同时考虑视频质量和码率,可以在提高编码效果和提升视频质量之间取得一个合理的平衡,达到性能的优化。
本实施例在设定条件中引入参考因子RD comp,用影响因子与参考因子的差值与设定值比较,其反应的是当前帧按照第一模式进行残差编码时相对于同一GOP中已按照第二模式进行残差编码的参考帧在图像质量和码率上的差异,如果这个差异较大,则认为当前帧按照第一模式进行残差编码会导致图像质量和码率相对参考帧的劣化,当前帧不应按照第一模式进行残差编码,而应按照第二模式进行残差编码,如果这个差异较小,则认为当前帧按照第一模式进行残差编码后的图像质量和码率较之参考帧差别不大,确定当前帧按照第一模式进行残差编码。本实施例进行模式判决时同时考虑码率和失真度两个因素,可以达到更好的综合性能。
本实施例及其他实施例中,所述失真度用平均平方误差(MSE:Mean Squared Error)表示,但也可以采用绝对误差和(SAD:Sum of Absolute Difference)、差值变换后再绝对值求和(SATD:Sum of Absolute Transformed Difference)、差值的平方和(SSD:Sum of Squared Difference)或者平均绝对差值(MAD:Mean Absolute Difference)表示。
本实施例的影响因子能够更好地反应背景残差缺失的影响,而上述设定条件也可以自适应地进行模式判决,不局限于固定的阈值,可以更好地适用于各种视频编解码的场景。
在本公开一示例性的实施例中,提供了一种残差编码方法,包括:
步骤一,将帧间预测帧的图像输入目标分割网络,将分割结果处理为目标掩膜;
步骤二,对目标掩膜进行形态学的膨胀处理;
本步骤膨胀处理使用的膨胀核与目标区域像素最大位移值相关,由如下式子决定:
k d=ceil(max(D*M o))+k 0
其中,D为图像中每个像素在前后两帧之间的位移值组成的矩阵,矩阵中的每个元素为
Figure PCTCN2021100191-appb-000009
(u,v)为每个像素的光流,M o为膨胀前的目标掩膜;ceil()为向上取整函数;ceil(max(D*M o))即为目标区域像素的最大位移值向上取整后的值。k 0为常数。
步骤三,当前帧为帧间预测帧时,计算用于衡量当前帧按照第一模式(即只对目标区域的残差编码,不对背景区域的残差编码或者说不对背景区域的残差进行补偿)进行残差编码的影响因子:
RD cur=R r_bkg+D w/o_r_bkg
其中,RD cur为当前帧的影响因子,R r_bkg为当前帧中背景区域的残差的码率,D w/o_r_bkg为当前帧按照第一模式进行残差编码后,重建图像相对于原图像的失真度,用MSE失真表示。在本实施例中,影响因子涉及码率和失真,也可以称为码率和失真的损失(RD loss)。
当前帧中目标区域的残差可以通过将当前帧整帧图像的残差与当前帧的目标掩膜相乘得到。当前帧中 背景区域的残差可以通过将当前帧整帧图像的残差与当前帧的背景掩膜相乘得到。
步骤四,判断影响因子RD cur是否满足以下条件:
RD cur-RD comp
其中,RD comp是参考因子,λ是设定的阈值。
如果当前帧按照第一模式进行残差编码后,RD loss的增长超过了一定的阈值,即:RD cur-RD comp<λ成立时,确定当前帧按照第一模式进行残差编码,只将当前帧中目标区域的残差输入后续编码网络;当RD cur-RD comp≥λ时,确定当前帧按照第二模式进行残差编码,将当前帧整帧图像的残差输入网络。
并且更新保存的RD comp值。
本实施例中,当前帧为GOP中的第一个帧间预测帧时,也可以不进行判决,直接确定当前帧按照第二模式进行残差编码,或者通过对RD comp的初始值的设置,使得判决结果是当前帧按照第二模式进行残差编码。
对于确定按照第二模式进行残差编码的当前帧,按照下式计算出参考因子RD comp并保存;
RD comp=R‘ r_bkg+D w_r_bkg
其中,R‘ r_bkg为当前帧按照第二模式进行残差编码时,帧中背景区域的残差的码率,D w_r_bkg为当前帧按照第二模式进行残差编码时,重建图像相对于原始图像的失真度,可以用MSE损失(MSE loss)表示。
上述参考因子是当前帧按照第二模式进行残差编码时的RD loss,从GOP中的第二个帧间预测帧开始,如果模式判决结果是确定当前帧按照第二模式进行残差编码,需按照RD comp=R‘ r_bkg+D w_r_bkg计算出RD comp,并且将保存的RD comp更新为此次计算的RD comp。对后续的帧间预测帧进行模式判决时,就可以直接获取到参考因子RD comp进行判断。对参考因子的更新可以及时地适应视频质量的变化,做出更为合理的模式判决。
在图8所示的本公开实施例的残差编码方法中,所述设定条件包括以下条件中的一个或多个:
条件一,所述影响因子小于设定的第一阈值;
条件二,所述影响因子减去第一参考因子的差小于设定的第二阈值,所述第一参考因子根据当前帧按照第二模式进行残差编码后的第二图像质量和/或第二码率确定;
条件三,所述影响因子减去第二参考因子的差小于设定的第三阈值,所述第二参考因子根据参考帧按照第一模式或第二模式进行残差编码后的第三图像质量和/或第三码率确定,所述参考帧是当前帧所在画面组中已编码的帧间预测帧。例如,参考帧是当前帧所在GOP中已确定按照第二模式进行残差编码且距离当前帧最近的一个帧间预测帧,所述第二参考因子根据所述参考帧按照第二模式进行残差编码后的第三图像质量和/或第三码率确定。
其中,所述第一图像质量、第二图像质量和第三图像质量均用重建图像相对于原始图像的失真度表示;所述第一码率用背景区域或目标区域的残差的码率表示;所述第二码率用背景区域和目标区域的残差的码率表示;所述第三码率用背景区域和/或目标区域的残差的码率表示。在本公开的其他实施例中,所述第一图像质量、第二图像质量和第三图像质量也可以用重建图像相对于原始图像的相似度、峰值信噪比(PSNR:Peak Signal to Noise Ratio)等其他参数表示。
在本公开一示例性的实施例中,使用上述条件一进行模式判决,所述影响因子根据第一图像质量确定,如等于当前帧按照第一模式进行残差编码后,重建图像相对于原始图像的失真度。在判断所述影响因子是否满足设定条件时,如果所述影响因子小于设定的失真度阈值,说明编码后的失真度较小,则确定当前帧按照第一模式进行残差编码即只对帧中目标区域的残差编码,以节约编码开销。如果所述影响因子大于或等于设定的失真度阈值,说明编码后的失真度较大,则确定当前帧按照第二模式进行残差编码即对整帧图像的残差编码的模式,以保证视频质量。这种判决方式比较简单,比较适合用于阈值相对固定的场景。但不太灵活,对于视频质量需求存在变化的场景,难以满足需要。
在本公开一示例性的实施例中,使用上述条件二进行模式判决。影响因子等于当前帧按照第一模式进行残差编码后的第一图像质量加上第一码率,其中第一图像质量等于重建图像相对于原始图像的失真度,第一码率等于当前帧中目标区域的残差的码率。第一参考因子等于当前帧按照第二模式进行残差编码后的第二图像质量加上第二码率,其中,第二图像质量等于重建图像相对于原始图像的失真度,第二码率等于 参考帧中背景区域和目标区域(即整帧图像)的残差的码率。此时,影响因子相当于当前帧按照第一模式进行残差编码的率失真代价,第一参考因子相当于当前帧按照第二模式进行残差编码的率失真代价,第二阈值可以设定为0,也即可以通过比较第一模式和第二模式对应的率失真代价来进行模式判决,在第一模式对应的率失真代价(影响因子)小于第二模式对应的率失真代价(第一参考因子)时,确定当前帧按照第一模式进行残差编码即只对目标区域的残差编码,反之,确定当前帧按第二模式进行残差编码即对整帧图像的残差编码。
在本公开一示例性的实施例中,使用上述条件三进行模式判决。第二参考因子根据参考帧按照第一模式进行残差编码后的第三图像质量确定,第三图像质量等于参考帧编码后重建图像相对于原始图像的失真度。参考帧是当前帧所在画面组中的上一帧间预测帧。影响因子根据当前帧按照第一模式进行残差编码后的第一图像质量确定,第一图像质量等于当前帧编码后重建图像相对于原始图像的失真度。则本实施例使用影响因子减去第二参考因子的差,反应的是当前帧按照第一模式进行残差编码后的失真度,相对于上一帧按照第一模式进行残差编码后的失真度的变化,如果差小于设定的第三阈值,说明失真度的变化小,当前帧可以继续按照第一模式进行残差编码,如果差大于或等于设定的第三阈值,说明失真度有明显的劣化,当前帧应按照第二模式进行残差编码。本实施例的一个变列中,第二参考因子也可以根据参考帧按照第一模式进行残差编码后的第三图像质量和第三码率确定,影响因子也可以根据当前帧按照第一模式进行残差编码后的第一图像质量和第一码率确定。
在本公开一示例性的实施例中,使用上述条件三进行模式判决。第二参考因子根据参考帧按照第二模式进行残差编码后的第三图像质量和第三码率确定,其中,第三图像质量等于参考帧编码后重建图像相对于原始图像的失真度,第三码率等于参考帧整帧图像的残差的码率,参考帧是当前帧所在画面组中已确定按照第二模式进行残差编码且距离当前帧最近的一个帧间预测帧。而影响因子根据当前帧按照第一模式进行残差编码后的第一图像质量和第一码率确定,其中,第一图像质量等于当前帧编码后重建图像相对于原始图像的失真度,第一码率等于当前帧中目标区域的残差的码率。则本实施例使用影响因子减去第二参考因子的差,反应的是当前帧按照第一模式进行残差编码后的失真度和码率,相对于上一个按照第二模式进行残差编码的帧间预测帧编码后的失真度和码率的变化,如果差小于设定的第三阈值,说明失真度和码率的变化小,当前帧可以按照第一模式进行残差编码,如果差大于或等于设定的第三阈值,说明失真度和码率有明显的劣化,当前帧应按照第二模式进行残差编码。本实施例的一个变列中,第二参考因子也可以根据参考帧按照第二模式进行残差编码后的第三图像质量确定,影响因子也可以根据当前帧按照第一模式进行残差编码后的第一图像质量确定。
在本公开一示例性的实施例中,使用上述多个条件进行模式判决,例如,对GOP中头几个帧间预测帧,使用条件一进行模式判决,在出现确定按照第二模式进行残差编码的帧间预测帧后,改为使用条件三进行模式判决。或者,对于GOP中每一个帧间预测值,都使用条件一和条件二进行模式判决,或者使用条件一和条件三进行模式判决,或者使用条件二和条件三进行模式判决,等等。
本公开一实施例还提供了一种残差编码装置,如图11所示,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如本公开任一实施例所述的残差编码方法。
本公开一实施例还提供了一种视频编码设备,也可参见图11,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如本公开任一实施例所述的视频编码方法。
本公开一实施例还提供了一种视频编解码系统,包括如本公开任一实施所述的视频编码设备,还包括视频解码设备。
本公开一实施例还提供了一种非瞬态计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序时被处理器执行时实现如本公开任一实施例所述的残差编码方法或视频编码方法。
本公开一实施例还提供了一种码流,其中,所述码流根据如本公开任一实施例所述的残差编码方法或者视频编码方法生成,其中,在确定当前帧按照第二模式进行残差编码的情况下,所述码流中包括只对当前帧中目标区域的残差编码得到的码字,在确定当前帧按照第一模式进行残差编码的情况下,所述码流中包括对当前帧整帧图像的残差编码得到的码字。
本公开一实施例还提供了一种视频解码器,用于实现本公开实施例的视频解码方法,该视频解码器可以基于端到端的视频解码框架实现。如图12所示,视频解码器30包括熵解码单元301、图像合并单元302、图像解码器303、图像缓冲器305、运动补偿单元307、残差解码处理装置309和重建单元308。在本公开 其他示例中,视频解码器30也可以包括更多、更少或者不同的单元。
熵解码单元301设置为对码流进行熵解码,提取出I帧的目标特征图和背景特征图、帧间预测帧的运动信息特征图、残差编码数据等信息,分别送入相应的单元进行处理;
图像合并单元302设置为将熵解码单元301提取的目标特征图和背景特征图合并为I帧整帧图像的特征图,输出到图像解码器303;
图像解码器303设置为对I帧整帧图像的特征图进行解码,输出I帧重建图像。图像解码器303可以基于神经网络实现;
图像缓冲器305设置为保存图像解码器303输出的I帧重建图像和重建单元308输出的帧间预测帧的重建图像,缓存的重建图像作为解码后视频数据输出显示,及为运动补偿单元307提供运动补偿所需的参考图像。
运动补偿单元307设置为根据参考图像(如上一帧的重建图像)和熵解码单元301提取的帧间运动信息特征图进行运动补偿,输出当前帧的预测图像;
残差解码处理装置309设置为对熵解码单元301提取的残差编码数据进行解码,输出重建残差。残差解码处理装置207可以使用基于神经网络的残差解码网络对残差编码数据进行解码;
重建单元308设置为将当前帧的预测图像与重建残差相加,得到帧间预测帧(以P帧为例)的重建图像,保存到图像缓冲器305;
图12中的视频解码器30可使用以下电路中的任意一种或者以下电路的任意组合来实现:一个或多个微处理器、数字信号处理器、专用集成电路、现场可编程门阵列、离散逻辑、硬件等。如果部分地以软件来实施本公开,那么可将用于软件的指令存储在合适的非易失性计算机可读存储媒体中,且可使用一个或多个处理器在硬件中执行所述指令,从而实施本公开任一实施例的视频解码方法。
本公开一实施例提出视频解码方法,可基于图12所示的视频解码框架实现。其中,对I帧的解码过程包括:
步骤610,对码流中的当前帧(I帧)进行熵解码,得到目标特征图和背景特征图;
步骤620,将所述目标特征图和背景特征图相加,得到当前帧(I帧)整帧图像的特征图;
步骤630,将当前帧(I帧)整帧图像的特征图输入基于神经网络的解码器,得到当前帧(I帧)的重建图像并保存。
当前帧为帧间预测帧(以P帧为例,也可以是B帧)时,相应的解码流程如图13B所示,包括:
步骤710,对码流中的当前帧(P帧)进行熵解码,得到残差编码数据和帧间运动信息特征图;
步骤720,利用帧间运动信息特征图补偿上一帧的重建图像,得到当前帧(P帧)的预测图像;
步骤730,对残差编码数据进行解码得到重建残差,将重建残差与当前帧(P帧)的预测图像相加,得到当前帧(P帧)的重建图像并保存。
在上述步骤中,不需要去识别编码端是按第一模式还是第二模式进行残差编码,解码端可以采用相同的方式完成对残差编码数据的解码。
采用H.264、H.265、NVC和本公开实施例的基于目标编码的视频压缩方法对样本图像进行编码,测试到的四种算法压缩该序列消耗的平均码率如下表所示:
  H.264 H.265 NVC OBVC
平均像素深度 0.0213 0.0199 0.0197 0.0175
峰值信噪比 22.40 23.71 23.58 23.20
本公开实施例通过给视频目标区域赋予更多码率资源,从而提升了极低码率下视频的主观质量。而通过间歇性补偿背景区域残差,逐帧补偿目标区域残差。保证目标区域视觉质量的同时略微降低背景区域质量,提升视频主观质量,一定程度上节省了码率资源。此外,本实施例通过将目标分割掩膜进行膨胀操作, 解决了间歇性补偿背景残差导致的目标边缘出现的目标边缘视觉缺陷。
在一或多个示例性实施例中,所描述的功能可以硬件、软件、固件或其任一组合来实施。如果以软件实施,那么功能可作为一个或多个指令或代码存储在计算机可读介质上或经由计算机可读介质传输,且由基于硬件的处理单元执行。计算机可读介质可包含对应于例如数据存储介质等有形介质的计算机可读存储介质,或包含促进计算机程序例如根据通信协议从一处传送到另一处的任何介质的通信介质。以此方式,计算机可读介质通常可对应于非暂时性的有形计算机可读存储介质或例如信号或载波等通信介质。数据存储介质可为可由一或多个计算机或者一或多个处理器存取以检索用于实施本公开中描述的技术的指令、代码和/或数据结构的任何可用介质。计算机程序产品可包含计算机可读介质。
举例来说且并非限制,此类计算机可读存储介质可包括RAM、ROM、EEPROM、CD-ROM或其它光盘存储装置、磁盘存储装置或其它磁性存储装置、快闪存储器或可用来以指令或数据结构的形式存储所要程序代码且可由计算机存取的任何其它介质。而且,还可以将任何连接称作计算机可读介质举例来说,如果使用同轴电缆、光纤电缆、双绞线、数字订户线(DSL)或例如红外线、无线电及微波等无线技术从网站、服务器或其它远程源传输指令,则同轴电缆、光纤电缆、双纹线、DSL或例如红外线、无线电及微波等无线技术包含于介质的定义中。然而应了解,计算机可读存储介质和数据存储介质不包含连接、载波、信号或其它瞬时(瞬态)介质,而是针对非瞬时有形存储介质。如本文中所使用,磁盘及光盘包含压缩光盘(CD)、激光光盘、光学光盘、数字多功能光盘(DVD)、软磁盘或蓝光光盘等,其中磁盘通常以磁性方式再生数据,而光盘使用激光以光学方式再生数据。上文的组合也应包含在计算机可读介质的范围内。
可由例如一或多个数字信号理器(DSP)、通用微处理器、专用集成电路(ASIC)现场可编程逻辑阵列(FPGA)或其它等效集成或离散逻辑电路等一或多个处理器来执行指令。因此,如本文中所使用的术语“处理器”可指上述结构或适合于实施本文中所描述的技术的任一其它结构中的任一者。另外,在一些方面中,本文描述的功能性可提供于经配置以用于编码和解码的专用硬件和/或软件模块内,或并入在组合式编解码器中。并且,可将所述技术完全实施于一个或多个电路或逻辑元件中。
本公开实施例的技术方案可在广泛多种装置或设备中实施,包含无线手机、集成电路(IC)或一组IC(例如,芯片组)。本公开实施例中描各种组件、模块或单元以强调经配置以执行所描述的技术的装置的功能方面,但不一定需要通过不同硬件单元来实现。而是,如上所述,各种单元可在编解码器硬件单元中组合或由互操作硬件单元(包含如上所述的一个或多个处理器)的集合结合合适软件和/或固件来提供。

Claims (16)

  1. 一种残差编码方法,包括:
    在当前帧为帧间预测帧时,计算当前帧按照第一模式进行残差编码的影响因子,所述第一模式是只对帧中目标区域的残差编码的模式,所述影响因子根据编码后的第一图像质量和/或第一码率确定;
    判断所述影响因子是否满足设定条件;
    在所述影响因子满足设定条件的情况下,确定当前帧按照第一模式进行残差编码,在所述影响因子不满足设定条件的情况下,确定当前帧按照第二模式进行残差编码,所述第二模式是对整帧图像的残差编码的模式。
  2. 如权利要求1所述的残差编码方法,其中:
    所述设定条件包括以下条件中的一个或多个:
    条件一,所述影响因子小于设定的第一阈值;
    条件二,所述影响因子减去第一参考因子的差小于设定的第二阈值,所述第一参考因子根据当前帧按照第二模式进行残差编码后的第二图像质量和/或第二码率确定;
    条件三,所述影响因子减去第二参考因子的差小于设定的第三阈值,所述第二参考因子根据参考帧按照第一模式或第二模式进行残差编码后的第三图像质量和/或第三码率确定,所述参考帧是当前帧所在画面组中已编码的帧间预测帧。
  3. 如权利要求2所述的残差编码方法,其中:
    所述参考帧是当前帧所在画面组中已确定按照第二模式进行残差编码且距离当前帧最近的一个帧间预测帧,所述第二参考因子根据所述参考帧按照第二模式进行残差编码后的第三图像质量和/或第三码率确定。
  4. 如权利要求1或2所述的残差编码方法,其中:
    所述第一图像质量、第二图像质量和第三图像质量均用重建图像相对于原始图像的失真度表示;
    所述第一码率用背景区域或目标区域的残差的码率表示;
    所述第二码率用背景区域和目标区域的残差的码率表示;
    所述第三码率用背景区域和/或目标区域的残差的码率表示。
  5. 如权利要求1所述的残差编码方法,其中:
    所述影响因子按照下式计算:
    RD cur=R r_bkg+D w/o_r_bkg
    其中,RD cur是所述影响因子,R r_bkg是当前帧中背景区域或目标区域的残差的码率,D w/o_r_bkg是当前帧按照第一模式进行残差编码后的重建图像相对于原始图像的失真度。
  6. 如权利要求5所述的残差编码方法,其中:
    所述设定条件为:RD cur-RD comp<λ,RD comp=R‘ r_bkg+D w_r_bkg
    其中,λ为设定的阈值,RD comp为参考因子,R‘ r_bkg是参考帧中背景区域和/或目标区域的残差的码率,D w_r_bkg是所述参考帧的重建图像相对于原始图像的失真度,所述参考帧指当前帧所在画面组中已确定按照第二模式进行残差编码且距离当前帧最近的一个帧间预测帧。
  7. 如权利要求1所述的残差编码方法,其中:
    所述残差编码方法还包括:在确定当前帧按照第一模式进行残差编码的情况下,将当前帧中背景区域的残差设置为等于0并进行编码。
  8. 如权利要求1所述的残差编码方法,其中:
    所述当前帧中目标区域的残差通过以下方式得到:生成所述当前帧的目标掩膜,对所述目标掩膜进行膨胀处理;及,将所述当前帧整帧图像的残差数据与膨胀处理后的目标掩膜相乘。
  9. 如权利要求8所述的残差编码方法,其中:
    对所述目标掩膜进行膨胀处理,包括:确定膨胀处理使用的膨胀核,使用所述膨胀核对所述目标掩膜进行膨胀处理,其中,所述膨胀核的大小与所述当前帧中目标区域的像素的位移统计值正相关。
  10. 如权利要求9所述的残差编码方法,其中,所述膨胀核为正方形,所述正方形的边长k d根据下式计算:
    k d=ceil(max(D*M o))+k 0
    其中,D为当前帧中像素的位移值组成的矩阵,M o为膨胀处理前的第一目标掩膜,k 0为设定的常数,ceil()为向上取整函数,max()为取矩阵中元素最大值的函数。
  11. 一种视频编码方法,包括:
    在当前帧为帧间预测帧时,通过帧间预测得到当前帧的预测图像;
    根据当前帧的原始图像和预测图像计算得到当前帧整帧图像的残差;
    按照如权利要求1至10中任一所述的方法进行残差编码。
  12. 如权利要求11所述的视频编码方法,其中:
    所述视频编码方法方法还包括:
    当前帧为I帧时,采用第一神经网络和第二神经网络分别对所述当前帧的原始图像进行编码,得到第一码率的图像特征图和第二码率的图像特征图,其中,第一码率大于第二码率;
    将第一码率的图像特征图与目标掩膜相乘,得到目标特征图;将第二码率的图像特征图与背景掩膜相乘,得到背景特征图;
    分别对所述目标特征图和背景特征图进行量化和熵编码。
  13. 一种残差编码装置,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如权利要求1至10中任一所述的残差编码方法。
  14. 一种视频编码设备,包括处理器以及存储有可在所述处理器上运行的计算机程序的存储器,其中,所述处理器执行所述计算机程序时实现如权利要求11或12中任一所述的视频编码方法。
  15. 一种视频编解码系统,其中,包括如权利要求14所述的视频编码设备。
  16. 一种非瞬态计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序时被处理器执行时实现如权利要求1至12中任一所述的方法。
PCT/CN2021/100191 2021-06-15 2021-06-15 残差编码和视频编码方法、装置、设备和系统 WO2022261838A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/100191 WO2022261838A1 (zh) 2021-06-15 2021-06-15 残差编码和视频编码方法、装置、设备和系统
CN202180099185.4A CN117480778A (zh) 2021-06-15 2021-06-15 残差编码和视频编码方法、装置、设备和系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/100191 WO2022261838A1 (zh) 2021-06-15 2021-06-15 残差编码和视频编码方法、装置、设备和系统

Publications (1)

Publication Number Publication Date
WO2022261838A1 true WO2022261838A1 (zh) 2022-12-22

Family

ID=84526800

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/100191 WO2022261838A1 (zh) 2021-06-15 2021-06-15 残差编码和视频编码方法、装置、设备和系统

Country Status (2)

Country Link
CN (1) CN117480778A (zh)
WO (1) WO2022261838A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115955568A (zh) * 2023-03-14 2023-04-11 中国电子科技集团公司第五十四研究所 一种基于海思芯片的低延时视频压缩与智能目标识别方法
CN116828184A (zh) * 2023-08-28 2023-09-29 腾讯科技(深圳)有限公司 视频编码、解码方法、装置、计算机设备和存储介质
CN117376551A (zh) * 2023-12-04 2024-01-09 淘宝(中国)软件有限公司 视频编码加速方法及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160150242A1 (en) * 2013-12-13 2016-05-26 Mediatek Singapore Pte. Ltd. Method of Background Residual Prediction for Video Coding
CN106162191A (zh) * 2015-04-08 2016-11-23 杭州海康威视数字技术股份有限公司 一种基于目标的视频编码方法及系统
US20170223357A1 (en) * 2016-01-29 2017-08-03 Google Inc. Motion vector prediction using prior frame residual
CN107396124A (zh) * 2017-08-29 2017-11-24 南京大学 基于深度神经网络的视频压缩方法
CN110351557A (zh) * 2018-04-03 2019-10-18 朱政 视频编码中一种快速帧间预测编码方法
WO2020256595A2 (en) * 2019-06-21 2020-12-24 Huawei Technologies Co., Ltd. Method and apparatus of still picture and video coding with shape-adaptive resampling of residual blocks

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160150242A1 (en) * 2013-12-13 2016-05-26 Mediatek Singapore Pte. Ltd. Method of Background Residual Prediction for Video Coding
CN106162191A (zh) * 2015-04-08 2016-11-23 杭州海康威视数字技术股份有限公司 一种基于目标的视频编码方法及系统
US20170223357A1 (en) * 2016-01-29 2017-08-03 Google Inc. Motion vector prediction using prior frame residual
CN107396124A (zh) * 2017-08-29 2017-11-24 南京大学 基于深度神经网络的视频压缩方法
CN110351557A (zh) * 2018-04-03 2019-10-18 朱政 视频编码中一种快速帧间预测编码方法
WO2020256595A2 (en) * 2019-06-21 2020-12-24 Huawei Technologies Co., Ltd. Method and apparatus of still picture and video coding with shape-adaptive resampling of residual blocks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
M. LU, Z. MA (NJU), Z. DAI, D. WANG (OPPO): "EE1: Tests on Decomposition, Compression and Synthesis (DCS)-based Technology (JVET-V0149)", 23. JVET MEETING; 20210707 - 20210716; TELECONFERENCE; (THE JOINT VIDEO EXPLORATION TEAM OF ISO/IEC JTC1/SC29/WG11 AND ITU-T SG.16 ), 7 July 2021 (2021-07-07), pages 1 - 4, XP030296133 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115955568A (zh) * 2023-03-14 2023-04-11 中国电子科技集团公司第五十四研究所 一种基于海思芯片的低延时视频压缩与智能目标识别方法
CN115955568B (zh) * 2023-03-14 2023-05-30 中国电子科技集团公司第五十四研究所 一种基于海思芯片的低延时视频压缩与智能目标识别方法
CN116828184A (zh) * 2023-08-28 2023-09-29 腾讯科技(深圳)有限公司 视频编码、解码方法、装置、计算机设备和存储介质
CN116828184B (zh) * 2023-08-28 2023-12-22 腾讯科技(深圳)有限公司 视频编码、解码方法、装置、计算机设备和存储介质
CN117376551A (zh) * 2023-12-04 2024-01-09 淘宝(中国)软件有限公司 视频编码加速方法及电子设备
CN117376551B (zh) * 2023-12-04 2024-02-23 淘宝(中国)软件有限公司 视频编码加速方法及电子设备

Also Published As

Publication number Publication date
CN117480778A (zh) 2024-01-30

Similar Documents

Publication Publication Date Title
WO2022261838A1 (zh) 残差编码和视频编码方法、装置、设备和系统
TWI399097B (zh) 用於編碼視訊之系統及方法,以及電腦可讀取媒體
US9414086B2 (en) Partial frame utilization in video codecs
TWI741239B (zh) 視頻資料的幀間預測方法和裝置
WO2022088631A1 (zh) 图像编码方法、图像解码方法及相关装置
JP2018534875A (ja) ディスプレイストリーム圧縮(dsc)におけるスライス境界視覚アーティファクトを減らすためのシステムおよび方法
JP6464192B2 (ja) ディスプレイストリーム圧縮(dsc)のための平坦度検出のためのシステムおよび方法
US20240064309A1 (en) Residual coding method and device, video coding method and device, and storage medium
US20230239464A1 (en) Video processing method with partial picture replacement
JP4742018B2 (ja) 画像符号化装置及び画像符号化方法
TW201921938A (zh) 具有在用於視訊寫碼之隨機存取組態中之未來參考訊框之可調適圖像群組結構
KR102020953B1 (ko) 카메라 영상의 복호화 정보 기반 영상 재 부호화 방법 및 이를 이용한 영상 재부호화 시스템
KR20040104831A (ko) 영상데이터의 압축 장치 및 방법
TWI841033B (zh) 視頻數據的幀間預測方法和裝置
US11825075B2 (en) Online and offline selection of extended long term reference picture retention
WO2023004590A1 (zh) 一种视频解码、编码方法及设备、存储介质
WO2022257142A1 (zh) 一种视频解码、编码方法及设备、存储介质
WO2023225854A1 (zh) 一种环路滤波方法、视频编解码方法、装置和系统
WO2023039856A1 (zh) 一种视频解码、编码方法及设备、存储介质
WO2020007187A1 (zh) 图像块解码方法及装置
KR20060043120A (ko) 영상 신호의 인코딩 및 디코딩 방법
TW202349967A (zh) 用於基於神經的媒體壓縮的熵譯碼
CN117676266A (zh) 视频流的处理方法及装置、存储介质、电子设备
JP2023124387A (ja) ビットレート決定装置、符号化装置、およびプログラム
CN117544782A (zh) 一种无人机8k视频中目标增强的编码方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21945430

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180099185.4

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE