CN116137659A - Inter-coded block partitioning method and apparatus - Google Patents

Inter-coded block partitioning method and apparatus Download PDF

Info

Publication number
CN116137659A
CN116137659A CN202111369165.2A CN202111369165A CN116137659A CN 116137659 A CN116137659 A CN 116137659A CN 202111369165 A CN202111369165 A CN 202111369165A CN 116137659 A CN116137659 A CN 116137659A
Authority
CN
China
Prior art keywords
block
current block
image
sub
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111369165.2A
Other languages
Chinese (zh)
Inventor
钱生
马祥
杨海涛
张恋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202111369165.2A priority Critical patent/CN116137659A/en
Publication of CN116137659A publication Critical patent/CN116137659A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/119Adaptive subdivision aspects, e.g. subdivision of a picture into rectangular or non-rectangular coding blocks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/513Processing of motion vectors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/567Motion estimation based on rate distortion criteria
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/593Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial prediction techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/96Tree coding, e.g. quad-tree coding

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The application provides a block partitioning method and device for inter-frame coding. The inter-coded block division method comprises the following steps: acquiring a data set of a current block, wherein the data set of the current block comprises image data of the current block and image data of a reference block of the current block; inputting the data set of the current block into a neural network model to obtain a division mode of the current block, wherein the neural network model is associated with the size of an image block corresponding to the input data set, and the image block comprises the current block and the reference block; and dividing the current block according to the dividing mode. The method and the device can realize block division acceleration of inter-frame coding, and effectively fit RDO decision processes in the depth neural network and the standard encoder so as to ensure coding performance indexes.

Description

Inter-coded block partitioning method and apparatus
Technical Field
The present disclosure relates to the field of video coding technologies, and in particular, to a method and an apparatus for dividing blocks in inter-frame coding.
Background
In the related video coding standard, a Coding Tree Unit (CTU) partitioning method based on a Quad Tree (QT) is used to recursively partition the CTU as a root node (root node) of the QT into a plurality of leaf nodes (leaf nodes). QT is a tree structure, a node being divided into four sub-nodes. The CTU dividing manner based on QT includes: taking the CTU as a root node, wherein one node corresponds to a square area; if a node is not divided any more, the corresponding area is a Coding Unit (CU); if one node continues to be divided, dividing the node into four nodes of the next level, namely dividing the square area corresponding to the node into four square areas with the same size, wherein the length and the width of the square area are respectively half of the length and the width of the square area before division.
The division of one CTU into a set of CUs corresponds to one coding tree. Which coding tree the CTU employs is typically determined using rate distortion optimization (rate distortion optimization, RDO) techniques of the encoder. The encoder tries multiple division modes for a CTU, each division mode can obtain a rate distortion cost (rate distortion cost, RDC), the encoder compares RDCs of the multiple division modes, and takes the division mode with the minimum RDC as the optimal division mode of the CTU for encoding the CTU.
The above RDO technique consumes a long time and a large amount of computation, so that the block division acceleration of video coding is an important research point, and in particular, the block division acceleration for inter-frame coding is a problem to be solved.
Disclosure of Invention
The embodiment of the application provides an inter-frame coding block division method, which is used for realizing inter-frame coding block division acceleration and effectively fitting RDO decision processes in a deep neural network and a standard encoder so as to ensure coding performance indexes.
In a first aspect, an embodiment of the present application provides a method for partitioning an inter-coded block, including: acquiring a data set of a current block, wherein the data set of the current block comprises image data of the current block and image data of a reference block of the current block; inputting the data set of the current block into a neural network model to obtain a division mode of the current block, wherein the neural network model is associated with the size of an image block corresponding to the input data set, and the image block comprises the current block and the reference block; and dividing the current block according to the dividing mode.
According to the method, the current block and the reference block of the current block are used as input of the depth neural network, the depth neural network determines the dividing mode of the current block, compared with analysis of a standard encoder, the method can achieve block dividing acceleration of inter-frame coding, and RDO decision processes in the depth neural network and the standard encoder are effectively fitted to ensure coding performance indexes. It should be understood that, in addition to being used for inter-frame coding, the reference block of the current block may also be generated based on an intra-prediction mode in intra-frame coding, which is not particularly limited.
The current block in the embodiment of the present application may be a Coding Tree Unit (CTU) that the encoder is currently processing. A CTU generally corresponds to a square region in a picture, which may contain luminance pixels and/or chrominance pixels in the square region, and may also contain syntax elements indicating how the CTU is divided into at least one Coding Unit (CU), and a method of decoding each CU to obtain a reconstructed picture. A CU generally corresponds to a rectangular area of a x B in an image, which may contain luminance pixels and/or chrominance pixels in this rectangular area of a x B, a being the width of the rectangular area, B being the height of the rectangular area, a and B being the same or different, the values of a and B generally being the integer powers of 2.
In one possible implementation, the data set of the current block may include image data of the current block and image data of a reference block of the current block.
In the embodiment of the present application, a motion vector estimation algorithm may be used to obtain the MV of the current block. And taking one MV of the current block as a search starting point, performing offset in a search window to obtain a plurality of offset MVs, determining an optimal MV from the plurality of offset MVs through rate distortion cost (rate distortion cost, RD cost), and further acquiring a reference block of the current block according to the optimal MV.
The image data of the current block may include the values of luminance pixels and/or chrominance pixels included in an image area corresponding to the current block, where the image area is an area in an image to which the current block belongs. The image data of the reference block of the current block may include values of luminance pixels and/or chrominance pixels included in an image area corresponding to the reference block of the current block, where the image area is an area in an image to which the reference block of the current block belongs.
In one possible implementation, the data set of the current block includes image data of the current block, image data of a reference block of the current block, and pixel-level optical flow of the current block.
In the case of an image with graphics processor (graphics processing unit, GPU) rendering information (GPU rendering information, GRI), the GRI may include information such as location information of any one pixel point in the image at different times, semantic segmentation of the image, etc. The optical flow of a PIXEL can be obtained by calculating the difference between the positions of the PIXEL at two moments, the optical flow can represent the motion direction and the motion distance of the PIXEL, each PIXEL has own optical flow attribute, the attribute generally refers to pixel_h and w (MVx and MVy), the optical flow attribute value of the PIXEL with the coordinates (h and w) is (MVx and MVy), and the optical flow of a plurality of PIXELs is the PIXEL-level optical flow of an image block formed by the plurality of PIXELs. The search speed and the precision of the reference block in the inter-frame coding process can be accelerated by acquiring the block-level MV through the pixel-level optical flow. The correspondence between the previous frame and the current frame can thus be found using the temporal variations of the pixels in the image sequence and the correlation between the adjacent frames. Based on the pixel-level optical flow of the current block, the embodiments of the present application may employ the following two reference block determination methods: one method is that the MV of the current block is obtained according to the pixel level optical flow of the current block, and then the reference block of the current block is obtained according to the MV of the current block. Another method is to obtain an initial MV according to the pixel level optical flow of the current block, use the initial MV as a search starting point, obtain the MV of the current block by using a motion vector estimation algorithm, and obtain a reference block of the current block according to the MV of the current block.
In addition to the image data of the current block and the image data of the reference block of the current block, the embodiment of the application can also add the pixel-level optical flow of the current block to the data set of the current block. Therefore, the GRI can be effectively utilized to assist block division in inter-frame coding, and particularly the searching speed and precision of a reference block are improved, so that the inter-frame coding efficiency is improved.
The partitioning of image blocks may be referred to as tree partitioning or hierarchical tree partitioning, wherein a root block at root tree level 0 (hierarchical level 0, depth 0) or the like may be recursively partitioned into four next lower tree level blocks, e.g., tree level 1 (hierarchical level 1, depth 1) nodes. These blocks may be further partitioned into four next lower level blocks, e.g., tree level 2 (hierarchical level 2, depth 2), etc., until the partitioning is complete. Based on this, the embodiment of the present application provides a plurality of models, for example, 3 kinds of deep neural network models nn_cu64, nn_cu32 and nn_cu16, which respectively correspond to a plurality of image sizes, so that the current block may obtain an image block with a smaller size after being divided, the image block with a smaller size may be further divided to obtain an image block with a smaller size, and inputting the data set of the image block obtained by dividing into the models with the corresponding sizes may determine whether the image block is divided or not. The embodiment of the application adopts a scheme based on the deep neural network, and the training of the deep neural network is driven based on a data scene (namely, the image data of the image blocks with the same size are trained to obtain the deep neural network model corresponding to the same size), so that better scene adaptability can be realized.
In one possible implementation, the data set of the current block is input into a first neural network model to obtain a first division identifier, where the first division identifier is used to indicate that the current block is to be divided or not divided, and the first neural network model is obtained through training based on an image block with the same size as the current block.
In one possible implementation, when the first division identifier indicates that division is to be performed, dividing the current block to obtain a plurality of first sub-blocks; acquiring a data set of the first sub-block, wherein the data set of the first sub-block comprises image data of the first sub-block and image data of a reference block of the first sub-block; inputting the data set of the first sub-block into a second neural network model to obtain a second division identifier, wherein the second division identifier is used for indicating whether the first sub-block is to be divided or not, and the second neural network model is obtained based on image block training with the same size as the first sub-block.
Alternatively, the data set of the first sub-block may include image data of the first sub-block and image data of a reference block of the first sub-block. Optionally, the data set of the first sub-block includes image data of the first sub-block, image data of a reference block of the first sub-block, and pixel-level optical flow of the first sub-block.
In one possible implementation, when the second division identifier indicates that division is to be performed, dividing the first sub-block to obtain a plurality of second sub-blocks; acquiring a data set of the second sub-block, wherein the data set of the second sub-block comprises image data of the second sub-block and image data of a reference block of the second sub-block; and inputting the data set of the second sub-block into a third neural network model to obtain a third division identifier, wherein the third division identifier is used for indicating whether the second sub-block is to be divided or not, and the third neural network model is obtained based on image block training with the same size as the second sub-block.
Alternatively, the data set of the second sub-block may include image data of the second sub-block and image data of a reference block of the second sub-block. Optionally, the data set of the second sub-block includes image data of the second sub-block, image data of a reference block of the second sub-block, and pixel-level optical flow of the second sub-block.
In one possible implementation, the current block is smaller than or equal to 64×64 in size.
In one possible implementation, the partitioning corresponds to a quadtree partitioning.
In a second aspect, an embodiment of the present application provides a block dividing apparatus, including: an acquisition module, configured to acquire a data set of a current block, where the data set of the current block includes image data of the current block and image data of a reference block of the current block; the decision module is used for inputting the data set of the current block into a neural network model to obtain the division mode of the current block, the neural network model is associated with the size of an image block corresponding to the input data set, and the image block comprises the current block and the reference block; and the dividing module is used for dividing the current block according to the dividing mode.
In one possible implementation manner, the decision module is specifically configured to input the data set of the current block into a first neural network model to obtain a first partition identifier, where the first partition identifier is used to indicate that the current block is to be partitioned or not partitioned, and the first neural network model is obtained through training based on an image block with the same size as the current block.
In a possible implementation manner, the dividing module is further configured to divide the current block to obtain a plurality of first sub-blocks when the first division identifier indicates that the current block is to be divided; the decision module is further configured to obtain a data set of the first sub-block, where the data set of the first sub-block includes image data of the first sub-block and image data of a reference block of the first sub-block; inputting the data set of the first sub-block into a second neural network model to obtain a second division identifier, wherein the second division identifier is used for indicating whether the first sub-block is to be divided or not, and the second neural network model is obtained based on image block training with the same size as the first sub-block.
In a possible implementation manner, the dividing module is further configured to divide the first sub-block to obtain a plurality of second sub-blocks when the second division identifier indicates that the first sub-block is to be divided; the decision module is further configured to obtain a data set of the second sub-block, where the data set of the second sub-block includes image data of the second sub-block and image data of a reference block of the second sub-block; and inputting the data set of the second sub-block into a third neural network model to obtain a third division identifier, wherein the third division identifier is used for indicating whether the second sub-block is to be divided or not, and the third neural network model is obtained based on image block training with the same size as the second sub-block.
In one possible implementation manner, the obtaining module is further configured to obtain graphics processor rendering information GRI; acquiring a pixel-level optical flow of the current block according to the GRI; the data set also includes pixel-level optical flow for the current block.
In a possible implementation manner, the obtaining module is specifically configured to extract, from the GRI, position information of a first pixel at different moments, where the first pixel is any one pixel in the current block; obtaining the optical flow of the first pixel point by calculating the difference according to the position information of the first pixel point at different moments; the pixel-level optical flow of the current block includes the optical flow of the first pixel point.
In a possible implementation manner, the obtaining module is further configured to obtain an MV of the current block according to a pixel-level optical flow of the current block; and acquiring a reference block of the current block according to the MV of the current block.
In a possible implementation manner, the obtaining module is further configured to obtain an initial MV according to a pixel-level optical flow of the current block; taking the initial MV as a searching starting point, and acquiring the MV of the current block by adopting a motion vector estimation algorithm; and acquiring a reference block of the current block according to the MV of the current block.
In one possible implementation, the current block is smaller than or equal to 64×64 in size.
In one possible implementation, the partitioning corresponds to a quadtree partitioning.
In a third aspect, embodiments of the present application provide an encoder, comprising: one or more processors; a non-transitory computer readable storage medium coupled to the processor and storing a program for execution by the processor, wherein the program, when executed by the processor, causes the decoder to perform the method of any of the first aspects above.
In a fourth aspect, embodiments of the present application provide a computer program product comprising program code for performing the method of any of the first aspects above, when the program code is executed on a computer or processor.
Drawings
FIG. 1a is an exemplary block diagram of a decoding system 10 according to an embodiment of the present application;
fig. 1b is an exemplary block diagram of a video coding system 40 according to an embodiment of the present application;
fig. 2 is an exemplary block diagram of video encoder 20 of an embodiment of the present application;
fig. 3 is a schematic view of an HEVC intra prediction direction;
fig. 4 is an exemplary block diagram of a video coding apparatus 400 of an embodiment of the present application;
FIG. 5 is an exemplary block diagram of an apparatus 500 of an embodiment of the present application;
FIG. 6 is an exemplary schematic diagram of a candidate image block according to an embodiment of the present application;
FIG. 7a is an exemplary training schematic of a deep neural network model;
FIG. 7b is an exemplary training schematic of a deep neural network model;
FIG. 8 is a flow chart of a process 800 of an inter-coded block partitioning method of an embodiment of the present application;
FIG. 9 is an exemplary schematic diagram of a search window according to an embodiment of the present application;
FIG. 10 is an exemplary flowchart of an image block partitioning method provided by an embodiment of the present application;
FIG. 11 is a schematic diagram of a block partitioning approach according to an embodiment of the present application;
FIG. 12 is an exemplary flowchart of an image block partitioning method provided by an embodiment of the present application;
FIG. 13 is a schematic diagram of block partitioning according to an embodiment of the present application;
FIG. 14 is an exemplary flowchart of an image block partitioning method provided by an embodiment of the present application;
FIG. 15 is a schematic diagram of a block partitioning approach according to an embodiment of the present application;
fig. 16 is a schematic diagram illustrating an exemplary configuration of a block dividing apparatus 1600 according to an embodiment of the present application.
Detailed Description
The embodiments of the present application provide an AI-based video compression technique, and in particular, a neural network-based video compression technique, and specifically, an inter-frame prediction technique based on a Neural Network (NN) to improve a conventional hybrid video codec system.
Video coding generally refers to processing a sequence of images that form a video or video sequence. In the field of video coding, the terms "picture", "frame" or "picture" may be used as synonyms. Video coding (or commonly referred to as encoding) includes two parts, video encoding and video decoding. Video encoding is performed on the source side, typically involving processing (e.g., compressing) the original video image to reduce the amount of data required to represent the video image (and thus more efficiently store and/or transmit). Video decoding is performed on the destination side, typically involving inverse processing with respect to the encoder to reconstruct the video image. The "encoding" of video images (or what is commonly referred to as images) in connection with an embodiment is understood to be "encoding" or "decoding" of video images or video sequences. The coding portion and decoding portion are also collectively referred to as coding and decoding (CODEC).
In the case of lossless video coding, the original video image may be reconstructed, i.e. the reconstructed video image has the same quality as the original video image (assuming no transmission loss or other data loss during storage or transmission). In the case of lossy video coding, further compression is performed by quantization or the like to reduce the amount of data required to represent the video image, whereas the decoder side cannot fully reconstruct the video image, i.e. the quality of the reconstructed video image is lower or worse than the quality of the original video image.
Several video coding standards pertain to "lossy hybrid video codec" (i.e., combining spatial and temporal prediction in the pixel domain with 2D transform coding in the transform domain for applying quantization). Each picture in a video sequence is typically partitioned into non-overlapping sets of blocks, typically encoded at the block level. In other words, an encoder typically processes, i.e., encodes, video at a block (video block) level, e.g., by spatial (intra) prediction and temporal (inter) prediction to produce a prediction block; subtracting the predicted block from the current block (current processing/block to be processed) to obtain a residual block; the residual block is transformed in the transform domain and quantized to reduce the amount of data to be transmitted (compressed), while the decoder side applies the inverse processing part of the relative encoder to the encoded or compressed block to reconstruct the current block for representation. In addition, the encoder needs to repeat the processing steps of the decoder so that the encoder and decoder generate the same predictions (e.g., intra-prediction and inter-prediction) and/or reconstructed pixels for processing, i.e., encoding, the subsequent block.
In the following embodiments of the decoding system 10, the encoder 20 and the decoder 30 are described with reference to fig. 1a to 3.
Fig. 1a is an exemplary block diagram of a decoding system 10 according to an embodiment of the present application, such as a video decoding system 10 (or simply decoding system 10) that may utilize the techniques of embodiments of the present application. Video encoder 20 (or simply encoder 20) and video decoder 30 (or simply decoder 30) in video coding system 10 represent devices, etc., that may be used to perform various techniques according to various examples described in embodiments of the present application.
As shown in fig. 1a, the decoding system 10 comprises a source device 12, the source device 12 being arranged to provide encoded image data 21, such as an encoded image, to a destination device 14 for decoding the encoded image data 21.
Source device 12 includes an encoder 20 and, in addition, optionally, may include an image source 16, a preprocessor (or preprocessing unit) 18 such as an image preprocessor, and a communication interface (or communication unit) 22.
Image source 16 may include, or be, any type of image capturing device for capturing real-world images and the like, and/or any type of image generating device, such as a computer graphics processor for generating computer-animated images, or any type of device for capturing and/or providing real-world images, computer-generated images (e.g., screen content, virtual Reality (VR) images, and/or any combination thereof (e.g., augmented reality (augmented reality, AR) images).
In order to distinguish the processing performed by the preprocessor (or preprocessing unit) 18, the image (or image data) 17 may also be referred to as an original image (or original image data) 17.
The preprocessor 18 is configured to receive the original image data 17 and preprocess the original image data 17 to obtain a preprocessed image (or preprocessed image data) 19. For example, the preprocessing performed by the preprocessor 18 may include cropping, color format conversion (e.g., from RGB to YCbCr), toning, or denoising. It is understood that the preprocessing unit 18 may be an optional component.
A video encoder (or encoder) 20 is provided for receiving the preprocessed image data 19 and providing encoded image data 21 (described further below with respect to fig. 2, etc.).
The communication interface 22 in the source device 12 may be used to: the encoded image data 21 is received and the encoded image data 21 (or any other processed version) is transmitted over the communication channel 13 to another device, such as the destination device 14, for storage or direct reconstruction.
The destination device 14 includes a decoder 30 and, in addition, optionally, may include a communication interface (or communication unit) 28, a post-processor (or post-processing unit) 32, and a display device 34.
The communication interface 28 in the destination device 14 is used to receive the encoded image data 21 (or any other processed version) directly from the source device 12 or from any other source device such as a storage device, e.g., a storage device for encoded image data, and to provide the encoded image data 21 to the decoder 30.
Communication interface 22 and communication interface 28 may be used to transmit or receive encoded image data (or encoded data) 21 over a direct communication link between source device 12 and destination device 14, such as a direct wired or wireless connection, or the like, or over any type of network, such as a wired network, a wireless network, or any combination thereof, any type of private and public networks, or any combination thereof.
For example, the communication interface 22 may be used to encapsulate the encoded image data 21 into a suitable format, such as a message, and/or process the encoded image data using any type of transmission encoding or processing for transmission over a communication link or network.
Communication interface 28 corresponds to communication interface 22 and may be used, for example, to receive the transmission data and process the transmission data using any type of corresponding transmission decoding or processing and/or decapsulation to obtain encoded image data 21.
Communication interface 22 and communication interface 28 may each be configured as a one-way communication interface, as indicated by the arrow in fig. 1a pointing from source device 12 to a corresponding communication channel 13 of destination device 14, or a two-way communication interface, and may be used to send and receive messages or the like to establish a connection, to acknowledge and exchange any other information related to the communication link and/or data transmission, such as an encoded image data transmission, or the like.
The video decoder (or decoder) 30 is for receiving the encoded image data 21 and providing decoded image data (or decoded image data) 31 (described further below with respect to fig. 3, etc.).
The post-processor 32 is configured to post-process decoded image data 31 (also referred to as reconstructed image data) such as a decoded image, and obtain post-processed image data 33 such as a post-processed image. Post-processing performed by post-processing unit 32 may include, for example, color format conversion (e.g., conversion from YCbCr to RGB), toning, cropping, or resampling, or any other processing for producing decoded image data 31 for display by display device 34 or the like.
The display device 34 is for receiving the post-processing image data 33 to display an image to a user or viewer or the like. The display device 34 may be or include any type of display for representing a reconstructed image, such as an integrated or external display screen or display. For example, the display screen may include a liquid crystal display (liquid crystal display, LCD), an organic light emitting diode (organic light emitting diode, OLED) display, a plasma display, a projector, a micro LED display, a liquid crystal on silicon display (liquid crystal on silicon, LCoS), a digital light processor (digital light processor, DLP), or any type of other display screen.
The decoding system 10 further comprises a training engine 25, the training engine 25 being arranged to train the encoder 20, in particular the segmentation unit in the encoder 20, to process the input image or image region or image block to obtain a segmented image region or image block.
The training data in the embodiments of the present application may be stored in a database (not shown), and the training engine 25 is trained to obtain a target model (e.g., may be a deep neural network for image segmentation) based on the training data. It should be noted that, in the embodiment of the present application, the source of the training data is not limited, and for example, the training data may be obtained from the cloud or other places to perform model training.
The target model trained by the training engine 25 may be applied in the decoding system 10, for example, to the source device 12 (e.g., encoder 20) shown in fig. 1 a. Training engine 25 may train in the cloud to obtain a target model, and then decoding system 10 downloads and uses the target model from the cloud; alternatively, the training engine 25 may train to obtain a target model in the cloud and use the target model, and the decoding system 10 directly obtains the processing result from the cloud. For example, training engine 25 trains a deep neural network for image segmentation, decoding system 10 downloads the deep neural network from the cloud, and then segmentation unit in encoder 20 may segment the input image block according to the deep neural network to obtain a segmented image region or image block. For another example, the training engine 25 trains to obtain the depth neural network for image segmentation, the decoding system 10 does not need to download the depth neural network from the cloud end, the encoder 20 transmits the image block to the cloud end, the cloud end segments the image block through the depth neural network, and the segmented image region or the segmented image block is obtained and transmitted to a subsequent module of the encoder 20 for processing.
Although fig. 1a shows the source device 12 and the destination device 14 as separate devices, device embodiments may also include both the source device 12 and the destination device 14 or both the functions of the source device 12 and the destination device 14, i.e., both the source device 12 or corresponding functions and the destination device 14 or corresponding functions. In these embodiments, the source device 12 or corresponding function and the destination device 14 or corresponding function may be implemented using the same hardware and/or software or by hardware and/or software alone or any combination thereof.
From the description, it will be apparent to the skilled person that the presence and (exact) division of the different units or functions in the source device 12 and/or the destination device 14 shown in fig. 1a may vary depending on the actual device and application.
Encoder 20 (e.g., video encoder 20) or decoder 30 (e.g., video decoder 30), or both, may be implemented by processing circuitry as shown in fig. 1b, such as one or more microprocessors, digital signal processors (digital signal processor, DSPs), application-specific integrated circuits (ASICs), field-programmable gate array, field-programmable gate arrays (FPGAs), discrete logic, hardware, video encoding specific processors, or any combinations thereof. Encoder 20 may be implemented by processing circuitry 46 to include the various modules discussed with reference to encoder 20 of fig. 2 and/or any other encoder systems or subsystems described herein. Decoder 30 may be implemented by processing circuit 46 to include the various modules discussed with reference to decoder 30 of fig. 3 and/or any other decoder system or subsystem described herein. The processing circuitry 46 may be used to perform various operations discussed below. As shown in fig. 5, if some of the techniques are implemented in software, the device may store instructions of the software in a suitable non-transitory computer-readable storage medium and execute the instructions in hardware using one or more processors to perform the techniques of embodiments of the present application. One of the video encoder 20 and the video decoder 30 may be integrated in a single device as part of a combined CODEC (CODEC), as shown in fig. 1 b.
Source device 12 and destination device 14 may comprise any of a variety of devices, including any type of handheld or stationary device, such as a notebook or laptop computer, a cell phone, a smart phone, a tablet or tablet computer, a camera, a desktop computer, a set-top box, a television, a display device, a digital media player, a video game console, a video streaming device (e.g., a content service server or content distribution server), a broadcast receiving device, a broadcast transmitting device, etc., and may not use or use any type of operating system. In some cases, source device 12 and destination device 14 may be equipped with components for wireless communications. Thus, the source device 12 and the destination device 14 may be wireless communication devices.
In some cases, the video coding system 10 shown in fig. 1a is merely exemplary, and the techniques provided by embodiments of the present application may be applicable to video coding settings (e.g., video coding or video decoding) that do not necessarily include any data communication between the coding device and the decoding device. In other examples, the data is retrieved from local memory, sent over a network, and so on. The video encoding device may encode and store data in the memory and/or the video decoding device may retrieve and decode data from the memory. In some examples, encoding and decoding are performed by devices that do not communicate with each other, but simply encode data into memory and/or retrieve and decode data from memory.
Fig. 1b is an exemplary block diagram of a video coding system 40 according to an embodiment of the present application, as shown in fig. 1b, the video coding system 40 may include an imaging device 41, a video encoder 20, a video decoder 30 (and/or a video codec implemented by processing circuitry 46), an antenna 42, one or more processors 43, one or more memory devices 44, and/or a display device 45.
As shown in fig. 1b, the imaging device 41, the antenna 42, the processing circuit 46, the video encoder 20, the video decoder 30, the processor 43, the memory 44 and/or the display device 45 are capable of communicating with each other. In different examples, video coding system 40 may include only video encoder 20 or only video decoder 30.
In some examples, antenna 42 may be used to transmit or receive an encoded bitstream of video data. Additionally, in some examples, display device 45 may be used to present video data. The processing circuit 46 may include application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, and the like. The video coding system 40 may also include an optional processor 43, which optional processor 43 may similarly include application-specific integrated circuit (ASIC) logic, a graphics processor, a general purpose processor, or the like. In addition, the memory storage 44 may be any type of memory, such as volatile memory (e.g., static random access memory (static random access memory, SRAM), dynamic random access memory (dynamic random access memory, DRAM), etc.) or non-volatile memory (e.g., flash memory, etc.), etc. In a non-limiting example, the memory storage 44 may be implemented by an overspeed cache memory. In other examples, processing circuitry 46 may include memory (e.g., a buffer, etc.) for implementing an image buffer, etc.
In some examples, video encoder 20 implemented by logic circuitry may include an image buffer (e.g., implemented by processing circuitry 46 or memory storage 44) and a graphics processing unit (e.g., implemented by processing circuitry 46). The graphics processing unit may be communicatively coupled to the image buffer. The graphics processing unit may include video encoder 20 implemented by processing circuitry 46 to implement the various modules discussed with reference to fig. 2 and/or any other encoder system or subsystem described herein. Logic circuitry may be used to perform various operations discussed herein.
In some examples, video decoder 30 may be implemented in a similar manner by processing circuitry 46 to implement the various modules discussed with reference to video decoder 30 of fig. 3 and/or any other decoder system or subsystem described herein. In some examples, video decoder 30 implemented by logic circuitry may include an image buffer (implemented by processing circuitry 46 or memory storage 44) and a graphics processing unit (e.g., implemented by processing circuitry 46). The graphics processing unit may be communicatively coupled to the image buffer. The graphics processing unit may include video decoder 30 implemented by processing circuitry 46 to implement the various modules discussed with reference to fig. 3 and/or any other decoder system or subsystem described herein.
In some examples, antenna 42 may be used to receive an encoded bitstream of video data. As discussed, the encoded bitstream may include data related to the encoded video frame, indicators, index values, mode selection data, etc., discussed herein, such as data related to the encoded partitions (e.g., transform coefficients or quantized transform coefficients, optional indicators (as discussed), and/or data defining the encoded partitions). Video coding system 40 may also include a video decoder 30 coupled to antenna 42 and used to decode the encoded bitstream. The display device 45 is used to present video frames.
It should be understood that video decoder 30 may be used to perform the reverse process for the example described with reference to video encoder 20 in the embodiments of the present application. Regarding signaling syntax elements, video decoder 30 may be configured to receive and parse such syntax elements and decode the associated video data accordingly. In some examples, video encoder 20 may entropy encode the syntax elements into an encoded video bitstream. In such examples, video decoder 30 may parse such syntax elements and decode the relevant video data accordingly.
For ease of description, embodiments of the present application are described with reference to general purpose video coding (versatile video coding, VVC) reference software or high-performance video coding (high-efficiency video coding, HEVC) developed by the video coding joint working group (joint collaboration team on video coding, JCT-VC) of the ITU-T video coding expert group (video coding experts group, VCEG) and the ISO/IEC moving picture expert group (motion picture experts group, MPEG). Those of ordinary skill in the art understand that embodiments of the present application are not limited to HEVC or VVC.
Encoder and encoding method
Fig. 2 is an exemplary block diagram of video encoder 20 of an embodiment of the present application. As shown in fig. 2, video encoder 20 includes an input (or input interface) 201, a residual calculation unit 204, a transform processing unit 206, a quantization unit 208, an inverse quantization unit 210, an inverse transform processing unit 212, a reconstruction unit 214, a loop filter 220, a decoded image buffer (decoded picture buffer, DPB) 230, a mode selection unit 260, an entropy encoding unit 270, and an output (or output interface) 272. The mode selection unit 260 may include an inter prediction unit 244, an intra prediction unit 254, and a partition unit 262. The inter prediction unit 244 may include a motion estimation unit and a motion compensation unit (not shown). The video encoder 20 shown in fig. 2 may also be referred to as a hybrid video encoder or a hybrid video codec-based video encoder.
The residual calculation unit 204, the transform processing unit 206, the quantization unit 208, and the mode selection unit 260 constitute a forward signal path of the encoder 20, while the inverse quantization unit 210, the inverse transform processing unit 212, the reconstruction unit 214, the buffer 216, the loop filter 220, the decoded image buffer (decoded picture buffer, DPB) 230, the inter prediction unit 244, and the intra prediction unit 254 constitute a backward signal path of the encoder, wherein the backward signal path of the encoder 20 corresponds to a signal path of a decoder (see decoder 30 in fig. 3). The inverse quantization unit 210, the inverse transform processing unit 212, the reconstruction unit 214, the loop filter 220, the decoded image buffer 230, the inter prediction unit 244, and the intra prediction unit 254 also constitute a "built-in decoder" of the video encoder 20.
Image and image segmentation (image and block)
Encoder 20 may be used to receive images (or image data) 17 via input 201 or the like, for example, images in a sequence of images forming a video or video sequence. The received image or image data may also be a preprocessed image (or preprocessed image data) 19. For simplicity, the following description uses image 17. Image 17 may also be referred to as a current image or an image to be encoded (especially when distinguishing the current image from other images in video encoding, such as the same video sequence, i.e. a video sequence that also includes the current image), a previously encoded image and/or a decoded image.
The (digital) image is or can be regarded as a two-dimensional array or matrix of pixels having intensity values. The pixels in the array may also be referred to as pixels (pixels or pels) (abbreviations for picture elements). The number of pixels of the array or image in the horizontal and vertical directions (or axes) determines the size and/or resolution of the image. To represent a color, three color components are typically employed, i.e., an image may be represented as or include an array of three pixels. In RBG format or color space, an image includes corresponding arrays of red, green, and blue pixel dots. However, in video coding, each pixel is typically represented in a luminance/chrominance format or color space, such as YCbCr, including a luminance component indicated by Y (sometimes also indicated by L) and two chrominance components represented by Cb, cr. The luminance (luma) component Y represents the luminance or grayscale intensity (e.g., both are the same in a grayscale image), while the two chrominance (chroma) components Cb and Cr represent the chrominance or color information components. Accordingly, the YCbCr format image includes a luminance pixel array of luminance pixel values (Y) and two chrominance pixel arrays of chrominance values (Cb and Cr). The RGB format image may be converted or transformed into YCbCr format and vice versa, a process also known as color transformation or conversion. If the image is black and white, the image may include only an array of intensity pixels. Accordingly, the image may be, for example, an array of luminance pixels in a monochrome format or an array of luminance pixels and two corresponding arrays of chrominance pixels in a 4:2:0, 4:2:2, and 4:4:4 color format.
In one embodiment, an embodiment of video encoder 20 may include an image segmentation unit (not shown in fig. 2) for segmenting image 17 into a plurality of (typically non-overlapping) image blocks 203. These blocks may also be referred to as root blocks, macro blocks (h.264/AVC) or coding tree blocks (coding tree block, CTB), or Coding Tree Units (CTU) in the h.265/HEVC and VVC standards. The segmentation unit may be used to use the same block size for all images in the video sequence and to use a corresponding grid defining the block size, or to change the block size between images or image subsets or groups and to segment each image into corresponding blocks.
In other embodiments, a video encoder may be used to directly receive blocks 203 of image 17, e.g., one, several, or all of the blocks comprising the image 17. The image block 203 may also be referred to as a current image block or an image block to be encoded.
Like image 17, image block 203 is also or may be considered a two-dimensional array or matrix of pixels having intensity values (pixel values), but image block 203 is smaller than image 17. In other words, block 203 may include one pixel array (e.g., a luminance array in the case of a monochrome image 17 or a luminance array or a chrominance array in the case of a color image) or three pixel arrays (e.g., one luminance array and two chrominance arrays in the case of a color image 17) or any other number and/or type of arrays depending on the color format employed. The number of pixels in the horizontal and vertical directions (or axes) of the block 203 defines the size of the block 203. Accordingly, the block may be an array of m×n (M columns×n rows) pixel points, or an array of m×n transform coefficients, or the like.
In one embodiment, video encoder 20 shown in fig. 2 is used to encode image 17 on a block-by-block basis, e.g., encoding and prediction is performed for each block 203.
In one embodiment, video encoder 20 shown in fig. 2 may also be used to segment and/or encode images using slices (also referred to as video slices), where the images may be segmented or encoded using one or more slices (which are typically non-overlapping). Each slice may include one or more blocks (e.g., coding tree units CTUs) or one or more groups of blocks (e.g., coded blocks (tile) in the h.265/HEVC/VVC standard and bricks (brick) in the VVC standard).
In one embodiment, video encoder 20 shown in fig. 2 may also be configured to divide and/or encode a picture using a slice/encode block set (also referred to as a video encode block set) and/or encode blocks (also referred to as video encode blocks), wherein the picture may be divided or encoded using one or more slice/encode block sets (typically non-overlapping), each slice/encode block set may include one or more blocks (e.g., CTUs) or one or more encode blocks, etc., wherein each encode block may be rectangular, etc., and may include one or more complete or partial blocks (e.g., CTUs).
Residual calculation
The residual calculation unit 204 is configured to calculate a residual block 205 (the prediction block 265 is described in detail later) from the image block (or original block) 203 and the prediction block 265 by: for example, pixel-by-pixel (pixel-by-pixel) pixel values of prediction block 265 are subtracted from pixel values of image block 203 to obtain residual block 205 in the pixel domain.
Transformation
The transform processing unit 206 is configured to perform discrete cosine transform (discrete cosine transform, DCT) or discrete sine transform (discrete sine transform, DST) on pixel values of the residual block 205, and the like, to obtain transform coefficients 207 in the transform domain. The transform coefficients 207, which may also be referred to as transform residual coefficients, represent the residual block 205 in the transform domain.
The transform processing unit 206 may be used to apply an integer approximation of the DCT/DST, such as the transforms specified for H.265/HEVC. Such an integer approximation is typically scaled by a certain factor compared to the orthogonal DCT transform. In order to maintain the norms of the residual block subjected to the forward and inverse transform processes, other scaling factors are used as part of the transform process. The scaling factor is typically selected based on certain constraints, such as a power of 2 for the shift operation, bit depth of the transform coefficients, trade-off between accuracy and implementation cost, etc. For example, a specific scaling factor is specified for inverse transformation by the inverse transformation processing unit 212 on the encoder 20 side (and for corresponding inverse transformation by, for example, the inverse transformation processing unit 312 on the decoder 30 side), and accordingly, a corresponding scaling factor may be specified for positive transformation by the transformation processing unit 206 on the encoder 20 side.
In one embodiment, video encoder 20 (and correspondingly, transform processing unit 206) may be configured to output transform parameters such as one or more types of transforms, e.g., directly or encoded or compressed by entropy encoding unit 270, e.g., such that video decoder 30 may receive and use the transform parameters for decoding.
Quantization
The quantization unit 208 is configured to quantize the transform coefficient 207 by, for example, scalar quantization or vector quantization, resulting in a quantized transform coefficient 209. The quantized transform coefficients 209 may also be referred to as quantized residual coefficients 209.
The quantization process may reduce the bit depth associated with some or all of the transform coefficients 207. For example, n-bit transform coefficients may be rounded down to m-bit transform coefficients during quantization, where n is greater than m. The quantization level may be modified by adjusting quantization parameters (quantization parameter, QP). For example, for scalar quantization, different degrees of scaling may be applied to achieve finer or coarser quantization. Smaller quantization step sizes correspond to finer quantization, while larger quantization step sizes correspond to coarser quantization. The appropriate quantization step size may be indicated by a quantization parameter (quantization parameter, QP). For example, the quantization parameter may be an index of a predefined set of suitable quantization steps. For example, smaller quantization parameters may correspond to fine quantization (smaller quantization step size), larger quantization parameters may correspond to coarse quantization (larger quantization step size), and vice versa. Quantization may comprise dividing by a quantization step, while corresponding or inverse dequantization performed by the dequantization unit 210 or the like may comprise multiplying by the quantization step. Embodiments according to some standards, such as HEVC, may be used to determine quantization step sizes using quantization parameters. In general, the quantization step size may be calculated from quantization parameters using a fixed-point approximation of an equation that includes division. Other scaling factors may be introduced to perform quantization and dequantization to recover norms of residual blocks that may be modified due to the scale used in the fixed point approximation of the equation for quantization step size and quantization parameters. In one exemplary implementation, the inverse transform and the dequantized scale may be combined. Alternatively, custom quantization tables may be used and indicated from the encoder to the decoder in the bitstream, etc. Quantization is a lossy operation, where the larger the quantization step size, the larger the loss.
In one embodiment, video encoder 20 (and correspondingly quantization unit 208) may be configured to output quantization parameters (quantization parameter, QP), e.g., directly or encoded or compressed by entropy encoding unit 270, e.g., so that video decoder 30 may receive and use the quantization parameters for decoding.
Inverse quantization
The inverse quantization unit 210 is configured to perform inverse quantization of the quantization unit 208 on the quantized coefficient to obtain a dequantized coefficient 211, for example, performing an inverse quantization scheme corresponding to the quantization scheme performed by the quantization unit 208 according to or using the same quantization step size as the quantization unit 208. The dequantized coefficients 211 may also be referred to as dequantized residual coefficients 211, correspond to the transform coefficients 207, but the dequantized coefficients 211 are typically not exactly the same as the transform coefficients due to quantization loss.
Inverse transformation
The inverse transform processing unit 212 is configured to perform an inverse transform of the transform performed by the transform processing unit 206, for example, an inverse discrete cosine transform (discrete cosine transform, DCT) or an inverse discrete sine transform (discrete sine transform, DST), to obtain a reconstructed residual block 213 (or a corresponding dequantized coefficient 213) in the pixel domain. The reconstructed residual block 213 may also be referred to as a transform block 213.
Reconstruction of
The reconstruction unit 214 (e.g., the summer 214) is configured to add the transform block 213 (i.e., the reconstructed residual block 213) to the prediction block 265 to obtain the reconstructed block 215 in the pixel domain, e.g., to add the pixel values of the reconstructed residual block 213 and the pixel values of the prediction block 265.
Filtering
The loop filter unit 220 (or simply "loop filter" 220) is used to filter the reconstruction block 215 to obtain a filter block 221, or is typically used to filter the reconstruction pixels to obtain filtered pixel values. For example, loop filter units are used to smooth pixel transitions or to improve video quality. Loop filter unit 220 may include one or more loop filters, such as a deblocking filter, a pixel-adaptive offset (SAO) filter, or one or more other filters, such as an adaptive loop filter (adaptive loop filter, ALF), a noise suppression filter (noise suppression filter, NSF), or any combination. For example, the loop filter unit 220 may include a deblocking filter, an SAO filter, and an ALF filter. The order of the filtering process may be a deblocking filter, an SAO filter, and an ALF filter. As another example, a process called luma map with chroma scaling (luma mapping with chroma scaling, LMCS) (i.e., adaptive in-loop shaper) is added. This process is performed prior to deblocking. For another example, the deblocking filtering process may also be applied to internal sub-block edges, such as affine sub-block edges, ATMVP sub-block edges, sub-block transform (SBT) edges, and intra sub-part (ISP) edges. Although loop filter unit 220 is shown in fig. 2 as a loop filter, in other configurations loop filter unit 220 may be implemented as a post-loop filter. The filtering block 221 may also be referred to as a filtering reconstruction block 221.
In one embodiment, video encoder 20 (and correspondingly loop filter unit 220) may be configured to output loop filter parameters (e.g., SAO filter parameters, ALF filter parameters, or LMCS parameters), e.g., directly or after entropy encoding by entropy encoding unit 270, e.g., such that decoder 30 may receive and decode using the same or different loop filter parameters.
Decoding image buffer
The decoded picture buffer (decoded picture buffer, DPB) 230 may be a reference picture memory that stores reference picture data for use by the video encoder 20 in encoding video data. DPB 230 may be formed of any of a variety of memory devices, such as dynamic random access memory (dynamic random access memory, DRAM), including Synchronous DRAM (SDRAM), magnetoresistive RAM (MRAM), resistive RAM (RRAM), or other types of memory devices. The decoded image buffer 230 may be used to store one or more filter blocks 221. The decoded image buffer 230 may also be used to store other previously filtered blocks, such as previously reconstructed and filtered block 221, of the same current image or of a different image, such as a previously reconstructed image, and may provide a complete previously reconstructed, i.e., decoded image (and corresponding reference blocks and pixels) and/or partially reconstructed current image (and corresponding reference blocks and pixels), such as for inter prediction. The decoded image buffer 230 may also be used to store one or more unfiltered reconstructed blocks 215, or generally unfiltered reconstructed pixels, e.g., reconstructed blocks 215 that are not filtered by the loop filter unit 220, or reconstructed blocks or reconstructed pixels that are not subject to any other processing.
Mode selection (segmentation and prediction)
The mode selection unit 260 comprises a segmentation unit 262, an inter prediction unit 244 and an intra prediction unit 254 for receiving or obtaining raw image data, e.g. filtered and/or unfiltered reconstructed pixels or reconstructed blocks of the same (current) image and/or one or more previously decoded images, such as the raw block 203 (current block 203 of the current image 17) and the reconstructed image data from the decoded image buffer 230 or other buffers (e.g. column buffers, not shown in the figure). The reconstructed image data is used as reference image data required for prediction such as inter prediction or intra prediction to obtain a prediction block 265 or a prediction value 265.
The mode selection unit 260 may be configured to determine or select a partition for the current block (including no partition) and the prediction mode (e.g., intra or inter prediction mode), generate a corresponding prediction block 265, and calculate the residual block 205 and reconstruct the reconstructed block 215.
In one embodiment, the mode selection unit 260 may be configured to select a partition and prediction mode (e.g., from among the prediction modes supported or available by the mode selection unit 260) that provides the best match or minimum residual (minimum residual refers to better compression in transmission or storage), or that provides the minimum signaling overhead (minimum signaling overhead refers to better compression in transmission or storage), or both. The mode selection unit 260 may be adapted to determine the segmentation and prediction modes based on a rate-distortion optimization (rate distortion Optimization, RDO), i.e. to select the prediction mode that provides the least rate-distortion optimization. The terms "best," "lowest," "optimal," and the like herein do not necessarily refer to "best," "lowest," "optimal" overall, but may also refer to situations where termination or selection criteria are met, e.g., values above or below a threshold or other limitations may result in "less preferred," but reduce complexity and processing time.
In other words, the segmentation unit 262 may be used to segment images in a video sequence into a sequence of Coding Tree Units (CTUs), the CTUs 203 may be further segmented into smaller block portions or sub-blocks (again forming blocks), for example, by iteratively using quad-tree partitioning (QT) segmentation, binary-tree partitioning (BT) segmentation, or trigeminal-tree (triple-tree partitioning, TT) segmentation, or any combination thereof, and to perform prediction on each of the block portions or sub-blocks, for example, wherein mode selection includes selecting a tree structure of the segmentation block 203 and selecting a prediction mode applied to each of the block portions or sub-blocks.
The partitioning performed by video encoder 20 (e.g., performed by partitioning unit 262) and the prediction processing (e.g., performed by inter prediction unit 244 and intra prediction unit 254) are described in detail below.
Segmentation
The segmentation unit 262 may segment (or divide) one image block (or CTU) 203 into smaller parts, such as square or rectangular shaped small blocks. For an image with three pixel arrays, one CTU consists of n×n luminance pixel blocks and two corresponding chrominance pixel blocks. The maximum allowable size of the luminance block in the CTU is specified as 128×128 in the general video coding (versatile video coding, VVC) standard under development, but may be specified as a value different from 128×128 in the future, for example 256×256. CTUs of an image may be grouped into slices/groups of coded blocks, coded blocks or tiles. One coding block covers a rectangular area of an image, and one coding block may be divided into one or more tiles. A tile consists of multiple CTU rows within one coded block. A coded block that is not partitioned into multiple tiles may be referred to as a tile. However, a tile is a true subset of a coded block and is therefore not referred to as a coded block. The VVC supports two coding block group modes, a raster scan slice/coding block group mode and a rectangular slice mode, respectively. In raster scan code block group mode, a slice/code block group contains a sequence of code blocks in a code block raster scan of an image. In rectangular tile mode, the tile contains a plurality of tiles of one image, which together make up a rectangular area of the image. The tiles within the rectangular tile are arranged in the tile raster scan order of the photograph. These smaller blocks (which may also be referred to as sub-blocks) may be further partitioned into smaller portions. This is also referred to as tree segmentation or hierarchical tree segmentation, where a root block at root tree level 0 (hierarchical level 0, depth 0) or the like may be recursively segmented into two or more next lower tree level blocks, e.g., tree level 1 (hierarchical level 1, depth 1) nodes. These blocks may in turn be partitioned into two or more next lower level blocks, e.g. tree level 2 (hierarchy level 2, depth 2) etc., until the partitioning is finished (because the ending criterion is fulfilled, e.g. maximum tree depth or minimum block size is reached). The blocks that are not further partitioned are also referred to as leaf blocks or leaf nodes of the tree. A tree divided into two parts is called a Binary Tree (BT), a tree divided into three parts is called a Ternary Tree (TT), and a tree divided into four parts is called a Quadtree (QT).
For example, the Coding Tree Unit (CTU) may be or include a CTB of a luminance pixel, two corresponding CTBs of a chrominance pixel of an image having three pixel arrays, or a CTB of a pixel of a monochrome image, or a CTB of a pixel of an image encoded using three independent color planes and syntax structures (for encoding pixels). Accordingly, a Coding Tree Block (CTB) may be an n×n pixel point block, where N may be set to a value such that the component is divided into CTBs, that is, divided. The Coding Unit (CU) may be or include a coding block of luminance pixels, two corresponding coding blocks of chrominance pixels of an image having three pixel arrays, or a coding block of pixels of a monochrome image or a coding block of pixels of an image coded using three independent color planes and syntax structures (for coding pixels). Accordingly, the Coding Block (CB) may be an mxn pixel point block, where M and N may be set to a value such that CTB is divided into coding blocks, that is, division.
For example, in an embodiment, a Coding Tree Unit (CTU) may be divided into a plurality of CUs according to HEVC by using a quadtree structure denoted as a coding tree. A decision is made at the leaf CU level whether to encode the image region using inter (temporal) prediction or intra (spatial) prediction. Each leaf CU may be further divided into one, two, or four PUs according to the PU partition type. The same prediction process is used within one PU and related information is transmitted to the decoder in PU units. After applying the prediction process according to the PU partition type to obtain the residual block, the leaf CU may be partitioned into Transform Units (TUs) according to other quadtree structures similar to the coding tree for the CU.
In an embodiment, the segmentation structure used to segment the coding tree units is partitioned using a combined quadtree of nested multi-type trees (e.g., binary tree and trigeminal tree) according to the latest video coding standard currently being developed (referred to as universal video coding (VVC). In the coding tree structure within the coding tree units, a CU may be square or rectangular, e.g., the Coding Tree Units (CTUs) are first partitioned by a quadtree structure, the quadtree nodes are further partitioned by the multi-type tree structure, the multi-type tree structure has four partition types, namely, vertical binary tree partition (split_bt_ver), horizontal binary tree partition (split_bt_hor), vertical trigeminal tree partition (split_tt_ver) and horizontal trigeminal tree partition (split_tt_hor). The multi-type tree nodes are referred to as Coding Units (CUs), unless the CU is too large for the maximum transform length, such segmentation is not required for prediction and transform processing. Whether the node is further divided is indicated by a first identifier (mtt _split_cu_flag), and when the node is further divided, the dividing direction is indicated by a second identifier (mtt _split_cu_vertical_flag) and then the dividing is indicated by a third identifier (mtt _split_cu_binary_flag). From the values of mtt _split_cu_vertical_flag and mtt _split_cu_binary_flag, the decoder may derive the multi-type tree partitioning mode (mttssplit mode) of the CU based on predefined rules or tables. For some designs, such as a 64×64 luma block and a 32×32 chroma pipeline design in a VVC hardware decoder, TT division is not allowed when the width or height of the luma code block is greater than 64. When the width or height of the chroma coding block is greater than 32, TT division is not allowed either. The pipeline design divides the image into a plurality of virtual pipeline data units (virtual pipeline data unit, VPDUs), each VPDU being defined as a unit in the image that does not overlap each other. In a hardware decoder, consecutive VPDUs are processed simultaneously in a plurality of pipeline stages. In most pipeline stages, the VPDU size is approximately proportional to the buffer size, so a smaller VPDU needs to be kept. In most hardware decoders, the VPDU size may be set to a maximum Transform Block (TB) size. However, in VVC, splitting of the Trigeminal Tree (TT) and the Binary Tree (BT) may increase the size of the VPDU.
In addition, it should be noted that when a portion of a tree node block exceeds the bottom or right image boundary, the tree node block is forced to be divided until all pixels of each encoded CU are located within the image boundary.
For example, the intra-partition (ISP) tool may vertically or horizontally divide a luminance intra-prediction block into two or four sub-parts according to a block size.
In one example, mode selection unit 260 of video encoder 20 may be used to perform any combination of the segmentation techniques described above.
As described above, video encoder 20 is configured to determine or select the best or optimal prediction mode from a (predetermined) set of prediction modes. The set of prediction modes may include, for example, intra prediction modes and/or inter prediction modes.
Intra prediction
The set of intra prediction modes may include 35 different intra prediction modes, for example, a non-directional mode like a DC (or mean) mode and a planar mode, or a directional mode as defined by HEVC, or 67 different intra prediction modes, for example, a non-directional mode like a DC (or mean) mode and a planar mode, or a directional mode as defined in VVC. For example, several conventional angular intra prediction modes are adaptively replaced with wide-angle intra prediction modes of non-square blocks defined in VVC. For another example, to avoid division of DC predictions, only the longer sides are used to calculate the average of non-square blocks. Also, the intra prediction result of the planar mode may be modified using a position-decided intra prediction combining (position dependent intra prediction combination, PDPC) method.
The intra prediction unit 254 is configured to generate an intra prediction block 265 using reconstructed pixels of neighboring blocks of the same current image according to intra prediction modes in the intra prediction mode set.
Intra-prediction unit 254 (or, typically, mode selection unit 260) is also used to output intra-prediction parameters (or, typically, information indicating a selected intra-prediction mode of the block) in the form of syntax element 266 to entropy encoding unit 270 for inclusion into encoded image data 21 so that video decoder 30 may perform operations, such as receiving and using the prediction parameters for decoding.
The intra prediction modes in HEVC include a direct current prediction mode, a planar prediction mode, and 33 angular prediction modes, totaling 35 candidate prediction modes. Fig. 3 is a schematic view of an HEVC intra prediction direction, as shown in fig. 3, in which a current block may be intra predicted using pixels of a reconstructed image block on the left and above as a reference. An image block used for intra-prediction of the current block in a peripheral region of the current block becomes a reference block, and pixels in the reference block are referred to as reference pixels. Of the 35 candidate prediction modes, the direct current prediction mode is applicable to a region with flat texture in the current block, and all pixels in the region use an average value of reference pixels in the reference block as prediction; the plane prediction mode is suitable for an image block with smooth change of texture, and a current block meeting the condition uses reference pixels in a reference block to conduct bilinear interpolation to be used for predicting all pixels in the current block; the angle prediction mode copies values of reference pixels in a corresponding reference block along a certain angle as predictions of all pixels in the current block using characteristics of texture of the current block highly correlated with texture of an adjacent reconstructed image block.
The HEVC encoder selects an optimal intra prediction mode from among the 35 candidate prediction modes shown in fig. 3 for the current block and writes the optimal intra prediction mode to the video bitstream. To improve the coding efficiency of intra prediction, the encoder/decoder derives 3 most probable modes from the respective optimal intra prediction modes of the reconstructed image block using intra prediction in the peripheral region, and if the optimal intra prediction mode selected for the current block is one of the 3 most probable modes, encodes a first index indicating that the selected optimal intra prediction mode is one of the 3 most probable modes; if the selected optimal intra prediction mode is not the 3 most probable modes, a second index is encoded indicating that the selected optimal intra prediction mode is one of the other 32 modes (other modes than the 3 most probable modes among the 35 candidate prediction modes). The HEVC standard uses a 5-bit fixed length code as the aforementioned second index.
The method for the HEVC encoder to derive the 3 most probable modes includes: and selecting the optimal intra-frame prediction modes of the left adjacent image block and the upper adjacent image block of the current block, and if the two optimal intra-frame prediction modes are the same, only one of the two optimal intra-frame prediction modes is reserved in the set. If the two optimal intra-frame prediction modes are the same and are both angle prediction modes, selecting two angle prediction modes adjacent to the angle direction to be added into the set; otherwise, the plane prediction mode, the direct current mode and the vertical prediction mode are sequentially selected to be added into the set until the number of modes in the set reaches 3.
After entropy decoding the code stream, the HEVC decoder obtains mode information of the current block, where the mode information includes an indication flag indicating whether an optimal intra-prediction mode of the current block is among 3 most probable modes, and an index of the optimal intra-prediction mode of the current block among 3 most probable modes or an index of the optimal intra-prediction mode of the current block among other 32 modes.
Inter prediction
In a possible implementation, the set of inter prediction modes is dependent on available reference pictures (i.e., at least a portion of previously decoded pictures stored in DPB230, for example, as described above) and other inter prediction parameters, for example, on whether to use the entire reference picture or only a portion of the reference picture, e.g., a search window region near the region of the current block, to search for the best matching reference block, and/or on whether to perform half-pixel, quarter-pixel, and/or 16-quarter-interpolation pixel interpolation, for example.
In addition to the above prediction modes, a skip mode and/or a direct mode may be employed.
For example, extended merge prediction, the merge candidate list for this mode consists of the following five candidate types in order: spatial MVP from spatially neighboring CUs, temporal MVP from collocated CUs, history-based MVP from FIFO table, pairwise average MVP, and zero MVs. Decoder-side motion vector correction (decoder side motion vector refinement, DMVR) based on bilateral matching can be used to increase the accuracy of the MV of the merge mode. The merge mode with MVD (MMVD) comes from merge mode with motion vector difference. The MMVD flag is transmitted immediately after the skip flag and the merge flag are transmitted to specify whether the MMVD mode is used by the CU. A CU-level adaptive motion vector resolution (adaptive motion vector resolution, AMVR) scheme may be used. AMVR supports MVDs of CUs to encode with different precision. The MVD of the current CU is adaptively selected according to the prediction mode of the current CU. When a CU is encoded in a merge mode, a merge inter/intra prediction (CIIP) mode may be applied to the current CU. And carrying out weighted average on the inter-frame and intra-frame prediction signals to obtain CIIP prediction. For affine motion compensated prediction, the affine motion field of a block is described by motion information of 2 control points (4 parameters) or 3 control points (6 parameters) motion vectors. Temporal motion vector prediction (SbTMVP) based on sub-blocks is similar to temporal motion vector prediction (temporal motion vector prediction, TMVP) in HEVC, but predicts the motion vector of the sub-CU within the current CU. Bi-directional optical flow (bi-directional optical flow, BDOF), previously known as BIO, is a simplified version of the computation that reduces, in particular, the number of multiplications and the size of the multipliers. In the triangle division mode, the CU is uniformly divided into two triangle parts in two division manners of diagonal division and anti-diagonal division. In addition, the bi-prediction mode is extended on a simple average basis to support weighted averaging of two prediction signals.
Inter prediction unit 244 may include a motion estimation (motion estimation, ME) unit and a motion compensation (motion compensation, MC) unit (both not shown in fig. 2). The motion estimation unit may be used to receive or obtain an image block 203 (current image block 203 of current image 17) and a decoded image 231, or at least one or more previously reconstructed blocks, e.g. one or more other/different previously decoded image 231, for motion estimation. For example, the video sequence may include the current image and the previous decoded image 231, or in other words, the current image and the previous decoded image 231 may be part of or form a sequence of images of the video sequence.
For example, encoder 20 may be configured to select a reference block from a plurality of reference blocks of the same or different ones of a plurality of other pictures, and provide the reference picture (or reference picture index) and/or an offset (spatial offset) between a position (x, y coordinates) of the reference block and a position of the current block as inter prediction parameters to the motion estimation unit. This offset is also called Motion Vector (MV).
The motion compensation unit is configured to obtain, e.g., receive, inter-prediction parameters, and perform inter-prediction based on or using the inter-prediction parameters to obtain the inter-prediction block 246. The motion compensation performed by the motion compensation unit may involve extracting or generating a prediction block from the motion/block vector determined by the motion estimation and may also involve performing interpolation on sub-pixel accuracy. Interpolation filtering may generate pixel points for other pixels from among the pixel points of known pixels, potentially increasing the number of candidate prediction blocks that may be used to encode an image block. Upon receiving a motion vector corresponding to a PU of a current image block, the motion compensation unit may locate a prediction block to which the motion vector points in one of the reference image lists.
The motion compensation unit may also generate syntax elements related to the blocks and video slices for use by video decoder 30 in decoding the image blocks of the video slices. In addition, or as an alternative to slices and corresponding syntax elements, groups of coded blocks and/or coded blocks and corresponding syntax elements may be generated or used.
In the process of obtaining a candidate motion vector list in advanced motion vector prediction (advanced motion vector prediction, AMVP) mode, motion Vectors (MVs) that may be added to the candidate motion vector list as an alternative include MVs of spatially and temporally adjacent image blocks of the current block, where MVs of the spatially adjacent image blocks may in turn include MVs of left candidate image blocks located to the left of the current block and MVs of upper candidate image blocks located above the current block. For example, fig. 6 is an exemplary schematic diagram of candidate image blocks according to an embodiment of the present application, where, as shown in fig. 6, the set of left candidate image blocks includes { A0, A1}, the set of upper candidate image blocks includes { B0, B1, B2}, the set of temporally adjacent candidate image blocks includes { C, T }, and all three sets may be added as alternatives to the candidate motion vector list, but according to the existing coding standard, the maximum length of the candidate motion vector list of AMVP is 2, so that MVs for adding up to two image blocks in the candidate motion vector list need to be determined from the three sets according to a prescribed order. The order may be to prioritize the set of left candidate image blocks { A0, A1} (consider A0, A0 unavailable before A1) of the current block, to consider the set of top candidate image blocks { B0, B1, B2} (consider B0, B0 unavailable before B1, B1 unavailable before B2) of the current block, and to consider the set of temporally adjacent candidate image blocks { C, T } (consider T, T unavailable before C) of the current block.
After the candidate motion vector list is obtained, an optimal MV is determined from the candidate motion vector list by using a rate-distortion cost (rate distortion cost, RD cost), and the candidate motion vector with the minimum RD cost is used as a motion vector predictor (motion vector predictor, MVP) of the current block. The rate distortion cost is calculated from the following formula:
J=SAD+λR
where J represents RD cost, SAD is the sum of absolute errors (sum of absolute differences, SAD) between the pixel value of the predicted block obtained by motion estimation using the candidate motion vector and the pixel value of the current block, R represents the code rate, and λ represents the lagrangian multiplier.
The coding end transmits the index of the determined MVP in the candidate motion vector list to the decoding end. Further, motion search may be performed in a neighborhood centered on the MVP to obtain an actual motion vector of the current block, and the encoding end calculates a motion vector difference (motion vector difference, MVD) between the MVP and the actual motion vector, and also transmits the MVD to the decoding end. The decoding end analyzes the index, finds the corresponding MVP in the candidate motion vector list according to the index, analyzes the MVD, and adds the MVD and the MVP to obtain the actual motion vector of the current block.
In the process of acquiring the candidate motion information list in the fusion (Merge) mode, the motion information which can be added to the candidate motion information list alternatively comprises motion information of an adjacent image block in the space domain or an adjacent image block in the time domain of the current block, wherein the adjacent image block in the space domain and the adjacent image block in the time domain can refer to fig. 6, the candidate motion information corresponding to the space domain in the candidate motion information list is from 5 blocks (A0, A1, B0, B1 and B2) in the space domain, and if the adjacent block in the space domain is unavailable or is intra-frame prediction, the motion information of the adjacent block in the space domain is not added to the candidate motion information list. The candidate motion information of the time domain of the current block is obtained after the MVs of the corresponding position blocks in the reference frame are scaled according to the sequence counts (picture order count, POC) of the reference frame and the current frame, whether the block with the position T in the reference frame is available or not is judged first, and if the block with the position C is not available, the block with the position C is selected. After the candidate motion information list is obtained, the RD cost is used for determining the optimal motion information from the candidate motion information list as the motion information of the current block. The encoding end transmits an index value (marked as a merge index) of the position of the optimal motion information in the candidate motion information list to the decoding end.
Entropy coding
The entropy encoding unit 270 is configured to apply an entropy encoding algorithm or scheme (e.g., a variable length coding (variable length coding, VLC) scheme, a context adaptive VLC (CALVC) scheme, an arithmetic coding scheme, a binarization algorithm, a context adaptive binary arithmetic coding (context adaptive binary arithmetic coding, CABAC), a syntax-based context-adaptive binary arithmetic coding (SBAC), a probability interval partition entropy (probability interval partitioning entropy, PIPE) coding, or other entropy encoding methods or techniques) to the quantized residual coefficients 209, the inter-prediction parameters, the intra-prediction parameters, the loop filter parameters, and/or other syntax elements, resulting in encoded image data 21 that can be output in the form of an encoded bitstream 21 or the like through an output 272, so that the video decoder 30 or the like can receive and use the parameters for decoding. The encoded bitstream 21 may be transmitted to the video decoder 30 or stored in memory for later transmission or retrieval by the video decoder 30.
Other structural variations of video encoder 20 may be used to encode the video stream. For example, the non-transform based encoder 20 may directly quantize the residual signal without a transform processing unit 206 for certain blocks or frames. In another implementation, encoder 20 may have quantization unit 208 and inverse quantization unit 210 combined into a single unit.
Decoder and decoding method
Fig. 3 is an exemplary block diagram of video decoder 30 of an embodiment of the present application. The video decoder 30 is configured to receive encoded image data 21 (e.g., encoded bitstream 21) encoded by, for example, the encoder 20, to obtain a decoded image 331. The encoded image data or bitstream comprises information for decoding said encoded image data, such as data representing image blocks of an encoded video slice (and/or a group of encoded blocks or encoded blocks) and associated syntax elements.
In the example of fig. 3, decoder 30 includes entropy decoding unit 304, inverse quantization unit 310, inverse transform processing unit 312, reconstruction unit 314 (e.g., summer 314), loop filter 320, decoded image buffer (DPB) 330, mode application unit 360, inter prediction unit 344, and intra prediction unit 354. The inter prediction unit 344 may be or include a motion compensation unit. In some examples, video decoder 30 may perform a decoding process that is substantially opposite to the encoding process described with reference to video encoder 100 of fig. 2.
As described for encoder 20, inverse quantization unit 210, inverse transform processing unit 212, reconstruction unit 214, loop filter 220, decoded image buffer DPB230, inter prediction unit 344, and intra prediction unit 354 also constitute a "built-in decoder" of video encoder 20. Accordingly, inverse quantization unit 310 may be functionally identical to inverse quantization unit 110, inverse transform processing unit 312 may be functionally identical to inverse transform processing unit 122, reconstruction unit 314 may be functionally identical to reconstruction unit 214, loop filter 320 may be functionally identical to loop filter 220, and decoded image buffer 330 may be functionally identical to decoded image buffer 230. Accordingly, the explanation of the respective units and functions of video encoder 20 applies accordingly to the respective units and functions of video decoder 30.
Entropy decoding
The entropy decoding unit 304 is configured to parse the bitstream 21 (or generally the encoded image data 21) and perform entropy decoding on the encoded image data 21, to obtain quantization coefficients 309 and/or decoded encoding parameters (not shown in fig. 3), such as any or all of inter-prediction parameters (e.g., reference image indexes and motion vectors), intra-prediction parameters (e.g., intra-prediction modes or indexes), transform parameters, quantization parameters, loop filter parameters, and/or other syntax elements, etc. Entropy decoding unit 304 may be used to apply a decoding algorithm or scheme corresponding to the encoding scheme of entropy encoding unit 270 of encoder 20. Entropy decoding unit 304 may also be used to provide inter-prediction parameters, intra-prediction parameters, and/or other syntax elements to mode application unit 360, as well as other parameters to other units of decoder 30. Video decoder 30 may receive syntax elements at the video slice and/or video block level. In addition, or as an alternative to slices and corresponding syntax elements, groups of coded blocks and/or coded blocks and corresponding syntax elements may be received or used.
Inverse quantization
The inverse quantization unit 310 may be configured to receive quantization parameters (quantization parameter, QP) (or information related to inverse quantization in general) and quantization coefficients from the encoded image data 21 (e.g. parsed and/or decoded by the entropy decoding unit 304), and to inverse quantize the decoded quantization coefficients 309 based on the quantization parameters to obtain inverse quantization coefficients 311, which inverse quantization coefficients 311 may also be referred to as transform coefficients 311. The dequantization process may include determining a degree of quantization using quantization parameters calculated by video encoder 20 for each video block in a video slice, as well as determining a degree of dequantization that needs to be performed.
Inverse transformation
The inverse transform processing unit 312 may be configured to receive the dequantized coefficients 311, also referred to as transform coefficients 311, and apply a transform to the dequantized coefficients 311 to obtain the reconstructed residual block 213 in the pixel domain. The reconstructed residual block 213 may also be referred to as a transform block 313. The transform may be an inverse transform, such as an inverse DCT, an inverse DST, an inverse integer transform, or a conceptually similar inverse transform process. The inverse transform processing unit 312 may also be used to receive transform parameters or corresponding information from the encoded image data 21 (e.g., parsed and/or decoded by the entropy decoding unit 304) to determine a transform to be applied to the dequantized coefficients 311.
Reconstruction of
The reconstruction unit 314 (e.g., the summer 314) is configured to add the reconstructed residual block 313 to the prediction block 365 to obtain a reconstructed block 315 in the pixel domain, e.g., to add the pixel values of the reconstructed residual block 313 and the pixel values of the prediction block 365.
Filtering
The loop filter unit 320 is used to filter the reconstructed block 315 (in or after the encoding loop) to obtain a filter block 321, so as to smoothly perform pixel transformation or improve video quality, etc. Loop filter unit 320 may include one or more loop filters, such as a deblocking filter, a pixel-adaptive offset (SAO) filter, or one or more other filters, such as an adaptive loop filter (adaptive loop filter, ALF), a noise suppression filter (noise suppression filter, NSF), or any combination. For example, the loop filter unit 220 may include a deblocking filter, an SAO filter, and an ALF filter. The order of the filtering process may be a deblocking filter, an SAO filter, and an ALF filter. As another example, a process called luma map with chroma scaling (luma mapping with chroma scaling, LMCS) (i.e., adaptive in-loop shaper) is added. This process is performed prior to deblocking. For another example, the deblocking filtering process may also be applied to internal sub-block edges, such as affine sub-block edges, ATMVP sub-block edges, sub-block transform (SBT) edges, and intra sub-part (ISP) edges. Although loop filter unit 320 is shown in fig. 3 as a loop filter, in other configurations loop filter unit 320 may be implemented as a post-loop filter.
Decoding image buffer
The decoded video blocks 321 in one picture are then stored in a decoded picture buffer 330, the decoded picture buffer 330 storing the decoded picture 331 as a reference picture for other pictures and/or for outputting subsequent motion compensation for display, respectively.
The decoder 30 is configured to output the decoded image 311 via an output 312 or the like, and display the decoded image to a user or view the decoded image by the user.
Prediction
Inter prediction unit 344 may be functionally identical to inter prediction unit 244, in particular a motion compensation unit, and intra prediction unit 354 may be functionally identical to inter prediction unit 254 and decide to divide or divide and perform prediction based on the segmentation and/or prediction parameters or corresponding information received from encoded image data 21 (e.g., parsed and/or decoded by entropy decoding unit 304). The mode application unit 360 may be configured to perform prediction (intra or inter prediction) of each block based on the reconstructed image, the block, or the corresponding pixel points (filtered or unfiltered), resulting in a predicted block 365.
When a video slice is encoded as an intra coded (I) slice, the intra prediction unit 354 in the mode application unit 360 is configured to generate a prediction block 365 for an image block of the current video slice based on the indicated intra prediction mode and data from a previously decoded block of the current image. When a video image is encoded as an inter-coded (i.e., B or P) slice, an inter prediction unit 344 (e.g., a motion compensation unit) in mode application unit 360 is used to generate a prediction block 365 for a video block of the current video slice from the motion vectors and other syntax elements received from entropy decoding unit 304. For inter prediction, these prediction blocks may be generated from one of the reference pictures in one of the reference picture lists. Video decoder 30 may construct reference frame list 0 and list 1 from the reference pictures stored in DPB 330 using a default construction technique. In addition to or as an alternative to slices (e.g., video slices), the same or similar process may be applied to embodiments of encoding blocks (e.g., video encoding blocks) and/or encoding blocks (e.g., video encoding blocks), e.g., video may be encoded using I, P or B encoding blocks and/or encoding blocks.
The mode application unit 360 is for determining prediction information for a video block of a current video slice by parsing a motion vector and other syntax elements, and generating a prediction block for the current video block being decoded using the prediction information. For example, mode application unit 360 uses some syntax elements received to determine a prediction mode (e.g., intra prediction or inter prediction) for encoding video blocks of a video slice, inter prediction slice type (e.g., B slice, P slice, or GPB slice), construction information for one or more reference picture lists of the slice, motion vectors for each inter-encoded video block of the slice, inter prediction state for each inter-encoded video block of the slice, other information to decode video blocks within the current video slice. In addition to or as an alternative to slices (e.g., video slices), the same or similar process may be applied to embodiments of encoding blocks (e.g., video encoding blocks) and/or encoding blocks (e.g., video encoding blocks), e.g., video may be encoded using I, P or B encoding blocks and/or encoding blocks.
In one embodiment, video encoder 30 of fig. 3 may also be used to segment and/or decode images using slices (also referred to as video slices), where the images may be segmented or decoded using one or more slices (typically non-overlapping). Each slice may include one or more blocks (e.g., CTUs) or one or more groups of blocks (e.g., coded blocks in the h.265/HEVC/VVC standard and bricks in the VVC standard).
In one embodiment, video decoder 30 shown in fig. 3 may also be used to segment and/or decode a picture using a slice/coding block set (also referred to as a video coding block set) and/or a coding block (also referred to as a video coding block), where the picture may be segmented or decoded using one or more slice/coding block sets (typically non-overlapping), each slice/coding block set may include one or more blocks (e.g., CTUs) or one or more coding blocks, etc., where each coding block may be rectangular, etc., and may include one or more complete or partial blocks (e.g., CTUs).
Other variations of video decoder 30 may be used to decode encoded image data 21. For example, decoder 30 may generate the output video stream without loop filter unit 320. For example, the non-transform based decoder 30 may directly inverse quantize the residual signal without an inverse transform processing unit 312 for certain blocks or frames. In another implementation, video decoder 30 may have an inverse quantization unit 310 and an inverse transform processing unit 312 combined into a single unit.
It should be understood that the processing results of the current step may be further processed in the encoder 20 and the decoder 30 and then output to the next step. For example, after interpolation filtering, motion vector derivation, or loop filtering, the processing result of the interpolation filtering, motion vector derivation, or loop filtering may be subjected to further operations, such as clipping (clip) or shift (shift) operations.
It should be noted that the derived motion vectors of the current block (including but not limited to control point motion vectors of affine mode, affine, plane, sub-block motion vectors of ATMVP mode, temporal motion vectors, etc.) may be further computed. For example, the value of the motion vector is limited to a predefined range according to the representation bits of the motion vector. If the representative bits of the motion vector are bitDepth, then the range is-2 (bitDepth-1) to 2 (bitDepth-1) -1, where "+" -represents the power. For example, if bitDepth is set to 16, the range is-32768-32767; if bitDepth is set to 18, the range is-131072 ~ 131071. For example, the value of the derived motion vector (e.g., MVs of 4 x 4 sub-blocks in one 8 x 8 block) is limited such that the maximum difference between integer parts of the 4 x 4 sub-blocks MVs does not exceed N pixels, e.g., 1 pixel. Two methods of restricting motion vectors according to bitDepth are provided herein.
Although the above embodiments primarily describe video codecs, it should be noted that embodiments of the coding system 10, encoder 20, and decoder 30, as well as other embodiments described herein, may also be used for still image processing or codec, i.e., processing or codec of a single image independent of any previous or successive image in video codec. In general, if image processing is limited to only a single image 17, the inter prediction unit 244 (encoder) and the inter prediction unit 344 (decoder) may not be available. All other functions (also referred to as tools or techniques) of video encoder 20 and video decoder 30 are equally applicable to still image processing, such as residual computation 204/304, transform 206, quantization 208, inverse quantization 210/310, (inverse) transform 212/312, segmentation 262/362, intra-prediction 254/354 and/or loop filtering 220/320, entropy encoding 270, and entropy decoding 304.
Fig. 4 is an exemplary block diagram of a video coding apparatus 400 according to an embodiment of the present application. The video coding apparatus 400 is adapted to implement the disclosed embodiments described herein. In one embodiment, video coding device 400 may be a decoder, such as video decoder 30 in FIG. 1a, or an encoder, such as video encoder 20 in FIG. 1 a.
The video coding apparatus 400 includes: an ingress port 410 (or input port 410) and a receive unit (Rx) 420 for receiving data; a processor, logic unit or central processing unit (central processing unit, CPU) 430 for processing data; for example, the processor 430 herein may be a neural network processor 430; a transmit unit (Tx) 440 and an output port 450 (or output port 450) for transmitting data; a memory 460 for storing data. The video decoding apparatus 400 may further include an optical-to-electrical (OE) component and an electro-optical (EO) component coupled to the input port 410, the receiving unit 420, the transmitting unit 440, and the output port 450, for outputting or inputting optical or electrical signals.
The processor 430 is implemented in hardware and software. Processor 430 may be implemented as one or more processor chips, cores (e.g., multi-core processors), FPGAs, ASICs, and DSPs. Processor 430 communicates with ingress port 410, receiving unit 420, transmitting unit 440, egress port 450, and memory 460. The processor 430 includes a decode module 470 (e.g., a neural network based decode module 470). The coding module 470 implements the embodiments disclosed above. For example, the decode module 470 performs, processes, prepares, or provides various encoding operations. Thus, substantial improvements are provided to the functionality of video coding device 400 by coding module 470 and affect the switching of video coding device 400 to different states. Alternatively, decode module 470 may be implemented in instructions stored in memory 460 and executed by processor 430.
Memory 460 includes one or more disks, tape drives, and solid state drives, and may be used as an overflow data storage device for storing programs when such programs are selected for execution, and for storing instructions and data that are read during program execution. The memory 460 may be volatile and/or nonvolatile, and may be read-only memory (ROM), random access memory (random access memory, RAM), ternary content-addressable memory (TCAM), and/or static random-access memory (SRAM).
Fig. 5 is an exemplary block diagram of an apparatus 500 of an embodiment of the present application, the apparatus 500 may be used as either or both of the source device 12 and the destination device 14 in fig. 1 a.
The processor 502 in the apparatus 500 may be a central processor. Processor 502 may be any other type of device or devices capable of manipulating or processing information, either as is known or later developed. Although the disclosed implementations may be implemented using a single processor, such as processor 502 as shown, using more than one processor may be faster and more efficient.
In one implementation, the memory 504 in the apparatus 500 may be a Read Only Memory (ROM) device or a Random Access Memory (RAM) device. Any other suitable type of storage device may be used as memory 504. Memory 504 may include code and data 506 that processor 502 accesses over bus 512. Memory 504 may also include an operating system 508 and application programs 510, application programs 510 including at least one program that allows processor 502 to perform the methods described herein. For example, application 510 may include applications 1 through N, as well as video coding applications that perform the methods described herein.
Apparatus 500 may also include one or more output devices, such as a display 518. In one example, display 518 may be a touch-sensitive display that combines the display with touch-sensitive elements that can be used to sense touch inputs. A display 518 may be coupled to the processor 502 by a bus 512.
Although the bus 512 in the apparatus 500 is described herein as a single bus, the bus 512 may include multiple buses. In addition, the secondary storage may be coupled directly to other components of the apparatus 500 or accessed through a network, and may include a single integrated unit such as a memory card or multiple units such as multiple memory cards. Thus, the apparatus 500 may have a variety of configurations.
Since embodiments of the present application relate to neural network applications, for ease of understanding, certain terms or terminology used in embodiments of the present application are explained below as part of the summary of the invention.
(1) Neural network
A Neural Network (NN) is a machine learning model, and may be composed of neural units, where the neural units may refer to an arithmetic unit with xs and intercept 1 as inputs, and the output of the arithmetic unit may be:
Figure BDA0003361742690000231
Where s=1, 2, … … n, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by joining together a number of the above-described single neural units, i.e., the output of one neural unit may be the input of another. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.
(2) Deep neural network
Deep neural networks (deep neural network, DNN), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, many of which are not particularly metrics. From DNNs, which are divided by the location of the different layers, the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN appears to be complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:
Figure BDA0003361742690000232
Figure BDA0003361742690000233
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003361742690000234
is an input vector, +.>
Figure BDA0003361742690000235
Is the output vector, +.>
Figure BDA0003361742690000236
Is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for the input vector +.>
Figure BDA0003361742690000237
The output vector is obtained by such simple operation>
Figure BDA0003361742690000238
Since DNN has a large number of layers, the coefficient W and the offset vector +.>
Figure BDA0003361742690000239
And thus a large number. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: it is assumed that in DNN of one three layers, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as +.>
Figure BDA00033617426900002310
The superscript 3 represents the number of layers in which the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. The summary is: the coefficients from the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as +.>
Figure BDA00033617426900002311
It should be noted that the input layer is devoid of W parameters. In deep neural networks, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks. The process of training the deep neural network, i.e. learning the weight matrix, has the final objective of obtaining a weight matrix (a weight matrix formed by a number of layers of vectors W) for all layers of the trained deep neural network.
(3) Convolutional neural network
The convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure, which is a deep learning architecture that is an algorithm for machine learning to perform multiple levels of learning at different levels of abstraction. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to an image input thereto. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a pooling layer. The feature extractor can be seen as a filter and the convolution process can be seen as a convolution with an input image or convolution feature plane (feature map) using a trainable filter.
The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. The convolution layer may comprise a number of convolution operators, also called kernels, which act in the image processing as a filter to extract specific information from the input image matrix, which may be essentially a weight matrix, which is usually predefined, which is usually processed on the input image in the horizontal direction, one pixel after the other (or two pixels after two pixels … … depending on the value of the step size stride), in order to accomplish the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix produces a convolved output of a single depth dimension, but in most cases does not use a single weight matrix, but instead applies multiple weight matrices of the same size (row by column), i.e., multiple homography matrices. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by the "multiple" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix is used to extract image edge information, another weight matrix is used to extract a particular color of the image, yet another weight matrix is used to blur unwanted noise in the image, etc. The plurality of weight matrixes have the same size (row and column), the feature images extracted by the plurality of weight matrixes with the same size have the same size, and the extracted feature images with the same size are combined to form the output of convolution operation. The weight values in the weight matrices are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can be used for extracting information from an input image, so that the convolutional neural network can conduct correct prediction. When a convolutional neural network has a plurality of convolutional layers, the initial convolutional layer tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network is deepened, features extracted by the convolutional layer further and further are more complex, such as features of high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
Since it is often desirable to reduce the number of training parameters, the convolutional layers often require periodic introduction of pooling layers, either one convolutional layer followed by one pooling layer or multiple convolutional layers followed by one or more pooling layers. The only purpose of the pooling layer during image processing is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator may calculate pixel values in the image over a particular range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
After the convolutional layer/pooling layer processing, the convolutional neural network is not yet sufficient to output the required output information. Because, as previously mentioned, the convolution/pooling layer will only extract features and reduce the parameters imposed by the input image. However, in order to generate the final output information (the required class information or other relevant information), convolutional neural networks need to utilize a neural network layer to generate the output of one or a set of the required number of classes. Thus, multiple hidden layers may be included in the neural network layer, where parameters included in the multiple hidden layers may be pre-trained based on relevant training data for a particular task type, e.g., such as image recognition, image classification, image super-resolution reconstruction, etc.
Optionally, after the multiple hidden layers in the neural network layer, an output layer of the whole convolutional neural network is further included, wherein the output layer has a loss function similar to the cross entropy of classification, and is specifically used for calculating a prediction error, and once the forward propagation of the whole convolutional neural network is completed, the backward propagation starts to update the weight values and the deviation of the layers so as to reduce the loss of the convolutional neural network and the error between the result output by the convolutional neural network through the output layer and the ideal result.
(4) Circulating neural network
A recurrent neural network (recurrent neural networks, RNN) is used to process the sequence data. In the traditional neural network model, from an input layer to an implicit layer to an output layer, the layers are fully connected, and no connection exists for each node between each layer. Although this common neural network solves many problems, it still has no weakness for many problems. For example, you want to predict what the next word of a sentence is, it is generally necessary to use the previous word, because the previous and next words in a sentence are not independent. RNN is called a recurrent neural network in the sense that a sequence's current output is related to the previous output. The specific expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more and are connected, and the input of the hidden layers comprises not only the output of the input layer but also the output of the hidden layer at the last moment. In theory, RNNs are able to process sequence data of any length. Training for RNNs is the same as training for traditional CNNs or DNNs. Error back propagation algorithms are also used, but with a few differences: that is, if the RNN is network extended, parameters therein, such as W, are shared; this is not the case with conventional neural networks such as those described above. And in using a gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the previous steps of the network. This learning algorithm is referred to as a time-based back propagation algorithm (Back propagation Through Time, BPTT).
Why is the convolutional neural network already present, the neural network is also looped? The reason is simple, and in convolutional neural networks, one precondition assumption is that: the elements are independent of each other, and the input and output are independent of each other, such as cats and dogs. However, in the real world, many elements are interconnected, such as the stock changes over time, and further such as one says: i like travel, where the most favored place is Yunnan, and later have the opportunity to go. Here, the filling should be known to humans as filling "yunnan". Because humans will infer from the context, but how to have the machine do this? RNNs have thus been developed. RNNs aim to give robots the ability to memorize as a robot. Thus, the output of the RNN needs to rely on current input information and historical memory information.
(5) Loss function
In training the deep neural network, since the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually expected target value according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the deep neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be predicted to be lower, and the adjustment is continued until the deep neural network can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.
(6) Back propagation algorithm
The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal super-resolution model, such as a weight matrix.
(7) Generating an countermeasure network
The generative antagonism network (generative adversarial networks, GAN) is a deep learning model. The model comprises at least two modules: one module is a Generative Model and the other module is a discriminant Model (Discriminative Model) through which the two modules learn to game each other, resulting in a better output. The generation model and the discrimination model can be neural networks, in particular deep neural networks or convolutional neural networks. The basic principle of GAN is as follows: taking GAN for generating pictures as an example, it is assumed that there are two networks, G (Generator) and D (Discriminator), where G is a network for generating pictures, which receives a random noise z, by which pictures are generated, denoted as G (z); d is a discrimination network for discriminating whether a picture is "true". Its input parameters are x, which represents a picture, and the output D (x) represents the probability that x is a true picture, if 1, it represents 100% of the pictures that are true, and if 0, it represents the pictures that are unlikely to be true. In the process of training the generated type countermeasure network, the object of generating the network G is to generate a real picture as far as possible to deceptively judge the network D, and the object of judging the network D is to distinguish the picture generated by the network G from the real picture as far as possible. Thus, G and D constitute a dynamic "gaming" process, i.e. "antagonism" in the "generative antagonism network". As a result of the last game, in an ideal state, G can generate a picture G (z) sufficient to be "spurious", and D has difficulty in determining whether the picture generated by G is true or not, i.e., D (G (z))=0.5. This gives an excellent generation model G which can be used to generate pictures.
In this embodiment, a deep neural network is used to determine whether the image block is to be divided continuously, and the training process of the training engine 25 to obtain the deep neural network is as follows:
first, the training data set includes: the data sets of three types of image blocks correspond to image blocks of sizes 64×64, 32×32, and 16×16, respectively, wherein the data set of 64×64 image blocks is denoted as rawpatch_cu64, the data set of 32×32 image blocks is denoted as rawpatch_cu32, and the data set of 16×16 image blocks is denoted as rawpatch_cu16.
In the embodiment of the application, the following methods may be used to construct the three types of image block data:
1. constructing image data imagepatch_cu64 of a 64×64 image block; constructing image data imagepatch_cu32 of the 32×32 image block; image data imagepatch_cu16 of the 16×16 image block is constructed.
The image data may include one or more of chromaticity values, luminance values, pixel values, and the like of the corresponding image block.
2. Acquiring image data reference patch_cu64 of a reference block of the 64×64 image block; acquiring image data reference patch_cu32 of a reference block of the 32×32 image blocks; image data reference patch_cu16 of a reference block of 16×16 image blocks is acquired.
The image data of the reference block may refer to the image data in step 1, and will not be described herein.
Alternatively, the reference block determination method may refer to the following description of correlation in inter prediction, for example, a classical motion vector estimation (motion vector estimation, MVE) algorithm (e.g. TZSearch algorithm in h.265) is used to obtain the reference block of the corresponding image block.
Alternatively, the method for determining the reference block may fuse pixel level optical flows in graphics processor (graphics processing unit, GPU) rendering information (GPU rendering information, GRI) to obtain Motion Vectors (MVs) of the corresponding image block, and obtain the reference block of the corresponding image block according to the MVs. The fusion method of the pixel-level optical flows comprises the step of taking the mean value, the median value or the mass center of all the pixel optical flows of the corresponding image blocks to generate MVs of the corresponding image blocks.
Optionally, the method for determining the reference block may use an MV obtained based on the pixel-level optical flow as a search starting point in an MVE algorithm to obtain an MV of the corresponding image block, and further obtain the reference block of the corresponding image block according to the MV.
3. Acquiring a GRI renderPatch_CU64 of a 64×64 image block; acquiring a GRI renderPatch_CU32 of a 32×32 image block; the GRI renderPatch_CU16 for a 16×16 image block is obtained.
When the image has a GRI, the GRI is extracted in a corresponding size from the rendering information video frame at a position corresponding to the corresponding image block, for example, pixel-level optical flow, semantic division information of the image (for example, to which object the corresponding image block belongs), and the like.
4. Combining the image data of the corresponding image block (imagepatch_cu), the image data of the reference block (referencepatch_cu), and the GRI (renderppatch_cu) results in a data set (rawppatch_cu) of the corresponding image block, representing 64, 32, or 16.
The corresponding image block may be a 64×64 image block, a 32×32 image block, or a 16×16 image block.
When the GRI does not exist in the image, the data set (rawpatch_cu) of the corresponding image block is composed of the image data (imagepatch_cu) of the corresponding image block and the image data (refenceps_cu) of the reference block of the corresponding image block, and the GRI (renderpatch_cu) of the corresponding image block is not included.
Next, based on the training data set described above, the training engine 25 constructs 3 deep neural network models nn_cu64, nn_cu32, and nn_cu16, respectively, wherein nn_cu64 is used to determine whether a 64×64 image block is to be divided, nn_cu32 is used to determine whether a 32×32 image block is to be divided, and nn_cu16 is used to determine whether a 16×16 image block is to be divided. The input of the deep neural network model is a data set RawPatch_CU of the corresponding image block, and the output is a division identifier, wherein the division identifier is used for indicating that the corresponding image block is to be divided or not divided, for example, the division identifier is 1 and indicates that the corresponding image block is to be divided; the partition flag is 0, indicating that the corresponding image block is not partitioned. The deep neural network model can be composed of a full convolution feature extraction layer and a linear classification layer, and the model training can adopt a common optimization method of a two-class model. It should be understood that the embodiment of the present application may also construct a deep neural network model including other structures, or may use other training methods to obtain the deep neural network model, which is not limited in any way.
For example, fig. 7a is an exemplary training schematic of the deep neural network model, and as shown in fig. 7a, the training data set includes:
the data set rawpatch_cu64 of the plurality of sets of 64×64 image blocks, the data set rawpatch_cu64 of each 64×64 image block includes imagepatch_cu64 and referencepatch_cu64. The data set rawpatch_cu64 of the plurality of sets of 64×64 image blocks inputs the nn_cu64, outputs a division flag having a value of 0 or 1, and indicates whether the corresponding 64×64 image blocks are to be divided or not.
The data set rawpatch_cu32 of the plurality of sets 32×32 image blocks, the data set rawpatch_cu32 of each 32×32 image block includes imagepatch_cu32 and referencepatch_cu32. The data set rawpatch_cu32 of the plurality of sets 32×32 image blocks inputs the nn_cu32, outputs a division flag having a value of 0 or 1, and indicates whether the corresponding 32×32 image blocks are to be divided or not.
The data set rawpatch_cu16 of the plurality of sets of 16×16 image blocks, the data set rawpatch_cu16 of each 16×16 image block includes imagepatch_cu16 and referencepatch_cu16. The data set rawpatch_cu16 of the plurality of sets of 16×16 image blocks inputs nn_cu16, outputs a division flag having a value of 0 or 1, and indicates whether the corresponding 16×16 image block is to be divided or not.
For example, fig. 7b is an exemplary training schematic of the deep neural network model, and as shown in fig. 7b, the training data set includes:
the data set rawpatch_cu64 of the plurality of sets of 64×64 image blocks, the data set rawpatch_cu64 of each 64×64 image block includes imagepatch_cu64, referencepatch_cu64, and renderpatch_cu64. The data set rawpatch_cu64 of the plurality of sets of 64×64 image blocks inputs the nn_cu64, outputs a division flag having a value of 0 or 1, and indicates whether the corresponding 64×64 image blocks are to be divided or not.
The data set rawpatch_cu32 of the plurality of sets 32×32 image blocks, the data set rawpatch_cu32 of each 32×32 image block includes imagepatch_cu32, referencepatch_cu32, and renderpatch_cu32. The data set rawpatch_cu32 of the plurality of sets 32×32 image blocks inputs the nn_cu32, outputs a division flag having a value of 0 or 1, and indicates whether the corresponding 32×32 image blocks are to be divided or not.
The data set rawpatch_cu16 of the plurality of sets of 16×16 image blocks, the data set rawpatch_cu16 of each 16×16 image block includes imagepatch_cu16, referencepatch_cu16, and renderpatch_cu16. The data set rawpatch_cu16 of the plurality of sets of 16×16 image blocks inputs nn_cu16, outputs a division flag having a value of 0 or 1, and indicates whether the corresponding 16×16 image block is to be divided or not.
Based on the above 3 deep neural network models nn_cu64, nn_cu32, and nn_cu16, the embodiments of the present application provide an inter-frame coding block division method to achieve inter-frame coding block division acceleration.
Fig. 8 is a flow chart of a process 800 of an inter-coded block partitioning method of an embodiment of the present application. The process 800 may be performed by the video encoder 20, and in particular, may be performed by the partition unit 262 of the video encoder 20, and it should be understood that the partition and the division in the embodiment of the present application represent the same meaning, and each refers to dividing an image block to obtain a plurality of sub-blocks. Process 800 is described as a series of steps or operations, it being understood that process 800 may be performed in various orders and/or concurrently, and is not limited to the order of execution as depicted in fig. 8. Assuming a video data stream having a plurality of image frames is using a video encoder, a process 800 is performed that includes the following steps to block divide the image blocks. Process 800 may include:
step 801, a data set of a current block is acquired.
The current block in the embodiment of the present application may be a Coding Tree Unit (CTU) that the encoder is currently processing.
A frame of picture in a video may be composed of a plurality of CTUs, whereas a CTU generally corresponds to a square region in the picture, which may contain luminance pixels and/or chrominance pixels in this square region, and syntax elements indicating how the CTU is divided into at least one Coding Unit (CU), and a method of decoding each CU to obtain a reconstructed picture. A CU generally corresponds to a rectangular area of axb in the image, which may contain luminance pixels and/or chrominance pixels in the rectangular area of axb, a is the width of the rectangular area, B is the height of the rectangular area, a and B may be the same or different, the values of a and B are generally integers to powers of 2, e.g., 256, 128, 64, 32, 16, 8, 4, and decoding a CU may result in a reconstructed image of the rectangular area of axb.
In one possible implementation, the data set of the current block may include image data of the current block and image data of a reference block of the current block.
In the embodiment of the present application, a motion vector estimation algorithm may be used to obtain the MV of the current block. For example, fig. 9 is an exemplary schematic diagram of a search window according to an embodiment of the present application, as shown in fig. 9, where a certain MV of the current block is (0, 0), and the MV is shifted in a 3×3 search window, so that 9 shifted MVs can be obtained: (-1, -1), (-1, 0), (-1, 1), (0, -1), (0, 0), (0, 1), (1, -1), (1, 0), (1, 1). And determining an optimal MV from the 9 offset MVs through a rate distortion cost (rate distortion cost, RD cost), and further acquiring a reference block of the current block according to the optimal MV.
The image data of the current block may include the values of luminance pixels and/or chrominance pixels included in an image area corresponding to the current block, where the image area is an area in an image to which the current block belongs. The image data of the reference block of the current block may include values of luminance pixels and/or chrominance pixels included in an image area corresponding to the reference block of the current block, where the image area is an area in an image to which the reference block of the current block belongs.
In one possible implementation, the data set of the current block includes image data of the current block, image data of a reference block of the current block, and pixel-level optical flow of the current block.
In the case of an image-presence graphics processor (graphics processing unit, GPU) rendering information (GPU rendering information, GRI), the GRI may include information such as pixel-level optical flow, semantic segmentation of the image, etc., where pixel-level optical flow represents the instantaneous speed of pixel motion of a spatially moving object on an observation imaging plane, so that the correspondence between a previous frame and a current frame may be found using the change in the time domain of pixels in an image sequence and the correlation between neighboring frames, and the gray-level instantaneous rate of change at a particular coordinate point of a two-dimensional image plane may be generally defined as an optical flow vector. The visible pixel level optical flow may also reflect the degree of pixel variation from the previous frame to the current frame. In addition to the image data of the current block and the image data of the reference block of the current block, the embodiment of the application can also add the pixel-level optical flow of the current block to the data set of the current block. Therefore, the GRI can be effectively utilized to assist block division in inter-frame coding, and particularly the searching speed and precision of a reference block are improved, so that the inter-frame coding efficiency is improved.
Based on the pixel-level optical flow of the current block, the embodiments of the present application may employ the following two reference block determination methods:
one method is that the MV of the current block is obtained according to the pixel level optical flow of the current block, and then the reference block of the current block is obtained according to the MV of the current block.
As described above, the pixel-level optical flow represents the instantaneous speed of the pixel motion of a spatially moving object on the observation imaging plane, so the mean, median, or centroid of all pixel-level optical flows within the current block can be taken as the MV of the current block. And acquiring a reference block of the current block based on the MV.
Another method is to obtain an initial MV according to the pixel level optical flow of the current block, use the initial MV as a search starting point, obtain the MV of the current block by using a motion vector estimation algorithm, and obtain a reference block of the current block according to the MV of the current block.
The embodiment of the application can take the mean value, the median value or the centroid of all the pixel-level optical flows in the current block as an initial MV. Based on the initial MV, the MV of the current block may be obtained by using the motion vector estimation algorithm with the initial MV as a search start point, and herein, reference may be made to fig. 9, which is not described herein.
Step 802, inputting the data set of the current block into the neural network model to obtain the dividing mode of the current block.
The partitioning of image blocks may be referred to as tree partitioning or hierarchical tree partitioning, wherein a root block at root tree level 0 (hierarchical level 0, depth 0) or the like may be recursively partitioned into four next lower tree level blocks, e.g., tree level 1 (hierarchical level 1, depth 1) nodes. These blocks may be further partitioned into four next lower level blocks, e.g., tree level 2 (hierarchical level 2, depth 2), etc., until the partitioning is complete. Based on this, the embodiment of the present application provides a plurality of models (refer to the description of the model training section above), where the plurality of models respectively correspond to a plurality of image sizes, so that the current block may obtain a smaller-sized image block after being divided, the smaller-sized image block may be obtained by continuing to divide the smaller-sized image block, and the data set of the divided image block is input into the model of the corresponding size to determine whether the image block is divided or not.
In one possible implementation, the data set of the current block is input into a first neural network model to obtain a first partition identifier, where the first partition identifier is used to indicate whether the current block is to be partitioned or not, and the first neural network model is obtained through training based on an image block with the same size as the current block.
The size of the current block may be 64×64, and the first neural network model may be the above-described deep neural network model nn_cu64. When the first division mark indicates no division, the current block is not divided any more, and the current block is divided in a non-division mode, and the encoder performs encoding processing by the non-division current block; when the first division flag indicates division, the current block continues to be divided into a plurality of first sub-blocks, and the size of any one of the first sub-blocks may be 32×32.
In one possible implementation, when the first division identifier indicates that division is to be performed, dividing the current block to obtain a plurality of first sub-blocks; acquiring a data set of a first sub-block, wherein the data set of the first sub-block comprises image data of the first sub-block and image data of a reference block of the first sub-block; the data set of the first sub-block is input into a second neural network model to obtain a second division identifier, wherein the second division identifier is used for indicating whether the first sub-block is to be divided or not, and the second neural network model is trained based on the image blocks with the same size as the first sub-block.
Alternatively, the data set of the first sub-block may include image data of the first sub-block and image data of a reference block of the first sub-block. Optionally, the data set of the first sub-block includes image data of the first sub-block, image data of a reference block of the first sub-block, and pixel-level optical flow of the first sub-block.
The first sub-block may be 32×32 in size and the second neural network model may be the deep neural network model nn_cu32 described above. When the second division mark indicates no division, the first sub-block is not divided any more, and the division mode of the image area corresponding to the first sub-block is the division with the depth of 1; when the second division flag indicates division, the first sub-block continues to be divided into a plurality of second sub-blocks, and the size of any one of the second sub-blocks may be 16×16.
In one possible implementation, when the second division identifier indicates that division is to be performed, dividing the first sub-block to obtain a plurality of second sub-blocks; acquiring a data set of a second sub-block, wherein the data set of the second sub-block comprises image data of the second sub-block and image data of a reference block of the second sub-block; and inputting the data set of the second sub-block into a third neural network model to obtain a third division identifier, wherein the third division identifier is used for indicating whether the second sub-block is to be divided or not, and the third neural network model is trained based on the image blocks with the same size as the second sub-block.
Alternatively, the data set of the second sub-block may include image data of the second sub-block and image data of a reference block of the second sub-block. Optionally, the data set of the second sub-block includes image data of the second sub-block, image data of a reference block of the second sub-block, and pixel-level optical flow of the second sub-block.
The second sub-block may be 16×16 in size and the third neural network model may be the deep neural network model nn_cu16 described above. When the third division mark indicates no division, the second sub-block is not divided any more, and the division mode of the image area corresponding to the second sub-block is the division with the depth of 2; when the third division mark represents division, the second sub-block continues to be divided into image blocks with the size of 8×8, and no division is performed, and at this time, the image area corresponding to the second sub-block is divided into the image areas with the depth of 3.
The embodiment of the application adopts a scheme based on the deep neural network, and the training of the deep neural network is driven based on a data scene (namely, the image data of the image blocks with the same size are trained to obtain the deep neural network model corresponding to the same size), so that better scene adaptability can be realized.
And 803, dividing the current block according to the dividing mode.
After the division manner of the current block is obtained in step 802, the current block may be divided based on the division manner.
According to the method, the current block and the reference block of the current block are used as input of the depth neural network, the depth neural network determines the dividing mode of the current block, compared with analysis of a standard encoder, the method can achieve block dividing acceleration of inter-frame coding, and RDO decision processes in the depth neural network and the standard encoder are effectively fitted to ensure coding performance indexes.
In one possible implementation, the current block size may be 64×64; when it is determined that the current block needs to be divided continuously, the current block may be divided to obtain 4 first sub-blocks, and the size of the first sub-blocks may be 32×32; when determining that a certain first sub-block needs to be divided continuously, dividing the first sub-block to obtain 4 second sub-blocks, wherein the size of each second sub-block can be 16×16; when it is determined that a certain second sub-block needs to be divided continuously, the second sub-block may be divided to obtain 4 third sub-blocks, and the size of the third sub-block may be 8×8. The division will not continue after the division into 8 x 8 image blocks, i.e. the third sub-block of 8 x 8 is the smallest coding unit.
In one possible implementation, the current block size may be 32×32; when it is determined that the current block needs to be divided continuously, the current block may be divided to obtain 4 first sub-blocks, and the size of the first sub-blocks may be 16×16; when it is determined that a certain first sub-block needs to be divided continuously, the first sub-block may be divided to obtain 4 second sub-blocks, and the size of the second sub-blocks may be 8×8. The division will not continue after the division into 8 x 8 image blocks, i.e. the second sub-block of 8 x 8 is the smallest coding unit.
In one possible implementation, the current block size may be 16×16; when it is determined that the current block needs to be divided continuously, the current block may be divided to obtain 4 first sub-blocks, and the size of the first sub-blocks may be 8×8. The division will not continue after the division into 8 x 8 image blocks, i.e. the first sub-block of 8 x 8 is the smallest coding unit.
The technical solution of the method embodiment shown in fig. 8 will be described in detail below using several specific embodiments.
Example 1
Fig. 10 is an exemplary flowchart of an image block dividing method according to an embodiment of the present application, where, as shown in fig. 10, a current block is a 64×64 image block, and a data set of the current block is denoted as a rawpatch_cu64.
1. The rawpatch_cu64 is input to the deep neural network model nn_cu64, and the depth_cu64_result is identified via the processing output division of the nn_cu 64.
When depth_cu 64_result=0, the image block representing 64×64 does not need to be subdivided, and the process ends, and the current block is divided in a non-divided manner. When depth_cu 64_result=1, it means that the 64×64 image block needs to be subdivided, and four-way tree division is performed on the 64×64 image block to obtain 4 32×32 image blocks.
2. The data set rawpatch_cu32 of the i 32×32-th image block is input to the deep neural network model nn_cu32, i e [1,4], and the partition identification depth_cu32_result [ i ] is output via the processing of the nn_cu 32.
When depth_cu32_result [ i ] =0, it indicates that the ith 32×32 image block does not need to be subdivided, and the process for the ith 32×32 image block ends, so as to obtain the partition mode of the region corresponding to the ith 32×32 image block in the current block. When depth_cu32_result [ i ] =1, it indicates that the ith 32×32 image block needs to be subdivided, and at this time, the ith 32×32 image block is quadtree-divided to obtain 4 16×16 image blocks.
For example, the data set rawpatch_cu32 of each of the 4 32×32 image blocks is input to nn_cu32, and the obtained division identifiers are respectively depth_cu32_result [1] =0, depth_cu32_result [2] =1, depth_cu32_result [3] =1, depth_cu32_result [4] =0, that is, the 2 nd and 3 rd image blocks of 32×32 need to be continuously divided, and the 1 st and 4 th image blocks of 32×32 need not to be continuously divided.
3. The data set rawpatch_cu16 of the j 16×16 th image block into which the i 32×32 image block is divided is input to the deep neural network model nn_cu16, i is the sequence number of the 32×32 image block determined to be continued to be divided in step 2, j e 1,4, and the division flag depth_cu16_result ij is output via the processing of the nn_cu 16.
When depth_cu32_result [ ij ] =0, it indicates that the jth 16×16 image block in the ith 32×32 image block does not need to be subdivided, and the process for the jth 16×16 image block in the ith 32×32 image block ends, so as to obtain the partition mode of the region corresponding to the jth 16×16 image block in the ith 32×32 image block in the current block. When depth_cu32_result [ ii ] =1, it means that the jth 16×16 image block in the ith 32×32 image block needs to be subdivided, and then the jth 16×16 image block in the ith 32×32 image block is quadtree-divided to obtain 4 8×8 image blocks.
For example, the data set rawpatch_cu16 of each of the jth 16×16 image blocks in the 2nd 32×32 image blocks is input to the nn_cu16, and the obtained division identifiers are respectively depth_cu16_result [21] =0, depth_cu16_result [22] =1, depth_cu16_result [23] =1, depth_cu16_result [24] =0, that is, the 2nd and 316×16 image blocks in the 2nd 32×32 image blocks need to be continuously divided, and the 1st and 4th 16×16 image blocks in the 2nd 32×32 image blocks do not need to be continuously divided.
For another example, the data sets rawpatch_cu16 of the jth 16×16 image blocks in the 3 rd 32×32 image blocks are input to nn_cu16, respectively, and the obtained division identifiers are respectively depth_cu16_result [31] =1, depth_cu16_result [32] =1, depth_cu16_result [33] =0, depth_cu16_result [34] =0, that is, the 1 st and 2 th 16×16 image blocks in the 3 rd 32×32 image blocks need to be continuously divided, and the 3 rd and 4 th 16×16 image blocks in the 3 rd 32×32 image blocks do not need to be continuously divided.
The division into 8 x 8 image blocks will not continue, i.e. the 8 x 8 image block is the smallest coding unit.
The division manner of the current block may be recursively obtained according to the division identifiers obtained in steps 1, 2 and 3, for example, the current block may be divided according to the division manner of the current block to obtain a division result as shown in fig. 11.
Example two
Fig. 12 is an exemplary flowchart of an image block dividing method according to an embodiment of the present application, where, as shown in fig. 12, a current block is a 32×32 image block, and a data set of the current block is denoted as a rawpatch_cu32.
1. The rawpatch_cu32 is input to the deep neural network model nn_cu32, and the depth_cu32_result is identified via the processing output division of the nn_cu 32.
When depth_cu 32_result=0, the image block representing 32×32 does not need to be subdivided, and the process ends at this time, and the current block is divided in a non-divided manner. When depth_cu 32_result=1, it indicates that the 32×32 image block needs to be subdivided, and at this time, the 32×32 image block is quadtree-divided to obtain 4 16×16 image blocks.
2. The data set rawpatch_cu16 of the ith 16×16 image block is input to the deep neural network model nn_cu16, i e [1,4], and the partition identification depth_cu16_result [ i ] is output via the processing of the nn_cu 16.
When depth_cu16_result [ i ] =0, it indicates that the ith 16×16 image block does not need to be subdivided, and the process for the ith 16×16 image block ends, so as to obtain the partition mode of the region corresponding to the ith 16×16 image block in the current block. When depth_cu16_result [ i ] =1, it indicates that the ith 16×16 image block needs to be subdivided, and at this time, the ith 16×16 image block is quadtree-divided to obtain 4 8×8 image blocks.
For example, the data set rawpatch_cu16 of each of the 4 16×16 image blocks is input to nn_cu16, and the obtained division identifiers are respectively depth_cu16_result [1] =1, depth_cu16_result [2] =0, depth_cu16_result [3] =0, depth_cu16_result [4] =1, i.e. the 1 st and 4 th 16×16 image blocks need to be continuously divided, and the 2 nd and 3 th 16×16 image blocks do not need to be continuously divided.
The division into 8 x 8 image blocks will not continue, i.e. the 8 x 8 image block is the smallest coding unit.
The division manner of the current block may be recursively obtained according to the division identifiers obtained in the steps 1 and 2, for example, the current block may be divided according to the division manner of the current block to obtain a division result as shown in fig. 13.
Example III
Fig. 14 is an exemplary flowchart of an image block dividing method according to an embodiment of the present application, where, as shown in fig. 14, a current block is a 16×16 image block, and a data set of the current block is denoted as a rawpatch_cu16.
The rawpatch_cu16 is input to the deep neural network model nn_cu16, and the depth_cu16_result is identified via the processing output division of the nn_cu 16.
When depth_cu16_result=0, the image block representing 16×16 does not need to be subdivided, and the process ends at this time, and the current block is divided in a non-dividing manner. When depth_cu16_result=1, it indicates that 16×16 image blocks need to be subdivided, and at this time, the 16×16 image blocks are quadtree-divided to obtain 4 8×8 image blocks.
The division into 8 x 8 image blocks will not continue, i.e. the 8 x 8 image block is the smallest coding unit.
According to the division identifier obtained in the above process, the division manner of the current block may be recursively obtained, for example, the current block may be divided according to the division manner of the current block to obtain the division result as shown in fig. 15.
Example IV
In the above embodiments, the data sets of the current block, rawpatch_cu, denoted 64, 32 or 16, may be formed by imagepatch_cu and refencepatch_cu (as shown in fig. 7 a), or may be formed by imagepatch_cu, refencepatch_cu and refencepatch_cu (as shown in fig. 7 b), according to the above description of the training engine 25 training the deep neural network model. Regarding imagepatch_cu, referencepatch_cu, and renderpatc_cu, the descriptions thereof may be omitted herein.
In the first to fourth embodiments of the present application, the current block and the reference block of the current block are used as the input of the depth neural network, and the depth neural network determines the dividing mode of the current block.
Fig. 16 is a schematic diagram illustrating an exemplary structure of a block dividing apparatus 1600 according to an embodiment of the present application, and as shown in fig. 16, the apparatus 1600 according to the embodiment may be applied to an encoder 20, particularly a dividing unit 262 in the encoder 20. The apparatus 1600 may include: an acquisition module 1601, a decision module 1602, and a division module 1603. Wherein, the liquid crystal display device comprises a liquid crystal display device,
an acquiring module 1601, configured to acquire a data set of a current block, where the data set of the current block includes image data of the current block and image data of a reference block of the current block; a decision module 1602, configured to input a data set of the current block into a neural network model to obtain a division manner of the current block, where the neural network model is associated with a size of an image block corresponding to the input data set, and the image block includes the current block and the reference block; the dividing module 1603 is configured to divide the current block according to the dividing manner.
In one possible implementation, the decision module 1602 is specifically configured to input the data set of the current block into a first neural network model to obtain a first partition identifier, where the first partition identifier is used to indicate that the current block is to be partitioned or not partitioned, and the first neural network model is obtained through training based on an image block with a size identical to that of the current block.
In a possible implementation manner, the dividing module 1603 is further configured to divide the current block to obtain a plurality of first sub-blocks when the first division identifier indicates that the current block is to be divided; the decision module 1602 is further configured to obtain a data set of the first sub-block, where the data set of the first sub-block includes image data of the first sub-block and image data of a reference block of the first sub-block; inputting the data set of the first sub-block into a second neural network model to obtain a second division identifier, wherein the second division identifier is used for indicating whether the first sub-block is to be divided or not, and the second neural network model is obtained based on image block training with the same size as the first sub-block.
In a possible implementation manner, the dividing module 1603 is further configured to divide the first sub-block to obtain a plurality of second sub-blocks when the second division identifier indicates that the first sub-block is to be divided; the decision module 1602 is further configured to obtain a data set of the second sub-block, where the data set of the second sub-block includes image data of the second sub-block and image data of a reference block of the second sub-block; and inputting the data set of the second sub-block into a third neural network model to obtain a third division identifier, wherein the third division identifier is used for indicating whether the second sub-block is to be divided or not, and the third neural network model is obtained based on image block training with the same size as the second sub-block.
In one possible implementation, the obtaining module 1601 is further configured to obtain graphics processor rendering information GRI; acquiring a pixel-level optical flow of the current block according to the GRI; the data set also includes pixel-level optical flow for the current block.
In a possible implementation manner, the obtaining module 1601 is specifically configured to extract, from the GRI, location information of a first pixel at different moments, where the first pixel is any one pixel in the current block; obtaining the optical flow of the first pixel point by calculating the difference according to the position information of the first pixel point at different moments; the pixel-level optical flow of the current block includes the optical flow of the first pixel point.
In a possible implementation manner, the obtaining module 1601 is further configured to obtain an MV of the current block according to a pixel level optical flow of the current block; and acquiring a reference block of the current block according to the MV of the current block.
In one possible implementation, the acquiring module 1601 is further configured to acquire an initial MV according to a pixel-level optical flow of the current block; taking the initial MV as a searching starting point, and acquiring the MV of the current block by adopting a motion vector estimation algorithm; and acquiring a reference block of the current block according to the MV of the current block.
In one possible implementation, the current block is smaller than or equal to 64×64 in size.
In one possible implementation, the partitioning corresponds to a quadtree partitioning.
In implementation, the steps of the above method embodiments may be implemented by integrated logic circuits of hardware in a processor or instructions in software form. The processor may be a general purpose processor, a digital signal processor (digital signal processor, DSP), an Application Specific Integrated Circuit (ASIC), a field programmable gate array (field programmable gate array, FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented as a hardware encoding processor executing, or may be implemented by a combination of hardware and software modules in the encoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
The memory mentioned in the above embodiments may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and direct memory bus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or, what contributes to the prior art, or part of the technical solutions may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a computer device (a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a specific implementation of the embodiments of the present application, but the protection scope of the embodiments of the present application is not limited thereto, and any person skilled in the art may easily think about changes or substitutions within the technical scope of the embodiments of the present application, and all changes and substitutions are included in the protection scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims (22)

1. A method of inter-coded block partitioning, comprising:
acquiring a data set of a current block, wherein the data set of the current block comprises image data of the current block and image data of a reference block of the current block;
inputting the data set of the current block into a neural network model to obtain a division mode of the current block, wherein the neural network model is associated with the size of an image block corresponding to the input data set, and the image block comprises the current block and the reference block;
and dividing the current block according to the dividing mode.
2. The method according to claim 1, wherein said inputting the data set of the current block into at least one neural network model to obtain the partitioning of the current block comprises:
Inputting the data set of the current block into a first neural network model to obtain a first division identifier, wherein the first division identifier is used for indicating whether the current block is to be divided or not, and the first neural network model is obtained through training based on image blocks with the same size as the current block.
3. The method according to claim 2, wherein the method further comprises:
dividing the current block to obtain a plurality of first sub-blocks when the first division identifier indicates to be divided;
acquiring a data set of the first sub-block, wherein the data set of the first sub-block comprises image data of the first sub-block and image data of a reference block of the first sub-block;
inputting the data set of the first sub-block into a second neural network model to obtain a second division identifier, wherein the second division identifier is used for indicating whether the first sub-block is to be divided or not, and the second neural network model is obtained based on image block training with the same size as the first sub-block.
4. A method according to claim 3, characterized in that the method further comprises:
dividing the first sub-block to obtain a plurality of second sub-blocks when the second division identifier indicates to be divided;
Acquiring a data set of the second sub-block, wherein the data set of the second sub-block comprises image data of the second sub-block and image data of a reference block of the second sub-block;
and inputting the data set of the second sub-block into a third neural network model to obtain a third division identifier, wherein the third division identifier is used for indicating whether the second sub-block is to be divided or not, and the third neural network model is obtained based on image block training with the same size as the second sub-block.
5. The method according to claim 1, wherein the method further comprises:
acquiring rendering information GRI of a graphics processor;
acquiring a pixel-level optical flow of the current block according to the GRI;
the data set also includes pixel-level optical flow for the current block.
6. The method of claim 5, wherein the acquiring pixel-level optical flow of the current block according to the GRI comprises:
extracting position information of a first pixel point at different moments from the GRI, wherein the first pixel point is any pixel point in the current block;
obtaining the optical flow of the first pixel point by calculating the difference according to the position information of the first pixel point at different moments; the pixel-level optical flow of the current block includes the optical flow of the first pixel point.
7. The method according to claim 5 or 6, characterized in that the method further comprises:
acquiring the MV of the current block according to the pixel-level optical flow of the current block;
and acquiring a reference block of the current block according to the MV of the current block.
8. The method according to claim 5 or 6, characterized in that the method further comprises:
acquiring an initial MV according to the pixel-level optical flow of the current block;
taking the initial MV as a searching starting point, and acquiring the MV of the current block by adopting a motion vector estimation algorithm;
and acquiring a reference block of the current block according to the MV of the current block.
9. The method according to any of claims 1-8, wherein the current block is smaller than or equal to 64 x 64 in size.
10. The method according to any of claims 1-9, wherein the partitioning corresponds to a quadtree partitioning.
11. A block dividing apparatus, comprising:
an acquisition module, configured to acquire a data set of a current block, where the data set of the current block includes image data of the current block and image data of a reference block of the current block;
the decision module is used for inputting the data set of the current block into a neural network model to obtain the division mode of the current block, the neural network model is associated with the size of an image block corresponding to the input data set, and the image block comprises the current block and the reference block;
And the dividing module is used for dividing the current block according to the dividing mode.
12. The apparatus according to claim 11, wherein the decision module is specifically configured to input the data set of the current block into a first neural network model to obtain a first partition identifier, where the first partition identifier is used to indicate that the current block is to be partitioned or not partitioned, and the first neural network model is obtained based on training of image blocks with a size identical to that of the current block.
13. The apparatus of claim 12, wherein the partitioning module is further configured to partition the current block to obtain a plurality of first sub-blocks when the first partition identity indicates to partition; the decision module is further configured to obtain a data set of the first sub-block, where the data set of the first sub-block includes image data of the first sub-block and image data of a reference block of the first sub-block; inputting the data set of the first sub-block into a second neural network model to obtain a second division identifier, wherein the second division identifier is used for indicating whether the first sub-block is to be divided or not, and the second neural network model is obtained based on image block training with the same size as the first sub-block.
14. The apparatus of claim 13, wherein the partitioning module is further configured to partition the first sub-block to obtain a plurality of second sub-blocks when the second partition identification indicates to partition; the decision module is further configured to obtain a data set of the second sub-block, where the data set of the second sub-block includes image data of the second sub-block and image data of a reference block of the second sub-block; and inputting the data set of the second sub-block into a third neural network model to obtain a third division identifier, wherein the third division identifier is used for indicating whether the second sub-block is to be divided or not, and the third neural network model is obtained based on image block training with the same size as the second sub-block.
15. The apparatus of claim 11, wherein the obtaining module is further configured to obtain graphics processor rendering information GRI; acquiring a pixel-level optical flow of the current block according to the GRI; the data set also includes pixel-level optical flow for the current block.
16. The apparatus of claim 15, wherein the obtaining module is specifically configured to extract, from the GRI, location information of a first pixel at different moments, where the first pixel is any one pixel in the current block; obtaining the optical flow of the first pixel point by calculating the difference according to the position information of the first pixel point at different moments; the pixel-level optical flow of the current block includes the optical flow of the first pixel point.
17. The apparatus according to claim 15 or 16, wherein the obtaining module is further configured to obtain the MV of the current block according to the pixel-level optical flow of the current block; and acquiring a reference block of the current block according to the MV of the current block.
18. The apparatus according to claim 15 or 16, wherein the obtaining module is further configured to obtain an initial MV from a pixel-level optical flow of the current block; taking the initial MV as a searching starting point, and acquiring the MV of the current block by adopting a motion vector estimation algorithm; and acquiring a reference block of the current block according to the MV of the current block.
19. The apparatus according to any of claims 11-18, wherein the current block is less than or equal to 64 x 64 in size.
20. The apparatus of any of claims 11-19, wherein the partitioning method corresponds to a quadtree partitioning method.
21. An encoder, comprising:
one or more processors;
a non-transitory computer readable storage medium coupled to the processor and storing a program for execution by the processor, wherein the program, when executed by the processor, causes the decoder to perform the method of any of claims 1-10.
22. A computer program product comprising program code for performing the method of any of claims 1-10 when the program code is executed on a computer or processor.
CN202111369165.2A 2021-11-18 2021-11-18 Inter-coded block partitioning method and apparatus Pending CN116137659A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111369165.2A CN116137659A (en) 2021-11-18 2021-11-18 Inter-coded block partitioning method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111369165.2A CN116137659A (en) 2021-11-18 2021-11-18 Inter-coded block partitioning method and apparatus

Publications (1)

Publication Number Publication Date
CN116137659A true CN116137659A (en) 2023-05-19

Family

ID=86334277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111369165.2A Pending CN116137659A (en) 2021-11-18 2021-11-18 Inter-coded block partitioning method and apparatus

Country Status (1)

Country Link
CN (1) CN116137659A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373695A (en) * 2023-10-12 2024-01-09 北京透彻未来科技有限公司 Extreme deep convolutional neural network-based diagnosis system for diagnosis of cancer disease

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373695A (en) * 2023-10-12 2024-01-09 北京透彻未来科技有限公司 Extreme deep convolutional neural network-based diagnosis system for diagnosis of cancer disease
CN117373695B (en) * 2023-10-12 2024-05-17 北京透彻未来科技有限公司 Extreme deep convolutional neural network-based diagnosis system for diagnosis of cancer disease

Similar Documents

Publication Publication Date Title
CN114339262B (en) Entropy encoding/decoding method and device
WO2022063265A1 (en) Inter-frame prediction method and apparatus
CN113196748B (en) Intra-frame prediction method and related device
CN113196783B (en) Deblocking filtering adaptive encoder, decoder and corresponding methods
CN114125446A (en) Image encoding method, decoding method and device
KR102660120B1 (en) How to calculate the positions of integer grid reference samples for block-level boundary sample gradient calculations in dual-predict optical flow computation and dual-predict correction.
CN112204962B (en) Image prediction method, apparatus and computer-readable storage medium
WO2022111233A1 (en) Intra prediction mode coding method, and apparatus
CN114915783A (en) Encoding method and apparatus
US20230388490A1 (en) Encoding method, decoding method, and device
CN115349257B (en) Use of DCT-based interpolation filters
US20230239500A1 (en) Intra Prediction Method and Apparatus
WO2020259567A1 (en) Video encoder, video decoder and corresponding method
WO2020253681A1 (en) Method and device for constructing merge candidate motion information list, and codec
CN113767626B (en) Video enhancement method and device
CN116137659A (en) Inter-coded block partitioning method and apparatus
WO2023020320A1 (en) Entropy encoding and decoding method and device
WO2023011420A1 (en) Encoding method and apparatus, and decoding method and apparatus
CN117440153A (en) Method and device for encoding and decoding image containing characters
CN116647683A (en) Quantization processing method and device
CN118042136A (en) Encoding and decoding method and device
CN116800985A (en) Encoding and decoding method and device
CN116800984A (en) Encoding and decoding method and device
CN116708787A (en) Encoding and decoding method and device
CN117397238A (en) Encoding and decoding method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication