WO2022063265A1

WO2022063265A1 - Inter-frame prediction method and apparatus

Info

Publication number: WO2022063265A1
Application number: PCT/CN2021/120640
Authority: WO
Inventors: 陈焕浜; 杨海涛
Original assignee: 华为技术有限公司
Priority date: 2020-09-28
Filing date: 2021-09-26
Publication date: 2022-03-31
Also published as: CN114286100A

Abstract

The present application provides an inter-frame prediction method and related apparatus. The invention relates to the technical field of artificial intelligence (AI)-based video or image compression, and specifically relates to the technical field of neural network-based video compression, said method comprising: obtaining the motion vectors of each of P reconstructed image blocks in a surrounding area of a current block; according to the respective motion vectors of the P reconstructed image blocks, obtaining Q prior candidate motion vectors and Q probability values of the current block; according to M probability values corresponding to M prior candidate motion vectors, obtaining M weight factors corresponding to the M prior candidate motion vectors; M, P, and Q are positive integers; according to the M prior candidate motion vectors, performing motion compensation to obtain M predicted values, respectively; performing weighted summation according to the M predicted values and the corresponding M weighting factors to obtain a predicted value of the current block. The present application improves the accuracy of inter-frame prediction, reduces the error in inter-frame prediction, and improves RDO efficiency for inter-frame prediction.

Description

Inter prediction method and device

This application claims the priority of the Chinese patent application with the application number 202011043942.X and the application name "Inter-frame prediction method and device" filed with the Chinese Patent Office on September 28, 2020, the entire contents of which are incorporated herein by reference middle.

technical field

The embodiments of the present application relate to the technical field of video or image compression based on artificial intelligence (artificial intelligence, AI), and in particular, to an inter-frame prediction method and apparatus.

Background technique

Video coding (video encoding and decoding) is widely used in digital video applications such as broadcast digital television, video transmission over the Internet and mobile networks, real-time conversational applications such as video chat and video conferencing, Digital Versatile Disc (DVD) ) and Blu-ray Discs, video content capture and editing systems, and security applications for camcorders.

The large amount of video data that needs to be described even in the case of short films can create difficulties when the data is to be sent or otherwise transmitted over a network with limited bandwidth capacity. Therefore, video data is usually compressed before being transmitted in modern telecommunication networks. Since memory resources can be limited, the size of the video can also be an issue when storing the video on a storage device. Video compression devices typically use software and/or hardware on the source side to encode video data prior to transmission or storage, thereby reducing the amount of data required to represent digital video images. Then, the compressed data is received by the video decompression device at the destination side. With limited network resources and growing demand for higher video quality, there is a need for improved compression and decompression techniques that can increase compression ratios with little impact on image quality.

Prediction in video coding can be divided into intra-frame prediction and inter-frame prediction. Inter prediction is to find a matching reference block for the current block in the current image in the reconstructed image, and use the value of the pixel point in the reference block as the predicted value of the value of the pixel point in the current block. The encoder tries multiple reference blocks for the current block in the reference picture, then decides the reference block suitable for the current block, and transmits the motion information to the decoder. The decoder can find the reference block of the corresponding image block through the motion information in the code stream, and then obtain the prediction of the image block. The motion information includes one or two motion vectors (motion vector, MV) pointing to the reference block, and indication information of the image where the reference block is located (usually denoted as a reference frame index (reference index, RI)). In the high efficiency video coding (HEVC) standard, two inter-frame prediction modes are defined, namely the advanced motion vector prediction (AMVP) mode and the merge (Merge) mode. In both modes, a candidate motion information list is first constructed from the motion information of the reconstructed image blocks adjacent to the current block in the spatial or temporal domains, and then the optimal motion information is determined from the candidate motion information list as the motion information of the current block. Further, the prediction of the current block is obtained based on the motion information of the current block.

Therefore, how to obtain the prediction of the current block according to the multiple candidate motion information is the key to realize the inter prediction.

SUMMARY OF THE INVENTION

The present application provides an inter-frame prediction method and apparatus, so as to improve the accuracy of inter-frame prediction, reduce the error of inter-frame prediction, and improve the RDO efficiency of inter-frame prediction.

In a first aspect, the present application provides an inter-frame prediction method, comprising: acquiring motion vectors of P reconstructed image blocks in a surrounding area of a current block, where the surrounding area includes a spatial neighborhood of the current block and/or or temporal neighborhood; obtain Q a priori candidate motion vectors of the current block and Q probability values corresponding to the Q a priori candidate motion vectors according to the respective motion vectors of the P reconstructed image blocks; according to M probability values corresponding to the M a priori candidate motion vectors, M weighting factors corresponding to the M a priori candidate motion vectors are obtained; M, P and Q are positive integers; according to the M a priori candidate motion vectors The motion vector performs motion compensation respectively to obtain M predicted values; the predicted value of the current block is obtained by weighted summation of the M predicted values and the corresponding M weighting factors.

The surrounding area of the current block includes spatial and/or temporal neighborhoods of the current block, wherein the image blocks in the spatial neighborhood may include left candidate image blocks located to the left of the current block and upper candidate image blocks located above the current block.

The reconstructed image block may refer to an encoded image block that has been encoded by an encoder and obtained for reconstruction, or a decoded image block that has been decoded and reconstructed by a decoder. The reconstructed image block may also refer to a basic unit image block of a preset size obtained by dividing an encoded image block or a decoded image block into sizes.

The motion vectors of the reconstructed image blocks may include: (1) multiple a posteriori motion vectors of the reconstructed image blocks, the multiple posterior motion vectors are corresponding to the multiple posterior candidate motion vectors according to the reconstructed values of the reconstructed image blocks or, (2) the optimal motion vector of the reconstructed image block, where the optimal motion vector is the a posteriori motion vector with the largest probability value or the smallest prediction error value among the above-mentioned multiple posterior motion vectors.

The multiple a posteriori candidate motion vectors of the reconstructed image block are obtained from the multiple prior candidate motion vectors of the reconstructed image block. For any a priori candidate motion vector among the multiple prior candidate motion vectors of the reconstructed image block, it can be offset within a preset search window to generate multiple offset candidate motion vectors. It can be seen that a priori candidate motion vector of the reconstructed image block can obtain multiple offset candidate motion vectors. The multiple a priori candidate motion vectors of the reconstructed image block are operated as above, and all the obtained offset candidate motion vectors are the multiple a posteriori candidate motion vectors of the reconstructed image block. The above-mentioned P reconstructed image blocks can obtain their respective multiple a posteriori candidate motion vectors according to this method, which will not be described one by one here.

The multiple a posteriori motion vectors of the reconstructed image block may refer to the above-mentioned multiple a posteriori candidate motion vectors; may also refer to the partial motion vectors in the above-mentioned multiple a posteriori candidate motion vectors, such as the above-mentioned multiple a posteriori candidate motion vectors selected from multiple specified motion vectors. The above-mentioned P reconstructed image blocks can obtain their respective multiple a posteriori motion vectors according to this method, and will not be described one by one here.

The respective motion vectors of the P reconstructed image blocks can be input into the trained neural network to obtain Q a priori candidate motion vectors of the current block and Q probability values corresponding to the Q prior candidate motion vectors. For the neural network, reference may be made to the description of the training engine 25 below, which will not be repeated here.

The Q a priori candidate motion vectors of the current block may refer to all the remaining motion vectors after deduplication of the multiple a posteriori motion vectors of the P reconstructed image blocks, or may refer to the plurality of each of the P reconstructed image blocks The partial motion vector among all the remaining motion vectors after the posterior motion vector is deduplicated.

Optionally, M=Q, in this case, the M probability values refer to the above-mentioned Q probability values, and the M a priori candidate motion vectors refer to the above-mentioned Q a priori candidate motion vectors.

Optionally, M<Q, in this case, the M probability values are all greater than the other probability values except the M probability values among the Q probability values, and the M probability values are selected from the Q a priori candidate motion vectors of the current block. M a priori candidate motion vectors corresponding to the value. That is, the first M probability values with the largest probability value are selected from the Q probability values corresponding to the Q prior candidate motion vectors of the current block, and the M probability values corresponding to the Q prior candidate motion vectors of the current block are selected. For the corresponding M a priori candidate motion vectors, the weight factor and the prediction value are calculated based on the M probability values and the M a priori candidate motion vectors, and then the prediction value of the current block is obtained. Among the multiple probability values corresponding to the multiple prior candidate motion vectors, the remaining probability values except the aforementioned M probability values can be ignored because the values are small, which can reduce the amount of calculation and improve the efficiency of inter-frame prediction.

It should be noted that "corresponding" in the M probability values corresponding to the M a priori candidate motion vectors does not refer to a one-to-one correspondence, for example, the current block has 5 prior candidate motion vectors, and the corresponding probabilities Values can be 5 probability values or less than 5 probability values.

When the sum of the M probability values is 1, the probability value corresponding to the first prior candidate motion vector is used as the weighting factor corresponding to the first prior candidate motion vector. That is, the respective weight factors of the M prior candidate motion vectors are the respective probability values of the M prior candidate motion vectors; or, when the sum of the M probability values is not 1, the M probability values are normalized ; take the normalized value of the probability value corresponding to the first a priori candidate motion vector as the weighting factor corresponding to the first a priori candidate motion vector. That is, the respective weight factors of the M prior candidate motion vectors are normalized values of the respective probability values of the M prior candidate motion vectors. The above-mentioned first a priori candidate motion vector is only a term used for the convenience of description, and it does not refer to a specific prior candidate motion vector, but represents any one of the Q a priori candidate motion vectors. It can be seen that the sum of the M weighting factors corresponding to the M a priori candidate motion vectors is 1.

According to the principle of inter-frame prediction, a candidate motion vector can find a reference block in the reference frame of the current block, and perform inter-frame prediction on the current block according to the reference block to obtain the predicted value corresponding to the candidate motion vector. The predicted values correspond to candidate motion vectors. Therefore, the motion compensation is respectively performed according to the M a priori candidate motion vectors, and M predicted values of the current block can be obtained.

The predicted value of the current block is obtained by weighted summation of the M predicted values and the corresponding M weighting factors. As mentioned above, M predicted values correspond to M a priori candidate motion vectors, and M weight factors also correspond to M a priori candidate motion vectors. Therefore, for the same prior candidate motion vector, the corresponding predicted values and weights A corresponding relationship is also established between the factors. The weight factor corresponding to the same prior candidate motion vector is multiplied by the predicted value, and then the multiple products corresponding to multiple prior candidate motion vectors are added to obtain the prediction of the current block. value.

The present application obtains multiple weighting factors and multiple prediction values of the current block based on the respective motion vectors of multiple reconstructed image blocks in the surrounding area of the current block, and assigns the weighting factors and prediction values corresponding to the same prior candidate motion vector The predicted value of the current block is obtained by multiplying the multiple products corresponding to multiple prior candidate motion vectors, and the predicted value of the current block obtained in this way is a combination of multiple prior candidate motion vectors. It fits the rich and changeable textures in the real world well, improves the accuracy of inter-frame prediction, reduces the error of inter-frame prediction, and improves the overall rate-distortion optimization (RDO) efficiency of inter-frame prediction.

In a possible implementation manner, in addition to acquiring the respective motion vectors of the P reconstructed image blocks, the respective related information of the P reconstructed image blocks may also be acquired. The relevant information of the reconstructed image block may be a plurality of prediction error values corresponding to a plurality of a posteriori motion vectors of the reconstructed image block, and the plurality of prediction error values are also based on the reconstructed values of the reconstructed image block and a plurality of a posteriori The predicted value corresponding to the candidate motion vector is determined.

Motion compensation is respectively performed on the reconstructed image block according to a plurality of a posteriori candidate motion vectors of the reconstructed image block, and a plurality of prediction values can be obtained, and the plurality of prediction values correspond to the foregoing a plurality of candidate a posteriori motion vectors.

The multiple prediction values are respectively compared with the reconstructed values of the reconstructed image blocks to obtain multiple prediction error values, and the multiple prediction error values correspond to multiple a posteriori candidate motion vectors. In the present application, methods such as sum of absolute difference (SAD) or sum of squared difference (SSE) can be used to obtain the prediction error value corresponding to a certain posterior candidate motion vector.

If the multiple posterior motion vectors of the reconstructed image block refer to the multiple posterior candidate motion vectors, the multiple prediction error values of the reconstructed image block corresponding to the multiple posterior motion vectors refer to the multiple posterior motion vectors corresponding to the multiple posterior motion vectors of the reconstructed image block. Multiple prediction error values of a posteriori candidate motion vector; if the multiple posterior motion vectors of the reconstructed image block refer to some motion vectors in the above multiple posterior candidate motion vectors, the reconstructed image block is the same as the above multiple motion vectors. The multiple prediction error values corresponding to the a posteriori motion vectors refer to the prediction error values corresponding to the partial motion vector selected from the multiple prediction error values corresponding to the multiple posterior candidate motion vectors.

Correspondingly, the input to the neural network includes a plurality of a posteriori motion vectors for each of the P reconstructed image blocks and a plurality of prediction error values corresponding to the plurality of a posteriori motion vectors.

In a possible implementation manner, in addition to acquiring the respective motion vectors of the P reconstructed image blocks, the respective related information of the P reconstructed image blocks may also be acquired. The relevant information of the reconstructed image block may be a plurality of probability values corresponding to a plurality of a posteriori motion vectors of the reconstructed image block, and the plurality of probability values are also based on the reconstructed values of the reconstructed image block and a plurality of a posteriori candidate motions The predicted value corresponding to the vector is determined.

The multiple probability values corresponding to multiple posterior motion vectors of the reconstructed image block can be obtained in the following two ways:

One is to obtain multiple probability values of the reconstructed image block according to the multiple prediction error values of the reconstructed image block obtained in the above method. For example, a normalized exponential function, a linear normalization method, etc. can be used to normalize the multiple prediction error values of the reconstructed image blocks to obtain the normalized values of the multiple prediction error values. The normalized value of the error value is the multiple probability values of the reconstructed image block. Based on the correspondence between the multiple prediction error values of the reconstructed image block and the multiple posterior motion vectors, the multiple probability values of the reconstructed image block Also corresponding to a plurality of posterior motion vectors of the reconstructed image block, the probability value can represent the probability that the posterior motion vector corresponding to it becomes the optimal motion vector of the reconstructed image block.

The other is to input the reconstructed value of the reconstructed image block and the multiple predicted values of the reconstructed image block obtained in the first method into the trained neural network to obtain the reconstructed image block corresponding to multiple posterior motion vectors multiple probability values. For the neural network, reference may be made to the description of the training engine 25 above, which will not be repeated here.

Correspondingly, the input to the neural network includes a plurality of a posteriori motion vectors of each of the P reconstructed image blocks and a plurality of probability values corresponding to the plurality of posterior motion vectors.

Therefore, after obtaining multiple prediction error values or probability values corresponding to multiple posterior motion vectors through the above two methods, the optimal motion vector of the reconstructed image block can be obtained by the following two methods:

One is to use the posterior motion vector corresponding to the smallest prediction error value among the multiple prediction error values corresponding to the multiple posterior motion vectors as the optimal motion vector of the reconstructed image block.

The other is to use the posterior motion vector corresponding to the largest probability value among the multiple probability values corresponding to the multiple posterior motion vectors as the optimal motion vector of the reconstructed image block.

When it needs to be explained, the optimal motion vector in this application only refers to the motion vector obtained by one of the above two methods, which is one of the multiple posterior motion vectors of the reconstructed image block. Motion vectors are not the only motion vectors used in inter-predicting reconstructed image blocks.

In a possible implementation manner, after obtaining the reconstruction value of the current block, the posterior motion vector of the current block and its related information can be obtained immediately, and the obtaining method includes:

1. Obtain multiple posterior motion vectors of the current block and multiple prediction error values corresponding to multiple posterior motion vectors according to the reconstructed value of the current block and the predicted values corresponding to multiple posterior candidate motion vectors of the current block. The multiple a posteriori motion vectors of the block are obtained from the multiple prior candidate motion vectors of the current block.

2. Input the neural network according to the reconstructed value of the current block and the predicted values corresponding to the multiple posterior candidate motion vectors of the current block to obtain multiple posterior motion vectors of the current block and multiple probabilities corresponding to the multiple posterior motion vectors value, the multiple a posteriori motion vectors of the current block are obtained according to multiple a priori candidate motion vectors of the current block, or the multiple a posteriori motion vectors corresponding to the multiple posterior motion vectors of the current block are obtained according to multiple prediction error values of the current block. a probability value.

3. Determine the a posteriori motion vector with the largest probability value or the smallest prediction error value among the multiple posterior motion vectors of the current block as the optimal motion vector of the current block.

In a possible implementation manner, the training data set on which the training engine trains the neural network includes information of multiple groups of image blocks, wherein the information of each group of image blocks includes a plurality of reconstructed image blocks. a posteriori motion vector, multiple probability values corresponding to the multiple posterior motion vectors, and multiple posterior motion vectors of the current block, multiple probability values corresponding to the multiple posterior motion vectors, the The plurality of reconstructed image blocks are image blocks in the spatial neighborhood and/or temporal neighborhood of the current block; a neural network is obtained by training according to the training data set.

In a possible implementation manner, the training data set on which the training engine trains the neural network includes information of multiple groups of image blocks, wherein the information of each group of image blocks includes a plurality of reconstructed image blocks. A posteriori motion vector, multiple prediction error values corresponding to the multiple posterior motion vectors, multiple posterior motion vectors of the current block, multiple probability values corresponding to the multiple posterior motion vectors, so The plurality of reconstructed image blocks are image blocks in the spatial neighborhood and/or temporal neighborhood of the current block; a neural network is obtained by training according to the training data set.

In a possible implementation manner, the training data set on which the training engine trains the neural network includes information of multiple groups of image blocks, wherein the information of each group of image blocks includes the respective optimal values of the multiple reconstructed image blocks a motion vector, and a plurality of a posteriori motion vectors of the current block, a plurality of probability values corresponding to the plurality of a posteriori motion vectors, the plurality of reconstructed image blocks being the spatial neighborhood of the current block and/or An image block in the temporal neighborhood; a neural network is obtained by training according to the training data set.

Optionally, the neural network includes at least a convolution layer and an activation layer. Wherein, the depth of the convolution kernel of the convolution layer is 2, 3, 4, 5, 6, 16, 24, 32, 48, 64 or 128; the size of the convolution kernel in the convolution layer is 1 ×1, 3×3, 5×5 or 7×7. For example, the size of a convolutional layer is 3×3×2×10, where 3×3 represents the size of the convolution kernel in the convolutional layer; 2 represents the depth of the convolutional kernel included in the convolutional layer, The number of data channels input to the convolution layer is the same as the depth of the convolution kernel contained in the convolution layer, that is, the number of data channels input to the convolution layer is also 2; 10 represents the number of convolution kernels contained in the convolution layer. , the number of data channels outputting the convolution layer is the same as the number of convolution kernels contained in the convolution layer, that is, the number of data channels outputting the convolution layer is also 10.

Optionally, the neural network includes a convolutional neural network CNN, a deep neural network DNN or a recurrent neural network RNN.

In a second aspect, the present application provides an encoder, comprising a processing circuit for performing the method according to any one of the above-mentioned first aspects.

In a third aspect, the present application provides a decoder, including a processing circuit, configured to perform the method described in any one of the above-mentioned first aspect.

In a fourth aspect, the present application provides a computer program product, including program code, which, when executed on a computer or a processor, is used to perform the method described in any one of the above-mentioned first aspects.

In a fifth aspect, the present application provides an encoder, comprising: one or more processors; a non-transitory computer-readable storage medium coupled to the processors and storing a program executed by the processors, wherein the The program, when executed by the processor, causes the decoder to execute the method described in any one of the first aspect above.

In a sixth aspect, the present application provides a decoder comprising: one or more processors; a non-transitory computer-readable storage medium coupled to the processors and storing a program executed by the processors, wherein the The program, when executed by the processor, causes the encoder to execute the method described in any one of the above-mentioned first aspects.

In a seventh aspect, the present application provides a non-transitory computer-readable storage medium, comprising program code, which, when executed by a computer device, is used to perform the method described in any one of the above-mentioned first aspects.

In an eighth aspect, the present invention relates to an inter-frame prediction apparatus, and the beneficial effects can be referred to the description of the first aspect and will not be repeated here. The inter-frame prediction apparatus has the function of implementing the behavior in the method embodiment of the first aspect. The functions can be implemented by hardware, or can be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions. In a possible design, the inter-frame prediction apparatus includes: a motion estimation unit and an inter-frame prediction processing unit, wherein the motion estimation unit is configured to acquire the respective motions of the P reconstructed image blocks in the surrounding area of the current block A vector, where the surrounding area includes a spatial neighborhood and/or a temporal neighborhood of the current block; an inter-frame prediction processing unit, configured to implement the method described in any one of the first aspects. These modules can perform the corresponding functions in the method examples of the first aspect. For details, please refer to the detailed descriptions in the method examples, which will not be repeated here.

One or more embodiments are described in detail in the accompanying drawings and the description below. Other features, objects, and advantages are apparent from the description, drawings, and claims.

Description of drawings

FIG. 1a is an exemplary block diagram of a decoding system 10 according to an embodiment of the present application;

FIG. 1b is an exemplary block diagram of a video decoding system 40 according to an embodiment of the present application;

FIG. 2 is an exemplary block diagram of a video encoder 20 according to an embodiment of the present application;

FIG. 3 is an exemplary block diagram of a video decoder 30 according to an embodiment of the present application;

FIG. 4 is an exemplary block diagram of a video decoding apparatus 400 according to an embodiment of the present application;

FIG. 5 is an exemplary block diagram of an apparatus 500 according to an embodiment of the present application;

6a-6e are several exemplary architectures of a neural network for inter-frame prediction according to an embodiment of the present application;

FIG. 7 is an exemplary schematic diagram of a candidate image block according to an embodiment of the present application;

8 is a flowchart of a process 800 of an inter-frame prediction method according to an embodiment of the present application;

FIG. 9 is a flowchart of a process 900 of an inter-frame prediction method according to an embodiment of the present application;

FIG. 10 is an exemplary schematic diagram of a search window according to an embodiment of the present application;

11 is a flowchart of a process 1100 of an inter-frame prediction method according to an embodiment of the present application;

12 is a flowchart of a process 1200 of an inter-frame prediction method according to an embodiment of the present application;

FIG. 13 is a schematic structural diagram of an inter-frame prediction apparatus 1300 according to an embodiment of the present application.

detailed description

Embodiments of the present application provide an AI-based video compression technology, in particular a neural network-based video compression technology, and specifically provide an inter-frame prediction technology based on a neural network (NN) to improve traditional Hybrid video codec system.

Video coding generally refers to the processing of sequences of images that form a video or video sequence. In the field of video coding, the terms "picture", "frame" or "image" may be used as synonyms. Video encoding (or commonly referred to as encoding) includes two parts, video encoding and video decoding. Video encoding is performed on the source side and typically involves processing (eg, compressing) the original video image to reduce the amount of data required to represent the video image (and thus store and/or transmit more efficiently). Video decoding is performed on the destination side and typically involves inverse processing relative to the encoder to reconstruct the video image. The "encoding" of a video image (or commonly referred to as an image) in relation to the embodiments should be understood as the "encoding" or "decoding" of a video image or a video sequence. The encoding part and the decoding part are also collectively referred to as codec (encoding and decoding, CODEC).

In the case of lossless video coding, the original video image can be reconstructed, ie the reconstructed video image has the same quality as the original video image (assuming no transmission loss or other data loss during storage or transmission). In the case of lossy video coding, further compression is performed through quantization, etc. to reduce the amount of data required to represent the video image, and the decoder side cannot completely reconstruct the video image, that is, the quality of the reconstructed video image is higher than that of the original video image. low or poor.

Several video coding standards fall under the category of "lossy hybrid video codecs" (ie, combining spatial and temporal prediction in the pixel domain with 2D transform coding in the transform domain for applying quantization). Each image in a video sequence is usually partitioned into sets of non-overlapping blocks, usually encoded at the block level. In other words, encoders typically process i.e. encode video at the block (video block) level, eg, by spatial (intra) prediction and temporal (inter) prediction to generate prediction blocks; block) to subtract the prediction block to get the residual block; transform the residual block in the transform domain and quantize the residual block to reduce the amount of data to be transmitted (compressed), and the decoder side will process inversely with respect to the encoder Partially applied to encoded or compressed blocks to reconstruct the current block for representation. Additionally, the encoder needs to repeat the decoder's processing steps so that the encoder and decoder generate the same predictions (eg, intra- and inter-prediction) and/or reconstructed pixels for processing, ie, encoding subsequent blocks.

In the following embodiments of the decoding system 10, the encoder 20 and the decoder 30 are described with respect to FIGS. 1a to 3 .

FIG. 1a is an exemplary block diagram of a decoding system 10 according to an embodiment of the present application, for example, a video decoding system 10 (or simply referred to as a decoding system 10 ) that can utilize the technology of the present application. Video encoder 20 (or encoder 20 for short) and video decoder 30 (or decoder 30 for short) in video coding system 10 represent devices, etc. that may be used to perform techniques in accordance with the various examples described in this application .

As shown in FIG. 1 a , the decoding system 10 includes a source device 12 for providing encoded image data 21 such as encoded images to a destination device 14 for decoding the encoded image data 21 .

The source device 12 includes an encoder 20 and, alternatively, an image source 16 , a preprocessor (or preprocessing unit) 18 such as an image preprocessor, and a communication interface (or communication unit) 22 .

Image source 16 may include or be any type of image capture device for capturing real-world images, etc., and/or any type of image generation device, such as a computer graphics processor or any type of user for generating computer animation images. Devices used to acquire and/or provide real-world images, computer-generated images (e.g., screen content, virtual reality (VR) images, and/or any combination thereof (e.g., augmented reality, AR) images). The image source may be any type of memory or storage that stores any of the above-mentioned images.

To distinguish the processing performed by the preprocessor (or preprocessing unit) 18 , the image (or image data) 17 may also be referred to as the original image (or original image data) 17 .

The preprocessor 18 is configured to receive the original image data 17 and preprocess the original image data 17 to obtain a preprocessed image (or preprocessed image data) 19 . For example, the preprocessing performed by the preprocessor 18 may include trimming, color format conversion (eg, from RGB to YCbCr), toning, or denoising. It is understood that the preprocessing unit 18 may be an optional component.

A video encoder (or encoder) 20 is used to receive preprocessed image data 19 and to provide encoded image data 21 (described further below with respect to Figure 2 etc.).

The communication interface 22 in the source device 12 can be used to: receive the encoded image data 21 and send the encoded image data 21 (or any other processed version) over the communication channel 13 to another device such as the destination device 14 or any other device for storage or rebuild directly.

The destination device 14 includes a decoder 30 and may additionally, alternatively, include a communication interface (or communication unit) 28 , a post-processor (or post-processing unit) 32 and a display device 34 .

The communication interface 28 in the destination device 14 is used to receive the encoded image data 21 (or any other processed version) directly from the source device 12 or from any other source device such as a storage device, for example, the storage device is an encoded image data storage device, The encoded image data 21 is supplied to the decoder 30 .

Communication interface 22 and communication interface 28 may be used through a direct communication link between source device 12 and destination device 14, such as a direct wired or wireless connection, etc., or through any type of network, such as a wired network, a wireless network, or any Combination, any type of private network and public network, or any type of combination, send or receive encoded image data (or encoded data) 21 .

For example, the communication interface 22 may be used to encapsulate the encoded image data 21 into a suitable format such as a message, and/or use any type of transfer encoding or processing to process the encoded image data for transmission over a communication link or communication network transfer on.

The communication interface 28 corresponds to the communication interface 22 and may be used, for example, to receive transmission data and process the transmission data using any type of corresponding transmission decoding or processing and/or decapsulation to obtain encoded image data 21 .

Both the communication interface 22 and the communication interface 28 can be configured as a one-way communication interface as indicated by the arrow from the source device 12 to the corresponding communication channel 13 of the destination device 14 in FIG. 1a, or a two-way communication interface, and can be used to send and receive messages etc. to establish a connection, acknowledge and exchange any other information related to a communication link and/or data transfer such as encoded image data transfer, etc.

A video decoder (or decoder) 30 is used to receive encoded image data 21 and to provide decoded image data (or decoded image data) 31 (described further below with reference to FIG. 3 etc.).

The post-processor 32 is configured to perform post-processing on the decoded image data 31 (also referred to as reconstructed image data) such as a decoded image to obtain post-processed image data 33 such as a post-processed image. Post-processing performed by post-processing unit 32 may include, for example, color format conversion (eg, from YCbCr to RGB), toning, trimming, or resampling, or any other processing used to generate decoded image data 31 for display by display device 34, etc. .

A display device 34 is used to receive post-processed image data 33 to display the image to a user or viewer or the like. Display device 34 may be or include any type of display for representing the reconstructed image, eg, an integrated or external display screen or display. For example, the display screen may include a liquid crystal display (LCD), an organic light emitting diode (OLED) display, a plasma display, a projector, a micro LED display, a liquid crystal on silicon (LCoS) display ), digital light processor (DLP), or any other type of display.

The decoding system 10 further comprises a training engine 25 for training the encoder 20 (in particular the inter prediction unit in the encoder 20) or the decoder 30 (in particular the inter prediction unit in the decoder 30), to process an input image or image region or image block to generate a predicted value for the input image or image region or image block.

Optionally, in this embodiment of the present application, the training data set includes: information of multiple groups of image blocks, wherein the information of each group of image blocks includes multiple posterior motion vectors, multiple posterior motion vectors, and multiple posterior motion vectors of multiple reconstructed image blocks. Multiple probability values corresponding to the tested motion vector, multiple posterior candidate motion vectors of the current block, multiple probability values corresponding to the multiple posterior candidate motion vectors, and multiple reconstructed image blocks are the spatial neighborhoods of the current block and/or image patches in the temporal neighborhood. A neural network is obtained after training with the training data set, and the input of the neural network is a plurality of a posteriori motion vectors of each of the reconstructed image blocks in the surrounding area of the current block, and a plurality of probability values corresponding to the multiple posterior motion vectors. , the output is multiple a priori candidate motion vectors of the current block and multiple probability values corresponding to the multiple prior candidate motion vectors.

Optionally, in this embodiment of the present application, the training data set includes: information of multiple groups of image blocks, wherein the information of each group of image blocks includes multiple posterior motion vectors, multiple posterior motion vectors, and multiple posterior motion vectors of multiple reconstructed image blocks. Multiple prediction error values corresponding to the tested motion vector, multiple posterior candidate motion vectors of the current block, multiple probability values corresponding to the multiple posterior candidate motion vectors, and multiple reconstructed image blocks are the spatial neighbors of the current block. Image patches in the domain and/or temporal neighborhood. A neural network is obtained by training the training data set, and the input of the neural network is a plurality of a posteriori motion vectors of each of the reconstructed image blocks in the surrounding area of the current block, and a plurality of prediction errors corresponding to the multiple posterior motion vectors. The output is multiple prior candidate motion vectors of the current block and multiple probability values corresponding to multiple prior candidate motion vectors.

Optionally, the training data set in this embodiment of the present application includes: information of multiple groups of image blocks, wherein the information of each group of image blocks includes respective optimal motion vectors of multiple reconstructed image blocks, and multiple image blocks of the current block. A posteriori candidate motion vector, multiple probability values corresponding to multiple posterior candidate motion vectors, multiple reconstructed image blocks are image blocks in the spatial neighborhood and/or temporal neighborhood of the current block. A neural network is obtained by training the training data set, the input of the neural network is the respective optimal motion vectors of multiple reconstructed image blocks in the surrounding area of the current block, and the output is multiple prior candidate motion vectors of the current block, and multiple multiple probability values corresponding to a priori candidate motion vector.

Optionally, the training data set in the embodiment of the present application includes: information of multiple groups of image blocks, wherein the information of each group of image blocks includes the reconstructed value of the image block and the predicted value corresponding to the multiple posterior candidate motion vectors, and A plurality of a posteriori motion vectors of the image block, and a plurality of probability values corresponding to the plurality of a posteriori motion vectors. A neural network is obtained after training with the training data set. The input of the neural network is the reconstructed value of the current block and the predicted values corresponding to multiple posterior candidate motion vectors, and the output is multiple posterior motion vectors of the current block, and multiple posterior motion vectors. Multiple probability values corresponding to the motion vector.

The process of training the neural network by the training engine 25 makes the outputted multiple a priori candidate motion vectors of the current block approximate multiple posterior motion vectors of the current block, and the multiple probability values corresponding to the multiple prior candidate motion vectors are approximated to the multiple prior candidate motion vectors. Multiple probability values corresponding to the posterior motion vector. Each training process can use a mini-batch size of 64 images and an initial learning rate of 1e-4, following a step size of 10. The information of the multiple groups of image blocks may be data generated when the encoder performs inter-frame encoding on multiple current blocks. The neural network can be used to implement the inter-frame prediction method provided by the embodiments of the present application, that is, the motion vectors of multiple reconstructed image blocks in the surrounding area of the current block and their related information are input into the neural network, and the current block can be obtained. A plurality of a priori candidate motion vectors and a plurality of probability values corresponding to the plurality of a priori candidate motion vectors. The neural network will be described in detail below in conjunction with Figures 6a-6e.

The training data in this embodiment of the present application may be stored in a database (not shown), and the training engine 25 trains a target model based on the training data (for example, a neural network for image inter-frame prediction). It should be noted that the embodiments of the present application do not limit the source of the training data, for example, the training data may be obtained from the cloud or other places to perform model training.

The target model trained by the training engine 25 can be applied to the decoding system 10, for example, the source device 12 (eg, the encoder 20) or the destination device 14 (eg, the decoder 30) shown in FIG. 1a. The training engine 25 can train on the cloud to obtain the target model, and then the decoding system 10 downloads and uses the target model from the cloud; or, the training engine 25 can train on the cloud to obtain the target model and use the target model, and the decoding system 10 directly downloads the target model from the cloud. Get the processing result. For example, the training engine 25 trains a target model with an inter-frame prediction function, the decoding system 10 downloads the target model from the cloud, and then the inter-frame prediction unit 244 in the encoder 20 or the inter-frame prediction unit 344 in the decoder 30 can Perform inter-frame prediction on the input image or image block according to the target model, and obtain the prediction of the image or image block. For another example, the training engine 25 trains a target model with an inter-frame prediction function, and the decoding system 10 does not need to download the target model from the cloud. The encoder 20 or the decoder 30 transmits the image or image block to the cloud, and the cloud passes the target model through the target model. The image or image block is inter-predicted, and the prediction of the image or image block is obtained and transmitted to the encoder 20 or the decoder 30 .

Although FIG. 1a shows source device 12 and destination device 14 as separate devices, device embodiments may include both source device 12 and destination device 14 or the functions of both source device 12 and destination device 14, ie, include both source device 12 and destination device 14. Device 12 or corresponding function and destination device 14 or corresponding function. In these embodiments, source device 12 or corresponding functionality and destination device 14 or corresponding functionality may be implemented using the same hardware and/or software or by separate hardware and/or software, or any combination thereof.

From the description, the existence and (exact) division of the different units or functions in the source device 12 and/or the destination device 14 shown in FIG. 1a may vary depending on the actual device and application, as will be apparent to the skilled person .

Encoder 20 (eg video encoder 20) or decoder 30 (eg video decoder 30) or both may be implemented by processing circuitry as shown in Figure 1b, eg one or more microprocessors, digital signal processors (digital signal processor, DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), discrete logic, hardware, special-purpose processor for video encoding, or any combination thereof . Encoder 20 may be implemented by processing circuitry 46 to include the various modules discussed with reference to encoder 20 of FIG. 2 and/or any other encoder system or subsystem described herein. Decoder 30 may be implemented by processing circuitry 46 to include the various modules discussed with reference to decoder 30 of FIG. 3 and/or any other decoder system or subsystem described herein. The processing circuitry 46 may be used to perform various operations discussed below. As shown in FIG. 5, if parts of the techniques are implemented in software, a device may store the instructions of the software in a suitable non-transitory computer-readable storage medium and execute the instructions in hardware using one or more processors, thereby Implement the techniques of this application. One of the video encoder 20 and the video decoder 30 may be integrated in a single device as part of a combined codec (encoder/decoder, CODEC), as shown in Figure 1b.

Source device 12 and destination device 14 may include any of a variety of devices, including any type of handheld or stationary device, such as a laptop or laptop, cell phone, smartphone, tablet or tablet, camera, Desktop computers, set-top boxes, televisions, display devices, digital media players, video game consoles, video streaming devices (eg, content service servers or content distribution servers), broadcast receiving equipment, broadcast transmitting equipment, etc., and may not Use or use any type of operating system. In some cases, source device 12 and destination device 14 may be equipped with components for wireless communication. Thus, source device 12 and destination device 14 may be wireless communication devices.

In some cases, the video coding system 10 shown in FIG. 1a is merely exemplary, and the techniques provided herein may be applicable to video encoding settings (eg, video encoding or video decoding) that do not necessarily include encoding devices and Decode any data communication between devices. In other examples, data is retrieved from local storage, sent over a network, and so on. The video encoding device may encode and store the data in memory, and/or the video decoding device may retrieve and decode the data from the memory. In some examples, encoding and decoding are performed by devices that do not communicate with each other but merely encode data to and/or retrieve and decode data from memory.

FIG. 1b is an exemplary block diagram of a video coding system 40 according to an embodiment of the present application. As shown in FIG. 1b, the video coding system 40 may include an imaging device 41, a video encoder 20, a video decoder 30 (and/or by video encoder/decoder implemented by processing circuitry 46 ), antenna 42 , one or more processors 43 , one or more memory memories 44 and/or display device 45 .

As shown in Figure 1b, the imaging device 41, antenna 42, processing circuit 46, video encoder 20, video decoder 30, processor 43, memory storage 44 and/or display device 45 can communicate with each other. In different examples, video coding system 40 may include only video encoder 20 or only video decoder 30 .

In some examples, antenna 42 may be used to transmit or receive an encoded bitstream of video data. Additionally, in some instances, display device 45 may be used to present video data. Processing circuitry 46 may include application-specific integrated circuit (ASIC) logic, graphics processors, general purpose processors, and the like. Video coding system 40 may also include an optional processor 43, which may similarly include application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, and the like. Additionally, the memory memory 44 may be any type of memory, such as volatile memory (eg, static random access memory (SRAM), dynamic random access memory (DRAM), etc.) or non-volatile memory volatile memory (eg, flash memory, etc.), etc. In a non-limiting example, memory storage 44 may be implemented by cache memory. In other examples, processing circuitry 46 may include memory (eg, cache memory, etc.) for implementing image buffers, and the like.

In some examples, video encoder 20 implemented by logic circuitry may include an image buffer (eg, implemented by processing circuitry 46 or memory memory 44 ) and a graphics processing unit (eg, implemented by processing circuitry 46 ). The graphics processing unit may be communicatively coupled to the image buffer. The graphics processing unit may include video encoder 20 implemented by processing circuitry 46 to implement the various modules discussed with reference to FIG. 2 and/or any other encoder system or subsystem described herein. Logic circuits may be used to perform the various operations discussed herein.

In some examples, video decoder 30 may be implemented by processing circuitry 46 in a similar manner to implement various of the types discussed with reference to video decoder 30 of FIG. 3 and/or any other decoder systems or subsystems described herein. module. In some examples, logic circuit-implemented video decoder 30 may include an image buffer (implemented by processing circuit 46 or memory memory 44) and a graphics processing unit (eg, implemented by processing circuit 46). The graphics processing unit may be communicatively coupled to the image buffer. The graphics processing unit may include video decoder 30 implemented by processing circuitry 46 to implement the various modules discussed with reference to FIG. 3 and/or any other decoder system or subsystem described herein.

In some examples, antenna 42 may be used to receive an encoded bitstream of video data. As discussed, the encoded bitstream may include data, indicators, index values, mode selection data, etc., as discussed herein related to encoded video frames, such as data related to encoded partitions (eg, transform coefficients or quantized transform coefficients). , (as discussed) optional indicators, and/or data defining the encoding split). Video coding system 40 may also include video decoder 30 coupled to antenna 42 for decoding the encoded bitstream. Display device 45 is used to present video frames.

It should be understood that for the examples described with reference to the video encoder 20 in the embodiments of the present application, the video decoder 30 may be used to perform the opposite process. With regard to signaling syntax elements, video decoder 30 may be operable to receive and parse such syntax elements, decoding the associated video data accordingly. In some examples, video encoder 20 may entropy encode the syntax elements into an encoded video bitstream. In such instances, video decoder 30 may parse such syntax elements and decode related video data accordingly.

For ease of description, refer to the universal video coding (versatile video coding, VVC) reference software or by the ITU-T video coding experts group (video coding experts group, VCEG) and ISO/IEC motion picture experts group (motion picture experts group, MPEG) The high-efficiency video coding (HEVC) developed by the joint collaboration team on video coding (JCT-VC) describes the embodiments of the present application. Those of ordinary skill in the art understand that the embodiments of the present application are not limited to HEVC or VVC.

Encoders and Encoding Methods

FIG. 2 is an exemplary block diagram of a video encoder 20 according to an embodiment of the present application. As shown in FIG. 2, the video encoder 20 includes an input terminal (or input interface) 201, a residual calculation unit 204, a transform processing unit 206, a quantization unit 208, an inverse quantization unit 210, an inverse transform processing unit 212, a reconstruction unit 214, A loop filter 220 , a decoded picture buffer (DPB) 230 , a mode selection unit 260 , an entropy encoding unit 270 and an output terminal (or output interface) 272 . Mode selection unit 260 may include inter prediction unit 244 , intra prediction unit 254 , and partition unit 262 . Inter prediction unit 244 may include a motion estimation unit and a motion compensation unit (not shown). The video encoder 20 shown in FIG. 2 may also be referred to as a hybrid video encoder or a hybrid video codec-based video encoder.

Referring to FIG. 2 , the inter-frame prediction unit is a trained target model (also called a neural network) for processing an input image or image region or image block to generate a predicted value for the input image block. For example, a neural network for inter prediction is used to receive an input image or image region or image patch, and generate a predicted value for the input image or image region or image patch. The neural network for inter prediction will be described in detail below in conjunction with Figures 6a-6e.

The residual calculation unit 204, the transform processing unit 206, the quantization unit 208 and the mode selection unit 260 constitute the forward signal path of the encoder 20, while the inverse quantization unit 210, the inverse transform processing unit 212, the reconstruction unit 214, the buffer 216, the loop The path filter 220, the decoded picture buffer (DPB) 230, the inter-frame prediction unit 244 and the intra-frame prediction unit 254 constitute the backward signal path of the encoder, wherein the backward signal path of the encoder 20 corresponds to the decoding signal path of the decoder (see decoder 30 in Figure 3). Inverse quantization unit 210 , inverse transform processing unit 212 , reconstruction unit 214 , loop filter 220 , decoded image buffer 230 , inter prediction unit 244 , and intra prediction unit 254 also make up the “built-in decoder” of video encoder 20 .

Image and image segmentation (images and blocks)

The encoder 20 may be operable to receive images (or image data) 17, eg, images in a sequence of images forming a video or video sequence, via an input 201 or the like. The received image or image data may also be a preprocessed image (or preprocessed image data) 19 . For simplicity, the following description uses image 17. The image 17 may also be referred to as the current image or the image to be encoded (especially when distinguishing the current image from other images in video encoding, such as the same video sequence, i.e. the video sequence that also includes the current image, previously encoded in the post image and/or post decoded image).

A (digital) image is or can be viewed as a two-dimensional array or matrix of pixel points with intensity values. The pixels in the array may also be called pixels or pels (short for picture elements). The number of pixels in the array or image in the horizontal and vertical directions (or axes) determines the size and/or resolution of the image. In order to represent color, three color components are usually used, that is, an image can be represented as or include an array of three pixel points. In RBG format or color space, an image includes an array of corresponding red, green and blue pixel points. However, in video coding, each pixel is usually represented in a luma/chroma format or color space, such as YCbCr, including a luma component denoted by Y (and sometimes L) and two chroma components denoted by Cb and Cr. The luminance (luma) component Y represents the luminance or gray level intensity (eg, both are the same in a grayscale image), while the two chrominance (chroma) components Cb and Cr represent the chrominance or color information components . Correspondingly, an image in YCbCr format includes a luminance pixel array of luminance pixel value (Y) and two chrominance pixel arrays of chrominance values (Cb and Cr). Images in RGB format can be converted or transformed to YCbCr format and vice versa, the process is also known as color transformation or conversion. If the image is black and white, the image may only include an array of luminance pixels. Correspondingly, the image may be, for example, a luminance pixel array in monochrome format or a luminance pixel array and two corresponding chrominance pixel arrays in 4:2:0, 4:2:2 and 4:4:4 color formats .

In one embodiment, an embodiment of the video encoder 20 may include an image segmentation unit (not shown in FIG. 2 ) for segmenting the image 17 into a plurality of (generally non-overlapping) image blocks 203 . These blocks may also be referred to as root blocks, macroblocks (H.264/AVC) or coding tree blocks (CTBs), or coding tree units (CTUs) in the H.265/HEVC and VVC standards ). The segmentation unit can be used to use the same block size for all images in a video sequence and use a corresponding grid that defines the block size, or to vary the block size between images or subsets of images or groups of images, and to segment each image into corresponding Piece.

In other embodiments, the video encoder may be used to directly receive blocks 203 of the image 17 , eg, one, several or all of the blocks that make up the image 17 . The image block 203 may also be referred to as a current image block or an image block to be encoded.

Like image 17 , image block 203 is also or can be considered as a two-dimensional array or matrix of pixels with intensity values (pixel values), but image block 203 is smaller than image 17 . In other words, block 203 may include an array of pixels (eg, a luminance array in the case of a monochrome image 17 or a luminance array or a chrominance array in the case of a color image) or three arrays of pixels (eg, in the case of a color image 17 ) (one luma array and two chrominance arrays) or any other number and/or type of arrays depending on the color format employed. The number of pixels in the horizontal and vertical directions (or axes) of the block 203 defines the size of the block 203 . Correspondingly, the block may be an array of M×N (M columns×N rows) pixel points, or an array of M×N transform coefficients, or the like.

In one embodiment, the video encoder 20 shown in FIG. 2 is used to encode the image 17 block by block, eg, performing encoding and prediction on each block 203 .

In one embodiment, the video encoder 20 shown in FIG. 2 may also be used to segment and/or encode an image using slices (also referred to as video slices), where an image may use one or more slices (typically non-overlapping slices) ) for segmentation or encoding. Each slice may include one or more blocks (eg, Coding Tree Unit CTUs) or one or more groups of blocks (eg, coding tiles in the H.265/HEVC/VVC standard and tiles in the VVC standard ( brick).

In one embodiment, the video encoder 20 shown in FIG. 2 may also be used to use slice/coding block groups (also referred to as video coding block groups) and/or coding blocks (also referred to as video coding blocks) ) to segment and/or encode an image, wherein the image may be segmented or encoded using one or more slices/encoded block groups (usually non-overlapping), each slice/encoded block group may include one or more slices/encoded block groups A block (eg, CTU) or one or more coding blocks, etc., wherein each coding block may be rectangular or the like, and may include one or more full or partial blocks (eg, CTUs).

residual calculation

The residual calculation unit 204 is configured to calculate the residual block 205 (the prediction block 265 will be described in detail later) according to the image block (or original block) 203 and the prediction block 265 in the following manner: for example, pixel by pixel (pixel by pixel) from the image The pixel value of the prediction block 265 is subtracted from the pixel value of the block 203 to obtain the residual block 205 in the pixel domain.

transform

The transform processing unit 206 is configured to perform discrete cosine transform (discrete cosine transform, DCT) or discrete sine transform (discrete sine transform, DST) etc. on the pixel point values of the residual block 205 to obtain transform coefficients 207 in the transform domain. Transform coefficients 207, which may also be referred to as transform residual coefficients, represent the residual block 205 in the transform domain.

Transform processing unit 206 may be used to apply integer approximations of DCT/DST, such as transforms specified for H.265/HEVC. Compared to the orthogonal DCT transform, this integer approximation is usually scaled by some factor. In order to maintain the norm of the forward and inversely transformed residual blocks, other scaling factors are used as part of the transformation process. The scaling factor is usually chosen according to certain constraints, such as the scaling factor being a power of 2 for the shift operation, the bit depth of the transform coefficients, the trade-off between accuracy and implementation cost, etc. For example, specific scaling factors are specified for the inverse transform by the inverse transform processing unit 212 at the encoder 20 side (and for the corresponding inverse transform at the decoder 30 side by, for example, the inverse transform processing unit 312), and accordingly, can be used at the encoder The 20 side specifies the corresponding scaling factor for the forward transformation through the transformation processing unit 206 .

In one embodiment, the video encoder 20 (correspondingly, the transform processing unit 206 ) may be configured to output transform parameters such as the type of one or more transforms, eg, directly or after being encoded or compressed by the entropy encoding unit 270 , eg, so that video decoder 30 can receive and decode using transform parameters.

quantify

The quantization unit 208 is configured to quantize the transform coefficients 207 by, for example, scalar quantization or vector quantization, to obtain quantized transform coefficients 209 . The quantized transform coefficients 209 may also be referred to as quantized residual coefficients 209 .

The quantization process may reduce the bit depth associated with some or all of the transform coefficients 207 . For example, n-bit transform coefficients may be rounded down to m-bit transform coefficients during quantization, where n is greater than m. The degree of quantization can be modified by adjusting the quantization parameter (QP). For example, with scalar quantization, different degrees of scaling can be applied to achieve finer or coarser quantization. Smaller quantization step sizes correspond to finer quantization, while larger quantization step sizes correspond to coarser quantization. A suitable quantization step size can be indicated by a quantization parameter (QP). For example, the quantization parameter may be an index into a predefined set of suitable quantization step sizes. For example, a smaller quantization parameter may correspond to fine quantization (smaller quantization step size), a larger quantization parameter may correspond to coarse quantization (larger quantization step size), and vice versa. Quantization may include dividing by the quantization step size, and corresponding or inverse dequantization performed by the inverse quantization unit 210 or the like may include multiplying by the quantization step size. Embodiments according to some standards such as HEVC may be used to use quantization parameters to determine the quantization step size. In general, the quantization step size can be calculated from the quantization parameter using a fixed-point approximation of an equation involving division. Other scaling factors may be introduced for quantization and dequantization to restore the norm of the residual block, which may be modified due to the scale used in the fixed-point approximation of the equations for the quantization step size and quantization parameters. In one exemplary implementation, the inverse transform and dequantized scales may be combined. Alternatively, a custom quantization table can be used and indicated from the encoder to the decoder in the bitstream etc. Quantization is a lossy operation, where the larger the quantization step size, the larger the loss.

In one embodiment, the video encoder 20 (correspondingly, the quantization unit 208) may be used to output a quantization parameter (QP), eg, directly or after being encoded or compressed by the entropy encoding unit 270, eg, such that the video Decoder 30 may receive and decode using the quantization parameters.

inverse quantization

The inverse quantization unit 210 is used to perform inverse quantization of the quantization unit 208 on the quantized coefficients to obtain the dequantized coefficients 211, for example, perform inverse quantization with the quantization scheme performed by the quantization unit 208 according to or using the same quantization step size as the quantization unit 208 plan. Dequantized coefficients 211 may also be referred to as dequantized residual coefficients 211, corresponding to transform coefficients 207, but due to losses caused by quantization, inverse quantized coefficients 211 are usually not identical to transform coefficients.

Inverse transform

The inverse transform processing unit 212 is used to perform the inverse transform of the transform performed by the transform processing unit 206, for example, an inverse discrete cosine transform (DCT) or an inverse discrete sine transform (DST), to A reconstructed residual block 213 (or corresponding dequantized coefficients 213) is obtained. The reconstructed residual block 213 may also be referred to as a transform block 213 .

reconstruction

The reconstruction unit 214 (eg, summer 214 ) is used to add the transform block 213 (ie, the reconstructed residual block 213 ) to the prediction block 265 to obtain the reconstructed block 215 in the pixel domain, eg, the The pixel value and the pixel value of the prediction block 265 are added.

filter

The loop filter unit 220 (or "loop filter" 220 for short) is used to filter the reconstruction block 215 to obtain the filter block 221, or generally to filter the reconstructed pixels to obtain filtered pixel values. For example, loop filter units are used to smooth pixel transitions or improve video quality. The loop filter unit 220 may include one or more loop filters, such as a deblocking filter, a sample-adaptive offset (SAO) filter, or one or more other filters, such as self- Adaptive loop filter (ALF), noise suppression filter (NSF), or any combination. For example, the loop filter unit 220 may include a deblocking filter, a SAO filter, and an ALF filter. The order of the filtering process can be deblocking filter, SAO filter and ALF filter. As another example, a process called luma mapping with chroma scaling (LMCS) (ie, adaptive in-loop shaper) is added. This process is performed before deblocking. For another example, the deblocking filtering process can also be applied to internal sub-block edges, such as affine sub-block edges, ATMVP sub-block edges, sub-block transform (SBT) edges, and intra sub-partition (ISP) edges. )edge. Although loop filter unit 220 is shown in FIG. 2 as a loop filter, in other configurations, loop filter unit 220 may be implemented as a post-loop filter. Filter block 221 may also be referred to as filter reconstruction block 221 .

In one embodiment, video encoder 20 (correspondingly, loop filter unit 220) may be used to output loop filter parameters (eg, SAO filter parameters, ALF filter parameters, or LMCS parameters), eg, directly or by entropy The encoding unit 270 performs entropy encoding and outputs, eg, so that the decoder 30 can receive and decode using the same or different loop filter parameters.

decoded image buffer

A decoded picture buffer (DPB) 230 may be a reference picture memory that stores reference picture data for use by the video encoder 20 in encoding the video data. DPB 230 may be formed by any of a variety of memory devices, such as dynamic random access memory (DRAM), including synchronous DRAM (SDRAM), magnetoresistive RAM (MRAM), Resistive RAM (RRAM) or other types of storage devices. Decoded image buffer 230 may be used to store one or more filter blocks 221 . The decoded image buffer 230 may also be used to store other previously filtered blocks of the same current image or a different image, such as a previous reconstructed image, such as the previously reconstructed and filtered block 221, and may provide a complete previously reconstructed or decoded image (and corresponding reference blocks and pixels) and/or a partially reconstructed current image (and corresponding reference blocks and pixels), eg for inter prediction. The decoded image buffer 230 may also be used to store one or more unfiltered reconstructed blocks 215, or generally unfiltered reconstructed pixels, eg, reconstructed blocks 215 not filtered by the in-loop filtering unit 220, or unfiltered Any other processed reconstructed blocks or reconstructed pixels.

Mode selection (segmentation and prediction)

Mode selection unit 260 includes partition unit 262, inter prediction unit 244, and intra prediction unit 254 for receiving or obtaining original blocks from decoded image buffer 230 or other buffers (eg, column buffers, not shown) 203 (current block 203 of current image 17) and original image data such as reconstructed image data, e.g. filtered and/or unfiltered reconstructed pixels or reconstructions of the same (current) image and/or one or more previously decoded images Piece. The reconstructed image data is used as reference image data required for prediction such as inter prediction or intra prediction to obtain the prediction block 265 or the prediction value 265.

Mode selection unit 260 may be used to determine or select a partition for the current block (including no partition) and prediction mode (eg, intra or inter prediction mode) to generate a corresponding prediction block 265 for computing and summing the residual block 205. The reconstruction block 215 is reconstructed.

In one embodiment, mode selection unit 260 may be used to select a partitioning and prediction mode (eg, from among those supported or available by mode selection unit 260) that provides the best match or the smallest residual (minimum Residual refers to better compression in transmission or storage), or provides minimal signaling overhead (minimum signaling overhead refers to better compression in transmission or storage), or considers or balances both. The mode selection unit 260 may be configured to determine the segmentation and prediction mode according to rate distortion optimization (RDO), ie select the prediction mode that provides the least rate distortion optimization. The terms "best", "lowest", "optimal", etc. herein do not necessarily refer to "best", "lowest", "optimal" in general, but may also refer to situations where termination or selection criteria are met, for example, Values above or below the threshold or other constraints may result in "sub-optimal choices" but reduce complexity and processing time.

In other words, partition unit 262 may be used to partition pictures in a video sequence into a sequence of coding tree units (CTUs), CTU 203 may be further partitioned into smaller block parts or sub-blocks (blocks again), e.g., Quad-tree partitioning (QT) partitioning, binary-tree partitioning (BT) partitioning, or triple-tree partitioning (TT) partitioning or any combination thereof is used by iteration, and for e.g. Or each of the sub-blocks performs prediction, wherein the mode selection includes selecting the tree structure of the partition block 203 and selecting a prediction mode to apply to each of the block parts or sub-blocks.

The segmentation (eg, performed by segmentation unit 262 ) and prediction processing (eg, performed by inter-prediction unit 244 and intra-prediction unit 254 ) performed by video encoder 20 will be described in detail below.

segmentation

The partitioning unit 262 may partition (or divide) an image block (or CTU) 203 into smaller parts, such as square or rectangular shaped pieces. For an image with three pixel arrays, a CTU consists of N×N luminance pixel blocks and two corresponding chrominance pixel blocks. The maximum allowable size of a luma block in a CTU is specified as 128x128 in the developing universal video coding (VVC) standard, but may be specified in the future to a value other than 128x128, such as 256x256. The CTUs of a picture can be aggregated/grouped into slices/coded block groups, coded blocks or bricks. A coding block covers a rectangular area of an image, and a coding block can be divided into one or more tiles. A brick consists of multiple CTU lines within an encoded block. An encoded block that is not divided into multiple bricks can be called a brick. However, bricks are a true subset of coded blocks and are therefore not called coded blocks. VVC supports two encoding block group modes, namely raster scan slice/encoded block group mode and rectangular slice mode. In raster scan coded block group mode, a slice/coded block group contains a sequence of coded blocks in a raster scan of coded blocks of an image. In rectangular slice mode, slices contain multiple tiles of an image that together make up a rectangular area of the image. The tiles within the rectangular slice are arranged in the order of the tile raster scan of the photo. These smaller blocks (also referred to as sub-blocks) may be further divided into smaller parts. This is also known as tree splitting or hierarchical tree splitting, where a root block at root tree level 0 (hierarchy level 0, depth 0) etc. can be recursively split into two or more blocks of the next lower tree level, For example, a node at tree level 1 (hierarchy level 1, depth 1). These blocks can in turn be split into two or more blocks of the next lower level, e.g. tree level 2 (hierarchy level 2, depth 2), etc., until the split ends (since ending criteria are met, such as reaching a maximum tree depth or minimum block size). Blocks that are not further divided are also called leaf blocks or leaf nodes of the tree. A tree divided into two parts is called a binary-tree (BT), a tree divided into three parts is called a ternary-tree (TT), and a tree divided into four parts is called a quadtree ( quad-tree, QT).

For example, a coding tree unit (CTU) may be or include a CTB for luma pixels, two corresponding CTBs for chroma pixels for an image with an array of three pixels, or a CTB for pixels for monochrome images, or a CTB using three The CTB of a pixel of an image encoded by the independent color plane and syntax structure (used to encode the pixel). Correspondingly, a coding tree block (CTB) can be a block of N×N pixel points, where N can be set to a certain value such that the components are divided into CTBs, which is division. A coding unit (CU) may be or include a coding block of luminance pixels, two corresponding coding blocks of chrominance pixels of an image with an array of three pixel points, or a coding block of pixels of a monochrome image, or An encoding block of pixels of an image encoded using three independent color planes and syntax structures (used to encode pixels). Correspondingly, a coding block (CB) can be a block of M×N pixel points, where M and N can be set to a certain value so that the CTB is divided into coding blocks, which is division.

For example, in an embodiment, a coding tree unit (CTU) may be divided into multiple CUs according to HEVC by using a quad-tree structure represented as a coding tree. The decision whether to use inter (temporal) prediction or intra (spatial) prediction to encode image regions is made at the leaf-CU level. Each leaf-CU may be further divided into one, two, or four PUs according to the PU partition type. The same prediction process is used within a PU, and relevant information is transmitted to the decoder on a PU basis. After applying the prediction process to obtain residual blocks according to the PU partition type, the leaf CU may be partitioned into transform units (TUs) according to other quad-tree structures similar to the coding tree used for the CU.

For example, in an embodiment, according to the latest video coding standard currently under development, called Versatile Video Coding (VVC), a combined quadtree of nested multi-type trees (eg, binary and ternary trees) is used to partition for segmentation coding The segmented structure of the tree unit.In the coding tree structure in the coding tree unit, the CU can be a square or a rectangle.For example, the coding tree unit (CTU) is first divided by the quad-tree structure.The quad-leaf node is further composed of multiple types of Tree structure division. There are four division types for multi-type tree structures: vertical binary tree division (SPLIT_BT_VER), horizontal binary tree division (SPLIT_BT_HOR), vertical ternary tree division (SPLIT_TT_VER) and horizontal ternary tree division (SPLIT_TT_HOR). Multi-type leaf nodes are called A coding unit (CU), unless the CU is too large for the maximum transform length, such a segment is used for prediction and transform processing without any other partitioning. In most cases, this means that the CU, PU, and TU are The block size is the same in the coding block structure of tree-nested multi-type trees. This exception occurs when the maximum supported transform length is less than the width or height of the color components of the CU. VVC has formulated a multi-type tree with quadtree nesting The only signaling mechanism for partitioning information in the coding structure. In the signaling mechanism, the coding tree unit (CTU) as the root of the quad-tree is first divided by the quad-tree structure. Then each quad-leaf node (when enough can be further divided into a multi-type tree structure. In the multi-type tree structure, whether the node is further divided by the first flag (mtt_split_cu_flag), when the node is further divided, first use the second flag (mtt_split_cu_vertical_flag) to indicate Divide the direction, and then use the third mark (mtt_split_cu_binary_flag) to indicate that the division is binary tree division or ternary tree division. According to the values of mtt_split_cu_vertical_flag and mtt_split_cu_binary_flag, the decoder can derive the multi-type tree division mode (MttSplitMode) of the CU based on a predefined rule or table. It should be noted that for a certain design, such as a 64×64 luma block and a 32×32 chroma pipeline design in a VVC hardware decoder, when the width or height of the luma coding block is greater than 64, TT division is not allowed .When the width or height of the chroma coding block is greater than 32, TT division is also not allowed. The pipeline design divides the image into multiple virtual pipeline data units (VPDUs), and each VPDU is defined in the image as mutual Non-overlapping units. In hardware decoders, consecutive VPDUs are processed simultaneously in multiple pipeline stages. In most pipeline stages, VPDU size is roughly proportional to buffer size, so it is necessary to keep VPDUs small . In most hardware decoders, the VPDU size can be set to the maximum transform block (TB) size. However, in VVC, the partition of ternary tree (TT) and binary tree (BT) may increase the size of VPDU.

In addition, it should be noted that when a part of the tree node block exceeds the bottom or the right border of the image, the tree node block is forced to be divided until all the pixels of each coded CU are located within the image border.

For example, the intra sub-partitions (ISP) tool may divide the luma intra prediction block vertically or horizontally into two or four sub-parts depending on the block size.

In one example, mode selection unit 260 of video encoder 20 may be used to perform any combination of the partitioning techniques described above.

As described above, video encoder 20 is used to determine or select the best or optimal prediction mode from a set of (predetermined) prediction modes. The set of prediction modes may include, for example, intra prediction modes and/or inter prediction modes.

Intra prediction

The set of intra prediction modes may include 35 different intra prediction modes, for example, non-directional modes like DC (or mean) mode and planar mode, or directional modes as defined by HEVC, or may include 67 different Intra prediction modes, for example, non-directional modes like DC (or mean) mode and planar mode, or directional modes as defined in VVC. For example, several conventional-angle intra-prediction modes are adaptively replaced with wide-angle intra-prediction modes for non-square blocks defined in VVC. As another example, in order to avoid the division operation of DC prediction, only the longer side is used to calculate the average value of the non-square block. In addition, the intra prediction result of the planar mode may also be modified using a position-dependent intra prediction combination (PDPC) method.

The intra-frame prediction unit 254 is configured to generate an intra-frame prediction block 265 using reconstructed pixels of adjacent blocks of the same current image according to the intra-frame prediction mode in the intra-frame prediction mode set.

Intra-prediction unit 254 (or generally mode selection unit 260 ) is also used to output intra-prediction parameters (or generally information indicating the selected intra-prediction mode of the block) to entropy encoding unit 270 in the form of syntax element 266 , to be included in the encoded image data 21 so that the video decoder 30 may perform operations such as receiving and using prediction parameters for decoding.

The intra prediction modes in HEVC include DC prediction mode, plane prediction mode and 33 angle prediction modes, totaling 35 candidate prediction modes. FIG. 3 is a schematic diagram of the HEVC intra prediction direction. As shown in FIG. 3 , the current block can use the pixels of the reconstructed image block on the left and above as a reference for intra prediction. The image block used for intra prediction of the current block in the surrounding area of the current block is called the reference block, and the pixels in the reference block are called reference pixels. Among the 35 candidate prediction modes, the DC prediction mode is suitable for the area with flat texture in the current block, and all pixels in this area use the average value of the reference pixels in the reference block as prediction; the plane prediction mode is suitable for image blocks with smooth texture changes. , the current block that meets this condition uses the reference pixels in the reference block to perform bilinear interpolation as the prediction of all pixels in the current block; the angle prediction mode uses the texture of the current block and the texture of the adjacent reconstructed image blocks. , the value of the reference pixel in the corresponding reference block is copied along a certain angle as the prediction of all pixels in the current block.

The HEVC encoder selects an optimal intra prediction mode from the 35 candidate prediction modes shown in FIG. 3 for the current block, and writes the optimal intra prediction mode into the video code stream. In order to improve the coding efficiency of intra-frame prediction, the encoder/decoder will derive 3 most probable modes from the respective optimal intra-frame prediction modes of the reconstructed image blocks using intra-frame prediction in the surrounding area. The selected optimal intra prediction mode is one of the 3 most probable modes, encoding a first index indicating that the selected optimal intra prediction mode is one of the 3 most probable modes; if selected The optimal intra prediction mode is not these 3 most probable modes, then a second index is encoded to indicate that the selected optimal intra prediction mode is the other 32 modes (in the 35 candidate prediction modes, except the aforementioned 3 most probable modes other modes). The HEVC standard uses a 5-bit fixed-length code as the aforementioned second index.

The method for the HEVC encoder to derive the 3 most probable modes includes: selecting the optimal intra prediction modes of the left adjacent image block and the upper adjacent image block of the current block into the set, if the two optimal intra prediction modes If they are the same, only one of them can be kept in the set. If the two optimal intra prediction modes are the same and both are angle prediction modes, then select the two angle prediction modes adjacent to the angle direction to join the set; otherwise, select the plane prediction mode, the DC mode mode and the vertical prediction mode in turn Patterns are added to the collection until the number of patterns in the collection reaches 3.

After the HEVC decoder performs entropy decoding on the code stream, it obtains the mode information of the current block. The mode information includes an indicator indicating whether the optimal intra prediction mode of the current block is among the three most probable modes, and the optimal intra prediction mode of the current block. The indices of the intra prediction mode in the 3 most probable modes or the indices of the optimal intra prediction mode of the current block in the other 32 modes.

Inter prediction

In a possible implementation, the set of inter-prediction modes depends on available reference pictures (ie, eg, at least some of the previously decoded pictures previously stored in DBP 230) and other inter-prediction parameters, eg on whether to use the entire reference picture or only use a portion of the reference image, e.g. the search window area near the area of the current block, to search for the best matching reference block, and/or e.g. depending on whether half-pixel, quarter-pixel and/or 1/16th interpolation is performed pixel interpolation.

In addition to the above prediction modes, skip mode and/or direct mode may also be employed.

For example, extending merge prediction, the merge candidate list for this mode consists of the following five candidate types in order: spatial MVP from spatially adjacent CUs, temporal MVP from collocated CUs, history-based MVP from FIFO table, pairwise Average MVP and zero MV. Decoder side motion vector refinement (DMVR) based on bilateral matching can be used to increase the accuracy of MV for merge mode. The merge mode with MVD (MMVD) comes from the merge mode with motion vector difference. Send the MMVD flag immediately after sending the skip flag and merge flag to specify whether the CU uses MMVD mode. A CU-level adaptive motion vector resolution (AMVR) scheme may be used. AMVR supports the MVD of the CU to be encoded in different precisions. According to the prediction mode of the current CU, the MVD of the current CU is adaptively selected. When a CU is encoded in combined mode, a combined inter/intra prediction (CIIP) mode may be applied to the current CU. A weighted average is performed on the inter and intra prediction signals to obtain the CIIP prediction. For affine motion compensation prediction, the affine motion field of the block is described by motion information of 2 control points (4 parameters) or 3 control points (6 parameters) motion vectors. Subblock-based temporal motion vector prediction (SbTMVP) is similar to temporal motion vector prediction (TMVP) in HEVC, but predicts the motion of sub-CUs in the current CU vector. Bi-directional optical flow (BDOF), formerly known as BIO, is a simplified version that reduces computation, especially in terms of the number of multiplications and the size of the multipliers. In the triangular division mode, the CU is evenly divided into two triangular parts in two divisions: diagonal division and anti-diagonal division. In addition, the bidirectional prediction mode is extended on the basis of simple averaging to support weighted average of two prediction signals.

Inter prediction unit 244 may include a motion estimation (ME) unit and a motion compensation (MC) unit (both not shown in FIG. 2 ). The motion estimation unit may be used to receive or obtain the image block 203 (the current image block 203 of the current image 17 ) and the decoded image 231 , or at least one or more previously reconstructed blocks, eg, one or more other/different previously decoded images 231 . Reconstruction blocks for motion estimation. For example, the video sequence may include the current image and the previous decoded image 231, or in other words, the current image and the previous decoded image 231 may be part of or form a sequence of images forming the video sequence.

For example, the encoder 20 may be operable to select a reference block from a plurality of reference blocks of the same or different pictures among a plurality of other pictures, and convert the reference picture (or reference picture index) and/or the position (x, y coordinates) of the reference block ) and the position of the current block (spatial offset) are provided to the motion estimation unit as inter prediction parameters. This offset is also called a motion vector (MV).

The motion compensation unit is used to obtain, eg, receive, inter-prediction parameters, and perform inter-prediction based on or using the inter-prediction parameters, resulting in the inter-prediction block 246 . The motion compensation performed by the motion compensation unit may involve extracting or generating prediction blocks from motion/block vectors determined through motion estimation, and may also include performing interpolation to sub-pixel precision. Interpolative filtering can generate pixels of other pixels from pixels of known pixels, thereby potentially increasing the number of candidate prediction blocks that can be used to encode an image block. Once the motion vector corresponding to the PU of the current image block is received, the motion compensation unit may locate the prediction block pointed to by the motion vector in one of the reference image lists.

The motion compensation unit may also generate block- and video slice-related syntax elements for use by video decoder 30 in decoding image blocks of the video slice. In addition, or instead of slices and corresponding syntax elements, coding block groups and/or coding blocks and corresponding syntax elements may be generated or used.

In the process of acquiring the candidate motion vector list in the advanced motion vector prediction (advanced motion vector prediction, AMVP) mode, the motion vector (motion vector, MV) that can be added to the candidate motion vector list as an alternative includes the spatial phase phase of the current block. The MVs of adjacent and temporally adjacent image blocks, wherein the MVs of spatially adjacent image blocks may in turn include the MVs of the left candidate image block located to the left of the current block and the MV of the upper candidate image block located above the current block. Exemplarily, FIG. 7 is an exemplary schematic diagram of a candidate image block according to an embodiment of the present application. As shown in FIG. 7 , the set of candidate image blocks on the left includes {A0, A1}, and the set of candidate image blocks on the upper side includes {B0 , B1, B2}, the set of temporally adjacent candidate image blocks includes {C, T}, these three sets can be added to the candidate motion vector list as candidates, but according to the existing coding standard, AMVP's The maximum length of the candidate motion vector list is 2, so it is necessary to determine the MVs for adding at most two image blocks to the candidate motion vector list from the three sets according to the specified order. The order may be to give priority to the set {A0, A1} of candidate image blocks on the left of the current block (consider A0 first, and then consider A1 when A0 is unavailable), and secondly consider the set of candidate image blocks above the current block {B0, B1, B2} (consider B0 first, if B0 is unavailable, then consider B1, if B1 is unavailable, then consider B2), and finally consider the set {C, T} of candidate image blocks adjacent to the current block in the temporal domain (consider T first, T is unavailable) Consider C).

After the above candidate motion vector list is obtained, the optimal MV is determined from the candidate motion vector list through rate distortion cost (RD cost), and the candidate motion vector with the smallest RD cost is used as the motion vector predictor (motion vector) of the current block. vector predictor, MVP). The rate-distortion cost is calculated by the following formula:

J=SAD+λR

Among them, J represents RD cost, SAD is the sum of absolute errors (sum of absolute differences, SAD) between the pixel value of the prediction block obtained after motion estimation using the candidate motion vector and the pixel value of the current block, R represents the code rate, λ denotes the Lagrange multiplier.

The encoder transmits the determined index of the MVP in the candidate motion vector list to the decoder. Further, a motion search can be performed in the neighborhood centered on the MVP to obtain the actual motion vector of the current block, and the encoder calculates the motion vector difference (motion vector difference, MVD) between the MVP and the actual motion vector, and uses the MVD to passed to the decoder. The decoding end parses the index, finds the corresponding MVP in the candidate motion vector list according to the index, parses the MVD, and adds the MVD and the MVP to obtain the actual motion vector of the current block.

In the process of acquiring the candidate motion information list in the merge mode, the motion information that can be added to the candidate motion information list as an alternative includes the motion information of the spatially adjacent or temporally adjacent image blocks of the current block, wherein the spatial domain For adjacent image blocks and adjacent image blocks in the temporal domain, refer to Figure 7. The candidate motion information corresponding to the spatial domain in the candidate motion information list comes from the 5 spatially adjacent blocks (A0, A1, B0, B1, and B2) , if the adjacent blocks in the spatial domain are unavailable or are intra-frame predictions, their motion information is not added to the candidate motion information list. The candidate motion information in the temporal domain of the current block is obtained by scaling the MV of the corresponding position block in the reference frame according to the picture order count (POC) of the reference frame and the current frame. First, determine the block whose position is T in the reference frame. Whether it is available, if not, select the block at position C. After obtaining the above candidate motion information list, determine the optimal motion information from the candidate motion information list through RD cost as the motion information of the current block. The encoder transmits the index value of the position of the optimal motion information in the candidate motion information list (denoted as merge index) to the decoder.

Entropy coding

The entropy coding unit 270 is used for entropy coding algorithm or scheme (for example, variable length coding (variable length coding, VLC) scheme, context adaptive VLC scheme (context adaptive VLC, CALVC), arithmetic coding scheme, binarization algorithm, Context adaptive binary arithmetic coding (context adaptive binary arithmetic coding, CABAC), syntax-based context adaptive binary arithmetic coding (syntax-based context-adaptive binary arithmetic coding, SBAC), probability interval partitioning entropy (probability interval partitioning entropy, PIPE) ) coding or other entropy coding method or technique) is applied to the quantized residual coefficients 209, inter prediction parameters, intra prediction parameters, loop filter parameters and/or other syntax elements, resulting in an encoded bit stream that can be passed through output 272 The encoded image data 21 output in the form of 21 or the like, so that the video decoder 30 or the like can receive and use the parameters for decoding. The encoded bitstream 21 may be transmitted to the video decoder 30, or stored in memory for later transmission or retrieval by the video decoder 30.

Other structural variations of video encoder 20 may be used to encode the video stream. For example, the non-transform based encoder 20 may directly quantize the residual signal without transform processing unit 206 for certain blocks or frames. In another implementation, encoder 20 may have quantization unit 208 and inverse quantization unit 210 combined into a single unit.

Decoders and Decoding Methods

FIG. 3 is an exemplary block diagram of a video decoder 30 according to an embodiment of the present application. The video decoder 30 is adapted to receive the encoded image data 21 (eg, the encoded bitstream 21 ) encoded by the encoder 20 , for example, to obtain a decoded image 331 . The encoded image data or bitstream includes information for decoding the encoded image data, such as data representing image blocks of an encoded video slice (and/or encoded block groups or encoded blocks) and associated syntax elements.

In the example of FIG. 3, decoder 30 includes entropy decoding unit 304, inverse quantization unit 310, inverse transform processing unit 312, reconstruction unit 314 (eg, summer 314), loop filter 320, decoded image buffer (DBP) ) 330 , a mode application unit 360 , an inter prediction unit 344 and an intra prediction unit 354 . Inter prediction unit 344 may be or include a motion compensation unit. In some examples, video decoder 30 may perform a decoding process that is substantially the inverse of the encoding process described with reference to video encoder 100 of FIG. 2 .

Referring to FIG. 3, the inter prediction unit includes a trained target model (also called a neural network) for processing an input image or image region or image patch to generate predicted values for the input image patch. For example, a neural network for inter prediction is used to receive an input image or image region or image patch, and generate a predicted value for the input image or image region or image patch. The neural network for inter prediction will be described in detail below in conjunction with Figures 6a-6e.

As described in the encoder 20, the inverse quantization unit 210, the inverse transform processing unit 212, the reconstruction unit 214, the loop filter 220, the decoded image buffer DPB 230, the inter prediction unit 344 and the intra prediction unit 354 also constitute a video encoder 20 "built-in decoders". Accordingly, the inverse quantization unit 310 may be functionally the same as the inverse quantization unit 110, the inverse transform processing unit 312 may be functionally the same as the inverse transform processing unit 122, the reconstruction unit 314 may be functionally the same as the reconstruction unit 214, and the loop Filter 320 may be functionally identical to loop filter 220 , and decoded image buffer 330 may be functionally identical to decoded image buffer 230 . Therefore, the explanations of the corresponding units and functions of the video encoder 20 apply correspondingly to the corresponding units and functions of the video decoder 30 .

Entropy decoding

The entropy decoding unit 304 is used to parse the bit stream 21 (or generally the encoded image data 21 ) and perform entropy decoding on the encoded image data 21 to obtain quantization coefficients 309 and/or decoded encoding parameters (not shown in FIG. 3 ), etc. , such as in inter prediction parameters (such as reference picture indices and motion vectors), intra prediction parameters (such as intra prediction mode or index), transform parameters, quantization parameters, loop filter parameters and/or other syntax elements, etc. any or all. The entropy decoding unit 304 may be configured to apply a decoding algorithm or scheme corresponding to the encoding scheme of the entropy encoding unit 270 of the encoder 20 . Entropy decoding unit 304 may also be used to provide inter-prediction parameters, intra-prediction parameters, and/or other syntax elements to mode application unit 360 , as well as other parameters to other units of decoder 30 . Video decoder 30 may receive syntax elements at the video slice and/or video block level. In addition, or instead of slices and corresponding syntax elements, encoded block groups and/or encoded blocks and corresponding syntax elements may be received or used.

inverse quantization

Inverse quantization unit 310 may be operable to receive quantization parameters (QPs) (or information related to inverse quantization in general) and quantization coefficients from encoded image data 21 (eg, parsed and/or decoded by entropy decoding unit 304), and based on The quantization parameters inverse quantize the decoded quantized coefficients 309 to obtain inverse quantized coefficients 311 , which may also be referred to as transform coefficients 311 . The inverse quantization process may include using quantization parameters calculated by video encoder 20 for each video block in the video slice to determine the degree of quantization, as well as the degree of inverse quantization that needs to be performed.

Inverse transform

An inverse transform processing unit 312 may be operable to receive dequantized coefficients 311, also referred to as transform coefficients 311, and apply a transform to the dequantized coefficients 311 to obtain a reconstructed residual block 213 in the pixel domain. The reconstructed residual block 213 may also be referred to as a transform block 313 . The transform may be an inverse transform, such as an inverse DCT, an inverse DST, an inverse integer transform, or a conceptually similar inverse transform process. Inverse transform processing unit 312 may also be operable to receive transform parameters or corresponding information from encoded image data 21 (eg, parsed and/or decoded by entropy decoding unit 304 ) to determine transforms to apply to dequantized coefficients 311 .

reconstruction

The reconstruction unit 314 (eg, summer 314) is used to add the reconstructed residual block 313 to the prediction block 365 to obtain the reconstructed block 315 in the pixel domain, for example, the pixel point values of the reconstructed residual block 313 and the prediction block 365 pixel values are added.

filter

The loop filter unit 320 (in or after the encoding loop) is used to filter the reconstruction block 315 to obtain a filter block 321, so as to smoothly perform pixel transitions or improve video quality, etc. The loop filter unit 320 may include one or more loop filters, such as a deblocking filter, a sample-adaptive offset (SAO) filter, or one or more other filters, such as a self- Adaptive loop filter (ALF), noise suppression filter (NSF), or any combination. For example, the loop filter unit 220 may include a deblocking filter, a SAO filter, and an ALF filter. The order of the filtering process can be deblocking filter, SAO filter and ALF filter. As another example, a process called luma mapping with chroma scaling (LMCS) (ie, adaptive in-loop shaper) is added. This process is performed before deblocking. For another example, the deblocking filtering process can also be applied to internal sub-block edges, such as affine sub-block edges, ATMVP sub-block edges, sub-block transform (SBT) edges, and intra sub-partition (ISP) edges. )edge. Although loop filter unit 320 is shown in FIG. 3 as a loop filter, in other configurations, loop filter unit 320 may be implemented as a post-loop filter.

decoded image buffer

The decoded video block 321 in one picture is then stored in a decoded picture buffer 330 which stores the decoded picture 331 as a reference picture for subsequent motion compensation of other pictures and/or output display respectively.

The decoder 30 is configured to output the decoded image 311 through the output terminal 312, etc., to display to the user or for the user to view.

predict

The inter prediction unit 344 may be functionally the same as the inter prediction unit 244 (in particular, the motion compensation unit), the intra prediction unit 354 may be functionally the same as the inter prediction unit 254, and is based on the encoded image data 21 (eg, The received partitioning and/or prediction parameters or corresponding information are parsed and/or decoded by the entropy decoding unit 304 to decide the partitioning or partitioning and perform prediction. The mode application unit 360 may be configured to perform prediction (intra or inter prediction) of each block according to the reconstructed image, block or corresponding pixel points (filtered or unfiltered), resulting in a prediction block 365 .

When encoding a video slice as an intra-coded (I) slice, the intra-prediction unit 354 in the mode application unit 360 is used to generate data based on the indicated intra-prediction mode and data from previously decoded blocks of the current image. Prediction block 365 for an image block of the current video slice. When a video image is encoded as an inter-coded (ie, B or P) slice, an inter-prediction unit 344 (eg, a motion compensation unit) in the mode application unit 360 is used to decode the motion vector and other syntax received from the entropy decoding unit 304 according to the motion vector The element generates a prediction block 365 for a video block of the current video slice. For inter prediction, these prediction blocks may be generated from one of the reference pictures in one of the reference picture lists. Video decoder 30 may construct reference frame List 0 and List 1 from reference pictures stored in DPB 330 using default construction techniques. In addition to or instead of slices (eg, video slices), the same or similar process may be applied to embodiments of coding block groups (eg, video coding block groups) and/or coding blocks (eg, video coding blocks), For example, video may be encoded using I, P, or B encoding block groups and/or encoding blocks.

Mode application unit 360 is operable to determine prediction information for a video block of the current video slice by parsing motion vectors and other syntax elements, and use the prediction information to generate a prediction block for the current video block being decoded. For example, mode applying unit 360 uses some of the received syntax elements to determine a prediction mode (eg, intra-prediction or inter-prediction), an inter-prediction slice type (eg, B-slice, P-slice, or GPB for encoding a video block of the video slice) slice), construction information for one or more reference picture lists of the slice, motion vectors for each inter-coded video block of the slice, inter-prediction status for each inter-coded video block of the slice, other information to decode video blocks within the current video slice. In addition to or instead of slices (eg, video slices), the same or similar process may be applied to embodiments of coding block groups (eg, video coding block groups) and/or coding blocks (eg, video coding blocks), For example, video may be encoded using I, P, or B encoding block groups and/or encoding blocks.

In one embodiment, the video encoder 30 of FIG. 3 may also be used to segment and/or decode an image using slices (also referred to as video slices), where an image may be performed using one or more slices (usually non-overlapping) Split or decode. Each slice may include one or more blocks (eg, CTUs) or one or more groups of blocks (eg, coded blocks in the H.265/HEVC/VVC standard and bricks in the VVC standard.

In one embodiment, the video decoder 30 shown in FIG. 3 may also be used to use slice/coding block groups (also referred to as video coding block groups) and/or coding blocks (also referred to as video coding blocks) ) to segment and/or decode an image, wherein the image may be segmented or decoded using one or more slices/encoded block groups (usually non-overlapping), each slice/encoded block group may include one or more A block (eg, CTU) or one or more coding blocks, etc., wherein each coding block may be rectangular or the like, and may include one or more full or partial blocks (eg, CTUs).

Other variations of the video decoder 30 may be used to decode the encoded image data 21 . For example, decoder 30 may generate the output video stream without loop filter unit 320 . For example, the non-transform based decoder 30 may directly inverse quantize the residual signal without the inverse transform processing unit 312 for certain blocks or frames. In another implementation, video decoder 30 may have inverse quantization unit 310 and inverse transform processing unit 312 combined into a single unit.

It should be understood that in the encoder 20 and the decoder 30, the processing result of the current step can be further processed, and then output to the next step. For example, after interpolation filtering, motion vector derivation or loop filtering, further operations, such as clip or shift operations, may be performed on the processing results of interpolation filtering, motion vector derivation or loop filtering.

It should be noted that further operations may be performed on the derived motion vectors of the current block (including but not limited to control point motion vectors in affine mode, affine, plane, sub-block motion vectors in ATMVP mode, temporal motion vectors, etc.). For example, the value of the motion vector is limited to a predefined range according to the representation bits of the motion vector. If the representation bit of the motion vector is bitDepth, the range is -2^(bitDepth-1) to 2^(bitDepth-1)-1, where "^" represents a power. For example, if bitDepth is set to 16, the range is -32768 to 32767; if bitDepth is set to 18, the range is -131072 to 131071. For example, the value of the derived motion vector (eg, the MVs of four 4x4 subblocks in an 8x8 block) is limited such that the maximum difference between the integer parts of the four 4x4 subblock MVs does not More than N pixels, eg no more than 1 pixel. There are two ways to limit motion vectors based on bitDepth.

Although the above embodiments have primarily described video codecs, it should be noted that embodiments of the coding system 10, encoder 20 and decoder 30, as well as other embodiments described herein, may also be used for still image processing or codecs, That is, the processing or coding of a single image in video codecs that is independent of any previous or consecutive images. In general, if image processing is limited to a single image 17, inter prediction unit 244 (encoder) and inter prediction unit 344 (decoder) may not be available. All other functions (also referred to as tools or techniques) of video encoder 20 and video decoder 30 are also available for still image processing, such as residual calculation 204/304, transform 206, quantization 208, inverse quantization 210/310, (inverse ) transform 212/312, partition 262/362, intra prediction 254/354 and/or loop filtering 220/320, entropy encoding 270 and entropy decoding 304.

FIG. 4 is an exemplary block diagram of a video coding apparatus 400 according to an embodiment of the present application. Video coding apparatus 400 is suitable for implementing the disclosed embodiments described herein. In one embodiment, the video coding apparatus 400 may be a decoder, such as the video decoder 30 in FIG. 1a, or an encoder, such as the video encoder 20 in FIG. 1a.

The video decoding apparatus 400 includes: an input port 410 (or input port 410) for receiving data and a receiver unit (receiver unit, Rx) 420; a processor, a logic unit or a central processing unit (central processing unit) for processing data , CPU) 430; for example, the processor 430 here can be a neural network processor 430; a transmitter unit (transmitter unit, Tx) 440 for transmitting data and an output port 450 (or output port 450); memory 460. The video coding apparatus 400 may also include optical-to-electrical (OE) components and electrical-to-optical (EO) components coupled to the input port 410, the receiving unit 420, the transmitting unit 440, and the output port 450, Exit or entrance for optical or electrical signals.

The processor 430 is implemented by hardware and software. Processor 430 may be implemented as one or more processor chips, cores (eg, multi-core processors), FPGAs, ASICs, and DSPs. The processor 430 communicates with the ingress port 410 , the receiving unit 420 , the sending unit 440 , the egress port 450 and the memory 460 . The processor 430 includes a decoding module 470 (eg, a neural network-based decoding module 470). The decoding module 470 implements the embodiments disclosed above. For example, the transcoding module 470 performs, processes, prepares or provides various encoding operations. Thus, a substantial improvement in the functionality of the video coding apparatus 400 is provided by the coding module 470, and switching of the video coding apparatus 400 to different states is affected. Alternatively, decoding module 470 is implemented as instructions stored in memory 460 and executed by processor 430 .

Memory 460 includes one or more magnetic disks, tape drives, and solid-state drives, and may serve as an overflow data storage device for storing programs when such programs are selected for execution, and for storing instructions and data read during program execution. Memory 460 may be volatile and/or non-volatile, and may be read-only memory (ROM), random access memory (RAM), ternary content addressable memory (ternary) content-addressable memory, TCAM) and/or static random-access memory (SRAM).

FIG. 5 is an exemplary block diagram of an apparatus 500 according to an embodiment of the present application, and the apparatus 500 can be used as either or both of the source device 12 and the destination device 14 in FIG. 1a.

The processor 502 in the apparatus 500 may be a central processing unit. Alternatively, the processor 502 may be any other type of device or devices, existing or to be developed in the future, capable of manipulating or processing information. Although the disclosed implementations may be implemented using a single processor, such as processor 502 as shown, using more than one processor is faster and more efficient.

In one implementation, the memory 504 in the apparatus 500 may be a read only memory (ROM) device or a random access memory (RAM) device. Any other suitable type of storage device may be used as memory 504 . Memory 504 may include code and data 506 accessed by processor 502 via bus 512 . The memory 504 may also include an operating system 508 and application programs 510 including at least one program that allows the processor 502 to perform the methods described herein. For example, applications 510 may include applications 1 through N, and also include video coding applications that perform the methods described herein.

Apparatus 500 may also include one or more output devices, such as display 518 . In one example, display 518 may be a touch-sensitive display that combines a display with touch-sensitive elements that may be used to sense touch input. Display 518 may be coupled to processor 502 through bus 512 .

Although bus 512 in device 500 is described herein as a single bus, bus 512 may include multiple buses. In addition, secondary storage may be directly coupled to other components of the device 500 or accessed through a network, and may include a single integrated unit, such as a memory card, or multiple units, such as multiple memory cards. Accordingly, the apparatus 500 may have various configurations.

Since the embodiments of the present application involve the application of neural networks, for ease of understanding, some nouns or terms used in the embodiments of the present application are explained below, and the nouns or terms are also part of the content of the invention.

(1) Neural network

A neural network (NN) is a machine learning model. A neural network can be composed of neural units. A neural unit can refer to an operation unit that takes xs and intercept 1 as input. The output of the operation unit can be:

Among them, s=1, 2,...n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.

(2) Deep neural network

Deep neural network (deep neural network, DNN), also known as multi-layer neural network, can be understood as a neural network with many hidden layers, and there is no special metric for "many" here. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN looks complicated, in terms of the work of each layer, it is not complicated. In short, it is the following linear relationship expression:

in,

is the input vector,

is the output vector,

is the offset vector, W is the weight matrix (also called coefficients), and α() is the activation function. Each layer is just an input vector

After such a simple operation to get the output vector

Due to the large number of DNN layers, the coefficient W and offset vector

The number is also much larger. These parameters are defined in the DNN as follows: Take the coefficient W as an example: Suppose that in a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as

The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. The summary is: the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as

It should be noted that the input layer does not have a W parameter. In a deep neural network, more hidden layers allow the network to better capture the complexities of the real world. In theory, a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks. Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).

(3) Convolutional Neural Network

Convolutional neural network (CNN) is a deep neural network with a convolutional structure and a deep learning architecture. Learning at multiple levels. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it. A convolutional neural network consists of a feature extractor consisting of convolutional and pooling layers. The feature extractor can be viewed as a filter, and the convolution process can be viewed as convolution with an input image or a convolutional feature map using a trainable filter.

The convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal. The convolution layer can include many convolution operators. The convolution operator is also called the kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator can essentially is a weight matrix, which is usually pre-defined, during the convolution operation on the image, the weight matrix is usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image... ...it depends on the value of stride) to process, so as to complete the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will be extended to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will result in a single depth dimension of the convolutional output, but in most cases a single weight matrix is not used, but multiple weight matrices of the same size (row × column) are applied, That is, multiple isotype matrices. The output of each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" described above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image. Blur, etc. The multiple weight matrices have the same size (row×column), and the size of the feature maps extracted from the multiple weight matrices with the same size is also the same, and then the multiple extracted feature maps with the same size are combined to form a convolution operation. output. The weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network can make correct predictions. When the convolutional neural network has multiple convolutional layers, the initial convolutional layer often extracts more general features, which can also be called low-level features; as the depth of the convolutional neural network deepens, The features extracted by the later convolutional layers are more and more complex, such as features such as high-level semantics, and the features with higher semantics are more suitable for the problem to be solved.

Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer, which can be a convolutional layer followed by a pooling layer, or a multi-layer convolutional layer followed by a layer or multiple pooling layers. During image processing, the only purpose of pooling layers is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling. The max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.

After processing by the convolutional layer/pooling layer, the convolutional neural network is not enough to output the required output information. Because as mentioned before, convolutional/pooling layers will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network needs to utilize neural network layers to generate one or a set of outputs of the required number of classes. Therefore, the neural network layer may include multiple hidden layers, and the parameters contained in the multiple hidden layers may be obtained by pre-training according to the relevant training data of a specific task type. For example, the task type may include image recognition, Image classification, image super-resolution reconstruction, and more.

Optionally, after the multi-layer hidden layers in the neural network layer, it also includes the output layer of the entire convolutional neural network, which has a loss function similar to categorical cross-entropy, specifically for calculating the prediction error, once the entire volume The forward propagation of the convolutional neural network is completed, and the backpropagation will start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network, and the result and ideal output of the convolutional neural network through the output layer. error between results.

(4) Recurrent Neural Network

Recurrent neural networks (RNNs) are used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are fully connected, and the nodes in each layer are unconnected. Although this ordinary neural network solves many problems, it is still powerless for many problems. For example, if you want to predict the next word of a sentence, you generally need to use the previous words, because the front and rear words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output. The specific manifestation is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer and this layer are no longer unconnected but connected, and the input of the hidden layer not only includes The output of the input layer also includes the output of the hidden layer at the previous moment. In theory, RNN can process sequence data of any length. The training of RNN is the same as the training of traditional CNN or DNN. The error back-propagation algorithm is also used, but there is one difference: that is, if the RNN is expanded, the parameters, such as W, are shared; while the traditional neural network mentioned above is not the case. And in the gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the network in the previous steps. This learning algorithm is called Back propagation Through Time (BPTT).

Why use a recurrent neural network when you already have a convolutional neural network? The reason is very simple. In the convolutional neural network, there is a premise that the elements are independent of each other, and the input and output are also independent, such as cats and dogs. But in the real world, many elements are interconnected, such as the change of stocks over time, and another example of a person who said: I like to travel, and my favorite place is Yunnan. I must go there in the future. Fill in the blanks here. Humans should all know that it is "Yunnan". Because humans make inferences based on the content of the context, but how do you get machines to do this? RNN came into being. RNNs are designed to give machines the ability to memorize like humans do. Therefore, the output of RNN needs to rely on current input information and historical memory information.

(5) Loss function

In the process of training a deep neural network, because it is hoped that the output of the deep neural network is as close as possible to the value you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then based on the difference between the two to update the weight vector of each layer of neural network (of course, there is usually an initialization process before the first update, that is, to pre-configure parameters for each layer in the deep neural network), for example, if the predicted value of the network If it is high, adjust the weight vector to make its prediction lower, and keep adjusting until the deep neural network can predict the real desired target value or a value that is very close to the real desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which are used to measure the difference between the predicted value and the target value. important equation. Among them, taking the loss function as an example, the higher the output value of the loss function (loss), the greater the difference, then the training of the deep neural network becomes the process of reducing the loss as much as possible.

(6) Back propagation algorithm

The convolutional neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, forwarding the input signal until the output will generate an error loss, and updating the parameters in the initial super-resolution model by back-propagating the error loss information, so that the error loss converges. The back-propagation algorithm is a back-propagation motion dominated by the error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.

(7) Generative Adversarial Networks

Generative adversarial networks (GANs) are deep learning models. The model includes at least two modules: one module is the Generative Model, and the other is the Discriminative Model, through which the two modules learn from each other through game play to produce better output. Both the generative model and the discriminative model can be neural networks, specifically, deep neural networks or convolutional neural networks. The basic principle of GAN is as follows: Take the GAN that generates pictures as an example, suppose there are two networks, G (Generator) and D (Discriminator), where G is a network that generates pictures, it receives a random noise z, through this noise Generate a picture, denoted as G(z); D is a discriminant network used to determine whether a picture is "real". Its input parameter is x, x represents a picture, and the output D(x) represents the probability that x is a real picture. If it is 1, it means 100% of the real picture, if it is 0, it means it is impossible to be real picture. In the process of training the generative adversarial network, the goal of generating network G is to generate real pictures as much as possible to deceive the discriminant network D, and the goal of discriminant network D is to try to distinguish the pictures generated by G from the real pictures. Come. In this way, G and D constitute a dynamic "game" process, that is, the "confrontation" in the "generative confrontation network". As a result of the final game, in an ideal state, G can generate a picture G(z) that is "real", but it is difficult for D to determine whether the picture generated by G is real, that is, D(G(z))=0.5. This results in an excellent generative model G, which can be used to generate images.

The target model (also referred to as a neural network) for inter prediction will be described in detail below with reference to Figs. 6a-6e. 6a-6e are several exemplary architectures of a neural network for inter-frame prediction according to an embodiment of the present application.

As shown in Figure 6a, the neural network includes: 3×3 convolutional layer (3×3Conv), activation layer (Relu), block processing layer (Res-Block), ..., block processing layer according to the order of processing , 3×3 convolutional layers, activation layers and 3×3 convolutional layers. The original matrix input to the neural network is processed by the above-mentioned layers to obtain the matrix, and then added to the original matrix to obtain the final output matrix.

As shown in Figure 6b, the neural network includes, in order of processing: two 3×3 convolutional layers and activation layers, one block processing layer, ..., block processing layer, 3×3 convolutional layer, and activation layer and 3×3 convolutional layers. The first matrix passes through a 3×3 convolution layer and an activation layer, the second matrix passes through another 3×3 convolution layer and an activation layer, and the processed two matrices are merged (contact) and then passed through the block processing layer, …, The matrix obtained after the block processing layer, the 3×3 convolutional layer, the activation layer and the 3×3 convolutional layer is added to the first matrix to obtain the final output matrix.

As shown in Figure 6c, the neural network includes, in order of processing: two 3×3 convolutional layers and activation layers, one block processing layer, ..., block processing layer, 3×3 convolutional layer, and activation layer and 3×3 convolutional layers. The first matrix and the second matrix are multiplied before they are input to the neural network, and then the first matrix is passed through a 3×3 convolution layer and an activation layer, and the multiplied matrix is passed through another 3×3 convolution layer. And the activation layer, the two processed matrices are added and then processed by the block processing layer, ..., block processing layer, 3 × 3 convolution layer, activation layer and 3 × 3 convolution layer. A matrix is added to get the final output matrix.

As shown in Figure 6d, the above-mentioned block processing layers include: 3×3 convolutional layers, activation layers and 3×3 convolutional layers in order of processing. After the input matrix is processed by these three layers, the processed The resulting matrix is added to the initial input matrix to obtain the output matrix. As shown in Figure 6c, the above-mentioned block processing layers include, in order of processing, a 3×3 convolution layer, an activation layer, a 3×3 convolution layer, and an activation layer. The input matrix is passed through the 3×3 convolution layer, After the activation layer and the 3×3 convolution layer are processed, the matrix obtained after processing is added to the initial input matrix, and then the output matrix is obtained through an activation layer.

It should be noted that, Figures 6a-6e only show several exemplary architectures of the neural network used for inter-frame prediction in the embodiments of the present application, which do not constitute a limitation on the architecture of the neural network. The number of layers, layer structure, addition, multiplication, or merging, etc. included in the process, as well as the number and size of input and/or output matrices, can be determined according to the actual situation, which is not specifically limited in this application.

FIG. 8 is a flowchart of a process 800 of an inter-frame prediction method according to an embodiment of the present application. Process 800 may be performed by video encoder 20 or video decoder 30 , and in particular, may be performed by

inter prediction units

244 , 344 of video encoder 20 or video decoder 30 . Process 800 is described as a series of steps or operations, and it should be understood that process 800 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 8 . Assuming that a video data stream with multiple image frames is using a video encoder or a video decoder, a process 800 comprising the following steps is performed to inter-predict an image or image block. Process 800 may include:

Step 801: Acquire respective motion vectors of multiple reconstructed image blocks in the surrounding area of the current block.

The surrounding area of the current block includes spatial and/or temporal neighborhoods of the current block, wherein the image blocks in the spatial neighborhood may include left candidate image blocks located to the left of the current block and upper candidate image blocks located above the current block. Exemplarily, as shown in FIG. 7 , the set of candidate image blocks on the left includes {A0, A1}, the set of candidate image blocks on the upper side includes {B0, B1, B2}, and the set of temporally adjacent candidate image blocks includes {C, T}. The reconstructed image block may refer to an encoded image block that has been encoded by an encoder and obtained for reconstruction, or a decoded image block that has been decoded and reconstructed by a decoder. The reconstructed image block may also refer to a basic unit image block of a preset size obtained by dividing an encoded image block or a decoded image block into sizes. For example, the size of the encoded image block or the decoded image block may be 16×16, 64×64 or 32×16, and the size of the basic unit image block may be 4×4 or 8×8.

The following description takes a reconstructed image block as an example, the reconstructed image block may be any one of a plurality of reconstructed image blocks in the surrounding area, and other reconstructed image blocks may refer to this method.

The multiple a posteriori candidate motion vectors are obtained from multiple prior candidate motion vectors of the reconstructed image block. For any a priori candidate motion vector among the multiple prior candidate motion vectors of the reconstructed image block, it can be Offset is performed within a preset search window to generate multiple offset candidate motion vectors. It can be seen that a priori candidate motion vector of the reconstructed image block can obtain multiple offset candidate motion vectors. The multiple a priori candidate motion vectors of the reconstructed image block are operated as above, and all the obtained offset candidate motion vectors are the multiple a posteriori candidate motion vectors of the reconstructed image block. For example, FIG. 10 is an exemplary schematic diagram of a search window according to an embodiment of the present application. As shown in FIG. 10 , assuming that a certain prior candidate motion vector of the reconstructed image block is (0, 0), the prior candidate motion vector is (0, 0). The vector is offset within a 3×3 search window, and 9 offset candidate motion vectors can be obtained: (-1,-1), (-1,0), (-1,1), (0,- 1), (0,0), (0,1), (1,-1), (1,0), (1,1). The nine offset candidate motion vectors are multiple a posteriori candidate motion vectors of the reconstructed image block.

The multiple a posteriori motion vectors of the reconstructed image block may refer to the above-mentioned multiple a posteriori candidate motion vectors; may also refer to the partial motion vectors in the above-mentioned multiple a posteriori candidate motion vectors, such as the above-mentioned multiple a posteriori candidate motion vectors selected from multiple specified motion vectors.

The probability values or prediction error values of a plurality of a posteriori motion vectors may be described below.

In a possible implementation manner, in addition to acquiring the motion vector of the reconstructed image block, related information of the reconstructed image block can also be acquired, and the related information and its acquisition method are as follows:

1. Multiple prediction error values corresponding to multiple posterior motion vectors of the reconstructed image block, and multiple prediction error values are also determined according to the reconstructed values of the reconstructed image block and the predicted values corresponding to multiple posterior candidate motion vectors .

Motion compensation is respectively performed on the reconstructed image block according to the multiple a posteriori candidate motion vectors, and multiple predicted values can be obtained, and the multiple predicted values correspond to the multiple posterior candidate motion vectors.

2. Multiple probability values corresponding to multiple posterior motion vectors of the reconstructed image block, and multiple probability values are also determined according to the reconstructed values of the reconstructed image block and the predicted values corresponding to multiple posterior candidate motion vectors.

One is to obtain multiple probability values of the reconstructed image block according to the multiple prediction error values of the reconstructed image block obtained in the first method. For example, a normalized exponential function, a linear normalization method, etc. can be used to normalize the multiple prediction error values of the reconstructed image blocks to obtain the normalized values of the multiple prediction error values. The normalized value of the error value is the multiple probability values of the reconstructed image block. Based on the correspondence between the multiple prediction error values of the reconstructed image block and the multiple posterior motion vectors, the multiple probability values of the reconstructed image block Also corresponding to a plurality of posterior motion vectors of the reconstructed image block, the probability value can represent the probability that the posterior motion vector corresponding to it becomes the optimal motion vector of the reconstructed image block.

In the present application, the above-mentioned motion vector and related information of the reconstructed image block can be obtained by directly reading the memory. After the reconstructed image block is encoded or decoded, the above method can be used immediately to obtain the motion vector or motion vector of the reconstructed image block and its related information, and then store it for subsequent image blocks (current block). During inter-frame prediction, it can be directly read from the corresponding location in the memory. In this way, the inter prediction efficiency of the current block can be improved.

In this application, the motion vector or motion vector and related information of the reconstructed image block can also be calculated only when the current block is inter-frame prediction, that is, when the current block is inter-frame prediction, the above method is used to obtain the reconstructed image block. The motion vector or motion vector of the image block and its related information. In this way, the calculation is performed after determining which reconstructed image block needs to be used, which can save storage space.

If the above-mentioned multiple reconstructed image blocks all use inter-frame prediction in the encoding or decoding process, the above-mentioned method can be used to obtain the motion vectors or motion vectors and related information of the multiple reconstructed image blocks. If some image blocks in the plurality of reconstructed image blocks do not use inter-frame prediction in the process of encoding or decoding, the motion vector or motion of the partial image block can also be obtained according to any one of the methods described in the above three cases. vector and its related information.

If the reconstructed image block contains multiple basic unit image blocks, the motion vector or motion vector and related information of the reconstructed image block can be taken as the motion vector or motion vector and related information of all the basic unit image blocks contained in the reconstructed image block . Further, the motion vector or motion vector and related information of the reconstructed image block can be refined to be the motion vector or motion vector and related information of all the pixels contained therein.

Step 802: Obtain multiple prior candidate motion vectors of the current block and multiple probability values corresponding to the multiple prior candidate motion vectors according to the respective motion vectors of the multiple reconstructed image blocks.

The multiple a priori candidate motion vectors of the current block may refer to all the remaining motion vectors after deduplication of the multiple posterior motion vectors of the multiple reconstructed image blocks, or may refer to the multiple reconstructed image blocks. The partial motion vector among all the remaining motion vectors after the posterior motion vector is deduplicated.

The respective motion vectors of the multiple reconstructed image blocks can be input into the trained neural network to obtain multiple prior candidate motion vectors of the current block and multiple probability values corresponding to the multiple prior candidate motion vectors. For the neural network, reference may be made to the description of the training engine 25 above, which will not be repeated here.

Optionally, multiple a posteriori motion vectors of multiple reconstructed image blocks and multiple prediction error values corresponding to multiple posterior motion vectors can be input into a trained neural network to obtain multiple priors of the current block. A candidate motion vector and a plurality of probability values corresponding to a plurality of a priori candidate motion vectors.

Optionally, multiple posterior motion vectors and multiple probability values corresponding to multiple posterior motion vectors of multiple reconstructed image blocks can be input into the trained neural network to obtain multiple prior candidates for the current block. A motion vector and a plurality of probability values corresponding to a plurality of a priori candidate motion vectors.

Optionally, the optimal motion vectors of multiple reconstructed image blocks can be input into a trained neural network to obtain multiple prior candidate motion vectors of the current block and multiple probability values corresponding to multiple prior candidate motion vectors. .

Step 803: Obtain multiple weighting factors corresponding to the multiple prior candidate motion vectors according to multiple probability values corresponding to multiple prior candidate motion vectors.

When the sum of the plurality of probability values is 1, the probability value corresponding to the first a priori candidate motion vector is used as the weighting factor corresponding to the first a priori candidate motion vector. That is, the respective weight factors of multiple prior candidate motion vectors are the respective probability values of multiple prior candidate motion vectors; or, when the sum of multiple probability values is not 1, the multiple probability values are normalized ; take the normalized value of the probability value corresponding to the first a priori candidate motion vector as the weighting factor corresponding to the first a priori candidate motion vector. That is, the respective weighting factors of the multiple prior candidate motion vectors are normalized values of the respective probability values of the multiple prior candidate motion vectors. The above-mentioned first a priori candidate motion vector is any one of a plurality of a priori candidate motion vectors. It can be seen that the sum of multiple weighting factors corresponding to multiple prior candidate motion vectors is 1.

Step 804: Perform motion compensation respectively according to the multiple prior candidate motion vectors to obtain multiple predicted values.

According to the principle of inter-frame prediction, a candidate motion vector can find a reference block in the reference frame of the current block, and perform inter-frame prediction on the current block according to the reference block to obtain the predicted value corresponding to the candidate motion vector. The predicted values correspond to candidate motion vectors. Therefore, the motion compensation is respectively performed according to the multiple a priori candidate motion vectors, and multiple predicted values of the current block can be obtained.

Step 805: Obtain the prediction value of the current block according to the weighted summation of multiple weighting factors and multiple prediction values.

Multiply the weight factor corresponding to the same prior candidate motion vector and the predicted value, and then add up multiple products corresponding to multiple prior candidate motion vectors to obtain the predicted value of the current block.

In a possible implementation manner, after obtaining the reconstruction value of the current block, the motion vector of the current block and its related information can be obtained immediately. For the motion vector and its related information, refer to step 801, and the obtaining method includes:

In a possible implementation manner, the multiple probability values of the current block include M probability values, and the M probability values are all greater than other probability values except the M probability values among the multiple probability values of the current block. Therefore, M a priori candidate motion vectors corresponding to M probability values can be selected from multiple a priori candidate motion vectors of the current block, and then M weighting factors can be obtained according to the M probability values, and M a priori candidate motion vectors can be obtained according to the M probability values. Perform motion compensation respectively to obtain M predicted values of the current block, and finally obtain the predicted value of the current block according to the weighted summation of the M weighting factors and the M predicted values. That is, the top M probability values with the largest probability value are selected from the multiple probability values corresponding to the multiple prior candidate motion vectors of the current block, and the M probability values corresponding to the multiple prior candidate motion vectors of the current block are selected. For the corresponding M a priori candidate motion vectors, the weight factor and the prediction value are calculated based on the M probability values and the M a priori candidate motion vectors, and then the prediction value of the current block is obtained. Among the multiple probability values corresponding to the multiple prior candidate motion vectors, the remaining probability values except the aforementioned M probability values can be ignored because the values are small, which can reduce the amount of calculation and improve the efficiency of inter-frame prediction.

The technical solutions of the method embodiment shown in FIG. 8 are described in detail below by using several specific embodiments.

Example 1

In this embodiment, multiple prior candidate motions of the current block are determined according to multiple a posteriori motion vectors of each of the multiple reconstructed image blocks in the surrounding area and multiple prediction error values corresponding to the multiple posterior motion vectors vector and a plurality of probability values corresponding to a plurality of a priori candidate motion vectors.

FIG. 9 is a flowchart of a process 900 of an inter-frame prediction method according to an embodiment of the present application. Process 900 may be performed by video encoder 20 or video decoder 30 , and in particular, may be performed by

inter prediction units

244 , 344 of video encoder 20 or video decoder 30 . Process 900 is described as a series of steps or operations, and it should be understood that process 900 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 9 . Assuming that a video data stream with multiple image frames is using a video encoder or a video decoder, a process 900 comprising the following steps is performed to inter-predict an image or image block. Process 900 may include:

Step 901: Acquire a plurality of a posteriori motion vectors of each of the plurality of reconstructed image blocks in the surrounding area and a plurality of prediction error values corresponding to the plurality of a posteriori motion vectors.

The following description takes a reconstructed image block as an example. The reconstructed image block can be any one of multiple reconstructed image blocks in the surrounding area. Other reconstructed image blocks can refer to this method to obtain multiple posteriors. A motion vector and a plurality of prediction error values corresponding to a plurality of a posteriori motion vectors.

There are N4 a posteriori candidate motion vectors in the reconstructed image block. The N4 a posteriori candidate motion vectors are obtained according to a plurality of prior candidate motion vectors in the reconstructed image block. For the acquisition method, refer to the description of step 801 above. Motion compensation is performed respectively according to the N4 a posteriori candidate motion vectors, and N4 predicted values of the reconstructed image block can be obtained. The N4 predicted values correspond to the N4 posterior candidate motion vectors, that is, according to one posterior candidate motion vector Perform inter-frame prediction on the reconstructed image block by using the reference block of , to obtain a predicted value of the reconstructed image block. The N4 prediction values are respectively compared with the reconstructed values of the reconstructed image block, and N4 prediction error values of the reconstructed image block are obtained, and the N4 prediction error values correspond to the N4 a posteriori candidate motion vectors. The present application may adopt methods such as SAD or SSE to obtain the prediction error value of the reconstructed image block corresponding to a certain posterior candidate motion vector.

The N2 a posteriori motion vectors of the reconstructed image block may refer to the above-mentioned N4 a posteriori candidate motion vectors; it may also refer to a partial motion vector in the above-mentioned N4 a posteriori candidate motion vectors, such as the above-mentioned N4 a posteriori candidate motion vectors selected from multiple specified motion vectors.

Correspondingly, the number of prediction error values corresponding to the N2 posterior motion vectors of the reconstructed image block is also N2.

All the posterior motion vectors of multiple reconstructed image blocks can be represented as a N2×Q two-dimensional matrix, where N2 is the number of posterior motion vectors, Q is the number of reconstructed image blocks, and the elements are expressed as

k=0, 1, . The n indicates the posterior motion vector.

The total prediction error values of multiple reconstructed image blocks can also be expressed as an N2×Q two-dimensional matrix, where the elements are expressed as

k=0, 1, . The n indicates the prediction error value corresponding to the posterior motion vector.

Step 902: Obtain multiple prior candidate motion vectors of the current block and multiple prior candidate motion vectors of the current block according to multiple a posteriori motion vectors of the multiple reconstructed image blocks and multiple prediction error values corresponding to the multiple posterior motion vectors. Multiple prediction error values corresponding to the candidate motion vectors are checked.

The present application can input all prediction error values and all posterior motion vectors of multiple reconstructed image blocks, that is, the above two N2×Q two-dimensional matrices, into a trained neural network, and the neural network outputs multiple A priori candidate motion vectors and a plurality of prediction error values corresponding to the plurality of a priori candidate motion vectors. For the neural network, reference may be made to the description of the training engine 25 above, which will not be repeated here.

Multiple prior candidate motion vectors of the current block can be represented as a N1×S two-dimensional matrix, where N1 is the number of prior candidate motion vectors of the current block, and S is the number of basic unit image blocks or pixels contained in the current block number, if the current block is not further divided, then S=1. The elements in the matrix are represented as

l=0, 1, . The prior candidate motion vector indicated by n of the unit image block or pixel.

Multiple prediction error values of the current block corresponding to multiple a priori candidate motion vectors can also be represented as an N1×S two-dimensional matrix. The elements in the matrix are represented as

l=0, 1, . The probability that the a priori candidate motion vector indicated by n of a unit image block or pixel becomes the optimal motion vector for this basic unit image block or pixel.

Optional, with l unchanged,

That is, the sum of N1 probability values corresponding to the N1 a priori candidate motion vectors of the basic unit image block or pixel indicated by l is 1. Alternatively, you can also

Using integer expression, you can get

256 with

The integer value associated with the number of binary bits, which represents

The integer value of is represented in 8 bits, so

It can also be equal to 128 or 512 etc.

Step 903: Obtain multiple weighting factors corresponding to the multiple prior candidate motion vectors according to multiple prediction error values of the current block corresponding to multiple prior candidate motion vectors.

Multiple weighting factors corresponding to multiple prior candidate motion vectors of the current block can also be represented as an N1×S two-dimensional matrix. The elements in the matrix are represented as

l=0, 1, . Weighting factor for the prior candidate motion vector indicated by n of the unit image block or pixel.

If the N1 probability values corresponding to the N1 a priori candidate motion vectors of the basic unit image block or pixel indicated by l in the current block are normalized, that is,

Then the N1 probability values can be used as the N1 weighting factors corresponding to the N1 a priori candidate motion vectors, for example

If the N1 probability values corresponding to the N1 a priori candidate motion vectors of the basic unit image block or pixel indicated by l in the current block are not normalized, the N1 probability values may be normalized first, The normalized values of the N1 probability values are then used as N1 weighting factors corresponding to the N1 a priori candidate motion vectors. Therefore, with l unchanged,

Step 904: Perform motion compensation respectively according to the multiple prior candidate motion vectors to obtain multiple predicted values.

A certain prior candidate motion vector is taken as an example for description. The prior candidate motion vector is any one of a plurality of prior candidate motion vectors, and other prior candidate motion vectors may refer to this method.

A prediction value of the current block is obtained by performing motion compensation according to the prior candidate motion vector, so N1 prediction values can be obtained from N1 prior candidate motion vectors.

Multiple predicted values of the current block can be expressed as a BH×WH×S three-dimensional matrix, where BH×WH represents the size of the basic unit image blocks contained in the current block, and S is the number of basic unit image blocks or pixels contained in the current block , if the current block is not further divided, then S=1. The elements in the matrix are represented as

l=0, 1, . The pixel in the i-th row and the j-th column in the unit image block, the predicted value corresponding to the prior candidate motion vector indicated by n.

Step 905: Obtain the predicted value of the current block according to the weighted summation of the multiple weighting factors and the multiple predicted values.

The predicted value of the current block is obtained by multiplying the weight factor corresponding to the same prior candidate motion vector and the predicted value, and then adding up multiple products corresponding to multiple prior candidate motion vectors. In the current block, the predicted value of the pixel in the i-th row and the j-th column in the basic unit image block indicated by l can be expressed as:

Embodiment 2

In this embodiment, multiple a priori candidate motion vectors of the current block are determined according to multiple posterior motion vectors and multiple probability values corresponding to the multiple reconstructed image blocks in the surrounding area. and multiple probability values corresponding to multiple a priori candidate motion vectors.

FIG. 11 is a flowchart of a process 1100 of an inter-frame prediction method according to an embodiment of the present application. Process 1100 may be performed by video encoder 20 or video decoder 30 , and in particular, may be performed by

inter prediction units

244 , 344 of video encoder 20 or video decoder 30 . Process 1100 is described as a series of steps or operations, and it should be understood that process 1100 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 11 . Assuming that a video data stream with multiple image frames is using a video encoder or a video decoder, a process 1100 comprising the following steps is performed to inter-predict an image or image block. Process 1100 may include:

Step 1101: Acquire a plurality of a posteriori motion vectors and a plurality of probability values corresponding to the plurality of a posteriori motion vectors for each of the plurality of reconstructed image blocks in the surrounding area.

Step 1101 of this embodiment is different from step 901 of the above-mentioned first embodiment in that the multiple prediction error values corresponding to the multiple posterior motion vectors become multiple probability values corresponding to the multiple posterior motion vectors.

The following description takes a reconstructed image block as an example. The reconstructed image block can be any one of multiple reconstructed image blocks in the surrounding area. Other reconstructed image blocks can refer to this method to obtain multiple posteriors. A motion vector and a plurality of probability values corresponding to a plurality of a posteriori motion vectors.

The N2 a posteriori motion vectors of the reconstructed image block can be obtained by referring to the method in the above step 901, and details are not repeated here.

The N2 probability values corresponding to the N2 posterior motion vectors of the reconstructed image block can be obtained in the following two ways:

One is to obtain N2 probability values of the reconstructed image block according to the N2 prediction error values of the reconstructed image block obtained in the first embodiment.

The N2 prediction error values of the reconstructed image blocks correspond to an N2-dimensional vector of all the prediction error values of the reconstructed image blocks, and the elements of which are expressed as

k1 is the index of the reconstructed image block, n=0, 1, . probability value. The N2 probability values of the reconstructed image block can also be represented as an N2-dimensional vector, where the elements are represented as

k1 is the index of the reconstructed image block, n=0, 1, . Probability of the optimal motion vector for the reconstructed image block.

Optionally, the following normalized exponential function can be used to

convert to

For another example, the linear normalization method can be used to

convert to

Therefore, with k constant,

The other is to input the reconstructed value of the reconstructed image block and the N2 predicted values corresponding to the N2 posterior motion vectors into the trained neural network to obtain N2 probabilities corresponding to the N2 posterior motion vectors of the reconstructed image block value. For the neural network, reference may be made to the description of the training engine 25 above, which will not be repeated here.

The reconstructed value of the reconstructed image block can be obtained after encoding the reconstructed image block, and the N2 predicted values corresponding to the N2 a posteriori motion vectors of the reconstructed image block can be obtained by referring to the method in the above step 901. Repeat.

k=0, 1, . The n indicates the posterior motion vector.

All probability values of multiple reconstructed image blocks can also be expressed as a N2×Q two-dimensional matrix, where the elements are expressed as

k=0, 1, . The probability that the posterior motion vector indicated by n becomes the optimal motion vector for the reconstructed image block.

Step 1102: Obtain multiple prior candidate motion vectors and multiple prior candidate motion vectors of the current block according to multiple posterior motion vectors of the multiple reconstructed image blocks and multiple probability values corresponding to the multiple posterior motion vectors. Multiple probability values corresponding to candidate motion vectors.

Step 1102 of this embodiment is different from step 902 of the above-mentioned first embodiment, the difference is that the multiple prediction error values corresponding to the multiple posterior motion vectors input to the neural network become multiple multiple posterior motion vectors corresponding to the multiple posterior motion vectors. probability value.

Step 1103: Obtain multiple weighting factors corresponding to the multiple prior candidate motion vectors according to multiple probability values of the current block corresponding to multiple prior candidate motion vectors.

Step 1104: Perform motion compensation respectively according to the multiple prior candidate motion vectors to obtain multiple predicted values.

Step 1105: Obtain the prediction value of the current block according to the weighted summation of multiple weighting factors and multiple prediction values.

For steps 1103-1105 in this embodiment, reference may be made to steps 903-905 in Embodiment 1, and details are not repeated here.

Embodiment 3

In this embodiment, according to the respective optimal motion vectors of multiple reconstructed image blocks in the surrounding area, multiple prior candidate motion vectors of the current block and multiple probability values corresponding to the multiple prior candidate motion vectors are determined.

FIG. 12 is a flowchart of a process 1200 of an inter-frame prediction method according to an embodiment of the present application. Process 1200 may be performed by video encoder 20 or video decoder 30 , and in particular, may be performed by

inter prediction units

244 , 344 of video encoder 20 or video decoder 30 . Process 1200 is described as a series of steps or operations, and it should be understood that process 1200 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 12 . Assuming that a video data stream with multiple image frames is using a video encoder or a video decoder, a process 1200 comprising the following steps is performed to inter-predict an image or image block. Process 1200 may include:

Step 1201: Obtain the respective optimal motion vectors of multiple reconstructed image blocks in the surrounding area.

Step 1201 of this embodiment is different from step 901 of the above-mentioned first embodiment in that the multiple posterior motion vectors and multiple prediction error values corresponding to the multiple posterior motion vectors become optimal motion vectors.

The following description takes a reconstructed image block as an example. The reconstructed image block can be any one of multiple reconstructed image blocks in the surrounding area. Other reconstructed image blocks can refer to this method to obtain the optimal motion vector. .

The optimal motion vector of the reconstructed image block can be obtained in the following two ways:

One is to obtain the optimal motion vector of the reconstructed image block according to the N2 prediction error values of the reconstructed image block obtained in the first embodiment, that is, the minimum prediction error value corresponding to the N2 prediction error values of the reconstructed image block. The posterior motion vector serves as the optimal motion vector for the reconstructed image block.

The other is to obtain the optimal motion vector of the reconstructed image block according to the N2 probability values of the reconstructed image block obtained in the second embodiment, that is, the posterior corresponding to the largest probability value among the N2 probability values of the reconstructed image block The motion vector serves as the optimal motion vector for the reconstructed image block.

Step 1202: Obtain multiple prior candidate motion vectors of the current block and multiple probability values corresponding to the multiple prior candidate motion vectors according to the respective optimal motion vectors of the multiple reconstructed image blocks.

Step 1202 of this embodiment is different from step 902 of the above-mentioned first embodiment, the difference is that the multiple posterior motion vectors input to the neural network and multiple prediction error values corresponding to the multiple posterior motion vectors become multiple reconstructed The optimal motion vector for the image block.

Step 1203: Obtain multiple weighting factors corresponding to the multiple prior candidate motion vectors according to multiple probability values of the current block corresponding to multiple prior candidate motion vectors.

Step 1204: Perform motion compensation respectively according to the multiple prior candidate motion vectors to obtain multiple predicted values.

Step 1205: Obtain the predicted value of the current block according to the weighted summation of multiple weighting factors and multiple predicted values.

For steps 1203-1205 in this embodiment, reference may be made to steps 903-905 in the first embodiment, and details are not repeated here.

FIG. 13 is a schematic structural diagram of an inter-frame prediction apparatus 1300 according to an embodiment of the present application. The inter-frame prediction apparatus 1300 includes: a motion estimation unit 1301 and an inter-frame prediction processing unit 1302, wherein the motion estimation unit 1301 is configured to obtain the respective motion vectors of the P reconstructed image blocks in the surrounding area of the current block. The region includes the spatial neighborhood and/or temporal neighborhood of the current block; the inter prediction processing unit 1302 is configured to obtain Q a priori candidate motions of the current block according to the respective motion vectors of the P reconstructed image blocks vector and Q probability values corresponding to the Q prior candidate motion vectors; according to the M probability values corresponding to the M prior candidate motion vectors, M corresponding to the M prior candidate motion vectors are obtained Weight factor; M, P, and Q are positive integers, and M is less than or equal to Q; respectively perform motion compensation according to the M a priori candidate motion vectors to obtain M predicted values; according to the M predicted values and the corresponding The M weighting factors are weighted and summed to obtain the predicted value of the current block. In one example, the inter prediction apparatus 1300 including the motion estimation unit 1301 and the inter prediction processing unit 1302 may correspond to the inter prediction unit 244 in FIG. 2 , or to the inter prediction unit 344 in FIG. 3 .

In the implementation process, each step of the above method embodiments may be completed by a hardware integrated logic circuit in a processor or an instruction in the form of software. The processor can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other Programming logic devices, discrete gate or transistor logic devices, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the methods disclosed in the embodiments of the present application may be directly embodied as executed by a hardware coding processor, or executed by a combination of hardware and software modules in the coding processor. The software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware.

The memory mentioned in the above embodiments may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) ) and direct memory bus random access memory (direct rambus RAM, DR RAM). It should be noted that the memory of the systems and methods described herein is intended to include, but not be limited to, these and any other suitable types of memory.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

An inter-frame prediction method, comprising:

obtaining the respective motion vectors of the P reconstructed image blocks in the surrounding area of the current block, the surrounding area including the spatial neighborhood and/or the temporal neighborhood of the current block;

Obtaining Q a priori candidate motion vectors of the current block and Q probability values corresponding to the Q prior candidate motion vectors according to the respective motion vectors of the P reconstructed image blocks;

According to the M probability values corresponding to the M prior candidate motion vectors, M weighting factors corresponding to the M prior candidate motion vectors are obtained; M, P and Q are positive integers, and M is less than or equal to Q;

Performing motion compensation respectively according to the M a priori candidate motion vectors to obtain M predicted values;

The predicted value of the current block is obtained by weighted summation of the M predicted values and the corresponding M weighting factors.
The method according to claim 1, wherein, according to the respective motion vectors of the P reconstructed image blocks, the Q a priori candidate motion vectors of the current block and the Q a priori candidate motion vectors and the Q a priori candidates are obtained. Q probability values corresponding to the motion vector, including:

The respective motion vectors of the P reconstructed image blocks are input into the trained neural network to obtain the Q prior candidate motion vectors and Q probability values corresponding to the Q prior candidate motion vectors.
The method according to claim 1 or 2, wherein the M weighting factors corresponding to the M a priori candidate motion vectors are obtained according to the M probability values corresponding to the M a priori candidate motion vectors ,include:

When the sum of the M probability values is 1, the probability value corresponding to the first prior candidate motion vector is used as the weighting factor corresponding to the first prior candidate motion vector; or,

When the sum of the M probability values is not 1, normalize the M probability values; take the normalized value of the probability values corresponding to the first a priori candidate motion vector as the weighting factor corresponding to the first a priori candidate motion vector;

Wherein, the first a priori candidate motion vector is any one of the M a priori candidate motion vectors.
The method according to any one of claims 1-3, wherein M is equal to Q, and the M probability values are the Q probability values.
The method according to any one of claims 1-3, wherein M is smaller than Q, and the M probability values are all greater than other probability values except the M probability values among the Q probability values .
The method according to any one of claims 1-5, characterized in that, according to the respective motion vectors of the P reconstructed image blocks, the Q a priori candidate motion vectors of the current block and the The Q probability values corresponding to the Q prior candidate motion vectors, including:

Inputting multiple posterior motion vectors of the P reconstructed image blocks and multiple probability values corresponding to the multiple posterior motion vectors into the trained neural network to obtain Q priors of the current block candidate motion vectors and Q probability values corresponding to the Q a priori candidate motion vectors; the plurality of a posteriori motion vectors of the reconstructed image block and the plurality of probability values corresponding to the plurality of a posteriori motion vectors is determined according to the reconstructed value of the reconstructed image block and the prediction values corresponding to multiple a posteriori candidate motion vectors, where the reconstructed image block is any one of the P reconstructed image blocks.
The method according to any one of claims 1-5, characterized in that, according to the respective motion vectors of the P reconstructed image blocks, the Q a priori candidate motion vectors of the current block and the The Q probability values corresponding to the Q prior candidate motion vectors, including:

A plurality of a posteriori motion vectors of each of the P reconstructed image blocks and a plurality of prediction error values corresponding to the plurality of a posteriori motion vectors are input into the trained neural network to obtain the Q prior motion vectors of the current block. a priori candidate motion vector and Q probability values corresponding to the Q prior candidate motion vectors; the multiple posterior motion vectors of the reconstructed image block and multiple predictions corresponding to the multiple posterior motion vectors The error value is determined according to the reconstructed value of the reconstructed image block and the predicted value corresponding to a plurality of a posteriori candidate motion vectors, where the reconstructed image block is any one of the P reconstructed image blocks.
The method according to any one of claims 1-5, characterized in that, according to the respective motion vectors of the P reconstructed image blocks, the Q a priori candidate motion vectors of the current block and the The Q probability values corresponding to the Q prior candidate motion vectors, including:

Input the respective optimal motion vectors of the P reconstructed image blocks into the trained neural network to obtain Q a priori candidate motion vectors of the current block and Q a priori candidate motion vectors corresponding to the Q a priori candidate motion vectors probability value; the optimal motion vector of the reconstructed image block is the posterior motion vector with the largest probability value or the smallest prediction error value among the multiple posterior motion vectors of the reconstructed image block, and the reconstructed image block is any one of the plurality of reconstructed image blocks; wherein,

The multiple posterior motion vectors of the reconstructed image block correspond to multiple probability values, and the multiple posterior motion vectors and the multiple probability values corresponding to the multiple posterior motion vectors are based on the multiple posterior motion vectors. The reconstructed value of the reconstructed image block and the predicted value corresponding to the multiple posterior candidate motion vectors are determined; or,

A plurality of a posteriori motion vectors of the reconstructed image block correspond to a plurality of prediction error values, and the plurality of a posteriori motion vectors and the plurality of prediction error values corresponding to the plurality of a posteriori motion vectors are based on the It is determined by the reconstructed value of the reconstructed image block and the predicted value corresponding to the multiple posterior candidate motion vectors.
The method of claim 6, further comprising:

Acquire a training data set, wherein the training data set includes information of multiple groups of image blocks, wherein the information of each group of image blocks includes a plurality of a posteriori motion vectors of each of the plurality of reconstructed image blocks, and the plurality of posterior motion vectors. multiple probability values corresponding to the posterior motion vector, multiple posterior motion vectors of the current block, multiple probability values corresponding to the multiple posterior motion vectors, and the multiple reconstructed image blocks are the current block image patches in the spatial and/or temporal neighborhood of ;

The neural network is obtained by training according to the training data set.
The method of claim 7, further comprising:

Acquire a training data set, wherein the training data set includes information of multiple groups of image blocks, wherein the information of each group of image blocks includes a plurality of a posteriori motion vectors of each of the plurality of reconstructed image blocks, and the plurality of posterior motion vectors. multiple prediction error values corresponding to the posterior motion vector, multiple posterior motion vectors of the current block, multiple probability values corresponding to the multiple posterior motion vectors, and the multiple reconstructed image blocks are the current an image block in the spatial and/or temporal neighborhood of the block;

The neural network is obtained by training according to the training data set.
The method of claim 8, further comprising:

Obtaining a training data set, wherein the training data set includes information of multiple groups of image blocks, wherein the information of each group of image blocks includes respective optimal motion vectors of multiple reconstructed image blocks, and multiple posteriors of the current block a motion vector, a plurality of probability values corresponding to the plurality of a posteriori motion vectors, the plurality of reconstructed image blocks being image blocks in a spatial neighborhood and/or a temporal neighborhood of the current block;

The neural network is obtained by training according to the training data set.
The method according to any one of claims 9-11, wherein the neural network at least includes a convolution layer and an activation layer.
The method according to claim 12, wherein the depth of the convolution kernel of the convolution layer is 2, 3, 4, 5, 6, 16, 24, 32, 48, 64 or 128; The size of the convolution kernels in the buildup layer is 1×1, 3×3, 5×5, or 7×7.
The method according to any one of claims 9-13, wherein the neural network comprises a convolutional neural network CNN, a deep neural network DNN or a recurrent neural network RNN.
An encoder, characterized by comprising a processing circuit for executing the method according to any one of claims 1 to 14.
A decoder, characterized by comprising a processing circuit for executing the method of any one of claims 1 to 14.
A computer program product, characterized in that it includes program code, which, when executed on a computer or processor, is used to perform the method of any one of the claims.
A kind of encoder, is characterized in that, comprises:

one or more processors;

A non-transitory computer-readable storage medium coupled to the processor and storing a program executed by the processor, wherein the program, when executed by the processor, causes the decoder to perform any of the claims the method described.
A decoder, characterized in that it includes:

one or more processors;

A non-transitory computer-readable storage medium coupled to the processor and storing a program executed by the processor, wherein the program, when executed by the processor, causes the encoder to perform any of the claims the method described.
A non-transitory computer-readable storage medium, characterized in that it includes program code, which, when executed by a computer device, is used to perform the method of any one of the claims.