CN114286100A - Inter-frame prediction method and device - Google Patents
Inter-frame prediction method and device Download PDFInfo
- Publication number
- CN114286100A CN114286100A CN202011043942.XA CN202011043942A CN114286100A CN 114286100 A CN114286100 A CN 114286100A CN 202011043942 A CN202011043942 A CN 202011043942A CN 114286100 A CN114286100 A CN 114286100A
- Authority
- CN
- China
- Prior art keywords
- motion vectors
- block
- reconstructed image
- candidate motion
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 190
- 239000013598 vector Substances 0.000 claims abstract description 605
- 238000013528 artificial neural network Methods 0.000 claims abstract description 118
- 230000002093 peripheral effect Effects 0.000 claims abstract description 30
- 238000012545 processing Methods 0.000 claims description 108
- 238000012549 training Methods 0.000 claims description 68
- 238000003860 storage Methods 0.000 claims description 24
- 230000002123 temporal effect Effects 0.000 claims description 22
- 238000013527 convolutional neural network Methods 0.000 claims description 21
- 230000004913 activation Effects 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 10
- 230000000306 recurrent effect Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 2
- 230000006835 compression Effects 0.000 abstract description 11
- 238000007906 compression Methods 0.000 abstract description 11
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 230000002829 reductive effect Effects 0.000 abstract description 3
- 230000000875 corresponding effect Effects 0.000 description 211
- 238000013139 quantization Methods 0.000 description 79
- 230000008569 process Effects 0.000 description 71
- 239000011159 matrix material Substances 0.000 description 59
- 230000006870 function Effects 0.000 description 29
- 239000000872 buffer Substances 0.000 description 25
- 230000006854 communication Effects 0.000 description 25
- 238000004891 communication Methods 0.000 description 25
- 238000010586 diagram Methods 0.000 description 23
- 238000005192 partition Methods 0.000 description 22
- 238000011176 pooling Methods 0.000 description 19
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 18
- 238000001914 filtration Methods 0.000 description 18
- 230000005540 biological transmission Effects 0.000 description 13
- 238000004422 calculation algorithm Methods 0.000 description 12
- 230000011218 segmentation Effects 0.000 description 12
- 238000000638 solvent extraction Methods 0.000 description 12
- 238000003491 array Methods 0.000 description 10
- 230000001537 neural effect Effects 0.000 description 10
- 230000036961 partial effect Effects 0.000 description 10
- 241000023320 Luma <angiosperm> Species 0.000 description 9
- OSWPMRLSEDHDFF-UHFFFAOYSA-N methyl salicylate Chemical compound COC(=O)C1=CC=CC=C1O OSWPMRLSEDHDFF-UHFFFAOYSA-N 0.000 description 9
- 230000003044 adaptive effect Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 8
- 210000002569 neuron Anatomy 0.000 description 8
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 6
- 238000013461 design Methods 0.000 description 5
- 230000014509 gene expression Effects 0.000 description 5
- 230000005055 memory storage Effects 0.000 description 5
- 230000009466 transformation Effects 0.000 description 5
- 239000011449 brick Substances 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000012805 post-processing Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000004973 liquid crystal related substance Substances 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000011664 signaling Effects 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000001360 synchronised effect Effects 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000006837 decompression Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000009795 derivation Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000003709 image segmentation Methods 0.000 description 2
- 238000003384 imaging method Methods 0.000 description 2
- 230000000670 limiting effect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000007727 signaling mechanism Effects 0.000 description 2
- 230000001629 suppression Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 238000012952 Resampling Methods 0.000 description 1
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 1
- 206010044625 Trichorrhexis Diseases 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000007175 bidirectional communication Effects 0.000 description 1
- 230000002146 bilateral effect Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/136—Incoming video signal characteristics or properties
- H04N19/137—Motion inside a coding unit, e.g. average field, frame or block difference
- H04N19/139—Analysis of motion vectors, e.g. their magnitude, direction, variance or reliability
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/157—Assigned coding mode, i.e. the coding mode being predefined or preselected to be further used for selection of another element or parameter
- H04N19/159—Prediction type, e.g. intra-frame, inter-frame or bidirectional frame prediction
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
- H04N19/176—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The application provides an inter-frame prediction method and device. The utility model relates to a video or image compression technical field based on Artificial Intelligence (AI), concretely relates to video compression technical field based on neural network. The method comprises the following steps: acquiring respective motion vectors of P reconstructed image blocks in the peripheral region of the current block; obtaining Q prior candidate motion vectors and Q probability values of the current block according to respective motion vectors of the P reconstructed image blocks; obtaining M weight factors corresponding to the M prior candidate motion vectors according to M probability values corresponding to the M prior candidate motion vectors; m, P and Q are positive integers; respectively executing motion compensation according to the M prior candidate motion vectors to obtain M predicted values; and weighting and summing the M predicted values and the corresponding M weight factors to obtain the predicted value of the current block. According to the method and the device, the accuracy of inter-frame prediction can be improved, the error of inter-frame prediction is reduced, and the RDO efficiency of inter-frame prediction is improved.
Description
Technical Field
The embodiment of the present application relates to the technical field of Artificial Intelligence (AI) -based video or image compression, and in particular, to an inter-frame prediction method and apparatus.
Background
Video encoding (video encoding and decoding) is widely used in Digital video applications, such as broadcast Digital television, video transmission over the internet and mobile networks, real-time conversation applications such as video chat and video conferencing, Digital Versatile Disks (DVDs) and blu-ray disks, video content acquisition and editing systems, and security applications for camcorders.
Even where the movie is short, a large amount of video data needs to be described, which can cause difficulties when the data is to be sent or otherwise transmitted in a network with limited bandwidth capacity. Therefore, video data is typically compressed and then transmitted in modern telecommunication networks. As memory resources may be limited, the size of the video may also become an issue when storing the video on the storage device. Video compression devices typically use software and/or hardware on the source side to encode the video data prior to transmission or storage, thereby reducing the amount of data required to represent digital video images. The compressed data is then received by the video decompression device at the destination side. With limited network resources and an increasing demand for higher video quality, there is a need for improved compression and decompression techniques that can increase the compression rate with little impact on image quality.
Prediction in video coding can be divided into intra prediction and inter prediction. The inter-frame prediction is to find a matched reference block for a current block in a current image in a reconstructed image, and take the value of a pixel point in the reference block as a predicted value of the pixel point in the current block. The encoder tries multiple reference blocks for the current block in the reference picture, then decides a reference block suitable for the current block, and transmits motion information to the decoder. The decoder can find the reference block of the corresponding image block according to the motion information in the code stream, and then the prediction of the image block is obtained. The motion information includes one or two Motion Vectors (MVs) pointing to a reference block, and indication information (usually referred to as Reference Index (RI)) of a picture in which the reference block is located. In the High Efficiency Video Coding (HEVC) standard, two inter prediction modes are defined, namely an Advanced Motion Vector Prediction (AMVP) mode and a Merge (Merge) mode. In both modes, a candidate motion information list is constructed through motion information of reconstructed image blocks adjacent to a space domain or a time domain of a current block, then optimal motion information is determined from the candidate motion information list to serve as the motion information of the current block, and prediction of the current block is obtained based on the motion information of the current block.
How to obtain the prediction of the current block according to the plurality of candidate motion information is a key to realize the inter prediction.
Disclosure of Invention
The application provides an inter-frame prediction method and device, which are used for improving the accuracy of inter-frame prediction, reducing the error of inter-frame prediction and improving the RDO efficiency of inter-frame prediction.
In a first aspect, the present application provides an inter prediction method, including: obtaining respective motion vectors of P reconstructed image blocks in a peripheral region of a current block, wherein the peripheral region comprises a spatial neighborhood and/or a temporal neighborhood of the current block; obtaining Q prior candidate motion vectors of the current block and Q probability values corresponding to the Q prior candidate motion vectors according to the respective motion vectors of the P reconstructed image blocks; obtaining M weight factors corresponding to the M priori candidate motion vectors according to M probability values corresponding to the M priori candidate motion vectors; m, P and Q are positive integers; respectively executing motion compensation according to the M prior candidate motion vectors to obtain M predicted values; and weighting and summing the M predicted values and the corresponding M weight factors to obtain the predicted value of the current block.
The peripheral region of the current block includes a spatial neighborhood and/or a temporal neighborhood of the current block, wherein the image blocks of the spatial neighborhood may include a left candidate image block located at the left side of the current block and an above candidate image block located above the current block.
The reconstructed image block may refer to an encoded image block that has been encoded and acquired by the encoding end for reconstruction or a decoded image block that has been decoded and reconstructed by the decoding end. The reconstructed image block may also refer to a basic unit image block of a predetermined size obtained by dividing the size of an encoded image block or a decoded image block.
The motion vectors of the reconstructed image block may include: (1) a plurality of a posteriori motion vectors for the reconstructed image block, the plurality of a posteriori motion vectors determined according to a reconstructed value of the reconstructed image block and predicted values corresponding to a plurality of a posteriori candidate motion vectors; or, (2) an optimal motion vector of the reconstructed image block, where the optimal motion vector is an a posteriori motion vector with a largest probability value or a smallest prediction error value among the a posteriori motion vectors.
A plurality of a posteriori candidate motion vectors for a reconstructed image block are obtained from a plurality of a priori candidate motion vectors for the reconstructed image block. For any one of a plurality of prior candidate motion vectors of the reconstructed image block, the prior candidate motion vector can be shifted within a preset search window to generate a plurality of shifted candidate motion vectors. It can be seen that one a priori candidate motion vector for a reconstructed image block may result in multiple offset candidate motion vectors. And all the obtained offset candidate motion vectors are the plurality of the posterior candidate motion vectors of the reconstructed image block according to the operation. The P reconstructed image blocks may all obtain a plurality of a posteriori candidate motion vectors according to the method, which are not described herein one by one.
The plurality of a posteriori motion vectors of the reconstructed image block may refer to the plurality of a posteriori candidate motion vectors; the motion vector may be a partial motion vector among the plurality of a posteriori candidate motion vectors, for example, a plurality of designated motion vectors selected from the plurality of a posteriori candidate motion vectors. The P reconstructed image blocks may each obtain a plurality of a posteriori motion vectors according to the method, which are not described one by one here.
The motion vectors of the P reconstructed image blocks can be input into the trained neural network to obtain Q prior candidate motion vectors of the current block and Q probability values corresponding to the Q prior candidate motion vectors. The neural network may be referred to in the following description of the training engine 25, and will not be described in detail here.
The Q a priori candidate motion vectors of the current block may refer to all the motion vectors remaining after the multiple a posteriori motion vectors of the P reconstructed image blocks are de-duplicated, or may refer to some motion vectors of all the motion vectors remaining after the multiple a posteriori motion vectors of the P reconstructed image blocks are de-duplicated.
Optionally, when M is equal to Q, the M probability values refer to the Q probability values, and the M prior candidate motion vectors refer to the Q prior candidate motion vectors.
Optionally, M is less than Q, at this time, the M probability values are all greater than other probability values except the M probability values in the Q probability values, and M prior candidate motion vectors corresponding to the M probability values are selected from the Q prior candidate motion vectors of the current block. The method comprises the steps of selecting the first M probability values with the maximum probability values from Q probability values corresponding to Q prior candidate motion vectors of a current block, selecting M prior candidate motion vectors corresponding to the M probability values from the Q prior candidate motion vectors of the current block, and calculating a weight factor and a predicted value based on the M probability values and the M prior candidate motion vectors to obtain a predicted value of the current block. And the rest probability values except the M probability values in the probability values corresponding to the plurality of prior candidate motion vectors can be ignored due to the smaller value, so that the calculated amount can be reduced, and the efficiency of inter-frame prediction can be improved.
It should be noted that "corresponding" in the M probability values corresponding to the M a priori candidate motion vectors does not mean a one-to-one correspondence, for example, the current block has 5 a priori candidate motion vectors, and the probability values corresponding to the prior candidate motion vectors may be 5 probability values or less than 5 probability values.
And when the sum of the M probability values is 1, taking the probability value corresponding to the first prior candidate motion vector as a weight factor corresponding to the first prior candidate motion vector. That is, the weighting factors of the M prior candidate motion vectors are the probability values of the M prior candidate motion vectors; or when the sum of the M probability values is not 1, carrying out normalization processing on the M probability values; the normalized value of the probability value corresponding to the first a priori candidate motion vector is taken as the weighting factor corresponding to the first a priori candidate motion vector. I.e. the weighting factors for each of the M a priori candidate motion vectors, are normalized values of the probability values for each of the M a priori candidate motion vectors. The first a priori candidate motion vector is only one term used for convenience of description, and does not refer to a specific a priori candidate motion vector, and represents any one of Q a priori candidate motion vectors. It can be seen that the sum of the M weighting factors corresponding to the M a priori candidate motion vectors is 1.
According to the principle of inter prediction, a candidate motion vector can find a reference block in a reference frame of a current block, inter prediction is performed on the current block according to the reference block to obtain a predicted value corresponding to the candidate motion vector, and the predicted value of the current block corresponds to the candidate motion vector. Therefore, motion compensation is respectively performed according to the M prior candidate motion vectors, and M predicted values of the current block can be obtained.
And weighting and summing the M predicted values and the corresponding M weight factors to obtain the predicted value of the current block. As described above, M predictor values correspond to M prior candidate motion vectors, and M weighting factors also correspond to M prior candidate motion vectors, so that for the same prior candidate motion vector, a correspondence relationship is also established between the corresponding predictor values and weighting factors, the weighting factors and predictor values corresponding to the same prior candidate motion vector are multiplied, and then a plurality of products corresponding to a plurality of prior candidate motion vectors are added to obtain the predictor value of the current block.
According to the method, a plurality of weighting factors and a plurality of predicted values of the current block are obtained through respective motion vectors of a plurality of reconstructed image blocks in the peripheral area of the current block, the weighting factors and the predicted values corresponding to the same prior candidate motion vector are multiplied, and then a plurality of products corresponding to a plurality of prior candidate motion vectors are added to obtain the predicted value of the current block.
In one possible implementation, in addition to obtaining the motion vector of each of the P reconstructed image blocks, the related information of each of the P reconstructed image blocks may also be obtained. The information related to the reconstructed image block may be a plurality of prediction error values corresponding to a plurality of a posteriori motion vectors for the reconstructed image block, and the plurality of prediction error values are also determined according to the reconstructed value of the reconstructed image block and prediction values corresponding to a plurality of a posteriori candidate motion vectors.
And respectively executing motion compensation on the reconstructed image block according to a plurality of posterior candidate motion vectors of the reconstructed image block to obtain a plurality of predicted values, wherein the plurality of predicted values correspond to the plurality of posterior candidate motion vectors.
And comparing the plurality of predicted values with the reconstructed values of the reconstructed image blocks respectively to obtain a plurality of prediction error values, wherein the plurality of prediction error values correspond to the plurality of posterior candidate motion vectors. The prediction error value corresponding to a certain posterior candidate motion vector can be obtained by using a Sum of Absolute Differences (SAD) or sum of squared differences (SSE) method.
If the plurality of a posteriori motion vectors of the reconstructed image block refer to the plurality of a posteriori candidate motion vectors, the plurality of prediction error values of the reconstructed image block corresponding to the plurality of a posteriori motion vectors refer to a plurality of prediction error values corresponding to the plurality of a posteriori candidate motion vectors; if the plurality of a posteriori motion vectors of the reconstructed image block are partial motion vectors of the plurality of a posteriori motion vector candidates, the plurality of prediction error values corresponding to the plurality of a posteriori motion vectors of the reconstructed image block are prediction error values corresponding to the partial motion vectors selected from the plurality of prediction error values corresponding to the plurality of a posteriori motion vector candidates.
Accordingly, a plurality of a posteriori motion vectors, each comprising P reconstructed image blocks, and a plurality of prediction error values corresponding to the plurality of a posteriori motion vectors are input to the neural network.
In one possible implementation, in addition to obtaining the motion vector of each of the P reconstructed image blocks, the related information of each of the P reconstructed image blocks may also be obtained. The information related to the reconstructed image block may be a plurality of probability values corresponding to a plurality of a posteriori motion vectors for the reconstructed image block, the plurality of probability values also being determined according to a reconstructed value of the reconstructed image block and predicted values corresponding to a plurality of a posteriori candidate motion vectors.
The probability values corresponding to a plurality of a posteriori motion vectors of the reconstructed image block may be obtained by the following two methods:
one is to obtain a plurality of probability values of the reconstructed image block according to a plurality of prediction error values of the reconstructed image block obtained in the above method. For example, normalization processing may be performed on the multiple prediction error values of the reconstructed image block by using a normalization index function, a linear normalization method, or the like, to obtain normalized values of the multiple prediction error values, where the normalized values of the multiple prediction error values are multiple probability values of the reconstructed image block, and based on a correspondence between the multiple prediction error values of the reconstructed image block and multiple a posteriori motion vectors, the multiple probability values of the reconstructed image block also correspond to the multiple a posteriori motion vectors of the reconstructed image block, and the probability values may represent probabilities that the a posteriori motion vector corresponding to the probability value becomes an optimal motion vector of the reconstructed image block.
And the other method is to input the reconstructed value of the reconstructed image block and the plurality of predicted values of the reconstructed image block obtained in the first method into a trained neural network to obtain a plurality of probability values of the reconstructed image block corresponding to a plurality of posterior motion vectors. The neural network may refer to the description of the training engine 25 above, and will not be described in detail here.
Accordingly, a plurality of a posteriori motion vectors and a plurality of probability values corresponding to the plurality of a posteriori motion vectors are input to the neural network for each of the P reconstructed image blocks.
Therefore, after obtaining the prediction error values or probability values corresponding to a plurality of a posteriori motion vectors by the above two methods, the optimal motion vector of the reconstructed image block may have the following two obtaining methods:
one is to use the a posteriori motion vector corresponding to the smallest prediction error value among a plurality of prediction error values corresponding to a plurality of a posteriori motion vectors as the optimal motion vector of the reconstructed image block.
And the other one is to take the posterior motion vector corresponding to the maximum probability value in the probability values corresponding to the posterior motion vectors as the optimal motion vector of the reconstructed image block.
It should be noted that the optimal motion vector in the present application refers to a motion vector obtained by one of the two methods, and is one of a plurality of a posteriori motion vectors of a reconstructed image block, but the optimal motion vector is not the only motion vector used when inter prediction is performed on the reconstructed image block.
In a possible implementation manner, after obtaining a reconstructed value of a current block, a posterior motion vector of the current block and related information thereof may be obtained immediately, where the obtaining method includes:
the method comprises the steps of obtaining a plurality of posterior motion vectors of a current block and a plurality of prediction error values corresponding to the posterior motion vectors according to a reconstruction value of the current block and prediction values corresponding to the posterior candidate motion vectors of the current block, wherein the posterior motion vectors of the current block are obtained according to the prior candidate motion vectors of the current block.
And secondly, inputting the reconstructed value of the current block and the predicted value corresponding to the plurality of posterior candidate motion vectors of the current block into a neural network to obtain the plurality of posterior motion vectors of the current block and a plurality of probability values corresponding to the plurality of posterior motion vectors, wherein the plurality of posterior motion vectors of the current block are obtained according to the plurality of prior candidate motion vectors of the current block, or the plurality of probability values corresponding to the plurality of posterior motion vectors of the current block are obtained according to the plurality of prediction error values of the current block.
And thirdly, determining the posterior motion vector with the maximum probability value or the minimum prediction error value in the plurality of posterior motion vectors of the current block as the optimal motion vector of the current block.
In one possible implementation, the training data set according to which the training engine trains the neural network includes information of a plurality of groups of image blocks, where the information of each group of image blocks includes a plurality of a posteriori motion vectors of each of a plurality of reconstructed image blocks, a plurality of probability values corresponding to the plurality of a posteriori motion vectors, and a plurality of a posteriori motion vectors of a current block, a plurality of probability values corresponding to the plurality of a posteriori motion vectors, and the plurality of reconstructed image blocks are image blocks in a spatial neighborhood and/or a temporal neighborhood of the current block; and training according to the training data set to obtain the neural network.
In one possible implementation, the training data set according to which the training engine trains the neural network includes information of a plurality of groups of image blocks, where the information of each group of image blocks includes a plurality of a posteriori motion vectors of a plurality of reconstructed image blocks, a plurality of prediction error values corresponding to the plurality of a posteriori motion vectors, and a plurality of a posteriori motion vectors of a current block, a plurality of probability values corresponding to the plurality of a posteriori motion vectors, and the plurality of reconstructed image blocks are image blocks in a spatial neighborhood and/or a temporal neighborhood of the current block; and training according to the training data set to obtain the neural network.
In one possible implementation, the training data set according to which the training engine trains the neural network includes information of a plurality of groups of image blocks, where the information of each group of image blocks includes respective optimal motion vectors of a plurality of reconstructed image blocks, a plurality of a posteriori motion vectors of the current block, and a plurality of probability values corresponding to the plurality of a posteriori motion vectors, where the plurality of reconstructed image blocks are image blocks in a spatial neighborhood and/or a temporal neighborhood of the current block; and training according to the training data set to obtain the neural network.
Optionally, the neural network comprises at least a convolutional layer and an activation layer. Wherein the convolution layer has a convolution kernel depth of 2, 3, 4, 5, 6, 16, 24, 32, 48, 64, or 128; the size of the convolution kernel in the convolutional layer is 1 × 1, 3 × 3, 5 × 5, or 7 × 7. For example, the size of a certain convolutional layer is 3 × 3 × 2 × 10, where 3 × 3 denotes the size of the convolutional core in the convolutional layer; 2, the depth of the convolution kernel contained in the convolution layer is represented, the number of data channels input into the convolution layer is consistent with the depth of the convolution kernel contained in the convolution layer, namely, the number of data channels input into the convolution layer is also 2; the number of convolution kernels included in the convolutional layer is denoted by 10, and the number of data channels for outputting the convolutional layer matches the number of convolution kernels included in the convolutional layer, i.e., the number of data channels for outputting the convolutional layer is also 10.
Optionally, the neural network includes a convolutional neural network CNN, a deep neural network DNN, or a recurrent neural network RNN.
In a second aspect, the present application provides an encoder comprising processing circuitry for performing the method according to any of the first aspect above.
In a third aspect, the present application provides a decoder comprising processing circuitry for performing the method of any of the first aspect above.
In a fourth aspect, the present application provides a computer program product comprising program code for performing the method of any of the first aspect above when executed on a computer or processor.
In a fifth aspect, the present application provides an encoder comprising: one or more processors; a non-transitory computer readable storage medium coupled to the processor and storing a program for execution by the processor, wherein the program, when executed by the processor, causes the decoder to perform the method of any of the first aspects above.
In a sixth aspect, the present application provides a decoder comprising: one or more processors; a non-transitory computer readable storage medium coupled to the processor and storing a program for execution by the processor, wherein the program, when executed by the processor, causes the encoder to perform the method of any of the first aspects above.
In a seventh aspect, the present application provides a non-transitory computer readable storage medium comprising program code which, when executed by a computer device, is adapted to perform the method of any of the first aspect above.
In an eighth aspect, the present invention relates to an inter-frame prediction apparatus, and beneficial effects may be found in the description of the first aspect and will not be described herein again. The inter-frame prediction apparatus has a function of implementing the behavior in the method embodiment of the first aspect described above. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above-described functions. In one possible design, the inter-prediction apparatus includes: the motion estimation unit is used for acquiring motion vectors of P reconstructed image blocks in a peripheral area of a current block, wherein the peripheral area comprises a spatial neighborhood and/or a temporal neighborhood of the current block; an inter-prediction processing unit configured to implement the method according to any of the first aspect above. The modules may perform corresponding functions in the method example of the first aspect, for specific reference, detailed description of the method example is given, and details are not repeated here.
The details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
Drawings
FIG. 1a is an exemplary block diagram of a decoding system 10 according to an embodiment of the present application;
FIG. 1b is an exemplary block diagram of a video coding system 40 of an embodiment of the present application;
fig. 2 is an exemplary block diagram of a video encoder 20 according to an embodiment of the present application;
fig. 3 is an exemplary block diagram of a video decoder 30 according to an embodiment of the present application;
FIG. 4 is an exemplary block diagram of a video coding apparatus 400 of an embodiment of the present application;
FIG. 5 is an exemplary block diagram of an apparatus 500 of an embodiment of the present application;
6a-6e are several exemplary architectures of neural networks for inter-frame prediction according to embodiments of the present application;
FIG. 7 is an exemplary diagram of candidate image blocks according to an embodiment of the present application;
FIG. 8 is a flow diagram of a process 800 of an inter prediction method according to an embodiment of the present application;
FIG. 9 is a flow chart of a process 900 of an inter prediction method according to an embodiment of the present application;
FIG. 10 is an exemplary diagram of a search window according to an embodiment of the present application;
FIG. 11 is a flow diagram of a process 1100 of an inter prediction method according to an embodiment of the present application;
FIG. 12 is a flow diagram of a process 1200 of an inter prediction method according to an embodiment of the present application;
fig. 13 is a block diagram illustrating an inter prediction apparatus 1300 according to an embodiment of the present disclosure.
Detailed Description
The embodiment of the present application provides an AI-based video compression technology, in particular, a neural network-based video compression technology, and specifically, provides a Neural Network (NN) -based inter-frame prediction technology to improve a conventional hybrid video coding and decoding system.
Video coding generally refers to the processing of a sequence of images forming a video or video sequence. In the field of video coding, the terms "image", "frame" or "picture" may be used as synonyms. Video encoding (or encoding in general) includes both video encoding and video decoding. Video encoding is performed on the source side, typically involving processing (e.g., compressing) the original video image to reduce the amount of data required to represent the video image (and thus more efficient storage and/or transmission). Video decoding is performed at the destination side, typically involving inverse processing with respect to the encoder, to reconstruct the video image. Embodiments are directed to "encoding" of video images (or generally referred to as pictures) to be understood as "encoding" or "decoding" of video images or video sequences. The encoding portion and the decoding portion are also collectively referred to as a CODEC (coding and decoding, CODEC).
In the case of lossless video coding, the original video image can be reconstructed, i.e., the reconstructed video image has the same quality as the original video image (assuming no transmission loss or other data loss during storage or transmission). In the case of lossy video coding, further compression is performed by quantization or the like to reduce the amount of data required to represent the video image, whereas the decoder side cannot reconstruct the video image completely, i.e. the quality of the reconstructed video image is lower or worse than the quality of the original video image.
Several video coding standards belong to the "lossy hybrid video codec" (i.e., the combination of spatial and temporal prediction in the pixel domain and 2D transform coding in the transform domain for applying quantization). Each image in a video sequence is typically partitioned into non-overlapping sets of blocks, typically encoded at the block level. In other words, the encoder typically processes, i.e., encodes, the video at the block (video block) level, e.g., producing prediction blocks by spatial (intra) prediction and temporal (inter) prediction; subtracting the prediction block from the current block (currently processed/block to be processed) to obtain a residual block; the residual block is transformed and quantized in the transform domain to reduce the amount of data to be transmitted (compressed), while the decoder side applies the inverse processing part with respect to the encoder to the encoded or compressed block to reconstruct the current block for representation. In addition, the encoder needs to repeat the processing steps of the decoder so that the encoder and decoder generate the same prediction (e.g., intra-prediction and inter-prediction) and/or reconstructed pixels for processing, i.e., encoding, of subsequent blocks.
In the following embodiments of the decoding system 10, the encoder 20 and the decoder 30 are described with reference to fig. 1a to 3.
Fig. 1a is an exemplary block diagram of a decoding system 10 of an embodiment of the present application, such as a video decoding system 10 (or simply decoding system 10) that may utilize the techniques of the present application. Video encoder 20 (or simply encoder 20) and video decoder 30 (or simply decoder 30) in video coding system 10 represent, among other things, devices that may be used to perform techniques in accordance with various examples described in this application.
As shown in FIG. 1a, a decoding system 10 includes a source device 12, the source device 12 configured to provide encoded image data 21, such as an encoded image, to a destination device 14 configured to decode the encoded image data 21.
Source device 12 includes an encoder 20 and may additionally, or alternatively, include an image source 16, a pre-processor (or pre-processing unit) 18 such as an image pre-processor, a communication interface (or communication unit) 22.
Image sources 16 may include or may be any type of image capture device for capturing real-world images and the like, and/or any type of image generation device, such as a computer graphics processor for generating computer-animated images or any type of device for acquiring and/or providing real-world images, computer-generated images (e.g., screen content, Virtual Reality (VR) images, and/or any combination thereof (e.g., Augmented Reality (AR) images)).
In order to distinguish the processing performed by the preprocessor (or preprocessing unit) 18, the image (or image data) 17 may also be referred to as an original image (or original image data) 17.
Preprocessor 18 is configured to receive original image data 17 and preprocess original image data 17 to obtain a preprocessed image (or preprocessed image data) 19. For example, the pre-processing performed by pre-processor 18 may include pruning, color format conversion (e.g., from RGB to YCbCr), toning, or denoising. It will be appreciated that the pre-processing unit 18 may be an optional component.
A video encoder (or encoder) 20 is operative to receive pre-processed image data 19 and provide encoded image data 21 (described further below with respect to fig. 2, etc.).
The communication interface 22 in the source device 12 may be used to: receives encoded image data 21 and transmits encoded image data 21 (or any other processed version) over communication channel 13 to another device, such as destination device 14, or any other device for storage or direct reconstruction.
The destination device 14 includes a decoder 30 and may additionally, or alternatively, include a communication interface (or communication unit) 28, a post-processor (or post-processing unit) 32, and a display device 34.
Communication interface 28 in destination device 14 is used to receive encoded image data 21 (or other arbitrarily-processed version) either directly from source device 12 or from any other source device, such as a storage device, for example, an encoded image data storage device, and to provide encoded image data 21 to decoder 30.
The communication interface 22 and the communication interface 28 may be used to transmit or receive encoded image data (or encoded data) 21 over a direct communication link, such as a direct wired or wireless connection, etc., between the source device 12 and the destination device 14, or over any type of network, such as a wired network, a wireless network, or any combination thereof, any type of private and public networks, or any type of combination thereof.
For example, communication interface 22 may be used to encapsulate encoded image data 21 into a suitable format such as a message and/or process the encoded image data using any type of transport encoding or processing for transmission over a communication link or network.
Communication interface 28 corresponds to communication interface 22, and may be used, for example, to receive transmitted data and process the transmitted data using any type of corresponding transmission decoding or processing and/or decapsulation to obtain encoded image data 21.
Both the communication interface 22 and the communication interface 28 may be configured as unidirectional communication interfaces, as indicated by the arrows of the corresponding communication channel 13 pointing from the source device 12 to the destination device 14 in fig. 1a, or as bidirectional communication interfaces, and may be used to send and receive messages or the like, to establish a connection, to acknowledge and exchange any other information related to a communication link and/or a data transmission, e.g. an encoded image data transmission, etc.
Video decoder (or decoder) 30 is operative to receive encoded image data 21 and provide decoded image data (or decoded image data) 31 (described further below with respect to fig. 3, etc.).
The post-processor 32 is configured to perform post-processing on decoded image data 31 (also referred to as reconstructed image data) such as a decoded image, and obtain post-processed image data 33 such as a post-processed image. Post-processing performed by post-processing unit 32 may include, for example, color format conversion (e.g., from YCbCr to RGB), toning, cropping, or resampling, or any other processing for generating decoded image data 31 for display by display device 34 or the like.
The display device 34 is used to receive the post-processed image data 33 to display an image to a user or viewer or the like. The display device 34 may be or include any type of display for representing the reconstructed image, such as an integrated or external display screen or display. For example, the display screen may include a Liquid Crystal Display (LCD), an Organic Light Emitting Diode (OLED) display, a plasma display, a projector, a micro LED display, a liquid crystal on silicon (LCoS), a Digital Light Processor (DLP), or any other type of display screen.
The coding system 10 further includes a training engine 25, the training engine 25 being configured to train the encoder 20 (and in particular an inter prediction unit in the encoder 20) or the decoder 30 (and in particular an inter prediction unit in the decoder 30) to process an input image or image area or image block to generate a prediction value for the input image or image area or image block.
Optionally, the training data set in this embodiment of the present application includes: and information of a plurality of groups of image blocks, wherein the information of each group of image blocks comprises a plurality of posterior motion vectors of the reconstructed image blocks, a plurality of probability values corresponding to the posterior motion vectors, a plurality of posterior candidate motion vectors of the current block, and a plurality of probability values corresponding to the posterior candidate motion vectors, and the reconstructed image blocks are image blocks in a spatial neighborhood and/or a temporal neighborhood of the current block. And training the training data set to obtain a neural network, wherein the input of the neural network is a plurality of posterior motion vectors and a plurality of probability values corresponding to the posterior motion vectors of a plurality of reconstructed image blocks in the peripheral region of the current block, and the output of the neural network is a plurality of prior candidate motion vectors and a plurality of probability values corresponding to the prior candidate motion vectors of the current block.
Optionally, the training data set in this embodiment of the present application includes: and information of a plurality of groups of image blocks, wherein the information of each group of image blocks comprises a plurality of posterior motion vectors of the reconstructed image blocks, a plurality of prediction error values corresponding to the posterior motion vectors, a plurality of posterior candidate motion vectors of the current block, and a plurality of probability values corresponding to the posterior candidate motion vectors, and the reconstructed image blocks are image blocks in a spatial neighborhood and/or a temporal neighborhood of the current block. And training the training data set to obtain a neural network, wherein the input of the neural network is a plurality of posterior motion vectors and a plurality of prediction error values corresponding to the posterior motion vectors of a plurality of reconstructed image blocks in the peripheral region of the current block, and the output of the neural network is a plurality of prior candidate motion vectors and a plurality of probability values corresponding to the prior candidate motion vectors of the current block.
Optionally, the training data set in this embodiment of the present application includes: and information of a plurality of groups of image blocks, wherein the information of each group of image blocks comprises respective optimal motion vectors of a plurality of reconstructed image blocks, a plurality of posterior candidate motion vectors of the current block and a plurality of probability values corresponding to the plurality of posterior candidate motion vectors, and the plurality of reconstructed image blocks are image blocks in a spatial neighborhood and/or a temporal neighborhood of the current block. And training the training data set to obtain a neural network, wherein the input of the neural network is the optimal motion vector of each of the plurality of reconstructed image blocks in the peripheral region of the current block, and the output of the neural network is a plurality of prior candidate motion vectors of the current block and a plurality of probability values corresponding to the plurality of prior candidate motion vectors.
Optionally, the training data set in this embodiment of the present application includes: and the information of each group of image blocks comprises a reconstruction value of the image block, a prediction value corresponding to a plurality of posterior candidate motion vectors, a plurality of posterior motion vectors of the image block and a plurality of probability values corresponding to the plurality of posterior motion vectors. And training the training data set to obtain a neural network, wherein the input of the neural network is the reconstruction value of the current block and the predicted values corresponding to the plurality of posterior candidate motion vectors, and the output of the neural network is the plurality of posterior motion vectors of the current block and a plurality of probability values corresponding to the plurality of posterior motion vectors.
The process of training the neural network by the training engine 25 approximates the output a plurality of a priori candidate motion vectors of the current block to a plurality of a posteriori motion vectors of the current block, and approximates a plurality of probability values corresponding to the plurality of a priori candidate motion vectors to a plurality of probability values corresponding to the plurality of a posteriori motion vectors. Each training process may use a small batch size of 64 images and an initial learning rate of 1e-4, following a step size of 10. The information of the plurality of groups of image blocks may be data generated when the plurality of current blocks are inter-encoded by the encoder. The neural network can be used for realizing the inter-frame prediction method provided by the embodiment of the application, namely, the motion vectors and the related information of a plurality of reconstructed image blocks in the peripheral region of the current block are input into the neural network, and a plurality of prior candidate motion vectors of the current block and a plurality of probability values corresponding to the prior candidate motion vectors can be obtained. The neural network will be described in detail below in connection with fig. 6a-6 e.
The training data in the embodiment of the present application may be stored in a database (not shown), and the training engine 25 trains a target model (for example, a neural network for image inter-frame prediction) based on the training data. It should be noted that, in the embodiment of the present application, a source of the training data is not limited, and for example, the training data may be obtained from a cloud or other places to perform model training.
The object model trained by the training engine 25 may be applied to the decoding system 10, for example, to the source device 12 (e.g., encoder 20) or the destination device 14 (e.g., decoder 30) shown in fig. 1 a. The training engine 25 may train in the cloud to obtain a target model, and then the decoding system 10 downloads and uses the target model from the cloud; alternatively, the training engine 25 may train in the cloud to obtain the target model and use the target model, and the decoding system 10 obtains the processing result directly from the cloud. For example, the training engine 25 trains an object model with inter prediction function, the decoding system 10 downloads the object model from the cloud, and then the inter prediction unit 244 in the encoder 20 or the inter prediction unit 344 in the decoder 30 may perform inter prediction on the input image or image block according to the object model to obtain prediction of the image or image block. For another example, the training engine 25 trains a target model with inter-frame prediction function, the decoding system 10 does not need to download the target model from the cloud, the encoder 20 or the decoder 30 transmits the image or the image block to the cloud, and the cloud performs inter-frame prediction on the image or the image block through the target model to obtain prediction of the image or the image block, and transmits the prediction to the encoder 20 or the decoder 30.
Although fig. 1a shows the source device 12 and the destination device 14 as separate devices, device embodiments may also include both the source device 12 and the destination device 14 or both the source device 12 and the destination device 14 functionality, i.e. both the source device 12 or corresponding functionality and the destination device 14 or corresponding functionality. In these embodiments, the source device 12 or corresponding functionality and the destination device 14 or corresponding functionality may be implemented using the same hardware and/or software or by separate hardware and/or software or any combination thereof.
It will be apparent to the skilled person from the description that the presence and (exact) division of different units or functions in the source device 12 and/or the destination device 14 shown in fig. 1a may differ depending on the actual device and application.
Encoder 20 (e.g., video encoder 20) or decoder 30 (e.g., video decoder 30) or both may be implemented by processing circuitry as shown in fig. 1b, such as one or more microprocessors, Digital Signal Processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), discrete logic, hardware, video-encoding dedicated processors, or any combination thereof. Encoder 20 may be implemented by processing circuitry 46 to include the various modules discussed with reference to encoder 20 of fig. 2 and/or any other encoder system or subsystem described herein. Decoder 30 may be implemented by processing circuitry 46 to include the various modules discussed with reference to decoder 30 of fig. 3 and/or any other decoder system or subsystem described herein. The processing circuit 46 may be used to perform various operations discussed below. As shown in fig. 5, if portions of the techniques are implemented in software, the device may store the instructions of the software in a suitable non-transitory computer-readable storage medium and execute the instructions in hardware using one or more processors to perform the techniques of this application. One of the video encoder 20 and the video decoder 30 may be integrated in a single device as part of a combined CODEC (CODEC), as shown in fig. 1 b.
Source device 12 and destination device 14 may comprise any of a variety of devices, including any type of handheld or fixed device, such as a notebook or laptop computer, a cell phone, a smart phone, a tablet or tablet computer, a camera, a desktop computer, a set-top box, a television, a display device, a digital media player, a video game console, a video streaming device (e.g., a content service server or a content distribution server), a broadcast receiving device, a broadcast transmitting device, etc., and may not use or use any type of operating system. In some cases, source device 12 and destination device 14 may be equipped with components for wireless communication. Thus, the source device 12 and the destination device 14 may be wireless communication devices.
In some cases, the video coding system 10 shown in fig. 1a is merely exemplary, and the techniques provided herein may be applicable to video encoding settings (e.g., video encoding or video decoding) that do not necessarily include any data communication between the encoding device and the decoding device. In other examples, the data is retrieved from local storage, sent over a network, and so on. A video encoding device may encode and store data in memory, and/or a video decoding device may retrieve and decode data from memory. In some examples, the encoding and decoding are performed by devices that do not communicate with each other, but merely encode data to and/or retrieve and decode data from memory.
Fig. 1b is an exemplary block diagram of a video decoding system 40 according to an embodiment of the present disclosure, and as shown in fig. 1b, the video decoding system 40 may include an imaging device 41, a video encoder 20, a video decoder 30 (and/or a video codec implemented by a processing circuit 46), an antenna 42, one or more processors 43, one or more memory storages 44, and/or a display device 45.
As shown in fig. 1b, the imaging device 41, the antenna 42, the processing circuit 46, the video encoder 20, the video decoder 30, the processor 43, the memory storage 44 and/or the display device 45 are capable of communicating with each other. In different examples, video coding system 40 may include only video encoder 20 or only video decoder 30.
In some instances, antenna 42 may be used to transmit or receive an encoded bitstream of video data. Additionally, in some instances, display device 45 may be used to present video data. The processing circuit 46 may comprise application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, or the like. Video decoding system 40 may also include an optional processor 43, which optional processor 43 similarly may include application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, or the like. In addition, the memory 44 may be any type of memory, such as a volatile memory (e.g., Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), etc.) or a non-volatile memory (e.g., flash memory, etc.), and so on. In a non-limiting example, memory storage 44 may be implemented by a speed cache memory. In other examples, the processing circuitry 46 may include memory (e.g., cache, etc.) for implementing an image buffer, etc.
In some examples, video encoder 20, implemented by logic circuitry, may include an image buffer (e.g., implemented by processing circuitry 46 or memory storage 44) and a graphics processing unit (e.g., implemented by processing circuitry 46). The graphics processing unit may be communicatively coupled to the image buffer. The graphics processing unit may include video encoder 20 implemented by processing circuitry 46 to implement the various modules discussed with reference to fig. 2 and/or any other encoder system or subsystem described herein. Logic circuitry may be used to perform various operations discussed herein.
In some examples, video decoder 30 may be implemented by processing circuitry 46 in a similar manner to implement the various modules discussed with reference to video decoder 30 of fig. 3 and/or any other decoder system or subsystem described herein. In some examples, logic circuit implemented video decoder 30 may include an image buffer (implemented by processing circuit 46 or memory storage 44) and a graphics processing unit (implemented by processing circuit 46, for example). The graphics processing unit may be communicatively coupled to the image buffer. The graphics processing unit may include a video decoder 30 implemented by processing circuitry 46 to implement the various modules discussed with reference to fig. 3 and/or any other decoder system or subsystem described herein.
In some instances, antenna 42 may be used to receive an encoded bitstream of video data. As discussed, the encoded bitstream may include data related to the encoded video frame, indicators, index values, mode selection data, etc., discussed herein, such as data related to the encoding partition (e.g., transform coefficients or quantized transform coefficients, (as discussed) optional indicators, and/or data defining the encoding partition). Video coding system 40 may also include a video decoder 30 coupled to antenna 42 and used to decode the encoded bitstream. The display device 45 is used to present video frames.
It should be understood that video decoder 30 may be used to perform the reverse process for the example described with reference to video encoder 20 in the embodiments of the present application. With respect to signaling syntax elements, video decoder 30 may be configured to receive and parse such syntax elements and decode the associated video data accordingly. In some examples, video encoder 20 may entropy encode the syntax elements into an encoded video bitstream. In such instances, video decoder 30 may parse such syntax elements and decode the relevant video data accordingly.
For convenience of description, embodiments of the present application are described with reference to general video coding (VVC) reference software or high-performance video coding (HEVC) developed by joint photographic coding team on video coding (JCT-VC) of ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG). Those of ordinary skill in the art understand that the embodiments of the present application are not limited to HEVC or VVC.
Encoder and encoding method
Fig. 2 is an exemplary block diagram of a video encoder 20 according to an embodiment of the present application. As shown in fig. 2, the video encoder 20 includes an input terminal (or input interface) 201, a residual calculation unit 204, a transform processing unit 206, a quantization unit 208, an inverse quantization unit 210, an inverse transform processing unit 212, a reconstruction unit 214, a loop filter 220, a Decoded Picture Buffer (DPB) 230, a mode selection unit 260, an entropy coding unit 270, and an output terminal (or output interface) 272. Mode select unit 260 may include inter prediction unit 244, intra prediction unit 254, and partition unit 262. The inter prediction unit 244 may include a motion estimation unit and a motion compensation unit (not shown). The video encoder 20 shown in fig. 2 may also be referred to as a hybrid video encoder or a hybrid video codec-based video encoder.
Referring to fig. 2, the inter-frame prediction unit is a trained target model (also called a neural network) for processing an input image or an image area or an image block to generate a prediction value of the input image block. For example, a neural network for inter prediction is used to receive an input image or image area or image block and generate a prediction value of the input image or image area or image block. The neural network for inter prediction will be described in detail below in conjunction with fig. 6a-6 e.
The residual calculation unit 204, the transform processing unit 206, the quantization unit 208, and the mode selection unit 260 constitute a forward signal path of the encoder 20, and the inverse quantization unit 210, the inverse transform processing unit 212, the reconstruction unit 214, the buffer 216, the loop filter 220, the Decoded Picture Buffer (DPB) 230, the inter prediction unit 244, and the intra prediction unit 254 constitute a backward signal path of the encoder, wherein the backward signal path of the encoder 20 corresponds to a signal path of a decoder (see the decoder 30 in fig. 3). Inverse quantization unit 210, inverse transform processing unit 212, reconstruction unit 214, loop filter 220, decoded image buffer 230, inter prediction unit 244, and intra prediction unit 254 also constitute a "built-in decoder" of video encoder 20.
Image and image segmentation (image and block)
The encoder 20 is operable to receive images (or image data) 17, e.g. forming images in a sequence of images of a video or video sequence, via an input 201 or the like. The received image or image data may also be a pre-processed image (or pre-processed image data) 19. For simplicity, the following description uses image 17. The image 17 may also be referred to as the current image or as an image to be encoded (in particular when the current image is distinguished from other images in video encoding, such as previously encoded images and/or decoded images in the same video sequence, i.e. a video sequence also comprising the current image).
The (digital) image is or can be considered as a two-dimensional array or matrix of pixels with intensity values. The pixels in the array may also be referred to as pixels (or pels) (short for picture elements). The number of pixels in the array or image in both the horizontal and vertical directions (or axes) determines the size and/or resolution of the image. To represent color, three color components are typically employed, i.e., the image may be represented as or include an array of three pixel points. In the RBG format or color space, the image includes corresponding arrays of red, green, and blue pixel dots. However, in video coding, each pixel is typically represented in a luminance/chrominance format or color space, such as YCbCr, comprising a luminance component (sometimes also denoted L) indicated by Y and two chrominance components, denoted Cb, Cr. The luminance (luma) component Y represents luminance or gray level intensity (e.g., both are the same in a gray scale image), while the two chrominance (abbreviated chroma) components Cb and Cr represent chrominance or color information components. Accordingly, an image in YCbCr format includes a luminance pixel point array of luminance pixel point values (Y) and two chrominance pixel point arrays of chrominance values (Cb and Cr). An image in RGB format may be converted or transformed into YCbCr format and vice versa, a process also known as color transformation or conversion. If the image is black and white, the image may include only an array of luminance pixel points. Accordingly, the image may be, for example, an array of luminance pixel points in monochrome format or an array of luminance pixel points in 4:2:0, 4:2:2 and 4:4:4 color formats and two corresponding arrays of chrominance pixel points.
In one embodiment, an embodiment of video encoder 20 may include an image partitioning unit (not shown in fig. 2) for partitioning image 17 into a plurality of (typically non-overlapping) image blocks 203. These blocks may also be referred to as root blocks, macroblocks (h.264/AVC), or Coding Tree Blocks (CTBs), or Coding Tree Units (CTUs) in the h.265/HEVC and VVC standards. The segmentation unit may be adapted to use the same block size for all images in the video sequence and to use a corresponding grid defining the block sizes, or to change the block sizes between images or subsets or groups of images and to segment each image into corresponding blocks.
In other embodiments, the video encoder may be configured to receive the blocks 203 of the image 17 directly, e.g., one, several, or all of the blocks that make up the image 17. The image block 203 may also be referred to as a current image block or an image block to be encoded.
As with image 17, image patch 203 is also or can be thought of as a two-dimensional array or matrix of pixels having intensity values (pixel point values), but image patch 203 is smaller than image 17. In other words, block 203 may include one pixel dot array (e.g., a luma array in the case of a monochrome image 17 or a luma array or a chroma array in the case of a color image) or three pixel dot arrays (e.g., a luma array and two chroma arrays in the case of a color image 17) or any other number and/or type of arrays depending on the color format employed. The number of pixels in the horizontal and vertical directions (or axes) of the block 203 defines the size of the block 203. Accordingly, a block may be an array of M × N (M columns × N rows) pixel points, or an array of M × N transform coefficients, or the like.
In one embodiment, the video encoder 20 shown in fig. 2 is used to encode the image 17 on a block-by-block basis, e.g., encoding and prediction is performed for each block 203.
In one embodiment, the video encoder 20 shown in fig. 2 may also be used to segment and/or encode pictures using slices (also referred to as video slices), where pictures may be segmented or encoded using one or more slices (which typically are non-overlapping). Each slice may comprise one or more blocks (e.g. a coding tree unit CTU) or one or more groups of blocks (e.g. a coded block (tile) in the h.265/HEVC/VVC standard and a brick (brick) in the VVC standard).
In one embodiment, the video encoder 20 shown in fig. 2 may be further configured to segment and/or encode a picture using one or more slice/coding block groups (generally non-overlapping) and/or coding blocks (also referred to as video coding blocks), wherein each slice/coding block group may include one or more blocks (e.g., CTUs) or one or more coding blocks, etc., wherein each coding block may be rectangular, etc., and may include one or more complete or partial blocks (e.g., CTUs).
Residual calculation
The residual calculation unit 204 is configured to calculate a residual block 205 from the image block (or original block) 203 and the prediction block 265 (the prediction block 265 is described in detail later) as follows: for example, pixel-point-by-pixel (pixel-by-pixel) values of the prediction block 265 are subtracted from pixel-point values of the image block 203, resulting in the residual block 205 in the pixel domain.
Transformation of
The transform processing unit 206 is configured to perform Discrete Cosine Transform (DCT), Discrete Sine Transform (DST), or the like on pixel point values of the residual block 205, to obtain transform coefficients 207 in a transform domain. The transform coefficients 207, which may also be referred to as transform residual coefficients, represent a residual block 205 in the transform domain.
Transform processing unit 206 may be used to apply an integer approximation of DCT/DST, such as the transform specified for h.265/HEVC. Such an integer approximation is typically scaled by some factor compared to the orthogonal DCT transform. To maintain the norm of the residual block that is processed by the forward and inverse transforms, other scaling factors are used as part of the transform process. The scaling factor is typically selected according to certain constraints, e.g., the scaling factor is a power of 2 for a shift operation, the bit depth of the transform coefficients, a tradeoff between accuracy and implementation cost, etc. For example, a specific scaling factor may be specified for the inverse transform by the inverse transform processing unit 212 on the encoder 20 side (and for the corresponding inverse transform by, for example, the inverse transform processing unit 312 on the decoder 30 side), and accordingly, a corresponding scaling factor may be specified for the forward transform by the transform processing unit 206 on the encoder 20 side.
In one embodiment, video encoder 20 (correspondingly, transform processing unit 206) may be configured to output transform parameters, such as the type of transform(s), e.g., directly or after being encoded or compressed by entropy encoding unit 270, e.g., such that video decoder 30 may receive and use the transform parameters for decoding.
Quantization
The quantization unit 208 is configured to quantize the transform coefficients 207 by, for example, scalar quantization or vector quantization, resulting in quantized transform coefficients 209. The quantized transform coefficients 209 may also be referred to as quantized residual coefficients 209.
The quantization process may reduce the bit depth associated with some or all of transform coefficients 207. For example, an n-bit transform coefficient may be rounded down to an m-bit transform coefficient during quantization, where n is greater than m. The quantization level may be modified by adjusting a Quantization Parameter (QP). For example, for scalar quantization, different degrees of scaling may be applied to achieve finer or coarser quantization. Smaller quantization steps correspond to finer quantization and larger quantization steps correspond to coarser quantization. An appropriate quantization step size may be indicated by a Quantization Parameter (QP). For example, the quantization parameter may be an index of a predefined set of suitable quantization step sizes. For example, a smaller quantization parameter may correspond to a fine quantization (smaller quantization step size) and a larger quantization parameter may correspond to a coarse quantization (larger quantization step size), or vice versa. The quantization may comprise division by a quantization step size, while the corresponding or inverse dequantization performed by the dequantization unit 210, etc., may comprise multiplication by the quantization step size. Embodiments according to some standards such as HEVC may be used to determine the quantization step size using a quantization parameter. In general, the quantization step size may be calculated from the quantization parameter using a fixed point approximation of an equation that includes division. Other scaling factors may be introduced for quantization and dequantization to recover the norm of the residual block that may be modified due to the scale used in the fixed point approximation of the equation for the quantization step size and quantization parameter. In one exemplary implementation, the inverse transform and dequantization scales may be combined. Alternatively, a custom quantization table may be used and indicated from the encoder to the decoder in the bitstream or the like. Quantization is a lossy operation, where the larger the quantization step size, the greater the loss.
In one embodiment, video encoder 20 (correspondingly, quantization unit 208) may be used to output Quantization Parameters (QPs), e.g., directly or encoded or compressed by entropy encoding unit 270, e.g., such that video decoder 30 may receive and decode using the quantization parameters.
Inverse quantization
The inverse quantization unit 210 is configured to perform inverse quantization of the quantization unit 208 on the quantized coefficients, resulting in dequantized coefficients 211, e.g., perform an inverse quantization scheme according to or using the same quantization step as the quantization unit 208. Dequantized coefficients 211, which may also be referred to as dequantized residual coefficients 211, correspond to transform coefficients 207, but dequantized coefficients 211 are typically not exactly the same as the transform coefficients due to losses caused by quantization.
Inverse transformation
The inverse transform processing unit 212 is configured to perform an inverse transform of the transform performed by the transform processing unit 206, such as an inverse Discrete Cosine Transform (DCT) or an inverse Discrete Sine Transform (DST), to obtain a reconstructed residual block 213 (or corresponding dequantized coefficients 213) in the pixel domain. The reconstructed residual block 213 may also be referred to as a transform block 213.
Reconstruction
The reconstruction unit 214 (e.g. summer 214) is configured to add the transform block 213 (i.e. the reconstructed residual block 213) to the prediction block 265 to obtain the reconstruction block 215 in the pixel domain, e.g. to add pixel point values of the reconstructed residual block 213 and pixel point values of the prediction block 265.
Filtering
The loop filter unit 220 (or simply "loop filter" 220) is used for filtering the reconstruction block 215 to obtain a filter block 221, or is generally used for filtering the reconstruction pixel points to obtain filtered pixel point values. For example, the loop filter unit is used to smoothly perform pixel transition or improve video quality. Loop filter unit 220 may include one or more loop filters, such as a deblocking filter, a sample-adaptive offset (SAO) filter, or one or more other filters, such as an Adaptive Loop Filter (ALF), a Noise Suppression Filter (NSF), or any combination thereof. For example, the loop filter unit 220 may include a deblocking filter, an SAO filter, and an ALF filter. The order of the filtering process may be a deblocking filter, an SAO filter, and an ALF filter. As another example, a process called luma mapping with chroma scaling (LMCS) (i.e., adaptive in-loop shaper) is added. This process is performed prior to deblocking. As another example, the deblocking filtering process may also be applied to intra sub-block edges, such as affine sub-block edges, ATMVP sub-block edges, sub-block transform (SBT) edges, and intra sub-partition (ISP) edges. Although loop filter unit 220 is shown in fig. 2 as a loop filter, in other configurations, loop filter unit 220 may be implemented as a post-loop filter. The filtering block 221 may also be referred to as a filtered reconstruction block 221.
In one embodiment, video encoder 20 (correspondingly, loop filter unit 220) may be used to output loop filter parameters (e.g., SAO filtering parameters, ALF filtering parameters, or LMCS parameters), e.g., directly or after entropy encoding by entropy encoding unit 270, e.g., such that decoder 30 may receive and decode using the same or different loop filter parameters.
Decoded picture buffer
Decoded Picture Buffer (DPB) 230 may be a reference picture memory that stores reference picture data for use by video encoder 20 in encoding video data. DPB230 may be formed from any of a variety of memory devices, such as Dynamic Random Access Memory (DRAM), including Synchronous DRAM (SDRAM), Magnetoresistive RAM (MRAM), Resistive RAM (RRAM), or other types of memory devices. The decoded picture buffer 230 may be used to store one or more filter blocks 221. The decoded picture buffer 230 may also be used to store other previously filtered blocks, such as previously reconstructed and filtered blocks 221, of the same current picture or of a different picture, such as a previously reconstructed picture, and may provide a complete previously reconstructed, i.e., decoded picture (and corresponding reference blocks and pixels) and/or a partially reconstructed current picture (and corresponding reference blocks and pixels), for example, for inter prediction. Decoded picture buffer 230 may also be used to store one or more unfiltered reconstructed blocks 215, or generally, unfiltered reconstructed pixel points, such as reconstructed blocks 215 that are not filtered by loop filtering unit 220, or reconstructed blocks or reconstructed pixel points that have not been subjected to any other processing.
Mode selection (segmentation and prediction)
The mode selection unit 260 includes a segmentation unit 262, an inter prediction unit 244, and an intra prediction unit 254, which are used to receive or obtain original image data, such as the original block 203 (the current block 203 of the current image 17) and reconstructed image data, e.g., filtered and/or unfiltered pixel reconstructed or reconstructed blocks of the same (current) image and/or one or more previously decoded images, from the decoded image buffer 230 or other buffers (e.g., column buffers, not shown). The reconstructed image data is used as reference image data necessary for prediction such as inter prediction or intra prediction to obtain a prediction block 265 or a prediction value 265.
The mode selection unit 260 may be used to determine or select a partition for the current block (including no partition) and the prediction mode (e.g., intra or inter prediction modes) and generate a corresponding prediction block 265 for calculation of the residual block 205 and reconstruction of the reconstructed block 215.
In one embodiment, mode selection unit 260 may be used to select the partitioning and prediction modes (e.g., from among the prediction modes supported or available by mode selection unit 260) that provide the best match or the smallest residual (smallest residual refers to better compression in transmission or storage), or that provide the smallest signaling overhead (smallest signaling overhead refers to better compression in transmission or storage), or both. The mode selection unit 260 may be configured to determine the partitioning and prediction modes according to Rate Distortion Optimization (RDO), i.e., to select a prediction mode that provides minimum rate distortion Optimization. The terms "best," "lowest," "optimal," and the like herein do not necessarily refer to "best," "lowest," "optimal," as a whole, but may also refer to situations where termination or selection criteria are met, e.g., values above or below a threshold or other limit may result in "sub-optimal" but may reduce complexity and processing time.
In other words, the segmentation unit 262 may be used to segment images in a video sequence into a sequence of Coding Tree Units (CTUs), and the CTUs 203 may be further segmented into smaller block portions or sub-blocks (again forming blocks), e.g., by iteratively using quad-tree (QT) segmentation, binary-tree (BT) segmentation, or triple-tree (TT) segmentation, or any combination thereof, and to perform prediction, e.g., on each of the block portions or sub-blocks, wherein mode selection includes selecting a tree structure of the segmented blocks 203 and selecting a prediction mode to apply to each of the block portions or sub-blocks.
The partitioning (e.g., by partitioning unit 262) and prediction processing (e.g., by inter-prediction unit 244 and intra-prediction unit 254) performed by video encoder 20 will be described in detail below.
Segmentation
The division unit 262 may divide (or divide) one image block (or CTU)203 into smaller parts, such as square or rectangular shaped small blocks. For an image with three pixel point arrays, one CTU consists of nxn luminance pixel point blocks and two corresponding chrominance pixel point blocks. The maximum allowable size of a luminance block in a CTU is specified as 128 × 128 in a universal video coding (VVC) standard under development, but may be specified as a value other than 128 × 128, for example, 256 × 256, in the future. The CTUs of a picture may be grouped/grouped into slice/coding block groups, coding blocks or bricks. A coding block covers a rectangular area of a picture and a coding block may be divided into one or more tiles. A tile consists of multiple rows of CTUs within a coding block. A coded block that is not partitioned into multiple tiles may be referred to as a tile. However, the tiles are a true subset of the coded blocks and are therefore not referred to as coded blocks. The VVC supports two modes of coded blockgroups, a raster scan slice/coded blockgroup mode and a rectangular slice mode. In the raster scan coding block group mode, a slice/coding block group contains a sequence of coding blocks in a raster scan of coding blocks of an image. In the rectangular tile mode, the tile contains a plurality of tiles of an image that together make up a rectangular area of the image. The tiles in the rectangular sheet are arranged in the tile raster scan order of the photographs. These smaller blocks (which may also be referred to as sub-blocks) may be further partitioned into smaller portions. This is also referred to as tree splitting or hierarchical tree splitting, where a root block at root tree level 0 (hierarchical level 0, depth 0), etc., may be recursively split into two or more next lower tree level blocks, e.g., tree level 1 (hierarchical level 1, depth 1) nodes. These blocks may in turn be split into two or more next lower level blocks, e.g. tree level 2 (hierarchical level 2, depth 2), etc., until the split ends (because the end criteria are met, e.g. maximum tree depth or minimum block size is reached). The blocks that are not further divided are also referred to as leaf blocks or leaf nodes of the tree. A tree divided into two parts is called a binary-tree (BT), a tree divided into three parts is called a ternary-tree (TT), and a tree divided into four parts is called a quad-tree (QT).
For example, a Coding Tree Unit (CTU) may be or include a CTB of luminance pixels, two corresponding CTBs of chrominance pixels of an image having three pixel point arrays, or a CTB of pixels of a monochrome image, or a CTB of pixels of an image encoded using three independent color planes and syntax structures (for encoding pixels). Accordingly, a Coding Tree Block (CTB) may be a block of N × N pixels, where N may be set to a value such that the components are divided into CTBs, which is the partition. A Coding Unit (CU) may be or include a coding block of luminance pixels, two corresponding coding blocks of chrominance pixels of an image having three arrays of pixel points, or a coding block of pixels of a monochrome image or a coding block of pixels of an image coded using three independent color planes and syntax structures (for coding pixels). Accordingly, a Coding Block (CB) may be an M × N block of pixels, where M and N may be set to a value such that the CTB is divided into coding blocks, which is the partition.
For example, in an embodiment, a Coding Tree Unit (CTU) may be partitioned into CUs according to HEVC by using a quadtree structure represented as a coding tree. The decision whether to encode an image region using inter (temporal) prediction or intra (spatial) prediction is made at the leaf CU level. Each leaf-CU may be further divided into one, two, or four PUs according to the PU partition type. The same prediction process is used within a PU and the relevant information is transmitted to the decoder in units of PU. After applying the prediction process according to the PU partition type to obtain the residual block, the leaf-CU may be partitioned into Transform Units (TUs) according to other quadtree structures similar to the coding tree used for the CU.
In an embodiment, for example, according to the latest video coding standard currently under development, called universal video coding (VVC), a combined quadtree of nested multi-type trees (e.g., binary and ternary) is used to partition a segmentation structure for partitioning a coding tree unit in a coding tree structure within a coding tree unit, a CU may be square or rectangular, for example, a Coding Tree Unit (CTU) is first partitioned by a quadtree structure, leaf nodes of the quadtree are further partitioned by a multi-type tree structure, the multi-type tree structure has four partition types of vertical binary tree partition (SPLIT _ BT _ VER), horizontal binary tree partition (SPLIT _ BT _ HOR), vertical ternary tree partition (SPLIT _ TT _ VER), and horizontal ternary tree partition (SPLIT _ TT _ HOR), leaf nodes of the multi-type tree are called Coding Units (CU), unless CU is too large for the maximum transform length, such segmentation is used for prediction and transform processing without any other segmentation. In most cases, this means that the block sizes of the CU, PU, and TU are the same in the coding block structure of the quad-tree nested multi-type tree. This anomaly occurs when the maximum supported transform length is smaller than the width or height of the color component of the CU. The VVC formulates a unique signaling mechanism for partition partitioning information in a coding structure with quad-tree nested multi-type trees. In the signaling mechanism, a Coding Tree Unit (CTU) is first partitioned by a quadtree structure as a root of the quadtree. Each quadtree leaf node (when large enough to be obtained) is then further partitioned into a multi-type tree structure. In the multi-type tree structure, whether the node is further divided is indicated by a first flag (mtt _ split _ cu _ flag), when the node is further divided, the division direction is indicated by a second flag (mtt _ split _ cu _ vertical _ flag) first, and the division is indicated by a third flag (mtt _ split _ cu _ binary _ flag) to be a binary tree division or a ternary tree division. From the values of mtt _ split _ CU _ vertical _ flag and mtt _ split _ CU _ binary _ flag, the decoder can derive a multi-type tree partition mode (mttssplitmode) for the CU based on predefined rules or tables. It should be noted that for a certain design, such as 64 × 64 luma blocks and 32 × 32 chroma pipeline design in the VVC hardware decoder, TT partitioning is not allowed when the width or height of the luma block is larger than 64. TT partitioning is also not allowed when the chroma coding block width or height is larger than 32. The pipeline design divides the image into a plurality of Virtual Pipeline Data Units (VPDUs), and each VPDU is defined as a unit which does not overlap with each other in the image. In a hardware decoder, successive VPDUs are processed simultaneously in multiple pipeline stages. In most pipeline stages, the VPDU size is roughly proportional to the buffer size, so it is desirable to keep the VPDU small. In most hardware decoders, the VPDU size can be set to the maximum Transform Block (TB) size. However, in VVC, the partitioning of the Ternary Tree (TT) and Binary Tree (BT) may increase the size of the VPDU.
In addition, when a part of the tree node block exceeds the bottom or the right boundary of the image, the tree node block is forcibly divided until all pixel points of each encoded CU are located within the image boundary.
For example, the intra sub-partitions (ISP) tool may divide a luma intra predicted block into two or four sub-parts vertically or horizontally according to a block size.
In one example, mode select unit 260 of video encoder 20 may be used to perform any combination of the segmentation techniques described above.
As described above, video encoder 20 is configured to determine or select the best or optimal prediction mode from a set of (predetermined) prediction modes. The set of prediction modes may include, for example, intra-prediction modes and/or inter-prediction modes.
Intra prediction
The set of intra prediction modes may include 35 different intra prediction modes, e.g., non-directional modes like DC (or mean) and planar modes, or directional modes as defined in HEVC, or may include 67 different intra prediction modes, e.g., non-directional modes like DC (or mean) and planar modes, or directional modes as defined in VVC. For example, several conventional angular intra prediction modes are adaptively replaced with wide-angle intra prediction modes of non-square blocks defined in VVC. For another example, to avoid division operations for DC prediction, only the longer sides are used to calculate the average of non-square blocks. Also, the intra prediction result of the planar mode may be modified by using a position-dependent intra prediction combination (PDPC) method.
The intra-prediction unit 254 is configured to generate an intra-prediction block 265 according to intra-prediction modes in the intra-prediction mode set by using reconstructed pixel points of neighboring blocks of the same current picture.
Intra-prediction unit 254 (or, generally, mode selection unit 260) is also used to output intra-prediction parameters (or, generally, information indicating the selected intra-prediction mode for the block) to entropy encoding unit 270 in the form of syntax elements 266 for inclusion into encoded image data 21 so that video decoder 30 may perform operations, such as receiving and using the prediction parameters for decoding.
Intra prediction modes in HEVC include direct current prediction mode, planar prediction mode and 33 angular prediction modes, for a total of 35 candidate prediction modes. Fig. 3 is a schematic diagram of HEVC intra prediction directions, and as shown in fig. 3, a current block may be intra predicted using pixels of left and upper reconstructed image blocks as references. An image block used for intra prediction of the current block in the peripheral region of the current block becomes a reference block, and pixels in the reference block are referred to as reference pixels. Of the 35 candidate prediction modes, the direct current prediction mode is applied to a region with flat texture in the current block, and all pixels in the region use the average value of reference pixels in a reference block as prediction; the plane prediction mode is suitable for the image block with smoothly changed texture, and the current block meeting the condition uses the reference pixels in the reference block to carry out bilinear interpolation to be used as the prediction of all pixels in the current block; the angular prediction mode uses the property that the texture of the current block is highly correlated with the texture of the neighboring reconstructed image block to copy the values of the reference pixels in the corresponding reference block along a certain angle as the prediction of all pixels in the current block.
The HEVC encoder selects an optimal intra prediction mode from the 35 candidate prediction modes shown in fig. 3 for the current block, and writes the optimal intra prediction mode into the video stream. To improve the encoding efficiency of intra prediction, the encoder/decoder derives 3 most probable modes from respective optimal intra prediction modes of reconstructed image blocks in the peripheral region using intra prediction, and if the selected optimal intra prediction mode for the current block is one of the 3 most probable modes, encodes a first index indicating that the selected optimal intra prediction mode is one of the 3 most probable modes; if the selected optimal intra prediction mode is not the 3 most probable modes, a second index is encoded indicating that the selected optimal intra prediction mode is one of the other 32 modes (the other modes except the aforementioned 3 most probable modes among the 35 candidate prediction modes). The HEVC standard uses a fixed length code of 5 bits as the aforementioned second index.
The method for deriving the 3 most probable modes by the HEVC encoder includes: and selecting the optimal intra-frame prediction modes of the left adjacent image block and the upper adjacent image block of the current block to be put into a set, and if the two optimal intra-frame prediction modes are the same, only one optimal intra-frame prediction mode is reserved in the set. If the two optimal intra-frame prediction modes are the same and are both angle prediction modes, two angle prediction modes adjacent to the angle direction are selected and added into a set; otherwise, sequentially selecting a plane prediction mode, a direct current mode and a vertical prediction mode to be added into the set until the number of modes in the set reaches 3.
After entropy decoding is carried out on a code stream by an HEVC decoder, mode information of a current block is obtained, wherein the mode information comprises an indication identifier indicating whether an optimal intra-frame prediction mode of the current block is in 3 most probable modes, and indexes of the optimal intra-frame prediction mode of the current block in the 3 most probable modes or indexes of the optimal intra-frame prediction mode of the current block in other 32 modes.
Inter prediction
In possible implementations, the set of inter prediction modes depends on the available reference pictures (i.e., at least some previously decoded pictures stored in the DBP230, for example, as described above) and other inter prediction parameters, such as whether to use the entire reference picture or only a portion of the reference picture, such as a search window area near the area of the current block, to search for the best matching reference block, and/or such as whether to perform half-pel, quarter-pel, and/or 16-th-pel interpolation of pixels, for example.
In addition to the prediction mode described above, a skip mode and/or a direct mode may be employed.
For example, extended merge prediction, the merge candidate list for this mode consists of the following five candidate types in order: spatial MVPs from spatially neighboring CUs, temporal MVPs from collocated CUs, history-based MVPs from FIFO tables, pairwise average MVPs, and zero MVs. Decoder side motion vector modification (DMVR) based on bilateral matching may be used to increase the accuracy of the MV in merge mode. Merge mode with MVD (MMVD) comes from merge modes with motion vector disparity. The MMVD flag is sent immediately after sending the skip flag and the merge flag to specify whether the CU uses MMVD mode. A CU-level Adaptive Motion Vector Resolution (AMVR) scheme may be used. The AMVR supports the MVDs of CUs to be encoded with different precisions. And adaptively selecting the MVD of the current CU according to the prediction mode of the current CU. When a CU is encoded in a combined mode, a combined inter/intra prediction (CIIP) mode may be applied to the current CU. And carrying out weighted average on the interframe and intraframe prediction signals to obtain CIIP prediction. For affine motion compensated prediction, the affine motion field of a block is described by the motion information of 2 control point (4 parameters) or 3 control point (6 parameters) motion vectors. Sub-block-based temporal motion vector prediction (SbTMVP) is similar to the Temporal Motion Vector Prediction (TMVP) in HEVC, but predicts the motion vectors of sub-CUs within the current CU. Bi-directional optical flow (BDOF), formerly known as BIO, is a simplified version of the reduction in computations, particularly in terms of multiplication times and multiplier size. In the triangle division mode, a CU is uniformly divided into two triangle parts in two divisions of diagonal division and anti-diagonal division. Furthermore, the bi-directional prediction mode is extended on the basis of simple averaging to support weighted averaging of the two prediction signals.
The inter prediction unit 244 may include a Motion Estimation (ME) unit and a Motion Compensation (MC) unit (both not shown in fig. 2). The motion estimation unit may be configured to receive or retrieve an image block 203 (a current image block 203 of a current image 17) and a decoded image 231, or at least one or more previously reconstructed blocks, e.g., of one or more other/different previously decoded images 231, for motion estimation. For example, the video sequence may comprise a current picture and a previous decoded picture 231, or in other words, the current picture and the previous decoded picture 231 may be part of or form the sequence of pictures forming the video sequence.
For example, the encoder 20 may be configured to select a reference block from a plurality of reference blocks of the same or different one of a plurality of other images, and to provide the reference image (or reference image index) and/or an offset (spatial offset) between the position (x, y coordinates) of the reference block and the position of the current block as an inter prediction parameter to the motion estimation unit. This offset is also called a Motion Vector (MV).
The motion compensation unit is configured to obtain, e.g., receive, inter-prediction parameters and perform inter-prediction according to or using the inter-prediction parameters to obtain an inter-prediction block 246. The motion compensation performed by the motion compensation unit may involve extracting or generating a prediction block from a motion/block vector determined by motion estimation, and may also include performing interpolation on sub-pixel precision. Interpolation filtering may generate pixel points for other pixels from pixel points for known pixels, potentially increasing the number of candidate prediction blocks that may be used to encode an image block. Upon receiving a motion vector corresponding to a PU of a current image block, the motion compensation unit may locate, in one of the reference picture lists, a prediction block to which the motion vector points.
Motion compensation unit may also generate syntax elements related to the block and the video slice for use by video decoder 30 in decoding an image block of the video slice. In addition, or as an alternative to slices and corresponding syntax elements, coding block groups and/or coding blocks and corresponding syntax elements may be generated or used.
In the process of obtaining a candidate motion vector list in an Advanced Motion Vector Prediction (AMVP) mode, Motion Vectors (MVs) that may be added to the candidate motion vector list as an alternative include MVs of spatially adjacent and temporally adjacent image blocks of a current block, where the MVs of the spatially adjacent image blocks may include MVs of a left candidate image block located on the left side of the current block and MVs of an upper candidate image block located above the current block. For example, fig. 7 is an exemplary schematic diagram of candidate image blocks according to an embodiment of the present application, and as shown in fig. 7, the left set of candidate image blocks includes { a0, a1}, the upper set of candidate image blocks includes { B0, B1, B2}, and the temporally adjacent set of candidate image blocks includes { C, T }, which may be added as candidates to the candidate motion vector list, but according to the existing coding standard, the maximum length of the candidate motion vector list of AMVP is 2, so that MVs of at most two image blocks added to the candidate motion vector list need to be determined from three sets according to a specified order. The order may be a set of left candidate tiles { A0, A1} (considering A0 first, A0 is not available and considering A1 second), a set of above candidate tiles { B0, B1, B2} for the current block (considering B0 first, B0 is not available and considering B1 second, B1 is not available and considering B2 second) second, and a set of temporally neighboring candidate tiles { C, T } for the current block { C, T } (considering T first, T is not available and considering C second).
After obtaining the candidate motion vector list, determining an optimal MV from the candidate motion vector list by a rate distortion cost (RD cost), and using the candidate motion vector with the minimum RD cost as a Motion Vector Predictor (MVP) of the current block. The rate-distortion cost is calculated by the following formula:
J=SAD+λR
where J denotes RD cost, SAD is the Sum of Absolute Differences (SAD) between the pixel values of the prediction block and the current block obtained by performing motion estimation using the candidate motion vector, R denotes a code rate, and λ denotes a lagrange multiplier.
And the encoding end transmits the determined index of the MVP in the candidate motion vector list to the decoding end. Further, motion search may be performed in a neighborhood with the MVP as a center to obtain an actual motion vector of the current block, and the encoding end calculates a Motion Vector Difference (MVD) between the MVP and the actual motion vector, and transmits the MVD to the decoding end. And the decoding end analyzes the index, finds the corresponding MVP in the candidate motion vector list according to the index, analyzes the MVD, and adds the MVD and the MVP to obtain the actual motion vector of the current block.
In the process of obtaining the candidate motion information list in the Merge (Merge) mode, the motion information that may be added as an alternative to the candidate motion information list includes motion information of spatially adjacent or temporally adjacent image blocks of the current block, where the spatially adjacent image blocks and the temporally adjacent image blocks may refer to fig. 7, the candidate motion information corresponding to the spatial domain in the candidate motion information list is from spatially adjacent 5 blocks (a0, a1, B0, B1, and B2), and if the spatially adjacent blocks are not available or are intra-predicted, the motion information thereof is not added to the candidate motion information list. The time-domain candidate motion information of the current block is obtained by scaling the MV of the block at the corresponding position in the reference frame according to the Picture Order Count (POC) of the reference frame and the current frame, and first, whether the block at the position T in the reference frame is available is judged, and if not, the block at the position C is selected. And after the candidate motion information list is obtained, determining the optimal motion information from the candidate motion information list as the motion information of the current block through RD cost. The encoding end transmits the index value (marked as merge index) of the position of the optimal motion information in the candidate motion information list to the decoding end.
Entropy coding
Entropy encoding unit 270 is configured to apply an entropy encoding algorithm or scheme (e.g., a Variable Length Coding (VLC) scheme, a context adaptive VLC scheme, a CALVC (CALVC), an arithmetic coding scheme, a binarization algorithm, a Context Adaptive Binary Arithmetic Coding (CABAC), a syntax-based context-adaptive binary arithmetic coding (SBAC), a Probability Interval Partitioning Entropy (PIPE) coding, or other entropy encoding methods or techniques) to quantization coefficients 209, inter-prediction parameters, intra-prediction parameters, loop filter parameters, and/or other syntax elements, encoded image data 21 that can be output via output 272 in the form of an encoded bitstream 21 or the like is obtained so that parameters for decoding can be received and used by video decoder 30 or the like. The encoded bitstream 21 may be transmitted to the video decoder 30 or stored in memory for later transmission or retrieval by the video decoder 30.
Other structural variations of video encoder 20 may be used to encode the video stream. For example, the non-transform based encoder 20 may directly quantize the residual signal without the transform processing unit 206 for some blocks or frames. In another implementation, the encoder 20 may have the quantization unit 208 and the inverse quantization unit 210 combined into a single unit.
Decoder and decoding method
Fig. 3 is an exemplary block diagram of a video decoder 30 according to an embodiment of the present application. The video decoder 30 is configured to receive encoded image data 21 (e.g., encoded bitstream 21), for example, encoded by the encoder 20, resulting in a decoded image 331. The encoded image data or bitstream comprises information for decoding said encoded image data, such as data representing image blocks of an encoded video slice (and/or a group of encoded blocks or a coded block) and associated syntax elements.
In the example of fig. 3, the decoder 30 includes an entropy decoding unit 304, an inverse quantization unit 310, an inverse transform processing unit 312, a reconstruction unit 314 (e.g., a summer 314), a loop filter 320, a decoded image buffer (DBP)330, a mode application unit 360, an inter prediction unit 344, and an intra prediction unit 354. The inter prediction unit 344 may be or include a motion compensation unit. In some examples, video decoder 30 may perform a decoding process that is substantially the inverse of the encoding process described with reference to video encoder 100 of fig. 2.
Referring to fig. 3, the inter prediction unit is a trained target model (also called a neural network) for processing an input image or image area or image block to generate a prediction value of the input image block. For example, a neural network for inter prediction is used to receive an input image or image area or image block and generate a prediction value of the input image or image area or image block. The neural network for inter prediction will be described in detail below in conjunction with fig. 6a-6 e.
As described for encoder 20, inverse quantization unit 210, inverse transform processing unit 212, reconstruction unit 214, loop filter 220, decoded picture buffer DPB230, inter prediction unit 344, and intra prediction unit 354 also constitute a "built-in decoder" for video encoder 20. Accordingly, the inverse quantization unit 310 may be functionally identical to the inverse quantization unit 110, the inverse transform processing unit 312 may be functionally identical to the inverse transform processing unit 122, the reconstruction unit 314 may be functionally identical to the reconstruction unit 214, the loop filter 320 may be functionally identical to the loop filter 220, and the decoded picture buffer 330 may be functionally identical to the decoded picture buffer 230. Accordingly, the explanations of the corresponding units and functions of video encoder 20 apply to the corresponding units and functions of video decoder 30, respectively.
Entropy decoding
The entropy decoding unit 304 is configured to parse the bitstream 21 (or generally the encoded image data 21) and perform entropy decoding on the encoded image data 21 to obtain quantized coefficients 309 and/or decoded encoding parameters (not shown in fig. 3), such as any or all of inter-prediction parameters (e.g., reference image indexes and motion vectors), intra-prediction parameters (e.g., intra-prediction modes or indexes), transformation parameters, quantization parameters, loop filter parameters, and/or other syntax elements. Entropy decoding unit 304 may be used to apply a decoding algorithm or scheme corresponding to the encoding scheme of entropy encoding unit 270 of encoder 20. Entropy decoding unit 304 may also be used to provide inter-prediction parameters, intra-prediction parameters, and/or other syntax elements to mode application unit 360, as well as to provide other parameters to other units of decoder 30. Video decoder 30 may receive syntax elements at the video slice and/or video block level. In addition, or as an alternative to slices and corresponding syntax elements, coding block groups and/or coding blocks and corresponding syntax elements may be received or used.
Inverse quantization
Inverse transformation
The inverse transform processing unit 312 is operable to receive the dequantized coefficients 311, also referred to as transform coefficients 311, and apply a transform to the dequantized coefficients 311 to obtain a reconstructed residual block 213 in the pixel domain. The reconstructed residual block 213 may also be referred to as a transform block 313. The transform may be an inverse transform, such as an inverse DCT, an inverse DST, an inverse integer transform, or a conceptually similar inverse transform process. Inverse transform processing unit 312 may also be used to receive transform parameters or corresponding information from encoded image data 21 (e.g., parsed and/or decoded by entropy decoding unit 304) to determine the transform to apply to dequantized coefficients 311.
Reconstruction
The reconstruction unit 314 (e.g., summer 314) is configured to add the reconstructed residual block 313 to the prediction block 365 to obtain a reconstruction block 315 in the pixel domain, e.g., to add pixel point values of the reconstructed residual block 313 and pixel point values of the prediction block 365.
Filtering
The loop filter unit 320 (in or after the encoding loop) is used to filter the reconstruction block 315, resulting in a filter block 321, to facilitate pixel transitions or to improve video quality, etc. Loop filter unit 320 may include one or more loop filters, such as a deblocking filter, a sample-adaptive offset (SAO) filter, or one or more other filters, such as an Adaptive Loop Filter (ALF), a Noise Suppression Filter (NSF), or any combination. For example, the loop filter unit 220 may include a deblocking filter, an SAO filter, and an ALF filter. The order of the filtering process may be a deblocking filter, an SAO filter, and an ALF filter. As another example, a process called luma mapping with chroma scaling (LMCS) (i.e., adaptive in-loop shaper) is added. This process is performed prior to deblocking. As another example, the deblocking filtering process may also be applied to intra sub-block edges, such as affine sub-block edges, ATMVP sub-block edges, sub-block transform (SBT) edges, and intra sub-partition (ISP) edges. Although loop filter unit 320 is shown in fig. 3 as a loop filter, in other configurations, loop filter unit 320 may be implemented as a post-loop filter.
Decoded picture buffer
Decoded video blocks 321 in one picture are then stored in decoded picture buffer 330, and decoded picture buffer 330 stores decoded pictures 331 as reference pictures for use in subsequent motion compensation of other pictures and/or respectively output displays.
The decoder 30 is configured to output the decoded image 311 via an output 312 or the like for display to a user or for viewing by a user.
Prediction
When a video slice is encoded as an intra-coded (I) slice, the intra-prediction unit 354 in the mode application unit 360 is used to generate a prediction block 365 for an image block of the current video slice according to the indicated intra-prediction mode and data from a previously decoded block of the current image. When the video picture is encoded as an inter-coded (i.e., B or P) slice, an inter prediction unit 344 (e.g., a motion compensation unit) in the mode application unit 360 is used to generate a prediction block 365 for the video block of the current video slice based on the motion vectors and other syntax elements received from the entropy decoding unit 304. For inter prediction, the prediction blocks may be generated from one of the reference pictures in one of the reference picture lists. Video decoder 30 may use a default construction technique to construct reference frame list 0 and list 1 from the reference pictures stored in DPB 330. The same or similar process may be applied to embodiments of coding block groups (e.g., video coding block groups) and/or coding blocks (e.g., video coding blocks), in addition to or instead of slices (e.g., video slices), e.g., video may be encoded using I, P or B coding block groups and/or coding blocks.
Mode application unit 360 is to determine prediction information for the video blocks of the current video slice by parsing the motion vectors and other syntax elements, and to generate prediction blocks for the current video block being decoded using the prediction information. For example, mode application unit 360 uses some of the syntax elements received to determine a prediction mode (e.g., intra prediction or inter prediction) for encoding video blocks of a video slice, an inter prediction slice type (e.g., B-slice, P-slice, or GPB-slice), construction information for one or more reference picture lists of the slice, a motion vector for each inter-coded video block of the slice, an inter prediction state for each inter-coded video block of the slice, other information, to decode video blocks within the current video slice. The same or similar process may be applied to embodiments of coding block groups (e.g., video coding block groups) and/or coding blocks (e.g., video coding blocks), in addition to or instead of slices (e.g., video slices), e.g., video may be encoded using I, P or B coding block groups and/or coding blocks.
In one embodiment, video encoder 30 of fig. 3 may also be used to segment and/or decode images using slices (also referred to as video slices), where images may be segmented or decoded using one or more slices (typically non-overlapping). Each slice may comprise one or more chunks (e.g. CTUs) or one or more chunk groups (e.g. coded chunks in the h.265/HEVC/VVC standard and bricks in the VVC standard).
In one embodiment, the video decoder 30 shown in fig. 3 may be further configured to partition and/or decode a picture using one or more slice/coding block groups (generally non-overlapping) and/or coding blocks (also referred to as video coding blocks), each of which may include one or more blocks (e.g., CTUs) or one or more coding blocks, etc., wherein each of the coding blocks may be rectangular, etc., and may include one or more complete or partial blocks (e.g., CTUs).
Other variations of video decoder 30 may be used to decode encoded image data 21. For example, decoder 30 may generate an output video stream without loop filter unit 320. For example, the non-transform based decoder 30 may directly inverse quantize the residual signal without the inverse transform processing unit 312 for some blocks or frames. In another implementation, video decoder 30 may have inverse quantization unit 310 and inverse transform processing unit 312 combined into a single unit.
It should be understood that in the encoder 20 and the decoder 30, the processing result of the current step may be further processed and then output to the next step. For example, after interpolation filtering, motion vector derivation, or loop filtering, further operations, such as clip (clip) or shift (shift) operations, may be performed on the processing results of interpolation filtering, motion vector derivation, or loop filtering.
It should be noted that the derived motion vector for the current block (including but not limited to the control point motion vector for affine mode, sub-block motion vector for affine, planar, ATMVP mode, temporal motion vector, etc.) may be further operated on. For example, the value of the motion vector is limited to a predefined range according to the representation bits of the motion vector. If the representation bits of the motion vector are bitDepth, the range is-2 ^ (bitDepth-1) to 2^ (bitDepth-1) -1, where "^" represents the power. For example, if bitDepth is set to 16, the range is-32768 ~ 32767; if the bitDepth is set to 18, the range is-131072-131071. For example, the values from which the motion vectors are derived (e.g. the MVs of 4 x 4 sub-blocks in an 8 x 8 block) are restricted such that the maximum difference between the integer parts of the 4 x 4 sub-blocks MV does not exceed N pixels, e.g. 1 pixel. Two methods of limiting the motion vector according to bitDepth are provided herein.
Although the above embodiments primarily describe video codecs, it should be noted that embodiments of coding system 10, encoder 20, and decoder 30, as well as other embodiments described herein, may also be used for still image processing or codec, i.e., processing or codec of a single image independent of any previous or consecutive image in a video codec. In general, if image processing is limited to only a single image 17, the inter prediction unit 244 (encoder) and the inter prediction unit 344 (decoder) may not be available. All other functions (also referred to as tools or techniques) of video encoder 20 and video decoder 30 are equally available for still image processing, such as residual calculation 204/304, transform 206, quantization 208, inverse quantization 210/310, (inverse) transform 212/312, partition 262/362, intra prediction 254/354 and/or loop filtering 220/320, entropy encoding 270, and entropy decoding 304.
Fig. 4 is an exemplary block diagram of a video coding apparatus 400 of an embodiment of the present application. Video coding apparatus 400 is suitable for implementing the disclosed embodiments described herein. In one embodiment, video coding device 400 may be a decoder, such as video decoder 30 in FIG. 1a, or an encoder, such as video encoder 20 in FIG. 1 a.
The processor 430 is implemented by hardware and software. Processor 430 may be implemented as one or more processor chips, cores (e.g., multi-core processors), FPGAs, ASICs, and DSPs. Processor 430 is in communication with ingress port 410, receiving unit 420, transmitting unit 440, egress port 450, and memory 460. Processor 430 includes a decoding module 470 (e.g., neural network-based decoding module 470). Coding module 470 implements the embodiments disclosed above. For example, the decode module 470 performs, processes, prepares, or provides various encoding operations. Thus, substantial improvements are provided to the functionality of video coding apparatus 400 by coding module 470 and affect the switching of video coding apparatus 400 to different states. Alternatively, decode module 470 may be implemented as instructions stored in memory 460 and executed by processor 430.
Fig. 5 is an exemplary block diagram of an apparatus 500 according to an embodiment of the present application, and the apparatus 500 may be used as either or both of the source device 12 and the destination device 14 in fig. 1 a.
The processor 502 in the apparatus 500 may be a central processor. Alternatively, processor 502 may be any other type of device or devices now or later developed that is capable of manipulating or processing information. Although the disclosed implementations may be implemented using a single processor, such as processor 502 shown, using more than one processor is faster and more efficient.
In one implementation, the memory 504 in the apparatus 500 may be a Read Only Memory (ROM) device or a Random Access Memory (RAM) device. Any other suitable type of storage device may be used for memory 504. The memory 504 may include code and data 506 that the processor 502 accesses over the bus 512. The memory 504 may also include an operating system 508 and application programs 510, the application programs 510 including at least one program that allows the processor 502 to perform the methods described herein. For example, applications 510 may include applications 1 through N, including video coding applications that perform the methods described herein.
The apparatus 500 may also include one or more output devices, such as a display 518. In one example, display 518 may be a touch-sensitive display that combines the display with touch-sensitive elements that may be used to sense touch inputs. A display 518 may be coupled to the processor 502 by the bus 512.
Although the bus 512 in the apparatus 500 is described herein as a single bus, the bus 512 may include multiple buses. Further, the secondary storage may be coupled directly to other components of the apparatus 500 or accessed over a network, and may comprise a single integrated unit such as a memory card or multiple units such as multiple memory cards. Accordingly, the apparatus 500 may have a variety of configurations.
Since the embodiments of the present application relate to the application of neural networks, some terms or expressions used in the embodiments of the present application will be explained below for the convenience of understanding, and the terms or expressions are also included as part of the summary of the invention.
(1) Neural network
A Neural Network (NN) is a machine learning model, the neural network may be composed of neural units, the neural units may refer to operation units with xs and intercept 1 as inputs, and outputs of the operation units may be:
where s is 1, 2, … … n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.
(2) Deep neural network
Deep Neural Networks (DNNs), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression: wherein,is the input vector of the input vector,is the output vector of the output vector,is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vectorObtaining the output vector through such simple operationDue to the large number of DNN layers, the coefficient W and the offset vectorThe number of the same is large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined asThe superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input. The summary is that: the coefficients of the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined asNote that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.
(3) Convolutional neural network
A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to learning at multiple levels in different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto. The convolutional neural network includes a feature extractor consisting of convolutional and pooling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter.
The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. The convolution layer may comprise a number of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from the input image matrix, and the convolution operator may essentially be a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … … depending on the value of the step size stride) in the horizontal direction, so as to complete the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by "plurality" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the feature maps extracted by the plurality of weight matrices having the same size also have the same size, and the extracted feature maps having the same size are combined to form the output of the convolution operation. The weight values in the weight matrixes need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used for extracting information from an input image, so that the convolutional neural network can carry out correct prediction. When the convolutional neural network has a plurality of convolutional layers, the initial convolutional layer often extracts more general features, and the general features can also be called as low-level features; as the depth of the convolutional neural network increases, the features extracted by the convolutional layer further back become more complex, such as features with high-level semantics, and the features with higher semantics are more suitable for the problem to be solved.
Since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after a convolutional layer, either one convolutional layer followed by one pooling layer or multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.
After the convolutional layer/pooling layer processing, the convolutional neural network is not enough to output the required output information. Since, as mentioned above, the convolutional/pooling layers only extract features and reduce the parameters introduced by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network needs to generate one or a set of the required number of classes of outputs using the neural network layer. Therefore, a plurality of hidden layers may be included in the neural network layer, and parameters included in the plurality of hidden layers may be obtained by pre-training according to the associated training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.
Optionally, after the plurality of hidden layers in the neural network layer, an output layer of the whole convolutional neural network is further included, and the output layer has a loss function similar to the classified cross entropy, and is specifically used for calculating the prediction error.
(4) Recurrent neural networks
Recurrent Neural Networks (RNNs) are used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are all connected, and each node between every two layers is connectionless. Although the common neural network solves a plurality of problems, the common neural network still has no capability for solving a plurality of problems. For example, you would typically need to use the previous word to predict what the next word in a sentence is, because the previous and next words in a sentence are not independent. The RNN is called a recurrent neural network, i.e., the current output of a sequence is also related to the previous output. The concrete expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more but connected, and the input of the hidden layer not only comprises the output of the input layer but also comprises the output of the hidden layer at the last moment. In theory, RNNs can process sequence data of any length. The training for RNN is the same as for conventional CNN or DNN. The error back-propagation algorithm is also used, but with a little difference: that is, if the RNN is network-deployed, the parameters therein, such as W, are shared; this is not the case with the conventional neural networks described above by way of example. And in using the gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the networks of the previous steps. This learning algorithm is called the Back Propagation Through Time (BPTT) algorithm.
Now that there is a convolutional neural network, why is a circular neural network? For simple reasons, in convolutional neural networks, there is a precondition assumption that: the elements are independent of each other, as are inputs and outputs, such as cats and dogs. However, in the real world, many elements are interconnected, such as stock changes over time, and for example, a person says: i like to travel, wherein the favorite place is Yunnan, and the opportunity is in future to go. Here, to fill in the blank, humans should all know to fill in "yunnan". Because humans infer from the context, but how do the machine do it? The RNN is generated. RNNs aim at making machines capable of memory like humans. Therefore, the output of the RNN needs to be dependent on the current input information and historical memory information.
(5) Loss function
In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.
(6) Back propagation algorithm
The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.
(7) Generative countermeasure network
Generative Adaptive Networks (GANs) are a deep learning model. The model comprises at least two modules: one module is a Generative Model (Generative Model), the other module is a discriminant Model (discriminant Model), and the two modules are used for mutually game learning, so that better output is generated. The generative model and the discriminant model may be both neural networks, specifically, deep neural networks, or convolutional neural networks. The basic principle of GAN is as follows: taking GAN as an example of generating pictures, assume that there are two networks, G (generator) and d (discriminator), where G is a network generating pictures, which receives a random noise z, from which pictures are generated, denoted G (z); d is a discrimination network for discriminating whether a picture is "real". The input parameter is x, x represents a picture, and the output D (x) represents the probability that x is a real picture, if the probability is 1, 100% of the picture is a real picture, and if the probability is 0, the picture cannot be a real picture. In the process of training the generating countermeasure network, the aim of generating the network G is to generate a real picture as much as possible to deceive the discrimination network D, and the aim of discriminating the network D is to distinguish the picture generated by G from the real picture as much as possible. Thus, G and D constitute a dynamic "gaming" process, i.e., "play" in a "generative play network". As a result of the final game, in an ideal state, G can generate enough pictures G (z) to be "fake" and D cannot easily determine whether the generated pictures are true or not, i.e., D (G (z)) is 0.5. This results in an excellent generative model G which can be used to generate pictures.
The object model (also called neural network) for inter prediction will be described in detail below in conjunction with fig. 6a-6 e. Fig. 6a-6e are several exemplary architectures of neural networks for inter-frame prediction according to embodiments of the present application.
As shown in fig. 6a, the neural network sequentially includes, in order of processing: 3 × 3 convolutional layers (3 × 3Conv), active layers (Relu), Block-processed layers (Res-Block), …, Block-processed layers, 3 × 3 convolutional layers, active layers, and 3 × 3 convolutional layers. And (3) inputting the matrix obtained by processing the original matrix of the neural network through the layers, and adding the matrix and the original matrix to obtain a final output matrix.
As shown in fig. 6b, the neural network sequentially includes, in order of processing: two paths of 3 x 3 convolutional layers and active layers, one path of block processing layer, …, block processing layer, 3 x 3 convolutional layer, active layer and 3 x 3 convolutional layer. The first matrix passes through one path of 3 × 3 convolutional layer and the activation layer, the second matrix passes through the other path of 3 × 3 convolutional layer and the activation layer, the processed two matrixes are combined (contact), then the combined matrix is processed by the block processing layer, …, the block processing layer, the 3 × 3 convolutional layer, the activation layer and the 3 × 3 convolutional layer, and finally the matrix is added with the first matrix to obtain the final output matrix.
As shown in fig. 6c, the neural network sequentially includes, in order of processing: two paths of 3 x 3 convolutional layers and active layers, one path of block processing layer, …, block processing layer, 3 x 3 convolutional layer, active layer and 3 x 3 convolutional layer. The first matrix and the second matrix are multiplied before being input into the neural network, then the first matrix passes through one path of 3 × 3 convolutional layer and one path of activation layer, the multiplied matrix passes through the other path of 3 × 3 convolutional layer and the other path of activation layer, the two processed matrixes are added, then the processed matrixes are processed through the block processing layer, the block processing layer …, the block processing layer, the 3 × 3 convolutional layer, the activation layer and the 3 × 3 convolutional layer, and finally the matrix is added with the first matrix to obtain a final output matrix.
As shown in fig. 6d, the block processing layer sequentially includes, according to the processing sequence: the method comprises the steps of processing an input matrix by a 3 x 3 convolutional layer, an active layer and a 3 x 3 convolutional layer, and adding the processed matrix and an initial input matrix to obtain an output matrix. As shown in fig. 6c, the block processing layer sequentially includes, in order of processing: the method comprises the steps of processing an input matrix by a 3 x 3 convolutional layer, an active layer, a 3 x 3 convolutional layer and an active layer, adding the matrix obtained after processing and an initial input matrix, and obtaining an output matrix through the active layer.
It should be noted that fig. 6a to 6e only show several exemplary architectures of the neural network for inter-frame prediction according to the embodiment of the present application, which do not constitute a limitation on the architecture of the neural network, the number of layers, the layer structure, the addition, the multiplication, or the combination included in the neural network, and the number, the size, and the like of the input and/or output matrices may be determined according to actual situations, and this is not specifically limited in the present application.
Fig. 8 is a flow chart of a process 800 of an inter prediction method according to an embodiment of the present application. Process 800 may be performed by video encoder 20 or video decoder 30, and in particular, may be performed by inter-prediction units 244, 344 of video encoder 20 or video decoder 30. Process 800 is described as a series of steps or operations, it being understood that process 800 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in fig. 8. Assuming a video data stream having a plurality of image frames is using a video encoder or video decoder, a process 800 is performed to inter-predict an image or image block comprising the following steps. The process 800 may include:
The peripheral region of the current block includes a spatial neighborhood and/or a temporal neighborhood of the current block, wherein the image blocks of the spatial neighborhood may include a left candidate image block located at the left side of the current block and an above candidate image block located above the current block. Illustratively, as shown in FIG. 7, the set of left candidate tiles comprises { A0, A1}, the set of top candidate tiles comprises { B0, B1, B2}, and the set of temporally adjacent candidate tiles comprises { C, T }. The reconstructed image block may refer to an encoded image block that has been encoded and acquired by the encoding end for reconstruction or a decoded image block that has been decoded and reconstructed by the decoding end. The reconstructed image block may also refer to a basic unit image block of a predetermined size obtained by dividing the size of an encoded image block or a decoded image block. For example, the size of an encoded or decoded image block may be 16 × 16, 64 × 64, or 32 × 16, and the size of a base unit image block may be 4 × 4 or 8 × 8.
In the following, a reconstructed image block is taken as an example, the reconstructed image block may be any one of a plurality of reconstructed image blocks in the peripheral region, and the method may be referred to for other reconstructed image blocks.
The motion vectors of the reconstructed image block may include: (1) a plurality of a posteriori motion vectors for the reconstructed image block, the plurality of a posteriori motion vectors determined according to a reconstructed value of the reconstructed image block and predicted values corresponding to a plurality of a posteriori candidate motion vectors; or, (2) an optimal motion vector of the reconstructed image block, where the optimal motion vector is an a posteriori motion vector with a largest probability value or a smallest prediction error value among the a posteriori motion vectors.
The plurality of posterior candidate motion vectors are obtained according to the plurality of prior candidate motion vectors of the reconstructed image block, and any one of the plurality of prior candidate motion vectors of the reconstructed image block can be subjected to offset in a preset search window to generate a plurality of offset candidate motion vectors. It can be seen that one a priori candidate motion vector for a reconstructed image block may result in multiple offset candidate motion vectors. And all the obtained offset candidate motion vectors are the plurality of the posterior candidate motion vectors of the reconstructed image block according to the operation. For example, fig. 10 is an exemplary schematic diagram of a search window according to an embodiment of the present application, and as shown in fig. 10, assuming that a certain a priori candidate motion vector of a reconstructed image block is (0,0), shifting the a priori candidate motion vector within a 3 × 3 search window may obtain 9 shifted candidate motion vectors: (-1, -1), (-1,0), (-1,1), (0, -1), (0,0), (0,1), (1, -1), (1,0), and (1, 1). The 9 offset candidate motion vectors are a plurality of a posteriori candidate motion vectors of the reconstructed image block.
The plurality of a posteriori motion vectors of the reconstructed image block may refer to the plurality of a posteriori candidate motion vectors; the motion vector may be a partial motion vector among the plurality of a posteriori candidate motion vectors, for example, a plurality of designated motion vectors selected from the plurality of a posteriori candidate motion vectors.
The probability values or prediction error values of a plurality of a posteriori motion vectors may be referred to as described below.
In one possible implementation, besides obtaining the motion vector of the reconstructed image block, the related information of the reconstructed image block may also be obtained, and the related information and the obtaining method thereof are as follows:
the method comprises the steps of firstly, reconstructing a plurality of prediction error values corresponding to a plurality of posterior motion vectors of an image block, wherein the plurality of prediction error values are also determined according to a reconstruction value of the reconstructed image block and a prediction value corresponding to a plurality of posterior candidate motion vectors.
And respectively executing motion compensation on the reconstructed image block according to the plurality of posterior candidate motion vectors to obtain a plurality of predicted values, wherein the plurality of predicted values correspond to the plurality of posterior candidate motion vectors.
And comparing the plurality of predicted values with the reconstructed values of the reconstructed image blocks respectively to obtain a plurality of prediction error values, wherein the plurality of prediction error values correspond to the plurality of posterior candidate motion vectors. The prediction error value corresponding to a certain posterior candidate motion vector can be obtained by using a Sum of Absolute Differences (SAD) or sum of squared differences (SSE) method.
If the plurality of a posteriori motion vectors of the reconstructed image block refer to the plurality of a posteriori candidate motion vectors, the plurality of prediction error values of the reconstructed image block corresponding to the plurality of a posteriori motion vectors refer to a plurality of prediction error values corresponding to the plurality of a posteriori candidate motion vectors; if the plurality of a posteriori motion vectors of the reconstructed image block are partial motion vectors of the plurality of a posteriori motion vector candidates, the plurality of prediction error values corresponding to the plurality of a posteriori motion vectors of the reconstructed image block are prediction error values corresponding to the partial motion vectors selected from the plurality of prediction error values corresponding to the plurality of a posteriori motion vector candidates.
And secondly, a plurality of probability values corresponding to a plurality of posterior motion vectors of the reconstructed image block are determined according to the reconstructed value of the reconstructed image block and the predicted values corresponding to a plurality of posterior candidate motion vectors.
The probability values corresponding to a plurality of a posteriori motion vectors of the reconstructed image block may be obtained by the following two methods:
one is to obtain probability values of the reconstructed image block according to the prediction error values of the reconstructed image block obtained in the first method. For example, normalization processing may be performed on the multiple prediction error values of the reconstructed image block by using a normalization index function, a linear normalization method, or the like, to obtain normalized values of the multiple prediction error values, where the normalized values of the multiple prediction error values are multiple probability values of the reconstructed image block, and based on a correspondence between the multiple prediction error values of the reconstructed image block and multiple a posteriori motion vectors, the multiple probability values of the reconstructed image block also correspond to the multiple a posteriori motion vectors of the reconstructed image block, and the probability values may represent probabilities that the a posteriori motion vector corresponding to the probability value becomes an optimal motion vector of the reconstructed image block.
And the other method is to input the reconstructed value of the reconstructed image block and the plurality of predicted values of the reconstructed image block obtained in the first method into a trained neural network to obtain a plurality of probability values of the reconstructed image block corresponding to a plurality of posterior motion vectors. The neural network may refer to the description of the training engine 25 above, and will not be described in detail here.
Therefore, after obtaining the prediction error values or probability values corresponding to a plurality of a posteriori motion vectors by the above two methods, the optimal motion vector of the reconstructed image block may have the following two obtaining methods:
one is to use the a posteriori motion vector corresponding to the smallest prediction error value among a plurality of prediction error values corresponding to a plurality of a posteriori motion vectors as the optimal motion vector of the reconstructed image block.
And the other one is to take the posterior motion vector corresponding to the maximum probability value in the probability values corresponding to the posterior motion vectors as the optimal motion vector of the reconstructed image block.
In the present application, the memory may be directly read to obtain the motion vector and the related information of the reconstructed image block. After the reconstructed image block is encoded or decoded, the motion vector or the motion vector and the related information of the reconstructed image block can be immediately obtained by adopting the method, and then the motion vector or the motion vector and the related information are stored, and when the subsequent image block (current block) is subjected to inter-frame prediction, the motion vector or the motion vector and the related information can be directly read from the corresponding position of a memory. This may improve the inter prediction efficiency of the current block.
In the present application, the motion vector or motion vector of the reconstructed image block and the related information thereof may also be calculated only when the inter prediction is performed on the current block, that is, when the inter prediction is performed on the current block, the motion vector or motion vector of the reconstructed image block and the related information thereof are obtained by using the above method. Thus, the storage space can be saved by determining which reconstructed image block needs to be used for calculation.
If the reconstructed image blocks all adopt inter-frame prediction in the encoding or decoding process, the motion vectors or the motion vectors and the related information of the reconstructed image blocks can be obtained by adopting the method. If some image blocks in the plurality of reconstructed image blocks do not adopt inter prediction in the encoding or decoding process, the motion vector or the motion vector of the some image blocks and the related information thereof can be obtained according to any one of the methods described in the above three cases.
If the reconstructed image block contains a plurality of basic unit image blocks, the motion vector or motion vector and related information of the reconstructed image block may be used as the motion vector or motion vector and related information of all the basic unit image blocks contained therein. Further, it is possible to refine the motion vector or motion vector of the reconstructed image block and its related information as the motion vector or motion vector of all pixels and its related information.
The a priori candidate motion vectors of the current block may refer to all the motion vectors remaining after the de-duplication of the a posteriori motion vectors of the reconstructed image blocks, or may refer to some of all the motion vectors remaining after the de-duplication of the a posteriori motion vectors of the reconstructed image blocks.
The motion vectors of the reconstructed image blocks may be input into a trained neural network to obtain a plurality of prior candidate motion vectors of the current block and a plurality of probability values corresponding to the prior candidate motion vectors. The neural network may refer to the description of the training engine 25 above, and will not be described in detail here.
Optionally, a plurality of a posteriori motion vectors of each of the plurality of reconstructed image blocks and a plurality of prediction error values corresponding to the plurality of a posteriori motion vectors may be input to the trained neural network to obtain a plurality of a priori candidate motion vectors of the current block and a plurality of probability values corresponding to the plurality of a priori candidate motion vectors.
Optionally, a plurality of a posteriori motion vectors and a plurality of probability values corresponding to the plurality of a posteriori motion vectors of each of the plurality of reconstructed image blocks may be input to the trained neural network to obtain a plurality of a priori candidate motion vectors of the current block and a plurality of probability values corresponding to the plurality of a priori candidate motion vectors.
Optionally, the optimal motion vectors of the reconstructed image blocks may be input to a trained neural network to obtain a plurality of prior candidate motion vectors of the current block and a plurality of probability values corresponding to the prior candidate motion vectors.
When the sum of the probability values is 1, the probability value corresponding to the first prior candidate motion vector is taken as the weight factor corresponding to the first prior candidate motion vector. That is, the weighting factors of the prior candidate motion vectors are the probability values of the prior candidate motion vectors; or when the sum of the probability values is not 1, carrying out normalization processing on the probability values; the normalized value of the probability value corresponding to the first a priori candidate motion vector is taken as the weighting factor corresponding to the first a priori candidate motion vector. I.e. the weighting factor for each of the plurality of a priori candidate motion vectors, is a normalized value of the probability value for each of the plurality of a priori candidate motion vectors. The first a priori candidate motion vector is any one of a plurality of a priori candidate motion vectors. As can be seen, the sum of the weighting factors corresponding to the a priori candidate motion vectors is 1.
And 804, respectively executing motion compensation according to the plurality of prior candidate motion vectors to obtain a plurality of predicted values.
According to the principle of inter prediction, a candidate motion vector can find a reference block in a reference frame of a current block, inter prediction is performed on the current block according to the reference block to obtain a predicted value corresponding to the candidate motion vector, and the predicted value of the current block corresponds to the candidate motion vector. Therefore, motion compensation is respectively performed according to a plurality of prior candidate motion vectors, and a plurality of predicted values of the current block can be obtained.
And multiplying the weight factor corresponding to the same prior candidate motion vector by the predicted value, and adding a plurality of products corresponding to a plurality of prior candidate motion vectors to obtain the predicted value of the current block.
According to the method, a plurality of weighting factors and a plurality of predicted values of the current block are obtained through respective motion vectors of a plurality of reconstructed image blocks in the peripheral area of the current block, the weighting factors and the predicted values corresponding to the same prior candidate motion vector are multiplied, and then a plurality of products corresponding to a plurality of prior candidate motion vectors are added to obtain the predicted value of the current block.
In a possible implementation manner, after obtaining the reconstruction value of the current block, the motion vector and the related information of the current block may be obtained immediately, and the motion vector and the related information thereof may refer to step 801, where the obtaining method includes:
the method comprises the steps of obtaining a plurality of posterior motion vectors of a current block and a plurality of prediction error values corresponding to the posterior motion vectors according to a reconstruction value of the current block and prediction values corresponding to the posterior candidate motion vectors of the current block, wherein the posterior motion vectors of the current block are obtained according to the prior candidate motion vectors of the current block.
And secondly, inputting the reconstructed value of the current block and the predicted value corresponding to the plurality of posterior candidate motion vectors of the current block into a neural network to obtain the plurality of posterior motion vectors of the current block and a plurality of probability values corresponding to the plurality of posterior motion vectors, wherein the plurality of posterior motion vectors of the current block are obtained according to the plurality of prior candidate motion vectors of the current block, or the plurality of probability values corresponding to the plurality of posterior motion vectors of the current block are obtained according to the plurality of prediction error values of the current block.
And thirdly, determining the posterior motion vector with the maximum probability value or the minimum prediction error value in the plurality of posterior motion vectors of the current block as the optimal motion vector of the current block.
In one possible implementation, the plurality of probability values of the current block includes M probability values that are each greater than other than the M probability values of the plurality of probability values of the current block. Therefore, M prior candidate motion vectors corresponding to M probability values can be selected from the multiple prior candidate motion vectors of the current block, M weight factors are obtained according to the M probability values, motion compensation is respectively executed according to the M prior candidate motion vectors to obtain M predicted values of the current block, and finally the predicted value of the current block is obtained through weighted summation according to the M weight factors and the M predicted values. The method comprises the steps of selecting the first M probability values with the maximum probability values from a plurality of probability values corresponding to a plurality of prior candidate motion vectors of a current block, selecting M prior candidate motion vectors corresponding to the M probability values from a plurality of prior candidate motion vectors of the current block, and calculating a weight factor and a predicted value based on the M probability values and the M prior candidate motion vectors to obtain a predicted value of the current block. And the rest probability values except the M probability values in the probability values corresponding to the plurality of prior candidate motion vectors can be ignored due to the smaller value, so that the calculated amount can be reduced, and the efficiency of inter-frame prediction can be improved.
The following describes the technical solution of the embodiment of the method shown in fig. 8 in detail by using several specific embodiments.
Example one
In this embodiment, a plurality of prior candidate motion vectors and a plurality of probability values corresponding to the plurality of prior candidate motion vectors of the current block are determined according to a plurality of posterior motion vectors and a plurality of prediction error values corresponding to the plurality of posterior motion vectors of each of the plurality of reconstructed image blocks in the peripheral region.
Fig. 9 is a flowchart of a process 900 of an inter prediction method according to an embodiment of the present application. Process 900 may be performed by video encoder 20 or video decoder 30, and in particular, may be performed by inter-prediction units 244, 344 of video encoder 20 or video decoder 30. Process 900 is described as a series of steps or operations, it being understood that process 900 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in fig. 9. Assuming a video data stream having a plurality of image frames is using a video encoder or video decoder, a process 900 is performed to inter-predict an image or image block comprising the following steps. The process 900 may include:
The following description will be given by taking an example of a reconstructed image block, where the reconstructed image block may be any one of a plurality of reconstructed image blocks in a peripheral region, and other reconstructed image blocks may obtain a plurality of a posteriori motion vectors and a plurality of prediction error values corresponding to the plurality of a posteriori motion vectors with reference to the method.
The reconstructed image block has N4 a posteriori candidate motion vectors, and the N4 a posteriori candidate motion vectors are obtained from a plurality of a priori candidate motion vectors of the reconstructed image block, and the obtaining method may refer to the description of step 801 above. Performing motion compensation according to the N4 a posteriori candidate motion vectors to obtain N4 predictors of the reconstructed image block, where the N4 predictors correspond to the N4 a posteriori candidate motion vectors, i.e., performing inter prediction on the reconstructed image block according to a reference block corresponding to one a posteriori candidate motion vector to obtain one predictor of the reconstructed image block. And comparing the N4 predicted values with the reconstructed values of the reconstructed image block respectively to obtain N4 prediction error values of the reconstructed image block, wherein the N4 prediction error values correspond to the N4 posterior candidate motion vectors. The prediction error value of the reconstructed image block corresponding to a certain posterior candidate motion vector can be obtained by SAD or SSE and other methods.
The N2 a posteriori motion vectors of the reconstructed image block may refer to the N4 a posteriori candidate motion vectors; the partial motion vectors may be partial motion vectors among the N4 a posteriori candidate motion vectors, for example, a plurality of designated motion vectors selected from the N4 a posteriori candidate motion vectors.
Accordingly, the number of prediction error values corresponding to N2 a posteriori motion vectors for a reconstructed image block is also N2.
The total a posteriori motion vectors for a plurality of reconstructed image blocks may be represented as a two-dimensional matrix of N2 × Q, N2 being the number of a posteriori motion vectors, Q being the number of reconstructed image blocks, where the elements are represented ask is 0,1, …, Q-1, indicating the index of the reconstructed image block, N is 0,1, …, N2-1, indicating the index of the a posteriori motion vector, meaning the a posteriori motion vector indicated by N for the reconstructed image block indicated by k.
The total prediction error values for a plurality of reconstructed image blocks may also be represented as a two-dimensional matrix of N2 × Q with elements represented ask is 0,1, …, Q-1, indicating the index of the reconstructed image block, N is 0,1, …, N2-1, indicating the index of the a posteriori motion vector, meaning the prediction error value corresponding to the a posteriori motion vector indicated by N for the reconstructed image block indicated by k.
The present application may input all prediction error values and all a posteriori motion vectors of a plurality of reconstructed image blocks, i.e. the two N2 × Q two-dimensional matrices into a trained neural network, and the neural network outputs a plurality of a priori candidate motion vectors of a current block and a plurality of prediction error values corresponding to the plurality of a priori candidate motion vectors. The neural network may refer to the description of the training engine 25 above, and will not be described in detail here.
Multiple a priori candidate motion vectors for a current block may be representedIs a two-dimensional matrix of N1 × S, where N1 is the number of a priori candidate motion vectors of the current block, and S is the number of basic unit image blocks or pixels included in the current block, and if the current block is not further divided, S is 1. The elements in the matrix are represented asI-0, 1, …, S-1, denotes the index of the basic unit image block or pixel, N-0, 1, …, N1-1, denotes the index of the a priori candidate motion vector, which means the a priori candidate motion vector denoted by N for the basic unit image block or pixel denoted by l.
The prediction error values for the current block corresponding to the a priori candidate motion vectors may also be represented as a two-dimensional matrix of N1 × S. The elements in the matrix are represented asI-0, 1, …, S-1, denotes the index of a basic unit image block or pixel, N-0, 1, …, N1-1 denotes the index of an a priori candidate motion vector, meaning the probability that the a priori candidate motion vector, denoted by N for the basic unit image block or pixel, denoted by l, becomes the optimal motion vector for that basic unit image block or pixel.
Optionally, in the case where l is constant,i.e. the sum of the N1 probability values corresponding to N1 a priori candidate motion vectors for the base unit image block or pixel indicated by i is 1. Alternatively, the liquid crystal display device can also be used forUsing an integer expression, can be obtained256 andis related to the binary bit number of the integer value of (1), which representsIs represented by 8 bits, and thusMay also be equal to 128 or 512, etc.
The weighting factors of the current block corresponding to the a priori candidate motion vectors may also be represented as a two-dimensional matrix of N1 × S. The elements in the matrix are represented asI ═ 0,1, …, S-1, denotes the index of the basic unit image block or pixel, N ═ 0,1, …, N1-1, denotes the index of the a priori candidate motion vector, meaning the weighting factor of the a priori candidate motion vector indicated by N of the basic unit image block or pixel indicated by l.
If the N1 probability values of the basic unit image blocks or pixels indicated by the i in the current block, corresponding to the N1 a priori candidate motion vectors, are normalized, i.e. the current block is divided into three equal partsThe N1 probability values may be used as N1 weight factors corresponding to N1 a priori candidate motion vectors, e.g.If N1 probability values of the basic unit image block or pixel indicated by i in the current block, which correspond to N1 a priori candidate motion vectors, are not normalized, the N1 probability values may be normalized first, and then the normalized values of the N1 probability values may be used as N1 weight factors corresponding to N1 a priori candidate motion vectors. Therefore, in the case where l is constant,
and 904, respectively executing motion compensation according to the plurality of prior candidate motion vectors to obtain a plurality of predicted values.
A certain a priori candidate motion vector is taken as an example for explanation, the a priori candidate motion vector is any one of a plurality of a priori candidate motion vectors, and other a priori candidate motion vectors can refer to the method.
Performing motion compensation according to the a priori candidate motion vectors yields a predictor of the current block, so that N1 a priori candidate motion vectors can yield N1 predictors.
The prediction values of the current block may be represented as a three-dimensional matrix of BH × WH × S, where BH × WH represents the size of the basic unit image block included in the current block, and S is the number of basic unit image blocks or pixels included in the current block, and if the current block is not further divided, S is 1. The elements in the matrix are represented asAnd l is 0,1, …, and S-1, and denotes an index of a basic unit image block or pixel, and N is 0,1, …, and N1-1, and denotes an index of an a priori candidate motion vector, which means that l denotes a pixel in the ith row and jth column in the basic unit image block, and N denotes a predictor corresponding to the a priori candidate motion vector.
And multiplying the weight factor corresponding to the same prior candidate motion vector by the predicted value, and adding a plurality of products corresponding to a plurality of prior candidate motion vectors to obtain the predicted value of the current block. In the current block, the prediction value of the pixel of the ith row and the jth column in the basic unit image block indicated by l can be expressed as:
example two
In this embodiment, a plurality of prior candidate motion vectors and a plurality of probability values corresponding to the plurality of prior candidate motion vectors of the current block are determined according to a plurality of posterior motion vectors and a plurality of probability values corresponding to the plurality of posterior motion vectors of each of the plurality of reconstructed image blocks in the peripheral region.
Fig. 11 is a flow diagram of a process 1100 of an inter prediction method according to an embodiment of the present application. Process 1100 may be performed by video encoder 20 or video decoder 30, and in particular, may be performed by inter-prediction units 244, 344 of video encoder 20 or video decoder 30. Process 1100 is depicted as a series of steps or operations, and it is to be understood that process 1100 can be performed in various orders and/or concurrently, and is not limited to the order of execution depicted in FIG. 11. Assuming a video data stream having a plurality of image frames is using a video encoder or video decoder, a process 1100 is performed to inter-predict an image or image block comprising the following steps. Process 1100 may include:
The following description will be given by taking an example of a reconstructed image block, where the reconstructed image block may be any one of a plurality of reconstructed image blocks in a peripheral region, and other reconstructed image blocks may obtain a plurality of a posteriori motion vectors and a plurality of probability values corresponding to the a posteriori motion vectors with reference to the method.
The N2 posterior motion vectors of the reconstructed image block may be obtained by referring to the method in step 901, which is not described herein again.
The N2 probability values corresponding to the N2 a posteriori motion vectors for the reconstructed image block may be obtained by the following two methods:
one is to obtain N2 probability values of the reconstructed image block according to an embodiment, where the N2 prediction error values of the obtained reconstructed image block.
The N2 prediction error values for the reconstructed image block correspond to an N2-dimensional vector of all prediction error values for the reconstructed image block, where the elements are represented ask1 is an index of the reconstructed image block, N is 0,1, …, N2-1, and represents an index of the a posteriori motion vector, and N2 probability values of the reconstructed image block can be calculated according to N2 prediction error values of the reconstructed image block. The N2 probability values of a reconstructed image block may also be represented as an N2-dimensional vector, in which the elements are represented ask1 is the index of the reconstructed image block, N is 0,1, …, N2-1, and represents the index of the a posteriori motion vector, which means the probability that the a posteriori motion vector indicated by N for the reconstructed image block indicated by k1 becomes the optimal motion vector for the reconstructed image block.
the other method is to input the reconstructed values of the reconstructed image block and N2 predicted values corresponding to N2 a posteriori motion vectors into a trained neural network to obtain N2 probability values corresponding to N2 a posteriori motion vectors of the reconstructed image block. The neural network may refer to the description of the training engine 25 above, and will not be described in detail here.
The reconstructed values of the reconstructed image block may be obtained after the reconstructed image block is encoded, and the N2 prediction values of the reconstructed image block corresponding to the N2 posterior motion vectors may be obtained by referring to the method in step 901, which is not described herein again.
The total a posteriori motion vectors for a plurality of reconstructed image blocks may be represented as a two-dimensional matrix of N2 × Q, N2 being the number of a posteriori motion vectors, Q being the number of reconstructed image blocks, where the elements are represented ask is 0,1, …, Q-1, indicating the index of the reconstructed image block, N is 0,1, …, N2-1, indicating the index of the a posteriori motion vector, meaning the a posteriori motion vector indicated by N for the reconstructed image block indicated by k.
The overall probability values of the reconstructed image blocks may also be represented as a two-dimensional matrix of N2 × Q, in which the elements are represented ask is 0,1, …, Q-1, indicating the index of a reconstructed image block, N is 0,1, …, N2-1 indicating the index of the a posteriori motion vector, meaning the probability that the a posteriori motion vector indicated by N for the reconstructed image block indicated by k becomes the optimal motion vector for the reconstructed image block.
And 1104, respectively executing motion compensation according to the plurality of prior candidate motion vectors to obtain a plurality of predicted values.
In the present embodiment, the steps 1103-1105 can refer to the steps 903-905 in the first embodiment, which are not described herein again.
EXAMPLE III
In this embodiment, a plurality of prior candidate motion vectors of the current block and a plurality of probability values corresponding to the plurality of prior candidate motion vectors are determined according to the respective optimal motion vectors of the plurality of reconstructed image blocks in the peripheral region.
Fig. 12 is a flowchart of a process 1200 of an inter prediction method according to an embodiment of the present application. Process 1200 may be performed by video encoder 20 or video decoder 30, and in particular, may be performed by inter-prediction units 244, 344 of video encoder 20 or video decoder 30. Process 1200 is described as a series of steps or operations, it being understood that process 1200 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in fig. 12. Assuming a video data stream having a plurality of image frames is using a video encoder or video decoder, a process 1200 is performed to inter-predict an image or image block comprising the following steps. The process 1200 may include:
In the following, a reconstructed image block is taken as an example, the reconstructed image block may be any one of a plurality of reconstructed image blocks in the peripheral region, and other reconstructed image blocks may obtain the optimal motion vector with reference to the method.
The optimal motion vector of the reconstructed image block can be obtained by the following two methods:
one is to obtain the optimal motion vector of the reconstructed image block according to the first embodiment, which is obtained from the N2 prediction error values of the reconstructed image block, that is, the a posteriori motion vector corresponding to the minimum prediction error value of the N2 prediction error values of the reconstructed image block is used as the optimal motion vector of the reconstructed image block.
The other method is to acquire the optimal motion vector of the reconstructed image block according to the N2 probability values of the reconstructed image block acquired in the second embodiment, that is, the posterior motion vector corresponding to the maximum probability value of the N2 probability values of the reconstructed image block is used as the optimal motion vector of the reconstructed image block.
And 1204, respectively executing motion compensation according to the plurality of prior candidate motion vectors to obtain a plurality of predicted values.
And step 1205, weighting and summing the multiple weight factors and the multiple predicted values to obtain the predicted value of the current block.
In the step 1203-.
Fig. 13 is a block diagram illustrating an inter prediction apparatus 1300 according to an embodiment of the present disclosure. The inter prediction apparatus 1300 includes: a motion estimation unit 1301 and an inter-frame prediction processing unit 1302, wherein the motion estimation unit 1301 is configured to obtain motion vectors of P reconstructed image blocks in a peripheral region of a current block, where the peripheral region includes a spatial neighborhood and/or a temporal neighborhood of the current block; the inter-frame prediction processing unit 1302 is configured to obtain Q priori candidate motion vectors of the current block and Q probability values corresponding to the Q priori candidate motion vectors according to motion vectors of the P reconstructed image blocks; obtaining M weight factors corresponding to the M priori candidate motion vectors according to M probability values corresponding to the M priori candidate motion vectors; m, P and Q are positive integers, M is less than or equal to Q; respectively executing motion compensation according to the M prior candidate motion vectors to obtain M predicted values; and weighting and summing the M predicted values and the corresponding M weight factors to obtain the predicted value of the current block. In one example, the inter prediction apparatus 1300 including the motion estimation unit 1301 and the inter prediction processing unit 1302 may correspond to the inter prediction unit 244 in fig. 2 or correspond to the inter prediction unit 344 in fig. 3.
In implementation, the steps of the above method embodiments may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an application-specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware encoding processor, or implemented by a combination of hardware and software modules in the encoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
The memory referred to in the various embodiments above may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate SDRAM, enhanced SDRAM, SLDRAM, Synchronous Link DRAM (SLDRAM), and direct rambus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (personal computer, server, network device, or the like) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (20)
1. An inter-frame prediction method, comprising:
obtaining respective motion vectors of P reconstructed image blocks in a peripheral region of a current block, wherein the peripheral region comprises a spatial neighborhood and/or a temporal neighborhood of the current block;
obtaining Q prior candidate motion vectors of the current block and Q probability values corresponding to the Q prior candidate motion vectors according to the respective motion vectors of the P reconstructed image blocks;
obtaining M weight factors corresponding to the M priori candidate motion vectors according to M probability values corresponding to the M priori candidate motion vectors; m, P and Q are positive integers, M is less than or equal to Q;
respectively executing motion compensation according to the M prior candidate motion vectors to obtain M predicted values;
and weighting and summing the M predicted values and the corresponding M weight factors to obtain the predicted value of the current block.
2. The method of claim 1, wherein obtaining the Q prior candidate motion vectors of the current block and the Q probability values corresponding to the Q prior candidate motion vectors according to the motion vectors of the P reconstructed image blocks comprises:
and inputting the motion vectors of the P reconstructed image blocks into a trained neural network to obtain the Q prior candidate motion vectors and Q probability values corresponding to the Q prior candidate motion vectors.
3. The method of claim 1 or 2, wherein obtaining M weight factors corresponding to M a priori candidate motion vectors according to M probability values corresponding to the M a priori candidate motion vectors comprises:
when the sum of the M probability values is 1, taking the probability value corresponding to a first prior candidate motion vector as the weight factor corresponding to the first prior candidate motion vector; or,
when the sum of the M probability values is not 1, carrying out normalization processing on the M probability values; taking a normalized value of the probability value corresponding to the first a priori candidate motion vector as the weight factor corresponding to the first a priori candidate motion vector;
wherein the first a priori candidate motion vector is any one of the M a priori candidate motion vectors.
4. The method of any one of claims 1-3, wherein M is equal to Q, and wherein the M probability values are the Q probability values.
5. The method of any one of claims 1-3, wherein M is less than Q, and wherein the M probability values are each greater than the other of the Q probability values except the M probability values.
6. The method according to any of claims 1-5, wherein said deriving Q a priori candidate motion vectors for the current block and Q probability values corresponding to the Q a priori candidate motion vectors from the respective motion vectors of the P reconstructed image blocks comprises:
inputting a plurality of posterior motion vectors of the P reconstructed image blocks and a plurality of probability values corresponding to the posterior motion vectors into a trained neural network to obtain Q prior candidate motion vectors of the current block and Q probability values corresponding to the Q prior candidate motion vectors; the plurality of a posteriori motion vectors and a plurality of probability values corresponding to the plurality of a posteriori motion vectors of a reconstructed image block are determined according to a reconstructed value of the reconstructed image block and a prediction value corresponding to a plurality of a posteriori candidate motion vectors, the reconstructed image block being any one of the P reconstructed image blocks.
7. The method according to any of claims 1-5, wherein said deriving Q a priori candidate motion vectors for the current block and Q probability values corresponding to the Q a priori candidate motion vectors from the respective motion vectors of the P reconstructed image blocks comprises:
inputting a plurality of posterior motion vectors of the P reconstructed image blocks and a plurality of prediction error values corresponding to the posterior motion vectors into a trained neural network to obtain Q prior candidate motion vectors of the current block and Q probability values corresponding to the Q prior candidate motion vectors; the plurality of a posteriori motion vectors and a plurality of prediction error values corresponding to the plurality of a posteriori motion vectors of a reconstructed image block are determined according to a reconstructed value of the reconstructed image block and a prediction value corresponding to a plurality of a posteriori candidate motion vectors, the reconstructed image block being any one of the P reconstructed image blocks.
8. The method according to any of claims 1-5, wherein said deriving Q a priori candidate motion vectors for the current block and Q probability values corresponding to the Q a priori candidate motion vectors from the respective motion vectors of the P reconstructed image blocks comprises:
inputting the respective optimal motion vectors of the P reconstructed image blocks into a trained neural network to obtain Q prior candidate motion vectors of the current block and Q probability values corresponding to the Q prior candidate motion vectors; the optimal motion vector of the reconstructed image block is a posterior motion vector with the maximum probability value or the minimum prediction error value in a plurality of posterior motion vectors of the reconstructed image block, and the reconstructed image block is any one of the plurality of reconstructed image blocks; wherein,
a plurality of posterior motion vectors of the reconstructed image block correspond to a plurality of probability values, and the plurality of posterior motion vectors and the plurality of probability values corresponding to the plurality of posterior motion vectors are determined according to a reconstructed value of the reconstructed image block and predicted values corresponding to a plurality of posterior candidate motion vectors; or,
the plurality of a posteriori motion vectors of the reconstructed image block correspond to a plurality of prediction error values, and the plurality of a posteriori motion vectors and the plurality of prediction error values corresponding to the plurality of a posteriori motion vectors are determined according to a reconstructed value of the reconstructed image block and prediction values corresponding to a plurality of a posteriori candidate motion vectors.
9. The method of claim 6, further comprising:
acquiring a training data set, wherein the training data set comprises information of a plurality of groups of image blocks, wherein the information of each group of image blocks comprises a plurality of posterior motion vectors of a plurality of reconstructed image blocks, a plurality of probability values corresponding to the plurality of posterior motion vectors, a plurality of posterior motion vectors of a current block, and a plurality of probability values corresponding to the plurality of posterior motion vectors, and the plurality of reconstructed image blocks are image blocks in a spatial neighborhood and/or a temporal neighborhood of the current block;
and training according to the training data set to obtain the neural network.
10. The method of claim 7, further comprising:
acquiring a training data set, wherein the training data set comprises information of a plurality of groups of image blocks, wherein the information of each group of image blocks comprises a plurality of posterior motion vectors of a plurality of reconstructed image blocks, a plurality of prediction error values corresponding to the plurality of posterior motion vectors, a plurality of posterior motion vectors of a current block, and a plurality of probability values corresponding to the plurality of posterior motion vectors, and the plurality of reconstructed image blocks are image blocks in a spatial neighborhood and/or a temporal neighborhood of the current block;
and training according to the training data set to obtain the neural network.
11. The method of claim 8, further comprising:
acquiring a training data set, wherein the training data set comprises information of a plurality of groups of image blocks, the information of each group of image blocks comprises respective optimal motion vectors of a plurality of reconstructed image blocks, a plurality of posterior motion vectors of a current block and a plurality of probability values corresponding to the plurality of posterior motion vectors, and the plurality of reconstructed image blocks are image blocks in a spatial neighborhood and/or a temporal neighborhood of the current block;
and training according to the training data set to obtain the neural network.
12. The method of any one of claims 9-11, wherein the neural network comprises at least a convolutional layer and an activation layer.
13. The method of claim 12, wherein the convolution layer has a depth of 2, 3, 4, 5, 6, 16, 24, 32, 48, 64, or 128 convolution kernels; the size of the convolution kernel in the convolutional layer is 1 × 1, 3 × 3, 5 × 5, or 7 × 7.
14. The method according to any of claims 9-13, wherein the neural network comprises a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), or a Recurrent Neural Network (RNN).
15. An encoder, characterized in that it comprises processing circuitry for carrying out the method of any one of claims 1 to 14.
16. A decoder, characterized in that it comprises processing circuitry for carrying out the method of any one of claims 1 to 14.
17. A computer program product comprising program code for performing the method of any one of the claims when executed on a computer or processor.
18. An encoder, comprising:
one or more processors;
a non-transitory computer readable storage medium coupled to the processor and storing a program for execution by the processor, wherein the program, when executed by the processor, causes the decoder to perform the method of any of the claims.
19. A decoder, comprising:
one or more processors;
a non-transitory computer readable storage medium coupled to the processor and storing a program for execution by the processor, wherein the program, when executed by the processor, causes the encoder to perform the method of any of the claims.
20. A non-transitory computer-readable storage medium comprising program code that, when executed by a computer device, performs the method of any one of the claims.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011043942.XA CN114286100A (en) | 2020-09-28 | 2020-09-28 | Inter-frame prediction method and device |
PCT/CN2021/120640 WO2022063265A1 (en) | 2020-09-28 | 2021-09-26 | Inter-frame prediction method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011043942.XA CN114286100A (en) | 2020-09-28 | 2020-09-28 | Inter-frame prediction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114286100A true CN114286100A (en) | 2022-04-05 |
Family
ID=80844989
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011043942.XA Pending CN114286100A (en) | 2020-09-28 | 2020-09-28 | Inter-frame prediction method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114286100A (en) |
WO (1) | WO2022063265A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114898577A (en) * | 2022-07-13 | 2022-08-12 | 环球数科集团有限公司 | Road intelligent management system and method for peak period access management |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114611637B (en) * | 2022-05-11 | 2022-08-05 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and readable storage medium |
CN114842424B (en) * | 2022-06-07 | 2023-01-24 | 北京拙河科技有限公司 | Intelligent security image identification method and device based on motion compensation |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103124344A (en) * | 2011-11-18 | 2013-05-29 | 江南大学 | Motion estimation method |
EP3370424A1 (en) * | 2017-03-02 | 2018-09-05 | Thomson Licensing | A method and a device for picture encoding and decoding |
WO2019183906A1 (en) * | 2018-03-29 | 2019-10-03 | 华为技术有限公司 | Inter-frame prediction method and device |
WO2020056798A1 (en) * | 2018-09-21 | 2020-03-26 | 华为技术有限公司 | Method and device for video encoding and decoding |
CN111200735B (en) * | 2018-11-19 | 2023-03-17 | 华为技术有限公司 | Inter-frame prediction method and device |
CN109672886B (en) * | 2019-01-11 | 2023-07-04 | 京东方科技集团股份有限公司 | Image frame prediction method and device and head display equipment |
CN110062239B (en) * | 2019-03-18 | 2020-10-20 | 杭州师范大学 | Reference frame selection method and device for video coding |
-
2020
- 2020-09-28 CN CN202011043942.XA patent/CN114286100A/en active Pending
-
2021
- 2021-09-26 WO PCT/CN2021/120640 patent/WO2022063265A1/en active Application Filing
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114898577A (en) * | 2022-07-13 | 2022-08-12 | 环球数科集团有限公司 | Road intelligent management system and method for peak period access management |
Also Published As
Publication number | Publication date |
---|---|
WO2022063265A1 (en) | 2022-03-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114339262B (en) | Entropy encoding/decoding method and device | |
CN113196748B (en) | Intra-frame prediction method and related device | |
WO2022063265A1 (en) | Inter-frame prediction method and apparatus | |
WO2020228560A1 (en) | Method and device for obtaining a list of candidate motion vectors, and codec | |
WO2020259567A1 (en) | Video encoder, video decoder and corresponding method | |
JP7508645B2 (en) | Method and apparatus for mode and size dependent block level restriction - Patents.com | |
US20230239500A1 (en) | Intra Prediction Method and Apparatus | |
CN114125446A (en) | Image encoding method, decoding method and device | |
WO2023011420A1 (en) | Encoding method and apparatus, and decoding method and apparatus | |
WO2023020320A1 (en) | Entropy encoding and decoding method and device | |
US20230388490A1 (en) | Encoding method, decoding method, and device | |
KR102660120B1 (en) | How to calculate the positions of integer grid reference samples for block-level boundary sample gradient calculations in dual-predict optical flow computation and dual-predict correction. | |
WO2021045657A9 (en) | Motion vector range derivation for enhanced interpolation filter | |
CN114915783A (en) | Encoding method and apparatus | |
WO2020253681A1 (en) | Method and device for constructing merge candidate motion information list, and codec | |
CN114584776A (en) | Method and device for decoding intra-frame prediction mode | |
CN116137659A (en) | Inter-coded block partitioning method and apparatus | |
WO2020119525A1 (en) | Inter-frame prediction method and device | |
CN117440153A (en) | Method and device for encoding and decoding image containing characters | |
CN116647683A (en) | Quantization processing method and device | |
CN118646900A (en) | Encoding and decoding method and device | |
CN118474373A (en) | Encoding and decoding method and device | |
CN116800985A (en) | Encoding and decoding method and device | |
CN116800984A (en) | Encoding and decoding method and device | |
CN116708787A (en) | Encoding and decoding method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |