WO2022194137A1 - 视频图像的编解码方法及相关设备 - Google Patents

视频图像的编解码方法及相关设备 Download PDF

Info

Publication number
WO2022194137A1
WO2022194137A1 PCT/CN2022/080900 CN2022080900W WO2022194137A1 WO 2022194137 A1 WO2022194137 A1 WO 2022194137A1 CN 2022080900 W CN2022080900 W CN 2022080900W WO 2022194137 A1 WO2022194137 A1 WO 2022194137A1
Authority
WO
WIPO (PCT)
Prior art keywords
residual
image
current image
neural network
probability distribution
Prior art date
Application number
PCT/CN2022/080900
Other languages
English (en)
French (fr)
Inventor
马祥
毛珏
杨海涛
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22770490.5A priority Critical patent/EP4300958A4/en
Publication of WO2022194137A1 publication Critical patent/WO2022194137A1/zh
Priority to US18/368,683 priority patent/US20240007637A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/13Adaptive entropy coding, e.g. adaptive variable length coding [AVLC] or context adaptive binary arithmetic coding [CABAC]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/132Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/61Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/91Entropy coding, e.g. variable length coding [VLC] or arithmetic coding

Definitions

  • the present application relates to the field of video encoding and decoding, and in particular, to a video image encoding and decoding method and related equipment.
  • Video coding (video encoding and decoding) is widely used in digital video applications such as broadcast digital television, video transmission over the Internet and mobile networks, real-time conversational applications such as video chat and video conferencing, DVD and Blu-ray discs, video content capture and editing systems And security applications for camcorders.
  • Video compression devices typically use software and/or hardware on the source side to encode video data prior to transmission or storage, thereby reducing the amount of data required to represent digital video images. Then, the compressed data is received by the video decompression device at the destination side.
  • Deep neural networks can provide powerful nonlinear processing capabilities.
  • a neural network such as AutoEncoder, etc.
  • AutoEncoder etc.
  • the decoding end performs entropy decoding and inverse quantization on the received code stream to obtain the value of the feature field, and then restores the reconstructed image through the decoding network.
  • Embodiments of the present application provide a video image encoding and decoding method and related equipment, which can improve encoding and decoding efficiency and accuracy.
  • the present application relates to a method for decoding video images.
  • the method is performed by a decoding device, and the method includes:
  • Receive the code stream of the current image perform probability estimation processing on the input data through a trained neural network (also known as a probability estimation network) to obtain the probability distribution of residual values of multiple samples included in the residual of the current image, input
  • the data includes at least the residual of the reference image, and the reference image is the decoded image before decoding the current image; the code stream is subjected to arithmetic entropy decoding based on the probability distribution of residual values of multiple samples included in the residual of the current image to obtain the first image.
  • Entropy decoded data the first entropy decoded data represents the residual of the current image; based on the residual of the current image, the reconstructed sample value of the current image is obtained.
  • the steps of "receiving the code stream of the current image” and the step of “performing probability estimation processing on the input data through a neural network to obtain the probability distribution of the residual values of multiple samples included in the residual of the current image” are executed in no order. Points, the former can be executed first, the latter can be executed first, or it can be executed at the same time.
  • the reference image includes multiple samples
  • the residual of the reference image refers to multiple differences between the reconstructed values of multiple samples included in the reference image and the predicted values of multiple samples
  • the current image includes multiple sample
  • the residual of the current image refers to multiple differences between the reconstructed values of multiple samples included in the current image and the predicted values of multiple samples; that is, the residual of the current image and the The residuals of the reference images all refer to the reconstruction residuals.
  • the first entropy decoded data is partial entropy decoded data decoded from the code stream, and the first entropy decoded data is used to represent the residual of the current image. It should be understood that other entropy decoded data can also be decoded from the middle of the code stream, which is not limited.
  • the reference image when decoding the code stream, the reference image is decoded first, and the current image is decoded later.
  • the image 1 is decoded first
  • the image 2 is decoded later
  • the image 1 is the reference image of the image 2.
  • the reference image and the current image may be consecutive frames or non-consecutive frames; for example, the reference image is the n-1 th frame image, and the current image is the n th frame image; or the reference image is the n-2 th frame image, The current image is the nth frame image.
  • the difference between the timestamp of the current image and the timestamp of the reference image is less than the preset threshold, thereby ensuring that the residual of the reference image and the residual of the current image are correlated.
  • the trained neural network is used to perform probability estimation processing on the residual values of multiple samples included in the residual of the reference image to obtain the probability distribution of the residual values of multiple samples included in the residual of the current image.
  • the ability to fully mine the correlation between the residual of the current image and the residual of the reference image can obtain a high-precision probability distribution of the residual values of multiple samples included in the residual of the current image, and then use the residual of the current image to include
  • decoding efficiency and decoding accuracy can be improved.
  • the neural network is a neural network obtained by training in the following process, and the method of this embodiment further includes:
  • the input data for training includes the decoded image before decoding the sample image, the sample image is a known decoded image, and the probability estimation process is performed on the input data for training through a neural network to obtain the sample image.
  • the estimated probability distribution of the residual values of the multiple samples included in the residual based on the known probability distribution and estimated probability distribution of the residual values of the multiple samples included in the residual of the sample image, the loss value is obtained; When it converges to the first threshold, or when the current training times of the neural network is greater than or equal to the second threshold, it is determined that the current neural network is the corresponding neural network when the neural network training is completed.
  • the input data for training includes:
  • Residual values of multiple samples included in the residual of the first reference image or;
  • the first reference picture and the second reference picture are decoded pictures before decoding the sample picture.
  • data such as residuals of multiple reference images (such as the first reference image and the second reference image) are used as training samples, so that the trained neural network can estimate a more accurate probability distribution.
  • decoding the probability distribution it is beneficial to improve decoding efficiency and decoding accuracy.
  • the training of the above-mentioned neural network can be carried out online or offline, wherein, when the training is carried out online, the decoding device can obtain a high-precision neural network in time, thereby obtaining a high-precision neural network.
  • the probability distribution of the precision can be further improved when the arithmetic entropy decoding is performed on the code stream by using the high precision probability distribution, and the decoding efficiency and decoding precision can be improved.
  • the above probability distribution represents the probability distribution of multiple differences between the reconstructed values of multiple samples and the predicted values of multiple samples included in the residual of the current image; The probability distribution of residual values of each sample in the sample, or the probability distribution of residual values of multiple samples representing the current image.
  • the above-mentioned first entropy decoding data includes a residual of the current image, or a feature map of the residual of the current image, or a transformed and quantized residual of the current image.
  • the first entropy decoded data is a feature map of the residual of the current image
  • the decoding method of the present application further includes:
  • the decoding network Pass the feature map of the residual of the current image through the decoding network to obtain the residual of the current image.
  • the feature map of the residual of the current image is first obtained during decoding, and then the residual of the current image is obtained through the decoding network to obtain the residual of the current image, which is beneficial to improve the decoding efficiency.
  • the first entropy decoding data is the residual of the transformed and quantized current image
  • the decoding method of the present application further includes:
  • the residual of the current image is obtained.
  • a possible inverse transformation method is inverse discrete cosine transform (discrete cosine transform, DCT), in addition to inverse discrete sine transform (discrete sine transform, DST), or inverse discrete wavelet transform (discrete wavelet transform, DWT) and so on.
  • DCT discrete cosine transform
  • DST discrete sine transform
  • DWT inverse discrete wavelet transform
  • the input data further includes super-prior information
  • the super-prior information is used to represent the deeper features of the residual of the current image
  • the features of the residual of the current image can be characterized by the convolutional neural network.
  • the graph is obtained by feature extraction; the above-mentioned neural network includes a convolutional network, a uniform pooling layer, a first convolutional layer, a splicing layer and a normalized exponential function (Softmax) layer, and the convolutional network includes a plurality of convolutional layers.
  • ReLU ReLU
  • the feature information in the residual and super-prior information of the reference image is extracted through the convolutional network; the probability estimation is performed on the feature information extracted by the convolutional network through the uniform pooling layer, the first convolutional layer, the splicing layer and the Softmax layer, and the output The probability distribution of residual values of multiple samples included in the residual of the current image.
  • the input data also includes the prediction of the reference image and the prediction of the current image;
  • the above neural network includes a convolutional network, a uniform pooling layer and a Softmax layer, the convolutional network includes multiple convolutional layers, and the excitation ReLU layer , wherein: probability estimation processing is performed on the input data through a neural network to obtain the probability distribution of residual values of multiple samples included in the residual of the current image, including:
  • the difference includes the probability distribution of residual values for multiple samples.
  • the prediction of the reference image and the prediction of the current image are introduced, and the residual of the reference image, the prediction of the reference image and the prediction of the current image are extracted through the convolutional network.
  • Layer and Softmax layer perform probability estimation on the above-mentioned extracted feature information, so as to obtain a more accurate probability distribution.
  • the input data includes residuals of multiple reference images, which are multiple decoded images before decoding the current image;
  • the neural network includes a convolutional network, a uniform pooling layer, and a Softmax layer ;
  • the convolutional network includes multiple convolutional layers and ReLU layers, where:
  • the probability estimation process is performed on the input data through the neural network to obtain the probability distribution of residual values of multiple samples included in the residual of the current image, including:
  • the feature information of the residuals of multiple reference images is extracted through the convolutional network; the probability estimation is performed on the feature information extracted by the convolutional network through the uniform pooling layer and the Softmax layer, and the residuals of multiple samples included in the residuals of the current image are output.
  • the probability distribution of values is performed on the probability distribution of values.
  • the residuals of multiple reference images are introduced, and the feature information of the residuals of multiple reference images is extracted through a convolutional network; the probability estimation is performed on the feature information through the uniform pooling layer and the Softmax layer. , a more accurate probability distribution can be obtained, and when the arithmetic entropy decoding of the code stream is performed by using the probability distribution, it is beneficial to provide decoding efficiency and decoding accuracy.
  • the input data includes residuals of multiple reference images, predictions of multiple reference images, and predictions of the current image, and the multiple reference images are multiple decoded images before decoding the current image;
  • neural The network includes a convolutional network, a uniform pooling layer and a Softmax layer;
  • the convolutional network includes a plurality of convolutional layers and a ReLU layer, wherein the probability estimation process is performed on the input data through a neural network, and the residuals of the current image are obtained.
  • the probability distribution of the residual values of the sample including:
  • the residuals of multiple reference images, the predictions of multiple reference images, and the predicted feature information of the current image are extracted through the convolutional network; the feature information extracted by the convolutional network is estimated through the uniform pooling layer and the Softmax layer, and the current output
  • the residual of the image includes the probability distribution of residual values of multiple samples.
  • the residuals and predictions of multiple reference images and the prediction of the current image are introduced, and the residuals of multiple reference images, the predictions of multiple reference images and the prediction of the current image are extracted through a convolutional network.
  • the predicted feature information by performing probability estimation on the feature information through the uniform pooling layer and the Softmax layer, a more accurate probability distribution can be obtained. Decoding accuracy.
  • the prediction of the current image is obtained by using the reconstructed sample value of the reference image to predict the current image, or by up-sampling the reconstructed sample value of the low-resolution image of the current image, and the resolution of the sampled image is the same as The current image has the same resolution.
  • the low-resolution image of the current image can be obtained by down-sampling the current image.
  • the prediction signal of the reference image is obtained by using the reconstructed sample value of the image before decoding the reference to predict the reference image; or by up-sampling the reconstructed sample value of the low-resolution image of the reference image, and the reference image
  • the image obtained by upsampling the reconstructed image of the low-resolution image of the image has the same resolution as the reference image.
  • a low-resolution image of the reference image can be obtained by down-sampling the reference image.
  • the present application relates to a video image encoding method.
  • the method is performed by an encoding device, and the method includes:
  • the current image includes multiple samples, and the residuals of the current image are multiple differences between the original values of the multiple samples and the predicted values of the multiple samples; It is called probability estimation network) to perform probability estimation processing on the input data to obtain the probability distribution of residual values of multiple samples included in the residual of the current image.
  • the input data includes at least the residual of the reference image.
  • the former may be executed first, or The latter can be executed first, or it can be executed simultaneously.
  • the reference image includes multiple samples
  • the residual of the reference image refers to multiple differences between the reconstructed values of multiple samples included in the reference image and the predicted values of multiple samples; that is, on the The residual of the reference image on the coding side refers to the reconstruction residual.
  • the current image includes multiple samples, and the residual of the current image refers to multiple differences between the original values of the multiple samples included in the current image and the predicted values of the multiple samples; that is, on the encoding side
  • the residual of the current image refers to the original residual.
  • the prediction of the current image includes prediction values of multiple samples; optionally, the prediction of the current image may be obtained by predicting the current image by using the reconstructed sample values of the reference image, or reconstructing a low-resolution image of the current image.
  • the sample value is obtained by up-sampling, and the resolution of the sampled image is the same as that of the current image.
  • the low-resolution image of the current image can be obtained by down-sampling the current image.
  • the reference image is an encoded image before encoding the current image, which means that the encoding of the reference image has been completed before encoding the current image.
  • image 1 is first encoded, and then image 2 is encoded, and image 1 is a reference image of image 2.
  • the current image and the reference image may be continuous frames or non-continuous frames.
  • the difference between the timestamp of the current image and the timestamp of the reference image is less than the preset threshold, thereby ensuring that the residual of the reference image and the residual of the current image are correlated.
  • the trained neural network is used to perform probability estimation processing on the residual values of multiple samples included in the residual of the reference image to obtain the probability distribution of the residual values of multiple samples included in the residual of the current image.
  • the ability to fully mine the correlation between the residual of the current image and the residual of the reference image can obtain a high-precision probability distribution of the residual values of multiple samples included in the residual of the current image, and then use the residual of the current image to include
  • arithmetic entropy coding is performed on the first entropy coded data according to the probability distribution of residual values of the plurality of samples, coding efficiency and coding accuracy can be improved.
  • the above-mentioned neural network is a neural network obtained by training, and the method of this implementation further includes:
  • the input data for training includes the encoded images before the encoded sample images, and the sample images are known encoded images; the training input data is processed by probability estimation through the neural network to obtain the residual error of the sample image
  • the estimated probability distribution of the residual values of the multiple samples included based on the known probability distribution and the estimated probability distribution of the residual values of the multiple samples included in the residual of the sample image, the loss value is obtained; when the loss value converges at When the first threshold, or the current training times of the neural network is greater than or equal to the second threshold, it is determined that the current neural network is the corresponding neural network when the neural network training is completed.
  • the input data for training includes:
  • Residual values of multiple samples included in the residual of the first reference image or;
  • the first reference picture and the second reference picture are coded pictures before coding the sample picture.
  • the trained neural network When training the above neural network, data such as residuals of multiple reference images (the first reference image and the second reference image) are used as training samples, so that the trained neural network can estimate a more accurate probability distribution.
  • the probability distribution performs arithmetic entropy coding on the first entropy coded data, it is beneficial to improve coding efficiency and coding accuracy.
  • the training of the above-mentioned neural network may be performed online or offline, wherein, when training is performed online, the encoding device may obtain a high-precision neural network in time, and then obtain A high-precision probability distribution, and further, when using the probability distribution to perform arithmetic entropy coding on the first entropy coded data, it is beneficial to improve coding efficiency and coding accuracy.
  • the above probability distribution represents the probability distribution of multiple differences between the original values of multiple samples and the predicted values of multiple samples included in the residual of the current image; The probability distribution of the residual values of each sample point in the sample; or, the probability distribution corresponding to the residual values of multiple samples of the current image.
  • the first entropy-encoded data includes a residual of the current image, or a feature map of the residual of the current image, or a transformed and quantized residual of the current image.
  • the first entropy encoding data is a feature map of the residual of the current image
  • the encoding method of the present application further includes: obtaining the residual feature map of the current image by passing the residual of the current image through an encoding network.
  • the first entropy encoded data is the transformed and quantized residual of the current image
  • the encoding method of the present application further includes: transforming and quantized the residual of the current image to obtain the transformed and quantized residual.
  • the residual of the current image is the transformed and quantized residual of the current image
  • a possible transformation method is DCT, and DST, or DWT, etc. can also be performed.
  • the quantization can adopt methods such as uniform quantization or non-uniform quantization.
  • the input data further includes super-prior information, which is used to characterize the deeper features of the current image
  • the convolutional neural network can be used to analyze the residual of the current image.
  • the feature map is obtained by feature extraction.
  • the above-mentioned neural network includes a convolutional network, a uniform pooling layer, a first convolutional layer, a splicing layer and a normalized exponential function (Softmax) layer, and the convolutional network includes a plurality of convolutional layers and an excitation (ReLU) layer, wherein, Perform probability estimation processing on the input data through a neural network to obtain the probability distribution of residual values of multiple samples included in the residual of the current image, including:
  • the feature information in the residual and super-prior information of the reference image is extracted through the convolutional network; the probability estimation is performed on the feature information extracted by the convolutional network through the uniform pooling layer, the first convolutional layer, the splicing layer and the Softmax layer, and the output The probability distribution of residual values of multiple samples included in the residual of the current image.
  • the first convolution layer and stitching layer are introduced. , and then extract the residual of the reference image and the feature information of the super-prior information through the convolution network, and then perform probability estimation on the extracted feature information through the uniform pooling layer, the first convolution layer, the splicing layer and the Softmax layer, A more accurate probability distribution can be obtained, and when the arithmetic entropy encoding is performed on the first entropy encoded data by using the probability distribution, it is beneficial to improve encoding efficiency and encoding precision.
  • the input data also includes the prediction of the reference image and the prediction of the current image;
  • the above neural network includes a convolutional network, a uniform pooling layer and a Softmax layer, the convolutional network includes multiple convolutional layers, and the excitation ReLU layer , wherein: probability estimation processing is performed on the input data through a neural network to obtain the probability distribution of residual values of multiple samples included in the residual of the current image, including:
  • the difference includes the probability distribution of residual values for multiple samples.
  • the prediction of the reference image and the prediction of the current image are introduced, and the residual of the reference image, the prediction of the reference image and the prediction of the current image are extracted through the convolutional network.
  • Layer and Softmax layer perform probability estimation on the above-mentioned extracted feature information, so as to obtain a more accurate probability distribution.
  • the input data includes residuals of multiple reference images, which are multiple encoded images before encoding the current image;
  • the neural network includes a convolutional network, a uniform pooling layer, and a Softmax layer ;
  • the convolutional network includes multiple convolutional layers and ReLU layers, where:
  • the probability estimation process is performed on the input data through the neural network to obtain the probability distribution of residual values of multiple samples included in the residual of the current image, including:
  • the residuals of multiple reference images are introduced, and the feature information of the residuals of multiple reference images is extracted through a convolutional network; the probability estimation is performed on the feature information through the uniform pooling layer and the Softmax layer. , a more accurate probability distribution can be obtained, and when the arithmetic entropy coding is performed on the first entropy coded data by using the probability distribution, it is beneficial to improve coding efficiency and coding accuracy.
  • the residuals of multiple reference images are introduced to obtain a more accurate estimated probability distribution, which is beneficial to improve coding efficiency and coding accuracy.
  • the input data includes residuals of multiple reference images, predictions of multiple reference images, and predictions of the current image, and the multiple reference images are multiple encoded images before encoding the current image;
  • neural The network includes a convolutional network, a uniform pooling layer and a Softmax layer;
  • the convolutional network includes a plurality of convolutional layers and a ReLU layer, wherein the probability estimation process is performed on the input data through a neural network, and the residuals of the current image are obtained.
  • the probability distribution of the residual values of the sample including:
  • the residuals of multiple reference images, the predictions of multiple reference images, and the predicted feature information of the current image are extracted through the convolutional network; the feature information extracted by the convolutional network is estimated through the uniform pooling layer and the Softmax layer, and the current output
  • the residual of the image includes the probability distribution of residual values of multiple samples.
  • the residuals and predictions of multiple reference images and the prediction of the current image are introduced, and the residuals of multiple reference images, the predictions of multiple reference images and the prediction of the current image are extracted through a convolutional network.
  • the predicted feature information by performing probability estimation on the feature information through the uniform pooling layer and the Softmax layer, a more accurate probability distribution can be obtained.
  • the residual of the current image to be encoded on the encoding side is obtained based on the difference between the original value and the predicted value of the sample of the image, the residuals in other places on the encoding side and the decoding side appear
  • the residuals refer to the reconstruction residuals.
  • the reconstruction residual of the reference image may be obtained by decoding the code stream of the reference image, and for the specific decoding process, refer to the decoding method of the first aspect.
  • the reconstruction residual of the reference image can also be obtained in the following ways:
  • the residual of the reference image is quantized to obtain the residual of the quantized reference image; and then the residual of the quantized reference image is inversely quantized to obtain the reconstructed residual of the reference image.
  • the residual of the reference image is quantized to obtain the residual of the second image after reference (denoted as R1), and then entropy coding is performed on R1, Get the code stream (denoted as R3), because the entropy coding is lossless, so the reconstruction residual of the reference image can be obtained by inverse quantization of R1 directly.
  • R3 Get the code stream
  • the present application relates to a video image decoding apparatus, and the beneficial effects can be referred to the description of the first aspect and will not be repeated here.
  • the decoding device has a function to implement the behavior in the method example of the first aspect above.
  • the functions can be implemented by hardware, or can be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the present application relates to an apparatus for encoding video images, and the beneficial effects can be referred to the description of the second aspect and will not be repeated here.
  • the encoding device has the function of implementing the behavior in the method example of the second aspect above.
  • the functions can be implemented by hardware, or can be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the method described in the first aspect of the present application may be performed by the apparatus described in the third aspect of the present application.
  • Other features and implementations of the method described in the first aspect of the present application directly depend on the functionality and implementation of the device described in the third aspect of the present application.
  • the method described in the second aspect of the present application may be performed by the apparatus described in the fourth aspect of the present application.
  • Other features and implementations of the method described in the second aspect of the present application directly depend on the functionality and implementation of the device described in the fourth aspect of the present application.
  • the present application relates to an apparatus for decoding a video stream, comprising a processor and a memory.
  • the memory stores instructions that cause the processor to perform the method of the first aspect.
  • the present application relates to an apparatus for encoding a video stream, comprising a processor and a memory.
  • the memory stores instructions that cause the processor to perform the method of the second aspect.
  • a computer-readable storage medium having stored thereon instructions that, when executed, cause one or more processors to encode video data.
  • the instructions cause the one or more processors to perform the method of the first or second aspect or any possible embodiment of the first or second aspect.
  • the present application relates to a computer program product comprising program code which, when run, performs the method of the first or second aspect or any one of possible embodiments of the first or second aspect.
  • FIG. 1 is a block diagram of an example of a video decoding system for implementing an embodiment of the present application
  • FIG. 2 is a block diagram of another example of a video decoding system for implementing an embodiment of the present application
  • FIG. 3 is a schematic block diagram of a video decoding apparatus for implementing an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of a video decoding apparatus for implementing an embodiment of the present application.
  • Figure 5 is a schematic diagram of the comparison of the original image, the predicted image and the residual
  • 6A is a schematic structural diagram of a video encoder according to an embodiment of the present application.
  • FIG. 6B is a schematic diagram of a principle of obtaining a residual signal according to an embodiment of the present application.
  • FIG. 6C is a schematic diagram of another residual signal acquisition principle illustrated in an embodiment of the present application.
  • 6D is a schematic diagram of a structure of an encoding network illustrated in an embodiment of the present application.
  • FIG. 6E is a schematic diagram of the structure of another encoding network illustrated in an embodiment of the present application.
  • 6F is a schematic diagram of the structure of another encoding network illustrated in an embodiment of the application.
  • FIG. 6G is a schematic structural diagram of a network layer according to an embodiment of the present application.
  • FIG. 6H is a schematic diagram for explaining the probability distribution
  • FIG. 7A is a schematic structural diagram of a neural network used for probability estimation according to an embodiment of the present application.
  • FIG. 7B is a schematic structural diagram of another neural network for probability estimation illustrated in an embodiment of the present application.
  • FIG. 7C is a schematic diagram of a hidden information acquisition principle illustrated in an embodiment of the present application.
  • FIG. 7D is a schematic diagram of another hidden information acquisition principle illustrated in an embodiment of the present application.
  • 7E is a schematic diagram of arithmetic coding interval division
  • FIG. 8A is a schematic structural diagram of another video encoder provided by an embodiment of the present application.
  • FIG. 8B is a schematic structural diagram of another video encoder provided by an embodiment of the present application.
  • 9A is a schematic structural diagram of a video decoder provided by an embodiment of the present application.
  • FIG. 9B is a functional schematic diagram of a decoding network illustrated in an embodiment of the present application.
  • 9C is a schematic structural diagram of a decoding network illustrated in an embodiment of the present application.
  • FIG. 9D is a schematic structural diagram of another decoding network illustrated in an embodiment of the present application.
  • 9E is a schematic structural diagram of another video decoder illustrated in an embodiment of the present application.
  • 9F is a schematic structural diagram of another video decoder illustrated in an embodiment of the present application.
  • FIG. 10 is a flowchart illustrating a process 1000 of an encoding method based on an embodiment of the present application
  • 11A is a flowchart showing a probability distribution estimation process based on the present application.
  • 11B is a flowchart showing a probability distribution estimation process based on the present application.
  • FIG. 12 is a flowchart illustrating a process 1200 of a decoding method based on an embodiment of the present application
  • 12A is a flowchart showing a probability distribution estimation process based on the present application.
  • 12B is a flowchart showing a probability distribution estimation process based on the present application.
  • FIG. 13 is a schematic diagram of an encoding and decoding process illustrated in an embodiment of the present application.
  • FIG. 14 is a schematic diagram of another encoding and decoding process illustrated by an embodiment of the present application.
  • FIG. 15 is a schematic diagram of another encoding and decoding process illustrated by an embodiment of the present application.
  • FIG. 16 is a schematic diagram of another encoding and decoding process illustrated by an embodiment of the present application.
  • FIG. 17 is a schematic diagram of another encoding and decoding process illustrated by an embodiment of the present application.
  • FIG. 18 is a schematic diagram of another encoding and decoding process illustrated by an embodiment of the present application.
  • Embodiments of the present application provide an AI-based video image compression technology, especially a neural network-based video compression technology, and specifically provide an entropy encoding and decoding technology for video images to improve traditional hybrid video encoding and decoding systems.
  • Video coding generally refers to the processing of sequences of images that form a video or video sequence. In the field of video coding, the terms “picture”, “frame” or “image” may be used as synonyms.
  • Video encoding (or commonly referred to as encoding) includes two parts: video encoding and video decoding. Video encoding is performed on the source side and typically involves processing (eg, compressing) the original video image to reduce the amount of data required to represent the video image (and thus store and/or transmit more efficiently). Video decoding is performed on the destination side and typically involves inverse processing relative to the encoder to reconstruct the video image.
  • the "encoding" of a video image in relation to the embodiments should be understood as the “encoding” or “decoding” of a video image or a video sequence.
  • the encoding part and the decoding part are also collectively referred to as codec (encoding and decoding, CODEC).
  • the original video image can be reconstructed, ie the reconstructed video image has the same quality as the original video image (assuming no transmission loss or other data loss during storage or transmission).
  • further compression is performed through quantization, etc. to reduce the amount of data required to represent the video image, and the decoder side cannot completely reconstruct the video image, that is, the quality of the reconstructed video image is higher than that of the original video image. low or poor.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes xs and intercept 1 as inputs, and the output of the operation unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting a plurality of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • a deep neural network also known as a multi-layer neural network, can be understood as a neural network with multiple hidden layers.
  • the DNN is divided according to the positions of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, in terms of the work of each layer, it is not complicated. In short, it is the following linear relationship expression: in, is the input vector, is the output vector, is the offset vector, W is the weight matrix (also called coefficients), and a() is the activation function.
  • Each layer is just an input vector After such a simple operation to get the output vector Due to the large number of DNN layers, the coefficient W and offset vector The number is also higher.
  • the DNN Take the coefficient W as an example: Suppose that in a three-layer DNN, the linear coefficient from the fourth neuron in the second layer to the second neuron in the third layer is defined as The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as
  • the input layer does not have a W parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional layers and subsampling layers, which can be viewed as a filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a convolutional layer of a convolutional neural network a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract image information is independent of location.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • Recurrent neural networks are used to process sequence data.
  • RNN Recurrent neural networks
  • the layers are fully connected, but each node in each layer is unconnected.
  • this ordinary neural network solves many problems, it is still powerless to many problems. For example, if you want to predict the next word of a sentence, you generally need to use the previous words, because the front and rear words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output.
  • RNN can process sequence data of any length.
  • the training of RNN is the same as the training of traditional CNN or DNN.
  • RNNs are designed to give machines the ability to memorize like humans do. Therefore, the output of RNN needs to rely on current input information and historical memory information.
  • the neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller.
  • BP error back propagation
  • the input signal is passed forward until the output will generate error loss, and the parameters in the initial neural network model are updated by back-propagating the error loss information, so that the error loss converges.
  • the back-propagation algorithm is a back-propagation movement dominated by error loss, aiming to obtain the parameters of the optimal neural network model, such as the weight matrix.
  • the encoder 20 and the decoder 30 are described based on FIGS. 1 to 2 .
  • Video encoder 20 (or encoder 20 for short) and video decoder 30 (or decoder 30 for short) in video coding system 10 represent devices, etc. that may be used to perform techniques based on the various examples described in this application .
  • the decoding system 10 includes a source device 12 for providing encoded image data 21 such as encoded images to a destination device 14 for decoding the encoded image data 21 .
  • the source device 12 includes an encoder 20 and, alternatively, an image source 16 , a preprocessor (or preprocessing unit) 18 such as an image preprocessor, and a communication interface (or communication unit) 22 .
  • Image source 16 may include or be any type of image capture device for capturing real-world images, etc., and/or any type of image generation device, such as a computer graphics processor or any type of user for generating computer animation images. Devices used to acquire and/or provide real-world images, computer-generated images (e.g., screen content, virtual reality (VR) images, and/or any combination thereof (e.g., augmented reality, AR) images).
  • the image source may be any type of memory or storage that stores any of the above-mentioned images.
  • the image (or image data 17 ) may also be referred to as the original image (or original image data) 17 .
  • the pre-processor 18 is used to receive the (raw) image data 17 and to pre-process the image data 17 to obtain a pre-processed image (or pre-processed image data) 19.
  • the preprocessing performed by the preprocessor 18 may include trimming, color format conversion (eg, from RGB to YCbCr), toning, or denoising. It is understood that the preprocessing unit 18 may be an optional component.
  • a video encoder (or encoder) 20 is used to receive preprocessed image data 19 and provide encoded image data 21 .
  • the communication interface 22 in the source device 12 can be used to: receive the encoded image data 21 and send the encoded image data 21 (or any other processed version) over the communication channel 13 to another device such as the destination device 14 or any other device for storage or rebuild directly.
  • the destination device 14 includes a decoder 30 and may additionally, alternatively, include a communication interface (or communication unit) 28 , a post-processor (or post-processing unit) 32 and a display device 34 .
  • the communication interface 28 in the destination device 14 is used to receive the encoded image data 21 (or any other processed version) directly from the source device 12 or from any other source device such as a storage device, for example, the storage device is an encoded image data storage device, The encoded image data 21 is supplied to the decoder 30 .
  • Communication interface 22 and communication interface 28 may be used through a direct communication link between source device 12 and destination device 14, such as a direct wired or wireless connection, etc., or through any type of network, such as a wired network, a wireless network, or any Combination, any type of private network and public network, or any type of combination, send or receive encoded image data (or encoded data) 21 .
  • the communication interface 22 may be used to encapsulate the encoded image data 21 into a suitable format such as a message, and/or use any type of transfer encoding or processing to process the encoded image data for transmission over a communication link or communication network transfer on.
  • the communication interface 28 corresponds to the communication interface 22 and may be used, for example, to receive transmission data and process the transmission data using any type of corresponding transmission decoding or processing and/or decapsulation to obtain encoded image data 21 .
  • Both the communication interface 22 and the communication interface 28 can be configured as a one-way communication interface as indicated by the arrow in FIG. 1 from the corresponding communication channel 13 of the source device 12 to the destination device 14, or a two-way communication interface, and can be used to send and receive messages etc. to establish a connection, acknowledge and exchange any other information related to a communication link and/or data transfer such as encoded image data transfer, etc.
  • a video decoder (or decoder) 30 is used to receive encoded image data 21 and provide decoded image data (or decoded image data) 31 (which will be further described below based on Figures 9A, 9E and 9F, etc.).
  • the post-processing processor 32 is configured to perform post-processing on the decoded image data 31 (also referred to as reconstructed image data) such as decoded images to obtain post-processed image data 33 such as post-processed images.
  • Post-processing performed by post-processing unit 32 may include, for example, color format conversion (eg, from YCbCr to RGB), toning, trimming, or resampling, or any other processing used to generate decoded image data 31 for display by display device 34, etc. .
  • a display device 34 is used to receive post-processed image data 33 to display the image to a user or viewer or the like.
  • Display device 34 may be or include any type of display for representing the reconstructed image, eg, an integrated or external display screen or display.
  • the display screen may include a liquid crystal display (LCD), an organic light emitting diode (OLED) display, a plasma display, a projector, a micro LED display, a liquid crystal on silicon (LCoS) display ), digital light processor (DLP), or any other type of display.
  • the decoding system 10 also includes a training engine 25, which is used to train the neural network in the encoder 20 or the decoder 30 to process the reconstruction residual of the input encoded image or the reconstruction residual of the decoded image to obtain the current image.
  • the input data also includes super-prior information.
  • the training data may be stored in a database (not shown), and the training engine 25 trains a neural network based on the training data.
  • the neural network is used to estimate the probability distribution of residual values of a plurality of samples included in the residual of the current image. It should be noted that the embodiments of the present application do not limit the source of the training data, for example, the training data may be obtained from the cloud or other places to perform neural network training for probability estimation.
  • the neural network trained by the training engine 25 can be applied to the decoding systems 10, 40, eg, the source device 12 (eg, the encoder 20) or the destination device 14 (eg, the decoder 30) shown in FIG. 1 .
  • the training engine 25 can train in the cloud to obtain the above-mentioned neural network, and then the decoding system 10 downloads and uses the neural network from the cloud; or, the training engine 25 can train in the cloud to obtain the above-mentioned neural network and use the neural network to obtain the residual of the current image.
  • the probability distribution of residual values of the included multiple samples, the decoding system 10 directly obtains the probability distribution from the cloud.
  • FIG. 1 shows the source device 12 and the destination device 14 as separate devices
  • the device embodiment may also include the source device 12 and the destination device 14 or the functions of the source device 12 and the destination device 14 at the same time, that is, include the source device 12 and the destination device 14 at the same time.
  • Device 12 or corresponding function and destination device 14 or corresponding function may be implemented using the same hardware and/or software or by separate hardware and/or software, or any combination thereof.
  • the existence and (exact) division of the different units or functions in the source device 12 and/or the destination device 14 shown in FIG. 1 may vary based on the actual device and application, as will be apparent to the skilled person .
  • Encoder 20 eg video encoder 20
  • decoder 30 eg video decoder 30
  • processing circuitry as shown in FIG. 2 , eg one or more microprocessors, digital signal processors (digital signal processor, DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), discrete logic, hardware, special-purpose processor for video encoding, or any combination thereof .
  • Encoder 20 may be implemented by processing circuitry 46 to include the various modules discussed with reference to encoder 20 of FIG. 2 and/or any other decoder system or subsystem described herein.
  • Decoder 30 may be implemented by processing circuitry 46 to include the various modules discussed with reference to decoder 30 of FIG.
  • the processing circuitry 46 may be used to perform various operations discussed below. As shown in FIG. 4, if parts of the techniques are implemented in software, a device may store the instructions of the software in a suitable non-transitory computer-readable storage medium and execute the instructions in hardware using one or more processors, thereby Implement the techniques of this application.
  • One of the video encoder 20 and the video decoder 30 may be integrated in a single device as part of a combined codec (encoder/decoder, CODEC), as shown in FIG. 2 .
  • Source device 12 and destination device 14 may include any of a variety of devices, including any type of handheld or stationary device, such as a laptop or laptop, cell phone, smartphone, tablet or tablet, camera, Desktop computers, set-top boxes, televisions, display devices, digital media players, video game consoles, video streaming devices (eg, content service servers or content distribution servers), broadcast receiving equipment, broadcast transmitting equipment, etc., and may not Use or use any type of operating system.
  • source device 12 and destination device 14 may be equipped with components for wireless communication.
  • source device 12 and destination device 14 may be wireless communication devices.
  • the video coding system 10 shown in FIG. 1 is merely exemplary, and the techniques provided herein may be applicable to video coding settings (eg, video encoding or video decoding) that do not necessarily include the encoding device and the Decode any data communication between devices.
  • data is retrieved from local storage, sent over a network, and so on.
  • the video encoding device may encode and store the data in memory, and/or the video decoding device may retrieve and decode the data from the memory.
  • encoding and decoding are performed by devices that do not communicate with each other but merely encode data to and/or retrieve and decode data from memory.
  • Video coding system 40 may include imaging device 41, video encoder 20, video decoder 30 (and/or video encoder/decoder implemented by processing circuit 46), antenna 42, one or more processors 43, a or multiple memory stores 44 and/or display devices 45 .
  • imaging device 41, antenna 42, processing circuit 46, video encoder 20, video decoder 30, processor 43, memory storage 44 and/or display device 45 can communicate with each other.
  • video coding system 40 may include only video encoder 20 or only video decoder 30 .
  • antenna 42 may be used to transmit or receive an encoded bitstream of video data.
  • display device 45 may be used to present video data.
  • Processing circuitry 46 may include application-specific integrated circuit (ASIC) logic, graphics processors, general purpose processors, and the like.
  • Video coding system 40 may also include an optional processor 43, which may similarly include application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, or the like.
  • the memory memory 44 may be any type of memory, such as volatile memory (eg, static random access memory (SRAM), dynamic random access memory (DRAM), etc.) or non-volatile memory volatile memory (eg, flash memory, etc.), etc.
  • memory storage 44 may be implemented by cache memory.
  • processing circuitry 46 may include memory (eg, cache memory, etc.) for implementing image buffers, and the like.
  • video encoder 20 implemented by logic circuitry may include an image buffer (eg, implemented by processing circuitry 46 or memory memory 44 ) and a graphics processing unit (eg, implemented by processing circuitry 46 ).
  • the graphics processing unit may be communicatively coupled to the image buffer.
  • the graphics processing unit may include video encoder 20 implemented by processing circuitry 46 to implement the various modules discussed with reference to video decoder 20 of FIG. 2 and/or any other encoder systems or subsystems described herein.
  • Logic circuits may be used to perform the various operations discussed herein.
  • video decoder 30 may be implemented by processing circuitry 46 in a similar manner to implement various of the types discussed with reference to video decoder 30 of FIG. 2 and/or any other decoder systems or subsystems described herein. module.
  • logic circuit-implemented video decoder 30 may include an image buffer (implemented by processing circuit 46 or memory memory 44) and a graphics processing unit (eg, implemented by processing circuit 46).
  • the graphics processing unit may be communicatively coupled to the image buffer.
  • the graphics processing unit may include video decoder 30 implemented by processing circuitry 46 .
  • antenna 42 may be used to receive an encoded bitstream of video data.
  • the encoded bitstream may include data, indicators, index values, mode selection data, etc., as discussed herein related to encoded video frames, such as data related to encoded partitions (eg, transform coefficients or quantized transform coefficients). , (as discussed) optional indicators, and/or data defining the encoding split).
  • Video coding system 40 may also include video decoder 30 coupled to antenna 42 for decoding the encoded bitstream.
  • Display device 45 is used to present video frames.
  • video decoder 30 may be used to perform the opposite process.
  • video decoder 30 may be operable to receive and parse such syntax elements, decoding the associated video data accordingly.
  • video encoder 20 may entropy encode the syntax elements into an encoded video bitstream. In such instances, video decoder 30 may parse such syntax elements and decode related video data accordingly.
  • codec process described in this application exists in most video codecs, such as H.263, H.264, MPEG-2, MPEG-4, VP8, VP9, AI-based end-to-end In the corresponding codec such as the image encoding of the terminal.
  • FIG. 3 is a schematic diagram of a video decoding apparatus 400 provided by an embodiment of the present application.
  • Video coding apparatus 400 is suitable for implementing the disclosed embodiments described herein.
  • the video coding apparatus 400 may be a decoder, such as the video decoder 30 in FIG. 2 , or an encoder, such as the video encoder 20 in FIG. 2 .
  • the video decoding apparatus 400 includes: an input port 410 (or input port 410) for receiving data and a receiver unit (receiver unit, Rx) 420; a processor, a logic unit or a central processing unit (central processing unit) for processing data , CPU) 430; for example, the processor 430 here can be a neural network processor 430; a transmitter unit (transmitter unit, Tx) 440 for transmitting data and an output port 450 (or output port 450); memory 460.
  • the video coding apparatus 400 may also include optical-to-electrical (OE) components and electrical-to-optical (EO) components coupled to the input port 410, the receiving unit 420, the transmitting unit 440, and the output port 450, Exit or entrance for optical or electrical signals.
  • OE optical-to-electrical
  • EO electrical-to-optical
  • the processor 430 is implemented by hardware and software.
  • Processor 430 may be implemented as one or more processor chips, cores (eg, multi-core processors), FPGAs, ASICs, and DSPs.
  • the processor 430 communicates with the ingress port 410 , the receiving unit 420 , the sending unit 440 , the egress port 450 and the memory 460 .
  • the processor 430 includes a decoding module 470 (eg, a neural network NN based decoding module 470).
  • the decoding module 470 implements the embodiments disclosed above. For example, the transcoding module 470 performs, processes, prepares or provides various encoding operations.
  • decoding module 470 is implemented as instructions stored in memory 460 and executed by processor 430 .
  • Memory 460 includes one or more magnetic disks, tape drives, and solid-state drives, and may serve as an overflow data storage device for storing programs when such programs are selected for execution, and for storing instructions and data read during program execution.
  • Memory 460 may be volatile and/or non-volatile, and may be read-only memory (ROM), random access memory (RAM), ternary content addressable memory (ternary) content-addressable memory, TCAM) and/or static random-access memory (SRAM).
  • FIG. 4 is a simplified block diagram of an apparatus 500 provided by an exemplary embodiment, and the apparatus 500 can be used as either or both of the source device 12 and the destination device 14 in FIG. 1 .
  • the processor 502 in the apparatus 500 may be a central processing unit.
  • the processor 502 may be any other type of device or devices, existing or to be developed in the future, capable of manipulating or processing information.
  • the disclosed implementations may be implemented using a single processor, such as processor 502 as shown, using more than one processor is faster and more efficient.
  • the memory 504 in the apparatus 500 may be a read only memory (ROM) device or a random access memory (RAM) device. Any other suitable type of storage device may be used as memory 504 .
  • Memory 504 may include code and data 506 accessed by processor 502 via bus 512 .
  • the memory 504 may also include an operating system 508 and application programs 510 including at least one program that allows the processor 502 to perform the methods described herein.
  • applications 510 may include applications 1 through N, and also include video coding applications that perform the methods described herein.
  • Apparatus 500 may also include one or more output devices, such as display 518 .
  • display 518 may be a touch-sensitive display that combines a display with touch-sensitive elements that may be used to sense touch input.
  • Display 518 may be coupled to processor 502 through bus 512 .
  • bus 512 in device 500 is described herein as a single bus, bus 512 may include multiple buses.
  • secondary storage may be directly coupled to other components of the device 500 or accessed through a network, and may include a single integrated unit, such as a memory card, or multiple units, such as multiple memory cards. Accordingly, the apparatus 500 may have various configurations.
  • the first row is the original image
  • the second row is the prediction of the original image
  • the third row is the residual of the original image.
  • FIG. 6A is a schematic block diagram of an example of a video encoder for implementing the techniques of the present application.
  • the video encoder 20 includes an input (or input interface) 202, an encoding network 204, a rounding unit 206, an entropy encoding unit 208, a neural network 210 for probability estimation, and an output (or output interface) ) 212.
  • the video encoder 20 shown in FIG. 6A may also be referred to as an end-to-end video encoder or an end-to-end video encoder-based video encoder.
  • the encoder 20 may receive input data 203 through an input 202 or the like, the input data image including the residual of the current image, as input to the encoding network 204 .
  • the above-mentioned input data 203 also includes the residual of the reference image, or,
  • Residuals for multiple reference pictures and predictions for the current picture or,
  • Residuals for multiple reference images, predictions for multiple reference images, and predictions for the current image are used as inputs to the neural network 210 for probability estimation.
  • a (digital) image is or can be viewed as a two-dimensional array or matrix of pixel points with intensity values.
  • the pixels in the array may also be called pixels or pels (short for picture elements).
  • the number of pixels in the array or image in the horizontal and vertical directions (or axes) determines the size and/or resolution of the image.
  • three color components are usually used, that is, an image can be represented as or include an array of three pixel points.
  • an image includes an array of corresponding red, green and blue pixel points.
  • each pixel is usually represented in a luma/chroma format or color space, such as YCbCr, including a luma component indicated by Y (and sometimes also an L) and two chroma denoted by Cb, Cr weight.
  • the luminance (luma) component Y represents the luminance or gray level intensity (eg, both are the same in a grayscale image), while the two chrominance (chroma) components Cb and Cr represent the chrominance or color information components .
  • an image in YCbCr format includes a luminance pixel array of luminance pixel value (Y) and two chrominance pixel arrays of chrominance values (Cb and Cr).
  • Images in RGB format can be converted or transformed to YCbCr format and vice versa, the process is also known as color transformation or conversion. If the image is black and white, the image may only include an array of luminance pixels. Correspondingly, the image may be, for example, a luminance pixel array in monochrome format or a luminance pixel array and two corresponding chrominance pixel arrays in 4:2:0, 4:2:2 and 4:4:4 color formats .
  • the residual of the current image is obtained based on the prediction of the current image and the current image.
  • the current image includes the original values of multiple samples (pixels), and the prediction of the current image includes the predictions of multiple samples (pixels). value, the original value of the sample (pixel point) in the current image and the predicted value of the sample at the corresponding position in the prediction of the current image are made to differ one by one, so as to obtain the residual of the current image.
  • the prediction signal of the current image is obtained by predicting the reconstructed sample value of the encoded image; then the original signal of the current image and the prediction of the current image are processed according to the above method to obtain the prediction signal of the current image. residual.
  • the low-resolution image of the current image is up-sampled to obtain the prediction of the current image, and then the original signal of the current image and the prediction of the current image are processed in the above manner to obtain the current image.
  • the residual of the image wherein, the low-resolution image of the current image is obtained by down-sampling the current image.
  • the prediction signal of the current image in this application can also be called the prediction of the current image
  • the prediction signal of the reference image can also be called the prediction of the reference image
  • the original signal of the current image refers to the current image itself .
  • the residual of the current image in this application may also be referred to as the residual signal of the current image.
  • the encoding network 204 is configured to process the residual of the current image through the encoding network to obtain a feature map 205 of the residual of the current image;
  • the encoding network 204 is shown in FIG. 6D , the encoding network 204 includes K sub-encoding networks, and each sub-encoding network corresponds to the feature layer where it is located and the corresponding output feature map.
  • the residual of the current image is input into the first sub-encoding network for feature extraction to obtain the first output feature map; then the first output feature map is input into the second sub-encoding network for feature extraction to obtain the second output feature map,
  • the feature map of the residual of the above-mentioned current image may be any one of the K output feature maps, or the Kth output feature map, or a multi-channel feature map obtained by stacking the K output feature maps. features at multiple scales.
  • the structure of the encoding network 204 is shown in FIG. 6E, and the encoding network 204 is composed of T network layers, wherein M, L, T, and K are positive integers.
  • the Mth output feature map and the Kth output feature map are both outputs of the coding network, and the Kth output feature map output by the coding network is output after the network layer L of the coding network 204, and the Mth output feature map is output after the network layer T .
  • the feature map of the residual of the current image may be the Mth output feature map, or a multi-channel feature map obtained by stacking multiple output feature maps, and the feature map has multi-scale features.
  • the multiple output feature maps are processed, such as up-sampling, down-sampling, etc., so that more The dimensions of the output feature maps are the same.
  • Any one of the network layers in FIG. 6E may be a convolution layer, a normalization layer, a nonlinear activation layer, or the like.
  • the sub-encoding network structures in the encoding network 204 network corresponding to FIG. 6D may be the same or different from each other, and the network structures of the Mth and Nth sub-encoding networks are shown in FIG. 6F .
  • the structure of any one of the network layer 1 of the Mth subcoding network and the Nth subcoding network may be different from each other, and L and P may be the same or different, and both L and P are positive integers greater than 0.
  • the Nth sub-encoding network in the example includes 5 network layers, specifically including a normalization layer , two convolutional layers, and two nonlinear activation layers.
  • the rounding unit 206 is used for rounding the output feature map 205 through, for example, scalar quantization or vector quantization, to obtain an integerized feature map 207 .
  • the video encoder 20 may be used to output a quantization parameter (QP), eg, directly or after being encoded or compressed by the entropy encoding unit 208, eg This enables video decoder 30 to receive and decode using the quantization parameters.
  • QP quantization parameter
  • the neural network 210 for probability estimation is used to estimate the probability distribution of residual values of a plurality of samples included in the residual of the current image based on the input data.
  • the above-mentioned probability distribution represents the probability distribution of multiple differences between the original values of the multiple samples included in the current image and the predicted values of the multiple samples, or, represents each sample in the multiple samples of the current image.
  • C0, C1, C2, and C3 are the residuals of the 4 samples of the current image, and their values are -2, 1, -2, 0, R0, R1 respectively , R2, and R3 are the residuals of the four samples of the reference image, and their values are -1, 2, -2, and 1, respectively.
  • the residual value is in the range of [-2,2], that is, an integer between -2 and 2.
  • the residual of the reference image is estimated by the neural network 210 for probability estimation, and the probability distribution of residual values of the four samples included in the residual of the current image is obtained:
  • the probability distribution of the residual values of the four samples included in the residual of the current image specifically refers to the probability distribution of the residual values of each sample point in the four samples of the current image, that is, the residual value of the C0 position.
  • the probability distribution of , the probability distribution of the residual value at the C1 position, the probability distribution of the residual value at the C2 position, and the probability distribution of the residual value at the C3 position are shown in Table 1 below:
  • the above residuals may be arithmetic entropy encoded according to the above probability distribution.
  • Another possible situation is to obtain the probability distribution of residual values of the four samples included in the current image based on the residual values of the four samples R0, R1, R2, and R3 included in the reference image. For example, based on the residual values of the four samples of R0, R1, R2, and R3, whose values are -1, 2, -2, and 1, respectively, the probability distribution of the residual values of the four samples in the current image is obtained, as shown in Table 2 below. shown,
  • the probability distributions shown in Tables 2 and 3 are the probability distributions of each value within the residual value range, and do not distinguish the location of the samples, that is, the probability distributions of the residual values of multiple samples are the same. .
  • the residual error can be encoded according to the above-mentioned method of performing arithmetic entropy encoding on the above-mentioned probability distribution.
  • the input data for the neural network 210 for probability estimation includes the residuals of the reference image, or,
  • Residuals from multiple reference images or,
  • Residuals from multiple reference images and predictions from multiple images are Residuals from multiple reference images and predictions from multiple images.
  • the neural network 210 for probability estimation includes a convolutional network, a uniform pooling layer, and a normalized exponential Softmax layer, wherein the convolutional network includes an excitation ReLU layer and multiple volumes layered,
  • the input data of the neural network 210 for probability estimation includes the residual of the reference image
  • the feature information of the residual of the reference image is extracted through the convolutional network
  • the input data of the neural network 210 for probability estimation includes the residual of the reference image , the prediction of the reference image and the prediction of the current image, extract the residual of the reference image, the prediction of the reference image and the feature information of the prediction of the current image through the convolutional network
  • the input data of the neural network 210 for probability estimation includes a plurality of When referring to the residuals of the reference images, the feature information of the residuals of the multiple reference images is extracted through the convolutional network; the input data of the neural network 210 for probability estimation includes the residuals of the multiple reference images, the predictions of the multiple reference images, and the When predicting the current image, the residuals of multiple reference images, the predictions of multiple reference images, and the predicted feature information of the current image are extracted through the convolutional network; then the features extracted by the convolutional network are extracted through the uniform pooling layer and the Softmax
  • the input of the neural network 210 for probability estimation includes multiple images (such as the residual of the reference image, the prediction of the reference image and the prediction of the current image, or the residual of multiple reference images, or the image residuals, predictions of multiple reference images, and predictions of current images), before the multiple images are input to the neural network 210 for probability estimation, a stacking operation is first performed on multiple images to obtain multi-channel images. , and then input the multi-channel image into the neural network 210 for probability estimation, and the convolutional network in the neural network 210 for probability estimation extracts the feature information of the multi-channel image, and then passes through the uniform pooling layer and Softmax. The layer performs probability estimation on the feature information extracted by the convolutional network, and outputs the probability distribution of residual values of multiple samples included in the residual of the current image.
  • images such as the residual of the reference image, the prediction of the reference image and the prediction of the current image, or the residual of multiple reference images, or the image residuals, predictions of multiple reference images, and predictions of current images
  • the input data of the neural network 210 for probability estimation also includes super-prior information, and the structure of the neural network 210 for probability estimation is different from that in FIG. 7A .
  • the neural network It includes a convolutional network, a convolutional pooling layer, a first convolutional layer, a concat layer and a Softmax layer, where the convolutional network includes multiple convolutional layers, ReLU layers,
  • the neural network 210 for probability estimation When the input data of the neural network 210 for probability estimation includes the residual and super-prior information of the reference image, the residual of the reference image and the feature information of the super-prior information are extracted through the convolutional network; the neural network for probability estimation When the input data of 210 includes the residual of the reference image, the prediction of the reference image, the super-prior information and the prediction of the current image, the residual of the reference image, the prediction of the reference image, the super-prior information and the current image are extracted through the convolutional network. When the input data of the neural network 210 for probability estimation includes residuals and super-prior information of multiple reference images, the residuals and super-prior information of multiple reference images are extracted through the convolutional network.
  • Feature information when the input data of the neural network 210 for probability estimation includes residuals of multiple reference images, predictions of multiple reference images, super-prior information and prediction of the current image, multiple reference images are extracted through a convolutional network The residual, prediction of multiple reference images, super-prior information and prediction feature information of the current image; then through the stitching layer, the first convolution layer, the uniform pooling layer and the Softmax layer to extract the feature information from the convolutional network Perform probability estimation, and output the probability distribution of residual values of multiple samples included in the residual of the current image.
  • the first convolutional layer and the convolutional layer in the convolutional network may be the same or different.
  • the input data of the neural network 210 for probability estimation includes the residual of the reference image, the prediction of the current image and the prediction of the reference image, the residual of the reference image, the prediction of the current image and the prediction of the reference image Before the prediction is input to the neural network 210 for probability estimation, the residual of the reference image, the prediction of the current image and the prediction of the reference image are stacked to obtain a three-channel image, and then the three-channel image is input to processed in the neural network 210 for probability estimation; or,
  • the input data of the neural network 210 for probability estimation includes residuals of multiple reference images
  • the residuals of the multiple reference images are A stacking operation is performed to obtain a multi-channel image, and then the multi-channel image is input into the neural network 210 for probability estimation for processing, or,
  • the input data of the neural network 210 for probability estimation includes residuals of multiple reference images, predictions of multiple images, and predictions of the current image, before being input into the neural network 210 for probability estimation.
  • the residual of the reference image, the prediction of multiple images, and the prediction of the current image are stacked to obtain a multi-channel image, and then the multi-channel image is input into the neural network 210 for probability estimation for processing.
  • the input data of the neural network 210 for probability estimation includes residuals of multiple reference images
  • multiple The residual of the reference image is passed through a recurrent neural network to obtain first hidden information, and then the first hidden information is input into the neural network 210 for probability estimation for processing.
  • the residual P j of the jth reference image and the hidden information H j-1 of the residuals of the multiple reference images are input into the recurrent neural network for processing to obtain the hidden information H j ;
  • the above-mentioned first hidden information is obtained by processing the residuals of the reference images; wherein, when the residual of the jth reference image is the first one of the residuals of the multiple reference images, the hidden information H j-1 is 0; When the residual of the jth reference image is the last one of the residuals of the multiple reference images, the hidden information H j is the first hidden information.
  • the difference between the time stamps of two adjacent reference images in the multiple reference images is the above-mentioned preset threshold.
  • the residuals of N reference code images are obtained through a recurrent neural network to obtain the first hidden information, which specifically includes:
  • the residual of the first reference image is input into the cyclic neural network for processing to obtain the hidden information H 1 ; then the hidden information H 1 and the residual of the second reference image are input into the cyclic neural network for processing, Obtain hidden information H 2 ; ... input the residual information of the hidden information H N-3 and the N-2th reference image into the cyclic neural network for processing to obtain the hidden information H N-2 ; the hidden information H N-2 And the residual of the N-1th reference image is input into the cyclic neural network for processing, and the hidden information H N-1 is obtained; the hidden information H N-1 and the residual of the Nth reference image are input into the cyclic neural network. Perform processing to obtain hidden information H N ; the hidden information H N is the first hidden information.
  • the input data of the neural network 210 for probability estimation includes residuals of multiple reference images, predictions of multiple reference images, and predictions of the current image
  • the residuals of the multiple reference images and the predictions of the multiple reference images are obtained through the recurrent neural network to obtain second hidden information, and then the second hidden information and the prediction of the current image are input into the neural network for probability estimation 210 processed in.
  • the second hidden information is essentially a feature map, so after obtaining the second hidden information, the second hidden information and the prediction of the current image can be stacked to obtain a two-channel image, and then the two-channel image can be obtained.
  • the images are input to the neural network 210 for probability estimation for processing.
  • the second hidden information is obtained based on residuals of multiple reference images and predictions of multiple reference images, including:
  • Input the residual Pj of the jth reference picture among the residuals of the multiple reference pictures, the prediction Pj ' of the jth reference picture among the predictions of the multiple reference pictures, and the hidden information Hj-1 ' into the loop Perform processing in the neural network to obtain hidden information H j '; in this way, the residuals of multiple reference images and the predictions of multiple reference images are processed to obtain the above-mentioned second hidden information; wherein, when the residual of the jth reference image is When the difference is the first in the residuals of multiple reference images and the prediction of the jth reference image is the first one in the prediction of multiple reference images, the hidden information H j-1 ' is 0; when the jth When the residual of the reference image is the last of the residuals of multiple reference images, and the prediction of the j-th reference image is the last of the predictions of the multiple reference images, the hidden information H j ′ is the second hidden information.
  • the residuals of N reference images and the prediction of N reference images obtain the second hidden information, which specifically includes:
  • the residual of the first reference image and the prediction of the first reference image are input into the recurrent neural network for processing to obtain the hidden information H 1 '; then the hidden information H 1 ', the second reference image's The residual and the prediction of the second reference image are input into the cyclic neural network for processing, and the hidden information H 2 ' is obtained; ...
  • the hidden information H N-3 ', the residual difference of the N-2th reference image and the The predictions of the N-2 reference images are input into the recurrent neural network for processing to obtain the hidden information H N-2 '; the hidden information H N-2 ', the residual of the N-1 reference image and the N-1
  • the predictions of the reference images are input into the recurrent neural network for processing to obtain hidden information H N-1 ′; the hidden information H N-1 , the residual of the Nth reference image and the prediction of the Nth reference image are input into the loop Process in the neural network to obtain hidden information H N '; the hidden information H N ' is the second hidden information.
  • each layer is not limited to the execution order shown in FIG. 7A and FIG. 7B , but may of course be other execution orders.
  • the video encoder 20 further includes a super-a priori encoding network 218 and a super-a priori network decoding network 220, wherein the super-a priori encoding network 218 further performs a feature map of the residual of the current image output by the encoding network 204.
  • Feature extraction to obtain the feature vector of the super-prior information and then input the feature vector of the super-prior information into the rounding unit 206, and the rounding unit 206 rounds the feature vector of the super-prior information to obtain the integerized The feature vector of the super-prior information; and then the integerized feature vector of the super-prior information is input into the super-prior decoding network 220 for processing to obtain the super-prior information.
  • the network architecture of the super-a priori coding network 218 reference may be made to the network architecture of the coding network 204, or other forms of architecture, which will not be described in detail here.
  • the function of the super-a priori encoding network 218 is opposite to the function of the super-a priori decoding network 220 , so the network structure of the super-a priori decoding network 218 may refer to the network structure of the super-a priori encoding network 218 .
  • the super-prior information is optional for the neural network 210 for probability estimation, and thus for the video encoder 20, the super-prior encoding network 218 and the super-prior information decoding network 220 are also optional, so The super-prior encoding network 218 and the super-prior information decoding network 220 are represented using dashed boxes.
  • the entropy coding unit 208 is used for entropy coding algorithm or scheme (for example, variable length coding (variable length coding, VLC) scheme, context adaptive VLC scheme (context adaptive VLC, CALVC), arithmetic coding scheme, binarization algorithm, Context adaptive binary arithmetic coding (context adaptive binary arithmetic coding, CABAC), syntax-based context adaptive binary arithmetic coding (syntax-based context-adaptive binary arithmetic coding, SBAC), probability interval partitioning entropy (probability interval partitioning entropy, PIPE) ) coding or other entropy coding methods or techniques) applied to the feature map after integerization and the probability distribution of residual values of a plurality of samples included in the residual of the current image, to obtain the output terminal 212 in the form of an encoded bit stream 21 and the like
  • the encoded image data 21 is output so that the video decoder 30 or the like can receive and use parameters for de
  • the statistics or related information can be added to the entropy encoding unit 208 if necessary, and the information can be sent to the decoding end.
  • the neural network 210 for probability estimation utilizes super-prior information when estimating the probability distribution of residual values of multiple samples included in the residual of the current image; After the a priori information, entropy coding is performed on the super-a priori information to obtain a code stream of the super-a priori information, and the code stream is transmitted to the video decoder 30 for decoding, so that the video decoder 30 obtains the super-a priori information and utilizes the super-prior information.
  • the prior information estimates the above probability distribution.
  • the video encoder 20 includes an input (or input interface) 202 , a quantization unit 216 , an entropy encoding unit 208 , a neural network 210 for probability estimation, and an output (or output interface) 212 .
  • the video encoder 20 shown in FIG. 8A may also be referred to as an end-to-end video encoder or an end-to-end video encoder-based video encoder.
  • Encoder 20 may receive input data 203 through input 202 or the like, the input data image including the residual of the current image, as input to quantization unit 216 .
  • the above-mentioned input data 203 also includes the residual of the reference image, or,
  • Residuals for multiple reference pictures and predictions for the current picture or,
  • Residuals for multiple reference images, predictions for multiple reference images, and predictions for the current image are used as inputs to the neural network 210 for probability estimation.
  • the quantization unit 216 is configured to quantize the residual of the current image to obtain the quantized residual of the current image, and the quantized residual of the current image is used as the input of the entropy encoding unit 208 .
  • the above-mentioned quantization may be uniform quantization or non-uniform quantization.
  • the neural network 210 for probability estimation For the detailed functions of the neural network 210 for probability estimation, please refer to the relevant description of the neural network 210 for probability estimation in FIG. 6A , or the function of the neural network 210 for probability estimation in FIG. The function of the neural network 210 for probability estimation is the same, and will not be described here.
  • Entropy coding unit 208 is used to apply an entropy coding algorithm or scheme (eg, VLC scheme, CALVC, arithmetic coding scheme, binarization algorithm, CABAC, SBAC, PIPE coding, or other entropy coding methods or techniques) to the quantized feature maps and the probability distribution of residual values of a plurality of samples included in the residual of the current image, to obtain encoded image data 21 that can be output in the form of encoded bit stream 21 through the output terminal 212, so that the video decoder 30 and the like can receive and use it. parameters for decoding.
  • the encoded bitstream 21 may be transmitted to the video decoder 30, or stored in memory for later transmission or retrieval by the video decoder 30.
  • the detailed function of the entropy encoding unit 208 can refer to the relevant description of the entropy encoding unit 208 in FIG. 6A , or the function of the entropy encoding unit 208 in FIG. 8A is the same as the function of the entropy encoding unit 208 in FIG. 6A . , will not be described here.
  • video encoder 20 includes an input (or input interface) 202, a transform unit 214, a quantization unit 216, an entropy encoding unit 208, a neural network 210 for probability estimation, and an output (or output interface) 212.
  • the video encoder 20 shown in FIG. 8B may also be referred to as an end-to-end video encoder or an end-to-end video encoder-based video encoder.
  • the encoder 20 may receive input data 203 through the input 202 or the like, the input data image including the residual of the current image, as input to the transform unit 214 .
  • the above-mentioned input data 203 also includes the residual of the reference image, or,
  • Residuals for multiple reference pictures and predictions for the current picture or,
  • Residuals for multiple reference images, predictions for multiple reference images, and predictions for the current image are used as inputs to the neural network 210 for probability estimation.
  • the transforming unit 214 is configured to transform the residual of the current image to obtain the residual of the transformed current image, and the transformed residual of the current image is used as the input of the quantization unit 216 .
  • the above-mentioned transformation can be discrete cosine transform (discrete cosine transform, DCT), in addition, discrete sine transform (discrete sine transform, DST), or discrete wavelet transform (discrete wavelet transform, DWT), etc.
  • the quantization unit 216 is configured to quantize the residual of the transformed current image to obtain the transformed and quantized residual of the current image, and the transformed and quantized residual of the current image is used as the input of the entropy encoding unit 208 .
  • the above-mentioned quantization may be uniform quantization or non-uniform quantization.
  • the neural network 210 for probability estimation For the detailed functions of the neural network 210 for probability estimation, please refer to the relevant description of the neural network 210 for probability estimation in FIG. 6A , or the function of the neural network 210 for probability estimation in FIG. The function of the neural network 210 for probability estimation is the same, and will not be described here.
  • Entropy coding unit 208 is used to apply an entropy coding algorithm or scheme (eg, VLC scheme, CALVC, arithmetic coding scheme, binarization algorithm, CABAC, SBAC, PIPE coding, or other entropy coding methods or techniques) to the transformed and quantized
  • an entropy coding algorithm or scheme eg, VLC scheme, CALVC, arithmetic coding scheme, binarization algorithm, CABAC, SBAC, PIPE coding, or other entropy coding methods or techniques
  • the feature map and the probability distribution of residual values of multiple samples included in the residual of the current image, the encoded image data 21 that can be output in the form of an encoded bit stream 21 through the output terminal 212 is obtained, so that the video decoder 30 etc. can receive and use the parameters for decoding.
  • the encoded bitstream 21 may be transmitted to the video decoder 30, or stored in memory for later transmission or retrieval by the video decoder 30.
  • the detailed function of the entropy encoding unit 208 can refer to the relevant description of the entropy encoding unit 208 in FIG. 6A , or the function of the entropy encoding unit 208 in FIG. 8A is the same as the function of the entropy encoding unit 208 in FIG. 6A . , will not be described here.
  • the residuals of the input current image are multiple samples based on the current image Except for the residual obtained by the difference between the original value of and the predicted value of the multiple samples, the residuals mentioned in other places (including the residuals of the current image and the residuals of the reference image) are all reconstruction residuals.
  • the reconstruction residuals of an image refer to a plurality of differences between the reconstructed values of a plurality of samples of the image and the predicted values of the plurality of samples.
  • FIG. 9A is a schematic block diagram of an example of a video decoder for implementing the techniques of the present application.
  • video decoder 30 includes an input (or input interface) 302 , a neural network 304 for probability estimation, an entropy decoding unit 306 , a decoding network 308 and an output (or output interface) 310 .
  • the video decoder 30 shown in FIG. 9A may also be referred to as an end-to-end video decoder or an end-to-end video decoder-based video decoder.
  • the decoder 30 may receive input data through the input terminal 302 or the like, where the input data includes the code stream of the current image as the input of the entropy decoding unit 306 .
  • the above-mentioned input data also includes the above-mentioned input data and the residual of the reference image, or,
  • Residuals for multiple reference pictures and predictions for the current picture or,
  • Residuals for multiple reference images, predictions for multiple reference images, and predictions for the current image are used as inputs to neural network 304 for probability estimation.
  • the residual of the current image is obtained based on the reconstruction value of the current image and the prediction of the current image.
  • the reconstruction of the current image includes the reconstruction values of multiple samples (pixels), and the prediction of the current image includes multiple samples ( pixel), the reconstructed value of the sample (pixel) in the current image and the predicted value of the sample at the corresponding position in the prediction of the current image are made to differ one by one, so as to obtain the residual of the current image. Therefore, after the residual of the current image is obtained, the residual values of multiple samples (pixels) included in the residual of the current image can be compared with the predicted values of multiple samples (pixels) included in the prediction of the current image one by one. Add to get the reconstructed value of the current image.
  • the prediction of the current image is obtained by predicting the reconstructed sample value of the reference image, or up-sampling the reconstructed sample value of the low-resolution image of the current image to obtain a sampled image, the sampled image
  • the resolution of the current image is the same as the resolution of the current image
  • the sampled image is the prediction of the current image, wherein the low-resolution image of the current image is obtained by down-sampling the current image.
  • the neural network 304 for probability estimation is used to estimate the probability distribution of residual values of a plurality of samples included in the residual of the current image based on the input data.
  • the input data of the neural network 304 for probability estimation further includes super-prior information, and the super-prior information is used to characterize the features of the residual of the current image at a deeper level.
  • the function of the neural network 304 used for probability estimation is the same as the function of the neural network 210 used for probability estimation on the encoding side.
  • the function of the neural network 304 used for probability estimation see The related description of the neural network 210 for probability estimation will not be described here.
  • the video decoder 30 further includes a super-a priori decoding network 312, and the super-a-priori decoding network 312 performs integerization of the super-a priori information obtained by performing arithmetic entropy decoding on the super-a priori code stream by the entropy decoding unit 306.
  • the feature vector is processed to obtain the above-mentioned super-prior information.
  • the function of the super-a priori encoding network 218 is opposite to that of the super-a priori decoding network 312 , so the network structure of the super-a priori decoding network 312 can be referred to the network structure of the super-a priori encoding network 218 .
  • the super-prior information decoding network 312 is also optional, so the super-prior information decoding network 312 Indicated using dashed boxes.
  • the entropy decoding unit 306 is configured to apply an entropy decoding algorithm or scheme to the code stream of the current image and the probability distribution of residual values of multiple samples included in the residual of the current image, to obtain a feature map of the residual of the current image.
  • the entropy decoding unit 306 may employ a decoding algorithm or scheme corresponding to the encoding scheme applied by the entropy encoding unit 208 of the encoder 20 .
  • the feature map of the residual of the current image is obtained through the decoding network 308, and the residual of the current image can be output through the output terminal 310.
  • the subsequent video decoder 30 can obtain the reconstructed sample value of the current image through the residual of the current image, that is, the reconstruction of the current image. .
  • the feature map of the residual of the current image is a multi-channel feature map
  • the multi-channel feature map includes N decoded feature maps, where N is greater than 1.
  • FIG. 9B is a schematic diagram of the input of N decoded feature maps to the decoding network 308 .
  • the input and output structure of the decoding network 308 is shown in FIG. 9C , and the decoding network 308 is composed of P network layers, where P>L and M, L, T, and P are all positive integers.
  • the T-th decoding feature map is the input after the network layer L of the decoding network 308308. Both the T-th decoding feature map and the M-th decoding feature map are inputs to the decoding network. It can be understood that multiple decoding feature maps may be used as inputs after different network layer positions in the decoding network 308, which are not limited here.
  • Any one of the network layers in FIG. 9C may be a convolutional layer, an inverse normalization layer, a nonlinear activation layer, or the like.
  • operations including addition, subtraction, etc. may exist in the decoding network 308 .
  • the network layer structures in the decoding network 308 may be the same or different from each other.
  • FIG. 9D an example of the structure of the decoding network 308 is shown in FIG. 9D .
  • the decoding network in the example includes 12 network layers, including one denormalization layer, six convolution layers, and four nonlinear layers. activation layer. Where m, n, and t are all positive integers, the output of the decoding network 308 is the residual of the current image.
  • the t-th decoded feature map, the n-th decoded feature map, and the m-th decoded feature map are all inputs to the decoding network, where t, n, and m are all integers.
  • the video decoder 30 includes an input (or input interface) 302 , a neural network 304 for probability estimation, an entropy decoding unit 306 , an inverse quantization unit 314 and an output (or output interface) 310 .
  • the video decoder 30 shown in FIG. 9E may also be referred to as an end-to-end video decoder or an end-to-end video decoder-based video decoder.
  • the decoder 30 may receive input data through the input terminal 302 or the like, and the input data image includes the code stream of the current image as the input of the entropy decoding unit 306 .
  • the above input data also includes the residual of the reference image, or,
  • Residuals for multiple reference pictures and predictions for the current picture or,
  • Residuals for multiple reference images, predictions for multiple reference images, and predictions for the current image are used as inputs to neural network 304 for probability estimation.
  • the neural network 304 for probability estimation For the detailed function of the neural network 304 for probability estimation, please refer to the relevant description of the neural network 210 for probability estimation in FIG. 6A, or the function of the neural network 304 for probability estimation in FIG. The function of the neural network 210 for probability estimation is the same, and will not be described here.
  • the entropy decoding unit 306 is configured to apply an entropy decoding algorithm or scheme to the code stream of the current image and the probability distribution of residual values of a plurality of samples included in the residual of the current image to obtain the quantized residual of the current image.
  • the entropy decoding unit 306 may employ a decoding algorithm or scheme corresponding to the encoding scheme applied by the entropy encoding unit 208 of the encoder 20 .
  • the inverse quantization unit 314 is used to inverse quantize the residual of the quantized current image to obtain the residual of the current image that can be output through the output terminal 310, and the subsequent video decoder 30 can obtain the reconstructed sample of the current image through the residual of the current image. value, the reconstruction of the current image.
  • the above-mentioned quantization may be uniform quantization or non-uniform quantization.
  • the video decoder 30 includes an input (or input interface) 302, a neural network 304 for probability estimation, an entropy decoding unit 306, an inverse quantization unit 314, an inverse transform unit 316, and an output (or output interface) 310.
  • the video decoder 30 shown in FIG. 9F may also be referred to as an end-to-end video decoder or an end-to-end video decoder-based video decoder.
  • the decoder 30 may receive input data through the input terminal 302 or the like, and the input data image includes the code stream of the current image as the input of the entropy decoding unit 306 .
  • the above input data also includes the residual of the reference image, or,
  • Residuals for multiple reference pictures and predictions for the current picture or,
  • Residuals of multiple reference images, predictions of multiple reference images, and predictions of the current image are used as inputs to a neural network 304 for probability estimation.
  • the neural network 304 for probability estimation For the detailed function of the neural network 304 for probability estimation, please refer to the relevant description of the neural network 210 for probability estimation in FIG. 6A, or the function of the neural network 304 for probability estimation in FIG. The function of the neural network 210 for probability estimation is the same, and will not be described here.
  • the entropy decoding unit 306 is configured to apply the entropy decoding algorithm or scheme to the code stream of the current image and the probability distribution of residual values of a plurality of samples included in the residual of the current image to obtain the transformed and quantized residual of the current image. .
  • the entropy decoding unit 306 may employ a decoding algorithm or scheme corresponding to the encoding scheme applied by the entropy encoding unit 208 of the encoder 20 .
  • the inverse quantization unit 314 is configured to perform inverse quantization on the residual of the transformed and quantized current image, and the residual of the transformed current image.
  • the above-mentioned quantization may be uniform quantization or non-uniform quantization.
  • the inverse transformation unit 316 is used to inversely transform the residual of the transformed current image to obtain the residual of the current image that can be output through the output terminal 310, and the subsequent video decoder 30 can obtain the reconstructed sample of the current image through the residual of the current image. value, the reconstruction of the current image.
  • the above transformation may be DCT, in addition, DST, or DWT, etc. may also be performed.
  • the processing result of the current step can be further processed, and then output to the next step.
  • the processing results of the encoder 20 and the decoder 30 may be subjected to further operations or processing, such as clip or shift operations or filtering processing.
  • the residuals mentioned on the decoding side are all reconstructed residuals, and the reconstructed residuals are multiple differences between the reconstructed values of multiple samples of the image and the predicted values of the multiple samples.
  • the reconstructed residual may also be referred to as a reconstructed residual signal.
  • FIG. 10 is a flowchart illustrating a process 1000 of an encoding method based on an embodiment of the present application.
  • Process 1000 may be performed by video encoder 20 .
  • Process 1000 is described as a series of steps or operations, and it should be understood that process 1000 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 10 .
  • the encoding method includes:
  • the current image includes multiple samples, and the residual of the current image is the difference between the original value and the predicted value of the multiple samples included in the current image.
  • obtaining the residuals of the current image includes: differentiating the original values of the multiple samples of the current image and the predicted values of the multiple samples of the current image one by one to obtain the residual values of the multiple samples of the current image, and the residual values of the multiple samples of the current image are obtained.
  • the residual includes residual values of the plurality of samples, wherein the resolution of the current image is the same as the predicted resolution of the current image.
  • the prediction of the current image includes prediction values of multiple samples of the current image.
  • the prediction process of the prediction of the current image reference may be made to the related descriptions in FIG. 6B and FIG. 6C , which will not be described here.
  • the current image and the reference image are continuous frames, and may also be non-continuous frames.
  • the difference between the timestamp of the current image and the timestamp of the encoded image is a preset threshold.
  • a low-resolution image of the current image can be obtained by down-sampling the current image.
  • the time stamp of the image may be the moment when the image was acquired.
  • the neural network is a trained neural network
  • the method of this implementation further includes:
  • the input data for training includes the encoded images before the encoded sample images, and the sample images are known encoded images; the training input data is processed by probability estimation through the neural network to obtain the residual error of the sample image
  • the estimated probability distribution of the residual values of the multiple samples included based on the known probability distribution and the estimated probability distribution of the residual values of the multiple samples included in the residual of the sample image, the loss value is obtained; when the loss value converges at When the first threshold, or the current training times of the neural network is greater than or equal to the second threshold, it is determined that the current neural network is the corresponding neural network when the neural network training is completed.
  • the input data for training includes:
  • Residual values of multiple samples included in the residual of the first reference image or;
  • the first reference picture and the second reference picture are coded pictures before coding the sample picture.
  • the above-mentioned probability distribution represents the probability distribution of multiple differences between the original values of the multiple samples included in the current image and the predicted values of the multiple samples, or, represents each sample in the multiple samples of the current image.
  • the above-mentioned first entropy encoded data includes the residual of the current image, or the feature map of the residual of the current image, or the residual of the current image after transformation and quantization.
  • the method of this implementation further includes: extracting the feature of the residual of the current image through an encoding network to obtain a feature map of the residual of the current image.
  • the method of this implementation further includes:
  • the transformation may be DCT, DST, DWT or other transformations; the quantization may be uniform quantization, non-uniform quantization or other quantization.
  • the above-mentioned neural network includes a convolutional network, a uniform pooling layer and a Softmax layer, the convolutional network includes a plurality of convolutional layers and ReLU layers, wherein, as shown in FIG.
  • the data is subjected to probability estimation processing to obtain the probability distribution of residual values of multiple samples of the current image, including:
  • the above input data includes: the residual of the reference image; or,
  • Residuals for multiple reference pictures, predictions for multiple reference pictures, and predictions for the current picture are Residuals for multiple reference pictures, predictions for multiple reference pictures, and predictions for the current picture.
  • the prediction of the reference image is obtained by prediction based on the reconstructed sample value of the coded image before the reference image, or up-sampling the reconstructed sample value of the low-resolution image of the reference image to obtain a sampled image.
  • the resolution of the picture is the same as the resolution of the reference picture, and the sampled picture is the prediction of the reference picture.
  • the low-resolution image of the reference image can be obtained by down-sampling the reference image.
  • S1102. Perform probability estimation on the feature information extracted by the convolutional network based on the uniform pooling layer and the Softmax layer, and output the probability distribution of residual values of multiple samples included in the residual of the current image.
  • the above-mentioned neural network includes a convolutional network, a uniform pooling layer, a first convolutional layer splicing layer and a Softmax layer, and the convolutional network includes a plurality of convolutional layers and ReLU layers. Perform probability estimation processing on the input data to obtain the probability distribution of residual values of multiple samples of the current image, including:
  • the convolutional neural network extracts the feature information of the residual of the reference image and the feature information of the super-prior information respectively.
  • super-prior information is obtained by:
  • the function of the super-prior encoding network and the function of the super-prior decoding network are opposite.
  • arithmetic entropy coding is performed on the feature vector of the integerized super-prior information to obtain a super-prior code stream.
  • the residual of the current image is correlated with the residual of the reference image, so the data including at least the residual of the reference image is input into the neural network to obtain the residual of the current image
  • the probability distribution of residual values of a plurality of samples is included, and the first entropy data representing the residual of the current image is encoded by using the probability distribution, which is beneficial to improve the accuracy of encoding.
  • super-prior information is introduced to obtain a higher-precision probability distribution of residual values of multiple samples included in the residual of the current image , thereby improving the encoding accuracy.
  • FIG. 12 is a flowchart illustrating a process 1200 of a decoding method based on an embodiment of the present application.
  • Process 1200 may be performed by video decoder 30 .
  • Process 1200 is described as a series of steps or operations, and it should be understood that process 1200 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 12 .
  • the decoding method includes:
  • S1202. Perform probability estimation processing on the input data through a neural network to obtain a probability distribution of residual values of multiple samples included in the residual of the current image, where the input data at least include the residual of a reference image, and the reference image is before decoding the current image. the decoded image.
  • the residual of the reference image includes multiple samples, and the residual of the reference image includes multiple differences between the reconstructed values of the multiple samples and the predicted values of the multiple samples.
  • the prediction of the reference picture includes prediction values of multiple samples.
  • the residual of the reference image is obtained by predicting the reference image based on the reconstructed sample value of the decoded image before the reference image, or by up-sampling the reconstructed sample value of the low-resolution image of the reference image to obtain the sampled image.
  • the resolution of the sampled image is the same as the resolution of the reference image
  • the sampled image is the prediction of the reference image.
  • the low-resolution image of the reference image can be obtained by down-sampling the reference image.
  • the first image and the current image may be continuous frames or non-consecutive frames; for example, the first image is the n-1 th frame image, and the current image is the n th frame image; or the first image is the n-2 th frame. frame image, the current image is the nth frame image; for another example, the difference between the timestamp of the current image and the timestamp of the encoded image is a preset threshold.
  • a low-resolution image of the current image can be obtained by down-sampling the current image.
  • the time stamp of the image may be the moment when the image was acquired.
  • the above-mentioned neural network is a neural network obtained by training, and the method of this embodiment further includes:
  • the input data for training includes the decoded image before decoding the sample image.
  • the sample image is a known decoded image.
  • the neural network is used to perform probability estimation processing on the input data for training to obtain the residual error of the sample image.
  • the estimated probability distribution of the residual values of the multiple samples included; based on the known probability distribution and the estimated probability distribution of the residual values of the multiple samples included in the residual of the sample image, the loss value is obtained; when the loss value converges at When the first threshold, or the current training times of the neural network is greater than or equal to the second threshold, it is determined that the current neural network is the corresponding neural network when the neural network training is completed.
  • the input data for training includes: residual values of multiple samples included in the residual of the first reference image, or;
  • the first reference picture and the second reference picture are decoded pictures before decoding the sample picture.
  • the specific training process can be found in the above method.
  • the above-mentioned probability distribution represents the probability distribution of multiple differences between the original values of the multiple samples included in the current image and the predicted values of the multiple samples, or, represents each sample in the multiple samples of the current image.
  • the above-mentioned neural network includes a convolutional network, a uniform pooling layer and a Softmax layer, the convolutional network includes a plurality of convolutional layers and ReLU layers, wherein, as shown in FIG.
  • the data is subjected to probability estimation processing to obtain the probability distribution of residual values of multiple samples of the current image, including:
  • the above input data includes: the residual of the reference image; or,
  • Residuals for multiple reference pictures, predictions for multiple reference pictures, and predictions for the current picture are Residuals for multiple reference pictures, predictions for multiple reference pictures, and predictions for the current picture.
  • the prediction of the reference image is obtained by prediction based on the reconstructed sample value of the decoded image before the reference image, or up-sampling the reconstructed sample value of the low-resolution image of the reference image to obtain a sampled image.
  • the resolution of the picture is the same as the resolution of the reference picture, and the sampled picture is the prediction of the reference picture.
  • the low-resolution image of the reference image can be obtained by down-sampling the reference image.
  • S1202a Perform probability estimation on the feature information extracted by the convolutional network based on the uniform pooling layer and the Softmax layer, and output the probability distribution of residual values of multiple samples included in the residual of the current image.
  • the above-mentioned neural network includes a convolutional network, a uniform pooling layer, a first convolutional layer splicing layer and a Softmax layer, and the convolutional network includes multiple convolutional layers and ReLU layers.
  • the network performs probability estimation processing on the input data, and obtains the probability distribution of residual values of multiple samples of the current image, including:
  • S1203a extract the residual of the reference image and the feature information of the super-prior information through the convolutional network.
  • the convolutional neural network extracts the feature information of the residual of the reference image and the feature information of the super-prior information respectively.
  • S1204a Perform probability estimation on the feature information extracted by the convolutional network based on the uniform pooling layer, the first convolutional layer, the splicing layer and the Softmax layer, and output the probability distribution of residual values of multiple samples included in the residual of the current image.
  • super-prior information is obtained by:
  • the first entropy decoded data may be partial entropy decoded data decoded from the code stream. After all, there are not only residual data in the code stream, but also other entropy decoded data. Other entropy decoded data can also be entropy decoded from the code stream. data, so "first" is added as a qualification.
  • the above-mentioned first entropy decoding data includes the residual of the current image, or the feature map of the residual of the current image, or the residual of the current image after transformation and quantization.
  • the method of this implementation further includes: obtaining the residual of the current image by passing the feature map of the residual of the current image through a decoding network.
  • the function of the above-mentioned decoding network is opposite to that of the encoding network, so the structure of the decoding network can refer to the structure of the encoding network shown in FIG. 6D, FIG. 6E, FIG. 6F and FIG. 6G, and will not be described here. .
  • the method of this implementation further includes:
  • the inverse transform may be inverse DCT, inverse DST, inverse DWT or other inverse transforms; inverse quantization may be uniform inverse quantization, non-uniform inverse quantization or other inverse quantization.
  • arithmetic entropy decoding and arithmetic entropy encoding are a pair of mutually inverse processes.
  • arithmetic entropy decoding process reference may be made to the above entropy encoding process, which will not be described here.
  • the residual of the current image is correlated with the residual of the reference image, so the data including at least the residual of the reference image is input into the neural network to obtain the residual of the current image
  • the probability distribution of residual values of a plurality of samples is included, and the first entropy data representing the residual of the current image is decoded by using the probability distribution, which is beneficial to improve the accuracy of decoding.
  • super-prior information is introduced to obtain a higher-precision probability distribution of residual values of multiple samples included in the residual of the current image , thereby improving the decoding accuracy.
  • FIG. 13 is a schematic diagram of an encoding and decoding process provided by an embodiment of the present application.
  • the encoding device obtains the original residual of the current image, where the original residual of the current image is a plurality of differences between the original values of the multiple samples included in the current image and the predicted values of the multiple samples;
  • the feature map Y of the residual is processed by a rounding module to obtain an integerized feature map Y'; the encoding device uses an arithmetic encoder to integerize the residual value based on the probability distribution of the residual values of a plurality of samples included in the residual of the current image.
  • the latter feature map Y' is subjected to arithmetic entropy encoding to obtain the code stream of the current image; before this, the encoding device inputs the reconstruction residual of the reference image into the neural network used for probability estimation, and obtains that the residual of the current image includes a large number of components.
  • the probability distribution of the residual values of each sample it should be pointed out that the residual of the current image mentioned in this embodiment is also the original residual of the above-mentioned current image; the reconstructed residual of the reference image is a plurality of samples included in the reference image. a plurality of differences between the reconstructed value of and the predicted value of the plurality of samples.
  • the encoding device inputs the reconstruction residual of the reference image and the super-prior information into a neural network for probability estimation (that is, the above-mentioned neural network), and obtains the residual value of a plurality of samples included in the residual of the current image. Probability distribution; before that, the encoding device uses the super-prior coding network to perform super-prior feature extraction on the feature map of the residual of the current image to obtain the feature vector Z of the super-prior information; the feature vector Z of the super-prior information is obtained.
  • a neural network for probability estimation that is, the above-mentioned neural network
  • the encoding device Carry out rounding to obtain the feature vector Z' of the integerized super-prior information; the encoding device then uses the super-prior decoding network to process the integerized feature vector of the super-prior information to obtain the super-priority of the current image. information.
  • the encoding device uses an arithmetic encoder to encode the integerized feature vector of the super-prior information to obtain a code stream of the super-prior information.
  • the decoding device inputs the reconstructed residual of the reference image into a neural network for probability estimation (that is, the above-mentioned neural network) for processing, and obtains the probability distribution of residual values of multiple samples included in the reconstructed residual of the current image;
  • a neural network for probability estimation that is, the above-mentioned neural network
  • an arithmetic decoder is used to perform arithmetic entropy decoding on the code stream of the current image to obtain the reconstruction residual of the current image.
  • the feature map of the current image is processed by the decoding network to obtain the reconstructed residual of the current image.
  • the decoding device uses an arithmetic decoder to decode the code stream of the super-a priori information to obtain a feature vector of the super-a priori information, and uses a super-a priori decoding network to process the feature vector of the super-a priori information to obtain the super-a priori information.
  • the super-prior information and the reconstructed residual of the reference frame image are input into the neural network for probability estimation for processing, and the probability distribution of residual values of multiple samples included in the residual of the current image is obtained.
  • FIG. 14 is a schematic diagram of another encoding and decoding process provided by an embodiment of the present application.
  • the encoding device obtains the original residual of the current image, where the original residual of the current image is a plurality of differences between the original values of the multiple samples included in the current image and the predicted values of the multiple samples;
  • the feature map of the residual of the image is processed by the rounding module to obtain an integerized feature map Y'; the encoding device uses an arithmetic encoder to calculate the integer value based on the probability distribution of residual values of a plurality of samples included in the residual of the current image.
  • the encoding device uses a recurrent neural network to obtain hidden information Hn based on the reconstruction residuals of multiple reference images, and the specific process can refer to the above implementation.
  • the encoding device inputs the hidden information Hn into the neural network for probability estimation (that is, the above-mentioned neural network), and obtains the residual value of the plurality of samples included in the residual of the current image.
  • the residual of the current image mentioned in this embodiment is also the original residual of the above-mentioned current image
  • the reconstruction residual of the reference image is the reconstruction value of multiple samples included in the reference image and the multiple Multiple differences between predicted values for a sample.
  • the encoding device inputs the hidden information Hn and the super-prior information into a neural network for probability estimation, and obtains the probability distribution of residual values of multiple samples included in the residual of the current image; before this, the encoding device Use the super-prior coding network to perform super-prior feature extraction on the feature map of the residual of the current image, and obtain the feature vector Z of the super-prior information; round the feature vector Z of the super-prior information to obtain the integerized The feature vector Z' of the super-prior information; the encoding device then uses the super-prior decoding network to process the integerized feature vector of the super-prior information to obtain the super-prior information of the current image.
  • the encoding device uses an arithmetic encoder to encode the integerized feature vector of the super-prior information to obtain a code stream of the super-prior information.
  • the decoding device uses a cyclic neural network to obtain hidden information Hn based on the reconstruction residuals of multiple reference images.
  • the hidden information Hn is input into the neural network for probability estimation to obtain the probability distribution of the residual values of the multiple samples included in the reconstruction residual of the current image; after the decoding device obtains the code stream of the current image, based on the residuals of the multiple samples included in the reconstruction residual of the current image.
  • the probability distribution of the value using the arithmetic decoder to perform arithmetic entropy decoding on the code stream of the current image to obtain the feature map of the reconstructed residual of the current image; use the decoding network to process the feature map of the reconstructed residual of the current image to obtain the current image. reconstruction residuals.
  • the decoding device uses an arithmetic decoder to decode the code stream of the super-a priori information to obtain a feature vector of the super-a priori information after integerization, and uses a super-a priori decoding network to decode the features of the integerized super-a priori information.
  • the vector is processed to obtain super-prior information; the super-prior information and hidden information Hn are input into the neural network for probability estimation for processing, and the probability distribution of residual values of multiple samples included in the residual of the current image is obtained. .
  • FIG. 15 is a schematic diagram of another encoding and decoding process provided by an embodiment of the present application.
  • the encoding apparatus obtains the original residual of the current image, where the original residual of the current image is a plurality of differences between the original values of the multiple samples included in the current image and the predicted values of the multiple samples,
  • the feature map Y of the original residual of the current image is processed by the rounding module to obtain the feature map Y' after integerization;
  • the encoder performs arithmetic entropy encoding on the integerized feature map Y' to obtain the code stream of the current image; before this, the encoding device inputs the reconstruction residual of the reference image, the prediction of the reference image and the prediction of the current image into the data for probability
  • the estimated neural network obtains the probability distribution of residual values of multiple samples included in the residual of the current image; it should
  • the encoding device inputs the reconstruction residual of the reference image, the prediction of the reference image, the prediction of the current image, and the super-prior information into the neural network for probability estimation, and obtains the number of samples included in the residual of the current image.
  • the probability distribution of the residual value before that, the encoding device uses the super-a priori coding network to perform super-a priori feature extraction on the feature map of the residual to obtain the feature vector Z of the super-a priori information; Z is rounded to obtain the feature vector Z' of the integerized super-prior information; the encoding device then uses the super-prior decoding network to process the integerized feature vector of the super-prior information to obtain the super-priority of the current image. check information.
  • the encoding device uses an arithmetic encoder to encode the integerized feature vector of the super-a-priori information to obtain a code stream of the super-a-priori information.
  • the decoding device inputs the reconstructed residual of the reference image, the prediction of the reference image, and the prediction of the current image into a neural network for probability estimation for processing, and obtains a probability distribution of residual values of a plurality of samples included in the reconstructed residual of the current image
  • the decoding device obtains the code stream of the current image, based on the probability distribution of the residual values of a plurality of samples included in the reconstruction residual of the current image
  • the arithmetic decoder is utilized to carry out arithmetic entropy decoding to the code stream of the current image, and obtain the current image
  • the feature map of the reconstructed residual is used to process the feature map of the reconstructed residual of the current image by using the decoding network to obtain the reconstructed residual of the current image.
  • the decoding device uses an arithmetic decoder to decode the code stream of the super-a priori information, obtains a feature vector of the integerized super-a priori information, and uses a super-a priori decoding network to decode the integerized super-a priori information.
  • the feature vector is processed to obtain super-prior information; the super-prior information, the reconstruction residual of the reference image, the prediction of the reference image, and the prediction of the current image are input into the neural network for probability estimation for processing, and the current image is obtained.
  • the probability distribution of residual values for multiple samples included in the residuals are described by the probability distribution of residual values for multiple samples included in the residuals.
  • FIG. 16 is a schematic diagram of another encoding and decoding process provided by an embodiment of the present application.
  • the encoding device obtains the original residual of the current image, where the original residual of the current image is a plurality of differences between the original values of the multiple samples included in the current image and the predicted values of the multiple samples,
  • the feature map Y of the original residual of the current image is processed by the rounding module to obtain the feature map Y' after integerization;
  • the encoder performs arithmetic entropy encoding on the integerized feature map Y' to obtain the code stream of the current image; before this, the encoding device uses a cyclic neural network to obtain the reconstructed residual based on multiple reference images and the prediction of multiple reference images.
  • the hidden information Hn the specific process can refer to the relevant descriptions of the above-mentioned embodiments, and will not be described here;
  • the encoding device inputs the hidden information Hn, the prediction of the current image and the super-prior information into a neural network for probability estimation, and obtains the probability distribution of residual values of multiple samples included in the residual of the current image;
  • the encoding device uses a super-a priori encoding network to perform super-a priori feature extraction on the feature map of the residual of the current image to obtain a feature vector Z of the super-a priori information; round the feature vector Z of the super-a priori information, The feature vector Z' of the integerized super-prior information is obtained; the encoding device then uses the super-prior decoding network to process the feature vector of the integerized super-prior information to obtain the super-prior information of the current image.
  • the encoding device uses an arithmetic encoder to encode the integerized feature vector of the super-prior information to obtain a code stream of the super-prior information.
  • the decoding device uses a cyclic neural network to obtain hidden information Hn based on the reconstruction residuals of multiple reference images and the prediction of multiple reference images.
  • the hidden information Hn and The prediction input of the current image is processed in the neural network for probability estimation, and the probability distribution of residual values of multiple samples included in the reconstruction residual of the current image is obtained; after the decoding device obtains the code stream of the current image, based on the current image
  • the probability distribution of the residual values of a plurality of samples included in the reconstructed residuals is obtained by using an arithmetic decoder to perform arithmetic entropy decoding on the code stream of the current image to obtain the feature map of the reconstructed residuals of the current image;
  • the feature map of the reconstructed residual is processed to obtain the reconstructed residual of the current image.
  • the decoding device uses an arithmetic decoder to decode the code stream of the super-a priori information, obtains a feature vector of the integerized super-a priori information, and uses a super-a priori decoding network to decode the integerized super-a priori information.
  • the feature vector is processed to obtain super-prior information; the super-prior information, hidden information Hn and the prediction of the current image are input into the neural network for probability estimation for processing, and the results of multiple samples included in the residual of the current image are obtained. Probability distribution of residual values.
  • FIG. 17 is a schematic diagram of an encoding and decoding process provided by an embodiment of the present application.
  • the encoding device obtains the original residual of the current image, where the original residual of the current image is a plurality of differences between the original values of the multiple samples included in the current image and the predicted values of the multiple samples;
  • the encoding device quantizes the original residual of the current image to obtain the quantized original residual of the current image;
  • the probability distribution of residual values of a plurality of samples included in the residual is used to perform arithmetic entropy coding on the residual of the quantized current image by using an arithmetic encoder to obtain the code stream of the current image; it should be pointed out that in this embodiment, the The residual of the current image is the original residual of the above current image.
  • the input data of the neural network for probability estimation includes the following situations:
  • the encoding device inputs the reconstruction residual of the reference image into the neural network for probability estimation, and obtains the probability distribution of residual values of a plurality of samples included in the residual of the current image;
  • the encoding device uses a cyclic neural network to obtain hidden information Hn based on the reconstruction residuals of multiple reference images.
  • the encoding device will hide the hidden information Hn.
  • the information Hn is input to the neural network for probability estimation, and the probability distribution of residual values of multiple samples included in the residual of the current image is obtained;
  • the encoding device inputs the reconstruction residual of the reference image, the prediction of the reference image, and the prediction of the current image into a neural network for probability estimation, and obtains the residuals of a plurality of samples included in the residuals of the current image. the probability distribution of the values,
  • the encoding device uses a cyclic neural network to obtain hidden information Hn based on the reconstruction residuals of multiple reference images and the prediction of multiple reference images.
  • the encoding device inputs the hidden information Hn and the prediction of the current image into a neural network for probability estimation, and obtains the probability distribution of residual values of a plurality of samples included in the residual of the current image.
  • the decoding device After obtaining the code stream of the current image, the decoding device uses an arithmetic decoder to perform arithmetic entropy decoding on the code stream of the current image based on the probability distribution of the residual value of the current image to obtain the quantized residual of the current image;
  • the residual of the current image is inverse quantized to obtain the reconstruction residual of the current image; wherein, as shown in FIG. 13 , the decoding device inputs the referenced reconstruction residual into the neural network for probability estimation for processing, and obtains the reconstruction residual of the current image. probability distribution of residual values; or, as shown in FIG. 14 , the decoding device uses a cyclic neural network to obtain hidden information Hn based on the reconstructed residuals of multiple reference images.
  • the decoding device uses a cyclic neural network to obtain the hidden information Hn based on the reconstruction residuals of multiple reference images and the prediction of multiple reference images.
  • the specific process can refer to the relevant description of the above-mentioned embodiment, which will not be described here;
  • the hidden information Hn and the prediction input of the current image are processed in the neural network for probability estimation to obtain the probability distribution of residual values of multiple samples included in the residual of the current image.
  • FIG. 18 is a schematic diagram of an encoding and decoding process provided by an embodiment of the present application.
  • the encoding device obtains the original residual of the current image, where the original residual of the current image is a plurality of differences between the original values of the multiple samples included in the current image and the predicted values of the multiple samples;
  • the original residual of the current image the encoding device uses an arithmetic encoder to perform arithmetic entropy coding on the original residual of the transformed and quantized current image based on the probability distribution of residual values of a plurality of samples included in the residual of the current image , to obtain the code stream of the current image; it should be pointed out that the residual of the current image mentioned in this embodiment is also the original residual of the above-mentioned current image;
  • the input data of the neural network for probability estimation includes the following situations:
  • the encoding device inputs the reconstruction residual of the reference image into the neural network for probability estimation, and obtains the probability distribution of residual values of a plurality of samples included in the residual of the current image;
  • the encoding device uses a cyclic neural network to obtain hidden information Hn based on the reconstruction residuals of multiple reference images.
  • the encoding device will hide the hidden information Hn.
  • the information Hn is input to the neural network for probability estimation, and the probability distribution of residual values of multiple samples included in the residual of the current image is obtained;
  • the encoding device inputs the reconstruction residual of the reference image, the prediction of the reference image, and the prediction of the current image into a neural network for probability estimation, and obtains the residuals of a plurality of samples included in the residuals of the current image. the probability distribution of the values,
  • the encoding device uses a cyclic neural network to obtain hidden information Hn based on the reconstruction residuals of multiple reference images and the prediction of multiple reference images.
  • the encoding device inputs the hidden information Hn and the prediction of the current image into a neural network for probability estimation, and obtains the probability distribution of residual values of a plurality of samples included in the residual of the current image.
  • the decoding device After obtaining the code stream of the current image, the decoding device utilizes an arithmetic decoder to perform arithmetic entropy decoding on the code stream of the current image based on the probability distribution of the residual value of the current image, and obtains the residual of the current image after transformation and quantization; Perform inverse quantization and inverse transformation on the transformed and quantized residual of the current image to obtain the reconstructed residual of the current image; wherein, as shown in FIG.
  • the decoding device inputs the referenced reconstructed residual into the neural network for probability estimation Perform processing in the network to obtain the probability distribution of the residual value of the current image; or, as shown in Figure 14, the decoding device uses a recurrent neural network to obtain the hidden information Hn based on the reconstruction residuals of multiple reference images.
  • the hidden information Hn is input into the neural network for probability estimation for processing, and the probability distribution of residual values of multiple samples included in the residual of the current image is obtained; or, as As shown in FIG.
  • the decoding device inputs the reconstruction residual of the reference image, the prediction of the reference image and the prediction of the current image into a neural network for probability estimation for processing, and obtains residuals of multiple samples included in the residuals of the current image or, as shown in FIG. 16 , the decoding device uses a cyclic neural network to obtain hidden information Hn based on the reconstruction residuals of multiple reference images and the prediction of multiple reference images.
  • the hidden information Hn and the prediction of the current image are input into the neural network for probability estimation for processing, and the probability distribution of residual values of multiple samples included in the residual of the current image is obtained.
  • encoding network and decoding network may be implemented by a neural network, such as a convolutional neural network.
  • Computer-readable media may include computer-readable storage media, which correspond to tangible media, such as data storage media, or communication media including any medium (eg, based on a communication protocol) that facilitates transfer of a computer program from one place to another .
  • a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave.
  • Data storage media can be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this application.
  • the computer program product may comprise a computer-readable medium.
  • such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage devices, magnetic disk storage devices or other magnetic storage devices, flash memory or may be used to store instructions or data structures desired program code in the form of any other medium that can be accessed by a computer.
  • any connection is properly termed a computer-readable medium.
  • a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave are used to transmit instructions from a website, server, or other remote source
  • the coaxial cable Wire, fiber optic cable, twisted pair, DSL or wireless technologies such as infrared, radio and microwave are included in the definition of media.
  • computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory, tangible storage media.
  • magnetic disks and optical disks include compact disks (CDs), laser disks, optical disks, digital versatile disks (DVDs), and Blu-ray disks, where disks typically reproduce data magnetically, while disks reproduce optically with lasers data. Combinations of the above should also be included within the scope of computer-readable media.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable logic arrays
  • the term "processor,” as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein.
  • the functions described by the various illustrative logical blocks, modules, and steps described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or in combination with into the combined codec.
  • the techniques may be fully implemented in one or more circuits or logic elements.
  • the techniques of this application may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC), or a set of ICs (eg, a chip set).
  • IC integrated circuit
  • Various components, modules, or units are described herein to emphasize functional aspects of means for performing the disclosed techniques, but do not necessarily require realization by different hardware units. Indeed, as described above, the various units may be combined in codec hardware units in conjunction with suitable software and/or firmware, or by interoperating hardware units (including one or more processors as described above) supply.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请提供一种视频图像的编解码方法及相关设备。涉及基于人工智能(AI)的视频或图像压缩技术领域,具体涉及基于神经网络的视频压缩技术领域。该解码方法包括:通过神经网络对输入数据进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布,输入数据至少包括参考图像的残差,参考图像为在解码当前图像之前的已解码图像;基于当前图像的残差包括的多个样本的残差值的概率分布对码流进行算术熵解码,得到第一熵解码数据,第一熵解码数据表示当前图像的残差;基于当前图像的残差,得到当前图像的重建样本值。本申请能够提高编解码的效率和精度。

Description

视频图像的编解码方法及相关设备
本申请要求于2021年03月17日提交中国国家知识产权局、申请号为202110289642.8、发明名称为“视频图像的编解码方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及视频编解码领域,尤其涉及一种视频图像的编解码方法及相关设备。
背景技术
视频编码(视频编码和解码)广泛用于数字视频应用,例如广播数字电视、互联网和移动网络上的视频传输、视频聊天和视频会议等实时会话应用、DVD和蓝光光盘、视频内容采集和编辑系统以及可携式摄像机的安全应用。
即使在影片较短的情况下也需要对大量的视频数据进行描述,当数据要在带宽容量受限的网络中发送或以其它方式传输时,这样可能会造成困难。因此,视频数据通常要先压缩然后在现代电信网络中传输。由于内存资源可能有限,当在存储设备上存储视频时,视频的大小也可能成为问题。视频压缩设备通常在信源侧使用软件和/或硬件,以在传输或存储之前对视频数据进行编码,从而减少用来表示数字视频图像所需的数据量。然后,压缩的数据在目的地侧由视频解压缩设备接收。在有限的网络资源以及对更高视频质量的需求不断增长的情况下,需要改进压缩和解压缩技术,这些改进的技术能够提高压缩率而几乎不影响图像质量。近年来,将深度学习应用于在图像和视频编解码领域逐渐成为一种趋势。深度神经网络可以提供强大的非线性处理能力。目前主要的基于神经网络的图像或者残差编码方法中,一般采用神经网络(例如AutoEncoder等)将图像变换到特征域,再对特征域中的特征值进行量化和熵编码(例如,算数熵编码)。解码端对接收到的码流进行熵解码和反量化,得到特征域的值,再经过解码网络恢复出重建图像。
发明内容
本申请实施例提供一种视频图像的编解码方法及相关设备,能够提高编解码效率和精度。
上述和其它目标通过独立权利要求的主题实现。其它实现方式在从属权利要求、具体实施方式和附图中显而易见。
具体实施例在所附独立权利要求中概述,其它实施例在从属权利要求中概述。
基于第一方面,本申请涉及视频图像的解码方法。该方法由解码装置执行,该方法包括:
接收当前图像的码流;通过经过训练的神经网络(也可称为概率估计网络)对输入数据进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布,输入数据至少包括参考图像的残差,参考图像为在解码当前图像之前的已解码图像;基于当前图像的残差包括的多个样本的残差值的概率分布对码流进行算术熵解码,得到第一熵解码数据,第一熵解码数据表示当前图像的残差;基于当前图像的残差,得到当前图像的重建样本值。
其中,步骤“接收当前图像的码流”和步骤“通过神经网络对输入数据进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布”的执行顺序没有先后之分,可以前者先执行,也可以后者先执行,也可以是同时执行。
其中,参考图像包括多个样本,在解码侧,参考图像的残差是指参考图像包括的多个样 本的重建值与多个样本的预测值之间的多个差值;当前图像包括多个样本,当前图像的残差是指当前图像包括的多个样本的重建值与多个样本的预测值之间的多个差值;也就是说,在解码侧所说的当前图像的残差和参考图像的残差均是指重建残差。
其中,第一熵解码数据是从码流中解码出的部分熵解码数据,第一熵解码数据用于表示当前图像的残差。应当理解是,还可从码流中间解码出其他熵解码数据,对此不作限定。
其中,在对码流解码时,参考图像先被解码出来,当前图像后被解码处理。比如对码流解码过程中,图像1先被解码出来,图像2后被解码出来,图像1为图像2的参考图像。
可选地,参考图像与当前图像可以为连续帧,可以为非连续帧;比如参考图像为第n-1帧图像,当前图像为第n帧图像;或者参考图像为第n-2帧图像,当前图像为第n帧图像。参考图像与当前图像为非连续帧时,当前图像的时间戳与参考图像的时间戳的差值小于预设阈值,从而保证参考图像的残差与当前图像的残差具有相关性。
通过训练好的神经网络对参考图像的残差包括的多个样本的残差值进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布,利用神经网络的计算能力充分挖掘当前图像的残差与参考图像的残差的相关性,可以得到高精度的当前图像的残差包括的多个样本的残差值的概率分布,进而在利用当前图像的残差包括的多个样本的残差值的概率分布对码流进行算术熵解码时,可以提高解码效率和解码精度。
在一个可能的设计中,所述神经网络为经如下过程训练得到神经网络,本实施例的方法还包括:
获取所述训练的输入数据,训练的输入数据包括在解码样本图像之前的已解码图像,样本图像为已知的已解码图像,通过神经网络对训练的输入数据进行概率估计处理,得到样本图像的残差包括的多个样本的残差值的估计的概率分布;基于样本图像的残差包括的多个样本的残差值的已知概率分布和估计的概率分布,获取损失值;在损失值收敛于第一阈值,或神经网络当前的训练次数大于或等于第二阈值时,确定当前的神经网络为神经网络训练完成时对应的神经网络。
在一个可能的设计中,训练的输入数据包括:
第一参考图像的残差包括的多个样本的残差值,或者;
第一参考图像的残差包括的多个样本的残差值和第二参考图像的残差包括的多个样本的残差值,或者;
样本图像的预测包括的多个样本的预测值、第一参考图像的残差包括的多个样本的残差值和第一参考图像的预测包括的多个样本的预测值;或者;
样本图像的预测包括的多个样本的预测值、第一参考图像的残差包括的多个样本的残差值、第一参考图像的预测包括的多个样本的预测值、第二参考图像的预测包括的多个样本的预测值和第二参考图像的残差包括的多个样本的残差值;
其中,第一参考图像和第二参考图像为在解码样本图像之前的已解码图像。
在训练上述神经网络时,利用多个参考图像(例如第一参考图像和第二参考图像)的残差等数据作为训练样本,使得训练得到的神经网络能够估计得到更加精确的概率分布,在利用该概率分布进行解码时,有利于提高解码效率和解码精度。
在一个可能的设计中,上述神经网络的训练可以是在线进行的,也可以是离线进行的,其中,通过在线方式进行训练时,可以使得解码装置可以及时得到高精度的神经网络,进而得到高精度的概率分布,进而在利用高精度的概率分布对码流进行算术熵解码时,可以提高解码效率和解码精度。
在一个可能的设计中,上述概率分布表示当前图像的残差包括的多个样本的重建值与多个样本的预测值之间的多个差值的概率分布;或者,表示当前图像的多个样本中各个样本的残差值的概率分布,或者表示当前图像的多个样本的残差值的概率分布。
在一个可能的设计中,上述第一熵解码数据包括当前图像的残差,或者当前图像的残差的特征图,或者经变换和量化的当前图像的残差。
在一个可能的设计中,第一熵解码数据为当前图像的残差的特征图,本申请的解码方法还包括:
将当前图像的残差的特征图经过解码网络获得当前图像的残差。通过将当前图像的残差变换到特征域,在解码先得到当前图像的残差的特征图,再将当前图像的残差的特征图经过解码网络获得当前图像的残差,有利于提高解码效率。
在一个可能的设计中,第一熵解码数据为经变换和量化后的当前图像的残差,本申请的解码方法还包括:
对经变换和量化后的当前图像的残差进行反量化和反变换后得到当前图像的残差。
一种可能的反变换方法为反离散余弦变换(discrete cosine transform,DCT),此外还可以进行反离散正弦变换(discrete sine transform,DST),或者反离散小波变换(discrete wavelet transform,DWT)等。
在一个可能的设计中,输入数据还进一步包括超先验信息,该超先验信息用于表征当前图像的残差的更深层次的特征,可以利用卷积神经网络对当前图像的残差的特征图进行特征提取得到;上述神经网络包括卷积网络、均匀池化层、第一卷积层、拼接层和归一化指数函数(Softmax)层,卷积网络包括多个卷积层,激励(ReLU)层,其中,通过神经网络对输入数据进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布,包括:
通过卷积网络提取参考图像的残差和超先验信息中的特征信息;通过均匀池化层、第一卷积层、拼接层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
在通过神经网络估计上述概率分布时,由于引入超先验信息,为了将超先验信息与其他信息(比如参考图像的残差)进行综合提取特征,通过引入第一卷积层和拼接层,以及通过卷积网络提取得到参考图像的残差和超先验信息的特征信息,再通过均匀池化层、第一卷积层、拼接层和Softmax层对上述提取的特征信息进行概率估计,可以得到更加精确的概率分布,在利用该概率分布对码流进行算术熵解码时,有利于提供解码效率和解码精度。
在一个可能的设计中,输入数据还包括参考图像的预测和当前图像的预测;上述神经网络包括卷积网络、均匀池化层和Softmax层,卷积网络包括多个卷积层,激励ReLU层,其中:通过神经网络对输入数据进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布,包括:
通过卷积网络提取参考图像的残差、参考图像的预测和当前图像的预测中的特征信息;通过均匀池化层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
在通过神经网络估计上述概率分布时,引入参考图像的预测和当前图像的预测,通过卷积网络提取参考图像的残差、参考图像的预测和当前图像的预测中的特征信息;通过均匀池化层和Softmax层对上述提取的特征信息进行概率估计,可以得到更加精确的概率分布,在利用该概率分布对码流进行算术熵解码时,有利于提供解码效率和解码精度。
在一个可能的设计中,输入数据包括多个参考图像的残差,该多个参考图像为在解码当 前图像之前的多个已解码图像;神经网络包括卷积网络、均匀池化层和Softmax层;卷积网络包括多个卷积层、ReLU层,其中:
所述通过神经网络对输入数据进行概率估计处理,得到所述当前图像的残差包括的多个样本的残差值的概率分布,包括:
通过卷积网络提取多个参考图像的残差的特征信息;通过均匀池化层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
在通过神经网络估计上述概率分布时,引入多个参考图像的残差,通过卷积网络提取多个参考图像的残差的特征信息;通过均匀池化层和Softmax层对该特征信息进行概率估计,可以得到更加精确的概率分布,在利用该概率分布对码流进行算术熵解码时,有利于提供解码效率和解码精度。
在一个可能的设计中,所述输入数据包括多个参考图像的残差、多个参考图像的预测和当前图像的预测,多个参考图像为在解码当前图像之前的多个已解码图像;神经网络包括卷积网络、均匀池化层和Softmax层;卷积网络包括多个卷积层、ReLU层,其中,通过神经网络对输入数据进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布,包括:
通过卷积网络提取多个参考图像的残差、多个参考图像的预测和当前图像的预测的特征信息;通过均匀池化层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
在通过神经网络估计上述概率分布时,引入多张参考图像的残差和预测,及当前图像的预测,通过卷积网络提取多个参考图像的残差、多个参考图像的预测和当前图像的预测的特征信息;通过均匀池化层和Softmax层对该特征信息进行概率估计,可以得到更加精确的概率分布,在利用该概率分布对码流进行算术熵解码时,从而有利于提供解码效率和解码精度。
其中,当前图像的预测是利用参考图像的重建样本值对当前图像进行预测得到的,或者对当前图像的低分辨率图像的重建样本值进行上采样得到的,并且采样得到的图像的分辨率与当前图像的分辨率相同。其中,当前图像的低分辨率图像可通过对当前图像进行下采样得到。
同理,参考图像的预测信号是利用在解码参考之前的图像的重建样本值对参考图像进行预测得到的;或者对参考图像的低分辨率图像的重建样本值进行上采样得到的,并且对参考图像的低分辨率图像的重建图像进行上采样得到的图像的分辨率与参考图像的分辨率相同。参考图像的低分辨率图像可通过对参考图像进行下采样得到。
基于第二方面,本申请涉及视频图像的编码方法。该方法由编码装置执行,该方法包括:
获取当前图像的残差,当前图像包括多个样本,当前图像的残差为多个样本的原始值与多个样本的预测值之间的多个差值;通过经训练的神经网络(也可称为概率估计网络)对输入数据进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布,输入数据至少包括参考图像的残差,参考图像为在编码当前图像之前的已编码图像;基于当前图像的残差包括的多个样本的残差值的概率分布对第一熵编码数据进行算术熵编码,得到当前图像的码流,第一熵编码数据表示当前图像的残差。
其中,步骤“获取当前图像的残差”和步骤“通过神经网络对输入数据进行概率估计处理,得到当前图像的残差的概率分布”的执行顺序没有先后之分,可以是前者先执行,也可以后者 先执行,也可以是同时执行。
其中,参考图像包括多个样本,在编码侧,参考图像的残差是指参考图像包括的多个样本的重建值与多个样本的预测值之间的多个差值;也就是说,在编码侧所说的参考图像的残差是指重建残差。在编码侧,当前图像包括多个样本,当前图像的残差是指当前图像包括的多个样本的原始值与多个样本的预测值之间的多个差值;也就是说,在编码侧所说的当前图像的残差是指原始残差。
其中,当前图像的预测包括多个样本的预测值;可选地,当前图像的预测可以是利用参考图像的重建样本值对当前图像进行预测得到的,或者对当前图像的低分辨率图像的重建样本值进行上采样得到的,并且采样得到的图像的分辨率与当前图像的分辨率相同。其中,当前图像的低分辨率图像可通过对当前图像进行下采样得到。
其中,参考图像为在编码当前图像之前的已编码图像,是指在对当前图像进行编码之前,已完成对参考图像的编码。比如在编码时,先对图像1进行编码,后对图像2进行编码,图像1为图像2的参考图像。
可选地,当前图像与参考图像可以是连续帧,也可以是非连续帧。参考图像与当前图像为非连续帧时,当前图像的时间戳与参考图像的时间戳的差值小于预设阈值,从而保证参考图像的残差与当前图像的残差具有相关性。
通过训练好的神经网络对参考图像的残差包括的多个样本的残差值进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布,利用神经网络的计算能力充分挖掘当前图像的残差与参考图像的残差的相关性,可以得到高精度的当前图像的残差包括的多个样本的残差值的概率分布,进而在利用当前图像的残差包括的多个样本的残差值的概率分布对第一熵编码数据进行算术熵编码时,可以提高编码效率和编码精度。
在一种可能的设计中,上述神经网络为经训练得到神经网络,本实施的方法还包括:
获取训练的输入数据,训练的输入数据包括在编码样本图像之前的已编码图像,样本图像为已知的已编码图像;通过神经网络对训练的输入数据进行概率估计处理,得到样本图像的残差包括的多个样本的残差值的估计的概率分布;基于样本图像的残差包括的多个样本的残差值的已知概率分布和估计的概率分布,获取损失值;在损失值收敛于第一阈值,或神经网络当前的训练次数大于或等于第二阈值时,确定当前的神经网络为神经网络训练完成时对应的神经网络。
在一个可能的设计中国,训练的输入数据包括:
第一参考图像的残差包括的多个样本的残差值,或者;
第一参考图像的残差包括的多个样本的残差值和第二参考图像的残差包括的多个样本的残差值,或者;
样本图像的预测包括的多个样本的预测值、第一参考图像的残差包括的多个样本的残差值和第一参考图像的预测包括的多个样本的预测值;或者;
样本图像的预测包括的多个样本的预测值、第一参考图像的残差包括的多个样本的残差值、第一参考图像的预测包括的多个样本的预测值、第二参考图像的预测包括的多个样本的预测值和第二参考图像的残差包括的多个样本的残差值;
其中,第一参考图像和第二参考图像为在编码样本图像之前的已编码图像。
在训练上述神经网络时,利用多个参考图像(第一参考图像和第二参考图像)的残差等数据作为训练样本,使得训练得到的神经网络能够估计得到更加精确的概率分布,在利用该概率分布对第一熵编码数据进行算术熵编码时,有利于提高编码效率和编码精度。
在一个可能的设计中,上述神经网络的训练可以是在线进行的,也可以是离线进行的,其中,通过在线方式进行训练时,可以是的编码装置可以及时得到高精度的神经网络,进而得到高精度的概率分布,进而在利用该概率分布对第一熵编码数据进行算术熵编码时,有利于提高编码效率和编码精度。
在一个可能的设计中,上述概率分布表示当前图像的残差包括的多个样本的原始值与多个样本的预测值之间的多个差值的概率分布;或者,表示当前图像的多个样本中各个样本点的残差值的概率分布;或者,表示当前图像的多个样本的残差值对应的概率分布。
在一个可能的设计中,第一熵编码数据包括当前图像的残差,或者当前图像的残差的特征图,或者经变换和量化后的当前图像的残差。
在一个可能的设计中,第一熵编码数据为当前图像的残差的特征图,本申请的编码方法还包括:将当前图像的残差经过编码网络获得当前图像的残差的特征图。通过将当前图像的残差变换到特征域,有利于提高编码效率。
需要指出的是,上述编码网络可以看成一个特征提取网络。
在一个可能的设计中,第一熵编码数据为经变换和量化后的当前图像的残差,本申请的编码方法还包括:对当前图像的残差进行变换和量化,得到经变换和量化后的当前图像的残差。
一种可能的变换方法为DCT,此外还可以进行DST,或者DWT等。量化可以采用均匀量化或者非均匀量化等方法。
在一个可能的设计中,为了提高编码精度,输入数据进一步包括超先验信息,该超先验信息用于表征当前图像的更深层次的特征,可以利用卷积神经网络对当前图像的残差的特征图进行特征提取得到。上述神经网络包括卷积网络、均匀池化层、第一卷积层、拼接层和归一化指数函数(Softmax)层,卷积网络包括多个卷积层,激励(ReLU)层,其中,通过神经网络对输入数据进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布,包括:
通过卷积网络提取参考图像的残差和超先验信息中的特征信息;通过均匀池化层、第一卷积层、拼接层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
在通过神经网络估计上述概率分布时,由于引入超先验信息,为了将超先验信息与其他信息(比如参考图像的残差)进行综合提取特征,因此引入了第一卷积层和拼接层,然后通过卷积网络提取得到参考图像的残差和超先验信息的特征信息,再通过均匀池化层、第一卷积层、拼接层和Softmax层对上述提取的特征信息进行概率估计,可以得到更加精确的概率分布,在利用该概率分布对第一熵编码数据进行算术熵编码时,有利于提高编码效率和编码精度。
在一个可能的设计中,输入数据还包括参考图像的预测和当前图像的预测;上述神经网络包括卷积网络、均匀池化层和Softmax层,卷积网络包括多个卷积层,激励ReLU层,其中:通过神经网络对输入数据进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布,包括:
通过卷积网络提取参考图像的残差、参考图像的预测和当前图像的预测中的特征信息;通过均匀池化层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
在通过神经网络估计上述概率分布时,引入参考图像的预测和当前图像的预测,通过卷 积网络提取参考图像的残差、参考图像的预测和当前图像的预测中的特征信息;通过均匀池化层和Softmax层对上述提取的特征信息进行概率估计,可以得到更加精确的概率分布,在利用该概率分布对第一熵编码数据进行算术熵编码时,有利于提高编码效率和编码精度。
在一个可能的设计中,输入数据包括多个参考图像的残差,该多个参考图像为在编码当前图像之前的多个已编码图像;神经网络包括卷积网络、均匀池化层和Softmax层;卷积网络包括多个卷积层、ReLU层,其中:
所述通过神经网络对输入数据进行概率估计处理,得到所述当前图像的残差包括的多个样本的残差值的概率分布,包括:
在通过神经网络估计上述概率分布时,引入多个参考图像的残差,通过卷积网络提取多个参考图像的残差的特征信息;通过均匀池化层和Softmax层对该特征信息进行概率估计,可以得到更加精确的概率分布,在利用该概率分布对第一熵编码数据进行算术熵编码时,有利于提高编码效率和编码精度。
在通过神经网络估计上述概率分布时,引入多张参考图像的残差,可以得到估计更加精确的概率分布,从而有利于提供编码效率和编码精度。
在一个可能的设计中,所述输入数据包括多个参考图像的残差、多个参考图像的预测和当前图像的预测,多个参考图像为在编码当前图像之前的多个已编码图像;神经网络包括卷积网络、均匀池化层和Softmax层;卷积网络包括多个卷积层、ReLU层,其中,通过神经网络对输入数据进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布,包括:
通过卷积网络提取多个参考图像的残差、多个参考图像的预测和当前图像的预测的特征信息;通过均匀池化层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
在通过神经网络估计上述概率分布时,引入多张参考图像的残差和预测,及当前图像的预测,通过卷积网络提取多个参考图像的残差、多个参考图像的预测和当前图像的预测的特征信息;通过均匀池化层和Softmax层对该特征信息进行概率估计,可以得到更加精确的概率分布,在利用该概率分布对第一熵编码数据进行算术熵编码时,有利于提高编码效率和编码精度。
在此需要指出的是,在本申请中,除了编码侧要编码的当前图像的残差是基于图像的样本的原始值与预测值的差值得到,编码侧其他地方的残差和解码侧出现的残差指的都是重建残差。
可选地,在编码侧,参考图像的重建残差可以通过对参考图像的码流进行解码得到,具体解码过程可参见第一方面的解码方法。参考图像的重建残差还可以通过如下方式得到:
对参考图像的残差进行量化,得到量化后的参考图像的残差;再对量化后的参考图像的残差进行反量化,得到参考图像的重建残差。在编码过程中,得到参考图像的残差(记为R)后,对参考图像的残差进行量化,得到参考后的第二图像的残差(记为R1),再对R1进行熵编码,得到码流(记为R3),因为熵编码是无损的,所以可以直接对R1进行反量化就可以得到参考图像的重建残差,这个跟对R3进行熵解码,再对解码结果进行反量化得到的结果与上述参考图像的重建残差是一样的。
基于第三方面,本申请涉及视频图像的解码装置,有益效果可以参见第一方面的描述此处不再赘述。所述解码装置具有实现上述第一方面的方法实例中行为的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上 述功能相对应的模块。
基于第四方面,本申请涉及视频图像的编码装置,有益效果可以参见第二方面的描述此处不再赘述。所述编码装置具有实现上述第二方面的方法实例中行为的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。
本申请第一方面所述的方法可由本申请第三方面所述的装置执行。本申请第一方面所述的方法的其它特征和实现方式直接取决于本申请第三方面所述的装置的功能性和实现方式。
本申请第二方面所述的方法可由本申请第四方面所述的装置执行。本申请第二方面所述的方法的其它特征和实现方式直接取决于本申请第四方面所述的装置的功能性和实现方式。
基于第五方面,本申请涉及解码视频流的装置,包含处理器和存储器。所述存储器存储指令,所述指令使得所述处理器执行第一方面所述的方法。
基于第六方面,本申请涉及编码视频流的装置,包含处理器和存储器。所述存储器存储指令,所述指令使得所述处理器执行第二方面所述的方法。
基于第七方面,提供一种计算机可读存储介质,其上储存有指令,当所述指令执行时,使得一个或多个处理器编码视频数据。所述指令使得所述一个或多个处理器执行第一或第二方面或第一或第二方面任意一种可能的实施例中的方法。
基于第八方面,本申请涉及包括程序代码的计算机程序产品,所述程序代码在运行时执行第一或第二方面或第一或第二方面任意一种可能的实施例中的方法。
附图及以下说明中将详细描述一个或多个实施例。其它特征、目的和优点在说明、附图以及权利要求中是显而易见的。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以基于这些附图获得其他的附图。
图1为用于实现本申请实施例的视频译码系统示例的框图;
图2为用于实现本申请实施例的视频译码系统另一示例的框图;
图3为用于实现本申请实施例的视频译码装置的示意性框图;
图4为用于实现本申请实施例的视频译码装置的示意性框图;
图5为原始图像、预测图像和残差的对比示意图;
图6A为本申请实施例提供的一种视频编码器的结构示意图;
图6B为本申请实施例示意出的一种残差信号获取原理示意图;
图6C为本申请实施例示意出的另一种残差信号获取原理示意图;
图6D为本申请实施例示意出的编码网络的结构的示意图;
图6E为本申请实施例示意出的另一种编码网络的结构的示意图;
图6F为本申请实施例示意出的另一种编码网络的结构的示意图;
图6G为本申请实施例示意出的网络层的结构示意图;
图6H为概率分布的解释示意图;
图7A为本申请实施例示意出的用于概率估计的神经网络的结构示意图;
图7B为本申请实施例示意出的另一种用于概率估计的神经网络的结构示意图;
图7C为本申请实施例示意出的一种隐藏信息获取原理示意图;
图7D为本申请实施例示意出的另一种隐藏信息获取原理示意图;
图7E为算术编码区间划分示意图;
图8A为本申请实施例提供的另一种视频编码器的结构示意图;
图8B为本申请实施例提供的另一种视频编码器的结构示意图;
图9A为本申请实施例提供的一种视频解码器的结构示意图;
图9B为本申请实施例示意出的解码网络的功能示意图;
图9C为本申请实施例示意出的解码网络的结构示意图;
图9D为本申请实施例示意出的另一种解码网络的结构示意图;
图9E为本申请实施例示意出的另一种视频解码器的结构示意图;
图9F为本申请实施例示意出的另一种视频解码器的结构示意图;
图10为示出基于本申请一种实施例的编码方法的过程1000的流程图;
图11A为示出基于本申请一种概率分布估计过程的流程图;
图11B为示出基于本申请一种概率分布估计过程的流程图;
图12为示出基于本申请一种实施例的解码方法的过程1200的流程图;
图12A为示出基于本申请一种概率分布估计过程的流程图;
图12B为示出基于本申请一种概率分布估计过程的流程图;
图13为本申请实施例示意出的一种编解码过程示意图;
图14为本申请实施例示意出的另一种编解码过程示意图;
图15为本申请实施例示意出的另一种编解码过程示意图;
图16为本申请实施例示意出的另一种编解码过程示意图;
图17为本申请实施例示意出的另一种编解码过程示意图;
图18为本申请实施例示意出的另一种编解码过程示意图。
具体实施方式
本申请实施例提供一种基于AI的视频图像压缩技术,尤其是提供一种基于神经网络的视频压缩技术,具体提供一种视频图像的熵编解码技术,以改进传统的混合视频编解码系统。
视频编码通常是指处理形成视频或视频序列的图像序列。在视频编码领域,术语“图像(picture)”、“帧(frame)”或“图片(image)”可以用作同义词。视频编码(或通常称为编码)包括视频编码和视频解码两部分。视频编码在源侧执行,通常包括处理(例如,压缩)原始视频图像以减少表示该视频图像所需的数据量(从而更高效存储和/或传输)。视频解码在目的地侧执行,通常包括相对于编码器作逆处理,以重建视频图像。实施例涉及的视频图像(或通常称为图像)的“编码”应理解为视频图像或视频序列的“编码”或“解码”。编码部分和解码部分也合称为编解码(编码和解码,CODEC)。
在无损视频编码情况下,可以重建原始视频图像,即重建的视频图像与原始视频图像具有相同的质量(假设存储或传输期间没有传输损耗或其它数据丢失)。在有损视频编码情况下,通过量化等执行进一步压缩,来减少表示视频图像所需的数据量,而解码器侧无法完全重建视频图像,即重建的视频图像的质量比原始视频图像的质量较低或较差。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2022080900-appb-000001
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2022080900-appb-000002
其中,
Figure PCTCN2022080900-appb-000003
是输入向量,
Figure PCTCN2022080900-appb-000004
是输出向量,
Figure PCTCN2022080900-appb-000005
是偏移向量,W是权重矩阵(也称系数),a()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2022080900-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2022080900-appb-000007
由于DNN层数多,系数W和偏移向量
Figure PCTCN2022080900-appb-000008
的数量也比较多。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2022080900-appb-000009
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2022080900-appb-000010
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)循环神经网络(recurrent neural networks,RNN)是用来处理序列数据的。在传统的神经网络模型中,是从输入层到隐含层再到输出层,层与层之间是全连接的,而对于每一 层层内之间的各个节点是无连接的。这种普通的神经网络虽然解决了很多难题,但是却仍然对很多问题无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNN之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐含层本层之间的节点不再无连接而是有连接的,并且隐含层的输入不仅包括输入层的输出还包括上一时刻隐含层的输出。理论上,RNN能够对任何长度的序列数据进行处理。对于RNN的训练和对传统的CNN或DNN的训练一样。RNN旨在让机器像人一样拥有记忆的能力。因此,RNN的输出就需要依赖当前的输入信息和历史的记忆信息。
(5)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再基于两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(6)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
在以下译码系统10的实施例中,编码器20和解码器30基于图1至图2进行描述。
图1为示例性译码系统10的示意性框图,例如可以利用本申请技术的视频译码系统10(或简称为译码系统10)。视频译码系统10中的视频编码器20(或简称为编码器20)和视频解码器30(或简称为解码器30)代表可用于基于本申请中描述的各种示例执行各技术的设备等。
如图1所示,译码系统10包括源设备12,源设备12用于将编码图像等编码图像数据21提供给用于对编码图像数据21进行解码的目的设备14。
源设备12包括编码器20,另外即可选地,可包括图像源16、图像预处理器等预处理器(或预处理单元)18、通信接口(或通信单元)22。
图像源16可包括或可以为任意类型的用于捕获现实世界图像等的图像捕获设备,和/或任意类型的图像生成设备,例如用于生成计算机动画图像的计算机图形处理器或任意类型的用于获取和/或提供现实世界图像、计算机生成图像(例如,屏幕内容、虚拟现实(virtual reality,VR)图像和/或其任意组合(例如增强现实(augmented reality,AR)图像)的设备。所述图像源可以为存储上述图像中的任意图像的任意类型的内存或存储器。
为了区分预处理器(或预处理单元)18执行的处理,图像(或图像数据17)也可称为原始图像(或原始图像数据)17。
预处理器18用于接收(原始)图像数据17,并对图像数据17进行预处理,得到预处理 图像(或预处理图像数据)19。例如,预处理器18执行的预处理可包括修剪、颜色格式转换(例如从RGB转换为YCbCr)、调色或去噪。可以理解的是,预处理单元18可以为可选组件。
视频编码器(或编码器)20用于接收预处理图像数据19并提供编码图像数据21。
源设备12中的通信接口22可用于:接收编码图像数据21并通过通信信道13向目的设备14等另一设备或任何其它设备发送编码图像数据21(或其它任意处理后的版本),以便存储或直接重建。
目的设备14包括解码器30,另外即可选地,可包括通信接口(或通信单元)28、后处理器(或后处理单元)32和显示设备34。
目的设备14中的通信接口28用于直接从源设备12或从存储设备等任意其它源设备接收编码图像数据21(或其它任意处理后的版本),例如,存储设备为编码图像数据存储设备,并将编码图像数据21提供给解码器30。
通信接口22和通信接口28可用于通过源设备12与目的设备14之间的直连通信链路,例如直接有线或无线连接等,或者通过任意类型的网络,例如有线网络、无线网络或其任意组合、任意类型的私网和公网或其任意类型的组合,发送或接收编码图像数据(或编码数据)21。
例如,通信接口22可用于将编码图像数据21封装为报文等合适的格式,和/或使用任意类型的传输编码或处理来处理所述编码后的图像数据,以便在通信链路或通信网络上进行传输。
通信接口28与通信接口22对应,例如,可用于接收传输数据,并使用任意类型的对应传输解码或处理和/或解封装对传输数据进行处理,得到编码图像数据21。
通信接口22和通信接口28均可配置为如图1中从源设备12指向目的设备14的对应通信信道13的箭头所指示的单向通信接口,或双向通信接口,并且可用于发送和接收消息等,以建立连接,确认并交换与通信链路和/或例如编码后的图像数据传输等数据传输相关的任何其它信息,等等。
视频解码器(或解码器)30用于接收编码图像数据21并提供解码图像数据(或解码图像数据)31(下面将基于图9A、图9E和图9F等进一步描述)。
目后处理器32用于对解码后的图像等解码图像数据31(也称为重建后的图像数据)进行后处理,得到后处理后的图像等后处理图像数据33。后处理单元32执行的后处理可以包括例如颜色格式转换(例如从YCbCr转换为RGB)、调色、修剪或重采样,或者用于产生供显示设备34等显示的解码图像数据31等任何其它处理。
显示设备34用于接收后处理图像数据33,以向用户或观看者等显示图像。显示设备34可以为或包括任意类型的用于表示重建后图像的显示器,例如,集成或外部显示屏或显示器。例如,显示屏可包括液晶显示器(liquid crystal display,LCD)、有机发光二极管(organic light emitting diode,OLED)显示器、等离子显示器、投影仪、微型LED显示器、硅基液晶显示器(liquid crystal on silicon,LCoS)、数字光处理器(digital light processor,DLP)或任意类型的其它显示屏。
译码系统10还包括训练引擎25,训练引擎25用于训练编码器20或解码器30中的神经网络,以处理输入已编码图像的重建残差或者已解码图像的重建残差得到当前图像的残差包括的多个样本的残差值的概率分布。可选地,输入的数据还包括超先验信息。
训练数据可以存入数据库(未示意)中,训练引擎25基于训练数据训练得到神经网络, 该神经网络用于估算当前图像的残差包括的多个样本的残差值的概率分布。需要说明的是,本申请实施例对于训练数据的来源不做限定,例如可以是从云端或其他地方获取训练数据进行用于概率估计的神经网络训练。
训练引擎25训练得到的神经网络可以应用于译码系统10,40中,例如,应用于图1所示的源设备12(例如编码器20)或目的设备14(例如解码器30)。训练引擎25可以在云端训练得到上述神经网络,然后译码系统10从云端下载并使用该神经网络;或者,训练引擎25可以在云端训练得到上述神经网络并使用该神经网络得到当前图像的残差包括的多个样本的残差值的概率分布,译码系统10从云端直接获取该概率分布。
尽管图1示出了源设备12和目的设备14作为独立的设备,但设备实施例也可以同时包括源设备12和目的设备14或同时包括源设备12和目的设备14的功能,即同时包括源设备12或对应功能和目的设备14或对应功能。在这些实施例中,源设备12或对应功能和目的设备14或对应功能可以使用相同硬件和/或软件或通过单独的硬件和/或软件或其任意组合来实现。
基于描述,图1所示的源设备12和/或目的设备14中的不同单元或功能的存在和(准确)划分可能基于实际设备和应用而有所不同,这对技术人员来说是显而易见的。
编码器20(例如视频编码器20)或解码器30(例如视频解码器30)或两者都可通过如图2所示的处理电路实现,例如一个或多个微处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)、离散逻辑、硬件、视频编码专用处理器或其任意组合。编码器20可以通过处理电路46实现,以包含参照图2编码器20论述的各种模块和/或本文描述的任何其它解码器系统或子系统。解码器30可以通过处理电路46实现,以包含参照图3解码器30论述的各种模块和/或本文描述的任何其它解码器系统或子系统。所述处理电路46可用于执行下文论述的各种操作。如图4所示,如果部分技术在软件中实施,则设备可以将软件的指令存储在合适的非瞬时性计算机可读存储介质中,并且使用一个或多个处理器在硬件中执行指令,从而执行本申请技术。视频编码器20和视频解码器30中的其中一个可作为组合编解码器(encoder/decoder,CODEC)的一部分集成在单个设备中,如图2所示。
源设备12和目的设备14可包括各种设备中的任一种,包括任意类型的手持设备或固定设备,例如,笔记本电脑或膝上型电脑、手机、智能手机、平板或平板电脑、相机、台式计算机、机顶盒、电视机、显示设备、数字媒体播放器、视频游戏控制台、视频流设备(例如,内容业务服务器或内容分发服务器)、广播接收设备、广播发射设备,等等,并可以不使用或使用任意类型的操作系统。在一些情况下,源设备12和目的设备14可配备用于无线通信的组件。因此,源设备12和目的设备14可以是无线通信设备。
在一些情况下,图1所示的视频译码系统10仅仅是示例性的,本申请提供的技术可适用于视频编码设置(例如,视频编码或视频解码),这些设置不一定包括编码设备与解码设备之间的任何数据通信。在其它示例中,数据从本地存储器中检索,通过网络发送,等等。视频编码设备可以对数据进行编码并将数据存储到存储器中,和/或视频解码设备可以从存储器中检索数据并对数据进行解码。在一些示例中,编码和解码由相互不通信而只是编码数据到存储器和/或从存储器中检索并解码数据的设备来执行。
图2是基于一示例性实施例的包含视频编码器20和/或视频解码器30的视频译码系统40的实例的说明图。视频译码系统40可以包含成像设备41、视频编码器20、视频解码器30(和/或藉由处理电路46实施的视频编/解码器)、天线42、一个或多个处理器43、一个或多个内 存存储器44和/或显示设备45。
如图2所示,成像设备41、天线42、处理电路46、视频编码器20、视频解码器30、处理器43、内存存储器44和/或显示设备45能够互相通信。在不同实例中,视频译码系统40可以只包含视频编码器20或只包含视频解码器30。
在一些实例中,天线42可以用于传输或接收视频数据的经编码比特流。另外,在一些实例中,显示设备45可以用于呈现视频数据。处理电路46可以包含专用集成电路(application-specific integrated circuit,ASIC)逻辑、图形处理器、通用处理器等。视频译码系统40也可以包含可选的处理器43,该可选处理器43类似地可以包含专用集成电路(application-specific integrated circuit,ASIC)逻辑、图形处理器、通用处理器等。另外,内存存储器44可以是任何类型的存储器,例如易失性存储器(例如,静态随机存取存储器(static random access memory,SRAM)、动态随机存储器(dynamic random access memory,DRAM)等)或非易失性存储器(例如,闪存等)等。在非限制性实例中,内存存储器44可以由超速缓存内存实施。在其它实例中,处理电路46可以包含存储器(例如,缓存等)用于实施图像缓冲器等。
在一些实例中,通过逻辑电路实施的视频编码器20可以包含(例如,通过处理电路46或内存存储器44实施的)图像缓冲器和(例如,通过处理电路46实施的)图形处理单元。图形处理单元可以通信耦合至图像缓冲器。图形处理单元可以包含通过处理电路46实施的视频编码器20,以实施参照图2的视频解码器20和/或本文中所描述的任何其它编码器系统或子系统所论述的各种模块。逻辑电路可以用于执行本文所论述的各种操作。
在一些实例中,视频解码器30可以以类似方式通过处理电路46实施,以实施参照图2的视频解码器30和/或本文中所描述的任何其它解码器系统或子系统所论述的各种模块。在一些实例中,逻辑电路实施的视频解码器30可以包含(通过处理电路46或内存存储器44实施的)图像缓冲器和(例如,通过处理电路46实施的)图形处理单元。图形处理单元可以通信耦合至图像缓冲器。图形处理单元可以包含通过处理电路46实施的视频解码器30。
在一些实例中,天线42可以用于接收视频数据的经编码比特流。如所论述,经编码比特流可以包含本文所论述的与编码视频帧相关的数据、指示符、索引值、模式选择数据等,例如与编码分割相关的数据(例如,变换系数或经量化变换系数,(如所论述的)可选指示符,和/或定义编码分割的数据)。视频译码系统40还可包含耦合至天线42并用于解码经编码比特流的视频解码器30。显示设备45用于呈现视频帧。
应理解,本申请实施例中对于参考视频编码器20所描述的实例,视频解码器30可以用于执行相反过程。关于信令语法元素,视频解码器30可以用于接收并解析这种语法元素,相应地解码相关视频数据。在一些例子中,视频编码器20可以将语法元素熵编码成经编码视频比特流。在此类实例中,视频解码器30可以解析这种语法元素,并相应地解码相关视频数据。
需要说明的是,本申请所描述的编解码过程存在于绝大部分视频编解码器中,例如H.263、H.264、MPEG-2、MPEG-4、VP8、VP9、基于AI的端到端的图像编码等对应的编解码器中。
图3为本申请实施例提供的视频译码设备400的示意图。视频译码设备400适用于实现本文描述的公开实施例。在一个实施例中,视频译码设备400可以是解码器,例如图2中的视频解码器30,也可以是编码器,例如图2中的视频编码器20。
视频译码设备400包括:用于接收数据的入端口410(或输入端口410)和接收单元(receiver unit,Rx)420;用于处理数据的处理器、逻辑单元或中央处理器(central processing unit,CPU)430;例如,这里的处理器430可以是神经网络处理器430;用于传输数据的发 送单元(transmitter unit,Tx)440和出端口450(或输出端口450);用于存储数据的存储器460。视频译码设备400还可包括耦合到入端口410、接收单元420、发送单元440和出端口450的光电(optical-to-electrical,OE)组件和电光(electrical-to-optical,EO)组件,用于光信号或电信号的出口或入口。
处理器430通过硬件和软件实现。处理器430可实现为一个或多个处理器芯片、核(例如,多核处理器)、FPGA、ASIC和DSP。处理器430与入端口410、接收单元420、发送单元440、出端口450和存储器460通信。处理器430包括译码模块470(例如,基于神经网络NN的译码模块470)。译码模块470实施上文所公开的实施例。例如,译码模块470执行、处理、准备或提供各种编码操作。因此,通过译码模块470为视频译码设备400的功能提供了实质性的改进,并且影响了视频译码设备400到不同状态的切换。或者,以存储在存储器460中并由处理器430执行的指令来实现译码模块470。
存储器460包括一个或多个磁盘、磁带机和固态硬盘,可以用作溢出数据存储设备,用于在选择执行程序时存储此类程序,并且存储在程序执行过程中读取的指令和数据。存储器460可以是易失性和/或非易失性的,可以是只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、三态内容寻址存储器(ternary content-addressable memory,TCAM)和/或静态随机存取存储器(static random-access memory,SRAM)。
图4为示例性实施例提供的装置500的简化框图,装置500可用作图1中的源设备12和目的设备14中的任一个或两个。
装置500中的处理器502可以是中央处理器。或者,处理器502可以是现有的或今后将研发出的能够操控或处理信息的任何其它类型设备或多个设备。虽然可以使用如图所示的处理器502等单个处理器来实施已公开的实现方式,但使用一个以上的处理器速度更快和效率更高。
在一种实现方式中,装置500中的存储器504可以是只读存储器(ROM)设备或随机存取存储器(RAM)设备。任何其它合适类型的存储设备都可以用作存储器504。存储器504可以包括处理器502通过总线512访问的代码和数据506。存储器504还可包括操作系统508和应用程序510,应用程序510包括允许处理器502执行本文所述方法的至少一个程序。例如,应用程序510可以包括应用1至N,还包括执行本文所述方法的视频译码应用。
装置500还可以包括一个或多个输出设备,例如显示器518。在一个示例中,显示器518可以是将显示器与可用于感测触摸输入的触敏元件组合的触敏显示器。显示器518可以通过总线512耦合到处理器502。
虽然装置500中的总线512在本文中描述为单个总线,但是总线512可以包括多个总线。此外,辅助储存器可以直接耦合到装置500的其它组件或通过网络访问,并且可以包括存储卡等单个集成单元或多个存储卡等多个单元。因此,装置500可以具有各种各样的配置。
如图5所示,第一行为原始图像,第二行为原始图像的预测,第三行为原始图像的残差。从图中可以看出,连续图像的残差之间存在明显的相关性,可以基于此相关性对当前图像的残差进行编码和对当前图像的码流进行解码,从而提高编解码的效率。
应理解,非连续图像的残差之间也存在相关性,但是非连续图像的时间戳之间的差值小于预设阈值,以使非连续图像中的内容大部分是相同的,从而保证非连续图像的残差之间具有明显的相关性。
编码器和编码方法
图6A为用于实现本申请技术的一种视频编码器的示例的示意性框图。在图6A的示例中, 视频编码器20包括输入端(或输入接口)202、编码网络204、取整单元206、熵编码单元208、用于概率估计的神经网络210和输出端(或输出接口)212。图6A所示的视频编码器20也可称为端到端的视频编码器或者基于端到端视频编码器的视频编码器。
编码器20可通过输入端202等接收输入数据203,该输入数据图像包括当前图像的残差,作为编码网络204的输入。
上述输入数据203还包括参考图像的残差,或者,
多个参考图像的残差,或者,
多个参考图像的残差和当前图像的预测,或者,
多个参考图像的残差、多个参考图像的预测和当前图像的预测,作为用于概率估计的神经网络210的输入。
(数字)图像为或可以视为具有强度值的像素点组成的二维阵列或矩阵。阵列中的像素点也可以称为像素(pixel或pel)(图像元素的简称)。阵列或图像在水平方向和垂直方向(或轴线)上的像素点数量决定了图像的大小和/或分辨率。为了表示颜色,通常采用三个颜色分量,即图像可以表示为或包括三个像素点阵列。在RBG格式或颜色空间中,图像包括对应的红色、绿色和蓝色像素点阵列。但是,在视频或图像编码中,每个像素通常以亮度/色度格式或颜色空间表示,例如YCbCr,包括Y指示的亮度分量(有时也用L表示)以及Cb、Cr表示的两个色度分量。亮度(luma)分量Y表示亮度或灰度水平强度(例如,在灰度等级图像中两者相同),而两个色度(chrominance,简写为chroma)分量Cb和Cr表示色度或颜色信息分量。相应地,YCbCr格式的图像包括亮度像素点值(Y)的亮度像素点阵列和色度值(Cb和Cr)的两个色度像素点阵列。RGB格式的图像可以转换或变换为YCbCr格式,反之亦然,该过程也称为颜色变换或转换。如果图像是黑白的,则该图像可以只包括亮度像素点阵列。相应地,图像可以为例如单色格式的亮度像素点阵列或4:2:0、4:2:2和4:4:4彩色格式的亮度像素点阵列和两个相应的色度像素点阵列。
其中,当前图像的残差是基于当前图像与当前图像的预测得到的,具体地,当前图像包括多个样本(像素点)的原始值,当前图像的预测包括多个样本(像素点)的预测值,将当前图像中样本(像素点)的原始值逐个与当前图像的预测中对应位置的样本的预测值作差,从而得到当前图像的残差。
在一个示例中,如图6B所示,对已编码图像的重建样本值进行预测得到当前图像的预测信号;然后按照上述方法对当前图像的原始信号和当前图像的预测进行处理,得到当前图像的残差。在另一个示例中,如图6C所示,对当前图像的低分辨率图像进行上采样,得到当前图像的预测,然后按照上述方式对当前图像的原始信号及当前图像的预测进行处理,得到当前图像的残差;其中,当前图像的低分辨率图像是对当前图像进行下采样得到的。
在此需要指出的是,本申请中的当前图像的预测信号又可以称为当前图像的预测,参考图像的预测信号又可以称为参考图像的预测;当前图像的原始信号指的是当前图像本身。本申请中的当前图像的残差又可以称为当前图像的残差信号。
编码网络204
如图6A所示,编码网络204用于通过编码网络对当前图像的残差进行处理,得到当前图像的残差的特征图205;
在一个可能性中,编码网络204如图6D所示,编码网络204包含了K个子编码网络,每个子编码网络对应了所在的特征层以及对应的输出特征图。当前图像的残差输入到第一子编码网络中进行特征提取,得到第一输出特征图;再将第一输出特征图输入到第二子编码网 络中进行特征提取,得到第二输出特征图,依次类推,将第K-1输出特征图输入到第K子编码网络中进行特征提取,得到第K输出特征图;此时有K个输出特征图输出,其中K>=1。上述当前图像的残差的特征图可以为K个输出特征图中的任一个,或者为第K输出特征图,或者为K个输出特征图经堆叠操作得到的多通道的特征图,该特征图具有多尺度的特征。
在一个可能性中,编码网络204的结构如图6E所示,编码网络204由T个网络层所构成,其中M、L、T、K为正整数。第M输出特征图与第K输出特征图均为编码网络的输出,且编码网络输出的第K输出特征图为编码网络204的网络层L后输出,第M输出特征图在网络层T后输出。可以理解,可以有多个输出特征图在编码网络204中的不同网络层位置后输出,此处不做限定。上述当前图像的残差的特征图可以为第M输出特征图,或者为多个输出特征图经堆叠操作得到的多通道的特征图,该特征图具有多尺度的特征。
在此需要指出的是,在对多个输出特征图进行堆叠操作之前,若多个输出特征图的尺寸不一致时,对多个输出特征图进行处理,比如上采样、下采样等,以使得多个输出特征图的尺寸一致。
其中图6E中的任意一网络层可以为卷积层、归一化层、非线性激活层等。
在一个可能性中,编码网络204网络中图6D所对应的各子编码网络结构可以互为相同或者不同,第M和第N子编码网络的网络结构如图6F所示。其中第M子编码网络的网络层1和第N子编码网络中的任意一网络层结构可以互不相同,且L和P可以相同或不同,L和P均为大于0的正整数。
在一个可能性中,编码网络204网络中的任意一子编码网络的网络结构示例如图6G所示,可见示例中第N子编码网络包含了5个网络层,具体包括了一个归一化层、两个卷积层以及两个非线性激活层。
取整单元206
取整单元206用于通过例如标量量化或矢量量化对输出特征图205进行取整,得到整数化后的特征图207。
在一个可能性中,视频编码器20(对应地,取整单元206)可用于输出取整参数(quantization parameter,QP),例如,直接输出或由熵编码单元208进行编码或压缩后输出,例如使得视频解码器30可接收并使用量化参数进行解码。
用于概率估计的神经网络210
用于概率估计的神经网络210用于基于输入的数据估算得到当前图像的残差包括的多个样本的残差值的概率分布。
可选地,上述概率分布表示当前图像的包括的多个样本的原始值与该多个样本的预测值之间的多个差值的概率分布,或者,表示当年图像的多个样本中各个样本点的残差值的概率分布,或者表示当前图像的多个样本的残差值的对应的概率分布。
对当前图像的残差包括的多个样本的残差值的概率分布信息的进行如下解释:
以一个2x2大小的图像为例,如图6H所示,C0、C1、C2、C3为当前图像的4个样本的残差,其值分别为-2、1、-2、0,R0、R1、R2、R3为参考图像的4个样本的残差,其值分别为-1、2、-2、1。假设残差值的取值范围为[-2,2],即-2到2之间的整数。通过用于概率估计的神经网络210对参考图像的残差进行概率估计,得到当前图像的残差包括的4个样本的残差值的概率分布:
一种情况为,当前图像的残差包括的4个样本的残差值的概率分布具体是指当前图像的4个样本中各个样本点的残差值的概率分布,即C0位置的残差值的概率分布,C1位置的残差 值的概率分布,C2位置的残差值的概率分布,和C3位置的残差值的概率分布如下表1所示:
位置 值为-2的概率 值为-1的概率 值为0的概率 值为1的概率 值为2的概率
C0 0.6 0.1 0.1 0.1 0.1
C1 0.1 0.1 0.1 0.5 0.2
C2 0.5 0.2 0.1 0.1 0.1
C3 0.1 0.2 0.4 0.2 0.1
表1
可以按照上述概率分布对上述残差进行算术熵编码。
另外一种可能的情况为,基于参考图像包括的4个样本R0、R1、R2、R3的残差值得到当前图像包括的4个样本的残差值的概率分布。例如基于R0、R1、R2、R3四个样本的残差值,其值分别为-1、2、-2、1,得到当前图像中的4个样本的残差值的概率分布,如下表2所示,
Figure PCTCN2022080900-appb-000011
表2
或者如下表3所示:
位置 值为-2的概率 值为-1的概率 值为0的概率 值为1的概率 值为2的概率
C0 0.5 0.2 0.1 0.1 0.1
C1 0.5 0.2 0.1 0.1 0.1
C2 0.5 0.2 0.1 0.1 0.1
C3 0.5 0.2 0.1 0.1 0.1
表3
相比于表1,表2和表3所示的概率分布为残差值取值范围内各个值的概率分布,并不区分样本所在的位置,即多个样本的残差值的概率分布相同。
可以按照上述概率分布对上述残差进行算术熵编码的方法进行编码。
可选地,用于概率估计的神经网络210的输入数据包括参考图像的残差,或者,
参考图像的残差、当前图像的预测和参考图像的预测,或者,
多张参考图像的残差,或者,
多张参考图像的残差和多张图像的预测。
需要指出的是,上述参数图像或多张参考图像为在编码当前图像之前的已编码图像。
在一个示例中,如图7A所示,用于概率估计的神经网络210包括卷积网络、均匀池化层和归一化指数函数Softmax层,其中,卷积网络包括激励ReLU层和多个卷积层,
用于概率估计的神经网络210的输入数据包括参考图像的残差时,通过卷积网络提取参考图像的残差的特征信息;用于概率估计的神经网络210的输入数据包括参考图像的残差、参考图像的预测和当前图像的预测时,通过卷积网络提取参考图像的残差、参考图像的预测和当前图像的预测的特征信息;用于概率估计的神经网络210的输入数据包括多张参考图像的残差时,通过卷积网络提取多张参考图像的残差的特征信息;用于概率估计的神经网络210的输入数据包括多张参考图像的残差、多张参考图像的预测及当前图像的预测时,通过卷积网络提取多张参考图像的残差、多张参考图像的预测和当前图像的预测的特征信息;再通过 均匀池化层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
可选地,当用于概率估计的神经网络210的输入包括多张图像(比如参考图像的残差、参考图像的预测和当前图像的预测,或者多张参考图像的残差,或者多张参考图像的残差、多张参考图像的预测及当前图像的预测)时,在该多张图像输入到用于概率估计的神经网络210之前,首先对多张图像进行堆叠操作,得到多通道的图像,然后再将该多通道的图像输入用于概率估计的神经网络210中,用于概率估计的神经网络210中的卷积网络提取多通道的图像的特征信息,再通过均匀池化层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
在一个示例中,用于概率估计的神经网络210的输入数据还包括超先验信息,此时用于概率估计的神经网络210的结构与图7A的不相同,如图7B所示,神经网络包括卷积网络、卷积池化层、第一卷积层、拼接(concat)层和Softmax层,其中,卷积网络包括多个卷积层、ReLU层,
用于概率估计的神经网络210的输入数据包括参考图像的残差和超先验信息时,通过卷积网络提取参考图像的残差和超先验信息的特征信息;用于概率估计的神经网络210的输入数据包括参考图像的残差、参考图像的预测、超先验信息和当前图像的预测时,通过卷积网络提取参考图像的残差、参考图像的预测、超先验信息和当前图像的预测的特征信息;用于概率估计的神经网络210的输入数据包括多张参考图像的残差和超先验信息时,通过卷积网络提取多张参考图像的残差和超先验信息的特征信息;用于概率估计的神经网络210的输入数据包括多张参考图像的残差、多张参考图像的预测、超先验信息及当前图像的预测时,通过卷积网络提取多张参考图像的残差、多张参考图像的预测、超先验信息和当前图像的预测的特征信息;再通过拼接层、第一卷积层、均匀池化层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
可选地,第一卷积层与卷积网络中的卷积层可以相同,也可以不相同。
在一个示例中,当用于概率估计的神经网络210的输入数据包括参考图像的残差、当前图像的预测和参考图像的预测时,在参考图像的残差、当前图像的预测和参考图像的预测输入到用于概率估计的神经网络210之前,对参考图像的残差、当前图像的预测和参考图像的预测进行堆叠操作,得到一个三通道的图像,然后再将该三通道的图像输入到用于概率估计的神经网络210中进行处理;或者,
当用于概率估计的神经网络210的输入数据包括多张参考图像的残差时,在将该多张参考图像的残差输入用于概率估计的神经网络210之前,对多张参考图像的残差进行堆叠操作,得到一个多通道的图像,再将该多通道的图像输入到用于概率估计的神经网络210中进行处理,或者,
当用于概率估计的神经网络210的输入数据包括多张参考图像的残差、多张图像的预测和当前图像的预测时,在输入到用于概率估计的神经网络210中之前,对多张参考图像的残差、多张图像的预测和当前图像的预测进行堆叠操作,得到一个多通道的图像,再将该多通道的图像输入到用于概率估计的神经网络210中进行处理。
在另一个示例中,当用于概率估计的神经网络210的输入数据包括多张参考图像的残差时,在将该多张参考图像的残差输入用于概率估计的神经网络210之前,多张参考图像的残差经过循环神经网络得到第一隐藏信息,再将该第一隐藏信息输入到用于概率估计的神经网络210中进行处理。
具体地,将多个参考图像的残差中的第j个参考图像的残差P j及隐藏信息H j-1输入到循环神经网络中进行处理,得到隐藏信息H j;按照该方式对多个参考图像的残差进行处理得到上述第一隐藏信息;其中,当第j个参考图像的残差为多个参考图像的残差中的第1个时,隐藏信息H j-1为0;当第j个参考图像的残差为多个参考图像的残差中的最后一个时,隐藏信息H j为第一隐藏信息。
其中,多张参考图像中相邻两张参考图像的时间戳的差值为上述预设阈值。
举例说明,如图7C所示,假设N个参考码图像的残差经循环神经网络得到第一隐藏信息,具体包括:
首先将第1个参考图像的残差输入到循环神经网络中进行处理,得到隐藏信息H 1;然后再将隐藏信息H 1和第2个参考图像的残差输入到循环神经网络中进行处理,得到隐藏信息H 2;……将隐藏信息H N-3和第N-2个参考图像的残差输入到循环神经网络中进行处理,得到隐藏信息H N-2;将隐藏信息H N-2和第N-1个参考图像的残差输入到循环神经网络中进行处理,得到隐藏信息H N-1;将隐藏信息H N-1和第N个参考图像的残差输入到循环神经网络中进行处理,得到隐藏信息H N;该隐藏信息H N为第一隐藏信息。
在另一个示例中,当用于概率估计的神经网络210的输入数据包括多张参考图像的残差、多张参考图像的预测和当前图像的预测时,在输入到用于概率估计的神经网络210中之前,多张参考图像的残差和多张参考图像的预测经循环神经网络得到第二隐藏信息,再将该第二隐藏信息和当前图像的预测输入到用于概率估计的神经网络210中进行处理。其中,第二隐藏信息本质上是一个特征图,因此可以在得到第二隐藏信息后,将第二隐藏信息和当前图像的预测进行堆叠操作,得到一个二通道的图像,然后再将该二通道的图像输入到用于概率估计的神经网络210中进行处理。
可选地,基于多个参考图像的残差和多个参考图像的预测得到第二隐藏信息,包括:
将多个参考图像的残差中的第j个参考图像的残差P j、多个参考图像的预测中的第j个参考图像的预测P j’和隐藏信息H j-1’输入到循环神经网络中进行处理,得到隐藏信息H j’;按照该方式对多个参考图像的残差和多个参考图像的预测进行处理得到上述第二隐藏信息;其中,当第j个参考图像的残差为多个参考图像的残差中的第1个和第j个参考图像的预测为多个参考图像的预测中的第1个时,隐藏信息H j-1’为0;当第j个参考图像的残差为多个参考图像的残差中的最后一个,且第j个参考图像的预测为多个参考图像的预测中的最后一个时,隐藏信息H j’为第二隐藏信息。
举例说明,如图7D所示,假设N个参考图像的残差和N个参考图像的预测得到第二隐藏信息,具体包括:
首先将第1个参考图像的残差和第1张个参考图像的预测输入到循环神经网络中进行处理,得到隐藏信息H 1’;然后再将隐藏信息H 1’、第2个参考图像的残差和第2个参考图像的预测输入到循环神经网络中进行处理,得到隐藏信息H 2’;……将隐藏信息H N-3’、第N-2个参考图像的残差差和第N-2个参考图像的预测输入到循环神经网络中进行处理,得到隐藏信息H N-2’;将隐藏信息H N-2’、第N-1个参考图像的残差和第N-1个参考图像的预测输入到循环神经网络中进行处理,得到隐藏信息H N-1’;将隐藏信息H N-1、第N个参考图像的残差和第N个参考图像的预测输入到循环神经网络中进行处理,得到隐藏信息H N’;该隐藏信息H N’为第二隐藏信息。
在此需要指出的是,图7A和图7B所示的神经网络结构中,各层之间的执行顺序不限于图7A和图7B所示的执行顺序,当然还可以是其他执行顺序。
可选地,视频编码器20还包括超先验编码网络218和超先验网络解码网络220,其中,超先验编码网络218对编码网络204输出的当前图像的残差的特征图做进一步的特征提取,得到超先验信息的特征向量,然后在将超先验信息的特征向量输入取整单元206中,取整单元206对超先验信息的特征向量进行取整,得到整数化后的超先验信息的特征向量;再将整数化后的超先验信息的特征向量输入到超先验解码网络220中进行处理,得到超先验信息。
可选地,超先验编码网络218的网络架构可以参见编码网络204的网络架构,也可以是其他形式的架构,在此不再具体叙述。超先验编码网络218的功能与超先验解码网络220的功能是相反的,因此超先验解码网络218的网络结构可参见超先验编码网络218的网络结构。
由于超先验信息对于用于概率估计的神经网络210来说是可选的,进而对于视频编码器20来说,超先验编码网络218和超先验信息解码网络220也是可选的,因此超先验编码网络218和超先验信息解码网络220使用虚线框表示的。
熵编码单元208
熵编码单元208用于将熵编码算法或方案(例如,可变长度编码(variable length coding,VLC)方案、上下文自适应VLC方案(context adaptive VLC,CALVC)、算术编码方案、二值化算法、上下文自适应二进制算术编码(context adaptive binary arithmetic coding,CABAC)、基于语法的上下文自适应二进制算术编码(syntax-based context-adaptive binary arithmetic coding,SBAC)、概率区间分割熵(probability interval partitioning entropy,PIPE)编码或其它熵编码方法或技术)应用于整数化后的特征图和当前图像的残差包括的多个样本的残差值的概率分布,得到可以通过输出端212以编码比特流21等形式输出的编码图像数据21,使得视频解码器30等可以接收并使用用于解码的参数。可将编码比特流21传输到视频解码器30,或将其保存在存储器中稍后由视频解码器30传输或检索。
在一个示例中,熵编码由于不知道整数化后的特征图的真实的字符概率,所以有需要可以统计这些或相关的信息添加至熵编码单元208中,可以把这些信息传到解码端。
在一个示例中,由于在编码侧,用于概率估计的神经网络210在估计当前图像的残差包括的多个样本的残差值的概率分布时,利用到超先验信息;因此在得到超先验信息后,对该超先验信息进行熵编码得到超先验信息码流,将该码流传输至视频解码器30进行解码,以使视频解码器30得到超先验信息并利用该超先验信息估计上述概率分布。
以下介绍算术熵编码的过程:
仍以图6H为例,假设已知当前需要对4个样本C0、C1、C2、C3上残差值其值分别为-2、1、-2、0进行算数熵编码,已知其概率分布为表1。按照算术编码算法,基于C0、C1、C2、C3样本的残差值的概率分布,对0到1区间进行划分,将C0、C1、C2、C3样本的残差值表示成实数0到1之间的一个间隔,最后输出的“当前间隔”的下边界就是编码后的码字。具体过程如图7E所示,0.285为最后的编码码字,在实际编码中,可该数值定点化表示写入码流。
类似的,其概率分布为表2或者表3时,按照上述类似方法进行算数编码。
图8A为用于实现本申请技术的另一种视频编码器的示例的示意性框图。在图8A的示例中,视频编码器20包括输入端(或输入接口)202、量化单元216、熵编码单元208、用于概率估计的神经网络210和输出端(或输出接口)212。图8A所示的视频编码器20也可称为端到端的视频编码器或者基于端到端视频编码器的视频编码器。
编码器20可通过输入端202等接收输入数据203,该输入数据图像包括当前图像的残差,作为量化单元216的输入。
上述输入数据203还包括参考图像的残差,或者,
多个参考图像的残差,或者,
多个参考图像的残差和当前图像的预测,或者,
多个参考图像的残差、多个参考图像的预测和当前图像的预测,作为用于概率估计的神经网络210的输入。
量化单元216
量化单元216用于对当前图像的残差进行量化,得到量化后的当前图像的残差,该量化后的当前图像的残差作为熵编码单元208的输入。
可选地,上述量化可以是均匀量化或者是非均匀量化。
用于概率估计的神经网络210
用于概率估计的神经网络210的详细功能可以参见图6A中用于概率估计的神经网络210的相关描述,或者说图8A中的用于概率估计的神经网络210的功能与图6A中用于概率估计的神经网络210的功能相同,在此不再叙述。
熵编码单元208
熵编码单元208用于将熵编码算法或方案(例如,VLC方案、CALVC、算术编码方案、二值化算法、CABAC、SBAC、PIPE编码或其它熵编码方法或技术)应用于量化后的特征图和当前图像的残差包括的多个样本的残差值的概率分布,得到可以通过输出端212以编码比特流21等形式输出的编码图像数据21,使得视频解码器30等可以接收并使用用于解码的参数。可将编码比特流21传输到视频解码器30,或将其保存在存储器中稍后由视频解码器30传输或检索。
在此需要说明的是,熵编码单元208的详细功能可以参见图6A中熵编码单元208的相关描述,或者说图8A中的熵编码单元208的功能与图6A中熵编码单元208的功能相同,在此不再叙述。
图8B为用于实现本申请技术的另一种视频编码器的示例的示意性框图。在图8B的示例中,视频编码器20包括输入端(或输入接口)202、变换单元214、量化单元216、熵编码单元208、用于概率估计的神经网络210和输出端(或输出接口)212。图8B所示的视频编码器20也可称为端到端的视频编码器或者基于端到端视频编码器的视频编码器。
编码器20可通过输入端202等接收输入数据203,该输入数据图像包括当前图像的残差,作为变换单元214的输入。
上述输入数据203还包括参考图像的残差,或者,
多个参考图像的残差,或者,
多个参考图像的残差和当前图像的预测,或者,
多个参考图像的残差、多个参考图像的预测和当前图像的预测,作为用于概率估计的神经网络210的输入。
变换单元214
变换单元214用于对当前图像的残差进行变换,得到变换后的当前图像的残差,该变换后的当前图像的残差作为量化单元216的输入。
可选地,上述变换可以为离散余弦变换(discrete cosine transform,DCT),此外还可以进 行离散正弦变换(discrete sine transform,DST),或者离散小波变换(discrete wavelet transform,DWT)等。
量化单元216
量化单元216用于对变换后的当前图像的残差进行量化,得到经变换和量化后的当前图像的残差,该将变换和量化后的当前图像的残差作为熵编码单元208的输入。
可选地,上述量化可以是均匀量化或者是非均匀量化。
用于概率估计的神经网络210
用于概率估计的神经网络210的详细功能可以参见图6A中用于概率估计的神经网络210的相关描述,或者说图8A中的用于概率估计的神经网络210的功能与图6A中用于概率估计的神经网络210的功能相同,在此不再叙述。
熵编码单元208
熵编码单元208用于将熵编码算法或方案(例如,VLC方案、CALVC、算术编码方案、二值化算法、CABAC、SBAC、PIPE编码或其它熵编码方法或技术)应用于经变换和量化后的特征图和当前图像的残差包括的多个样本的残差值的概率分布,得到可以通过输出端212以编码比特流21等形式输出的编码图像数据21,使得视频解码器30等可以接收并使用用于解码的参数。可将编码比特流21传输到视频解码器30,或将其保存在存储器中稍后由视频解码器30传输或检索。
在此需要说明的是,熵编码单元208的详细功能可以参见图6A中熵编码单元208的相关描述,或者说图8A中的熵编码单元208的功能与图6A中熵编码单元208的功能相同,在此不再叙述。
在此需要指出的是,在编码侧,除了图6A中的编码网络204、图8A中的量化单元216和图8B中的变换单元214的输入当前图像的残差为基于当前图像的多个样本的原始值与该多个样本的预测值作差得到残差之外,其他地方所说的残差(包括当前图像的残差和参考图像的残差)均为重建残差。图像的重建残差指的是图像的多个样本的重建值与该多个样本的预测值之间的多个差值。
解码器和解码方法
图9A为用于实现本申请技术的一种视频解码器的示例的示意性框图。在图9A的示例中,视频解码器30包括输入端(或输入接口)302、用于概率估计的神经网络304、熵解码单元306、解码网络308和输出端(或输出接口)310。图9A所示的视频解码器30也可称为端到端的视频解码器或者基于端到端视频解码器的视频解码器。
解码器30可通过输入端302等接收输入数据,该输入数据包括当前图像的码流,作为熵解码单元306的输入。
其中,上述输入数据还包括上述输入数据还包括参考图像的残差,或者,
多个参考图像的残差,或者,
多个参考图像的残差和当前图像的预测,或者,
多个参考图像的残差、多个参考图像的预测和当前图像的预测,作为用于概率估计的神经网络304的输入。
其中,当前图像的残差是基于当前图像的重建值与当前图像的预测得到的,具体地,当前图像的重建包括多个样本(像素点)的重建值,当前图像的预测包括多个样本(像素点)的预测值,将当前图像中样本(像素点)的重建值逐个与当前图像的预测中对应位置的样本的预测值作差,从而得到当前图像的残差。因此,在得到当前图像的残差后,可以将当前图 像的残差包括的多个样本(像素点)的残差值逐个与当前图像的预测包括的多个样本(像素点)的预测值相加,从而得到当前图像的重建值。
在一个示例中,当前图像的预测是对参考图像的重建样本值进行预测得到的,或者对当前图像的低分辨率图像的重建样本值进行上采样,得到采样后的图像,该采样后的图像的分辨率与当前图像的分辨率相同,且采样后的图像为当前图像的预测,其中当前图像的低分辨率图像是通过对当前图像进行下采样得到的。
用于概率估计的神经网络304
用于概率估计的神经网络304用于基于输入的数据估算得到当前图像的残差包括的多个样本的残差值的概率分布。
在此需要指出的是,当前图像的残差包括的多个样本的残差值的概率分布的解释说明可参见图6A所示实施例中的相关描述,在此不再叙述。
在一个示例中,为了提高解码的准确率,用于概率估计的神经网络304的输入数据还包括超先验信息,该超先验信息用于表征更深次层次的当前图像的残差的特征。
在此需要说明的是,用于概率估计的神经网络304的功能与编码侧的用于概率估计的神经网络210的功能一致,用于概率估计的神经网络304功能的描述可以参见编码侧的用于概率估计的神经网络210的相关描述,在此不再叙述。
可选地,视频解码器30还包括超先验解码网络312,该超先验解码网络312对熵解码单元306对超先验码流进行算术熵解码得到的整数化后的超先验信息的特征向量进行处理,得到上述超先验信息。超先验编码网络218的功能与超先验解码网络312的功能是相反的,因此超先验解码网络312的网络结构可参见超先验编码网络218的网络结构。
由于超先验信息对于用于概率估计的神经网络304来说是可选的,进而对于视频解码器30来说,超先验信息解码网络312也是可选的,因此超先验信息解码网络312使用虚线框表示的。
熵解码单元306
熵解码单元306用于将熵解码算法或方案应用于当前图像的码流和当前图像的残差包括的多个样本的残差值的概率分布,得到当前图像的残差的特征图。
熵解码单元306可采用应用编码器20的熵编码单元208的编码方案对应的解码算法或方案。
解码网络308
当前图像的残差的特征图经过解码网络308得到可以通过输出端310输出当前图像的残差,后续视频解码器30可以通过当前图像的残差得到当前图像的重建样本值,即当前图像的重建。
可选地,当前图像的残差的特征图为多通道的特征图,该多通道的特征图包括N个解码特征图,其中N大于1。如图9B所示为N个解码特征图输入解码网络308的示意图。
在一个可能性中,解码网络308中的输入输出结构如图9C所示,解码网络308由P个网络层所构成,其中P>L且M、L、T、P均为正整数。其中第T解码特征图为解码网络308308的网络层L后输入。第T解码特征图与第M解码特征图均为解码网络的输入,可以理解,可以有多个解码特征图在解码网络308中的不同网络层位置后作为输入,此处不做限定。
图9C中的任意一网络层可以为卷积层、反归一化层、非线性激活层等。
在一个可能性中,解码网络308中可以存在包括相加、相减等操作。
在一个可能性中,解码网络308中各网络层结构可以互为相同或者不同。
在一个可能性中,解码网络308的结构示例如图9D所示,可见示例中解码网络包含了12个网络层,具体包括了一个反归一化层、六个卷积层以及四个非线性激活层。其中m,n,t均为正整数,解码网络308输出为当前图像的残差。第t解码特征图、第n解码特征图、第m解码特征图均为解码网络的输入,其中t、n、m均为整数。
图9E为用于实现本申请技术的一种视频解码器的示例的示意性框图。在图9E的示例中,视频解码器30包括输入端(或输入接口)302、用于概率估计的神经网络304、熵解码单元306、反量化单元314和输出端(或输出接口)310。图9E所示的视频解码器30也可称为端到端的视频解码器或者基于端到端视频解码器的视频解码器。
解码器30可通过输入端302等接收输入数据,该输入数据图像包括当前图像的码流,作为熵解码单元306的输入。
上述输入数据还包括参考图像的残差,或者,
多个参考图像的残差,或者,
多个参考图像的残差和当前图像的预测,或者,
多个参考图像的残差、多个参考图像的预测和当前图像的预测,作为用于概率估计的神经网络304的输入。
用于概率估计的神经网络304
用于概率估计的神经网络304的详细功能可以参见图6A中用于概率估计的神经网络210的相关描述,或者说图9B中的用于概率估计的神经网络304的功能与图6A中用于概率估计的神经网络210的功能相同,在此不再叙述。
熵解码单元306
熵解码单元306用于将熵解码算法或方案应用于当前图像的码流和当前图像的残差包括的多个样本的残差值的概率分布,得到量化后的当前图像的残差。
熵解码单元306可采用应用编码器20的熵编码单元208的编码方案对应的解码算法或方案。
反量化单元314
反量化单元314用于对量化后的当前图像的残差进行反量化,得到可以通过输出端310输出当前图像的残差,后续视频解码器30可以通过当前图像的残差得到当前图像的重建样本值,即当前图像的重建。
可选地,上述量化可以是均匀量化或者是非均匀量化。
图9F为用于实现本申请技术的一种视频解码器的示例的示意性框图。在图9F的示例中,视频解码器30包括输入端(或输入接口)302、用于概率估计的神经网络304、熵解码单元306、反量化单元314、反变换单元316和输出端(或输出接口)310。图9F所示的视频解码器30也可称为端到端的视频解码器或者基于端到端视频解码器的视频解码器。
解码器30可通过输入端302等接收输入数据,该输入数据图像包括当前图像的码流,作为熵解码单元306的输入。
上述输入数据还包括参考图像的残差,或者,
多个参考图像的残差,或者,
多个参考图像的残差和当前图像的预测,或者,
多个参考图像的残差、多个参考图像的预测和当前图像的预测,作为用于概率估计的神 经网络304的输入。
用于概率估计的神经网络304
用于概率估计的神经网络304的详细功能可以参见图6A中用于概率估计的神经网络210的相关描述,或者说图9B中的用于概率估计的神经网络304的功能与图6A中用于概率估计的神经网络210的功能相同,在此不再叙述。
熵解码单元306
熵解码单元306用于将熵解码算法或方案应用于当前图像的码流和当前图像的残差包括的多个样本的残差值的概率分布,得到将变换和量化后的当前图像的残差。
熵解码单元306可采用应用编码器20的熵编码单元208的编码方案对应的解码算法或方案。
反量化单元314
反量化单元314用于对经变换和量化后的当前图像的残差进行反量化,变换后的当前图像的残差。
可选地,上述量化可以是均匀量化或者是非均匀量化。
反变换单元316
反变换单元316用于对变换后的当前图像的残差进行反变换,得到可以通过输出端310输出当前图像的残差,后续视频解码器30可以通过当前图像的残差得到当前图像的重建样本值,即当前图像的重建。
上述变换可以为DCT,此外还可以进行DST,或者DWT等。
应理解,在编码器20和解码器30中,可以对当前步骤的处理结果进一步处理,然后输出到下一步骤。例如,在编码器20和解码器30之后,可以对编码器20和解码器30的处理结果进行进一步的运算或处理,例如裁剪(clip)或移位(shift)运算或滤波处理。
在此需要指出的是,在解码侧所说的残差均为重建残差,该重建残差为图像的多个样本的重建值与该多个样本的预测值之间的多个差值。重建残差又可称为重建残差信号。
图10是示出基于本申请一种实施例的编码方法的过程1000的流程图。过程1000可由视频编码器20执行。过程1000描述为一系列的步骤或操作,应当理解的是,过程1000可以以各种顺序执行和/或同时发生,不限于图10所示的执行顺序。
如图10所示,该编码方法包括:
S1001、获取当前图像的残差。
其中,当前图像包括多个样本,当前图像的残差为当前图像包括的多个样本的原始值与与预测值之间的差值。
具体地,获取当前图像的残差,包括:将当前图像的多个样本的原始值与当前图像的多个样本的预测值逐个作差得到当前图像的多个样本的残差值,当前图像的残差包括该多个样本的残差值,其中,当前图像的分辨率与当前图像的预测的分辨率相同。
在此需要指出的是,当前图像的预测包括当前图像的多个样本的预测值。当前图像的预测的确定过程可参见图6B和图6C的相关描述,在此不再叙述。
可选地,当前图像与参考图像为连续帧,也可以为非连续帧,比如当前图像的时间戳与已编码图像的时间戳的差值为预设阈值。当前图像的低分辨率图像可通过对当前图像进行下采样得到。图像的时间戳可以为图像被采集得到的时刻。
S1002、通过神经网络对输入数据进行概率估计处理,得到当前图像的残差包括的多个样 本的残差值的概率分布,输入数据至少包括参考图像的残差,参考图像为在编码当前图像之前的已编码的图像。
在一个可能的设计中,神经网络为经训练得到的神经网络,本实施的方法还包括:
获取训练的输入数据,训练的输入数据包括在编码样本图像之前的已编码图像,样本图像为已知的已编码图像;通过神经网络对训练的输入数据进行概率估计处理,得到样本图像的残差包括的多个样本的残差值的估计的概率分布;基于样本图像的残差包括的多个样本的残差值的已知概率分布和估计的概率分布,获取损失值;在损失值收敛于第一阈值,或神经网络当前的训练次数大于或等于第二阈值时,确定当前的神经网络为神经网络训练完成时对应的神经网络。
其中,训练的输入数据包括:
第一参考图像的残差包括的多个样本的残差值,或者;
第一参考图像的残差包括的多个样本的残差值和第二参考图像的残差包括的多个样本的残差值,或者;
样本图像的预测包括的多个样本的预测值、第一参考图像的残差包括的多个样本的残差值和第一参考图像的预测包括的多个样本的预测值;或者;
样本图像的预测包括的多个样本的预测值、第一参考图像的残差包括的多个样本的残差值、第一参考图像的预测包括的多个样本的预测值、第二参考图像的预测包括的多个样本的预测值和第二参考图像的残差包括的多个样本的残差值;
其中,第一参考图像和第二参考图像为在编码样本图像之前的已编码图像。
可选地,上述概率分布表示当前图像的包括的多个样本的原始值与该多个样本的预测值之间的多个差值的概率分布,或者,表示当年图像的多个样本中各个样本点的残差值的概率分布,或者表示当前图像的多个样本的残差值的对应的概率分布。
在一个可能的设计中,上述第一熵编码数据包括当前图像的残差,或者当前图像的残差的特征图,或者经变换和量化后的当前图像的残差。
可选地,当第一熵编码数据为当前图像的残差的特征图时,本实施的方法还包括:将当前图像的残差经过编码网络进行特征提取,得到当前图像的残差的特征图。
在此需要指出的是,上述编码网络的具体结构功能可参见图6D、图6E、图6F和图6G的相关描述,在此不再叙述。
可选地,当第一熵编码数据为经变换和量化后的当前图像的残差时,本实施的方法还包括:
对当前图像的残差进行变换和量化,得到经变换和量化后的当前图像的残差。其中,变换可以为DCT、DST、DWT或者其他变换;量化可以为均匀量化、非均匀量化或者其他量化。
在一个可能的设计中,上述神经网络包括卷积网络、均匀池化层和Softmax层,该卷积网络包括多个卷积层、ReLU层,其中,如图11A所示,通过神经网络对输入数据进行概率估计处理,得到当前图像的多个样本的残差值的概率分布,包括:
S1101、通过卷积网络提取输入数据的特征信息。
其中,上述输入数据包括:参考图像的残差;或者,
参考图像的残差、参考图像的预测和当前图像的预测;或者,
多个参考图像的残差;或者,
多个参考图像的残差、多个参考图像的预测和当前图像的预测。
其中,参考图像的预测是基于参考图像之前被编码图像的重建样本值进行预测得到的,或者对参考图像的低分辨率图像的重建样本值进行上采样,得到采样后的图像,该采样后的图像的分辨率与参考图像的分辨率相同,且该采样后的图像为参考图像的预测。其中,参考图像的低分辨率图像可通过对参考图像进行下采样得到。
S1102、基于均匀池化层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
在一个可能的设计中,在对当前图像的残差的特征图进行编码时,为了提高编码精度,在计算当前图像的残差包括的多个样本的残差值的概率分布时,引入超先验信息;上述神经网络包括卷积网络、均匀池化层、第一卷积层拼接层和Softmax层,卷积网络包括多个卷积层、ReLU层,其中,如图11B所示通过神经网络对输入数据进行概率估计处理,得到当前图像的多个样本的残差值的概率分布,包括:
S1103、通过卷积网络提取参考图像的残差和超先验信息的特征信息。
其中,卷积神经网络分别提取参考图像的残差的特征信息和超先验信息的特征信息。
S1104、基于均匀池化层、第一卷积层、拼接层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
在此需要指出的是,通过神经网络对输入数据的处理过程具体可以参见用于概率估计的神经网络210的功能的描述,在此不再叙述。
在一个可能的设计中,通过如下方式获取超先验信息:
将当前图像的残差的特征图输入超先验编码网络中进行特征提取,得到超先验信息的特征向量;对该超先验信息的特征向量进行取整,得到整数化后的超先验信息的特征向量;将整数化后的超先验信息的特征向量输入超先验解码网络进行处理得到超先验信息。
其中,超先验编码网络的功能和超先验解码网络的功能是相反的。
为了使解码侧获取超先验信息,在得到整数化后的超先验信息的特征向量后,对该整数化后的超先验信息的特征向量进行算术熵编码,得到超先验码流。
S1003、基于当前图像的残差包括的多个样本的残差值的概率分布对第一熵编码数据进行算术熵编码,得到当前图像的码流,第一熵数据表示当前图像的残差。
可以看出,在本申请的实施例中,当前图像的残差与参考图像的残差具有相关性,因此将至少包括参考图像的残差的数据输入到神经网络中,得到当前图像的残差包括的多个样本的残差值的概率分布,利用该概率分布对表示当前图像的残差的第一熵数据进行编码,有利于提高编码的准确率。在确定当前图像的残差包括的多个样本的残差值的概率分布时,引入超先验信息,可以得到精度更加高的当前图像的残差包括的多个样本的残差值的概率分布,从而提高编码精度。
图12是示出基于本申请一种实施例的解码方法的过程1200的流程图。过程1200可由视频解码器30执行。过程1200描述为一系列的步骤或操作,应当理解的是,过程1200可以以各种顺序执行和/或同时发生,不限于图12所示的执行顺序。
如图12所示,该解码方法包括:
S1201、接收当前图像的码流。
S1202、通过神经网络对输入数据进行概率估计处理,得到当前图像的残差包括的多个样本的残差值的概率分布,输入数据至少包括参考图像的残差,参考图像为在解码当前图像之前的已解码图像。
其中,参考图像的残差包括多个样本,参考图像的残差包括多个样本的重建值与该多个样本的预测值之间多个差值。
参考图像的预测包括多个样本的预测值。参考图像的残差是基于在参考图像之前已解码图像的重建样本值对参考图像进行预测得到的,或者对参考图像的低分辨率图像的重建样本值进行上采样得到的,得到采样后的图像,该采样后的图像的分辨率与参考图像的分辨率相同,且该采样后的图像为参考图像的预测。其中,参考图像的低分辨率图像可通过对参考图像进行下采样得到。
可选地,第一图像与当前图像可以为连续帧,可以为非连续帧;比如第一图像为第n-1帧图像,当前图像为第n帧图像;或者第一图像为第n-2帧图像,当前图像为第n帧图像;再比如当前图像的时间戳与已编码图像的时间戳的差值为预设阈值。当前图像的低分辨率图像可通过对当前图像进行下采样得到。图像的时间戳可以为图像被采集得到的时刻。
在一个可能的设计中,上述神经网络为经训练得到神经网络,本实施例的方法还包括:
获取训练的输入数据,训练的输入数据包括在解码样本图像之前的已解码图像,样本图像为已知的已解码图像,通过神经网络对训练的输入数据进行概率估计处理,得到样本图像的残差包括的多个样本的残差值的估计的概率分布;基于样本图像的残差包括的多个样本的残差值的已知概率分布和估计的概率分布,获取损失值;在损失值收敛于第一阈值,或神经网络当前的训练次数大于或等于第二阈值时,确定当前的神经网络为神经网络训练完成时对应的神经网络。
其中,训练的输入数据包括:第一参考图像的残差包括的多个样本的残差值,或者;
第一参考图像的残差包括的多个样本的残差值和第二参考图像的残差包括的多个样本的残差值,或者;
样本图像的预测包括的多个样本的预测值、第一参考图像的残差包括的多个样本的残差值和第一参考图像的预测包括的多个样本的预测值;或者;
样本图像的预测包括的多个样本的预测值、第一参考图像的残差包括的多个样本的残差值、第一参考图像的预测包括的多个样本的预测值、第二参考图像的预测包括的多个样本的预测值和第二参考图像的残差包括的多个样本的残差值;
其中,第一参考图像和第二参考图像为在解码样本图像之前的已解码图像。具体训练过程可参见上述方式。
可选地,上述概率分布表示当前图像的包括的多个样本的原始值与该多个样本的预测值之间的多个差值的概率分布,或者,表示当年图像的多个样本中各个样本点的残差值的概率分布,或者表示当前图像的多个样本的残差值的对应的概率分布。
在一个可能的设计中,上述神经网络包括卷积网络、均匀池化层和Softmax层,该卷积网络包括多个卷积层、ReLU层,其中,如图12A所示,通过神经网络对输入数据进行概率估计处理,得到当前图像的多个样本的残差值的概率分布,包括:
S1201a、通过卷积网络提取输入数据的特征信息。
其中,上述输入数据包括:参考图像的残差;或者,
参考图像的残差、参考图像的预测和当前图像的预测;或者,
多个参考图像的残差;或者,
多个参考图像的残差、多个参考图像的预测和当前图像的预测。
其中,参考图像的预测是基于参考图像之前被解码图像的重建样本值进行预测得到的,或者对参考图像的低分辨率图像的重建样本值进行上采样,得到采样后的图像,该采样后的 图像的分辨率与参考图像的分辨率相同,且该采样后的图像为参考图像的预测。其中,参考图像的低分辨率图像可通过对参考图像进行下采样得到。
S1202a、基于均匀池化层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
在一个可能的设计中,在对当前图像的残差的特征图进行解码时,为了提高解码精度,在计算当前图像的残差包括的多个样本的残差值的概率分布时,引入超先验信息;上述神经网络包括卷积网络、均匀池化层、第一卷积层拼接层和Softmax层,卷积网络包括多个卷积层、ReLU层,其中,如图12B所示,通过神经网络对输入数据进行概率估计处理,得到当前图像的多个样本的残差值的概率分布,包括:
S1203a、通过卷积网络提取参考图像的残差和超先验信息的特征信息。
其中,卷积神经网络分别提取参考图像的残差的特征信息和超先验信息的特征信息。
S1204a、基于均匀池化层、第一卷积层、拼接层和Softmax层对卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
在此需要指出的是,通过神经网络对输入数据的处理过程具体可以参见用于概率估计的神经网络210的功能的描述,在此不再叙述。
在一个可能的设计中,通过如下方式获取超先验信息:
对超先验码流进行算术熵解码,得到整数化后的超先验信息的特征向量,再将该整数化后的超先验信息的特征向量经过超先验解码网络进行处理得到超先验信息。其中,超先验编码网络的功能和超先验解码网络的功能是相反的。
S1203、基于当前图像的残差包括的多个样本的残差值的概率分布对所述码流进行算术熵解码,得到第一熵解码数据,该第一熵解码数据表示当前图像的残差;基于所述当前图像的残差,得到当前图像的重建样本值。
需要指出的是,第一熵解码数据可以是从码流中解码出的部分熵解码数据,毕竟码流中不仅只有残差数据还有其他的,还可以从码流中熵解码出其他熵解码数据,所以加了“第一”作为限定。
在一个可能的设计中,上述第一熵解码数据包括当前图像的残差,或者当前图像的残差的特征图,或者经变换和量化后的当前图像的残差。
可选地,当第一熵解码数据为当前图像的残差的特征图时,本实施的方法还包括:将当前图像的残差的特征图经过解码网络得到当前图像的残差。
在此需要指出的是,上述解码网络的功能与编码网络的功能相反,因此解码网络的结构可以参照图6D、图6E、图6F和图6G所示的编码网络的结构,在此不再叙述。
可选地,当第一解编码数据为经变换和量化后的当前图像的残差时,本实施的方法还包括:
对经变换和量化后的当前图像的残差进行反量化和反变换,得到当前图像的残差。其中,反变换可以为反DCT、反DST、反DWT或者其他反变换;反量化可以为均匀反量化、非均匀反量化或者其他反量化。
在此需要说明的是,算术熵解码与算术熵编码是一对互逆的过程,算术熵解码过程可参见上述熵编码过程,在此不再叙述。
可以看出,在本申请的实施例中,当前图像的残差与参考图像的残差具有相关性,因此将至少包括参考图像的残差的数据输入到神经网络中,得到当前图像的残差包括的多个样本的残差值的概率分布,利用该概率分布对表示当前图像的残差的第一熵数据进行解码,有利 于提高解码的准确率。在确定当前图像的残差包括的多个样本的残差值的概率分布时,引入超先验信息,可以得到精度更加高的当前图像的残差包括的多个样本的残差值的概率分布,从而提高解码精度。
下面结合附图综合介绍编码过程和解码过程。
参见图13,图13为本申请实施例提供的一种编码和解码过程的示意图。如图13所示,编码装置获取当前图像的原始残差,该当前图像的原始残差为当前图像包括的多个样本的原始值与该多个样本的预测值之间的多个差值;当前图像的原始残差的具体过程可参见上述相关描述,在此不再叙述;编码装置利用编码网络对当前图像的原始残差进行特征提取,得到当前图像的原始残差的特征图Y;该残差的特征图Y经取整模块处理,得到整数化后的特征图Y’;编码装置基于当前图像的残差包括的多个样本的残差值的概率分布,利用算术编码器对整数化后的特征图Y’进行算术熵编码,得到当前图像的码流;在此之前,编码装置将参考图像的重建残差输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布;需要指出的是,本实施例所说的当前图像的残差也就是上述当前图像的原始残差;参考图像的重建残差为参考图像包括的多个样本的重建值与该多个样本的预测值之间的多个差值。
可选地,编码装置将参考图像的重建残差和超先验信息输入到用于概率估计的神经网络(即上述神经网络),得到当前图像的残差包括的多个样本的残差值的概率分布;在此之前,编码装置利用超先验编码网络对当前图像的残差的特征图进行超先验特征提取,得到超先验信息的特征向量Z;对超先验信息的特征向量Z进行取整,得到整数化后的超先验信息的特征向量Z’;编码装置再利用超先验解码网络对整数化后的超先验信息的特征向量进行处理,得到当前图像的超先验信息。
为了方便解码装置在解码时使用超先验信息,编码装置利用算术编码器对整数化后的超先验信息的特征向量进行编码,得到超先验信息的码流。
解码装置将参考图像的重建残差输入用于概率估计的神经网络(即上述神经网络)中进行处理,得到当前图像的重建残差包括的多个样本的残差值的概率分布;解码装置在得到当前图像的码流后,基于当前图像的重建残差包括的多个样本的残差值的概率分布,利用算术解码器对当前图像的码流进行算术熵解码,得到当前图像的重建残差的特征图;利用解码网络对当前图像的重建残差的特征图进行处理,得到当前图像的重建残差。
可选地,解码装置利用算术解码器对超先验信息的码流进行解码,得到超先验信息的特征向量,利用超先验解码网络对超先验信息的特征向量进行处理,得到超先验信息;将超先验信息和参考帧图像的重建残差输入到用于概率估计的神经网络中进行处理,得到当前图像的残差包括的多个样本的残差值的概率分布。
参见图14,图14为本申请实施例提供的另一种编码和解码过程的示意图。如图14所示,编码装置获取当前图像的原始残差,该当前图像的原始残差为当前图像包括的多个样本的原始值与该多个样本的预测值之间的多个差值;当前图像的原始残差的具体过程可参见上述相关描述,在此不再叙述;编码装置利用编码网络对当前图像的残差进行特征提取,得到当前图像的残差ResiPicN的特征图Y;该当前图像的残差的特征图经取整模块处理,得到整数化后的特征图Y’;编码装置基于当前图像的残差包括的多个样本的残差值的概率分布,利用算术编码器对整数化后的特征图Y’进行算术熵编码,得到当前图像的码流;在此之前,编码装 置利用循环神经网络,基于多个参考图像的重建残差得到隐藏信息Hn,具体过程可参见上述实施例的相关描述,在此不再叙述;编码装置并将隐藏信息Hn输入到用于概率估计的神经网络(即上述神经网络),得到当前图像的残差包括的多个样本的残差值的概率分布;需要指出的是,本实施例所说的当前图像的残差也就是上述当前图像的原始残差;参考图像的重建残差为参考图像包括的多个样本的重建值与该多个样本的预测值之间的多个差值。
可选地,编码装置将隐藏信息Hn和超先验信息输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布;在此之前,编码装置利用超先验编码网络对当前图像的残差的特征图进行超先验特征提取,得到超先验信息的特征向量Z;对超先验信息的特征向量Z进行取整,得到整数化后的超先验信息的特征向量Z’;编码装置再利用超先验解码网络对整数化后的超先验信息的特征向量进行处理,得到当前图像的超先验信息。
为了方便解码装置在解码时使用超先验信息,编码装置利用算术编码器对整数化后的超先验信息的特征向量进行编码,得到超先验信息的码流。
解码装置利用循环神经网络,基于多个参考图像的重建残差得到隐藏信息Hn,具体过程可参见上述实施例的相关描述,在此不再叙述;将隐藏信息Hn输入用于概率估计的神经网络中进行处理,得到当前图像的重建残差包括的多个样本的残差值的概率分布;解码装置在得到当前图像的码流后,基于当前图像的重建残差包括的多个样本的残差值的概率分布,利用算术解码器对当前图像的码流进行算术熵解码,得到当前图像的重建残差的特征图;利用解码网络对当前图像的重建残差的特征图进行处理,得到当前图像的重建残差。
可选地,解码装置利用算术解码器对超先验信息的码流进行解码,得到整数化后超先验信息的特征向量,利用超先验解码网络对整数化后的超先验信息的特征向量进行处理,得到超先验信息;将超先验信息和隐藏信息Hn输入到用于概率估计的神经网络中进行处理,得到当前图像的残差包括的多个样本的残差值的概率分布。
参见图15,图15为本申请实施例提供的另一种编码和解码过程的示意图。如图15所示,编码装置获取当前图像的原始残差,该当前图像的原始残差为当前图像包括的多个样本的原始值与该多个样本的预测值之间的多个差值,当前图像的原始残差的具体过程可参见上述相关描述,在此不再叙述;编码装置利用编码网络对当前图像的原始残差进行特征提取,得到当前图像的原始残差的特征图Y;该当前图像的原始残差的特征图Y经过取整模块处理,得到整数化后的特征图Y’;编码装置基于当前图像的残差包括的多个样本的残差值的概率分布,利用算术编码器对整数化后的特征图Y’进行算术熵编码,得到当前图像的码流;在此之前,编码装置将参考图像的重建残差、参考图像的预测和当前图像的预测输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布;需要指出的是,本实施例所说的当前图像的残差也就是上述当前图像的原始残差;参考图像的重建残差为参考图像包括的多个样本的重建值与该多个样本的预测值之间的多个差值。
可选地,编码装置将参考图像的重建残差、参考图像的预测、当前图像的预测和超先验信息输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布;在此之前,编码装置利用超先验编码网络对残差的特征图进行超先验特征提取,得到超先验信息的特征向量Z;对超先验信息的特征向量Z进行取整,得到整数化后的超先验信息的特征向量Z’;编码装置再利用超先验解码网络对整数化后的超先验信息的特征向量进行处理,得到当前图像的超先验信息。
为了方便解码装置在解码时使用超先验信息,编码装置利用算术编码器对对整数化后的超先验信息的特征向量进行编码,得到超先验信息的码流。
解码装置将参考图像的重建残差、参考图像的预测、当前图像的预测输入用于概率估计的神经网络中进行处理,得到当前图像的重建残差包括的多个样本的残差值的概率分布;解码装置在得到当前图像的码流后,基于当前图像的重建残差包括的多个样本的残差值的概率分布,利用算术解码器对当前图像的码流进行算术熵解码,得到当前图像的重建残差的特征图;利用解码网络对当前图像的重建残差的特征图进行处理,得到当前图像的重建残差。
可选地,解码装置利用算术解码器对超先验信息的码流进行解码,得到整数化后的超先验信息的特征向量,利用超先验解码网络对整数化后的超先验信息的特征向量进行处理,得到超先验信息;将超先验信息、参考图像的重建残差、参考图像的预测、当前图像的预测输入到用于概率估计的神经网络中进行处理,得到当前图像的残差包括的多个样本的残差值的概率分布。
参见图16,图16为本申请实施例提供的另一种编码和解码过程的示意图。如图16所示,编码装置获取当前图像的原始残差,该当前图像的原始残差为当前图像包括的多个样本的原始值与该多个样本的预测值之间的多个差值,当前图像的原始残差的具体过程可参见上述相关描述,在此不再叙述;编码装置利用编码网络对当前图像的原始残差进行特征提取,得到当前图像的原始残差的特征图Y;该当前图像的原始残差的特征图Y经取整模块处理,得到整数化后的特征图Y’;编码装置基于当前图像的残差包括的多个样本的残差值的概率分布,利用算术编码器对整数化后的特征图Y’进行算术熵编码,得到当前图像的码流;在此之前,编码装置利用循环神经网络,基于多个参考图像的重建残差和多个参考图像的预测得到隐藏信息Hn,具体过程可参见上述实施例的相关描述,在此不再叙述;编码装置并将隐藏信息Hn和当前图像的预测输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布;需要指出的是,本实施例所说的当前图像的残差也就是上述当前图像的原始残差;参考图像的重建残差为参考图像包括的多个样本的重建值与该多个样本的预测值之间的多个差值。
可选地,编码装置将隐藏信息Hn、当前图像的预测和超先验信息输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布;在此之前,编码装置利用超先验编码网络对当前图像的残差的特征图进行超先验特征提取,得到超先验信息的特征向量Z;对超先验信息的特征向量Z进行取整,得到整数化后的超先验信息的特征向量Z’;编码装置再利用超先验解码网络对整数化后的超先验信息的特征向量进行处理,得到当前图像的超先验信息。
为了方便解码装置在解码时使用超先验信息,编码装置利用算术编码器对整数化后的超先验信息的特征向量进行编码,得到超先验信息的码流。
解码装置利用循环神经网络,基于多个参考图像的重建残差和多个参考图像的预测得到隐藏信息Hn,具体过程可参见上述实施例的相关描述,在此不再叙述;将隐藏信息Hn和当前图像的预测输入用于概率估计的神经网络中进行处理,得到当前图像的重建残差包括的多个样本的残差值的概率分布;解码装置在得到当前图像的码流后,基于当前图像的重建残差包括的多个样本的残差值的概率分布,利用算术解码器对当前图像的码流进行算术熵解码,得到当前图像的重建残差的特征图;利用解码网络对当前图像的重建残差的特征图进行处理,得到当前图像的重建残差。
可选地,解码装置利用算术解码器对超先验信息的码流进行解码,得到整数化后的超先验信息的特征向量,利用超先验解码网络对整数化后的超先验信息的特征向量进行处理,得到超先验信息;将超先验信息、隐藏信息Hn和当前图像的预测输入到用于概率估计的神经网络中进行处理,得到当前图像的残差包括的多个样本的残差值的概率分布。
参见图17,图17为本申请实施例提供的一种编码和解码过程的示意图。如图17所示,编码装置获取当前图像的原始残差,该当前图像的原始残差为当前图像包括的多个样本的原始值与该多个样本的预测值之间的多个差值;当前图像的原始残差的具体过程可参见上述相关描述,在此不再叙述;编码装置对当前图像的原始残差进行量化,得到量化后的当前图像的原始残差;编码装置基于当前图像的残差包括的多个样本的残差值的概率分布,利用算术编码器对量化后的当前图像的残差进行算术熵编码,得到当前图像的码流;需要指出的是,本实施例所说的当前图像的残差也就是上述当前图像的原始残差。
其中,用于概率估计的神经网络的输入数据包括以下情况:
如图13所示,编码装置将参考图像的重建残差输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布;
或者如图14所示,编码装置利用循环神经网络,基于多张参考图像的重建残差得到隐藏信息Hn,具体过程可参见上述实施例的相关描述,在此不再叙述;编码装置并将隐藏信息Hn输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布;
或者如图15所示,编码装置将参考图像的重建残差、参考图像的预测和当前图像的预测输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布,
或者如图16所示,编码装置利用循环神经网络,基于多张参考图像的重建残差和多张参考图像的预测得到隐藏信息Hn,具体过程可参见上述实施例的相关描述,在此不再叙述;编码装置并将隐藏信息Hn和当前图像的预测输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布。
解码装置在得到当前图像的码流后,基于当前图像的残差值的概率分布,利用算术解码器对当前图像的码流进行算术熵解码,得到量化后的当前图像的残差;对量化后的当前图像的残差进行反量化,得到当前图像的重建残差;其中,如图13所示,解码装置将参考的重建残差输入用于概率估计的神经网络中进行处理,得到当前图像的残差值的概率分布;或者,如图14所示,解码装置利用循环神经网络,基于多张参考图像的重建残差得到隐藏信息Hn,具体过程可参见上述实施例的相关描述,在此不再叙述;将隐藏信息Hn输入用于概率估计的神经网络中进行处理,得到当前图像的残差包括的多个样本的残差值的概率分布;或者,如图15所示,解码装置将参考图像的重建残差、参考图像的预测和当前图像的预测输入用于概率估计的神经网络中进行处理,得到当前图像的残差包括的多个样本的残差值的概率分布;或者,如图16所示,解码装置利用循环神经网络,基于多张参考图像的重建残差和多张参考图像的预测得到隐藏信息Hn,具体过程可参见上述实施例的相关描述,在此不再叙述;将隐藏信息Hn和当前图像的预测输入用于概率估计的神经网络中进行处理,得到当前图像的残差包括的多个样本的残差值的概率分布。
参见图18,图18为本申请实施例提供的一种编码和解码过程的示意图。如图18所示,编码装置获取当前图像的原始残差,该当前图像的原始残差为当前图像包括的多个样本的原 始值与该多个样本的预测值之间的多个差值;当前图像的原始残差的具体过程可参见上述相关描述,在此不再叙述;编码装置对当前图像的原始残差进行变换(例如DCT,或者DST,DWT)和量化,得到经变换和量化后的当前图像的原始残差;编码装置基于当前图像的残差包括的多个样本的残差值的概率分布,利用算术编码器对经变换和量化后的当前图像的原始残差进行算术熵编码,得到当前图像的码流;需要指出的是,本实施例所说的当前图像的残差也就是上述当前图像的原始残差;
其中,用于概率估计的神经网络的输入数据包括以下情况:
如图13所示,编码装置将参考图像的重建残差输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布;
或者如图14所示,编码装置利用循环神经网络,基于多张参考图像的重建残差得到隐藏信息Hn,具体过程可参见上述实施例的相关描述,在此不再叙述;编码装置并将隐藏信息Hn输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布;
或者如图15所示,编码装置将参考图像的重建残差、参考图像的预测和当前图像的预测输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布,
或者如图16所示,编码装置利用循环神经网络,基于多张参考图像的重建残差和多张参考图像的预测得到隐藏信息Hn,具体过程可参见上述实施例的相关描述,在此不再叙述;编码装置并将隐藏信息Hn和当前图像的预测输入到用于概率估计的神经网络,得到当前图像的残差包括的多个样本的残差值的概率分布。
解码装置在得到当前图像的码流后,基于当前图像的残差值的概率分布,利用算术解码器对当前图像的码流进行算术熵解码,得到经变换和量化后的当前图像的残差;对将变换和量化后的当前图像的残差进行反量化和反变换,得到当前图像的重建残差;其中,如图13所示,解码装置将参考的重建残差输入用于概率估计的神经网络中进行处理,得到当前图像的残差值的概率分布;或者,如图14所示,解码装置利用循环神经网络,基于多张参考图像的重建残差得到隐藏信息Hn,具体过程可参见上述实施例的相关描述,在此不再叙述;将隐藏信息Hn输入用于概率估计的神经网络中进行处理,得到当前图像的残差包括的多个样本的残差值的概率分布;或者,如图15所示,解码装置将参考图像的重建残差、参考图像的预测和当前图像的预测输入用于概率估计的神经网络中进行处理,得到当前图像的残差包括的多个样本的残差值的概率分布;或者,如图16所示,解码装置利用循环神经网络,基于多张参考图像的重建残差和多张参考图像的预测得到隐藏信息Hn,具体过程可参见上述实施例的相关描述,在此不再叙述;将隐藏信息Hn和当前图像的预测输入用于概率估计的神经网络中进行处理,得到当前图像的残差包括的多个样本的残差值的概率分布。
在此需要说明的是,上述编码网络、解码网络可通过神经网络实现,比如卷积神经网络实现。
本领域技术人员能够领会,结合本文公开描述的各种说明性逻辑框、模块和算法步骤所描述的功能可以硬件、软件、固件或其任何组合来实施。如果以软件来实施,那么各种说明性逻辑框、模块、和步骤描述的功能可作为一或多个指令或代码在计算机可读媒体上存储或传输,且由基于硬件的处理单元执行。计算机可读媒体可包含计算机可读存储媒体,其对应于有形媒体,例如数据存储媒体,或包括任何促进将计算机程序从一处传送到另一处的媒体(例如,基于通信协议)的通信媒体。以此方式,计算机可读媒体大体上可对应于(1)非暂时性的有形计算机可读存储媒体,或(2)通信媒体,例如信号或载波。数据存储媒体可为可由 一或多个计算机或一或多个处理器存取以检索用于实施本申请中描述的技术的指令、代码和/或数据结构的任何可用媒体。计算机程序产品可包含计算机可读媒体。
作为实例而非限制,此类计算机可读存储媒体可包括RAM、ROM、EEPROM、CD-ROM或其它光盘存储装置、磁盘存储装置或其它磁性存储装置、快闪存储器或可用来存储指令或数据结构的形式的所要程序代码并且可由计算机存取的任何其它媒体。并且,任何连接被恰当地称作计算机可读媒体。举例来说,如果使用同轴缆线、光纤缆线、双绞线、数字订户线(DSL)或例如红外线、无线电和微波等无线技术从网站、服务器或其它远程源传输指令,那么同轴缆线、光纤缆线、双绞线、DSL或例如红外线、无线电和微波等无线技术包含在媒体的定义中。但是,应理解,所述计算机可读存储媒体和数据存储媒体并不包括连接、载波、信号或其它暂时媒体,而是实际上针对于非暂时性有形存储媒体。如本文中所使用,磁盘和光盘包含压缩光盘(CD)、激光光盘、光学光盘、数字多功能光盘(DVD)和蓝光光盘,其中磁盘通常以磁性方式再现数据,而光盘利用激光以光学方式再现数据。以上各项的组合也应包含在计算机可读媒体的范围内。
可通过例如一或多个数字信号处理器(DSP)、通用微处理器、专用集成电路(ASIC)、现场可编程逻辑阵列(FPGA)或其它等效集成或离散逻辑电路等一或多个处理器来执行指令。因此,如本文中所使用的术语“处理器”可指前述结构或适合于实施本文中所描述的技术的任一其它结构中的任一者。另外,在一些方面中,本文中所描述的各种说明性逻辑框、模块、和步骤所描述的功能可以提供于经配置以用于编码和解码的专用硬件和/或软件模块内,或者并入在组合编解码器中。而且,所述技术可完全实施于一或多个电路或逻辑元件中。
本申请的技术可在各种各样的装置或设备中实施,包含无线手持机、集成电路(IC)或一组IC(例如,芯片组)。本申请中描述各种组件、模块或单元是为了强调用于执行所揭示的技术的装置的功能方面,但未必需要由不同硬件单元实现。实际上,如上文所描述,各种单元可结合合适的软件和/或固件组合在编码解码器硬件单元中,或者通过互操作硬件单元(包含如上文所描述的一或多个处理器)来提供。
以上所述,仅为本申请示例性的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应该以权利要求的保护范围为准。

Claims (29)

  1. 一种视频图像的解码方法,其特征在于,包括:
    接收当前图像的码流;
    通过经训练得到的神经网络对输入数据进行概率估计处理,得到所述当前图像的残差包括的多个样本的残差值的概率分布,所述输入数据至少包括参考图像的残差,所述参考图像为在解码所述当前图像之前的已解码图像;
    基于所述当前图像的残差包括的多个样本的残差值的概率分布对所述码流进行算术熵解码,得到第一熵解码数据,所述第一熵解码数据表示当前图像的残差;
    基于所述当前图像的残差,得到当前图像的重建样本值。
  2. 根据权利要求1所述的方法,其特征在于,所述神经网络为经如下过程训练得到的神经网络,
    所述方法还包括:
    获取所述训练的输入数据,所述训练的输入数据包括在解码样本图像之前的已解码图像,所述样本图像为已知的已解码图像,
    通过神经网络对所述训练的输入数据进行概率估计处理,得到所述样本图像的残差包括的多个样本的残差值的估计的概率分布;
    基于所述样本图像的残差包括的多个样本的残差值的已知概率分布和所述估计的概率分布,获取损失值;
    在所述损失值收敛于第一阈值,或所述神经网络当前的训练次数大于或等于第二阈值时,确定当前的神经网络为所述神经网络训练完成时对应的神经网络。
  3. 根据权利要求2所述的方法,其特征在于,所述训练的输入数据包括:
    第一参考图像的残差包括的多个样本的残差值,或者;
    所述第一参考图像的残差包括的多个样本的残差值和第二参考图像的残差包括的多个样本的残差值,或者;
    所述样本图像的预测包括的多个样本的预测值、所述第一参考图像的残差包括的多个样本的残差值和所述第一参考图像的预测包括的多个样本的预测值;或者;
    所述样本图像的预测包括的多个样本的预测值、所述第一参考图像的残差包括的多个样本的残差值、所述第一参考图像的预测包括的多个样本的预测值、第二参考图像的预测包括的多个样本的预测值和第二参考图像的残差包括的多个样本的残差值;
    其中,所述第一参考图像和第二参考图像为在解码所述样本图像之前的已解码图像。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,
    所述概率分布表示所述当前图像的残差包括的多个样本的重建值与所述多个样本的预测值之间的多个差值的概率分布;或者,
    所述概率分布表示所述当前图像的多个样本中各个样本的残差值的概率分布;或者,
    所述概率分布表示所述当前图像的多个样本的残差值的概率分布。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述第一熵解码数据为所述当前图像的残差,或者所述当前图像的残差的特征图,或者经变换和量化后的所述当前图像的残 差。
  6. 根据权利要求5所述的方法,其特征在于,所述第一熵解码数据为所述当前图像的残差的特征图,所述方法还包括:
    将所述当前图像的残差的特征图经过解码网络获得所述当前图像的残差。
  7. 根据权利要求5所述的方法,其特征在于,所述第一熵解码数据为所述经变换和量化后的所述当前图像的残差,所述方法还包括:
    对所述经变换和量化后的所述当前图像的残差进行反变换和反量化,得到所述当前图像的残差。
  8. 根据权利要求6所述的方法,其特征在于,所述输入数据进一步包括超先验信息,
    所述神经网络包括卷积网络、均匀池化层、第一卷积层、拼接concat层和归一化指数函数Softmax层;所述卷积网络包括多个卷积层、激励ReLU层,其中:
    所述通过神经网络对输入数据进行概率估计处理,得到所述当前图像的残差的概率分布,包括:
    通过所述卷积网络分别提取所述参考图像的残差和所述超先验信息中的特征信息;通过所述拼接concat层、所述第一卷积层、所述均匀池化层和所述Softmax层对所述卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
  9. 根据权利要求1-7任一项所述的方法,其特征在于,所述输入数据还包括所述参考图像的预测和所述当前图像的预测;
    所述神经网络包括卷积网络、均匀池化层和归一化指数函数Softmax层;所述卷积网络包括多个卷积层、激励ReLU层,其中:
    所述通过神经网络对输入数据进行概率估计处理,得到所述当前图像的残差包括的多个样本的残差值的概率分布,包括:
    通过所述卷积网络提取所述参考图像的残差、所述参考图像的预测和所述当前图像的预测的特征信息;通过所述均匀池化层和所述Softmax层对所述卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
  10. 根据权利要求1-7任一项所述的方法,其特征在于,所述输入数据包括:
    多个参考图像的残差,所述多个参考图像为在解码所述当前图像之前的多个已解码图像;
    所述神经网络包括卷积网络、均匀池化层和归一化指数函数Softmax层;所述卷积网络包括多个卷积层、激励ReLU层,其中:
    所述通过神经网络对输入数据进行概率估计处理,得到所述当前图像的残差包括的多个样本的残差值的概率分布,包括:
    通过所述卷积网络提取所述多个参考图像的残差的特征信息;通过所述均匀池化层和所述Softmax层对所述卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
  11. 根据权利要求1-7任一项所述的方法,其特征在于,所述输入数据包括:
    多个参考图像的残差、所述多个参考图像的预测和当前图像的预测,所述多个参考图像为在解码所述当前图像之前的多个已解码图像;
    所述神经网络包括卷积网络、均匀池化层和归一化指数函数Softmax层;所述卷积网络包括多个卷积层、激励ReLU层,其中:
    所述通过神经网络对输入数据进行概率估计处理,得到所述当前图像的残差包括的多个样本的残差值的概率分布,包括:
    通过所述卷积网络提取所述多个参考图像的残差和所述多个参考图像的预测和所述当前图像的预测的特征信息;通过所述均匀池化层和所述Softmax层对所述卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
  12. 一种视频图像的编码方法,其特征在于,包括:
    获取当前图像的残差,所述当前图像包括多个样本,所述残差为所述样本的原始值与所述样本的预测值之间的差值;
    通过经训练得到的神经网络对输入数据进行概率估计处理,得到所述当前图像的残差包括的多个样本的残差值的概率分布,所述输入数据至少包括参考图像的残差,所述参考图像为在编码所述当前图像之前的已编码图像;
    根据所述当前图像的残差包括的多个样本的残差值的概率分布对第一熵编码数据进行算术熵编码,得到当前图像的码流,所述第一熵编码数据表示当前图像的残差。
  13. 根据权利要求12所述的方法,其特征在于,所述神经网络为经如下过程训练得到神经网络,
    所述方法还包括:
    获取所述训练的输入数据,所述训练的输入数据包括在编码样本图像之前的已编码图像,所述样本图像为已知的已编码图像;
    通过神经网络对所述训练的输入数据进行概率估计处理,得到所述样本图像的残差包括的多个样本的残差值的估计的概率分布;
    基于所述样本图像的残差包括的多个样本的残差值的已知概率分布和所述估计的概率分布,获取损失值;
    在所述损失值收敛于第一阈值,或所述神经网络当前的训练次数大于或等于第二阈值时,确定当前的神经网络为所述神经网络训练完成时对应的神经网络。
  14. 根据权利要求13所述的方法,其特征在于,所述训练的输入数据包括:
    第一参考图像的残差包括的多个样本的残差值,或者;
    所述第一参考图像的残差包括的多个样本的残差值和第二参考图像的残差包括的多个样本的残差值,或者;
    所述样本图像的预测包括的多个样本的预测值、所述第一参考图像的残差包括的多个样本的残差值和所述第一参考图像的预测包括的多个样本的预测值;或者;
    所述样本图像的预测包括的多个样本的预测值、所述第一参考图像的残差包括的多个样本的残差值、所述第一参考图像的预测包括的多个样本的预测值、第二参考图像的预测包括的多个样本的预测值和第二参考图像的残差包括的多个样本的残差值;
    其中,所述第一参考图像和第二参考图像为在编码所述样本图像之前的已编码图像。
  15. 根据权利要求12-14任一项所述的方法,其特征在于,
    所述概率分布表示所述当前图像的残差包括的多个样本的原始值与所述多个样本的预测值之间的多个差值的概率分布;或者,
    所述概率分布表示所述当前图像的多个样本中各个样本点的残差值的概率分布;或者,
    所述概率分布表示所述当前图像的多个样本的残差值对应的概率分布。
  16. 根据权利要求12-15任一项所述的方法,其特征在于,所述第一熵编码数据包括所述当前图像的残差,或者所述当前图像的残差的特征图,或者经变换和量化后的所述当前图像的残差。
  17. 根据权利要求16所述的方法,其特征在于,所述第一熵编码数据为所述当前图像的残差的特征图,所述方法还包括:
    将所述当前图像的残差经过编码网络获得所述当前图像的残差的特征图。
  18. 根据权利要求16所述的方法,其特征在于,所述第一熵编码数据为所述经变换和量化后的所述当前图像的残差,所述方法还包括:
    对所述当前图像的残差进行变换和量化,得到所述经变换和量化后的所述当前图像的残差。
  19. 根据权利要求17所述的方法,其特征在于,
    所述输入数据进一步包括超先验信息,
    所述神经网络包括卷积网络、均匀池化层、第一卷积层、拼接concat层和归一化指数函数Softmax层;所述卷积网络包括多个卷积层、激励ReLU层,其中:
    所述通过神经网络对输入数据进行概率估计处理,得到所述当前图像的残差包括的多个样本的残差值的概率分布,包括:
    通过所述卷积网络提取所述参考图像的残差和所述超先验信息中的特征信息;通过所述均匀池化层、所述第一卷积层、所述concat层和所述Softmax层对所述卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
  20. 根据权利要求19所述的方法,其特征在于,所述方法还包括:
    将所述当前图像的残差的特征图输入超先验编码网络,得到所述超先验信息的特征向量;
    对所述超先验信息的特征向量进行取整,得到整数化后的超先验信息的特征向量;
    将所述整数化后的超先验信息的特征向量输入超先验解码网络得到所述超先验信息。
  21. 根据权利要求12-18任一项所述的方法,其特征在于,
    所述输入数据还包括所述参考图像的预测和所述当前图像的预测;
    所述神经网络包括卷积网络、均匀池化层和归一化指数函数Softmax层;所述卷积网络包括多个卷积层、激励ReLU层,其中:
    所述通过神经网络对输入数据进行概率估计处理,得到所述当前图像的残差包括的多个样本的残差值的概率分布,包括:
    通过所述卷积网络提取所述参考图像的残差、参考图像的预测和当前图像的预测的特征信息;通过所述均匀池化层和所述Softmax层对所述卷积网络提取的特征信息进行概率估计,输出当前图像的多个样本的残差值的概率分布。
  22. 根据权利要求12-18任一项所述的方法,其特征在于,
    所述输入数据包括多个参考图像的残差,多个参考图像的预测信号,和当前图像的预测信号;所述多个参考图像为在编码所述当前图像之前的多个已编码图像,
    所述神经网络包括卷积网络、均匀池化层和归一化指数函数Softmax层;所述卷积网络包括多个卷积层、激励ReLU层,其中:
    所述通过神经网络对输入数据进行概率估计处理,得到所述当前图像的残差包括的多个样本的残差值的概率分布,包括:
    通过所述卷积网络提取所述多个参考图像的残差、所述多个参考图像的预测和当前图像的预测的特征信息;通过所述均匀池化层和所述Softmax层对所述卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
  23. 根据权利要求12-18任一项所述的方法,其特征在于,所述
    所述输入数据包括多个参考图像的残差,所述多个参考图像为在编码所述当前图像之前的多个已编码图像,
    所述神经网络包括卷积网络、均匀池化层和归一化指数函数Softmax层;所述卷积网络包括多个卷积层、激励ReLU层,其中:
    所述通过神经网络对输入数据进行概率估计处理,得到所述当前图像的残差包括的多个样本的残差值的概率分布,包括:
    通过所述卷积网络提取所述多个参考图像的残差的特征信息;通过所述均匀池化层和所述Softmax层对所述卷积网络提取的特征信息进行概率估计,输出当前图像的残差包括的多个样本的残差值的概率分布。
  24. 一种解码器,其特征在于,包括处理电路,用于执行如权利要求1-11任一项所述的方法。
  25. 一种编码器,其特征在于,包括处理电路,用于执行如权利要求12-23任一项所述的方法。
  26. 一种计算机程序产品,其特征在于,包括程序代码,当其在计算机或处理器上执行时,用于执行如权利要求1-23任一项所述的方法。
  27. 一种解码器,其特征在于,包括:
    一个或多个处理器;
    非瞬时性计算机可读存储介质,耦合到所述处理器,存储有所述处理器执行的程序,其中,所述程序在由所述处理器执行时,使得所述解码器执行如权利要求1-11任一项所述的方法。
  28. 一种编码器,其特征在于,包括:
    一个或多个处理器;
    非瞬时性计算机可读存储介质,耦合到所述处理器,存储有所述处理器执行的程序,其中,所述程序在由所述处理器执行时,使得所述解码器执行如权利要求12-23任一项所述的方法。
  29. 一种非瞬时性计算机可读存储介质,其特征在于,包括程序代码,当其由计算机设备执行时,用于执行基于权利要求1-23任一项所述的方法。
PCT/CN2022/080900 2021-03-17 2022-03-15 视频图像的编解码方法及相关设备 WO2022194137A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22770490.5A EP4300958A4 (en) 2021-03-17 2022-03-15 VIDEO IMAGE CODING METHODS, VIDEO IMAGE DECODING METHODS AND RELATED DEVICES
US18/368,683 US20240007637A1 (en) 2021-03-17 2023-09-15 Video picture encoding and decoding method and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110289642.8 2021-03-17
CN202110289642.8A CN115118972A (zh) 2021-03-17 2021-03-17 视频图像的编解码方法及相关设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/368,683 Continuation US20240007637A1 (en) 2021-03-17 2023-09-15 Video picture encoding and decoding method and related device

Publications (1)

Publication Number Publication Date
WO2022194137A1 true WO2022194137A1 (zh) 2022-09-22

Family

ID=83321666

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/080900 WO2022194137A1 (zh) 2021-03-17 2022-03-15 视频图像的编解码方法及相关设备

Country Status (4)

Country Link
US (1) US20240007637A1 (zh)
EP (1) EP4300958A4 (zh)
CN (1) CN115118972A (zh)
WO (1) WO2022194137A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116112693A (zh) * 2022-12-30 2023-05-12 北京天兵科技有限公司 高分辨率视频信号低码流的采编方法、装置和系统

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090232217A1 (en) * 2008-03-17 2009-09-17 Samsung Electronics Co., Ltd. Method and apparatus for encoding and decoding image
US20170237980A1 (en) * 2016-02-16 2017-08-17 Google Inc. Entropy coding transform partitioning information
CN108174218A (zh) * 2018-01-23 2018-06-15 中国科学技术大学 基于学习的视频编解码框架
CN108632630A (zh) * 2018-05-28 2018-10-09 中国科学技术大学 一种结合位运算和概率预测的二值图像编码方法
CN111163320A (zh) * 2018-11-07 2020-05-15 合肥图鸭信息科技有限公司 一种视频压缩方法及系统
CN111314698A (zh) * 2020-02-27 2020-06-19 浙江大华技术股份有限公司 一种图像编码处理方法及装置
CN111432211A (zh) * 2020-04-01 2020-07-17 济南浪潮高新科技投资发展有限公司 一种用于视频编码的残差信息压缩方法
CN111464815A (zh) * 2020-04-17 2020-07-28 中国科学技术大学 一种基于神经网络的视频编码方法及系统

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019185808A1 (en) * 2018-03-29 2019-10-03 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Intra-prediction mode concept for block-wise picture coding
EP3562162A1 (en) * 2018-04-27 2019-10-30 InterDigital VC Holdings, Inc. Method and apparatus for video encoding and decoding based on neural network implementation of cabac

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090232217A1 (en) * 2008-03-17 2009-09-17 Samsung Electronics Co., Ltd. Method and apparatus for encoding and decoding image
US20170237980A1 (en) * 2016-02-16 2017-08-17 Google Inc. Entropy coding transform partitioning information
CN108174218A (zh) * 2018-01-23 2018-06-15 中国科学技术大学 基于学习的视频编解码框架
CN108632630A (zh) * 2018-05-28 2018-10-09 中国科学技术大学 一种结合位运算和概率预测的二值图像编码方法
CN111163320A (zh) * 2018-11-07 2020-05-15 合肥图鸭信息科技有限公司 一种视频压缩方法及系统
CN111314698A (zh) * 2020-02-27 2020-06-19 浙江大华技术股份有限公司 一种图像编码处理方法及装置
CN111432211A (zh) * 2020-04-01 2020-07-17 济南浪潮高新科技投资发展有限公司 一种用于视频编码的残差信息压缩方法
CN111464815A (zh) * 2020-04-17 2020-07-28 中国科学技术大学 一种基于神经网络的视频编码方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4300958A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116112693A (zh) * 2022-12-30 2023-05-12 北京天兵科技有限公司 高分辨率视频信号低码流的采编方法、装置和系统

Also Published As

Publication number Publication date
EP4300958A1 (en) 2024-01-03
US20240007637A1 (en) 2024-01-04
EP4300958A4 (en) 2024-06-05
CN115118972A (zh) 2022-09-27

Similar Documents

Publication Publication Date Title
WO2021249290A1 (zh) 环路滤波方法和装置
US20230362378A1 (en) Video coding method and apparatus
WO2023279961A1 (zh) 视频图像的编解码方法及装置
US20240105193A1 (en) Feature Data Encoding and Decoding Method and Apparatus
US20240007637A1 (en) Video picture encoding and decoding method and related device
WO2022063267A1 (zh) 帧内预测方法及装置
US20230396810A1 (en) Hierarchical audio/video or picture compression method and apparatus
WO2023193629A1 (zh) 区域增强层的编解码方法和装置
WO2023098688A1 (zh) 图像编解码方法和装置
WO2022100173A1 (zh) 一种视频帧的压缩和视频帧的解压缩方法及装置
WO2022156688A1 (zh) 分层编解码的方法及装置
CN114554205B (zh) 一种图像编解码方法及装置
WO2023165487A1 (zh) 特征域光流确定方法及相关设备
CN117321989A (zh) 基于神经网络的图像处理中的辅助信息的独立定位
WO2023279968A1 (zh) 视频图像的编解码方法及装置
WO2024007820A1 (zh) 数据编解码方法及相关设备
WO2023040745A1 (zh) 特征图编解码方法和装置
WO2023206420A1 (zh) 视频编解码方法、装置、设备、系统及存储介质
CN118318441A (zh) 特征图编解码方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22770490

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022770490

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022770490

Country of ref document: EP

Effective date: 20230928

NENP Non-entry into the national phase

Ref country code: DE