WO2023193629A1 - 区域增强层的编解码方法和装置 - Google Patents

区域增强层的编解码方法和装置 Download PDF

Info

Publication number
WO2023193629A1
WO2023193629A1 PCT/CN2023/084290 CN2023084290W WO2023193629A1 WO 2023193629 A1 WO2023193629 A1 WO 2023193629A1 CN 2023084290 W CN2023084290 W CN 2023084290W WO 2023193629 A1 WO2023193629 A1 WO 2023193629A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
feature map
reconstructed
input
target area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2023/084290
Other languages
English (en)
French (fr)
Chinese (zh)
Inventor
毛珏
赵寅
杨海涛
张恋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to EP23784193.7A priority Critical patent/EP4492786A4/en
Priority to JP2024559394A priority patent/JP7760756B2/ja
Publication of WO2023193629A1 publication Critical patent/WO2023193629A1/zh
Priority to MX2024012429A priority patent/MX2024012429A/es
Priority to US18/908,185 priority patent/US20250030879A1/en
Anticipated expiration legal-status Critical
Priority to JP2025173783A priority patent/JP2026027261A/ja
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • H04N19/33Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability in the spatial domain
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/117Filters, e.g. for pre-processing or post-processing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/119Adaptive subdivision aspects, e.g. subdivision of a picture into rectangular or non-rectangular coding blocks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • H04N19/149Data rate or code amount at the encoder output by estimating the code amount by means of a model, e.g. mathematical model or statistical model
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/167Position within a video image, e.g. region of interest [ROI]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/182Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a pixel
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/187Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a scalable video layer
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/91Entropy coding, e.g. variable length coding [VLC] or arithmetic coding

Definitions

  • the present application relates to video coding and decoding technology, and in particular, to a coding and decoding method and device for a regional enhancement layer.
  • Scalable video decoding also known as scalable video decoding, proposes the concept of scalable time domain, spatial domain, and quality. By adding information of the enhancement layer to the basic layer, it can obtain a higher frame rate and higher resolution than the basic layer. resolution, higher quality video content. Different users can choose whether they need enhancement layer code streams to match their respective network bandwidth and device processing capabilities.
  • This application provides a coding and decoding method and device for a regional enhancement layer to improve the coding efficiency and coding accuracy of the enhancement layer.
  • this application provides a coding method for a regional enhancement layer, which includes: obtaining reconstructed pixels of a base layer of a target area; inputting the reconstructed pixels into a correction network to obtain correction information of the target area; The information and the original pixels of the target area are input into the encoding network to obtain the residual feature map of the enhancement layer of the target area; the residual feature map is encoded to obtain the enhancement layer code stream of the target area.
  • the reconstructed pixels of the base layer are used to remove noise signals that are useless for encoding the AI enhancement layer through a correction network to obtain correction information, and then the residual feature map of the enhancement layer in the target area is encoded based on the correction information.
  • the enhancement layer in the required area target area
  • encoding based on the correction information can improve the accuracy of the encoding.
  • the target area is intended to represent the position of the image block that the scheme of the embodiment of the present application focuses on and processes during an encoding process, and the shape of the target area can be a regular rectangle or square, or an irregular shape, and this is not done Specific limitations.
  • the initially obtained image block can be called an original block, and the pixels it contains can be called original pixels; the reconstructed image block can be called a reconstruction block, and the pixels it contains can be called reconstructed pixels.
  • the encoding process is roughly similar, especially that each layer includes initial image blocks and reconstructed image blocks.
  • the pixels contained in the initially obtained region are called the original pixels of the base layer of the region, and the pixels contained in the reconstructed region are called the reconstructed pixels of the base layer of the region.
  • the pixels contained in the initially obtained area are called the original pixels of the enhancement layer in this area, and the reconstructed The pixels contained in the area are called the reconstructed pixels of the enhancement layer in this area.
  • the reconstructed pixels are input into the correction network to obtain at least one of multiple pixel values and multiple feature values, and the correction information is the multiple pixel values or the multiple feature values. characteristic value.
  • a neural network may be used to implement the correction network.
  • a neural network composed of four convolutional layers/deconvolution layers and three activation layers interspersed and cascaded is used to construct a correction network.
  • the input of the correction network is the reconstructed pixels of the basic layer of the target area, and the output is the target area.
  • the corresponding correction information is used to remove noise signals that are useless for AI enhancement layer coding.
  • the convolution kernel size of the convolution layer can be set to 3 ⁇ 3
  • the number of channels of the output feature map is set to M
  • the width and height of each convolution layer are downsampled by 2 times. It should be understood that the foregoing examples do not constitute specific limitations.
  • the size of the convolution kernel, the number of feature map channels, the downsampling multiple, the number of downsampling times, the number of convolution layers, and the number of activation layers can all be adjusted, and the embodiments of this application do not specifically limit this.
  • multiple probability distributions can be obtained, which correspond to multiple feature values contained in the residual feature map, and then entropy is performed on the corresponding feature values in the residual feature map according to the multiple probability distributions. Encode to obtain enhancement layer code stream.
  • the residual feature map of the enhancement layer of the target area contains multiple eigenvalues.
  • multiple probability distributions are several ways to obtain multiple probability distributions:
  • the probability estimation network may also include a convolution layer and a GDN. There is no limit on whether the probability estimation network has other activation functions. However, the embodiment of the present application does not limit the number of convolution layers of the convolution layer, nor does it limit the size of the convolution kernel. limited.
  • a probability distribution model is first used for modeling, then the correction information is input into the probability estimation network to obtain model parameters, and the model parameters are substituted into the probability distribution model to obtain a probability distribution.
  • the probability distribution model can be a single Gaussian model (gaussian single model, GSM), an asymmetric Gaussian model, a Gaussian mixture model (gaussian mixture model, GMM) or a Laplace distribution model (laplace distribution).
  • the model parameters are the values of the mean parameter ⁇ and variance ⁇ of the Gaussian distribution.
  • the model parameters are the values of the position parameter ⁇ and the scale parameter b of the Laplace distribution. It should be understood that in addition to the above probability distribution model, other models can also be used, without specific limitations.
  • the residual feature map of the enhancement layer of the target area can be input into the side information extraction network to obtain the side information of the residual feature map, entropy encoding is performed on the side information, and written into the code stream.
  • the aforementioned side information of the residual feature map is used as the reconstructed side information of the residual feature map.
  • the reconstructed side information is input into the side information processing network to obtain the first feature map, and the multiple feature values and the first feature map are input into the probability estimation network to obtain multiple probability distributions.
  • the correction information is multiple pixel values
  • the feature map and the second feature map are input into the probability estimation network to obtain multiple probability distributions.
  • the reconstructed pixels of the residual feature map can be input into the feature estimation network to obtain the third feature map
  • the reconstructed side information can be input into the side information processing network to obtain the first feature map
  • the first feature map and the third feature map can be input into the side information processing network to obtain the first feature map.
  • the feature map is input into the probability estimation network to obtain multiple probability distributions.
  • the method further includes: inputting the residual feature map into a side information extraction network to obtain the side information of the residual feature map; applying the side information or the quantized The side information is entropy encoded and written into the code stream.
  • the encoding network includes a first encoding network; the correction information and the original pixels of the target area are input into the encoding network to obtain the residual features of the enhancement layer of the target area.
  • the figure includes: when the correction information is a plurality of pixel values, calculating the difference between the original pixel and the corresponding pixel value in the correction information; inputting the difference result into the first encoding network to obtain the The residual feature map.
  • correspondence can be understood as position correspondence, that is, the difference between the original pixel and the pixel value at the corresponding position in the correction information.
  • the encoding network includes a second encoding network; the correction information and the original pixels of the target area are input into the encoding network to obtain the residual features of the enhancement layer of the target area.
  • Figure including: inputting the original pixels into the second encoding network; when the correction information is multiple feature values, output of any convolution layer in the second encoding network and the correction information Calculate the difference between the corresponding feature values in , and input the difference result into the network layer after any convolutional layer in the second coding network to obtain the residual feature map.
  • correspondence can be understood as position correspondence, that is, the difference between the output of any convolutional layer in the second coding network and the feature value of the corresponding position in the correction information.
  • the correction information can have two situations, one is multiple pixel values, and the other is multiple feature values.
  • the encoding network can also adopt two structures.
  • the input of the encoding network (Encoder) on the encoding side is the correction information and the original pixels of the target area
  • the output is the residual feature map of the enhancement layer of the target area.
  • the embodiments of the present application can also use coding networks with other structures, which are not specifically limited.
  • obtaining the reconstructed pixels of the base layer of the target area may include: encoding the image to which the target area belongs to obtain the base layer code stream of the image; Decoding is performed to obtain a reconstruction map of the base layer of the image; at least one area that needs to be enhanced is determined according to the reconstruction map, and the target area is one of the at least one area.
  • determining at least one area that needs to be enhanced according to the reconstruction map includes: dividing the reconstruction map to obtain multiple regions; determining a region with a variance greater than a first threshold among the multiple regions as the at least one region. ; Alternatively, determine the proportion of pixels in the plurality of regions with gradients greater than the second threshold to the total pixels, and determine the region in which the proportion is greater than the third threshold as the at least one region. For example, if the variance of the area is greater than the threshold t1, t1>0, it can be considered that the texture of the area is relatively complex, so enhancement processing is needed to improve the image quality; or, the proportion of pixels in the area with a gradient greater than the second threshold to the total pixels is greater than third threshold.
  • the proportion of pixels in a region with gradients greater than threshold a to total pixels is greater than threshold t2, a>0, 0 ⁇ t2 ⁇ 1. It can also be considered that the texture of this region is relatively complex, so enhancement processing is required to improve image quality.
  • the encoding end encodes the original image to obtain the base layer code stream, and then decodes the base layer code stream to obtain the reconstructed image of the base layer.
  • the VVC encoder encodes the original image to obtain the base layer code stream
  • the VVC decoder decodes the base layer code stream to obtain the reconstructed image of the base layer. It should be understood that other encoders can also be used for the base layer, HEVC codec and AVC codec are not specifically limited in the embodiments of this application.
  • the target area in the example is an area that needs to be enhanced.
  • the subsequent enhancement layer coding only the target area can be encoded and decoded by the enhancement layer. In this way, there is no need to enhance the entire frame of the image, which can improve the encoding of the image. Decoding efficiency.
  • the method further includes: using the side information of the residual feature map as reconstructed side information of the residual feature map.
  • this application provides a decoding method for a regional enhancement layer, which includes: obtaining reconstructed pixels of the base layer of a target area; inputting the reconstructed pixels into a correction network to obtain correction information of the target area; obtaining the target area region; decode the enhancement layer code stream to obtain the residual feature map of the enhancement layer of the target region; input the residual feature map and the correction information into the decoding network to obtain the The reconstructed pixels of the enhancement layer in the target area.
  • the reconstructed pixels of the base layer are used to remove noise signals that are useless for AI enhancement layer coding through a correction network to obtain correction information, and then the enhancement layer code stream is decoded based on the correction information.
  • the enhancement layer code stream is decoded based on the correction information.
  • inputting the reconstructed pixels into a correction network to obtain the correction information of the target area includes: inputting the reconstructed pixels into the correction network to obtain multiple information of the target area. At least one of a pixel value and a plurality of feature values, and the correction information is the plurality of pixel values or the plurality of feature values.
  • decoding the enhancement layer code stream to obtain the residual feature map of the enhancement layer of the target area includes: obtaining the plurality of probability distributions according to the correction information, The multiple probability distributions correspond to multiple feature value code streams included in the enhancement layer code stream; entropy decoding is performed on the corresponding feature value code streams in the enhancement layer code stream according to the multiple probability distributions to obtain The residual feature map.
  • obtaining the multiple probability distributions based on the correction information includes: inputting the correction information into a probability estimation network to obtain the multiple probability distributions.
  • obtaining the multiple probability distributions based on the correction information includes: obtaining the multiple probability distributions based on the correction information and reconstructed side information of the residual feature map.
  • obtaining the multiple probability distributions based on the correction information and the reconstructed side information of the residual feature map includes: when the correction information is multiple feature values, The reconstructed side information is input into a side information processing network to obtain a first feature map; the plurality of feature values and the first feature map are input into a probability estimation network to obtain the plurality of probability distributions.
  • obtaining the multiple probability distributions based on the correction information and the reconstructed side information of the residual feature map includes: when the correction information is a plurality of pixel values, The plurality of pixel values are input into the feature estimation network to obtain a second feature map; the reconstructed side information is input into the side information processing network to obtain a first feature map; the first feature map and the second feature map are input into A probability estimation network is used to obtain the plurality of probability distributions.
  • decoding the enhancement layer code stream to obtain the residual feature map of the enhancement layer of the target area includes: obtaining the residual feature map according to the reconstructed side information of the residual feature map. Described multiple probability distributions, the multiple probability distributions correspond to multiple feature value code streams included in the enhancement layer code stream; according to the multiple probability distributions, the corresponding feature value codes in the enhancement layer code stream are respectively The stream is entropy decoded to obtain the residual feature map.
  • obtaining the multiple probability distributions based on reconstructed side information of the residual feature map includes: inputting the reconstructed side information into a probability estimation network to obtain the multiple probability distributions .
  • obtaining the multiple probability distributions based on the reconstructed side information of the residual feature map includes: obtaining the multiple probability distributions based on the reconstructed side information and the reconstructed pixels. .
  • obtaining the multiple probability distributions based on the reconstructed side information and the reconstructed pixels includes: inputting the reconstructed pixels into a feature estimation network to obtain a third feature map;
  • the reconstructed side information is input into a side information processing network to obtain a first feature map;
  • the first feature map and the third feature map are input into a probability estimation network to obtain the multiple probability distributions.
  • the method further includes: inputting the residual feature map into a side information extraction network to obtain side information of the residual feature map; using the side information as the residual feature Reconstructed side information of the graph.
  • the method further includes: obtaining a side information code stream of the target area; and parsing the side information code stream to obtain the reconstructed side information.
  • the decoding network includes a first decoding network; and inputting the residual feature map and the correction information into the decoding network to obtain reconstructed pixels of the enhancement layer of the target area includes: : Input the residual feature map into the first decoding network to obtain the reconstructed residual pixels of the enhancement layer of the target area; when the correction information is multiple pixel values, the reconstructed residual pixels and The corresponding pixel values in the correction information are summed to obtain the reconstructed pixel.
  • the decoding network includes a second decoding network; and inputting the residual feature map and the correction information into the decoding network to obtain reconstructed pixels of the enhancement layer of the target area includes: : Input the residual feature map into the second decoding network; when the correction information is multiple feature values, the output of any convolution layer in the second decoding network and the correction information The corresponding feature values are summed; the summed result is input into the network layer after any convolutional layer in the second decoding network to obtain the reconstructed pixel.
  • the method further includes: obtaining a base layer code stream of the image to which the target area belongs; parsing the base layer code stream to obtain a reconstruction map of the base layer of the image; and according to the If the reconstruction map determines at least one area that needs to be enhanced, the target area is one of the at least one area.
  • determining at least one area that needs to be enhanced based on the reconstruction map includes: dividing the reconstruction map to obtain multiple areas; determine the area as the at least one area; or, determine the proportion of pixels in the plurality of areas with gradients greater than the second threshold to the total pixels, and determine the area in which the proportion is greater than the third threshold as the at least one area.
  • the present application provides an encoding device, including: an acquisition module for acquiring reconstructed pixels of the base layer of a target area; and a processing module for inputting the reconstructed pixels into a correction network to obtain corrections of the target area. information; input the correction information and the original pixels of the target area into the encoding network to obtain the residual feature map of the enhancement layer of the target area; the encoding module is used to encode the residual feature map to obtain the Enhancement layer code stream for the target area.
  • the processing module is specifically configured to input the reconstructed pixels into the correction network to obtain at least one of multiple pixel values and multiple feature values of the target area, the The correction information is the plurality of pixel values or the plurality of feature values.
  • the encoding module is specifically configured to obtain multiple probability distributions based on the correction information, and the multiple probability distributions correspond to multiple feature values contained in the residual feature map; according to The multiple probability scores Entropy coding is performed on the corresponding feature values in the residual feature map to obtain the enhancement layer code stream.
  • the encoding module is specifically configured to input the correction information into a probability estimation network to obtain the multiple probability distributions.
  • the encoding module is specifically configured to obtain the plurality of probability distributions based on the correction information and reconstructed side information of the residual feature map.
  • the encoding module is specifically configured to input the reconstructed side information into the side information processing network to obtain the first feature map when the correction information is multiple feature values; A plurality of feature values and the first feature map are input into a probability estimation network to obtain the plurality of probability distributions.
  • the encoding module is specifically configured to input the multiple pixel values into the feature estimation network to obtain the second feature map when the correction information is a plurality of pixel values;
  • the reconstructed side information is input into the side information processing network to obtain the first feature map;
  • the first feature map and the second feature map are input into the probability estimation network to obtain the multiple probability distributions.
  • the encoding module is specifically configured to obtain multiple probability distributions based on the reconstructed side information of the residual feature map, and the multiple probability distributions are consistent with the multiple probability distributions included in the residual feature map. corresponding to the feature values; performing entropy coding on the corresponding feature values in the residual feature map according to the multiple probability distributions to obtain the enhancement layer code stream.
  • the encoding module is specifically configured to input the reconstructed side information into a probability estimation network to obtain the multiple probability distributions.
  • the encoding module is specifically configured to obtain the multiple probability distributions according to the reconstruction side information and the reconstruction pixels.
  • the encoding module is specifically configured to input the reconstructed pixels into a feature estimation network to obtain a third feature map; and input the reconstructed side information into a side information processing network to obtain a first feature map. ; Input the first feature map and the third feature map into a probability estimation network to obtain the multiple probability distributions.
  • the encoding module is also used to input the residual feature map into a side information extraction network to obtain the side information of the residual feature map; the side information or the quantized
  • the side information is entropy encoded and written into the code stream.
  • the coding network includes a first coding network; and the coding module is specifically configured to, when the correction information is multiple pixel values, compare the original pixels and the correction information. The corresponding pixel values are differentiated; the difference result is input into the first encoding network to obtain the residual feature map.
  • the coding network includes a second coding network; the coding module is specifically configured to input the original pixels into the second coding network; when the correction information is a plurality of feature values When , calculate the difference between the output of any convolution layer in the second coding network and the corresponding feature value in the correction information; input the difference result into any convolution in the second coding network The network layer after the layer to obtain the residual feature map.
  • the encoding module is also used to encode the image to which the target area belongs to obtain the base layer code stream of the image; and decode the base layer code stream to obtain the base layer code stream.
  • a reconstructed image of the base layer of the image; at least one area that needs to be enhanced is determined based on the reconstructed image, and the target area is one of the at least one area.
  • the encoding module is specifically configured to divide the reconstructed image to obtain multiple regions; determine a region with a variance greater than a first threshold among the multiple regions as the at least one region. ; Or, OK The proportion of pixels in the plurality of regions in which the gradient is greater than the second threshold to the total pixels, and the region in which the proportion is greater than the third threshold is determined as the at least one region.
  • the encoding module is also configured to use the side information of the residual feature map as the reconstructed side information of the residual feature map.
  • the present application provides a decoding device, including: an acquisition module for acquiring reconstructed pixels of the base layer of a target area; and a processing module for inputting the reconstructed pixels into a correction network to obtain the correction of the target area.
  • Information the acquisition module is also used to obtain the enhancement layer code stream of the target area; the decoding module is used to decode the enhancement layer code stream to obtain the residual feature map of the enhancement layer of the target area;
  • the processing module is also configured to input the residual feature map and the correction information into a decoding network to obtain reconstructed pixels of the enhancement layer of the target area.
  • the processing module is specifically configured to input the reconstructed pixels into the correction network to obtain at least one of multiple pixel values and multiple feature values of the target area, the The correction information is the plurality of pixel values or the plurality of feature values.
  • the decoding module is specifically configured to obtain the plurality of probability distributions according to the correction information, and the plurality of probability distributions are related to the plurality of feature value codes contained in the enhancement layer code stream.
  • Stream correspondence perform entropy decoding on corresponding feature value code streams in the enhancement layer code stream according to the multiple probability distributions to obtain the residual feature map.
  • the decoding module is specifically configured to input the correction information into a probability estimation network to obtain the multiple probability distributions.
  • the decoding module is specifically configured to obtain the plurality of probability distributions based on the correction information and reconstructed side information of the residual feature map.
  • the decoding module is specifically configured to input the reconstructed side information into the side information processing network to obtain the first feature map when the correction information is multiple feature values; A plurality of feature values and the first feature map are input into a probability estimation network to obtain the plurality of probability distributions.
  • the decoding module is specifically configured to, when the correction information is multiple pixel values, input the multiple pixel values into the feature estimation network to obtain the second feature map;
  • the reconstructed side information is input into the side information processing network to obtain the first feature map;
  • the first feature map and the second feature map are input into the probability estimation network to obtain the multiple probability distributions.
  • the decoding module is specifically configured to obtain the multiple probability distributions according to the reconstructed side information of the residual feature map, and the multiple probability distributions and the enhancement layer code stream include Corresponding to multiple feature value code streams; perform entropy decoding on the corresponding feature value code streams in the enhancement layer code stream according to the multiple probability distributions to obtain the residual feature map.
  • the decoding module is specifically configured to input the reconstructed side information into a probability estimation network to obtain the multiple probability distributions.
  • the decoding module is specifically configured to obtain the multiple probability distributions according to the reconstruction side information and the reconstruction pixels.
  • the decoding module is specifically configured to input the reconstructed pixels into a feature estimation network to obtain a third feature map; and input the reconstructed side information into a side information processing network to obtain a first feature map. ; Input the first feature map and the third feature map into a probability estimation network to obtain the multiple probability distributions.
  • the decoding module is also used to input the residual feature map into side information extraction
  • the network is used to obtain the side information of the residual feature map; the side information is used as the reconstructed side information of the residual feature map.
  • the decoding module is also used to obtain the side information code stream of the target area; and parse the side information code stream to obtain the reconstructed side information.
  • the decoding network includes a first decoding network; the decoding module is specifically configured to input the residual feature map into the first decoding network to obtain the enhancement layer of the target area. reconstruction residual pixel; when the correction information is multiple pixel values, sum the reconstruction residual pixel and the corresponding pixel value in the correction information to obtain the reconstruction pixel.
  • the decoding network includes a second decoding network; the decoding module is specifically configured to input the residual feature map into the second decoding network; when the correction information is multiple When the feature value is obtained, sum the output of any convolutional layer in the second decoding network and the corresponding feature value in the correction information; input the summed result into any one of the second decoding network
  • the network layer after the convolutional layer is used to obtain the reconstructed pixels.
  • the decoding module is also used to obtain the base layer code stream of the image to which the target area belongs; parse the base layer code stream to obtain the reconstruction map of the base layer of the image; At least one area that needs to be enhanced is determined according to the reconstruction map, and the target area is one of the at least one area.
  • the decoding module is specifically configured to divide the reconstructed image to obtain multiple regions; and determine a region with a variance greater than a first threshold among the multiple regions as the at least one region. ; Alternatively, determine the proportion of pixels in the plurality of regions with gradients greater than the second threshold to the total pixels, and determine the region in which the proportion is greater than the third threshold as the at least one region.
  • this application provides an encoder, including: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors , causing the one or more processors to implement the method described in any one of the above first aspects.
  • the present application provides a decoder, including: one or more processors; a memory for storing one or more programs; when the one or more programs are executed by the one or more processors , causing the one or more processors to implement the method described in any one of the above second aspects.
  • the present application provides a computer-readable storage medium, including a computer program.
  • the computer program When the computer program is executed on a computer, it causes the computer to perform the method described in any one of the first to second aspects.
  • the present application provides a computer program product.
  • the computer program product contains instructions. When the instructions are run on a computer or a processor, the computer or the processor implements the first to second aspects described above. The method described in any one of the aspects.
  • Figure 1 is an exemplary hierarchical diagram of scalable video decoding in this application
  • Figure 2A is a schematic block diagram of an exemplary decoding system 10
  • Figure 2B is an illustration of an example of video coding system 40
  • Figure 3 is a schematic diagram of a video decoding device 400 provided by an embodiment of the present invention.
  • Figure 4 is a simplified block diagram of an apparatus 500 provided by an exemplary embodiment
  • Figure 5 is an example diagram of an application scenario according to the embodiment of the present application.
  • Figure 6 is an example diagram of an application scenario according to the embodiment of the present application.
  • Figure 7 is an example diagram of an application scenario according to the embodiment of the present application.
  • Figure 8 is a flow chart of a process 800 of the regional enhancement layer encoding method according to the embodiment of the present application.
  • Figure 9a is an exemplary schematic diagram of the correction network
  • Figure 9b is an exemplary schematic diagram of the correction network
  • Figure 9c is an exemplary schematic diagram of the correction network
  • Figure 9d is an exemplary schematic diagram of the correction network
  • Figure 10a is an exemplary schematic diagram of the encoding network
  • Figure 10b is an exemplary schematic diagram of the encoding network
  • Figure 11a is an exemplary schematic diagram of a probability estimation network
  • Figure 11b is an exemplary schematic diagram of a probability estimation network
  • Figure 12 is a flow chart of a process 1200 of the regional enhancement layer decoding method according to the embodiment of the present application.
  • Figure 13a is an exemplary schematic diagram of the decoding network
  • Figure 13b is an exemplary schematic diagram of the decoding network
  • Figure 14 is an exemplary schematic diagram of the encoding and decoding process
  • Figure 15 is an exemplary schematic diagram of the encoding and decoding process
  • Figure 16 is an exemplary schematic diagram of the encoding and decoding process
  • Figure 17 is an exemplary schematic diagram of the encoding and decoding process
  • Figure 18 is an exemplary structural schematic diagram of the encoding device 1800 according to the embodiment of the present application.
  • Figure 19 is an exemplary structural schematic diagram of the decoding device 1900 according to the embodiment of the present application.
  • At least one (item) refers to one or more, and “plurality” refers to two or more.
  • “And/or” is used to describe the relationship between associated objects, indicating that there can be three relationships. For example, “A and/or B” can mean: only A exists, only B exists, and A and B exist simultaneously. , where A and B can be singular or plural. The character “/” generally indicates that the related objects are in an "or” relationship. “At least one of the following” or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items).
  • At least one of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c” ”, where a, b, c can be single or multiple.
  • Neural network is a machine learning model.
  • the neural network can be composed of neural units.
  • the neural unit can refer to an arithmetic unit that takes xs and intercept 1 as input.
  • the output of the arithmetic unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a nonlinear function such as ReLU.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the characteristics of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Multi-layer perception (MLP)
  • MLP is a simple deep neural network (DNN) (different layers are fully connected), also called a multi-layer neural network, which can be understood as a neural network with many hidden layers.
  • DNN deep neural network
  • the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in between are hidden layers.
  • the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks very complicated, the work of each layer is actually not complicated.
  • the coefficient from the k-th neuron in layer L-1 to the j-th neuron in layer L is defined as It should be noted that the input layer has no W parameter.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically, a model with more parameters has higher complexity and greater "capacity", which means it can complete more complex learning tasks.
  • Training a deep neural network is the process of learning the weight matrix. The ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (a weight matrix formed by the vectors W of many layers).
  • CNN Convolutional neural network
  • CNN is a deep neural network with a convolutional structure. It is a deep learning architecture. Deep learning architecture refers to machine learning algorithms that use different abstractions. Learning at multiple levels.
  • CNN is a feed-forward (feed-forward) artificial neural network, each neuron in the feed-forward artificial neural network can respond to the image input into it.
  • the convolutional neural network contains a feature extractor composed of convolutional layers and pooling layers.
  • the feature extractor can be regarded as a filter, and the convolution process can be regarded as using a trainable filter to convolve with an input image or convolution feature plane (feature map).
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • the convolution layer can include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can essentially Is a weight matrix, which is usually predefined. During the convolution operation on the image, the weight matrix is usually pixel by pixel (or two pixels by two pixels) along the horizontal direction on the input image... ...This depends on the value of the step size), so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image.
  • the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases, instead of using a single weight matrix, multiple weight matrices of the same size (rows ⁇ columns) are applied, That is, multiple matrices of the same type.
  • the output of each weight matrix is stacked to form the depth dimension of the convolution image.
  • the dimension here can be understood as being determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image.
  • one weight matrix is used to extract edge information of the image
  • another weight matrix is used to extract specific colors of the image
  • another weight matrix is used to remove unnecessary noise in the image. Perform blurring, etc.
  • the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices with the same size are also the same size.
  • the extracted multiple feature maps with the same size are then merged to form a convolution operation. output.
  • the weight values in these weight matrices require a lot of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network can make correct predictions.
  • the initial convolutional layer often extracts more general features, which can also be called low-level features; as the depth of the convolutional neural network deepens,
  • the features extracted by subsequent convolutional layers become more and more complex, such as high-level semantic features.
  • Features with higher semantics are more suitable for the problem to be solved.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling.
  • the operators in the pooling layer should also be related to the size of the image.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image input to the pooling layer.
  • Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network After being processed by the convolutional layer/pooling layer, the convolutional neural network is not enough to output the required output information. Because as mentioned before, the convolutional layer/pooling layer will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network needs to use neural network layers to generate an output or a set of required number of classes. Therefore, the neural network layer can include multiple hidden layers, and the parameters contained in the multiple hidden layers can be pre-trained based on relevant training data for specific task types. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc.
  • the output layer of the entire convolutional neural network is also included.
  • This output layer has a loss function similar to categorical cross-entropy, specifically used to calculate the prediction error.
  • Recurrent neural networks are used to process sequence data.
  • the layers are fully connected, while the nodes within each layer are unconnected.
  • this ordinary neural network has solved many difficult problems, it is still incompetent for many problems. For example, if you want to predict the next word of a sentence, you generally need to use the previous word, because the preceding and following words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output.
  • RNN can process sequence data of any length.
  • the training of RNN is the same as the training of traditional CNN or DNN.
  • the error backpropagation algorithm is also used, but there is one difference: that is, if the RNN is expanded into a network, then the parameters, such as W, are shared; this is not the case with the traditional neural network as shown in the example above.
  • the output of each step not only depends on the network of the current step, but also depends on the status of the network of several previous steps. This learning algorithm is called Back propagation Through Time (BPTT).
  • BPTT Back propagation Through Time
  • Convolutional neural networks can use the error back propagation (BP) algorithm to correct during the training process
  • BP error back propagation
  • the size of the parameters in the initial super-resolution model makes the reconstruction error loss of the super-resolution model smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the super-resolution model, such as the weight matrix.
  • Generative adversarial networks is a deep learning model.
  • the model includes at least two modules: one module is a generative model (Generative Model), and the other module is a discriminative model (Discriminative Model). Through these two modules, they learn from each other to produce better output.
  • Both the generative model and the discriminative model can be neural networks, specifically deep neural networks or convolutional neural networks.
  • the basic principle of GAN is as follows: Take the GAN that generates pictures as an example. Suppose there are two networks, G (Generator) and D (Discriminator), where G is a network that generates pictures.
  • D is a discriminant network, used to judge whether a picture is "real". Its input parameter is x, x represents a picture, and the output D(x) represents the probability that x is a real picture. If it is 1, it means 130% is a real picture. If it is 0, it means it cannot be real. picture.
  • the goal of the generative network G is to generate real pictures as much as possible to deceive the discriminant network D, and the goal of the discriminant network D is to try to distinguish the pictures generated by G from the real pictures. Come.
  • G and D constitute a dynamic "game” process, that is, the "confrontation” in the "generative adversarial network".
  • Embodiments of the present application relate to scalable video decoding. To facilitate understanding, related nouns or terms will be explained below:
  • Scalable video coding also known as scalable video coding and decoding, is an extension of the current video coding standard (generally advanced video coding (AVC) (H.264)).
  • scalable video coding SVC
  • HEVC high efficiency video coding
  • SHVC standard scalable high efficiency video coding
  • Scalable video decoding technology can obtain code streams of different resolution levels by spatially grading (resolution grading) the original image blocks.
  • the resolution can refer to the size of the image block in pixels.
  • the low-level resolution is lower, and the high-level resolution is not lower than the low-level resolution; or, by performing time domain analysis on the original image block.
  • Grading frame rate grading
  • the frame rate can refer to the number of image frames contained in the video per unit time.
  • the frame rate of the low level is lower, and the frame rate of the high level is not lower than the frame rate of the low level; or, by grading the original image blocks in the quality domain , code streams with different levels of encoding quality can be obtained.
  • Coding quality can refer to the quality of the video.
  • the low-level image distortion is greater, while the high-level image distortion is no higher than the low-level image distortion.
  • a layer called a base layer is the lowest layer in scalable video coding.
  • the base layer image blocks are encoded using the lowest resolution; in the temporal domain grading, the base layer image blocks are encoded using the lowest frame rate; in the quality domain grading, the base layer image blocks are encoded using the highest QP or the lowest code rate for encoding. That is, the basic layer is scalable The lowest quality layer in frequency decoding.
  • the level called enhancement layer is the level above the basic layer in scalable video decoding, and can be divided into multiple enhancement layers from low to high.
  • the lowest layer enhancement layer is based on the coding information obtained from the base layer, and the combined code stream obtained by encoding has a higher encoding resolution than the base layer, or a higher frame rate than the base layer, or a higher code rate than the base layer.
  • the higher-layer enhancement layer can encode higher-quality image blocks based on the coding information of the lower-layer enhancement layer.
  • Figure 1 is an exemplary hierarchical diagram of scalable video decoding in this application.
  • B and enhancement layer image blocks (E1 ⁇ En, n ⁇ 1) are then encoded respectively to obtain a code stream including a base layer code stream and an enhancement layer code stream.
  • the base layer code stream is generally the code stream obtained by applying the lowest resolution, the lowest frame rate or the lowest encoding quality parameters to the image blocks.
  • the enhancement layer code stream is based on the basic layer and superimposes the code stream obtained by encoding image blocks with high resolution, high frame rate or high coding quality parameters.
  • the encoder transmits the code stream to the decoder, it prioritizes the transmission of the basic layer code stream.
  • the network has margin, it gradually transmits higher and higher level code streams.
  • the decoder first receives and decodes the base layer code stream, and then according to the received enhancement layer code stream, in order from low level to high level, it decodes code streams with increasingly higher levels of spatial domain, time domain or quality layer by layer.
  • the higher-level decoding information is then superimposed on the lower-level reconstruction block to obtain a higher resolution, higher frame rate or higher quality reconstruction block.
  • each image in a video sequence is typically split into a set of non-overlapping blocks, which are typically encoded at the block level.
  • encoders usually process or encode video at the block (image block) level, e.g., by spatial (intra) prediction and temporal (inter) prediction to produce prediction blocks; from the image block (currently processed/to be processed) block) to obtain the residual block; transforming the residual block in the transform domain and quantizing the residual block can reduce the amount of data to be transmitted (compressed).
  • the encoder also needs to obtain the reconstructed residual block through inverse quantization and inverse transformation, and then add the pixel values of the reconstructed residual block and the pixel values of the prediction block to obtain the reconstructed block.
  • the reconstruction block of the base layer refers to the reconstruction block obtained by performing the above operation on the base layer image block obtained by layering the original image block.
  • the area below may refer to the largest coding unit (LCU) in the entire frame image, or the image block obtained after dividing the entire frame image, or the region of interest (ROI) in the entire frame image. (That is, specifying an image area that needs to be processed in the image), or a slice image in a frame of image.
  • LCU largest coding unit
  • ROI region of interest
  • VVC versatile video coding
  • AI artificial intelligence
  • the VVC encoder encodes the input image x to obtain a code stream
  • the VVC decoder decodes the code stream to obtain the reconstructed image xc of the base layer.
  • the enhancement layer input x and xc into the encoding network (Encoder) to obtain the residual feature map (y) of the enhancement layer.
  • Encoder encoding network
  • input y into the side information extraction network to obtain the side information (z) of y, and quantize z to obtain the quantized side information.
  • right Perform entropy encoding and write it into the code stream; then perform entropy decoding on the aforementioned code stream to obtain the decoded side information.
  • right Perform inverse quantization to obtain reconstructed side information Will Enter the probability estimation network and get probability distribution.
  • y is quantized to obtain the residual feature map of the quantized enhancement layer.
  • the VVC decoder parses the code stream to obtain the reconstructed image xc of the basic layer.
  • Enhancement layer parses the code stream to obtain decoded side information right Perform inverse quantization to obtain reconstructed side information Will Enter the probability estimation network and get probability distribution. Perform entropy decoding on the aforementioned code stream according to the above probability distribution to obtain the decoded residual feature map. right Perform inverse quantization to obtain the reconstructed residual feature map Convert xc and Input the decoding network (Decoder) to obtain the reconstruction image yc of the enhancement layer.
  • the AI image encoding and decoding method is used in the enhancement layer, which is better than traditional images in objective quality measures such as MS-structural similarity (SSIM) and peak signal to noise ratio (PSNR). Coding scheme.
  • SSIM MS-structural similarity
  • PSNR peak signal to noise ratio
  • the AI image encoding and decoding network used in the above solution has high complexity and high computing power requirements, resulting in low encoding and decoding efficiency of the enhancement layer.
  • embodiments of the present application provide a coding and decoding method for a regional enhancement layer to improve the coding and decoding efficiency of the enhancement layer.
  • the following describes the systems and/or scenarios to which the solution of the embodiment of the present application is applicable.
  • FIG. 2A is a schematic block diagram of an exemplary decoding system 10.
  • the video encoder 20 (or simply referred to as the encoder 20 ) and the video decoder 30 (or simply referred to as the decoder 30 ) in the decoding system 10 may be used to perform various example solutions described in the embodiments of this application.
  • the decoding system 10 includes a source device 12 for providing encoded image data 21 such as an encoded image to a destination device 14 for decoding the encoded image data 21 .
  • the source device 12 includes an encoder 20 and, additionally or optionally, an image source 16, a preprocessor (or preprocessing unit) 18 such as an image preprocessor, and a communication interface (or communication unit) 22.
  • Image source 16 may include or be any type of image capture device for capturing real-world images or the like, and/or any type of image generation device, such as a computer graphics processor or any type of user for generating computer animation images. Devices used to acquire and/or provide real-world images, computer-generated images (e.g., screen content, virtual reality (VR) images, and/or any combination thereof (e.g., augmented reality (AR) images)).
  • the image source may be any type of memory or storage that stores any of the above images.
  • the image (or image data) 17 may also be referred to as the original image (or original image data) 17.
  • the preprocessor 18 is used to receive (original) image data 17 and perform preprocessing on the image data 17 to obtain a preprocessed image (or preprocessed image data) 19 .
  • preprocessing performed by preprocessor 18 may include cropping, color format conversion (eg, from RGB to YCbCr), color grading, or denoising. It can be understood that the preprocessing unit 18 may be an optional component.
  • Video encoder (or encoder) 20 is used to receive pre-processed image data 19 and provide encoded image data 21 (further described below with reference to FIG. 3 and the like).
  • the communication interface 22 in the source device 12 may be used to receive the encoded image data 21 and send the encoded image data 21 (or any other processed version) to another device such as the destination device 14 or any other device through the communication channel 13 for storage. Or rebuild directly.
  • the destination device 14 includes a decoder 30 and may additionally or optionally include a communication interface (or communication unit) 28, a post-processor (or post-processing unit) 32 and a display device 34.
  • the communication interface 28 in the destination device 14 is used to receive the encoded image data 21 (or any other processed version) directly from the source device 12 or from any other source device such as a storage device.
  • the storage device is an encoded image data storage device
  • the encoded image data 21 is provided to the decoder 30 .
  • Communication interface 22 and communication interface 28 may be used via a direct communication link between source device 12 and destination device 14, such as a direct wired or wireless connection, or the like, or via any type of network, such as a wired network, a wireless network, or any thereof.
  • the communication interface 22 may be used to encapsulate the encoded image data 21 into a suitable format such as a message, and/or process the encoded image data using any type of transmission encoding or processing for transmission over a communication link or network. transfer on.
  • the communication interface 28 corresponds to the communication interface 22 and can, for example, be used to receive transmission data and process the transmission data using any type of corresponding transmission decoding or processing and/or decapsulation to obtain the encoded image data 21 .
  • Both communication interface 22 and communication interface 28 can be configured as a one-way communication interface as indicated by the arrow pointing from the source device 12 to the corresponding communication channel 13 of the destination device 14 in Figure 2A, or a bi-directional communication interface, and can be used to send and receive messages. etc., to establish the connection, confirm and exchange any other information related to the communication link and/or data transmission such as the transmission of encoded image data, etc.
  • the video decoder (or decoder) 30 is configured to receive encoded image data 21 and provide decoded image data (or decoded image data) 31 (further described below with reference to FIG. 4 and the like).
  • the post-processor 32 is used to perform post-processing on decoded image data 31 (also referred to as reconstructed image data) such as decoded images to obtain post-processed image data 33 such as post-processed images.
  • Post-processing performed by the post-processing unit 32 may include, for example, color format conversion (eg, from YCbCr to RGB), toning, cropping or resampling, or any other processing for generating decoded image data 31 for display by a display device 34 or the like. .
  • the display device 34 is used to receive the post-processed image data 33 to display the image to a user or viewer or the like.
  • Display device 34 may be or include any type of display for representing reconstructed images, such as an integrated or external display or display.
  • the display screen may include a liquid crystal display (LCD), an organic light emitting diode (OLED) display, a plasma display, a projector, a micro LED display, a liquid crystal on silicon display (LCoS) ), digital light processor (DLP) or any type of other display.
  • the decoding system 10 also includes a training engine 25, which is used to train the encoder 20 or the decoder 30, especially the neural network used in the encoder 20 or the decoder 30 (described in detail below).
  • the training data can be stored in a database (not shown), and the training engine 25 trains and obtains a neural network based on the training data. It should be noted that the embodiment of the present application does not limit the source of the training data. For example, the training data may be obtained from the cloud or other places for model training.
  • FIG. 2A shows source device 12 and destination device 14 as separate devices
  • device embodiments may also include both source device 12 and destination device 14 or the functionality of both source device 12 and destination device 14 , that is, include both source devices 12 and 14 .
  • Device 12 or corresponding function and destination device 14 or corresponding function In these embodiments, source device 12 or corresponding functions and destination device 14 or corresponding functions may use the same hardware and/or software or by separate hardware and/or software or Any combination of them can be achieved.
  • Encoder 20 eg, video encoder 20
  • decoder 30 eg, video decoder 30
  • processing circuitry such as one or more microprocessors, digital signal processors (digital signal processor, DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), discrete logic, hardware, video encoding special processor or any combination thereof .
  • the encoder 20 and the decoder 30 may each be implemented by a processing circuit 46.
  • the processing circuitry 46 may be used to perform various operations discussed below.
  • the device can store the instructions of the software in a suitable non-transitory computer-readable storage medium, and use one or more processors to execute the instructions in hardware, thereby performing the technology of the present application.
  • One of the encoder 20 and the decoder 30 may be integrated in a single device as part of a combined encoder/decoder (CODEC), as shown in Figure 2B.
  • Source device 12 and destination device 14 may include any of a variety of devices, including any type of handheld or stationary device, for example, a notebook or laptop computer, a smartphone, a tablet or slate, a camera, a desktop computer , set-top boxes, televisions, display devices, digital media players, video game consoles, video streaming devices (e.g., content business servers or content distribution servers), etc., and may not use or use any type of operating system.
  • source device 12 and destination device 14 may be equipped with components for wireless communications. Accordingly, source device 12 and destination device 14 may be wireless communication devices.
  • the decoding system 10 shown in FIG. 2A is only exemplary, and the technology provided in this application may be applicable to video decoding devices (eg, video encoding or video decoding), which do not necessarily include encoding devices and video decoding devices. Decode any data communication between devices. In other examples, data is retrieved from local storage, sent over the network, and so on. The video encoding device may encode the data and store the data in memory, and/or the video decoding device may retrieve the data from memory and decode the data. In some examples, encoding and decoding are performed by devices that do not communicate with each other but merely encode data to memory and/or retrieve and decode data from memory.
  • video decoding devices eg, video encoding or video decoding
  • FIG. 2B is an illustration of an example of video coding system 40.
  • Video coding system 40 may include imaging device 41, video encoder 20, video decoder 30 (and/or a video codec implemented by processing circuitry 46), antenna 42, one or more processors 43, a or multiple memory stores 44 and/or display devices 45.
  • the imaging device 41, the antenna 42, the processing circuit 46, the video encoder 20, the video decoder 30, the processor 43, the memory storage 44 and/or the display device 45 can communicate with each other.
  • video coding system 40 may include only video encoder 20 or only video decoder 30 .
  • antenna 42 may be used to transmit or receive an encoded bitstream of video data.
  • display device 45 may be used to present video data.
  • Processing circuit 46 may include application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, or the like.
  • Video decoding system 40 may also include an optional processor 43, which may similarly include application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, or the like.
  • the memory 44 may be any type of memory, such as volatile memory (eg, static random access memory (SRAM), dynamic random access memory (DRAM), etc.) or non-volatile memory. Volatile memory (for example, flash memory, etc.), etc.
  • memory store 44 may be implemented by cache memory.
  • processing circuitry 46 may include memory (e.g., cache etc.) for implementing image buffers etc.
  • video encoder 20 implemented by logic circuitry may include an image buffer (eg, implemented by processing circuitry 46 or memory storage 44) and a graphics processing unit (eg, implemented by processing circuitry 46).
  • a graphics processing unit may be communicatively coupled to the image buffer.
  • the graphics processing unit may include video encoder 20 implemented through processing circuitry 46 .
  • Logic circuits can be used to perform the various operations discussed herein.
  • video decoder 30 may be implemented in a similar manner with processing circuitry 46 to implement the various functions discussed with respect to video decoder 30 of FIG. 4 and/or any other decoder system or subsystem described herein. module.
  • logic circuitry implemented video decoder 30 may include an image buffer (implemented by processing circuitry 46 or memory storage 44 ) and a graphics processing unit (eg, implemented by processing circuitry 46 ).
  • a graphics processing unit may be communicatively coupled to the image buffer.
  • the graphics processing unit may include video decoder 30 implemented by processing circuitry 46 .
  • antenna 42 may be used to receive an encoded bitstream of video data.
  • the encoded bitstream may include data related to encoded video frames, indicators, index values, mode selection data, etc., as discussed herein, such as data related to encoded partitions (e.g., transform coefficients or quantized transform coefficients , optional indicators (as discussed, and/or data defining encoding splits).
  • Video coding system 40 may also include video decoder 30 coupled to antenna 42 and for decoding the encoded bitstream.
  • Display device 45 is used to present video frames.
  • video decoder 30 may be used to perform the opposite process.
  • video decoder 30 may be configured to receive and parse such syntax elements and decode related video data accordingly.
  • video encoder 20 may entropy encode the syntax elements into an encoded video bitstream.
  • video decoder 30 may parse such syntax elements and decode the associated video data accordingly.
  • VVC Versatile video coding
  • VCEG ITU-T Video Coding Experts Group
  • HEVC High-Efficiency Video Coding
  • JCT-VC Joint Collaboration Team on Video Coding
  • FIG. 3 is a schematic diagram of a video decoding device 400 provided by an embodiment of the present invention.
  • Video coding device 400 is suitable for implementing the disclosed embodiments described herein.
  • the video decoding device 400 may be a decoder, such as the video decoder 30 in FIG. 2A, or an encoder, such as the video encoder 20 in FIG. 2A.
  • the video decoding device 400 includes: an inlet port 410 (or input port 410) for receiving data and a receiving unit (receiver unit, Rx) 420; a processor, logic unit or central processing unit (central processing unit) for processing data , CPU) 430; for example, the processor 430 here can be a neural network processor 430; a transmitter unit (Tx) 440 for transmitting data and an output port 450 (or output port 450); for storing data Memory 460.
  • the video decoding device 400 may further include optical-to-electrical (OE) components and electrical-to-optical (EO) components coupled to the inlet port 410, the receiving unit 420, the transmitting unit 440, and the egress port 450, An outlet or entrance for optical or electrical signals.
  • OE optical-to-electrical
  • EO electrical-to-optical
  • Processor 430 is implemented in hardware and software.
  • Processor 430 may be implemented as one or more processor chips, cores (eg, multi-core processors), FPGAs, ASICs, and DSPs.
  • Processor 430 communicates with ingress port 410, receiving unit 420, transmitting unit 440, egress port 450, and memory 460.
  • Processor 430 includes a decoding module 470 (e.g., based on Decoding module 470 of neural network NN).
  • Decoding module 470 implements the embodiments disclosed above. For example, decoding module 470 performs, processes, prepares, or provides various encoding operations.
  • decoding module 470 provides a substantial improvement in the functionality of the video decoding device 400 and affects the switching of the video decoding device 400 to different states.
  • decoding module 470 may be implemented as instructions stored in memory 460 and executed by processor 430 .
  • Memory 460 includes one or more disks, tape drives, and solid-state drives that may serve as overflow data storage devices for storing programs when they are selected for execution, and for storing instructions and data that are read during program execution.
  • Memory 460 may be volatile and/or non-volatile, and may be read-only memory (ROM), random access memory (RAM), ternary content-addressable memory (ternary content-addressable memory (TCAM) and/or static random-access memory (static random-access memory (SRAM)).
  • ROM read-only memory
  • RAM random access memory
  • TCAM ternary content-addressable memory
  • SRAM static random-access memory
  • FIG. 4 is a simplified block diagram of an apparatus 500 provided by an exemplary embodiment.
  • the apparatus 500 may be used as either or both of the source device 12 and the destination device 14 in FIG. 2A .
  • Processor 502 in device 500 may be a central processing unit.
  • processor 502 may be any other type of device or devices that exists or may be developed in the future that is capable of manipulating or processing information.
  • the disclosed implementations may be implemented using a single processor, such as processor 502 as shown, it is faster and more efficient to use more than one processor.
  • memory 504 in apparatus 500 may be a read-only memory (ROM) device or a random access memory (RAM) device. Any other suitable type of storage device may be used as memory 504.
  • Memory 504 may include code and data 506 that processor 502 accesses through bus 512 .
  • Memory 504 may also include an operating system 508 and application programs 510 including at least one program that allows processor 502 to perform the methods described herein.
  • applications 510 may include Applications 1 through N, including a video coding application that performs the methods described herein.
  • Apparatus 500 may also include one or more output devices, such as display 518.
  • display 518 may be a touch-sensitive display that combines a display with a touch-sensitive element that can be used to sense touch input.
  • Display 518 may be coupled to processor 502 via bus 512 .
  • bus 512 in device 500 is described herein as a single bus, bus 512 may include multiple buses. Additionally, auxiliary storage may be directly coupled to other components of device 500 or accessed through a network, and may include a single integrated unit, such as a memory card, or multiple units, such as multiple memory cards. Accordingly, device 500 may have a wide variety of configurations.
  • Figure 5 is an example diagram of an application scenario of an embodiment of the present application.
  • the application scenario can be a business involving image/video collection, storage or transmission in terminals, cloud servers, and video surveillance, for example, terminal photography/ Video recording, photo album, cloud photo album, video surveillance, etc.
  • Encoding end Camera collects images/videos.
  • the artificial intelligence (AI) image/video decoding network extracts features from images/videos to obtain image features with low redundancy, and then compresses them based on the image features to obtain code streams/image files.
  • AI artificial intelligence
  • the AI image/video decoding network decompresses the code stream/image file to obtain image features, and then performs inverse feature extraction on the image features to obtain the reconstructed image/video.
  • the storage/transmission module stores the compressed stream/image files for different businesses (for example, terminal photography, video surveillance, cloud servers, etc.) or transmits them (for example, cloud services, live broadcast technology, etc.).
  • Figure 6 is an example diagram of an application scenario of an embodiment of the present application.
  • the application scenario can be a business involving image/video collection, storage or transmission in terminals and video surveillance, for example, terminal photo albums, video surveillance, Live broadcast etc.
  • Encoding side The encoding network transforms the image/video into image features with lower redundancy, which usually contains non-linear transformation units and has non-linear characteristics.
  • the entropy estimation network is responsible for calculating the encoding probability of each data in the image features.
  • the entropy coding network performs lossless encoding on image features according to the probability corresponding to each data to obtain a code stream/image file, further reducing the amount of data transmission in the image compression process.
  • the entropy decoding network performs lossless decoding on the code stream/image file based on the probability corresponding to each data to obtain reconstructed image features.
  • the decoding network inversely transforms the image features output by entropy decoding and parses them into images/videos. Corresponding to the coding network, it usually contains nonlinear transformation units and has nonlinear characteristics.
  • the saving module saves the code stream/image file to the corresponding storage location of the terminal.
  • the loading module loads the code stream/image file from the corresponding storage location of the terminal and inputs it into the entropy decoding network.
  • Figure 7 is an example diagram of an application scenario of an embodiment of the present application.
  • the application scenario can be a business involving image/video collection, storage or transmission in cloud and video surveillance, for example, cloud photo album, video surveillance, Live broadcast etc.
  • Encoding end Obtain the image/video locally, perform image (JPEG) encoding on it to obtain the compressed image/video, and then send the compressed image/video to the cloud.
  • the cloud performs JPEG decoding on the compressed image/video to obtain the image/video, and then compresses the image/video to obtain the code stream/image file and stores it.
  • Decoding end When the local needs to obtain images/videos from the cloud, the cloud decompresses the code stream/image files to obtain images/videos, and then JPEG-encodes the images/videos to obtain compressed images/videos, and sends the compressed images/videos locally. video. Locally perform JPEG decoding of compressed images/videos to obtain images/videos.
  • the structure of the cloud and the purpose of each module can be referred to the structure and purpose of each module in Figure 7 , which will not be described in detail here in the embodiment of the present application.
  • embodiments of the present application provide an image encoding/decoding method to achieve efficient nonlinear transformation processing and improve the rate-distortion performance in image/video compression algorithms.
  • FIG. 8 is a flowchart of a process 800 of a regional enhancement layer encoding method according to an embodiment of the present application.
  • Process 800 may be performed by video encoder 20.
  • Process 800 is described as a series of steps or operations, and it should be understood that process 800 may be performed in various orders and/or occur simultaneously and is not limited to the order of execution shown in FIG. 8 . Assuming that a video data stream with multiple image frames is using video encoder 20, a process 800 including the following steps is performed to encode a regional enhancement layer.
  • Process 800 may include:
  • Step 801 Obtain the reconstructed pixels of the base layer of the target area.
  • each frame in a video sequence can be segmented into a set of non-overlapping image blocks and then encoded at the block level.
  • encoders usually process or encode video at the block (image block) level, e.g., by spatial (intra) prediction and temporal (inter) prediction to produce prediction blocks; from the image block (currently processed/to be processed) block) to obtain the residual block; transforming the residual block in the transform domain and quantizing the residual block can reduce the amount of data to be transmitted (compressed).
  • the encoder also needs to obtain the reconstructed residual block through inverse quantization and inverse transformation, and then add the pixel values of the reconstructed residual block and the pixel values of the prediction block to obtain the reconstructed block.
  • the area included in the image frame may refer to the largest coding unit (LCU) in the entire frame image, or the image block obtained after dividing the entire frame image, or the area of interest in the entire frame image. (region of interest, ROI) (that is, a specified image area that needs to be processed in the image), etc. It should be understood that, in addition to the aforementioned situations, the region may also be a partial image described in other ways, and there is no specific limitation on this.
  • LCU largest coding unit
  • the target area is intended to represent the image that the solution of the embodiment of the present application focuses on and processes during an encoding process.
  • the position of the image block, and the shape of the target area can be a regular rectangle or square, or an irregular shape, and there is no specific limit to this.
  • the initially obtained image block can be called the original block, and the pixels it contains can be called original pixels; the reconstructed image block can be called the reconstructed block, and the pixels it contains can be called to reconstruct pixels.
  • the encoding process is roughly similar, especially that each layer includes initial image blocks and reconstructed image blocks.
  • the pixels contained in the initially obtained region are called the original pixels of the base layer of the region, and the pixels contained in the reconstructed region are called the reconstructed pixels of the base layer of the region.
  • the pixels contained in the initially obtained region are called the original pixels of the enhancement layer in the region, and the pixels contained in the reconstructed region are called reconstructed pixels of the enhancement layer in the region.
  • obtaining the reconstructed pixels of the base layer of the target area may include: encoding the image to which the target area belongs to obtain the base layer code stream of the image, and then decoding the base layer code stream to obtain the base layer code stream of the image. Reconstruct the map, and then determine at least one area that needs to be enhanced based on the reconstructed map, and the target area is one of the at least one area.
  • the encoding end encodes the original image to obtain the base layer code stream, and then decodes the base layer code stream to obtain the reconstructed image of the base layer.
  • the VVC encoder encodes the original image to obtain the base layer code stream
  • the VVC decoder decodes the base layer code stream to obtain the reconstructed image of the base layer.
  • other coders such as HEVC codec and AVC codec, can also be used for the base layer, and this is not specifically limited in the embodiments of the present application.
  • the reconstruction map can be divided to obtain multiple regions.
  • the multiple regions can refer to the relevant descriptions of the above regions, for example, multiple LCUs, multiple image blocks, multiple ROIs, etc. It should be noted that the regions may be partial images described in multiple ways, and correspondingly, multiple ways may be used to obtain divisions of multiple regions, which are not specifically limited in the embodiments of the present application.
  • the target area in the example is an area that needs to be enhanced.
  • the subsequent enhancement layer coding only the target area can be encoded and decoded by the enhancement layer. In this way, there is no need to enhance the entire frame of the image, which can improve the encoding of the image. Decoding efficiency.
  • the areas determined to be target areas among the plurality of areas may include at least one area, and the at least one area satisfies a condition: the variance of the area is greater than the first threshold. For example, if the variance of the area is greater than the threshold t1, t1>0, it can be considered that the texture of the area is relatively complex, so enhancement processing is needed to improve the image quality; or, the proportion of pixels in the area with a gradient greater than the second threshold to the total pixels is greater than third threshold. For example, the proportion of pixels in a region with gradients greater than threshold a to total pixels is greater than threshold t2, a>0, 0 ⁇ t2 ⁇ 1. It can also be considered that the texture of this region is relatively complex, so enhancement processing is required to improve image quality.
  • the encoding end can use any one of the above at least one area as the current target area, and extract the pixels corresponding to the target area from the reconstruction map of the base layer as the reconstructed pixels of the base layer of the target area.
  • Step 802 Input the reconstructed pixels into the correction network to obtain correction information of the target area.
  • the input of the correction network is the reconstructed pixels of the base layer of the target area, and the output is the correction information corresponding to the target area. Its function is to remove noise signals that are useless for encoding the AI enhancement layer.
  • the correction network can be composed of a convolution layer (conv) and an activation layer (ReLU). There is no limit on whether the correction network includes an activation layer (generalized divisive normalization, GDN), and there is no limit on whether there are other activation functions.
  • GDN generalized divisive normalization
  • GDN generalized divisive normalization
  • This application implements
  • the number of convolution layers of the convolution layer is not limited, and the size of the convolution kernel is not limited.
  • the convolution kernel can be 3 ⁇ 3, 5 ⁇ 5 or 7 ⁇ 7.
  • the reconstructed pixels of the base layer of the target area are input into the correction network to obtain at least one of multiple pixel values and multiple feature values of the target area.
  • the correction information may be multiple pixel values or multiple characteristic value.
  • a neural network may be used to implement the correction network.
  • a neural network composed of four convolutional/deconvolutional layers and three activation layers interspersed in cascade is used to construct the correction network.
  • the convolution kernel size of each convolution layer can be set to 3 ⁇ 3
  • the number of channels of the output feature map is set to M
  • the width and height of each convolution layer are downsampled by 2 times. It should be understood that the foregoing examples do not constitute specific limitations.
  • the size of the convolution kernel, the number of feature map channels, the downsampling multiple, the number of downsampling times, the number of convolution layers, and the number of activation layers can all be adjusted, and the embodiments of this application do not specifically limit this.
  • multiple pixel values and/or multiple feature values can be output.
  • the input of the correction network is reconstructed pixels, and the reconstructed pixels are within the target area, so the output multiple pixel values and /or multiple feature values, even if they do not correspond to pixels in the target area one-to-one, it can be considered that the multiple pixel values and/or multiple feature values still belong to the target area, also known as multiple pixel values and/or Or multiple feature values correspond to the target area.
  • the correction information may be an upsampled value of reconstructed pixels of the base layer of the target area.
  • the resolution of the reconstructed pixels of the base layer of the target area is the same as or different from the resolution of the reconstructed pixels of the enhancement layer of the target area, which is not specifically limited in this embodiment of the present application.
  • Figure 9a is an exemplary schematic diagram of the correction network.
  • the correction network includes 6 convolutional layers, 5 activation layers ReLU, and the size of the convolution kernel is 3 ⁇ 3.
  • the reconstructed pixels of the base layer of the target area are input into the correction network and multiple pixel values are output.
  • This correction network can denoise the reconstructed pixels to remove noise signals that are not useful for AI enhancement layer encoding, and obtain multiple pixel values.
  • Figure 9b is an exemplary schematic diagram of the correction network.
  • the correction network includes 6 convolutional layers, 5 activation layers ReLU, and the size of the convolution kernel is 3 ⁇ 3.
  • the reconstructed pixels of the base layer of the target area are input into the correction network and multiple feature values are output.
  • Figure 9c is an exemplary schematic diagram of the correction network.
  • the correction network includes 8 convolutional layers, 6 activation layers ReLU, and the size of the convolution kernel is 3 ⁇ 3.
  • the reconstructed pixels of the base layer of the target area are input into the correction network, and multiple pixel values and multiple feature values are output from different layers of the correction network.
  • Figure 9d is an exemplary schematic diagram of the correction network.
  • the correction network includes 5 convolution layers and a deconvolution layer, 5 activation layers ReLU, and the size of the convolution kernel is 3 ⁇ 3.
  • the convolution kernel size of the deconvolution layer of the correction network can be set to 3 ⁇ 3, the number of channels of the output feature map can be set to 48 (it can also be other values, there is no limit here), and the width and height of the deconvolution layer Perform 2 times upsampling, and the output channel of the last layer of convolution is M.
  • M When M is 3, multiple pixel values are output; when M is 48, multiple feature values are output.
  • the resolution of the correction information is twice the resolution of the reconstructed pixels of the base layer.
  • correction networks in addition to the above four examples of correction networks, the embodiments of the present application can also use correction networks with other structures, which are not specifically limited.
  • Step 803 Input the correction information and the original pixels of the target area into the encoding network to obtain the residual feature map of the enhancement layer of the target area.
  • the encoding end can extract the pixels at the position corresponding to the target area from the original image, which are the original pixels of the target area.
  • the correction information can have two situations, one is multiple pixel values, and the other is multiple feature values.
  • the encoding network can also adopt two structures.
  • the input of the encoding network (Encoder) on the encoding side is the correction information and the original pixels of the target area, and the output is the residual feature map of the enhancement layer of the target area.
  • Figure 10a is an exemplary schematic diagram of a coding network.
  • the coding network may be a first coding network, including 4 convolutional layers and 3 GDNs.
  • the difference between the original pixel and the corresponding pixel value in the correction information is calculated, and then the difference result is input into the first encoding network to obtain the residual feature map.
  • the correction information (multiple pixel values) corresponds to the pixel domain, it can be directly differentiated from the original pixels of the target area.
  • Figure 10b is an exemplary schematic diagram of a coding network.
  • the coding network may be a second coding network, including 4 convolutional layers and 3 GDNs.
  • the difference result is then input into the network layer after any convolutional layer (for example, the second convolutional layer) in the second coding network to obtain the residual feature map.
  • the correction information multiple feature values
  • the original pixels since the correction information (multiple feature values) corresponds to the feature domain, the original pixels must first be input into the second encoding network and converted into the feature domain, and then the difference with the multiple feature values is calculated.
  • Step 804 Encode the residual feature map to obtain the enhancement layer code stream of the target area.
  • multiple probability distributions can be obtained, which correspond to multiple feature values contained in the residual feature map, and then entropy is performed on the corresponding feature values in the residual feature map according to the multiple probability distributions. Encode to obtain enhancement layer code stream.
  • the residual feature map of the enhancement layer of the target area contains multiple eigenvalues.
  • multiple probability distributions are several ways to obtain multiple probability distributions:
  • the probability estimation network may also include a convolution layer and a GDN. There is no limit on whether the probability estimation network has other activation functions. However, the embodiment of the present application does not limit the number of convolution layers of the convolution layer, nor does it limit the size of the convolution kernel. limited.
  • a probability distribution model is first used for modeling, then the correction information is input into the probability estimation network to obtain model parameters, and the model parameters are substituted into the probability distribution model to obtain a probability distribution.
  • the probability distribution model can be a single Gaussian model (gaussian single model, GSM), an asymmetric Gaussian model, a Gaussian mixture model (gaussian mixture model, GMM) or a Laplace distribution model (laplace distribution).
  • the model parameters are the values of the mean parameter ⁇ and variance ⁇ of the Gaussian distribution.
  • the model parameters are the values of the position parameter ⁇ and the scale parameter b of the Laplace distribution. It should be understood that in addition to the above probability distribution model, other models can also be used, without specific limitations.
  • Figure 11a is an exemplary schematic diagram of a probability estimation network. As shown in Figure 11a, the probability estimation network includes 4 convolutional layers and 3 GDNs, and its input correction information is multiple pixel values.
  • Figure 11b is an exemplary schematic diagram of a probability estimation network. As shown in Figure 11b, the probability estimation network includes two convolutional layers and two GDNs, and its input correction information is multiple feature values.
  • probability estimation network is constructed, and there is no specific limitation on this.
  • the reconstructed side information is input into the side information processing network to obtain the first feature map, and the multiple feature values and the first feature map are input into the probability estimation network to obtain multiple probability distributions.
  • the residual feature map of the enhancement layer of the target area can be input into the side information extraction network to obtain the side information of the residual feature map, entropy encoding is performed on the side information, and written into the code stream.
  • the aforementioned side information of the residual feature map is used as the reconstructed side information of the residual feature map.
  • Side information can be considered as a feature map with the same dimension as the aforementioned residual feature map obtained by further feature extraction of the residual feature map of the enhancement layer of the target area. Therefore, the side information extraction network is used to extract the residual feature map of the enhancement layer of the target area.
  • the feature map is further extracted to obtain a feature map with the same dimension as the aforementioned residual feature map.
  • the side information processing network may perform feature extraction on the side information to output a first feature map with the same resolution as the residual feature map.
  • a neural network composed of three deconvolution layers and two activation layers interspersed and cascaded is used to achieve the aforementioned functions.
  • the probability estimation network can refer to the description above and will not be repeated here.
  • the correction information is multiple pixel values
  • the feature map and the second feature map are input into the probability estimation network to obtain multiple probability distributions.
  • the feature estimation network can convert pixel values (multiple pixel values) represented by the pixel domain into feature values (second feature map) represented by the feature domain.
  • the feature estimation network can adopt the structure of the probability estimation network shown in Figure 11b.
  • the difference from the probability estimation network lies in the input, output and training process. It is precisely because of the differences in input, output and training process that even if the network structure is the same (i.e. Containing the same layer structure), they can still be regarded as different networks and complete different functions.
  • the side information processing network and probability estimation network can refer to the above description and will not be repeated here.
  • the probabilistic network can refer to the description above and will not be repeated here.
  • the reconstructed pixels of the residual feature map can be input into the feature estimation network to obtain the third feature map
  • the reconstructed side information can be input into the side information processing network to obtain the first feature map
  • the first feature map and the third feature map can be input into the side information processing network to obtain the first feature map.
  • the feature map is input into the probability estimation network to obtain multiple probability distributions.
  • the feature estimation network, side information processing network and probability estimation network can refer to the above description and will not be described again here.
  • AI coding is used for the enhancement layer of the selected target area.
  • the reconstructed pixels of the base layer are used to remove noise signals that are useless for AI enhancement layer coding through a correction network to obtain correction information.
  • This correction information encodes the residual feature map of the enhancement layer of the target area.
  • only the enhancement layer coding is performed on the required area (target area), which can reduce the complexity of enhancement layer coding and improve the efficiency of enhancement layer coding.
  • encoding based on correction information can improve the accuracy of encoding.
  • FIG. 12 is a flowchart of a process 1200 of the regional enhancement layer decoding method according to an embodiment of the present application.
  • Process 1200 may be performed by video decoder 30.
  • Process 1200 is described as a series of steps or operations, and it should be understood that process 1200 may be performed in various orders and/or occur simultaneously and is not limited to the order of execution shown in FIG. 12 .
  • a process 1200 including the following steps is performed to decode the code stream to obtain the regional enhancement layer reconstructed image.
  • Process 1200 may include:
  • Step 1201 Obtain the reconstructed pixels of the base layer of the target area.
  • the decoding end can receive the code stream from the encoding end.
  • This code stream contains the base layer code stream obtained after the encoding end encodes the original pixels of the image. Therefore, the decoding end obtains the image by decoding the base layer code stream. Reconstruction of the base layer.
  • the decoding end may obtain the reconstructed pixels of the base layer of the target area in the same manner as the encoding end. Reference may be made to the description of step 801, which will not be described again here.
  • Step 1202 Input the reconstructed pixels into the correction network to obtain correction information of the target area.
  • step 1202 reference may be made to the description of step 802, which will not be described again here.
  • Step 1203 Obtain the enhancement layer code stream of the target area.
  • step 804 the encoding end encodes the residual feature map to obtain the enhancement layer code stream of the target area. Therefore, correspondingly, the code stream received by the decoding end also includes the enhancement layer code stream of the target area.
  • Step 1204 Decode the enhancement layer code stream to obtain the residual feature map of the enhancement layer in the target area.
  • multiple probability distributions can be obtained, the multiple probability distributions correspond to multiple feature value code streams contained in the enhancement layer code stream, and then the corresponding feature values in the enhancement layer code stream are calculated based on the multiple probability distributions.
  • the code stream is entropy decoded to obtain the residual feature map of the target area.
  • the enhancement layer code stream contains multiple feature value code streams.
  • Step 1205 Input the residual feature map and correction information into the decoding network to obtain reconstructed pixels of the enhancement layer of the target area.
  • the encoding end uses the encoding network to obtain the residual feature map of the enhancement layer of the target area based on the input correction information and the original pixels of the target area.
  • the decoding end uses the decoding network to reversely use the input residual feature map.
  • the feature map and correction information are used to obtain the reconstructed pixels of the enhancement layer in the target area.
  • Correction information can be in two situations, one is multiple pixel values, the other is multiple feature values.
  • the decoding network can also adopt two structures.
  • the input of the decoding network (Decoder) at the decoding end is the correction information and the residual feature map of the enhancement layer of the target area, and the output is the reconstructed pixel of the enhancement layer of the target area.
  • Figure 13a is an exemplary schematic diagram of a decoding network.
  • the decoding network may be a first decoding network, including 4 convolutional layers and 3 GDNs.
  • the residual feature map is input into the first decoding network to obtain the reconstructed residual pixels of the enhancement layer of the target area, and then the reconstructed residual pixels and the corresponding pixel values in the correction information are summed to obtain the reconstructed pixels.
  • the correction information is multiple Pixel values.
  • the correction information multiple pixel values
  • the correction information corresponds to the pixel domain, it can be directly summed with the reconstruction residual pixels of the enhancement layer in the target area.
  • Figure 13b is an exemplary schematic diagram of a decoding network.
  • the decoding network may be a second decoding network, including 4 convolutional layers and 3 GDNs.
  • the correction information multiple feature values
  • the residual feature map since the correction information (multiple feature values) corresponds to the feature domain, the residual feature map must first be input into the second decoding network and converted into the feature domain, and then summed with the multiple feature values.
  • decoding networks may also use decoding networks with other structures, which are not specifically limited.
  • AI decoding is used for the enhancement layer code stream.
  • the reconstructed pixels of the base layer are used to remove noise signals that are useless for AI enhancement layer coding through a correction network to obtain correction information, and then the correction information is obtained based on the correction information.
  • the enhancement layer code stream is decoded.
  • only the required area (target area) is decoded, which can reduce the complexity of enhancement layer decoding and improve the efficiency of enhancement layer decoding.
  • decoding based on the correction information can improve Decoding accuracy.
  • Figure 14 is an exemplary schematic diagram of the encoding and decoding process. As shown in Figure 14, the process of this embodiment is as follows.
  • the base layer encoder encodes the original image x to obtain the base layer code stream (Bitstream1), and the base layer decoder (Decoder1) decodes Bitstream1 to obtain the reconstructed image xc.
  • Encoder1 and Decoder1 can use traditional video decoding standards for basic layer encoding and decoding, such as H.264/AVC, H.265/HEVC or H.266/VVC standards, and can also use existing JPEG image encoding.
  • the standard performs basic layer encoding and decoding and does not impose specific restrictions on this.
  • the determination method may include selecting an area with a relatively complex texture, selecting an area of interest to the human eye, or randomly selecting one or more areas. This step can use related technologies, and there is no specific limitation on this.
  • the area when the variance of the area is greater than the threshold t1, the area is an area with a relatively complex texture and can be used as a target area; or, when the proportion of pixels in the area with a gradient greater than the threshold a is greater than t2, the area is determined to have a relatively complex texture. area can be used as a target area.
  • t1 and a are numbers greater than 0, and t2 is a number from 0 to 1.
  • the function of the correction network is to remove noise signals that are useless for AI enhancement layer encoding.
  • the correction network can output multiple pixel domains and/or multiple feature domains.
  • the correction information includes the aforementioned multiple pixel domains or multiple feature domains. For example, the following three methods can be used:
  • Method 1 Input the reconstructed pixels of the base layer of xc1 into the correction network shown in Figure 9a, and output multiple pixel values as p.
  • Method 2 Input the reconstructed pixels of the base layer of xc1 into the correction network shown in Figure 9b, and output multiple feature values as p.
  • Method 3 Input the reconstructed pixels of the base layer of xc1 into the correction network as shown in Figure 9c, output multiple pixel values and multiple feature values, and use one of them as p.
  • the floating point number in y can also be truncated to obtain an integer, or the quantization can be obtained by quantizing according to a preset quantization step size. Eigenvalues.
  • correction information There are two types of correction information: pixel values or feature values. Both can be input into the probability estimation network, but the structure of the probability estimation network is different. If it is a pixel value, the probability estimation network adopts the structure shown in Figure 11a; if it is a feature value, the probability estimation network adopts the structure shown in Figure 11b.
  • the base layer decoder parses the base layer code stream (Bitstream1) to obtain the reconstructed image xc.
  • Figure 15 is an exemplary schematic diagram of the encoding and decoding process. As shown in Figure 15, the difference between this embodiment and Embodiment 1 lies in the method of obtaining the probability distribution.
  • the base layer encoder encodes the original image x to obtain the base layer code stream (Bitstream1), and the base layer decoder (Decoder1) decodes Bitstream1 to obtain the reconstructed image xc.
  • Encoder1 and Decoder1 can use traditional video decoding standards for basic layer encoding and decoding, such as H.264/AVC, H.265/HEVC or H.266/VVC standards, and can also use existing JPEG image encoding.
  • the standard performs basic layer encoding and decoding and does not impose specific restrictions on this.
  • the determination method may include selecting an area with a relatively complex texture, selecting an area of interest to the human eye, or randomly selecting one or more areas. This step can use related technologies, and there is no specific limitation on this.
  • the area when the variance of the area is greater than the threshold t1, the area is an area with a relatively complex texture and can be used as a target area; or, when the proportion of pixels in the area with a gradient greater than the threshold a is greater than t2, the area is determined to have a relatively complex texture. area can be used as a target area.
  • t1 and a are numbers greater than 0, and t2 is a number from 0 to 1.
  • the function of the correction network is to remove noise signals that are useless for encoding the AI enhancement layer.
  • the correction network can output multiple pixel domains and/or multiple feature domains, and the correction information is multiple pixel domains or multiple feature domains. For example, the following three methods can be used:
  • Method 1 Input the reconstructed pixels of the base layer of xc1 into the correction network shown in Figure 9a, and output multiple pixel values as p.
  • Method 2 Input the reconstructed pixels of the base layer of xc1 into the correction network shown in Figure 9b, and output multiple feature values as p.
  • Method 3 Input the reconstructed pixels of the base layer of xc1 into the correction network as shown in Figure 9c, output multiple pixel values and multiple feature values, and use one of them as p.
  • the floating point number in y can also be truncated to obtain an integer, or the quantization can be obtained by quantizing according to a preset quantization step size. Eigenvalues.
  • the encoding end can transmit Bitstream3 to the decoding end.
  • the base layer decoder parses the base layer code stream (Bitstream1) to obtain the reconstructed image xc.
  • Figure 16 is an exemplary schematic diagram of the encoding and decoding process. As shown in Figure 16, the difference between this embodiment and Embodiment 2 lies in the method of obtaining the probability distribution.
  • the base layer encoder encodes the original image x to obtain the base layer code stream (Bitstream1), and the base layer decoder (Decoder1) decodes Bitstream1 to obtain the reconstructed image xc.
  • Encoder1 and Decoder1 can use traditional video decoding standards for basic layer encoding and decoding, such as H.264/AVC, H.265/HEVC or H.266/VVC standards, and can also use existing JPEG image encoding.
  • the standard performs basic layer encoding and decoding and does not impose specific restrictions on this.
  • the determination method may include selecting an area with a relatively complex texture, selecting an area of interest to the human eye, or randomly selecting one or more areas. This step can use related technologies, and there is no specific limitation on this.
  • the area when the variance of the area is greater than the threshold t1, the area is an area with a relatively complex texture and can be used as a target area; or, when the proportion of pixels in the area with a gradient greater than the threshold a is greater than t2, the area is determined to have a relatively complex texture. area can be used as a target area.
  • t1 and a are numbers greater than 0, and t2 is a number from 0 to 1.
  • the function of the correction network is to remove noise signals that are useless for encoding the AI enhancement layer.
  • the correction network can output multiple pixel domains and/or multiple feature domains, and the correction information is multiple pixel domains or multiple feature domains. For example, the following three methods can be used:
  • Method 1 Input the reconstructed pixels of the base layer of xc1 into the correction network shown in Figure 9a, and output multiple pixel values as p.
  • Method 2 Input the reconstructed pixels of the base layer of xc1 into the correction network shown in Figure 9b, and output multiple feature values as p.
  • Method 3 Input the reconstructed pixels of the base layer of xc1 into the correction network as shown in Figure 9c, output multiple pixel values and multiple feature values, and use one of them as p.
  • the floating point number in y can also be truncated to obtain an integer, or the quantization can be obtained by quantizing according to a preset quantization step size. Eigenvalues.
  • the encoding end can transmit Bitstream3 to the decoding end.
  • the input side information processing network obtains the reconstructed side information of the feature domain.
  • the base layer decoder parses the base layer code stream (Bitstream1) to obtain the reconstructed image xc.
  • the input side information processing network obtains the reconstructed side information of the feature domain.
  • Figure 17 is an exemplary schematic diagram of the encoding and decoding process. As shown in Figure 17, the difference between this embodiment and Embodiment 3 lies in the method of obtaining the probability distribution.
  • the base layer encoder encodes the original image x to obtain the base layer code stream (Bitstream1), and the base layer decoder (Decoder1) decodes Bitstream1 to obtain the reconstructed image xc.
  • Encoder1 and Decoder1 can use traditional video decoding standards for basic layer encoding and decoding, such as H.264/AVC, H.265/HEVC or H.266/VVC standards, and can also use existing JPEG image encoding.
  • the standard performs basic layer encoding and decoding and does not impose specific restrictions on this.
  • the determination method may include selecting an area with a relatively complex texture, selecting an area of interest to the human eye, or randomly selecting one or more areas. This step can use related technologies, and there is no specific limitation on this.
  • the area when the variance of the area is greater than the threshold t1, the area is an area with a relatively complex texture and can be used as a target area; or, when the proportion of pixels in the area with a gradient greater than the threshold a is greater than t2, the area is determined to have a relatively complex texture. area can be used as a target area.
  • t1 and a are numbers greater than 0, and t2 is a number from 0 to 1.
  • the function of the correction network is to remove noise signals that are useless for encoding the AI enhancement layer.
  • the correction network can output multiple pixel domains and/or multiple feature domains, and the correction information is multiple pixel domains or multiple feature domains. For example, the following three methods can be used:
  • Method 1 Input the reconstructed pixels of the base layer of xc1 into the correction network shown in Figure 9a, and output multiple pixel values as p.
  • Method 2 Input the reconstructed pixels of the base layer of xc1 into the correction network shown in Figure 9b, and output multiple feature values as p.
  • Method 3 Input the reconstructed pixels of the base layer of xc1 into the correction network as shown in Figure 9c, output multiple pixel values and multiple feature values, and use one of them as p.
  • the floating point number in y can also be truncated to obtain an integer, or the quantization can be obtained by quantizing according to a preset quantization step size. Eigenvalues.
  • the encoding end can transmit Bitstream3 to the decoding end.
  • the input side information processing network obtains the reconstructed side information of the feature domain.
  • the base layer decoder parses the base layer code stream (Bitstream1) to obtain the reconstructed image xc.
  • the input side information processing network obtains the reconstructed side information of the feature domain.
  • FIG. 18 is an exemplary structural diagram of the encoding device 1800 according to the embodiment of the present application. As shown in FIG. 18 , the encoding device 1800 according to this embodiment can be applied to the encoding end 20 .
  • the encoding device 1800 may include: an acquisition module 1801, a processing module 1802, and an encoding module 1803. in,
  • the acquisition module 1801 is used to obtain the reconstructed pixels of the base layer of the target area; the processing module 1802 is used to input the reconstructed pixels into a correction network to obtain the correction information of the target area; combine the correction information with the target area
  • the original pixels are input into the encoding network to obtain the residual feature map of the enhancement layer of the target area; the encoding module 1803 is used to encode the residual feature map to obtain the enhancement layer code stream of the target area.
  • the processing module 1802 is specifically configured to input the reconstructed pixels into the correction network to obtain at least one of multiple pixel values and multiple feature values of the target area, so
  • the correction information is the plurality of pixel values or the plurality of feature values.
  • the encoding module 1803 is specifically configured to obtain multiple probability distributions based on the correction information, where the multiple probability distributions correspond to multiple feature values included in the residual feature map; Entropy coding is performed on corresponding feature values in the residual feature map according to the multiple probability distributions to obtain the enhancement layer code stream.
  • the encoding module 1803 is specifically configured to input the correction information into the probability The network is estimated to obtain the plurality of probability distributions.
  • the encoding module 1803 is specifically configured to obtain the multiple probability distributions based on the correction information and the reconstructed side information of the residual feature map.
  • the encoding module 1803 is specifically configured to input the reconstructed side information into the side information processing network to obtain the first feature map when the correction information is multiple feature values;
  • the plurality of feature values and the first feature map are input into a probability estimation network to obtain the plurality of probability distributions.
  • the encoding module 1803 is specifically configured to, when the correction information is multiple pixel values, input the multiple pixel values into the feature estimation network to obtain the second feature map;
  • the reconstructed side information is input into a side information processing network to obtain a first feature map; the first feature map and the second feature map are input into a probability estimation network to obtain the multiple probability distributions.
  • the encoding module 1803 is specifically configured to obtain multiple probability distributions based on the reconstructed side information of the residual feature map.
  • the multiple probability distributions are consistent with the residual feature map contained in the residual feature map.
  • the encoding module 1803 is specifically configured to input the reconstructed side information into a probability estimation network to obtain the multiple probability distributions.
  • the encoding module 1803 is specifically configured to obtain the multiple probability distributions according to the reconstruction side information and the reconstruction pixels.
  • the encoding module 1803 is specifically configured to input the reconstructed pixels into a feature estimation network to obtain a third feature map; input the reconstructed side information into a side information processing network to obtain the first feature Figure; input the first feature map and the third feature map into a probability estimation network to obtain the multiple probability distributions.
  • the encoding module 1803 is also used to input the residual feature map into a side information extraction network to obtain the side information of the residual feature map; the side information or quantized The side information is entropy encoded and written into the code stream.
  • the coding network includes a first coding network; the coding module 1803 is specifically configured to compare the original pixels and the revised information when the correction information is multiple pixel values. Calculate the difference between the corresponding pixel values; input the difference result into the first encoding network to obtain the residual feature map.
  • the encoding network includes a second encoding network; the encoding module 1803 is specifically configured to input the original pixels into the second encoding network; when the correction information is multiple features When the value is reached, the difference is calculated between the output of any convolutional layer in the second coding network and the corresponding feature value in the correction information; the difference result is input into any convolution layer in the second coding network.
  • the subsequent network layers are accumulated to obtain the residual feature map.
  • the encoding module 1803 is also used to encode the image to which the target area belongs to obtain the base layer code stream of the image; and decode the base layer code stream to obtain A reconstruction map of the base layer of the image; at least one area that needs to be enhanced is determined based on the reconstruction map, and the target area is one of the at least one area.
  • the encoding module 1803 is specifically configured to divide the reconstructed image to obtain multiple regions; determine a region with a variance greater than a first threshold among the multiple regions as the at least one area; or, determine the proportion of pixels in the multiple areas with gradients greater than the second threshold to the total pixels, and set the proportion to be greater than the third threshold.
  • the area is determined as the at least one area.
  • the encoding module 1803 is also configured to use the side information of the residual feature map as the reconstructed side information of the residual feature map.
  • the device of this embodiment can be used to execute the technical solution of the method embodiment shown in Figure 8. Its implementation principles and technical effects are similar and will not be described again here.
  • FIG. 19 is an exemplary structural diagram of a decoding device 1900 according to an embodiment of the present application. As shown in FIG. 19 , the decoding device 1900 according to this embodiment can be applied to the decoding end 30 .
  • the decoding device 1900 may include: an acquisition module 1901, a processing module 1902 and a decoding module 1903. in,
  • the acquisition module 1901 is used to acquire the reconstructed pixels of the base layer of the target area; the processing module 1902 is used to input the reconstructed pixels into the correction network to obtain the correction information of the target area; the acquisition module 1901 is also used to obtain The enhancement layer code stream of the target area; the decoding module 1903 is used to decode the enhancement layer code stream to obtain the residual feature map of the enhancement layer of the target area; the processing module 1902 is also used to The residual feature map and the correction information are input into the decoding network to obtain reconstructed pixels of the enhancement layer of the target area.
  • the processing module 1902 is specifically configured to input the reconstructed pixels into the correction network to obtain at least one of multiple pixel values and multiple feature values of the target area, so
  • the correction information is the plurality of pixel values or the plurality of feature values.
  • the decoding module 1903 is specifically configured to obtain the plurality of probability distributions according to the correction information, the plurality of probability distributions and the plurality of feature values contained in the enhancement layer code stream.
  • Code stream correspondence perform entropy decoding on corresponding feature value code streams in the enhancement layer code stream according to the multiple probability distributions to obtain the residual feature map.
  • the decoding module 1903 is specifically configured to input the correction information into a probability estimation network to obtain the multiple probability distributions.
  • the decoding module 1903 is specifically configured to obtain the plurality of probability distributions according to the correction information and the reconstructed side information of the residual feature map.
  • the decoding module 1903 is specifically configured to input the reconstructed side information into the side information processing network to obtain the first feature map when the correction information is multiple feature values;
  • the plurality of feature values and the first feature map are input into a probability estimation network to obtain the plurality of probability distributions.
  • the decoding module 1903 is specifically configured to, when the correction information is multiple pixel values, input the multiple pixel values into the feature estimation network to obtain the second feature map;
  • the reconstructed side information is input into a side information processing network to obtain a first feature map;
  • the first feature map and the second feature map are input into a probability estimation network to obtain the multiple probability distributions.
  • the decoding module 1903 is specifically configured to obtain the multiple probability distributions based on the reconstructed side information of the residual feature map.
  • the multiple probability distributions are related to the enhancement layer code stream.
  • the multiple feature value code streams included correspond to each other; perform entropy decoding on the corresponding feature value code streams in the enhancement layer code stream according to the multiple probability distributions to obtain the residual feature map.
  • the decoding module 1903 is specifically configured to input the reconstructed side information into a probability estimation network to obtain the multiple probability distributions.
  • the decoding module 1903 is specifically configured to obtain the multiple probability distributions according to the reconstructed side information and the reconstructed pixels.
  • the decoding module 1903 is specifically configured to input the reconstructed pixels into a feature estimation network to obtain a third feature map; input the reconstructed side information into a side information processing network to obtain the first feature Figure; input the first feature map and the third feature map into a probability estimation network to obtain the multiple probability distributions.
  • the decoding module 1903 is also configured to input the residual feature map into a side information extraction network to obtain the side information of the residual feature map; use the side information as the Reconstructed side information of residual feature maps.
  • the decoding module 1903 is also used to obtain the side information code stream of the target area; and parse the side information code stream to obtain the reconstructed side information.
  • the decoding network includes a first decoding network; the decoding module 1903 is specifically configured to input the residual feature map into the first decoding network to obtain enhancement of the target area.
  • the decoding network includes a second decoding network; the decoding module 1903 is specifically configured to input the residual feature map into the second decoding network; when the correction information is multiple When there are eigenvalues, sum the output of any convolutional layer in the second decoding network and the corresponding eigenvalue in the correction information; input the summed result into any one of the second decoding networks.
  • a convolutional layer is followed by a network layer to obtain the reconstructed pixels.
  • the decoding module 1903 is also used to obtain the base layer code stream of the image to which the target area belongs; and parse the base layer code stream to obtain the reconstruction map of the base layer of the image. ; Determine at least one area that needs to be enhanced according to the reconstruction map, then the target area is one of the at least one area.
  • the decoding module 1903 is specifically configured to divide the reconstructed image to obtain multiple regions; determine a region with a variance greater than a first threshold among the multiple regions as the at least one region; or, determine the proportion of pixels in the plurality of regions whose gradient is greater than the second threshold to the total pixels, and determine the region in which the proportion is greater than the third threshold as the at least one region.
  • the device of this embodiment can be used to execute the technical solution of the method embodiment shown in Figure 12. Its implementation principles and technical effects are similar and will not be described again here.
  • each step of the above method embodiment can be completed through an integrated logic circuit of hardware in the processor or instructions in the form of software.
  • the processor can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other Programmed logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the methods disclosed in the embodiments of the present application can be directly implemented by a hardware encoding processor, or executed by a combination of hardware and software modules in the encoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the memory mentioned in the above embodiments may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memories.
  • the non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate SDRAM double data rate SDRAM
  • DDR SDRAM double data rate SDRAM
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Compression Of Band Width Or Redundancy In Fax (AREA)
PCT/CN2023/084290 2022-04-08 2023-03-28 区域增强层的编解码方法和装置 Ceased WO2023193629A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
EP23784193.7A EP4492786A4 (en) 2022-04-08 2023-03-28 ENCODING METHOD AND APPARATUS FOR REGION ENHANCEMENT LAYER, AND DECODING METHOD AND APPARATUS FOR AREA ENHANCEMENT LAYER
JP2024559394A JP7760756B2 (ja) 2022-04-08 2023-03-28 領域強化層をエンコーディングおよびデコーディングするための方法および装置
MX2024012429A MX2024012429A (es) 2022-04-08 2024-10-07 Metodo y aparato para codificar y decodificar la capa de mejora de region
US18/908,185 US20250030879A1 (en) 2022-04-08 2024-10-07 Method and Apparatus for Encoding and Decoding Region Enhancement Layer
JP2025173783A JP2026027261A (ja) 2022-04-08 2025-10-15 領域強化層をエンコーディングおよびデコーディングするための方法および装置

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210365196.9A CN116939218A (zh) 2022-04-08 2022-04-08 区域增强层的编解码方法和装置
CN202210365196.9 2022-04-08

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/908,185 Continuation US20250030879A1 (en) 2022-04-08 2024-10-07 Method and Apparatus for Encoding and Decoding Region Enhancement Layer

Publications (1)

Publication Number Publication Date
WO2023193629A1 true WO2023193629A1 (zh) 2023-10-12

Family

ID=88244001

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/084290 Ceased WO2023193629A1 (zh) 2022-04-08 2023-03-28 区域增强层的编解码方法和装置

Country Status (6)

Country Link
US (1) US20250030879A1 (https=)
EP (1) EP4492786A4 (https=)
JP (2) JP7760756B2 (https=)
CN (1) CN116939218A (https=)
MX (1) MX2024012429A (https=)
WO (1) WO2023193629A1 (https=)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496246A (zh) * 2023-11-09 2024-02-02 暨南大学 一种基于卷积神经网络的恶意软件分类方法
CN118152497A (zh) * 2024-03-28 2024-06-07 中国科学院华南植物园 新型结构化分层地理网格编码方法、系统、设备及介质
CN119150706A (zh) * 2024-11-18 2024-12-17 珠海市格努科技有限公司 弓网物理变量预测方法、装置、系统及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120343277A (zh) * 2024-01-18 2025-07-18 杭州海康威视数字技术股份有限公司 一种解码方法、装置及其设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104685879A (zh) * 2012-09-27 2015-06-03 杜比实验室特许公司 针对编码标准可扩展性的层间参考图片处理
CN112702604A (zh) * 2021-03-25 2021-04-23 北京达佳互联信息技术有限公司 用于分层视频的编码方法和装置以及解码方法和装置
CN113628115A (zh) * 2021-08-25 2021-11-09 Oppo广东移动通信有限公司 图像重建的处理方法、装置、电子设备和存储介质
WO2022068716A1 (zh) * 2020-09-30 2022-04-07 华为技术有限公司 熵编/解码方法及装置

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100886191B1 (ko) * 2004-12-06 2009-02-27 엘지전자 주식회사 영상 블록을 디코딩 하는 방법
KR20060063613A (ko) * 2004-12-06 2006-06-12 엘지전자 주식회사 영상 신호의 스케일러블 인코딩 및 디코딩 방법
KR100888963B1 (ko) * 2004-12-06 2009-03-17 엘지전자 주식회사 영상 신호의 스케일러블 인코딩 및 디코딩 방법
EP2119236A1 (en) * 2007-03-15 2009-11-18 Nokia Corporation System and method for providing improved residual prediction for spatial scalability in video coding
US9554132B2 (en) * 2011-05-31 2017-01-24 Dolby Laboratories Licensing Corporation Video compression implementing resolution tradeoffs and optimization
US20140003504A1 (en) * 2012-07-02 2014-01-02 Nokia Corporation Apparatus, a Method and a Computer Program for Video Coding and Decoding
CN113518228B (zh) * 2012-09-28 2024-06-11 交互数字麦迪逊专利控股公司 用于视频编码的方法、用于视频解码的方法及其装置
JP2020022145A (ja) * 2018-08-03 2020-02-06 日本放送協会 符号化装置、復号装置、学習装置及びプログラム
CN112673625A (zh) * 2018-09-10 2021-04-16 华为技术有限公司 混合视频以及特征编码和解码
EP3633990B1 (en) * 2018-10-02 2021-10-27 Nokia Technologies Oy An apparatus and method for using a neural network in video coding
CN110151133B (zh) * 2019-05-24 2021-10-01 哈尔滨工业大学 基于图像分割与时频信息融合的乳腺光学成像装置及方法
US11228776B1 (en) * 2020-03-27 2022-01-18 Tencent America LLC Method for output layer set mode in multilayered video stream

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104685879A (zh) * 2012-09-27 2015-06-03 杜比实验室特许公司 针对编码标准可扩展性的层间参考图片处理
WO2022068716A1 (zh) * 2020-09-30 2022-04-07 华为技术有限公司 熵编/解码方法及装置
CN112702604A (zh) * 2021-03-25 2021-04-23 北京达佳互联信息技术有限公司 用于分层视频的编码方法和装置以及解码方法和装置
CN113628115A (zh) * 2021-08-25 2021-11-09 Oppo广东移动通信有限公司 图像重建的处理方法、装置、电子设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4492786A4

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117496246A (zh) * 2023-11-09 2024-02-02 暨南大学 一种基于卷积神经网络的恶意软件分类方法
CN118152497A (zh) * 2024-03-28 2024-06-07 中国科学院华南植物园 新型结构化分层地理网格编码方法、系统、设备及介质
CN119150706A (zh) * 2024-11-18 2024-12-17 珠海市格努科技有限公司 弓网物理变量预测方法、装置、系统及存储介质

Also Published As

Publication number Publication date
JP2026027261A (ja) 2026-02-18
JP7760756B2 (ja) 2025-10-27
JP2025511850A (ja) 2025-04-16
EP4492786A1 (en) 2025-01-15
CN116939218A (zh) 2023-10-24
EP4492786A4 (en) 2025-07-09
US20250030879A1 (en) 2025-01-23
MX2024012429A (es) 2024-11-08

Similar Documents

Publication Publication Date Title
TWI850806B (zh) 基於注意力的圖像和視訊壓縮上下文建模
TWI826160B (zh) 圖像編解碼方法和裝置
CN115604485B (zh) 视频图像的解码方法及装置
WO2023193629A1 (zh) 区域增强层的编解码方法和装置
CN117321989A (zh) 基于神经网络的图像处理中的辅助信息的独立定位
CN118872266A (zh) 基于多模态处理的视频译码方法
US12052443B2 (en) Loop filtering method and apparatus
CN117501696A (zh) 使用在分块之间共享的信息进行并行上下文建模
US20240296594A1 (en) Generalized Difference Coder for Residual Coding in Video Compression
CN117441333A (zh) 用于输入图像数据处理神经网络的辅助信息的可配置位置
WO2023279968A1 (zh) 视频图像的编解码方法及装置
CN118020306A (zh) 视频编解码方法、编码器、解码器及存储介质
TW202503681A (zh) 編解碼方法和裝置
WO2025035805A1 (zh) 训练方法及装置
TW202416712A (zh) 使用神經網路進行圖像區域的並行處理-解碼、後濾波和rdoq
CN121711482A (zh) 编解码方法和装置
CN121842395A (zh) 编解码方法和装置
KR20250071263A (ko) 가변 채널 수를 갖는 신경망 및 이를 작동시키는 방법
WO2025086895A1 (zh) 熵编/解码方法和装置
CN120730066A (zh) 编解码方法和装置
CN116797674A (zh) 图像编解码方法和装置
CN121569481A (zh) 图像压缩中的重采样
CN121960568A (zh) 基于神经网络的图像处理中的辅助信息的独立定位

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23784193

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024559394

Country of ref document: JP

Ref document number: MX/A/2024/012429

Country of ref document: MX

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112024020321

Country of ref document: BR

WWE Wipo information: entry into national phase

Ref document number: 2023784193

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023784193

Country of ref document: EP

Effective date: 20241008

WWE Wipo information: entry into national phase

Ref document number: 202417077846

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 112024020321

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20240930