WO2023000182A1 - 图像编解码及处理方法、装置及设备 - Google Patents

图像编解码及处理方法、装置及设备 Download PDF

Info

Publication number
WO2023000182A1
WO2023000182A1 PCT/CN2021/107466 CN2021107466W WO2023000182A1 WO 2023000182 A1 WO2023000182 A1 WO 2023000182A1 CN 2021107466 W CN2021107466 W CN 2021107466W WO 2023000182 A1 WO2023000182 A1 WO 2023000182A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
scale
prediction
reference image
feature information
Prior art date
Application number
PCT/CN2021/107466
Other languages
English (en)
French (fr)
Inventor
高艳博
贾梦虎
李帅
岳建
元辉
李明
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Priority to CN202180100797.0A priority Critical patent/CN117678221A/zh
Priority to PCT/CN2021/107466 priority patent/WO2023000182A1/zh
Publication of WO2023000182A1 publication Critical patent/WO2023000182A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/117Filters, e.g. for pre-processing or post-processing

Definitions

  • the present application relates to the technical field of image processing, and in particular to an image codec and processing method, device and equipment.
  • the video production equipment collects low-quality video streams and transmits the low-quality video streams to the video playback equipment.
  • the video playback equipment processes the low-quality videos and generates high-quality videos for playback.
  • the quality of video is improved by means of filtering.
  • the decoder performs filtering on the decoded reconstructed image and then plays it.
  • the filtering method cannot significantly improve the quality of the video.
  • Embodiments of the present application provide an image encoding, decoding, and processing method, device, and equipment, so as to significantly improve an image enhancement effect.
  • the embodiment of the present application provides an image decoding method, including:
  • the present application provides an image coding method, including:
  • the present application provides an image processing method, including:
  • the present application provides a model training method for training a quality enhancement network.
  • the quality enhancement network includes a feature extraction module, an offset value prediction module, a time domain alignment module, and a quality enhancement module.
  • the method includes :
  • multi-scale prediction is performed through the offset value prediction module to obtain the offset value of the reference image
  • the second feature information of the reference image is obtained by performing time domain alignment in the time domain alignment module
  • the quality enhancement network is trained according to the predicted value of the enhanced image of the image to be enhanced and the real value of the enhanced image of the image to be enhanced.
  • an image decoding device configured to execute the method in the above first aspect or its various implementations.
  • the image decoding device includes a functional unit configured to execute the method in the above first aspect or each implementation manner thereof.
  • a decoder including a processor and a memory.
  • the memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory to execute the method in the above first aspect or its various implementations.
  • an image encoding device configured to execute the method in the above second aspect or various implementations thereof.
  • the image decoding device includes a functional unit configured to execute the method in the above second aspect or each implementation manner thereof.
  • an encoder including a processor and a memory.
  • the memory is used to store a computer program
  • the processor is used to invoke and run the computer program stored in the memory, so as to execute the method in the above second aspect or its various implementations.
  • an image processing device configured to execute the method in the above third aspect or various implementations thereof.
  • the device includes a functional unit configured to execute the method in the above third aspect or each implementation manner thereof.
  • an image processing device including a processor and a memory.
  • the memory is used to store a computer program
  • the processor is used to call and run the computer program stored in the memory, so as to execute the method in the above third aspect or its various implementations.
  • a model training device configured to execute the method in the above fourth aspect or various implementations thereof.
  • the model training device includes a functional unit for executing the method in the above fourth aspect or each implementation manner thereof.
  • a model training device including a processor and a memory.
  • the memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory, so as to execute the method in the above fourth aspect or each implementation manner thereof.
  • a chip for implementing any one of the above first to fourth aspects or the method in each implementation manner thereof.
  • the chip includes: a processor, configured to call and run a computer program from the memory, so that the device installed with the chip executes any one of the above-mentioned first to fourth aspects or any of the implementations thereof. method.
  • a fourteenth aspect there is provided a computer-readable storage medium for storing a computer program, and the computer program causes a computer to execute any one of the above-mentioned first to fourth aspects or the method in each implementation manner thereof.
  • a computer program product including computer program instructions, the computer program instructions cause a computer to execute any one of the above first to fourth aspects or the method in each implementation manner.
  • a computer program which, when running on a computer, causes the computer to execute any one of the above first to fourth aspects or the method in each implementation manner thereof.
  • the current reconstructed image is obtained by decoding the code stream; M reference images of the current reconstructed image are obtained from the reconstructed image; the current reconstructed image and M reference images are input into the quality enhancement network to enhance the quality
  • the network performs feature extraction at different scales, and obtains the first feature information of the current reconstructed image and the reference image at N scales respectively, and performs multiple Scale prediction, to obtain the offset value of the reference image, and then perform temporal alignment according to the offset value of the reference image and the first feature information of the reference image, to obtain the second feature information of the reference image, and finally according to the second feature information of the reference image Predict the enhanced image of the current reconstructed image to achieve significant image enhancement.
  • FIG. 1 is a schematic block diagram of a video encoding and decoding system involved in an embodiment of the present application
  • Fig. 2 is a schematic block diagram of a video encoder provided by an embodiment of the present application.
  • Fig. 3 is a schematic block diagram of a video decoder provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a principle of an embodiment of the present application.
  • FIG. 5 is a schematic flow chart of a quality enhancement network training method provided by an embodiment of the present application.
  • FIG. 6 is a schematic network diagram of a quality enhancement network according to an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a training method for a quality enhancement network provided by an embodiment of the present application.
  • FIG. 8A is a network diagram of a feature extraction module involved in an embodiment of the present application.
  • FIG. 8B is a network diagram of a feature extraction module involved in an embodiment of the present application.
  • FIG. 8C is a network schematic diagram of an offset value prediction module involved in an embodiment of the present application.
  • FIG. 8D is a network schematic diagram of an offset value prediction module involved in an embodiment of the present application.
  • FIG. 8E is a network schematic diagram of a time domain alignment module involved in an embodiment of the present application.
  • FIG. 8F is a network schematic diagram of a quality enhancement module involved in an embodiment of the present application.
  • FIG. 8G is a schematic network diagram of a quality enhancement network according to an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a training method for a quality enhancement network provided by an embodiment of the present application.
  • FIG. 10A is a network schematic diagram of an offset value prediction module involved in an embodiment of the present application.
  • FIG. 10B is a network schematic diagram of an offset value prediction module involved in an embodiment of the present application.
  • FIG. 10C is a schematic network diagram of a time domain alignment module involved in an embodiment of the present application.
  • FIG. 10D is a schematic network diagram of a quality enhancement module involved in an embodiment of the present application.
  • Fig. 11 is a schematic flowchart of an image decoding method provided by an embodiment of the present application.
  • Fig. 12 is a schematic flowchart of an image coding method provided by an embodiment of the present application.
  • FIG. 13 is a schematic flowchart of an image processing method provided by an embodiment of the present application.
  • Fig. 14 is a schematic block diagram of an image decoding device provided by an embodiment of the present application.
  • Fig. 15 is a schematic block diagram of an image encoding device provided by an embodiment of the present application.
  • Fig. 16 is a schematic block diagram of an image processing device provided by an embodiment of the present application.
  • Fig. 17 is a schematic block diagram of a model training device provided by an embodiment of the present application.
  • Fig. 18 is a schematic block diagram of an electronic device provided by an embodiment of the present application.
  • the present application can be applied to the technical field of point cloud upsampling, for example, can be applied to the technical field of point cloud compression.
  • the application can be applied to the field of image codec, video codec, hardware video codec, dedicated circuit video codec, real-time video codec, etc.
  • the solution of the present application can be combined with audio and video coding standards (audio video coding standard, referred to as AVS), for example, H.264/audio video coding (audio video coding, referred to as AVC) standard, H.265/high efficiency video coding ( High efficiency video coding (HEVC for short) standard and H.266/versatile video coding (VVC for short) standard.
  • the solutions of the present application may operate in conjunction with other proprietary or industry standards, including ITU-TH.261, ISO/IECMPEG-1Visual, ITU-TH.262 or ISO/IECMPEG-2Visual, ITU-TH.263 , ISO/IECMPEG-4Visual, ITU-TH.264 (also known as ISO/IECMPEG-4AVC), including scalable video codec (SVC) and multi-view video codec (MVC) extensions.
  • SVC scalable video codec
  • MVC multi-view video codec
  • FIG. 1 is a schematic block diagram of a video encoding and decoding system involved in an embodiment of the present application. It should be noted that FIG. 1 is only an example, and the video codec system in the embodiment of the present application includes but is not limited to what is shown in FIG. 1 .
  • the video codec system 100 includes an encoding device 110 and a decoding device 120 .
  • the encoding device is used to encode (can be understood as compression) the video data to generate a code stream, and transmit the code stream to the decoding device.
  • the decoding device decodes the code stream generated by the encoding device to obtain decoded video data.
  • the encoding device 110 in the embodiment of the present application can be understood as a device having a video encoding function
  • the decoding device 120 can be understood as a device having a video decoding function, that is, the embodiment of the present application includes a wider range of devices for the encoding device 110 and the decoding device 120, Examples include smartphones, desktop computers, mobile computing devices, notebook (eg, laptop) computers, tablet computers, set-top boxes, televisions, cameras, display devices, digital media players, video game consoles, vehicle-mounted computers, and the like.
  • the encoding device 110 may transmit the encoded video data (such as code stream) to the decoding device 120 via the channel 130 .
  • Channel 130 may include one or more media and/or devices capable of transmitting encoded video data from encoding device 110 to decoding device 120 .
  • channel 130 includes one or more communication media that enable encoding device 110 to transmit encoded video data directly to decoding device 120 in real-time.
  • encoding device 110 may modulate the encoded video data according to a communication standard and transmit the modulated video data to decoding device 120 .
  • the communication medium includes a wireless communication medium, such as a radio frequency spectrum.
  • the communication medium may also include a wired communication medium, such as one or more physical transmission lines.
  • the channel 130 includes a storage medium that can store video data encoded by the encoding device 110 .
  • the storage medium includes a variety of local access data storage media, such as optical discs, DVDs, flash memory, and the like.
  • the decoding device 120 may acquire encoded video data from the storage medium.
  • channel 130 may include a storage server that may store video data encoded by encoding device 110 .
  • the decoding device 120 may download the stored encoded video data from the storage server.
  • the storage server may store the encoded video data and may transmit the encoded video data to the decoding device 120, such as a web server (eg, for a website), a file transfer protocol (FTP) server, and the like.
  • FTP file transfer protocol
  • the encoding device 110 includes a video encoder 112 and an output interface 113 .
  • the output interface 113 may include a modulator/demodulator (modem) and/or a transmitter.
  • the encoding device 110 may include a video source 111 in addition to the video encoder 112 and the input interface 113 .
  • the video source 111 may include at least one of a video capture device (for example, a video camera), a video archive, a video input interface, a computer graphics system, wherein the video input interface is used to receive video data from a video content provider, and the computer graphics system Used to generate video data.
  • a video capture device for example, a video camera
  • a video archive for example, a video archive
  • a video input interface for example, a video archive
  • video input interface for example, a video input interface
  • computer graphics system used to generate video data.
  • the video encoder 112 encodes the video data from the video source 111 to generate a code stream.
  • Video data may include one or more pictures or a sequence of pictures.
  • the code stream contains the encoding information of an image or image sequence in the form of a bit stream.
  • Encoding information may include encoded image data and associated data.
  • the associated data may include a sequence parameter set (SPS for short), a picture parameter set (PPS for short) and other syntax structures.
  • SPS sequence parameter set
  • PPS picture parameter set
  • An SPS may contain parameters that apply to one or more sequences.
  • a PPS may contain parameters applied to one or more images.
  • the syntax structure refers to a set of zero or more syntax elements arranged in a specified order in the code stream.
  • the video encoder 112 directly transmits encoded video data to the decoding device 120 via the output interface 113 .
  • the encoded video data can also be stored on a storage medium or a storage server for subsequent reading by the decoding device 120 .
  • the decoding device 120 includes an input interface 121 and a video decoder 122 .
  • the decoding device 120 may include a display device 123 in addition to the input interface 121 and the video decoder 122 .
  • the input interface 121 includes a receiver and/or a modem.
  • the input interface 121 can receive encoded video data through the channel 130 .
  • the video decoder 122 is used to decode the encoded video data to obtain decoded video data, and transmit the decoded video data to the display device 123 .
  • the display device 123 displays the decoded video data.
  • the display device 123 may be integrated with the decoding device 120 or external to the decoding device 120 .
  • the display device 123 may include various display devices, such as a liquid crystal display (LCD), a plasma display, an organic light emitting diode (OLED) display, or other types of display devices.
  • LCD liquid crystal display
  • plasma display a plasma display
  • OLED organic light emitting diode
  • FIG. 1 is only an example, and the technical solutions of the embodiments of the present application are not limited to FIG. 1 .
  • the technology of the present application may also be applied to one-sided video encoding or one-sided video decoding.
  • Fig. 2 is a schematic block diagram of a video encoder provided by an embodiment of the present application. It should be understood that the video encoder 200 can be used to perform lossy compression on images, and can also be used to perform lossless compression on images.
  • the lossless compression may be visually lossless compression or mathematically lossless compression.
  • the video encoder 200 can be applied to image data in luminance-chrominance (YCbCr, YUV) format.
  • the YUV ratio can be 4:2:0, 4:2:2 or 4:4:4, Y means brightness (Luma), Cb (U) means blue chroma, Cr (V) means red chroma, U and V are expressed as chroma (Chroma) for describing color and saturation.
  • 4:2:0 means that every 4 pixels have 4 luminance components
  • 2 chroma components YYYYCbCr
  • 4:2:2 means that every 4 pixels have 4 luminance components
  • 4 Chroma component YYYYCbCrCbCr
  • 4:4:4 means full pixel display (YYYYCbCrCbCrCbCrCbCr).
  • the video encoder 200 reads video data, and for each frame of image in the video data, divides a frame of image into several coding tree units (coding tree unit, CTU), "largest coding unit” (Largest Coding unit, LCU for short) or "coding tree block” (coding tree block, CTB for short).
  • Each CTU may be associated with a pixel block of equal size within the image.
  • Each pixel may correspond to one luminance (luma) sample and two chrominance (chrominance or chroma) samples.
  • each CTU may be associated with one block of luma samples and two blocks of chroma samples.
  • a CTU size is, for example, 128 ⁇ 128, 64 ⁇ 64, 32 ⁇ 32 and so on.
  • a CTU can be further divided into several coding units (Coding Unit, CU) for coding, and the CU can be a rectangular block or a square block.
  • the CU can be further divided into a prediction unit (PU for short) and a transform unit (TU for short), so that coding, prediction, and transformation are separated, and processing is more flexible.
  • a CTU is divided into CUs in a quadtree manner, and a CU is divided into TUs and PUs in a quadtree manner.
  • the video encoder and video decoder can support various PU sizes. Assuming that the size of a specific CU is 2N ⁇ 2N, video encoders and video decoders may support 2N ⁇ 2N or N ⁇ N PU sizes for intra prediction, and support 2N ⁇ 2N, 2N ⁇ N, N ⁇ 2N, NxN or similarly sized symmetric PUs for inter prediction. The video encoder and video decoder may also support asymmetric PUs of 2NxnU, 2NxnD, nLx2N, and nRx2N for inter prediction.
  • the video encoder 200 may include: a prediction unit 210, a residual unit 220, a transform/quantization unit 230, an inverse transform/quantization unit 240, a reconstruction unit 250, and a loop filter unit 260. Decoded image cache 270 and entropy encoding unit 280. It should be noted that the video encoder 200 may include more, less or different functional components.
  • the current block may be called a current coding unit (CU) or a current prediction unit (PU).
  • a predicted block may also be referred to as a predicted block to be encoded or an image predicted block, and a reconstructed block to be encoded may also be referred to as a reconstructed block or an image reconstructed block to be encoded.
  • the prediction unit 210 includes an inter prediction unit 211 and an intra prediction unit 212 . Because there is a strong correlation between adjacent pixels in a video frame, the intra-frame prediction method is used in video coding and decoding technology to eliminate the spatial redundancy between adjacent pixels. Due to the strong similarity between adjacent frames in video, the inter-frame prediction method is used in video coding and decoding technology to eliminate time redundancy between adjacent frames, thereby improving coding efficiency.
  • the inter-frame prediction unit 211 can be used for inter-frame prediction.
  • the inter-frame prediction can refer to image information of different frames.
  • the inter-frame prediction uses motion information to find a reference block from the reference frame, and generates a prediction block according to the reference block to eliminate temporal redundancy;
  • Frames used for inter-frame prediction may be P frames and/or B frames, P frames refer to forward predictive frames, and B frames refer to bidirectional predictive frames.
  • the motion information includes the reference frame list where the reference frame is located, the reference frame index, and the motion vector.
  • the motion vector can be an integer pixel or a sub-pixel. If the motion vector is sub-pixel, then interpolation filtering needs to be used in the reference frame to make the required sub-pixel block.
  • the reference frame found according to the motion vector A block of whole pixels or sub-pixels is called a reference block.
  • Some technologies will directly use the reference block as a prediction block, and some technologies will further process the reference block to generate a prediction block. Reprocessing and generating a prediction block based on a reference block can also be understood as taking the reference block as a prediction block and then processing and generating a new prediction block based on the prediction block.
  • inter-frame prediction methods include: geometric partitioning mode (GPM) in the VVC video codec standard, and angular weighted prediction (AWP) in the AVS3 video codec standard. These two intra-frame prediction modes have something in common in principle.
  • GPM geometric partitioning mode
  • AVS3 angular weighted prediction
  • the intra-frame prediction unit 212 only refers to the image information of the same frame to predict the pixel information in the block to be encoded in the current frame for eliminating spatial redundancy.
  • a frame used for intra prediction may be an I frame.
  • the intra prediction method further includes a multiple reference line intra prediction method (multiple reference line, MRL).
  • MRL can use more reference pixels to improve coding efficiency.
  • mode 0 is to copy the pixels above the current block to the current block in the vertical direction as the prediction value
  • mode 1 is to copy the reference pixels on the left to the current block in the horizontal direction as the prediction value
  • mode 2 (DC) is to copy the pixels from A to The average value of the 8 points D and I ⁇ L is used as the prediction value of all points
  • modes 3 to 8 are to copy the reference pixel to the corresponding position of the current block according to a certain angle. Because some positions of the current block cannot exactly correspond to the reference pixels, it may be necessary to use the weighted average of the reference pixels, or the sub-pixels of the interpolated reference pixels.
  • the intra prediction modes used by HEVC include planar mode (Planar), DC and 33 angle modes, a total of 35 prediction modes.
  • the intra-frame modes used by VVC include Planar, DC and 65 angle modes, with a total of 67 prediction modes.
  • the intra-frame modes used by AVS3 include DC, Plane, Bilinear and 63 angle modes, a total of 66 prediction modes.
  • the intra-frame prediction will be more accurate, and it will be more in line with the demand for the development of high-definition and ultra-high-definition digital video.
  • the residual unit 220 may generate a residual block of the CU based on the pixel blocks of the CU and the prediction blocks of the PUs of the CU. For example, residual unit 220 may generate a residual block for a CU such that each sample in the residual block has a value equal to the difference between the samples in the pixel blocks of the CU, and the samples in the PUs of the CU. Corresponding samples in the predicted block.
  • Transform/quantization unit 230 may quantize the transform coefficients. Transform/quantization unit 230 may quantize transform coefficients associated with TUs of a CU based on quantization parameter (QP) values associated with the CU. Video encoder 200 may adjust the degree of quantization applied to transform coefficients associated with a CU by adjusting the QP value associated with the CU.
  • QP quantization parameter
  • Inverse transform/quantization unit 240 may apply inverse quantization and inverse transform to the quantized transform coefficients, respectively, to reconstruct a residual block from the quantized transform coefficients.
  • the reconstruction unit 250 may add samples of the reconstructed residual block to corresponding samples of one or more prediction blocks generated by the prediction unit 210 to generate a reconstructed block to be encoded associated with the TU. By reconstructing the sample blocks of each TU of the CU in this way, the video encoder 200 can reconstruct the pixel blocks of the CU.
  • Loop filtering unit 260 may perform deblocking filtering operations to reduce blocking artifacts of pixel blocks associated with a CU.
  • the loop filtering unit 260 includes a deblocking filtering unit, a sample point adaptive compensation SAO unit, and an adaptive loop filtering ALF unit.
  • the decoded image buffer 270 may store reconstructed pixel blocks.
  • Inter prediction unit 211 may use reference pictures containing reconstructed pixel blocks to perform inter prediction on PUs of other pictures.
  • intra prediction unit 212 may use the reconstructed pixel blocks in decoded picture cache 270 to perform intra prediction on other PUs in the same picture as the CU.
  • Entropy encoding unit 280 may receive the quantized transform coefficients from transform/quantization unit 230 . Entropy encoding unit 280 may perform one or more entropy encoding operations on the quantized transform coefficients to generate entropy encoded data.
  • the basic flow of video coding involved in this application is as follows: at the coding end, the current image is divided into blocks, and for the current block, the prediction unit 210 uses intra prediction or inter prediction to generate a prediction block of the current block.
  • the residual unit 220 may calculate a residual block based on the predicted block and the original block of the current block, that is, a difference between the predicted block and the original block of the current block, and the residual block may also be referred to as residual information.
  • the residual block can be transformed and quantized by the transformation/quantization unit 230 to remove information that is not sensitive to human eyes, so as to eliminate visual redundancy.
  • the residual block before being transformed and quantized by the transform/quantization unit 230 may be called a time domain residual block, and the time domain residual block after being transformed and quantized by the transform/quantization unit 230 may be called a frequency residual block or a frequency-domain residual block.
  • the entropy encoding unit 280 receives the quantized transform coefficients output by the transform and quantization unit 230 , may perform entropy encoding on the quantized transform coefficients, and output a code stream.
  • the entropy coding unit 280 can eliminate character redundancy according to the target context model and the probability information of the binary code stream.
  • the video encoder performs inverse quantization and inverse transformation on the quantized transform coefficients output by the transform and quantization unit 230 to obtain a residual block of the current block, and then adds the residual block of the current block to the prediction block of the current block, Get the reconstructed block of the current block.
  • reconstructed blocks corresponding to other blocks to be encoded in the current image can be obtained, and these reconstructed blocks are spliced to obtain a reconstructed image of the current image.
  • filter the reconstructed image for example, use ALF to filter the reconstructed image to reduce the difference between the pixel value of the pixel in the reconstructed image and the original pixel value of the pixel in the current image difference.
  • the filtered reconstructed image is stored in the decoded image buffer 270, which may serve as a reference frame for inter-frame prediction for subsequent frames.
  • the block division information determined by the encoder as well as mode information or parameter information such as prediction, transformation, quantization, entropy coding, and loop filtering, etc., are carried in the code stream when necessary.
  • the decoding end analyzes the code stream and analyzes the existing information to determine the same block division information as the encoding end, prediction, transformation, quantization, entropy coding, loop filtering and other mode information or parameter information, so as to ensure the decoding image obtained by the encoding end It is the same as the decoded image obtained by the decoder.
  • Fig. 3 is a schematic block diagram of a video decoder provided by an embodiment of the present application.
  • the video decoder 300 includes: an entropy decoding unit 310 , a prediction unit 320 , an inverse quantization/transformation unit 330 , a reconstruction unit 340 , a loop filter unit 350 and a decoded image buffer 360 . It should be noted that the video decoder 300 may include more, less or different functional components.
  • the video decoder 300 can receive code streams.
  • the entropy decoding unit 310 may parse the codestream to extract syntax elements from the codestream. As part of parsing the codestream, the entropy decoding unit 310 may parse the entropy-encoded syntax elements in the codestream.
  • the prediction unit 320 , the inverse quantization/transformation unit 330 , the reconstruction unit 340 and the loop filter unit 350 can decode video data according to the syntax elements extracted from the code stream, that is, generate decoded video data.
  • the prediction unit 320 includes an intra prediction unit 321 and an inter prediction unit 322 .
  • Intra prediction unit 321 may perform intra prediction to generate a predictive block for a PU. Intra prediction unit 321 may use an intra prediction mode to generate a prediction block for a PU based on pixel blocks of spatially neighboring PUs. Intra prediction unit 321 may also determine an intra prediction mode for a PU from one or more syntax elements parsed from a codestream.
  • the inter prediction unit 322 may construct a first reference picture list (list 0) and a second reference picture list (list 1) according to the syntax elements parsed from the codestream. Furthermore, if the PU is encoded using inter prediction, entropy decoding unit 310 may parse the motion information for the PU. Inter prediction unit 322 may determine one or more reference blocks for the PU according to the motion information of the PU. Inter prediction unit 322 may generate a predictive block for the PU from one or more reference blocks for the PU.
  • Inverse quantization/transform unit 330 may inverse quantize (ie, dequantize) transform coefficients associated with a TU. Inverse quantization/transform unit 330 may use the QP value associated with the CU of the TU to determine the degree of quantization.
  • inverse quantization/transform unit 330 may apply one or more inverse transforms to the inverse quantized transform coefficients in order to generate a residual block associated with the TU.
  • Reconstruction unit 340 uses the residual blocks associated with the TUs of the CU and the prediction blocks of the PUs of the CU to reconstruct the pixel blocks of the CU. For example, the reconstruction unit 340 may add the samples of the residual block to the corresponding samples of the prediction block to reconstruct the pixel block of the CU, and obtain the reconstructed block to be encoded.
  • Loop filtering unit 350 may perform deblocking filtering operations to reduce blocking artifacts of pixel blocks associated with a CU.
  • the loop filtering unit 350 includes a deblocking filtering unit, a sample point adaptive compensation SAO unit, and an adaptive loop filtering ALF unit.
  • Video decoder 300 may store the reconstructed picture of the CU in decoded picture cache 360 .
  • the video decoder 300 may use the reconstructed picture in the decoded picture buffer 360 as a reference picture for subsequent prediction, or transmit the reconstructed picture to a display device for presentation.
  • the entropy decoding unit 310 can analyze the code stream to obtain the prediction information of the current block, the quantization coefficient matrix, etc., and the prediction unit 320 uses intra prediction or inter prediction to generate the current block based on the prediction information.
  • the inverse quantization/transformation unit 330 uses the quantization coefficient matrix obtained from the code stream to perform inverse quantization and inverse transformation on the quantization coefficient matrix to obtain a residual block.
  • the reconstruction unit 340 adds the predicted block and the residual block to obtain a reconstructed block.
  • the reconstructed blocks form a reconstructed image
  • the loop filtering unit 350 performs loop filtering on the reconstructed image based on the image or based on the block to obtain a decoded image.
  • the decoded image can also be referred to as a reconstructed image.
  • the reconstructed image can be displayed by a display device, and on the other hand, it can be stored in the decoded image buffer 360 and serve as a reference frame for inter-frame prediction for subsequent frames.
  • the above is the basic process of the video codec under the block-based hybrid coding framework. With the development of technology, some modules or steps of the framework or process may be optimized. This application is applicable to the block-based hybrid coding framework.
  • the basic process of the video codec but not limited to the framework and process.
  • the quality of video is improved by filtering.
  • DBF technology and SAO technology are used for filtering.
  • ALF technology is additionally added in VVC/H266.
  • DBF reduces the block effect by smoothing the coding unit boundary
  • SAO alleviates the ringing effect by compensating the pixel value
  • ALF further enhances the reconstructed image quality by minimizing the error between the reconstructed block and the original block.
  • the filtering method cannot significantly improve the quality of the video, and the effect is poor.
  • the compressed video quality improvement based on spatio-temporal deformable convolution which is referred to as spatio-temporal deformable convolution fusion (Spatio-Temporal Deformable Fusion, STDF) technology, is mainly applied to the post-reconstruction image at the decoding end. processing to enhance the quality of the current frame by using multiple adjacent reference frames.
  • STDF uses the temporal information of the reference frame to enhance the quality of the current frame by utilizing the effective alignment properties of deformable convolutions to align and fuse temporal information.
  • the STDF technology is mainly realized through the following processes:
  • the extracted consecutive frames are stitched together in the time domain dimension, and the offset value prediction network is input to generate the offset value.
  • the offset value refers to the offset value of the sampling points in the deformable convolution.
  • the offset value prediction network adopts the form of a U-shaped network (Unet), and uses the method of combining the underlying detailed information and the high-level semantic information to fully learn the time domain information, so as to directly predict the offset value.
  • a set of offset values is predicted for each frame of image, that is, 2R+1 sets of offset values are output.
  • Corresponding to each pixel of each frame there are 9 sets of sampling points, that is, 9 offset values, and each offset value includes the sampling distance in the horizontal and vertical directions.
  • step b The offset value predicted in step b is used in the offset of the deformable convolution sampling point, and the reference frame is aligned to the current frame, thereby fusing temporal information.
  • step d) Input the fusion feature generated in step c into the quality enhancement network for learning the reconstruction residual map, that is, the difference between the input frame to be enhanced and the real image. After adding the residual map and the frame to be enhanced, the enhanced frame is output.
  • the first method above that is, the in-loop filtering technology
  • in-loop filtering technology is commonly used in intra-frame filtering.
  • subsequent unreconstructed frames cannot be obtained, so there is a great limitation.
  • the offset value is assumed to be P(x, y).
  • bilateral filtering technology is usually used for sampling, that is, the coordinates of the four points around the sampling point are respectively For P1(x1,y1)P2(x2,y2),P3(x3,y3),P4(x4,y4).
  • the offset value When training the network, the offset value will be optimized towards the true value, but in the early stage of training, the current offset value deviates greatly from the real offset value.
  • the real offset value is far beyond the range of the receptive field, and the optimization direction of the offset value will deviate from the direction of the real value, resulting in larger errors.
  • the real offset position is Pt
  • the current offset value position is P. Since the network training is optimized according to the gradient direction, the value of Pt is greater than the value of P, so P will shift toward a larger trend, that is, shift to P4 point, resulting in larger errors and larger deviations in alignment. As a result, the generated offset value is inaccurate, and the alignment operation is biased. Therefore, multi-frame information cannot be effectively fused, and even time-domain information that is not conducive to the recovery of the current frame may be fused.
  • this application provides a method for implementing image enhancement through a new quality enhancement model
  • the quality enhancement model is based on the first feature information of the image to be enhanced and the reference image of the image to be enhanced in N scales, Perform multi-scale prediction to obtain the offset value of the reference image. Since the model realizes the multi-scale prediction of the offset value, the range of the receptive field is expanded, so that the offset value can learn the direction of the real offset, and then realize the offset The accurate prediction of the value, and the subsequent multi-scale alignment of deformable convolution is realized based on the accurately predicted offset value, so as to realize the efficient enhancement of the image.
  • the image processing method provided in this application is to use a quality enhancement network to enhance the image quality, and the quality enhancement network is a piece of software code or a chip with data processing functions. Based on this, the training process of the quality enhancement network is introduced first.
  • FIG. 5 is a schematic flow chart of a quality enhancement network training method provided by an embodiment of the present application. As shown in FIG. 5, the training process includes:
  • M is a positive integer.
  • the image to be enhanced is an image to be enhanced in the training set, which includes multiple images to be enhanced and M reference images for each image to be enhanced.
  • the training process of the quality enhancement network using the image to be enhanced in the training set and M reference images of the image to be enhanced is an iterative process. For example, the first image to be enhanced and the M reference images of the image to be enhanced are input into the quality enhancement network to be trained, and the initial parameters of the quality enhancement network are adjusted once to obtain the quality enhancement network trained for the first time.
  • the training end condition of the quality enhancement network includes that the number of training times reaches a preset number of times, or the loss reaches a preset loss.
  • the methods for determining the initial parameters of the above-mentioned quality enhancement network include but are not limited to the following:
  • the initial parameters of the quality enhancement network may be preset values, or random values, or empirical values.
  • the second way is to obtain the pre-training parameters obtained during the pre-training of the pre-training model, and determine the pre-training parameters as the initial parameters of the quality enhancement network.
  • the above M reference images of the image to be enhanced may be M images located forward of the image to be enhanced in the playback order in the video stream.
  • the above M reference images of the image to be enhanced may be M images located behind the image to be enhanced in the playback sequence in the video stream.
  • image 1, image 2, and image 3 are sequentially included in the order of video playback, where image 2 is an image to be enhanced, then image 1 and image 3 can be used as reference images for image 2.
  • the image to be enhanced and the M reference images are consecutive images in a playing sequence.
  • the image to be enhanced and the M reference images are discontinuous in playback order.
  • the process of training the quality enhancement network using the image to be enhanced in the training set and the M reference images of the image to be enhanced is consistent.
  • the embodiment of the present application takes an image to be enhanced as an example. The training process of the quality enhancement network is described.
  • the following describes the network structure of the quality enhancement network involved in the embodiment of the present application in conjunction with FIG. 6. It should be noted that the network structure of the quality enhancement network in the embodiment of the present application includes but is not limited to the modules shown in FIG. Figure 6 More or less modules.
  • the quality enhancement network includes a feature extraction module, an offset value prediction module, a temporal alignment module and a quality enhancement module.
  • the feature extraction module is used to extract the first feature information of the image at different scales.
  • the scale of the image in this application refers to the length and width of the image.
  • the offset value prediction module is used to predict the offset value of the image according to the first feature information in different scales extracted by the feature extraction module.
  • the time-domain alignment module is used to perform time-domain alignment according to the first feature information extracted by the feature extraction module and the offset value predicted by the offset value prediction module, so as to obtain time-domain aligned second feature information.
  • the quality enhancement module is used to predict an enhanced image of the image according to the second feature information aligned by the time domain alignment module.
  • FIG. 6 is only a schematic framework diagram of the quality enhancement network involved in the embodiment of the present application, and the quality enhancement module in the embodiment of the present application may also include more or fewer modules than those in FIG. Do limit.
  • the above S501 includes the following steps from S502 to S504.
  • N is a positive integer greater than 1, that is to say, the feature extraction module performs feature extraction on at least two different scales of the input M+1 images, and obtains images under at least two sizes of the image to be enhanced and the reference image.
  • the scale L1 represents the scale of the original image
  • the scale L2 represents the half scale of the original image
  • the scale L3 represents the quarter scale of the original image.
  • the size of the first feature information of the image to be enhanced and/or the reference image at scale L1 is HXW
  • the image to be enhanced and/or the reference image at scale L1 The size of the first feature information at L2 is H/2XW/2
  • the size of the first feature information at the scale L3 of the image to be enhanced and/or the reference image is H/4XW/4.
  • the image to be enhanced t its forward reference image tr to t-1, its backward reference image t+1 to t+r, a total of 2r+1 images, expressed as I i ⁇ R H ⁇ W , i ⁇ tr,...,t+r ⁇ , and then send I i ⁇ R H ⁇ W , i ⁇ tr,...,t+r ⁇ into the quality enhancement network for processing.
  • the first feature information of the reference image at N scales output by the feature extraction module includes feature information of at least one of the M reference images at N scales. That is to say, the feature extraction module performs feature extraction on each of the M reference images to obtain the first feature information of each reference image at N scales, or the feature extraction module extracts the first feature information of each of the M reference images. Feature extraction is performed on the reference image to obtain the first feature information of a part of the reference image at N scales.
  • the first feature information of the image to be enhanced and the reference image at the scale L1, the first feature information at the scale L2, and the first feature information at the scale L3 are input into the offset value prediction module, and the offset value
  • the prediction module learns the first feature information of the image to be enhanced and the reference image at different scales to expand the receptive field range learned by the offset value prediction module, so that the offset value can learn the direction of the real offset, and then Accurate prediction of offset values is achieved.
  • the offset value of the reference image can be understood as an offset value matrix.
  • the offset value prediction module of the embodiment of the present application is a pyramid progressive prediction network, and the pyramid progressive prediction network gradually learns the deformable convolution offset value from coarse to fine.
  • the pyramidal progressive structure can effectively enhance the compressed video with large motion distance.
  • the offset value of the reference image predicted by the offset value prediction module and the first feature information of the reference image extracted by the feature extraction module are input into the time domain alignment module.
  • the temporal alignment module obtains the offset value (for example, 9 offset values) corresponding to the point from the offset value of the reference image, and uses the 9 offset values corresponding to the point is the offset value of the sampling point, and 9 sampling points are obtained, and the 9 sampling points are convolved to obtain a convolved value, which is used as the second feature information of the point, and so on , performing the above operations on the points in the first feature information to obtain the second feature information of the reference image.
  • the offset value for example, 9 offset values
  • the above S503 includes: according to the offset value of the reference image and the first feature information of the reference image, performing multi-scale temporal alignment in the temporal alignment module to obtain the second multi-scale feature information of the reference image.
  • the temporal alignment module downsamples the offset value of the reference image and the first feature information of the reference image into multiple small scales, for example, for a certain scale, the offset value and the first feature information of the scale are Align in the time domain to obtain the second feature information at this scale.
  • this application optimizes network training.
  • the multi-scale alignment technology is adopted, that is, the time-domain alignment module in Figure 6 downsamples the first feature information and offset values to be aligned to multiple small scales synchronously, and performs deformable convolution alignment operations on multiple scales. Since the small-scale offset value is closer to the real sampling point than the large-scale offset value, when training the network, the direction of gradient optimization will point to the direction of the real sampling point. For large-scale offset values, the sampling mechanism of bilinear filtering makes it impossible to correctly find the optimization direction, so the small-scale offset value optimization process will guide the large-scale offset value optimization process. Ultimately guiding the entire alignment process to be more precise.
  • the second feature information of the reference image aligned by the temporal alignment module is input into the quality enhancement module to obtain a predicted value of the enhanced image of the image to be enhanced.
  • it also includes obtaining the second feature information of the image to be enhanced, inputting the image to be enhanced and the reference image into the quality enhancement module, and obtaining the predicted value of the enhanced image of the image to be enhanced.
  • obtaining the second feature information of the image to be enhanced inputting the image to be enhanced and the reference image into the quality enhancement module, and obtaining the predicted value of the enhanced image of the image to be enhanced.
  • the first feature information of the image to be enhanced in addition to inputting the second feature information of the reference image into the quality enhancement module, can also be input into the quality enhancement module to obtain the predicted value of the enhanced image of the image to be enhanced.
  • the specific process Refer to the embodiment shown in Figure 9 below.
  • the embodiment of the present application does not limit the manner of acquiring the true value of the enhanced image of the image to be enhanced.
  • the true value of the enhanced image of the image to be enhanced may be an enhanced image obtained by using an existing image quality enhancement method.
  • the ground truth value of the enhanced image of the image to be enhanced may be an image collected by a high-quality image collection device.
  • the loss between the predicted value of the enhanced image of the image to be enhanced and the true value of the enhanced image of the image to be enhanced is calculated, and the parameters in the quality enhancement network are reversely adjusted according to the loss, so as to Implements the training of the quality augmentation network.
  • the image to be enhanced and the M reference images are input into the feature extraction module for feature extraction at different scales, and the image to be enhanced and the reference image are respectively obtained.
  • the first feature information of the image at N scales performs multi-scale prediction through the offset value prediction module to obtain the offset value of the reference image
  • the time domain alignment is performed in the time domain alignment module to obtain the second feature information of the reference image
  • the quality enhancement module The predicted value of the enhanced image of the image to be enhanced is obtained, and the quality enhancement network is trained according to the predicted value of the enhanced image of the image to be enhanced and the real value of the enhanced image of the image to be enhanced.
  • the offset value prediction module learns the first feature information on different scales to expand the receptive field range learned by the offset value prediction module, so that the offset value can be learned
  • the direction of the real offset can be used to accurately predict the offset value, and the image enhancement effect can be improved based on the accurately predicted offset value.
  • the model training methods in the embodiment of the present application include two methods.
  • the network structure and training process of the quality enhancement network involved in the embodiment of the present application will be introduced respectively below combining the two training methods.
  • FIG. 7 is a schematic flow diagram of a training method for a quality enhancement network provided by an embodiment of the present application. As shown in FIG. 7, the training process includes:
  • M is a positive integer.
  • N is a positive integer greater than 1.
  • the embodiment of the present application does not limit the network structure of the feature extraction module.
  • the feature extraction module includes N first feature extraction units.
  • the above S602 includes: for the image to be enhanced, input the image to be enhanced into the feature extraction module to obtain the i-th
  • the first feature information of the image to be enhanced under the N-i+1 scale is extracted by the first feature extraction unit, and the first feature information of the image to be enhanced under the N-i+1 scale is input
  • Feature extraction is performed in the i+1 first feature extraction unit to obtain the first feature information of the image to be enhanced at the N-i+2th scale, where i is a positive integer from 1 to N-1.
  • the feature extraction module For at least one reference image in the M reference images, input the reference image into the feature extraction module to obtain the first feature of the reference image extracted by the ith first feature extraction unit at the N-i+1th scale information, and input the first feature information of the reference image at the N-i+1th scale into the i+1th first feature extraction unit for feature extraction, and obtain the reference image at the N-i+2th
  • the first feature information under the scale, i is a positive integer from 1 to N-1.
  • the first first feature extraction unit processes the image, and outputs first feature information of the image at a third scale.
  • the first first feature extraction unit also inputs the extracted first feature information of the image at the third scale (for example, L1 scale) to the second first feature extraction unit.
  • the second first feature extraction unit processes the first feature information of the image at the second scale (for example, at the L2 scale), and outputs the first feature information of the image at the second scale.
  • the second first feature extraction unit also inputs the extracted first feature information of the image at the second scale into the third first feature extraction unit.
  • the third first feature extraction unit processes the first feature information of the image at the second scale, and outputs the first feature information of the image at the first scale (for example, at the L3 scale).
  • This embodiment does not limit the specific sizes of the above-mentioned first scale, second scale and third scale.
  • the above-mentioned third scale is the original scale of the image, such as HXW.
  • the second scale is half of the first scale, for example H/2XW/2.
  • the first scale is half of the second scale, for example H/4XW/4.
  • the embodiment of the present application does not limit the network structure of the first feature extraction unit.
  • the first feature extraction unit includes at least one convolutional layer.
  • each of the N first feature extraction units includes the same number of convolutional layers.
  • each first feature extraction unit includes two convolutional layers.
  • the number of convolutional layers included in each of the first feature extraction units in the N first feature extraction units is not exactly the same, for example, some of the first feature extraction units include 2 layers of convolutional layers, and some of the first feature extraction units include The extraction unit includes 1 convolutional layer, or some of the first feature extraction units include 3 convolutional layers, etc.
  • the parameters of the convolutional layers included in each first feature extraction unit may be the same or different.
  • the feature extraction module includes 6 convolutional layers, the convolution step of the first convolutional layer and the second convolutional layer is the first value, the third convolutional layer and the The convolution step size of the four convolutional layers is the second value, and the convolution step size of the fifth convolution layer and the sixth convolution layer is the third value, wherein the first value is greater than the second value, and the second value is greater than the third value value.
  • the feature extraction module includes three first feature extraction units, and each first feature extraction unit includes two convolutional layers.
  • the first first feature extraction unit includes two convolutional layers, and the convolution step of the two convolutional layers is 1.
  • the second first feature extraction unit includes two convolutional layers, wherein the convolutional stride of the first convolutional layer is 2, and the convolutional stride of the second convolutional layer is 1.
  • the third first feature extraction unit includes two convolutional layers, wherein the convolutional stride of the first convolutional layer is 2, and the convolutional stride of the second convolutional layer is 1.
  • the number of channels of each convolutional layer shown in FIG. 8B is not limited.
  • the Nth scale is the largest scale among the N scales.
  • This embodiment does not limit the specific network structure of the offset value prediction module.
  • the offset value prediction module includes N first prediction units, then the above S603 includes S603-A and S603-B:
  • the first feature information of the image to be enhanced at the Nth scale, the offset value of the image to be enhanced at the Nth scale predicted by the N-1th first prediction unit, and the reference image at the Nth scale are input into the Nth first prediction unit to obtain the Nth first prediction unit respectively The predicted offset value of the image to be enhanced at the Nth scale and the offset value of the reference image at the Nth scale.
  • the offset values of the image to be enhanced and the reference image at the j-th scale are respectively 0.
  • the image to be enhanced and the reference image output by the third first feature extraction unit shown in Figure 8B above are spliced with the first feature information at the first scale Afterwards, it is input to the first first prediction unit for offset value prediction, and the offset values of the image to be enhanced and the reference image predicted by the first first prediction unit at the second scale are respectively obtained. Stitching the first feature information of the image to be enhanced and the reference image at the second scale and the predicted offset values of the image to be enhanced and the reference image at the second scale respectively into the second first prediction unit
  • the offset value prediction is performed to obtain the offset values of the image to be enhanced and the reference image predicted by the second first prediction unit at the third scale respectively.
  • the splicing of the first feature information of the image to be enhanced and the reference image at the third scale and the predicted offset values of the image to be enhanced and the reference image at the third scale are input into the third first prediction
  • the offset value prediction is performed in the first prediction unit to obtain the offset values at the third scale of the image to be enhanced and the reference image predicted by the third first prediction unit.
  • the embodiment of the present application does not limit the specific network structure of the first prediction unit.
  • the first first prediction unit includes the first first prediction unit subunit and the first first upsampled subunit.
  • the j-th first prediction unit if the j-th first prediction unit is a first prediction unit other than the first first prediction unit among the N first prediction units, then the j-th first prediction unit includes the j-th The first alignment subunit, the jth first prediction subunit, and the jth first upsampling subunit. For example, as shown in FIG. 8D, if the j-th first prediction unit is the second first prediction unit among the N first prediction units, then the second first prediction unit includes the second first aligned sub-unit , the second first prediction subunit, and the second first upsampling subunit.
  • S603-A1 includes S603-A21 to S603-A23:
  • the first feature information of the image to be enhanced and the reference image at the j-th scale, and the image to be enhanced and the reference image predicted by the j-1th first prediction unit respectively at the j-th scale The offset value is input into the j-th first alignment subunit to perform time-domain feature alignment, and the feature information of the image to be enhanced and the reference image respectively aligned at the j-th scale are obtained;
  • the Nth first prediction unit includes the Nth first alignment subunit and the Nth first prediction subunit.
  • the third first prediction unit includes a third first alignment subunit and a third first prediction subunit.
  • the embodiment of the present application does not limit the network structure of the above-mentioned first alignment subunit, first prediction subunit, and first upsampling subunit.
  • the above-mentioned first prediction subunit is an offset prediction network (Offset prediction network, OPN for short).
  • OPN sampling uses 3 layers of convolutional layers, the number of input channels is T ⁇ C, and the number of output channels is T ⁇ 3 ⁇ 9, where 3 means that in addition to outputting the sampling point position (x, y), OPN also outputs The magnitude of the sampled value.
  • the above-mentioned first alignment subunit is a deformable convolution (Deformable convolution, DCN for short).
  • DCN deformable convolution
  • the input and output channels of DCN, that is, deformable convolution, are both C.
  • the first upsampling subunit is a bilinear interpolation upsampling unit.
  • the predicted offset value is gradually adjusted from coarse to fine, that is, the residual of the predicted offset value is not the offset value itself.
  • the first feature information f 1 i of the image to be enhanced and the reference image generated by the above-mentioned feature extraction module in the first scale are spliced together , are input together into the first first prediction subunit (OPN) to predict the offset value.
  • OPN uses a 3-layer convolutional layer to predict the offset value, and obtains the offset value of the image to be enhanced and the reference image at the first scale respectively Then, the offset value O0 of the image to be enhanced and the reference image at the first scale are upsampled to the offset value at the second scale (ie L2 scale) through the first first upsampling sub-unit Stitch together the first feature information f 2 i of the image to be enhanced and the reference image at the second scale (ie, the smallest scale L2), and input the offset value O2 into the second first alignment subunit (DCN) for Deformable convolution obtains the feature information of the image to be enhanced and the reference image aligned at the second scale.
  • DCN second first alignment subunit
  • the feature information is sampled and aligned separately, and the alignment features of the image to be enhanced and the reference image at the third scale are obtained, and the alignment features of the image to be enhanced and the reference image at the third scale are input into the third first predictor
  • the offset value O5 of the image to be enhanced and the reference image is predicted, and O5 is added to O4 to obtain the offset value O ⁇ R T ⁇ 3 ⁇ 9 ⁇ of the image to be enhanced and the reference image in the third scale respectively H ⁇ W .
  • the offset value since each previous prediction is the offset value of the small-scale feature to predict the large-scale feature, the offset value will lose details, so an additional prediction and alignment operation is added on the original scale feature.
  • the multi-scale features of the image to be enhanced and the reference image aligned according to O4 are input into the third prediction subunit (OPN), and the offset value output by the OPN is added to O4 to obtain a more accurate offset value O at the same scale.
  • OPN third prediction subunit
  • the embodiment of the present application does not limit the specific network structure of the time domain alignment module.
  • the time domain alignment module includes K first time domain alignment units and K ⁇ 1 first downsampling units.
  • K is a positive integer greater than 2.
  • the first time domain alignment unit is an offset value prediction network OPN.
  • the first downsampling unit is an average pooling layer.
  • the first downsampling unit is a maximum pooling layer.
  • the above S604 includes the following S604-A1 to S604-A3:
  • k is a positive integer from K to 2.
  • the offset value and first feature information of the first image at the kth scale are the offset value and the first feature information of the first image at the Nth scale. first feature information.
  • K N.
  • reference images can be understood as all reference images in the M reference images of the image to be enhanced, or can be understood as part of the reference images in the M reference images, and each image in the image to be enhanced and the reference image
  • the process of extracting the second feature information is consistent.
  • any image in the image to be enhanced and the reference image is recorded as the first image, and the process of extracting the second feature information from each image in the image to be enhanced and the reference image is the same as The first image is the same, just refer to the first image.
  • multi-scale alignment is performed on the offset value of the first image at the Nth scale predicted by the above offset value prediction module and the first feature information of the first image at the Nth scale extracted by the feature extraction module .
  • the first feature information and offset values of the first image at the Nth scale are down-sampled to obtain the first feature information and offset values at different scales, and the first feature information at each scale The information is aligned with the offset value to obtain the second feature information of the first image at different scales.
  • the offset value and first feature information of the first image at the third scale are input into the third first temporal alignment unit, and the first image is obtained at The second feature information at the third scale, wherein the offset value of the first image at the third scale and the first feature information, and the second feature information at the third scale are both HXW in size.
  • the offset value of the first image at the third scale and the first feature information are input into the second first downsampling unit for downsampling, and the offset value and the first feature information at the second scale of the first image are obtained
  • the first feature information, optionally, the offset value of the first image at the second scale and the size of the first feature information are H/2XW/2.
  • the offset value of the first image at the second scale and the first feature information are input into the second first temporal alignment unit to obtain the second feature information of the first image at the second scale.
  • input the offset value of the first image at the second scale and the first feature information into the first first downsampling unit for downsampling, and obtain the offset value and the first feature information of the first image at the first scale
  • the first feature information, optionally, the offset value of the first image at the first scale and the size of the first feature information are H/4XW/4.
  • this step adopts a multi-scale alignment operation, that is, the offset value O of the first image and the first feature information of the L1 scale Downsampling to multiple small scales synchronously, for example, downsampling O and L1 of the original scale to half or quarter of the original scale.
  • Deformable convolution alignment is performed on the first feature information of the three scales respectively.
  • the offset values of the three scales are all from the offset value O of the original scale. Therefore, when training the network, the coarse offset value of the small scale will guide Large-scale accurate offset values are optimized towards true offset values.
  • the above S603 includes: inputting the first feature information of the first image at N scales into the offset value prediction module for multi-scale prediction, and obtaining P groups of the first image at N scales Offset value, P is a positive integer.
  • the above S604 includes: dividing the first image into P image blocks, assigning P groups of offset values to the P image blocks one by one; The information is input into the time domain alignment module for multi-scale time domain alignment, and the multi-scale second feature information of the image block at the Nth scale is obtained; according to the multi-scale second feature information of the image block at the Nth scale in the first image information to obtain the multi-scale second feature information of the first image at the Nth scale.
  • the embodiment of the present application does not limit the specific network structure of the quality enhancement module.
  • the quality enhancement module includes K first enhancement units and K-1 first upsampling units, then the above S605 includes the following S605-A1 to S605-A4:
  • the fusion value of the enhanced image at the kth scale of the image to be enhanced is the first one obtained by the first enhancement unit according to the second feature information of the image to be enhanced and the reference image at the first scale The initial prediction value of the enhanced image of the image to be enhanced at the first scale.
  • S605-A4 Determine the fusion value of the enhanced image of the image to be enhanced at the K scale as the predicted value of the enhanced image of the image to be enhanced at the N scale.
  • the image to be enhanced and the reference image are respectively concatenated at the second feature information at the first scale and input to the first first enhancement unit for quality enhancement, and the to-be-enhanced image is obtained The fused value of the augmented image at the first scale.
  • the second characteristic information of the image to be enhanced and the reference image at the second scale input them into the second first enhancement unit for image quality enhancement, and obtain the enhancement of the image to be enhanced at the second scale
  • the initial predicted value of the image is to fuse the upsampled value of the enhanced image of the image to be enhanced at the second scale with the initial predicted value to obtain the fusion value of the enhanced image of the image to be enhanced at the second scale.
  • input the fusion value of the enhanced image of the image to be enhanced at the second scale into the second first upsampling unit for upsampling, and obtain the upsampled value of the enhanced image of the image to be enhanced at the third scale.
  • the third first enhancement unit for image quality enhancement input it into the third first enhancement unit for image quality enhancement, and obtain the enhancement of the image to be enhanced at the third scale
  • the initial predicted value of the image input it into the third first enhancement unit for image quality enhancement, and obtain the enhancement of the image to be enhanced at the third scale
  • the upsampled value of the enhanced image of the image to be enhanced in the third scale and the initial predicted value are fused to obtain the fused value of the enhanced image of the image to be enhanced in the third scale.
  • the fusion value of the enhanced image of the image to be enhanced at the third scale is determined as the predicted value of the enhanced image of the image to be enhanced at the third scale.
  • the last convolutional layer among the plurality of convolutional layers of each first enhancement unit does not include an activation function.
  • a LeakyReLU activation function is used in the first enhancement unit, where the coefficient of the activation function is 0.1.
  • the second feature information of the image to be enhanced and the reference image generated by the time-domain alignment module aligned at multiple scales are input to the quality enhancement module at the same time.
  • the aligned second feature information at different scales is stitched together and input to the quality enhancement module.
  • the quality enhancement module has three branches, which correspond to the alignment features of the input three scales. Specifically, the smallest scale L3 generates a preliminary restored image, and other branches further learn residual information and restore detailed information.
  • the above steps introduce the alignment and enhancement using the offset value at the Nth scale, and the process of training the quality enhancement network according to the predicted value of the enhanced image at the Nth scale of the image to be enhanced.
  • the training method of the embodiment of the present application also includes using offset values at scales other than the Nth scale for alignment and enhancement, so that the predicted value of the enhanced image at other scales of the image to be enhanced can
  • the process by which the quality-augmented network is trained Specifically include the following steps:
  • Step A1 Input the first characteristic information of the image to be enhanced and the reference image at N scales respectively into the offset value prediction module for multi-scale prediction, and obtain the offset values of the image to be enhanced and the reference image at the jth scale respectively. transfer value.
  • the jth scale is a scale other than the Nth scale among the N scales.
  • Step A2 Input the offset value and first feature information of the image to be enhanced at the jth scale and the offset value and first feature information of the reference image at the jth scale into the temporal alignment module for multi-scale Aligning in the time domain to obtain the multi-scale second feature information of the image to be enhanced and the reference image at the j-th scale, respectively.
  • Step A3 Input the second multi-scale feature information of the image to be enhanced and the reference image at the jth scale into the quality enhancement module to obtain the predicted value of the enhanced image of the image to be enhanced at the jth scale.
  • Step A4 Train the quality enhancement network according to the predicted value and the real value of the enhanced image of the image to be enhanced at the jth scale.
  • N 3, as shown in FIG. 8D , obtain the offset values of the image to be enhanced and the reference image predicted by the second first prediction unit at the second scale (that is, at the L2 scale).
  • S604 replace the offset value and first feature information of the image to be enhanced and the reference image at the Nth scale with the offset value and the first feature information of the image to be enhanced and the reference image at the jth scale respectively.
  • a feature information according to the method of S604 above, the multi-scale second feature information of the image to be enhanced and the reference image output by the time domain alignment module at the j-th scale respectively can be obtained.
  • FIG. 8G is a schematic diagram of a quality enhancement network provided by a specific embodiment of the present application, and the functions of each module refer to the description of the above embodiment.
  • the offset value outside the Nth scale is further used to train the quality enhancement network, thereby improving the training efficiency and training accuracy of the quality enhancement network .
  • the embodiment of the present application does not limit the specific training environment of the quality enhancement network and the selection of training data.
  • a total of 108 sequences from Xiph.org and JCT-VC are used, which are divided into 100 sequences in the training set and 8 sequences in the test set.
  • the data of each QP is used as a set of training set and a set of test set respectively. A total of 4 models were trained.
  • the test set uses the test sequence under the public test conditions required by JVET. After the test set undergoes the same data processing process as the training set, the trained model is input for testing.
  • PSNR is selected as the evaluation standard of image reconstruction quality.
  • the model is trained based on the Pytorch platform.
  • the training set is randomly divided into 128x128 blocks as input, the training batch (batch) is set to 64, the optimizer uses the Adam optimizer, the initial learning rate is 1e-4, and gradually decreases to 1e-6 as the training progresses.
  • Four models are obtained by training under 4 QPs respectively.
  • image-level input is used to input the entire image into the network for processing.
  • Table 1 shows the improvement effect of the present application relative to the HM16.9 compression reconstruction video quality.
  • BD and PSNR are one of the main parameters for evaluating the performance of a video coding algorithm, which means that the video coded by the new algorithm (that is, the technical solution of this application) has a bit rate and PSNR (Peak Signal to Noise Ratio, peak signal-to-noise ratio) relative to the original algorithm.
  • "-" indicates performance improvement, such as bit rate and PSNR performance improvement.
  • the technical solution proposed in this application improves the average performance by 21.0% in terms of bit rate saving.
  • the embodiment of the present application provides a training method for a quality enhancement network
  • the quality enhancement network includes a feature extraction module, an offset value prediction module, a time domain alignment module and a quality enhancement module, and M images to be enhanced and images to be enhanced are obtained during training
  • the reference image input the image to be enhanced and M reference images into the feature extraction module to perform feature extraction at different scales, and obtain the first feature information of the image to be enhanced and the reference image at N scales respectively; the image to be enhanced and the reference image are respectively
  • the first feature information at N scales is input into the offset value prediction module for multi-scale prediction, and the offset values of the image to be enhanced and the reference image at the N scale are respectively obtained;
  • the image to be enhanced is at the N scale
  • the offset value and the first feature information under the scale and the offset value and the first feature information of the reference image at the Nth scale are input into the time domain alignment module for multi-scale time domain alignment, and the image to be enhanced is obtained in multiple The second feature information under the scale and the second feature information of
  • the quality enhancement network is trained based on the predicted value of the enhanced image of the image to be enhanced and the ground truth value of the enhanced image of the image to be enhanced. Since the above-mentioned quality enhancement network adopts a pyramid-shaped prediction network, only the offset value is up-sampled, which avoids the information loss caused by the up-sampling of image features. In addition, in order to predict the offset value more accurately and optimize the network training, a multi-scale alignment technology is adopted to synchronously down-sample the offset value of the original scale and the features to be aligned, and the offset value of the small scale is relative to the offset of the large scale The value will be closer to the real sampling point. When training the network, the gradient optimization direction will point to the direction of the real sampling point, and finally guide the entire alignment process to be more accurate. When using the trained network for image enhancement, efficient image enhancement can be achieved.
  • FIG. 7 introduces the process of using the offset value of the image to be enhanced and the reference image to train the quality enhancement network.
  • the process of using the offset value of the reference image to train the quality enhancement network will be introduced below with reference to FIG. 9 .
  • Fig. 9 is a schematic flowchart of a training method for a quality enhancement network provided by an embodiment of the present application. As shown in Fig. 9, the training process includes:
  • M is a positive integer.
  • the Nth scale is the largest scale among the N scales.
  • This embodiment does not limit the specific network structure of the offset value prediction module.
  • the offset value prediction module includes N second prediction units, then the above S703 includes:
  • the offset value of the reference image at the j-1th scale is 0.
  • the image to be enhanced and the reference image output by the third first feature extraction unit shown in Figure 8B above are spliced with the first feature information at the first scale Afterwards, it is input into the first second prediction unit for offset value prediction, and the offset value of the reference image predicted by the first second prediction unit at the second scale is obtained.
  • the offset value into the second second prediction unit prediction After splicing the first feature information of the image to be enhanced and the reference image at the second scale and predicting the offset value of the reference image at the second scale, input the offset value into the second second prediction unit prediction, to obtain the offset value of the reference image predicted by the second second prediction unit at the third scale.
  • the image to be enhanced and the reference image are respectively spliced with the first feature information at the third scale and the offset value of the reference image at the third scale predicted is input into the third second prediction unit for The offset value prediction is to obtain the offset value of the reference image predicted by the third second prediction unit at the third scale.
  • the embodiment of the present application does not limit the specific network structure of the first prediction unit.
  • the first second prediction unit includes the first second prediction unit Two prediction subunits and the first second upsampling subunit.
  • the above S703-A includes:
  • the jth second prediction unit if the jth second prediction unit is a second prediction unit other than the first second prediction unit among the N second prediction units, then the jth second prediction unit includes the jth second prediction unit The second alignment subunit, the jth second prediction subunit, and the jth second upsampling subunit. As shown in FIG. 10B, if the j-th second prediction unit is the second second prediction unit among the N second prediction units, the second second prediction unit includes the second second aligned sub-unit, the second Two second prediction subunits and a second second upsampling subunit.
  • the Nth second prediction unit includes the Nth second alignment subunit and the Nth second prediction subunit, then the above S703-B includes:
  • the embodiment of the present application does not limit the network structure of the above-mentioned first alignment subunit, first prediction subunit, and first upsampling subunit.
  • the above-mentioned second prediction subunit is an offset value prediction network OPN.
  • the above-mentioned second alignment subunit is a deformable convolutional DCN.
  • the image to be enhanced generated by the above feature extraction module and the reference image are respectively at the first scale (that is, the minimum scale L3).
  • the feature information is concatenated and input together into the first and second prediction subunit (OPN) to predict the offset value.
  • OPN uses a 3-layer convolutional layer to predict the offset value, and obtains the offset value of the reference image at the first scale. Then, the offset value of the reference image at the first scale is up-sampled to the offset value O2 at the second scale (ie, the L2 scale) through the first second upsampling subunit.
  • DCN second second alignment subunit
  • Input the aligned feature information into the second first prediction subunit (OPN) obtain the offset value O3 of the reference image predicted by the second second prediction subunit in the second scale, and use the offset value After O3 is added to O2, it is input into the second second upsampling subunit to obtain an offset value O4.
  • the offset value O4 into the third second alignment subunit, so that the third second alignment subunit can compare the image to be enhanced under the third scale (that is, the original scale L1) output by the above steps and the first image of the reference image.
  • the first feature information is sampled and aligned respectively to obtain the alignment features of the image to be enhanced and the reference image at the third scale, and the alignment features of the image to be enhanced and the reference image at the third scale are input into the third second In the prediction sub-unit, the offset value O5 of the reference image is obtained by prediction, and O5 is added to O4 to obtain the offset value of the reference image at the third scale.
  • the embodiment of the present application does not limit the specific network structure of the time domain alignment module.
  • the time domain alignment module includes K second time domain alignment units and K-1 second downsampling units, where K is a positive integer greater than 2.
  • the second time domain alignment unit is an offset value prediction network OPN.
  • the second downsampling unit is an average pooling layer.
  • the second downsampling unit is a maximum pooling layer.
  • k is a positive integer from K to 2.
  • the offset value and first feature information of the reference image at the k scale are the offset value and the first feature information of the reference image at the N scale. characteristic information.
  • the offset value and the first feature information of the reference image at the third scale are input into the third first In the time domain alignment unit, the second feature information of the reference image at the third scale is obtained, wherein the offset value and the first feature information of the reference image at the third scale, and the second feature information at the third scale
  • the size of feature information is HXW.
  • the offset value of the reference image at the third scale and the first feature information are input into the second first down-sampling unit for downsampling, and the offset value and the first feature information of the reference image at the second scale are obtained.
  • Feature information optionally, the size of the offset value of the reference image at the second scale and the first feature information is H/2XW/2.
  • the above S703 includes: inputting the first feature information of the reference image at N scales into the offset value prediction module for multi-scale prediction, and obtaining P groups of offsets of the reference image at the N scale value, P is a positive integer.
  • the above S604 includes: dividing the reference image into P image blocks, and assigning P groups of offset values to the P image blocks one by one; Input the multi-scale time-domain alignment into the time domain alignment module to obtain the multi-scale second feature information of the image block at the Nth scale; according to the multi-scale second feature information of the image block at the Nth scale in the reference image, The second feature information of the reference image at multiple scales is obtained.
  • the embodiment of the present application does not limit the specific network structure of the quality enhancement module.
  • the quality enhancement module includes K second enhancement units and K-1 second upsampling units, then the above S705 includes:
  • S704-A4 Determine the fusion value of the enhanced image of the image to be enhanced at the K scale as the predicted value of the enhanced image of the image to be enhanced at the N scale.
  • the first feature information of the image to be enhanced at the first scale and the second feature information of the reference image at the first scale are concatenated and then input into the first first Quality enhancement is performed in the second enhancement unit to obtain the fusion value of the enhanced image of the image to be enhanced at the first scale.
  • the second second enhancement unit for image quality enhancement After splicing the first feature information of the image to be enhanced at the second scale and the second feature information of the reference image at the second scale, input it into the second second enhancement unit for image quality enhancement, and obtain The initial predicted value of the enhanced image of the enhanced image at the second scale, the upsampled value of the enhanced image at the second scale of the image to be enhanced and the initial predicted value are fused to obtain the image to be enhanced at the second scale The fusion value of the enhanced image. Next, input the fusion value of the enhanced image of the image to be enhanced at the second scale into the second first upsampling unit for upsampling, and obtain the upsampled value of the enhanced image of the image to be enhanced at the third scale.
  • the third first enhancement unit for image quality enhancement After splicing the first feature information of the image to be enhanced in the third scale and the second feature information of each reference image in the third scale, input it into the third first enhancement unit for image quality enhancement, Obtain the initial prediction value of the enhanced image of the image to be enhanced at the third scale. Then, the upsampled value of the enhanced image of the image to be enhanced in the third scale and the initial predicted value are fused to obtain the fused value of the enhanced image of the image to be enhanced in the third scale. The fusion value of the enhanced image of the image to be enhanced at the third scale is determined as the predicted value of the enhanced image of the image to be enhanced at the third scale.
  • the first enhancement unit includes a plurality of convolutional layers, and the last convolutional layer in the plurality of convolutional layers does not include an activation function.
  • the above steps introduce the alignment and enhancement using the offset value at the Nth scale, and the process of training the quality enhancement network according to the predicted value of the enhanced image at the Nth scale of the image to be enhanced.
  • the training method of the embodiment of the present application also includes using offset values at scales other than the Nth scale for alignment and enhancement, so that the predicted value of the enhanced image at other scales of the image to be enhanced can
  • the process by which the quality-augmented network is trained Specifically include the following steps:
  • Step B1 Input the first feature information of the image to be enhanced and the reference image at N scales into the offset value prediction module for multi-scale prediction, and obtain the offset value of the reference image at the jth scale, jth A scale is a scale other than the Nth scale among the N scales;
  • Step B2 Input the offset value and first feature information of the reference image at the jth scale into the time domain alignment module for multi-scale time domain alignment, and obtain the multi-scale second feature of the reference image at the jth scale information;
  • Step B3 Input the first feature information of the image to be enhanced at multiple scales and the second feature information of the reference image at multiple scales into the quality enhancement module to obtain the prediction of the enhanced image of the image to be enhanced at the jth scale value;
  • Step B4 According to the predicted value of the enhanced image of the image to be enhanced at the jth scale and the true value of the enhanced image of the image to be enhanced, the quality enhancement network is trained.
  • the image to be enhanced and the M reference images of the image to be enhanced are input into the feature extraction module to perform feature extraction of different scales, respectively.
  • the first feature information of the image to be enhanced and the reference image at N scales; the first feature information of the image to be enhanced and the reference image at N scales respectively are input into the offset value prediction module for multi-scale prediction, and the reference The offset value of the image at the Nth scale; input the offset value and the first feature information of the reference image at the Nth scale into the time domain alignment module for multi-scale time domain alignment, and obtain the reference image at multiple scales
  • the second feature information under: input the second feature information of the reference image in multiple scales into the quality enhancement module to obtain the predicted value of the enhanced image of the image to be enhanced.
  • the quality enhancement network is trained based on the predicted value of the enhanced image of the image to be enhanced and the ground truth value of the enhanced image of the image to be enhanced.
  • the above-mentioned quality enhancement network adopts a pyramid-shaped prediction network, only the offset value is up-sampled, which avoids the information loss caused by the up-sampling of image features.
  • a multi-scale alignment technology is adopted to synchronously down-sample the offset value of the original scale and the features to be aligned, and the offset value of the small scale is relative to the offset of the large scale. The value will be closer to the real sampling point.
  • the gradient optimization direction will point to the direction of the real sampling point, and finally guide the entire alignment process to be more accurate.
  • efficient image enhancement can be achieved.
  • the offset value prediction module only predicts the offset value of the reference image
  • the time domain alignment module only performs time domain alignment on the reference image, thereby reducing the calculation amount of each module and reducing the model training complexity, thereby improving the training efficiency of the model.
  • the quality enhancement network provided by the embodiments of the present application can also be applied to the video codec framework, for example, it can be applied to the video decoding end to perform quality enhancement on the reconstructed image obtained by the decoding end to obtain an enhanced image of the reconstructed image .
  • Fig. 11 is a schematic flowchart of an image decoding method provided by an embodiment of the present application. As shown in Fig. 11, the method includes:
  • the entropy decoding unit 310 can analyze the code stream to obtain prediction information of the current block, quantization coefficient matrix, etc., and the prediction unit 320 uses intra prediction or inter prediction for the current block based on the prediction information to generate a prediction block of the current block.
  • the inverse quantization/transformation unit 330 uses the quantization coefficient matrix obtained from the code stream to perform inverse quantization and inverse transformation on the quantization coefficient matrix to obtain a residual block.
  • the reconstruction unit 340 adds the predicted block and the residual block to obtain a reconstructed block.
  • the reconstructed blocks form a reconstructed image, and the optional loop filtering unit 350 performs loop filtering on the reconstructed image based on the image or based on the block to obtain the current reconstructed image.
  • a quality enhancement network is combined with a video coding framework.
  • the quality enhancement network described in the above embodiment is added at the output end of the decoder.
  • the decoded current reconstructed image is input to the quality enhancement network, and the quality enhancement network can be used to significantly improve the image quality of the current reconstructed image, and further improve the decoded image quality under the premise of ensuring the bit rate.
  • Ways to obtain M reference images of the current reconstructed image in this step include but are not limited to the following:
  • the M reference images of the current reconstructed image are any M image images in the reconstructed image.
  • Way 2 From the reconstructed images, at least one image located in the forward direction and/or backward direction of the current reconstructed image in the playing order is obtained as a reference image of the current reconstructed image.
  • the current reconstructed image and the M reference images are consecutive images in a playback sequence.
  • the current reconstructed image and the M reference images are not consecutive images in a playback order.
  • the method in the embodiment of the present application further includes: decoding the code stream to obtain first flag information, where the first flag information is used to indicate whether to use a quality enhancement network to perform quality enhancement on the currently reconstructed image.
  • the first flag information indicates that the quality enhancement network is used to enhance the quality of the current reconstructed image
  • M reference images of the current reconstructed image are acquired from the reconstructed image.
  • the above first tag information is included in the sequence parameter set SPS.
  • the decoder needs to read the first flag information from the SPS before performing the above S802. If the value of the first flag information is 1, it means that the quality enhancement network of the present application is used to enhance the quality of the currently decoded reconstructed image. If the value of the first flag information is 0, it means that the quality enhancement network of the present application is not used to enhance the quality of the currently decoded reconstructed image.
  • the reference image of the current reconstructed image has the following two situations:
  • the current reconstructed image is the first current reconstructed image.
  • first input the current reconstructed image into the reconstructed video buffer and after processing one or more Group Of Pictures (GOP), read the forward direction t-r of the currently reconstructed image t from the reconstructed video buffer Up to t-1 and/or backward t+1 to t+r images are used as reference images for the current reconstructed image.
  • GOP Group Of Pictures
  • each of the above reference images is an image that has not been enhanced by a quality enhancement network.
  • the quality enhancement network includes a feature extraction module, an offset value prediction module, a time domain alignment module and a quality enhancement module.
  • the first feature information of the image at N scales, N is a positive integer greater than 1, and the offset value prediction module is used to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales, and obtain The offset value of the reference image, the temporal alignment module is used to perform temporal alignment according to the offset value of the reference image and the first feature information, to obtain the second feature information of the reference image, and the quality enhancement module is used to perform the second feature information based on the reference image
  • the feature information predicts the enhanced image of the current reconstructed image.
  • the enhanced image of the current reconstructed image is marked and stored in the reconstructed video buffer.
  • the enhanced image of the current reconstructed image is directly displayed.
  • the quality enhancement network includes a feature extraction module, an offset value prediction module, a time domain alignment module and a quality enhancement module.
  • the feature extraction module is used to perform feature extraction of different scales on the current reconstructed image and the reference image respectively, and obtain first feature information of the current reconstructed image and the reference image at N scales respectively.
  • the offset value prediction module is used to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales respectively, to obtain the offset value of the reference image.
  • the temporal alignment module is configured to perform temporal alignment according to the offset value of the reference image and the first characteristic information of the reference image to obtain second characteristic information of the reference image.
  • the quality enhancement module is used to predict the enhanced image of the current reconstructed image according to the second characteristic information of the reference image.
  • the temporal alignment module is configured to perform multi-scale temporal alignment according to the offset value of the reference image and the first feature information, to obtain second feature information of the reference image at multiple scales.
  • the feature extraction module includes N first feature extraction units.
  • any image in the current reconstructed image and the reference image is recorded as the first image
  • the i-th first feature extraction unit is used to output the extracted first image under the N-i+1th scale feature information, and input the first feature information of the extracted first image under the N-i+1th scale into the i+1th first feature extraction unit, so that the i+1th first feature
  • the extraction unit outputs the first feature information of the first image at the N-i+2th scale, where i is a positive integer ranging from 1 to N-1.
  • the above-mentioned reference images can be understood as all reference images in the M reference images of the current reconstructed image, or can be understood as part of the reference images in the M reference images, and each image in the current reconstructed image and the reference image
  • the process of extracting the first feature information is consistent.
  • any image in the current reconstruction image and the reference image is recorded as the first image, and the process of extracting the first feature information from each image in the current reconstruction image and the reference image is the same as the above-mentioned
  • the first images are the same, and the above-mentioned first images may be referred to.
  • the feature extraction module includes 6 convolutional layers, the convolution step of the first convolutional layer and the second convolutional layer is the first value, and the third convolutional layer and the convolution step of the fourth convolution layer is the second value, the convolution step of the fifth convolution layer and the sixth convolution layer is the third value, wherein the first value is greater than the second value, and the second value is greater than third value.
  • the quality enhancement network in the embodiment of the present application is trained by two methods, and the execution process of some modules in the quality enhancement network trained by different training methods is different during prediction.
  • the prediction process of the quality enhancement network obtained by the above two different training methods will be introduced respectively.
  • the offset value prediction module is used to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales respectively, and obtain the offsets of the current reconstructed image and the reference image at N scales respectively value, the Nth scale is the largest scale among the N scales;
  • the temporal alignment module is used to perform multi-scale temporal alignment based on the offset value of the current reconstructed image at the Nth scale and the first feature information to obtain the current reconstruction
  • the quality enhancement module is used to obtain an enhanced image of the current reconstructed image according to the second characteristic information of the current reconstructed image and the reference image at multiple scales respectively.
  • the offset value prediction module includes N first prediction units.
  • the j-th first prediction unit For the j-th first prediction unit among the N first prediction units, the j-th first prediction unit is used to use the first feature information at the j-th scale of the current reconstruction image and the reference image respectively, and the current reconstruction image and the offset values of the reference image at the j-th scale respectively, and obtain the offset values of the current reconstructed image and the reference image at the j+1-th scale respectively.
  • the Nth first prediction unit is used to use the first feature information at the Nth scale respectively according to the current reconstructed image and the reference image, and the Nth -
  • the offset values of the current reconstructed image and the reference image predicted by the first prediction unit at the N-th scale respectively, and the current reconstructed image and the reference image predicted by the N-th first prediction unit are respectively at the N-th scale offset value.
  • the offset values of the current reconstructed image and the reference image at the j-th scale are respectively 0.
  • the first first prediction unit includes the first first prediction unit subunit and the first first upsampled subunit.
  • the first first prediction subunit is used to predict the offset value according to the first feature information of the current reconstructed image and the reference image at the first scale respectively, and predict the current reconstructed image and the reference image respectively at the first The offset value under the scale;
  • the first first upsampling subunit is used to perform upsampling according to the offset values of the current reconstruction image and the reference image predicted by the first first prediction subunit in the first scale respectively, to obtain the current reconstruction image and the reference image The offset value of the image at the second scale respectively.
  • the j-th first prediction unit is a first prediction unit other than the first first prediction unit among the N first prediction units, then the j-th first prediction unit
  • the prediction unit includes a jth first alignment subunit, a jth first prediction subunit, and a jth first upsampling subunit.
  • the jth first alignment subunit is used for the current reconstructed image and the reference image predicted by the j-1th first prediction unit according to the first feature information of the current reconstructed image and the reference image respectively at the jth scale Perform temporal feature alignment at the offset values at the jth scale, respectively, to obtain feature information aligned at the jth scale for the current reconstructed image and the reference image;
  • the j-th first prediction subunit is used to predict the offset value according to the feature information of the current reconstruction image and the reference image aligned at the j-th scale, and obtain the offsets of the current reconstruction image and the reference image at the j-scale respectively value;
  • the j-th first up-sampling subunit is used according to the offset values of the current reconstructed image output by the j-th first prediction sub-unit and the reference image at j scales and the j-1th first prediction unit prediction Upsampling is performed on the sum of the offset values of the current reconstructed image and the reference image at the j-th scale, respectively, to obtain the offset values of the current reconstructed image and the reference image at the j+1 scale respectively.
  • the Nth first prediction unit includes the Nth first alignment subunit and the Nth first prediction subunit.
  • the Nth first alignment subunit is used for the current reconstructed image and the reference image predicted by the N-1th first prediction unit according to the first feature information of the current reconstructed image and the reference image respectively at the Nth scale
  • the temporal feature alignment is performed on the offset value at the Nth scale respectively, and the feature information of the current reconstruction image and the reference image are respectively aligned at the Nth scale;
  • the Nth first prediction subunit is used to predict the offset value according to the feature information of the current reconstruction image and the reference image aligned at the Nth scale respectively, and obtain the predicted current reconstruction image and the reference image at the Nth scale respectively offset value;
  • the offset values of the current reconstructed image and the reference image predicted by the N-th first prediction unit at the N-th scale are based on the current reconstructed image and the reference image predicted by the N-th first prediction sub-unit at the N-th scale respectively.
  • the offset value at the scale, and the current reconstructed image predicted by the N-1th first prediction unit and the reference image are determined by adding the offset values at the Nth scale respectively.
  • each of the foregoing first prediction subunits is an OPN.
  • the above-mentioned first alignment subunit is a DCN.
  • the time domain alignment module includes K first time domain alignment units and K-1 first downsampling units, where K is a positive integer greater than 2.
  • the k-th first temporal alignment unit is used to obtain the second feature information of the image at the k-th scale according to the offset value of the first image at the k-th scale and the first feature information, where
  • the first image is a current reconstructed image or a reference image;
  • the k-1th first down-sampling unit is used to perform down-sampling according to the offset value of the first image at the k-th scale and the first feature information to obtain the offset of the first image at the k-1-th scale value and first feature information;
  • the k-1 first temporal alignment unit is used to obtain the second image of the first image at the k-1 scale according to the offset value of the first image at the k-1 scale and the first feature information characteristic information.
  • the offset value and first feature information of the first image at the k-th scale are the offset value and first feature information of the first image at the N-th scale.
  • the first time domain alignment unit is a DCN.
  • the above-mentioned first down-sampling unit is an average pooling layer or a maximum pooling layer.
  • the above-mentioned offset value prediction module is used to perform multi-scale prediction according to the first feature information of the first image at N scales, to obtain P groups of offset values of the first image at N scales, P is a positive integer;
  • the temporal alignment module is used to divide the first image into P image blocks, and assign P groups of offset values to the P image blocks one by one, and according to the corresponding set of offset values of the image blocks and the first Multi-scale temporal alignment of the feature information is performed to obtain the second multi-scale feature information of the image block at the Nth scale, and then according to the second multi-scale feature information of the image block at the Nth scale in the first image, the second multi-scale feature information is obtained. Multi-scale second feature information of an image at the Nth scale.
  • the quality enhancement module includes K first enhancement units and K-1 first upsampling units.
  • the k+1th first enhancement unit is used to perform image quality enhancement according to the second feature information of the current reconstructed image and the reference image at the k+1th scale respectively, to obtain the current reconstructed image at the k+1th scale
  • the kth first upsampling unit is used to perform upsampling according to the fusion value of the enhanced image of the current reconstructed image and the reference image at the kth scale, respectively, to obtain the enhanced image of the current reconstructed image at the k+1th scale
  • Upsampling value when k is 1, the fusion value of the enhanced image of the current reconstructed image at the kth scale is the first first enhancement unit according to the second feature of the current reconstructed image and the reference image at the first scale respectively Information, the obtained initial prediction value of the enhanced image of the current reconstructed image at the first scale;
  • the fusion value of the enhanced image of the current reconstructed image at the k+1th scale is determined after fusion of the upsampling value and the initial prediction value of the enhanced image at the k+1th scale of the current reconstructed image.
  • the predicted value of the enhanced image of the current reconstructed image at the Nth scale is determined according to the fusion value of the enhanced image of the current reconstructed image at the Kth scale.
  • the above-mentioned first enhancement unit includes a plurality of convolutional layers, and the last convolutional layer in the plurality of convolutional layers does not include an activation function.
  • the offset value prediction module is used to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales respectively, and obtain the offsets of the current reconstructed image and the reference image at N scales respectively value, the Nth scale is the largest scale among the N scales;
  • the time domain alignment module is used to perform multiple Scale time domain alignment to obtain the second feature information of the reference image at multiple scales;
  • the quality enhancement module is used to obtain the first feature information of the current reconstructed image at multiple scales and the second feature information of the reference image at multiple scales Information to get the predicted value of the enhanced image of the current reconstructed image.
  • the offset value prediction module includes N second prediction units.
  • the jth second prediction unit is used to obtain the reference image according to the first feature information of the current reconstructed image and the reference image at the jth scale, and the offset value of the reference image at the jth scale The offset value at the j+1th scale.
  • the Nth second prediction unit is used to use the first feature information of the current reconstructed image and the reference image at the Nth scale respectively, and the reference image predicted by the N-1th second prediction unit at the Nth scale
  • the offset value of the next prediction unit is obtained to obtain the offset value of the reference image predicted by the Nth second prediction unit at the Nth scale.
  • the offset value of each reference image at the j-1th scale is 0 .
  • the first second prediction unit includes the first second prediction unit Two prediction subunits and the first second upsampling subunit.
  • the first second prediction subunit is used to perform offset value prediction according to the first feature information of the current reconstructed image and the reference image at the first scale respectively, to obtain the offset of the reference image at the first scale value;
  • the first second upsampling subunit is configured to perform upsampling according to the offset value of the reference image at the first scale, to obtain the offset value of the reference image at the second scale.
  • the j-th second prediction unit is a second prediction unit other than the first second prediction unit among the N second prediction units
  • the j-th second prediction unit includes a jth second alignment subunit, a jth second prediction subunit, and a jth second upsampling subunit.
  • the j-th second alignment subunit is used to use the first feature information of the current reconstructed image and the reference image at the j-th scale respectively, and the reference image predicted by the j-1th second prediction unit at the j-th scale
  • the offset value under the scale is aligned in the time domain, and the feature information of the current reconstruction image and the reference image are respectively aligned at the jth scale;
  • the j-th second prediction subunit is used to predict the offset value according to the feature information of the current reconstruction image and the reference image aligned at the j-th scale, and obtain the offset value of the reference image at the j-scale;
  • the j-th second up-sampling subunit is used according to the offset value of the reference image output by the j-th first prediction sub-unit at the j scale and the reference image predicted by the j-1th second prediction unit at the j-th
  • the sum of the offset values at scales is up-sampled to obtain the offset value of the reference image at j+1 scales.
  • the Nth second prediction unit includes the Nth second alignment subunit and the Nth second prediction subunit.
  • the Nth second alignment subunit is used for the first feature information of the current reconstructed image and the reference image at the Nth scale respectively, and the reference image predicted by the N-1th second prediction unit is at the Nth scale
  • the offset value under the scale is aligned in the time domain to obtain the feature information of the current reconstruction image and the reference image aligned at the Nth scale;
  • the Nth second prediction subunit is used to predict the offset value according to the feature information of the current reconstructed image and the reference image aligned at the Nth scale, and obtain the reference image predicted by the Nth second prediction unit at the Nth scale. offset value;
  • the offset value of the reference image predicted by the Nth second prediction unit at the Nth scale is the offset value of the reference image predicted by the Nth second prediction subunit at the Nth scale, and the Nth- It is determined after adding the offset values of the reference image predicted by the second prediction unit at the Nth scale.
  • the above-mentioned second prediction subunit is OPN.
  • the above-mentioned second alignment subunit is a DCN.
  • the time domain alignment module includes K second time domain alignment units and K-1 second downsampling units, where K is a positive integer greater than 2.
  • the kth second temporal alignment unit is used to obtain the second characteristic information of the reference image at the kth scale according to the offset value of the reference image at the kth scale and the first characteristic information.
  • k is a positive integer from K to 2.
  • the offset value and first feature information of the reference image at the k scale are the offset value and the first feature information of the reference image at the N scale. characteristic information.
  • the k-1th second down-sampling unit is used to perform down-sampling according to the offset value of the reference image at the k-th scale and the first feature information, to obtain the offset value and the first feature information of the reference image at the k-1-th scale first characteristic information;
  • the k-1 second temporal alignment unit is used to obtain the second feature information of the reference image at the k-1 scale according to the offset value and the first feature information of the reference image at the k-1 scale , until k-1 is equal to 1.
  • the foregoing second time domain alignment unit is a DCN.
  • the above-mentioned second down-sampling unit is an average pooling layer or a maximum pooling layer.
  • the above-mentioned offset value prediction module is used to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales respectively, to obtain the P of the reference image at the N scale.
  • Group offset value, P is a positive integer;
  • the temporal alignment module is used to divide the reference image into P image blocks, and assign P groups of offset values to the P image blocks one by one, and according to the set of offset values corresponding to the image blocks and the image block
  • the first feature information is aligned in the multi-scale time domain to obtain the multi-scale second feature information of the image block at the Nth scale, and then according to the multi-scale second feature information of the image block in the reference image at the Nth scale, it is obtained The multi-scale second feature information of the reference image at the Nth scale.
  • the quality enhancement module includes K second enhancement units and K-1 second upsampling units.
  • the k+1th second enhancement unit is used to perform image quality enhancement according to the first feature information of the current reconstructed image at the k+1 scale and the second feature information of the reference image at the k+1 scale , to obtain the initial prediction value of the enhanced image of the current reconstructed image at the k+1th scale, where k is a positive integer from 1 to K-1;
  • the kth second upsampling unit is used to perform upsampling according to the fusion value of the enhanced image of the current reconstructed image at the kth scale, to obtain the upsampled value of the enhanced image of the current reconstructed image at the k+1th scale,
  • the fusion value of the enhanced image of the current reconstructed image at the kth scale is the first.
  • the second enhancement unit is based on the first feature information of the current reconstructed image at the first scale and the reference image at the first The second feature information under the second scale, the obtained initial prediction value of the enhanced image of the current reconstructed image under the first scale;
  • the fusion value of the enhanced image of the current reconstructed image at the k+1th scale is determined after fusion of the upsampling value and the initial prediction value of the enhanced image at the k+1th scale of the current reconstructed image.
  • the first enhancement unit includes a plurality of convolutional layers, and the last convolutional layer in the plurality of convolutional layers does not include an activation function.
  • the above-mentioned quality enhancement network is used to enhance the quality of the current reconstructed image.
  • the whole process is simple and low in cost, and the efficient enhancement of the current reconstructed image can be realized, thereby improving the quality of the current reconstructed image.
  • the quality enhancement network provided by the embodiments of the present application can also be applied to the video encoding end in the video coding and decoding framework, and perform quality enhancement on the reconstructed image obtained by the encoding end to obtain an enhanced image of the reconstructed image.
  • Fig. 12 is a schematic flowchart of an image coding method provided by an embodiment of the present application. As shown in Fig. 12, the method includes:
  • the basic flow of video encoding involved in the present application is as follows: at the encoding end, the image to be encoded (i.e., the current image) is divided into blocks, and for the current block, the prediction unit 210 uses an intra-frame Prediction or inter prediction results in a predicted block for the current block.
  • the residual unit 220 may calculate a residual block based on the predicted block and the original block of the current block, that is, a difference between the predicted block and the original block of the current block, and the residual block may also be referred to as residual information.
  • the residual block can be transformed and quantized by the transformation/quantization unit 230 to remove information that is not sensitive to human eyes, so as to eliminate visual redundancy.
  • the residual block before being transformed and quantized by the transform/quantization unit 230 may be called a time domain residual block, and the time domain residual block after being transformed and quantized by the transform/quantization unit 230 may be called a frequency residual block or a frequency-domain residual block.
  • the entropy encoding unit 280 receives the quantized transform coefficients output by the transform and quantization unit 230 , may perform entropy encoding on the quantized transform coefficients, and output a code stream.
  • the entropy coding unit 280 can eliminate character redundancy according to the target context model and the probability information of the binary code stream.
  • the video encoder performs inverse quantization and inverse transformation on the quantized transform coefficients output by the transform and quantization unit 230 to obtain a residual block of the current block, and then adds the residual block of the current block to the prediction block of the current block, Get the reconstructed block of the current block.
  • reconstruction blocks corresponding to other blocks to be encoded in the current image can be obtained, and these reconstruction blocks are spliced to obtain the current reconstruction image of the current image.
  • filter the current reconstructed image for example, use ALF to filter the current reconstructed image to reduce the difference between the pixel value of the pixel in the current reconstructed image and the pixel value in the current image The difference between the raw pixel values of the points.
  • the filtered current reconstructed image is stored in the decoded image buffer 270, which may serve as a reference image for inter-frame prediction for subsequent frames.
  • Ways to obtain M reference images of the current reconstructed image in this step include but are not limited to the following:
  • the M reference images of the current reconstructed image are any M images in the reconstructed images in the decoded image buffer 270 .
  • Way 2 From the reconstructed images in the decoded image cache 270, at least one image located in the forward direction and/or backward direction of the current reconstructed image in playback order is acquired as a reference image of the current reconstructed image.
  • the current reconstructed image and the M reference images are consecutive images in a playback order.
  • the current reconstructed image and the M reference images are not consecutive images in a playback order.
  • first flag information is written in the Sequence Parameter Set (SPS), where the first flag information is used to indicate whether to use the quality enhancement network to perform quality enhancement on the currently reconstructed image.
  • SPS Sequence Parameter Set
  • M reference images of the current reconstructed image are obtained from the reconstructed image.
  • the reference image of the current reconstructed image has the following two situations:
  • the current reconstructed image is the first reconstructed image.
  • the current reconstructed image is first input into the reconstructed video buffer, and after one or more GOPs are processed, the forward t-r to t-1 and/or the backward direction of the current reconstructed image t are read from the reconstructed video buffer.
  • the t+1 to t+r images are used as reference images for the current reconstructed image.
  • each of the above reference images is an image that has not been enhanced by a quality enhancement network.
  • the quality enhancement network includes a feature extraction module, an offset value prediction module, a temporal alignment module and a quality enhancement module.
  • the feature extraction module is used to extract features of different scales from the current reconstructed image and the reference image, and obtain the current reconstructed image and
  • the first feature information of the reference image at N scales N is a positive integer greater than 1
  • the offset value prediction module is used to perform multi-scale based on the first feature information of the current reconstruction image and the reference image at N scales respectively Prediction, obtain the offset value of the reference image
  • the temporal alignment module is used to perform temporal alignment according to the offset value of the reference image and the first characteristic information of the reference image, obtain the second characteristic information of the reference image
  • the quality enhancement module is used for An enhanced image of the current reconstructed image is predicted according to the second feature information of the reference image.
  • the application of the quality enhancement network to the codec system has been introduced above, and the above quality enhancement network can also be applied to other scenarios that require image quality enhancement.
  • Fig. 13 is a schematic flowchart of an image processing method provided by an embodiment of the present application. As shown in Fig. 13, the method includes:
  • the captured t-th images are stored in the buffer in sequence, and after the t+r images are captured, the t-r to t+th images can be taken out from the buffer
  • a total of 2r+1 images of r images are input to the quality enhancement network, where the tth image is the target image to be enhanced, and the other images are the reference images of the target image to be enhanced.
  • it is enhanced image by image according to the playback order, that is, the target image to be enhanced is sequentially taken out from the decoding buffer, and its forward and backward continuous reference images are input into the quality enhancement network to obtain the enhanced image of the target image.
  • the quality enhancement network includes a feature extraction module, an offset value prediction module, a time domain alignment module and a quality enhancement module.
  • the first feature information of the image at N scales, N is a positive integer greater than 1, and the offset value prediction module is used to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales, and obtain The offset value of the reference image, the temporal alignment module is used to perform temporal alignment according to the offset value of the reference image and the first feature information, to obtain the second feature information of the reference image, and the quality enhancement module is used to perform the second feature information based on the reference image
  • the feature information predicts the enhanced image of the current reconstructed image.
  • Fig. 5 to Fig. 13 are only examples of the present application, and should not be construed as limiting the present application.
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the order of execution of the processes should be determined by their functions and internal logic, and should not be used in this application.
  • the implementation of the examples constitutes no limitation.
  • the term "and/or" is only an association relationship describing associated objects, indicating that there may be three relationships. Specifically, A and/or B may mean: A exists alone, A and B exist simultaneously, and B exists alone.
  • the character "/" in this article generally indicates that the contextual objects are an "or" relationship.
  • the network structure of the quality enhancement network and the image processing method are introduced above with reference to FIG. 5 to FIG. 13 , and the device embodiment of the present application is described in detail below in conjunction with FIG. 14 to FIG. 16 .
  • FIG. 14 is a schematic block diagram of an image decoding device provided by an embodiment of the present application.
  • the image decoding device may be the decoder shown in FIG. 3 , or a component in the decoder, such as a processor in the decoder.
  • the image decoding device 10 may include:
  • Decoding unit 11 configured to decode the code stream to obtain the current reconstructed image
  • An acquisition unit 12 configured to acquire M reference images of the current reconstructed image from the reconstructed image, where M is a positive integer;
  • the enhancement unit 13 is configured to input the current reconstructed image and the M reference images into a quality enhancement network to obtain an enhanced image of the current reconstructed image.
  • the quality enhancement network includes a feature extraction module, an offset value prediction module, a temporal alignment module, and a quality enhancement module
  • the feature extraction module is used to perform different scales of the current reconstruction image and the reference image respectively Feature extraction, obtaining the first feature information of the current reconstructed image and the reference image at N scales respectively, where N is a positive integer greater than 1
  • the offset value prediction module is used to Multi-scale prediction is performed on the first feature information of the image and the reference image at N scales respectively to obtain an offset value of the reference image
  • the temporal alignment module is used to obtain the offset value of the reference image and
  • the first characteristic information of the reference image is aligned in time domain to obtain the second characteristic information of the reference image
  • the quality enhancement module is used to predict the enhancement of the current reconstructed image according to the second characteristic information of the reference image image.
  • the temporal alignment module is configured to perform multi-scale temporal alignment according to the offset value of the reference image and the first characteristic information of the reference image, to obtain the first position of the reference image at multiple scales Two feature information.
  • the feature extraction module includes N first feature extraction units
  • the i-th first feature extraction unit is used to output the first feature information of the extracted first image at the N-i+1th scale, and the extracted first image at the N-ith scale
  • the first feature information at the +1 scale is input into the i+1 first feature extraction unit, so that the i+1 first feature extraction unit outputs the first image at the N-i+2 scale
  • the first feature information below, the i is a positive integer from 1 to N-1, and the first image is any one of the current reconstructed image and the reference image.
  • the offset value prediction module is configured to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales, respectively, to obtain the current reconstructed image and the reference image respectively.
  • the time-domain alignment module is used to perform multi-scale time-domain alignment according to the offset value of the current reconstructed image at the Nth scale and the first feature information, to obtain the second of the current reconstructed image at multiple scales.
  • feature information and performing multi-scale temporal alignment according to the offset value of the reference image at the Nth scale and the first feature information, to obtain second feature information of the reference image at multiple scales;
  • the quality enhancement module is configured to obtain an enhanced image of the current reconstructed image according to the second characteristic information of the current reconstructed image and the reference image at multiple scales respectively.
  • the offset value prediction module includes N first prediction units;
  • the j-th first prediction unit is used for the first feature information of the current reconstructed image and the reference image at the j-th scale respectively, and the current reconstructed image and the reference image are respectively in
  • the offset value at the jth scale is to obtain the offset values of the current reconstructed image and the reference image at the j+1th scale respectively, where j is a positive integer from 1 to N-1;
  • the Nth first prediction unit is used to predict the current reconstructed image according to the first feature information of the current reconstructed image and the reference image at the Nth scale respectively, and the N-1th first prediction unit and the offset values of the reference image at the Nth scale respectively, to obtain the offset values of the current reconstructed image predicted by the Nth first prediction unit and the reference image at the Nth scale respectively .
  • the offsets of the current reconstructed image and the reference image at the jth scale are respectively The value is 0.
  • the first first prediction unit includes the first first prediction unit subunits and the first first upsampling subunit;
  • the first first predicting subunit is used to perform offset value prediction according to the first feature information of the current reconstructed image and the reference image at the first scale respectively, and predict the current reconstructed image and the The offset values of the reference images at the first scale respectively;
  • the first first upsampling subunit is used to perform upsampling according to offset values of the current reconstructed image and the reference image predicted by the first first prediction subunit respectively at a first scale , to obtain offset values of the current reconstructed image and the reference image at the second scale respectively.
  • the jth first prediction unit is a first prediction unit other than the first first prediction unit among the N first prediction units, then the jth first prediction unit
  • the prediction unit includes a jth first alignment subunit, a jth first prediction subunit, and a jth first upsampling subunit;
  • the jth first alignment subunit is used to predict the first feature information of the current reconstructed image and the reference image at the jth scale respectively, and the j-1th first prediction unit. performing time-domain feature alignment on offset values of the current reconstructed image and the reference image at the jth scale, respectively, to obtain feature information of the current reconstructed image and the reference image aligned at the jth scale;
  • the j-th first prediction subunit is used to perform offset value prediction according to the feature information of the current reconstructed image and the reference image aligned at the j-th scale, to obtain the current reconstructed image and the reference image
  • the j-th first up-sampling subunit is used for offset values and j-th
  • the current reconstructed image and the reference image predicted by one first prediction unit are respectively up-sampled by the sum of offset values at the j-th scale to obtain the current reconstructed image and the reference image at j Offset value at +1 scale.
  • the Nth first prediction unit includes an Nth first alignment subunit and an Nth first prediction subunit
  • the Nth first alignment subunit is used to predict according to the first feature information of the current reconstructed image and the reference image at the Nth scale respectively, and the N-1th first prediction unit
  • the current reconstructed image and the reference image are respectively aligned in the time domain with offset values at the Nth scale, to obtain the feature information of the current reconstructed image and the reference image being aligned at the Nth scale;
  • the Nth first prediction subunit is used to perform offset value prediction according to the feature information of the current reconstructed image and the reference image aligned at the Nth scale, to obtain the predicted current reconstructed image and the predicted The offset value of the reference image at the Nth scale;
  • the offset values of the current reconstructed image predicted by the Nth first prediction unit and the reference image at the Nth scale are based on the current reconstructed image predicted by the Nth first prediction subunit and The offset values of the reference image at the N-th scale respectively, and the offset values of the N-th scale of the current reconstructed image predicted by the N-1th first prediction unit and the reference image respectively at the N-th scale definite.
  • the first prediction subunit is an offset value prediction network OPN.
  • the first alignment subunit is a deformable convolutional DCN.
  • the time domain alignment module includes K first time domain alignment units and K-1 first downsampling units, where K is a positive integer greater than 2;
  • the kth first temporal alignment unit is used to obtain the second offset value of the first image at the kth scale according to the offset value of the first image at the kth scale and the first feature information.
  • the k-1th first down-sampling unit is used to perform down-sampling according to the offset value of the first image at the k-th scale and the first feature information, to obtain the first image at the k-1-th scale
  • the k-1th first temporal alignment unit is used to obtain the k-1th scale of the first image according to the offset value of the first image at the k-1th scale and the first feature information The second characteristic information below, until k-1 is equal to 1.
  • the first temporal alignment unit is a deformable convolutional DCN.
  • the first downsampling unit is an average pooling layer.
  • the quality enhancement module includes K first enhancement units and K-1 first upsampling units;
  • the k+1th first enhancement unit is used to perform image quality enhancement according to the second feature information of the current reconstructed image and the reference image at the k+1th scale respectively, to obtain the kth of the current reconstructed image
  • the initial prediction value of the enhanced image under +1 scale, the k is a positive integer from 1 to K-1;
  • the kth first upsampling unit is used to perform upsampling according to the fusion value of the enhanced image of the current reconstructed image at the kth scale, to obtain the enhanced image of the current reconstructed image at the k+1th scale
  • the fusion value of the enhanced image of the current reconstructed image at the kth scale is the first first enhancement unit according to the current reconstructed image and the reference image respectively in The second feature information at the first scale, the obtained initial prediction value of the enhanced image of the current reconstructed image at the first scale;
  • the fusion value of the enhanced image of the current reconstructed image at the k+1th scale is fused according to the upsampling value and the initial prediction value of the enhanced image at the k+1th scale of the current reconstructed image Determined, the predicted value of the enhanced image of the current reconstructed image at the Nth scale is determined according to the fusion value of the enhanced image of the current reconstructed image at the Kth scale.
  • the first enhancement unit includes a plurality of convolutional layers, and a last convolutional layer of the plurality of convolutional layers does not include an activation function.
  • the offset value prediction module is configured to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales respectively, to obtain the current reconstructed image and the P groups of offset values of the reference image at the Nth scale respectively, where P is a positive integer;
  • the temporal alignment module is used to divide the first image into P image blocks, and assign the P groups of offset values to the P image blocks one by one, and according to a corresponding one of the image blocks Multi-scale temporal alignment is performed between the group offset value and the first feature information of the image block to obtain the second feature information of the image block at multiple scales, and then according to the image block in the first image multi-scale second feature information at scales to obtain multi-scale second feature information of the first image at N scales.
  • the offset value prediction module is configured to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales respectively, to obtain the reference image at the Nth Offset values under scales, the Nth scale being the largest scale among the N scales;
  • the time domain alignment module is used to perform multi-scale time domain alignment according to the offset value of the reference image at the Nth scale and the first feature information, to obtain the second feature information of the reference image at multiple scales ;
  • the quality enhancement module is configured to obtain an enhanced image of the current reconstructed image according to the first feature information of the current reconstructed image at multiple scales and the second feature information of the reference image at multiple scales.
  • the offset value prediction module includes N second prediction units;
  • the j-th second prediction unit is used to calculate the first feature information of the current reconstructed image and the reference image at the j-th scale, and the offset value of the reference image at the j-th scale , to obtain the offset value of the reference image at the j+1th scale, where j is a positive integer from 1 to N-1;
  • the N-th second prediction unit is used to use the first feature information of the current reconstructed image and the reference image at the N-th scale respectively, and the reference image predicted by the N-1th second prediction unit in The offset value at the Nth scale is to obtain the offset value at the Nth scale of the reference image predicted by the Nth second prediction unit.
  • the offset of the reference image at the j-1th scale is The shift value 0.
  • the first second prediction unit includes the first second prediction unit Two prediction subunits and the first second upsampling subunit;
  • the first and second prediction subunits are used to perform offset value prediction according to the first feature information of the current reconstructed image and the reference image at the first scale respectively, so as to obtain the first feature information of the reference image at the first scale.
  • the first second upsampling subunit is configured to perform upsampling according to the offset value of the reference image at the first scale to obtain the offset value of the reference image at the second scale.
  • the jth second prediction unit is a second prediction unit other than the first second prediction unit among the N second prediction units, then the jth second prediction unit
  • the prediction unit includes a jth second alignment subunit, a jth second prediction subunit, and a jth second upsampling subunit;
  • the j-th second alignment subunit is used to predict the reference image predicted by the j-1th second prediction unit according to the first feature information of the current reconstructed image and the reference image at the j-th scale respectively performing time-domain feature alignment on the offset value at the jth scale, to obtain the feature information of the alignment between the current reconstructed image and the reference image at the jth scale;
  • the j-th second prediction subunit is used to perform offset value prediction according to the feature information of the alignment of the current reconstructed image and the reference image at the j-th scale, to obtain the j-th scale of the reference image offset value;
  • the j th second upsampling subunit is used to predict according to the offset value of the reference image output by the j th first prediction subunit at the j scale and the j-1 th second prediction unit The sum of the offset values of the reference image at the jth scale is up-sampled to obtain the offset value of the reference image at the j+1 scale.
  • the Nth second prediction unit includes an Nth second alignment subunit and an Nth second prediction subunit
  • the Nth second alignment subunit is used to predict according to the first feature information of the current reconstructed image and the reference image at the Nth scale respectively, and the N-1th second prediction unit performing time-domain feature alignment on the offset value of the reference image at the Nth scale, to obtain feature information aligned at the Nth scale between the current reconstructed image and the reference image;
  • the Nth second prediction subunit is used to perform offset value prediction according to the feature information of the current reconstructed image and the reference image aligned at the Nth scale, to obtain the predicted value of the Nth second prediction unit.
  • the offset value of the reference image predicted by the Nth second prediction unit at the Nth scale is based on the offset value of the reference image predicted by the Nth second prediction subunit at the Nth scale
  • the offset value is determined after adding the offset value of the reference image predicted by the N-1th second prediction unit at the Nth scale.
  • the second prediction subunit is an offset value prediction network OPN.
  • the second alignment subunit is a deformable convolutional DCN.
  • the time domain alignment module includes K second time domain alignment units and K-1 second downsampling units, where K is a positive integer greater than 2;
  • the k-1th second down-sampling unit is used to perform down-sampling according to the offset value of the reference image at the k-th scale and the first feature information, to obtain the reference image at the k-1-th scale Offset value and first feature information;
  • the k-1th second temporal alignment unit is used to obtain the reference image at the k-1th scale according to the offset value and the first characteristic information of the reference image at the k-1th scale The second characteristic information, until k-1 is equal to 1.
  • the second temporal alignment unit is a deformable convolutional DCN.
  • the second downsampling unit is an average pooling layer.
  • the quality enhancement module includes K second enhancement units and K-1 second upsampling units;
  • the k+1th second enhancement unit is used for the first feature information of the current reconstructed image at the k+1th scale and the second feature information of the reference image at the k+1th scale Perform image quality enhancement to obtain the initial predicted value of the enhanced image of the current reconstructed image at the k+1th scale, where k is a positive integer from 1 to K-1;
  • the kth second upsampling unit is used to perform upsampling according to the fusion value of the enhanced image of the current reconstructed image at the kth scale, to obtain the enhanced image of the current reconstructed image at the k+1th scale
  • the fusion value of the enhanced image of the current reconstructed image at the kth scale is the first and the second enhancement unit according to the current reconstructed image at the first scale
  • the first feature information and the second feature information of the reference image at the first scale are obtained to obtain an initial prediction value of the enhanced image of the current reconstructed image at the first scale;
  • the fusion value of the enhanced image of the current reconstructed image at the k+1th scale is determined after fusion of the upsampling value and the initial predicted value of the enhanced image at the k+1th scale of the current reconstructed image .
  • the first enhancement unit includes a plurality of convolutional layers, and a last convolutional layer of the plurality of convolutional layers does not include an activation function.
  • the offset value prediction module is configured to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales respectively, to obtain the reference image at the Nth P group of offset values under a scale, the P is a positive integer;
  • the temporal alignment module is used to divide the reference image into P image blocks, and assign the P groups of offset values to the P image blocks one by one, and for each image block, according to the A set of offset values corresponding to the image block and the first feature information of the image block are aligned in the multi-scale time domain to obtain the multi-scale second feature information of the image block at the Nth scale, and then according to the reference
  • the multi-scale second feature information of each image block in the image at the Nth scale is obtained to obtain the multi-scale second feature information of the reference image at the Nth scale.
  • the decoding unit 11 is further configured to decode the code stream to obtain first flag information, where the first flag information is used to indicate whether to use the quality enhancement network to perform quality enhancement on the currently reconstructed image;
  • M reference images of the current reconstructed image are acquired from reconstructed images.
  • the first flag information is included in a sequence parameter set.
  • the obtaining unit 12 is specifically configured to obtain, from the reconstructed images, at least one image that is located forward and/or backward of the current reconstructed image in the playing order as the current reconstructed image. Reference image.
  • the current reconstructed image and the reference image are continuous in playback order.
  • the device embodiment and the method embodiment may correspond to each other, and similar descriptions may refer to the method embodiment. To avoid repetition, details are not repeated here.
  • the decoding device 10 shown in FIG. 14 may correspond to the corresponding subject in the image decoding method of the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the decoding device 10 are for realizing image decoding For the sake of brevity, the corresponding processes in the method are not repeated here.
  • FIG. 15 is a schematic block diagram of an image encoding device provided by an embodiment of the present application.
  • the image encoding device may be the encoder shown in FIG. 2 , or a component in the encoder, such as a processor in the encoder.
  • the image encoding device 20 may include:
  • a first acquiring unit 21 configured to acquire an image to be encoded
  • An encoding unit 22 configured to encode the image to be encoded to obtain a current reconstructed image of the image to be encoded
  • the second acquiring unit 23 is configured to acquire M reference images of the current reconstructed image from the reconstructed image, where M is a positive integer;
  • the enhancement unit 24 is configured to input the current reconstructed image and the M reference images into a quality enhancement network to obtain an enhanced image of the current reconstructed image.
  • the quality enhancement network includes a feature extraction module, an offset value prediction module, a temporal alignment module, and a quality enhancement module
  • the feature extraction module is used to perform different scales of the current reconstruction image and the reference image respectively Feature extraction, obtaining the first feature information of the current reconstructed image and the reference image at N scales respectively, where N is a positive integer greater than 1
  • the offset value prediction module is used to Multi-scale prediction is performed on the first feature information of the image and the reference image at N scales respectively to obtain an offset value of the reference image
  • the temporal alignment module is used to obtain the offset value of the reference image and
  • the first characteristic information of the reference image is aligned in time domain to obtain the second characteristic information of the reference image
  • the quality enhancement module is used to predict the enhancement of the current reconstructed image according to the second characteristic information of the reference image image.
  • the temporal alignment module is configured to perform multi-scale temporal alignment according to the offset value of the reference image and the first characteristic information of the reference image, to obtain the first position of the reference image at multiple scales Two feature information.
  • the feature extraction module includes N first feature extraction units
  • the i-th first feature extraction unit is used to output the first feature information of the extracted first image at the N-i+1th scale, and the extracted first image at the N-ith scale
  • the first feature information at the +1 scale is input into the i+1 first feature extraction unit, so that the i+1 first feature extraction unit outputs the first image at the N-i+2 scale
  • the first feature information below, the i is a positive integer from 1 to N-1, and the first image is any one of the current reconstructed image and the reference image.
  • the offset value prediction module is configured to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales, respectively, to obtain the current reconstructed image and the reference image respectively.
  • the time-domain alignment module is used to perform multi-scale time-domain alignment according to the offset value of the current reconstructed image at the Nth scale and the first feature information, to obtain the second of the current reconstructed image at multiple scales.
  • feature information and performing multi-scale temporal alignment according to the offset value of the reference image at the Nth scale and the first feature information, to obtain second feature information of the reference image at multiple scales;
  • the quality enhancement module is configured to obtain an enhanced image of the current reconstructed image according to the second characteristic information of the current reconstructed image and the reference image at multiple scales respectively.
  • the offset value prediction module includes N first prediction units;
  • the j-th first prediction unit is used for the first feature information of the current reconstructed image and the reference image at the j-th scale respectively, and the current reconstructed image and the reference image are respectively in
  • the offset value at the jth scale is to obtain the offset values of the current reconstructed image and the reference image at the j+1th scale respectively, where j is a positive integer from 1 to N-1;
  • the Nth first prediction unit is used to predict the current reconstructed image according to the first feature information of the current reconstructed image and the reference image at the Nth scale respectively, and the N-1th first prediction unit and the offset values of the reference image at the Nth scale respectively, to obtain the offset values of the current reconstructed image predicted by the Nth first prediction unit and the reference image at the Nth scale respectively .
  • the offsets of the current reconstructed image and the reference image at the jth scale are respectively The value is 0.
  • the first first prediction unit includes the first first prediction unit subunits and the first first upsampling subunit;
  • the first first predicting subunit is used to perform offset value prediction according to the first feature information of the current reconstructed image and the reference image at the first scale respectively, and predict the current reconstructed image and the The offset values of the reference images at the first scale respectively;
  • the first first upsampling subunit is used to perform upsampling according to offset values of the current reconstructed image and the reference image predicted by the first first prediction subunit respectively at a first scale , to obtain offset values of the current reconstructed image and the reference image at the second scale respectively.
  • the jth first prediction unit is a first prediction unit other than the first first prediction unit among the N first prediction units, then the jth first prediction unit
  • the prediction unit includes a jth first alignment subunit, a jth first prediction subunit, and a jth first upsampling subunit;
  • the jth first alignment subunit is used to predict the first feature information of the current reconstructed image and the reference image at the jth scale respectively, and the j-1th first prediction unit. performing time-domain feature alignment on offset values of the current reconstructed image and the reference image at the jth scale, respectively, to obtain feature information of the current reconstructed image and the reference image aligned at the jth scale;
  • the j-th first prediction subunit is used to perform offset value prediction according to the feature information of the current reconstructed image and the reference image aligned at the j-th scale, to obtain the current reconstructed image and the reference image
  • the j-th first up-sampling subunit is used for offset values and j-th
  • the current reconstructed image and the reference image predicted by one first prediction unit are respectively up-sampled by the sum of offset values at the j-th scale to obtain the current reconstructed image and the reference image at j Offset value at +1 scale.
  • the Nth first prediction unit includes an Nth first alignment subunit and an Nth first prediction subunit
  • the Nth first alignment subunit is used to predict according to the first feature information of the current reconstructed image and the reference image at the Nth scale respectively, and the N-1th first prediction unit
  • the current reconstructed image and the reference image are respectively aligned in the time domain with offset values at the Nth scale, to obtain the feature information of the current reconstructed image and the reference image being aligned at the Nth scale;
  • the Nth first prediction subunit is used to perform offset value prediction according to the feature information of the current reconstructed image and the reference image aligned at the Nth scale, to obtain the predicted current reconstructed image and the predicted The offset value of the reference image at the Nth scale;
  • the offset values of the current reconstructed image predicted by the Nth first prediction unit and the reference image at the Nth scale are based on the current reconstructed image predicted by the Nth first prediction subunit and The offset values of the reference image at the N-th scale respectively, and the offset values of the N-th scale of the current reconstructed image predicted by the N-1th first prediction unit and the reference image respectively at the N-th scale definite.
  • the first prediction subunit is an offset value prediction network OPN.
  • the first alignment subunit is a deformable convolutional DCN.
  • the time domain alignment module includes K first time domain alignment units and K-1 first downsampling units, where K is a positive integer greater than 2;
  • the kth first temporal alignment unit is used to obtain the second offset value of the first image at the kth scale according to the offset value of the first image at the kth scale and the first characteristic information.
  • the k-1th first down-sampling unit is used to perform down-sampling according to the offset value of the first image at the k-th scale and the first feature information, to obtain the first image at the k-1-th scale
  • the k-1th first temporal alignment unit is used to obtain the k-1th scale of the first image according to the offset value of the first image at the k-1th scale and the first feature information The second characteristic information below, until k-1 is equal to 1.
  • the first temporal alignment unit is a deformable convolutional DCN.
  • the first downsampling unit is an average pooling layer.
  • the quality enhancement module includes K first enhancement units and K-1 first upsampling units;
  • the k+1th first enhancement unit is used to perform image quality enhancement according to the second feature information of the current reconstructed image and the reference image at the k+1th scale respectively, to obtain the kth of the current reconstructed image
  • the initial prediction value of the enhanced image under +1 scale, the k is a positive integer from 1 to K-1;
  • the kth first upsampling unit is used to perform upsampling according to the fusion value of the enhanced image of the current reconstructed image at the kth scale, to obtain the enhanced image of the current reconstructed image at the k+1th scale
  • the fusion value of the enhanced image of the current reconstructed image at the kth scale is the first first enhancement unit according to the current reconstructed image and the reference image respectively in The second feature information at the first scale, the obtained initial prediction value of the enhanced image of the current reconstructed image at the first scale;
  • the fusion value of the enhanced image of the current reconstructed image at the k+1th scale is fused according to the upsampling value and the initial prediction value of the enhanced image at the k+1th scale of the current reconstructed image Determined, the predicted value of the enhanced image of the current reconstructed image at the Nth scale is determined according to the fusion value of the enhanced image of the current reconstructed image at the Kth scale.
  • the first enhancement unit includes a plurality of convolutional layers, and a last convolutional layer of the plurality of convolutional layers does not include an activation function.
  • the offset value prediction module is configured to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales respectively, to obtain the current reconstructed image and the P groups of offset values of the reference image at the Nth scale respectively, where P is a positive integer;
  • the temporal alignment module is used to divide the first image into P image blocks, and assign the P groups of offset values to the P image blocks one by one, and according to a corresponding one of the image blocks Multi-scale temporal alignment is performed between the group offset value and the first feature information of the image block to obtain the second feature information of the image block at multiple scales, and then according to the image block in the first image multi-scale second feature information at scales to obtain multi-scale second feature information of the first image at N scales.
  • the offset value prediction module is configured to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales respectively, to obtain the reference image at the Nth Offset values under scales, the Nth scale being the largest scale among the N scales;
  • the time domain alignment module is used to perform multi-scale time domain alignment according to the offset value of the reference image at the Nth scale and the first feature information, to obtain the second feature information of the reference image at multiple scales ;
  • the quality enhancement module is configured to obtain an enhanced image of the current reconstructed image according to the first feature information of the current reconstructed image at multiple scales and the second feature information of the reference image at multiple scales.
  • the offset value prediction module includes N second prediction units;
  • the j-th second prediction unit is used to calculate the first feature information of the current reconstructed image and the reference image at the j-th scale, and the offset value of the reference image at the j-th scale , to obtain the offset value of the reference image at the j+1th scale, where j is a positive integer from 1 to N-1;
  • the N-th second prediction unit is used to use the first feature information of the current reconstructed image and the reference image at the N-th scale respectively, and the reference image predicted by the N-1th second prediction unit in The offset value at the Nth scale is to obtain the offset value at the Nth scale of the reference image predicted by the Nth second prediction unit.
  • the offset of the reference image at the j-1th scale is The shift value 0.
  • the first second prediction unit includes the first second prediction unit Two prediction subunits and the first second upsampling subunit;
  • the first and second prediction subunits are used to perform offset value prediction according to the first feature information of the current reconstructed image and the reference image at the first scale respectively, so as to obtain the first feature information of the reference image at the first scale.
  • the first second upsampling subunit is configured to perform upsampling according to the offset value of the reference image at the first scale to obtain the offset value of the reference image at the second scale.
  • the jth second prediction unit is a second prediction unit other than the first second prediction unit among the N second prediction units, then the jth second prediction unit
  • the prediction unit includes a jth second alignment subunit, a jth second prediction subunit, and a jth second upsampling subunit;
  • the j-th second alignment subunit is used to predict the reference image predicted by the j-1th second prediction unit according to the first feature information of the current reconstructed image and the reference image at the j-th scale respectively performing time-domain feature alignment on the offset value at the jth scale, to obtain the feature information of the alignment between the current reconstructed image and the reference image at the jth scale;
  • the j-th second prediction subunit is used to perform offset value prediction according to the feature information of the alignment of the current reconstructed image and the reference image at the j-th scale, to obtain the j-th scale of the reference image offset value;
  • the j th second upsampling subunit is used to predict according to the offset value of the reference image output by the j th first prediction subunit at the j scale and the j-1 th second prediction unit The sum of the offset values of the reference image at the jth scale is up-sampled to obtain the offset value of the reference image at the j+1 scale.
  • the Nth second prediction unit includes an Nth second alignment subunit and an Nth second prediction subunit
  • the Nth second alignment subunit is used to predict according to the first feature information of the current reconstructed image and the reference image at the Nth scale respectively, and the N-1th second prediction unit performing time-domain feature alignment on the offset value of the reference image at the Nth scale, to obtain feature information aligned at the Nth scale between the current reconstructed image and the reference image;
  • the Nth second prediction subunit is used to perform offset value prediction according to the feature information of the current reconstructed image and the reference image aligned at the Nth scale, to obtain the predicted value of the Nth second prediction unit.
  • the offset value of the reference image predicted by the Nth second prediction unit at the Nth scale is based on the offset value of the reference image predicted by the Nth second prediction subunit at the Nth scale
  • the offset value is determined after adding the offset value of the reference image predicted by the N-1th second prediction unit at the Nth scale.
  • the second prediction subunit is an offset value prediction network OPN.
  • the second alignment subunit is a deformable convolutional DCN.
  • the time domain alignment module includes K second time domain alignment units and K-1 second downsampling units, where K is a positive integer greater than 2;
  • the k-1th second down-sampling unit is used to perform down-sampling according to the offset value of the reference image at the k-th scale and the first feature information, to obtain the reference image at the k-1-th scale Offset value and first feature information;
  • the k-1th second temporal alignment unit is used to obtain the reference image at the k-1th scale according to the offset value and the first feature information of the reference image at the k-1th scale The second feature information, until k-1 is equal to 1.
  • the second temporal alignment unit is a deformable convolutional DCN.
  • the second downsampling unit is an average pooling layer.
  • the quality enhancement module includes K second enhancement units and K-1 second upsampling units;
  • the k+1th second enhancement unit is used for the first feature information of the current reconstructed image at the k+1th scale and the second feature information of the reference image at the k+1th scale Perform image quality enhancement to obtain the initial predicted value of the enhanced image of the current reconstructed image at the k+1th scale, where k is a positive integer from 1 to K-1;
  • the kth second upsampling unit is used to perform upsampling according to the fusion value of the enhanced image of the current reconstructed image at the kth scale, to obtain the enhanced image of the current reconstructed image at the k+1th scale
  • the fusion value of the enhanced image of the current reconstructed image at the kth scale is the first and the second enhancement unit according to the current reconstructed image at the first scale
  • the first feature information and the second feature information of the reference image at the first scale are obtained to obtain an initial prediction value of the enhanced image of the current reconstructed image at the first scale;
  • the fusion value of the enhanced image of the current reconstructed image at the k+1th scale is determined after fusion of the upsampling value and the initial predicted value of the enhanced image at the k+1th scale of the current reconstructed image .
  • the first enhancement unit includes a plurality of convolutional layers, and a last convolutional layer of the plurality of convolutional layers does not include an activation function.
  • the offset value prediction module is configured to perform multi-scale prediction according to the first feature information of the current reconstructed image and the reference image at N scales respectively, to obtain the reference image at the Nth P group of offset values under a scale, the P is a positive integer;
  • the temporal alignment module is used to divide the reference image into P image blocks, and assign the P groups of offset values to the P image blocks one by one, and for each image block, according to the A set of offset values corresponding to the image block and the first feature information of the image block are aligned in the multi-scale time domain to obtain the multi-scale second feature information of the image block at the Nth scale, and then according to the reference
  • the multi-scale second feature information of each image block in the image at the Nth scale is obtained to obtain the multi-scale second feature information of the reference image at the Nth scale.
  • the second acquiring unit 23 is further configured to acquire first flag information, where the first flag information is used to indicate whether to use the quality enhancement network to perform quality enhancement on the reconstructed image; and The first flag information indicates that when the quality enhancement network is used to enhance the quality of the current reconstructed image, M reference images of the current reconstructed image are acquired from reconstructed images.
  • the first flag information is included in a sequence parameter set.
  • the second obtaining unit 23 is specifically configured to obtain, from the reconstructed images, at least one image located forward and/or backward of the currently reconstructed image in playback order as the currently reconstructed image The reference image for the image.
  • the reconstructed image and the M reference images of the reconstructed image are consecutive images in playback order.
  • the device embodiment and the method embodiment may correspond to each other, and similar descriptions may refer to the method embodiment. To avoid repetition, details are not repeated here.
  • the encoding device 20 shown in FIG. 15 may correspond to the corresponding subject in the image encoding method of the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the encoding device 20 are for realizing image encoding For the sake of brevity, the corresponding processes in the method are not repeated here.
  • Fig. 16 is a schematic block diagram of an image processing apparatus provided by an embodiment of the present application.
  • the image processing apparatus may be an image processing device, such as a video acquisition device or a video playback device.
  • the image processing device 50 may include:
  • An acquisition unit 51 configured to acquire a target image to be enhanced, and M reference images of the target image, where M is a positive integer;
  • the enhancement unit 52 is configured to input the target image and the M reference images into a quality enhancement network to obtain an enhanced image of the target image.
  • the quality enhancement network includes a feature extraction module, an offset value prediction module, a temporal alignment module, and a quality enhancement module
  • the feature extraction module is used to perform feature extraction at different scales on the target image and the reference image. Extracting, respectively obtaining the first feature information of the target image and the reference image at N scales, where N is a positive integer greater than 1, and the offset value prediction module is used to obtain the target image and the reference image according to the Multi-scale prediction is performed on the first feature information of the reference image at N scales respectively to obtain the offset value of the reference image, and the temporal alignment module is used to obtain the offset value of the reference image and the reference image
  • the first characteristic information of the image is aligned in time domain to obtain the second characteristic information of the reference image
  • the quality enhancement module is used to predict the enhanced image of the target image according to the second characteristic information of the reference image.
  • the device embodiment and the method embodiment may correspond to each other, and similar descriptions may refer to the method embodiment. To avoid repetition, details are not repeated here.
  • the image processing device 50 shown in FIG. 16 may correspond to the corresponding subject in performing the image processing method of the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the image processing device 50 are for realizing For the sake of brevity, the corresponding flow in the image processing method will not be repeated here.
  • Fig. 17 is a schematic block diagram of a model training device provided by an embodiment of the present application.
  • the model training device may be a computing device, or a processor in the computing device.
  • the model training device 40 is used to train the quality enhancement network, and the quality enhancement network includes a feature extraction module, an offset value prediction module, a time domain alignment module and a quality enhancement module, and the model training device 40 may include:
  • An acquisition unit 41 configured to acquire M+1 image images, the M+1 image images include the image to be enhanced and M reference images of the image to be enhanced, and M is a positive integer;
  • the feature extraction unit 42 is used to input the image to be enhanced and the M reference images of the image to be enhanced into the feature extraction module to perform feature extraction of different scales, respectively, to obtain the first feature information of the image to be enhanced and the reference image at N scales , the N is a positive integer greater than 1;
  • the offset value prediction unit 43 is configured to perform multi-scale prediction through the offset value prediction module according to the first feature information of the image to be enhanced and the reference image at N scales respectively, to obtain the offset value of the reference image;
  • the temporal alignment unit 44 is configured to perform temporal alignment in the temporal alignment module according to the offset value of the reference image and the first characteristic information of the reference image to obtain second characteristic information of the reference image;
  • the enhancement unit 45 is configured to obtain the predicted value of the enhanced image of the image to be enhanced through the quality enhancement module according to the second characteristic information of the reference image;
  • the training unit 46 is configured to train the quality enhancement network according to the predicted value of the enhanced image of the image to be enhanced and the real value of the enhanced image of the image to be enhanced.
  • the device embodiment and the method embodiment may correspond to each other, and similar descriptions may refer to the method embodiment. To avoid repetition, details are not repeated here.
  • the model training device 40 shown in FIG. 17 may correspond to the corresponding subject in the model training method of the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the model training device 40 are respectively for realizing For the sake of brevity, the corresponding process in the model training method will not be repeated here.
  • the functional unit may be implemented in the form of hardware, may also be implemented by instructions in the form of software, and may also be implemented by a combination of hardware and software units.
  • each step of the method embodiment in the embodiment of the present application can be completed by an integrated logic circuit of the hardware in the processor and/or instructions in the form of software, and the steps of the method disclosed in the embodiment of the present application can be directly embodied as hardware
  • the decoding processor is executed, or the combination of hardware and software units in the decoding processor is used to complete the execution.
  • the software unit may be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, and registers.
  • the storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps in the above method embodiments in combination with its hardware.
  • Fig. 18 is a schematic block diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device 30 may be the image processing device described in the embodiment of the present application, or a decoder, or an encoder, or a model training device, and the electronic device 30 may include:
  • a memory 33 and a processor 32 the memory 33 is used to store a computer program 34 and transmit the program code 34 to the processor 32 .
  • the processor 32 can call and run the computer program 34 from the memory 33 to implement the method in the embodiment of the present application.
  • the processor 32 can be used to execute the steps in the above-mentioned method 200 according to the instructions in the computer program 34 .
  • the processor 32 may include, but is not limited to:
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the memory 33 includes but is not limited to:
  • non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash.
  • the volatile memory can be Random Access Memory (RAM), which acts as external cache memory.
  • RAM Static Random Access Memory
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Synchronous Dynamic Random Access Memory Synchronous Dynamic Random Access Memory
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM, DDR SDRAM double data rate synchronous dynamic random access memory
  • Enhanced SDRAM, ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory
  • Direct Rambus RAM Direct Rambus RAM
  • the computer program 34 can be divided into one or more units, and the one or more units are stored in the memory 33 and executed by the processor 32 to complete the present application.
  • the one or more units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 34 in the electronic device 30 .
  • the electronic device 30 may also include:
  • a transceiver 33 the transceiver 33 can be connected to the processor 32 or the memory 33 .
  • the processor 32 can control the transceiver 33 to communicate with other devices, specifically, can send information or data to other devices, or receive information or data sent by other devices.
  • Transceiver 33 may include a transmitter and a receiver.
  • the transceiver 33 may further include antennas, and the number of antennas may be one or more.
  • bus system includes not only a data bus, but also a power bus, a control bus and a status signal bus.
  • the present application also provides a computer storage medium, on which a computer program is stored, and when the computer program is executed by a computer, the computer can execute the methods of the above method embodiments.
  • the embodiments of the present application further provide a computer program product including instructions, and when the instructions are executed by a computer, the computer executes the methods of the foregoing method embodiments.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, server, or data center by wire (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to another website site, computer, server or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a digital point cloud disc (digital video disc, DVD)), or a semiconductor medium (such as a solid state disk (solid state disk, SSD)), etc. .
  • a magnetic medium such as a floppy disk, a hard disk, or a magnetic tape
  • an optical medium such as a digital point cloud disc (digital video disc, DVD)
  • a semiconductor medium such as a solid state disk (solid state disk, SSD)
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or can be Integrate into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Processing (AREA)

Abstract

本申请提供一种图像编解码及处理方法、装置及设备,该方法包括:从已重建的图像中,获取当前重建图像的M个参考图像;将当前重建图像和M个参考图像输入质量增强网络中,使得质量增强网络进行不同尺度的特征提取,分别得到当前重建图像和参考图像在N个尺度下的第一特征信息,并根据当前重建图像和M个参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到参考图像的偏移值,再根据参考图像的偏移值和参考图像的第一特征信息进行时域对齐,得到参考图像的第二特征信息,最后根据参考图像的第二特征信息预测当前重建图像的增强图像,实现图像的显著增强。

Description

图像编解码及处理方法、装置及设备 技术领域
本申请涉及图像处理技术领域,尤其涉及一种图像编解码及处理方法、装置及设备。
背景技术
随着图像处理技术的发展,用户对视频质量的要求越来越高,而高质量视频对采集设备、数据传输以及数据存储等方面提出的较高要求。为了平衡各项成本,视频制作设备采集低质量视频流,并将低质量视频流传输给视频播放设备,视频播放设备对该低质量视频进行处理,生成高质量视频后进行播放。
目前通过滤波的方式来提高视频的质量,例如在视频编解码技术中,解码端对解码出的重建图像进行滤波后播放。但是,滤波的方式无法显著提高视频的质量。
发明内容
本申请实施例提供了一种图像编解码及处理方法、装置及设备,以显著提升图像的增强效果。
第一方面,本申请实施例提供一种图像解码方法,包括:
解码码流,得到当前重建图像;
从已重建的图像中,获取所述当前重建图像的M个参考图像,所述M为正整数;
将所述当前重建图像和所述M个参考图像输入质量增强网络中,得到所述当前重建图像的增强图像。
第二方面,本申请提供了一种图像编码方法,包括:
获取待编码图像;
对所述待编码图像进行编码,得到所述待编码图像的当前重建图像;
从已重建的图像中,获取所述当前重建图像的M个参考图像,所述M为正整数;
将所述当前重建图像和所述M个参考图像输入质量增强网络中,得到所述当前重建图像的增强图像。
第三方面,本申请提供了一种图像处理方法,包括:
获取待增强的目标图像,以及所述目标图像的M个参考图像,所述M为正整数;
将所述目标图像和所述M个参考图像输入质量增强网络中,得到所述目标图像的增强图像。
第四方面,本申请提供了一种模型训练方法,用于训练质量增强网络,所述质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块,所述方法包括:
获取待增强图像以及所述待增强图像的M个参考图像,所述M为正整数;
将待增强图像和M个参考图像输入所述特征提取模块进行不同尺度的特征提取,得到待增强图像和M个参考图像分别在N个尺度下的第一特征信息,所述N为大于1的正整数;
根据待增强图像和M个参考图像分别在N个尺度下的第一特征信息,通过所述偏移值预测模块进行多尺度预测,得到参考图像的偏移值;
根据参考图像的偏移值和参考图像的第一特征信息,通过所述时域对齐模块中进行时域对齐,得到参考图像的第二特征信息;
根据参考图像的第二特征信息,通过所述质量增强模块,得到所述待增强图像的增强图像的预测值;
根据所述待增强图像的增强图像的预测值和所述待增强图像的增强图像的真值,对所述质量增强网络进行训练。
第五方面,提供了一种图像解码装置,用于执行上述第一方面或其各实现方式中的方法。具体地,该图像解码装置包括用于执行上述第一方面或其各实现方式中的方法的功能单元。
第六方面,提供了一种解码器,包括处理器和存储器。该存储器用于存储计算机程序,该处理器 用于调用并运行该存储器中存储的计算机程序,以执行上述第一方面或其各实现方式中的方法。
第七方面,提供了一种图像编码装置,用于执行上述第二方面或其各实现方式中的方法。具体地,该图像解码装置包括用于执行上述第二方面或其各实现方式中的方法的功能单元。
第八方面,提供了一种编码器,包括处理器和存储器。该存储器用于存储计算机程序,该处理器用于调用并运行该存储器中存储的计算机程序,以执行上述第二方面或其各实现方式中的方法。
第九方面,提供了一种图像处理装置,用于执行上述第三方面或其各实现方式中的方法。具体地,该装置包括用于执行上述第三方面或其各实现方式中的方法的功能单元。
第十方面,提供了一种图像处理设备,包括处理器和存储器。该存储器用于存储计算机程序,该处理器用于调用并运行该存储器中存储的计算机程序,以执行上述第三方面或其各实现方式中的方法。
第十一方面,提供了一种模型训练装置,用于执行上述第四方面或其各实现方式中的方法。具体地,该模型训练装置包括用于执行上述第四方面或其各实现方式中的方法的功能单元。
第十二方面,提供了一种模型训练设备,包括处理器和存储器。该存储器用于存储计算机程序,该处理器用于调用并运行该存储器中存储的计算机程序,以执行上述第四方面或其各实现方式中的方法。
第十三方面,提供了一种芯片,用于实现上述第一方面至第四方面中的任一方面或其各实现方式中的方法。具体地,该芯片包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有该芯片的设备执行如上述第一方面至第四方面中的任一方面或其各实现方式中的方法。
第十四方面,提供了一种计算机可读存储介质,用于存储计算机程序,该计算机程序使得计算机执行上述第一方面至第四方面中的任一方面或其各实现方式中的方法。
第十五方面,提供了一种计算机程序产品,包括计算机程序指令,该计算机程序指令使得计算机执行上述第一方面至第四方面中的任一方面或其各实现方式中的方法。
第十六方面,提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面至第四方面中的任一方面或其各实现方式中的方法。
基于以上技术方案,通过解码码流,得到当前重建图像;从已重建的图像中,获取当前重建图像的M个参考图像;将当前重建图像和M个参考图像输入质量增强网络中,使得质量增强网络进行不同尺度的特征提取,分别得到当前重建图像和参考图像在N个尺度下的第一特征信息,并根据当前重建图像和M个参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到参考图像的偏移值,再根据参考图像的偏移值和参考图像的第一特征信息进行时域对齐,得到参考图像的第二特征信息,最后根据参考图像的第二特征信息预测当前重建图像的增强图像,实现图像的显著增强。
附图说明
图1为本申请实施例涉及的一种视频编解码系统的示意性框图;
图2是本申请实施例提供的视频编码器的示意性框图;
图3是本申请实施例提供的视频解码器的示意性框图;
图4为本申请实施例的一种原理示意图;
图5为本申请一实施例提供的质量增强网络训练方法流程示意图;
图6为本申请一实施例涉及的质量增强网络的一种网络示意图;
图7本申请一实施例提供的质量增强网络的一种训练方法流程示意图;
图8A为本申请一实施例涉及的特征提取模块的一种网络示意图;
图8B为本申请一实施例涉及的特征提取模块的一种网络示意图;
图8C为本申请一实施例涉及的偏移值预测模块的一种网络示意图;
图8D为本申请一实施例涉及的偏移值预测模块的一种网络示意图;
图8E为本申请一实施例涉及的时域对齐模块的一种网络示意图;
图8F为本申请一实施例涉及的质量增强模块的一种网络示意图;
图8G为本申请一实施例涉及的质量增强网络的一种网络示意图;
图9本申请一实施例提供的质量增强网络的一种训练方法流程示意图;
图10A为本申请一实施例涉及的偏移值预测模块的一种网络示意图;
图10B为本申请一实施例涉及的偏移值预测模块的一种网络示意图;
图10C为本申请一实施例涉及的时域对齐模块的一种网络示意图;
图10D为本申请一实施例涉及的质量增强模块的一种网络示意图;
图11为本申请一实施例提供的图像解码方法的流程示意图
图12为本申请一实施例提供的图像编码方法的流程示意图
图13为本申请一实施例提供的图像处理方法的流程示意图;
图14是本申请一实施例提供的图像解码装置的示意性框图;
图15是本申请一实施例提供的图像编码装置的示意性框图
图16是本申请一实施例提供的图像处理装置的示意性框图;
图17是本申请一实施例提供的模型训练装置的示意性框图;
图18是本申请实施例提供的电子设备的示意性框图。
具体实施方式
本申请可应用于点云上采样技术领域,例如可以应用于点云压缩技术领域。
本申请可应用于图像编解码领域、视频编解码领域、硬件视频编解码领域、专用电路视频编解码领域、实时视频编解码领域等。例如,本申请的方案可结合至音视频编码标准(audio video coding standard,简称AVS),例如,H.264/音视频编码(audio video coding,简称AVC)标准,H.265/高效视频编码(high efficiency video coding,简称HEVC)标准以及H.266/多功能视频编码(versatile video coding,简称VVC)标准。或者,本申请的方案可结合至其它专属或行业标准而操作,所述标准包含ITU-TH.261、ISO/IECMPEG-1Visual、ITU-TH.262或ISO/IECMPEG-2Visual、ITU-TH.263、ISO/IECMPEG-4Visual,ITU-TH.264(还称为ISO/IECMPEG-4AVC),包含可分级视频编解码(SVC)及多视图视频编解码(MVC)扩展。应理解,本申请的技术不限于任何特定编解码标准或技术。
为了便于理解,首先结合图1对本申请实施例涉及的视频编解码系统进行介绍。
图1为本申请实施例涉及的一种视频编解码系统的示意性框图。需要说明的是,图1只是一种示例,本申请实施例的视频编解码系统包括但不限于图1所示。如图1所示,该视频编解码系统100包含编码设备110和解码设备120。其中编码设备用于对视频数据进行编码(可以理解成压缩)产生码流,并将码流传输给解码设备。解码设备对编码设备编码产生的码流进行解码,得到解码后的视频数据。
本申请实施例的编码设备110可以理解为具有视频编码功能的设备,解码设备120可以理解为具有视频解码功能的设备,即本申请实施例对编码设备110和解码设备120包括更广泛的装置,例如包含智能手机、台式计算机、移动计算装置、笔记本(例如,膝上型)计算机、平板计算机、机顶盒、电视、相机、显示装置、数字媒体播放器、视频游戏控制台、车载计算机等。
在一些实施例中,编码设备110可以经由信道130将编码后的视频数据(如码流)传输给解码设备120。信道130可以包括能够将编码后的视频数据从编码设备110传输到解码设备120的一个或多个媒体和/或装置。
在一个实例中,信道130包括使编码设备110能够实时地将编码后的视频数据直接发射到解码设备120的一个或多个通信媒体。在此实例中,编码设备110可根据通信标准来调制编码后的视频数据,且将调制后的视频数据发射到解码设备120。其中通信媒体包含无线通信媒体,例如射频频谱,可选的,通信媒体还可以包含有线通信媒体,例如一根或多根物理传输线。
在另一实例中,信道130包括存储介质,该存储介质可以存储编码设备110编码后的视频数据。存储介质包含多种本地存取式数据存储介质,例如光盘、DVD、快闪存储器等。在该实例中,解码设备120可从该存储介质中获取编码后的视频数据。
在另一实例中,信道130可包含存储服务器,该存储服务器可以存储编码设备110编码后的视频数据。在此实例中,解码设备120可以从该存储服务器中下载存储的编码后的视频数据。可选的,该存储服务器可以存储编码后的视频数据且可以将该编码后的视频数据发射到解码设备120,例如web 服务器(例如,用于网站)、文件传送协议(FTP)服务器等。
一些实施例中,编码设备110包含视频编码器112及输出接口113。其中,输出接口113可以包含调制器/解调器(调制解调器)和/或发射器。
在一些实施例中,编码设备110除了包括视频编码器112和输入接口113外,还可以包括视频源111。
视频源111可包含视频采集装置(例如,视频相机)、视频存档、视频输入接口、计算机图形系统中的至少一个,其中,视频输入接口用于从视频内容提供者处接收视频数据,计算机图形系统用于产生视频数据。
视频编码器112对来自视频源111的视频数据进行编码,产生码流。视频数据可包括一个或多个图像(picture)或图像序列(sequence of pictures)。码流以比特流的形式包含了图像或图像序列的编码信息。编码信息可以包含编码图像数据及相关联数据。相关联数据可包含序列参数集(sequence parameter set,简称SPS)、图像参数集(picture parameter set,简称PPS)及其它语法结构。SPS可含有应用于一个或多个序列的参数。PPS可含有应用于一个或多个图像的参数。语法结构是指码流中以指定次序排列的零个或多个语法元素的集合。
视频编码器112经由输出接口113将编码后的视频数据直接传输到解码设备120。编码后的视频数据还可存储于存储介质或存储服务器上,以供解码设备120后续读取。
在一些实施例中,解码设备120包含输入接口121和视频解码器122。
在一些实施例中,解码设备120除包括输入接口121和视频解码器122外,还可以包括显示装置123。
其中,输入接口121包含接收器及/或调制解调器。输入接口121可通过信道130接收编码后的视频数据。
视频解码器122用于对编码后的视频数据进行解码,得到解码后的视频数据,并将解码后的视频数据传输至显示装置123。
显示装置123显示解码后的视频数据。显示装置123可与解码设备120整合或在解码设备120外部。显示装置123可包括多种显示装置,例如液晶显示器(LCD)、等离子体显示器、有机发光二极管(OLED)显示器或其它类型的显示装置。
此外,图1仅为实例,本申请实施例的技术方案不限于图1,例如本申请的技术还可以应用于单侧的视频编码或单侧的视频解码。
下面对本申请实施例涉及的视频编码器进行介绍。
图2是本申请实施例提供的视频编码器的示意性框图。应理解,该视频编码器200可用于对图像进行有损压缩(lossy compression),也可用于对图像进行无损压缩(lossless compression)。该无损压缩可以是视觉无损压缩(visually lossless compression),也可以是数学无损压缩(mathematically lossless compression)。
该视频编码器200可应用于亮度色度(YCbCr,YUV)格式的图像数据上。例如,YUV比例可以为4:2:0、4:2:2或者4:4:4,Y表示明亮度(Luma),Cb(U)表示蓝色色度,Cr(V)表示红色色度,U和V表示为色度(Chroma)用于描述色彩及饱和度。例如,在颜色格式上,4:2:0表示每4个像素有4个亮度分量,2个色度分量(YYYYCbCr),4:2:2表示每4个像素有4个亮度分量,4个色度分量(YYYYCbCrCbCr),4:4:4表示全像素显示(YYYYCbCrCbCrCbCrCbCr)。
例如,该视频编码器200读取视频数据,针对视频数据中的每帧图像,将一帧图像划分成若干个编码树单元(coding tree unit,CTU)、“最大编码单元”(Largest Coding unit,简称LCU)或“编码树型块”(coding tree block,简称CTB)。每一个CTU可以与图像内的具有相等大小的像素块相关联。每一像素可对应一个亮度(luminance或luma)采样及两个色度(chrominance或chroma)采样。因此,每一个CTU可与一个亮度采样块及两个色度采样块相关联。一个CTU大小例如为128×128、64×64、32×32等。一个CTU又可以继续被划分成若干个编码单元(Coding Unit,CU)进行编码,CU可以为矩形块也可以为方形块。CU可以进一步划分为预测单元(prediction Unit,简称PU)和变换单元(transform unit,简称TU),进而使得编码、预测、变换分离,处理的时候更灵活。在一种示例中,CTU以四叉树方 式划分为CU,CU以四叉树方式划分为TU、PU。
视频编码器及视频解码器可支持各种PU大小。假定特定CU的大小为2N×2N,视频编码器及视频解码器可支持2N×2N或N×N的PU大小以用于帧内预测,且支持2N×2N、2N×N、N×2N、N×N或类似大小的对称PU以用于帧间预测。视频编码器及视频解码器还可支持2N×nU、2N×nD、nL×2N及nR×2N的不对称PU以用于帧间预测。
在一些实施例中,如图2所示,该视频编码器200可包括:预测单元210、残差单元220、变换/量化单元230、反变换/量化单元240、重建单元250、环路滤波单元260、解码图像缓存270和熵编码单元280。需要说明的是,视频编码器200可包含更多、更少或不同的功能组件。
可选的,在本申请中,当前块(current block)可以称为当前编码单元(CU)或当前预测单元(PU)等。预测块也可称为预测待编码块或图像预测块,重建待编码块也可称为重建块或图像重建待编码块。
在一些实施例中,预测单元210包括帧间预测单元211和帧内预测单元212。由于视频的一个帧中的相邻像素之间存在很强的相关性,在视频编解码技术中使用帧内预测的方法消除相邻像素之间的空间冗余。由于视频中的相邻帧之间存在着很强的相似性,在视频编解码技术中使用帧间预测方法消除相邻帧之间的时间冗余,从而提高编码效率。
帧间预测单元211可用于帧间预测,帧间预测可以参考不同帧的图像信息,帧间预测使用运动信息从参考帧中找到参考块,根据参考块生成预测块,用于消除时间冗余;帧间预测所使用的帧可以为P帧和/或B帧,P帧指的是向前预测帧,B帧指的是双向预测帧。运动信息包括参考帧所在的参考帧列表,参考帧索引,以及运动矢量。运动矢量可以是整像素的或者是分像素的,如果运动矢量是分像素的,那么需要在参考帧中使用插值滤波做出所需的分像素的块,这里把根据运动矢量找到的参考帧中的整像素或者分像素的块叫参考块。有的技术会直接把参考块作为预测块,有的技术会在参考块的基础上再处理生成预测块。在参考块的基础上再处理生成预测块也可以理解为把参考块作为预测块然后再在预测块的基础上处理生成新的预测块。
目前最常用的帧间预测方法包括:VVC视频编解码标准中的几何划分模式(geometric partitioning mode,GPM),以及AVS3视频编解码标准中的角度加权预测(angular weighted prediction,AWP)。这两种帧内预测模式在原理上有共通之处。
帧内预测单元212只参考同一帧图像的信息,预测当前帧待编码块内的像素信息,用于消除空间冗余。帧内预测所使用的帧可以为I帧。
在一些实施例中,帧内预测方法还包括多参考行帧内预测方法(multiple reference line,MRL),MRL可以使用更多的参考像素从而提高编码效率。
帧内预测有多种预测模式,H.264中对4×4的块进行帧内预测的9种模式。其中模式0是将当前块上面的像素按竖直方向复制到当前块作为预测值;模式1是将左边的参考像素按水平方向复制到当前块作为预测值;模式2(DC)是将A~D和I~L这8个点的平均值作为所有点的预测值,模式3至模式8是分别按某一个角度将参考像素复制到当前块的对应位置。因为当前块某些位置不能正好对应到参考像素,可能需要使用参考像素的加权平均值,或者说是插值的参考像素的分像素。
HEVC使用的帧内预测模式有平面模式(Planar)、DC和33种角度模式,共35种预测模式。VVC使用的帧内模式有Planar、DC和65种角度模式,共67种预测模式。AVS3使用的帧内模式有DC、Plane、Bilinear和63种角度模式,共66种预测模式。
需要说明的是,随着角度模式的增加,帧内预测将会更加精确,也更加符合对高清以及超高清数字视频发展的需求。
残差单元220可基于CU的像素块及CU的PU的预测块来产生CU的残差块。举例来说,残差单元220可产生CU的残差块,使得残差块中的每一采样具有等于以下两者之间的差的值:CU的像素块中的采样,及CU的PU的预测块中的对应采样。
变换/量化单元230可量化变换系数。变换/量化单元230可基于与CU相关联的量化参数(QP)值来量化与CU的TU相关联的变换系数。视频编码器200可通过调整与CU相关联的QP值来调整应用于与CU相关联的变换系数的量化程度。
反变换/量化单元240可分别将逆量化及逆变换应用于量化后的变换系数,以从量化后的变换系数重建残差块。
重建单元250可将重建后的残差块的采样加到预测单元210产生的一个或多个预测块的对应采样,以产生与TU相关联的重建待编码块。通过此方式重建CU的每一个TU的采样块,视频编码器200可重建CU的像素块。
环路滤波单元260可执行消块滤波操作以减少与CU相关联的像素块的块效应。
在一些实施例中,环路滤波单元260包括去块滤波单元、样点自适应补偿SAO单元、自适应环路滤波ALF单元。
解码图像缓存270可存储重建后的像素块。帧间预测单元211可使用含有重建后的像素块的参考图像来对其它图像的PU执行帧间预测。另外,帧内预测单元212可使用解码图像缓存270中的重建后的像素块来对在与CU相同的图像中的其它PU执行帧内预测。
熵编码单元280可接收来自变换/量化单元230的量化后的变换系数。熵编码单元280可对量化后的变换系数执行一个或多个熵编码操作以产生熵编码后的数据。
本申请涉及的视频编码的基本流程如下:在编码端,将当前图像划分成块,针对当前块,预测单元210使用帧内预测或帧间预测产生当前块的预测块。残差单元220可基于预测块与当前块的原始块计算残差块,即预测块和当前块的原始块的差值,该残差块也可称为残差信息。该残差块经由变换/量化单元230变换与量化等过程,可以去除人眼不敏感的信息,以消除视觉冗余。可选的,经过变换/量化单元230变换与量化之前的残差块可称为时域残差块,经过变换/量化单元230变换与量化之后的时域残差块可称为频率残差块或频域残差块。熵编码单元280接收到变换量化单元230输出的量化后的变换系数,可对该量化后的变换系数进行熵编码,输出码流。例如,熵编码单元280可根据目标上下文模型以及二进制码流的概率信息消除字符冗余。
另外,视频编码器对变换量化单元230输出的量化后的变换系数进行反量化和反变换,得到当前块的残差块,再将当前块的残差块与当前块的预测块进行相加,得到当前块的重建块。随着编码的进行,可以得到当前图像中其他待编码块对应的重建块,这些重建块进行拼接,得到当前图像的重建图像。由于编码过程中引入误差,为了降低误差,对重建图像进行滤波,例如,使用ALF对重建图像进行滤波,以减小重建图像中像素点的像素值与当前图像中像素点的原始像素值之间差异。将滤波后的重建图像存放在解码图像缓存270中,可以为后续的帧作为帧间预测的参考帧。
需要说明的是,编码端确定的块划分信息,以及预测、变换、量化、熵编码、环路滤波等模式信息或者参数信息等在必要时携带在码流中。解码端通过解析码流及根据已有信息进行分析确定与编码端相同的块划分信息,预测、变换、量化、熵编码、环路滤波等模式信息或者参数信息,从而保证编码端获得的解码图像和解码端获得的解码图像相同。
图3是本申请实施例提供的视频解码器的示意性框图。
如图3所示,视频解码器300包含:熵解码单元310、预测单元320、反量化/变换单元330、重建单元340、环路滤波单元350及解码图像缓存360。需要说明的是,视频解码器300可包含更多、更少或不同的功能组件。
视频解码器300可接收码流。熵解码单元310可解析码流以从码流提取语法元素。作为解析码流的一部分,熵解码单元310可解析码流中的经熵编码后的语法元素。预测单元320、反量化/变换单元330、重建单元340及环路滤波单元350可根据从码流中提取的语法元素来解码视频数据,即产生解码后的视频数据。
在一些实施例中,预测单元320包括帧内预测单元321和帧间预测单元322。
帧内预测单元321可执行帧内预测以产生PU的预测块。帧内预测单元321可使用帧内预测模式以基于空间相邻PU的像素块来产生PU的预测块。帧内预测单元321还可根据从码流解析的一个或多个语法元素来确定PU的帧内预测模式。
帧间预测单元322可根据从码流解析的语法元素来构造第一参考图像列表(列表0)及第二参考图像列表(列表1)。此外,如果PU使用帧间预测编码,则熵解码单元310可解析PU的运动信息。帧间预测单元322可根据PU的运动信息来确定PU的一个或多个参考块。帧间预测单元322可根据PU的一个或多个参考块来产生PU的预测块。
反量化/变换单元330可逆量化(即,解量化)与TU相关联的变换系数。反量化/变换单元330可使 用与TU的CU相关联的QP值来确定量化程度。
在逆量化变换系数之后,反量化/变换单元330可将一个或多个逆变换应用于逆量化变换系数,以便产生与TU相关联的残差块。
重建单元340使用与CU的TU相关联的残差块及CU的PU的预测块以重建CU的像素块。例如,重建单元340可将残差块的采样加到预测块的对应采样以重建CU的像素块,得到重建待编码块。
环路滤波单元350可执行消块滤波操作以减少与CU相关联的像素块的块效应。
在一些实施例中,环路滤波单元350包括去块滤波单元、样点自适应补偿SAO单元、自适应环路滤波ALF单元。
视频解码器300可将CU的重建图像存储于解码图像缓存360中。视频解码器300可将解码图像缓存360中的重建图像作为参考图像用于后续预测,或者,将重建图像传输给显示装置呈现。
本申请涉及的视频解码的基本流程如下:熵解码单元310可解析码流得到当前块的预测信息、量化系数矩阵等,预测单元320基于预测信息对当前块使用帧内预测或帧间预测产生当前块的预测块。反量化/变换单元330使用从码流得到的量化系数矩阵,对量化系数矩阵进行反量化、反变换得到残差块。重建单元340将预测块和残差块相加得到重建块。重建块组成重建图像,环路滤波单元350基于图像或基于块对重建图像进行环路滤波,得到解码图像。该解码图像也可以称为重建图像,该重建图像一方面可以被显示设备进行显示,另一方面可以存放在解码图像缓存360中,为后续的帧作为帧间预测的参考帧。
上述是基于块的混合编码框架下的视频编解码器的基本流程,随着技术的发展,该框架或流程的一些模块或步骤可能会被优化,本申请适用于该基于块的混合编码框架下的视频编解码器的基本流程,但不限于该框架及流程。
目前通过滤波的方式来提高视频的质量,例如在HEVC/H265中,为了提高重建图像质量,使用了DBF技术和SAO技术进行滤波。在VVC/H266中额外添加了ALF技术。其中,DBF通过平滑编码单元边界降低块效应,SAO通过补偿像素值缓解振铃效应,ALF通过最小化重建块与原始块的误差进一步增强重建图像质量。但是滤波的方式无法显著提高视频的质量,效果差。
在本申请的一些实施例中,基于时空可变形卷积的压缩视频质量提升,该技术简称时空可变形卷积融合(Spatio-Temporal Deformable Fusion,STDF)技术,主要应用于解码端重建图像的后处理,通过使用多个相邻参考帧来增强当前帧的质量。STDF通过利用可变形卷积的有效对齐性质,对齐融合时域信息,利用参考帧的时域信息增强当前帧的质量。
该STDF技术主要通过以下流程实现:
a)从解码端重建图像缓存流中提取连续的2R+1帧图像,其中中间帧为待增强帧,其他帧为参考帧。参考帧为待增强帧提供时域补充信息。
b)将提取到的连续帧在时域维度上拼接起来,输入偏移值预测网络用于生成偏移值。偏移值是指可变形卷积中采样点的偏移值。偏移值预测网络采用了U形网络(Unet)的形式,利用底层细节信息和高层语义信息相结合的方法充分学习时域信息,从而直接预测出偏移值。对于每帧图像分别预测一组偏移值,即输出2R+1组偏移值。对应于每帧的每个像素点,有9组采样点,即9个偏移值,每个偏移值包含横纵两个方向的采样距离。
c)将步骤b预测的偏移值用于可变形卷积采样点的偏移中,将参考帧对齐到当前帧,从而融合时域信息。
d)将步骤c中生成的融合特征输入质量增强网络,用于学习重建残差图,即输入的待增强帧和真实图像的差。残差图与待增强帧相加后输出得到增强帧。
在实际应用中发现,上述方式一,即环内滤波技术,设计难度高,其收益甚微。另外环内滤波技术常见于帧内滤波,对于多帧增强来说,不能获取后续未重建的帧,因此限制很大。
但是STDF技术,对于当前预测采样点,其偏移值假设为P(x,y),采样时,为了实现可微分,通常采用双边滤波技术进行采样,即设采样点周围的4个点坐标分别为P1(x1,y1)P2(x2,y2),P3(x3,y3),P4(x4,y4)。可由下式计算采样值:P=W(P1,P)*P1+W(P2,P)*P2+W(P3,P)*P3+W(P4,P)*P4,W代表双线性滤波的权重。在训练网络时,偏移值会朝着真实值方向优化,然而在训练初期时,当前偏移值与 真实偏移值偏差较大。真实偏移值远远超出感受野范围,偏移值优化方向会偏离真实值方向,从而导致误差变大。具体来说,如图4所示,真实偏移位置为Pt,当前偏移值位置为P。由于网络训练按照梯度方向进行优化,Pt的值大于P的值,因此P会朝着变大的趋势偏移,即向P4点偏移,从而导致误差变大,对齐出现较大偏差。从而导致生成的偏移值不准确,对齐操作有偏差。从而不能有效融合多帧信息,甚至会融合不利于当前帧恢复的时域信息。
基于此,在本申请提供一种通过新的质量增强模型来实现图像增强的方法,该质量增强模型根据待增强图像和所述待增强图像的参考图像在N个尺度下的第一特征信息,进行多尺度预测,得到参考图像的偏移值,由于该模型实现了偏移值的多尺度预测,扩大感受野的范围,从而让偏移值可以学习到真实偏移的方向,进而实现偏移值的准确预测,后续基于该准确预测的偏移值实现可变形卷积多尺度对齐,从而实现图像的高效增强。
下面结合具体的实施例,对本申请实施例涉及的图像处理方法进行介绍。
本申请提供的图像处理方法是使用质量增强网络对图像进行质量增强,该质量增强网络为一段软件代码或者为具有数据处理功能的芯片。基于此,首先对质量增强网络的训练过程进行介绍。
图5为本申请一实施例提供的质量增强网络训练方法流程示意图,如图5所示,训练过程包括:
S501、获取待增强图像以及待增强图像的M个参考图像。
其中,M为正整数。
上述待增强图像为训练集中的一张待增强图像,该训练集中包括多张待增强图像,以及每一张待增加图像的M个参考图像。使用训练集中的待增强图像以及待增强图像的M个参考图像对质量增强网络的训练过程为迭代过程。例如,将第一张待增强图像以及该待增强图像的M个参考图像输入待训练的质量增强网络中,对质量增强网络的初始参数进行一次调整,得到第一次训练过的质量增强网络。接着,将第二张待增强图像以及该第二张待增强图像的M个参考图像输入第一次训练过的质量增强网络中,对第一次训练过的质量增强网络的参数进行一次调整,得到第二次训练过的质量增强网络,参照上述方法,依次迭代,直到达到质量增强网络的训练结束条件为止。其中,质量增强网络的训练结束条件包括训练次数达到预设次数,或者损失达到预设损失。
上述质量增强网络的初始参数的确定方法包括但不限于如下几种:
方式一,质量增强网络的初始参数可以为预设值,或者为随机值,或者为经验值。
方式二,获取预训练模型在预训练时得到的预训练参数,将该预训练参数确定为质量增强网络的初始参数。
在一些实施例中,上述待增强图像的M个参考图像可以为视频流中,在播放顺序上位于该待增强图像的前向M张图像。
在一些实施例中,上述待增强图像的M个参考图像可以为视频流中,在播放顺序上位于该待增强图像的后向M张图像。
在一些实施例中,上述待增强图像的M个参考图像可以为视频流中,在播放顺序上位于该待增强图像的前向R张图像和位于该待增强图像的后向R张图像,其中2R=M。例如,在视频流中,按照视频播放顺序,依次包括图像1、图像2、图像3,其中图像2为待增强图像,则可以将图像1、图像3作为图像2的参考图像。
在一些实施例中,待增强图像与M个参考图像在播放顺序为连续图像。
在一些实施例中,待增强图像与M个参考图像在播放顺序上不连续。
本申请实施例中,使用训练集中的待增强图像及该待增强图像的M个参考图像对质量增强网络进行训练的过程一致,为了便于描述,本申请实施例以一张待增强图像为例,对质量增强网络的训练过程进行说明。
下面结合图6对本申请实施例涉及的质量增强网络的网络结构进行介绍,需要说明的是,本申请实施例的质量增强网络的网络结构包括但不限于图6所示的模块,还可以包括比图6更多或更少的模块。
如图6所示,质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块。
其中,特征提取模块用于提取图像在不同尺度下的第一特征信息。需要说明的是,本申请的图像 的尺度指图像的长和宽的大小。
偏移值预测模块用于根据特征提取模块提取的不同尺度下的第一特征信息,预测图像的偏移值。
时域对齐模块用于根据特征提取模块提取的第一特征信息和偏移值预测模块预测的偏移值进行时域对齐,得到时域对齐的第二特征信息。
质量增强模块用于根据时域对齐模块对齐的第二特征信息,预测图像的增强图像。
需要说明是,上述图6只是本申请实施例涉及的质量增强网络的一种框架示意图,本申请实施例的质量增强模块还可以包括比图6更多或更少的模块,本申请对此不做限制。
以图6为例,对图6所示的质量增强网络进行训练时,将上述S501包括如下S502至S504的步骤。
S502、将待增强图像以及待增强图像的M个参考图像输入特征提取模块进行不同尺度的特征提取,分别得到待增强图像和参考图像在N个尺度下的第一特征信息。
其中,N为大于1的正整数,也就是说,特征提取模块对输入的M+1个图像进行至少两个不同尺度下的特征提取,得到待增强图像和参考图像的至少两个尺寸下的第一特征信息。例如,N=3,特征提取模块输出待增强图像和参考图像分别在尺度L1下的第一特征信息、在尺度L2下的第二特征信息、以及在尺度L3下的第一特征信息。
可选的,尺度L1表示原始图像的尺度,尺度L2表示原始图像的二分之一尺度,尺度L3表示原始图像的四分之一尺度。例如,待增强图像和/或参考图像的原始大小为HXW,则待增强图像和/或参考图像在尺度L1下的第一特征信息的大小为HXW,则待增强图像和/或参考图像在尺度L2下的第一特征信息的大小为H/2XW/2,则待增强图像和/或参考图像在尺度L3下的第一特征信息的大小为H/4XW/4。
举例说明,假设待增强图像t,其前向参考图像t-r到t-1,其后向参考图像t+1到t+r,共2r+1个图像图像,表示为I i∈R H×W,i∈{t-r,...,t+r},然后将I i∈R H×W,i∈{t-r,...,t+r}送入质量增强网络进行处理。特征提取模块对2r+1个图像进行多尺度特征提取,输出图像在三个尺度下的第一特征信息
Figure PCTCN2021107466-appb-000001
L∈{1,2,4},i∈{t-r,...,t+r},L={1,2,4}中的1、2、4分别对应公式中的H/L、W/L为原尺度、二分之一尺度和四分之一尺度。
需要说明的是,上述以原尺度、二分之一尺度和四分之一尺度为例进行说明,本申请实施例涉及的N个尺度包括但不限于上述3个尺度,具体根据实际需要设定。
另外,需要说明的是,特征提取模块输出的参考图像在N个尺度下的第一特征信息包括M个参考图像中至少一个参考图像在N个尺度下的特征信息。也就是说,特征提取模块对M个参考图像中每一个参考图像进行特征提取,得到每一个参考图像在N个尺度下的第一特征信息,也可以是特征提取模块对M个参考图像中部分参考图像进行特征提取,得到部分参考图像在N个尺度下的第一特征信息。
S503、根据待增强图像和参考图像分别在N个尺度下的第一特征信息,通过偏移值预测模块进行多尺度预测,得到参考图像的偏移值。
例如,将待增强图像和参考图像分别在尺度L1下的第一特征信息、在尺度L2下的第一特征信息、以及在尺度L3下的第一特征信息输入偏移值预测模块,偏移值预测模块对待增强图像以及参考图像分别在不同尺度上的第一特征信息进行学习,以扩大偏移值预测模块所学习的感受野范围,从而让偏移值可以学习到真实偏移的方向,进而实现偏移值的准确预测。
其中,参考图像的偏移值可以理解为偏移值矩阵。
在一些实施例中,本申请实施例的偏移值预测模块为金字塔渐进式预测网络,该金字塔渐进式预测网络由粗到细地逐步学习可变形卷积偏移值。该金字塔渐进式结构可以有效的针对运动距离大的压缩视频增强。
S504、根据参考图像的偏移值和参考图像的第一特征信息,通过时域对齐模块中进行时域对齐,得到参考图像的第二特征信息。
具体参照图6所示,将偏移值预测模块预测的参考图像的偏移值和特征提取模块提取的该参考图 像的第一特征信息输入时域对齐模块中。该时域对齐模块针对第一特征信息中的点,从该参考图像的偏移值中获取该点对应的偏移值(例如9个偏移值),以该点对应的9个偏移值为采样点的偏移值,得到9个采样点,对这9个采样点做卷积,得到一个卷积后的值,将该卷积后的值作为该点的第二特征信息,依次类推,对第一特征信息中的点做上述操作,得到该参考图像的第二特征信息。
在一些实施例中,上述S503包括:根据参考图像的偏移值和参考图像的第一特征信息,通过时域对齐模块中进行多尺度时域对齐,得到参考图像的多尺度第二特征信息。
具体是,时域对齐模块将参考图像的偏移值和参考图像的第一特征信息下采样为多个小尺度,例如针对某个尺度,将该尺度下的偏移值和第一特征信息进行时域对齐,得到该尺度下的第二特征信息。
本申请为了更准确的预测偏移值,优化网络训练。采用了多尺度对齐技术,即图6中的时域对齐模块对待对齐的第一特征信息和偏移值同步下采样到多个小尺度上,在多个尺度上执行可变形卷积对齐操作。由于小尺度的偏移值相对于大尺度的偏移值会更接近真实采样点,训练网络时,梯度优化方向将会指向真实采样点方向。而对于大尺度的偏移值,双线性滤波的采样机制导致其无法正确的找到优化方向,因此小尺度的偏移值优化过程将会引导大尺度的偏移值优化过程。最终引导整个对齐过程更加精确。
S505、根据参考图像的第二特征信息,通过质量增强模块,得到待增强图像的增强图像的预测值。
在一些实施例中,将时域对齐模块对齐的参考图像的第二特征信息输入质量增强模块,得到待增强图像的增强图像的预测值。
在一些实施例中,还包括获取待增强图像的第二特征信息,将待增强图像和参考图像输入质量增强模块,得到待增强图像的增强图像的预测值,具体过程参照下面图7所示的实施例。
在一些实施例中,除了将参考图像的第二特征信息输入质量增强模块外,还可以将待增强图像的第一特征信息输入质量增强模块,得到待增强图像的增强图像的预测值,具体过程参照下面图9所示的实施例。
S506、根据待增强图像的增强图像的预测值和待增强图像的增强图像的真值,对质量增强网络进行训练。
本申请实施例对待增强图像的增强图像的真值的获取方式不做限制。
在一些实施例中,上述待增强图像的增强图像的真值可以是,使用已有的图像质量增强方法得到的增强图像。
在一些实施例中,上述待增强图像的增强图像的真值可以是高质量图像采集设备采集的图像。
具体的根据预设的损失函数,计算待增强图像的增强图像的预测值和待增强图像的增强图像的真值之间的损失,根据损失大小对质量增强网络中的参数进行反向调整,以实现对质量增强网络的训练。
重复上述步骤,直到质量增强网络训练完成为止。
本申请实施例的训练方法,通过获取待增强图像以及待增强图像的M个参考图像,将待增强图像以及M个参考图像输入特征提取模块进行不同尺度的特征提取,分别得到待增强图像和参考图像在N个尺度下的第一特征信息,根据待增强图像和参考图像分别在N个尺度下的第一特征信息,通过偏移值预测模块进行多尺度预测,得到参考图像的偏移值,根据参考图像的偏移值和参考图像的第一特征信息,通过时域对齐模块中进行时域对齐,得到参考图像的第二特征信息,根据参考图像的第二特征信息,通过质量增强模块,得到待增强图像的增强图像的预测值,根据待增强图像的增强图像的预测值和待增强图像的增强图像的真值,对质量增强网络进行训练。本申请实施例提出的质量增强网络,其中偏移值预测模块在不同尺度上的第一特征信息进行学习,以扩大偏移值预测模块所学习的感受野范围,从而让偏移值可以学习到真实偏移的方向,进而实现偏移值的准确预测,基于该准确预测的偏移值可以提高图像的增强效果。
本申请实施例的模型训练方式包括两种方式,下面结合两种训练方式,对本申请实施例涉及的质量增强网络的网络结构以及训练过程分别进行介绍。
图7本申请一实施例提供的质量增强网络的一种训练方法流程示意图,如图7所示,训练过程包括:
S601、获取待增强图像以及待增强图像的M个参考图像。
其中,M为正整数。
上述S601的实现过程参照上述S401的描述,在此不再赘述。
S602、将待增强图像以及待增强图像的M个参考图像输入特征提取模块进行不同尺度的特征提取,分别得到待增强图像和参考图像在N个尺度下的第一特征信息。
其中,N为大于1的正整数。
本申请实施例对特征提取模块的网络结构不做限制。
在一些实施例中,如图8A所示,特征提取模块包括N个第一特征提取单元,此时,上述S602包括:针对待增强图像,将该待增强图像输入特征提取模块,得到第i个第一特征提取单元所提取的该待增强图像在第N-i+1个尺度下的第一特征信息,并将该待增强图像在第N-i+1个尺度下的第一特征信息输入第i+1个第一特征提取单元中进行特征提取,得到该待增强图像在第N-i+2个尺度下的第一特征信息,i为1至N-1的正整数。针对M个参考图像中的至少一个参考图像,将该参考图像输入特征提取模块,得到第i个第一特征提取单元所提取的该参考图像在第N-i+1个尺度下的第一特征信息,并将该参考图像在第N-i+1个尺度下的第一特征信息输入第i+1个第一特征提取单元中进行特征提取,得到该参考图像在第N-i+2个尺度下的第一特征信息,i为1至N-1的正整数。需要说明的是,图8A是以N=3为例示出了特征提取模块的网络结构,本申请实施例的特征提取模块可以包括2个第一特征提取单元或多于3个第一特征提取单元。
举例说明,假设N为3,如图8A所示,则针对待增强图像以及M个参考图像构成的M+1个图像中的任一图像,将该图像输入第一个第一特征提取单元中,该第一个第一特征提取单元对该图像进行处理,输出该图像在第三个尺度下的第一特征信息。另外,第一个第一特征提取单元还将提取的该图像在第三个尺度下(例如L1尺度下)的第一特征信息输入第二个第一特征提取单元中。第二个第一特征提取单元对该图像在第二个尺度下(例如L2尺度下)的第一特征信息进行处理,输出该图像在第二个尺度下的第一特征信息。另外,第二个第一特征提取单元还将提取的该图像在第二个尺度下的第一特征信息输入第三个第一特征提取单元中。第三个第一特征提取单元对该图像在第二个尺度下的第一特征信息进行处理,输出该图像在第一个尺度下(例如L3尺度下)的第一特征信息。
本实施例对上述第一个尺度、第二个尺度和第三个尺度的具体大小不做限制。
在一些实施例中,上述第三个尺度为该图像的原始尺度,例如为HXW。第二个尺度为第一个尺度的二分之一,例如为H/2XW/2。第一个尺度为第二个尺度的二分之一,例如为H/4XW/4。
本申请实施例对第一特征提取单元的网络结构不做限制。
在一些实施例中,上述第一特征提取单元包括至少一个卷积层。
可选的,N个第一特征提取单元中每个第一特征提取单元所包括的卷积层的个数相同。例如,每个第一特征提取单元包括两层卷积层。
可选的,N个第一特征提取单元中各第一特征提取单元所包括的卷积层的个数不完全相同,例如,部分第一特征提取单元包括2层卷积层,部分第一特征提取单元包括1层卷积层,或者,部分第一特征提取单元包括3层卷积层等。
可选的,每个第一特征提取单元所包括的卷积层的参数可以相同,也不可以不同。
在本申请的一具体实施例中,特征提取模块包括6层卷积层,第一层卷积层和第二层卷积层的卷积步长为第一数值,第三卷积层和第四卷积层的卷积步长为第二数值,第五卷积层和第六卷积层的卷积步长为第三数值,其中第一数值大于第二数值,第二数值大于第三数值。
例如,如图8B所示,特征提取模块包括3个第一特征提取单元,每个第一特征提取单元包括2层卷积层。其中第一个第一特征提取单元包括两个卷积层,且两个卷积层的卷积步长为1。第二个第一特征提取单元包括两个卷积层,其中第一个卷积层的卷积步长为2,第二个卷积层的卷积步长为1。第三个第一特征提取单元包括两个卷积层,其中第一个卷积层的卷积步长为2,第二个卷积层的卷积步长为1。
本实施例对图8B所示的各卷积层的通道数不做限制,例如,图8B所示的各卷积层的通道数C=64。
S603、将待增强图像和参考图像分别在N个尺度下的第一特征信息,输入偏移值预测模块中进行多尺度预测,分别得到待增强图像和参考图像在第N个尺度下的偏移值。
其中,第N个尺度为N个尺度中的最大尺度。
本实施例对偏移值预测模块的具体网络结构不做限制。
在一些实施例中,如图8C所示,偏移值预测模块包括N个第一预测单元,则上述S603包括S603-A和S603-B:
S603-A、将待增强图像和参考图像分别在第j个尺度下的第一特征信息、以及待增强图像和参考图像分别在第j个尺度下的偏移值输入第j个第一预测单元中,得到待增强图像和参考图像分别在第j+1个尺度下的偏移值,j为1至N-1的正整数;例如,将待增强图像在第j个尺度下的第一特征信息以及待增强图像在第j个尺度下的偏移值、和参考图像在第j个尺度下的第一特征信息以及参考图像在第j个尺度下的偏移值输入第j个第一预测单元中,分别得到待增强图像在第j+1个尺度下的偏移值,以及参考图像在第j+1个尺度下的偏移值,j为1至N-1的正整数。
S603-B、将待增强图像和参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第一预测单元预测的待增强图像和参考图像分别在第N个尺度下的偏移值输入第N个第一预测单元中,得到第N个第一预测单元预测的待增强图像和参考图像分别在第N个尺度下的偏移值。例如,将待增强图像在第N个尺度下的第一特征信息以及第N-1个第一预测单元预测的待增强图像在第N个尺度下的偏移值、和参考图像在第N个尺度下的第一特征信息以及第N-1个第一预测单元预测的参考图像在第N个尺度下的偏移值输入第N个第一预测单元中,分别得到第N个第一预测单元预测的待增强图像在第N个尺度下的偏移值和参考图像在第N个尺度下的偏移值。
其中,若上述第j个预测单元为N个预测单元中的第一个预测单元,则待增强图像和参考图像分别在第j个尺度下的偏移值为0。
举例说明,假设N=3,如图8C所示,将上述图8B所示的第三个第一特征提取单元输出的待增强图像和参考图像分别在第一个尺度下的第一特征信息拼接后输入第一个第一预测单元中进行偏移值预测,得到第一个第一预测单元预测的待增强图像和参考图像分别在第二尺度下的偏移值。将待增强图像和参考图像分别在第二个尺度下的第一特征信息拼接后和预测的待增强图像和参考图像分别在第二个尺度下的偏移值输入第二个第一预测单元中进行偏移值预测,得到第二个第一预测单元预测的待增强图像和参考图像分别在第三尺度下的偏移值。接着,将待增强图像和参考图像分别在第三个尺度下的第一特征信息拼接后和预测的待增强图像和参考图像分别在第三个尺度下的偏移值输入第三个第一预测单元中进行偏移值预测,得到第三个第一预测单元预测的待增强图像和参考图像分别在第三尺度下的偏移值。
本申请实施例对第一预测单元的具体网络结构不做限制。
在一些实施例中,如图8D所示,若第j个预测单元为N个第一预测单元中的第一个第一预测单元,则第一个第一预测单元包括第一个第一预测子单元和第一个第一上采样子单元。
基于图8D所示,上述S603-A包括:
S603-A11、将待增强图像和参考图像分别在第一个尺度下的第一特征信息输入第一个第一预测子单元进行偏移值预测,得到第一个预测子单元预测的待增强图像和参考图像分别的在第一个尺度下的偏移值;
S603-A12、将第一个第一预测子单元预测的待增强图像和参考图像分别在第一个尺度下的偏移值输入第一个第一上采样子单元进行上采样,得到待增强图像和参考图像分别在第二个尺度下的偏移值。
在一些实施例中,若第j个第一预测单元为N个第一预测单元中除第一个第一预测单元之外的第一预测单元,则第j个第一预测单元包括第j个第一对齐子单元、第j个第一预测子单元、第j个第一上采样子单元。例如,如图8D所示,若第j个第一预测单元为N个第一预测单元中的第二个第一预测单元,则第二个第一预测单元包括第二个第一对齐子单元、第二个第一预测子单元、第二个第一上采样子单元。
基于图8D所示,上述S603-A1包括S603-A21至S603-A23:
S603-A21、将待增强图像和参考图像分别在第j个尺度下的第一特征信息、以及第j-1个第一预测单元预测的待增强图像和参考图像分别在第j个尺度下的偏移值输入第j个第一对齐子单元中进行时域特征对齐,得到待增强图像和参考图像分别在第j个尺度下对齐的特征信息;
S603-A22、将待增强图像和参考图像分别在第j个尺度下对齐的特征信息输入第j个第一预测子 单元进行偏移值预测后与第j-1个第一预测单元预测的待增强图像和参考图像分别在第j个尺度下的偏移值相加,得到待增强图像和参考图像分别在j个尺度下的偏移值;
S603-A23、将待增强图像和参考图像分别在j个尺度下的偏移值输入第j个第一上采样子单元进行上采样,得到第j个第一预测单元预测待增强图像和参考图像分别在j+1个尺度下的偏移值。
在一些实施例中,第N个第一预测单元包括第N个第一对齐子单元和第N个第一预测子单元。如图8D所示,第三个第一预测单元包括第三个第一对齐子单元和第三个第一预测子单元。则上述S603-B包括S603-B1至S603-B2:
S603-B1、将待增强图像和参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第一预测单元预测的待增强图像和参考图像分别在第N个尺度下的偏移值输入第N个第一对齐子单元中进行时域特征对齐,得到待增强图像和参考图像分别在第N个尺度下对齐的特征信息;
S603-B2、将待增强图像和参考图像分别在第N个尺度下对齐的特征信息输入第N第一预测子单元进行偏移值预测后与第N-1个第一预测单元预测的待增强图像和参考图像分别在第N个尺度下的偏移值相加,得到第N个第一预测单元预测的待增强图像和参考图像分别在第N个尺度下的偏移值。
本申请实施例对上述各第一对齐子单元、第一预测子单元以及第一上采样子单元的网络结构不做限制。
在一些实施例中,上述第一预测子单元为偏移值预测网络(Offset prediction network,简称OPN)。
可选的,OPN采样用了3层卷积层,输入通道数为T×C,输出通道数为T×3×9,其中3表示OPN除了输出采样点位置(x,y)外,还输出采样值的幅度。
示例性的,T=3,C=64。
在一些实施例中,上述第一对齐子单元为可变形卷积(Deformable convolution,简称DCN)。示例性的,DCN即可变形卷积的输入输出通道均为C。
在一些实施例中,上述第一上采样子单元为双线性插值上采样单元。
在金字塔形渐进式偏移值预测模块中,为了更有效的预测偏移值,采用了由粗到细的逐渐调整预测的偏移值,即预测偏移值残差而非偏移值本身。
举例说明,如图8D所示,假设N=3,将上述特征提取模块生成的待增强图像和参考图像分别在第一个尺度(即最小尺度L3)下的第一特征信息f 1 i拼接起来,一起输入第一个第一预测子单元(OPN)中,预测偏移值。OPN采用了3层卷积层预测偏移值,得到待增强图像和参考图像分别在第一个尺度下的偏移值
Figure PCTCN2021107466-appb-000002
然后,将待增强图像和参考图像分别在第一个尺度下的偏移值O 0通过第一个第一上采样子单元上采样到第二个尺度(即L2尺度)下的偏移值
Figure PCTCN2021107466-appb-000003
将待增强图像和参考图像分别在第二个尺度(即最小尺度L2)下的第一特征信息f 2 i拼接起来,与偏移值O2输入第二个第一对齐子单元(DCN)中进行可变形卷积,得到待增强图像和参考图像分别在第二个尺度下对齐的特征信息。将对齐后的特征信息输入第二个第一预测子单元(OPN)中,得到第二个第一预测子单元预测的待增强图像和参考图像分别在第二个尺度下的偏移值O3,将偏移值O3与O2相加后,输入第二个第一上采样子单元中,得到偏移值O4。将偏移值O4输入第三个第一对齐子单元,以使第三个第一对齐子单元对上述步骤输出的第三尺度下(即原尺度L1)的待增强图像和参考图像的第一特征信息分别进行采样对齐,得到待增强图像和参考图像分别在第三个尺度下的对齐特征,并将待增强图像和参考图像在第三个尺度下的对齐特征输入第三个第一预测子单元中,预测得到待增强图像和参考图像的偏移值O5,O5与O4相加,得到待增强图像和参考图像分别在第三个尺度下的偏移值O∈R T×3×9×H×W。本实施例,鉴于前面的每次预测都是小尺度特征预测大尺度特征的偏移值,偏移值会有细节损失,因此在原尺度特征上额外添加一次预测和对齐操作。具体是,将根据O4对齐的待增强图像和参考图像的多尺度特征输入第三个预测子单元(OPN)中,该OPN输出的偏移值加上O4得到同一尺度更精确的偏移值O∈R T×3×9×H×W
S604、将待增强图像在第N个尺度下的偏移值和第一特征信息,以及参考强图像在第N个尺度下的偏移值和第一特征信息,输入时域对齐模块中进行多尺度时域对齐,分别得到待增强图像在多个尺度下的第二特征信息和参考强图像在多个尺度下的第二特征信息。
本申请实施例对时域对齐模块的具体网络结构不做限制。
在一些实施例中,如图8E所示,时域对齐模块包括K个第一时域对齐单元和K-1个第一下采样单元。K为大于2的正整数。
在一种可能的实现方式中,第一时域对齐单元为偏移值预测网络OPN。
在一种可能的实现方式中,第一下采样单元为平均池化层。
在一种可能的实现方式中,第一下采样单元为最大池化层。
此时,上述S604包括如下S604-A1至S604-A3:
S604-A1、将待增强图像和参考图像中的任一图像记为第一图像,将第一图像在第k个尺度下的偏移值和第一特征信息输入第k个第一时域对齐单元中,得到第一图像在第k个尺度下的第二特征信息。
其中,k为K至2的正整数,当k=K时,第一图像在第k个尺度下的偏移值和第一特征信息为第一图像在第N个尺度下的偏移值和第一特征信息。
可选的,K=N。
需要说明的是,上述参考图像可以理解为待增强图像的M个参考图像中的所有参考图像,也可以理解为M个参考图像中的部分参考图像,待增强图像和参考图像中的每一个图像提取第二特性信息的过程一致,为了便于描述,将待增强图像和参考图像中的任一图像记为第一图像,其中待增强图像和参考图像中每一个图像提取第二特征信息的过程与第一图像相同,参照第一图像即可。
S604-A2、将第一图像在第k个尺度下的偏移值和在第k个尺度下的第一特征信息输入第k-1个第一下采样单元中进行下采样,得到第一图像在第k-1个尺度下的偏移值和第一特征信息;
S604-A3、将第一图像在第k-1个尺度下的偏移值和在第k-1个尺度下的第一特征信息输入第k-1个第一时域对齐单元中,得到第一图像在第k-1个尺度下的第二特征信息,直到k-1等于1为止。
本步骤,对上述偏移值预测模块预测的第一图像在第N个尺度下的偏移值,以及特征提取模块提取的第一图像在第N个尺度下的第一特征信息进行多尺度对齐。具体是,将该第一图像在第N个尺度下的第一特征信息和偏移值进行下采样,得到不同尺度下的第一特征信息和偏移值,并对各尺度下的第一特征信息和偏移值进行对齐,得到该第一图像在不同尺度下的第二特征信息。
举例说明,假设K=3,将该第一图像在第三个尺度(例如L1尺度)下的偏移值和第一特征信息输入第三个第一时域对齐单元中,得到第一图像在第三个尺度下的第二特征信息,其中第一图像在第三个尺度下的偏移值和第一特征信息,以及在第三个尺度下的第二特征信息的大小均为HXW。另外,将第一图像在第三个尺度下的偏移值和第一特征信息输入第二个第一下采样单元中进行下采样,得到第一图像在第二个尺度下的偏移值和第一特征信息,可选的,第一图像在第二个尺度下的偏移值和第一特征信息的大小为H/2XW/2。将第一图像在第二个尺度下的偏移值和第一特征信息输入第二个第一时域对齐单元中,得到第一图像在第二个尺度下的第二特征信息。接着,将第一图像在第二个尺度下的偏移值和第一特征信息输入第一个第一下采样单元中进行下采样,得到第一图像在第一个尺度下的偏移值和第一特征信息,可选的,第一图像在第一个尺度下的偏移值和第一特征信息的大小为H/4XW/4。
如图8E所示,为了更准确的预测偏移值和有效的优化梯度传播,本步骤采用了多尺度对齐操作,即分别将该第一图像的偏移值O以及L1尺度的第一特征信息同步下采样到多个小尺度,例如将原尺度的O和L1下采样到原尺度的二分之一,四分之一。对三个尺度的第一特征信息分别进行可变形卷积对齐,3个尺度的偏移值都来自于原尺度的偏移值O,因此训练网络时,小尺度的粗偏移值将会引导大尺度的精确偏移值朝着真实偏移值方向优化。多尺度对齐后的第二特征信息可以表示为:
Figure PCTCN2021107466-appb-000004
L∈{1,2,4},C'=T×C。
在一些实施例中,上述S603包括:将第一图像在N个尺度下的第一特征信息,输入偏移值预测模块中进行多尺度预测,得到第一图像在第N个尺度下的P组偏移值,P为正整数。
对应的,上述S604包括:将第一图像划分为P个图像块,将P组偏移值一一分配给P个图像块;将图像块对应的一组偏移值和图像块的第一特征信息输入时域对齐模块中进行多尺度时域对齐,得到图像块在第N个尺度下的多尺度第二特征信息;根据第一图像中图像块在第N个尺度下的多尺度第 二特征信息,得到第一图像在第N个尺度下的多尺度第二特征信息。
根据上述步骤得到待增强图像和参考图像在第N个尺度下的多尺度第二特征信息后,执行如下S605。
S605、将待增强图像和参考图像分别在多个尺度下的第二特征信息输入质量增强模块,得到待增强图像的增强图像的预测值。
本申请实施例对质量增强模块的具体网络结构不做限制。
在一些实施例中,如图8F所示,质量增强模块包括K个第一增强单元和K-1个第一上采样单元,则上述S605包括如下S605-A1至S605-A4:
S605-A1、将待增强图像和参考图像分别在第k+1个尺度下的第二特征信息输入第k+1个第一增强单元中进行图像质量增强,得到待增强图像在第k+1个尺度下的增强图像的初始预测值,k为1至K-1的正整数。
S605-A2、将待增强图像在第k个尺度下的增强图像的融合值输入第k个第一上采样单元中进行上采样,得到待增强图像在第k+1个尺度下的增强图像的上采样值。
当k为1时,待增强图像在第k个尺度下的增强图像的融合值为第一个第一增强单元根据待增强图像和参考图像在第一个尺度下的第二特征信息,得到的待增强图像在第一个尺度下的增强图像的初始预测值。
S605-A3、根据待增强图像在第k+1个尺度下的增强图像的上采样值和初始预测值进行融合,得到待增强图像在第k+1个尺度下的增强图像的融合值。
S605-A4、将待增强图像在第K个尺度下的增强图像的融合值确定为待增强图像在第N个尺度下的增强图像的预测值。
举例说明,假设K=3,参照图8F所示,将待增强图像和参考图像分别在第一个尺度下的第二特征信息拼接后输入第一个第一增强单元中进行质量增强,得到待增强图像在第一个尺度下的增强图像的融合值。接着,将待增强图像在第一个尺度下的增强图像的融合值输入第一个第一上采样单元中进行上采样,得到待增强图像在第二个尺度下的增强图像的上采样值。另外,将待增强图像和参考图像分别在第二个尺度下的第二特征信息拼接后,输入第二个第一增强单元中进行图像质量增强,得到待增强图像在第二个尺度下的增强图像的初始预测值,将待增强图像在第二个尺度下的增强图像的上采样值和初始预测值进行融合,得到待增强图像在第二个尺度下的增强图像的融合值。接着,将待增强图像在第二个尺度下的增强图像的融合值输入第二个第一上采样单元中进行上采样,得到待增强图像在第三个尺度下的增强图像的上采样值。另外,将待增强图像和参考图像分别在第三个尺度下的第二特征信息拼接后,输入第三个第一增强单元中进行图像质量增强,得到待增强图像在第三个尺度下的增强图像的初始预测值。然后,将待增强图像在第三个尺度下的增强图像的上采样值和初始预测值进行融合,得到待增强图像在第三个尺度下的增强图像的融合值。将待增强图像在第三个尺度下的增强图像的融合值确定为待增强图像在第三个尺度下的增强图像的预测值。
在一种可能的实现方式中,上述第一增强单元包括多个卷积层,例如包括8层卷积层,每个卷积层的输入输出通道数C=64(第一层输入通道数为T×C=3×64,最后一层通道数输出为1)。另外,每个第一增强单元的多个卷积层中最后一个卷积层不包括激活函数。
可选的,第一增强单元中使用LeakyReLU激活函数,其中激活函数的系数为0.1。
本步骤,如图8F所示,将经时域对齐模块生成的待增强图像和参考图像分别在多个尺度下对齐的第二特征信息同时输入质量增强模块。为了融合对齐的多尺度第二特征信息,将不同尺度下对齐的第二特征信息拼接起来,输入质量增强模块。利用质量增强模块由粗到细的恢复图像质量。质量增强模块共有三个分支,分别对应输入三个尺度的对齐特征。具体来说,最小尺度L3生成初步的恢复图像,其他分支进一步学习残差信息,恢复细节信息。
根据上述方法预测出待增强图像在第N个尺度下的增强图像的预测值后,执行S606,对质量增强网络的参数进行调整。
S606、根据待增强图像的增强图像的预测值和待增强图像的增强图像的真值,对质量增强网络进行训练。
上述S606的实现过程与上述S506一致,参照上述S506的具体描述,在此不再赘述。
上述步骤介绍了使用第N个尺度下的偏移值进行对齐和增强,根据待增强图像在第N个尺度下的增强图像的预测值对质量增强网络进行训练的过程进行介绍。
在一些实施例中,本申请实施例的训练方式还包括使用除第N尺度外的其他尺度下的偏移值进行对齐和增强,以根据待增强图像在其他尺度下的增强图像的预测值对质量增强网络进行训练的过程。具体包括如下步骤:
步骤A1、将待增强图像和参考图像分别在N个尺度下的第一特征信息,输入偏移值预测模块中进行多尺度预测,得到待增强图像和参考图像分别在第j个尺度下的偏移值。
其中,第j个尺度为N个尺度中除第N个尺度之外的尺度。
步骤A2、将待增强图像在第j个尺度下的偏移值和第一特征信息和参考图像在第j个尺度下的偏移值和第一特征信息,输入时域对齐模块中进行多尺度时域对齐,分别得到待增强图像和参考图像在第j个尺度下的多尺度第二特征信息。
步骤A3、将待增强图像和参考图像分别在第j个尺度下的多尺度第二特征信息输入质量增强模块,得到待增强图像在第j个尺度下的增强图像的预测值。
步骤A4、根据待增强图像在第j个尺度下的增强图像的预测值和真值,对质量增强网络进行训练。
例如,N=3,参照图8D所示,得到第2个第一预测单元预测的待增强图像和参考图像分别在第2个尺度下(即L2尺度下)的偏移值。参照上述S604所示,将待增强图像和参考图像分别在第N个尺度下的偏移值和第一特征信息替换为待增强图像和参考图像分别在第j个尺度下的偏移值和第一特征信息,根据上述S604的方法,可以得到时域对齐模块输出的待增强图像和参考图像分别在第j个尺度下的多尺度第二特征信息。接着,参照上述S605的方法,将待增强图像和参考图像分别在第j个尺度下的多尺度第二特征信息输入质量增强模块,得到待增强图像在第j个尺度下的增强图像的预测值。然后,将待增强图像的增强图像的真值下采样为第j个尺度,进而计算待增强图像在第j个尺度下的增强图像的预测值和真值之间的损失,根据该损失对质量增强网络进行训练。
图8G为本申请一具体实施例提供的质量增强网络的示意图,各模块的功能参照上述实施例的描述。
本实施例,在使用第N尺度下的偏移值对质量增强网络训练外,进一步使用第N尺度外的偏移值对质量增强网络训练,进而提高了质量增强网络的训练效率和训练准确性。
本申请实施例对质量增强网络的具体训练环境以及训练数据的选取不做限制。
在一些实施例中,在数据集方面,采用了来自于Xiph.org和JCT-VC的共108个序列,将其分为训练集100个序列,测试集8个序列。可选的,训练集和测试集中的序列使用HM16.9编解码器在LDP模式,QP={22,27,32,37}条件下压缩并解码,得到重建视频序列。将重建视频序列作为质量增强网络输入。每个QP的数据分别作为一组训练集和一组测试集。共训练4个模型。测试集采用了JVET要求的公共测试条件下的测试序列,在对测试集经过与训练集同样的数据处理流程后,输入完成训练的模型进行测试。
在评价标准方面,选择峰值信噪比PSNR作为图像重建质量的评价标准。
在网络训练方面,模型是基于Pytorch平台训练的。训练集随机划分为128x128的块作为输入,训练批次(batch)设置为64,优化器采用Adam优化器,初始学习率为1e-4,随着训练进行,逐渐降低到1e-6。分别在4个QP下训练得到四个模型。
对于测试过程,则采用图像级的输入,将整图像输入网络处理。
表1
Figure PCTCN2021107466-appb-000005
Figure PCTCN2021107466-appb-000006
表1示出了本申请相对于HM16.9压缩重建视频质量的提升效果。BD和PSNR是评价视频编码算法性能的主要参数之一,表示新算法(即本申请技术方案)编码的视频相对于原来的算法在码率和PSNR(Peak Signal to Noise Ratio,峰值信噪比)上的变化情况,即新算法与原有算法在相同信噪比情况下码率的变化情况。“-”表示性能提升,例如码率和PSNR性能提升。如表1所示,相比HM16.9压缩重建视频质量,本申请提出的技术方案,在节约码率方面平均性能提升21.0%。
本申请实施例提供质量增强网络的训练方法,该质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块,在训练时获取待增强图像以及待增强图像的M个参考图像,将待增强图像和M个参考图像输入特征提取模块进行不同尺度的特征提取,分别得到待增强图像和参考图像在N个尺度下的第一特征信息;将待增强图像和参考图像分别在N个尺度下的第一特征信息,输入偏移值预测模块中进行多尺度预测,分别得到待增强图像和参考图像在第N个尺度下的偏移值;将待增强图像在第N个尺度下的偏移值和第一特征信息和参考图像在第N个尺度下的偏移值和第一特征信息,输入时域对齐模块中进行多尺度时域对齐,得到待增强图像在多个尺度下的第二特征信息和参考图像在多个尺度下的第二特征信息;将待增强图像和参考图像分别在多个尺度下的第二特征信息输入质量增强模块,得到待增强图像的增强图像的预测值。根据待增强图像的增强图像的预测值和待增强图像的增强图像的真值,对质量增强网络进行训练。由于上述质量增强网络采用金字塔形预测网络,只对偏移值进行上采样处理,避免了图像特征上采样造成的信息损失。另外,为了更准确的预测偏移值,优化网络训练,采用了多尺度对齐技术,将原尺度的偏移值和待对齐特征同步下采样,小尺度的偏移值相对于大尺度的偏移值会更接近真实采样点,训练网络时,梯度优化方向将会指向真实采样点方向,最终引导整个对齐过程更加精确。使用该训练好的网络进行图像增强时,可以实现对图像的高效增强。
上述图7所示实施例介绍了使用待增强图像和参考图像的偏移值对质量增强网络进行训练的过程。下面结合图9对使参考图像的偏移值对质量增强网络进行训练的过程进行介绍。
图9本申请一实施例提供的质量增强网络的一种训练方法流程示意图,如图9所示,训练过程包括:
S701、获取待增强图像以及待增强图像的M个参考图像。
其中,M为正整数。
S702、将待增强图像以及M个参考图像输入特征提取模块进行不同尺度的特征提取,分别得到待增强图像和参考图像在N个尺度下的第一特征信息,N为大于1的正整数。
上述S701和S702的实现过程参照上述S601和S602的描述,在此不再赘述。
S703、将待增强图像和参考图像分别在N个尺度下的第一特征信息,输入偏移值预测模块中进行多尺度预测,得到参考图像在第N个尺度下的偏移值。
其中,第N个尺度为N个尺度中的最大尺度。
本实施例对偏移值预测模块的具体网络结构不做限制。
在一些实施例中,如图10A所示,偏移值预测模块包括N个第二预测单元,则此时上述S703包括:
S703-A、将待增强图像和参考图像分别在第j个尺度下的第一特征信息、以及参考图像在第j个尺度下的偏移值输入第j个第二预测单元中,得到参考图像在第j+1个尺度下的偏移值,直到j+1等于N为止,j为1至N-1的正整数。
S703-B、将待增强图像和参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第二预测单元预测的参考图像在第N个尺度下的偏移值输入第N个第二预测单元中,得到第N个第二预测单元预测的参考图像在第N个尺度下的偏移值。
可选的,若第j个第二预测单元为N个第二预测单元中的第一个第二预测单元,则参考图像在第j-1个尺度下的偏移值为0。
举例说明,假设N=3,如图10A所示,将上述图8B所示的第三个第一特征提取单元输出的待增强图像和参考图像分别在第一个尺度下的第一特征信息拼接后输入第一个第二预测单元中进行偏移值预测,得到第一个第二预测单元预测的该参考图像在第二尺度下的偏移值。将待增强图像和该参考图像分别在第二个尺度下的第一特征信息拼接后和预测该参考图像在第二个尺度下的偏移值输入第二个第二预测单元中进行偏移值预测,得到第二个第二预测单元预测的该参考图像在第三尺度下的偏移值。接着,将待增强图像和该参考图像分别在第三个尺度下的第一特征信息拼接后和预测的该参考图像在第三个尺度下的偏移值输入第三个第二预测单元中进行偏移值预测,得到第三个第二预测单元预测的该参考图像在第三尺度下的偏移值。
本申请实施例对第一预测单元的具体网络结构不做限制。
在一些实施例中,如图10B所示,若第j个第二预测单元为N个第二预测单元中的第一个第二预测单元,则第一个第二预测单元包括第一个第二预测子单元和第一个第二上采样子单元。此时,上述S703-A包括:
S703-A11、将待增强图像和参考图像分别在第一个尺度下的第一特征信息输入第一个第二预测子单元进行偏移值预测,得到第一个预测子单元输出的参考图像的在第一个尺度下的偏移值;
S703-A12、将参考图像的在第一个尺度下的偏移值输入第一个第二上采样子单元进行上采样,得到参考图像在第二个尺度下的偏移值。
在一些实施例中,若第j个第二预测单元为N个第二预测单元中除第一个第二预测单元之外的第二预测单元,则第j个第二预测单元包括第j个第二对齐子单元、第j个第二预测子单元、第j个第二上采样子单元。如图10B所示,若第j个第二预测单元为N个第二预测单元中的第二个第二预测单元,则第二个第二预测单元包括第二个第二对齐子单元、第二个第二预测子单元、第二个第二上采样子单元。
基于图10B所所示,,上述S703-A包括:
S703-A21、将待增强图像和参考图像分别在第j个尺度下的第一特征信息、以及第j-1个第二预测单元预测的参考图像在第j个尺度下的偏移值输入第j个第二对齐子单元中进行时域特征对齐,得到待增强图像和参考图像分别在第j个尺度下对齐的特征信息;
S703-A22、将待增强图像和参考图像分别在第j个尺度下对齐的特征信息输入第j个第二预测子单元进行偏移值预测后与第j-1个第二预测单元预测的所述图像图像在第j个尺度下的偏移值相加,得到第j个第二预测子单元预测的参考图像在j个尺度下的偏移值;
S703-A23、将第j个第二预测子单元预测的参考图像在j个尺度下的偏移值输入第j个第二上采样子单元进行上采样,得到第j个第二预测单元预测的参考图像在j+1个尺度下的偏移值。
在一些实施例中,第N个第二预测单元包括第N个第二对齐子单元和第N个第二预测子单元,则上述S703-B包括:
S703-B1、将待增强图像和参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第二预测单元预测的参考图像在第N个尺度下的偏移值输入第N个第二对齐子单元中进行时域特征对齐,得到待增强图像和参考图像分别在第N个尺度下对齐的特征信息;
S703-B2、将待增强图像和参考图像分别在第N个尺度下对齐的特征信息输入第N第二预测子单元进行偏移值预测后与第N-1个第二预测单元预测的所述参考图像在第N个尺度下的偏移值相加, 得到第N个第二预测单元预测的参考图像在N个尺度下的偏移值。
本申请实施例对上述各第一对齐子单元、第一预测子单元以及第一上采样子单元的网络结构不做限制。
可选的,上述第二预测子单元为偏移值预测网络OPN。
可选的,上述第二对齐子单元为可变形卷积DCN。
举例说明,如图10B所示,假设N=3,针对至少一个参考图像,将上述特征提取模块生成的待增强图像和该参考图像分别在第一个尺度(即最小尺度L3)下的第一特征信息拼接起来,一起输入第一个第二预测子单元(OPN)中,预测偏移值。OPN采用了3层卷积层预测偏移值,得到该参考图像在第一个尺度下的偏移值。然后,将该参考图像在第一个尺度下的偏移值通过第一个第二上采样子单元上采样到第二个尺度(即L2尺度)下的偏移值O2。将待增强图像和该参考图像分别在第二个尺度(即最小尺度L2)下的第一特征信息拼接起来,与偏移值O2输入第二个第二对齐子单元(DCN)中进行可变形卷积,得到待增强图像和该参考图像在第二个尺度下对齐的特征信息。将对齐后的特征信息输入第二个第一预测子单元(OPN)中,得到第二个第二预测子单元预测的该参考图像在第二个尺度下的偏移值O3,将偏移值O3与O2相加后,输入第二个第二上采样子单元中,得到偏移值O4。将偏移值O4输入第三个第二对齐子单元,以使第三个第二对齐子单元对上述步骤输出的第三尺度下(即原尺度L1)的待增强图像和该参考图像的第一特征信息分别进行采样对齐,得到待增强图像和该参考图像在第三个尺度下的对齐特征,并将待增强图像和该参考图像在第三个尺度下的对齐特征输入第三个第二预测子单元中,预测得到该参考图像的偏移值O5,O5与O4相加,得到该参考图像在第三个尺度下的偏移值。
S704、将参考图像在第N个尺度下的偏移值和第一特征信息,输入时域对齐模块中进行多尺度时域对齐,得到参考图像在多个尺度下的第二特征信息。
本申请实施例对时域对齐模块的具体网络结构不做限制。
在一些实施例中,如图10C所示,时域对齐模块包括K个第二时域对齐单元和K-1个第二下采样单元,K为大于2的正整数。
在一种可能的实现方式中,第二时域对齐单元为偏移值预测网络OPN。
可选的,在时域对齐模块中,各可变形卷积参数数量相同,例如输入输出通道都为C=64。
在一种可能的实现方式中,第二下采样单元为平均池化层。
在一种可能的实现方式中,第二下采样单元为最大池化层。
则上述S704包括:
S704-A1、将参考图像在第k个尺度下的偏移值和第一特征信息输入第k个第二时域对齐单元中,得到参考图像在第k个尺度下的第二特征信息。
其中,k为K至2的正整数,当k=K时,参考图像在第k个尺度下的偏移值和第一特征信息为参考图像在第N个尺度下的偏移值和第一特征信息。
S704-A2、将参考图像在第k个尺度下的偏移值和第一特征信息输入第k-1个第二下采样单元中进行下采样,得到参考图像在第k-1个尺度下的偏移值和第一特征信息。
S704-A3、将参考图像在第k-1个尺度下的偏移值和第一特征信息输入第k-1个第二时域对齐单元中,得到参考图像在第k-1个尺度下的第二特征信息,直到k-1等于1为止。
举例说明,假设K=3,针对M个参考图像中的至少一个参考图像,将该参考图像在第三个尺度(例如L1尺度)下的偏移值和第一特征信息输入第三个第一时域对齐单元中,得到参考图像在第三个尺度下的第二特征信息,其中参考图像在第三个尺度下的偏移值和第一特征信息,以及在第三个尺度下的第二特征信息的大小均为HXW。另外,将参考图像在第三个尺度下的偏移值和第一特征信息输入第二个第一下采样单元中进行下采样,得到参考图像在第二个尺度下的偏移值和第一特征信息,可选的,参考图像在第二尺度下的偏移值和第一特征信息的大小为H/2XW/2。将参考图像在第二个尺度下的偏移值和第一特征信息输入第二个第一时域对齐单元中,得到参考图像在第二个尺度下的第二特征信息。接着,将参考图像在第二个尺度下的偏移值和第一特征信息输入第一个第一下采样单元中进行下采样,得到参考图像在第一个尺度下的偏移值和第一特征信息,可选的,参考图像在第一个尺度下的偏移值和第一特征信息的大小为H/4XW/4。
在一些实施例中,上述S703包括:将参考图像在N个尺度下的第一特征信息,输入偏移值预测模块中进行多尺度预测,得到参考图像在第N个尺度下的P组偏移值,P为正整数。
对应的,上述S604包括:将参考图像划分为P个图像块,将P组偏移值一一分配给P个图像块;将图像块对应的一组偏移值和图像块的第一特征信息输入时域对齐模块中进行多尺度时域对齐,得到图像块在第N个尺度下的多尺度第二特征信息;根据参考图像中图像块在第N个尺度下的多尺度第二特征信息,得到参考图像在多个尺度下的第二特征信息。
根据上述步骤得到参考图像在多个尺度下的第二特征信息后,执行如下S705。
S705、将待增强图像在多个尺度下的第一特征信息和参考图像在多个尺度下的第二特征信息输入质量增强模块,得到待增强图像的增强图像的预测值。
本申请实施例对质量增强模块的具体网络结构不做限制。
在一些实施例中,如图10D所示,质量增强模块包括K个第二增强单元和K-1个第二上采样单元,则上述S705包括:
S704-A1、将待增强图像在第k+1个尺度下的第一特征信息和参考图像在第k+1个尺度下的第二特征信息,输入第k+1个第二增强单元中进行图像质量增强,得到待增强图像在第k+1个尺度下的增强图像的初始预测值,k为1至K-1的正整数;
S704-A2、将待增强图像在第k个尺度下的增强图像的融合值输入第k个第二上采样单元中进行上采样,得到待增强图像在第k+1个尺度下的增强图像的上采样值,当k为1时,待增强图像在第k个尺度下的增强图像的融合值为第一个第二增强单元根据待增强图像在第一个尺度下的第一特征信息和参考图像在第一个尺度下的第二特征信息,得到的待增强图像在第一个尺度下的增强图像的初始预测值;
S704-A3、根据待增强图像在第k+1个尺度下的增强图像的上采样值和初始预测值进行融合,得到待增强图像在第k+1个尺度下的增强图像的融合值;
S704-A4、将待增强图像在第K个尺度下的增强图像的融合值确定为待增强图像在第N个尺度下的增强图像的预测值。
举例说明,假设K=3,参照图10D所示,将待增强图像在第一个尺度下的第一特征信息和参考图像在第一个尺度下的第二特征信息拼接后输入第一个第二增强单元中进行质量增强,得到待增强图像在第一个尺度下的增强图像的融合值。接着,将待增强图像在第一个尺度下的增强图像的融合值输入第一个第二上采样单元中进行上采样,得到待增强图像在第二个尺度下的增强图像的上采样值。另外,将待增强图像在第二个尺度下的第一特征信息和参考图像在第二个尺度下的第二特征信息拼接后,输入第二个第二增强单元中进行图像质量增强,得到待增强图像在第二个尺度下的增强图像的初始预测值,将待增强图像在第二个尺度下的增强图像的上采样值和初始预测值进行融合,得到待增强图像在第二个尺度下的增强图像的融合值。接着,将待增强图像在第二个尺度下的增强图像的融合值输入第二个第一上采样单元中进行上采样,得到待增强图像在第三个尺度下的增强图像的上采样值。另外,将待增强图像在第三个尺度下的第一特征信息和每个参考图像在第三个尺度下的第二特征信息拼接后,输入第三个第一增强单元中进行图像质量增强,得到待增强图像在第三个尺度下的增强图像的初始预测值。然后,将待增强图像在第三个尺度下的增强图像的上采样值和初始预测值进行融合,得到待增强图像在第三个尺度下的增强图像的融合值。将待增强图像在第三个尺度下的增强图像的融合值确定为待增强图像在第三个尺度下的增强图像的预测值。
可选的,第一增强单元包括多个卷积层,且多个卷积层中的最后一个卷积层不包括激活函数。
S706、根据待增强图像的增强图像的预测值和待增强图像的增强图像的真值,对质量增强网络进行训练。
上述S706的实现过程与上述S506一致,参照上述S506的具体描述,在此不再赘述。
上述步骤介绍了使用第N个尺度下的偏移值进行对齐和增强,根据待增强图像在第N个尺度下的增强图像的预测值对质量增强网络进行训练的过程进行介绍。
在一些实施例中,本申请实施例的训练方式还包括使用除第N尺度外的其他尺度下的偏移值进行对齐和增强,以根据待增强图像在其他尺度下的增强图像的预测值对质量增强网络进行训练的过程。具体包括如下步骤:
步骤B1、将待增强图像和参考图像的在N个尺度下的第一特征信息,输入偏移值预测模块中进行多尺度预测,得到参考图像在第j个尺度下的偏移值,第j个尺度为N个尺度中除第N个尺度之外的尺度;
步骤B2、将参考图像在第j个尺度下的偏移值和第一特征信息,输入时域对齐模块中进行多尺度时域对齐,得到参考图像在第j个尺度下的多尺度第二特征信息;
步骤B3、将待增强图像在多个尺度下的第一特征信息和参考图像在多个尺度下的第二特征信息输入质量增强模块,得到待增强图像在第j个尺度下的增强图像的预测值;
步骤B4、根据待增强图像在第j个尺度下的增强图像的预测值和待增强图像的增强图像的真值,对质量增强网络进行训练。
具体参照上述步骤A1至上述步骤A4的描述,在此不再赘述。
本申请实施例的模型训练方法,通过获取待增强图像以及待增强图像的M个参考图像,将待增强图像以及待增强图像的M个参考图像输入特征提取模块进行不同尺度的特征提取,分别得到待增强图像和参考图像在N个尺度下的第一特征信息;将待增强图像和参考图像分别在N个尺度下的第一特征信息,输入偏移值预测模块中进行多尺度预测,得到参考图像在第N个尺度下的偏移值;将参考图像第N个尺度下的偏移值和第一特征信息,输入时域对齐模块中进行多尺度时域对齐,得到参考图像在多个尺度下的第二特征信息;将参考图像图像在多个尺度下的第二特征信息输入质量增强模块,得到待增强图像的增强图像的预测值。根据待增强图像的增强图像的预测值和待增强图像的增强图像的真值,对质量增强网络进行训练。本申请在由于上述质量增强网络采用金字塔形预测网络,只对偏移值进行上采样处理,避免了图像特征上采样造成的信息损失。另外,为了更准确的预测偏移值,优化网络训练,采用了多尺度对齐技术,将原尺度的偏移值和待对齐特征同步下采样,小尺度的偏移值相对于大尺度的偏移值会更接近真实采样点,训练网络时,梯度优化方向将会指向真实采样点方向,最终引导整个对齐过程更加精确。使用该训练好的网络进行图像增强时,可以实现对图像的高效增强。进一步的,本申请实施例中,偏移值预测模块只预测参考图像的偏移值,且时域对齐模块只对参考图像进行时域对齐,进而降低了各模块的计算量,降低了模型训练的复杂性,进而提高模型的训练效率。
上文结合质量增强网络的网络结构,对质量增强网络的训练过程进行介绍,下面对质量增强网络的应用过程进行介绍。
在一些实施例中,本申请实施例提供的质量增强网络还可以应用于视频编解码框架中,例如可以应用于视频解码端,对解码端得到的重建图像进行质量增强,得到重建图像的增强图像。
图11为本申请一实施例提供的图像解码方法的流程示意图,如图11所示,该方法包括:
S801、解码码流,得到当前重建图像。
例如图3所示,熵解码单元310可解析码流得到当前块的预测信息、量化系数矩阵等,预测单元320基于预测信息对当前块使用帧内预测或帧间预测产生当前块的预测块。反量化/变换单元330使用从码流得到的量化系数矩阵,对量化系数矩阵进行反量化、反变换得到残差块。重建单元340将预测块和残差块相加得到重建块。重建块组成重建图像,可选的环路滤波单元350基于图像或基于块对重建图像进行环路滤波,得到当前重建图像。
在本实施例中,将质量增强网络与视频编码框架相结合。
在一种示例中,在解码器的输出端增加上述实施例所述的质量增强网络。对解码后的当前重建图像输入质量增强网络,利用该质量增强网络可以显著提升当前重建图像的图像质量,进而在保证码率的前提下,进一步提升解码后的图像质量。
S802、从已重建的图像中,获取当前重建图像的M个参考图像,所述M为正整数。
本步骤获取当前重建图像的M个参考图像的方式包括但不限于如下几种:
方式一,上述当前重建图像的M个参考图像为已重建的图像中的任意M个图像图像。
方式二,从已重建的图像中,获取在播放顺序上位于该当前重建图像的前向和/或后向的至少一个图像作为当前重建图像的参考图像。
可选的,当前重建图像与M个参考图像在播放顺序上为连续图像。
可选的,当前重建图像与M个参考图像在播放顺序上不为连续图像。
在一些实施例中,本申请实施例的方法还包括:解码码流,得到第一标记信息,该第一标记信息用于指示是否使用质量增强网络对所述当前重建图像进行质量增强。在该第一标记信息指示使用质量增强网络对当前重建图像进行质量增强时,从已重建的图像图像中,获取该当前重建图像的M个参考图像。
可选的,上述第一标记信息包含在序列参数集SPS中。
也就是说,解码端在执行上述S802之前,需要从SPS中读取第一标记信息。如果第一标记信息的值为1,则表示采用本申请的质量增强网络对解码的当前重建图像进行质量增强。如果第一标记信息的值为0,则表示不采用本申请的质量增强网络对解码的当前重建图像进行质量增强。
如果采用本申请的质量增强网络对解码的当前重建图像进行质量增强时,当前重建图像的参考图像存在如下两种情况:
情况1,如果该当前重建图像的前向和/或后向参考图像已经重建,此时直接从重建视频缓存器中读取当前重建图像t的前向t-r到t-1和/或后向t+1到t+r个图像作为该当前重建图像的参考图像。
情况2,如果该当前重建图像的参考图像暂时不能获取,例如该当前重建图像为第一张当前重建图像。此时,先将该当前重建图像输入重建视频缓存器中,待处理完一个或多个图像组(Group Of Picture,GOP)后,从重建视频缓存器中读取当前重建图像t的前向t-r到t-1和/或后向t+1到t+r个图像作为该当前重建图像的参考图像。
在一些实施例中,上述各参考图像均为未经过质量增强网络增强过的图像。
S803、将当前重建图像和M个参考图像输入质量增强网络中,得到当前重建图像的增强图像。
其中,质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块,特征提取模块用于对当前重建图像和参考图像进行不同尺度的特征提取,得到当前重建图像和参考图像在N个尺度下的第一特征信息,N为大于1的正整数,偏移值预测模块用于根据当前重建图像和参考图像在N个尺度下的第一特征信息进行多尺度预测,得到参考图像的偏移值,时域对齐模块用于根据参考图像的偏移值和第一特征信息进行时域对齐,得到参考图像的第二特征信息,质量增强模块用于根据参考图像的第二特征信息预测当前重建图像的增强图像。
在一些实施例中,使用质量增强网络对当前重建图像进行质量增强后,将当前重建图像的增强图像进行标记后,存入重建视频缓存器中。或者,直接显示该当前重建图像的增强图像。
参照上述图6所示,质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块。
其中,特征提取模块用于对当前重建图像和参考图像分别进行不同尺度的特征提取,分别得到当前重建图像和参考图像在N个尺度下的第一特征信息。
偏移值预测模块用于根据当前重建图像和参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到参考图像的偏移值。
时域对齐模块用于根据参考图像的偏移值和参考图像的第一特征信息进行时域对齐,得到参考图像的第二特征信息。
质量增强模块用于根据参考图像的第二特征信息预测当前重建图像的增强图像。
在一些实施例中,上述时域对齐模块用于根据参考图像的偏移值和第一特征信息进行多尺度时域对齐,得到参考图像在多个尺度下的第二特征信息。
在一些实施例中,如图8A所示,特征提取模块包括N个第一特征提取单元。
其中,将当前重建图像和参考图像中的任一图像记为第一图像,第i个第一特征提取单元用于输出所提取的第一图像的在第N-i+1个尺度下的第一特征信息,并将所提取的第一图像在第N-i+1个尺度下的第一特征信息输入第i+1个第一特征提取单元中,以使第i+1个第一特征提取单元输出第一图像在第N-i+2个尺度下的第一特征信息,其中,i为1至N-1的正整数。
需要说明的是,上述参考图像可以理解为当前重建图像的M个参考图像中的所有参考图像,也可以理解为M个参考图像中的部分参考图像,当前重建图像和参考图像中的每一个图像提取第一特性信息的过程一致,为了便于描述,当前重建图像和参考图像中的任一图像记为第一图像,其中当前重建图像和参考图像中每一个图像提取第一特征信息的过程与上述第一图像相同,参照上述第一图像即可。
在一些实施例中,如图8B所示,特征提取模块包括6层卷积层,第一层卷积层和第二层卷积层的卷积步长为第一数值,第三卷积层和第四卷积层的卷积步长为第二数值,第五卷积层和第六卷积层的卷积步长为第三数值,其中第一数值大于第二数值,第二数值大于第三数值。
由于上述可知,本申请实施例的质量增强网络是通过两种方式训练得到的,对于不同的训练方式训练得到的质量增强网络中的部分模块在预测时的执行过程也不相同。下面针对上述两种不同训练方法得到的质量增强网络的预测过程分别进行介绍。
情况1,偏移值预测模块用于根据当前重建图像和参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到当前重建图像和参考图像分别在第N个尺度下的偏移值,第N个尺度为N个尺度中的最大尺度;时域对齐模块用于根据当前重建图像在第N个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到当前重建图像在多个尺度下的第二特征信息,以及根据参考图像在第N个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到参考图像在多个尺度下的第二特征信息;质量增强模块用于根据当前重建图像和参考图像分别在多个尺度下的第二特征信息,得到当前重建图像的增强图像。
在情况1下,如图8C所示,偏移值预测模块包括N个第一预测单元。
对于N个第一预测单元中的第j个第一预测单元,第j个第一预测单元用于根据当前重建图像和参考图像分别在第j个尺度下的第一特征信息、以及当前重建图像和参考图像分别在第j个尺度下的偏移值,得到当前重建图像和参考图像分别在第j+1个尺度下的偏移值。其中,j为1至N-1的正整数,即从j=1开始,重复执行上述步骤,直到j为N-1为止,进而得到第N-1个第一预测单元预测的当前重建图像和参考图像分别在第N个尺度下的偏移值。
对于N个第一预测单元中的第N个第一预测单元,该第N个第一预测单元用于根据当前重建图像和参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第一预测单元预测的当前重建图像和参考图像分别在第N个尺度下的偏移值,得到第N个第一预测单元预测的当前重建图像和参考图像分别在第N个尺度下的偏移值。
示例性的,若上述第j个预测单元为N个预测单元中的第一个预测单元,则当前重建图像和参考图像分别在第j个尺度下的偏移值为0。
在一些实施例中,如图8D所示,若第j个预测单元为N个第一预测单元中的第一个第一预测单元,则第一个第一预测单元包括第一个第一预测子单元和第一个第一上采样子单元。
其中,第一个第一预测子单元用于根据当前重建图像和参考图像分别在第一个尺度下的第一特征信息进行偏移值预测,预测当前重建图像和参考图像分别的在第一个尺度下的偏移值;
第一个第一上采样子单元用于根据第一个第一预测子单元预测的当前重建图像和参考图像分别的在第一个尺度下的偏移值进行上采样,得到当前重建图像和参考图像分别在第二个尺度下的偏移值。
在一些实施例中,如图8D所示,若第j个第一预测单元为N个第一预测单元中除第一个第一预测单元之外的第一预测单元,则第j个第一预测单元包括第j个第一对齐子单元、第j个第一预测子单元、第j个第一上采样子单元。
其中,第j个第一对齐子单元用于根据当前重建图像和参考图像分别在第j个尺度下的第一特征信息、以及第j-1个第一预测单元预测的当前重建图像和参考图像分别在第j个尺度下的偏移值进行时域特征对齐,得到当前重建图像和参考图像分别在第j个尺度下对齐的特征信息;
第j个第一预测子单元用于根据当前重建图像和参考图像分别在第j个尺度下对齐的特征信息进行偏移值预测,得到当前重建图像和参考图像分别在j个尺度下的偏移值;
第j个第一上采样子单元用于根据第j个第一预测子单元输出的当前重建图像和参考图像分别在j个尺度下的偏移值和第j-1个第一预测单元预测的当前重建图像和参考图像分别在第j个尺度下的偏移值的和值进行上采样,得到当前重建图像和参考图像分别在j+1个尺度下的偏移值。
在一些实施例中,如图8D所示,第N个第一预测单元包括第N个第一对齐子单元和第N个第一预测子单元。
其中,第N个第一对齐子单元用于根据当前重建图像和参考图像分别在第N个尺度下的第一特 征信息、以及第N-1个第一预测单元预测的当前重建图像和参考图像分别在第N个尺度下的偏移值进行时域特征对齐,得到当前重建图像和参考图像分别在第N个尺度下对齐的特征信息;
第N个第一预测子单元用于根据当前重建图像和参考图像分别在第N个尺度下对齐的特征信息进行偏移值预测,得到预测的当前重建图像和参考图像分别在第N个尺度下的偏移值;
上述第N个第一预测单元预测的当前重建图像和参考图像分别在第N个尺度下的偏移值是根据第N个第一预测子单元预测的当前重建图像和参考图像分别在第N个尺度下的偏移值,以及第N-1个第一预测单元预测的当前重建图像和参考图像分别在第N个尺度下的偏移值相加后确定的。
可选的,上述各第一预测子单元为OPN。
可选的,上述第一对齐子单元为DCN。
在情况1下,如图8E所示,时域对齐模块包括K个第一时域对齐单元和K-1个第一下采样单元,K为大于2的正整数。
具体的,第k个第一时域对齐单元用于根据第一图像在第k个尺度下的偏移值和第一特征信息,得到图像图像在第k个尺度下的第二特征信息,其中第一图像为当前重建图像或参考图像;
第k-1个第一下采样单元用于根据第一图像在第k个尺度下的偏移值和第一特征信息进行下采样,得到第一图像在第k-1个尺度下的偏移值和第一特征信息;
第k-1个第一时域对齐单元用于根据第一图像在第k-1个尺度下的偏移值和第一特征信息,得到第一图像在第k-1个尺度下的第二特征信息。
其中,k为K至2的正整数,也就是说,从k=K开始,重复执行上述步骤,直到k=2为止。
示例性的,当k=K时,第一图像在第k个尺度下的偏移值和第一特征信息为第一图像在第N个尺度下的偏移值和第一特征信息。
可选的,上述第一时域对齐单元为DCN。
可选的,上述第一下采样单元为平均池化层或最大池化层。
在一些实施例中,上述偏移值预测模块用于根据第一图像在N个尺度下的第一特征信息进行多尺度预测,得到第一图像在第N个尺度下的P组偏移值,P为正整数;
时域对齐模块用于将第一图像划分为P个图像块,并将P组偏移值一一分配给P个图像块,且根据图像块对应的一组偏移值和图像块的第一特征信息进行多尺度时域对齐,得到图像块在第N个尺度下的多尺度第二特征信息,进而根据第一图像中图像块在第N个尺度下的多尺度第二特征信息,得到第一图像在第N个尺度下的多尺度第二特征信息。
在情况1下,如图8F所示,质量增强模块包括K个第一增强单元和K-1个第一上采样单元。
其中,第k+1个第一增强单元用于根据当前重建图像和参考图像分别在第k+1个尺度下的第二特征信息进行图像质量增强,得到当前重建图像在第k+1个尺度下的增强图像的初始预测值;
第k个第一上采样单元用于根据当前重建图像和参考图像分别在第k个尺度下的增强图像的融合值进行上采样,得到当前重建图像在第k+1个尺度下的增强图像的上采样值,当k为1时,当前重建图像在第k个尺度下的增强图像的融合值为第一个第一增强单元根据当前重建图像和参考图像分别在第一尺度下的第二特征信息,得到的当前重建图像在第一个尺度下的增强图像的初始预测值;
其中,当前重建图像在第k+1个尺度下的增强图像的融合值是根据当前重建图像在第k+1个尺度下的增强图像的上采样值和初始预测值进行融合后确定的。
上述k为1至K-1的正整数,也就是说,从k=1开始,重复执行上述步骤,直到k=K-1为止。
其中,当前重建图像在第N个尺度下的增强图像的预测值是根据当前重建图像在第K个尺度下的增强图像的融合值确定的。
可选的,上述第一增强单元包括多个卷积层,且多个卷积层中的最后一个卷积层不包括激活函数。
情况2,偏移值预测模块用于根据当前重建图像和参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到当前重建图像和参考图像分别在第N个尺度下的偏移值,第N个尺度为N个尺度中的最大尺度;时域对齐模块用于根据参考图像在第N个尺度下的偏移值和参考图像在第N个尺 度下的第一特征信息进行多尺度时域对齐,得到参考图像在多个尺度下的第二特征信息;质量增强模块用于根据当前重建图像在多个尺度下的第一特征信息和参考图像在多个尺度下的第二特征信息,得到当前重建图像的增强图像的预测值。
在情况2中,如图10A所示,偏移值预测模块包括N个第二预测单元。
针对任意参考图像,第j个第二预测单元用于根据当前重建图像和参考图像在第j个尺度下的第一特征信息、以及参考图像在第j个尺度下的偏移值,得到参考图像在第j+1个尺度下的偏移值。
其中,j为1至N-1的正整数,也就是说,从j=1开始,重复执行上述步骤,直到j=N-1为止。
其中,第N个第二预测单元用于根据当前重建图像和参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第二预测单元预测的参考图像在第N个尺度下的偏移值,得到第N个第二预测单元预测的参考图像在第N个尺度下的偏移值。
在一些实施例中,若第j个第二预测单元为N个第二预测单元中的第一个第二预测单元,则每个参考图像在第j-1个尺度下的偏移值为0。
在一些实施例中,如图10B所示,若第j个第二预测单元为N个第二预测单元中的第一个第二预测单元,则第一个第二预测单元包括第一个第二预测子单元和第一个第二上采样子单元。
其中,第一个第二预测子单元用于根据当前重建图像和参考图像分别在第一个尺度下的第一特征信息进行偏移值预测,得到参考图像的在第一个尺度下的偏移值;
第一个第二上采样子单元用于根据参考图像的在第一个尺度下的偏移值进行上采样,得到参考图像在第二个尺度下的偏移值。
在一些实施例中,如图10B所示,若第j个第二预测单元为N个第二预测单元中除第一个第二预测单元之外的第二预测单元,则第j个第二预测单元包括第j个第二对齐子单元、第j个第二预测子单元、第j个第二上采样子单元。
其中,第j个第二对齐子单元用于根据当前重建图像和参考图像分别在第j个尺度下的第一特征信息、以及第j-1个第二预测单元预测的参考图像在第j个尺度下的偏移值进行时域特征对齐,得到当前重建图像和参考图像分别在第j个尺度下对齐的特征信息;
第j个第二预测子单元用于根据当前重建图像和参考图像在第j个尺度下对齐的特征信息进行偏移值预测,得到参考图像在j个尺度下的偏移值;
第j个第二上采样子单元用于根据第j个第一预测子单元输出的参考图像在j个尺度下的偏移值和第j-1个第二预测单元预测的参考图像在第j个尺度下的偏移值的和值进行上采样,得到参考图像在j+1个尺度下的偏移值。
在一些实施例中,如图10B所示,第N个第二预测单元包括第N个第二对齐子单元和第N个第二预测子单元。
其中,第N个第二对齐子单元用于根据当前重建图像和参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第二预测单元预测的参考图像在第N个尺度下的偏移值进行时域特征对齐,得到当前重建图像和参考图像在第N个尺度下对齐的特征信息;
第N第二预测子单元用于根据当前重建图像和参考图像在第N个尺度下对齐的特征信息进行偏移值预测,得到第N个第二预测单元预测的参考图像在N个尺度下的偏移值;
第N个第二预测单元预测的参考图像在第N个尺度下的偏移值是根据第N个第二预测子单元预测的参考图像在第N个尺度下的偏移值,以及第N-1个第二预测单元预测的参考图像在第N个尺度下的偏移值相加后确定的。
可选的,上述第二预测子单元为OPN。
可选的,上述第二对齐子单元为DCN。
在情况2中,如图10C所示,时域对齐模块包括K个第二时域对齐单元和K-1个第二下采样单元,K为大于2的正整数。
其中,第k个第二时域对齐单元用于根据参考图像在第k个尺度下的偏移值和第一特征信息,得到参考图像在第k个尺度下的第二特征信息。
其中,k为K至2的正整数,当k=K时,参考图像在第k个尺度下的偏移值和第一特征信息为 参考图像在第N个尺度下的偏移值和第一特征信息。
第k-1个第二下采样单元用于根据参考图像在第k个尺度下的偏移值和第一特征信息进行下采样,得到参考图像在第k-1个尺度下的偏移值和第一特征信息;
第k-1个第二时域对齐单元用于根据参考图像在第k-1个尺度下的偏移值和第一特征信息,得到参考图像在第k-1个尺度下的第二特征信息,直到k-1等于1为止。
可选的,上述第二时域对齐单元为DCN。
可选的,上述第二下采样单元为平均池化层或最大池化层。
在一些实施例中,上述偏移值预测模块用于根据当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到参考图像在第N个尺度下的P组偏移值,P为正整数;
对应的,时域对齐模块用于将参考图像划分为P个图像块,并将P组偏移值一一分配给P个图像块,且根据图像块对应的一组偏移值和图像块的第一特征信息进行多尺度时域对齐,得到图像块在第N个尺度下的多尺度第二特征信息,进而根据参考图像中图像块在第N个尺度下的多尺度第二特征信息,得到参考图像在第N个尺度下的多尺度第二特征信息。
在情况2中,如图10D所示,质量增强模块包括K个第二增强单元和K-1个第二上采样单元。
其中,第k+1个第二增强单元用于根据当前重建图像在第k+1个尺度下的第一特征信息和参考图像在第k+1个尺度下的第二特征信息进行图像质量增强,得到当前重建图像在第k+1个尺度下的增强图像的初始预测值,k为1至K-1的正整数;
第k个第二上采样单元用于根据当前重建图像在第k个尺度下的增强图像的融合值进行上采样,得到当前重建图像在第k+1个尺度下的增强图像的上采样值,当k为1时,当前重建图像在第k个尺度下的增强图像的融合值为第一个第二增强单元根据当前重建图像在第一个尺度下的第一特征信息和参考图像在第一个尺度下的第二特征信息,得到的当前重建图像在第一个尺度下的增强图像的初始预测值;
当前重建图像在第k+1个尺度下的增强图像的融合值是根据当前重建图像在第k+1个尺度下的增强图像的上采样值和初始预测值进行融合后确定的。
可选的,第一增强单元包括多个卷积层,且多个卷积层中的最后一个卷积层不包括激活函数。
本申请实施例,采用上述质量增强网络对当前重建图像进行质量增强,整个过程简单,且成本低,可以实现对当前重建图像的高效增强,进而提高了当前重建图像的质量。
在一些实施例中,本申请实施例提供的质量增强网络还可以应用于视频编解码框架中的视频编码端,对编码端得到的重建图像进行质量增强,得到重建图像的增强图像。
图12为本申请一实施例提供的图像编码方法的流程示意图,如图12所示,该方法包括:
S901、获取待编码图像。
S902、对待编码图像进行编码,得到待解码图像的当前重建图像。
参照上述图2所示的编码器,本申请涉及的视频编码的基本流程如下:在编码端,将待编码的图像图像(即当前图像)划分成块,针对当前块,预测单元210使用帧内预测或帧间预测产生当前块的预测块。残差单元220可基于预测块与当前块的原始块计算残差块,即预测块和当前块的原始块的差值,该残差块也可称为残差信息。该残差块经由变换/量化单元230变换与量化等过程,可以去除人眼不敏感的信息,以消除视觉冗余。可选的,经过变换/量化单元230变换与量化之前的残差块可称为时域残差块,经过变换/量化单元230变换与量化之后的时域残差块可称为频率残差块或频域残差块。熵编码单元280接收到变换量化单元230输出的量化后的变换系数,可对该量化后的变换系数进行熵编码,输出码流。例如,熵编码单元280可根据目标上下文模型以及二进制码流的概率信息消除字符冗余。
另外,视频编码器对变换量化单元230输出的量化后的变换系数进行反量化和反变换,得到当前块的残差块,再将当前块的残差块与当前块的预测块进行相加,得到当前块的重建块。随着编码的进行,可以得到当前图像中其他待编码块对应的重建块,这些重建块进行拼接,得到当前图像的当前重 建图像。
可选的,由于编码过程中引入误差,为了降低误差,对当前重建图像进行滤波,例如,使用ALF对当前重建图像进行滤波,以减小当前重建图像中像素点的像素值与当前图像中像素点的原始像素值之间差异。将滤波后的当前重建图像存放在解码图像缓存270中,可以为后续的帧作为帧间预测的参考图像。
S903、从已重建的图像中,获取当前重建图像的M个参考图像,所述M为正整数。
本步骤获取当前重建图像的M个参考图像的方式包括但不限于如下几种:
方式一,上述当前重建图像的M个参考图像为解码图像缓存270中已重建的图像中的任意M个图像。
方式二,从解码图像缓存270中已重建的图像中,获取在播放顺序上位于当前重建图像的前向和/或后向的至少一个图像作为当前重建图像的参考图像。
可选的,当前重建图像与M个参考图像在播放顺序上为连续图像。
可选的,当前重建图像与M个参考图像在播放顺序上不为连续图像。
在一些实施例中,在序列参数集(SPS)中写入第一标记信息,该第一标记信息用于指示是否使用质量增强网络对当前重建图像进行质量增强。在该第一标记信息指示使用质量增强网络对当前重建图像进行质量增强时,从已重建的图像中,获取该当前重建图像的M个参考图像。
如果第一标记信息指示采用本申请的质量增强网络对上述当前重建图像进行质量增强时,当前重建图像的参考图像存在如下两种情况:
情况1,如果该当前重建图像的前向和/或后向参考图像已经重建,此时直接从重建视频缓存器中读取当前重建图像t的前向t-r到t-1和/或后向t+1到t+r个图像作为该当前重建图像的参考图像。
情况2,如果该当前重建图像的参考图像暂时不能获取,例如该当前重建图像为第一张重建图像。此时,先将该当前重建图像输入重建视频缓存器中,待处理完一个或多个GOP后,从重建视频缓存器中读取当前重建图像t的前向t-r到t-1和/或后向t+1到t+r个图像作为该当前重建图像的参考图像。
在一些实施例中,上述各参考图像均为未经过质量增强网络增强过的图像图像。
S904、将当前重建图像和M个参考图像输入质量增强网络中,得到当前重建图像的增强图像。
其中,质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块,特征提取模块用于对当前重建图像和参考图像分别进行不同尺度的特征提取,得到当前重建图像和参考图像分别在N个尺度下的第一特征信息,N为大于1的正整数,偏移值预测模块用于根据当前重建图像和参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到参考图像的偏移值,时域对齐模块用于根据参考图像的偏移值和参考图像的第一特征信息进行时域对齐,得到参考图像的第二特征信息,质量增强模块用于根据参考图像的第二特征信息预测当前重建图像的增强图像。
其中,质量增强网络的具体网络结构,以及质量增强网络中各模块的功能参照上述图11所示实施例的描述,在此不再赘述。
上文对质量增强网络应用于编解码系统中进行了介绍,上述质量增强网络还可以应用于其他需要对图像质量进行增强的场景。
图13为本申请一实施例提供的图像处理方法的流程示意图,如图13所示,该方法包括:
S101、获取待增强的目标图像,以及目标图像的M个参考图像,M为正整数。
S102、将目标图像和M个参考图像输入质量增强网络中,得到目标图像的增强图像。
当质量增强网络应用于视频采集设备采集的视频处理时,对于采集到的第t图像,按顺序存入缓存器,在采集到t+r图像后,便可以从缓存器取出第t-r到t+r图像共2r+1图像输入质量增强网络,其中第t图像为待增强的目标图像,其他图像为待增强的目标图像的参考图像。当应用于视频播放器时,按照播放顺序逐图像增强,即从解码缓冲器中依次取出待增强的目标图像,和其前向后向连续参考图像共同输入质量增强网络得到目标图像的增强图像。
其中,质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块,特征提取模块用于对当前重建图像和参考图像进行不同尺度的特征提取,得到当前重建图像和参考图像在N个尺度下的第一特征信息,N为大于1的正整数,偏移值预测模块用于根据当前重建图像和参考图 像在N个尺度下的第一特征信息进行多尺度预测,得到参考图像的偏移值,时域对齐模块用于根据参考图像的偏移值和第一特征信息进行时域对齐,得到参考图像的第二特征信息,质量增强模块用于根据参考图像的第二特征信息预测当前重建图像的增强图像。
上述质量增强网络的网络结构可以参照上述图8A至10D所示,具体参照上述实施例的描述,在此不再赘述。
应理解,图5至图13仅为本申请的示例,不应理解为对本申请的限制。
以上结合附图详细描述了本申请的优选实施方式,但是,本申请并不限于上述实施方式中的具体细节,在本申请的技术构思范围内,可以对本申请的技术方案进行多种简单变型,这些简单变型均属于本申请的保护范围。例如,在上述具体实施方式中所描述的各个具体技术特征,在不矛盾的情况下,可以通过任何合适的方式进行组合,为了避免不必要的重复,本申请对各种可能的组合方式不再另行说明。又例如,本申请的各种不同的实施方式之间也可以进行任意组合,只要其不违背本申请的思想,其同样应当视为本申请所公开的内容。
还应理解,在本申请的各种方法实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。另外,本申请实施例中,术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系。具体地,A和/或B可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
上文结合图5至图13对质量增强网络的网络结构以及图像处理方法进行了介绍,下文结合图14至图16,详细描述本申请的装置实施例。
图14是本申请一实施例提供的图像解码装置的示意性框图,该图像解码装置可以为图3所示的解码器,或者为解码器中的部件,例如为解码器中的处理器。
如图14所示,该图像解码装置10可包括:
解码单元11,用于解码码流,得到当前重建图像;
获取单元12,用于从已重建的图像中,获取所述当前重建图像的M个参考图像,所述M为正整数;
增强单元13,用于将所述当前重建图像和所述M个参考图像输入质量增强网络中,得到所述当前重建图像的增强图像。
其中,所述质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块,所述特征提取模块用于对所述当前重建图像和所述参考图像分别进行不同尺度的特征提取,得到所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息,所述N为大于1的正整数,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述参考图像的偏移值,所述时域对齐模块用于根据所述参考图像的偏移值和所述参考图像的第一特征信息进行时域对齐,得到所述参考图像的第二特征信息,所述质量增强模块用于根据所述参考图像的第二特征信息预测所述当前重建图像的增强图像。
在一些实施例中,所述时域对齐模块用于根据所述参考图像的偏移值和参考图像的第一特征信息进行多尺度时域对齐,得到所述参考图像在多个尺度下的第二特征信息。
在一些实施例中,所述特征提取模块包括N个第一特征提取单元;
其中,第i个第一特征提取单元用于输出所提取的第一图像在第N-i+1个尺度下的第一特征信息,并将所提取的所述第一图像在第N-i+1个尺度下的第一特征信息输入第i+1个第一特征提取单元中,以使第i+1个第一特征提取单元输出所述第一图像在第N-i+2个尺度下的第一特征信息,所述i为1至N-1的正整数,所述第一图像为所述当前重建图像和所述参考图像中的任一图像。
在一些实施例中,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,分别得到所述当前重建图像和所述参考图像在第N个尺度下的偏移值,所述第N个尺度为所述N个尺度中的最大尺度;
所述时域对齐模块用于根据所述当前重建图像在第N个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到所述当前重建图像在多个尺度下的第二特征信息,以及根据所述参考图像在第N 个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到所述参考图像在多个尺度下的第二特征信息;
所述质量增强模块用于根据所述当前重建图像和所述参考图像分别在多个尺度下的第二特征信息,得到所述当前重建图像的增强图像。
在一些实施例中,所述偏移值预测模块包括N个第一预测单元;
其中,其中,第j个第一预测单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值,得到所述当前重建图像和所述参考图像分别在第j+1个尺度下的偏移值,所述j为1至N-1的正整数;
第N个第一预测单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值,得到所述第N个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值。
在一些实施例中,若所述第j个预测单元为所述N个预测单元中的第一个预测单元,则所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值为0。
在一些实施例中,若所述第j个预测单元为所述N个第一预测单元中的第一个第一预测单元,则所述第一个第一预测单元包括第一个第一预测子单元和第一个第一上采样子单元;
所述第一个第一预测子单元用于根据所述当前重建图像和所述参考图像分别在第一个尺度下的第一特征信息进行偏移值预测,预测所述当前重建图像和所述参考图像分别的在第一个尺度下的偏移值;
所述第一个第一上采样子单元用于根据所述第一个第一预测子单元预测的所述当前重建图像和所述参考图像分别在第一个尺度下的偏移值进行上采样,得到所述当前重建图像和所述参考图像分别在第二个尺度下的偏移值。
在一些实施例中,若所述第j个第一预测单元为所述N个第一预测单元中除第一个第一预测单元之外的第一预测单元,则所述第j个第一预测单元包括第j个第一对齐子单元、第j个第一预测子单元、第j个第一上采样子单元;
所述第j个第一对齐子单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及第j-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像分别在第j个尺度下对齐的特征信息;
所述第j个第一预测子单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下对齐的特征信息进行偏移值预测,得到所述当前重建图像和所述参考图像分别在j个尺度下的偏移值;
所述第j个第一上采样子单元用于根据所述第j个第一预测子单元输出的所述当前重建图像和所述参考图像分别在j个尺度下的偏移值和第j-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值的和值进行上采样,得到所述当前重建图像和所述参考图像分别在j+1个尺度下的偏移值。
在一些实施例中,所述第N个第一预测单元包括第N个第一对齐子单元和第N个第一预测子单元;
所述第N个第一对齐子单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及所述第N-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像分别在第N个尺度下对齐的特征信息;
所述第N个第一预测子单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下对齐的特征信息进行偏移值预测,得到预测的所述当前重建图像和所述参考图像在第N个尺度下的偏移值;
所述第N个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值是根据所述第N个第一预测子单元预测的当前重建图像和所述参考图像分别在第N个尺度下的偏移值,以及第N-1个第一预测单元预测的当前重建图像和所述参考图像分别在第N个尺度下的偏移 值相加后确定的。
在一些实施例中,所述第一预测子单元为偏移值预测网络OPN。
在一些实施例中,所述第一对齐子单元为可变形卷积DCN。
在一些实施例中,所述时域对齐模块包括K个第一时域对齐单元和K-1个第一下采样单元,所述K为大于2的正整数;
其中,第k个第一时域对齐单元用于根据所述第一图像在第k个尺度下的偏移值和第一特征信息,得到所述第一图像在第k个尺度下的第二特征信息,所述k为K至2的正整数,当k=K时,所述第一图像在第k个尺度下的偏移值和第一特征信息为所述第一图像在第N个尺度下的偏移值和第一特征信息;
第k-1个第一下采样单元用于根据所述第一图像在第k个尺度下的偏移值和第一特征信息进行下采样,得到所述第一图像在第k-1个尺度下的偏移值和第一特征信息;
第k-1个第一时域对齐单元用于根据所述第一图像在第k-1个尺度下的偏移值和第一特征信息,得到所述第一图像在第k-1个尺度下的第二特征信息,直到k-1等于1为止。
在一些实施例中,所述第一时域对齐单元为可变形卷积DCN。
在一些实施例中,所述第一下采样单元为平均池化层。
在一些实施例中,所述质量增强模块包括K个第一增强单元和K-1个第一上采样单元;
第k+1个第一增强单元用于根据所述当前重建图像和所述参考图像分别在第k+1个尺度下的第二特征信息进行图像质量增强,得到所述当前重建图像在第k+1个尺度下的增强图像的初始预测值,所述k为1至K-1的正整数;
第k个第一上采样单元用于根据所述当前重建图像在第k个尺度下的增强图像的融合值进行上采样,得到所述当前重建图像在第k+1个尺度下的增强图像的上采样值,当所述k为1时,所述当前重建图像在第k个尺度下的增强图像的融合值为第一个第一增强单元根据所述当前重建图像和所述参考图像分别在第一个尺度下的第二特征信息,得到的所述当前重建图像在第一个尺度下的增强图像的初始预测值;
其中,所述当前重建图像在第k+1个尺度下的增强图像的融合值是根据所述当前重建图像在第k+1个尺度下的增强图像的上采样值和初始预测值进行融合后确定的,所述当前重建图像在第N个尺度下的增强图像的预测值是根据所述当前重建图像在第K个尺度下的增强图像的融合值确定的。
在一些实施例中,所述第一增强单元包括多个卷积层,且所述多个卷积层中的最后一个卷积层不包括激活函数。
在一些实施例中,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述当前重建图像和所述参考图像分别在第N个尺度下的P组偏移值,所述P为正整数;
所述时域对齐模块用于将所述第一图像划分为P个图像块,并将所述P组偏移值一一分配给所述P个图像块,且根据所述图像块对应的一组偏移值和所述图像块的第一特征信息进行多尺度时域对齐,得到所述图像块在多个尺度下的第二特征信息,进而根据所述第一图像中图像块在第N个尺度下的多尺度第二特征信息,得到所述第一图像在第N个尺度下的多尺度第二特征信息。
在一些实施例中,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述参考图像在第N个尺度下的偏移值,所述第N个尺度为所述N个尺度中的最大尺度;
所述时域对齐模块用于根据所述参考图像在第N个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到所述参考图像在多个尺度下的第二特征信息;
所述质量增强模块用于根据所述当前重建图像在多个尺度下的第一特征信息和所述参考图像在多个尺度下的第二特征信息,得到所述当前重建图像的增强图像。
在一些实施例中,所述偏移值预测模块包括N个第二预测单元;
其中,第j个第二预测单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及所述参考图像在第j个尺度下的偏移值,得到所述参考图像在第j+1个尺度下的偏 移值,所述j为1至N-1的正整数;
第N个第二预测单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第二预测单元预测的所述参考图像在第N个尺度下的偏移值,得到所述第N个第二预测单元预测的所述参考图像在第N个尺度下的偏移值。
在一些实施例中,若所述第j个第二预测单元为所述N个第二预测单元中的第一个第二预测单元,则所述参考图像在第j-1个尺度下的偏移值为0。
在一些实施例中,若所述第j个第二预测单元为所述N个第二预测单元中的第一个第二预测单元,则所述第一个第二预测单元包括第一个第二预测子单元和第一个第二上采样子单元;
所述第一个第二预测子单元用于根据所述当前重建图像和所述参考图像分别在第一个尺度下的第一特征信息进行偏移值预测,得到所述参考图像的在第一个尺度下的偏移值;
所述第一个第二上采样子单元用于根据所述参考图像的在第一个尺度下的偏移值进行上采样,得到所述参考图像在第二个尺度下的偏移值。
在一些实施例中,若所述第j个第二预测单元为所述N个第二预测单元中除第一个第二预测单元之外的第二预测单元,则所述第j个第二预测单元包括第j个第二对齐子单元、第j个第二预测子单元、第j个第二上采样子单元;
所述第j个第二对齐子单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及第j-1个第二预测单元预测的参考图像在第j个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像在第j个尺度下对齐的特征信息;
所述第j个第二预测子单元用于根据所述当前重建图像和所述参考图像在第j个尺度下对齐的特征信息进行偏移值预测,得到所述参考图像在j个尺度下的偏移值;
所述第j个第二上采样子单元用于根据所述第j个第一预测子单元输出的所述参考图像在j个尺度下的偏移值和第j-1个第二预测单元预测的所述参考图像在第j个尺度下的偏移值的和值进行上采样,得到所述参考图像在j+1个尺度下的偏移值。
在一些实施例中,所述第N个第二预测单元包括第N个第二对齐子单元和第N个第二预测子单元;
所述第N个第二对齐子单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及所述第N-1个第二预测单元预测的所述参考图像在第N个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像在第N个尺度下对齐的特征信息;
所述第N第二预测子单元用于根据所述当前重建图像和所述参考图像在第N个尺度下对齐的特征信息进行偏移值预测,得到所述第N个第二预测单元预测的所述参考图像在N个尺度下的偏移值;
所述第N个第二预测单元预测的所述参考图像在第N个尺度下的偏移值是根据所述第N个第二预测子单元预测的所述参考图像在第N个尺度下的偏移值,以及第N-1个第二预测单元预测的所述参考图像在第N个尺度下的偏移值相加后确定的。
在一些实施例中,所述第二预测子单元为偏移值预测网络OPN。
在一些实施例中,所述第二对齐子单元为可变形卷积DCN。
在一些实施例中,所述时域对齐模块包括K个第二时域对齐单元和K-1个第二下采样单元,所述K为大于2的正整数;
其中,第k个第二时域对齐单元用于根据所述参考图像在第k个尺度下的偏移值和第一特征信息,得到所述参考图像在第k个尺度下的第二特征信息,所述k为K至2的正整数,当k=K时,所述参考图像在第k个尺度下的偏移值和第一特征信息为所述参考图像在第N个尺度下的偏移值和第一特征信息;
第k-1个第二下采样单元用于根据所述参考图像在第k个尺度下的偏移值和第一特征信息进行下采样,得到所述参考图像在第k-1个尺度下的偏移值和第一特征信息;
第k-1个第二时域对齐单元用于根据所述参考图像在第k-1个尺度下的偏移值和第一特征信息,得到所述参考图像在第k-1个尺度下的第二特征信息,直到k-1等于1为止。
在一些实施例中,所述第二时域对齐单元为可变形卷积DCN。
在一些实施例中,所述第二下采样单元为平均池化层。
在一些实施例中,所述质量增强模块包括K个第二增强单元和K-1个第二上采样单元;
其中,第k+1个第二增强单元用于根据所述当前重建图像在第k+1个尺度下的第一特征信息和所述参考图像在第k+1个尺度下的第二特征信息进行图像质量增强,得到所述当前重建图像在第k+1个尺度下的增强图像的初始预测值,所述k为1至K-1的正整数;
第k个第二上采样单元用于根据所述当前重建图像在第k个尺度下的增强图像的融合值进行上采样,得到所述当前重建图像在第k+1个尺度下的增强图像的上采样值,当所述k为1时,所述当前重建图像在第k个尺度下的增强图像的融合值为第一个第二增强单元根据所述当前重建图像在第一个尺度下的第一特征信息和所述参考图像在第一个尺度下的第二特征信息,得到的所述当前重建图像在第一个尺度下的增强图像的初始预测值;
所述当前重建图像在第k+1个尺度下的增强图像的融合值是根据所述当前重建图像在第k+1个尺度下的增强图像的上采样值和初始预测值进行融合后确定的。
在一些实施例中,所述第一增强单元包括多个卷积层,且所述多个卷积层中的最后一个卷积层不包括激活函数。
在一些实施例中,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述参考图像在第N个尺度下的P组偏移值,所述P为正整数;
所述时域对齐模块用于将所述参考图像划分为P个图像块,并将所述P组偏移值一一分配给所述P个图像块,且针对每一个图像块,根据所述图像块对应的一组偏移值和所述图像块的第一特征信息进行多尺度时域对齐,得到所述图像块在第N个尺度下的多尺度第二特征信息,进而根据所述参考图像中每个图像块在第N个尺度下的多尺度第二特征信息,得到所述参考图像在第N个尺度下的多尺度第二特征信息。
在一些实施例中,解码单元11,还用于解码码流,得到第一标记信息,所述第一标记信息用于指示是否使用所述质量增强网络对所述当前重建图像进行质量增强;
在所述第一标记信息指示使用所述质量增强网络对所述当前重建图像进行质量增强时,从已重建的图像中,获取所述当前重建图像的M个参考图像。
在一些实施例中,所述第一标记信息包含在序列参数集中。
在一些实施例中,获取单元12,具体用于从已重建的图像中,获取在播放顺序上位于所述当前重建图像的前向和/或后向的至少一个图像作为所述当前重建图像的参考图像。
可选的,所述所述当前重建图像与所述参考图像在播放顺序上连续。
应理解,装置实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,图14所示的解码装置10可以对应于执行本申请实施例的图像解码方法中的相应主体,并且解码装置10中的各个单元的前述和其它操作和/或功能分别为了实现图像解码方法中的相应流程,为了简洁,在此不再赘述。
图15是本申请一实施例提供的图像编码装置的示意性框图,该图像编码装置可以为图2所示的编码器,或者为编码器中的部件,例如为编码器中的处理器。
如图15所示,该图像编码装置20可包括:
第一获取单元21,用于获取待编码图像;
编码单元22,用于对所述待编码图像进行编码,得到所述待编码图像的当前重建图像;
第二获取单元23,用于从已重建的图像中,获取所述当前重建图像的M个参考图像,所述M为正整数;
增强单元24,用于将所述当前重建图像和所述M个参考图像输入质量增强网络中,得到所述当前重建图像的增强图像。
其中,所述质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块,所述特征提取模块用于对所述当前重建图像和所述参考图像分别进行不同尺度的特征提取,得到所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息,所述N为大于1的正整数,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行 多尺度预测,得到所述参考图像的偏移值,所述时域对齐模块用于根据所述参考图像的偏移值和所述参考图像的第一特征信息进行时域对齐,得到所述参考图像的第二特征信息,所述质量增强模块用于根据所述参考图像的第二特征信息预测所述当前重建图像的增强图像。
在一些实施例中,所述时域对齐模块用于根据所述参考图像的偏移值和参考图像的第一特征信息进行多尺度时域对齐,得到所述参考图像在多个尺度下的第二特征信息。
在一些实施例中,所述特征提取模块包括N个第一特征提取单元;
其中,第i个第一特征提取单元用于输出所提取的第一图像在第N-i+1个尺度下的第一特征信息,并将所提取的所述第一图像在第N-i+1个尺度下的第一特征信息输入第i+1个第一特征提取单元中,以使第i+1个第一特征提取单元输出所述第一图像在第N-i+2个尺度下的第一特征信息,所述i为1至N-1的正整数,所述第一图像为所述当前重建图像和所述参考图像中的任一图像。
在一些实施例中,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,分别得到所述当前重建图像和所述参考图像在第N个尺度下的偏移值,所述第N个尺度为所述N个尺度中的最大尺度;
所述时域对齐模块用于根据所述当前重建图像在第N个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到所述当前重建图像在多个尺度下的第二特征信息,以及根据所述参考图像在第N个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到所述参考图像在多个尺度下的第二特征信息;
所述质量增强模块用于根据所述当前重建图像和所述参考图像分别在多个尺度下的第二特征信息,得到所述当前重建图像的增强图像。
在一些实施例中,所述偏移值预测模块包括N个第一预测单元;
其中,其中,第j个第一预测单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值,得到所述当前重建图像和所述参考图像分别在第j+1个尺度下的偏移值,所述j为1至N-1的正整数;
第N个第一预测单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值,得到所述第N个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值。
在一些实施例中,若所述第j个预测单元为所述N个预测单元中的第一个预测单元,则所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值为0。
在一些实施例中,若所述第j个预测单元为所述N个第一预测单元中的第一个第一预测单元,则所述第一个第一预测单元包括第一个第一预测子单元和第一个第一上采样子单元;
所述第一个第一预测子单元用于根据所述当前重建图像和所述参考图像分别在第一个尺度下的第一特征信息进行偏移值预测,预测所述当前重建图像和所述参考图像分别的在第一个尺度下的偏移值;
所述第一个第一上采样子单元用于根据所述第一个第一预测子单元预测的所述当前重建图像和所述参考图像分别在第一个尺度下的偏移值进行上采样,得到所述当前重建图像和所述参考图像分别在第二个尺度下的偏移值。
在一些实施例中,若所述第j个第一预测单元为所述N个第一预测单元中除第一个第一预测单元之外的第一预测单元,则所述第j个第一预测单元包括第j个第一对齐子单元、第j个第一预测子单元、第j个第一上采样子单元;
所述第j个第一对齐子单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及第j-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像分别在第j个尺度下对齐的特征信息;
所述第j个第一预测子单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下对齐的特征信息进行偏移值预测,得到所述当前重建图像和所述参考图像分别在j个尺度下的偏移值;
所述第j个第一上采样子单元用于根据所述第j个第一预测子单元输出的所述当前重建图像和所 述参考图像分别在j个尺度下的偏移值和第j-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值的和值进行上采样,得到所述当前重建图像和所述参考图像分别在j+1个尺度下的偏移值。
在一些实施例中,所述第N个第一预测单元包括第N个第一对齐子单元和第N个第一预测子单元;
所述第N个第一对齐子单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及所述第N-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像分别在第N个尺度下对齐的特征信息;
所述第N个第一预测子单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下对齐的特征信息进行偏移值预测,得到预测的所述当前重建图像和所述参考图像在第N个尺度下的偏移值;
所述第N个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值是根据所述第N个第一预测子单元预测的当前重建图像和所述参考图像分别在第N个尺度下的偏移值,以及第N-1个第一预测单元预测的当前重建图像和所述参考图像分别在第N个尺度下的偏移值相加后确定的。
在一些实施例中,所述第一预测子单元为偏移值预测网络OPN。
在一些实施例中,所述第一对齐子单元为可变形卷积DCN。
在一些实施例中,所述时域对齐模块包括K个第一时域对齐单元和K-1个第一下采样单元,所述K为大于2的正整数;
其中,第k个第一时域对齐单元用于根据所述第一图像在第k个尺度下的偏移值和第一特征信息,得到所述第一图像在第k个尺度下的第二特征信息,所述k为K至2的正整数,当k=K时,所述第一图像在第k个尺度下的偏移值和第一特征信息为所述第一图像在第N个尺度下的偏移值和第一特征信息;
第k-1个第一下采样单元用于根据所述第一图像在第k个尺度下的偏移值和第一特征信息进行下采样,得到所述第一图像在第k-1个尺度下的偏移值和第一特征信息;
第k-1个第一时域对齐单元用于根据所述第一图像在第k-1个尺度下的偏移值和第一特征信息,得到所述第一图像在第k-1个尺度下的第二特征信息,直到k-1等于1为止。
在一些实施例中,所述第一时域对齐单元为可变形卷积DCN。
在一些实施例中,所述第一下采样单元为平均池化层。
在一些实施例中,所述质量增强模块包括K个第一增强单元和K-1个第一上采样单元;
第k+1个第一增强单元用于根据所述当前重建图像和所述参考图像分别在第k+1个尺度下的第二特征信息进行图像质量增强,得到所述当前重建图像在第k+1个尺度下的增强图像的初始预测值,所述k为1至K-1的正整数;
第k个第一上采样单元用于根据所述当前重建图像在第k个尺度下的增强图像的融合值进行上采样,得到所述当前重建图像在第k+1个尺度下的增强图像的上采样值,当所述k为1时,所述当前重建图像在第k个尺度下的增强图像的融合值为第一个第一增强单元根据所述当前重建图像和所述参考图像分别在第一个尺度下的第二特征信息,得到的所述当前重建图像在第一个尺度下的增强图像的初始预测值;
其中,所述当前重建图像在第k+1个尺度下的增强图像的融合值是根据所述当前重建图像在第k+1个尺度下的增强图像的上采样值和初始预测值进行融合后确定的,所述当前重建图像在第N个尺度下的增强图像的预测值是根据所述当前重建图像在第K个尺度下的增强图像的融合值确定的。
在一些实施例中,所述第一增强单元包括多个卷积层,且所述多个卷积层中的最后一个卷积层不包括激活函数。
在一些实施例中,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述当前重建图像和所述参考图像分别在第N个尺度下的 P组偏移值,所述P为正整数;
所述时域对齐模块用于将所述第一图像划分为P个图像块,并将所述P组偏移值一一分配给所述P个图像块,且根据所述图像块对应的一组偏移值和所述图像块的第一特征信息进行多尺度时域对齐,得到所述图像块在多个尺度下的第二特征信息,进而根据所述第一图像中图像块在第N个尺度下的多尺度第二特征信息,得到所述第一图像在第N个尺度下的多尺度第二特征信息。
在一些实施例中,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述参考图像在第N个尺度下的偏移值,所述第N个尺度为所述N个尺度中的最大尺度;
所述时域对齐模块用于根据所述参考图像在第N个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到所述参考图像在多个尺度下的第二特征信息;
所述质量增强模块用于根据所述当前重建图像在多个尺度下的第一特征信息和所述参考图像在多个尺度下的第二特征信息,得到所述当前重建图像的增强图像。
在一些实施例中,所述偏移值预测模块包括N个第二预测单元;
其中,第j个第二预测单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及所述参考图像在第j个尺度下的偏移值,得到所述参考图像在第j+1个尺度下的偏移值,所述j为1至N-1的正整数;
第N个第二预测单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第二预测单元预测的所述参考图像在第N个尺度下的偏移值,得到所述第N个第二预测单元预测的所述参考图像在第N个尺度下的偏移值。
在一些实施例中,若所述第j个第二预测单元为所述N个第二预测单元中的第一个第二预测单元,则所述参考图像在第j-1个尺度下的偏移值为0。
在一些实施例中,若所述第j个第二预测单元为所述N个第二预测单元中的第一个第二预测单元,则所述第一个第二预测单元包括第一个第二预测子单元和第一个第二上采样子单元;
所述第一个第二预测子单元用于根据所述当前重建图像和所述参考图像分别在第一个尺度下的第一特征信息进行偏移值预测,得到所述参考图像的在第一个尺度下的偏移值;
所述第一个第二上采样子单元用于根据所述参考图像的在第一个尺度下的偏移值进行上采样,得到所述参考图像在第二个尺度下的偏移值。
在一些实施例中,若所述第j个第二预测单元为所述N个第二预测单元中除第一个第二预测单元之外的第二预测单元,则所述第j个第二预测单元包括第j个第二对齐子单元、第j个第二预测子单元、第j个第二上采样子单元;
所述第j个第二对齐子单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及第j-1个第二预测单元预测的参考图像在第j个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像在第j个尺度下对齐的特征信息;
所述第j个第二预测子单元用于根据所述当前重建图像和所述参考图像在第j个尺度下对齐的特征信息进行偏移值预测,得到所述参考图像在j个尺度下的偏移值;
所述第j个第二上采样子单元用于根据所述第j个第一预测子单元输出的所述参考图像在j个尺度下的偏移值和第j-1个第二预测单元预测的所述参考图像在第j个尺度下的偏移值的和值进行上采样,得到所述参考图像在j+1个尺度下的偏移值。
在一些实施例中,所述第N个第二预测单元包括第N个第二对齐子单元和第N个第二预测子单元;
所述第N个第二对齐子单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及所述第N-1个第二预测单元预测的所述参考图像在第N个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像在第N个尺度下对齐的特征信息;
所述第N第二预测子单元用于根据所述当前重建图像和所述参考图像在第N个尺度下对齐的特征信息进行偏移值预测,得到所述第N个第二预测单元预测的所述参考图像在N个尺度下的偏移值;
所述第N个第二预测单元预测的所述参考图像在第N个尺度下的偏移值是根据所述第N个第二预测子单元预测的所述参考图像在第N个尺度下的偏移值,以及第N-1个第二预测单元预测的所述 参考图像在第N个尺度下的偏移值相加后确定的。
在一些实施例中,所述第二预测子单元为偏移值预测网络OPN。
在一些实施例中,所述第二对齐子单元为可变形卷积DCN。
在一些实施例中,所述时域对齐模块包括K个第二时域对齐单元和K-1个第二下采样单元,所述K为大于2的正整数;
其中,第k个第二时域对齐单元用于根据所述参考图像在第k个尺度下的偏移值和第一特征信息,得到所述参考图像在第k个尺度下的第二特征信息,所述k为K至2的正整数,当k=K时,所述参考图像在第k个尺度下的偏移值和第一特征信息为所述参考图像在第N个尺度下的偏移值和第一特征信息;
第k-1个第二下采样单元用于根据所述参考图像在第k个尺度下的偏移值和第一特征信息进行下采样,得到所述参考图像在第k-1个尺度下的偏移值和第一特征信息;
第k-1个第二时域对齐单元用于根据所述参考图像在第k-1个尺度下的偏移值和第一特征信息,得到所述参考图像在第k-1个尺度下的第二特征信息,直到k-1等于1为止。
在一些实施例中,所述第二时域对齐单元为可变形卷积DCN。
在一些实施例中,所述第二下采样单元为平均池化层。
在一些实施例中,所述质量增强模块包括K个第二增强单元和K-1个第二上采样单元;
其中,第k+1个第二增强单元用于根据所述当前重建图像在第k+1个尺度下的第一特征信息和所述参考图像在第k+1个尺度下的第二特征信息进行图像质量增强,得到所述当前重建图像在第k+1个尺度下的增强图像的初始预测值,所述k为1至K-1的正整数;
第k个第二上采样单元用于根据所述当前重建图像在第k个尺度下的增强图像的融合值进行上采样,得到所述当前重建图像在第k+1个尺度下的增强图像的上采样值,当所述k为1时,所述当前重建图像在第k个尺度下的增强图像的融合值为第一个第二增强单元根据所述当前重建图像在第一个尺度下的第一特征信息和所述参考图像在第一个尺度下的第二特征信息,得到的所述当前重建图像在第一个尺度下的增强图像的初始预测值;
所述当前重建图像在第k+1个尺度下的增强图像的融合值是根据所述当前重建图像在第k+1个尺度下的增强图像的上采样值和初始预测值进行融合后确定的。
在一些实施例中,所述第一增强单元包括多个卷积层,且所述多个卷积层中的最后一个卷积层不包括激活函数。
在一些实施例中,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述参考图像在第N个尺度下的P组偏移值,所述P为正整数;
所述时域对齐模块用于将所述参考图像划分为P个图像块,并将所述P组偏移值一一分配给所述P个图像块,且针对每一个图像块,根据所述图像块对应的一组偏移值和所述图像块的第一特征信息进行多尺度时域对齐,得到所述图像块在第N个尺度下的多尺度第二特征信息,进而根据所述参考图像中每个图像块在第N个尺度下的多尺度第二特征信息,得到所述参考图像在第N个尺度下的多尺度第二特征信息。
在一些实施例中,所述第二获取单元23,还用于获取第一标记信息,所述第一标记信息用于指示是否使用所述质量增强网络对所述重建图像进行质量增强;并在所述第一标记信息指示使用所述质量增强网络对所述当前重建图像进行质量增强时,从已重建的图像中,获取所述当前重建图像的M个参考图像。
在一些实施例中,所述第一标记信息包含在序列参数集中。
在一些实施例中,第二获取单元23,具体用于从已重建的图像中,获取在播放顺序上位于所述当前重建图像的前向和/或后向的至少一个图像作为所述当前重建图像的参考图像。
可选的,所述重建图像与所述重建图像的M个参考图像在播放顺序上为连续图像。
应理解,装置实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,图15所示的编码装置20可以对应于执行本申请实施例的图像编码方法中的相应主体,并且编码装置20中的各个单元的前述和其它操作和/或功能分别为了实现图像编码方法中 的相应流程,为了简洁,在此不再赘述。
图16是本申请一实施例提供的图像处理装置的示意性框图,该图像处理装置可以为图像处理设备,例如视频采集设备或视频播放设备。
如图16所示,该图像处理装置50可包括:
获取单元51,用于获取待增强的目标图像,以及所述目标图像的M个参考图像,所述M为正整数;
增强单元52,用于将所述目标图像和所述M个参考图像输入质量增强网络中,得到所述目标图像的增强图像。
其中,所述质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块,所述特征提取模块用于对所述目标图像和所述参考图像分别进行不同尺度的特征提取,分别得到所述目标图像和所述参考图像在N个尺度下的第一特征信息,所述N为大于1的正整数,所述偏移值预测模块用于根据所述目标图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述参考图像的偏移值,所述时域对齐模块用于根据所述参考图像的偏移值和所述参考图像的第一特征信息进行时域对齐,得到所述参考图像的第二特征信息,所述质量增强模块用于根据所述参考图像的第二特征信息预测所述目标图像的增强图像。
其中,质量增强网络的具体结构参照上述实施例的描述,在此不再赘述。
应理解,装置实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,图16所示的图像处理装置50可以对应于执行本申请实施例的图像处理方法中的相应主体,并且图像处理装置50中的各个单元的前述和其它操作和/或功能分别为了实现图像处理方法中的相应流程,为了简洁,在此不再赘述。
图17是本申请一实施例提供的模型训练装置的示意性框图,该模型训练装置可以为计算设备,或者为计算设备中的处理器。
如图17所示,该模型训练装置40用于训练质量增强网络,所述质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块,模型训练装置40可包括:
获取单元41,用于获取M+1个图像图像,所述M+1个图像图像包括待增强图像以及所述待增强图像的M个参考图像,所述M为正整数;
特征提取单元42,用于将待增强图像以及待增强图像的M个参考图像输入特征提取模块分别进行不同尺度的特征提取,分别得到待增强图像和参考图像在N个尺度下的第一特征信息,所述N为大于1的正整数;
偏移值预测单元43,用于根据待增强图像和参考图像分别在N个尺度下的第一特征信息,通过偏移值预测模块进行多尺度预测,得到参考图像的偏移值;
时域对齐单元44,用于根据参考图像的偏移值和参考图像的第一特征信息,通过时域对齐模块中进行时域对齐,得到参考图像的第二特征信息;
增强单元45,用于根据参考图像的第二特征信息,通过质量增强模块,得到待增强图像的增强图像的预测值;
训练单元46,用于根据待增强图像的增强图像的预测值和待增强图像的增强图像的真值,对质量增强网络进行训练。
应理解,装置实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,图17所示的模型训练装置40可以对应于执行本申请实施例的模型训练方法中的相应主体,并且模型训练装置40中的各个单元的前述和其它操作和/或功能分别为了实现模型训练方法中的相应流程,为了简洁,在此不再赘述。
上文中结合附图从功能单元的角度描述了本申请实施例的装置和系统。应理解,该功能单元可以通过硬件形式实现,也可以通过软件形式的指令实现,还可以通过硬件和软件单元组合实现。具体地,本申请实施例中的方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路和/或软件形式的指令完成,结合本申请实施例公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码 处理器中的硬件及软件单元组合执行完成。可选地,软件单元可以位于随机存储器,闪存、只读存储器、可编程只读存储器、电可擦写可编程存储器、寄存器等本领域的成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法实施例中的步骤。
图18是本申请实施例提供的电子设备的示意性框图。
如图18所示,该电子设备30可以为本申请实施例所述的图像处理设备,或者解码器,或者编码器,或者为模型训练设备,该电子设备30可包括:
存储器33和处理器32,该存储器33用于存储计算机程序34,并将该程序代码34传输给该处理器32。换言之,该处理器32可以从存储器33中调用并运行计算机程序34,以实现本申请实施例中的方法。
例如,该处理器32可用于根据该计算机程序34中的指令执行上述方法200中的步骤。
在本申请的一些实施例中,该处理器32可以包括但不限于:
通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等等。
在本申请的一些实施例中,该存储器33包括但不限于:
易失性存储器和/或非易失性存储器。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synch link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。
在本申请的一些实施例中,该计算机程序34可以被分割成一个或多个单元,该一个或者多个单元被存储在该存储器33中,并由该处理器32执行,以完成本申请提供的方法。该一个或多个单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述该计算机程序34在该电子设备30中的执行过程。
如图18所示,该电子设备30还可包括:
收发器33,该收发器33可连接至该处理器32或存储器33。
其中,处理器32可以控制该收发器33与其他设备进行通信,具体地,可以向其他设备发送信息或数据,或接收其他设备发送的信息或数据。收发器33可以包括发射机和接收机。收发器33还可以进一步包括天线,天线的数量可以为一个或多个。
应当理解,该电子设备30中的各个组件通过总线系统相连,其中,总线系统除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。
本申请还提供了一种计算机存储介质,其上存储有计算机程序,该计算机程序被计算机执行时使得该计算机能够执行上述方法实施例的方法。或者说,本申请实施例还提供一种包含指令的计算机程序产品,该指令被计算机执行时使得计算机执行上述方法实施例的方法。
当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例该的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如数字点云光盘(digital video  disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。例如,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
以上内容,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以该权利要求的保护范围为准。

Claims (80)

  1. 一种图像解码方法,其特征在于,包括:
    解码码流,得到当前重建图像;
    从已重建的图像中,获取所述当前重建图像的M个参考图像,所述M为正整数;
    将所述当前重建图像和所述M个参考图像输入质量增强网络中,得到所述当前重建图像的增强图像。
  2. 根据权利要求1所述的方法,其特征在于,所述质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块,所述特征提取模块用于对所述当前重建图像和所述参考图像分别进行不同尺度的特征提取,得到所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息,所述N为大于1的正整数,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述参考图像的偏移值,所述时域对齐模块用于根据所述参考图像的偏移值和所述参考图像的第一特征信息进行时域对齐,得到所述参考图像的第二特征信息,所述质量增强模块用于根据所述参考图像的第二特征信息预测所述当前重建图像的增强图像。
  3. 根据权利要求2所述的方法,其特征在于,所述时域对齐模块用于根据所述参考图像的偏移值和所述参考图像的第一特征信息进行多尺度时域对齐,得到所述参考图像在多个尺度下的第二特征信息。
  4. 根据权利要求3所述的方法,其特征在于,所述特征提取模块包括N个第一特征提取单元;
    其中,第i个第一特征提取单元用于输出所提取的第一图像在第N-i+1个尺度下的第一特征信息,并将所提取的所述第一图像在第N-i+1个尺度下的第一特征信息输入第i+1个第一特征提取单元中,以使第i+1个第一特征提取单元输出所述第一图像在第N-i+2个尺度下的第一特征信息,所述i为1至N-1的正整数,所述第一图像为所述当前重建图像和所述参考图像中的任一图像。
  5. 根据权利要求4所述的方法,其特征在于,
    所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,分别得到所述当前重建图像和所述参考图像在第N个尺度下的偏移值,所述第N个尺度为所述N个尺度中的最大尺度;
    所述时域对齐模块用于根据所述当前重建图像在第N个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到所述当前重建图像在多个尺度下的第二特征信息,以及根据所述参考图像在第N个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到所述参考图像在多个尺度下的第二特征信息;
    所述质量增强模块用于根据所述当前重建图像和所述参考图像分别在多个尺度下的第二特征信息,得到所述当前重建图像的增强图像。
  6. 根据权利要求5所述的方法,其特征在于,所述偏移值预测模块包括N个第一预测单元;
    其中,第j个第一预测单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值,得到所述当前重建图像和所述参考图像分别在第j+1个尺度下的偏移值,所述j为1至N-1的正整数;
    第N个第一预测单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值,得到所述第N个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值。
  7. 根据权利要求6所述的方法,其特征在于,若所述第j个预测单元为所述N个预测单元中的第一个预测单元,则所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值为0。
  8. 根据权利要求6所述的方法,其特征在于,若所述第j个预测单元为所述N个第一预测单元中的第一个第一预测单元,则所述第一个第一预测单元包括第一个第一预测子单元和第一个第一上采样子单元;
    所述第一个第一预测子单元用于根据所述当前重建图像和所述参考图像分别在第一个尺度下的 第一特征信息进行偏移值预测,预测所述当前重建图像和所述参考图像分别的在第一个尺度下的偏移值;
    所述第一个第一上采样子单元用于根据所述第一个第一预测子单元预测的所述当前重建图像和所述参考图像分别在第一个尺度下的偏移值进行上采样,得到所述当前重建图像和所述参考图像分别在第二个尺度下的偏移值。
  9. 根据权利要求6所述的方法,其特征在于,若所述第j个第一预测单元为所述N个第一预测单元中除第一个第一预测单元之外的第一预测单元,则所述第j个第一预测单元包括第j个第一对齐子单元、第j个第一预测子单元、第j个第一上采样子单元;
    所述第j个第一对齐子单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及第j-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像在第j个尺度下对齐的特征信息;
    所述第j个第一预测子单元用于根据所述当前重建图像和所述参考图像在第j个尺度下对齐的特征信息进行偏移值预测,得到所述当前重建图像和所述参考图像分别在j个尺度下的偏移值;
    所述第j个第一上采样子单元用于根据所述第j个第一预测子单元输出的所述当前重建图像和所述参考图像分别在j个尺度下的偏移值和第j-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值的和值进行上采样,得到所述当前重建图像和所述参考图像分别在j+1个尺度下的偏移值。
  10. 根据权利要求6所述的方法,其特征在于,所述第N个第一预测单元包括第N个第一对齐子单元和第N个第一预测子单元;
    所述第N个第一对齐子单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及所述第N-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像分别在第N个尺度下对齐的特征信息;
    所述第N个第一预测子单元用于根据所述当前重建图像和所述参考图像在第N个尺度下对齐的特征信息进行偏移值预测,得到预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值;
    所述第N个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值是根据所述第N个第一预测子单元预测的当前重建图像和所述参考图像分别在第N个尺度下的偏移值,以及第N-1个第一预测单元预测的当前重建图像和所述参考图像分别在第N个尺度下的偏移值相加后确定的。
  11. 根据权利要求8-10任一项所述的方法,其特征在于,所述第一预测子单元为偏移值预测网络OPN。
  12. 根据权利要求9或10所述的方法,其特征在于,所述第一对齐子单元为可变形卷积DCN。
  13. 根据权利要求5所述的方法,其特征在于,所述时域对齐模块包括K个第一时域对齐单元和K-1个第一下采样单元,所述K为大于2的正整数;
    其中,第k个第一时域对齐单元用于根据所述第一图像在第k个尺度下的偏移值和第一特征信息,得到所述第一图像在第k个尺度下的第二特征信息,所述k为K至2的正整数,当k=K时,所述第一图像在第k个尺度下的偏移值和第一特征信息为所述第一图像在第N个尺度下的偏移值和第一特征信息;
    第k-1个第一下采样单元用于根据所述第一图像在第k个尺度下的偏移值和第一特征信息进行下采样,得到所述第一图像在第k-1个尺度下的偏移值和第一特征信息;
    第k-1个第一时域对齐单元用于根据所述第一图像在第k-1个尺度下的偏移值和第一特征信息,得到所述第一图像在第k-1个尺度下的第二特征信息,直到k-1等于1为止。
  14. 根据权利要求13所述的方法,其特征在于,所述第一时域对齐单元为可变形卷积DCN。
  15. 根据权利要求13所述的方法,其特征在于,所述第一下采样单元为平均池化层。
  16. 根据权利要求13所述的方法,其特征在于,所述质量增强模块包括K个第一增强单元和K-1个第一上采样单元;
    第k+1个第一增强单元用于根据所述当前重建图像和所述参考图像分别在第k+1个尺度下的第二特征信息进行图像质量增强,得到所述当前重建图像在第k+1个尺度下的增强图像的初始预测值,所述k为1至K-1的正整数;
    第k个第一上采样单元用于根据所述当前重建图像在第k个尺度下的增强图像的融合值进行上采样,得到所述当前重建图像在第k+1个尺度下的增强图像的上采样值,当所述k为1时,所述当前重建图像在第k个尺度下的增强图像的融合值为第一个第一增强单元根据所述当前重建图像和所述参考图像分别在第一个尺度下的第二特征信息,得到的所述当前重建图像在第一个尺度下的增强图像的初始预测值;
    其中,所述当前重建图像在第k+1个尺度下的增强图像的融合值是根据所述当前重建图像在第k+1个尺度下的增强图像的上采样值和初始预测值进行融合后确定的,所述当前重建图像在第N个尺度下的增强图像的预测值是根据所述当前重建图像在第K个尺度下的增强图像的融合值确定的。
  17. 根据权利要求16所述的方法,其特征在于,所述第一增强单元包括多个卷积层,且所述多个卷积层中的最后一个卷积层不包括激活函数。
  18. 根据权利要求5所述的方法,其特征在于,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述当前重建图像和所述参考图像分别在第N个尺度下的P组偏移值,所述P为正整数;
    所述时域对齐模块用于将所述第一图像划分为P个图像块,并将所述P组偏移值一一分配给所述P个图像块,且根据所述图像块对应的一组偏移值和所述图像块的第一特征信息进行多尺度时域对齐,得到所述图像块在多个尺度下的第二特征信息,进而根据所述第一图像中图像块在第N个尺度下的多尺度第二特征信息,得到所述第一图像在第N个尺度下的多尺度第二特征信息。
  19. 根据权利要求4所述的方法,其特征在于,
    所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述参考图像在第N个尺度下的偏移值,所述第N个尺度为所述N个尺度中的最大尺度;
    所述时域对齐模块用于根据所述参考图像在第N个尺度下的偏移值和所述参考图像在第N个尺度下的第一特征信息进行多尺度时域对齐,得到所述参考图像在多个尺度下的第二特征信息;
    所述质量增强模块用于根据所述当前重建图像在多个尺度下的第一特征信息和所述参考图像在多个尺度下的第二特征信息,得到所述当前重建图像的增强图像。
  20. 根据权利要求19所述的方法,其特征在于,所述偏移值预测模块包括N个第二预测单元;
    其中,第j个第二预测单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及所述参考图像在第j个尺度下的偏移值,得到所述参考图像在第j+1个尺度下的偏移值,所述j为1至N-1的正整数;
    第N个第二预测单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第二预测单元预测的所述参考图像在第N个尺度下的偏移值,得到所述第N个第二预测单元预测的所述参考图像在第N个尺度下的偏移值。
  21. 根据权利要求20所述的方法,其特征在于,若所述第j个第二预测单元为所述N个第二预测单元中的第一个第二预测单元,则所述参考图像在第j-1个尺度下的偏移值为0。
  22. 根据权利要求21所述的方法,其特征在于,若所述第j个第二预测单元为所述N个第二预测单元中的第一个第二预测单元,则所述第一个第二预测单元包括第一个第二预测子单元和第一个第二上采样子单元;
    所述第一个第二预测子单元用于根据所述当前重建图像和所述参考图像分别在第一个尺度下的第一特征信息进行偏移值预测,得到所述参考图像的在第一个尺度下的偏移值;
    所述第一个第二上采样子单元用于根据所述参考图像的在第一个尺度下的偏移值进行上采样,得到所述参考图像在第二个尺度下的偏移值。
  23. 根据权利要求20所述的方法,其特征在于,若所述第j个第二预测单元为所述N个第二预 测单元中除第一个第二预测单元之外的第二预测单元,则所述第j个第二预测单元包括第j个第二对齐子单元、第j个第二预测子单元、第j个第二上采样子单元;
    所述第j个第二对齐子单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及第j-1个第二预测单元预测的参考图像在第j个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像在第j个尺度下对齐的特征信息;
    所述第j个第二预测子单元用于根据所述当前重建图像和所述参考图像在第j个尺度下对齐的特征信息进行偏移值预测,得到所述参考图像在j个尺度下的偏移值;
    所述第j个第二上采样子单元用于根据所述第j个第一预测子单元输出的所述参考图像在j个尺度下的偏移值和第j-1个第二预测单元预测的所述参考图像在第j个尺度下的偏移值的和值进行上采样,得到所述参考图像在j+1个尺度下的偏移值。
  24. 根据权利要求20所述的方法,其特征在于,所述第N个第二预测单元包括第N个第二对齐子单元和第N个第二预测子单元;
    所述第N个第二对齐子单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及所述第N-1个第二预测单元预测的所述参考图像在第N个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像在第N个尺度下对齐的特征信息;
    所述第N第二预测子单元用于根据所述当前重建图像和所述参考图像在第N个尺度下对齐的特征信息进行偏移值预测,得到所述第N个第二预测单元预测的所述参考图像在N个尺度下的偏移值;
    所述第N个第二预测单元预测的所述参考图像在第N个尺度下的偏移值是根据所述第N个第二预测子单元预测的所述参考图像在第N个尺度下的偏移值,以及第N-1个第二预测单元预测的所述参考图像在第N个尺度下的偏移值相加后确定的。
  25. 根据权利要求22-24任一项所述的方法,其特征在于,所述第二预测子单元为偏移值预测网络OPN。
  26. 根据权利要求23或24任一项所述的方法,其特征在于,所述第二对齐子单元为可变形卷积DCN。
  27. 根据权利要求20所述的方法,其特征在于,所述时域对齐模块包括K个第二时域对齐单元和K-1个第二下采样单元,所述K为大于2的正整数;
    其中,第k个第二时域对齐单元用于根据所述参考图像在第k个尺度下的偏移值和第一特征信息,得到所述参考图像在第k个尺度下的第二特征信息,所述k为K至2的正整数,当k=K时,所述参考图像在第k个尺度下的偏移值和第一特征信息为所述参考图像在第N个尺度下的偏移值和第一特征信息;
    第k-1个第二下采样单元用于根据所述参考图像在第k个尺度下的偏移值和第一特征信息进行下采样,得到所述参考图像在第k-1个尺度下的偏移值和第一特征信息;
    第k-1个第二时域对齐单元用于根据所述参考图像在第k-1个尺度下的偏移值和第一特征信息,得到所述参考图像在第k-1个尺度下的第二特征信息,直到k-1等于1为止。
  28. 根据权利要求27所述的方法,其特征在于,所述第二时域对齐单元为可变形卷积DCN。
  29. 根据权利要求27所述的方法,其特征在于,所述第二下采样单元为平均池化层。
  30. 根据权利要求27所述的方法,其特征在于,所述质量增强模块包括K个第二增强单元和K-1个第二上采样单元;
    其中,第k+1个第二增强单元用于根据所述当前重建图像在第k+1个尺度下的第一特征信息和所述参考图像在第k+1个尺度下的第二特征信息进行图像质量增强,得到所述当前重建图像在第k+1个尺度下的增强图像的初始预测值,所述k为1至K-1的正整数;
    第k个第二上采样单元用于根据所述当前重建图像在第k个尺度下的增强图像的融合值进行上采样,得到所述当前重建图像在第k+1个尺度下的增强图像的上采样值,当所述k为1时,所述当前重建图像在第k个尺度下的增强图像的融合值为第一个第二增强单元根据所述当前重建图像在第一个尺度下的第一特征信息和所述参考图像在第一个尺度下的第二特征信息,得到的所述当前重建图像在第一个尺度下的增强图像的初始预测值;
    所述当前重建图像在第k+1个尺度下的增强图像的融合值是根据所述当前重建图像在第k+1个尺度下的增强图像的上采样值和初始预测值进行融合后确定的。
  31. 根据权利要求30所述的方法,其特征在于,所述第二增强单元包括多个卷积层,且所述多个卷积层中的最后一个卷积层不包括激活函数。
  32. 根据权利要求19所述的方法,其特征在于,
    所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述参考图像在第N个尺度下的P组偏移值,所述P为正整数;
    所述时域对齐模块用于将所述参考图像划分为P个图像块,并将所述P组偏移值一一分配给所述P个图像块,且针对每一个图像块,根据所述图像块对应的一组偏移值和所述图像块的第一特征信息进行多尺度时域对齐,得到所述图像块在第N个尺度下的多尺度第二特征信息,进而根据所述参考图像中每个图像块在第N个尺度下的多尺度第二特征信息,得到所述参考图像在第N个尺度下的多尺度第二特征信息。
  33. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    解码码流,得到第一标记信息,所述第一标记信息用于指示是否使用所述质量增强网络对所述当前重建图像进行质量增强;
    在所述第一标记信息指示使用所述质量增强网络对所述当前重建图像进行质量增强时,从已重建的图像中,获取所述当前重建图像的M个参考图像。
  34. 根据权利要求33所述的方法,其特征在于,所述第一标记信息包含在序列参数集中。
  35. 根据权利要求1所述的方法,其特征在于,所述从已解码的图像中,获取所述当前重建图像的M个参考图像,包括:
    从已重建的图像中,获取在播放顺序上位于所述当前重建图像的前向和/或后向的至少一个图像作为所述当前重建图像的参考图像。
  36. 根据权利要求35所述的方法,其特征在于,所述当前重建图像与所述参考图像在播放顺序上连续。
  37. 一种图像编码方法,其特征在于,包括:
    获取待编码图像;
    对所述待编码图像进行编码,得到所述待编码图像的当前重建图像;
    从已重建的图像中,获取所述当前重建图像的M个参考图像,所述M为正整数;
    将所述当前重建图像和所述M个参考图像输入质量增强网络中,得到所述当前重建图像的增强图像。
  38. 根据权利要求37所述的方法,其特征在于,所述质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块,所述特征提取模块用于对所述当前重建图像和所述参考图像分别进行不同尺度的特征提取,分别得到所述当前重建图像和所述参考图像在N个尺度下的第一特征信息,所述N为大于1的正整数,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述参考图像的偏移值,所述时域对齐模块用于根据所述参考图像的偏移值和所述参考图像的第一特征信息进行时域对齐,得到所述参考图像的第二特征信息,所述质量增强模块用于根据所述参考图像的第二特征信息预测所述当前重建图像的增强图像。
  39. 根据权利要求38所述的方法,其特征在于,所述时域对齐模块用于根据所述参考图像的偏移值和第一特征信息进行多尺度时域对齐,得到所述参考图像在多个尺度下的第二特征信息。
  40. 根据权利要求39所述的方法,其特征在于,所述特征提取模块包括N个第一特征提取单元;
    其中,第i个第一特征提取单元用于输出所提取的第一图像在第N-i+1个尺度下的第一特征信息,并将所提取的所述第一图像在第N-i+1个尺度下的第一特征信息输入第i+1个第一特征提取单元中,以使第i+1个第一特征提取单元输出所述第一图像在第N-i+2个尺度下的第一特征信息,所述i为1至N-1的正整数,所述第一图像为所述当前重建图像和所述参考图像中的任一图像。
  41. 根据权利要求40所述的方法,其特征在于,
    所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息分别进行多尺度预测,得到所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值,所述第N个尺度为所述N个尺度中的最大尺度;
    所述时域对齐模块用于根据所述当前重建图像在第N个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到所述当前重建图像在多个尺度下的第二特征信息,以及根据所述参考图像在第N个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到所述参考图像在多个尺度下的第二特征信息;
    所述质量增强模块用于根据所述当前重建图像和所述参考图像分别在多个尺度下的第二特征信息,得到所述当前重建图像的增强图像。
  42. 根据权利要求41所述的方法,其特征在于,所述偏移值预测模块包括N个第一预测单元;
    其中,第j个第一预测单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值,得到所述当前重建图像和所述参考图像分别在第j+1个尺度下的偏移值,所述j为1至N-1的正整数;
    第N个第一预测单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值,得到所述第N个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值。
  43. 根据权利要求42所述的方法,其特征在于,若所述第j个预测单元为所述N个预测单元中的第一个预测单元,则所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值为0。
  44. 根据权利要求42所述的方法,其特征在于,若所述第j个预测单元为所述N个第一预测单元中的第一个第一预测单元,则所述第一个第一预测单元包括第一个第一预测子单元和第一个第一上采样子单元;
    所述第一个第一预测子单元用于根据所述当前重建图像和所述参考图像分别在第一个尺度下的第一特征信息进行偏移值预测,预测所述当前重建图像和所述参考图像分别在第一个尺度下的偏移值;
    所述第一个第一上采样子单元用于根据所述第一个第一预测子单元预测的所述当前重建图像和所述参考图像分别在第一个尺度下的偏移值进行上采样,得到所述当前重建图像和所述参考图像分别在第二个尺度下的偏移值。
  45. 根据权利要求42所述的方法,其特征在于,若所述第j个第一预测单元为所述N个第一预测单元中除第一个第一预测单元之外的第一预测单元,则所述第j个第一预测单元包括第j个第一对齐子单元、第j个第一预测子单元、第j个第一上采样子单元;
    所述第j个第一对齐子单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及第j-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像在第j个尺度下对齐的特征信息;
    所述第j个第一预测子单元用于根据所述当前重建图像和所述参考图像在第j个尺度下对齐的特征信息进行偏移值预测,得到所述当前重建图像和所述参考图像分别在j个尺度下的偏移值;
    所述第j个第一上采样子单元用于根据所述第j个第一预测子单元输出的所述当前重建图像和所述参考图像分别在j个尺度下的偏移值和第j-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第j个尺度下的偏移值的和值进行上采样,得到所述当前重建图像和所述参考图像分别在j+1个尺度下的偏移值。
  46. 根据权利要求42所述的方法,其特征在于,所述第N个第一预测单元包括第N个第一对齐子单元和第N个第一预测子单元;
    所述第N个第一对齐子单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及所述第N-1个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像在第N个尺度下对齐的特征信息;
    所述第N个第一预测子单元用于根据所述当前重建图像和所述参考图像在第N个尺度下对齐的特征信息进行偏移值预测,得到预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值;
    所述第N个第一预测单元预测的所述当前重建图像和所述参考图像分别在第N个尺度下的偏移值是根据所述第N个第一预测子单元预测的当前重建图像和所述参考图像分别在第N个尺度下的偏移值,以及第N-1个第一预测单元预测的当前重建图像和所述参考图像分别在第N个尺度下的偏移值相加后确定的。
  47. 根据权利要求44-46任一项所述的方法,其特征在于,所述第一预测子单元为偏移值预测网络OPN。
  48. 根据权利要求45或46所述的方法,其特征在于,所述第一对齐子单元为可变形卷积DCN。
  49. 根据权利要求41所述的方法,其特征在于,所述时域对齐模块包括K个第一时域对齐单元和K-1个第一下采样单元,所述K为大于2的正整数;
    其中,第k个第一时域对齐单元用于根据所述第一图像在第k个尺度下的偏移值和第一特征信息,得到所述第一图像在第k个尺度下的第二特征信息,所述k为K至2的正整数,当k=K时,所述第一图像在第k个尺度下的偏移值和第一特征信息为所述第一图像在第N个尺度下的偏移值和第一特征信息;
    第k-1个第一下采样单元用于根据所述第一图像在第k个尺度下的偏移值和第一特征信息进行下采样,得到所述第一图像在第k-1个尺度下的偏移值和第一特征信息;
    第k-1个第一时域对齐单元用于根据所述第一图像在第k-1个尺度下的偏移值和第一特征信息,得到所述第一图像在第k-1个尺度下的第二特征信息,直到k-1等于1为止。
  50. 根据权利要求49所述的方法,其特征在于,所述第一时域对齐单元为可变形卷积DCN。
  51. 根据权利要求49所述的方法,其特征在于,所述第一下采样单元为平均池化层。
  52. 根据权利要求49所述的方法,其特征在于,所述质量增强模块包括K个第一增强单元和K-1个第一上采样单元;
    第k+1个第一增强单元用于根据所述当前重建图像和所述参考图像分别在第k+1个尺度下的第二特征信息进行图像质量增强,得到所述当前重建图像在第k+1个尺度下的增强图像的初始预测值,所述k为1至K-1的正整数;
    第k个第一上采样单元用于根据所述当前重建图像在第k个尺度下的增强图像的融合值进行上采样,得到所述当前重建图像在第k+1个尺度下的增强图像的上采样值,当所述k为1时,所述当前重建图像在第k个尺度下的增强图像的融合值为第一个第一增强单元根据所述当前重建图像和所述参考图像分别在第一个尺度下的第二特征信息,得到的所述当前重建图像在第一个尺度下的增强图像的初始预测值;
    其中,所述当前重建图像在第k+1个尺度下的增强图像的融合值是根据所述当前重建图像在第k+1个尺度下的增强图像的上采样值和初始预测值进行融合后确定的,所述当前重建图像在第N个尺度下的增强图像的预测值是根据所述当前重建图像在第K个尺度下的增强图像的融合值确定的。
  53. 根据权利要求52所述的方法,其特征在于,所述第一增强单元包括多个卷积层,且所述多个卷积层中最后一个卷积层不包括激活函数。
  54. 根据权利要求41所述的方法,其特征在于,所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述当前重建图像和所述参考图像分别在第N个尺度下的P组偏移值,所述P为正整数;
    所述时域对齐模块用于将所述第一图像划分为P个图像块,并将所述P组偏移值一一分配给所述P个图像块,且根据所述图像块对应的一组偏移值和所述图像块的第一特征信息进行多尺度时域对齐,得到所述图像块在多个尺度下的第二特征信息,进而根据所述第一图像中图像块在第N个尺度下的多尺度第二特征信息,得到所述第一图像在第N个尺度下的多尺度第二特征信息。
  55. 根据权利要求50所述的方法,其特征在于,
    所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征 信息进行多尺度预测,得到所述参考图像在第N个尺度下的偏移值,所述第N个尺度为所述N个尺度中的最大尺度;
    所述时域对齐模块用于根据所述参考图像在第N个尺度下的偏移值和第一特征信息进行多尺度时域对齐,得到所述参考图像在多个尺度下的第二特征信息;
    所述质量增强模块用于根据所述当前重建图像在多个尺度下的第一特征信息和所述参考图像在多个尺度下的第二特征信息,得到所述当前重建图像的增强图像。
  56. 根据权利要求55所述的方法,其特征在于,所述偏移值预测模块包括N个第二预测单元;
    其中,第j个第二预测单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及所述参考图像在第j个尺度下的偏移值,得到所述参考图像在第j+1个尺度下的偏移值,所述j为1至N-1的正整数;
    第N个第二预测单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及第N-1个第二预测单元预测的所述参考图像在第N个尺度下的偏移值,得到所述第N个第二预测单元预测的所述参考图像在第N个尺度下的偏移值。
  57. 根据权利要求56所述的方法,其特征在于,若所述第j个第二预测单元为所述N个第二预测单元中的第一个第二预测单元,则所述参考图像在第j-1个尺度下的偏移值为0。
  58. 根据权利要求57所述的方法,其特征在于,若所述第j个第二预测单元为所述N个第二预测单元中的第一个第二预测单元,则所述第一个第二预测单元包括第一个第二预测子单元和第一个第二上采样子单元;
    所述第一个第二预测子单元用于根据所述当前重建图像和所述参考图像分别在第一个尺度下的第一特征信息进行偏移值预测,得到所述参考图像的在第一个尺度下的偏移值;
    所述第一个第二上采样子单元用于根据所述参考图像的在第一个尺度下的偏移值进行上采样,得到所述参考图像在第二个尺度下的偏移值。
  59. 根据权利要求56所述的方法,其特征在于,若所述第j个第二预测单元为所述N个第二预测单元中除第一个第二预测单元之外的第二预测单元,则所述第j个第二预测单元包括第j个第二对齐子单元、第j个第二预测子单元、第j个第二上采样子单元;
    所述第j个第二对齐子单元用于根据所述当前重建图像和所述参考图像分别在第j个尺度下的第一特征信息、以及第j-1个第二预测单元预测的参考图像在第j个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像在第j个尺度下对齐的特征信息;
    所述第j个第二预测子单元用于根据所述当前重建图像和所述参考图像在第j个尺度下对齐的特征信息进行偏移值预测,得到所述参考图像在j个尺度下的偏移值;
    所述第j个第二上采样子单元用于根据所述第j个第一预测子单元输出的所述参考图像在j个尺度下的偏移值和第j-1个第二预测单元预测的所述参考图像在第j个尺度下的偏移值的和值进行上采样,得到所述参考图像在j+1个尺度下的偏移值。
  60. 根据权利要求56所述的方法,其特征在于,所述第N个第二预测单元包括第N个第二对齐子单元和第N个第二预测子单元;
    所述第N个第二对齐子单元用于根据所述当前重建图像和所述参考图像分别在第N个尺度下的第一特征信息、以及所述第N-1个第二预测单元预测的所述参考图像在第N个尺度下的偏移值进行时域特征对齐,得到所述当前重建图像和所述参考图像在第N个尺度下对齐的特征信息;
    所述第N第二预测子单元用于根据所述当前重建图像和所述参考图像在第N个尺度下对齐的特征信息进行偏移值预测,得到所述第N个第二预测单元预测的所述参考图像在N个尺度下的偏移值;
    所述第N个第二预测单元预测的所述参考图像在第N个尺度下的偏移值是根据所述第N个第二预测子单元预测的所述参考图像在第N个尺度下的偏移值,以及第N-1个第二预测单元预测的所述参考图像在第N个尺度下的偏移值相加后确定的。
  61. 根据权利要求58-60任一项所述的方法,其特征在于,所述第二预测子单元为偏移值预测网络OPN。
  62. 根据权利要求59或60任一项所述的方法,其特征在于,所述第二对齐子单元为可变形卷积 DCN。
  63. 根据权利要求56所述的方法,其特征在于,所述时域对齐模块包括K个第二时域对齐单元和K-1个第二下采样单元,所述K为大于2的正整数;
    其中,第k个第二时域对齐单元用于根据所述参考图像在第k个尺度下的偏移值和第一特征信息,得到所述参考图像在第k个尺度下的第二特征信息,所述k为K至2的正整数,当k=K时,所述参考图像在第k个尺度下的偏移值和第一特征信息为所述参考图像在第N个尺度下的偏移值和第一特征信息;
    第k-1个第二下采样单元用于根据所述参考图像在第k个尺度下的偏移值和第一特征信息进行下采样,得到所述参考图像在第k-1个尺度下的偏移值和第一特征信息;
    第k-1个第二时域对齐单元用于根据所述参考图像在第k-1个尺度下的偏移值和第一特征信息,得到所述参考图像在第k-1个尺度下的第二特征信息,直到k-1等于1为止。
  64. 根据权利要求63所述的方法,其特征在于,所述第二时域对齐单元为可变形卷积DCN。
  65. 根据权利要求63所述的方法,其特征在于,所述第二下采样单元为平均池化层。
  66. 根据权利要求63所述的方法,其特征在于,所述质量增强模块包括K个第二增强单元和K-1个第二上采样单元;
    其中,第k+1个第二增强单元用于根据所述当前重建图像在第k+1个尺度下的第一特征信息和所述参考图像在第k+1个尺度下的第二特征信息进行图像质量增强,得到所述当前重建图像在第k+1个尺度下的增强图像的初始预测值,所述k为1至K-1的正整数;
    第k个第二上采样单元用于根据所述当前重建图像在第k个尺度下的增强图像的融合值进行上采样,得到所述当前重建图像在第k+1个尺度下的增强图像的上采样值,当所述k为1时,所述当前重建图像在第k个尺度下的增强图像的融合值为第一个第二增强单元根据所述当前重建图像在第一个尺度下的第一特征信息和所述参考图像在第一个尺度下的第二特征信息,得到的所述当前重建图像在第一个尺度下的增强图像的初始预测值;
    所述当前重建图像在第k+1个尺度下的增强图像的融合值是根据所述当前重建图像在第k+1个尺度下的增强图像的上采样值和初始预测值进行融合后确定的。
  67. 根据权利要求66所述的方法,其特征在于,所述第二增强单元包括多个卷积层,且所述多个卷积层中的最后一个卷积层不包括激活函数。
  68. 根据权利要求19所述的方法,其特征在于,
    所述偏移值预测模块用于根据所述当前重建图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述参考图像在第N个尺度下的P组偏移值,所述P为正整数;
    所述时域对齐模块用于将所述参考图像划分为P个图像块,并将所述P组偏移值一一分配给所述P个图像块,且针对每一个图像块,根据所述图像块对应的一组偏移值和所述图像块的第一特征信息进行多尺度时域对齐,得到所述图像块在第N个尺度下的多尺度第二特征信息,进而根据所述参考图像中每个图像块在第N个尺度下的多尺度第二特征信息,得到所述参考图像在第N个尺度下的多尺度第二特征信息。
  69. 根据权利要求37所述的方法,其特征在于,所述方法还包括:
    获取第一标记信息,所述第一标记信息用于指示是否使用所述质量增强网络对所述当前重建图像进行质量增强;
    在所述第一标记信息指示使用所述质量增强网络对所述当前重建图像进行质量增强时,从已重建的图像中,获取所述当前重建图像的M个参考图像。
  70. 根据权利要求69所述的方法,其特征在于,所述第一标记信息包含在序列参数集中。
  71. 根据权利要求37所述的方法,其特征在于,所述从已解码的图像中,获取所述当前重建图像的M个参考图像,包括:
    从已重建的图像中,获取在播放顺序上位于所述当前重建图像的前向和/或后向的至少一个图像作为所述当前重建图像的参考图像。
  72. 根据权利要求71所述的方法,其特征在于,所述当前重建图像与所述参考图像在播放顺序上连续。
  73. 一种图像处理方法,其特征在于,包括:
    获取待增强的目标图像,以及所述目标图像的M个参考图像,所述M为正整数;
    将所述目标图像和所述M个参考图像输入质量增强网络中,得到所述目标图像的增强图像。
  74. 根据权利要求73所述的方法,其特征在于,所述质量增强网络包括特征提取模块、偏移值预测模块、时域对齐模块和质量增强模块,所述特征提取模块用于对所述目标图像和所述参考图像分别进行不同尺度的特征提取,分别得到所述目标图像和所述参考图像在N个尺度下的第一特征信息,所述N为大于1的正整数,所述偏移值预测模块用于根据所述目标图像和所述参考图像分别在N个尺度下的第一特征信息进行多尺度预测,得到所述参考图像的偏移值,所述时域对齐模块用于根据所述参考图像的偏移值和所述参考图像的第一特征信息进行时域对齐,得到所述参考图像的第二特征信息,所述质量增强模块用于根据所述参考图像的第二特征信息预测所述目标图像的增强图像。
  75. 一种图像解码装置,其特征在于,包括:
    解码单元,用于解码码流,得到当前重建图像;
    获取单元,用于从已重建的图像中,获取所述当前重建图像的M个参考图像,所述M为正整数;
    增强单元,用于将所述当前重建图像和所述M个参考图像输入质量增强网络中,得到所述当前重建图像的增强图像。
  76. 一种图像编码装置,其特征在于,包括:
    第一获取单元,用于获取待编码图像;
    编码单元,用于对所述待编码图像进行编码,得到所述待编码图像的当前重建图像;
    第二获取单元,用于从已重建的图像中,获取所述当前重建图像的M个参考图像,所述M为正整数;
    增强单元,用于将所述当前重建图像和所述M个参考图像输入质量增强网络中,得到所述当前重建图像的增强图像。
  77. 一种图像处理装置,其特征在于,包括:
    获取单元,用于获取待增强的目标图像,以及所述目标图像的M个参考图像,所述M为正整数;
    增强单元,用于将所述目标图像和所述M个参考图像输入质量增强网络中,得到所述目标图像的增强图像。
  78. 一种解码器,其特征在于,包括:处理器和存储器;
    所述存储器用于存储计算机程序;
    所述处理器用于调用并运行所述存储器中存储的计算机程序,以执行如权利要求1-36任一项所述的方法。
  79. 一种编码器,其特征在于,包括:处理器和存储器;
    所述存储器用于存储计算机程序;
    所述处理器用于调用并运行所述存储器中存储的计算机程序,以执行如权利要求37-70任一项所述的方法。
  80. 一种计算机可读存储介质,其特征在于,用于存储计算机程序,所述计算机程序使得计算机执行如权利要求1至36或37至72或73至74任一项所述的方法。
PCT/CN2021/107466 2021-07-20 2021-07-20 图像编解码及处理方法、装置及设备 WO2023000182A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180100797.0A CN117678221A (zh) 2021-07-20 2021-07-20 图像编解码及处理方法、装置及设备
PCT/CN2021/107466 WO2023000182A1 (zh) 2021-07-20 2021-07-20 图像编解码及处理方法、装置及设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/107466 WO2023000182A1 (zh) 2021-07-20 2021-07-20 图像编解码及处理方法、装置及设备

Publications (1)

Publication Number Publication Date
WO2023000182A1 true WO2023000182A1 (zh) 2023-01-26

Family

ID=84979828

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/107466 WO2023000182A1 (zh) 2021-07-20 2021-07-20 图像编解码及处理方法、装置及设备

Country Status (2)

Country Link
CN (1) CN117678221A (zh)
WO (1) WO2023000182A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111052740A (zh) * 2017-07-06 2020-04-21 三星电子株式会社 用于编码或解码图像的方法和装置
CN111194555A (zh) * 2017-08-28 2020-05-22 交互数字Vc控股公司 用模式感知深度学习进行滤波的方法和装置
WO2020180449A1 (en) * 2019-03-04 2020-09-10 Interdigital Vc Holdings, Inc. Method and device for picture encoding and decoding
CN111711824A (zh) * 2020-06-29 2020-09-25 腾讯科技(深圳)有限公司 视频编解码中的环路滤波方法、装置、设备及存储介质
US20210099710A1 (en) * 2018-04-01 2021-04-01 Lg Electronics Inc. Method for image coding using convolution neural network and apparatus thereof
CN113132729A (zh) * 2020-01-15 2021-07-16 北京大学 一种基于多参考帧的环路滤波方法及电子装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111052740A (zh) * 2017-07-06 2020-04-21 三星电子株式会社 用于编码或解码图像的方法和装置
CN111194555A (zh) * 2017-08-28 2020-05-22 交互数字Vc控股公司 用模式感知深度学习进行滤波的方法和装置
US20210099710A1 (en) * 2018-04-01 2021-04-01 Lg Electronics Inc. Method for image coding using convolution neural network and apparatus thereof
WO2020180449A1 (en) * 2019-03-04 2020-09-10 Interdigital Vc Holdings, Inc. Method and device for picture encoding and decoding
CN113132729A (zh) * 2020-01-15 2021-07-16 北京大学 一种基于多参考帧的环路滤波方法及电子装置
CN111711824A (zh) * 2020-06-29 2020-09-25 腾讯科技(深圳)有限公司 视频编解码中的环路滤波方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN117678221A (zh) 2024-03-08

Similar Documents

Publication Publication Date Title
WO2017071480A1 (zh) 参考帧编解码的方法与装置
WO2017129023A1 (zh) 解码方法、编码方法、解码设备和编码设备
US11677987B2 (en) Joint termination of bidirectional data blocks for parallel coding
CN104581177B (zh) 一种结合块匹配和串匹配的图像压缩方法和装置
WO2018001208A1 (zh) 编解码的方法及设备
US20220295071A1 (en) Video encoding method, video decoding method, and corresponding apparatus
JP7277586B2 (ja) モードおよびサイズに依存したブロックレベル制限の方法および装置
WO2023279961A1 (zh) 视频图像的编解码方法及装置
WO2019199767A1 (en) Spatially adaptive quantization-aware deblocking filter
WO2023039859A1 (zh) 视频编解码方法、设备、系统、及存储介质
WO2022266955A1 (zh) 图像解码及处理方法、装置及设备
CN113822824A (zh) 视频去模糊方法、装置、设备及存储介质
WO2023044868A1 (zh) 视频编解码方法、设备、系统、及存储介质
WO2022171042A1 (zh) 一种编码方法、解码方法及设备
WO2023000182A1 (zh) 图像编解码及处理方法、装置及设备
CN115866297A (zh) 视频处理方法、装置、设备及存储介质
CN115550666A (zh) 用于视频数据的编码方法、解码方法、计算设备和介质
WO2023220969A1 (zh) 视频编解码方法、装置、设备、系统及存储介质
WO2023184088A1 (zh) 图像处理方法、装置、设备、系统、及存储介质
WO2023184248A1 (zh) 视频编解码方法、装置、设备、系统及存储介质
WO2023092404A1 (zh) 视频编解码方法、设备、系统、及存储介质
WO2023220946A1 (zh) 视频编解码方法、装置、设备、系统及存储介质
WO2023044919A1 (zh) 视频编解码方法、设备、系统、及存储介质
WO2023184747A1 (zh) 视频编解码方法、装置、设备、系统及存储介质
WO2023122969A1 (zh) 帧内预测方法、设备、系统、及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21950447

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180100797.0

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE