CN117441186A - Image decoding and processing method, device and equipment - Google Patents

Image decoding and processing method, device and equipment Download PDF

Info

Publication number
CN117441186A
CN117441186A CN202180097934.XA CN202180097934A CN117441186A CN 117441186 A CN117441186 A CN 117441186A CN 202180097934 A CN202180097934 A CN 202180097934A CN 117441186 A CN117441186 A CN 117441186A
Authority
CN
China
Prior art keywords
information
channel
characteristic information
image
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180097934.XA
Other languages
Chinese (zh)
Inventor
元辉
姜世奇
杨烨
李明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Publication of CN117441186A publication Critical patent/CN117441186A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The application provides an image decoding and processing method, device and equipment, wherein the method comprises the following steps: decoding the code stream to obtain a reconstructed image; inputting the reconstructed image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the reconstructed image; the dynamic conversion model includes: the device comprises N coding modules and N decoding modules, wherein the ith coding module is in jump connection with the (N-i+1) decoding module, the ith coding module is used for carrying out feature extraction on the (i-1) th first feature information output by the (i-1) th coding module to obtain the (i) th first feature information of a reconstructed image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i) th second feature information to obtain the (N-i+1) th second feature information, and the HDR image is determined according to the second feature information output by the last decoding module. The application adopts the dynamic conversion model to convert the image with low dynamic range into the image with high dynamic range, and has simple process and low cost.

Description

Image decoding and processing method, device and equipment Technical Field
The present disclosure relates to the field of image processing technologies, and in particular, to an image decoding and processing method, device, and apparatus.
Background
Dynamic range is a term used to define how widely a camera can capture tonal details of an image, generally referring to the range from the lowest value to the highest overflow value. Briefly, it describes the ratio between the brightest and darkest shades that a camera can record within a single frame. The greater the dynamic range, the more information may be retained in the highlight and shadow regions.
However, the acquisition of the high dynamic range image is relatively complex, and higher requirements are also put on hardware and algorithms in terms of data acquisition, transmission, storage, display and the like, so that the conversion cost for converting the low dynamic range image into the high dynamic range image is high at present.
Disclosure of Invention
The embodiment of the application provides an image decoding and processing method, device and equipment, so as to reduce the cost of converting a low dynamic range image into a high dynamic range image.
In a first aspect, an embodiment of the present application provides an image decoding method, including:
decoding the code stream to obtain a reconstructed image;
inputting the reconstructed image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the reconstructed image;
Wherein the dynamic conversion model includes: the device comprises N coding modules and N decoding modules which are connected in series, wherein the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the ith coding module is connected with the (N-i+1) decoding module in a jumping manner, the ith coding module is used for carrying out feature extraction on the (i-1) th first feature information output by the (i-1) th coding module to obtain the (i-1) th first feature information of a reconstructed image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i) th second feature information of the reconstructed image to obtain the (N-i+1) th second feature information of the reconstructed image, the HDR image of the reconstructed image is determined according to the second feature information output by the last decoding module in the N decoding modules, and i is a positive integer less than or equal to N.
In a second aspect, the present application provides an image processing method, including:
acquiring a low dynamic range LDR image to be processed;
inputting the LDR image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the LDR image;
wherein the dynamic conversion model includes: the method comprises the steps of connecting N coding modules in series and N decoding modules in series, wherein the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the ith coding module is connected with the (N-i+1) th decoding module in a jumping mode, the ith coding module is used for carrying out feature extraction on the (i-1) th first feature information output by the (i-1) th coding module to obtain the (i-first feature information of an LDR image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i) th second feature information of the LDR image to obtain the (N-i+1) th second feature information of the LDR image, the HDR image of the LDR image is determined according to the second feature information output by the last decoding module in the N decoding modules, and the (i) is a positive integer less than or equal to N.
In a third aspect, the present application provides a model training method, including:
acquiring a low dynamic range LDR training image and a high dynamic range HDR image truth value of the LDR training image;
inputting an LDR training image into a dynamic conversion model, and carrying out feature extraction on the i-1 th first feature information through an i-th coding module to obtain the i-th first feature information of the LDR training image, wherein the dynamic conversion model comprises N coding modules and N decoding modules which are connected in series, the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the i-th coding module is connected with the N-i+1 th decoding module in a jumping way, i is a positive integer smaller than or equal to N, and N is a positive integer;
the method comprises the steps of performing feature extraction on the i-1 th first feature information and the N-i second feature information of an LDR training image through an N-i+1 th decoding module to obtain the N-i+1 th second feature information of the LDR training image;
determining an HDR image predicted value of the LDR training image according to the second characteristic information of the LDR training image output by the last decoding module in the N decoding modules;
and determining loss between the HDR image predicted value of the LDR training image and the HDR image true value of the LDR training image, and training the dynamic conversion model according to the loss.
In a fourth aspect, an image decoding apparatus is provided for performing the method of the first aspect or each implementation thereof. Specifically, the image decoding apparatus includes a functional unit for performing the method of the first aspect or its respective implementation forms.
In a fifth aspect, a decoder is provided that includes a processor and a memory. The memory is for storing a computer program and the processor is for calling and running the computer program stored in the memory for performing the method of the first aspect or implementations thereof.
In a sixth aspect, an image processing apparatus is provided for performing the method of the second aspect or each implementation thereof. In particular, the apparatus comprises functional units for performing the method of the second aspect described above or in various implementations thereof.
In a seventh aspect, an image processing apparatus is provided that includes a processor and a memory. The memory is for storing a computer program and the processor is for invoking and running the computer program stored in the memory to perform the method of the second aspect or implementations thereof described above.
In an eighth aspect, a model training apparatus is provided for performing the method of the third aspect or implementations thereof. In particular, the model training apparatus comprises functional units for performing the method of the third aspect described above or implementations thereof.
In a ninth aspect, a model training apparatus is provided that includes a processor and a memory. The memory is for storing a computer program and the processor is for calling and running the computer program stored in the memory for performing the method of the third aspect or implementations thereof.
In a tenth aspect, a chip is provided for implementing the method in any one of the first to third aspects or each implementation thereof. Specifically, the chip includes: a processor for calling and running a computer program from a memory, causing a device on which the chip is mounted to perform the method as in any one of the first to third aspects or implementations thereof described above.
In an eleventh aspect, a computer-readable storage medium is provided for storing a computer program for causing a computer to perform the method of any one of the above first to third aspects or implementations thereof.
In a twelfth aspect, there is provided a computer program product comprising computer program instructions for causing a computer to perform the method of any one of the above first to third aspects or implementations thereof.
In a thirteenth aspect, there is provided a computer program which, when run on a computer, causes the computer to perform the method of any one of the above-described first to third aspects or implementations thereof.
Based on the technical scheme, the dynamic conversion model comprises N coding modules and N decoding modules which are connected in series, wherein the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the ith coding module is connected with the (N-i+1) th decoding module in a jumping manner, the ith coding module is used for carrying out feature extraction on the (i-1) th first feature information output by the (i-1) th coding module to obtain the (i-first feature information of the reconstructed image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i+1) th second feature information of the reconstructed image to obtain the (N-i+1) th second feature information of the reconstructed image, the HDR image of the reconstructed image is determined according to the second feature information output by the last decoding module in the N decoding modules, and i is a positive integer less than or equal to N. The dynamic conversion model can be used for converting an LDR image into an HDR image, so that the conversion of the HDR image is realized without increasing the cost of data acquisition, encoding, transmission, storage and the like, the efficiency of HDR image conversion is improved, and the HDR image is reduced.
Drawings
Fig. 1 is a schematic block diagram of a video codec system according to an embodiment of the present application;
FIG. 2 is a schematic block diagram of a video encoder provided by an embodiment of the present application;
FIG. 3 is a schematic block diagram of a video decoder provided by an embodiment of the present application;
FIG. 4 is a flowchart of a training method for a dynamic conversion model according to an embodiment of the present application;
FIG. 5A is a schematic diagram of a network of dynamic transition models according to an embodiment of the present application;
FIG. 5B is a network diagram of convolution blocks according to one embodiment of the present disclosure;
FIG. 5C is a schematic diagram of a network of dynamic transition models according to an embodiment of the present application;
FIG. 5D is a network diagram of a convolution attention module according to one embodiment of the present disclosure;
FIG. 5E is a network diagram of a channel attention module according to one embodiment of the present application;
FIG. 5F is a network diagram of a spatial attention module according to one embodiment of the present application;
FIG. 5G is a schematic diagram of a network of dynamic transition models according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating an image decoding method according to an embodiment of the present disclosure;
FIG. 7 is a network diagram of a spatial attention module according to one embodiment of the present application;
FIG. 8 is a flowchart of an image processing method according to an embodiment of the present disclosure;
fig. 9 is a schematic block diagram of an image decoding apparatus provided in an embodiment of the present application;
fig. 10 is a schematic block diagram of an image processing apparatus provided in an embodiment of the present application;
FIG. 11 is a schematic block diagram of a model training apparatus provided by an embodiment of the present application;
fig. 12 is a schematic block diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The method and the device can be applied to the technical field of point cloud up-sampling, for example, the technical field of point cloud compression.
The method and the device can be applied to the fields of image encoding and decoding, video encoding and decoding, hardware video encoding and decoding, special circuit video encoding and decoding, real-time video encoding and decoding and the like. For example, the schemes of the present application may be incorporated into audio video coding standards (audio video coding standard, AVS for short), such as the h.264/audio video coding (audio video coding, AVC for short) standard, the h.265/high efficiency video coding (high efficiency video coding, HEVC for short) standard, and the h.266/multi-function video coding (versatile video coding, VVC for short) standard. Alternatively, the schemes of the present application may operate in conjunction with other proprietary or industry standards including ITU-T H.261, ISO/IECMPEG-1Visual, ITU-T H.262 or ISO/IECMPEG-2Visual, ITU-T H.263, ISO/IECMPEG-4Visual, ITU-T H.264 (also known as ISO/IECMPEG-4 AVC), including Scalable Video Codec (SVC) and Multiview Video Codec (MVC) extensions. It should be understood that the techniques of this application are not limited to any particular codec standard or technique.
For ease of understanding, a video codec system according to an embodiment of the present application will be described first with reference to fig. 1.
Fig. 1 is a schematic block diagram of a video codec system according to an embodiment of the present application. It should be noted that fig. 1 is only an example, and the video codec system of the embodiment of the present application includes, but is not limited to, the one shown in fig. 1. As shown in fig. 1, the video codec system 100 includes an encoding device 110 and a decoding device 120. Wherein the encoding device is arranged to encode (which may be understood as compressing) the video data to generate a code stream and to transmit the code stream to the decoding device. The decoding device decodes the code stream generated by the encoding device to obtain decoded video data.
The encoding device 110 of the present embodiment may be understood as a device having a video encoding function, and the decoding device 120 may be understood as a device having a video decoding function, i.e., the present embodiment includes a broader means for the encoding device 110 and the decoding device 120, such as including a smart phone, a desktop computer, a mobile computing device, a notebook (e.g., laptop) computer, a tablet computer, a set-top box, a television, a camera, a display device, a digital media player, a video game console, a vehicle-mounted computer, and the like.
In some embodiments, the encoding device 110 may transmit the encoded video data (e.g., a bitstream) to the decoding device 120 via the channel 130. Channel 130 may include one or more media and/or devices capable of transmitting encoded video data from encoding device 110 to decoding device 120.
In one example, channel 130 includes one or more communication media that enable encoding device 110 to transmit encoded video data directly to decoding device 120 in real-time. In this example, the encoding apparatus 110 may modulate the encoded video data according to a communication standard and transmit the modulated video data to the decoding apparatus 120. Where the communication medium comprises a wireless communication medium, such as a radio frequency spectrum, the communication medium may optionally also comprise a wired communication medium, such as one or more physical transmission lines.
In another example, channel 130 includes a storage medium that may store video data encoded by encoding device 110. Storage media include a variety of locally accessed data storage media such as compact discs, DVDs, flash memory, and the like. In this example, the decoding device 120 may obtain encoded video data from the storage medium.
In another example, channel 130 may comprise a storage server that may store video data encoded by encoding device 110. In this example, the decoding device 120 may download stored encoded video data from the storage server. Alternatively, the storage server may store the encoded video data and may transmit the encoded video data to a decoding device 120, such as a web server (e.g., for a website), a File Transfer Protocol (FTP) server, or the like.
In some embodiments, the encoding apparatus 110 includes a video encoder 112 and an output interface 113. Wherein the output interface 113 may comprise a modulator/demodulator (modem) and/or a transmitter.
In some embodiments, the encoding device 110 may include a video source 111 in addition to a video encoder 112 and an input interface 113.
Video source 111 may include at least one of a video capture device (e.g., a video camera), a video archive, a video input interface for receiving video data from a video content provider, a computer graphics system for generating video data.
The video encoder 112 encodes video data from the video source 111 to produce a bitstream. The video data may include one or more pictures (pictures) or sequences of pictures (sequence of pictures). The code stream contains encoded information of the image or image sequence in the form of a bit stream. The encoded information may include encoded image data and associated data. The associated data may include a sequence parameter set (sequence parameter set, SPS for short), a picture parameter set (picture parameter set, PPS for short), and other syntax structures. An SPS may contain parameters that apply to one or more sequences. PPS may contain parameters that apply to one or more pictures. A syntax structure refers to a set of zero or more syntax elements arranged in a specified order in a bitstream.
The video encoder 112 directly transmits the encoded video data to the decoding apparatus 120 via the output interface 113. The encoded video data may also be stored on a storage medium or storage server for subsequent reading by the decoding device 120.
In some embodiments, decoding apparatus 120 includes an input interface 121 and a video decoder 122.
In some embodiments, decoding apparatus 120 may include a display device 123 in addition to input interface 121 and video decoder 122.
Wherein the input interface 121 comprises a receiver and/or a modem. The input interface 121 may receive encoded video data through the channel 130.
The video decoder 122 is configured to decode the encoded video data to obtain decoded video data, and transmit the decoded video data to the display device 123.
The display device 123 displays the decoded video data. The display device 123 may be integral with the decoding apparatus 120 or external to the decoding apparatus 120. The display device 123 may include a variety of display devices, such as a Liquid Crystal Display (LCD), a plasma display, an Organic Light Emitting Diode (OLED) display, or other types of display devices.
In addition, fig. 1 is merely an example, and the technical solution of the embodiment of the present application is not limited to fig. 1, for example, the technology of the present application may also be applied to single-side video encoding or single-side video decoding.
The following describes a video encoder according to an embodiment of the present application.
Fig. 2 is a schematic block diagram of a video encoder provided by an embodiment of the present application. It should be appreciated that the video encoder 200 may be used for lossy compression of images (lossy compression) and may also be used for lossless compression of images (lossless compression). The lossless compression may be visual lossless compression (visually lossless compression) or mathematical lossless compression (mathematically lossless compression).
The video encoder 200 may be applied to image data in luminance and chrominance (YCbCr, YUV) format. For example, the YUV ratio may be 4:2:0, 4:2:2, or 4:4:4, y represents brightness (Luma), cb (U) represents blue chromaticity, cr (V) represents red chromaticity, U and V represent chromaticity (Chroma) for describing color and saturation. For example, in color format, 4:2:0 represents 4 luminance components per 4 pixels, 2 chrominance components (yyycbcr), 4:2:2 represents 4 luminance components per 4 pixels, 4 chrominance components (yyyycbcrbcr), and 4:4:4 represents a full-pixel display (yyyycbcrcbcrbcrcbcr).
For example, the video encoder 200 reads video data, and for each frame of image in the video data, divides a frame of image into a plurality of Coding Tree Units (CTUs), a "maximum coding unit" (Largest Coding unit, LCU) or a "coding tree block" (coding tree block, CTB). Each CTU may be associated with a block of pixels of equal size within the image. Each pixel may correspond to one luminance (or luma) sample and two chrominance (or chroma) samples. Thus, each CTU may be associated with one block of luma samples and two blocks of chroma samples. One CTU size is, for example, 128×128, 64×64, 32×32, etc. One CTU may be further divided into several Coding Units (CUs), where a CU may be a rectangular block or a square block. The CU may be further divided into a Prediction Unit (PU) and a Transform Unit (TU), so that the encoding, the prediction, and the transform are separated, and the processing is more flexible. In one example, CTUs are divided into CUs in a quadtree manner, and CUs are divided into TUs, PUs in a quadtree manner.
Video encoders and video decoders may support various PU sizes. Assuming that the size of a particular CU is 2nx2n, video encoders and video decoders may support 2 nx2n or nxn PU sizes for intra prediction and support 2 nx2n, 2 nx N, N x 2N, N x N or similar sized symmetric PUs for inter prediction. Video encoders and video decoders may also support asymmetric PUs of 2nxnu, 2nxnd, nL x 2N, and nR x 2N for inter prediction.
In some embodiments, as shown in fig. 2, the video encoder 200 may include: a prediction unit 210, a residual unit 220, a transform/quantization unit 230, an inverse transform/quantization unit 240, a reconstruction unit 250, a loop filtering unit 260, a decoded image buffer 270, and an entropy encoding unit 280. It should be noted that video encoder 200 may include more, fewer, or different functional components.
Alternatively, in this application, a current block (current block) may be referred to as a current Coding Unit (CU) or a current Prediction Unit (PU), or the like. The prediction block may also be referred to as a prediction block or an image prediction block, and the reconstruction block may also be referred to as a reconstruction block or an image reconstruction block.
In some embodiments, prediction unit 210 includes an inter prediction unit 211 and an intra prediction unit 212. Because of the strong correlation between adjacent pixels in a frame of video, intra-prediction methods are used in video coding techniques to eliminate spatial redundancy between adjacent pixels. Because of the strong similarity between adjacent frames in video, the inter-frame prediction method is used in the video coding and decoding technology to eliminate the time redundancy between adjacent frames, thereby improving the coding efficiency.
The inter prediction unit 211 may be used for inter prediction, which may refer to image information of different frames, using motion information to find a reference block from the reference frame, generating a prediction block from the reference block, for eliminating temporal redundancy; the frames used for inter-prediction may be P frames, which refer to forward predicted frames, and/or B frames, which refer to bi-directional predicted frames. The motion information includes a reference frame list in which the reference frame is located, a reference frame index, and a motion vector. The motion vector may be integer or sub-pixel, and if the motion vector is sub-pixel, then interpolation filtering is required to make the required sub-pixel block in the re-reference frame, where the integer or sub-pixel block in the reference frame found from the motion vector is referred to as the reference block. Some techniques may use the reference block directly as a prediction block, and some techniques may reprocess the reference block to generate a prediction block. Reprocessing a prediction block on the basis of a reference block is also understood to mean that the reference block is taken as a prediction block and then a new prediction block is processed on the basis of the prediction block.
The most commonly used inter prediction methods at present include: geometric partitioning modes (geometric partitioning mode, GPM) in VVC video codec standard, and angle weighted prediction (angular weighted prediction, AWP) in AVS3 video codec standard. These two intra prediction modes are in principle common.
The intra prediction unit 212 predicts pixel information within a block to be encoded of the current code by referring to only information of the same frame image for eliminating spatial redundancy. The frame used for intra prediction may be an I-frame.
In some embodiments, the intra prediction method further includes a multi-reference line intra prediction method (multiple reference line, MRL), which may use more reference pixels to improve coding efficiency.
There are a number of prediction modes for intra prediction, and 9 modes for intra prediction of a 4×4 block in h.264. Wherein, the mode 0 is to copy the pixels above the current block to the current block in the vertical direction as a predicted value; mode 1 is to copy left reference pixels to the current block in the horizontal direction as a predicted value; mode 2 (DC) is a prediction value of all points, which is an average value of 8 points a to D and I to L, and modes 3 to 8 are copies of reference pixels to corresponding positions of the current block at a certain angle, respectively. Because some locations of the current block do not exactly correspond to reference pixels, it may be necessary to use a weighted average of the reference pixels, or sub-pixels of the interpolated reference pixels.
The intra prediction modes used by HEVC are Planar mode (Planar), DC, and 33 angular modes, for a total of 35 prediction modes. The intra modes used by VVC are Planar, DC and 65 angular modes, for a total of 67 prediction modes. The intra modes used by AVS3 are DC, plane, bilinear and 63 angular modes, for a total of 66 prediction modes.
It should be noted that, with the increase of the angle modes, the intra-frame prediction will be more accurate, and the requirements for the development of high-definition and ultra-high-definition digital video are more satisfied.
Residual unit 220 may generate a residual block of the CU based on the pixel block of the CU and the prediction block of the PU of the CU. For example, residual unit 220 may generate a residual block of the CU such that each sample in the residual block has a value equal to the difference between: samples in pixel blocks of a CU, and corresponding samples in prediction blocks of PUs of the CU.
The transform/quantization unit 230 may quantize the transform coefficients. Transform/quantization unit 230 may quantize transform coefficients associated with TUs of a CU based on Quantization Parameter (QP) values associated with the CU. The video encoder 200 may adjust the degree of quantization applied to the transform coefficients associated with the CU by adjusting the QP value associated with the CU.
The inverse transform/quantization unit 240 may apply inverse quantization and inverse transform, respectively, to the quantized transform coefficients to reconstruct a residual block from the quantized transform coefficients.
The reconstruction unit 250 may add samples of the reconstructed residual block to corresponding samples of one or more prediction blocks generated by the prediction unit 210 to generate a reconstructed block to be encoded associated with the TU. In this way, reconstructing sample blocks for each TU of the CU, video encoder 200 may reconstruct pixel blocks of the CU.
Loop filtering unit 260 may perform a deblocking filtering operation to reduce blocking artifacts of pixel blocks associated with the CU.
In some embodiments, the loop filtering unit 260 includes a deblocking filtering unit, a sample adaptive compensation SAO unit, an adaptive loop filtering ALF unit.
The decoded image buffer 270 may store reconstructed pixel blocks. Inter prediction unit 211 may use the reference image containing the reconstructed pixel block to perform inter prediction on PUs of other images. In addition, intra prediction unit 212 may use the reconstructed pixel blocks in decoded image buffer 270 to perform intra prediction on other PUs in the same image as the CU.
The entropy encoding unit 280 may receive the quantized transform coefficients from the transform/quantization unit 230. Entropy encoding unit 280 may perform one or more entropy encoding operations on the quantized transform coefficients to generate entropy encoded data.
The basic flow of video coding related to the application is as follows: at the encoding end, the current image is divided into blocks, and for the current block, the prediction unit 210 generates a prediction block of the current block using intra prediction or inter prediction. The residual unit 220 may calculate a residual block, which may also be referred to as residual information, based on the difference between the prediction block and the original block of the current block, i.e., the prediction block and the original block of the current block. The residual block is transformed and quantized by the transforming/quantizing unit 230, and the like, so that information insensitive to human eyes can be removed to eliminate visual redundancy. Alternatively, the residual block before being transformed and quantized by the transforming/quantizing unit 230 may be referred to as a time domain residual block, and the time domain residual block after being transformed and quantized by the transforming/quantizing unit 230 may be referred to as a frequency residual block or a frequency domain residual block. The entropy encoding unit 280 receives the quantized transform coefficient output from the transform quantization unit 230, and may entropy encode the quantized transform coefficient to output a bitstream. For example, the entropy encoding unit 280 may eliminate character redundancy according to the target context model and probability information of the binary code stream.
In addition, the video encoder performs inverse quantization and inverse transformation on the quantized transform coefficients output from the transform quantization unit 230 to obtain a residual block of the current block, and then adds the residual block of the current block to the predicted block of the current block to obtain a reconstructed block of the current block. Along with the progress of coding, the reconstruction blocks corresponding to other blocks to be coded in the current image can be obtained, and the reconstruction blocks are spliced to obtain a reconstruction image of the current image. Since errors are introduced during the encoding process, to reduce the errors, the reconstructed image is filtered, for example, using ALF, to reduce the difference between the pixel values of the pixels in the reconstructed image and the original pixel values of the pixels in the current image. The filtered reconstructed image is stored in the decoded image buffer 270 and may be used as a reference frame for inter prediction for subsequent frames.
The block division information determined by the encoding end, and mode information or parameter information such as prediction, transformation, quantization, entropy coding, loop filtering, etc. are carried in the code stream when necessary. The decoding end analyzes the code stream and analyzes and determines the same block division information as the encoding end according to the existing information, and predicts, transforms, quantizes, entropy codes, loop filters and other mode information or parameter information, so that the decoded image obtained by the encoding end is ensured to be the same as the decoded image obtained by the decoding end.
Fig. 3 is a schematic block diagram of a video decoder provided by an embodiment of the present application.
As shown in fig. 3, the video decoder 300 includes: an entropy decoding unit 310, a prediction unit 320, an inverse quantization/transformation unit 330, a reconstruction unit 340, a loop filtering unit 350, and a decoded image buffer 360. It should be noted that the video decoder 300 may include more, fewer, or different functional components.
The video decoder 300 may receive the bitstream. The entropy decoding unit 310 may parse the bitstream to extract syntax elements from the bitstream. As part of parsing the bitstream, the entropy decoding unit 310 may parse entropy-encoded syntax elements in the bitstream. The prediction unit 320, the inverse quantization/transformation unit 330, the reconstruction unit 340, and the loop filtering unit 350 may decode video data according to syntax elements extracted from a bitstream, i.e., generate decoded video data.
In some embodiments, prediction unit 320 includes an intra prediction unit 321 and an inter prediction unit 322.
The intra prediction unit 321 may perform intra prediction to generate a prediction block of the PU. Intra-prediction unit 321 may use an intra-prediction mode to generate a prediction block for a PU based on pixel blocks of spatially-neighboring PUs. The intra prediction unit 321 may also determine an intra prediction mode of the PU according to one or more syntax elements parsed from the bitstream.
The inter prediction unit 322 may construct a first reference picture list (list 0) and a second reference picture list (list 1) according to syntax elements parsed from the bitstream. Furthermore, if the PU uses inter prediction encoding, entropy decoding unit 310 may parse the motion information of the PU. Inter prediction unit 322 may determine one or more reference blocks for the PU based on the motion information of the PU. Inter prediction unit 322 may generate a prediction block for the PU based on one or more reference blocks for the PU.
The inverse quantization/transform unit 330 may inverse quantize (i.e., dequantize) transform coefficients associated with the TUs. Inverse quantization/transform unit 330 may determine the degree of quantization using QP values associated with the CUs of the TUs.
After inverse quantizing the transform coefficients, inverse quantization/transform unit 330 may apply one or more inverse transforms to the inverse quantized transform coefficients in order to generate a residual block associated with the TU.
Reconstruction unit 340 uses the residual blocks associated with the TUs of the CU and the prediction blocks of the PUs of the CU to reconstruct the pixel blocks of the CU. For example, the reconstruction unit 340 may add samples of the residual block to corresponding samples of the prediction block to reconstruct a pixel block of the CU, resulting in a reconstructed block to be encoded.
Loop filtering unit 350 may perform a deblocking filtering operation to reduce blocking artifacts of pixel blocks associated with the CU.
In some embodiments, loop filtering unit 350 includes a deblocking filtering unit, a sample adaptive compensation SAO unit, an adaptive loop filtering ALF unit.
The video decoder 300 may store the reconstructed image of the CU in a decoded image buffer 360. The video decoder 300 may use the reconstructed image in the decoded image buffer 360 as a reference image for subsequent prediction or may transmit the reconstructed image to a display device for presentation.
The basic flow of video decoding related to the application is as follows: the entropy decoding unit 310 may parse the code stream to obtain prediction information of the current block, a quantization coefficient matrix, etc., and the prediction unit 320 generates a prediction block of the current block using intra prediction or inter prediction on the current block based on the prediction information. The inverse quantization/transformation unit 330 performs inverse quantization and inverse transformation on the quantized coefficient matrix using the quantized coefficient matrix obtained from the code stream to obtain a residual block. The reconstruction unit 340 adds the prediction block and the residual block to obtain a reconstructed block. The reconstructed blocks constitute a reconstructed image, and the loop filtering unit 350 performs loop filtering on the reconstructed image based on the image or based on the blocks, resulting in a decoded image. The decoded image may also be referred to as a reconstructed image, which may be displayed by a display device on the one hand, and stored in the decoded image buffer 360 on the other hand, for subsequent frames as reference frames for inter prediction.
The foregoing is a basic flow of a video codec under a block-based hybrid coding framework, and as technology advances, some modules or steps of the framework or flow may be optimized.
Real world scenes have a large dynamic range, ranging from no month late night to afternoon glare, spanning up to 14 orders of magnitude. In such a complex environment, the low dynamic range (Low Dynamic Range, LDR) image shot by the conventional camera may cause overexposure or underexposure of some parts of the image, so that the real world cannot be truly restored, while the high dynamic range (High Dynamic Range, HDR) image contains abundant light and color information in various illumination environments in the real scene, and can more completely record or display texture details of the bright and dark areas substantially the same as those of the real scene. Meanwhile, the acquisition of the HDR image is relatively complex, and higher requirements on hardware and algorithms are also put forward in terms of data acquisition, transmission, storage, display and the like.
With the rapid development of deep learning technology in recent years, and in particular, the widespread use of Convolutional Neural Networks (CNNs), it has become possible to reconstruct a High Dynamic Range (HDR) image covering the entire dynamic range from a single or multiple exposure Low Dynamic Range (LDR) images of the same scene.
The embodiment of the application provides an image processing method based on a model, which converts an LDR image into an HDR image through the model. Namely, the encoding end encodes the LDR image to form a code stream, the decoding end decodes the LDR image, and then the decoding end dynamically converts the decoded LDR image by using the model of the embodiment of the application to obtain an HDR image, so that the conversion of the HDR image is realized without increasing the cost of data acquisition, encoding, transmission, storage and the like.
The technical solutions according to the embodiments of the present application are described below with reference to specific embodiments.
The image processing method provided by the application is to convert an LDR image into an HDR image by using a dynamic conversion model, wherein the dynamic conversion model is a piece of software code or a chip with a data processing function. Based on this, a training process of the dynamic conversion model will be described first.
Fig. 4 is a schematic flow chart of a dynamic conversion model training method according to an embodiment of the present application, and as shown in fig. 4, the training process includes:
s401, acquiring an LDR training image and an HDR image true value of the LDR training image.
The LDR training image is a LDR training image randomly selected in a training set, the training set comprises a plurality of LDR training images, and the training process of the dynamic conversion model by using the LDR training images in the training set is an iterative process. For example, the first LDR training image is input into the dynamic conversion model to be trained, and the initial parameters of the dynamic conversion model are adjusted once, so as to obtain the dynamic conversion model trained for the first time. And then, inputting a second LDR training image into the dynamic conversion model trained for the first time, adjusting the parameters of the dynamic conversion model trained for the first time to obtain the dynamic conversion model trained for the second time, and sequentially iterating by referring to the method until the training end condition of the dynamic conversion model is reached. The training ending condition of the dynamic conversion model comprises that training times reach preset times or loss reaches preset loss.
The above method for determining the initial parameters of the dynamic conversion model includes, but is not limited to, the following:
in one mode, the initial parameters of the dynamic conversion model may be preset values, random values, or empirical values.
And secondly, obtaining pre-training parameters obtained by the pre-training model during pre-training, and determining the pre-training parameters as initial parameters of the dynamic conversion model.
The second mode is to determine the pre-training parameters of the pre-training model as the initial parameters of the dynamic conversion model, so that the training times and the training accuracy of the dynamic conversion model can be reduced.
The embodiment of the application does not limit the type of the pre-training model, for example, the pre-training model is a VGG-16 network model.
From the above, the process of training the dynamic conversion model by using each LDR training image in the training set is consistent, and for convenience of description, in this embodiment, a training process of the dynamic conversion model is described by taking one LDR training image as an example.
In some embodiments, the HDR image truth value of the LDR training image may be generated by manually dynamically converting the LDR training image.
In some embodiments, the HDR image truth value of the LDR training image may be an HDR image obtained by converting the LDR training image using an existing high dynamic conversion method.
In some embodiments, the acquired HDR image may be converted into an LDR image, the converted LDR image is used as an LDR training image, and the acquired HDR image is used as an HDR image truth value of the LDR training image.
The embodiment of the application does not limit the way of acquiring the LDR training image and acquiring the HDR image true value of the LDR training image.
S402, inputting the LDR training image into a dynamic conversion model for dynamic conversion, and extracting the features of the i-1 th first feature information through an i-th coding module to obtain the i-th first feature information of the LDR training image.
S403, performing feature extraction on the i-1 first feature information and the N-i second feature information of the LDR training image through the N-i+1 decoding module to obtain the N-i+1 second feature information of the LDR training image.
The following describes the network structure of the dynamic conversion model according to the embodiment of the present application with reference to fig. 5A, and it should be noted that the network structure of the dynamic conversion model according to the embodiment of the present application includes, but is not limited to, the modules shown in fig. 5A, and may further include more or less modules than fig. 5A.
Fig. 5A is a network schematic diagram of a dynamic transformation model according to an embodiment of the present application, and as shown in fig. 5A, the dynamic transformation model may be understood as a self-encoder network composed of N-level encoding components and decoding components. The dynamic conversion model includes: the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the ith coding module is connected with the (skip connection) of the (N-i+1) th decoding module in a jumping way, the jumping connection can be understood that the input end of the (i) th coding module is connected with the input end of the (N-i+1) th decoding module, the (i) th coding module is used for carrying out feature extraction on the (i-1) th first feature information to obtain the (i-i) th first feature information of the LDR training image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i) th second feature information of the LDR training image to obtain the (N-i+1) th second feature information of the LDR training image, and the (i) th positive integer less than or equal to N.
And if i is equal to N, the N-i second characteristic information is determined according to the N first characteristic information output by the N coding module.
If i is smaller than N, the N-i second characteristic information is determined according to the N-i second characteristic information output by the N-i decoding module.
If i is equal to 1, the i-1 th first feature information is determined according to the LDR training image.
If i is greater than 1, the i-1 th first characteristic information is determined according to the first characteristic information output by the i-1 st coding module.
For example, as shown in fig. 5A, n=4, the encoding assembly includes 4 encoding modules connected in series, the decoding assembly includes 4 decoding modules connected in series, and the output of the last encoding module is connected to the input of the first decoding module. The first encoding module is in jump connection with the fourth decoding module, the second encoding module is in jump connection with the third decoding module, the third encoding module is in jump connection with the second decoding module, and the fourth encoding module is in jump connection with the first decoding module.
Inputting the LDR training image into the dynamic conversion model to obtain the 0 th first feature information, where the 0 th first feature information may be the LDR training image or a feature map obtained by processing the LDR training image. The 0 th first characteristic information is respectively input into a first encoding module and a fourth decoding module, the first encoding module outputs the first characteristic information according to the 0 th first characteristic information, and the first characteristic information is respectively input into a second encoding module and a third decoding module. The second encoding module obtains second first characteristic information according to the first characteristic information, and inputs the second first characteristic information into the third encoding module and the second decoding module respectively. The third encoding module obtains third first characteristic information according to the second first characteristic information, and inputs the third first characteristic information into the fourth encoding module and the first decoding module respectively. The fourth encoding module outputs fourth first characteristic information according to the third first characteristic information, and inputs the fourth first characteristic information into the first decoding module. The first decoding module obtains first and second characteristic information according to the fourth first characteristic information and the third first characteristic information, and inputs the first and second characteristic information into the second decoding module. The second decoding module obtains second characteristic information according to the first second characteristic information and the second first characteristic information, and inputs the second characteristic information into the third decoding module. And the third decoding module obtains third second characteristic information according to the second characteristic information and the first characteristic information, and inputs the third second characteristic information into the fourth decoding module. And the fourth decoding module obtains fourth second characteristic information according to the 0 th first characteristic information and the third second characteristic information.
In some embodiments, as shown in fig. 5A, the step S403 includes: concatenating the i-1 th first feature information and the N-i th second feature information of the LDR training image, wherein "C" in FIG. 5A represents the concatenation; inputting the cascaded characteristic information into an N-i+1 decoding module for characteristic extraction to obtain the N-i+1 second characteristic information of the LDR training image. For example, the fourth first feature information and the third first feature information are cascaded, and the cascaded fourth first feature information and third first feature information are input into the first decoding module to obtain the first and second feature information output by the first decoding module. And cascading the first and second characteristic information with the second first characteristic information, and inputting the first and second characteristic information after cascading into the second decoding module to obtain second and second characteristic information output by the second decoding module. And cascading the second characteristic information and the first characteristic information, and inputting the cascaded second characteristic information and first characteristic information into a third decoding module to obtain third second characteristic information output by the third decoding module. And similarly, cascading the 0 th first characteristic information and the third second characteristic information, and inputting the cascaded 0 th first characteristic information and third second characteristic information into a fourth decoding module to obtain fourth second characteristic information output by the fourth decoding module.
The embodiment of the application does not limit the specific network structure of the coding module.
In one embodiment, each of the N coding modules includes at least one convolution block, wherein parameters of the convolution blocks included by each of the N coding modules are not exactly the same. For example, the characteristic dimension of the convolution block included in the first encoding module is 64, the characteristic dimension of the convolution block included in the second encoding module is 128, the characteristic dimension of the convolution block included in the third encoding module is 256, the characteristic dimension of the convolution block included in the fourth encoding module is 512, and so on.
The embodiment of the application does not limit the specific network structure of the decoding module.
In one embodiment, each of the N decoding modules includes at least one convolution block, wherein parameters of the convolution blocks included by each of the N decoding modules are not exactly the same. For example, the characteristic dimension of the convolution block included in the first decoding module is 256, the characteristic dimension of the convolution block included in the second decoding module is 128, the characteristic dimension of the convolution block included in the third decoding module is 64, the characteristic dimension of the convolution block included in the fourth encoding module is 32, and so on.
The network structures of the convolution blocks included in each coding module in the embodiment of the present application may be the same or different. The network structure of the convolution blocks included in each decoding module may be the same or different. In addition, the network structures of the convolution blocks included in the encoding module and the decoding module may be the same or different, which is not limited in this application.
In one possible implementation, the network structure of the convolution blocks comprised by the encoding module and/or the decoding module comprises convolution layer 1, convolution layer 2, convolution layer 3 and an activation function.
Alternatively, as shown in fig. 5B, the convolution kernels of the convolution layers 1 and 2 are 3×3, the convolution kernel of the convolution layer 3 is 1×1, and the activation function is a Sigmoid weighted linear unit (Sigmoid Weighted Liner Unit, abbreviated as SiLU).
The convolution kernel sizes of the above-mentioned convolution layers 1, 2 and 3 include, but are not limited to, the above-mentioned values, and the activation function includes, but is not limited to, a SiLU, for example, a RELU, etc., which is not limited in this application.
In some embodiments, as shown in fig. 5C, the dynamic transition model further includes: a convolution attention module (Convolutional Block Attention Module, CBAM for short) located in the jump connection of the ith coding module and the N-i+1 decoding module. The attention mechanism of the convolution attention module enables the dynamic conversion model to focus more attention on relevant parts of the code-side features and focus less attention on other irrelevant parts, namely the convolution attention mechanism is used for improving the characterization capability of the dynamic conversion model, focusing on important features and inhibiting unnecessary features, so that the efficiency of the model is greatly improved.
In one possible implementation, one or more CBAM are included in each of the hop connections of the encoding module and the decoding module.
Based on the dynamic conversion model shown in fig. 5C, the step S403 of extracting features from the i-1 th first feature information and the N-i th second feature information of the LDR training image by using the N-i+1 th decoding module to obtain the N-i+1 th second feature information of the LDR training image includes steps S403-a and S403-B:
S403-A, extracting spatial information and channel information from the ith-1 first characteristic information through a convolution attention module to obtain the ith-1 third characteristic information of the LDR training image.
S403-B, performing feature extraction on the i-1 th third feature information and the N-i second feature information through the N-i+1 th decoding module to obtain the N-i+1 th second feature information of the LDR training image. For example, the ith-1 third characteristic information and the nth-i second characteristic information are cascaded, the cascaded ith-1 third characteristic information and the cascaded nth-i second characteristic information are input into the nth-i+1 decoding module, and the nth-i+1 second characteristic information of the LDR training image output by the nth-i+1 decoding module is obtained.
The embodiment of the application does not limit the network structure of the convolution attention module.
In one possible implementation, as shown in fig. 5D, the convolution attention module includes: a channel attention module and a spatial attention module. The channel attention module learns channel information of the features by utilizing the relation among channels of the features, and the space attention module learns the space information of the features by utilizing the space relation of the features.
It should be noted that, the channel to which the present invention belongs may be understood as a feature dimension, for example, a feature dimension of 32 of one feature information, and the number of channels representing the feature information is 32.
On the basis of fig. 5D, in S403-a, spatial information and channel information are extracted from the i-1 th first feature information by the convolution attention module, and the i-1 th third feature information of the LDR training image is obtained, where the steps S403-A1 to S403-A3 include:
S403-A1, extracting channel information from the i-1 th first characteristic information through a channel attention module to obtain the channel attention information of the i-1 th first characteristic information.
S403-A2, extracting spatial information of the fusion channel characteristic information of the ith-1 first characteristic information through the spatial attention module to obtain the spatial attention information of the ith-1 first characteristic information.
The fusion channel characteristic information of the ith-1 first characteristic information is determined according to the ith-1 first characteristic information and the channel attention information of the ith-1 first characteristic information.
In some embodiments, as shown in FIG. 5D, the convolution attention module further includes a first multiplication unit, where S403-A2 includes S403-A21 and S403-A22:
S403-A21, multiplying the i-1 th first characteristic information and the channel attention information of the i-1 th first characteristic information by a first multiplication unit to obtain the fused channel characteristic information of the i-1 th first characteristic information.
S403-A22, inputting the fusion channel characteristic information of the ith-1 first characteristic information into a spatial attention module for spatial information extraction to obtain the spatial attention information of the ith-1 first characteristic information.
S403-A3, determining the ith-1 third characteristic information of the LDR training image according to the channel attention information and the space attention information of the ith-1 first characteristic information.
In some embodiments, as shown in FIG. 5D, where the convolution attention module further includes a second multiplication unit, then S403-A3 includes: and multiplying the fusion channel characteristic information of the ith-1 first characteristic information and the spatial attention information by a second multiplication unit to obtain the ith-1 third characteristic information of the LDR training image.
For example, assuming the i-1 first feature information is feature map F, as shown in FIG. 5D, the feature map F is input to the CBAM module, which sequentially extrapolates the attention map along two independent dimensions (i.e., channel dimension and spatial dimension), and then multiplies the attention map with the input feature map for adaptive feature optimization. Specifically, a one-dimensional channel attention map MC is obtained through a channel attention module, and the MC is multiplied by the input feature F to obtain F'. F' is input into a spatial attention module, and a two-dimensional spatial attention map Ms is obtained through the spatial attention module. And multiplying Ms and F 'to obtain a final feature map F', wherein the final feature map is the i-1 th third feature information of the LDR training image.
In FIG. 5D, the following is includedRepresenting the multiplication of the corresponding elements in turn. Here, if the dimension of the input feature map F is h×w×c, the dimension of the one-dimensional channel attention map MC is 1×1×c, and the dimension of the two-dimensional space attention map Ms is h×w×1.
The above S403-A1 will be described below with reference to the network structure of the channel attention module.
In some embodiments, as shown in fig. 5E, the channel attention module includes: the device comprises a first space compression unit, a second space compression unit and a channel characteristic extraction unit. The first space compression unit and the second space compression unit are both used for compressing the space size of the feature map, and the channel feature extraction unit is used for extracting the features of the feature map after the space compression. That is, as shown in fig. 5F, the present application compresses the spatial dimensions of the input feature map in order to efficiently calculate channel attention.
Optionally, the first spatial compression unit and/or the second spatial compression unit comprise a pooling layer.
Optionally, the first spatial compression unit is a maximum pooling layer, and/or the second spatial compression unit is an average pooling layer.
Alternatively, the channel feature extraction unit is a multi-layer perceptron (Multilayer perception, abbreviated as MLP), for example, the MLP is an MLP comprising a single hidden layer.
On the basis of fig. 5E, the channel attention module in S403-A1 extracts the channel information from the i-1 th first feature information, and the channel attention information for obtaining the i-1 th first feature information includes S403-a11 to S403-a15:
S403-A11, performing space dimension compression on the i-1 th first characteristic information through a first space compression unit to obtain first space compression information of the i-1 th first characteristic information.
S403-A12, performing space dimension compression on the i-1 th first characteristic information through a second space compression unit to obtain second space compression information of the i-1 th first characteristic information.
S403-A13, carrying out channel feature extraction on the first space compression information of the i-1 th first feature information through a channel feature extraction unit to obtain first channel information of the i-1 th first feature information.
S403-A14, carrying out channel feature extraction on the second space compression information of the i-1 th first feature information through a channel feature extraction unit to obtain second channel information of the i-1 th first feature information.
S403-A15, determining channel attention information of the i-1 th first characteristic information according to the first channel information and the second channel information of the i-1 th first characteristic information.
In some embodiments, as shown in fig. 5E, the channel attention module further comprises: a first adding unit and a first activation function, where S403-a15 includes:
S403-A151, adding the first channel information and the second channel information of the i-1 pieces of first characteristic information through a first adding unit to obtain fusion channel information of the i-1 pieces of first characteristic information.
S403-A152, performing nonlinear processing on the fusion channel information of the i-1 first characteristic information through a first activation function to obtain channel attention information of the i-1 first characteristic information.
The embodiment of the application does not limit the specific form of the first activation function, and the specific form is determined according to actual needs.
The above S403-A2 will be described below in connection with the network structure of the spatial attention module.
In some embodiments, as shown in fig. 5F, the spatial attention module includes: the device comprises a first channel compression unit, a second channel compression unit and a spatial feature extraction unit. The first channel compression unit and the second channel compression unit are both used for compressing the channel dimension of the feature map, and the spatial feature extraction unit is used for extracting the features of the feature map after channel compression. I.e., a spatial attention module as shown in fig. 5F, by utilizing the spatial relationships between features to generate a spatial attention map. The spatial attention and the channel attention complement each other. To calculate spatial attention, the channel dimensions of the input feature map are compressed.
Optionally, the first channel compression unit and/or the second channel compression unit comprise a pooling layer.
Optionally, the first channel compression unit is a maximum pooling layer (MaxPool), and/or the second channel compression unit is an average pooling layer (AvgPool).
Optionally, the spatial feature extraction unit is a convolution layer.
At this time, the step S403-A2 performs spatial information extraction on the fused channel characteristic information of the ith-1 first characteristic information through the spatial attention module to obtain spatial attention information of the ith-1 first characteristic information, including S403-A21 to S403-A24:
S403-A21, carrying out channel dimension compression on the fusion channel characteristic information of the ith-1 first characteristic information through a first channel compression unit to obtain first channel compression information of the ith-1 first characteristic information.
S403-A22, performing channel dimension compression on the fusion channel characteristic information of the ith-1 first characteristic information through a second channel compression unit to obtain second channel compression information of the ith-1 first characteristic information.
S403-A23, performing spatial feature extraction on the first channel compression information and the second channel compression information of the ith-1 first feature information through a spatial feature extraction unit to obtain spatial feature information of the ith-1 first feature information.
S403-A24, determining the spatial attention information of the ith-1 first characteristic information according to the spatial characteristic information of the ith-1 first characteristic information.
In some embodiments, as shown in FIG. 5F, the spatial attention module further includes a second activation function, S403-A24 includes: and carrying out nonlinear processing on the spatial characteristic information of the i-1 th first characteristic information through a second activation function to obtain the spatial attention information of the i-1 th first characteristic information.
The embodiment of the application does not limit the specific form of the second activation function, for example, a sigmoid activation function.
In one specific example, for example, the spatial attention module generates corresponding feature vectors along a channel (channel) axis using an average pooling (i.e., second channel compression unit) and a maximum pooling (i.e., first channel compression unit) operation, and connects the two to generate an efficient feature descriptor. On this basis, the dimension is reduced to a channel through a convolution layer (i.e. a spatial feature extraction unit), and a two-dimensional spatial attention feature map Ms is generated after a sigmoid activation function (i.e. a second activation function).
Optionally, the spatial dimension of the channel attention information of the i-1 th first feature information is 1×1.
Optionally, the feature dimension of the spatial attention information of the i-1 th first feature information is 1.
According to the dynamic conversion model provided by the embodiment of the application, the convolution attention module is added to each branch and comprises the channel attention module and the space attention module, the channel characteristics and the space characteristics are respectively learned, the learning of the dynamic conversion model on the detail characteristics of the image is further improved, the trained dynamic conversion model can reconstruct more detail characteristics in the image, and the quality of the HDR image generated by the dynamic conversion model is further improved.
In some embodiments, as shown in fig. 5G, the dynamic conversion model further includes at least one downsampling unit, and the training method of the embodiments of the present application further includes: and performing space dimension downsampling on the characteristic information output by the coding module through a downsampling unit. That is, in order to reduce network complexity, the embodiment of the present application sets at least one downsampling unit in the encoding component, so as to reduce the spatial dimension of the feature information output by the encoding module.
The number of the downsampling units included in the dynamic conversion model is not limited, and the downsampling units are determined according to actual requirements.
In one possible implementation manner, a downsampling unit is arranged between two adjacent coding modules and is used for downsampling the characteristic information output by the last coding unit in spatial dimension and inputting the downsampled characteristic information into the next coding module, so that the data volume processed by the coding modules is reduced, the complexity of a model is reduced, and each coding module can learn the characteristics on different sizes to improve the prediction accuracy of the dynamic conversion model.
Optionally, the downsampling unit is a maximum pooling layer.
In some embodiments, as shown in fig. 5G, the dynamic conversion model further includes at least one upsampling unit, and the training method of the embodiments of the present application further includes: and carrying out space dimension up-sampling on the characteristic information output by the decoding module through an up-sampling unit.
As shown in fig. 5G, since at least one downsampling unit is disposed in the encoding component, in order to ensure that the size of the decoded image is consistent with the size of the original image, at least one upsampling unit is disposed in the decoding component, for performing spatial dimension upsampling on the feature information output by the decoding module.
Optionally, the upsampling unit is a bilinear interpolation unit.
In some embodiments, as shown in fig. 5G, the dynamic conversion model further includes a first convolution layer, where the first convolution layer is located at an input end of the dynamic conversion model, and is configured to process an image input to the dynamic conversion model to obtain an initial feature map of the input image. For example, inputting an LDR training image into a dynamic conversion model, and extracting features of the LDR training image through a first convolution layer in the dynamic conversion model to obtain an initial feature map of the LDR training image; the initial feature map is respectively input into a first coding module and a first convolution attention module to obtain first feature information output by the first coding module and first third feature information output by the first convolution attention module. The initial feature map may be understood as the 0 th first feature information described above.
According to the method, the LDR training image is input into the dynamic conversion model, so that the second feature information of the LDR training image output by the last decoding module in the dynamic conversion model can be obtained, and then the following S404 is executed.
S404, determining an HDR image prediction value of the LDR training image according to the second characteristic information of the LDR training image output by the last decoding module in the N decoding modules.
In some embodiments, the channel of the second feature information of the LDR training image is converted into 3 channels (e.g., RGB channels), resulting in an HDR image predictor of the LDR training image.
In some embodiments, as shown in fig. 5G, the dynamic conversion model further includes a second convolution layer, and S404 includes: and extracting the characteristics of the second characteristic information of the LDR training image output by the last decoding module through the second convolution layer, and outputting the HDR image predicted value of the LDR training image.
The second convolution layer further includes an activation function, and the feature dimension of the second convolution layer is 3, that is, after passing through the second convolution layer, a 3-channel (e.g. RGB) image may be output, and the 3-channel image is used as an HDR image prediction value of the LDR training image.
Alternatively, the convolution kernel of the second convolution layer may have a size of 1×1.
S405, determining target loss between an HDR image predicted value of the LDR training image and an HDR image true value of the LDR training image, and training the dynamic conversion model according to the loss.
After obtaining the HDR image predicted value of the LDR training image according to the step of S404, comparing the HDR image predicted value of the LDR training image with the HDR image true value of the LDR training image, determining a target loss between the HDR image predicted value of the LDR training image and the HDR image true value of the LDR training image, and adjusting parameters in the dynamic conversion model according to the target loss, so as to realize one training of the dynamic conversion model. Next, the dynamic conversion model is trained using another LDR training image with reference to the same procedure as described above until the dynamic conversion model training is completed.
In some embodiments, the manner of determining the loss in S405 includes S405A: and determining target loss between the HDR image predicted value of the LDR training image and the HDR image true value of the LDR training image according to a preset loss function.
Optionally, the predetermined loss function includes at least one of a reconstruction loss function, a perceptual loss function, and a style loss function.
In one possible implementation manner, the preset loss function includes a reconstruction loss function, a perceptual loss function, and a style loss function, where S405A includes:
determining a reconstruction penalty between the HDR image predictor and the HDR image truth value;
determining a perceived loss between the HDR image predictor and the HDR image truth value;
determining a pattern penalty between the HDR image predictor and the HDR image truth value;
and determining target loss between the HDR image predicted value and the HDR image true value according to the reconstruction loss, the perception loss and the pattern loss between the HDR image predicted value and the HDR image true value.
Wherein the reconstruction penalty determines that the HDR image predictor approximates an HDR image truth value on pixels.
The perceptual penalty evaluates the degree of matching of features of the HDR image predictor with features extracted from the HDR image truth and allows the model to produce textures that are perceptually similar to the HDR image truth, i.e. the perceptual penalty ensures that a visually pleasing image with more texture details is generated.
Style loss by comparing global statistics with Gram matrices collected over the entire image, style and texture are captured, guaranteeing style consistency and color consistency of the predicted image.
In some embodiments, the sum of weights of the reconstruction loss, the perceptual loss, and the style loss may be taken as the target loss.
The target loss between the HDR image predictor and the HDR image truth value is determined, for example, according to the following equation (1):
Loss=L 1s L stp L p (1)
wherein Loss is a target Loss, L1 is a reconstruction Loss, lst is a perception Loss, lp is a pattern Loss, and lambda s And lambda (lambda) p Is a super parameter. In the above formula (1)It can be understood that the weight of the reconstruction loss is 1, and the weight of the perception loss is lambda s The weight of the style loss is lambda p
It should be noted that, the above formula (1) is only an example, and the manner of determining the target loss in the present application includes, but is not limited to, adding, subtracting, multiplying or dividing a certain parameter in the above formula (1), or equivalent deformation of the above formula (1), which falls within the scope of protection of the present application.
In one example, the compressed tone mapping value of the HDR image predictor is determined according to a preset compressed tone mapping function; determining a compressed tone mapping value of the HDR image truth value according to the compressed tone mapping function; the reconstruction penalty is determined from the error between the compressed tone mapping value of the HDR image truth value and the compressed tone mapping value of the HDR image predictor.
For example, the reconstruction loss is determined according to the following formula (2):
L 1 =‖T(H)-T(GT)‖ 1 (2)
where L1 represents reconstruction loss, T is μ -law compressed tone mapping function, T (H) is compressed tone mapping value of HDR image predicted value, T (GT) is compressed tone mapping value of HDR image true value,x=h, or GT, H being the HDR image predictor output by the dynamic conversion model, GT is HDR image true value of LDR training image, "IIRII # - 1 "means L1 norm, μ is a preset parameter.
It should be noted that the above formula (2) is only an example, and the manner of determining the reconstruction loss in the present application includes, but is not limited to, adding, subtracting, multiplying or dividing a certain parameter in the above formula (2), or equivalent deformation of the above formula (2), which falls within the scope of the present application.
In one example, the perceived loss is determined by: acquiring a feature map of a first layer of the pre-training model; determining a compressed tone mapping value of an HDR image predicted value according to a preset compressed tone mapping function; determining a compressed tone mapping value of the HDR image truth value according to the compressed tone mapping function; determining a compressed tone mapping value of an HDR image predicted value, and corresponding first characteristic values in a characteristic diagram of a first layer; determining a compressed tone mapping value of an HDR image true value, and corresponding second characteristic values in the characteristic diagram of the first layer; and determining the perception loss according to the error between the first characteristic value and the second characteristic value.
For example, the perceptual loss is determined according to the following equation (3):
wherein Lp represents the perceived loss, phi l Feature maps representing layer I of the pre-trained model, e.g. of TGG-16, the feature map having a size C l ×H l ×W l ,φ l (T (H)) is the first eigenvalue, phi, of the compressed tone mapping value of the HDR image predictor corresponding in the eigenvector of the first layer l (T (GT)) is a second eigenvalue of the compressed tone mapping value of the HDR image truth value corresponding in the eigenvector of the first layer.
It should be noted that, the above formula (3) is only an example, and the manner of determining the perceived loss in the present application includes, but is not limited to, the manner shown in the above formula (3), such as adding, subtracting, multiplying or dividing a certain parameter in the formula (3), or equivalent deformation of the above formula (3), which falls within the protection scope of the present application.
In one example, style loss is determined according to the following: obtaining a Gram matrix of a first layer of feature map of the pre-training model; determining a compressed tone mapping value of an HDR image predicted value according to a preset compressed tone mapping function; determining a compressed tone mapping value of the HDR image truth value according to the compressed tone mapping function; determining a compressed tone mapping value of an HDR image predicted value, and corresponding first element values in a Gram matrix; determining a compressed tone mapping value of an HDR image true value, and corresponding second element values in a feature map of a first layer; a pattern loss is determined based on an error between the first element value and the second element value.
For example, the style loss is determined according to the following equation (4):
where Lp denotes the perceptual loss function, G (), G (H)) is the glamer Gram matrix of the first layer feature map of the pre-training model, G (T (H)) is the first element value in the glamer Gram matrix corresponding to the compressed tone mapping value of the HDR image predictor, G (T (GT)) is the second element value in the feature map of the first layer corresponding to the compressed tone mapping value of the HDR image truth,x=h or GT, K l Is of the size C l H l W l Representing the calculated normalization factor, the characteristic phi is (H l W l )×C l Therefore, the Gram matrix has a size of C l ×C l
Optionally, a pre-trained VGG-16 network is used, and the feature map and the output of the real features of the first three pooling layers pool1, pool2 and pool3 of VGG-16 are calculated respectively, and the perceived loss and the style loss of these features are calculated respectively according to the above formula (3) and formula (4).
The target loss in the embodiment of the application comprises reconstruction loss, perception loss and style loss, so that reconstruction distortion, artifacts and tone anomalies of the high dynamic range image are reduced, and the quality of the HDR image generated by the model is further improved.
Further, the image processing capability of the dynamic conversion model proposed in the embodiment of the present application is verified by experimental means.
Collection of data sets: the deep learning model relies on a large scale dataset since datasets with LDR-HDR image pairs cannot be used. The present application collects from multiple HDR image data sets and HDR video data and sets a virtual camera to capture multiple random areas of a scene using randomly selected camera calibration. Virtual camera calibration includes parameters of exposure, camera curve, white balance and noise level. Wherein the virtual camera parameters are randomly selected and the camera curve parameters are randomly fitted to a database of camera curves. This provides a set of LDR and corresponding HDR images, which are used as real cases for input and training, respectively. A set of data enhancement operations is then applied to improve the robustness of the prediction. Each HDR image is considered as a real scene, selecting regions as images of random size and position, cropping, then randomly flipping and resampling to 256 x 256 pixels. The final training network using these data enhancement functions can be well generalized to various images captured using different cameras. The obtained dataset is then divided into a training set and a test set. Specifically, two data sets were collected from the HDR data set, namely the Fairchild HDR data set and the HDR EYE data set, for testing.
Experimental environment: the hardware experimental device is AMD Ryzen 5 CPU,NVIDIA GTX 1080 Ti and a 16G memory, and the framework is PyTorch.
To illustrate the performance of the proposed method, the method is compared with five existing single image HDR reconstruction techniques, including three conventional non-learning methods: akyuz method, KOV method, masia method. In addition to this, there are two methods based on deep learning techniques: expandadNet and HDRCNN. In order to evaluate the quality of reconstructed images obtained by various single image HDR reconstruction methods, three objective evaluation methods PU-PSNR, PU-SSIM and HDR-VDP Q scores are used to evaluate the image quality.
The perceptual unified coding proposed in the present application converts luminance values into approximately perceptually uniform pixel values of an HDR image. In the evaluation index, the PU-PSNR measures a pixel difference between the predicted image and the reference image. The PU-SSIM measures the structural difference between the predicted image and the reference image from a visual perception perspective. HDR-VDP is a visual metric that is used to compare a reference image to a test image and predict the quality of an HDR image relative to the reference image. The quality Q score provided in HDR-VDP is used as an evaluation index.
In the objective index, the larger the Q value, the PU-PSNR and the PU-SSIM value, the closer the high dynamic range image reconstructed by the model is to the original image, and the higher the reconstruction quality is.
Table 1 shows a quantitative comparison of reconstructed HDR images using existing methods on an HDR EYE dataset and a Fairchild dataset. Wherein bold indicates the method with the best experimental results and underline indicates the suboptimal algorithm. Our method has the best results in the Fairchild dataset, good Q scores in the HDR EYE dataset, and better performance than other methods in terms of PSNR and SSIM metrics on both datasets.
TABLE 1
Wherein the Fairchild dataset is constructed by the Mark d.fairchild professor team of the university of rochester, comprising a series of over 100 HDR images and data.
As can be seen from table 1, other methods fail to restore texture to the overexposed areas and can lead to color changes, blurring, and tiling artifacts. Compared to the methods of the present application, conventional methods cannot eliminate noise or recover lost details in the saturated region. The model provided by the application has good performance compared with the prior method, the finally obtained HDR image has more natural colors and richer details, and noise in a low exposure area can be effectively restrained.
The embodiment of the application provides a dynamic conversion model, the model includes N coding modules connected in series and N decoding modules connected in series, the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, and the ith coding module is connected with the N-i+1th decoding module in a jumping way, and the model is trained by using LDR training images, wherein the training process is as follows: inputting the LDR training image into a dynamic conversion model, carrying out feature extraction on the i-1 th first feature information through an i-th coding module to obtain the i-1 th first feature information of the LDR training image, and carrying out feature extraction on the i-1 th first feature information and the N-i second feature information of the LDR training image through an N-i+1-th decoding module to obtain the N-i+1-th second feature information of the LDR training image; determining an HDR image predicted value of the LDR training image according to the second characteristic information of the LDR training image output by the last decoding module in the N decoding modules; and determining loss between the HDR image predicted value of the LDR training image and the HDR image true value of the LDR training image, and training the dynamic conversion model according to the loss. When in subsequent use, the trained dynamic conversion model can be used for converting the LDR image into the HDR image, so that the conversion of the HDR image is realized without increasing the cost of data acquisition, encoding, transmission, storage and the like, and the efficiency of HDR image conversion is improved.
The training process of the dynamic conversion model is described above in connection with the network structure of the dynamic conversion model, and the application process of the dynamic conversion model is described below.
In some embodiments, the dynamic conversion model provided in the embodiments of the present application may be further applied to a video encoding and decoding framework, for example, may be applied to a video decoding end, and perform high-dynamic conversion on a reconstructed image obtained by the decoding end, so as to obtain an HDR image of the reconstructed image.
Fig. 6 is a flowchart of an image decoding method according to an embodiment of the present application, as shown in fig. 6, where the method includes:
s601, decoding the code stream to obtain a reconstructed image.
For example, as shown in fig. 3, the entropy decoding unit 310 may parse the code stream to obtain prediction information of the current block, a quantization coefficient matrix, etc., and the prediction unit 320 generates a prediction block of the current block using intra prediction or inter prediction on the current block based on the prediction information. The inverse quantization/transformation unit 330 performs inverse quantization and inverse transformation on the quantized coefficient matrix using the quantized coefficient matrix obtained from the code stream to obtain a residual block. The reconstruction unit 340 adds the prediction block and the residual block to obtain a reconstructed block. The reconstructed blocks constitute a reconstructed image, and the loop filtering unit 350 performs loop filtering on the reconstructed image based on the image or based on the blocks, resulting in a reconstructed image.
In this embodiment, the dynamic conversion model is combined with the video coding framework.
In one example, for facilitating encoding, at the encoding end, the input 10bit hdr data is converted into 8bit LDR data by a tone mapping module (TM), then split into CTUs, sent to an encoder for encoding, and formed into a code stream by links such as motion estimation, motion compensation, intra-frame prediction, inter-frame prediction, transformation, quantization, filtering, entropy encoding, and the like. The dynamic conversion model described in the above embodiment is added at the output of the decoder. The dynamic range of the decoded LDR reconstructed image is expanded, and the quality of the obtained HDR data can be remarkably improved by using the model, so that the quality of the decoded image is further improved on the premise of ensuring the code rate.
S602, inputting the reconstructed image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the reconstructed image.
Referring to fig. 5A, the dynamic conversion model includes: the method comprises the steps of connecting N coding modules in series and N decoding modules in series, wherein the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the ith coding module is connected with the (N-i+1) th decoding module in a jumping mode, the ith coding module is used for carrying out feature extraction on the (i-1) th first feature information output by the (i-1) th coding module to obtain the (i-1) th first feature information of a reconstructed image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i) th second feature information of the reconstructed image to obtain the (N-i+1) th second feature information of the reconstructed image, i is a positive integer smaller than or equal to N, and N is a positive integer.
The HDR image of the reconstructed image is determined according to the second characteristic information output by the last decoding module in the N decoding modules.
And if i is equal to N, the N-i second characteristic information is determined according to the N first characteristic information output by the N coding module.
If i is smaller than N, the N-i second characteristic information is determined according to the N-i second characteristic information output by the N-i decoding module.
If i is equal to 1, the i-1 th first feature information is determined according to the reconstructed image, for example, the 0 th first feature information is the reconstructed image or a feature map obtained by processing the reconstructed image.
If i is greater than 1, the i-1 th first characteristic information is determined according to the first characteristic information output by the i-1 st coding module.
The embodiment of the application does not limit the specific network structure of the coding module.
In one embodiment, each of the N coding modules includes at least one convolution block, wherein parameters of the convolution blocks included by each of the N coding modules are not exactly the same. For example, the characteristic dimension of the convolution block included in the first encoding module is 64, the characteristic dimension of the convolution block included in the second encoding module is 128, the characteristic dimension of the convolution block included in the third encoding module is 256, the characteristic dimension of the convolution block included in the fourth encoding module is 512, and so on.
The embodiment of the application does not limit the specific network structure of the decoding module.
In one embodiment, each of the N decoding modules includes at least one convolution block, wherein parameters of the convolution blocks included by each of the N decoding modules are not exactly the same. For example, the characteristic dimension of the convolution block included in the first decoding module is 256, the characteristic dimension of the convolution block included in the second decoding module is 128, the characteristic dimension of the convolution block included in the third decoding module is 64, the characteristic dimension of the convolution block included in the fourth encoding module is 32, and so on.
The network structures of the convolution blocks included in each coding module in the embodiment of the present application may be the same or different. The network structure of the convolution blocks included in each decoding module may be the same or different. In addition, the network structures of the convolution blocks included in the encoding module and the decoding module may be the same or different, which is not limited in this application.
In one possible implementation, the encoding module and/or decoding module includes a network structure including convolutional layer 1, convolutional layer 2, convolutional layer 3, and an activation function as shown in fig. 5B.
Optionally, the convolution kernels of the convolution layer 1 and the convolution layer 2 are 3×3, the convolution kernel of the convolution layer 3 is 1×1, and the activation function is Sigmoid weighted linear unit (Sigmoid Weighted Liner Unit, abbreviated as SiLU).
The convolution kernel sizes of the above-mentioned convolution layers 1, 2 and 3 include, but are not limited to, the above-mentioned values, and the activation function includes, but is not limited to, a SiLU, for example, a RELU, etc., which is not limited in this application.
In some embodiments, as shown in fig. 5C, the dynamic transition model further includes: a convolution attention module (CBAM) located in the jump connection of the ith coding module and the N-i+1 decoding module. The attention mechanism of the convolution attention module enables the dynamic conversion model to focus more attention on relevant parts of the code-side features and focus less attention on other irrelevant parts, namely the performance of the dynamic conversion model is improved by using the convolution attention mechanism, important features are focused, unnecessary features are restrained, and therefore the efficiency of the model is greatly improved.
In one possible implementation, one or more CBAM are included in each of the hop connections of the encoding module and the decoding module.
The convolution attention module is located in jump connection between the ith coding module and the (N-i+1) th decoding module and is used for extracting space information and channel information of the (i-1) th first characteristic information to obtain the (i-1) th third characteristic information of the reconstructed image.
At this time, the (N-i+1) -th decoding module is used for extracting the features of the (i-1) -th third feature information and the (N-i) -th second feature information to obtain the (N-i+1) -th second feature information of the reconstructed image. For example, the (N-i+1) -th decoding module is configured to perform feature extraction on feature information obtained by concatenating the (i-1) -th first feature information and the (N-i) -th second feature information of the reconstructed image, so as to obtain the (N-i+1) -th second feature information of the reconstructed image.
In some embodiments, as shown in fig. 5D, the convolution attention module includes a channel attention module and a spatial attention module.
The channel attention module is used for extracting channel information from the i-1 th first characteristic information to obtain the channel attention information of the i-1 th first characteristic information.
The spatial attention module is used for extracting spatial information from the i-1 th first characteristic information and the channel attention information of the i-1 th first characteristic information to obtain the spatial attention information of the i-1 th first characteristic information.
The i-1 th third feature information of the reconstructed image is determined based on the channel attention information and the spatial attention information of the i-1 th first feature information.
As shown in fig. 5E, the convolution attention module further includes a first multiplication unit; the first multiplication unit is used for multiplying the i-1 th first characteristic information and the channel attention information of the i-1 th first characteristic information to obtain the fusion channel characteristic information of the i-1 th first characteristic information, and at the moment, the spatial attention module is used for extracting the spatial information of the fusion channel characteristic information of the i-1 th first characteristic information to obtain the spatial attention information of the i-1 th first characteristic information.
With continued reference to FIG. 5D, the convolution attention module further includes a second multiplication unit; the second multiplication unit is used for multiplying the fusion channel characteristic information of the ith-1 first characteristic information and the spatial attention information to obtain the ith-1 third characteristic information of the reconstructed image.
In some embodiments, as shown in fig. 5E, the channel attention module includes: the device comprises a first space compression unit, a second space compression unit and a channel characteristic extraction unit.
The first space compression unit is used for carrying out space dimension compression on the ith-1 first characteristic information to obtain first space compression information of the ith-1 first characteristic information;
the second space compression unit is used for performing space dimension compression on the i-1 th first characteristic information to obtain second space compression information of the i-1 th first characteristic information;
the channel feature extraction unit is used for extracting channel features of the first space compression information of the i-1 th first feature information to obtain the first channel information of the i-1 th first feature information, and extracting channel features of the second space compression information of the i-1 th first feature information to obtain the second channel information of the i-1 th first feature information.
The channel attention information of the i-1 th first feature information is determined based on the first channel information and the second channel information of the i-1 th first feature information.
Optionally, the first spatial compression unit and/or the second spatial compression unit comprises a pooling layer.
Optionally, the first spatial compression unit is a maximum pooling layer and/or the second spatial compression unit is an average pooling layer.
Optionally, the channel feature extraction unit is a multi-layer perceptron MLP.
With continued reference to fig. 5E, the channel attention module further includes: a first addition unit and a first activation function;
the first adding unit is used for adding the first channel information and the second channel information of the i-1 pieces of first characteristic information to obtain fusion channel information of the i-1 pieces of first characteristic information;
the first activation function is used for carrying out nonlinear processing on the fusion channel information of the i-1 first characteristic information to obtain the channel attention information of the i-1 first characteristic information.
In some embodiments, as shown in fig. 5F, the spatial attention module includes: the device comprises a first channel compression unit, a second channel compression unit and a spatial feature extraction unit;
the first channel compression unit is used for carrying out channel dimension compression on the fusion channel characteristic information of the ith-1 first characteristic information to obtain first channel compression information of the ith-1 first characteristic information;
The second channel compression unit is used for carrying out channel dimension compression on the fusion channel characteristic information of the ith-1 first characteristic information to obtain second channel compression information of the ith-1 first characteristic information;
the spatial feature extraction unit is used for extracting spatial features of the first channel compression information and the second channel compression information of the i-1 th first feature information to obtain spatial feature information of the i-1 th first feature information;
the spatial attention information of the i-1 th first feature information is determined based on the spatial feature information of the i-1 th first feature information.
Optionally, the first channel compression unit and/or the second channel compression unit comprises a pooling layer.
Optionally, the first channel compression unit is a maximum pooling layer and/or the second channel compression unit is an average pooling layer.
Optionally, the spatial feature extraction unit is a convolution layer.
With continued reference to FIG. 5F, the spatial attention module further includes a second activation function;
the second activation function is used for carrying out nonlinear processing on the spatial characteristic information of the ith-1 first characteristic information to obtain the spatial attention information of the ith-1 first characteristic information.
Optionally, the spatial dimension of the channel attention information of the i-1 th first feature information is 1×1.
Optionally, the spatial attention information of the i-1 th first feature information has a feature dimension of 1.
According to the dynamic conversion model provided by the embodiment of the application, the convolution attention module is added to each branch, and comprises the channel attention module and the space attention module, so that the channel characteristics and the space characteristics are respectively learned, the learning of the dynamic conversion model on the detail characteristics of the image is further improved, the dynamic conversion model can reconstruct more detail characteristics in the image, and the quality of the HDR image generated by the dynamic conversion model is further improved.
In some embodiments, as shown in fig. 5G, the dynamic conversion model further includes at least one downsampling unit; the downsampling unit is used for performing space dimension downsampling on the characteristic information output by the coding module.
Optionally, the downsampling unit is a maximum pooling layer.
In some embodiments, as shown in fig. 5G, the dynamic conversion model further includes at least one upsampling unit; the up-sampling unit is used for performing space dimension up-sampling on the characteristic information output by the decoding module.
Optionally, the upsampling unit is a bilinear interpolation unit.
With continued reference to FIG. 5G, the dynamic transition model also includes a first convolution layer; the first convolution layer is used for extracting features of the reconstructed image to obtain an initial feature map of the reconstructed image, and the initial feature map is respectively input into the first coding module and the first convolution attention module.
With continued reference to FIG. 5G, the dynamic transition model also includes a second convolution layer; the second convolution layer is used for extracting the characteristics of the second characteristic information of the reconstructed image output by the last decoding module and outputting an HDR image of the reconstructed image.
In a specific embodiment of the present application, as shown in fig. 7, the dynamic conversion model includes a first convolution layer, 4 serially connected coding modules, 3 downsampling units, 4 serially connected decoding modules, 3 upsampling units, 4 CBAMs located on the skip connection of the coding modules and the decoding modules, and a second convolution layer. Illustratively, the first convolution layer has a convolution kernel of 3×3, the number of channels is 32, where the number of channels can also be understood as the characteristic dimension, the second convolution layer has a convolution kernel of 1×1, the number of channels is 3, and the second convolution layer includes an activation function. The first coding module comprises a convolution block with a channel number of 64, the second coding module comprises a convolution block with a channel number of 128, the third coding module comprises a convolution block with a channel number of 256, and the fourth coding module comprises a convolution block with a channel number of 512. The method comprises the steps of arranging a first downsampling unit between a first coding module and a second coding module, arranging a second downsampling unit between the second coding module and a third coding module, and arranging a third downsampling unit between the third coding module and a fourth coding module, wherein the convolution kernels of the first downsampling unit, the second downsampling unit and the third downsampling unit are 2 multiplied by 2, and the step length is the largest pooling layer of S2. The first decoding module comprises a convolution block with a channel number of 256, the second decoding module comprises a convolution block with a channel number of 128, the third decoding module comprises a convolution block with a channel number of 64, and the fourth decoding module comprises a convolution block with a channel number of 32. A first upsampling unit is arranged between the fourth coding block and the first decoding block, a second upsampling unit is arranged between the first decoding block and the second decoding block, a third upsampling unit is arranged between the second decoding block and the third decoding block, the first upsampling unit, the second upsampling unit and the third upsampling unit are bilinear interpolation units, and the multiple of upsampling is 2×2, and in addition, each upsampling unit further comprises a convolutional layer, for example, the first upsampling unit is Bilinear Upsample 2 ×2, conv 3×3 256, the second upsampling unit is Bilinear Upsample 2 ×2, conv 3×3 128, and the third upsampling unit is Bilinear Upsample 2 ×2, conv 3×3 64.
Let the size of the reconstructed image be h×w×3, where h×w denotes the length and width dimensions of the reconstructed image and 3 denotes the RGB3 channel number of the reconstructed image. The reconstructed image is input into the dynamic conversion model shown in fig. 7, and an initial feature map of the reconstructed image is output after the first convolution layer processing, wherein the size of the initial feature map is h×w×32. The initial feature image output by the first convolution layer is respectively input into a first coding module and a first CBAM, the convolution block in the first convolution module carries out convolution processing on the initial feature image to obtain first feature information of a reconstructed image, the first feature information is respectively input into a second CBAM and a first downsampling unit, and the size of the first feature information is H multiplied by W multiplied by 64. The first downsampling unit downsamples the first feature information to H/2×W/2×64 and inputs the sampled first feature information into the second encoding module. The convolution block in the second coding module carries out convolution processing on the sampled first characteristic information to obtain second first characteristic information of the reconstructed image, and the second first characteristic information is respectively input into a third CBAM and a second downsampling unit, wherein the size of the second first characteristic information is H/2 XW/2X 128. The second downsampling unit downsamples the second first characteristic information to H/4×W/4×128, and inputs the sampled second first characteristic information into the third encoding module. The convolution block in the third coding module carries out convolution processing on the sampled second first characteristic information to obtain third first characteristic information of the reconstructed image, and the third first characteristic information is respectively input into a fourth CBAM and a third downsampling unit, wherein the size of the third first characteristic information is H/4 XW/4X 256. The third downsampling unit downsamples the third first feature information to H/8×W/8×256, and inputs the sampled third first feature information into the fourth encoding module. The convolution block in the fourth coding module carries out convolution processing on the sampled third first characteristic information to obtain fourth first characteristic information of the reconstructed image, and the fourth first characteristic information is input into the first up-sampling unit, wherein the size of the fourth first characteristic information is H/8 XW/8X 512.
The first upsampling unit upsamples the fourth first characteristic information to H/4 xw/4 x 256. And the fourth CBAM performs feature extraction on the third first feature information and outputs the first third feature information of the reconstructed image. The first third characteristic information and the fourth first characteristic information after up-sampling are cascaded and then input into a first decoding module. The first decoding module performs feature extraction on the first and third feature information after cascading and the fourth first feature information after upsampling to obtain first and second feature information of the reconstructed image, and inputs the first and second feature information into the second upsampling unit. The second upsampling unit upsamples the first second characteristic information to H/2×w/2×128. And the third CBAM performs feature extraction on the second first feature information and outputs second and third feature information of the reconstructed image. And the second and third characteristic information is cascaded with the first and second characteristic information after up-sampling and then is input into a second decoding module. And the second decoding module performs feature extraction on the cascaded second third feature information and the up-sampled first second feature information to obtain second feature information of the reconstructed image, and inputs the second feature information into a third up-sampling unit. The third upsampling unit upsamples the second characteristic information to h×w×64. The second CBAM performs feature extraction on the first feature information and outputs third feature information of the reconstructed image. And the third characteristic information is cascaded with the second up-sampled second characteristic information and then input into a third decoding module. And the third decoding module performs feature extraction on the third and the second up-sampled third feature information to obtain third and second feature information of the reconstructed image. The first CBAM performs feature extraction on the initial feature map of the reconstructed image and outputs fourth third feature information of the reconstructed image. And the fourth third characteristic information and the third second characteristic information are input into a fourth decoding module after being cascaded. And the fourth decoding module performs feature extraction on the third feature information and the third second feature information after cascading to obtain fourth second feature information of the reconstructed image, and inputs the fourth second feature information into the second convolution layer, wherein the size of the fourth second feature information is H multiplied by W multiplied by 32. The second convolution layer processes the fourth second feature information and outputs an HDR image of the reconstructed image, wherein the HDR image has a size of H×W×3.
The embodiment of the application adopts the dynamic conversion model to convert the reconstructed image with low dynamic range into the image with high dynamic range, and the whole conversion process is simple and the cost is low.
In some embodiments, the initial parameters of the dynamic conversion model when trained are pre-training parameters that the pre-training model gets when pre-trained.
In some embodiments, the loss function of the dynamic conversion model includes at least one of a reconstructed loss function, a perceived loss function, and a style loss function.
In one example, the loss function of the dynamic transformation model is shown as:
Loss=L 1s L stp L p
wherein Loss is a Loss function of a dynamic conversion model, L1 is a reconstruction Loss function, lst is a perception Loss function, lp is a pattern Loss function, and lambda s And lambda (lambda) p Is a super parameter.
In one example, the reconstruction penalty function of the dynamic conversion model is determined based on an error between a compressed tone mapping value of an HDR image truth value and a compressed tone mapping value of an HDR image predictor, wherein the compressed tone mapping value of the HDR image predictor is determined from a preset compressed tone mapping function and the HDR image predictor, and the compressed tone mapping value of the HDR image truth value is determined from the compressed tone mapping function and the HDR image truth value.
For example, the reconstruction loss function of the dynamic transformation model is determined based on the following formula:
L 1 =‖T(H)-T(GT)‖ 1
wherein L1 represents a reconstruction loss function,x=h or GT, H being the number of bits of the dynamic transition model, the predicted value output by the dynamic conversion model, GT is the true value of the training image, "II" -, I 1 "means L1 norm, μ is a preset parameter.
In one example, the perceptual penalty function of the dynamic conversion model is determined based on an error between a first eigenvalue, which is a corresponding eigenvalue of a compressed tone mapping value of the HDR image predictor in the eigenvalue of the first layer of the pretrained model, and a second eigenvalue, which is a corresponding eigenvalue of a compressed tone mapping value of the HDR image truth in the eigenvalue of the first layer, the compressed tone mapping value of the HDR image predictor being determined from a preset compressed tone mapping function and the HDR image predictor, the compressed tone mapping value of the HDR image truth being determined from the compressed tone mapping function and the HDR image truth.
For example, the perceptual loss function of the dynamic transformation model is determined based on the following formula:
wherein Lp represents a perceptual loss function, φ l A feature map of the first layer of the pre-training model, the size of the feature map is C l ×H l ×W l
In one example, the pattern loss function of the dynamic conversion model is determined based on an error between a first element value that is an element value corresponding to a compressed tone mapping value of the HDR image predictor in a Gram matrix of a first layer feature map of the pre-training model and a second element value that is an element value corresponding to a compressed tone mapping value of the HDR image truth value in the Gram matrix, the compressed tone mapping value of the HDR image predictor being determined according to a preset compressed tone mapping function and the HDR image predictor, the compressed tone mapping value of the HDR image truth value being determined according to the compressed tone mapping function and the HDR image truth value.
For example, the style loss function of the dynamic transformation model is determined based on the following formula:
wherein Lp denotes the perceptual loss function, G ()' is the Gram matrix of the first layer features of the pre-trained model,φ l a feature map of the first layer of the pre-training model, the size of the feature map is C l ×H l ×W l ,K l The size is C l H l W l
According to the embodiment of the application, the dynamic conversion model is adopted to convert the reconstructed image with the low dynamic range into the image with the high dynamic range, and the whole conversion process is simple and low in cost. In addition, reconstruction loss, perception loss and style loss are set to reduce reconstruction distortion, artifacts and abnormal color tone of the high dynamic range image, and the decoded image quality is further improved on the premise of ensuring the code rate.
The above description of the application of the dynamic conversion model to the codec system has been described, and the dynamic conversion model may be applied to other scenes that convert low dynamic range images into high dynamic range images.
Fig. 8 is a flowchart of an image processing method according to an embodiment of the present application, as shown in fig. 8, where the method includes:
s801, obtaining an LDR image to be processed;
s802, inputting the LDR image into a dynamic conversion model for dynamic conversion to obtain an HDR image of the LDR image.
As shown in fig. 5A, the dynamic conversion model includes: the method comprises the steps of connecting N coding modules in series and N decoding modules in series, wherein the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the ith coding module is connected with the (N-i+1) th decoding module in a jumping mode, the ith coding module is used for carrying out feature extraction on the (i-1) th first feature information output by the (i-1) th coding module to obtain the (i-first feature information of an LDR image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i) th second feature information of the LDR image to obtain the (N-i+1) th second feature information of the LDR image, the HDR image of the LDR image is determined according to the second feature information output by the last decoding module in the N decoding modules, and the (i) is a positive integer less than or equal to N.
The network structure of the dynamic conversion model may be shown in fig. 5A to 5G, and specifically refer to the description of the foregoing embodiments, which are not repeated herein.
It should be understood that fig. 4-8 are only examples of the present application and should not be construed as limiting the present application.
The preferred embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application. For example, the specific features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described in detail. As another example, any combination of the various embodiments of the present application may be made without departing from the spirit of the present application, which should also be considered as disclosed herein.
It should be further understood that, in the various method embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present application. In addition, in the embodiment of the present application, the term "and/or" is merely an association relationship describing the association object, which means that three relationships may exist. Specifically, a and/or B may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
The network structure of the motion conversion model and the image processing method are described above in conjunction with fig. 4 to 8, and the device embodiments of the present application are described in detail below in conjunction with fig. 9 to 12.
Fig. 9 is a schematic block diagram of an image decoding apparatus provided in an embodiment of the present application, which may be the decoder shown in fig. 3, or a component in the decoder, for example, a processor in the decoder.
As shown in fig. 9, the image decoding apparatus 10 may include:
a decoding unit 11, configured to decode the code stream to obtain a reconstructed image;
the processing unit 12 is configured to input the reconstructed image into a dynamic conversion model for dynamic conversion, so as to obtain a high dynamic range HDR image of the reconstructed image;
wherein the dynamic conversion model includes: the device comprises N coding modules and N decoding modules which are connected in series, wherein the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the ith coding module is connected with the (N-i+1) decoding module in a jumping manner, the ith coding module is used for carrying out feature extraction on the (i-1) th first feature information output by the (i-1) th coding module to obtain the (i) th first feature information of the reconstructed image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i) th second feature information of the reconstructed image to obtain the (N-i+1) th second feature information of the reconstructed image, the HDR image of the reconstructed image is determined according to the second feature information output by the last decoding module in the N decoding modules, and the (i) is a positive integer less than or equal to N.
In one embodiment, the dynamic conversion model further comprises: a convolution attention module located in a jump connection of the ith coding module and the N-i+1 decoding module;
the convolution attention module is used for extracting space information and channel information of the ith-1 first characteristic information to obtain the ith-1 third characteristic information of the reconstructed image;
the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th third feature information and the (N-i) th second feature information to obtain the (N-i+1) th second feature information of the reconstructed image.
In one embodiment, the convolution attention module includes a channel attention module and a spatial attention module;
the channel attention module is used for extracting channel information from the i-1 th first characteristic information to obtain channel attention information of the i-1 th first characteristic information;
the spatial attention module is used for extracting spatial information from the i-1 th first characteristic information and the channel attention information of the i-1 th first characteristic information to obtain the spatial attention information of the i-1 th first characteristic information;
the i-1 th third feature information of the reconstructed image is determined based on the channel attention information and the spatial attention information of the i-1 th first feature information.
In one embodiment, the convolution attention module further comprises a first multiplication unit;
the first multiplication unit is used for multiplying the i-1 th first characteristic information and the channel attention information of the i-1 th first characteristic information to obtain the fused channel characteristic information of the i-1 th first characteristic information;
the spatial attention module is used for extracting spatial information of the fusion channel characteristic information of the ith-1 first characteristic information to obtain the spatial attention information of the ith-1 first characteristic information.
In one embodiment, the convolution attention module further comprises a second multiplication unit;
the second multiplication unit is used for multiplying the fusion channel characteristic information of the ith-1 th first characteristic information and the spatial attention information to obtain the ith-1 th third characteristic information of the reconstructed image.
In one embodiment, the channel attention module comprises: the device comprises a first space compression unit, a second space compression unit and a channel characteristic extraction unit;
the first space compression unit is used for performing space dimension compression on the i-1 th first characteristic information to obtain first space compression information of the i-1 th first characteristic information;
The second space compression unit is used for performing space dimension compression on the i-1 th first characteristic information to obtain second space compression information of the i-1 th first characteristic information;
the channel characteristic extraction unit is used for extracting channel characteristics of the first space compression information of the i-1 th first characteristic information to obtain first channel information of the i-1 th first characteristic information, and extracting channel characteristics of the second space compression information of the i-1 th first characteristic information to obtain second channel information of the i-1 th first characteristic information;
the channel attention information of the i-1 th first feature information is determined based on the first channel information and the second channel information of the i-1 th first feature information.
In one embodiment, the first spatial compression unit and/or the second spatial compression unit comprises a pooling layer.
In one embodiment, the first spatial compression unit is a maximum pooling layer and/or the second spatial compression unit is an average pooling layer.
In an embodiment, the channel feature extraction unit is a multi-layer perceptron MLP.
In one embodiment, the channel attention module further comprises: a first addition unit and a first activation function;
The first adding unit is used for adding the first channel information and the second channel information of the i-1 first characteristic information to obtain fusion channel information of the i-1 first characteristic information;
the first activation function is used for carrying out nonlinear processing on the fusion channel information of the i-1 first characteristic information to obtain the channel attention information of the i-1 first characteristic information.
In one embodiment, the spatial attention module comprises: the device comprises a first channel compression unit, a second channel compression unit and a spatial feature extraction unit;
the first channel compression unit is used for carrying out channel dimension compression on the fusion channel characteristic information of the ith-1 first characteristic information to obtain first channel compression information of the ith-1 first characteristic information;
the second channel compression unit is used for carrying out channel dimension compression on the fusion channel characteristic information of the ith-1 first characteristic information to obtain second channel compression information of the ith-1 first characteristic information;
the spatial feature extraction unit is used for extracting spatial features of the first channel compression information and the second channel compression information of the ith-1 first feature information to obtain spatial feature information of the ith-1 first feature information;
The spatial attention information of the i-1 th first feature information is determined based on the spatial feature information of the i-1 th first feature information.
In one embodiment, the first channel compression unit and/or the second channel compression unit comprises a pooling layer.
In one embodiment, the first channel compression unit is a maximum pooling layer and/or the second channel compression unit is an average pooling layer.
In one embodiment, the spatial feature extraction unit is a convolutional layer.
In one embodiment, the spatial attention module further comprises a second activation function;
the second activation function is used for carrying out nonlinear processing on the spatial characteristic information of the ith-1 first characteristic information to obtain the spatial attention information of the ith-1 first characteristic information.
In one embodiment, the spatial dimension of the channel attention information of the i-1 th first feature information is 1×1.
In one embodiment, the spatial attention information of the i-1 th first feature information has a feature dimension of 1.
In one embodiment, the dynamic conversion model further comprises at least one downsampling unit;
the downsampling unit is used for performing space dimension downsampling on the characteristic information output by the coding module.
In one embodiment, the downsampling unit is a maximum pooling layer.
In one embodiment, the dynamic conversion model further comprises at least one upsampling unit;
the up-sampling unit is used for performing spatial dimension up-sampling on the characteristic information output by the decoding module.
In one embodiment, the upsampling unit is a bilinear interpolation unit.
In one embodiment, each of the N coding modules includes at least one convolution block, wherein parameters of the convolution blocks included by each of the N coding modules are not exactly the same.
In one embodiment, each of the N decoding modules includes at least one convolution block, wherein parameters of the convolution blocks included by each of the N decoding modules are not exactly the same.
In one embodiment, if the i is equal to N, the nth-i second characteristic information is determined according to the nth first characteristic information output by the nth encoding module; or,
if the i is smaller than N, the Nth-i second characteristic information is determined according to the Nth-i second characteristic information output by the Nth-i decoding module; or,
If the i is equal to 1, the i-1 th first characteristic information is determined according to the reconstructed image; or,
if i is greater than 1, the i-1 th first characteristic information is determined according to the first characteristic information output by the i-1 st coding module.
In one embodiment, the (N-i+1) -th decoding module is configured to perform feature extraction on the feature information obtained by concatenating the (i-1) -th third feature information and the (N-i) -th second feature information, so as to obtain the (N-i+1) -th second feature information of the reconstructed image.
In one embodiment, the dynamic conversion model further comprises a first convolution layer;
the first convolution layer is used for extracting features of the reconstructed image to obtain an initial feature map of the reconstructed image, and the initial feature map is input into a first coding module and a first convolution attention module respectively.
In one embodiment, the dynamic conversion model further comprises a second convolution layer;
the second convolution layer is used for extracting the characteristics of the second characteristic information of the reconstructed image output by the last decoding module and outputting an HDR image of the reconstructed image.
In one embodiment, the initial parameters of the dynamic transformation model when training are pre-training parameters that the pre-training model gets when pre-training.
In one embodiment, the loss function of the dynamic conversion model includes at least one of a reconstructed loss function, a perceived loss function, and a style loss function.
In one embodiment, the loss function of the dynamic conversion model is represented by the following formula:
Loss=L 1s L stp L p
wherein Loss is a Loss function of the dynamic conversion model, L1 is the reconstruction Loss function, lst is the perception Loss function, lp is the pattern Loss function, and lambda s And lambda (lambda) p Is a super parameter.
In one embodiment, the reconstruction loss function of the dynamic conversion model is determined from an error between a compressed tone-mapped value of an HDR image truth value and a compressed tone-mapped value of an HDR image predictor, wherein the compressed tone-mapped value of the HDR image predictor is determined from a preset compressed tone-mapped function and the HDR image predictor, and the compressed tone-mapped value of the HDR image truth value is determined from the compressed tone-mapped function and the HDR image truth value.
For example, the reconstruction loss function of the dynamic transformation model is determined based on the following formula:
L 1 =‖T(H)-T(GT)‖ 1
wherein L1 represents the reconstruction loss function,x=h or GT, H being training of the dynamic transition In the model, the GT is the true value of the training image, and the model is the predicted value output by the dynamic conversion model 1 "means L1 norm, μ is a preset parameter.
In an embodiment, the perceptual penalty function of the dynamic conversion model is determined based on an error between a first eigenvalue, which is a corresponding eigenvalue of a compressed tone mapping value of an HDR image predictor in a first layer of the feature map of the pre-training model, and a second eigenvalue, which is a corresponding eigenvalue of a compressed tone mapping value of an HDR image truth, which is determined from a preset compressed tone mapping function and the HDR image predictor, which is determined from the compressed tone mapping function and the HDR image truth.
For example, the perceptual loss function of the dynamic transformation model is determined based on the following formula:
wherein Lp represents the perceptual loss function,x=h, or GT, where H is the time that the dynamic transition model is trained, the GT is the true value of the training image, and the GT is the predicted value output by the dynamic conversion model 1 "represents L1 norm, μ is a preset parameter, φ l A feature map representing a first layer of the pre-training model, the size of the feature map being C l ×H l ×W l
In an embodiment, the pattern loss function of the dynamic conversion model is determined based on an error between a first element value, which is an element value of a compressed tone mapping value of an HDR image predictor corresponding to an element value of a compressed tone mapping value of an HDR image truth value corresponding to an element value of a Gram matrix of a first layer feature map of the pre-training model, and a second element value, which is an element value of a compressed tone mapping value of an HDR image truth value determined from a preset compressed tone mapping function and the HDR image predictor, determined from the compressed tone mapping function and the HDR image truth value.
For example, the style loss function of the dynamic transformation model is determined based on the following formula:
wherein Lp represents the perceptual loss function,g () is the Gram matrix of the first layer features of the pre-trained model,x=h or GT, where H is a predicted value output by the dynamic conversion model when the dynamic conversion model is trained, and GT is an HDR true value of a training image, "||i # 1 "represents L1 norm, μ is a preset parameter, φ l A feature map representing a first layer of the pre-training model, the size of the feature map being C l ×H l ×W l The K is l The size is C l H l W l
It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 10 shown in fig. 9 may correspond to a corresponding main body in performing the image decoding method according to the embodiment of the present application, and the foregoing and other operations and/or functions of each unit in the apparatus 10 are respectively for implementing a corresponding flow in the image decoding method, which are not described herein for brevity.
Fig. 10 is a schematic block diagram of an image processing apparatus provided in an embodiment of the present application.
As shown in fig. 10, the image processing apparatus 20 may include:
an acquisition unit 21 for acquiring a low dynamic range LDR image to be processed;
the processing unit 22 is configured to input the LDR image into a dynamic conversion model for dynamic conversion, so as to obtain a high dynamic range HDR image of the LDR image;
wherein the dynamic conversion model includes: the device comprises N coding modules and N decoding modules which are connected in series, wherein the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the ith coding module is connected with the (N-i+1) th decoding module in a jumping manner, the ith coding module is used for carrying out feature extraction on the (i-1) th first feature information output by the (i-1) th coding module to obtain the (i-first) th feature information of the LDR image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i) th second feature information of the LDR image to obtain the (N-i+1) th second feature information of the LDR image, the HDR image of the LDR image is determined according to the second feature information output by the last decoding module in the N decoding modules, and the (i) is a positive integer less than or equal to N.
In some embodiments, the dynamic conversion model further comprises: a convolution attention module located in a jump connection of the ith coding module and the N-i+1 decoding module;
the convolution attention module is used for extracting space information and channel information of the ith-1 first characteristic information to obtain the ith-1 third characteristic information of the LDR image;
the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th third feature information and the (N-i) th second feature information to obtain the (N-i+1) th second feature information of the LDR image.
In some embodiments, the convolution attention module includes a channel attention module and a spatial attention module;
the channel attention module is used for extracting channel information from the i-1 th first characteristic information to obtain channel attention information of the i-1 th first characteristic information;
the spatial attention module is used for extracting spatial information from the i-1 th first characteristic information and the channel attention information of the i-1 th first characteristic information to obtain the spatial attention information of the i-1 th first characteristic information;
the i-1 th third feature information of the LDR image is determined according to the channel attention information and the spatial attention information of the i-1 th first feature information.
In some embodiments, the convolution attention module further comprises a first multiplication unit;
the first multiplication unit is used for multiplying the i-1 th first characteristic information and the channel attention information of the i-1 th first characteristic information to obtain the fused channel characteristic information of the i-1 th first characteristic information;
the spatial attention module is used for extracting spatial information of the fusion channel characteristic information of the ith-1 first characteristic information to obtain the spatial attention information of the ith-1 first characteristic information.
In some embodiments, the convolution attention module further comprises a second multiplication unit;
the second multiplication unit is used for multiplying the fusion channel characteristic information of the ith-1 first characteristic information and the spatial attention information to obtain the ith-1 third characteristic information of the LDR image.
In some embodiments, the channel attention module comprises: the device comprises a first space compression unit, a second space compression unit and a channel characteristic extraction unit;
the first space compression unit is used for performing space dimension compression on the i-1 th first characteristic information to obtain first space compression information of the i-1 th first characteristic information;
The second space compression unit is used for performing space dimension compression on the i-1 th first characteristic information to obtain second space compression information of the i-1 th first characteristic information;
the channel characteristic extraction unit is used for extracting channel characteristics of the first space compression information of the i-1 th first characteristic information to obtain first channel information of the i-1 th first characteristic information, and extracting channel characteristics of the second space compression information of the i-1 th first characteristic information to obtain second channel information of the i-1 th first characteristic information;
the channel attention information of the i-1 th first feature information is determined based on the first channel information and the second channel information of the i-1 th first feature information.
In some embodiments, the first spatial compression unit and/or the second spatial compression unit comprises a pooling layer.
In some embodiments, the first spatial compression unit is a maximum pooling layer and/or the second spatial compression unit is an average pooling layer.
In some embodiments, the channel feature extraction unit is a multi-layer perceptron MLP.
In some embodiments, the channel attention module further comprises: a first addition unit and a first activation function;
The first adding unit is used for adding the first channel information and the second channel information of the i-1 first characteristic information to obtain fusion channel information of the i-1 first characteristic information;
the first activation function is used for carrying out nonlinear processing on the fusion channel information of the i-1 first characteristic information to obtain the channel attention information of the i-1 first characteristic information.
In some embodiments, the spatial attention module comprises: the device comprises a first channel compression unit, a second channel compression unit and a spatial feature extraction unit;
the first channel compression unit is used for carrying out channel dimension compression on the fusion channel characteristic information of the ith-1 first characteristic information to obtain first channel compression information of the ith-1 first characteristic information;
the second channel compression unit is used for carrying out channel dimension compression on the fusion channel characteristic information of the ith-1 first characteristic information to obtain second channel compression information of the ith-1 first characteristic information;
the spatial feature extraction unit is used for extracting spatial features of the first channel compression information and the second channel compression information of the ith-1 first feature information to obtain spatial feature information of the ith-1 first feature information;
The spatial attention information of the i-1 th first feature information is determined based on the spatial feature information of the i-1 th first feature information.
In some embodiments, the first channel compression unit and/or the second channel compression unit comprises a pooling layer.
In some embodiments, the first channel compression unit is a maximum pooling layer and/or the second channel compression unit is an average pooling layer.
In some embodiments, the spatial feature extraction unit is a convolutional layer.
In some embodiments, the spatial attention module further comprises a second activation function;
the second activation function is used for carrying out nonlinear processing on the spatial characteristic information of the ith-1 first characteristic information to obtain the spatial attention information of the ith-1 first characteristic information.
In some embodiments, the spatial dimension of the channel attention information of the i-1 th first feature information is 1×1.
In some embodiments, the spatial attention information of the i-1 th first feature information has a feature dimension of 1.
In some embodiments, the dynamic conversion model further comprises at least one downsampling unit;
the downsampling unit is used for performing space dimension downsampling on the characteristic information output by the coding module.
In some embodiments, the downsampling unit is a maximum pooling layer.
In some embodiments, the dynamic conversion model further comprises at least one upsampling unit;
the up-sampling unit is used for performing spatial dimension up-sampling on the characteristic information output by the decoding module.
In some embodiments, the upsampling unit is a bilinear interpolation unit.
In some embodiments, each of the N coding modules includes at least one convolution block, wherein parameters of the convolution blocks included by each of the N coding modules are not exactly the same.
In some embodiments, each of the N decoding modules includes at least one convolution block, wherein parameters of the convolution blocks included by each of the N decoding modules are not exactly the same.
In some embodiments, if the i is equal to N, the nth-i second characteristic information is determined according to the nth first characteristic information output by the nth encoding module; or,
if the i is smaller than N, the Nth-i second characteristic information is determined according to the Nth-i second characteristic information output by the Nth-i decoding module; or,
If the i is equal to 1, the i-1 th first characteristic information is determined according to the LDR image; or,
if i is greater than 1, the i-1 th first characteristic information is determined according to the first characteristic information output by the i-1 st coding module.
In some embodiments, the (N-i+1) -th decoding module is configured to perform feature extraction on the (i-1) -th third feature information and the (N-i) -th feature information after cascading, so as to obtain the (N-i+1) -th second feature information of the LDR image.
In some embodiments, the dynamic conversion model further comprises a first convolution layer;
the first convolution layer is used for extracting features of the LDR image to obtain an initial feature map of the LDR image, and the initial feature map is input into a first coding module and a first convolution attention module respectively.
In some embodiments, the dynamic conversion model further comprises a second convolution layer;
the second convolution layer is used for extracting the characteristics of the second characteristic information of the LDR image output by the last decoding module and outputting an HDR image of the LDR image.
In some embodiments, the initial parameters of the dynamic conversion model when training are pre-training parameters that the pre-training model gets when pre-training.
In some embodiments, the loss function of the dynamic conversion model includes at least one of a reconstructed loss function, a perceived loss function, and a style loss function.
In some embodiments, the loss function of the dynamic conversion model is as follows:
Loss=L 1s L stp L p
wherein Loss is a Loss function of the dynamic conversion model, L1 is the reconstruction Loss function, lst is the perception Loss function, lp is the pattern Loss function, and lambda s And lambda (lambda) p Is a super parameter.
In some embodiments, the reconstruction loss function of the dynamic conversion model is determined from an error between a compressed tone mapping value of an HDR image truth value and a compressed tone mapping value of an HDR image predictor, wherein the compressed tone mapping value of the HDR image predictor is determined from a preset compressed tone mapping function and the HDR image predictor, and the compressed tone mapping value of the HDR image truth value is determined from the compressed tone mapping function and the HDR image truth value.
For example, the reconstruction loss function of the dynamic transformation model is determined based on the following formula:
L 1 =‖T(H)-T(GT)‖ 1
wherein L1 represents the reconstruction loss function,x=h, or GT, where H is the time to train the dynamic transition model, the GT is the true value of the training image, and the GT is the predicted value output by the dynamic conversion model 1 "means L1 norm, μ is a preset parameter.
In some embodiments, the perceptual penalty function of the dynamic conversion model is determined based on an error between a first eigenvalue, which is a corresponding eigenvalue of a compressed tone-mapped value of an HDR image predictor in a first layer of the feature map of the pre-training model, and a second eigenvalue, which is a corresponding eigenvalue of a compressed tone-mapped value of an HDR image truth, which is determined from a preset compressed tone-mapped function and the HDR image predictor, which is determined from the compressed tone-mapped function and the HDR image truth.
For example, the perceptual loss function of the dynamic transformation model is determined based on the following formula:
wherein Lp represents the perceptual loss function,x=h or GT, where H is a predicted value output by the dynamic conversion model when training the dynamic conversion model,the GT is the true value of the training image, "II" -, I 1 "represents L1 norm, μ is a preset parameter, φ l A feature map representing a first layer of the pre-training model, the size of the feature map being C l ×H l ×W l
In some embodiments, the pattern loss function of the dynamic conversion model is determined based on an error between a first element value, which is an element value in a Gram matrix of a first layer feature map of the pre-training model to which compressed tone-mapped values of HDR image prediction values correspond, and a second element value, which is an element value in the Gram matrix to which compressed tone-mapped values of HDR image truth values, which are determined from a preset compressed tone-mapped function and the HDR image prediction values, are determined from the compressed tone-mapped function and the HDR image truth values.
For example, the style loss function of the dynamic transformation model is determined based on the following formula:
wherein Lp represents the perceptual loss function,g () is the Gram matrix of the first layer features of the pre-trained model,x=h or GT, where H is a predicted value output by the dynamic conversion model when the dynamic conversion model is trained, and GT is an HDR true value of a training image, "||i # 1 "represents L1 norm, μ is a preset parameter, φ l Representing the pre-trainingThe first layer of the model has a feature map of size C l ×H l ×W l The K is l The size is C l H l W l
It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 20 shown in fig. 10 may correspond to a corresponding main body in performing the image processing method in the embodiment of the present application, and the foregoing and other operations and/or functions of each unit in the apparatus 20 are respectively for implementing a corresponding flow in the image processing method, which are not described herein for brevity.
Fig. 11 is a schematic block diagram of a model training apparatus provided in an embodiment of the present application.
As shown in fig. 11, the model training apparatus 40 includes:
an obtaining unit 41, configured to obtain a low dynamic range LDR training image and a high dynamic range HDR image truth value of the LDR training image;
the processing unit 42 is configured to input the LDR training image into a dynamic conversion model, perform feature extraction on the i-1 th first feature information through an i-th coding module to obtain the i-1 th first feature information of the LDR training image, where the dynamic conversion model includes N coding modules connected in series and the N decoding modules connected in series, an output of a last coding module of the N coding modules is connected with an input of a first decoding module of the N decoding modules, and the i-th coding module is connected with an N-i+1-th decoding module in a skip manner, where i is a positive integer less than or equal to N, and N is a positive integer; performing feature extraction on the i-1 th first feature information and the N-i second feature information of the LDR training image through the N-i+1 th decoding module to obtain the N-i+1 th second feature information of the LDR training image; determining an HDR image prediction value of the LDR training image according to the second characteristic information of the LDR training image output by the last decoding module in the N decoding modules; determining a loss between an HDR image predicted value of the LDR training image and an HDR image true value of the LDR training image, and training the dynamic conversion model according to the loss.
In one embodiment, the dynamic conversion model further comprises: the processing unit 42 is specifically configured to extract spatial information and channel information from the i-1 th first feature information by using the convolution attention module, so as to obtain the i-1 th third feature information of the LDR training image; and carrying out feature extraction on the ith-1 third feature information and the nth-i second feature information through the (N-i+1) decoding module to obtain the (N-i+1) second feature information of the LDR training image.
In some embodiments, the convolution attention module includes a channel attention module and a spatial attention module, and the processing unit 42 is specifically configured to perform channel information extraction on the i-1 th first feature information by using the channel attention module to obtain channel attention information of the i-1 th first feature information; extracting spatial information of fusion channel characteristic information of the ith-1 first characteristic information through the spatial attention module to obtain spatial attention information of the ith-1 first characteristic information, wherein the fusion channel characteristic information of the ith-1 first characteristic information is determined according to the ith-1 first characteristic information and the channel attention information of the ith-1 first characteristic information; and determining the ith-1 third characteristic information of the LDR training image according to the channel attention information and the space attention information of the ith-1 first characteristic information.
In some embodiments, the convolution attention module further includes a first multiplication unit, and the processing unit 42 is further configured to multiply the i-1 th first feature information and the i-1 st first feature information channel attention information by the first multiplication unit to obtain the i-1 st first feature information fused channel feature information.
In some embodiments, the convolution attention module further includes a second multiplication unit, and the processing unit 42 is specifically configured to multiply the fused channel feature information of the i-1 th first feature information and the spatial attention information by the second multiplication unit to obtain the i-1 th third feature information of the LDR training image.
In some embodiments, the channel attention module comprises: the processing unit 42 is specifically configured to perform space dimension compression on the i-1 th first feature information by using the first space compression unit to obtain first space compression information of the i-1 th first feature information; performing space dimension compression on the i-1 th first characteristic information through the second space compression unit to obtain second space compression information of the i-1 th first characteristic information; channel feature extraction is carried out on the first space compression information of the i-1 th first feature information through the channel feature extraction unit, so that first channel information of the i-1 th first feature information is obtained; channel feature extraction is carried out on the second space compression information of the i-1 th first feature information through the channel feature extraction unit, so that second channel information of the i-1 th first feature information is obtained; and determining the channel attention information of the i-1 th first characteristic information according to the first channel information and the second channel information of the i-1 th first characteristic information.
In some embodiments, the first spatial compression unit and/or the second spatial compression unit comprises a pooling layer.
In some embodiments, the first spatial compression unit is a maximum pooling layer and/or the second spatial compression unit is an average pooling layer.
In some embodiments, the channel feature extraction unit is a multi-layer perceptron MLP.
In some embodiments, the channel attention module further comprises: the processing unit 42 is specifically configured to add the first channel information and the second channel information of the i-1 pieces of first feature information by using the first adding unit, so as to obtain fused channel information of the i-1 pieces of first feature information; and carrying out nonlinear processing on the fusion channel information of the i-1 first characteristic information through the first activation function to obtain the channel attention information of the i-1 first characteristic information.
In some embodiments, the spatial attention module comprises: the processing unit 42 is specifically configured to perform channel dimension compression on the fused channel characteristic information of the i-1 th first characteristic information by using the first channel compression unit, so as to obtain first channel compression information of the i-1 th first characteristic information; channel dimension compression is carried out on the fusion channel characteristic information of the ith-1 first characteristic information through the second channel compression unit, so that second channel compression information of the ith-1 first characteristic information is obtained; performing spatial feature extraction on the first channel compression information and the second channel compression information of the ith-1 first feature information through the spatial feature extraction unit to obtain spatial feature information of the ith-1 first feature information; and determining the spatial attention information of the ith-1 first characteristic information according to the spatial characteristic information of the ith-1 first characteristic information.
In some embodiments, the first channel compression unit and/or the second channel compression unit comprises a pooling layer.
In some embodiments, the first channel compression unit is a maximum pooling layer and/or the second channel compression unit is an average pooling layer.
In some embodiments, the spatial feature extraction unit is a convolutional layer.
In some embodiments, the spatial attention module further includes a second activation function, and the processing unit 42 is specifically configured to perform nonlinear processing on the spatial feature information of the i-1 th first feature information by using the second activation function to obtain the spatial attention information of the i-1 th first feature information.
In some embodiments, the spatial dimension of the channel attention information of the i-1 th first feature information is 1×1.
In some embodiments, the spatial attention information of the i-1 th first feature information has a feature dimension of 1.
In some embodiments, the dynamic conversion model further includes at least one downsampling unit, and the processing unit 42 is further configured to spatially dimensionally downsample the feature information output by the encoding module through the downsampling unit.
Optionally, the downsampling unit is a maximum pooling layer.
In some embodiments, the dynamic conversion model further includes at least one upsampling unit, and the processing unit 42 is further configured to spatially upsample the feature information output by the decoding module through the upsampling unit.
Optionally, the up-sampling unit is a bilinear interpolation unit.
Optionally, each of the N coding modules includes at least one convolution block, where parameters of the convolution blocks included in each of the N coding modules are not identical.
Optionally, each of the N decoding modules includes at least one convolution block, where parameters of the convolution block included in each of the N decoding modules are not identical.
In some embodiments, if the i is equal to N, the nth-i second characteristic information is determined according to the nth first characteristic information output by the nth encoding module; or if the i is smaller than N, the Nth-i second characteristic information is determined according to the Nth-i second characteristic information output by the Nth-i decoding module; or if the i is equal to 1, the i-1 th first characteristic information is determined according to the LDR training image; or if i is greater than 1, the i-1 th first characteristic information is determined according to the first characteristic information output by the i-1 st coding module.
In some embodiments, the processing unit 42 is specifically configured to concatenate the i-1 th third feature information and the N-i th second feature information; inputting the cascaded characteristic information into the (N-i+1) th decoding module for characteristic extraction to obtain the (N-i+1) th second characteristic information of the LDR training image.
In some embodiments, the dynamic conversion model further includes a first convolution layer, and the processing unit 42 is further configured to perform feature extraction on the LDR training image through the first convolution layer to obtain an initial feature map of the LDR training image; and respectively inputting the initial feature map into a first coding module and a first convolution attention module to obtain first feature information output by the first coding module and first third feature information output by the first convolution attention module.
In some embodiments, the dynamic conversion model further includes a second convolution layer, and the processing unit 42 is specifically configured to perform feature extraction on the second feature information of the LDR training image output by the last decoding module through the second convolution layer, and output an HDR image prediction value of the LDR training image.
In some embodiments, the processing unit 42 is further configured to obtain pre-training parameters obtained by the pre-training model during pre-training; and determining the pre-training parameters as initial parameters of the dynamic conversion model.
In some embodiments, the processing unit 42 is specifically configured to determine, according to a preset loss function, a target loss between the HDR image predicted value of the LDR training image and the HDR image true value of the LDR training image.
In some embodiments, the preset loss function includes at least one of a reconstruction loss function, a perceptual loss function, and a pattern loss function.
In some embodiments, the processing unit 42 is specifically configured to determine a reconstruction loss between the HDR image predicted value and the HDR image true value; determining a perceived loss between the HDR image predictor and the HDR image truth value; determining a pattern penalty between the HDR image predictor and the HDR image truth value; and determining a target loss between the HDR image predicted value and the HDR image true value according to the reconstruction loss, the perception loss and the style loss between the HDR image predicted value and the HDR image true value.
In some embodiments, the processing unit 42 is specifically configured to determine the target loss between the HDR image predicted value and the HDR image true value according to the following formula:
Loss=L 1s L stp L p
wherein Loss is the target Loss, L1 is the reconstruction Loss, lst is the perceptual Loss, lp is the pattern Loss, and λ s And lambda (lambda) p Is a super parameter.
In some embodiments, the processing unit 42 is specifically configured to determine the compressed tone mapping value of the HDR image predicted value according to a preset compressed tone mapping function; determining a compressed tone mapping value of the HDR image truth value according to the compressed tone mapping function; the reconstruction penalty is determined from an error between the compressed tone-mapped value of the HDR image truth value and the compressed tone-mapped value of the HDR image predictor.
For example, the reconstruction loss is determined according to the following formula:
L 1 =‖T(H)-T(GT)‖ 1
wherein L1 represents the reconstruction loss,x=h, or GT, the H being the HDR image predictor output by the dynamic conversion model, the GT is the HDR image truth value, "II" -, I 1 "means L1 norm, μ is a preset parameter.
In some embodiments, the processing unit 42 is specifically configured to obtain a feature map of a first layer of the pre-training model; determining a compressed tone mapping value of the HDR image predicted value according to a preset compressed tone mapping function; determining a compressed tone mapping value of the HDR image truth value according to the compressed tone mapping function; determining a compressed tone mapping value of the HDR image predicted value, and corresponding first characteristic values in the characteristic diagram of the first layer; determining compressed tone mapping values of the HDR image truth values, corresponding second feature values in the feature map of the first layer; and determining the perception loss according to the error between the first characteristic value and the second characteristic value.
For example, the perceived loss is determined according to the following formula:
wherein Lp represents the perceived loss,x=h, or GT, the H being the HDR image predictor output by the dynamic conversion model, the GT is the HDR image truth value, "II" -, I 1 "represents L1 norm, μ is a preset parameter, φ l A feature map representing a first layer of the pre-training model, the size of the feature map being C l ×H l ×W l
In some embodiments, the processing unit 42 is specifically configured to obtain a Gram matrix of the first layer feature map of the pre-training model; determining a compressed tone mapping value of the HDR image predicted value according to a preset compressed tone mapping function; determining a compressed tone mapping value of the HDR image truth value according to the compressed tone mapping function; determining a compressed tone mapping value of the HDR image predicted value, corresponding first element values in the Gram matrix; determining compressed tone mapping values of the HDR image truth values, corresponding second element values in the Gram matrix; and determining the style loss according to the error between the first element value and the second element value.
For example, the style loss is determined according to the following formula:
wherein Lp represents the perceptual loss function, G () is the Gram matrix of the first layer features of the pre-trained model,x=h, or GT, the H being the HDR image predictor output by the dynamic conversion model, the GT is the HDR image truth value, "II" -, I 1 "represents L1 norm, μ is a preset parameter, φ l A feature map representing a first layer of the pre-training model, the size of the feature map being C l ×H l ×W l The K is l The size is C l H l W l
It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 40 shown in fig. 11 may correspond to a corresponding main body in performing the model training method in the embodiment of the present application, and the foregoing and other operations and/or functions of each unit in the apparatus 40 are respectively for implementing corresponding flows in each method such as the model training method, and are not described herein for brevity.
The apparatus and system of embodiments of the present application are described above in terms of functional units in conjunction with the accompanying drawings. It should be understood that the functional units may be implemented in hardware, or in instructions in software, or in a combination of hardware and software units. Specifically, each step of the method embodiments in the embodiments of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in software form, and the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software units in the decoding processor. Alternatively, the software elements may reside in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.
Fig. 12 is a schematic block diagram of an electronic device provided in an embodiment of the present application.
As shown in fig. 12, the electronic device 30 may be an image processing device, or a decoder, or a model training device according to the embodiment of the present application, and the electronic device 30 may include:
a memory 33 and a processor 32, the memory 33 being adapted to store a computer program 34 and to transmit the program code 34 to the processor 32. In other words, the processor 32 may call and run the computer program 34 from the memory 33 to implement the methods in embodiments of the present application.
For example, the processor 32 may be configured to perform the steps of the method 200 described above in accordance with instructions in the computer program 34.
In some embodiments of the present application, the processor 32 may include, but is not limited to:
a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
In some embodiments of the present application, the memory 33 includes, but is not limited to:
Volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DR RAM).
In some embodiments of the present application, the computer program 34 may be partitioned into one or more units that are stored in the memory 33 and executed by the processor 32 to perform the methods provided herein. The one or more elements may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program 34 in the electronic device 30.
As shown in fig. 12, the electronic device 30 may further include:
a transceiver 33, the transceiver 33 being connectable to the processor 32 or the memory 33.
The processor 32 may control the transceiver 33 to communicate with other devices, and in particular, may send information or data to other devices or receive information or data sent by other devices. The transceiver 33 may include a transmitter and a receiver. The transceiver 33 may further include antennas, the number of which may be one or more.
It will be appreciated that the various components in the electronic device 30 are connected by a bus system that includes, in addition to a data bus, a power bus, a control bus, and a status signal bus.
The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, embodiments of the present application also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of the method embodiments described above.
When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces, in whole or in part, a flow or function consistent with embodiments of the present application. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital point cloud optical disk (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, functional units in various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (107)

  1. An image decoding method, comprising:
    decoding the code stream to obtain a reconstructed image;
    inputting the reconstructed image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the reconstructed image;
    Wherein the dynamic conversion model includes: the device comprises N coding modules and N decoding modules which are connected in series, wherein the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the ith coding module is connected with the (N-i+1) decoding module in a jumping manner, the ith coding module is used for carrying out feature extraction on the (i-1) th first feature information output by the (i-1) th coding module to obtain the (i) th first feature information of the reconstructed image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i) th second feature information of the reconstructed image to obtain the (N-i+1) th second feature information of the reconstructed image, the HDR image of the reconstructed image is determined according to the second feature information output by the last decoding module in the N decoding modules, and the (i) is a positive integer less than or equal to N.
  2. The method of claim 1, wherein the dynamic transformation model further comprises: a convolution attention module located in a jump connection of the ith coding module and the N-i+1 decoding module;
    The convolution attention module is used for extracting space information and channel information of the ith-1 first characteristic information to obtain the ith-1 third characteristic information of the reconstructed image;
    the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th third feature information and the (N-i) th second feature information to obtain the (N-i+1) th second feature information of the reconstructed image.
  3. The method of claim 2, wherein the convolution attention module comprises a channel attention module and a spatial attention module;
    the channel attention module is used for extracting channel information from the i-1 th first characteristic information to obtain channel attention information of the i-1 th first characteristic information;
    the spatial attention module is used for extracting spatial information from the i-1 th first characteristic information and the channel attention information of the i-1 th first characteristic information to obtain the spatial attention information of the i-1 th first characteristic information;
    the i-1 th third feature information of the reconstructed image is determined based on the channel attention information and the spatial attention information of the i-1 th first feature information.
  4. A method according to claim 3, wherein the convolution attention module further comprises a first multiplication unit;
    the first multiplication unit is used for multiplying the i-1 th first characteristic information and the channel attention information of the i-1 th first characteristic information to obtain the fused channel characteristic information of the i-1 th first characteristic information;
    the spatial attention module is used for extracting spatial information of the fusion channel characteristic information of the ith-1 first characteristic information to obtain the spatial attention information of the ith-1 first characteristic information.
  5. The method of claim 4, wherein the convolution attention module further comprises a second multiplication unit;
    the second multiplication unit is used for multiplying the fusion channel characteristic information of the ith-1 th first characteristic information and the spatial attention information to obtain the ith-1 th third characteristic information of the reconstructed image.
  6. A method according to claim 3, wherein the channel attention module comprises: the device comprises a first space compression unit, a second space compression unit and a channel characteristic extraction unit;
    the first space compression unit is used for performing space dimension compression on the i-1 th first characteristic information to obtain first space compression information of the i-1 th first characteristic information;
    The second space compression unit is used for performing space dimension compression on the i-1 th first characteristic information to obtain second space compression information of the i-1 th first characteristic information;
    the channel characteristic extraction unit is used for extracting channel characteristics of the first space compression information of the i-1 th first characteristic information to obtain first channel information of the i-1 th first characteristic information, and extracting channel characteristics of the second space compression information of the i-1 th first characteristic information to obtain second channel information of the i-1 th first characteristic information;
    the channel attention information of the i-1 th first feature information is determined based on the first channel information and the second channel information of the i-1 th first feature information.
  7. The method of claim 6, wherein the first spatial compression unit and/or the second spatial compression unit comprises a pooling layer.
  8. The method of claim 6, wherein the first spatial compression unit is a maximum pooling layer and/or the second spatial compression unit is an average pooling layer.
  9. The method of claim 6, wherein the channel feature extraction unit is a multi-layer perceptron MLP.
  10. The method of claim 6, wherein the channel attention module further comprises: a first addition unit and a first activation function;
    the first adding unit is used for adding the first channel information and the second channel information of the i-1 first characteristic information to obtain fusion channel information of the i-1 first characteristic information;
    the first activation function is used for carrying out nonlinear processing on the fusion channel information of the i-1 first characteristic information to obtain the channel attention information of the i-1 first characteristic information.
  11. A method according to claim 3, wherein the spatial attention module comprises: the device comprises a first channel compression unit, a second channel compression unit and a spatial feature extraction unit;
    the first channel compression unit is used for carrying out channel dimension compression on the fusion channel characteristic information of the ith-1 first characteristic information to obtain first channel compression information of the ith-1 first characteristic information;
    the second channel compression unit is used for carrying out channel dimension compression on the fusion channel characteristic information of the ith-1 first characteristic information to obtain second channel compression information of the ith-1 first characteristic information;
    The spatial feature extraction unit is used for extracting spatial features of the first channel compression information and the second channel compression information of the ith-1 first feature information to obtain spatial feature information of the ith-1 first feature information;
    the spatial attention information of the i-1 th first feature information is determined based on the spatial feature information of the i-1 th first feature information.
  12. The method of claim 11, wherein the first channel compression unit and/or the second channel compression unit comprises a pooling layer.
  13. The method according to claim 11, wherein the first channel compression unit is a maximum pooling layer and/or the second channel compression unit is an average pooling layer.
  14. The method of claim 11, wherein the spatial feature extraction unit is a convolutional layer.
  15. The method of claim 11, wherein the spatial attention module further comprises a second activation function;
    the second activation function is used for carrying out nonlinear processing on the spatial characteristic information of the ith-1 first characteristic information to obtain the spatial attention information of the ith-1 first characteristic information.
  16. The method according to any of claims 3-15, wherein the spatial dimension of the channel attention information of the i-1 th first characteristic information is 1 x 1.
  17. The method according to any of claims 3-15, wherein the spatial attention information of the i-1 th first feature information has a feature dimension of 1.
  18. The method of claim 2, wherein the dynamic conversion model further comprises at least one downsampling unit;
    the downsampling unit is used for performing space dimension downsampling on the characteristic information output by the coding module.
  19. The method of claim 18, wherein the downsampling unit is a maximum pooling layer.
  20. The method of claim 18, wherein the dynamic transformation model further comprises at least one upsampling unit;
    the up-sampling unit is used for performing spatial dimension up-sampling on the characteristic information output by the decoding module.
  21. The method of claim 20, wherein the upsampling unit is a bilinear interpolation unit.
  22. The method of claim 1, wherein each of the N coding modules comprises at least one convolution block, wherein parameters of the convolution blocks included by each of the N coding modules are not exactly the same.
  23. The method of claim 1, wherein each of the N decoding modules comprises at least one convolution block, wherein parameters of the convolution blocks included by each of the N decoding modules are not exactly the same.
  24. The method of claim 1, wherein the step of determining the position of the substrate comprises,
    if the i is equal to N, the N-i second characteristic information is determined according to the N first characteristic information output by the N coding module; or,
    if the i is smaller than N, the Nth-i second characteristic information is determined according to the Nth-i second characteristic information output by the Nth-i decoding module; or,
    if the i is equal to 1, the i-1 th first characteristic information is determined according to the reconstructed image; or,
    if i is greater than 1, the i-1 th first characteristic information is determined according to the first characteristic information output by the i-1 st coding module.
  25. The method according to claim 2, wherein the N-i+1 decoding module is configured to perform feature extraction on the i-1 third feature information and the feature information after the N-i second feature information is cascaded, so as to obtain the N-i+1 second feature information of the reconstructed image.
  26. The method of claim 2, wherein the dynamic transformation model further comprises a first convolution layer;
    the first convolution layer is used for extracting features of the reconstructed image to obtain an initial feature map of the reconstructed image, and the initial feature map is input into a first coding module and a first convolution attention module respectively.
  27. The method of claim 2, wherein the dynamic transformation model further comprises a second convolution layer;
    the second convolution layer is used for extracting the characteristics of the second characteristic information of the reconstructed image output by the last decoding module and outputting an HDR image of the reconstructed image.
  28. The method of claim 2, wherein the initial parameters of the dynamic transformation model at training are pre-training parameters obtained by the pre-training model at pre-training.
  29. The method of claim 28, wherein the loss function of the dynamic transformation model comprises at least one of a reconstructed loss function, a perceived loss function, and a style loss function.
  30. The method of claim 29, wherein the loss function of the dynamic transformation model is represented by the formula:
    Loss=L 1s L stp L p
    Wherein Loss is a Loss function of the dynamic conversion model, L1 is the reconstruction Loss function, lst is the perception Loss function, lp is the pattern Loss function, and lambda s And lambda (lambda) p Is a super parameter.
  31. The method of claim 30, wherein the reconstruction loss function of the dynamic conversion model is determined based on an error between a compressed tone mapping value of an HDR image truth value and a compressed tone mapping value of an HDR image predictor, wherein the compressed tone mapping value of the HDR image predictor is determined according to a preset compressed tone mapping function and the HDR image predictor, and wherein the compressed tone mapping value of the HDR image truth value is determined according to the compressed tone mapping function and the HDR image truth value.
  32. The method of claim 30, wherein the perceptual penalty function of the dynamic conversion model is determined based on an error between a first eigenvalue, which is a eigenvalue of a compressed tone mapping value of an HDR image predictor, corresponding in a feature map of a first layer of the pre-training model, and a second eigenvalue, which is a eigenvalue of a compressed tone mapping value of an HDR image truth, corresponding in a feature map of the first layer, the compressed tone mapping value of the HDR image predictor being determined according to a preset compressed tone mapping function and the HDR image predictor, the compressed tone mapping value of the HDR image truth being determined according to the compressed tone mapping function and the HDR image truth.
  33. The method of claim 30, wherein the pattern loss function of the dynamic conversion model is determined based on an error between a first element value that is an element value in a Gram matrix of a first layer feature map of the pre-training model to which compressed tone mapping values of HDR image prediction values correspond and a second element value that is an element value in the Gram matrix to which compressed tone mapping values of HDR image truth values are corresponding, the compressed tone mapping values of the HDR image prediction values being determined according to a preset compressed tone mapping function and the HDR image prediction values, the compressed tone mapping values of the HDR image truth values being determined according to the compressed tone mapping function and the HDR image truth values.
  34. An image processing method, comprising:
    acquiring a low dynamic range LDR image to be processed;
    inputting the LDR image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the LDR image;
    wherein the dynamic conversion model includes: the device comprises N coding modules and N decoding modules which are connected in series, wherein the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the ith coding module is connected with the (N-i+1) th decoding module in a jumping manner, the ith coding module is used for carrying out feature extraction on the (i-1) th first feature information output by the (i-1) th coding module to obtain the (i-first) th feature information of the LDR image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i) th second feature information of the LDR image to obtain the (N-i+1) th second feature information of the LDR image, the HDR image of the LDR image is determined according to the second feature information output by the last decoding module in the N decoding modules, and the (i) is a positive integer less than or equal to N.
  35. The method of claim 34, wherein the dynamic transformation model further comprises: a convolution attention module located in a jump connection of the ith coding module and the N-i+1 decoding module;
    the convolution attention module is used for extracting space information and channel information of the ith-1 first characteristic information to obtain the ith-1 third characteristic information of the LDR image;
    the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th third feature information and the (N-i) th second feature information to obtain the (N-i+1) th second feature information of the LDR image.
  36. The method of claim 35, wherein the convolution attention module comprises a channel attention module and a spatial attention module;
    the channel attention module is used for extracting channel information from the i-1 th first characteristic information to obtain channel attention information of the i-1 th first characteristic information;
    the spatial attention module is used for extracting spatial information from the i-1 th first characteristic information and the channel attention information of the i-1 th first characteristic information to obtain the spatial attention information of the i-1 th first characteristic information;
    The i-1 th third feature information of the LDR image is determined according to the channel attention information and the spatial attention information of the i-1 th first feature information.
  37. The method of claim 36, wherein the convolution attention module further comprises a first multiplication unit;
    the first multiplication unit is used for multiplying the i-1 th first characteristic information and the channel attention information of the i-1 th first characteristic information to obtain the fused channel characteristic information of the i-1 th first characteristic information;
    the spatial attention module is used for extracting spatial information of the fusion channel characteristic information of the ith-1 first characteristic information to obtain the spatial attention information of the ith-1 first characteristic information.
  38. The method of claim 37, wherein the convolution attention module further comprises a second multiplication unit;
    the second multiplication unit is used for multiplying the fusion channel characteristic information of the ith-1 first characteristic information and the spatial attention information to obtain the ith-1 third characteristic information of the LDR image.
  39. The method of claim 36, wherein the channel attention module comprises: the device comprises a first space compression unit, a second space compression unit and a channel characteristic extraction unit;
    The first space compression unit is used for performing space dimension compression on the i-1 th first characteristic information to obtain first space compression information of the i-1 th first characteristic information;
    the second space compression unit is used for performing space dimension compression on the i-1 th first characteristic information to obtain second space compression information of the i-1 th first characteristic information;
    the channel characteristic extraction unit is used for extracting channel characteristics of the first space compression information of the i-1 th first characteristic information to obtain first channel information of the i-1 th first characteristic information, and extracting channel characteristics of the second space compression information of the i-1 th first characteristic information to obtain second channel information of the i-1 th first characteristic information;
    the channel attention information of the i-1 th first feature information is determined based on the first channel information and the second channel information of the i-1 th first feature information.
  40. The method of claim 39, wherein the first spatial compression unit and/or the second spatial compression unit comprises a pooling layer.
  41. The method of claim 39, wherein the first spatial compression unit is a maximum pooling layer and/or the second spatial compression unit is an average pooling layer.
  42. The method of claim 39, wherein the channel feature extraction unit is a multi-layer perceptron MLP.
  43. The method of claim 39, wherein the channel attention module further comprises: a first addition unit and a first activation function;
    the first adding unit is used for adding the first channel information and the second channel information of the i-1 first characteristic information to obtain fusion channel information of the i-1 first characteristic information;
    the first activation function is used for carrying out nonlinear processing on the fusion channel information of the i-1 first characteristic information to obtain the channel attention information of the i-1 first characteristic information.
  44. The method of claim 36, wherein the spatial attention module comprises: the device comprises a first channel compression unit, a second channel compression unit and a spatial feature extraction unit;
    the first channel compression unit is used for carrying out channel dimension compression on the fusion channel characteristic information of the ith-1 first characteristic information to obtain first channel compression information of the ith-1 first characteristic information;
    the second channel compression unit is used for carrying out channel dimension compression on the fusion channel characteristic information of the ith-1 first characteristic information to obtain second channel compression information of the ith-1 first characteristic information;
    The spatial feature extraction unit is used for extracting spatial features of the first channel compression information and the second channel compression information of the ith-1 first feature information to obtain spatial feature information of the ith-1 first feature information;
    the spatial attention information of the i-1 th first feature information is determined based on the spatial feature information of the i-1 th first feature information.
  45. The method of claim 44, wherein the first channel compression unit and/or the second channel compression unit comprises a pooling layer.
  46. The method of claim 44, wherein the first channel compression unit is a maximum pooling layer and/or the second channel compression unit is an average pooling layer.
  47. The method of claim 44, wherein the spatial feature extraction unit is a convolutional layer.
  48. The method of claim 44, wherein the spatial attention module further comprises a second activation function;
    the second activation function is used for carrying out nonlinear processing on the spatial characteristic information of the ith-1 first characteristic information to obtain the spatial attention information of the ith-1 first characteristic information.
  49. The method of any one of claims 36-48, wherein the channel attention information of the i-1 th first characteristic information has a spatial dimension of 1 x 1.
  50. The method of any one of claims 36-48, wherein the spatial attention information of the i-1 st first feature information has a feature dimension of 1.
  51. The method of claim 35, wherein the dynamic transformation model further comprises at least one downsampling unit;
    the downsampling unit is used for performing space dimension downsampling on the characteristic information output by the coding module.
  52. The method of claim 51, wherein the downsampling unit is a maximum pooling layer.
  53. The method of claim 51, wherein the dynamic transformation model further comprises at least one upsampling unit;
    the up-sampling unit is used for performing spatial dimension up-sampling on the characteristic information output by the decoding module.
  54. The method of claim 53, wherein the upsampling unit is a bilinear interpolation unit.
  55. The method of claim 34, wherein each of the N coding modules comprises at least one convolution block, and wherein parameters of the convolution blocks included in each of the N coding modules are not exactly the same.
  56. The method of claim 34, wherein each of the N decoding modules comprises at least one convolution block, and wherein parameters of the convolution blocks included by each of the N decoding modules are not exactly the same.
  57. The method of claim 34, wherein the step of determining the position of the probe is performed,
    if the i is equal to N, the N-i second characteristic information is determined according to the N first characteristic information output by the N coding module; or,
    if the i is smaller than N, the Nth-i second characteristic information is determined according to the Nth-i second characteristic information output by the Nth-i decoding module; or,
    if the i is equal to 1, the i-1 th first characteristic information is determined according to the LDR image; or,
    if i is greater than 1, the i-1 th first characteristic information is determined according to the first characteristic information output by the i-1 st coding module.
  58. The method of claim 35, wherein the N-i+1 decoding module is configured to perform feature extraction on the i-1 th third feature information and the N-i th second feature information after concatenation, so as to obtain the N-i+1 th second feature information of the LDR image.
  59. The method of claim 35, wherein the dynamic transformation model further comprises a first convolution layer;
    the first convolution layer is used for extracting features of the LDR image to obtain an initial feature map of the LDR image, and the initial feature map is input into a first coding module and a first convolution attention module respectively.
  60. The method of claim 35, wherein the dynamic transformation model further comprises a second convolution layer;
    the second convolution layer is used for extracting the characteristics of the second characteristic information of the LDR image output by the last decoding module and outputting an HDR image of the LDR image.
  61. The method of claim 35, wherein the initial parameters of the dynamic transformation model when trained are pre-training parameters obtained by the pre-training model when pre-trained.
  62. The method of claim 61, wherein the loss function of the dynamic transformation model comprises at least one of a reconstructed loss function, a perceived loss function, and a style loss function.
  63. The method of claim 62, wherein the loss function of the dynamic transformation model is represented by the formula:
    Loss=L 1s L stp L p
    Wherein Loss is a Loss function of the dynamic conversion model, L1 is the reconstruction Loss function, lst is the perception Loss function, lp is the pattern Loss function, and lambda s And lambda (lambda) p Is a super parameter.
  64. The method of claim 63 wherein the reconstruction loss function of the dynamic conversion model is determined based on an error between a compressed tone map value according to an HDR image truth value and a compressed tone map value of an HDR image predictor determined according to a preset compressed tone map function and the HDR image predictor, the compressed tone map value of the HDR image truth value determined according to the compressed tone map function and the HDR image truth value.
  65. The method of claim 63 wherein the perceptual penalty function of the dynamic conversion model is determined based on an error between a first eigenvalue, which is a corresponding eigenvalue of a compressed tone map value of an HDR image predictor in a feature map of a first layer of the pre-training model, and a second eigenvalue, which is a corresponding eigenvalue of a compressed tone map value of an HDR image truth in a feature map of the first layer, the compressed tone map value of the HDR image predictor being determined according to a preset compressed tone map function and the HDR image predictor, the compressed tone map value of the HDR image truth being determined according to the compressed tone map function and the HDR image truth.
  66. The method of claim 63 wherein the pattern loss function of the dynamic conversion model is determined based on an error between a first element value that is an element value in a Gram matrix of a first layer feature map of the pre-training model to which compressed tone mapping values of HDR image prediction values correspond and a second element value that is an element value in the Gram matrix to which compressed tone mapping values of HDR image truth values are corresponding, the compressed tone mapping values of the HDR image prediction values being determined according to a preset compressed tone mapping function and the HDR image prediction values, the compressed tone mapping values of the HDR image truth values being determined according to the compressed tone mapping function and the HDR image truth values.
  67. A method of model training, comprising:
    acquiring a low dynamic range LDR training image and a high dynamic range HDR image truth value of the LDR training image;
    inputting the LDR training image into a dynamic conversion model, and performing feature extraction on the i-1 th first feature information through an i-th coding module to obtain the i-th first feature information of the LDR training image, wherein the dynamic conversion model comprises N coding modules connected in series and N decoding modules connected in series, the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the i-th coding module is connected with the N-i+1 th decoding module in a jumping manner, i is a positive integer smaller than or equal to N, and N is a positive integer;
    Performing feature extraction on the i-1 th first feature information and the N-i second feature information of the LDR training image through the N-i+1 th decoding module to obtain the N-i+1 th second feature information of the LDR training image;
    determining an HDR image prediction value of the LDR training image according to the second characteristic information of the LDR training image output by the last decoding module in the N decoding modules;
    determining a loss between an HDR image predicted value of the LDR training image and an HDR image true value of the LDR training image, and training the dynamic conversion model according to the loss.
  68. The method of claim 67, wherein said dynamic transition model further comprises: the convolution attention module is located in jump connection between the ith coding module and the (N-i+1) th decoding module, and performs feature extraction on the (i-1) th first feature information and the (N-i) th second feature information of the LDR training image through the (N-i+1) th decoding module to obtain the (N-i+1) th second feature information of the LDR training image, and the convolution attention module comprises:
    extracting spatial information and channel information from the ith-1 first characteristic information through the convolution attention module to obtain the ith-1 third characteristic information of the LDR training image;
    And carrying out feature extraction on the ith-1 third feature information and the nth-i second feature information through the (N-i+1) decoding module to obtain the (N-i+1) second feature information of the LDR training image.
  69. A method as in claim 68 wherein the convolution attention module comprises a channel attention module and a spatial attention module, the performing spatial information and channel information extraction on the i-1 th first feature information by the convolution attention module to obtain the i-1 th third feature information of the LDR training image comprises:
    extracting channel information from the i-1 th first characteristic information through the channel attention module to obtain channel attention information of the i-1 th first characteristic information;
    extracting spatial information of fusion channel characteristic information of the ith-1 first characteristic information through the spatial attention module to obtain spatial attention information of the ith-1 first characteristic information, wherein the fusion channel characteristic information of the ith-1 first characteristic information is determined according to the ith-1 first characteristic information and the channel attention information of the ith-1 first characteristic information;
    And determining the ith-1 third characteristic information of the LDR training image according to the channel attention information and the space attention information of the ith-1 first characteristic information.
  70. The method of claim 69, wherein the convolution attention module further comprises a first multiplication unit, the method further comprising:
    multiplying the i-1 th first characteristic information and the channel attention information of the i-1 th first characteristic information by the first multiplication unit to obtain the fused channel characteristic information of the i-1 th first characteristic information.
  71. A method as in claim 69 wherein said convolution attention module further comprises a second multiplication unit, said determining an i-1 th third characteristic information of said LDR training image from channel attention information and spatial attention information of said i-1 th first characteristic information comprising:
    and multiplying the fusion channel characteristic information of the ith-1 first characteristic information and the spatial attention information by the second multiplication unit to obtain the ith-1 third characteristic information of the LDR training image.
  72. The method of claim 69, wherein the channel attention module comprises: the channel attention module is used for extracting channel information of the i-1 th first characteristic information to obtain channel attention information of the i-1 th first characteristic information;
    Performing space dimension compression on the i-1 th first characteristic information through the first space compression unit to obtain first space compression information of the i-1 th first characteristic information;
    performing space dimension compression on the i-1 th first characteristic information through the second space compression unit to obtain second space compression information of the i-1 th first characteristic information;
    channel feature extraction is carried out on the first space compression information of the i-1 th first feature information through the channel feature extraction unit, so that first channel information of the i-1 th first feature information is obtained;
    channel feature extraction is carried out on the second space compression information of the i-1 th first feature information through the channel feature extraction unit, so that second channel information of the i-1 th first feature information is obtained;
    and determining the channel attention information of the i-1 th first characteristic information according to the first channel information and the second channel information of the i-1 th first characteristic information.
  73. The method of claim 72, wherein the first spatial compression unit and/or the second spatial compression unit comprises a pooling layer.
  74. The method of claim 72, wherein the first spatial compression unit is a maximum pooling layer and/or the second spatial compression unit is an average pooling layer.
  75. The method of claim 72, wherein the channel feature extraction unit is a multi-layer perceptron MLP.
  76. The method of claim 72, wherein the channel attention module further comprises: the first adding unit and the first activation function, the channel attention information of the i-1 th first characteristic information is determined according to the first channel information and the second channel information of the i-1 th first characteristic information, and the method comprises the following steps:
    adding the first channel information and the second channel information of the i-1 pieces of first characteristic information through the first adding unit to obtain fusion channel information of the i-1 pieces of first characteristic information;
    and carrying out nonlinear processing on the fusion channel information of the i-1 first characteristic information through the first activation function to obtain the channel attention information of the i-1 first characteristic information.
  77. The method of claim 69, wherein the spatial attention module comprises: the spatial attention module is used for extracting the spatial information of the fusion channel characteristic information of the ith-1 first characteristic information to obtain the spatial attention information of the ith-1 first characteristic information, and the spatial attention module comprises the following components:
    Channel dimension compression is carried out on the fusion channel characteristic information of the ith-1 first characteristic information through the first channel compression unit, so that first channel compression information of the ith-1 first characteristic information is obtained;
    channel dimension compression is carried out on the fusion channel characteristic information of the ith-1 first characteristic information through the second channel compression unit, so that second channel compression information of the ith-1 first characteristic information is obtained;
    performing spatial feature extraction on the first channel compression information and the second channel compression information of the ith-1 first feature information through the spatial feature extraction unit to obtain spatial feature information of the ith-1 first feature information;
    and determining the spatial attention information of the ith-1 first characteristic information according to the spatial characteristic information of the ith-1 first characteristic information.
  78. The method of claim 77, wherein said first channel compression unit and/or said second channel compression unit comprises a pooling layer.
  79. The method of claim 77, wherein said first channel compression unit is a maximum pooling layer and/or said second channel compression unit is an average pooling layer.
  80. The method of claim 77, wherein said spatial feature extraction unit is a convolutional layer.
  81. The method of claim 77, wherein said spatial attention module further includes a second activation function, said determining spatial attention information for said i-1 th first feature information based on spatial feature information for said i-1 th first feature information, comprising:
    and carrying out nonlinear processing on the spatial characteristic information of the ith-1 first characteristic information through the second activation function to obtain the spatial attention information of the ith-1 first characteristic information.
  82. The method of any of claims 69-81, wherein the spatial dimension of channel attention information of the i-1 th first characteristic information is 1 x 1.
  83. The method of any one of claims 69-81, wherein the spatial attention information of the i-1 st first feature information has a feature dimension of 1.
  84. The method of claim 68, wherein the dynamic transformation model further comprises at least one downsampling unit, the method further comprising:
    and carrying out space dimension downsampling on the characteristic information output by the coding module through the downsampling unit.
  85. The method of claim 84, wherein the downsampling unit is a maximum pooling layer.
  86. The method of claim 84, wherein the dynamic transformation model further comprises at least one upsampling unit, the method further comprising:
    and carrying out space dimension up-sampling on the characteristic information output by the decoding module through the up-sampling unit.
  87. The method of claim 86, wherein the upsampling unit is a bilinear interpolation unit.
  88. The method of claim 67, wherein each of said N coding modules comprises at least one convolution block, and wherein parameters of said convolution blocks comprised by each of said N coding modules are not exactly the same.
  89. The method of claim 67, wherein each of said N decoding modules comprises at least one convolution block, and wherein parameters of the convolution blocks comprised by each of said N decoding modules are not exactly the same.
  90. The method of claim 67, wherein the step of,
    if the i is equal to N, the N-i second characteristic information is determined according to the N first characteristic information output by the N coding module; or,
    If the i is smaller than N, the Nth-i second characteristic information is determined according to the Nth-i second characteristic information output by the Nth-i decoding module; or,
    if the i is equal to 1, the i-1 th first characteristic information is determined according to the LDR training image; or,
    if i is greater than 1, the i-1 th first characteristic information is determined according to the first characteristic information output by the i-1 st coding module.
  91. A method as in claim 68 wherein the feature extracting, by the N-i+1 decoding module, the i-1 th third feature information and the N-i th second feature information to obtain the N-i+1 th second feature information of the LDR training image comprises:
    cascading the i-1 th third characteristic information and the N-i th second characteristic information;
    inputting the cascaded characteristic information into the (N-i+1) th decoding module for characteristic extraction to obtain the (N-i+1) th second characteristic information of the LDR training image.
  92. The method of claim 68, wherein the dynamic transformation model further comprises a first convolution layer, the method further comprising:
    extracting features of the LDR training image through the first convolution layer to obtain an initial feature map of the LDR training image;
    And respectively inputting the initial feature map into a first coding module and a first convolution attention module to obtain first feature information output by the first coding module and first third feature information output by the first convolution attention module.
  93. A method as in claim 67 wherein said dynamic transition model further comprises a second convolution layer, said determining an HDR image prediction value for said LDR training image based on second characteristic information of said LDR training image output by a last decoding module of said N decoding modules, comprising:
    and extracting the characteristics of the second characteristic information of the LDR training image output by the last decoding module through the second convolution layer, and outputting an HDR image predicted value of the LDR training image.
  94. The method of claim 68, further comprising:
    obtaining pre-training parameters obtained by a pre-training model during pre-training;
    and determining the pre-training parameters as initial parameters of the dynamic conversion model.
  95. A method as described in claim 94 wherein said determining a penalty between an HDR image predictor of said LDR training image and an HDR image truth value of said LDR training image comprises:
    And determining target loss between the HDR image predicted value of the LDR training image and the HDR image true value of the LDR training image according to a preset loss function.
  96. The method of claim 95, wherein the predetermined loss function comprises at least one of a reconstruction loss function, a perceptual loss function, and a pattern loss function.
  97. The method of claim 96, wherein said determining a target loss between an HDR image predictor of said LDR training image and an HDR image truth of said LDR training image according to a preset loss function comprises:
    determining a reconstruction penalty between the HDR image predictor and the HDR image truth value;
    determining a perceived loss between the HDR image predictor and the HDR image truth value;
    determining a pattern penalty between the HDR image predictor and the HDR image truth value;
    and determining a target loss between the HDR image predicted value and the HDR image true value according to the reconstruction loss, the perception loss and the style loss between the HDR image predicted value and the HDR image true value.
  98. The method of claim 97, wherein the determining the target loss between the determining the HDR image predictor and the HDR image truth from a reconstruction loss, a perception loss, and a style loss between the HDR image predictor and the HDR image truth comprises:
    Determining a target loss between the HDR image predictor and the HDR image truth value according to the following formula:
    Loss=L 1s L stp L p
    wherein Loss is the target Loss, L1 is the reconstruction Loss, lst is the perceptual Loss, lp is the pattern Loss, and λ s And lambda (lambda) p Is a super parameter.
  99. The method of claim 97, wherein the determining a reconstruction penalty between the HDR image predictor and the HDR image truth value comprises:
    determining a compressed tone mapping value of the HDR image predicted value according to a preset compressed tone mapping function;
    determining a compressed tone mapping value of the HDR image truth value according to the compressed tone mapping function;
    the reconstruction penalty is determined from an error between the compressed tone-mapped value of the HDR image truth value and the compressed tone-mapped value of the HDR image predictor.
  100. The method of claim 97, wherein the determining a perceived loss between the HDR image predictor and the HDR image truth value comprises:
    acquiring a feature map of a first layer of the pre-training model;
    determining a compressed tone mapping value of the HDR image predicted value according to a preset compressed tone mapping function;
    Determining a compressed tone mapping value of the HDR image truth value according to the compressed tone mapping function;
    determining a compressed tone mapping value of the HDR image predicted value, and corresponding first characteristic values in the characteristic diagram of the first layer;
    determining compressed tone mapping values of the HDR image truth values, corresponding second feature values in the feature map of the first layer;
    and determining the perception loss according to the error between the first characteristic value and the second characteristic value.
  101. The method of claim 97, wherein the determining a pattern penalty between the HDR image predictor and the HDR image truth value comprises:
    acquiring a Gram matrix of a first layer of feature map of the pre-training model;
    determining a compressed tone mapping value of the HDR image predicted value according to a preset compressed tone mapping function;
    determining a compressed tone mapping value of the HDR image truth value according to the compressed tone mapping function;
    determining a compressed tone mapping value of the HDR image predicted value, corresponding first element values in the Gram matrix;
    determining compressed tone mapping values of the HDR image truth values, corresponding second element values in the Gram matrix;
    And determining the style loss according to the error between the first element value and the second element value.
  102. An image decoding apparatus, comprising:
    the decoding unit is used for decoding the code stream to obtain a reconstructed image;
    the processing unit is used for inputting the reconstructed image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the reconstructed image;
    wherein the dynamic conversion model includes: the device comprises N coding modules and N decoding modules which are connected in series, wherein the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the ith coding module is connected with the (N-i+1) decoding module in a jumping manner, the ith coding module is used for carrying out feature extraction on the (i-1) th first feature information output by the (i-1) th coding module to obtain the (i) th first feature information of the reconstructed image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i) th second feature information of the reconstructed image to obtain the (N-i+1) th second feature information of the reconstructed image, the HDR image of the reconstructed image is determined according to the second feature information output by the last decoding module in the N decoding modules, and the (i) is a positive integer less than or equal to N.
  103. An image processing apparatus, comprising:
    the acquisition unit is used for acquiring the LDR image with the low dynamic range to be processed;
    the processing unit is used for inputting the LDR image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the LDR image;
    wherein the dynamic conversion model includes: the device comprises N coding modules and N decoding modules which are connected in series, wherein the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the ith coding module is connected with the (N-i+1) th decoding module in a jumping manner, the ith coding module is used for carrying out feature extraction on the (i-1) th first feature information output by the (i-1) th coding module to obtain the (i-first) th feature information of the LDR image, the (N-i+1) th decoding module is used for carrying out feature extraction on the (i-1) th first feature information and the (N-i) th second feature information of the LDR image to obtain the (N-i+1) th second feature information of the LDR image, the HDR image of the LDR image is determined according to the second feature information output by the last decoding module in the N decoding modules, and the (i) is a positive integer less than or equal to N.
  104. A model training device, comprising:
    the acquisition unit is used for acquiring a low dynamic range LDR training image and a high dynamic range HDR image true value of the LDR training image;
    the processing unit is used for inputting the LDR training image into a dynamic conversion model, extracting the characteristics of the i-1 th first characteristic information through an i-th coding module to obtain the i-th first characteristic information of the LDR training image, wherein the dynamic conversion model comprises N coding modules connected in series and N decoding modules connected in series, the output of the last coding module in the N coding modules is connected with the input of the first decoding module in the N decoding modules, the i-th coding module is connected with the N-i+1 th decoding module in a jumping manner, i is a positive integer smaller than or equal to N, and N is a positive integer; performing feature extraction on the i-1 th first feature information and the N-i second feature information of the LDR training image through the N-i+1 th decoding module to obtain the N-i+1 th second feature information of the LDR training image; determining an HDR image prediction value of the LDR training image according to the second characteristic information of the LDR training image output by the last decoding module in the N decoding modules; determining a loss between an HDR image predicted value of the LDR training image and an HDR image true value of the LDR training image, and training the dynamic conversion model according to the loss.
  105. A decoder, comprising: a processor and a memory;
    the memory is used for storing a computer program;
    the processor is configured to invoke and run a computer program stored in the memory to perform the method of any of claims 1-33.
  106. An electronic device, comprising: a processor and a memory;
    the memory is used for storing a computer program;
    the processor is configured to invoke and run a computer program stored in the memory to perform the method of any of claims 34-66 or 67 to 101.
  107. A computer readable storage medium storing a computer program for causing a computer to perform the method of any one of claims 1 to 33 or 34 to 66 or 67 to 101.
CN202180097934.XA 2021-06-24 2021-06-24 Image decoding and processing method, device and equipment Pending CN117441186A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/102173 WO2022266955A1 (en) 2021-06-24 2021-06-24 Image decoding method and apparatus, image processing method and apparatus, and device

Publications (1)

Publication Number Publication Date
CN117441186A true CN117441186A (en) 2024-01-23

Family

ID=84543976

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180097934.XA Pending CN117441186A (en) 2021-06-24 2021-06-24 Image decoding and processing method, device and equipment

Country Status (2)

Country Link
CN (1) CN117441186A (en)
WO (1) WO2022266955A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115776571B (en) * 2023-02-10 2023-04-28 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Image compression method, device, equipment and storage medium
CN117854138B (en) * 2024-03-07 2024-05-10 深圳航天信息有限公司 Information acquisition and analysis method, device, equipment and storage medium based on big data

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10475169B2 (en) * 2017-11-28 2019-11-12 Adobe Inc. High dynamic range illumination estimation
CN108805836A (en) * 2018-05-31 2018-11-13 大连理工大学 Method for correcting image based on the reciprocating HDR transformation of depth
CN109447907B (en) * 2018-09-20 2020-06-16 宁波大学 Single image enhancement method based on full convolution neural network
CN109785263B (en) * 2019-01-14 2022-09-16 北京大学深圳研究生院 Retinex-based inverse tone mapping image conversion method
US11107205B2 (en) * 2019-02-18 2021-08-31 Samsung Electronics Co., Ltd. Techniques for convolutional neural network-based multi-exposure fusion of multiple image frames and for deblurring multiple image frames
CN111951171A (en) * 2019-05-16 2020-11-17 武汉Tcl集团工业研究院有限公司 HDR image generation method and device, readable storage medium and terminal equipment
CN110717868B (en) * 2019-09-06 2022-05-03 上海交通大学 Video high dynamic range inverse tone mapping model construction and mapping method and device
CN111709900A (en) * 2019-10-21 2020-09-25 上海大学 High dynamic range image reconstruction method based on global feature guidance
CN111292264B (en) * 2020-01-21 2023-04-21 武汉大学 Image high dynamic range reconstruction method based on deep learning
CN111372006B (en) * 2020-03-03 2021-05-07 山东大学 High dynamic range imaging method and system for mobile terminal
CN111914938B (en) * 2020-08-06 2024-01-30 上海金桥信息股份有限公司 Image attribute classification and identification method based on full convolution two-branch network

Also Published As

Publication number Publication date
WO2022266955A1 (en) 2022-12-29

Similar Documents

Publication Publication Date Title
US20230069953A1 (en) Learned downsampling based cnn filter for image and video coding using learned downsampling feature
CN114339262B (en) Entropy encoding/decoding method and device
CN111800629A (en) Video decoding method, video encoding method, video decoder and video encoder
WO2021109978A1 (en) Video encoding method, video decoding method, and corresponding apparatuses
US11070808B2 (en) Spatially adaptive quantization-aware deblocking filter
US20230076920A1 (en) Global skip connection based convolutional neural network (cnn) filter for image and video coding
EP4365820A1 (en) Video super-resolution network, and video super-resolution, encoding and decoding processing method and device
WO2022182265A1 (en) Apparatus and method for coding pictures using a convolutional neural network
WO2022266955A1 (en) Image decoding method and apparatus, image processing method and apparatus, and device
WO2023279961A1 (en) Video image encoding method and apparatus, and video image decoding method and apparatus
CN114915783A (en) Encoding method and apparatus
CN116982311A (en) Multi-scale optical flow for video compression for learning
CN117441333A (en) Configurable location for inputting auxiliary information of image data processing neural network
CN114286099A (en) Intra-frame prediction method and device
WO2022179509A1 (en) Audio/video or image layered compression method and apparatus
KR20230129068A (en) Scalable encoding and decoding method and apparatus
CN114554212A (en) Video processing apparatus and method, and computer storage medium
CN115706798A (en) Entropy encoding and decoding method and device
CN115118972A (en) Video image coding and decoding method and related equipment
WO2023000182A1 (en) Image encoding, decoding and processing methods, image decoding apparatus, and device
WO2023091040A1 (en) Generalized difference coder for residual coding in video compression
CN118318446A (en) Generalized difference decoder for residual coding in video compression
EP4226325A1 (en) A method and apparatus for encoding or decoding a picture using a neural network
CN116405701A (en) Image filtering method, device, equipment and storage medium
CN117151986A (en) Image filtering method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination