WO2022266955A1

WO2022266955A1 - Image decoding method and apparatus, image processing method and apparatus, and device

Info

Publication number: WO2022266955A1
Application number: PCT/CN2021/102173
Authority: WO
Inventors: 元辉; 姜世奇; 杨烨; 李明
Original assignee: Oppo广东移动通信有限公司
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2022-12-29
Also published as: CN117441186A

Abstract

The present application provides an image decoding method and apparatus, an image processing method and apparatus, and a device. The method comprises: decoding a code stream to obtain a reconstructed image, and inputting the reconstructed image into a dynamic conversion model for dynamic conversion, so as to obtain a high dynamic range (HDR) image of the reconstructed image, wherein the dynamic conversion model comprises N encoding modules and N decoding modules; an ith encoding module is in skip connection with a (N-i+1)th decoding module; the ith encoding module is configured to perform feature extraction on a (i-1)th piece of first feature information outputted by a (i-1)th encoding module, so as to obtain an ith piece of first feature information of the reconstructed image, and the (N-i+1)th decoding module is configured to perform feature extraction on the (i-1)th piece of first feature information and a (N-i)th piece of second feature information to obtain a (N-i+1)th piece of second feature information, and the HDR image is determined according to the second feature information outputted by a last decoding module. The present application converts an image having a low dynamic range into an image having a high dynamic range by using the dynamic conversion model, and thus, the process is simple and the cost is low.

Description

Image decoding and processing method, device and equipment

technical field

The present application relates to the technical field of image processing, and in particular to an image decoding and processing method, device and equipment.

Background technique

Dynamic range is a term used to define how wide a range of tonal detail a camera can capture in an image, usually the range from the lowest value to the highest overflow value. Simply put, it describes the ratio between the brightest and darkest tones a camera can record in a single frame. The larger the dynamic range, the more likely it is to preserve information in highlights and shadows.

However, the acquisition of high dynamic range images is relatively complicated, and higher requirements are placed on hardware and algorithms in terms of data acquisition, transmission, storage, and display. At present, the conversion cost of converting low dynamic range images to high dynamic range images is high. .

Contents of the invention

Embodiments of the present application provide an image decoding and processing method, device, and equipment to reduce the cost of converting a low dynamic range image into a high dynamic range image.

In the first aspect, the embodiment of the present application provides an image decoding method, including:

Decode the code stream to obtain the reconstructed image;

Input the reconstructed image into the dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the reconstructed image;

Wherein, the dynamic conversion model includes: N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in the N encoding modules is connected to the input of the first decoding module in the N decoding modules , and the i-th coding module is skip-connected to the N-i+1-th decoding module, and the i-th coding module is used to perform feature extraction on the i-1-th first feature information output by the i-1-th coding module, Obtain the i-th first feature information of the reconstructed image, and the N-i+1 decoding module is used to perform feature extraction on the i-1-th first feature information and the N-i-th second feature information of the reconstructed image, and obtain the reconstruction The N-i+1th second feature information of the image, the HDR image of the reconstructed image is determined according to the second feature information output by the last decoding module in the N decoding modules, i is a positive integer less than or equal to N, and N is a positive integer.

In a second aspect, the present application provides an image processing method, including:

Obtain the low dynamic range LDR image to be processed;

Input the LDR image into the dynamic conversion model for dynamic conversion, and obtain the high dynamic range HDR image of the LDR image;

Wherein, the dynamic conversion model includes: N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in the N encoding modules is connected to the input of the first decoding module in the N decoding modules , and the i-th coding module is skip-connected to the N-i+1-th decoding module, and the i-th coding module is used to perform feature extraction on the i-1-th first feature information output by the i-1-th coding module, Obtain the i-th first feature information of the LDR image, and the N-i+1 decoding module is used to perform feature extraction on the i-1-th first feature information and the N-i-th second feature information of the LDR image, and obtain the LDR The N-i+1th second feature information of the image, the HDR image of the LDR image is determined according to the second feature information output by the last decoding module in the N decoding modules, i is a positive integer less than or equal to N, and N is a positive integer.

In a third aspect, the present application provides a model training method, including:

Obtain the true value of the low dynamic range LDR training image and the high dynamic range HDR image of the LDR training image;

Input the LDR training image into the dynamic conversion model, and extract the i-1 first feature information through the i-th encoding module to obtain the i-th first feature information of the LDR training image, wherein the dynamic conversion model includes serial connection The N encoding modules of N encoding modules and N decoding modules connected in series, the output of the last encoding module among the N encoding modules is connected to the input of the first decoding module among the N decoding modules, and the i-th encoding module is connected to the first N-i+1 decoding modules are skipped and connected, i is a positive integer less than or equal to N, and N is a positive integer;

Through the N-i+1th decoding module, the i-1th first feature information and the N-ith second feature information of the LDR training image are extracted to obtain the N-i+1th second feature information of the LDR training image characteristic information;

According to the second feature information of the LDR training image output by the last decoding module in the N decoding modules, determine the HDR image prediction value of the LDR training image;

Determine the loss between the predicted value of the HDR image of the LDR training image and the true value of the HDR image of the LDR training image, and train the dynamic transformation model according to the loss.

In a fourth aspect, an image decoding device is provided, configured to execute the method in the above first aspect or its implementations. Specifically, the image decoding device includes a functional unit configured to execute the method in the above first aspect or each implementation manner thereof.

In a fifth aspect, a decoder is provided, including a processor and a memory. The memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory, so as to execute the method in the above first aspect or its various implementations.

In a sixth aspect, an image processing device is provided, configured to execute the method in the above-mentioned second aspect or various implementations thereof. Specifically, the device includes a functional unit configured to execute the method in the above second aspect or each implementation manner thereof.

In a seventh aspect, an image processing device is provided, including a processor and a memory. The memory is used to store a computer program, and the processor is used to invoke and run the computer program stored in the memory, so as to execute the method in the above second aspect or its various implementations.

In an eighth aspect, a model training device is provided, configured to execute the method in the above third aspect or various implementations thereof. Specifically, the model training device includes a functional unit for executing the method in the above third aspect or its various implementations.

In a ninth aspect, a model training device is provided, including a processor and a memory. The memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory, so as to execute the method in the above third aspect or its various implementations.

In a tenth aspect, a chip is provided, configured to implement any one of the foregoing first to third aspects or the method in each implementation manner thereof. Specifically, the chip includes: a processor, configured to call and run a computer program from the memory, so that the device installed with the chip executes any one of the above-mentioned first to third aspects or any of the implementations thereof. method.

In an eleventh aspect, there is provided a computer-readable storage medium for storing a computer program, and the computer program causes a computer to execute any one of the above-mentioned first to third aspects or the method in each implementation manner thereof.

A twelfth aspect provides a computer program product, including computer program instructions, the computer program instructions cause a computer to execute any one of the above first to third aspects or the method in each implementation manner.

A thirteenth aspect provides a computer program, which, when running on a computer, causes the computer to execute any one of the above first to third aspects or the method in each implementation manner.

Based on the above technical scheme, the dynamic conversion model includes N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in the N encoding modules is connected with the first decoding module in the N decoding modules The input connection of , and the i-th encoding module is skip-connected to the N-i+1-th decoding module, and the i-th encoding module is used to perform the i-1th first feature information output by the i-1-th encoding module Feature extraction, to obtain the i-th first feature information of the reconstructed image, and the N-i+1-th decoding module is used to perform feature extraction on the i-1-th first feature information and the N-i-th second feature information of the reconstructed image , to obtain the N-i+1th second characteristic information of the reconstructed image, the HDR image of the reconstructed image is determined according to the second characteristic information output by the last decoding module in the N decoding modules, i is a positive value less than or equal to N Integer, N is a positive integer. Using this dynamic conversion model, LDR images can be converted into HDR images, and HDR image conversion can be realized without increasing the cost of data acquisition, encoding, transmission, storage, etc., thereby improving the efficiency of HDR image conversion and reducing the cost of HDR images. image.

Description of drawings

FIG. 1 is a schematic block diagram of a video encoding and decoding system involved in an embodiment of the present application;

Fig. 2 is a schematic block diagram of a video encoder provided by an embodiment of the present application;

Fig. 3 is a schematic block diagram of a video decoder provided by an embodiment of the present application;

FIG. 4 is a schematic flow chart of a dynamic conversion model training method provided by an embodiment of the present application;

FIG. 5A is a network schematic diagram of a dynamic conversion model involved in an embodiment of the present application;

FIG. 5B is a schematic network diagram of a convolution block involved in an embodiment of the present application;

FIG. 5C is a network schematic diagram of a dynamic conversion model involved in an embodiment of the present application;

FIG. 5D is a network diagram of a convolutional attention module involved in an embodiment of the present application;

FIG. 5E is a network diagram of a channel attention module involved in an embodiment of the present application;

FIG. 5F is a network schematic diagram of a spatial attention module involved in an embodiment of the present application;

FIG. 5G is a network schematic diagram of a dynamic conversion model involved in an embodiment of the present application;

FIG. 6 is a schematic flowchart of an image decoding method provided by an embodiment of the present application;

FIG. 7 is a network diagram of a spatial attention module involved in an embodiment of the present application;

FIG. 8 is a schematic flowchart of an image processing method provided by an embodiment of the present application;

FIG. 9 is a schematic block diagram of an image decoding device provided by an embodiment of the present application;

FIG. 10 is a schematic block diagram of an image processing device provided by an embodiment of the present application;

Fig. 11 is a schematic block diagram of a model training device provided by an embodiment of the present application;

Fig. 12 is a schematic block diagram of an electronic device provided by an embodiment of the present application.

detailed description

The present application can be applied to the technical field of point cloud upsampling, for example, can be applied to the technical field of point cloud compression.

The application can be applied to the field of image codec, video codec, hardware video codec, dedicated circuit video codec, real-time video codec, etc. For example, the solution of the present application can be combined with audio and video coding standards (audio video coding standard, referred to as AVS), for example, H.264/audio video coding (audio video coding, referred to as AVC) standard, H.265/high efficiency video coding ( High efficiency video coding (HEVC for short) standard and H.266/versatile video coding (VVC for short) standard. Alternatively, the solutions of the present application may operate in conjunction with other proprietary or industry standards, including ITU-TH.261, ISO/IECMPEG-1Visual, ITU-TH.262 or ISO/IECMPEG-2Visual, ITU-TH.263 , ISO/IECMPEG-4Visual, ITU-TH.264 (also known as ISO/IECMPEG-4AVC), including scalable video codec (SVC) and multi-view video codec (MVC) extensions. It should be understood that the techniques of this application are not limited to any particular codec standard or technology.

For ease of understanding, the video codec system involved in the embodiment of the present application is first introduced with reference to FIG. 1 .

FIG. 1 is a schematic block diagram of a video encoding and decoding system involved in an embodiment of the present application. It should be noted that FIG. 1 is only an example, and the video codec system in the embodiment of the present application includes but is not limited to what is shown in FIG. 1 . As shown in FIG. 1 , the video codec system 100 includes an encoding device 110 and a decoding device 120 . The encoding device is used to encode (can be understood as compression) the video data to generate a code stream, and transmit the code stream to the decoding device. The decoding device decodes the code stream generated by the encoding device to obtain decoded video data.

The encoding device 110 in the embodiment of the present application can be understood as a device having a video encoding function, and the decoding device 120 can be understood as a device having a video decoding function, that is, the embodiment of the present application includes a wider range of devices for the encoding device 110 and the decoding device 120, Examples include smartphones, desktop computers, mobile computing devices, notebook (eg, laptop) computers, tablet computers, set-top boxes, televisions, cameras, display devices, digital media players, video game consoles, vehicle-mounted computers, and the like.

In some embodiments, the encoding device 110 may transmit the encoded video data (such as code stream) to the decoding device 120 via the channel 130 . Channel 130 may include one or more media and/or devices capable of transmitting encoded video data from encoding device 110 to decoding device 120 .

In one example, channel 130 includes one or more communication media that enable encoding device 110 to transmit encoded video data directly to decoding device 120 in real-time. In this example, encoding device 110 may modulate the encoded video data according to a communication standard and transmit the modulated video data to decoding device 120 . The communication medium includes a wireless communication medium, such as a radio frequency spectrum. Optionally, the communication medium may also include a wired communication medium, such as one or more physical transmission lines.

In another example, the channel 130 includes a storage medium that can store video data encoded by the encoding device 110 . The storage medium includes a variety of local access data storage media, such as optical discs, DVDs, flash memory, and the like. In this example, the decoding device 120 may acquire encoded video data from the storage medium.

In another example, channel 130 may include a storage server that may store video data encoded by encoding device 110 . In this instance, the decoding device 120 may download the stored encoded video data from the storage server. Optionally, the storage server may store the encoded video data and may transmit the encoded video data to the decoding device 120, such as a web server (eg, for a website), a file transfer protocol (FTP) server, and the like.

In some embodiments, the encoding device 110 includes a video encoder 112 and an output interface 113 . Wherein, the output interface 113 may include a modulator/demodulator (modem) and/or a transmitter.

In some embodiments, the encoding device 110 may include a video source 111 in addition to the video encoder 112 and the input interface 113 .

The video source 111 may include at least one of a video capture device (for example, a video camera), a video archive, a video input interface, a computer graphics system, wherein the video input interface is used to receive video data from a video content provider, and the computer graphics system Used to generate video data.

The video encoder 112 encodes the video data from the video source 111 to generate a code stream. Video data may include one or more pictures or a sequence of pictures. The code stream contains the encoding information of an image or image sequence in the form of a bit stream. Encoding information may include encoded image data and associated data. The associated data may include a sequence parameter set (SPS for short), a picture parameter set (PPS for short) and other syntax structures. An SPS may contain parameters that apply to one or more sequences. A PPS may contain parameters applied to one or more images. The syntax structure refers to a set of zero or more syntax elements arranged in a specified order in the code stream.

The video encoder 112 directly transmits encoded video data to the decoding device 120 via the output interface 113 . The encoded video data can also be stored on a storage medium or a storage server for subsequent reading by the decoding device 120 .

In some embodiments, the decoding device 120 includes an input interface 121 and a video decoder 122 .

In some embodiments, the decoding device 120 may include a display device 123 in addition to the input interface 121 and the video decoder 122 .

Wherein, the input interface 121 includes a receiver and/or a modem. The input interface 121 can receive encoded video data through the channel 130 .

The video decoder 122 is used to decode the encoded video data to obtain decoded video data, and transmit the decoded video data to the display device 123 .

The display device 123 displays the decoded video data. The display device 123 may be integrated with the decoding device 120 or external to the decoding device 120 . The display device 123 may include various display devices, such as a liquid crystal display (LCD), a plasma display, an organic light emitting diode (OLED) display, or other types of display devices.

In addition, FIG. 1 is only an example, and the technical solutions of the embodiments of the present application are not limited to FIG. 1 . For example, the technology of the present application may also be applied to one-sided video encoding or one-sided video decoding.

The video encoder involved in the embodiment of the present application is introduced below.

Fig. 2 is a schematic block diagram of a video encoder provided by an embodiment of the present application. It should be understood that the video encoder 200 can be used to perform lossy compression on images, and can also be used to perform lossless compression on images. The lossless compression may be visually lossless compression or mathematically lossless compression.

The video encoder 200 can be applied to image data in luminance-chrominance (YCbCr, YUV) format. For example, the YUV ratio can be 4:2:0, 4:2:2 or 4:4:4, Y means brightness (Luma), Cb (U) means blue chroma, Cr (V) means red chroma, U and V are expressed as chroma (Chroma) for describing color and saturation. For example, in terms of color format, 4:2:0 means that every 4 pixels have 4 luminance components, 2 chroma components (YYYYCbCr), 4:2:2 means that every 4 pixels have 4 luminance components, 4 Chroma component (YYYYCbCrCbCr), 4:4:4 means full pixel display (YYYYCbCrCbCrCbCrCbCr).

For example, the video encoder 200 reads video data, and for each frame of image in the video data, divides a frame of image into several coding tree units (coding tree unit, CTU), "largest coding unit" (Largest Coding unit, LCU for short) or "coding tree block" (coding tree block, CTB for short). Each CTU may be associated with a pixel block of equal size within the image. Each pixel may correspond to one luminance (luma) sample and two chrominance (chrominance or chroma) samples. Thus, each CTU may be associated with one block of luma samples and two blocks of chroma samples. A CTU size is, for example, 128×128, 64×64, 32×32 and so on. A CTU can be further divided into several coding units (Coding Unit, CU) for coding, and the CU can be a rectangular block or a square block. The CU can be further divided into a prediction unit (PU for short) and a transform unit (TU for short), so that coding, prediction, and transformation are separated, and processing is more flexible. In an example, a CTU is divided into CUs in a quadtree manner, and a CU is divided into TUs and PUs in a quadtree manner.

The video encoder and video decoder can support various PU sizes. Assuming that the size of a specific CU is 2N×2N, video encoders and video decoders may support 2N×2N or N×N PU sizes for intra prediction, and support 2N×2N, 2N×N, N×2N, NxN or similarly sized symmetric PUs for inter prediction. The video encoder and video decoder may also support asymmetric PUs of 2NxnU, 2NxnD, nLx2N, and nRx2N for inter prediction.

In some embodiments, as shown in FIG. 2 , the video encoder 200 may include: a prediction unit 210, a residual unit 220, a transform/quantization unit 230, an inverse transform/quantization unit 240, a reconstruction unit 250, and a loop filter unit 260. Decoded image cache 270 and entropy encoding unit 280. It should be noted that the video encoder 200 may include more, less or different functional components.

Optionally, in this application, the current block (current block) may be called a current coding unit (CU) or a current prediction unit (PU). A predicted block may also be referred to as a predicted block to be encoded or an image predicted block, and a reconstructed block to be encoded may also be referred to as a reconstructed block or an image reconstructed block to be encoded.

In some embodiments, the prediction unit 210 includes an inter prediction unit 211 and an intra prediction unit 212 . Because there is a strong correlation between adjacent pixels in a video frame, the intra-frame prediction method is used in video coding and decoding technology to eliminate the spatial redundancy between adjacent pixels. Due to the strong similarity between adjacent frames in video, the inter-frame prediction method is used in video coding and decoding technology to eliminate time redundancy between adjacent frames, thereby improving coding efficiency.

The inter-frame prediction unit 211 can be used for inter-frame prediction. The inter-frame prediction can refer to image information of different frames. The inter-frame prediction uses motion information to find a reference block from the reference frame, and generates a prediction block according to the reference block to eliminate temporal redundancy; Frames used for inter-frame prediction may be P frames and/or B frames, P frames refer to forward predictive frames, and B frames refer to bidirectional predictive frames. The motion information includes the reference frame list where the reference frame is located, the reference frame index, and the motion vector. The motion vector can be an integer pixel or a sub-pixel. If the motion vector is sub-pixel, then it is necessary to use interpolation filtering in the reference frame to make the required sub-pixel block. Here, the reference frame found according to the motion vector A block of whole pixels or sub-pixels is called a reference block. Some technologies will directly use the reference block as a prediction block, and some technologies will further process the reference block to generate a prediction block. Reprocessing and generating a prediction block based on a reference block can also be understood as taking the reference block as a prediction block and then processing and generating a new prediction block based on the prediction block.

Currently the most commonly used inter-frame prediction methods include: geometric partitioning mode (GPM) in the VVC video codec standard, and angular weighted prediction (AWP) in the AVS3 video codec standard. These two intra-frame prediction modes have something in common in principle.

The intra-frame prediction unit 212 only refers to the information of the same frame image, and predicts the pixel information in the block to be encoded of the current code, so as to eliminate spatial redundancy. A frame used for intra prediction may be an I frame.

In some embodiments, the intra prediction method further includes a multiple reference line intra prediction method (multiple reference line, MRL). MRL can use more reference pixels to improve coding efficiency.

There are multiple prediction modes for intra-frame prediction, and there are 9 modes for intra-frame prediction for 4×4 blocks in H.264. Among them, mode 0 is to copy the pixels above the current block to the current block in the vertical direction as the prediction value; mode 1 is to copy the reference pixels on the left to the current block in the horizontal direction as the prediction value; mode 2 (DC) is to copy the pixels from A to The average value of the 8 points D and I~L is used as the prediction value of all points, and modes 3 to 8 are to copy the reference pixel to the corresponding position of the current block according to a certain angle. Because some positions of the current block cannot exactly correspond to the reference pixels, it may be necessary to use the weighted average of the reference pixels, or the sub-pixels of the interpolated reference pixels.

The intra prediction modes used by HEVC include planar mode (Planar), DC and 33 angle modes, a total of 35 prediction modes. The intra-frame modes used by VVC include Planar, DC and 65 angle modes, with a total of 67 prediction modes. The intra-frame modes used by AVS3 include DC, Plane, Bilinear and 63 angle modes, a total of 66 prediction modes.

It should be noted that with the increase of the angle mode, the intra-frame prediction will be more accurate, and it will be more in line with the demand for the development of high-definition and ultra-high-definition digital video.

The residual unit 220 may generate a residual block of the CU based on the pixel blocks of the CU and the prediction blocks of the PUs of the CU. For example, residual unit 220 may generate a residual block for a CU such that each sample in the residual block has a value equal to the difference between the samples in the pixel blocks of the CU, and the samples in the PUs of the CU. Corresponding samples in the predicted block.

Transform/quantization unit 230 may quantize the transform coefficients. Transform/quantization unit 230 may quantize transform coefficients associated with TUs of a CU based on quantization parameter (QP) values associated with the CU. Video encoder 200 may adjust the degree of quantization applied to transform coefficients associated with a CU by adjusting the QP value associated with the CU.

Inverse transform/quantization unit 240 may apply inverse quantization and inverse transform to the quantized transform coefficients, respectively, to reconstruct a residual block from the quantized transform coefficients.

The reconstruction unit 250 may add samples of the reconstructed residual block to corresponding samples of one or more prediction blocks generated by the prediction unit 210 to generate a reconstructed block to be encoded associated with the TU. By reconstructing the sample blocks of each TU of the CU in this way, the video encoder 200 can reconstruct the pixel blocks of the CU.

Loop filtering unit 260 may perform deblocking filtering operations to reduce blocking artifacts of pixel blocks associated with a CU.

In some embodiments, the loop filtering unit 260 includes a deblocking filtering unit, a sample point adaptive compensation SAO unit, and an adaptive loop filtering ALF unit.

The decoded image buffer 270 may store reconstructed pixel blocks. Inter prediction unit 211 may use reference pictures containing reconstructed pixel blocks to perform inter prediction on PUs of other pictures. In addition, intra prediction unit 212 may use the reconstructed pixel blocks in decoded picture cache 270 to perform intra prediction on other PUs in the same picture as the CU.

Entropy encoding unit 280 may receive the quantized transform coefficients from transform/quantization unit 230 . Entropy encoding unit 280 may perform one or more entropy encoding operations on the quantized transform coefficients to generate entropy encoded data.

The basic flow of video coding involved in this application is as follows: at the coding end, the current image is divided into blocks, and for the current block, the prediction unit 210 uses intra prediction or inter prediction to generate a prediction block of the current block. The residual unit 220 may calculate a residual block based on the predicted block and the original block of the current block, that is, a difference between the predicted block and the original block of the current block, and the residual block may also be referred to as residual information. The residual block can be transformed and quantized by the transformation/quantization unit 230 to remove information that is not sensitive to human eyes, so as to eliminate visual redundancy. Optionally, the residual block before being transformed and quantized by the transform/quantization unit 230 may be called a time domain residual block, and the time domain residual block after being transformed and quantized by the transform/quantization unit 230 may be called a frequency residual block or a frequency-domain residual block. The entropy encoding unit 280 receives the quantized transform coefficients output by the transform and quantization unit 230 , may perform entropy encoding on the quantized transform coefficients, and output a code stream. For example, the entropy coding unit 280 can eliminate character redundancy according to the target context model and the probability information of the binary code stream.

In addition, the video encoder performs inverse quantization and inverse transformation on the quantized transform coefficients output by the transform and quantization unit 230 to obtain a residual block of the current block, and then adds the residual block of the current block to the prediction block of the current block, Get the reconstructed block of the current block. As the encoding proceeds, reconstructed blocks corresponding to other blocks to be encoded in the current image can be obtained, and these reconstructed blocks are spliced to obtain a reconstructed image of the current image. Due to the error introduced in the encoding process, in order to reduce the error, filter the reconstructed image, for example, use ALF to filter the reconstructed image to reduce the difference between the pixel value of the pixel in the reconstructed image and the original pixel value of the pixel in the current image difference. The filtered reconstructed image is stored in the decoded image buffer 270, which may serve as a reference frame for inter-frame prediction for subsequent frames.

It should be noted that the block division information determined by the encoder, as well as mode information or parameter information such as prediction, transformation, quantization, entropy coding, and loop filtering, etc., are carried in the code stream when necessary. The decoding end analyzes the code stream and analyzes the existing information to determine the same block division information as the encoding end, prediction, transformation, quantization, entropy coding, loop filtering and other mode information or parameter information, so as to ensure the decoding image obtained by the encoding end It is the same as the decoded image obtained by the decoder.

Fig. 3 is a schematic block diagram of a video decoder provided by an embodiment of the present application.

As shown in FIG. 3 , the video decoder 300 includes: an entropy decoding unit 310 , a prediction unit 320 , an inverse quantization/transformation unit 330 , a reconstruction unit 340 , a loop filter unit 350 and a decoded image buffer 360 . It should be noted that the video decoder 300 may include more, less or different functional components.

The video decoder 300 can receive code streams. The entropy decoding unit 310 may parse the codestream to extract syntax elements from the codestream. As part of parsing the codestream, the entropy decoding unit 310 may parse the entropy-encoded syntax elements in the codestream. The prediction unit 320 , the inverse quantization/transformation unit 330 , the reconstruction unit 340 and the loop filter unit 350 can decode video data according to the syntax elements extracted from the code stream, that is, generate decoded video data.

In some embodiments, the prediction unit 320 includes an intra prediction unit 321 and an inter prediction unit 322 .

Intra prediction unit 321 may perform intra prediction to generate a predictive block for a PU. Intra prediction unit 321 may use an intra prediction mode to generate a prediction block for a PU based on pixel blocks of spatially neighboring PUs. Intra prediction unit 321 may also determine an intra prediction mode for a PU from one or more syntax elements parsed from a codestream.

The inter prediction unit 322 may construct a first reference picture list (list 0) and a second reference picture list (list 1) according to the syntax elements parsed from the codestream. Furthermore, if the PU is encoded using inter prediction, entropy decoding unit 310 may parse the motion information for the PU. Inter prediction unit 322 may determine one or more reference blocks for the PU according to the motion information of the PU. Inter prediction unit 322 may generate a predictive block for the PU from one or more reference blocks for the PU.

Inverse quantization/transform unit 330 may inverse quantize (ie, dequantize) transform coefficients associated with a TU. Inverse quantization/transform unit 330 may use QP values associated with CUs of the TU to determine the degree of quantization.

After inverse quantizing the transform coefficients, inverse quantization/transform unit 330 may apply one or more inverse transforms to the inverse quantized transform coefficients in order to generate a residual block associated with the TU.

Reconstruction unit 340 uses the residual blocks associated with the TUs of the CU and the prediction blocks of the PUs of the CU to reconstruct the pixel blocks of the CU. For example, the reconstruction unit 340 may add the samples of the residual block to the corresponding samples of the prediction block to reconstruct the pixel block of the CU, and obtain the reconstructed block to be encoded.

Loop filtering unit 350 may perform deblocking filtering operations to reduce blocking artifacts of pixel blocks associated with a CU.

In some embodiments, the loop filtering unit 350 includes a deblocking filtering unit, a sample point adaptive compensation SAO unit, and an adaptive loop filtering ALF unit.

Video decoder 300 may store the reconstructed picture of the CU in decoded picture cache 360 . The video decoder 300 may use the reconstructed picture in the decoded picture buffer 360 as a reference picture for subsequent prediction, or transmit the reconstructed picture to a display device for presentation.

The basic process of video decoding involved in this application is as follows: the entropy decoding unit 310 can analyze the code stream to obtain the prediction information of the current block, the quantization coefficient matrix, etc., and the prediction unit 320 uses intra prediction or inter prediction to generate the current block based on the prediction information. The predicted block for the block. The inverse quantization/transformation unit 330 uses the quantization coefficient matrix obtained from the code stream to perform inverse quantization and inverse transformation on the quantization coefficient matrix to obtain a residual block. The reconstruction unit 340 adds the predicted block and the residual block to obtain a reconstructed block. The reconstructed blocks form a reconstructed image, and the loop filtering unit 350 performs loop filtering on the reconstructed image based on the image or based on the block to obtain a decoded image. The decoded image can also be referred to as a reconstructed image. On the one hand, the reconstructed image can be displayed by a display device, and on the other hand, it can be stored in the decoded image buffer 360 and serve as a reference frame for inter-frame prediction for subsequent frames.

The above is the basic process of the video codec under the block-based hybrid coding framework. With the development of technology, some modules or steps of the framework or process may be optimized. This application is applicable to the block-based hybrid coding framework. The basic process of the video codec, but not limited to the framework and process.

Real-world scenes have a large dynamic range, spanning up to 14 orders of magnitude, from a moonless late night to the harsh midday sun. In such a complex environment, low dynamic range (LDR) images captured by traditional cameras will cause some parts of the image to be overexposed or underexposed, which cannot truly restore the real world, while high dynamic range (High Dynamic Range) Range (HDR for short) images contain rich light, shadow and color information in various lighting environments in real scenes, and can more completely record or show texture details in light and dark areas that are basically the same as real scenes. At the same time, the acquisition of HDR images is relatively complex, and higher requirements are placed on hardware and algorithms in terms of data acquisition, transmission, storage, and display.

In recent years, with the rapid development of deep learning technology, especially the wide application of convolutional neural network (CNN), reconstruction of high dynamic range (HDR) covering the entire dynamic range from single or multiple exposure low dynamic range (LDR) images of the same scene Range (HDR) images are possible.

An embodiment of the present application provides a model-based image processing method, which converts an LDR image into an HDR image through a model. That is, the encoding end encodes the LDR image to form a code stream and transmits it to the decoding end. After decoding the LDR image, the decoding end uses the model of the embodiment of the present application to dynamically convert the decoded LDR image to obtain an HDR image. HDR image conversion is achieved while reducing the cost of encoding, transmission, and storage.

The technical solutions involved in the embodiments of the present application will be introduced below in conjunction with specific embodiments.

The image processing method provided in the present application converts an LDR image into an HDR image by using a dynamic conversion model, and the dynamic conversion model is a piece of software code or a chip with data processing functions. Based on this, the training process of the dynamic conversion model is firstly introduced.

Fig. 4 is a schematic flow chart of a dynamic conversion model training method provided by an embodiment of the present application. As shown in Fig. 4, the training process includes:

S401. Acquire the LDR training image and the HDR image truth value of the LDR training image.

The above-mentioned LDR training image is a randomly selected LDR training image in the training set, which includes a plurality of LDR training images, and the training process of the dynamic conversion model using the LDR training images in the training set is an iterative process. For example, the first LDR training image is input into the dynamic conversion model to be trained, and the initial parameters of the dynamic conversion model are adjusted once to obtain the dynamic conversion model trained for the first time. Next, input the second LDR training image into the dynamic conversion model trained for the first time, adjust the parameters of the dynamic conversion model trained for the first time, and obtain the dynamic conversion model trained for the second time, refer to the above method, iterates in sequence until the training end condition of the dynamic conversion model is reached. Wherein, the training end condition of the dynamic conversion model includes that the number of training times reaches a preset number of times, or the loss reaches a preset loss.

The methods for determining the initial parameters of the above-mentioned dynamic conversion model include but are not limited to the following:

In a first manner, the initial parameters of the dynamic conversion model may be preset values, or random values, or empirical values.

The second way is to obtain the pre-training parameters obtained during the pre-training of the pre-training model, and determine the pre-training parameters as the initial parameters of the dynamic conversion model.

The second way is to determine the pre-training parameters of the pre-training model as the initial parameters of the dynamic conversion model, which can reduce the number of training times and training accuracy of the dynamic conversion model.

The embodiment of the present application does not limit the type of the pre-training model, for example, the pre-training model is the VGG-16 network model.

It can be known from the above that the process of training the dynamic conversion model using each LDR training image in the training set is consistent. For the convenience of description, the embodiment of the present application uses an LDR training image as an example to illustrate the training process of the dynamic conversion model.

In some embodiments, the true value of the HDR image of the above-mentioned LDR training image may be dynamically converted to the LDR training image manually to generate an HDR image.

In some embodiments, the true value of the HDR image of the above-mentioned LDR training image may be an HDR image obtained by converting the LDR training image using an existing high dynamic conversion method.

In some embodiments, the collected HDR image may be converted into an LDR image, the converted LDR image may be used as an LDR training image, and the collected HDR image may be used as a true value of the HDR image of the LDR training image.

The embodiment of the present application does not limit the way of acquiring the LDR training image and the HDR image true value of the LDR training image.

S402. Input the LDR training image into the dynamic conversion model for dynamic conversion, and extract the i-1 first feature information through the i-th encoding module to obtain the i-th first feature information of the LDR training image.

S403, perform feature extraction on the i-1th first feature information and the N-ith second feature information of the LDR training image through the N-i+1 decoding module, and obtain the N-i+1th LDR training image Second characteristic information.

The network structure of the dynamic conversion model involved in the embodiment of the present application will be introduced below in conjunction with FIG. 5A. It should be noted that the network structure of the dynamic conversion model in the embodiment of the present application includes but is not limited to the modules shown in FIG. Figure 5A More or less modules.

FIG. 5A is a schematic network diagram of a dynamic conversion model according to an embodiment of the present application. As shown in FIG. 5A , the dynamic conversion model can be understood as an autoencoder network composed of N-level encoding components and decoding components. The dynamic conversion model includes: N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in the N encoding modules is connected to the input of the first decoding module in the N decoding modules, and The i-th encoding module is connected to the N-i+1-th decoding module by skip connection. The skip connection can be understood as the connection between the input end of the i-th encoding module and the input end of the N-i+1-th decoding module. The i-th encoding module is used to perform feature extraction on the i-1-th first feature information to obtain the i-th first feature information of the LDR training image, and the N-i+1-th decoding module is used to extract the i-1-th first feature information The first feature information and the N-i second feature information of the LDR training image are extracted to obtain the N-i+1 second feature information of the LDR training image, i is a positive integer less than or equal to N, and N is positive integer.

Wherein, if i is equal to N, the above N-i th second feature information is determined according to the N th first feature information output by the N th encoding module.

If i is less than N, the above N-i th second feature information is determined according to the N-i th second feature information output by the N-i th decoding module.

If i is equal to 1, the i-1th first feature information is determined according to the LDR training image.

If i is greater than 1, the i-1th first feature information is determined according to the first feature information output by the i-1th coding module.

For example, as shown in FIG. 5A , N=4, the encoding component includes 4 serial encoding modules, the decoding component includes 4 serial decoding modules, and the output of the last encoding module is connected to the input of the first decoding module. The first coding module is connected to the fourth decoding module by skipping, the second coding module is connected to the third decoding module by skipping, the third coding module is connected to the second decoding module by skipping, and the fourth coding module is connected to the first skip connections of decoding modules.

Input the LDR training image into the dynamic conversion model to obtain the 0th first feature information, the 0th first feature information can be the LDR training image, or the feature map after the LDR training image is processed, the embodiment of the present application is for This is not limited. Input the 0th first feature information into the first encoding module and the fourth decoding module respectively, the first encoding module outputs the first first feature information according to the 0th first feature information, and the first The first feature information is respectively input into the second encoding module and the third decoding module. The second encoding module obtains the second first feature information according to the first first feature information, and inputs the second first feature information into the third encoding module and the second decoding module respectively. The third encoding module obtains the third first characteristic information according to the second first characteristic information, and inputs the third first characteristic information into the fourth encoding module and the first decoding module respectively. The fourth encoding module outputs the fourth first characteristic information according to the third first characteristic information, and inputs the fourth first characteristic information into the first decoding module. The first decoding module obtains the first second characteristic information according to the fourth first characteristic information and the third first characteristic information, and inputs the first second characteristic information into the second decoding module. The second decoding module obtains the second second characteristic information according to the first second characteristic information and the second first characteristic information, and inputs the second second characteristic information into the third decoding module. The third decoding module obtains the third second characteristic information according to the second second characteristic information and the first first characteristic information, and inputs the third second characteristic information into the fourth decoding module. The fourth decoding module obtains the fourth second characteristic information according to the 0th first characteristic information and the third second characteristic information.

In some embodiments, as shown in FIG. 5A, the above S403 includes: concatenating the i-1th first feature information and the N-ith second feature information of the LDR training image, and “C” in FIG. 5A indicates Concatenation: input the concatenated feature information into the N-i+1th decoding module for feature extraction, and obtain the N-i+1th second feature information of the LDR training image. For example, the fourth first feature information and the third first feature information are concatenated, and the concatenated fourth first feature information and third first feature information are input into the first decoding module to obtain The first second feature information output by the first decoding module. Concatenate the first second feature information and the second first feature information, and input the concatenated first second feature information and the second first feature information into the second decoding module to obtain the first The second second feature information output by the two decoding modules. The second second characteristic information and the first first characteristic information are concatenated, and the concatenated second second characteristic information and the first first characteristic information are input into the third decoding module to obtain the third The third second feature information output by a decoding module. Similarly, the 0th first characteristic information and the third second characteristic information are concatenated, and the concatenated 0th first characteristic information and the third second characteristic information are input into the fourth decoding module, The fourth second feature information output by the fourth decoding module is obtained.

The embodiment of the present application does not limit the specific network structure of the encoding module.

In an embodiment, each of the N coding modules includes at least one convolutional block, and parameters of the convolutional blocks included in each of the N coding modules are not completely the same. For example, the feature dimension of the convolution block included in the first encoding module is 64, the feature dimension of the convolution block included in the second encoding module is 128, and the feature dimension of the convolution block included in the third encoding module is is 256, the feature dimension of the convolutional block included in the fourth encoding module is 512 and so on.

The embodiment of the present application does not limit the specific network structure of the decoding module.

In an embodiment, each of the N decoding modules includes at least one convolutional block, and parameters of the convolutional blocks included in each of the N decoding modules are not completely the same. For example, the feature dimension of the convolution block included in the first decoding module is 256, the feature dimension of the convolution block included in the second decoding module is 128, and the feature dimension of the convolution block included in the third decoding module is is 64, the feature dimension of the convolutional block included in the fourth coding module is 32 and so on.

The network structures of the convolutional blocks included in the encoding modules in the embodiments of the present application may be the same or different. The network structures of the convolutional blocks included in each decoding module may be the same or different. In addition, the network structures of the convolutional blocks included in the encoding module and the decoding module may be the same or different, which is not limited in this application.

In a possible implementation manner, the network structure of the convolutional block included in the encoding module and/or the decoding module includes a convolutional layer 1, a convolutional layer 2, a convolutional layer 3 and an activation function.

Optionally, as shown in FIG. 5B, the convolution kernels of convolution layer 1 and convolution layer 2 are 3×3, the convolution kernel of convolution layer 3 is 1×1, and the activation function is a Sigmoid weighted linear unit (Sigmoid Weighted Liner Unit, referred to as SiLU).

It should be noted that the sizes of the convolution kernels of the above-mentioned convolutional layer 1, convolutional layer 2, and convolutional layer 3 include but are not limited to the above values, and the activation functions include but are not limited to SiLU, such as RELU, etc. This is not limited.

In some embodiments, as shown in Figure 5C, the dynamic conversion model further includes: a convolutional block attention module (Convolutional Block Attention Module, Abbreviated as CBAM). The attention mechanism of this convolutional attention module enables the dynamic transformation model to focus more attention on the relevant parts of the encoding side features and less attention on other irrelevant parts, that is, by The convolutional attention mechanism is used to improve the representation ability of the dynamic conversion model, focusing on important features and suppressing unnecessary features, thus greatly improving the efficiency of the model.

In a possible implementation manner, one or more CBAMs are included in the skip connections between each encoding module and decoding module.

On the basis of the dynamic conversion model shown in Figure 5C, the above S403 performs feature extraction on the i-1th first feature information and the N-ith second feature information of the LDR training image through the N-i+1th decoding module , the N-i+1 second feature information of the LDR training image includes S403-A and S403-B:

S403-A. Extract the spatial information and channel information of the i-1 th first feature information through the convolutional attention module, and obtain the i-1 th third feature information of the LDR training image.

S403-B. Use the N-i+1th decoding module to perform feature extraction on the i-1th third feature information and the N-ith second feature information to obtain the N-i+1th second feature information of the LDR training image characteristic information. For example, the i-1th third feature information and the N-ith second feature information are concatenated, and the concatenated i-1th third feature information and N-ith second feature information are input into the N-th The i+1 decoding module obtains the N-i+1th second feature information of the LDR training image output by the N-i+1th decoding module.

The embodiment of the present application does not limit the network structure of the convolutional attention module.

In a possible implementation, as shown in FIG. 5D , the convolutional attention module includes: a channel attention module and a spatial attention module. Among them, the channel attention module learns the channel information of features by using the inter-channel relationship of features, and the spatial attention module learns the spatial information of features by using the spatial relationship of features.

It should be noted that the channel to which it belongs here can be understood as a feature dimension. For example, if the feature dimension of a piece of feature information is 32, it means that the number of channels of the feature information is 32.

On the basis of Figure 5D, in the above S403-A, the spatial information and channel information of the i-1th first feature information are extracted through the convolution attention module, and the i-1th third feature information of the LDR training image is obtained Including S403-A1 to S403-A3:

S403-A1. Perform channel information extraction on the i-1 th first feature information through the channel attention module, and obtain channel attention information of the i-1 th first feature information.

S403-A2. Using the spatial attention module, perform spatial information extraction on the fusion channel feature information of the i-1 first feature information, to obtain the spatial attention information of the i-1 first feature information.

Wherein, the fused channel feature information of the i-1 th first feature information is determined according to the i-1 th first feature information and the channel attention information of the i-1 th first feature information.

In some embodiments, as shown in Figure 5D, the convolutional attention module also includes a first multiplication unit, at this time S403-A2 includes S403-A21 and S403-A22:

S403-A21. Multiply the i-1 first feature information and the channel attention information of the i-1 first feature information by the first multiplication unit to obtain the fusion of the i-1 first feature information Channel characteristic information.

S403-A22. Input the fused channel feature information of the i-1 th first feature information into the spatial attention module to extract spatial information, and obtain the spatial attention information of the i-1 th first feature information.

S403-A3. Determine the i-1th third feature information of the LDR training image according to the channel attention information and the spatial attention information of the i-1th first feature information.

In some embodiments, as shown in FIG. 5D , the convolutional attention module further includes a second multiplication unit, then S403-A3 includes: the fusion channel feature information of the i-1th first feature information through the second multiplication unit Multiply with the spatial attention information to obtain the i-1th third feature information of the LDR training image.

For example, the network structure of the convolutional attention module is shown in Figure 5D. Assume that the i-1th first feature information is a feature map F, and the feature map F is input into the CBAM module, and the CBAM module will follow two independent dimensions (i.e. channel dimension and spatial dimension) the attention map is sequentially inferred, and then the attention map is multiplied with the input feature map for adaptive feature optimization. Specifically, firstly, the one-dimensional channel attention map MC is obtained through the channel attention module, and F' is obtained after multiplying MC and the input feature F. Input F' into the spatial attention module, and get the two-dimensional spatial attention map Ms through the spatial attention module. The final feature map F" is obtained after multiplying Ms and F', and the final feature map is the i-1th third feature information of the LDR training image.

It should be noted that in Figure 5D

Indicates that the corresponding elements are multiplied sequentially. Here, if the dimension of the input feature map F is H×W×C, then the dimension of the 1D channel attention map MC is 1×1×C, and the dimension of the 2D spatial attention map Ms is H×W×1.

The above S403-A1 will be described below in conjunction with the network structure of the channel attention module.

In some embodiments, as shown in FIG. 5E , the channel attention module includes: a first spatial compression unit, a second spatial compression unit, and a channel feature extraction unit. Wherein, both the first space compression unit and the second space compression unit are used to compress the spatial size of the feature map, and the channel feature extraction unit is used to perform feature extraction on the space compressed feature map. That is, as shown in Figure 5F, in order to efficiently calculate channel attention, the present application compresses the spatial dimension of the input feature map.

Optionally, the above-mentioned first spatial compression unit and/or the second spatial compression unit includes a pooling layer.

Optionally, the above-mentioned first spatial compression unit is a maximum pooling layer, and/or the second spatial compression unit is an average pooling layer.

Optionally, the channel feature extraction unit is a multilayer perception machine (Multilayer perception, MLP for short), for example, the MLP is an MLP including a single hidden layer.

On the basis of Figure 5E, in the above S403-A1, channel information is extracted from the i-1 first feature information through the channel attention module, and the channel attention information of the i-1 first feature information is obtained including S403- A11 to S403-A15:

S403-A11. Perform spatial dimension compression on the i-1 th first feature information by the first spatial compression unit, to obtain first spatial compression information of the i-1 th first feature information.

S403-A12. Perform spatial dimension compression on the i-1 th first feature information by the second spatial compression unit, to obtain second spatial compression information of the i-1 th first feature information.

S403-A13. Perform channel feature extraction on the first spatially compressed information of the i-1 first feature information by the channel feature extraction unit, to obtain the first channel information of the i-1 first feature information.

S403-A14. Perform channel feature extraction on the i-1 second spatially compressed information of the first feature information by the channel feature extraction unit to obtain the i-1 second channel information of the first feature information.

S403-A15. Determine the channel attention information of the i-1 first feature information according to the first channel information and the second channel information of the i-1 first feature information.

In some embodiments, as shown in FIG. 5E, the channel attention module further includes: a first addition unit and a first activation function. At this time, the above S403-A15 includes:

S403-A151. Add the first channel information and the second channel information of the i-1 pieces of first feature information by the first adding unit to obtain the fusion channel information of the i-1 pieces of first feature information.

S403-A152. Perform non-linear processing on the fused channel information of the i-1 pieces of first feature information by using the first activation function to obtain channel attention information of the i-1 th piece of first feature information.

The embodiment of the present application does not limit the specific form of the first activation function, which is specifically determined according to actual needs.

The above S403-A2 will be described below in conjunction with the network structure of the spatial attention module.

In some embodiments, as shown in FIG. 5F , the spatial attention module includes: a first channel compression unit, a second channel compression unit, and a spatial feature extraction unit. Both the first channel compression unit and the second channel compression unit are used to compress the channel dimension of the feature map, and the spatial feature extraction unit is used to perform feature extraction on the channel compressed feature map. That is, the spatial attention module shown in Figure 5F generates a spatial attention map by utilizing the spatial relationship between features. Spatial attention complements channel attention. To compute spatial attention, the channel dimensions of the input feature maps are compressed.

Optionally, the first channel compression unit and/or the second channel compression unit include a pooling layer.

Optionally, the first channel compression unit is a maximum pooling layer (MaxPool), and/or the second channel compression unit is an average pooling (AvgPool) layer.

Optionally, the aforementioned spatial feature extraction unit is a convolutional layer.

At this time, the above S403-A2 uses the spatial attention module to extract the spatial information of the fusion channel feature information of the i-1 first feature information to obtain the spatial attention information of the i-1 first feature information, including S403 -A21 to S403-A24:

S403-A21. Perform channel dimension compression on the fused channel feature information of the i-1 th first feature information by the first channel compression unit, to obtain the first channel compressed information of the i-1 th first feature information.

S403-A22. Perform channel dimension compression on the fused channel feature information of the i-1 th first feature information by the second channel compression unit to obtain second channel compressed information of the i-1 th first feature information.

S403-A23, performing spatial feature extraction on the first channel compressed information and the second channel compressed information of the i-1 first feature information through the spatial feature extraction unit, to obtain the spatial feature information of the i-1 first feature information .

S403-A24. Determine the spatial attention information of the i-1 th first feature information according to the spatial feature information of the i-1 th first feature information.

In some embodiments, as shown in FIG. 5F , the spatial attention module further includes a second activation function, and S403-A24 includes: performing non-linearity on the spatial feature information of the i-1th first feature information through the second activation function processing to obtain the spatial attention information of the i-1th first feature information.

The embodiment of the present application does not limit the specific form of the second activation function, for example, a sigmoid activation function.

In a specific example, for example, the spatial attention module utilizes average pooling (ie, the second channel compression unit) and maximum pooling (ie, the first channel compression unit) operations to generate corresponding feature vectors along the channel (channel) axis, and concatenate the two to generate efficient feature descriptors. On this basis, after a convolutional layer (ie, the spatial feature extraction unit) is reduced to a channel, a two-dimensional spatial attention feature map Ms is generated after a sigmoid activation function (ie, the second activation function).

Optionally, the spatial dimension of the channel attention information of the i-1th first feature information is 1×1.

Optionally, the feature dimension of the spatial attention information of the i-1th first feature information is 1.

The dynamic conversion model provided by the embodiment of the present application adds a convolutional attention module to each branch, and the convolutional attention module includes a channel attention module and a spatial attention module, respectively for channel features and spatial features Learning is carried out, thereby improving the learning of image detail features by the dynamic conversion model, so that the trained dynamic conversion model can reconstruct more detailed features in the image, thereby improving the quality of the HDR image generated by the dynamic conversion model.

In some embodiments, as shown in FIG. 5G , the dynamic conversion model further includes at least one downsampling unit, and the training method in the embodiment of the present application further includes: performing spatial dimension downsampling on the feature information output by the encoding module through the downsampling unit. That is, in order to reduce network complexity in the embodiment of the present application, at least one downsampling unit is set in the coding component to reduce the spatial dimension of the feature information output by the coding module.

The embodiment of the present application does not limit the number of down-sampling units included in the dynamic conversion model, which is specifically determined according to actual requirements.

In a possible implementation, a downsampling unit is set between two adjacent encoding modules, which is used to downsample the feature information output by the previous encoding unit in a spatial dimension, and then input it into the next encoding module, This not only reduces the amount of data processed by the encoding module and reduces the complexity of the model, but also enables each encoding module to learn features of different sizes to improve the prediction accuracy of the dynamic conversion model.

Optionally, the downsampling unit is a maximum pooling layer.

In some embodiments, as shown in FIG. 5G , the dynamic conversion model further includes at least one upsampling unit, and the training method in the embodiment of the present application further includes: performing spatial dimension upsampling on the feature information output by the decoding module through the upsampling unit.

As shown in Figure 5G, since at least one down-sampling unit is set in the encoding component, in order to ensure that the size of the decoded image is consistent with the size of the original image, at least one up-sampling unit is set in the decoding component for decoding The feature information output by the module is up-sampled in the spatial dimension.

Optionally, the upsampling unit is a bilinear interpolation unit.

In some embodiments, as shown in Figure 5G, the dynamic conversion model further includes a first convolutional layer, the first convolutional layer is located at the input end of the dynamic conversion model, and is used to process the image input to the dynamic conversion model to obtain the input The initial feature map of the image. For example, input the LDR training image into the dynamic conversion model, and extract the features of the LDR training image through the first convolutional layer in the dynamic conversion model to obtain the initial feature map of the LDR training image; input the initial feature map into the first encoding module respectively And in the first convolutional attention module, the first first feature information output by the first encoding module, and the first third feature information output by the first convolutional attention module are obtained. The aforementioned initial feature map can be understood as the aforementioned 0th first feature information.

In the embodiment of the present application, according to the above method, the LDR training image is input into the dynamic conversion model, and the second characteristic information of the LDR training image output by the last decoding module in the dynamic conversion model can be obtained, and then, the following S404 is performed.

S404. Determine the HDR image prediction value of the LDR training image according to the second characteristic information of the LDR training image output by the last decoding module among the N decoding modules.

In some embodiments, the channel of the second feature information of the LDR training image is converted into 3 channels (such as RGB channels) to obtain the predicted value of the HDR image of the LDR training image.

In some embodiments, as shown in FIG. 5G, the dynamic conversion model further includes a second convolutional layer, then the above S404 includes: performing the second feature information of the LDR training image output by the last decoding module through the second convolutional layer Feature extraction, output the HDR image prediction value of the LDR training image.

The second convolutional layer above also includes an activation function, and the feature dimension of the second convolutional layer is 3, that is, after passing through the second convolutional layer, a 3-channel (such as RGB) image can be output, and the 3-channel image can be used as an LDR HDR image predictors for training images.

Optionally, the size of the convolution kernel of the second convolution layer may be 1×1.

S405. Determine the target loss between the predicted value of the HDR image of the LDR training image and the true value of the HDR image of the LDR training image, and train the dynamic transformation model according to the loss.

After the HDR image prediction value of the LDR training image is obtained according to the steps of S404 above, the HDR image prediction value of the LDR training image is compared with the HDR image true value of the LDR training image to determine the HDR image prediction value of the LDR training image and the LDR training image The target loss between the true values of the HDR image, and adjust the parameters in the dynamic conversion model according to the target loss, to achieve a training of the dynamic conversion model. Next, use another LDR training image to train the dynamic transformation model by referring to the same steps as above, until the dynamic transformation model training is completed.

In some embodiments, the manner of determining the loss in S405 includes S405A: according to a preset loss function, determine a target loss between the predicted value of the HDR image of the LDR training image and the true value of the HDR image of the LDR training image.

Optionally, the aforementioned preset loss function includes at least one of a reconstruction loss function, a perceptual loss function, and a style loss function.

In a possible implementation, the above preset loss function includes a reconstruction loss function, a perceptual loss function, and a style loss function. At this time, S405A includes:

Determine the reconstruction loss between the predicted value of the HDR image and the true value of the HDR image;

Determine the perceptual loss between the predicted value of the HDR image and the true value of the HDR image;

Determine the style loss between the predicted value of the HDR image and the true value of the HDR image;

According to the reconstruction loss, perceptual loss and style loss between the predicted value of the HDR image and the ground truth value of the HDR image, the target loss between the predicted value of the HDR image and the ground truth value of the HDR image is determined.

Among them, the reconstruction loss determines that the predicted value of the HDR image is close to the true value of the HDR image on the pixel.

The perceptual loss evaluates how well the features of the predicted value of the HDR image match the features extracted from the ground truth of the HDR image, and allows the model to produce textures that are perceptually similar to the ground truth of the HDR image, i.e., the perceptual loss ensures the generation of textures with more texture details. Visually pleasing images.

The style loss captures both style and texture by comparing global statistics with Gram matrices collected over the entire image, ensuring both style consistency and color consistency of the predicted image.

In some embodiments, the weighted sum of reconstruction loss, perceptual loss and style loss can be used as the target loss.

For example, according to the following formula (1), determine the target loss between the predicted value of the HDR image and the true value of the HDR image:

Loss＝L ₁ +λ _s L _st +λ _p L _p (1)

Among them, Loss is the target loss, L1 is the reconstruction loss, Lst is the perceptual loss, Lp is the style loss, and λ _s and λ _p are hyperparameters. In the above formula (1), it can be understood that the weight of the reconstruction loss is 1, the weight of the perceptual loss is λ _s , and the weight of the style loss is λ _p .

It should be noted that the above formula (1) is just an example, and the method of determining the target loss in this application includes but is not limited to the above formula (1), such as adding, subtracting, multiplying or dividing in formula (1) A certain parameter, or the equivalent deformation of the above formula (1), etc., all belong to the protection scope of the present application.

In one example, according to the preset compressed tone mapping function, the compressed tone mapping value of the predicted value of the HDR image is determined; according to the compressed tone mapping function, the compressed tone mapping value of the true value of the HDR image is determined; according to the compression of the true value of the HDR image The error between the tonemapped value and the compressed tonemapped value of the HDR image prediction determines the reconstruction loss.

For example, the reconstruction loss is determined according to the following formula (2):

L ₁ =‖T(H)-T(GT)‖ ₁ (2)

Among them, L1 represents the reconstruction loss, T is the μ-law compressed tone mapping function, T(H) is the compressed tone mapping value of the predicted value of the HDR image, and T(GT) is the compressed tone mapping value of the true value of the HDR image,

x=H or GT, H is the predicted value of the HDR image output by the dynamic conversion model, GT is the true value of the HDR image of the LDR training image, "‖.‖ ₁ "indicates the L1 norm, and μ is the preset parameter.

It should be noted that the above formula (2) is just an example, and the method of determining the reconstruction loss in this application includes but is not limited to the above formula (2), such as adding, subtracting, multiplying or multiplying in formula (2) Except for a certain parameter, or the equivalent deformation of the above formula (2), etc., all belong to the protection scope of the present application.

In one example, the perceptual loss is determined in the following manner: obtain the feature map of the l-th layer of the pre-training model; determine the compressed tone-mapping value of the HDR image prediction value according to the preset compressed tone-mapping function; according to the compressed tone-mapping function , determine the compressed tone mapping value of the true value of the HDR image; determine the compressed tone mapping value of the predicted value of the HDR image, the first feature value corresponding to the feature map of the l layer; determine the compressed tone mapping value of the true value of the HDR image, in The second eigenvalue corresponding to the feature map of the l-th layer; determining the perceptual loss according to the error between the first eigenvalue and the second eigenvalue.

For example, the perceptual loss is determined according to the following formula (3):

Among them, Lp represents the perceptual loss, φ _l represents the feature map of the l-th layer of the pre-training model, such as the feature map of the l-th layer of TGG-16, the size of the feature map is C _l × H _l × W _l , φ _l (T(H)) is the first eigenvalue corresponding to the compressed tone mapping value of the HDR image prediction value in the feature map of the l layer, φ _l (T(GT)) is the compressed tone mapping value of the true value of the HDR image in The second eigenvalue corresponding to the feature map of layer l.

It should be noted that the above formula (3) is just an example, and the method of determining the perceptual loss in the present application includes but not limited to the above formula (3), such as adding, subtracting, multiplying or dividing in formula (3) A certain parameter, or the equivalent deformation of the above formula (3), etc., all belong to the protection scope of the present application.

In one example, the style loss is determined according to the following manner: obtain the Gram Gram matrix of the l-th layer feature map of the pre-training model; determine the compressed tone mapping value of the HDR image prediction value according to the preset compressed tone mapping function; Compressed tone mapping function, determine the compressed tone mapping value of the true value of the HDR image; determine the compressed tone mapping value of the predicted value of the HDR image, the corresponding first element value in the Gram Gram matrix; determine the compressed tone mapping value of the true value of the HDR image , the second element value corresponding to the feature map of the first layer; according to the error between the first element value and the second element value, determine the style loss.

For example, the style loss is determined according to the following formula (4):

Among them, Lp represents the perceptual loss function, G(.) is the Gram Gram matrix of the l-th layer feature map of the pre-trained model, and G(T(H)) is the compressed tone mapping value of the HDR image prediction value in the Gram Gram matrix The first element value corresponding to G(T(GT)) is the second element value corresponding to the compressed tone mapping value of the true value of the HDR image in the feature map of layer l,

x=H or GT, the size of K _l is C _l H _l W _l , which represents the normalization factor of the calculation, and the feature φ is a matrix of (H _l W _l )×C _l , therefore, the size of the Gram matrix is C _l ×C _l .

Optionally, use the pre-trained VGG-16 network, and calculate the output of the feature maps and real features of the first three pooling layers pool1, pool2 and pool3 of VGG-16 respectively, and according to the above formula (3) and formula ( 4) Compute the perceptual loss and style loss for these features separately.

The target loss in the embodiment of the present application includes reconstruction loss, perceptual loss and style loss, so as to reduce reconstruction distortion, artifacts and tone anomalies of high dynamic range images, and further improve the quality of HDR images generated by the model.

Further, the image processing capability of the dynamic transformation model proposed in the embodiment of the present application is verified through experiments below.

Collection of datasets: Deep learning models rely on large-scale datasets, since datasets with LDR-HDR image pairs cannot be used. This application collects from multiple HDR image datasets and HDR video data, and sets up a virtual camera to capture multiple random regions of the scene using randomly selected camera calibrations. Virtual camera calibration contains parameters for exposure, camera curve, white balance and noise level. The virtual camera parameters are randomly selected, and the camera curve parameters are randomly fitted into the camera curve database. This provides a set of LDR and corresponding HDR images, which are used as input and ground truth for training, respectively. A set of data augmentation operations are then applied to improve the robustness of the predictions. Treating each HDR image as a real scene, a region is selected as an image crop with random size and position, then randomly flipped and resampled to 256×256 pixels. The final trained network using these data augmentations generalizes well to a variety of images captured with different cameras. The obtained dataset is then divided into training set and test set. Specifically, two datasets, Fairchild HDR dataset and HDR EYE dataset, are collected from the HDR dataset for testing.

Experimental environment: The hardware experimental equipment of this application is AMD Ryzen 5 CPU, NVIDIA GTX 1080 Ti and 16G memory, and the framework is PyTorch.

To illustrate the performance of the method proposed in this application, the method is compared with five existing single-image HDR reconstruction techniques, including three conventional non-learning methods: Akyuz method, KOV method and Masia method. In addition, there are two methods based on deep learning technology: ExpandNet and HDRCNN. To evaluate the quality of reconstructed images obtained by various single-image HDR reconstruction methods, three objective evaluation methods PU-PSNR, PU-SSIM and HDR-VDP Q-score were used to evaluate the image quality.

The perceptually uniform coding proposed in this application converts luminance values into approximately perceptually uniform pixel values of an HDR image. Among the evaluation metrics, PU-PSNR measures the pixel-wise difference between the predicted image and the reference image. PU-SSIM measures the structural difference between predicted and reference images from the perspective of visual perception. HDR-VDP is a visual metric used to compare reference and test images and predict the quality of an HDR image relative to the reference image. The quality Q-score provided in HDR-VDP is used as the evaluation metric.

Among the objective indicators, the larger the Q value, PU-PSNR and PU-SSIM value, the closer the high dynamic range image reconstructed by the model is to the original image, and the higher the reconstruction quality is.

Table 1 shows a quantitative comparison of reconstructed HDR images using existing methods on the HDR EYE dataset and the Fairchild dataset. Among them, the bold indicates the method with the best experimental results, and the underline indicates the second best algorithm. Our method has the best results in the Fairchild dataset, good Q-score in the HDR EYE dataset, and outperforms other methods in terms of PSNR and SSIM metrics on both datasets.

Table 1

Among them, the Fairchild dataset was constructed by the team of Professor Mark D. Fairchild of Rochester Institute of Technology, and contains a series of HDR images and data of more than 100 pieces.

As can be seen from Table 1, other methods cannot recover the texture of the overexposed regions and lead to results of discoloration, blurring and tiling artifacts. Compared with the method of this application, conventional methods cannot remove noise or restore lost details in saturated regions. The model proposed in this application has good performance compared with existing methods, and the finally obtained HDR images have more natural colors and richer details, and can effectively suppress noise in low-exposure regions.

The embodiment of the present application provides a dynamic conversion model, the model includes N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in the N encoding modules and the output of the first encoding module in the N decoding modules The input of a decoding module is connected, and the i-th encoding module is skipped and connected to the N-i+1-th decoding module, and the model is trained using the LDR training image. The training process is: input the LDR training image into the dynamic conversion model, The i-1th first feature information is extracted by the i-th encoding module to obtain the i-th first feature information of the LDR training image, and the i-1-th first feature information is obtained by the N-i+1-th decoding module The first feature information and the N-i second feature information of the LDR training image are extracted to obtain the N-i+1 second feature information of the LDR training image; according to the LDR training output of the last decoding module in the N decoding modules The second feature information of the image is to determine the HDR image prediction value of the LDR training image; determine the loss between the HDR image prediction value of the LDR training image and the HDR image true value of the LDR training image, and train the dynamic conversion model according to the loss. In subsequent use, the trained dynamic conversion model can be used to convert the LDR image into an HDR image, and then realize the conversion of the HDR image without increasing the cost of data acquisition, encoding, transmission, storage, etc., thereby improving the quality of the HDR image. conversion efficiency.

Combining with the network structure of the dynamic conversion model, the training process of the dynamic conversion model is introduced above, and the application process of the dynamic conversion model is introduced below.

In some embodiments, the dynamic conversion model provided by the embodiment of the present application can also be applied to the video codec framework, for example, it can be applied to the video decoding end to perform high dynamic conversion on the reconstructed image obtained by the decoding end to obtain the HDR of the reconstructed image image.

Fig. 6 is a schematic flowchart of an image decoding method provided by an embodiment of the present application. As shown in Fig. 6, the method includes:

S601. Decode the code stream to obtain a reconstructed image.

For example, as shown in FIG. 3 , the entropy decoding unit 310 can analyze the code stream to obtain prediction information of the current block, quantization coefficient matrix, etc., and the prediction unit 320 uses intra prediction or inter prediction for the current block based on the prediction information to generate a prediction block of the current block. The inverse quantization/transformation unit 330 uses the quantization coefficient matrix obtained from the code stream to perform inverse quantization and inverse transformation on the quantization coefficient matrix to obtain a residual block. The reconstruction unit 340 adds the predicted block and the residual block to obtain a reconstructed block. The reconstructed blocks form a reconstructed image, and the loop filtering unit 350 performs loop filtering on the reconstructed image based on the image or based on the block to obtain the reconstructed image.

In this embodiment, the dynamic transformation model is combined with the video coding framework.

In one example, in order to facilitate encoding, the input 10-bit HDR data is converted into 8-bit LDR data through a tone mapping module (TM) at the encoding end, and then divided into CTUs and sent to the encoder for encoding. After motion estimation, Motion compensation, intra-frame prediction, inter-frame prediction, transformation, quantization, filtering, and entropy coding form a code stream. The dynamic conversion model described in the above embodiment is added at the output end of the decoder. The dynamic range of the decoded LDR reconstruction image is extended. Using this model, the quality of the obtained HDR data can be significantly improved, and the decoded image quality can be further improved under the premise of ensuring the bit rate.

S602. Input the reconstructed image into a dynamic conversion model to perform dynamic conversion to obtain a high dynamic range HDR image of the reconstructed image.

Shown in Fig. 5 A with reference to, dynamic transformation model comprises: N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in N encoding modules is decoded with the first decoding module in N decoding modules The input connection of the module, and the i-th encoding module is skipped and connected to the N-i+1-th decoding module, and the i-th encoding module is used for the i-1th first feature information output by the i-1-th encoding module Perform feature extraction to obtain the i-th first feature information of the reconstructed image, and the N-i+1-th decoding module is used to perform feature extraction on the i-1-th first feature information and the N-i-th second feature information of the reconstructed image Extracting to obtain the N-i+1th second feature information of the reconstructed image, where i is a positive integer less than or equal to N, and N is a positive integer.

Wherein, the HDR image of the reconstructed image is determined according to the second characteristic information output by the last decoding module among the N decoding modules.

If i is equal to 1, the i-1th first feature information is determined according to the reconstructed image, for example, the 0th first feature information is the reconstructed image, or is a feature map after processing the reconstructed image.

In a possible implementation manner, the network structure included in the encoding module and/or the decoding module is as shown in FIG. 5B , including a convolutional layer 1, a convolutional layer 2, a convolutional layer 3 and an activation function.

Optionally, the convolution kernel of convolution layer 1 and convolution layer 2 is 3×3, the convolution kernel of convolution layer 3 is 1×1, and the activation function is Sigmoid Weighted Linear Unit (Sigmoid Weighted Liner Unit, referred to as SiLU ).

In some embodiments, as shown in FIG. 5C , the dynamic conversion model further includes: a convolutional attention module (CBAM) located in the skip connection between the i-th encoding module and the N-i+1-th decoding module. The attention mechanism of this convolutional attention module enables the dynamic transformation model to focus more attention on the relevant parts of the encoding side features and less attention on other irrelevant parts, that is, by The convolutional attention mechanism is used to improve the representation ability of the dynamic conversion model, focusing on important features and suppressing unnecessary features, thus greatly improving the efficiency of the model.

Among them, the convolutional attention module located in the skip connection between the i-th encoding module and the N-i+1-th decoding module is used to extract the spatial information and channel information of the i-1-th first feature information, and obtain the reconstruction The i-1th third feature information of the image.

At this time, the N-i+1th decoding module is used to perform feature extraction on the i-1th third feature information and the N-ith second feature information to obtain the N-i+1th second feature of the reconstructed image information. For example, the N-i+1 decoding module is used to perform feature extraction on the concatenated feature information of the i-1 first feature information and the N-i second feature information of the reconstructed image, and obtain the N-th feature information of the reconstructed image. i+1 pieces of second feature information.

In some embodiments, as shown in Figure 5D, the convolutional attention module includes a channel attention module and a spatial attention module.

Wherein, the channel attention module is used to extract the channel information of the i-1 first feature information, and obtain the channel attention information of the i-1 first feature information.

The spatial attention module is used to extract the spatial information of the i-1 first feature information and the channel attention information of the i-1 first feature information, and obtain the spatial attention of the i-1 first feature information information.

The i-1th third feature information of the reconstructed image is determined according to the channel attention information and the spatial attention information of the i-1th first feature information.

As shown in Figure 5E, the convolutional attention module also includes a first multiplication unit; the first multiplication unit is used for channel attention information of the i-1 first feature information and the i-1 first feature information Perform multiplication to obtain the fusion channel feature information of the i-1 first feature information. At this time, the spatial attention module is used to extract the spatial information of the fusion channel feature information of the i-1 first feature information, and obtain The spatial attention information of the i-1th first feature information.

Continuing to refer to Figure 5D, the convolutional attention module also includes a second multiplication unit; the second multiplication unit is used to multiply the fusion channel feature information and spatial attention information of the i-1 first feature information, Obtain the i-1th third feature information of the reconstructed image.

In some embodiments, as shown in FIG. 5E , the channel attention module includes: a first spatial compression unit, a second spatial compression unit, and a channel feature extraction unit.

Wherein, the first spatial compression unit is used to compress the spatial dimension of the i-1 first feature information to obtain the first spatial compression information of the i-1 first feature information;

The second spatial compression unit is used to perform spatial dimension compression on the i-1 first feature information to obtain second spatial compression information of the i-1 first feature information;

The channel feature extraction unit is used to perform channel feature extraction on the first spatial compression information of the i-1 first feature information, to obtain the first channel information of the i-1 first feature information, and to obtain the first channel information of the i-1 first feature information. Channel feature extraction is performed on the second space compressed information of the feature information to obtain i-1 second channel information of the first feature information.

The channel attention information of the i-1 first feature information is determined according to the first channel information and the second channel information of the i-1 first feature information.

Optionally, the first spatial compression unit and/or the second spatial compression unit includes a pooling layer.

Optionally, the first spatial compression unit is a maximum pooling layer, and/or the second spatial compression unit is an average pooling layer.

Optionally, the channel feature extraction unit is a multi-layer perceptron MLP.

Continuing to refer to Figure 5E, the channel attention module also includes: a first addition unit and a first activation function;

Wherein, the first addition unit is configured to add the first channel information and the second channel information of the i-1 first feature information to obtain the fusion channel information of the i-1 first feature information;

The first activation function is used to perform non-linear processing on the fusion channel information of the i-1 first feature information to obtain the channel attention information of the i-1 first feature information.

In some embodiments, as shown in FIG. 5F , the spatial attention module includes: a first channel compression unit, a second channel compression unit, and a spatial feature extraction unit;

Wherein, the first channel compression unit is configured to perform channel dimension compression on the fusion channel feature information of the i-1 first feature information, to obtain the first channel compression information of the i-1 first feature information;

The second channel compression unit is used to perform channel dimension compression on the fusion channel feature information of the i-1 first feature information, to obtain the second channel compression information of the i-1 first feature information;

The spatial feature extraction unit is used to perform spatial feature extraction on the first channel compressed information and the second channel compressed information of the i-1 first feature information, to obtain the spatial feature information of the i-1 first feature information;

The spatial attention information of the i-1 first feature information is determined according to the spatial feature information of the i-1 first feature information.

Optionally, the first channel compression unit and/or the second channel compression unit includes a pooling layer.

Optionally, the first channel compression unit is a maximum pooling layer, and/or the second channel compression unit is an average pooling layer.

Optionally, the spatial feature extraction unit is a convolutional layer.

Continuing to refer to Figure 5F, the spatial attention module also includes a second activation function;

Wherein, the second activation function is used to perform non-linear processing on the spatial feature information of the i-1 first feature information to obtain the spatial attention information of the i-1 first feature information.

Optionally, the spatial dimension of the channel attention information of the i-1 th first feature information is 1×1.

The dynamic conversion model provided by the embodiment of the present application adds a convolutional attention module to each branch, and the convolutional attention module includes a channel attention module and a spatial attention module, respectively for channel features and spatial features Learning is carried out, thereby improving the learning of image detail features by the dynamic conversion model, so that the dynamic conversion model can reconstruct more detailed features in the image, thereby improving the quality of the HDR image generated by the dynamic conversion model.

In some embodiments, as shown in FIG. 5G , the dynamic conversion model further includes at least one downsampling unit; the downsampling unit is used for downsampling the feature information output by the encoding module in a spatial dimension.

Optionally, the downsampling unit is a maximum pooling layer.

In some embodiments, as shown in FIG. 5G , the dynamic conversion model further includes at least one upsampling unit; the upsampling unit is used to perform spatial dimension upsampling on the feature information output by the decoding module.

Optionally, the upsampling unit is a bilinear interpolation unit.

Continuing to refer to Figure 5G, the dynamic conversion model also includes a first convolutional layer; the first convolutional layer is used to extract features from the reconstructed image, obtain the initial feature map of the reconstructed image, and input the initial feature map into the first in the first encoding module and the first convolutional attention module.

Continuing to refer to Figure 5G, the dynamic conversion model also includes a second convolutional layer; the second convolutional layer is used for feature extraction of the second feature information of the reconstructed image output by the last decoding module, and outputs the HDR image of the reconstructed image .

In a specific embodiment of the present application, as shown in FIG. 7, the dynamic conversion model includes a first convolutional layer, 4 encoding modules connected in series, 3 down-sampling units, 4 decoding modules connected in series, 3 Upsampling units, 4 CBAMs on the skip connections of the encoding module and decoding module, and the second convolutional layer. Exemplarily, the convolution kernel of the first convolution layer is 3×3, and the number of channels is 32, where the number of channels can also be understood as a feature dimension, and the convolution kernel of the second convolution layer is 1×1, and the number of channels is 3, and the second convolutional layer includes an activation function. The first encoding module includes a convolutional block with 64 channels, the second encoding module includes a convolutional block with 128 channels, the third encoding module includes a convolutional block with 256 channels, and the fourth encoding module includes A convolutional block with 512 channels. A first down-sampling unit is set between the first coding module and the second coding module, a second down-sampling unit is set between the second coding module and the third coding module, and a second down-sampling unit is set between the third coding module and the fourth coding module A third down-sampling unit is set, the first down-sampling unit, the second down-sampling unit and the third down-sampling unit are all maximum pooling layers with a convolution kernel of 2×2 and a step size of S2. The first decoding module includes a convolutional block with 256 channels, the second decoding module includes a convolutional block with 128 channels, the third decoding module includes a convolutional block with 64 channels, and the fourth decoding module includes A convolutional block with 32 channels. A first upsampling unit is set between the fourth coding block and the first decoding module, a second upsampling unit is set between the first decoding module and the second decoding module, and a second upsampling unit is set between the second decoding module and the third decoding module The third upsampling unit is set, the first upsampling unit, the second upsampling unit and the third sampling unit are all bilinear interpolation units, and the upsampling multiple is 2×2. In addition, each upsampling unit also Including a convolutional layer, for example, the first upsampling unit is Bilinear Upsample 2×2, Conv 3×3 256, the second upsampling unit is Bilinear Upsample 2×2, Conv 3×3 128, and the third upsampling unit is Bilinear Upsample 2×2, Conv 3×3 64.

Suppose the size of the reconstructed image is H×W×3, where H×W represents the length and width dimensions of the reconstructed image, and 3 represents the number of RGB3 channels of the reconstructed image. Input the above reconstructed image into the dynamic conversion model shown in Figure 7, and after the first convolutional layer processing, output the initial feature map of the reconstructed image, the size of the initial feature map is H×W×32. The initial feature map output by the first convolutional layer is respectively input into the first encoding module and the first CBAM, and the convolution block in the first convolution module performs convolution processing on the initial feature map to obtain the first first image of the reconstructed image. Feature information, the first first feature information is respectively input into the second CBAM and the first down-sampling unit, and the size of the first first feature information is H×W×64. The first down-sampling unit down-samples the first first feature information to H/2×W/2×64, and inputs the sampled first first feature information into the second coding module. The convolution block in the second encoding module performs convolution processing on the sampled first first feature information to obtain the second first feature information of the reconstructed image, and input the second first feature information respectively For the third CBAM and the second down-sampling unit, the size of the second first feature information is H/2×W/2×128. The second down-sampling unit down-samples the second first feature information to H/4×W/4×128, and inputs the sampled second first feature information into the third coding module. The convolution block in the third encoding module performs convolution processing on the sampled second first feature information to obtain the third first feature information of the reconstructed image, and input the third first feature information respectively The fourth CBAM and the third down-sampling unit, the size of the third first feature information is H/4×W/4×256. The third down-sampling unit down-samples the third first feature information to H/8×W/8×256, and inputs the sampled third first feature information into the fourth coding module. The convolution block in the fourth encoding module performs convolution processing on the sampled third first feature information to obtain the fourth first feature information of the reconstructed image, and input the fourth first feature information into the first An upsampling unit, the size of the fourth first feature information is H/8×W/8×512.

The first upsampling unit upsamples the fourth first feature information to H/4×W/4×256. The fourth CBAM performs feature extraction on the third first feature information, and outputs the first third feature information of the reconstructed image. The first third feature information is concatenated with the upsampled fourth first feature information and input to the first decoding module. The first decoding module performs feature extraction on the concatenated first third feature information and the upsampled fourth first feature information to obtain the first second feature information of the reconstructed image, and converts the first The second feature information is input into the second upsampling unit. The second upsampling unit upsamples the first second feature information to H/2×W/2×128. The third CBAM performs feature extraction on the second first feature information, and outputs the second and third feature information of the reconstructed image. The second third feature information is concatenated with the upsampled first second feature information and then input to the second decoding module. The second decoding module performs feature extraction on the concatenated second third feature information and the up-sampled first second feature information to obtain the second second feature information of the reconstructed image, and converts the second The second feature information is input into the third upsampling unit. The third upsampling unit upsamples the second second feature information to H×W×64. The second CBAM performs feature extraction on the first first feature information, and outputs the third third feature information of the reconstructed image. The third third feature information is concatenated with the upsampled second second feature information and input to the third decoding module. The third decoding module performs feature extraction on the concatenated third third feature information and the up-sampled second second feature information to obtain the third second feature information of the reconstructed image. The first CBAM performs feature extraction on the initial feature map of the reconstructed image, and outputs the fourth and third feature information of the reconstructed image. The fourth third feature information is concatenated with the third second feature information and input to the fourth decoding module. The fourth decoding module performs feature extraction on the concatenated fourth third feature information and third second feature information to obtain the fourth second feature information of the reconstructed image, and converts the fourth second feature information Input the second convolutional layer, the size of the fourth second feature information is H×W×32. The second convolutional layer processes the fourth second feature information, and outputs the HDR image of the reconstructed image, and the size of the HDR image is H×W×3.

In the embodiment of the present application, the above-mentioned dynamic conversion model is used to convert the reconstructed image with a low dynamic range into an image with a high dynamic range, and the whole conversion process is simple and low in cost.

In some embodiments, the initial parameters of the dynamic conversion model during training are pre-training parameters obtained during pre-training of the pre-training model.

In some embodiments, the loss function of the dynamic transformation model includes at least one of a reconstruction loss function, a perceptual loss function, and a style loss function.

In one example, the loss function of the dynamic conversion model is as shown in the following formula:

Loss＝L ₁ +λ _s L _st +λ _p L _p

Among them, Loss is the loss function of the dynamic conversion model, L1 is the reconstruction loss function, Lst is the perceptual loss function, Lp is the style loss function, and λ _s and λ _p are hyperparameters.

In one example, the reconstruction loss function of the dynamic transformation model is determined based on the error between the compressed tone-mapped values of the true value of the HDR image and the compressed tone-mapped value of the predicted value of the HDR image, where the compressed tone-mapped value of the predicted value of the HDR image The mapping value is determined according to the preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapping value of the true value of the HDR image is determined according to the compressed tone mapping function and the true value of the HDR image.

For example, the reconstruction loss function of the dynamic transformation model is determined based on the following formula:

L ₁ =‖T(H)-T(GT) _‖1

Among them, L1 represents the reconstruction loss function,

x=H or GT, H is the predicted value output by the dynamic conversion model when training the dynamic conversion model, GT is the real value of the training image, "‖.‖ ₁ "indicates the L1 norm, and μ is the preset parameter.

In one example, the perceptual loss function of the dynamic transformation model is determined based on the error between the first feature value and the second feature value, wherein the first feature value is the compressed tone map value of the predicted value of the HDR image in the pre-training The corresponding eigenvalue in the feature map of layer l of the model, the second eigenvalue is the compressed tone mapping value of the true value of the HDR image The corresponding eigenvalue in the feature map of layer l, the compressed tone mapping value of the predicted value of the HDR image It is determined according to the preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapping value of the true value of the HDR image is determined according to the compressed tone mapping function and the true value of the HDR image.

For example, the perceptual loss function of the dynamic transition model is determined based on the following formula:

Among them, Lp represents the perceptual loss function, φ _l represents the feature map of the l-th layer of the pre-training model, and the size is C _l × H _l × W _l .

In one example, the style loss function of the dynamic transformation model is determined based on the error between the first element value and the second element value, wherein the first element value is the compressed tone map value of the HDR image prediction value in the pre-training The corresponding element value in the Gram matrix of the layer l feature map of the model, the second element value is the compressed tone mapping value of the true value of the HDR image corresponding to the element value in the Gram matrix, and the compressed tone mapping value of the predicted value of the HDR image It is determined according to the preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapping value of the true value of the HDR image is determined according to the compressed tone mapping function and the true value of the HDR image.

For example, the style loss function of the dynamic transformation model is determined based on the following formula:

Among them, Lp represents the perceptual loss function, G(.) is the Gram matrix of the l-th layer feature of the pre-trained model,

φ _l represents the feature map of layer l of the pre-training model, the size of which is C _l ×H _l ×W _l , and the size of K _l is C _l H _l W _l .

In the embodiment of the present application, the reconstruction image with a low dynamic range is converted into an image with a high dynamic range by using the above dynamic conversion model, and the whole conversion process is simple and low in cost. In addition, by setting the reconstruction loss, perceptual loss and style loss to reduce high dynamic range image reconstruction distortion, artifacts and abnormal tone, the decoded image quality is further improved under the premise of ensuring the bit rate.

The application of the dynamic conversion model to the codec system has been introduced above, and the above dynamic conversion model can also be applied to other scenarios where an image with a low dynamic range is converted to a high dynamic range.

Fig. 8 is a schematic flow chart of an image processing method provided by an embodiment of the present application. As shown in Fig. 8, the method includes:

S801. Acquire the LDR image to be processed;

S802. Input the LDR image into the dynamic conversion model to perform dynamic conversion to obtain an HDR image of the LDR image.

Wherein, as shown in Figure 5A, the dynamic conversion model includes: N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in the N encoding modules and the first in the N decoding modules The input of the decoding module is connected, and the i-th encoding module is skipped to the N-i+1 decoding module, and the i-th encoding module is used for the i-1th first output of the i-1 encoding module The feature information is extracted to obtain the i-th first feature information of the LDR image, and the N-i+1 decoding module is used for the i-1-th first feature information and the N-i-th second feature information of the LDR image Perform feature extraction to obtain the N-i+1th second feature information of the LDR image, the HDR image of the LDR image is determined according to the second feature information output by the last decoding module in the N decoding modules, and i is less than or equal to A positive integer of N, where N is a positive integer.

For the network structure of the dynamic conversion model, refer to the above-mentioned FIGS. 5A to 5G , and specifically refer to the description of the above-mentioned embodiments, and details are not repeated here.

It should be understood that Fig. 4 to Fig. 8 are only examples of the present application, and should not be construed as limiting the present application.

The preferred embodiments of the present application have been described in detail above in conjunction with the accompanying drawings. However, the present application is not limited to the specific details in the above embodiments. Within the scope of the technical concept of the present application, various simple modifications can be made to the technical solutions of the present application. These simple modifications all belong to the protection scope of the present application. For example, the various specific technical features described in the above specific implementation manners can be combined in any suitable manner if there is no contradiction. Separately. As another example, any combination of various implementations of the present application can also be made, as long as they do not violate the idea of the present application, they should also be regarded as the content disclosed in the present application.

It should also be understood that in the various method embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the order of execution of the processes should be determined by their functions and internal logic, and should not be used in this application. The implementation of the examples constitutes no limitation. In addition, in the embodiment of the present application, the term "and/or" is only an association relationship describing associated objects, indicating that there may be three relationships. Specifically, A and/or B may mean: A exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this article generally indicates that the contextual objects are an "or" relationship.

The network structure of the action conversion model and the image processing method are introduced above with reference to FIG. 4 to FIG. 8 , and the device embodiment of the present application is described in detail below in conjunction with FIG. 9 to FIG. 12 .

FIG. 9 is a schematic block diagram of an image decoding device provided by an embodiment of the present application. The image decoding device may be the decoder shown in FIG. 3 , or a component in the decoder, such as a processor in the decoder.

As shown in Figure 9, the image decoding device 10 may include:

Decoding unit 11, configured to decode the code stream to obtain a reconstructed image;

A processing unit 12, configured to input the reconstructed image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the reconstructed image;

Wherein, the dynamic conversion model includes: N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in the N encoding modules is the same as that of the first encoding module in the N decoding modules The input of the decoding module is connected, and the i-th encoding module is skip-connected to the N-i+1 decoding module, and the i-th encoding module is used for the i-1th output of the i-1 encoding module Feature extraction is performed on the first feature information to obtain the i-th first feature information of the reconstructed image, and the N-i+1th decoding module is used to extract the i-1-th first feature information and the Perform feature extraction on the N-ith second feature information of the reconstructed image to obtain the N-i+1th second feature information of the reconstructed image, and the HDR image of the reconstructed image is based on the last one of the N decoding modules Determined by the second feature information output by the decoding module, the i is a positive integer less than or equal to N, and the N is a positive integer.

In one embodiment, the dynamic conversion model further includes: a convolutional attention module located in the skip connection between the i-th encoding module and the N-i+1-th decoding module;

The convolutional attention module is used to extract spatial information and channel information from the i-1 first feature information to obtain the i-1 third feature information of the reconstructed image;

The N-i+1th decoding module is used to perform feature extraction on the i-1th third feature information and the N-ith second feature information to obtain the N-i+th feature information of the reconstructed image. 1 piece of second characteristic information.

In one embodiment, the convolutional attention module includes a channel attention module and a spatial attention module;

The channel attention module is used to extract the channel information of the i-1 first feature information, and obtain the channel attention information of the i-1 first feature information;

The spatial attention module is used to extract spatial information from the i-1 first feature information and channel attention information of the i-1 first feature information, to obtain the i-1 first feature information Spatial attention information of the first feature information;

The i-1 th third feature information of the reconstructed image is determined according to the channel attention information and the spatial attention information of the i-1 th first feature information.

In one embodiment, the convolutional attention module further includes a first multiplication unit;

The first multiplication unit is configured to multiply the i-1 first feature information and the channel attention information of the i-1 first feature information to obtain the i-1 first feature Information fusion channel feature information;

The spatial attention module is configured to extract spatial information from the fused channel feature information of the i-1 first feature information to obtain the spatial attention information of the i-1 first feature information.

In one embodiment, the convolutional attention module further includes a second multiplication unit;

The second multiplication unit is used to multiply the fusion channel feature information and the spatial attention information of the i-1 first feature information to obtain the i-1 third feature information of the reconstructed image.

In one embodiment, the channel attention module includes: a first spatial compression unit, a second spatial compression unit, and a channel feature extraction unit;

The first spatial compression unit is configured to perform spatial dimension compression on the i-1 first feature information to obtain first spatial compression information of the i-1 first feature information;

The second spatial compression unit is configured to perform spatial dimension compression on the i-1 first feature information to obtain second spatial compression information of the i-1 first feature information;

The channel feature extraction unit is configured to perform channel feature extraction on the first spatially compressed information of the i-1 first feature information, obtain the first channel information of the i-1 first feature information, and perform the channel feature extraction on the i-1 first feature information. performing channel feature extraction on the second spatial compression information of the i-1 first feature information, and obtaining the second channel information of the i-1 first feature information;

In an embodiment, the first spatial compression unit and/or the second spatial compression unit includes a pooling layer.

In an embodiment, the first spatial compression unit is a maximum pooling layer, and/or the second spatial compression unit is an average pooling layer.

In one embodiment, the channel feature extraction unit is a multi-layer perceptron MLP.

In one embodiment, the channel attention module further includes: a first addition unit and a first activation function;

The first adding unit is configured to add the first channel information and the second channel information of the i-1 pieces of first feature information to obtain the fusion channel information of the i-1 pieces of first feature information;

The first activation function is used to perform nonlinear processing on the fused channel information of the i-1 pieces of first feature information to obtain channel attention information of the i-1 th piece of first feature information.

In one embodiment, the spatial attention module includes: a first channel compression unit, a second channel compression unit, and a spatial feature extraction unit;

The first channel compression unit is configured to perform channel dimension compression on the fused channel feature information of the i-1 first feature information, to obtain the first channel compression information of the i-1 first feature information;

The second channel compression unit is configured to perform channel dimension compression on the fused channel feature information of the i-1 first feature information to obtain second channel compression information of the i-1 first feature information;

The spatial feature extraction unit is configured to perform spatial feature extraction on the first channel compressed information and the second channel compressed information of the i-1 first feature information, to obtain the i-1 first feature information Spatial feature information;

The spatial attention information of the i-1 th first feature information is determined according to the spatial feature information of the i-1 th first feature information.

In an embodiment, the first channel compression unit and/or the second channel compression unit includes a pooling layer.

In one embodiment, the first channel compression unit is a maximum pooling layer, and/or the second channel compression unit is an average pooling layer.

In one embodiment, the spatial feature extraction unit is a convolutional layer.

In one embodiment, the spatial attention module further includes a second activation function;

The second activation function is used to perform nonlinear processing on the spatial feature information of the i-1 th first feature information to obtain the spatial attention information of the i-1 th first feature information.

In an embodiment, the spatial dimension of the channel attention information of the i-1 th first feature information is 1×1.

In an embodiment, the feature dimension of the spatial attention information of the i-1 th first feature information is 1.

In one embodiment, the dynamic conversion model further includes at least one downsampling unit;

The down-sampling unit is used for down-sampling the feature information output by the encoding module in a spatial dimension.

In one embodiment, the downsampling unit is a max pooling layer.

In one embodiment, the dynamic conversion model further includes at least one upsampling unit;

The up-sampling unit is used for up-sampling the feature information output by the decoding module in a spatial dimension.

In an embodiment, the upsampling unit is a bilinear interpolation unit.

In one embodiment, each of the N coding modules includes at least one convolutional block, wherein the parameters of the convolutional blocks included in each of the N coding modules are not completely the same.

In an embodiment, each of the N decoding modules includes at least one convolutional block, and parameters of the convolutional blocks included in each of the N decoding modules are not completely the same.

In one embodiment, if the i is equal to N, the N-i second feature information is determined according to the N first feature information output by the N encoding module; or,

If the i is less than N, the N-i-th second feature information is determined according to the N-i-th second feature information output by the N-i-th decoding module; or,

If the i is equal to 1, the i-1th first feature information is determined according to the reconstructed image; or,

If the i is greater than 1, the i-1 first feature information is determined according to the first feature information output by the i-1 encoding module.

In one embodiment, the N-i+1th decoding module is configured to perform concatenated feature information on the i-1th third feature information and the N-ith second feature information feature extraction, to obtain the N-i+1th second feature information of the reconstructed image.

In one embodiment, the dynamic conversion model further includes a first convolutional layer;

The first convolutional layer is used to perform feature extraction on the reconstructed image to obtain an initial feature map of the reconstructed image, and input the initial feature map to the first coding module and the first convolutional attention module respectively middle.

In one embodiment, the dynamic conversion model further includes a second convolutional layer;

The second convolutional layer is used to perform feature extraction on the second feature information of the reconstructed image output by the last decoding module, and output an HDR image of the reconstructed image.

In an embodiment, the initial parameters of the dynamic conversion model during training are pre-training parameters obtained during pre-training of the pre-training model.

In one embodiment, the loss function of the dynamic conversion model includes at least one of a reconstruction loss function, a perceptual loss function and a style loss function.

In one embodiment, the loss function of the dynamic conversion model is as shown in the following formula:

Loss＝L ₁ +λ _s L _st +λ _p L _p

Wherein, Loss is the loss function of the dynamic conversion model, the L1 is the reconstruction loss function, the Lst is the perceptual loss function, the Lp is the style loss function, and the λ _s and λ _p is a hyperparameter.

In one embodiment, the reconstruction loss function of the dynamic transformation model is determined according to the error between the compressed tone mapping value of the true value of the HDR image and the compressed tone mapping value of the predicted value of the HDR image, wherein the HDR image The compressed tone mapping value of the predicted value is determined according to the preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapping value of the real value of the HDR image is determined according to the compressed tone mapping function and the HDR image The truth value is determined.

L ₁ =‖T(H)-T(GT) _‖1

Among them, L1 represents the reconstruction loss function,

x=H or GT, the H is the predicted value output by the dynamic conversion model when the dynamic conversion model is trained, the GT is the real value of the training image, "‖.‖ ₁ "represents the L1 norm, μ is a preset parameter.

In one embodiment, the perceptual loss function of the dynamic conversion model is determined based on an error between a first eigenvalue and a second eigenvalue, wherein the first eigenvalue is the compressed tone of the predicted value of the HDR image A feature value corresponding to the mapping value in the feature map of the first layer of the pre-training model, and the second feature value is a feature corresponding to the compressed tone mapping value of the true value of the HDR image in the feature map of the first layer value, the compressed tone mapping value of the predicted value of the HDR image is determined according to the preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapping value of the real value of the HDR image is determined according to the compressed tone mapping function and the ground truth value of the HDR image is determined.

where Lp represents the perceptual loss function,

x=H or GT, the H is the predicted value output by the dynamic conversion model when the dynamic conversion model is trained, the GT is the real value of the training image, "‖.‖ ₁ "represents the L1 norm, μ is a preset parameter, φ _l represents the feature map of layer l of the pre-training model, and its size is C _l ×H _l ×W _l .

In one embodiment, the style loss function of the dynamic conversion model is determined based on an error between a first element value and a second element value, wherein the first element value is a compressed tone of an HDR image prediction value The element value corresponding to the mapping value in the Gram Gram matrix of the l-th layer feature map of the pre-training model, and the second element value is the corresponding element in the Gram matrix of the compressed tone mapping value of the true value of the HDR image value, the compressed tone mapping value of the predicted value of the HDR image is determined according to the preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapping value of the real value of the HDR image is determined according to the compressed tone mapping function and the ground truth value of the HDR image is determined.

where Lp represents the perceptual loss function,

G(.) is the Gram matrix of the l-th layer feature of the pre-training model,

x=H or GT, the H is the predicted value output by the dynamic conversion model when the H is training the dynamic conversion model, the GT is the HDR true value of the training image, "‖.‖ ₁ "represents the L1 norm, μ is a preset parameter, φ _l represents the feature map of layer l of the pre-training model, and its size is C _l ×H _l ×W _l , and the size of K _l is C _l H _l W _l .

It should be understood that the device embodiment and the method embodiment may correspond to each other, and similar descriptions may refer to the method embodiment. To avoid repetition, details are not repeated here. Specifically, the device 10 shown in FIG. 9 may correspond to the corresponding subject in the image decoding method of the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the device 10 are respectively in order to realize the image decoding method For the sake of brevity, the corresponding process will not be repeated here.

Fig. 10 is a schematic block diagram of an image processing device provided by an embodiment of the present application.

As shown in Figure 10, the image processing device 20 may include:

An acquisition unit 21, configured to acquire a low dynamic range LDR image to be processed;

A processing unit 22, configured to input the LDR image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the LDR image;

Wherein, the dynamic conversion model includes: N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in the N encoding modules is the same as that of the first encoding module in the N decoding modules The input of the decoding module is connected, and the i-th encoding module is skip-connected to the N-i+1 decoding module, and the i-th encoding module is used for the i-1th output of the i-1 encoding module Feature extraction is performed on the first feature information to obtain the ith first feature information of the LDR image, and the N-i+1 decoding module is used to extract the i-1 first feature information and the Feature extraction is performed on the N-i second feature information of the LDR image to obtain the N-i+1 second feature information of the LDR image, and the HDR image of the LDR image is based on the last one of the N decoding modules Determined by the second feature information output by the decoding module, the i is a positive integer less than or equal to N, and the N is a positive integer.

In some embodiments, the dynamic conversion model further includes: a convolutional attention module located in the skip connection between the i-th encoding module and the N-i+1-th decoding module;

The convolutional attention module is used to extract spatial information and channel information from the i-1 first feature information to obtain the i-1 third feature information of the LDR image;

The N-i+1th decoding module is used to perform feature extraction on the i-1th third feature information and the N-ith second feature information to obtain the N-i+th feature information of the LDR image. 1 piece of second characteristic information.

In some embodiments, the convolutional attention module includes a channel attention module and a spatial attention module;

The i-1 th third feature information of the LDR image is determined according to the channel attention information and the spatial attention information of the i-1 th first feature information.

In some embodiments, the convolutional attention module further includes a first multiplication unit;

In some embodiments, the convolutional attention module further includes a second multiplication unit;

The second multiplication unit is configured to multiply the fusion channel feature information and the spatial attention information of the i-1 first feature information to obtain the i-1 third feature information of the LDR image.

In some embodiments, the channel attention module includes: a first spatial compression unit, a second spatial compression unit, and a channel feature extraction unit;

In some embodiments, the first spatial compression unit and/or the second spatial compression unit comprises a pooling layer.

In some embodiments, the first spatial compression unit is a max pooling layer, and/or the second spatial compression unit is an average pooling layer.

In some embodiments, the channel feature extraction unit is a multi-layer perceptron MLP.

In some embodiments, the channel attention module further includes: a first addition unit and a first activation function;

In some embodiments, the spatial attention module includes: a first channel compression unit, a second channel compression unit, and a spatial feature extraction unit;

In some embodiments, the first channel compression unit and/or the second channel compression unit comprises a pooling layer.

In some embodiments, the first channel compression unit is a max pooling layer, and/or the second channel compression unit is an average pooling layer.

In some embodiments, the spatial feature extraction unit is a convolutional layer.

In some embodiments, the spatial attention module further includes a second activation function;

In some embodiments, the spatial dimension of the channel attention information of the i-1 th first feature information is 1×1.

In some embodiments, the feature dimension of the spatial attention information of the i-1 th first feature information is 1.

In some embodiments, the dynamic conversion model further includes at least one downsampling unit;

In some embodiments, the downsampling unit is a max pooling layer.

In some embodiments, the dynamic conversion model further includes at least one upsampling unit;

In some embodiments, the upsampling unit is a bilinear interpolation unit.

In some embodiments, each of the N encoding modules includes at least one convolutional block, and parameters of the convolutional blocks included in each of the N encoding modules are not completely the same.

In some embodiments, each of the N decoding modules includes at least one convolutional block, wherein parameters of the convolutional blocks included in each of the N decoding modules are not completely the same.

In some embodiments, if the i is equal to N, the N-i th second feature information is determined according to the N th first feature information output by the N th encoding module; or,

If the i is equal to 1, the i-1th first feature information is determined according to the LDR image; or,

In some embodiments, the N-i+1th decoding module is used to perform feature extraction on the concatenated feature information of the i-1th third feature information and the N-ith second feature information , to obtain the N-i+1th second feature information of the LDR image.

In some embodiments, the dynamic transformation model further includes a first convolutional layer;

The first convolutional layer is used to extract features from the LDR image, obtain an initial feature map of the LDR image, and input the initial feature map to the first coding module and the first convolutional attention module respectively middle.

In some embodiments, the dynamic transformation model further includes a second convolutional layer;

The second convolutional layer is used to perform feature extraction on the second feature information of the LDR image output by the last decoding module, and output an HDR image of the LDR image.

In some embodiments, the initial parameters of the dynamic transformation model during training are pre-training parameters obtained during pre-training of the pre-training model.

In some embodiments, the loss function of the dynamic transformation model includes at least one of a reconstruction loss function, a perceptual loss function and a style loss function.

In some embodiments, the loss function of the dynamic conversion model is as shown in the following formula:

Loss＝L ₁ +λ _s L _st +λ _p L _p

In some embodiments, the reconstruction loss function of the dynamic transformation model is determined from the error between the compressed tone-mapped values of the true value of the HDR image and the compressed tone-mapped value of the predicted value of the HDR image, wherein the predicted HDR image The compressed tone-mapping value of the value is determined according to the preset compressed tone-mapping function and the predicted value of the HDR image, and the compressed tone-mapping value of the real value of the HDR image is determined according to the compressed tone-mapping function and the real value of the HDR image. The value is determined.

L ₁ =‖T(H)-T(GT) _‖1

Among them, L1 represents the reconstruction loss function,

In some embodiments, the perceptual loss function of the dynamic transformation model is determined based on an error between a first eigenvalue and a second eigenvalue, wherein the first eigenvalue is a compressed tone map of an HDR image prediction value The value corresponds to the feature value in the feature map of the first layer of the pre-training model, and the second feature value is the corresponding feature value of the compressed tone mapping value of the true value of the HDR image in the feature map of the first layer , the compressed tone mapping value of the predicted value of the HDR image is determined according to a preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapping value of the real value of the HDR image is determined according to the compressed tone mapping function and the true value of the HDR image is determined.

where Lp represents the perceptual loss function,

In some embodiments, the style loss function of the dynamic transformation model is determined based on an error between a first element value and a second element value, wherein the first element value is a compressed tone map of an HDR image prediction value The value corresponds to the element value in the Gram Gram matrix of the l-th layer feature map of the pre-training model, and the second element value is the corresponding element value in the Gram matrix of the compressed tone mapping value of the true value of the HDR image , the compressed tone mapping value of the predicted value of the HDR image is determined according to a preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapping value of the real value of the HDR image is determined according to the compressed tone mapping function and the true value of the HDR image is determined.

where Lp represents the perceptual loss function,

G(.) is the Gram matrix of the l-th layer feature of the pre-training model,

It should be understood that the device embodiment and the method embodiment may correspond to each other, and similar descriptions may refer to the method embodiment. To avoid repetition, details are not repeated here. Specifically, the device 20 shown in FIG. 10 may correspond to the corresponding subject in the image processing method of the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the device 20 are respectively in order to realize the image processing method For the sake of brevity, the corresponding process will not be repeated here.

Fig. 11 is a schematic block diagram of a model training device provided by an embodiment of the present application.

As shown in Figure 11, model training device 40 comprises:

An acquisition unit 41, configured to acquire a low dynamic range LDR training image and a true value of a high dynamic range HDR image of the LDR training image;

The processing unit 42 is configured to input the LDR training image into the dynamic conversion model, and extract the i-1 first feature information through the i-th encoding module to obtain the i-th first feature of the LDR training image information, wherein the dynamic conversion model includes N encoding modules connected in series and the N decoding modules connected in series, and the output of the last encoding module in the N encoding modules is the same as that of the N decoding modules The input connection of the first decoding module of , and the i-th encoding module is skipped and connected to the N-i+1 decoding module, the i is a positive integer less than or equal to N, and the N is a positive integer; through the The N-i+1 decoding module performs feature extraction on the i-1 first feature information and the N-i second feature information of the LDR training image to obtain the N-i of the LDR training image. i+1 second feature information; according to the second feature information of the LDR training image output by the last decoding module in the N decoding modules, determine the HDR image prediction value of the LDR training image; determine the LDR The loss between the HDR image prediction value of the training image and the HDR image true value of the LDR training image, and the dynamic conversion model is trained according to the loss.

In one embodiment, the dynamic conversion model further includes: a convolutional attention module located in the skip connection between the i-th encoding module and the N-i+1-th decoding module, the above-mentioned processing unit 42 , specifically for performing spatial information and channel information extraction on the i-1th first feature information through the convolution attention module to obtain the i-1th third feature information of the LDR training image; by The N-i+1th decoding module performs feature extraction on the i-1th third feature information and the N-ith second feature information to obtain the N-i+1th of the LDR training image a second characteristic information.

In some embodiments, the convolutional attention module includes a channel attention module and a spatial attention module, and the above-mentioned processing unit 42 is specifically configured to perform the i-1th first feature through the channel attention module Extract channel information from the information to obtain the channel attention information of the i-1 first feature information; perform spatial information extraction on the fusion channel feature information of the i-1 first feature information through the spatial attention module , to obtain the spatial attention information of the i-1 first feature information, the fusion channel feature information of the i-1 first feature information is based on the i-1 first feature information and the Determined by the channel attention information of the i-1 first feature information; according to the channel attention information and spatial attention information of the i-1 first feature information, determine the i-1th of the LDR training image A third characteristic information.

In some embodiments, the convolutional attention module further includes a first multiplication unit, the above-mentioned processing unit 42 is also used to perform the i-1th first feature information and the i-th by the first multiplication unit Multiply the channel attention information of the first feature information to obtain the fusion channel feature information of the i-1 first feature information.

In some embodiments, the convolutional attention module further includes a second multiplication unit, the above-mentioned processing unit 42, which is specifically used to perform the fusion channel of the i-1th first feature information through the second multiplication unit The feature information is multiplied by the spatial attention information to obtain the i-1th third feature information of the LDR training image.

In some embodiments, the channel attention module includes: a first spatial compression unit, a second spatial compression unit, and a channel feature extraction unit, and the above processing unit 42 is specifically configured to use the first spatial compression unit to analyze the Perform spatial dimension compression on the i-1th first feature information to obtain first spatial compression information of the i-1th first feature information; use the second spatial compression unit to compress the i-1th first feature information A feature information is subjected to spatial dimension compression to obtain the second spatial compression information of the i-1 first feature information; the first spatial dimension of the i-1 first feature information is obtained by the channel feature extraction unit performing channel feature extraction on the compressed information to obtain the first channel information of the i-1 first feature information; performing the second spatial compression information on the i-1 first feature information through the channel feature extraction unit Channel feature extraction, obtaining the second channel information of the i-1 first feature information; determining the i-1th channel information according to the first channel information and second channel information of the i-1 first feature information Channel attention information of the first feature information.

In some embodiments, the channel attention module further includes: a first addition unit and a first activation function, and the above-mentioned processing unit 42 is specifically configured to perform the i-1 first features through the first addition unit adding the first channel information and the second channel information of the information to obtain the fusion channel information of the i-1 pieces of first feature information; The channel information is fused to perform nonlinear processing to obtain the channel attention information of the i-1 th first feature information.

In some embodiments, the spatial attention module includes: a first channel compression unit, a second channel compression unit, and a spatial feature extraction unit, and the above-mentioned processing unit 42 is specifically configured to use the first channel compression unit to Perform channel dimension compression on the fused channel feature information of the i-1 first feature information to obtain first channel compression information of the i-1 first feature information; use the second channel compression unit to compress the first channel The fusion channel feature information of the i-1 first feature information is compressed in the channel dimension to obtain the second channel compression information of the i-1 first feature information; the i-th first feature information is extracted by the spatial feature extraction unit. The first channel compressed information and the second channel compressed information of one first feature information are subjected to spatial feature extraction to obtain the spatial feature information of the i-1 first feature information; according to the i-1 first The spatial feature information of the feature information is to determine the spatial attention information of the i-1 first feature information.

In some embodiments, the spatial attention module further includes a second activation function, and the above-mentioned processing unit 42 is specifically configured to perform the spatial feature information of the i-1th first feature information through the second activation function Perform nonlinear processing to obtain the spatial attention information of the i-1th first feature information.

In some embodiments, the dynamic conversion model further includes at least one down-sampling unit, the above-mentioned processing unit 42 is further configured to down-sample the feature information output by the encoding module through the down-sampling unit in a spatial dimension.

Optionally, the downsampling unit is a maximum pooling layer.

In some embodiments, the dynamic conversion model further includes at least one upsampling unit, the above-mentioned processing unit 42 is further configured to perform spatial dimension upsampling on the feature information output by the decoding module through the upsampling unit.

Optionally, the upsampling unit is a bilinear interpolation unit.

Optionally, each of the N encoding modules includes at least one convolutional block, and parameters of the convolutional blocks included in each of the N encoding modules are not completely the same.

Optionally, each of the N decoding modules includes at least one convolutional block, and parameters of the convolutional blocks included in each of the N decoding modules are not completely the same.

In some embodiments, if the i is equal to N, the N-i th second feature information is determined according to the N th first feature information output by the N th encoding module; or, if the i is less than N, the N-i-th second feature information is determined according to the N-i-th second feature information output by the N-i-th decoding module; or, if the i is equal to 1, then the i-1-th A feature information is determined according to the LDR training image; or, if the i is greater than 1, the i-1th first feature information is determined according to the first feature information output by the i-1th coding module of.

In some embodiments, the above processing unit 42 is specifically configured to concatenate the i-1th third feature information and the N-ith second feature information; input the concatenated feature information into the The N-i+1th decoding module performs feature extraction to obtain the N-i+1th second feature information of the LDR training image.

In some embodiments, the dynamic conversion model further includes a first convolutional layer, and the above-mentioned processing unit 42 is also configured to perform feature extraction on the LDR training image through the first convolutional layer to obtain the LDR training image. The initial feature map of the image; the initial feature map is input into the first coding module and the first convolution attention module respectively, and the first first feature information output by the first coding module is obtained, and the obtained The first third feature information output by the first convolutional attention module.

In some embodiments, the dynamic conversion model further includes a second convolutional layer, and the above-mentioned processing unit 42 is specifically used to process the LDR training image output by the last decoding module through the second convolutional layer The second feature information performs feature extraction, and outputs the HDR image prediction value of the LDR training image.

In some embodiments, the processing unit 42 is further configured to obtain pre-training parameters obtained during pre-training of the pre-training model; and determine the pre-training parameters as initial parameters of the dynamic transformation model.

In some embodiments, the above processing unit 42 is specifically configured to determine the target loss between the predicted value of the HDR image of the LDR training image and the true value of the HDR image of the LDR training image according to a preset loss function.

In some embodiments, the preset loss function includes at least one of a reconstruction loss function, a perceptual loss function and a style loss function.

In some embodiments, the above processing unit 42 is specifically configured to determine the reconstruction loss between the predicted value of the HDR image and the true value of the HDR image; determine the difference between the predicted value of the HDR image and the true value of the HDR image Perceptual loss between; determine the style loss between the predicted value of the HDR image and the true value of the HDR image; according to the reconstruction loss, perceptual loss and style between the predicted value of the HDR image and the true value of the HDR image Loss, determining the target loss between the predicted value of the HDR image and the true value of the HDR image.

In some embodiments, the above-mentioned processing unit 42 is specifically configured to determine the target loss between the predicted value of the HDR image and the true value of the HDR image according to the following formula:

Loss＝L ₁ +λ _s L _st +λ _p L _p

Wherein, Loss is the target loss, the L1 is the reconstruction loss, the Lst is the perceptual loss, the Lp is the style loss, and the λ _s and λ _p are hyperparameters.

In some embodiments, the above-mentioned processing unit 42 is specifically configured to determine the compressed tone-mapping value of the predicted value of the HDR image according to a preset compressed tone-mapping function; The compressed tone-mapped value of the value; the reconstruction loss is determined according to the error between the compressed tone-mapped value of the true value of the HDR image and the compressed tone-mapped value of the predicted value of the HDR image.

For example, the reconstruction loss is determined according to the following formula:

L ₁ =‖T(H)-T(GT) _‖1

where L1 represents the reconstruction loss,

x=H or GT, the H is the predicted value of the HDR image output by the dynamic conversion model, the GT is the true value of the HDR image, "‖.‖ ₁ "indicates the L1 norm, μ is preset parameter.

In some embodiments, the above-mentioned processing unit 42 is specifically configured to obtain the feature map of the first layer of the pre-training model; determine the compressed tone-mapping value of the HDR image prediction value according to a preset compressed tone-mapping function; According to the compressed tone mapping function, determine the compressed tone mapping value of the real value of the HDR image; determine the compressed tone mapping value of the predicted value of the HDR image, and the corresponding first feature value in the feature map of the first layer ; Determining the compressed tone mapping value of the true value of the HDR image, the second eigenvalue corresponding to the feature map of the first layer; according to the error between the first eigenvalue and the second eigenvalue, Determine the perceptual loss.

For example, the perceptual loss is determined according to the following formula:

where Lp represents the perceptual loss,

x=H or GT, the H is the predicted value of the HDR image output by the dynamic conversion model, the GT is the true value of the HDR image, "‖.‖ ₁ "indicates the L1 norm, μ is preset parameter, φ _l represents the feature map of layer l of the pre-training model, and its size is C _l ×H _l ×W _l .

In some embodiments, the above-mentioned processing unit 42 is specifically configured to obtain the Gram Gram matrix of the l-th layer feature map of the pre-training model; determine the compression of the predicted value of the HDR image according to a preset compressed tone mapping function Tone mapping value; according to the compressed tone mapping function, determine the compressed tone mapping value of the true value of the HDR image; determine the compressed tone mapping value of the predicted value of the HDR image, and the corresponding first element value in the Gram matrix ; Determine the compressed tone mapping value of the true value of the HDR image, the corresponding second element value in the Gram matrix; determine the style according to the error between the first element value and the second element value loss.

For example, the style loss is determined according to the following formula:

where Lp represents the perceptual loss function,

G(.) is the Gram matrix of the l-th layer feature of the pre-training model,

x=H or GT, the H is the predicted value of the HDR image output by the dynamic conversion model, the GT is the true value of the HDR image, "‖.‖ ₁ "indicates the L1 norm, μ is preset Parameters, φ _l represents the feature map of layer l of the pre-training model, with a size of C _l ×H _l ×W _l , and the size of K _l is C _l H _l W _l .

It should be understood that the device embodiment and the method embodiment may correspond to each other, and similar descriptions may refer to the method embodiment. To avoid repetition, details are not repeated here. Specifically, the device 40 shown in FIG. 11 may correspond to the corresponding subject in the model training method of the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the device 40 are for realizing the model training method, etc. For the sake of brevity, the corresponding processes in each method are not repeated here.

The device and system of the embodiments of the present application are described above from the perspective of functional units with reference to the accompanying drawings. It should be understood that the functional unit may be implemented in the form of hardware, may also be implemented by instructions in the form of software, and may also be implemented by a combination of hardware and software units. Specifically, each step of the method embodiment in the embodiment of the present application can be completed by an integrated logic circuit of the hardware in the processor and/or instructions in the form of software, and the steps of the method disclosed in the embodiment of the present application can be directly embodied as hardware The decoding processor is executed, or the combination of hardware and software units in the decoding processor is used to complete the execution. Optionally, the software unit may be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, and registers. The storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps in the above method embodiments in combination with its hardware.

As shown in Figure 12, the electronic device 30 may be the image processing device described in the embodiment of the present application, or a decoder, or a model training device, and the electronic device 30 may include:

A memory 33 and a processor 32 , the memory 33 is used to store a computer program 34 and transmit the program code 34 to the processor 32 . In other words, the processor 32 can call and run the computer program 34 from the memory 33 to implement the method in the embodiment of the present application.

For example, the processor 32 can be used to execute the steps in the above-mentioned method 200 according to the instructions in the computer program 34 .

In some embodiments of the present application, the processor 32 may include, but is not limited to:

General-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates Or transistor logic devices, discrete hardware components, and so on.

In some embodiments of the present application, the memory 33 includes but is not limited to:

volatile memory and/or non-volatile memory. Among them, the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash. The volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (Static RAM, SRAM), Dynamic Random Access Memory (Dynamic RAM, DRAM), Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (synch link DRAM, SLDRAM) and Direct Memory Bus Random Access Memory (Direct Rambus RAM, DR RAM).

In some embodiments of the present application, the computer program 34 can be divided into one or more units, and the one or more units are stored in the memory 33 and executed by the processor 32 to complete the present application. Methods. The one or more units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 34 in the electronic device 30 .

As shown in Figure 12, the electronic device 30 may also include:

A transceiver 33 , the transceiver 33 can be connected to the processor 32 or the memory 33 .

Wherein, the processor 32 can control the transceiver 33 to communicate with other devices, specifically, can send information or data to other devices, or receive information or data sent by other devices. Transceiver 33 may include a transmitter and a receiver. The transceiver 33 may further include antennas, and the number of antennas may be one or more.

It should be understood that the various components in the electronic device 30 are connected through a bus system, wherein the bus system includes not only a data bus, but also a power bus, a control bus and a status signal bus.

The present application also provides a computer storage medium, on which a computer program is stored, and when the computer program is executed by a computer, the computer can execute the methods of the above method embodiments. In other words, the embodiments of the present application further provide a computer program product including instructions, and when the instructions are executed by a computer, the computer executes the methods of the foregoing method embodiments.

When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, server, or data center by wire (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to another website site, computer, server or data center. The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a tape), an optical medium (such as a digital video disc (DVD)), or a semiconductor medium (such as a solid state disk (SSD)), etc. .

Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or can be Integrate into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

A unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or may also be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

The above content is only the specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application, and should covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

An image decoding method, characterized in that, comprising:

Decode the code stream to obtain the reconstructed image;

Inputting the reconstructed image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the reconstructed image;

Wherein, the dynamic conversion model includes: N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in the N encoding modules is the same as that of the first encoding module in the N decoding modules The input of the decoding module is connected, and the i-th encoding module is skip-connected to the N-i+1 decoding module, and the i-th encoding module is used for the i-1th output of the i-1 encoding module Feature extraction is performed on the first feature information to obtain the i-th first feature information of the reconstructed image, and the N-i+1th decoding module is used to extract the i-1-th first feature information and the Perform feature extraction on the N-ith second feature information of the reconstructed image to obtain the N-i+1th second feature information of the reconstructed image, and the HDR image of the reconstructed image is based on the last one of the N decoding modules Determined by the second feature information output by the decoding module, the i is a positive integer less than or equal to N, and the N is a positive integer.
The method according to claim 1, wherein the dynamic conversion model further comprises: convolutional attention located in the skip connection between the i-th encoding module and the N-i+1-th decoding module module;

The convolutional attention module is used to extract spatial information and channel information from the i-1 first feature information to obtain the i-1 third feature information of the reconstructed image;

The N-i+1th decoding module is used to perform feature extraction on the i-1th third feature information and the N-ith second feature information to obtain the N-i+th feature information of the reconstructed image. 1 piece of second characteristic information.
The method according to claim 2, wherein the convolution attention module includes a channel attention module and a spatial attention module;

The channel attention module is used to extract the channel information of the i-1 first feature information, and obtain the channel attention information of the i-1 first feature information;

The spatial attention module is used to extract spatial information from the i-1 first feature information and channel attention information of the i-1 first feature information, to obtain the i-1 first feature information Spatial attention information of the first feature information;

The i-1 th third feature information of the reconstructed image is determined according to the channel attention information and the spatial attention information of the i-1 th first feature information.
The method according to claim 3, wherein the convolution attention module also includes a first multiplication unit;

The first multiplication unit is configured to multiply the i-1 first feature information and the channel attention information of the i-1 first feature information to obtain the i-1 first feature Information fusion channel feature information;

The spatial attention module is configured to extract spatial information from the fused channel feature information of the i-1 first feature information to obtain the spatial attention information of the i-1 first feature information.
The method according to claim 4, wherein the convolution attention module also includes a second multiplication unit;

The second multiplication unit is configured to multiply the fusion channel feature information and the spatial attention information of the i-1 first feature information to obtain the i-1 third feature information of the reconstructed image.
The method according to claim 3, wherein the channel attention module comprises: a first space compression unit, a second space compression unit and a channel feature extraction unit;

The first spatial compression unit is configured to perform spatial dimension compression on the i-1 first feature information to obtain first spatial compression information of the i-1 first feature information;

The second spatial compression unit is configured to perform spatial dimension compression on the i-1 first feature information to obtain second spatial compression information of the i-1 first feature information;

The channel feature extraction unit is configured to perform channel feature extraction on the first spatially compressed information of the i-1 first feature information, obtain the first channel information of the i-1 first feature information, and perform the channel feature extraction on the i-1 first feature information. performing channel feature extraction on the second spatial compression information of the i-1 first feature information, and obtaining the second channel information of the i-1 first feature information;

The channel attention information of the i-1 first feature information is determined according to the first channel information and the second channel information of the i-1 first feature information.
The method according to claim 6, wherein the first spatial compression unit and/or the second spatial compression unit comprises a pooling layer.
The method according to claim 6, wherein the first spatial compression unit is a maximum pooling layer, and/or the second spatial compression unit is an average pooling layer.
The method according to claim 6, wherein the channel feature extraction unit is a multi-layer perceptron (MLP).
The method according to claim 6, wherein the channel attention module further comprises: a first addition unit and a first activation function;

The first adding unit is configured to add the first channel information and the second channel information of the i-1 pieces of first feature information to obtain the fusion channel information of the i-1 pieces of first feature information;

The first activation function is used to perform nonlinear processing on the fused channel information of the i-1 pieces of first feature information to obtain channel attention information of the i-1 th piece of first feature information.
The method according to claim 3, wherein the spatial attention module comprises: a first channel compression unit, a second channel compression unit and a spatial feature extraction unit;

The first channel compression unit is configured to perform channel dimension compression on the fused channel feature information of the i-1 first feature information, to obtain the first channel compression information of the i-1 first feature information;

The second channel compression unit is configured to perform channel dimension compression on the fused channel feature information of the i-1 first feature information to obtain second channel compression information of the i-1 first feature information;

The spatial feature extraction unit is configured to perform spatial feature extraction on the first channel compressed information and the second channel compressed information of the i-1 first feature information, to obtain the i-1 first feature information Spatial feature information;

The spatial attention information of the i-1 th first feature information is determined according to the spatial feature information of the i-1 th first feature information.
The method according to claim 11, wherein the first channel compression unit and/or the second channel compression unit comprises a pooling layer.
The method according to claim 11, wherein the first channel compression unit is a maximum pooling layer, and/or the second channel compression unit is an average pooling layer.
The method according to claim 11, wherein the spatial feature extraction unit is a convolutional layer.
The method according to claim 11, wherein the spatial attention module further comprises a second activation function;

The second activation function is used to perform nonlinear processing on the spatial feature information of the i-1 th first feature information to obtain the spatial attention information of the i-1 th first feature information.
The method according to any one of claims 3-15, wherein the spatial dimension of the channel attention information of the i-1 th first feature information is 1×1.
The method according to any one of claims 3-15, wherein the feature dimension of the spatial attention information of the i-1 th first feature information is 1.
The method according to claim 2, wherein the dynamic conversion model further comprises at least one downsampling unit;

The down-sampling unit is used for down-sampling the feature information output by the encoding module in a spatial dimension.
The method according to claim 18, wherein the downsampling unit is a maximum pooling layer.
The method according to claim 18, wherein the dynamic conversion model further comprises at least one upsampling unit;

The up-sampling unit is used for up-sampling the feature information output by the decoding module in a spatial dimension.
The method according to claim 20, wherein the upsampling unit is a bilinear interpolation unit.
The method according to claim 1, wherein each of the N encoding modules includes at least one convolutional block, wherein the convolutional block included in each of the N encoding modules The parameters are not exactly the same.
The method according to claim 1, wherein each of the N decoding modules includes at least one convolutional block, wherein the convolutional block included in each of the N decoding modules The parameters are not exactly the same.
The method according to claim 1, characterized in that,

If the i is equal to N, the N-i th second feature information is determined according to the N th first feature information output by the N th encoding module; or,

If the i is less than N, the N-i-th second feature information is determined according to the N-i-th second feature information output by the N-i-th decoding module; or,

If the i is equal to 1, the i-1th first feature information is determined according to the reconstructed image; or,

If the i is greater than 1, the i-1 first feature information is determined according to the first feature information output by the i-1 encoding module.
The method according to claim 2, wherein the N-i+1th decoding module is used to concatenate the i-1th third characteristic information and the N-ith second characteristic information Feature extraction is performed on the final feature information to obtain the N-i+1th second feature information of the reconstructed image.
The method according to claim 2, wherein the dynamic conversion model further comprises a first convolutional layer;

The first convolutional layer is used to perform feature extraction on the reconstructed image to obtain an initial feature map of the reconstructed image, and input the initial feature map to the first coding module and the first convolutional attention module respectively middle.
The method according to claim 2, wherein the dynamic transformation model further comprises a second convolutional layer;

The second convolutional layer is used to perform feature extraction on the second feature information of the reconstructed image output by the last decoding module, and output an HDR image of the reconstructed image.
The method according to claim 2, wherein the initial parameters of the dynamic conversion model during training are pre-training parameters obtained during pre-training of the pre-training model.
The method according to claim 28, wherein the loss function of the dynamic conversion model includes at least one of a reconstruction loss function, a perceptual loss function and a style loss function.
The method according to claim 29, wherein the loss function of the dynamic conversion model is as shown in the following formula:

Loss＝L 1 +λ s L st +λ p L p

Wherein, Loss is the loss function of the dynamic conversion model, the L1 is the reconstruction loss function, the Lst is the perceptual loss function, the Lp is the style loss function, and the λ s and λ p is a hyperparameter.
The method according to claim 30, wherein the reconstruction loss function of the dynamic transformation model is determined based on the error between the compressed tone mapping value of the true value of the HDR image and the compressed tone mapping value of the predicted value of the HDR image , wherein the compressed tone mapping value of the predicted value of the HDR image is determined according to a preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapping value of the real value of the HDR image is determined according to the compressed tone mapping function and the ground truth value of the HDR image is determined.
The method according to claim 30, wherein the perceptual loss function of the dynamic conversion model is determined based on an error between a first eigenvalue and a second eigenvalue, wherein the first eigenvalue is HDR The compressed tone mapping value of the image prediction value corresponds to the feature value in the feature map of the first layer of the pre-training model, and the second feature value is the compressed tone mapping value of the true value of the HDR image in the first layer. The corresponding feature value in the feature map, the compressed tone mapping value of the predicted value of the HDR image is determined according to the preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapped value of the true value of the HDR image is Determined according to the compressed tone mapping function and the true value of the HDR image.
The method according to claim 30, wherein the style loss function of the dynamic conversion model is determined based on an error between a first element value and a second element value, wherein the first element value is HDR The compressed tone mapping value of the image prediction value corresponds to the element value in the Gram Gram matrix of the l-th layer feature map of the pre-training model, and the second element value is the compressed tone mapping value of the true value of the HDR image in the The corresponding element value in the Gram matrix, the compressed tone mapping value of the predicted value of the HDR image is determined according to the preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapped value of the true value of the HDR image is Determined according to the compressed tone mapping function and the true value of the HDR image.
An image processing method, characterized in that, comprising:

Obtain the low dynamic range LDR image to be processed;

The LDR image is input into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the LDR image;

Wherein, the dynamic conversion model includes: N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in the N encoding modules is the same as that of the first encoding module in the N decoding modules The input of the decoding module is connected, and the i-th encoding module is skip-connected to the N-i+1 decoding module, and the i-th encoding module is used for the i-1th output of the i-1 encoding module Feature extraction is performed on the first feature information to obtain the ith first feature information of the LDR image, and the N-i+1 decoding module is used to extract the i-1 first feature information and the Feature extraction is performed on the N-i second feature information of the LDR image to obtain the N-i+1 second feature information of the LDR image, and the HDR image of the LDR image is based on the last one of the N decoding modules Determined by the second feature information output by the decoding module, the i is a positive integer less than or equal to N, and the N is a positive integer.
The method according to claim 34, wherein the dynamic conversion model further comprises: convolutional attention located in the skip connection between the i-th encoding module and the N-i+1-th decoding module module;

The convolutional attention module is used to extract spatial information and channel information from the i-1 first feature information to obtain the i-1 third feature information of the LDR image;

The N-i+1th decoding module is used to perform feature extraction on the i-1th third feature information and the N-ith second feature information to obtain the N-i+th feature information of the LDR image. 1 piece of second characteristic information.
The method according to claim 35, wherein the convolution attention module includes a channel attention module and a spatial attention module;

The channel attention module is used to extract the channel information of the i-1 first feature information, and obtain the channel attention information of the i-1 first feature information;

The spatial attention module is used to extract spatial information from the i-1 first feature information and channel attention information of the i-1 first feature information, to obtain the i-1 first feature information Spatial attention information of the first feature information;

The i-1 th third feature information of the LDR image is determined according to the channel attention information and the spatial attention information of the i-1 th first feature information.
The method according to claim 36, wherein the convolutional attention module further comprises a first multiplication unit;

The first multiplication unit is configured to multiply the i-1 first feature information and the channel attention information of the i-1 first feature information to obtain the i-1 first feature Information fusion channel feature information;

The spatial attention module is configured to extract spatial information from the fused channel feature information of the i-1 first feature information to obtain the spatial attention information of the i-1 first feature information.
The method according to claim 37, wherein the convolution attention module further comprises a second multiplication unit;

The second multiplication unit is configured to multiply the fusion channel feature information and the spatial attention information of the i-1 first feature information to obtain the i-1 third feature information of the LDR image.
The method according to claim 36, wherein the channel attention module comprises: a first space compression unit, a second space compression unit and a channel feature extraction unit;

The first spatial compression unit is configured to perform spatial dimension compression on the i-1 first feature information to obtain first spatial compression information of the i-1 first feature information;

The second spatial compression unit is configured to perform spatial dimension compression on the i-1 first feature information to obtain second spatial compression information of the i-1 first feature information;

The channel feature extraction unit is configured to perform channel feature extraction on the first spatially compressed information of the i-1 first feature information, obtain the first channel information of the i-1 first feature information, and perform the channel feature extraction on the i-1 first feature information. performing channel feature extraction on the second spatial compression information of the i-1 first feature information, and obtaining the second channel information of the i-1 first feature information;

The channel attention information of the i-1 first feature information is determined according to the first channel information and the second channel information of the i-1 first feature information.
The method according to claim 39, wherein the first spatial compression unit and/or the second spatial compression unit comprises a pooling layer.
The method according to claim 39, wherein the first spatial compression unit is a maximum pooling layer, and/or the second spatial compression unit is an average pooling layer.
The method according to claim 39, wherein the channel feature extraction unit is a multi-layer perceptron (MLP).
The method according to claim 39, wherein the channel attention module further comprises: a first addition unit and a first activation function;

The first adding unit is configured to add the first channel information and the second channel information of the i-1 pieces of first feature information to obtain the fusion channel information of the i-1 pieces of first feature information;

The first activation function is used to perform nonlinear processing on the fused channel information of the i-1 pieces of first feature information to obtain channel attention information of the i-1 th piece of first feature information.
The method according to claim 36, wherein the spatial attention module comprises: a first channel compression unit, a second channel compression unit and a spatial feature extraction unit;

The first channel compression unit is configured to perform channel dimension compression on the fused channel feature information of the i-1 first feature information, to obtain the first channel compression information of the i-1 first feature information;

The second channel compression unit is configured to perform channel dimension compression on the fused channel feature information of the i-1 first feature information to obtain second channel compression information of the i-1 first feature information;

The spatial feature extraction unit is configured to perform spatial feature extraction on the first channel compressed information and the second channel compressed information of the i-1 first feature information, to obtain the i-1 first feature information Spatial feature information;

The spatial attention information of the i-1 th first feature information is determined according to the spatial feature information of the i-1 th first feature information.
The method according to claim 44, wherein the first channel compression unit and/or the second channel compression unit comprises a pooling layer.
The method according to claim 44, wherein the first channel compression unit is a maximum pooling layer, and/or the second channel compression unit is an average pooling layer.
The method according to claim 44, wherein the spatial feature extraction unit is a convolutional layer.
The method according to claim 44, wherein the spatial attention module further comprises a second activation function;

The second activation function is used to perform nonlinear processing on the spatial feature information of the i-1 th first feature information to obtain the spatial attention information of the i-1 th first feature information.
The method according to any one of claims 36-48, wherein the spatial dimension of the channel attention information of the i-1 th first feature information is 1×1.
The method according to any one of claims 36-48, characterized in that the feature dimension of the spatial attention information of the i-1 th first feature information is 1.
The method according to claim 35, wherein the dynamic conversion model further comprises at least one downsampling unit;

The down-sampling unit is used for down-sampling the feature information output by the encoding module in a spatial dimension.
The method according to claim 51, wherein the downsampling unit is a max pooling layer.
The method according to claim 51, wherein the dynamic conversion model further comprises at least one upsampling unit;

The up-sampling unit is used for up-sampling the feature information output by the decoding module in a spatial dimension.
The method according to claim 53, wherein the upsampling unit is a bilinear interpolation unit.
The method according to claim 34, wherein each of the N encoding modules includes at least one convolutional block, wherein the convolutional block included in each of the N encoding modules The parameters are not exactly the same.
The method according to claim 34, wherein each of the N decoding modules includes at least one convolutional block, wherein the convolutional block included in each of the N decoding modules The parameters are not exactly the same.
The method of claim 34, wherein,

If the i is equal to N, the N-i th second feature information is determined according to the N th first feature information output by the N th encoding module; or,

If the i is less than N, the N-i-th second feature information is determined according to the N-i-th second feature information output by the N-i-th decoding module; or,

If the i is equal to 1, the i-1th first feature information is determined according to the LDR image; or,

If the i is greater than 1, the i-1 first feature information is determined according to the first feature information output by the i-1 encoding module.
The method according to claim 35, wherein the N-i+1th decoding module is used to concatenate the i-1th third characteristic information and the N-ith second characteristic information Feature extraction is performed on the last feature information to obtain the N-i+1th second feature information of the LDR image.
The method according to claim 35, wherein the dynamic conversion model further comprises a first convolutional layer;

The first convolutional layer is used to extract features from the LDR image, obtain an initial feature map of the LDR image, and input the initial feature map to the first coding module and the first convolutional attention module respectively middle.
The method according to claim 35, wherein the dynamic conversion model further comprises a second convolutional layer;

The second convolutional layer is used to perform feature extraction on the second feature information of the LDR image output by the last decoding module, and output an HDR image of the LDR image.
The method according to claim 35, wherein the initial parameters of the dynamic conversion model during training are pre-training parameters obtained during pre-training of the pre-training model.
The method according to claim 61, wherein the loss function of the dynamic conversion model includes at least one of a reconstruction loss function, a perceptual loss function and a style loss function.
The method according to claim 62, wherein the loss function of the dynamic conversion model is as shown in the following formula:

Loss＝L 1 +λ s L st +λ p L p

Wherein, Loss is the loss function of the dynamic conversion model, the L1 is the reconstruction loss function, the Lst is the perceptual loss function, the Lp is the style loss function, and the λ s and λ p is a hyperparameter.
The method according to claim 63, wherein the reconstruction loss function of the dynamic transformation model is determined based on the error between the compressed tone mapping value of the true value of the HDR image and the compressed tone mapping value of the predicted value of the HDR image wherein the compressed tone mapping value of the predicted value of the HDR image is determined according to a preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapping value of the true value of the HDR image is determined according to the compressed tone mapping function The mapping function and the ground truth of the HDR image are determined.
The method according to claim 63, wherein the perceptual loss function of the dynamic conversion model is determined based on an error between a first eigenvalue and a second eigenvalue, wherein the first eigenvalue is HDR The compressed tone mapping value of the image prediction value corresponds to the feature value in the feature map of the first layer of the pre-training model, and the second feature value is the compressed tone mapping value of the true value of the HDR image in the first layer. The corresponding feature value in the feature map, the compressed tone mapping value of the predicted value of the HDR image is determined according to the preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapped value of the true value of the HDR image is Determined according to the compressed tone mapping function and the true value of the HDR image.
The method according to claim 63, wherein the style loss function of the dynamic conversion model is determined based on an error between a first element value and a second element value, wherein the first element value is HDR The compressed tone mapping value of the image prediction value corresponds to the element value in the Gram Gram matrix of the l-th layer feature map of the pre-training model, and the second element value is the compressed tone mapping value of the true value of the HDR image in the The corresponding element value in the Gram matrix, the compressed tone mapping value of the predicted value of the HDR image is determined according to the preset compressed tone mapping function and the predicted value of the HDR image, and the compressed tone mapped value of the true value of the HDR image is Determined according to the compressed tone mapping function and the true value of the HDR image.
A model training method, characterized in that, comprising:

Obtain the true value of the high dynamic range HDR image of the low dynamic range LDR training image and the LDR training image;

The LDR training image is input into the dynamic conversion model, and the i-1 first feature information is extracted through the i-th encoding module to obtain the i-th first feature information of the LDR training image, wherein the The dynamic conversion model comprises N coding modules connected in series and the N decoding modules connected in series, the output of the last coding module in the N coding modules is connected with the first decoding module in the N decoding modules The input connection of the i-th encoding module and the N-i+1-th decoding module are jump-connected, the i is a positive integer less than or equal to N, and the N is a positive integer;

Feature extraction is performed on the i-1th first feature information and the N-ith second feature information of the LDR training image through the N-i+1 decoding module, to obtain the LDR training image. N-i+1 pieces of second feature information;

Determine the HDR image prediction value of the LDR training image according to the second characteristic information of the LDR training image output by the last decoding module in the N decoding modules;

Determining a loss between the predicted value of the HDR image of the LDR training image and the true value of the HDR image of the LDR training image, and training the dynamic transformation model according to the loss.
The method according to claim 67, wherein the dynamic conversion model further comprises: convolutional attention located in the skip connection between the i-th encoding module and the N-i+1-th decoding module module, wherein the N-i+1th decoding module performs feature extraction on the i-1th first feature information and the N-ith second feature information of the LDR training image to obtain the LDR The N-i+1 second feature information of the training image, including:

Carry out spatial information and channel information extraction to the i-1th first feature information by the convolution attention module, obtain the i-1th third feature information of the LDR training image;

The N-i+1th decoding module performs feature extraction on the i-1th third feature information and the N-ith second feature information to obtain the N-i+th of the LDR training image 1 piece of second characteristic information.
The method according to claim 68, wherein the convolutional attention module includes a channel attention module and a spatial attention module, and the ith-1th A characteristic information is carried out spatial information and channel information extraction, obtains the i-1 the 3rd characteristic information of described LDR training image, comprises:

performing channel information extraction on the i-1 first feature information through the channel attention module, to obtain channel attention information of the i-1 first feature information;

Through the spatial attention module, spatial information is extracted from the fusion channel feature information of the i-1th first feature information, and the spatial attention information of the i-1th first feature information is obtained, and the i-1th first feature information is obtained. The fused channel feature information of one first feature information is determined according to the i-1 first feature information and the channel attention information of the i-1 first feature information;

Determine the i-1th third feature information of the LDR training image according to the channel attention information and the spatial attention information of the i-1th first feature information.
The method according to claim 69, wherein the convolution attention module also includes a first multiplication unit, and the method also includes:

The first multiplication unit multiplies the i-1 first feature information and the channel attention information of the i-1 first feature information to obtain the i-1 first feature information The fusion channel feature information.
The method according to claim 69, wherein the convolutional attention module further comprises a second multiplication unit, and the channel attention information and spatial attention according to the i-1th first feature information Information, determine the i-1th third feature information of the LDR training image, including:

The fusion channel feature information and the spatial attention information of the i-1 first feature information are multiplied by the second multiplication unit to obtain the i-1 third feature information of the LDR training image.
The method according to claim 69, wherein the channel attention module comprises: a first space compression unit, a second space compression unit, and a channel feature extraction unit, and the channel attention module is used for the Extracting channel information from the i-1 first feature information to obtain channel attention information of the i-1 first feature information;

performing spatial dimension compression on the i-1 th first feature information by the first spatial compression unit to obtain first spatial compression information of the i-1 th first feature information;

performing spatial dimension compression on the i-1 th first feature information by the second spatial compression unit to obtain second spatial compression information of the i-1 th first feature information;

performing channel feature extraction on the first spatial compression information of the i-1 first feature information by the channel feature extraction unit, to obtain the first channel information of the i-1 first feature information;

performing channel feature extraction on the second spatial compression information of the i-1 first feature information by the channel feature extraction unit, to obtain the second channel information of the i-1 first feature information;

Determine the channel attention information of the i-1 first feature information according to the first channel information and the second channel information of the i-1 first feature information.
The method according to claim 72, wherein the first spatial compression unit and/or the second spatial compression unit comprises a pooling layer.
The method according to claim 72, wherein the first spatial compression unit is a maximum pooling layer, and/or the second spatial compression unit is an average pooling layer.
The method according to claim 72, wherein the channel feature extraction unit is a multi-layer perceptron (MLP).
The method according to claim 72, wherein the channel attention module further comprises: a first addition unit and a first activation function, and the first channel information according to the i-1 first feature information and the second channel information, determine the channel attention information of the i-1 first feature information, including:

Adding the first channel information and the second channel information of the i-1 pieces of first feature information by the first adding unit to obtain the fusion channel information of the i-1 pieces of first feature information;

Perform non-linear processing on the fused channel information of the i-1 pieces of first feature information by using the first activation function to obtain channel attention information of the i-1 th piece of first feature information.
The method according to claim 69, wherein the spatial attention module comprises: a first channel compression unit, a second channel compression unit, and a spatial feature extraction unit, and the i-th - Extracting spatial information from the fusion channel feature information of 1 first feature information to obtain the spatial attention information of the i-1 first feature information, including:

performing channel dimension compression on the fusion channel feature information of the i-1 first feature information by the first channel compression unit, to obtain the first channel compression information of the i-1 first feature information;

performing channel dimension compression on the fusion channel feature information of the i-1 first feature information by the second channel compression unit to obtain the second channel compression information of the i-1 first feature information;

The space feature extraction unit performs spatial feature extraction on the first channel compressed information and the second channel compressed information of the i-1 first feature information to obtain the space of the i-1 first feature information characteristic information;

The spatial attention information of the i-1 th first feature information is determined according to the spatial feature information of the i-1 th first feature information.
The method according to claim 77, wherein the first channel compression unit and/or the second channel compression unit comprises a pooling layer.
The method according to claim 77, wherein the first channel compression unit is a maximum pooling layer, and/or the second channel compression unit is an average pooling layer.
The method according to claim 77, wherein the spatial feature extraction unit is a convolutional layer.
The method according to claim 77, wherein the spatial attention module further includes a second activation function, and determining the ith according to the spatial feature information of the i-1 first feature information -1 spatial attention information of the first feature information, including:

The spatial feature information of the i-1 th first feature information is nonlinearly processed by the second activation function to obtain the spatial attention information of the i-1 th first feature information.
The method according to any one of claims 69-81, wherein the spatial dimension of the channel attention information of the i-1th first feature information is 1×1.
The method according to any one of claims 69-81, wherein the feature dimension of the spatial attention information of the i-1 first feature information is 1.
The method according to claim 68, wherein the dynamic conversion model further comprises at least one downsampling unit, and the method further comprises:

The feature information output by the coding module is down-sampled in a spatial dimension by the down-sampling unit.
The method according to claim 84, wherein the downsampling unit is a max pooling layer.
The method according to claim 84, wherein the dynamic conversion model further comprises at least one upsampling unit, and the method further comprises:

The feature information output by the decoding module is subjected to spatial dimension up-sampling by the up-sampling unit.
The method according to claim 86, wherein the upsampling unit is a bilinear interpolation unit.
The method according to claim 67, wherein each of the N encoding modules includes at least one convolutional block, wherein the convolutional block included in each of the N encoding modules The parameters are not exactly the same.
The method according to claim 67, wherein each of the N decoding modules includes at least one convolutional block, wherein the convolutional block included in each of the N decoding modules The parameters are not exactly the same.
The method of claim 67, wherein,

If the i is equal to N, the N-i th second feature information is determined according to the N th first feature information output by the N th encoding module; or,

If the i is less than N, the N-i-th second feature information is determined according to the N-i-th second feature information output by the N-i-th decoding module; or,

If the i is equal to 1, the i-1th first feature information is determined according to the LDR training image; or,

If the i is greater than 1, the i-1 first feature information is determined according to the first feature information output by the i-1 encoding module.
The method according to claim 68, wherein the i-1th third characteristic information and the N-ith second characteristic information are performed by the N-i+1th decoding module Feature extraction, obtaining the N-i+1 second feature information of the LDR training image includes:

Concatenating the i-1th third characteristic information and the N-ith second characteristic information;

Inputting the concatenated feature information into the N-i+1th decoding module for feature extraction to obtain the N-i+1th second feature information of the LDR training image.
The method according to claim 68, wherein the dynamic conversion model further comprises a first convolutional layer, and the method further comprises:

Carrying out feature extraction to the LDR training image through the first convolutional layer to obtain an initial feature map of the LDR training image;

Input the initial feature map into the first encoding module and the first convolution attention module respectively, obtain the first first feature information output by the first encoding module, and obtain the first convolution The first third feature information output by the attention module.
The method according to claim 67, wherein the dynamic conversion model further comprises a second convolutional layer, and the second convolutional layer of the LDR training image output by the last decoding module among the N decoding modules is characterized in that: Feature information, determining the HDR image prediction value of the LDR training image, including:

performing feature extraction on the second feature information of the LDR training image output by the last decoding module through the second convolution layer, and outputting the HDR image prediction value of the LDR training image.
The method of claim 68, further comprising:

Obtain the pre-training parameters obtained by the pre-training model during pre-training;

The pre-training parameters are determined as initial parameters of the dynamic conversion model.
The method according to claim 94, wherein said determining the loss between the predicted value of the HDR image of the LDR training image and the true value of the HDR image of the LDR training image comprises:

A target loss between the predicted value of the HDR image of the LDR training image and the true value of the HDR image of the LDR training image is determined according to a preset loss function.
The method according to claim 95, wherein the preset loss function includes at least one of a reconstruction loss function, a perceptual loss function and a style loss function.
The method according to claim 96, wherein the target loss between the predicted value of the HDR image of the LDR training image and the true value of the HDR image of the LDR training image is determined according to a preset loss function, include:

determining a reconstruction loss between the predicted value of the HDR image and the true value of the HDR image;

determining a perceptual loss between the predicted value of the HDR image and the true value of the HDR image;

determining a style loss between the predicted value of the HDR image and the true value of the HDR image;

The determining a target loss between the predicted value of the HDR image and the true value of the HDR image is determined according to a reconstruction loss, a perceptual loss, and a style loss between the predicted value of the HDR image and the true value of the HDR image.
The method according to claim 97, wherein said determining said HDR image is determined according to reconstruction loss, perceptual loss, and style loss between said HDR image prediction value and said HDR image true value. The target loss between the predicted value and the true value of the HDR image, including:

Determine the target loss between the HDR image prediction value and the HDR image true value according to the following formula:

Loss＝L 1 +λ s L st +λ p L p

Wherein, Loss is the target loss, the L1 is the reconstruction loss, the Lst is the perceptual loss, the Lp is the style loss, and the λ s and λ p are hyperparameters.
The method according to claim 97, wherein the determining the reconstruction loss between the predicted value of the HDR image and the true value of the HDR image comprises:

determining a compressed tone-mapping value of the predicted value of the HDR image according to a preset compressed tone-mapping function;

determining a compressed tone-mapping value of the true value of the HDR image according to the compressed tone-mapping function;

The reconstruction loss is determined from an error between the compressed tone-mapped values of the true value of the HDR image and the compressed tone-mapped value of the predicted value of the HDR image.
The method according to claim 97, wherein the determining the perceptual loss between the predicted value of the HDR image and the true value of the HDR image comprises:

Obtain the feature map of the first layer of the pre-training model;

determining a compressed tone-mapping value of the predicted value of the HDR image according to a preset compressed tone-mapping function;

determining a compressed tone-mapping value of the true value of the HDR image according to the compressed tone-mapping function;

Determining the compressed tone mapping value of the predicted value of the HDR image, corresponding to the first feature value in the feature map of the first layer;

Determining a compressed tone mapping value of the true value of the HDR image, a second feature value corresponding to the feature map of the first layer;

The perceptual loss is determined based on an error between the first feature value and the second feature value.
The method according to claim 97, wherein the determining the style loss between the predicted value of the HDR image and the true value of the HDR image comprises:

Obtain the Gram Gram matrix of the l-th layer feature map of the pre-training model;

determining a compressed tone-mapping value of the predicted value of the HDR image according to a preset compressed tone-mapping function;

determining a compressed tone-mapping value of the true value of the HDR image according to the compressed tone-mapping function;

Determine the compressed tone mapping value of the predicted value of the HDR image, and the corresponding first element value in the Gram matrix;

Determine the compressed tone mapping value of the true value of the HDR image, and the corresponding second element value in the Gram matrix;

The style loss is determined based on an error between the first element value and the second element value.
An image decoding device, characterized in that it comprises:

a decoding unit, configured to decode the code stream to obtain a reconstructed image;

A processing unit, configured to input the reconstructed image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the reconstructed image;

Wherein, the dynamic conversion model includes: N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in the N encoding modules is the same as that of the first encoding module in the N decoding modules The input of the decoding module is connected, and the i-th encoding module is skip-connected to the N-i+1 decoding module, and the i-th encoding module is used for the i-1th output of the i-1 encoding module Feature extraction is performed on the first feature information to obtain the i-th first feature information of the reconstructed image, and the N-i+1th decoding module is used to extract the i-1-th first feature information and the Perform feature extraction on the N-ith second feature information of the reconstructed image to obtain the N-i+1th second feature information of the reconstructed image, and the HDR image of the reconstructed image is based on the last one of the N decoding modules Determined by the second feature information output by the decoding module, the i is a positive integer less than or equal to N, and the N is a positive integer.
An image processing device, characterized in that it comprises:

An acquisition unit, configured to acquire a low dynamic range LDR image to be processed;

A processing unit, configured to input the LDR image into a dynamic conversion model for dynamic conversion to obtain a high dynamic range HDR image of the LDR image;

Wherein, the dynamic conversion model includes: N encoding modules connected in series and N decoding modules connected in series, the output of the last encoding module in the N encoding modules is the same as that of the first encoding module in the N decoding modules The input of the decoding module is connected, and the i-th encoding module is skip-connected to the N-i+1 decoding module, and the i-th encoding module is used for the i-1th output of the i-1 encoding module Feature extraction is performed on the first feature information to obtain the ith first feature information of the LDR image, and the N-i+1 decoding module is used to extract the i-1 first feature information and the Feature extraction is performed on the N-i second feature information of the LDR image to obtain the N-i+1 second feature information of the LDR image, and the HDR image of the LDR image is based on the last one of the N decoding modules Determined by the second feature information output by the decoding module, the i is a positive integer less than or equal to N, and the N is a positive integer.
A model training device, characterized in that it comprises:

An acquisition unit configured to acquire a low dynamic range LDR training image and a true value of a high dynamic range HDR image of the LDR training image;

A processing unit, configured to input the LDR training image into a dynamic conversion model, perform feature extraction on the i-1 first feature information through the i-th encoding module, and obtain the i-th first feature information of the LDR training image , wherein the dynamic conversion model includes N encoding modules connected in series and the N decoding modules connected in series, the output of the last encoding module in the N encoding modules is the same as that of the N decoding modules The input of the first decoding module is connected, and the i-th encoding module is skipped and connected to the N-i+1 decoding module, the i is a positive integer less than or equal to N, and the N is a positive integer; through the The N-i+1 decoding module performs feature extraction on the i-1 first feature information and the N-i second feature information of the LDR training image to obtain the N-i of the LDR training image +1 second feature information; according to the second feature information of the LDR training image output by the last decoding module in the N decoding modules, determine the HDR image prediction value of the LDR training image; determine the LDR training The loss between the HDR image prediction value of the image and the HDR image true value of the LDR training image, and the dynamic conversion model is trained according to the loss.
A decoder, characterized in that it includes: a processor and a memory;

The memory is used to store computer programs;

The processor is used for invoking and running the computer program stored in the memory, so as to execute the method according to any one of claims 1-33.
An electronic device, characterized in that it includes: a processor and a memory;

The memory is used to store computer programs;

The processor is used for invoking and running the computer program stored in the memory, so as to execute the method according to any one of claims 34-66 or 67-101.
A computer-readable storage medium, characterized by being used to store a computer program, the computer program causes a computer to execute the method according to any one of claims 1-33 or 34-66 or 67-101.