CN115564662A

CN115564662A - Image reconstruction method, electronic device, storage medium, and computer program product

Info

Publication number: CN115564662A
Application number: CN202210932958.9A
Authority: CN
Inventors: 刘震; 刘帅成
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2022-08-04
Filing date: 2022-08-04
Publication date: 2023-01-03

Abstract

The embodiment of the application provides an image reconstruction method, electronic equipment, a computer readable storage medium and a computer program product. The method comprises the following steps: acquiring a plurality of images to be processed, wherein the images to be processed are LDR images; performing feature extraction on a plurality of images to be processed to obtain extracted image features; extracting global context information and local context information based on the extracted image features to obtain global context feature information and local context feature information, and obtaining a reconstructed image by combining the global context feature information and the local context feature information, wherein the reconstructed image is a high dynamic range image. The ghost problem under the large-foreground motion scene can be effectively solved.

Description

Image reconstruction method, electronic device, storage medium, and computer program product

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image reconstruction method, an electronic device, a computer-readable storage medium, and a computer program product.

Background

In the field of digital image processing, dynamic range refers to the range of light intensities in a scene that an image can capture. The natural scene dynamic range observable by human eyes can reach 10000 < 1 >, while common consumer grade photographic devices (e.g. cell phones) can often only capture limited Low Dynamic Range (LDR) images (with a dynamic range of e.g. 100 to 300. Compared with an LDR image, the High Dynamic Range (HDR) image has a wider dynamic range, and can more truly restore the light and shadow effect close to a real scene, so that a photo with richer hierarchy, more real picture and higher quality can be obtained.

There are two methods of obtaining HDR images: the first is to capture HDR images directly by a specific device, but such a device is bulky, expensive, and not popular in consumer electronics (e.g., smart phones); the second approach is to obtain HDR images by fusing multiple frames of LDR images at different exposures (i.e. multi-exposure HDR image reconstruction), however multi-frame fusion is prone to ghost problems due to the shake of the hand-held camera or the motion of foreground objects.

Of the above two methods, the second method, i.e. multi-exposure HDR image reconstruction in a dynamic scene, is a relatively common HDR image acquisition method. In order to obtain a high-quality HDR image, a problem of ghosting caused by foreground motion or camera shake when multiple LDR images with different exposure intensities are fused needs to be solved. In particular, when there are a large range of moving objects or a sharp luminance change in a scene, the ghost problem can be very difficult to solve.

Disclosure of Invention

The present application has been made in view of the above problems. An image reconstruction method, an electronic device, a computer-readable storage medium, and a computer program product are provided.

According to an aspect of the present application, there is provided an image reconstruction method including: acquiring a plurality of images to be processed, wherein the images to be processed are low dynamic range images; performing feature extraction on a plurality of images to be processed to obtain extracted image features; extracting global context information and local context information based on the extracted image features to obtain global context feature information and local context feature information, and obtaining a reconstructed image by combining the global context feature information and the local context feature information, wherein the reconstructed image is a high dynamic range image.

Illustratively, extracting global context information and local context information based on the extracted image features, obtaining global context feature information and local context feature information, and obtaining a reconstructed image by combining the global context feature information and the local context feature information, includes: sequentially executing at least one reconstruction operation, wherein the reconstruction operation comprises at least one characteristic processing operation which is sequentially executed, and the characteristic processing operation comprises: extracting global context information of image features to be processed to obtain global context features, wherein the image features to be processed corresponding to a first feature processing operation in first reconstruction operations are extracted image features, the image features to be processed corresponding to the first feature processing operation in non-first reconstruction operations are execution results of previous reconstruction operations, the image features to be processed corresponding to the non-first feature processing operation in all the reconstruction operations are execution results of the previous feature processing operations, and the global context feature information comprises the global context features; extracting local context information of the image features to be processed to obtain local context features, wherein the local context feature information comprises the local context features; fusing the global context features and the local context features together to obtain an execution result of the feature processing operation; obtaining a reconstructed image by combining the global context feature information and the local context feature information, wherein the method comprises the following steps: a reconstructed image is obtained based on a result of the execution of the last reconstruction operation, the reconstructed image being a high dynamic range image.

Exemplarily, the extracting local context information of the image feature to be processed, and the obtaining the local context feature includes: carrying out layer normalization processing on the image features to be processed to obtain normalized features; shaping the normalized features to convert the normalized features into a two-dimensional image feature form to obtain shaped features; convolving the integer feature to obtain a first convolution feature; performing channel attention processing on the first convolution features to obtain channel attention features; and shaping the channel attention feature to convert the channel attention feature into a block embedded feature form to obtain a local context feature.

Illustratively, performing feature extraction on a plurality of images to be processed, obtaining extracted image features comprises: performing initial feature extraction on a plurality of images to be processed to obtain initial image features; the initial image features are shaped to convert the initial image features into a block-embedded feature form to obtain extracted image features.

Illustratively, performing initial feature extraction on a plurality of images to be processed, and obtaining initial image features comprises: respectively carrying out convolution on the plurality of images to be processed to obtain a plurality of second convolution characteristics which are in one-to-one correspondence with the plurality of images to be processed; for any non-reference image, performing spatial attention processing on a second convolution feature corresponding to the non-reference image and a second convolution feature corresponding to a reference image to obtain a spatial attention feature corresponding to the non-reference image, wherein the reference image is one of the images to be processed, and the non-reference image is an image except the reference image in the images to be processed; and fusing the spatial attention features corresponding to all the non-reference images and the second convolution features corresponding to the reference images to obtain initial image features.

Illustratively, the reference image is an image of the plurality of images to be processed whose exposure value is closest to 0, or the reference image is one of the plurality of images to be processed whose exposure value is at the center.

Illustratively, obtaining a reconstructed image based on a result of the performance of the last reconstruction operation includes: combining the execution result of the last reconstruction operation with the initial image feature and/or the second convolution feature corresponding to the reference image to obtain an intermediate output feature; convolving the intermediate output characteristics to obtain third convolution characteristics; and mapping the third convolution characteristic through the activation function to obtain a reconstructed image.

Illustratively, obtaining a reconstructed image based on the result of the execution of the last reconstruction operation includes: combining the execution result of the last reconstruction operation with the initial image characteristics to obtain intermediate output characteristics; convolving the intermediate output features to obtain fourth convolution features; and mapping the fourth convolution characteristic through an activation function to obtain a reconstructed image.

Illustratively, the reconstruction operation further includes a convolution operation performed on the execution result of the last feature processing operation, and in each reconstruction operation, the to-be-processed image feature corresponding to the first feature processing operation and the execution result of the convolution operation are merged together as the execution result of the reconstruction operation.

Exemplarily, the operations of performing feature extraction on a plurality of images to be processed, performing global context information extraction and local context information extraction based on the extracted image features, and obtaining a reconstructed image by combining the global context feature information and the local context feature information are realized by a reconstruction network model, and the method further includes: acquiring a training image, wherein the training image comprises a plurality of sample images and annotation images corresponding to the sample images, the sample images are low dynamic range images, and the annotation images are high dynamic range images; processing the plurality of sample images using the initial reconstructed network model to obtain a predicted reconstructed image; calculating a reconstruction loss term based on the predicted reconstructed image and the labeled image; acquiring predicted image characteristics and annotated image characteristics corresponding to the predicted reconstructed image and the annotated image respectively; calculating a perception loss item based on the predicted image characteristic and the annotated image characteristic; calculating a total loss based on the reconstruction loss term and the perceptual loss term; and optimizing the parameters of the initial reconstructed network model based on the total loss to obtain the reconstructed network model.

Illustratively, obtaining the predicted image feature and the annotated image feature corresponding to each of the predicted reconstructed image and the annotated image comprises: tone mapping is carried out on the prediction reconstruction image and the annotation image respectively to obtain a new prediction reconstruction image and a new annotation image; and respectively inputting the new prediction reconstruction image and the new annotation image into the pre-training network model to obtain the corresponding prediction image characteristics and annotation image characteristics.

According to another aspect of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the image reconstruction method described above.

According to another aspect of the present application, there is provided a computer readable storage medium having stored thereon a computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the above-mentioned image reconstruction method.

According to another aspect of the present application, there is provided a computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the above-mentioned image reconstruction method.

According to the image reconstruction method, the electronic device, the computer-readable storage medium and the computer program product of the embodiments of the application, global context information extraction and local context information extraction are performed based on image features of an LDR image, and an HDR image is obtained by combining the global context feature information and the local context feature information. The scheme can simultaneously aggregate the long-range and local context information of the image, and when the processing mode is applied to HDR image reconstruction, the ghost problem in a large foreground motion scene can be effectively solved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally indicate like parts or steps.

FIG. 1 shows a schematic block diagram of an example electronic device for implementing image feature processing or image reconstruction methods and apparatus according to embodiments of the present application;

FIG. 2 shows a schematic flow diagram of an image feature processing method according to one embodiment of the present application;

FIG. 3 illustrates a network architecture diagram of a context-aware visual converter module according to one embodiment of the present application;

FIG. 4 shows a schematic flow chart of an image reconstruction method according to an embodiment of the present application;

FIG. 5 illustrates a schematic diagram of a process flow for reconstructing a network model according to one embodiment of the present application;

FIG. 6 shows a schematic block diagram of an image feature processing apparatus according to one embodiment of the present application;

FIG. 7 shows a schematic block diagram of an image reconstruction apparatus according to an embodiment of the present application;

FIG. 8 shows a schematic block diagram of an electronic device according to an embodiment of the present application; and

FIG. 9 shows a schematic block diagram of an electronic device according to one embodiment of the present application.

Detailed Description

In recent years, technical research based on artificial intelligence, such as computer vision, deep learning, machine learning, image processing, and image recognition, has been actively developed. Artificial Intelligence (AI) is an emerging scientific technology for studying and developing theories, methods, techniques and application systems for simulating and extending human Intelligence. The artificial intelligence subject is a comprehensive subject and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning and neural networks. Computer vision is used as an important branch of artificial intelligence, particularly a machine is used for identifying the world, and the computer vision technology generally comprises the technologies of face identification, image reconstruction, fingerprint identification and anti-counterfeiting verification, biological feature identification, face detection, pedestrian detection, target detection, pedestrian identification, image processing, image identification, image semantic understanding, image retrieval, character identification, video processing, video content identification, behavior identification, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map construction (SLAM), computational photography, robot navigation and positioning and the like. With the research and progress of artificial intelligence technology, the technology is applied to various fields, such as security, city management, traffic management, building management, park management, face passage, face attendance, logistics management, warehouse management, robots, intelligent marketing, computational photography, mobile phone images, cloud services, smart homes, wearable equipment, unmanned driving, automatic driving, smart medical treatment, face payment, face unlocking, fingerprint unlocking, testimony verification, smart screens, smart televisions, cameras, mobile internet, live webcasts, beauty treatment, medical beauty treatment, intelligent temperature measurement and the like.

In order to make the objects, technical solutions and advantages of the present application more apparent, exemplary embodiments according to the present application will be described in detail below with reference to the accompanying drawings. It should be apparent that the described embodiments are only a few embodiments of the present application, and not all embodiments of the present application, and it should be understood that the present application is not limited to the example embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the application described in the application without inventive step, shall fall within the scope of protection of the application.

To at least partially solve the technical problem, embodiments of the present application provide an image reconstruction method, an electronic device, a computer-readable storage medium, and a computer program product. According to the image reconstruction method, the long range and the local context information of the image can be simultaneously aggregated, and when the processing mode is applied to HDR image reconstruction, the ghost problem in a large-foreground motion scene can be effectively solved. The image reconstruction technology according to the embodiment of the application can be applied to any field needing to generate HDR images.

First, an example electronic device 100 for implementing an image feature processing or image reconstruction method and apparatus according to an embodiment of the present application is described with reference to fig. 1.

As shown in fig. 1, electronic device 100 includes one or more processors 102, one or more memory devices 104. Optionally, the electronic device 100 may also include an input device 106, an output device 108, and an image capture device 110, which are interconnected via a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.

The processor 102 may be implemented in hardware form of at least one of a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a microprocessor, the processor 102 may be one or a combination of Central Processing Units (CPU), image processors (GPU), application Specific Integrated Circuits (ASIC), or other forms of processing units with data processing capability and/or instruction execution capability, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement the client functionality (implemented by the processor) of the embodiments of the application described below and/or other desired functionality. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like.

The output device 108 may output various information (e.g., images and/or sounds) to an external (e.g., user), and may include one or more of a display, a speaker, etc. Alternatively, the input device 106 and the output device 108 may be integrated together, implemented using the same interactive device (e.g., a touch screen).

The image capture device 110 may capture images and store the captured images in the storage device 104 for use by other components. The image capture device 110 may be a separate camera or a camera in a mobile terminal, etc. It should be understood that the image capture device 110 is merely an example, and the electronic device 100 may not include the image capture device 110. In this case, other devices having image capturing capabilities may be used to capture an image and transmit the captured image to the electronic device 100.

For example, an example electronic device for implementing the image feature processing or image reconstruction method and apparatus according to the embodiments of the present application may be implemented on a device such as a personal computer, a terminal device, an attendance machine, a panel machine, a camera, or a remote server. Wherein, the terminal device includes but is not limited to: tablet computers, mobile phones, PDAs (Personal Digital assistants), touch screen-enabled all-in-one machines, wearable devices, and the like.

According to the method and the device, a novel image feature processing mode is adopted to aggregate long-range and local context information of the image, and then the image feature processing mode is applied to the field of image reconstruction so as to solve the problem of ghosting existing in the image feature processing mode. For ease of understanding, an image feature processing method according to an embodiment of the present application is first described below with reference to fig. 2 to 3. FIG. 2 shows a schematic flow diagram of an image feature processing method 200 according to one embodiment of the present application. As shown in fig. 2, the image feature processing method 200 includes steps S210, S220, S230, and S240.

In step S210, to-be-processed image features are acquired, which are obtained based on image features extracted from the to-be-processed image.

The image to be processed here may be an arbitrary image. By way of example and not limitation, the image to be processed may be an original Low Dynamic Range (LDR) image acquired by the image acquisition device, or may be an LDR image obtained by performing some pre-processing on the original LDR image. Illustratively, the preprocessing may include smoothing, filtering, normalizing, and the like.

For the image features extracted from the image to be processed, any suitable further processing may be performed to obtain the image features to be processed. Such further processing may include, for example, reshaping (reshape). For example, the image features extracted from the image to be processed may be converted from a two-dimensional image feature form of H × W × C into a feature form of block embedded features (patch embedding), where H denotes height, W denotes width, and C denotes the number of channels, by shaping. The image features to be processed are input features of a Context-aware Vision Transformer (CA-ViT) module, that is, features that need to be processed by the CA-ViT module.

In step S220, global context information is extracted from the image feature to be processed to obtain a global context feature. Illustratively, a visual converter module in the context-aware visual converter module may be utilized to perform global context information extraction on the image feature to be processed, so as to obtain a global context feature.

The image features to be processed may be input into a Vision Transformer (ViT) module, so as to obtain global context features output by the ViT module. The ViT module employed in step S220 may be a conventional ViT. The network structure and the operation principle of the conventional ViT can be understood by those skilled in the art, and are not described herein in detail.

In step S230, local context information extraction is performed on the image feature to be processed, so as to obtain a local context feature. For example, a local context extraction module in the context-aware visual converter module may be utilized to extract local context information of the image feature to be processed, so as to obtain a local context feature.

Alternatively, step S220 and step S230 may be performed synchronously.

Fig. 3 shows a schematic network structure diagram of a CA-ViT module according to an embodiment of the present application. As shown in fig. 3, the CA-ViT module includes two parallel branches, viT module 302 and local context extraction module 304. As shown in fig. 3, the ViT module 302 mainly includes a Multi-head Self-Attention layer (MSA) and a Multi-layer Perceptron (MLP). In addition, the ViT module 302 may further include at least one Layer Normalization (LN) Layer. The network structure and the operation principle of the ViT module 302 can be understood by those skilled in the art, and are not described in detail herein. The image feature to be processed may be patch embeddings, the image feature to be processed is input to the ViT module 302, and the ViT module 302 may output the global context feature. That is, the extraction of the global context information may be performed by the ViT module 302.

The extraction of local context information may be performed by the local context extraction module 304. As shown in FIG. 3, local context extraction module 304 may include a local context extractor. The local context extractor may include a first convolution module and a channel attention module. The extraction of the local context information in the image can be realized through the first convolution module and the channel attention module. It is to be understood that the network architecture shown in fig. 3 is only an example and not a limitation of the present application, which is not limited to the network architecture shown in fig. 3. For example, the first volume module in the local context extraction module 304 may be replaced by other network structures such as a pooling layer, as long as the extraction of the local context information can be achieved.

In step S240, the global context feature and the local context feature are fused together to obtain an execution result of the image feature processing method 200. The global context feature and the local context feature are fused together, so that the output feature of the context-aware vision converter module, that is, the execution result of the image feature processing method 200, can be obtained.

As shown in fig. 3, there is a fusion module after the outputs of the ViT module 302 and the local context extraction module 304 for fusing the global context feature and the local context feature together.

Optionally, the global context feature and the local context feature may be merged together by means of channel splicing.

According to the image feature processing method, the features (i.e., the execution results) obtained by processing carry global context information and local context information, that is, long-range and local context information of an image can be aggregated at the same time, and the performance of the image feature processing method is improved remarkably compared with that of the existing visual converter on tasks such as HDR image reconstruction.

Illustratively, the image feature processing method according to the embodiment of the present application may be implemented in a device, apparatus or system having a memory and a processor.

The image feature processing method according to the embodiment of the present application may be deployed at an image capturing end, for example, at a personal terminal or a server end.

Alternatively, the image feature processing method according to the embodiment of the present application may also be distributively deployed at a server side (or a cloud side) and a personal terminal. For example, an LDR image may be acquired at a client, the client transmits the acquired LDR image to a server (or a cloud), and the server (or the cloud) extracts features from the LDR image, and obtains features of an image to be processed for image feature processing.

According to the embodiment of the present application, extracting local context information from the image feature to be processed to obtain the local context feature (step S230) may include: carrying out layer normalization processing on the image features to be processed to obtain normalized features; shaping the normalized features to convert the normalized features into a two-dimensional image feature form to obtain shaped features; convolving the integer feature to obtain a first convolution feature; performing channel attention processing on the first convolution features to obtain channel attention features; and shaping the channel attention feature to convert the channel attention feature into a block embedded feature form to obtain a local context feature.

Exemplarily, the local context extraction module includes a layer normalization module, a first shaping module, a first convolution module, a channel self-attention module, and a second shaping module, and extracting local context information of the image feature to be processed by using a local context extraction module in the context-aware visual converter module to obtain the local context feature includes: inputting the characteristics of the image to be processed into a layer normalization module for layer normalization processing to obtain normalized characteristics; inputting the normalized features into a first shaping module for shaping so as to convert the normalized features into a two-dimensional image feature form and obtain shaping features; inputting the shaping feature into a first convolution module for convolution to obtain a first convolution feature; inputting the first convolution characteristic into a channel attention module for channel attention processing to obtain a channel attention characteristic; and inputting the channel attention features into a second shaping module for shaping so as to convert the channel attention features into a block embedded feature form and obtain local context features.

Fig. 3 shows the LN module, the first convolution module, and the channel self-attention module included in the local context extraction module 304, and the first shaping module and the second shaping module are not shown.

In the case where the image feature to be processed is patch embedding, a certain processing, for example, conversion into a form of a two-dimensional image feature may be performed before the feature is input to the local context extractor. Illustratively, the local context extraction module 304 may include a Layer Normalization (LN) layer and/or a shaping (reshape) layer. For example, the image features to be processed may be input into the LN layer for normalization. The normalized features can also be input into a reshape layer for feature conversion, and the features are converted into a two-dimensional image feature form from a patch embedding form, namely, into features which can be expressed as H × W × C. The H x W x C dimensional features may then be input to the first convolution module. The dimension of the feature output by the first convolution module may be H × W × C, and of course, for any parameter of the height H, the width W, and the number of channels C, the output feature of the first convolution module may or may not be consistent with the parameter of the input feature.

The first convolution module may include at least one convolution layer. The number of convolutional layers included in the first convolutional module may be arbitrary, and the present application does not limit this. The channel attention module may include a pooling layer, a linear layer, an activation function layer, a dimension extension, and the like network layer. The pooling layer and dimension expansion layer in FIG. 3 are not explicitly shown, but the operations performed by both are shown at the corresponding flow locations. The number of layers included in the channel attention module may be set to an appropriate size as needed, which is not limited in the present application. In the example shown in FIG. 3, the channel attention module includes two Linear (Linear) layers and two activation function layers. The two activation function layers are a ReLU layer and a Sigmoid layer, respectively.

The features output by the first convolution module are input to the pooling layer, which may output features of 1 × 1 × C dimensions. The pooling layer may be an average pooling layer and/or a maximum pooling layer, etc. After several linear layers and several activation function layers, the output features still maintain the 1 × 1 × C dimension. The feature output by the last activation function layer may be dimension extended for multiplication with the feature output by the first convolution module, converting it back to H × W × C size. The output characteristic of the channel attention module, namely the channel attention characteristic, can be obtained by multiplying the converted characteristic element by element with the characteristic output by the first convolution module.

The channel attention feature is in the form H W C. The MLP output in the ViT module is characterized by the patch embeddings form. To fuse these two features together, one can be transformed. Illustratively, a patch embedding calculation may be performed on the channel attention feature, and a feature in the form of patch embedding is obtained as the local context feature. The global context feature and the local context feature may then be fused together, such as by way of channel stitching.

According to an embodiment of the application, the channel attention module comprises at least one linear layer and an activation function layer.

The present embodiment has been described above, and is not described herein again.

An implementation of solving the ghosting problem in a large foreground motion scene using the CA-ViT module described above according to an embodiment of the present application is described below.

According to another aspect of the present application, there is provided an image reconstruction method. FIG. 4 shows a schematic flow diagram of an image reconstruction method 400 according to one embodiment of the present application. As shown in fig. 4, the image reconstruction method 400 includes steps S410, S420, S430, and S440.

In step S410, a plurality of images to be processed, which are low dynamic range images, are acquired.

It will be appreciated by those skilled in the art that if a reconstruction is desired to obtain a certain HDR image, the LDR images on which it is based are theoretically images of similar image content but different luminances. For example, the plurality of LDR images may be a plurality of images acquired continuously for the same scene at the same view angle using the same camera. The scene may include a target object (foreground) and a portion other than the target object (background). In a plurality of LDR images, the background may be kept substantially unchanged, while the foreground may be stationary or moving. When the foreground motion is large, the aforementioned ghost problem exists.

In step S420, feature extraction is performed on a plurality of images to be processed, obtaining extracted image features. For example, a plurality of images to be processed may be input to a feature extraction module in the reconstruction network model for feature extraction, obtaining extracted image features.

In step S430, global context information extraction and local context information extraction are performed based on the extracted image features, so as to obtain global context feature information and local context feature information, and a reconstructed image is obtained by combining the global context feature information and the local context feature information, where the reconstructed image is a high dynamic range image.

The global contextual feature information and the local contextual feature information may be extracted using any suitable existing or future possible global contextual information extraction and local contextual information extraction methods, such as using neural network models, exemplary extraction methods being described below.

According to the image reconstruction method, the global context information and the local context information are extracted based on the image characteristics of the LDR image, and the HDR image is obtained by combining the global context characteristic information and the local context characteristic information. The scheme can simultaneously aggregate the long-range and local context information of the image, and when the processing mode is applied to HDR image reconstruction, the ghost problem in a large foreground motion scene can be effectively solved.

Illustratively, the image reconstruction method according to an embodiment of the present application may be implemented in a device, apparatus, or system having a memory and a processor.

The image reconstruction method according to the embodiment of the application can be deployed at an image acquisition end, for example, can be deployed at a personal terminal or a server end.

Alternatively, the image reconstruction method according to the embodiment of the present application may also be distributively deployed at a server side (or a cloud side) and a personal terminal side. For example, an LDR image may be acquired at a client, and the client transmits the acquired LDR image to a server (or a cloud), so that the server (or the cloud) reconstructs the image.

Illustratively, extracting global context information and local context information based on the extracted image features, obtaining global context feature information and local context feature information, and obtaining a reconstructed image in combination with the global context feature information and the local context feature information, where the reconstructed image is a high dynamic range image may include: sequentially executing at least one reconstruction operation, wherein the reconstruction operation comprises at least one characteristic processing operation which is sequentially executed, and the characteristic processing operation comprises: extracting global context information of image features to be processed to obtain global context features, wherein the image features to be processed corresponding to a first feature processing operation in first reconstruction operations are extracted image features, the image features to be processed corresponding to the first feature processing operation in non-first reconstruction operations are execution results of previous reconstruction operations, the image features to be processed corresponding to the non-first feature processing operation in all the reconstruction operations are execution results of the previous feature processing operations, and the global context feature information comprises global context features; extracting local context information of the image feature to be processed to obtain a local context feature, wherein the local context feature information comprises the local context feature; fusing the global context features and the local context features together to obtain an execution result of the feature processing operation; a reconstructed image is obtained based on a result of the last reconstruction operation, the reconstructed image being a high dynamic range image.

The reconstruction operation may be performed one or more times, and the feature processing operation in any one reconstruction operation may also be performed one or more times. The result of the last reconstruction operation may be further processed to obtain an HDR image. By way of example and not limitation, the result of the last reconstruction operation may be convolved, and then the features obtained by convolution are subjected to the calculation of an activation function to obtain the required HDR image.

For example, the extracted image features may be input to a reconstruction module in the reconstruction network model for processing, to obtain a reconstructed image, which is a high dynamic range image; the reconstruction module comprises at least one perception module which is connected in sequence, each perception module comprises at least one sub-perception module which is connected in sequence and a second convolution module which is connected with the last sub-perception module, each sub-perception module is the context perception visual converter module, and the input characteristic of each sub-perception module is the characteristic of the image to be processed corresponding to the sub-perception module. The feature processing operation is the operation included in the image feature processing method 200 described above. The feature processing operations may correspond one-to-one to the sub-perception modules, i.e., each sub-perception module may perform one feature processing operation. The reconstruction operations correspond one-to-one to the perception modules, i.e., each perception module can perform one reconstruction operation.

FIG. 5 illustrates a schematic diagram of a process flow for reconstructing a network model according to one embodiment of the present application. The reconstruction network model is a network model for reconstructing an HDR image based on a plurality of LDR images. As shown in fig. 5, the reconstructed network model may include a feature extraction module 502 and a reconstruction module 504.

The feature extraction module 502 may be implemented using any suitable network architecture as long as it can extract features that meet the input requirements of the reconstruction module 504. Illustratively, the features output by the feature extraction module 502 may be features in the form of patches embeddings.

The reconstruction module 504 may include at least one sensing module, and each sensing module may include at least one sub-sensing module and a second convolution module, where each sub-sensing module is the CA-ViT module described above. The sensing modules can be N, wherein N is more than or equal to 1. It is noted that fig. 5 only shows a single sensing module, and in the case of multiple sensing modules, the sensing modules may be connected end to end in sequence. The second convolution module may be any suitable convolution module that may include at least one convolution layer. In one example, the second convolution module may include a hole convolution layer (scaled Conv).

The extracted features output by the feature extraction module 502 are input to a first one of the reconstruction modules 504, while the features output by the first one of the perception modules are input to a second one of the reconstruction modules 504, and so on. In each sensing module, the sub-sensing modules are also connected in sequence, so that the input characteristic of the latter sub-sensing module is the output characteristic of the former sub-sensing module. Illustratively, the output characteristics of each sub-perception module may also be characteristics in the form of patch embeddings.

According to the embodiment of the application, extracting the local context information of the image feature to be processed to obtain the local context feature comprises the following steps: carrying out layer normalization processing on the image features to be processed to obtain normalized features; shaping the normalized features to convert the normalized features into a two-dimensional image feature form to obtain shaped features; convolving the integer feature to obtain a first convolution feature; performing channel attention processing on the first convolution features to obtain channel attention features; and shaping the channel attention feature to convert the channel attention feature into a block embedded feature form to obtain a local context feature.

The manner of extracting the local context features has already been described above, and will not be described herein again.

According to an embodiment of the present application, performing feature extraction on a plurality of images to be processed, obtaining extracted image features includes: performing initial feature extraction on a plurality of images to be processed to obtain initial image features; the initial image features are reshaped to convert the initial image features into a block-embedded feature form to obtain extracted image features.

Illustratively, the feature extraction module may include an initial feature extraction module and a third shaping module, wherein inputting a plurality of images to be processed into the feature extraction module in the reconstructed network model for feature extraction, and obtaining the extracted image features includes: inputting a plurality of images to be processed into an initial feature extraction module for initial feature extraction to obtain initial image features; inputting the initial image features into a third shaping module for shaping to convert the initial image features into a block-embedded feature form to obtain extracted image features.

The features of each LDR image may be extracted separately in an initial feature extraction module, which may output the result of the integration of these features, i.e., the initial image features. Subsequently, the initial image features output by the initial feature extraction module may be converted from the H × W × C feature form to the feature form of patch embeddings. Through the conversion of the characteristic forms, the characteristics in a proper form can be conveniently obtained and input into the sub-perception module for processing.

According to the embodiment of the application, the initial feature extraction is carried out on a plurality of images to be processed, and the obtaining of the initial image features comprises the following steps: respectively convolving the multiple images to be processed to obtain multiple second convolution characteristics which are in one-to-one correspondence with the multiple images to be processed; for any non-reference image, performing spatial attention processing on a second convolution feature corresponding to the non-reference image and a second convolution feature corresponding to a reference image to obtain a spatial attention feature corresponding to the non-reference image, wherein the reference image is one of the images to be processed, and the non-reference image is an image except the reference image in the images to be processed; and fusing the spatial attention features corresponding to all the non-reference images and the second convolution features corresponding to the reference images to obtain initial image features.

For example, the initial feature extraction module may include a third convolution module, a spatial attention module, and a fusion module, and inputting a plurality of images to be processed into the initial feature extraction module for feature extraction, and obtaining the initial image features includes: inputting a plurality of images to be processed into a third convolution module for convolution to obtain a plurality of second convolution characteristics which are in one-to-one correspondence with the plurality of images to be processed; for any non-reference image, inputting a second convolution characteristic corresponding to the non-reference image and a second convolution characteristic corresponding to a reference image into a spatial attention module to obtain a spatial attention characteristic corresponding to the non-reference image, wherein the reference image is one of the images to be processed, and the non-reference image is an image except the reference image in the images to be processed; and inputting the spatial attention features corresponding to all the non-reference images and the second convolution features corresponding to the reference images into a fusion module for fusion to obtain initial image features.

The third convolution module may include at least one sub-convolution module. In one example, the third convolution module may include a plurality of sub-convolution modules in one-to-one correspondence with the plurality of images to be processed, and each sub-convolution module may include at least one convolution layer. Each sub-convolution module may be configured to convolve a corresponding image to be processed.

Alternatively, a certain image may be selected in advance from a plurality of images to be processed as a reference image, and the remaining images may be used as non-reference images. The second convolution feature corresponding to each non-reference image and the second convolution feature corresponding to the reference image may be spliced and then input to the spatial attention module for spatial attention calculation. For example, and without limitation, the number of spatial attention modules may be the number of images to be processed minus one, which facilitates the input of corresponding features into the corresponding spatial attention module for processing for each pair of non-reference and reference image combinations. Of course, the spatial attention module may be, for example, one, and each pair of the combination of the non-reference image and the reference image may be sequentially input to the spatial attention module for processing.

Referring to fig. 5, three images to be processed are shown. In the embodiment shown in fig. 5, the second to-be-processed image with the exposure value at the center is taken as a reference image, and the other two are taken as non-reference images. Firstly, the three images to be processed can be respectively subjected to shallow layer feature extraction through three convolution layers (Conv) to respectively obtain three convolution features f ₁ 、f ₂ And f ₃ . The convolution may then be performedSign f ₁ And convolution feature f ₂ Performing channel splicing (Concat), and sending the spliced characteristics to a space attention module to obtain space attention characteristics f ₁ '. The spatial attention module may include at least one convolution layer and at least one activation function layer. In one example, the spatial attention module consists of two convolution layers and one sigmoid activation function layer. Furthermore, the convolution characteristic f can be used ₃ And convolution feature f ₂ Performing channel splicing (Concat), and sending the spliced characteristics to another space attention module to obtain space attention characteristics f ₃ '. Subsequently, the spatial attention feature f may be identified ₁ ', spatial attention feature f ₃ ' and convolution feature f ₂ Fusing together to obtain initial image characteristics f _att . For example, but not limitation, the fusion module may include a stitching module and a fourth convolution module, and the three features are first stitched together, and then the stitched features may be input into the fourth convolution module for convolution, so as to obtain the initial image feature f _att . Of course, the fusion module may also comprise only the splicing module. The three characteristics can be spliced together directly to obtain an initial image characteristic f _att . Note that m is shown in FIG. 5 ₁ And m ₃ Is the spatial weight calculated by the spatial attention module, and the weight is compared with the original feature (i.e. f) ₁ And f ₃ ) The spatial attention characteristics can be obtained by multiplying.

The above feature extraction manner is only an example and not a limitation of the present application. For example, a plurality of images to be processed may be each subjected to feature extraction by a convolution layer, and after obtaining a plurality of convolution features, the convolution features are directly spliced together as initial image features.

Through the calculation of the spatial attention of the relative reference image, the HDR reconstructed image closer to the reference image can be generated by taking the reference image as a reference in the following process.

According to the embodiment of the present application, the reference image may be an image of which exposure value is closest to 0 among the plurality of images to be processed.

Preferably, the LDR image is selected to include at least one 0EV image. Alternatively, the 0EV image may be directly set as the reference image. When the LDR image does not contain a 0EV image, an image closest to 0EV can still be selected as a reference image.

For example, assume that the number of images to be processed is three. In one example, the exposure values of the three images to be processed are +2EV, 0EV and-1 EV, respectively, and then the image of 0EV may be selected as the reference image. In another example, the exposure values of the three images to be processed are +3EV, +2EV, +1EV, respectively, and then the image of +1EV may be selected as the reference image. In yet another example, the exposure values of the three images to be processed are-1 EV, -2EV, -3EV, respectively, and then the image of-1 EV may be selected as the reference image. In yet another example, the exposure values of the three images to be processed are +3EV, +2EV, -1EV, respectively, and then the image of-1 EV may be selected as the reference image.

The image of 0EV is an image of standard exposure, which contains relatively balanced image information, and the brightness of the HDR image after reconstruction is generally relatively close to the image of 0EV, so that it is beneficial to reconstruct and obtain a relatively accurate HDR image by using the image closest to 0EV as a reference image.

According to an embodiment of the application, the reference image may be one of the images with the exposure value in the middle of the plurality of images to be processed.

In the case where the number of images to be processed is an odd number, the image with the exposure value at the center may be directly selected from all the images to be processed as a reference image. For example, assume that the number of images to be processed is three. In one example, the exposure values of the three images to be processed are +2EV, 0EV and-1 EV, respectively, and then the image of 0EV may be selected as the reference image. In another example, exposure values of three to-be-processed images are +3EV, +2EV, +1EV, respectively, and an image of +2EV may be selected as a reference image. In yet another example, the exposure values of the three images to be processed are-1 EV, -2EV, -3EV, respectively, and then the image of-2 EV may be selected as the reference image. In yet another example, the exposure values of the three images to be processed are +3EV, +2EV, -1EV, respectively, and then the image of +2EV may be selected as the reference image.

In the case where the number of images to be processed is an even number, there are two images at the middlemost. For these two images, one of them can be further selected as a reference image. The selection of one of the two images may be achieved in any suitable way of selection. For example, one of the two images may be randomly selected or an image having an exposure value closest to 0 may be selected as the reference image. For example, assume that the number of images to be processed is four. In one example, the exposure values of the four images to be processed are +3EV, +2EV, -1EV and-3 EV, respectively, and the images at the center are the images at +2EV and-1 EV, from which the image closest to 0EV (i.e., the image at-1 EV) can be selected as the reference image.

According to an embodiment of the present application, obtaining a reconstructed image based on a result of performing the last reconstruction operation includes: combining the execution result of the last reconstruction operation with the initial image feature and/or a second convolution feature corresponding to the reference image to obtain an intermediate output feature; performing convolution on the intermediate output characteristic to obtain a third convolution characteristic; and mapping the third convolution characteristic through an activation function to obtain a reconstructed image.

Illustratively, the reconstruction module may further include an output module connected after the last sensing module, and inputting the extracted image features into the reconstruction module for processing, and obtaining the reconstructed image includes: inputting the extracted image features into at least one perception module for processing; combining the features output by the last sensing module with the initial image features and/or the convolution features corresponding to the reference image to obtain intermediate output features; and (4) inputting the intermediate output characteristics into an output module for processing to obtain a reconstructed image.

The output module may include at least one convolution layer and at least one activation function layer. Alternatively, the activation function layer may be a sigmoid activation layer.

Referring to fig. 5, at the output of the last perceptual module, the initial image feature f is connected by means of a skip connection (shortcut) _att And convolution feature f ₂ Connected in such a way that the last sensing module outputs the characteristicCharacterization and initial image feature f _att And convolution feature f ₂ And are combined together to obtain an intermediate output characteristic. The merging may be an element-by-element summation.

Initial image feature f _att And convolution feature f ₂ One or both of the features may optionally be combined with the feature output by the last sensing module, although optionally the feature output by the last sensing module may also be output directly as an intermediate output feature without being combined with other features.

The feature output by the last sensing module is combined with the convolution feature corresponding to the initial image feature and/or the reference image, and the residual error processing mode is adopted, so that the training of the reconstructed network model can be optimized.

According to an embodiment of the present application, obtaining a reconstructed image based on a result of performing the last reconstruction operation includes: combining the execution result of the last reconstruction operation with the initial image characteristic to obtain an intermediate output characteristic; performing convolution on the intermediate output characteristic to obtain a fourth convolution characteristic; and mapping the fourth convolution characteristic through an activation function to obtain a reconstructed image.

Illustratively, the reconstruction module further comprises an output module connected after the last perception module, and inputting the extracted image features into the reconstruction module for processing, and obtaining the reconstructed image comprises: inputting the extracted image features into at least one sensing module for processing; combining the characteristics output by the last sensing module with the characteristics of the initial image to obtain intermediate output characteristics; and (4) inputting the intermediate output characteristics into an output module for processing to obtain a reconstructed image.

The embodiment of merging the feature output by the last sensing module with the initial image feature has been described above, and will not be described herein again.

According to the embodiment of the application, the reconstruction operation further comprises a convolution operation executed according to the execution result of the last feature processing operation, and in each reconstruction operation, the to-be-processed image feature corresponding to the first feature processing operation and the execution result of the convolution operation are combined together to serve as the execution result of the reconstruction operation.

Illustratively, in each perception module, the input features of the first sub-perception module and the output features of the second convolution module are merged together as the output features of the perception module.

Referring to fig. 5, the input features of a first CA-ViT module are shown to be jumpingly connected to the output of a second convolution module so that the input features of the first CA-ViT module can be combined with the output features of the second convolution module. The combination here may be a summation element by element. Similar to the above embodiment of combining the feature output by the last sensing module with the initial image feature and/or the second convolution feature corresponding to the reference image, the combining scheme of this embodiment is a residual error processing manner, and by this way, the training of the reconstructed network model can be optimized.

According to an embodiment of the application, the operations of performing feature extraction on a plurality of images to be processed, performing global context information extraction and local context information extraction based on the extracted image features, and obtaining a reconstructed image by combining the global context feature information and the local context feature information are implemented by a reconstructed network model, and the method 400 may further include: acquiring a training image, wherein the training image comprises a plurality of sample images and an annotation image corresponding to the plurality of sample images, the sample images are LDR images, and the annotation image is an HDR image; processing the plurality of sample images using the initial reconstructed network model to obtain a predicted reconstructed image; calculating a reconstruction loss term based on the predicted reconstructed image and the labeled image; acquiring predicted image characteristics and labeled image characteristics corresponding to the predicted reconstructed image and the labeled image respectively; calculating a perception loss item based on the predicted image characteristic and the annotated image characteristic; calculating a total loss based on the reconstruction loss term and the perceptual loss term; and optimizing the parameters of the initial reconstruction network model based on the total loss to obtain the reconstruction network model.

The initial reconstructed network model is a reconstructed network model with initialization parameters as parameters. The way of processing the plurality of sample images by using the initial reconstruction network model is consistent with the way of processing the plurality of images to be processed by using the reconstruction network model, and a person skilled in the art can understand the use way of the initial reconstruction network model and the obtaining way of the predicted reconstruction image in the training process, which are not described herein again.

Those skilled in the art will appreciate that the training process to reconstruct the network model may include: inputting a group of sample images into the initial reconstruction network model to obtain corresponding predicted reconstruction images, substituting the predicted reconstruction images and the marked images into a loss function to calculate loss terms, and optimizing parameters of the initial reconstruction network model through a back propagation and gradient descent algorithm until the value of the loss terms meets the requirement. And after the current optimization is finished, inputting the next group of sample images into the optimized reconstruction network model, repeating the optimization process until the maximum iteration times or the model convergence is reached, and finishing the training process. The process of reconstructing parameters of the network model based on the optimization of the loss term can be implemented in a conventional optimization manner, which is not described herein again. The manner in which the loss terms involved in the training of the reconstructed network model are calculated is mainly described herein.

Unlike existing loss function calculation methods that use pixel-level loss only, the loss function used in embodiments of the present application may include a reconstruction loss term and a perceptual loss term, as follows:

in the formula (1), the first and second groups,

the total loss is expressed as a total loss,

and

respectively representing a reconstruction loss term and a perception loss term, and lambda represents a preset scaling factor. The predetermined scaling factor λ may be any suitable value, such as 0.01 reconstruction loss term for calculating the predicted reconstructed image and the annotated image (true label)To the error between. By way of example and not limitation, the error may be, for example, a minimum absolute value error (f: (m))

Error), namely:

wherein, I ^GT Representing an annotated image, I ^H Which represents a predicted reconstructed image, is,

represents the μ -law function. The expression for the μ -law function is as follows:

the μ -law function represents tone mapping (tone mapping) of the input image (represented by x in equation (3)), i.e. transforming the original input image onto another luminance domain. For better display, the reconstructed HDR image is usually passed through

The transformation is then displayed on the display.

The perceptual loss term may be used to calculate an error on the image feature corresponding to each of the predicted reconstructed image and the annotated image. For example, but not by way of limitation, the error may also be

And (4) error. The perceptual loss term is used to constrain the quality of the reconstructed HDR image from the feature level. Due to the addition of the perception loss term, in the process of optimizing and reconstructing the network model based on the loss term, errors of image features corresponding to the predicted reconstructed image and the annotated image can be also taken into optimization factors, which is helpful for improving the quality of the HDR image output by the reconstructed network model.

Illustratively, obtaining the predicted image feature and the annotated image feature corresponding to each of the predicted reconstructed image and the annotated image comprises: carrying out tone mapping on the predicted reconstructed image and the annotated image respectively to obtain a new predicted reconstructed image and a new annotated image; and respectively inputting the new prediction reconstruction image and the new annotation image into a pre-training network model to obtain the corresponding predicted image characteristic and annotation image characteristic.

The perceptual loss term may be defined as:

wherein phi _i () The output features of the i-th layer of the m layers participating in the training for the reconstructed network model in the pre-trained network model are represented, i =1,2, \8230;, m.

The m layers of the pre-trained network model that participate in the training for the reconstructed network model may include any m network layers of the pre-trained network model other than the input layer and the output layer, that is, any m hidden layers of the pre-trained network model. m is an integer greater than or equal to 1. The size of m can be set as required, and the application is not limited thereto. Preferably, the m layers of the pre-trained network model that participate in the training for the reconstructed network model are the m layers of the pre-trained network model that are closest to the output layer (i.e., the last m hidden layers).

As mentioned above, the reconstructed HDR image usually goes through for better display

The transformation is then displayed on the display. Therefore, when the features are extracted by utilizing the pre-training network model, the predicted reconstructed image and the labeled image can be respectively carried out

The features are extracted after transformation, rather than directly extracting the features of the predicted reconstructed image and the annotated image. Based on passing through

The perceptual loss term is calculated according to the characteristics of the transformed image, and then an HDR image obtained by reconstruction of a reconstruction network model obtained by training based on the perceptual loss term can have a good display effect.

By way of example and not limitation, the pre-trained network model may be a VGG16 model.

In training, the training image may first be cropped to a size of 128x 128. Illustratively, the reconstructed network model may be trained using an Adam optimizer. The parameters of the Adam optimizer may be set as follows: beta is a ₁ ＝0.9,β ₂ =0.999, ∈ =10^ (-8), learning rate is 2e-4, training code can be implemented using PyTorch framework.

Compared with the existing image reconstruction method, the reconstruction network model based on the CA-ViT module can better remove ghosting and restore visually consistent details in an overexposed region and an occluded region, so that a HDR image with higher quality can be reconstructed.

According to another aspect of the present application, an image feature processing apparatus is provided. Fig. 6 shows a schematic block diagram of an image feature processing apparatus 600 according to an embodiment of the present application.

As shown in fig. 6, the image feature processing apparatus 600 according to the embodiment of the present application includes an obtaining module 610, a first extracting module 620, a second extracting module 630, and a fusing module 640. The respective modules may perform the respective steps/functions of the image feature processing method described above in connection with fig. 2, respectively. Only the main functions of the respective components of the image feature processing apparatus 600 will be described below, and details that have been described above will be omitted.

The obtaining module 610 is configured to obtain a feature of an image to be processed, where the feature of the image to be processed is obtained based on an image feature extracted from the image to be processed. The obtaining module 610 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

The first extraction module 620 is configured to perform global context information extraction on the image feature to be processed to obtain a global context feature. The first extraction module 620 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

The second extraction module 630 is configured to perform local context information extraction on the image feature to be processed, so as to obtain a local context feature. The second extraction module 630 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

The fusion module 640 is configured to fuse the global context feature and the local context feature together to obtain an execution result of the image feature processing apparatus 600. The fusion module 640 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

According to another aspect of the present application, an image reconstruction apparatus is provided. Fig. 7 shows a schematic block diagram of an image reconstruction apparatus 700 according to an embodiment of the present application.

As shown in fig. 7, the image reconstruction apparatus 700 according to the embodiment of the present application includes an acquisition module 710, a first extraction module 720, and a second extraction module 730. The respective modules may perform the respective steps/functions of the image reconstruction method described above in connection with fig. 4, respectively. Only the main functions of the respective components of the image reconstruction apparatus 700 will be described below, and the details that have been described above will be omitted.

The obtaining module 710 is configured to obtain a plurality of images to be processed, where the images to be processed are low dynamic range images. The obtaining module 710 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

The first extraction module 720 is configured to perform feature extraction on a plurality of images to be processed, so as to obtain extracted image features. The first extraction module 720 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

The second extraction module 730 is configured to perform global context information extraction and local context information extraction based on the extracted image features, obtain global context feature information and local context feature information, and obtain a reconstructed image by combining the global context feature information and the local context feature information, where the reconstructed image is a high dynamic range image. The second extraction module 730 may be implemented by the processor 102 in the electronic device shown in fig. 1 executing program instructions stored in the storage 104.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

FIG. 8 shows a schematic block diagram of an electronic device 800 according to an embodiment of the application. The electronic device 800 includes a storage (i.e., memory) 810, a processor 820, and computer programs stored on the memory.

The storage 810 stores computer program instructions for implementing respective steps in the image feature processing method according to an embodiment of the present application.

The processor 820 is used for executing computer program instructions stored in the storage device 810 to execute the corresponding steps of the image feature processing method according to the embodiment of the application.

In one embodiment, the computer program instructions, when executed by the processor 820, are operable to perform the steps of: acquiring to-be-processed image features, wherein the to-be-processed image features are obtained on the basis of image features extracted from to-be-processed images; extracting global context information of the image features to be processed to obtain global context features; extracting local context information of the image features to be processed to obtain local context features; and fusing the global context feature and the local context feature together to obtain an execution result.

FIG. 9 shows a schematic block diagram of an electronic device 900 according to an embodiment of the present application. The electronic device 900 includes a storage (i.e., memory) 910, a processor 920, and computer programs stored on the memory.

The storage 910 stores computer program instructions for implementing respective steps in an image reconstruction method according to an embodiment of the present application.

The processor 920 is configured to execute computer program instructions stored in the storage device 910 to perform the corresponding steps of the image reconstruction method according to the embodiment of the present application.

In one embodiment, the computer program instructions, when executed by the processor 920, are configured to perform the steps of: acquiring a plurality of images to be processed, wherein the images to be processed are low dynamic range images; performing feature extraction on a plurality of images to be processed to obtain extracted image features; extracting global context information and local context information based on the extracted image features to obtain global context feature information and local context feature information, and obtaining a reconstructed image by combining the global context feature information and the local context feature information, wherein the reconstructed image is a high dynamic range image.

Furthermore, according to the embodiment of the present application, there is also provided a computer readable storage medium on which a computer program/instruction is stored, which is used for executing the corresponding steps of the image feature processing method according to the embodiment of the present application when the computer program/instruction is executed by a computer or a processor, and is used for realizing the corresponding modules in the image feature processing device according to the embodiment of the present application. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disc read-only memory (CD-ROM), a USB memory, or any combination of the above storage media.

In one embodiment, the computer program/instructions when executed are for performing the steps of: acquiring the characteristics of an image to be processed, wherein the characteristics of the image to be processed are obtained based on the image characteristics extracted from the image to be processed; extracting global context information of the image features to be processed to obtain global context features; extracting local context information of the image features to be processed to obtain local context features; and fusing the global context feature and the local context feature together to obtain an execution result.

Furthermore, according to the embodiment of the present application, there is also provided a computer readable storage medium on which a computer program/instruction is stored, which is used for executing the corresponding steps of the image reconstruction method according to the embodiment of the present application when the computer program/instruction is executed by a computer or a processor, and is used for realizing the corresponding modules in the image reconstruction device according to the embodiment of the present application. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media.

In one embodiment, the computer program/instructions when executed are for performing the steps of: acquiring a plurality of images to be processed, wherein the images to be processed are low dynamic range images; performing feature extraction on a plurality of images to be processed to obtain extracted image features; extracting global context information and local context information based on the extracted image features to obtain global context feature information and local context feature information, and obtaining a reconstructed image by combining the global context feature information and the local context feature information, wherein the reconstructed image is a high dynamic range image.

Furthermore, according to an embodiment of the present application, there is also provided a computer program product including computer programs/instructions which, when executed by a processor, implement the above-mentioned image feature processing method or the above-mentioned image reconstruction method.

Although the example embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the above-described example embodiments are merely illustrative and are not intended to limit the scope of the present application thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present application. All such changes and modifications are intended to be included within the scope of the present application as claimed in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, a division of a unit is only one type of division of a logical function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the present application, various features of the present application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various application aspects. However, the method of this application should not be construed to reflect the intent: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Moreover, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some of the blocks in an image feature processing apparatus or an image reconstruction apparatus according to embodiments of the present application. The present application may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website, or provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiments of the present application or the description thereof, and the protection scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope disclosed in the present application, and all the changes or substitutions should be covered by the protection scope of the present application. The protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image reconstruction method, comprising:

acquiring a plurality of images to be processed, wherein the images to be processed are low dynamic range images;

performing feature extraction on the plurality of images to be processed to obtain extracted image features;

extracting global context information and local context information based on the extracted image features to obtain global context feature information and local context feature information, and obtaining a reconstructed image by combining the global context feature information and the local context feature information, wherein the reconstructed image is a high dynamic range image.

2. The method of claim 1, wherein,

the extracting global context information and local context information based on the extracted image features to obtain global context feature information and local context feature information, and obtaining a reconstructed image by combining the global context feature information and the local context feature information, including:

sequentially executing at least one reconstruction operation, wherein the reconstruction operation comprises at least one feature processing operation which is sequentially executed, and the feature processing operation comprises:

extracting global context information of image features to be processed to obtain global context features, wherein the image features to be processed corresponding to a first feature processing operation in first reconstruction operations are the extracted image features, the image features to be processed corresponding to the first feature processing operation in non-first reconstruction operations are the execution results of the previous reconstruction operations, the image features to be processed corresponding to the non-first feature processing operations in all the reconstruction operations are the execution results of the previous feature processing operations, and the global context feature information comprises the global context features;

extracting local context information of the image features to be processed to obtain local context features, wherein the local context feature information comprises the local context features;

fusing the global context feature and the local context feature together to obtain an execution result of the feature processing operation;

a reconstructed image is obtained based on a result of the execution of the last reconstruction operation, the reconstructed image being a high dynamic range image.

3. The method as claimed in claim 2, wherein said extracting local context information from the image feature to be processed to obtain a local context feature comprises:

carrying out layer normalization processing on the image features to be processed to obtain normalized features;

shaping the normalized features to convert the normalized features into a two-dimensional image feature form to obtain shaped features;

performing convolution on the shaping feature to obtain a first convolution feature;

performing channel attention processing on the first convolution features to obtain channel attention features;

shaping the channel attention feature to convert the channel attention feature into a block embedded feature form to obtain the local context feature.

4. The method of any one of claims 1-3, wherein the performing feature extraction on the plurality of images to be processed to obtain extracted image features comprises:

performing initial feature extraction on the plurality of images to be processed to obtain initial image features;

shaping the initial image features to convert the initial image features into a block-embedded feature form, obtaining the extracted image features.

5. The method as claimed in claim 4, wherein said performing initial feature extraction on said plurality of images to be processed, obtaining initial image features comprises:

performing convolution on the multiple images to be processed respectively to obtain multiple second convolution characteristics corresponding to the multiple images to be processed one by one;

for any non-reference image, performing spatial attention processing on a second convolution feature corresponding to the non-reference image and a second convolution feature corresponding to a reference image to obtain a spatial attention feature corresponding to the non-reference image, wherein the reference image is one of the images to be processed, and the non-reference image is an image except the reference image in the images to be processed;

and fusing the spatial attention features corresponding to all the non-reference images and the second convolution features corresponding to the reference images to obtain the initial image features.

6. The method of claim 5, wherein the reference image is one of the plurality of images to be processed in which the exposure value is closest to 0 or one of the plurality of images to be processed in which the exposure value is at the middle most.

7. The method of claim 5 when dependent on claim 2 or 3, wherein said obtaining a reconstructed image based on the result of the last reconstruction operation performed comprises:

combining the execution result of the last reconstruction operation with the initial image feature and/or a second convolution feature corresponding to the reference image to obtain an intermediate output feature;

performing convolution on the intermediate output characteristic to obtain a third convolution characteristic;

and mapping the third convolution characteristic through an activation function to obtain the reconstructed image.

8. The method of claim 4 when dependent on claim 2 or 3, wherein said obtaining a reconstructed image based on the result of the performance of the last reconstruction operation comprises:

combining the execution result of the last reconstruction operation with the initial image characteristic to obtain an intermediate output characteristic;

performing convolution on the intermediate output characteristic to obtain a fourth convolution characteristic;

and mapping the fourth convolution characteristic through an activation function to obtain the reconstructed image.

9. The method of claim 2 or 3, wherein the reconstruction operation further comprises a convolution operation performed on the result of the execution of the last feature processing operation,

in each reconstruction operation, the image feature to be processed corresponding to the first feature processing operation and the execution result of the convolution operation are combined together to be used as the execution result of the reconstruction operation.

10. The method of any one of claims 1-9, wherein the performing feature extraction on the plurality of images to be processed, the performing global context information extraction and local context information extraction based on the extracted image features, and the combining the global context feature information and the local context feature information to obtain a reconstructed image is performed by a reconstruction network model, the method further comprising:

acquiring training images, wherein the training images comprise a plurality of sample images and annotation images corresponding to the sample images, the sample images are low dynamic range images, and the annotation images are high dynamic range images;

processing the plurality of sample images using an initial reconstructed network model to obtain a predicted reconstructed image;

calculating a reconstruction loss term based on the predicted reconstructed image and the annotated image;

acquiring predicted image characteristics and annotated image characteristics corresponding to the predicted reconstructed image and the annotated image respectively;

calculating a perception loss term based on the predicted image features and the annotated image features;

calculating a total loss based on the reconstruction loss term and the perceptual loss term;

and optimizing the parameters of the initial reconstructed network model based on the total loss to obtain the reconstructed network model.

11. The method of claim 10, wherein said obtaining predictive image features and annotation image features corresponding to each of said predictively reconstructed image and said annotation image comprises:

carrying out tone mapping on the predicted reconstructed image and the annotated image respectively to obtain a new predicted reconstructed image and a new annotated image;

and respectively inputting the new prediction reconstruction image and the new annotation image into a pre-training network model to obtain the corresponding prediction image characteristics and the corresponding annotation image characteristics.

12. An electronic device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the image reconstruction method according to any one of claims 1 to 11.

13. A computer-readable storage medium, on which a computer program/instructions are stored, which, when being executed by a processor, carry out the image reconstruction method according to any one of claims 1-11.

14. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the image reconstruction method according to any of claims 1-11.