CN115311149A - Image denoising method, model, computer-readable storage medium and terminal device - Google Patents
Image denoising method, model, computer-readable storage medium and terminal device Download PDFInfo
- Publication number
- CN115311149A CN115311149A CN202110500670.XA CN202110500670A CN115311149A CN 115311149 A CN115311149 A CN 115311149A CN 202110500670 A CN202110500670 A CN 202110500670A CN 115311149 A CN115311149 A CN 115311149A
- Authority
- CN
- China
- Prior art keywords
- image
- attention
- feature map
- module
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 95
- 238000003860 storage Methods 0.000 title claims abstract description 15
- 238000012549 training Methods 0.000 claims description 126
- 238000005070 sampling Methods 0.000 claims description 92
- 230000004927 fusion Effects 0.000 claims description 65
- 238000011176 pooling Methods 0.000 claims description 42
- 238000004891 communication Methods 0.000 claims description 10
- 238000004088 simulation Methods 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 abstract description 12
- 230000006870 function Effects 0.000 description 39
- 230000004913 activation Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 9
- 238000010606 normalization Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 8
- 238000003384 imaging method Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 238000012545 processing Methods 0.000 description 8
- 230000009467 reduction Effects 0.000 description 7
- 230000002123 temporal effect Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000001914 filtration Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 239000000243 solution Substances 0.000 description 3
- 101100400452 Caenorhabditis elegans map-2 gene Proteins 0.000 description 2
- 101150064138 MAP1 gene Proteins 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000004821 distillation Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000000060 site-specific infrared dichroism spectroscopy Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/70—Denoising; Smoothing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Image Processing (AREA)
Abstract
The application discloses an image denoising method, a model, a computer readable storage medium and a terminal device, wherein the method acquires a plurality of characteristic maps corresponding to an image to be processed; determining an attention feature map corresponding to the image to be processed based on a target feature map in a plurality of feature maps, wherein the target feature map is the feature map with the smallest image size in the plurality of feature maps; and determining a denoised image corresponding to the image to be processed based on the attention feature map and the feature maps except the target feature map. According to the embodiment of the application, a plurality of feature maps carrying image detail information are obtained, and the attention mechanism and the target feature map are utilized to determine the attention feature map, so that the attention feature map can retain color features and texture features of an image to be processed, the image detail information in a de-noised image determined based on the attention feature map and the plurality of feature maps can be improved, and the image quality of the de-noised image is improved.
Description
Technical Field
The present disclosure relates to image processing, and in particular, to an image denoising method, a model, a computer-readable storage medium, and a terminal device.
Background
With the rapid development of terminal devices, the terminal devices can be used in various shooting scenes, wherein night scene shooting is a common scene shot by the terminal. However, in the night scene shooting process, due to the fact that a photosensitive element of the shooting device is in a low-light environment for a long time, the shot image carries image noise, the image noise affects the image quality, even part of image content in the image is annihilated, and inconvenience is brought to users.
Disclosure of Invention
The technical problem to be solved by the present application is to provide an image denoising method, aiming at the defects in the prior art.
In order to solve the foregoing technical problem, a first aspect of the embodiments of the present application provides an image denoising method, where the image denoising method includes:
acquiring a plurality of characteristic graphs corresponding to an image to be processed;
determining an attention feature map corresponding to the image to be processed based on a target feature map in a plurality of feature maps, wherein the target feature map is the feature map with the smallest image size in the plurality of feature maps;
and determining a denoised image corresponding to the image to be processed based on the attention feature map and the feature maps except the target feature map.
The image denoising method comprises the step of denoising an image, wherein the image size of the denoised image is larger than that of the image to be processed.
The image denoising method is characterized in that an image denoising model is applied to the image denoising method, the image denoising model comprises a down-sampling module, and the down-sampling module comprises a convolution unit and a plurality of down-sampling units; the acquiring of the plurality of feature maps corresponding to the image to be processed specifically includes:
inputting an image to be processed into the convolution unit, and outputting a reference characteristic diagram through the convolution unit;
determining a plurality of candidate feature maps corresponding to the image to be processed based on the reference feature maps by using the plurality of down-sampling units;
and taking the reference feature map and the candidate feature maps as a plurality of feature maps corresponding to the image to be processed.
The image denoising method comprises the steps that a plurality of down-sampling units are sequentially cascaded, the plurality of down-sampling units correspond to a plurality of candidate feature maps one by one, each candidate feature map is an output item of the down-sampling unit corresponding to each candidate feature map, an input item of the down-sampling unit positioned at the forefront in the cascade order comprises a reference feature map, and an output item of a previous down-sampling unit in two adjacent down-sampling units in the cascade order is an input item of a next down-sampling unit.
The image denoising method applies an image denoising model, the image denoising model comprises a plurality of cascaded attention modules, at least a first attention module and a second attention module exist in the plurality of attention modules, the first attention module is positioned before the second attention module according to a cascade order, and the input items of the second attention module comprise the attention module adjacent to the second attention module and positioned before the second attention module and the input items of the first attention module.
The image denoising method, wherein the determining of the attention feature map corresponding to the image to be processed based on the target feature map in the feature maps specifically includes:
inputting the target feature map into the most front attention module according to the cascade order, and outputting a first attention feature map through the most front attention module;
taking the first attention feature map as a target attention feature map, and taking an attention module located at the second position as a target attention module;
detecting whether the target attention module is a second attention module;
if the target attention module is not the second attention module, taking the target attention feature map as an input image of the target attention module; if the target attention module is a second attention module, taking the target attention feature map and an input item of a first attention module corresponding to the target attention module as an input image of the target attention module;
determining, with a target attention module, a second attention feature map based on the input image;
and taking the second attention feature map as a target attention feature map, taking a subsequent attention module of the target attention module as a target attention module, and continuing to execute the step of detecting whether the target attention module is the second attention module until the target attention module is the last attention module.
The image denoising method comprises the following steps that an attention module comprises a plurality of cascade attention units and fusion units; the determining, by the target attention module, the second attention feature map based on the input image specifically includes:
inputting the input image into the attention unit positioned at the forefront according to the cascade order, and outputting a reference feature map through a plurality of attention units;
and inputting the reference feature map and the input image into the fusion unit, and outputting a second attention feature map corresponding to the image to be processed through the fusion unit.
The image denoising method comprises the following steps that the attention unit comprises a volume block, a first attention block, a second attention block, a first fusion block and a second fusion block; the output items of the volume block are the input items of the first attention block and the input items of the second attention block; the input items of the first fusion block include output items of the first attention block and output items of the second attention block; the entries of the second merge block include entries of the volume block and outputs of the first merge block.
The image denoising method comprises the steps that the first attention block comprises a first global mean pooling layer, a global maximum pooling layer, a first fusion layer and a second fusion layer, the output items of the volume block are the input items of the first global mean pooling layer and the input items of the global maximum pooling layer, the input items of the first fusion layer comprise the output items of the first global mean pooling layer and the output items of the global maximum pooling layer, and the input items of the second fusion layer comprise the output items of the first fusion layer and the output items of the volume block.
The image denoising method is characterized in that the second attention block comprises a second global mean pooling layer and a third fusion layer; the output item of the volume block is the input item of the second global mean pooling layer, and the input item of the third fusion layer comprises the output item of the second global mean pooling layer and the output item of the volume block.
The image denoising method is characterized in that the image denoising method applies an image denoising model, the image denoising model comprises an up-sampling module, the up-sampling module comprises a plurality of cascaded up-sampling units, an input item of the up-sampling unit positioned at the forefront in a cascaded sequence is an attention feature map, an output item of a previous up-sampling unit in two adjacent up-sampling units in the cascaded sequence is an input item of a next up-sampling unit, the input items of all up-sampling units except the up-sampling unit positioned at the forefront in the plurality of up-sampling units comprise a feature map in a plurality of feature maps, and the feature maps corresponding to all up-sampling units are different.
The image denoising method is implemented by applying an image denoising model, and the training method of the image denoising model specifically comprises the following steps:
acquiring a training sample set, wherein the training sample set comprises a plurality of training image groups, each training image group in the plurality of training image groups comprises a training image and a scene image, and the training image is determined by adding noise to the scene image;
inputting the training images in the training sample set into an image denoising model, and outputting a denoising image corresponding to the training image through the image denoising model;
and training the preset network model based on the predicted image and the scene image corresponding to the training image to obtain the image denoising model.
The image denoising method specifically includes:
acquiring a plurality of scene images, wherein the ambient brightness corresponding to each scene image in the plurality of scene images meets a preset condition;
for each scene image in a plurality of scene images, determining the simulated noise corresponding to the scene image, and adding the simulated noise into the scene image to obtain a training image corresponding to the scene image, wherein the simulated noise is greater than a preset noise threshold;
and generating a training sample set based on each scene image and the training images corresponding to the scene images.
The image denoising method includes the following steps of training the preset network model based on the predicted image and the scene image corresponding to the training image to obtain the image denoising model:
determining a minimum absolute value deviation loss value and a structural similarity index loss value of the predicted image and the scene image;
determining a total loss value corresponding to the training image based on the minimum absolute value deviation loss value and the structural similarity index loss value;
and training the preset network model based on the total loss value to obtain the image denoising model.
The image denoising method, wherein after generating a training sample set based on each scene image and a training image corresponding to each scene image, the training method of the image denoising model further includes:
inputting the training images in the training sample set into an image denoising model, and outputting a denoising image corresponding to the training image through the image denoising model;
determining a chroma loss value corresponding to the training image based on the de-noised image and the predicted image;
and training the image denoising model based on the chromaticity loss value, and taking the trained image denoising model as the image denoising model.
A second aspect of the embodiments of the present application provides an image denoising model, where the image denoising model includes:
the down-sampling module is used for acquiring a plurality of characteristic maps corresponding to the image to be processed;
the attention module is used for determining an attention feature map corresponding to the image to be processed based on a target feature map in a plurality of feature maps, wherein the target feature map is the feature map with the smallest image size in the plurality of feature maps;
and the up-sampling module is used for determining a de-noised image corresponding to the image to be processed based on the attention feature map and all feature maps except the target feature map.
A third aspect of embodiments of the present application provides a computer readable storage medium storing one or more programs, which are executable by one or more processors to implement the steps in the image denoising method as described in any one of the above.
A fourth aspect of embodiments of the present application provides a terminal device, including: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;
the communication bus realizes the connection communication between the processor and the memory;
the processor, when executing the computer readable program, implements the steps of the image denoising method as described in any one of the above.
Has the advantages that: compared with the prior art, the image denoising method, the model, the computer readable storage medium and the terminal device are provided, and the method obtains a plurality of characteristic graphs corresponding to the image to be processed; determining an attention feature map corresponding to the image to be processed based on a target feature map in a plurality of feature maps, wherein the target feature map is the feature map with the smallest image size in the plurality of feature maps; and determining a denoised image corresponding to the image to be processed based on the attention feature map and the feature maps except the target feature map. According to the embodiment of the application, a plurality of characteristic graphs carrying image detail information are obtained, and an attention mechanism and a target characteristic graph are utilized to determine an attention characteristic graph, so that the attention characteristic graph can include color characteristics and texture characteristics of a reserved image to be processed, the image detail information in a de-noised image determined based on the attention characteristic graph and the plurality of characteristic graphs can be improved, and the image quality of the de-noised image is improved.
Drawings
Fig. 1 is a flowchart of an image denoising method provided in the present application.
FIG. 2 is a schematic structural diagram of an image denoising model of the image denoising method provided in the present application;
FIG. 3 is a schematic structural diagram of an attention module in the image denoising method provided by the present application;
FIG. 4 is a schematic diagram of an original image and a denoised image in the image denoising method provided by the present application;
fig. 5 is a schematic structural diagram of a terminal device provided in the present application.
Detailed Description
In order to make the purpose, technical scheme and effect of the present application clearer and clearer, the present application is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The applicant finds that, with the rapid development of the terminal device, the terminal device can be used in various shooting scenes, wherein night scene shooting is a common scene shot by the terminal device. However, in the night scene shooting process, due to the fact that a photosensitive element of the shooting device is in a low-light environment for a long time, the shot image carries image noise, the image noise affects the image quality, even part of image content in the image is annihilated, and inconvenience is brought to users.
In order to improve the image quality of a night scene image obtained by shooting a night scene shooting scene, the night scene image is processed through a night scene algorithm after the night scene image is obtained. The super-night scene algorithm is to perform multi-frame synthesis, demosaicing, high Dynamic Range Imaging (HDR) and denoising Processing by acquiring different exposures of multiple frames as input, and then output a display Image through an ISP (Image Signal Processing). The purpose of the denoising process is to remove image noise in a noisy image to restore a potentially sharp image and to retain the original details of the denoised image.
The currently commonly used image denoising methods are mainly divided into two categories, the first category belongs to the filtering idea, and the representative methods include Block-Matching and 3D filtering (image denoising by sparse three-dimensional Transform domain collaborative filtering), DCT (Discrete Cosine Transform), and the like. For a traditional denoising method represented by the classical BM3D, the main idea is three-dimensional block matching filtering, although the effect is good, due to the high algorithm complexity, it takes a long time for an image with a large resolution, such as a currently popular 4k image, and real-time processing cannot be realized. Moreover, the type of noise is complex, additive noise and multiplicative noise, so that the robustness is poor by adopting the method. The second category belongs to a depth Network method based on data driving, and the representative methods include convolution-based end-to-end residual learning Network DnCnn (Denoising Convolutional Neural Networks), ridnet (Recursive Information Distillation Network), CBDnet (Convolutional Blind Denoising Network), pridNet (Recursive read Image Denoising Network, pyramid Real-time Denoising Network), and the like, and can be subdivided into three different schemes based on raw2raw, raw2rgb and rgb2rgb according to the type of input data. In recent years, with the rapid development of data-driven CNNs (Convolutional Neural Networks) in denoising tasks, the denoising effect based on deep learning has been completely surpassed in the conventional methods from the earlier DnCnn to the present PridNet, ridnet. However, for an image shot in a night scene, based on characteristics such as diversity of noise distribution, the existing image denoising method still has the problem of low denoising precision for the night scene image.
In order to solve the above problem, in the embodiment of the present application, a plurality of feature maps corresponding to an image to be processed are obtained; determining an attention feature map corresponding to the image to be processed based on a target feature map in a plurality of feature maps, wherein the target feature map is the feature map with the smallest image size in the plurality of feature maps; and determining a denoised image corresponding to the image to be processed based on the attention feature map and the feature maps except the target feature map. According to the embodiment of the application, a plurality of characteristic graphs carrying image detail information are obtained, and an attention mechanism and a target characteristic graph are utilized to determine an attention characteristic graph, so that the attention characteristic graph can include color characteristics and texture characteristics of a reserved image to be processed, the image detail information in a de-noised image determined based on the attention characteristic graph and the plurality of characteristic graphs can be improved, and the image quality of the de-noised image is improved.
For example, the embodiments of the present application may be applied to the fields of night photographing, night face recognition, night photography, and the like. It can be understood that the common feature of the application scenes is that the environment needs to be photographed in a low-light environment, and the night-scene photographing is not limited to the night-scene, which mainly refers to photographing in an environment with brightness not meeting the requirements of conventional photographing. It should be noted that the above application scenarios are only shown for the convenience of understanding the present application, and the embodiments of the present application are not limited in any way in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.
The following description of the embodiments is provided to further explain the present disclosure by way of example in connection with the appended drawings.
As shown in fig. 1, the present embodiment provides an image denoising method, which may include the following steps:
and S10, acquiring a plurality of characteristic diagrams corresponding to the image to be processed.
Specifically, the image to be processed may be obtained by shooting through an imaging system (e.g., an on-screen camera) configured in the electronic device, or obtained through a network (e.g., hundreds degrees), or sent through another external device (e.g., a smartphone). In an implementation manner of this embodiment, the image to be processed is obtained by shooting through an imaging system of an electronic device configured with the image denoising method provided in this embodiment, and the image to be processed is obtained by shooting through the imaging system under a dim light condition, where the dim light condition refers to that the ambient light brightness of a shooting scene does not satisfy a preset condition, for example, the image to be processed is obtained by shooting through the imaging system under a night scene, in germany. In addition, the ambient light brightness not meeting the preset condition may be that the ambient light brightness is less than a preset brightness threshold value, and the like.
The image sizes of the feature maps are different from each other, and each feature map of the feature maps carries image content information of an image to be processed, for example, the feature maps include a feature map a, a feature map B, and a feature map C, the image size of the feature map a is 256 × 256, the image size of the feature map B is 128 × 128, and the image size of the feature map C is 56 × 56. The characteristic maps can be determined by a traditional image characteristic extraction method or obtained by extracting a neural network model.
In an implementation manner of this embodiment, the image denoising method applies an image denoising model, where the image denoising model includes a down-sampling module, and the down-sampling module includes a convolution unit and several down-sampling units; the acquiring of the plurality of feature maps corresponding to the image to be processed specifically includes:
inputting an image to be processed into the convolution unit, and outputting a reference characteristic diagram through the convolution unit;
determining a plurality of candidate feature maps corresponding to the image to be processed based on the reference feature maps by using the plurality of down-sampling units;
and taking the reference feature map and the candidate feature maps as a plurality of feature maps corresponding to the image to be processed.
Specifically, the image denoising model is a trained deep learning model and is used for removing image noise with a removed image to obtain a denoised image corresponding to the image to be processed. The image denoising model comprises a downsampling module, the downsampling module comprises a convolution unit and a plurality of downsampling units, an input item of the convolution unit is an image to be processed, the downsampling units are sequentially cascaded, an input item of the downsampling unit positioned at the forefront in the cascade sequence is an output item of the convolution unit, and an output item of a previous downsampling unit in two downsampling units in the cascade sequence is an input item of a next downsampling unit.
In one implementation of this embodiment, as shown in fig. 2, the convolution unit may include a first convolution layer and a first activation function layer, and the step size of the first convolution layer is 1, so that the image size of the input item of the convolution unit is the same as the image size of the output item. Each of the number of downsampling units includes a second convolution layer having a step size of 2 such that an image size of an output item of the respective downsampling unit is half an image size of the input item to enable downsampling of the input item, and a second activation function. The first activation function layer and each second activation function layer are both configured with a Recu (Rectified Linear Unit) activation function, and the generalization capability of the image denoising model is enhanced through the Recu activation function. In practical applications, the step size of the second convolution layer in each of the plurality of down-sampling units may be different, for example, the step size of the second convolution layer in the down-sampling unit a in the plurality of down-sampling units is 3, and the step size of the second convolution layer in the down-sampling unit B in the plurality of down-sampling units is 2.
The output items of each of the plurality of down-sampling units are candidate feature maps, and it can be understood that the number of the plurality of down-sampling units is the same as the number of the plurality of candidate feature maps, and the plurality of down-sampling units are in one-to-one correspondence with the plurality of candidate feature maps, and each candidate feature map is the output item of the down-sampling unit corresponding to each candidate feature map. For example, the downsampling module includes N downsampling units, and for the reference feature map of the convolution unit, the first downsampling unit downsamples the reference feature map to obtain a first candidate feature map; a second down-sampling unit performs down-sampling on the first candidate feature map to obtain a second candidate feature map; and by analogy, the Nth down-sampling unit performs down-sampling on the (N-1) th candidate feature map output by the (N-1) th down-sampling unit to obtain an Nth candidate feature map, and then the first candidate feature map, the second candidate feature map, the.
Taking the downsampling module shown in fig. 2 including 3 downsampling units as an example, a specific process of determining the candidate feature maps corresponding to the image to be processed based on the reference feature map by using the downsampling units will be described, where the 3 downsampling units are respectively denoted as a first downsampling unit, a second downsampling unit, and a third downsampling unit according to a cascade order. Then, the determining, by using the several downsampling units, several candidate feature maps corresponding to the image to be processed based on the reference feature map may specifically include: inputting the reference feature map into a first downsampling unit, and downsampling the reference feature map through the first downsampling unit to obtain a first candidate feature map; inputting the first candidate feature map into a second down-sampling unit, and down-sampling the first candidate feature map through the second down-sampling unit to obtain a second candidate feature map; inputting the second candidate feature map into a third down-sampling unit, and down-sampling the second candidate feature map through the third down-sampling unit to obtain a third candidate feature map; and taking the first candidate feature map, the second candidate feature map and the third candidate feature map as a plurality of candidate feature maps corresponding to the image to be processed.
And S20, determining an attention feature map corresponding to the image to be processed based on a target feature map in the feature maps.
Specifically, the target feature map is a feature map with the smallest image size in the feature maps, for example, the feature maps include a feature map a and a feature map B, where the image size of the feature map a is smaller than that of the feature map B, and then the feature map a is the target feature map. The attention feature map carries color features, texture features and image detail information of the image to be processed, and the image detail information carried by a plurality of feature maps can be improved by converting the target feature map into the attention feature map, so that the candidate determined de-noised image can retain the image detail information, the color information and the texture information of the image to be processed, and the image quality of the de-noised image can be improved.
In an implementation manner of this embodiment, the image denoising method applies an image denoising model, where the image denoising model includes several attention modules, where the image denoising model may include only several attention modules, and after a target feature map is obtained, an attention feature map is determined through the several attention modules and the target feature map in the image denoising model; or, the image denoising model includes a down-sampling module and several attention modules, where the attention module located at the forefront among the several attention modules is connected to the down-sampling module, a target feature map output by the down-sampling module is an input item of the forefront attention module, and an attention feature map output by the several attention modules, where the down-sampling module may be the down-sampling module as described above, and it may refer to the above description specifically, and will not be described here specifically.
In an implementation manner of this embodiment, several attention modules may be cascaded in sequence, where an input item of a most front attention module is a target feature map, an output item of a previous attention module in two adjacent attention modules is an input item of a next attention module, and an output item of a most rear attention module is an attention feature map. For example, the number of attention modules includes M attention modules, and for the target feature map, the first attention module determines an attention feature map a based on the target feature map; the second attention module determines an attention profile B based on the attention profile a; by analogy, the M-th attention module determines an attention feature map M based on the attention feature map M-1 output by the attention module M-1, and takes the attention feature map M as the attention feature map, wherein M is the number of downsampling units and is a positive integer.
In an implementation aspect of this embodiment, besides the plurality of attention modules being cascaded in sequence, at least a first attention module and a second attention module exist in the plurality of attention modules, the first attention module is located before the second attention module in the cascaded sequence, and the input items of the second attention module include an output item of an attention module adjacent to and located before the second attention module and an input item of the first attention module. It is understood that at least two attention modules in the plurality of attention modules have jump connection, and the image detail information learned by the second attention module can be added through the jump connection, so that the image detail information carried by the finally determined attention feature map can be added. For example, two attention modules in a jump connection are two adjacent attention modules, and then the entries of the latter attention module include the entries of the former attention module and the outputs of the former attention module. Of course, in practical applications, any two adjacent attention modules of the plurality of attention modules comprise a short connection and a jump connection, in other words, the input item of the following attention module of any two adjacent attention modules of the plurality of attention modules comprises the input item of the preceding attention module and the output item of the preceding attention module; alternatively, there are only two adjacent attention modules of the number of attention modules comprising the short connection and the jump connection, in other words, there are only two adjacent attention modules of the number of attention modules, the input item of the latter one of the two adjacent attention modules comprising the input item of the former attention module and the output item of the former attention module.
In a specific implementation manner of this embodiment, a connection manner of the attention modules is as shown in fig. 2, the attention modules include 4 attention modules, which are respectively denoted as an attention module a, an attention module b, an attention module c, and an attention module d according to a cascade order, where the attention module c and the attention module d include short connections and jump connections, and the input items of the attention module d include the output item of the attention module c and the input item of the attention module c.
Based on this, the determining, based on the target feature map in the feature maps, the attention feature map corresponding to the image to be processed specifically includes:
inputting the target feature map into the most front attention module according to the cascade order, and outputting a first attention feature map through the most front attention module;
taking the first attention feature map as a target attention feature map, and taking an attention module located at the second position as a target attention module;
detecting whether the target attention module is a second attention module;
if the target attention module is not the second attention module, taking the target attention feature map as an input image of the target attention module; if the target attention module is a second attention module, taking the target attention feature map and an input item of a first attention module corresponding to the target attention module as an input image of the target attention module;
determining, with a target attention module, a second attention feature map based on the input image;
and taking the second attention feature map as a target attention feature map, taking a subsequent attention module made by the target attention module as a target attention module, and continuing to perform the step of inputting the target attention feature map into the target attention module until the target attention module is the last attention module.
In particular, the first attention module and the second attention module are two attention modules including a jump connection, it is understood that the first attention module is located before the second attention module, and the input item of the first attention module is the output item of the second attention module. Thus, for the attention module located at the forefront, the target feature map is directly used as an input item of the forefront attention module. The attention module located at the second position detects whether the attention module is the second attention module (i.e. whether the attention module is configured with the jump connection); when the attention module is not the second attention module, the input items of the attention module are all included in the output items of the previous attention module; conversely, when the attention module is the second attention module, it is described that the entries of the attention module include the output entry of the previous attention module and the corresponding entry of the first attention module (i.e., the attention module connected by jumping with the attention module).
After the input image of the attention module located at the second position is determined, after the input image passes through the attention module located at the second position, a second attention feature map is determined through the attention module located at the second position, then the second attention feature map is repeatedly executed to be used as a target attention feature map, the attention module located at the third position is used as a target attention module, the step of inputting the target attention feature map into the target attention module is continuously executed, and then the output item of the attention module located at the last position is acquired, so that the attention feature map corresponding to the image to be processed is obtained.
In one implementation of this embodiment, the attention module may adopt RRG (Recursive Residual Group), as shown in fig. 3, and may include several cascaded attention units 201 and fusion units 20; the determining, by the target attention module, the second attention feature map based on the input image specifically includes:
inputting the input image into the attention unit positioned at the forefront according to the cascade order, and outputting a reference feature map through a plurality of attention units;
and inputting the reference feature map and the input image into the fusion unit, and outputting a second attention feature map corresponding to the image to be processed through the fusion unit.
Specifically, the plurality of attention units 201 are sequentially cascaded, an input item of an attention unit located at the forefront in the cascade order is an input image, an output item of a previous attention unit in two attention units adjacent in the cascade order is an input item of a next attention unit, an output item of an attention unit located at the last in the cascade order is a reference feature map, the input item of the fusion unit further includes an input image and a reference feature map, the reference feature map and the input image are fused by the fusion unit to obtain an output item of the attention module, and a second attention feature map corresponding to the image to be processed. For example, the attention module comprises P attention units, and for an input image of the attention module, a first attention unit determines an attention feature map a based on the input image; the second attention unit determines an attention feature map b based on the attention feature map a; by analogy, the P-th attention unit determines an attention feature map P based on the attention feature map P-1 output by the P-1-th attention unit, and takes the attention feature map P as a reference feature map, wherein M is the number of downsampling units and is a positive integer.
Taking an example that the attention module includes 4 attention units, a specific process of inputting the input image into the attention unit positioned at the forefront in the cascade order and referring to the feature map through the output of the plurality of attention units is described, where the 4 attention units are respectively denoted as a first attention unit, a second attention unit, a third attention unit and a fourth attention unit in the cascade order. Then, the inputting the input image into the attention unit located at the forefront in the cascade order and outputting the reference feature map through the plurality of attention units may specifically include: inputting an input image into a first attention unit, and outputting an attention feature map 1 through the first attention unit; inputting the attention feature map 1 into a second attention unit, and outputting an attention feature map 2 through the second attention unit; inputting the attention feature map 2 into a third attention unit, and outputting an attention feature map 3 through the third attention unit; the attention feature map 3 is input into a fourth attention unit, the attention feature map 4 is output through the fourth attention feature map, and the attention feature map 4 serves as a reference feature map corresponding to the image to be processed.
In an implementation manner of this embodiment, the attention module may further include a convolution layer, where the convolution layer is located between the last attention unit 201 and the fusion unit 202, the reference feature map output by the last attention unit 201 is input into the convolution layer, the convolution layer performs convolution operation on the reference feature map, and the reference feature map after convolution operation is input into the fusion unit 202, so that the fusion unit 202 fuses the reference feature map after product operation with the input image to obtain the attention feature map corresponding to the attention module. In this way, the convolution layer is adopted to carry out convolution operation on the reference characteristic diagram, and the image detail information in the reference characteristic diagram after the convolution operation can be enriched. In a specific implementation manner, the image scale of the input item of the convolution layer is the same as the image scale of the input image, the fusion unit may be an adder, and the adder performs pixel-by-pixel addition on the output item of the convolution layer and the input image to obtain the attention feature map corresponding to the attention module. Wherein, the convolution kernel of the convolution layer can be 3 x 3.
In an implementation manner of this embodiment, the attention unit may employ DAB (Dual attention Blocks), and the attention unit is configured to include at least one attention unit having a temporal attention system, a spatial attention system, and/or a channel attention system, so that each attention module includes at least one attention unit having a temporal attention system, a spatial attention system, and/or a channel attention system, and the attention unit suppresses a feature with a smaller information amount and propagates a feature with richer information, so that a feature map learned by the attention unit retains color, texture, and image detail features, and a feature map learned by the attention module retains color, texture, and image detail features, and further improves image quality of a denoised image determined based on the attention feature map.
In one implementation of the present embodiment, as shown in fig. 3, the attention unit 201 includes a volume block 213, a first attention block 211, a second attention block 212, a first fusion block 214, and a second fusion block 215; the input items of the volume block 213 are target feature maps, and the output items of the volume block 213 are the input items of the first attention block 211 and the input items of the second attention block 212; the inputs of the first fusion block 214 include the output of the first attention block 211 and the output of the second attention block 212; the entries of the second fusion block 215 include the entries of the volume block 213 and the output entries of the first fusion block 214. The first attention block is used for paying attention to time and space, the second attention block is used for paying attention to the space, the first attention block learns image features based on a time attention system and a space attention system, and the second attention block learns the image features based on a channel attention system. In addition, when the intermediate feature map is determined based on the temporal attention mechanism, the spatial attention mechanism and the channel attention mechanism, the first fusion block and the second fusion block are used for fusion connection, so that the feature loss in the image transmission process is reduced, and more original effective information can be retained.
In an implementation manner of this embodiment, the attention unit may further include a convolution layer, the convolution layer is located between the first fusion block and the second fusion block, and an output item of the first fusion block is input into the convolution layer, and is input into the second fusion block after passing through the convolution layer. It will be appreciated that the output terms of the first fused block are the input terms of the convolutional layer, and the output terms of the convolutional layer are the input terms of the second fused block. The first fusion block may adopt a connector Concat, and the second fusion block may adopt an adder; the convolution block can comprise a first convolution layer, an activation function layer and a second convolution layer, the first convolution layer, the activation function layer and the second convolution layer are sequentially cascaded, convolution kernels of the first convolution layer and the second convolution layer are 3 × 3, an activation function configured by the activation function layer can be a RecUu function (Rectified Linear Unit), and the generalization capability of the image denoising model is increased through the RecUu function.
The present embodiment illustrates the structure of the first attention block by taking 3 as an example, where the first attention block 211 includes a first Global mean Pooling layer (Global Average Pooling) 2011, a Global maximum Pooling layer (Global Max Pooling) 2012, a first fusion layer 2013, and a second fusion layer 2014, the output items of the volume block 213 are the input items of the first Global mean Pooling layer 2011 and the input items of the Global maximum Pooling layer 2012, the input items of the first fusion layer 2013 include the output items of the first Global mean Pooling layer 2011 and the output items of the Global maximum Pooling layer 202, and the input items of the second fusion layer 2014 include the output items of the first fusion layer 2013 and the output items of the volume block 213. The first attention block realizes a temporal attention mechanism and a spatial attention mechanism through a first Global Average Pooling layer (Global Average Pooling) and a Global maximum Pooling layer (Global Max Pooling), then an output item of the first Global Average Pooling layer and an output item of the Global maximum Pooling layer are fused through a first fusion layer, and then an output item of the first fusion layer and an output item of the convolution block are fused, so that an effective image feature map is kept, and the transfer probability of image noise is reduced.
In addition, in practical applications, the first attention block may further include a convolution layer and a normalization layer, the convolution layer and the normalization layer are located between the first fusion layer and the second fusion layer, an output item of the first fusion layer is input to the convolution layer, an output item of the convolution layer is input to the normalization layer, and an output item of the normalization layer is input to the second fusion layer, so that the output item of the normalization layer is used as a space-time attention coefficient of the output item of the convolution block, and the output item of the convolution block is processed through the space-time attention coefficient, so that effective image features in the image to be processed can be learned, and image noise in the image to be processed can be removed. The first merging layer may use a connection layer Concat, and the second merging layer may use a multiplier.
In one implementation of the present embodiment, the second attention block 212 includes a second Global Average Pooling layer (Global Average Pooling) 2015 and a third fusion layer 2016; the output of the volume block 213 is an input of the second global mean pooling layer 2015, and the input of the third fusion layer 2016 includes an output of the second global mean pooling layer 2015 and an output of the volume block 213. The second attention block realizes a temporal attention mechanism and a spatial attention mechanism through a second Global Average Pooling layer (Global Average Pooling), then the output items of the second Global Average Pooling layer and the output items of the convolution block are fused through a third fusion layer, the importance degree of each feature channel of the input item of the unit to be noticed is calculated, and the feature channels are enhanced or suppressed based on the importance degree of each channel, so that an effective image feature map can be reserved, and the transfer probability of image noise is reduced.
In addition, in practical applications, the second attention block may further set other network layers based on actual requirements, for example, as shown in fig. 3, the second attention block includes cascaded convolution sub-blocks and a normalization layer, the convolution sub-blocks and the normalization layer are located between a second Global Average Pooling layer (Global Average Pooling) and a third fusion layer, an input item of the convolution sub-block is an output item of the second Global Average Pooling layer, an output item of the convolution sub-block is an input item of the normalization layer, and an output item of the normalization layer is an input item of the third fusion layer. The convolution sub-blocks may include a convolution layer a, an activation function layer, and a convolution layer b, which are concatenated, where a convolution kernel of the convolution layer a may be 1 × 1, the activation function layer may be a ReLU function (a Linear rectification function), a convolution kernel of the convolution layer b may be 3 × 3, and the third fused layer may be a multiplier.
S30, determining a de-noised image corresponding to the image to be processed based on the attention feature map and the feature maps except the target feature map.
Specifically, the signal-to-noise ratio of the image to be denoised corresponding to the image to be processed of the image to be denoised is higher than the signal-to-noise ratio of the image to be processed, and the image size of the image to be denoised is larger than the image size of the image to be processed, for example, the image size of the image to be denoised is twice the image size of the image to be processed, so that the image denoising time can be shortened and the memory overhead can be reduced.
For example, assuming that the left side in fig. 4 is an original image and the right side is a denoised image determined by using the image denoising method provided in this embodiment, it can be seen from fig. 4 that the right image carries less image noise relative to the left image and the text information is clear, so that the scheme can well remove the noise, fully retain the text information, and improve the image quality of the denoised image relative to the image to be processed.
In an implementation manner of this embodiment, the image denoising method applies an image denoising model, where the image denoising model includes an upsampling module, and determines a denoised image corresponding to the image to be processed based on the attention feature map and each feature map except the target feature map by using the upsampling module. The image denoising model only comprises an up-sampling module, and after the attention characteristic map and the characteristic maps except the target characteristic map are obtained, a denoised image is determined through the up-sampling module in the image denoising model; alternatively, the image denoising model further includes a downsampling module or several attention modules, or the image denoising model further includes a downsampling module and several attention modules, where the downsampling module may be the downsampling module as described above, and the attention modules may be the attention modules as described above, which may be referred to the above description specifically, and this description is not specifically described here.
In a specific implementation manner of this embodiment, the image denoising model further includes a down-sampling module and several attention modules, in other words, the image denoising model includes a down-sampling module, several attention modules, and an up-sampling module, the down-sampling module and the up-sampling module form a Unet structure, the several attention modules are cascaded and between the down-sampling module and the up-sampling module, a target feature map determined by the down-sampling module is an input item of a most front attention module in the several attention modules, an input item of the up-sampling module includes an attention feature map output by a last attention module in the several attention modules, and the down-sampling module determines each feature map except the target feature map in the several feature maps. The image denoising module in this embodiment includes network parameters of 64964, the required peak memory is small, and the required inference time is short, for example, the inference time is only 0.15s for processing 4640 × 3472pixel size images on GTX1080TI, and the image denoising module can be effectively applied to denoising when a mobile terminal (e.g., a mobile phone, etc.) takes a picture.
In an implementation manner of this embodiment, the upsampling module includes a plurality of cascaded upsampling units, an input item of an upsampling unit located at the forefront in a cascaded order is an attention feature map, an output item of a previous upsampling unit of two upsampling units adjacent to each other in the cascaded order is an input item of a next upsampling unit, and input items of upsampling units other than the forefront upsampling unit in the plurality of upsampling units all include one feature map of the plurality of feature maps, and feature maps corresponding to the upsampling units are different from each other. In a specific implementation manner, each upsampling unit in the plurality of upsampling units performs upsampling in a bilinear interpolation manner, and the speed of upsampling can be effectively increased by using the bilinear interpolation, and the smoothness of an image output each time is increased.
In an implementation manner of this embodiment, the number of the upsampling units included in the upsampling module is one more than the number of the downsampling units included in the downsampling module, so that the image size of the denoised image output by the upsampling module is larger than the image size of the image to be processed, and thus, the small-sized image to be processed can be denoised to obtain a large-sized denoised image, so that the amount of computation in the denoised image can be reduced, and the speed of determining the denoised image can be increased.
Taking fig. 2 as an example, the up-sampling module includes 4 up-sampling units and a first convolution unit, and the 4 up-sampling units are sequentially recorded as a first up-sampling unit, a second up-sampling unit, and a fourth up-sampling unit according to a cascade order. The down-sampling unit comprises 3 down-sampling units and a second convolution unit, wherein the 3 down-sampling units are respectively marked as a first down-sampling unit, a second down-sampling unit and a third down-sampling unit according to a cascading sequence; the first up-sampling unit is connected with the last attention module, the fourth up-sampling unit is connected with the first convolution unit, and the second convolution unit is connected with the first up-sampling unit, so that the input item of the first up-sampling unit comprises the output item of the second convolution unit, and the first down-sampling unit is connected with the second up-sampling unit, so that the input item of the second up-sampling unit comprises the output item of the first down-sampling unit; the second downsampling unit is connected to the third upsampling unit such that the input of the third upsampling unit comprises the output of the second downsampling unit. The first convolution may include a convolution layer with a step size of 1 and an activation function layer, where the activation function layer may be a ReLU function (Linear rectification function), and the generalization capability of the image denoising model is increased by the ReLU function.
In an implementation manner of this embodiment, the image denoising method applies an image denoising model, and the training method of the image denoising model specifically includes:
acquiring a training sample set;
inputting the training images in the training sample set into an image denoising model, and outputting a denoising image corresponding to the training image through the image denoising model;
training the preset network model based on the predicted image and the scene image corresponding to the training image to obtain the image denoising model.
Specifically, the training sample set includes a plurality of training image groups, each of the plurality of training image groups including a training image and a scene image, wherein the training image is determined by adding noise to the scene image. It can be understood that the training image is a noise image corresponding to the scene image, the scene image is a de-noising image corresponding to the training image, and when the preset network model is trained, the scene image can be used as a measurement basis of a prediction de-noising image corresponding to the training image, so that the preset network model can be trained based on the scene image and the prediction de-noising image. The model structure of the preset network model is the same as that of the image denoising model, and the difference between the preset network model and the image denoising model is that the model parameters of the preset network model are different from those of the image denoising model, the model parameters of the preset network model are initial model parameters, and the model parameters of the image denoising model are model parameters trained by a training sample set.
Briefly describing the training environment and the training parameters adopted in this embodiment, the hardware environment for training the initial model is GTX1080Ti. And when the parameters are initialized, adopting an Xavier initialization mode. The Xavier initialization mode can enable information in the network to flow better, and the variance of the output of each layer is equal as much as possible. The preset learning rate is 10-4. In the selection of the optimizer, an Adaptive optimizer such as an Adaptive Gradient (adgrad) optimizer, an Adam optimizer, or the like is used. The learning rate can adopt a polynomial attenuation mode, a natural exponential attenuation mode and other attenuation modes. The number of iterations was 6000 epochs (Epoch) with a Batch Size (Batch Size) of 1.
The training sample set may be obtained in a variety of ways, for example, by using the standard image in the standard image set as a scene image, performing image processing (e.g., linear transformation, brightness adjustment) on the scene image, and adding artificially synthesized analog noise using a noise model to obtain a training image corresponding to the scene image. For another example, for the same scene, a low ISO image is taken as a scene image, a high ISO image is taken as a training image, and camera parameters such as exposure time are adjusted so that the brightness of the low ISO image is consistent with that of the high ISO image. For another example, multiple images are continuously captured for the same scene, the captured multiple images are processed (for example, image registration, abnormal image rejection, etc.), the processed multiple images are weighted and averaged to obtain a scene image, and the multiple images are taken as one of the training images, or the multiple images are taken as the training images, respectively.
The existing denoising algorithm based on deep learning mainly depends on the addition of Gaussian noise to a clear image in a simulation mode, and noise carried by a real camera when the image is acquired cannot be restored. In this respect, an SSID (smart phone camera noise reduction Dataset) is provided, the SIDD Dataset is synthesized by images acquired by a real phone under different exposures, expression modes of different Bayer arrays (Bayer Pattern) are covered, noise expression under a real environment can be fully expressed, the method synthesizes final noise-removed images by acquiring 150 frames of images with different exposure times, and a good effect is obtained in training of each noise-removal model. However, due to the complexity and intensity of the low-light-scene noise, the image noise under the low-light scene belongs to the mixed distribution of gaussian noise and poisson noise, the collection difficulty is high by adopting the SIDD scheme, and the result of a clean image (ground route) obtained after the noise of the noise image is reduced is good or bad. Therefore, the noise reduction effect of the model obtained after training is also uneven.
The training model needs a large number of samples, and the time and the labor are consumed for shooting qualified samples through a mobile phone, so that the acquired standard images are all images obtained by shooting under the condition of meeting the conventional shooting environment, and a plurality of scene images are obtained. And then adding simulation noise to each scene image, generating a training image corresponding to the scene image, and taking the scene image and the training image corresponding to the scene image as a training image group to obtain a training image group.
Based on this, in an implementation manner of this embodiment, the acquiring a training sample set specifically includes:
acquiring a plurality of scene images;
for each scene image in a plurality of scene images, determining simulation noise corresponding to the scene image, and adding the simulation noise to the scene image to obtain a training image corresponding to the scene image, wherein the simulation noise is greater than a preset noise threshold;
and generating a training sample set based on each scene image and the training images corresponding to the scene images.
Specifically, the obtaining modes of each of the plurality of scene images may be the same, for example, all the scene images are shot under a preset scene through an imaging model, or the obtaining modes of some of the scene images in the plurality of scene images are the same, and the obtaining modes of some of the scene images are different, for example, the plurality of scene images include a first scene image and a second scene image, the first scene image is shot through an imaging module configured in the electronic device, and the second scene image is obtained through the internet. In addition, the ambient brightness corresponding to each of the plurality of scene images meets a preset condition, wherein the preset condition may be that the ambient brightness is greater than a preset brightness threshold. It is understood that the ambient brightness of each of the plurality of scene images is greater than the preset brightness threshold. For example, 800 RGB images are acquired by the imaging component during the day, and the 800 RGB images are sampled as the scene images, wherein the resolution of each RGB image is 4640 × 3472.
After a plurality of scene images are acquired, for each scene image in the plurality of scene images, determining the simulated noise corresponding to the scene image, and then adding the simulated noise to the scene image to obtain a training image corresponding to the scene image. The simulated noise can be added to the existing noise data, and can also be reversely operated based on a denoising network to obtain a training image corresponding to the scene image. In an implementation manner of this embodiment, the simulation noise may be determined by using a denoising network model CycleISP (cyclic Image Signal Processing ), where a CycleISP frame may include four parts, RGB2RAW respectively, for converting an sRGB graph into a RAW graph; RAW2RGB for converting the RAW map into an sRGB map; color calibration (Color Correction) for an auxiliary Color recovery network providing an accurate Color attention mechanism (explicit Color attention) for RAW2RGB to correctly recover RGB maps; noise (Noise Injection) is added, and is set to OFF when cyclelsp is trained, and is set to ON when Noise data needs to be generated. In addition, the simulation noise is larger than a preset noise threshold, wherein the preset noise threshold is SIDD noise synthesized by a CycleISP model. In this embodiment, the noise intensity of the simulated noise may be 120% of the SIDD noise synthesized by the CycleISP model, so that the image denoising model obtained based on training of the training image set is more suitable for images shot in a low-light environment.
After the simulated noise is acquired, the simulated noise is added to each scene image to obtain a training image corresponding to each scene image, wherein shooting scenes corresponding to part of the scene images are different and can be stored in a plurality of scene images, for example, part of the scene images correspond to night scenes, part of the scene images correspond to cloudy scenes, and the like, so that the image denoising model obtained based on training of the training sample set can use images shot in different scenes. In addition, after the training images are obtained, operations such as cutting, rotating, scale changing, turning and random deducting can be performed on the random images so as to enhance the training sample set, and time and energy spent on large-scale collection of scene images can be avoided. In one specific implementation, the resolution of the training image is 4640 × 3472pixel, and the resolution of the training image after cropping is 512 × 512pixel.
In an implementation manner of this embodiment, the training the preset network model based on the predicted image and the scene image corresponding to the training image to obtain the image denoising model specifically includes:
determining a minimum absolute value deviation loss value and a structural similarity index loss value of the predicted image and the scene image;
determining a total loss value corresponding to the training image based on the minimum absolute value deviation loss value and the structural similarity index loss value;
and training the preset network model based on the total loss value to obtain the image denoising model.
Specifically, the total loss value is a denoising loss value between the scene image and the predicted image, wherein the total loss value may adopt a conventional loss function such as a minimum absolute value deviation loss function and a euclidean distance function as a denoising loss function. The total loss value adopted in this embodiment is composed of several loss values, where the several loss values include a minimum absolute value deviation loss value (L1 norm loss value) and a Structural SIMilarity index loss value (SSIM loss value).
The L1-norm loss value is determined by an L1-norm loss function, wherein the L1-norm loss function is referred to as a land (Least Absolute deviation) function, or a Least Absolute Error (LAE). The L1 norm loss function minimizes the sum of the absolute differences of the target and estimated values. The formula for the L1 norm loss function is:
The structural similarity index loss value is determined by an SSIM loss function, the SSIM loss function is an index for measuring the similarity of two images, and the natural images have extremely high structural property, so that the natural images are expressed as the correlation between pixels in space, and the SSIM loss function can better measure the similarity between a predicted image and a calibration image. Wherein, the formula of the SSIM loss function is as follows:
wherein, mu x Is the average value, μ, of the scene image x y Is the average of the predicted image y, σ x 2 Is the variance, σ, of the scene image x y 2 Is the variance, σ, of the predicted image y xy Is the covariance of the scene image x and the predicted image y, c 1 And c 2 Is a constant.In addition, since the structural similarity ranges from-1 to 1, the calculation formula of the structural similarity index LOSS value may be SSIM _ LOSS =1-SSIM.
After the minimum absolute value deviation loss value and the structural similarity index loss value are obtained, a total loss value is determined by weighted summation of the minimum absolute value deviation loss value and the structural similarity index loss value, wherein the total loss value L total =w 1 *L 1 +w 2 * SSIM _ LOSS, wherein w 1 And w 2 As weighting factors, e.g. w 1 Is equal to 0.15,w 2 Equal to 0.85. In addition, after the total loss value is obtained, the total loss value is reversely transmitted into a preset network model, and parameters of the initial model are adjusted according to the total loss value, so that the image denoising model is obtained.
In an implementation manner of this embodiment, after generating a training sample set based on each scene image and a training image corresponding to each scene image, the training method for the image denoising model further includes:
inputting the training images in the training sample set into an image denoising model, and outputting a denoising image corresponding to the training image through the image denoising model;
determining a chroma loss value corresponding to the training image based on the denoised image and the predicted image;
and training the image denoising model based on the chromaticity loss value, and taking the trained image denoising model as the image denoising model.
Specifically, the image quality of the denoised image determined by the image denoising model subjected to the chroma loss value fine adjustment is higher than that of the denoised image determined by the image denoising model not subjected to the chroma loss value fine adjustment. In this way, the denoising method adopted by this embodiment is convolution and then upsampling, and although details of the original image are retained as much as possible, some color blocks and excessively unnatural regions still exist, and thus the color blocks and excessively unnatural regions are mainly caused by lack of color detail adjustment in the training process. Therefore, after the image denoising model is built, the image denoising model is adjusted based on the chromaticity loss value, some color blocks and excessive unnatural areas can be avoided, and the image quality of the denoised image can be improved.
The chroma loss value is determined based on a color loss function, and the color loss value between the denoised image and the corresponding scene image is calculated through the color loss function so as to obtain the chroma loss value. The color loss function adopted by the embodiment is mainly used for calculating the loss of the denoised image and the calibrated image on the color so as to restrict the color accuracy of the image denoising model. In one implementation of this embodiment, when chroma training is performed on the image denoising model, in order to reduce the influence on parameters that have been trained before the image denoising model, the number of batches (epochs) used is a preset number of fine adjustments, the number of fine adjustments is smaller than the number of periods for training the preset network model, the number of fine adjustments in this embodiment is 1000, the learning rate of the fine adjustments is fixed to 10 "6, and other training parameters may be consistent with training parameters for training the preset network model before. After the image denoising model subjected to the chroma loss value fine tuning is obtained, the model parameters of the image denoising model subjected to the chroma loss value fine tuning can be frozen by using a TensorFlow frame, so that the image denoising model is obtained. In addition, in order to facilitate the use of the trained denoising model at the Mobile terminal, a PB file corresponding to the frozen denoising model may be transplanted at the Mobile terminal subsequently using a frame of the Mobile terminal such as MACE (Mobile AI computer Engine).
In summary, the present embodiment provides an image denoising method, a model, a computer-readable storage medium, and a terminal device, where the method obtains a plurality of feature maps corresponding to an image to be processed; determining an attention feature map corresponding to the image to be processed based on a target feature map in a plurality of feature maps, wherein the target feature map is the feature map with the smallest image size in the plurality of feature maps; and determining a denoised image corresponding to the image to be processed based on the attention feature map and the feature maps except the target feature map. According to the embodiment of the application, a plurality of feature maps carrying image detail information are obtained, and the attention mechanism and the target feature map are utilized to determine the attention feature map, so that the attention feature map can retain color features and texture features of an image to be processed, the image detail information in a de-noised image determined based on the attention feature map and the plurality of feature maps can be improved, and the image quality of the de-noised image is improved.
In addition, in the embodiment, a Cycle ISP model is adopted to simulate an SIDD real noise scene, collected clean standard images are utilized, simulated real noise with higher strength is added to the collected standard images, a training image set is constructed, and the problems that a data set is difficult to collect and the training effect of the currently synthesized noise images and calibration images is poor can be effectively solved. Meanwhile, a lightweight end-to-end image noise reduction network is designed, and a good noise reduction effect is obtained through the design of a noise reduction loss function and the optimization of a training strategy. Experimental results show that the image noise reduction model based on the method has better comprehensive advantages in memory occupation, reasoning time, and final noise reduction effect and image definition.
Based on the image denoising method, this embodiment provides an image denoising model, as shown in fig. 2, where the image denoising model includes:
a down-sampling module 100, configured to obtain a plurality of feature maps corresponding to an image to be processed;
the attention module 200 is configured to determine an attention feature map corresponding to the image to be processed based on a target feature map in a plurality of feature maps, where the target feature map is a feature map with a smallest image size in the plurality of feature maps;
an upsampling module 300, configured to determine a denoised image corresponding to the image to be processed based on the attention feature map and the feature maps.
In addition, it should be noted that the model structure and the model training process of the image denoising model provided in this embodiment are the same as those of the image denoising model and the image denoising model in the image denoising method, which are not described herein again, and the above description may be specifically used as parameters.
Based on the foregoing image denoising method, the present embodiment provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the steps in the image denoising method according to the foregoing embodiment.
Based on the image denoising method, the present application further provides a terminal device, as shown in fig. 5, including at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, and may further include a communication Interface (Communications Interface) 23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through the bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may call logic instructions in the memory 22 to perform the methods in the embodiments described above.
Furthermore, the logic instructions in the memory 22 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.
The memory 22, which is a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes the functional application and data processing, i.e. implements the method in the above-described embodiments, by executing the software program, instructions or modules stored in the memory 22.
The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be used as the transient computer readable storage medium.
In addition, the specific processes loaded and executed by the instruction processors in the computer-readable storage medium and the terminal device are described in detail in the method, and are not stated herein.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
Claims (18)
1. An image denoising method, comprising:
acquiring a plurality of characteristic graphs corresponding to an image to be processed;
determining an attention feature map corresponding to the image to be processed based on a target feature map in a plurality of feature maps, wherein the target feature map is the feature map with the smallest image size in the plurality of feature maps;
and determining a denoised image corresponding to the image to be processed based on the attention feature map and the feature maps except the target feature map.
2. The image denoising method of claim 1, wherein an image size of the denoised image is larger than an image size of the image to be processed.
3. The image denoising method of claim 1, wherein the image denoising method applies an image denoising model, the image denoising model comprises a down-sampling module, the down-sampling module comprises a convolution unit and a plurality of down-sampling units; the acquiring of the plurality of feature maps corresponding to the image to be processed specifically includes:
inputting an image to be processed into the convolution unit, and outputting a reference feature map through the convolution unit;
determining a plurality of candidate feature maps corresponding to the image to be processed based on the reference feature maps by using the plurality of down-sampling units;
and taking the reference feature map and the candidate feature maps as a plurality of feature maps corresponding to the image to be processed.
4. The image denoising method of claim 3, wherein the plurality of downsampling units are cascaded in sequence, the plurality of downsampling units correspond to the plurality of candidate feature maps one to one, each candidate feature map is an output item of the downsampling unit corresponding to each candidate feature map, wherein an input item of the downsampling unit positioned at the forefront in the cascaded sequence comprises a reference feature map, and an output item of a previous downsampling unit in two downsampling units adjacent in the cascaded sequence is an input item of a next downsampling unit.
5. The image denoising method of claim 1, wherein the image denoising method applies an image denoising model, the image denoising model comprises a plurality of cascaded attention modules, at least a first attention module and a second attention module exist in the plurality of attention modules, the first attention module precedes the second attention module in a cascaded order, and the input items of the second attention module comprise the output item of the attention module adjacent to and preceding the second attention module and the input item of the first attention module.
6. The image denoising method of claim 5, wherein the determining the attention feature map corresponding to the image to be processed based on a target feature map of the feature maps specifically comprises:
inputting the target feature map into the most front attention module according to the cascade order, and outputting a first attention feature map through the most front attention module;
taking the first attention feature map as a target attention feature map, and taking an attention module located at the second position as a target attention module;
detecting whether the target attention module is a second attention module;
if the target attention module is not the second attention module, taking the target attention feature map as an input image of the target attention module; if the target attention module is a second attention module, taking the target attention feature map and an input item of a first attention module corresponding to the target attention module as an input image of the target attention module;
determining, with a target attention module, a second attention feature map based on the input image;
and taking the second attention feature map as a target attention feature map, taking a subsequent attention module of the target attention module as a target attention module, and continuing to execute the step of detecting whether the target attention module is the second attention module until the target attention module is the last attention module.
7. The image denoising method of claim 6, wherein the attention module comprises a plurality of cascaded attention units and fusion units; the determining, by the target attention module, the second attention feature map based on the input image specifically includes:
inputting the input image into the attention unit positioned at the forefront according to the cascade order, and outputting a reference feature map through a plurality of attention units;
and inputting the reference feature map and the input image into the fusion unit, and outputting a second attention feature map corresponding to the image to be processed through the fusion unit.
8. The image denoising method of claim 7, wherein the attention unit comprises a volume block, a first attention block, a second attention block, a first fusion block and a second fusion block; the output items of the volume block are the input items of the first attention block and the input items of the second attention block; the input items of the first fusion block include output items of the first attention block and output items of the second attention block; the entries of the second merge block include entries of the volume block and outputs of the first merge block.
9. The image denoising method according to claim 8, wherein the first attention block includes a first global mean pooling layer, a global maximum pooling layer, a first fusion layer, and a second fusion layer, the output items of the convolution block are input items of the first global mean pooling layer and the global maximum pooling layer, the input items of the first fusion layer include output items of the first global mean pooling layer and output items of the global maximum pooling layer, and the input items of the second fusion layer include output items of the first fusion layer and output items of the convolution block.
10. The image denoising method of claim 8, wherein the second attention block comprises a second global mean pooling layer and a third fusion layer; the output item of the volume block is the input item of the second global mean pooling layer, and the input item of the third fusion layer comprises the output item of the second global mean pooling layer and the output item of the volume block.
11. The image denoising method of claim 1, wherein the image denoising method applies an image denoising model, the image denoising model includes an upsampling module, the upsampling module includes a plurality of cascaded upsampling units, an input item of a first upsampling unit located at the forefront in a cascaded order is an attention feature map, an output item of a previous upsampling unit of two upsampling units adjacent in the cascaded order is an input item of a next upsampling unit, and input items of the upsampling units except the first upsampling unit in the plurality of upsampling units include a feature map of a plurality of feature maps, and corresponding feature maps of the upsampling units are different from each other.
12. The image denoising method according to any one of claims 1 to 11, wherein the image denoising method applies an image denoising model, and the training method of the image denoising model specifically comprises:
acquiring a training sample set, wherein the training sample set comprises a plurality of training image groups, each training image group in the plurality of training image groups comprises a training image and a scene image, and the training images are determined by adding noise to the scene images;
inputting the training images in the training sample set into an image denoising model, and outputting a denoising image corresponding to the training image through the image denoising model;
and training the preset network model based on the predicted image and the scene image corresponding to the training image to obtain the image denoising model.
13. The image denoising method of claim 12, wherein the obtaining of the training sample set specifically comprises:
acquiring a plurality of scene images, wherein the ambient brightness corresponding to each scene image in the plurality of scene images meets a preset condition;
for each scene image in a plurality of scene images, determining simulation noise corresponding to the scene image, and adding the simulation noise to the scene image to obtain a training image corresponding to the scene image, wherein the simulation noise is greater than a preset noise threshold;
and generating a training sample set based on each scene image and the training images corresponding to the scene images.
14. The image denoising method of claim 12, wherein the training the preset network model based on the predicted image and the scene image corresponding to the training image to obtain the image denoising model specifically comprises:
determining a minimum absolute value deviation loss value and a structural similarity index loss value of the predicted image and the scene image;
determining a total loss value corresponding to the training image based on the minimum absolute value deviation loss value and the structural similarity index loss value;
and training the preset network model based on the total loss value to obtain the image denoising model.
15. The image denoising method of claim 12, wherein after generating a training sample set based on each scene image and a training image corresponding to each scene image, the training method of the image denoising model further comprises:
inputting the training images in the training sample set into an image denoising model, and outputting a denoising image corresponding to the training image through the image denoising model;
determining a chroma loss value corresponding to the training image based on the de-noised image and the predicted image;
and training the image denoising model based on the chromaticity loss value, and taking the trained image denoising model as the image denoising model.
16. An image denoising model, comprising:
the down-sampling module is used for acquiring a plurality of characteristic maps corresponding to the image to be processed;
the attention module is used for determining an attention feature map corresponding to the image to be processed based on a target feature map in a plurality of feature maps, wherein the target feature map is the feature map with the smallest image size in the plurality of feature maps;
and the up-sampling module is used for determining a de-noised image corresponding to the image to be processed based on the attention feature map and all feature maps except the target feature map.
17. A computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps in the image denoising method according to any one of claims 1 through 15.
18. A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;
the communication bus realizes the connection communication between the processor and the memory;
the processor, when executing the computer readable program, implements the steps in the image denoising method according to any one of claims 1-15.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110500670.XA CN115311149A (en) | 2021-05-08 | 2021-05-08 | Image denoising method, model, computer-readable storage medium and terminal device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110500670.XA CN115311149A (en) | 2021-05-08 | 2021-05-08 | Image denoising method, model, computer-readable storage medium and terminal device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115311149A true CN115311149A (en) | 2022-11-08 |
Family
ID=83854005
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110500670.XA Pending CN115311149A (en) | 2021-05-08 | 2021-05-08 | Image denoising method, model, computer-readable storage medium and terminal device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115311149A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116823597A (en) * | 2023-08-02 | 2023-09-29 | 北京中科闻歌科技股份有限公司 | Image generation system |
CN118411654A (en) * | 2024-07-02 | 2024-07-30 | 贵州道坦坦科技股份有限公司 | Water transport abnormal event identification method and monitoring system based on deep learning |
-
2021
- 2021-05-08 CN CN202110500670.XA patent/CN115311149A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116823597A (en) * | 2023-08-02 | 2023-09-29 | 北京中科闻歌科技股份有限公司 | Image generation system |
CN116823597B (en) * | 2023-08-02 | 2024-05-07 | 北京中科闻歌科技股份有限公司 | Image generation system |
CN118411654A (en) * | 2024-07-02 | 2024-07-30 | 贵州道坦坦科技股份有限公司 | Water transport abnormal event identification method and monitoring system based on deep learning |
CN118411654B (en) * | 2024-07-02 | 2024-10-11 | 贵州道坦坦科技股份有限公司 | Water transport abnormal event identification method and monitoring system based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112233038B (en) | True image denoising method based on multi-scale fusion and edge enhancement | |
CN111539879B (en) | Video blind denoising method and device based on deep learning | |
Xu et al. | Learning to restore low-light images via decomposition-and-enhancement | |
CN109360171B (en) | Real-time deblurring method for video image based on neural network | |
CN111402146B (en) | Image processing method and image processing apparatus | |
Hu et al. | Underwater image restoration based on convolutional neural network | |
US20220301114A1 (en) | Noise Reconstruction For Image Denoising | |
CN111986084A (en) | Multi-camera low-illumination image quality enhancement method based on multi-task fusion | |
CN113850741B (en) | Image noise reduction method and device, electronic equipment and storage medium | |
CN115311149A (en) | Image denoising method, model, computer-readable storage medium and terminal device | |
Rasheed et al. | LSR: Lightening super-resolution deep network for low-light image enhancement | |
CN112509144A (en) | Face image processing method and device, electronic equipment and storage medium | |
US20240020796A1 (en) | Noise reconstruction for image denoising | |
CN115035011B (en) | Low-illumination image enhancement method of self-adaption RetinexNet under fusion strategy | |
CN113284061A (en) | Underwater image enhancement method based on gradient network | |
Zhang et al. | Deep motion blur removal using noisy/blurry image pairs | |
CN115984570A (en) | Video denoising method and device, storage medium and electronic device | |
CN116547694A (en) | Method and system for deblurring blurred images | |
CN116152128A (en) | High dynamic range multi-exposure image fusion model and method based on attention mechanism | |
CN113628134B (en) | Image noise reduction method and device, electronic equipment and storage medium | |
Saleem et al. | A non-reference evaluation of underwater image enhancement methods using a new underwater image dataset | |
CN113379606B (en) | Face super-resolution method based on pre-training generation model | |
Huang et al. | Underwater image enhancement based on color restoration and dual image wavelet fusion | |
CN112150363B (en) | Convolutional neural network-based image night scene processing method, computing module for operating method and readable storage medium | |
CN115063301A (en) | Video denoising method, video processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |