CN116703768A

CN116703768A - Training method, device, medium and equipment for blind spot denoising network model

Info

Publication number: CN116703768A
Application number: CN202310666771.3A
Authority: CN
Inventors: 张旦
Original assignee: Shanghai Qigan Electronic Information Technology Co ltd
Current assignee: Shanghai Qigan Electronic Information Technology Co ltd
Priority date: 2023-06-06
Filing date: 2023-06-06
Publication date: 2023-09-05

Abstract

The invention discloses a training method, a device, a medium and equipment of a blind spot denoising network model, which are used for providing a blind spot denoising network model for self-supervision deep learning so as to effectively remove noise with larger area in an image, and comprise the following steps: optimizing a feature fusion network model, wherein the optimized feature fusion network model comprises at least two feature extraction channels, the mask types or the mask sizes of convolution layers in different feature extraction channels are different, and feature fusion is completed by the fusion layers according to the extraction results of different feature extraction channels; taking the noisy image in the public data set as a training data set; and inputting the training data set into the optimized feature fusion network model to perform self-supervision model training until the set iteration times are reached or the loss value of the loss function is smaller than a set threshold value, and outputting the blind spot denoising network model.

Description

Training method, device, medium and equipment for blind spot denoising network model

Technical Field

The invention relates to the technical field of image processing, in particular to a training method, device, medium and equipment of a blind spot denoising network model.

Background

At present, image denoising is an important step which is unavoidable in image processing, and the denoising effect has a great influence on the flow of subsequent image processing. The traditional image denoising algorithm is low in speed and poor in robustness. With the development of deep learning, a deep learning image denoising algorithm also makes great progress. Supervised image denoising requires a noise-clean image pair, and in practical applications, the collection of this noise-clean image pair is very difficult, so many self-supervised training methods that do not require clean pictures have been developed, such as blind spot network (Blind Spot Network, BSN) denoising methods, however, because the necessary condition for BSN denoising is that the noise is assumed to be spatially independent, but the real noise tends to be spatially continuous. Therefore, in order to break the spatial relationship of noise, the pixel point sampling is performed on the picture before training to break the spatial relationship of noise, and the center mask (BSN) is used to perform the center mask on the convolution kernel in the training process to achieve the blind point effect. In denoising a real noise picture, the key point of the improvement of the blind spot denoising effect is to break the noise space connection and generate blind spots and to keep the detail information of the original pixels of the picture as far as possible. Existing blind spot denoising models, whether masked (mask) on an input image or masked in a network, are punctiform. The blind spot denoising method is a method for recovering a blind spot by utilizing the spatial relation between the pixels of the blind spot and surrounding pixels. Thus, this method is capable of denoising, provided that the point is spatially linked to surrounding points, and the surrounding points are pixels of the picture itself, not noise. When the occupied area of noise in the picture is large, if only one current pixel point is covered in the process of extracting the characteristics, the current pixel point is recovered by utilizing surrounding information, and the recovered pixel point is likely to be still a noise point, so that the noise with large area in the image is difficult to remove.

Therefore, it is needed to provide a training scheme of an image denoising network model to effectively remove noise with larger area in an image.

Disclosure of Invention

The embodiment of the invention provides a training method, a training device, a training medium and training equipment for a blind spot denoising network model, which are used for effectively removing noise with larger area in an image.

In a first aspect, the present invention provides a training method for a blind spot denoising network model, where the training method may include the following steps: optimizing a feature fusion network model, wherein the optimized feature fusion network model comprises at least two feature extraction channels, the mask types or the mask sizes of convolution layers in different feature extraction channels are different, and feature fusion is completed by the fusion layers according to the extraction results of different feature extraction channels; taking the noisy image in the public data set as a training data set; and inputting the training data set into the optimized feature fusion network model to perform self-supervision model training until the set iteration times are reached or the loss value of the loss function is smaller than a set threshold value, and outputting the blind spot denoising network model.

The invention provides a training method of a blind spot denoising network model, which has the beneficial effects that: the training of the image denoising network model can be realized by using the noisy image without collecting the noise-clean image pair in advance, and the feature is extracted by optimizing the feature fusion network model and designing a convolution kernel comprising various mask shapes, so that the optimized feature fusion network model combines and extracts the global feature and the local feature, and the noise with larger area which is difficult to remove in the image can be effectively removed.

In one possible implementation manner, inputting the training data set into the optimized feature fusion network model for self-supervision model training includes: after the noise-containing image is subjected to feature pre-extraction, respectively carrying out feature extraction through each feature extraction channel; inputting the extraction results of different feature extraction channels into CDCL comprising DCL for convolution, then firstly connecting and fusing the extraction results of the feature extraction channels with the same mask size, fusing the features extracted by convolution kernels with different mask types, connecting the outputs of all channels together after a plurality of DCLs, and completing the feature fusion of the convolution kernels with different mask sizes; and finally, obtaining final output through channel transformation and feature fusion of a plurality of convolutions. The implementation method can remove the noise with larger area which is difficult to remove in the image by using a method of combining a plurality of masks in the self-supervision image denoising process.

In another possible implementation manner, the mask type is at least one or more of cross, back-font, row, column, oblique line and reverse oblique line.

In another possible embodiment, the loss function satisfies: loss= |i _out -I _N ‖ ₁

Wherein Loss is L1 Loss function, I _out I is the output result of the blind spot denoising network model _N Is a noisy image.

In other possible embodiments, inputting the mask image training set into the optimized feature fusion network model for self-supervision model training includes: randomly cutting the noisy image into a plurality of sub-images, randomly rotating the sub-images, and horizontally or vertically turning the sub-images, inputting the sub-images into an optimized feature fusion network model, and performing self-supervision model training.

In a second aspect, the embodiment of the present invention further provides a training device for a blind spot denoising network model, where the training device includes a module/unit for performing the method of any one of the possible embodiments of the first aspect. These modules/units may be implemented by hardware, or may be implemented by hardware executing corresponding software. These modules/units may be implemented by hardware, or may be implemented by hardware executing corresponding software.

In a third aspect, embodiments of the present invention further provide a computer readable storage medium, where the readable storage medium includes a program, where the program when executed on an electronic device causes the electronic device to perform a method according to any one of the possible implementation manners of the first aspect.

In a fourth aspect, the present embodiment also provides a computer program product comprising a program product for causing an electronic device to perform the method of any one of the possible embodiments of the first aspect described above when the program product is run on the electronic device.

The advantageous effects concerning the above second to sixth aspects can be seen from the description in the above first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention;

fig. 2 is a schematic flow chart of a training method of a blind spot denoising network model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a blind spot denoising network model according to an embodiment of the present invention;

fig. 4 is a schematic diagram of CDCL and DCL in a blind spot denoising network model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of various mask shapes according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a training device for a blind spot denoising network model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Image denoising aims to recover clean signals from noisy observations, which is one of the important tasks in image processing and low-level computer vision. Recently, with the rapid development of neural networks, supervised denoising models based on learning have achieved satisfactory performance. However, the supervised denoising model based on learning is largely dependent on noise-clean image pairs. In practical applications, such image pairs are complex and expensive to collect, even in dynamic scenes, medical imaging, etc., because of the limitation of real conditions, the image pairs meeting the requirements cannot be obtained at all, which results in that the supervised image denoising method is difficult to adapt to some denoising scenes or is difficult to achieve the ideal denoising effect. Therefore, a self-monitoring image denoising method is provided in the prior art, and compared with a supervised image denoising method, the self-monitoring image denoising method has practical value because the self-monitoring image denoising method does not need to use noise-clean image pairs as references. Most of the current self-supervision methods can realize the training of the denoising model by using noisy images. However, in the existing self-supervision denoising method, when the occupied area of noise in a picture is large, if only one current pixel point is shielded in the process of extracting features, then the current pixel point is restored by using surrounding information, and the restored pixel point is likely to be still a noise point, so that the noise with large area in the picture is difficult to remove, and the noise reduction effect is insufficient.

In order to overcome the defect of the noise reduction effect of the existing image denoising network model, the invention provides a training method of a blind spot denoising network model, which can realize the training of the image denoising network model by using a noisy image without collecting noise-clean image pairs in advance, and extracts the features by optimizing the feature fusion network model and designing a convolution kernel comprising various mask shapes, so that the optimized feature fusion network model combines the global features and the local features, and the noise with larger area which is difficult to remove in the image can be effectively removed.

Some terms in the embodiments of the present invention are explained below to facilitate understanding by those skilled in the art.

1. Convolutional neural networks are a type of feedforward neural network (feed forward neural networks) that contains convolutional computations and has a deep structure, and are one of the representative algorithms of deep learning. The convolutional neural network has characteristic learning capability and can carry out translation invariant classification on input information according to a hierarchical structure of the convolutional neural network. Convolutional neural networks, which were developed by biological neuroscience research, were originally proposed to process data having a network-like structure, such as a two-dimensional network that can be thought of as an image consisting of pixels. The general network structure of the convolutional neural network comprises a data input layer, a convolutional layer, a data excitation layer, a pooling layer, a full connection layer and a data output layer.

Embodiments of the present invention relate to artificial intelligence (artificial intelligence, AI) and machine learning techniques, designed based on deep learning networks and Machine Learning (ML) in artificial intelligence.

With research and progress of artificial intelligence technology, artificial intelligence is developed in various fields such as common smart home, intelligent customer service, virtual assistant, smart speaker, smart marketing, unmanned, automatic driving, robot, smart medical, etc., and it is believed that with the development of technology, artificial intelligence will be applied in more fields and become more and more important value.

2. Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.

In the description of embodiments of the present invention, the terminology used in the embodiments below is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include, for example, "one or more" such forms of expression, unless the context clearly indicates to the contrary. It should also be understood that in the following embodiments of the present invention, "at least one", "one or more" means one or more than two (including two). The term "and/or" is used to describe an association relationship of associated objects, meaning that there may be three relationships; for example, a and/or B may represent: a alone, a and B together, and B alone, wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the invention. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise. The term "coupled" includes both direct and indirect connections, unless stated otherwise. The terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated.

In embodiments of the invention, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

The training method of the blind spot denoising network model provided by the invention can be applied to an application scene shown in fig. 1, wherein the application scene comprises a server 100 and a terminal device 200.

In a possible implementation manner, the server 100 is configured to optimize a feature fusion network model, where the converter in the optimized feature fusion network model includes at least two feature extraction channels, mask types or mask sizes of convolution layers in different feature extraction channels are different, and feature fusion is completed by the fusion layer according to extraction results of the different feature extraction channels; taking the noisy image as a training data set; and inputting the training data set into the optimized feature fusion network model to perform self-supervision model training until the set iteration times are reached or the loss value of the loss function is smaller than a set threshold value, and outputting the blind spot denoising network model. The terminal device 200 acquires the blind spot denoising network model from the server 100 and uses it for image denoising.

The server 100 and the terminal device 200 may be connected through a wireless network, and the terminal device 200 may be a terminal device with an image sensor, which may be a smart phone, a tablet computer, a medical imaging device, or the like. The server 100 may be a server, or a server cluster or a cloud computing center formed by a plurality of servers.

Based on the application scenario diagram shown in fig. 1, the embodiment of the invention provides a training method flow of a blind spot denoising network model, as shown in fig. 2, the method flow can be executed by a server, and the method comprises the following steps:

s201, optimizing a feature fusion network model, wherein the optimized feature fusion network model comprises at least two feature extraction channels, mask types or mask sizes of convolution layers in different feature extraction channels are different, and feature fusion is completed by the fusion layers according to extraction results of different feature extraction channels.

And S202, taking the noisy image in the public data set as a training data set.

Illustratively, the present invention may employ SIDD-Medium and DND in a real image denoising common dataset. The SIDD-Medium contains 320 pairs of real noise and clean images, and the embodiment can use the noisy sRGB images therein as training sets, and corresponding SIDD validation and Benchmark as training sets and test sets respectively. The DND dataset contains 50 true noise images, so DND images are typically used as the test set, but in the blind spot denoising network model, the invention uses DND as both the training set and the test set because the invention can be trained using noisy pictures.

And S203, inputting the training data set into the optimized feature fusion network model to perform self-supervision model training until the set iteration times are reached or the loss value of the loss function is smaller than a set threshold value, and outputting the blind spot denoising network model.

In this step, in one possible embodiment, inputting the training data set into the optimized feature fusion network model to perform self-supervision model training includes: after the noise-containing image is subjected to feature pre-extraction, respectively carrying out feature extraction through each feature extraction channel; inputting the extraction results of different feature extraction channels into CDCL comprising DCL for convolution, then firstly connecting and fusing the extraction results of the feature extraction channels with the same mask size, fusing the features extracted by convolution kernels with different mask types, connecting the outputs of all channels together after a plurality of DCLs, and completing the feature fusion of the convolution kernels with different mask sizes; and finally, obtaining final output through channel transformation and feature fusion of a plurality of convolutions.

Fig. 3 illustrates a model framework diagram of an optimized feature fusion network model, named MM-BSN model framework. In fig. 3, it is shown that, when the mask (mask) is a punctiform mask and a cross mask with two different sizes, the optimized feature fusion network model may have four feature extraction channels, where each feature extraction channel corresponds to one mask, after simple feature extraction, a noisy image passes through multiple convolution layers containing convolution kernels with different masks, then, a Concatenation hole convolution layer (Concatenation-based Dilated Convolution Layer, CDCL) containing a small number of hole convolution layers (Dilated Convolution Layer, DCL) (set as 2 in fig. 3) is input, and then, the features extracted by the convolution kernels of different types of masks are fused according to the size of the mask, and after the feature extraction channel is set as 7 in fig. 3, all the mask outputs are connected together, so as to complete the feature fusion of the convolution kernels with different sizes of masks, and finally, the final output is obtained through multiple channel transformation and feature fusion of 1×1 convolution. It should be appreciated that if the mask type or size of the mask is increased, additional feature lift channels may be added to the model frame as shown in FIG. 3.

Fig. 4 illustrates a specific composition of the DCL and the CDCL, and it can be seen from fig. 4 that the DCL includes a 1×1 convolution and a 3×3 convolution, and a summer. The CDCL includes two branches, one of which includes two 1×1 convolutions and one DCL, and the other includes one 1×1 convolution, and the convolution results of the two branches are spliced in channel dimensions by a connector (connectate), and finally the final result is output after the convolution is performed by one 1×1 convolution layer.

In addition, the present invention also proposes a cross shape, a zigzag shape, a row, a column, a diagonal line, a reverse diagonal line, and a x-shaped mask in fig. 5, and by way of example, fig. 5 is a diagram showing the shapes of various masks when the convolution kernel size is 5*5, and fig. 5 is a diagram showing the shapes of various masks when the convolution kernel size is 5 times 5. Wherein gray represents a value of 1 and white represents a value of 0. Fig. 5 (a) and 5 (f) show a dot mask and a character-returning mask, respectively, and note that the dot mask can be understood as one of the character-returning masks; fig. 5 (b) and 5 (g) show masks with 0 in the middle row direction and 0 in the col direction, respectively; fig. 5 (c) and 5 (h) respectively show an inverse cross mask having a value of 0 in the middle cross direction and a cross mask having a value of 0 on the non-center cross except for the center point of 0; fig. 5 (d) and 5 (i) show a value of 0 on a 45 ° diagonal in the mask and a value of 0 on a 135 ° diagonal in the mask, respectively; fig. 5 (e) and 5 (j) show a value of 0 in the x direction and a value of 0 in the non-x direction except for the center point of 0, respectively.

In one possible embodiment, the batch size is set to 8 and the number of iterations is 20 epochs (epochs) when training the model. The optimization function used was Adma, an initial learning rate of 0.0001, and a learning rate of 0.1 times every 8 periods (epoch). The embodiment can randomly cut the picture into 128×128 sub-images (patches), and input the sub-images after 90 ° range class random rotation and horizontal or vertical overturn for model training. The various models may be trained on python3.8.0, pytorch1.12.0, nvidia Tesla T4 GPUs.

In addition, the embodiment can deploy the trained network to the cloud or edge equipment, take the sRGB image obtained by the front end as the input data of the network, and directly give the output of the model to the subsequent image processing equipment. The invention is also applicable to the basic processing of other images, such as image reconstruction and the like.

In some embodiments of the present invention, an embodiment of the present invention discloses a training device for a blind spot denoising network model, as shown in fig. 6, where the training device is configured to implement the methods described in the foregoing embodiments of the training method, and the training device includes: an optimization unit 601 and a training unit 602. The optimizing unit 601 is configured to optimize a feature fusion network model, where the optimized feature fusion network model includes at least two feature extraction channels, mask types or mask sizes of convolution layers in different feature extraction channels are different, and feature fusion is completed by the fusion layer according to extraction results of different feature extraction channels. The training unit 602 is configured to take the noisy image in the public data set as a training data set, input the training data set to the optimized feature fusion network model for self-supervision model training until the set iteration number is reached or the loss value of the loss function is smaller than a set threshold, and output a blind spot denoising network model. All relevant contents of each step related to the foregoing training method embodiment may be cited to the functional descriptions of the corresponding functional modules, which are not described herein.

In other embodiments of the present invention, an electronic device is disclosed, where the electronic device may refer to the server 100 above and may refer to the terminal device 200 above, as shown in fig. 7, and the electronic device may include: one or more processors 701; a memory 702; a display 703; one or more applications (not shown); and one or more programs 704, the devices described above may be connected by one or more communication buses 705. Wherein the one or more programs 704 are stored in the memory 702 and configured to be executed by the one or more processors 701, the one or more programs 704 include instructions that can be used to perform the various steps as in fig. 2 and the corresponding embodiments.

From the foregoing description of the embodiments, it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of functional modules is illustrated, and in practical application, the above-described functional allocation may be implemented by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to implement all or part of the functions described above. The specific working processes of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which are not described herein.

The functional units in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present invention may be essentially or partly contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing an electronic device or processor to perform all or part of the steps of the method described in the embodiments of the present invention. And the aforementioned storage medium includes: flash memory, removable hard disk, read-only memory, random access memory, magnetic or optical disk, and the like.

The foregoing is merely a specific implementation of the embodiment of the present invention, but the protection scope of the embodiment of the present invention is not limited to this, and any changes or substitutions within the technical scope disclosed in the embodiment of the present invention should be covered in the protection scope of the embodiment of the present invention. Therefore, the protection scope of the embodiments of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for training a blind spot denoising network model, the method comprising:

optimizing a feature fusion network model, wherein the optimized feature fusion network model comprises at least two feature extraction channels, the mask types or the mask sizes of convolution layers in different feature extraction channels are different, and feature fusion is completed by the fusion layers according to the extraction results of different feature extraction channels;

taking the noisy image in the public data set as a training data set;

and inputting the training data set into the optimized feature fusion network model to perform self-supervision model training until the set iteration times are reached or the loss value of the loss function is smaller than a set threshold value, and outputting the blind spot denoising network model.

2. The training method of claim 1, wherein inputting the training dataset into the optimized feature fusion network model for self-supervised model training comprises:

after the noise-containing image is subjected to feature pre-extraction, respectively carrying out feature extraction through each feature extraction channel;

inputting the extraction results of different feature extraction channels into CDCL comprising DCL for convolution, then firstly connecting and fusing the extraction results of the feature extraction channels with the same mask size, fusing the features extracted by convolution kernels with different mask types, connecting the outputs of all channels together after a plurality of DCLs, and completing the feature fusion of the convolution kernels with different mask sizes;

and finally, obtaining final output through channel transformation and feature fusion of a plurality of convolutions.

3. The training method of claim 2, wherein the mask type is at least one or more of cross, return, row, column, diagonal, reverse diagonal.

4. A training method according to any one of claims 1 to 3, characterized in that the loss function satisfies:

Loss＝‖I _out -I _N ‖ ₁

5. A training method according to any one of claims 1 to 3, characterized in that the inputting of the mask image training set into the optimized feature fusion network model for self-supervision model training comprises:

randomly cutting the noisy image into a plurality of sub-images, randomly rotating the sub-images, and horizontally or vertically turning the sub-images, inputting the sub-images into an optimized feature fusion network model, and performing self-supervision model training.

6. A training device for a blind spot denoising network model, comprising:

the optimizing unit is used for optimizing the feature fusion network model, the optimized feature fusion network model comprises at least two feature extraction channels, the mask types or the mask sizes of the convolution layers in different feature extraction channels are different, and the feature fusion is completed by the fusion layers according to the extraction results of the different feature extraction channels;

the training unit is used for taking the noisy image in the public data set as a training data set, inputting the training data set into the optimized feature fusion network model for self-supervision model training until the set iteration times are reached or the loss value of the loss function is smaller than a set threshold value, and outputting the blind spot denoising network model.

7. The training device according to claim 6, wherein the training unit inputs the training data set into an optimized feature fusion network model for self-supervision model training, specifically for:

8. The training device of claim 7, wherein the mask type is at least one or more of cross, zig-zag, row, column, diagonal, and reverse diagonal.

9. Training device according to any of the claims 6 to 8, characterized in that the loss function satisfies:

Loss＝‖I _out -I _N ‖ ₁

wherein I is _out For the output of the blind spot denoising network model, I _N Is a noisy image.

10. Training device according to any of the claims 6 to 8, wherein the training unit is configured to perform self-supervised model training on the mask image training set input optimized feature fusion network model, in particular for:

11. A computer readable storage medium having a program stored therein, characterized in that the program, when executed by a processor, implements the method of any one of claims 1 to 5.

12. An electronic device comprising a memory and a processor, the memory having stored thereon a program executable on the processor, which when executed by the processor, causes the electronic device to implement the method of any of claims 1 to 5.