CN112991235A

CN112991235A - Video noise reduction method and video noise reduction terminal

Info

Publication number: CN112991235A
Application number: CN202110541126.XA
Authority: CN
Inventors: 葛益军; 王军
Original assignee: Hangzhou Xiongmai Integrated Circuit Technology Co Ltd
Current assignee: Zhejiang Xinmai Microelectronics Co ltd
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2021-06-18
Anticipated expiration: 2041-05-18
Also published as: CN112991235B

Abstract

The application relates to a video noise reduction method and a video noise reduction terminal, which adopt a dynamic and static fusion technology, fully utilize information between frames in a video stream, and compared with other noise reduction methods based on a neural network of pictures, the method overcomes the problem of unstable multi-frame sequences of noise reduction effects, eliminates the phenomenon of jumping and flickering, and is more suitable for noise reduction processing of video frame sequences. In addition, the neural network architecture adopted by the method is used for motion analysis and estimation, compared with a traditional video denoising method which directly judges whether the video is static or static by calculating the brightness difference value of two frames before and after, the neural network can understand more context information, is not easily influenced by self noise and environmental disturbance, and has higher accuracy. The smear problem existing in the traditional algorithm can be better overcome.

Description

Video noise reduction method and video noise reduction terminal

Technical Field

The present application relates to the field of video noise processing technologies, and in particular, to a video noise reduction method and a video noise reduction terminal.

Background

At present, a monitoring camera is widely applied to security monitoring and family life. Users are also increasingly demanding on the quality of the resulting images, and they desire clear and reliable images, both during the day and at night. However, there is an inevitable noise problem due to the intrinsic process characteristics of the image sensor. Thermal noise is generated due to fluctuation of charge potential caused by, for example, thermal effects of electrons in the image sensor. For example, the amplifier in the image sensor has inconsistent switching characteristics, and solid-state noise is propagated. Also, for example, when a current flows through a potential barrier (PN junction) in an image sensor, shot noise associated with incident photons and dark current is generated. These noise phenomena are particularly noticeable in the case of poor night lighting conditions, and already affect the sharpness of the image very severely. The complete de-noising of the video is performed to improve the image effect, which is an urgent need of users.

The traditional video noise reduction method mainly adopts a multi-stage filtering algorithm, mainly because the filtering method has high calculation speed, needs few resources and is easier to deploy. However, since the filter distinguishes between the invalid noise and the valid image according to the frequency characteristics of the digital signal, context information of the image cannot be perceived, and thus, the picture quality in the denoised video is low. There are two main reasons for this:

1) the traditional video noise reduction method does not know which noises are the edge details, some scene noises are erased, but some edge details are also removed at the same time, the picture becomes fuzzy, and the definition is reduced.

2) The conventional video denoising method has no good dynamic and static analysis, does not know which moving areas and which static areas are, and causes image smear, and mainly shows that when a moving object exists in an image, residues are generated on a target moving path. This is a phenomenon caused by excessive superposition of motion regions in temporal filtering.

Disclosure of Invention

Therefore, it is necessary to provide a video denoising method and a video denoising terminal for solving the problem that the conventional video denoising method causes low picture quality in a denoised video.

The application provides a video denoising method, which comprises the following steps:

acquiring a video frame and a historical state frame at the current moment, and inputting the video frame and the historical state frame at the current moment into a motion estimation network model;

running the motion estimation network model to obtain a dynamic and static analysis result output by the motion estimation network model; the dynamic and static analysis result comprises the motion probability and the static probability of each image area in the current video frame;

inputting the video frame of the current moment into a noise reduction network model;

operating the noise reduction network model, controlling a front noise reduction module in the noise reduction network model to perform coarse noise removal on a video frame at the current moment, and acquiring the current video frame output by the front noise reduction module after the coarse noise removal

Performing dynamic and static fusion on the current video frame and the historical state frame after the coarse noise is removed to generate a fused video frame and storing the fused video frame as a new historical state frame;

and controlling a post-denoising module in the denoising network model to perform secondary denoising processing on the fused video frame, acquiring the video frame after the secondary denoising processing output by the post-denoising module, and outputting the video frame after the secondary denoising processing as a denoised video frame.

The present application further provides a video noise reduction terminal, including:

a processor for performing the video denoising method as mentioned in the foregoing;

a motion estimation network model coupled to the processor;

one end of the noise reduction network model is connected with the processor, and the other end of the noise reduction network model is connected with the motion estimation network model;

and the database is connected with the processor.

Drawings

Fig. 1 is a schematic flowchart of a video denoising method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a video denoising terminal according to an embodiment of the present application;

fig. 3 is a noise reduction logic diagram of a video noise reduction method according to an embodiment of the present application.

Reference numerals are used.

100-a processor; 200-a motion estimation network model; 300-a noise reduction network model; 400-database.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The application provides a video denoising method. It should be noted that the video noise reduction method provided by the present application is applicable to videos of any format.

In addition, the video denoising method provided by the application is not limited to the execution subject. Optionally, an execution subject of the video denoising method provided by the present application may be a video denoising terminal. Specifically, the main body of the video denoising method provided by the present application may be one or more processors in the video denoising terminal.

As shown in fig. 1, in an embodiment of the present application, the video denoising method includes the following steps S100 to S600:

s100, obtaining a video frame and a historical state frame at the current moment, and inputting the video frame and the historical state frame at the current moment into the motion estimation network model.

Specifically, the video frame at the current time is the video frame corresponding to the current time node. The historical state frame is a fusion frame of video frames at all past times, which is different from the current time node. In other words, the historical state frame is not a video frame corresponding to a specific time node in the past, but is a fusion of video frames at all times in the past. The video noise reduction terminal can be connected with a camera or other video generating equipment to obtain a video frame at the current moment. The video noise reduction terminal can obtain the historical state frame by calling the historical state frame stored in a local database.

And S200, operating the motion estimation network model to obtain a dynamic and static analysis result output by the motion estimation network model. And the dynamic and static analysis result comprises the motion probability and the static probability of each image area in the current video frame.

Specifically, the motion estimation network model may calculate, according to the video frame at the current time and the historical state frame, a motion probability and a stationary probability of each image area in the video frame at the current time compared to the historical state frame.

The motion estimation network model can adopt a Unet network as a basic network model, and the number of input channels of the network is 6, so that the joint input of the video frame and the historical state frame at the current moment can be supported. The number of output channels of the network is 2, and the Unet network can be accessed to a SoftMax layer to be used as the output of the Unet network. The SoftMax layer can compress the data to between 0-1, which can represent the prediction probability of two classes, motion or stationary. For example, after a certain image region is subjected to dynamic and static analysis through a motion estimation network model, the obtained motion probability is 0.8, and the static probability is 0.2.

And S300, inputting the video frame at the current moment into the noise reduction network model.

Specifically, the function of the noise reduction network model is mainly used for performing 2D spatial noise reduction on video frames. The video frame is essentially an image, namely image data containing noise before noise reduction is input into the noise reduction network model, and image data after noise reduction is output from the noise reduction network model. The number of input and output channels of the noise reduction network model is 3. The noise reduction is divided into two levels of front noise reduction and rear noise reduction. Different network sizes and loss functions are set for different levels.

And S400, operating the noise reduction network model, controlling a front noise reduction module in the noise reduction network model to perform coarse noise removal on the video frame at the current moment, and acquiring the current video frame output by the front noise reduction module after the coarse noise removal.

Specifically, the previous step is carried out, a front noise reduction module is used in the step, coarse noise removal is performed, and secondary noise reduction is further performed through a rear noise reduction module.

And S500, performing dynamic-static fusion on the current video frame and the historical state frame after the coarse noise is removed, generating a fused video frame and storing the fused video frame as a new historical state frame.

Specifically, the fused video frame is stored in the database as a new historical state frame, and replaces the original historical state frame.

S600, controlling a post-denoising module in the denoising network model to perform secondary denoising processing on the fused video frame, acquiring the video frame after the secondary denoising processing output by the post-denoising module, and outputting the video frame after the secondary denoising processing as a denoised video frame.

Specifically, the step uses a post-denoising module, and fine noise removal, i.e., secondary denoising, is performed.

In the embodiment, by adopting a dynamic and static fusion technology, information between frames in a video stream is fully utilized, and compared with other noise reduction methods based on a neural network of a picture, the method overcomes the problem of unstable multi-frame sequences with noise reduction effect, eliminates the phenomenon of jumping and flickering, and is more suitable for noise reduction processing of a video frame sequence. In addition, the neural network architecture adopted by the method is used for motion analysis and estimation, compared with a traditional video denoising method which directly judges whether the video is static or static by calculating the brightness difference value of two frames before and after, the neural network can understand more context information, is not easily influenced by self noise and environmental disturbance, and has higher accuracy. The smear problem existing in the traditional algorithm can be better overcome.

In addition, in this embodiment, the front noise reduction module is controlled to perform coarse noise removal on the video frame at the current moment, and then the rear noise reduction module is controlled to perform secondary noise reduction processing on the fused video frame after dynamic and static fusion, so that the purpose of removing large noise first and then removing small noise is achieved, and the noise reduction effect is better.

In an embodiment of the present application, before S100, the video denoising method further includes the following S010 to S060:

and S010, establishing a full convolution neural network model with equal input and output sizes, and recording the model as a first model.

And S020, acquiring a plurality of noise-containing pictures in the same scene and a plurality of noise-containing pictures in different scenes as training data, inputting the training data into the first model for training, and taking the trained first model as a motion estimation network model.

And S030, establishing another full convolution neural network model with equal input and output sizes, and recording the model as a second model.

And S040, constructing an internal structure of the second model, wherein the internal structure of the second model comprises a front noise reduction module and a rear noise reduction module.

S050, setting a loss function of the front noise reduction module and a loss function of the rear noise reduction module respectively;

and S060, acquiring a plurality of groups of noise-containing pictures and noise-free pictures of different scenes, respectively training the front noise reduction module and the rear noise reduction module, and taking the trained second model as a noise reduction network model.

Specifically, the input/output equal size means that the resolution of the picture to which the first model is input and the resolution of the picture to which the first model is output match. For example, after a 1080P picture is input into the full convolution neural network model and subjected to model processing, the output picture is still a 1080P picture. In the training process of the first model, a plurality of noise-containing pictures in the same scene and a plurality of noise-containing pictures in different scenes can be packaged and compressed into a database file (HDFS) as a training data set, a deep learning frame is used as a training frame, and a caffe tool can be used as a training tool.

The noise reduction is divided into two levels of front noise reduction and rear noise reduction. Different network sizes and loss functions are set for different levels.

In this embodiment, the motion estimation network model and the noise reduction network model are pre-constructed and trained, so that the whole video noise reduction terminal has both dynamic and static analysis capability and gradient multiple noise reduction capability.

In an embodiment of the present application, the S020 includes S021 to S025:

and S021, acquiring a plurality of noise-containing pictures of the same scene and a plurality of noise-containing pictures of different scenes.

S022, setting at least one still image group, placing a plurality of noise-containing pictures with the same scene into the still image group, setting at least one moving image group, and placing a plurality of noise-containing pictures with different scenes into the moving image group.

S023, capturing an image region at the same position in each picture in the still image group to obtain a local image block, and generating a still tag corresponding to the local image block.

S024, intercepting image areas at the same position in each picture in the motion image group to obtain a local image block, and generating a motion label corresponding to the local image block.

And S025, inputting all the local image blocks and the corresponding static labels or the corresponding motion labels as training data to the first model for training.

Specifically, the image blocks and the corresponding labels are packed and compressed into a corresponding database file (HDFS) as a training data set. The training framework uses a deep learning framework, and the training tool can use a buffer tool.

In this embodiment, by using multiple noise-containing pictures of the same scene as training data, the trained first model, i.e., the motion estimation network model, can have the capability of identifying a stationary region in a video frame. By using a plurality of noise-containing pictures of different scenes as training data, the trained first model, namely the motion estimation network model can have the capability of identifying a motion area in a video frame. Because the present embodiment adopts the neural network method for motion analysis and estimation. Compared with the traditional method which directly depends on the calculation of the brightness difference value of the two frames before and after to judge the dynamic and static states, the neural network can understand more context information, is not easily influenced by self noise and environmental disturbance, and has higher accuracy. The smear problem existing in the traditional algorithm can be better overcome. In an embodiment of the present application, the S050 includes the following S051 to S052:

s051, setting the number of channels at the deepest layer of the network of the front noise reduction module to be 512, and selecting L2Loss + SSTM as a Loss function of the front noise reduction module.

And S052, setting the number of channels at the deepest layer of the network of the rear noise reduction module to be 256, and selecting L1Loss as a Loss function of the rear noise reduction module.

Specifically, the front noise reduction module aims at raw input with large unprocessed noise, and takes fitting to remove noise with large particle morphology characteristics as a main target, so that the network scale is selected to be larger appropriately, and a loss function emphasizing overall consistency of the image is selected.

On the contrary, the post-denoising processing module aims at the data with weak noise after fusion, and mainly aims at fitting and removing the noise with the morphology characteristics of fine particles, and meanwhile, more detailed information needs to be reserved. The network scale can be moderately smaller, and a loss function for emphasizing local detail restoration of the image is selected. In this embodiment, the number of channels in the deepest layer of the network selected by the front noise reduction module is 512, and L2Loss + SSIM is selected as a Loss function. The number of channels in the deepest layer of the network selected by the post-denoising processing module is 256, and L1Loss is selected as a Loss function.

In this embodiment, by selecting the number of channels and the loss function of the deepest layer of different networks according to different noise reduction requirements of the front noise reduction module and the rear noise reduction module, the front noise reduction module and the rear noise reduction module can be made to have clear division of labor and exert the highest noise reduction efficiency respectively.

In an embodiment of the present application, the S060 includes the following S061 to S063:

s061, an original picture set is created, a plurality of groups of noise-containing pictures and noise-free pictures of different scenes are obtained, and all the pictures are placed into the original picture set. Each scene corresponds to a plurality of noisy pictures and a plurality of noiseless pictures. The noise-containing picture is a picture with the exposure time of less than 50 milliseconds, and the noise-free picture is a picture with the exposure time of more than 5 seconds.

S062, screening out a noise-containing picture in the first exposure time from the original picture set, and a noise-free picture in the same scene with the noise-containing picture in the first exposure time, and forming a first type of image pair by the noise-containing picture in the first exposure time and the noise-free picture in the same scene.

And S063, repeatedly executing the previous step S062 to obtain a plurality of first-class image pairs, using the plurality of first-class image pairs as training data of the front noise reduction module, and training the front noise reduction module.

Specifically, the present embodiment collects noisy and noiseless pairs of pictures in multiple groups of different scenes as a training set of the network. The acquisition process can control the intensity of noise by using a mode of limiting the exposure time and the system gain.

The purpose of the front noise reduction module is coarse noise removal. The shorter the exposure time, the more noisy the picture is. Therefore, in this embodiment, the strong-noise image data with a short exposure time is used as the input of the front noise reduction module, the weak-noise image data with a long exposure time is used as the input of the subsequent rear noise reduction module, and the image collected without limiting the exposure time (i.e., with a long enough exposure time) is regarded as the noise-free data as the original data for training.

The exposure time is set according to the shooting environment and the shooting equipment, and the embodiment only exemplifies one common parameter. The shorter the exposure time, the more noisy the picture is. And setting the noise-containing picture as a picture with the exposure time less than 50 milliseconds, and setting the noise-free picture as a picture with the exposure time more than 5 seconds. The first exposure time is set to less than or equal to 10 milliseconds. The second exposure time is set to be greater than 10 milliseconds and less than or equal to 50 milliseconds. A plurality of first type image pairs may be packaged into a database file format (HDFS), and the post-noise reduction module may be trained using a deep learning framework, caffe.

In the training process in this embodiment, a Mixup technique and an activation function of upper and lower clamps with a low slope are also used, so that an effect of controlling the output stability of the model can be achieved to a certain extent.

In an embodiment of the present application, the S060 further includes the following S064 to S065:

and S064, screening out a noise-containing picture in the second exposure time and a noise-free picture in the same scene with the noise-containing picture in the second exposure time from the original picture set, and forming a second type of picture pair by the noise-containing picture in the second exposure time and the noise-free picture in the same scene. The second exposure time is greater than the first exposure time.

And S065, repeatedly executing the previous step S064 to obtain a plurality of second-class image pairs, taking the plurality of second-class image pairs as training data of the rear noise reduction module, and training the rear noise reduction module.

Specifically, the second exposure time is greater than the first exposure time. In other words, the present embodiment uses the weak noise image data with a long exposure time as the input of the post-noise reduction module, and regards the image acquired without limiting the exposure time (i.e., the exposure time is long enough) as the noiseless data as the raw data for training.

A plurality of second type image pairs may be packed into a database file format (HDFS), and the post-denoising module may be trained using a deep learning framework, caffe.

The previous embodiment has pointed out that the exposure time is set according to the shooting environment and the shooting equipment, and this embodiment only exemplifies one common parameter. The shorter the exposure time, the more noisy the picture is. And setting the noise-containing picture as a picture with the exposure time less than 50 milliseconds, and setting the noise-free picture as a picture with the exposure time more than 5 seconds. The first exposure time is set to less than or equal to 10 milliseconds. The second exposure time is set to be greater than 10 milliseconds and less than or equal to 50 milliseconds.

In this embodiment, the front noise reduction module and the rear noise reduction module are trained respectively, so that the front noise reduction module obtains the capability of removing coarse noise, and the rear noise reduction module obtains the capability of removing fine noise.

In an embodiment of the present application, the S400 includes the following S410 to S420:

and S410, obtaining a dynamic and static analysis result output by the motion estimation network model.

And S420, operating the noise reduction network model, controlling the front noise reduction module to perform spatial domain noise reduction processing on the motion region in the video frame at the current moment based on the dynamic and static analysis results, and taking the video frame after the spatial domain noise reduction processing as the video frame after the coarse noise removal.

Specifically, as shown in fig. 3, the video frame It collected at the current time is sent to the front noise reduction module, and the front noise reduction module performs coarse noise removal on the video frame It to obtain a noise-reduced video frame Kt. In the noise reduction process, based on the dynamic and static analysis results of the motion evaluation module mentioned in the previous step, the spatial domain noise reduction is only carried out on the motion region, and the static region is directly ignored, so that the calculation resources can be saved. This is because the still region can be processed by the portion of the dynamic-static combination in the subsequent S500.

In this embodiment, a neural network method is used for spatial domain noise reduction. The neural network has strong fitting performance capability, and can obtain clearer image quality compared with the traditional noise reduction algorithm. Meanwhile, based on the dynamic and static analysis results of the motion evaluation module in the previous step, only the motion part is subjected to airspace noise reduction, so that the operation time is saved, and the installation and the deployment in equipment are facilitated.

In an embodiment of the present application, the S500 includes the following S510 to S540:

and S510, obtaining a dynamic and static analysis result output by the motion estimation network model.

And S520, converting each static area or each moving area in the current video frame after the coarse noise is removed into an area fusion coefficient mu corresponding to each static area or each moving area through a nonlinear transformation function power (x, 0.1) based on the dynamic and static analysis result.

S530, based on the dynamic and static analysis results, converting each static area or motion area in the historical state frame into an area fusion coefficient 1-mu corresponding to each static area or motion area through a nonlinear transformation function power (x, 0.1).

S540, generating a fused video frame according to the following formula 1;

C_t=K_t×μ＋C_t-1x (1-. mu.) formula 1

And Ct is an expression of the fused video frame. Kt is the expression of the current video frame after the coarse noise removal. Ct-1 is an expression for the historical status frame. And mu is a region fusion coefficient corresponding to the current video frame after the coarse noise removal. And 1-mu is a region fusion coefficient corresponding to the historical state frame. Specifically, during dynamic and static fusion, according to the dynamic and static analysis result, when an image area in the current video frame after coarse noise removal is a static area, the area fusion coefficient 1-mu of the historical state frame is taken as the main factor. And when one image area in the current video frame after the rough noise removal is a motion area, the area fusion coefficient mu corresponding to the current video frame after the rough noise removal is dominant.

It can be seen that, when the video and the static are fused, what the fused video frame is mainly determined by the weight ratio of the current video frame and the historical state frame after the coarse noise is removed, and then the current video frame and the historical state frame after the coarse noise is removed are overlapped according to the weight ratio of the current video frame and the historical state frame to generate the fused video frame. Mu.s₁And mu₂Is the weight ratio for constraining the two. The principle mainly followed is that the static area is mainly based on the data of the historical state frame, and the motion area is mainly based on the data of the current video frame after the rough noise removal.

In the embodiment, a dynamic and static fusion technology is adopted, information between frames in a video stream is fully utilized, and compared with other noise reduction methods based on a neural network of a picture, the method overcomes the problem of unstable multi-frame sequences of noise reduction effects, eliminates the phenomenon of jumping and flickering, and is more suitable for noise reduction processing of video frame sequences.

In an embodiment of the present application, the S600 includes the following steps:

s610, controlling the post-denoising module to perform airspace denoising processing on the fused video frame, and removing noise of fine particle morphological characteristics in the fused video frame.

Specifically, 2D spatial noise reduction is also adopted in this step, but belongs to fine noise reduction, and the purpose is to remove noise with fine particle morphological characteristics in the fused video frame. As shown in fig. 3, the final output video frame after the secondary spatial domain denoising is Ot, which is the finally obtained denoised video frame.

In this embodiment, the noise with large grain morphological characteristics can be removed by performing coarse noise reduction on the video frame at the current time. And then secondary airspace noise reduction processing is carried out on the fused video frame, so that noise with fine particle morphological characteristics can be removed, the noise reduction has gradualness, and the noise reduction effect is better. If the noise reduction is performed in a mixed manner without performing the coarse and fine noise reduction, the design complexity of the noise reduction network model is increased, and the processing efficiency of the noise reduction network model is lowered.

The application also provides a video noise reduction terminal.

As shown in fig. 2, in an embodiment of the present application, the video denoising terminal includes a processor 100, a motion estimation network model 200, a denoising network model 300, and a database 400. The processor 100 is connected to the motion estimation network model 200. One end of the noise reduction network model 300 is connected to the processor 100. The other end of the noise reduction network model 300 is connected to the motion estimation network model 200. The database 400 is connected to the processor 100. The processor 100 is configured to perform the video denoising method provided in any of the foregoing embodiments.

In particular, the database 400 is used to store the historical status frames. The historical state frames in the database 400 are continually updated as different video frames are de-noised.

It should be noted that, for brevity of lines, the video denoising terminal provided in this embodiment and the devices or terminals with the same names appearing in the video denoising methods mentioned in the foregoing embodiments are only numbered in this embodiment, and these devices or terminals with the same names include the processor 100, the motion estimation network model 200, the denoising network model 300, and the database 400.

The technical features of the embodiments described above may be arbitrarily combined, the order of execution of the method steps is not limited, and for simplicity of description, all possible combinations of the technical features in the embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the combinations of the technical features should be considered as the scope of the present description.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A video denoising method, comprising:

operating the noise reduction network model, and controlling a front noise reduction module in the noise reduction network model to perform coarse noise removal on a video frame at the current moment to obtain the current video frame output by the front noise reduction module after the coarse noise removal;

2. The video denoising method of claim 1, wherein before the obtaining the video frame at the current time and the historical state frame, the video denoising method further comprises:

establishing a full convolution neural network model with equal input and output sizes, and recording the model as a first model;

acquiring a plurality of noise-containing pictures in the same scene and a plurality of noise-containing pictures in different scenes as training data, inputting the training data into the first model for training, and taking the trained first model as a motion estimation network model;

establishing another full convolution neural network model with equal input and output sizes, and recording the model as a second model;

constructing an internal structure of the second model, wherein the internal structure of the second model comprises a front noise reduction module and a rear noise reduction module;

respectively setting a loss function of the front noise reduction module and a loss function of the rear noise reduction module;

and acquiring a plurality of groups of noise-containing pictures and noise-free pictures of different scenes, respectively training the front noise reduction module and the rear noise reduction module, and taking the trained second model as a noise reduction network model.

3. The method of claim 2, wherein the obtaining a plurality of noise-containing pictures of the same scene and a plurality of noise-containing pictures of different scenes as training data input to the first model training comprises:

acquiring a plurality of noise-containing pictures of the same scene and a plurality of noise-containing pictures of different scenes;

setting at least one static image group, placing a plurality of noise-containing pictures with the same scene into the static image group, setting at least one moving image group, and placing a plurality of noise-containing pictures with different scenes into the moving image group;

intercepting an image area at the same position in each picture in the still image group to obtain a local image block and generate a still label corresponding to the local image block;

intercepting an image area at the same position in each picture in the motion image group to obtain a local image block and generate a motion tag corresponding to the local image block;

and inputting all the local image blocks and the corresponding static labels or the corresponding motion labels as training data to the first model for training.

4. The method of claim 3, wherein the separately setting the loss function of the front noise reduction module and the loss function of the rear noise reduction module comprises:

setting the number of channels at the deepest layer of the network of the front noise reduction module to be 512, and selecting L2Loss + SSTM as a Loss function of the front noise reduction module;

setting the number of channels at the deepest layer of the network of the rear noise reduction module to be 256, and selecting L1Loss as a Loss function of the rear noise reduction module.

5. The method of claim 4, wherein the obtaining a plurality of groups of noisy pictures and noiseless pictures of different scenes, and training the front denoising module and the rear denoising module respectively comprises:

creating an original picture set, acquiring a plurality of groups of noise-containing pictures and noise-free pictures of different scenes, and putting all the pictures into the original picture set; each scene corresponds to a plurality of noise-containing pictures and a plurality of noise-free pictures, the noise-containing pictures are pictures with exposure time less than 50 milliseconds, and the noise-free pictures are pictures with exposure time more than 5 seconds;

screening out a noise-containing picture in the first exposure time from the original picture set, and a noise-free picture in the same scene with the noise-containing picture in the first exposure time, and forming a first-class image pair by the noise-containing picture in the first exposure time and the noise-free picture in the same scene;

and repeatedly executing the previous step to obtain a plurality of first-class image pairs, taking the plurality of first-class image pairs as training data of the front noise reduction module, and training the front noise reduction module.

6. The method of claim 5, wherein the obtaining a plurality of groups of noisy pictures and noiseless pictures of different scenes, and training the front denoising module and the rear denoising module respectively comprises:

screening out a noise-containing picture in the second exposure time from the original picture set, and a noise-free picture in the same scene with the noise-containing picture in the second exposure time, and forming a second type of picture pair by the noise-containing picture in the second exposure time and the noise-free picture in the same scene; the second exposure time is greater than the first exposure time;

and repeatedly executing the previous step to obtain a plurality of second-class image pairs, taking the second-class image pairs as training data of the rear noise reduction module, and training the rear noise reduction module.

7. The method of claim 6, wherein the operating the denoising network model, controlling a front denoising module in the denoising network model to perform coarse denoising on a video frame at a current time, and obtaining the current video frame after the coarse denoising output by the front denoising module, comprises:

acquiring a dynamic and static analysis result output by the motion estimation network model;

and operating the noise reduction network model, controlling the front noise reduction module to perform spatial domain noise reduction processing on a motion region in the video frame at the current moment based on the dynamic and static analysis results, and taking the video frame after the spatial domain noise reduction processing as the current video frame after the coarse noise is removed.

8. The method of claim 7, wherein the performing motion-still fusion on the current video frame and the historical state frame after the coarse noise removal comprises:

based on the dynamic and static analysis result, converting each static area or motion area in the current video frame after the coarse noise removal into an area fusion coefficient mu corresponding to each static area or motion area through a nonlinear transformation function power (x, 0.1);

based on the dynamic and static analysis results, converting each static area or motion area in the historical state frame into an area fusion coefficient 1-mu corresponding to each static area or motion area through a nonlinear transformation function power (x, 0.1);

generating a fused video frame according to the following formula 1;

C_t=K_t×μ＋C_t-1x (1-. mu.) formula 1;

wherein, C_tFor the expression of the fused video frame, K_tFor the expression of the current video frame after coarse noise removal, C_t-1And mu is an expression of the historical state frame, mu is a region fusion coefficient corresponding to the current video frame after the coarse noise is removed, and 1-mu is a region fusion coefficient corresponding to the historical state frame.

9. The method of claim 8, wherein the controlling a post-denoising module in the denoising network model to perform a secondary denoising process on the fused video frame comprises:

and controlling the rear noise reduction module to perform airspace noise reduction processing on the fused video frame, and removing the noise with the fine particle morphological characteristics in the fused video frame.

10. A video denoising terminal, comprising:

a processor for performing the video denoising method of any one of claims 1-9;

a motion estimation network model coupled to the processor;

a database coupled to the processor.