CN108257105A

CN108257105A - A kind of light stream estimation for video image and denoising combination learning depth network model

Info

Publication number: CN108257105A
Application number: CN201810081519.5A
Authority: CN
Inventors: 李望秀
Original assignee: University of South China
Current assignee: University of South China
Priority date: 2018-01-29
Filing date: 2018-01-29
Publication date: 2018-07-06
Anticipated expiration: 2038-01-29
Also published as: CN108257105B

Abstract

The invention discloses a joint learning deep network model for optical flow estimation and denoising of video images, which belongs to the field of image processing. The model includes a preprocessing module, an optical flow estimation module, and a denoising module. Each module adopts the Encoder-Decoder network structure. Using the sample data set, the preprocessing module is first trained separately, and then the relevant parameters of the preprocessing module are fixed, and the preprocessing module is trained at the same time. Module and optical flow estimation module, and finally fix the relevant parameters of the preprocessing module and optical flow estimation module. The overall training includes a deep network model of three modules. The trained deep network model can directly perform optical flow estimation and de-noising on noisy video images. noise processing. The joint learning deep network model proposed by the present invention has fast optical flow estimation and denoising speed and high precision, and is convenient for quickly processing a large number of video images in practice.

Description

A joint learning deep network model for optical flow estimation and denoising of video images

技术领域technical field

本发明涉及图像处理领域，具体指一种针对视频图像的光流估计与去噪联合学习深度网络模型。The invention relates to the field of image processing, and specifically refers to a joint learning deep network model for optical flow estimation and denoising of video images.

背景技术Background technique

视频图像在采集、压缩、存储、传输等环节中都面临噪声干扰，噪声会显著降低视频图像的视觉质量，并对后续的目标识别和跟踪等智能化分析造成困难。因此，需要在保留视频信息的前提下去除视频图像中的噪声，提高信噪比和改善视觉效果。Video images are faced with noise interference during acquisition, compression, storage, transmission and other links. Noise will significantly reduce the visual quality of video images and cause difficulties for subsequent intelligent analysis such as target recognition and tracking. Therefore, it is necessary to remove the noise in the video image under the premise of retaining the video information, improve the signal-to-noise ratio and improve the visual effect.

由于视频图像具有时域相关性，因此可以把光流估计和视频去噪相结合，获得更好的去噪效果，但是现有的联合光流估计与视频去噪算法，需要大量迭代运算，耗费大量计算资源和时间，不便在实际中应用，而且光流估计容易受到视频噪声的干扰，从而影响去噪效果。因此，提出快速有效的联合光流估计与视频去噪算法，是视频图像处理领域急需解决的问题。Because video images have time-domain correlation, optical flow estimation and video denoising can be combined to obtain better denoising effects. However, the existing joint optical flow estimation and video denoising algorithms require a large number of iterative operations and cost A large amount of computing resources and time are inconvenient for practical application, and optical flow estimation is easily disturbed by video noise, which affects the denoising effect. Therefore, to propose a fast and effective joint optical flow estimation and video denoising algorithm is an urgent problem in the field of video image processing.

发明内容Contents of the invention

本发明为克服上述情况不足，旨在提供一种针对视频图像的光流估计与去噪联合学习深度网络模型，利用深度网络模型从大量训练样本中，联合学习光流估计和视频去噪，以解决现有技术中光流估计精度低，去噪效果差，耗时长的问题。In order to overcome the above disadvantages, the present invention aims to provide a joint learning deep network model for optical flow estimation and denoising of video images, using the deep network model to jointly learn optical flow estimation and video denoising from a large number of training samples to achieve It solves the problems of low accuracy of optical flow estimation, poor denoising effect and long time consumption in the prior art.

为解决上述技术问题，本发明提出的技术方案是：In order to solve the problems of the technologies described above, the technical solution proposed by the present invention is:

一种针对视频图像的光流估计与去噪联合学习深度网络模型，其特征在于：该联合深度学习网络模型包括三个模块：预处理模块、光流估计模块和去噪模块，首先利用样本数据集对深度网络模型进行训练；然后对输入的噪声图像im_n1和im_n2，利用预处理模块做初步去噪处理，得到预处理后的图像对im_p1和im_p2；利用光流估计模块对图像对im_p1和im_p2进行运动估计，得到光流估计结果flow；把噪声图像im_n2按照光流估计结果flow做变换得到图像im_n2’，再将图像im_n2’和噪声图像im_n1作为去噪模块的输入图像，得到噪声图像im_n1对应的最终去噪图像im_dn。A joint learning deep network model for optical flow estimation and denoising of video images, characterized in that: the joint deep learning network model includes three modules: a preprocessing module, an optical flow estimation module and a denoising module, first using sample data Set the deep network model for training; then use the preprocessing module to perform preliminary denoising processing on the input noise images im_n1 and im_n2, and obtain the preprocessed image pair im_p1 and im_p2; use the optical flow estimation module to image pair im_p1 and im_p2 Perform motion estimation to obtain the optical flow estimation result flow; transform the noise image im_n2 according to the optical flow estimation result flow to obtain the image im_n2', and then use the image im_n2' and noise image im_n1 as the input image of the denoising module to obtain the corresponding noise image im_n1 The final denoised image im_dn.

所述输入噪声图像im_n1和im_n2是包含噪声的视频中相邻两帧图像。The input noise images im_n1 and im_n2 are two adjacent frames of images in a video containing noise.

所述样本数据集的数量不少于20000，其中每个样本中包括视频中相邻两帧噪声图像n1和n2，噪声图像n1和n2对应的标准清晰图像p1和p2，图像对p1和p2对应的光流估计结果f。The number of sample data sets is not less than 20,000, wherein each sample includes two adjacent frames of noise images n1 and n2 in the video, the noise images n1 and n2 correspond to standard definition images p1 and p2, and the image pair p1 and p2 correspond to The optical flow estimation result of f.

所述深度网络模型的具体训练方法是利用样本数据集中的相应数据，首先单独训练预处理模块；然后固定预处理模块的相关参数，同时训练预处理模块和光流估计模块；最后固定预处理模块和光流估计模块的相关参数，整体训练包含三个模块的深度网络模型。The specific training method of the deep network model is to use the corresponding data in the sample data set to first train the preprocessing module separately; then fix the relevant parameters of the preprocessing module, and train the preprocessing module and the optical flow estimation module at the same time; finally fix the preprocessing module and the optical flow estimation module. The relevant parameters of the flow estimation module, the overall training includes a deep network model of three modules.

所述预处理模块、光流估计模块和去噪模块采用Encoder-Decoder网络结构。The preprocessing module, optical flow estimation module and denoising module adopt Encoder-Decoder network structure.

所述Encoder-Decoder网络结构包括编码Encoder部分和解码Decoder部分，其中编码Encoder部分包括M层卷积层，解码Decoder部分包括M个子网络，每个子网络包括1个反卷积层和N个卷积层，解码Decoder部分的每个子网络层作反卷积时，调用编码Encoder部分对应的卷积层图像特征，上一层的输出结果作为下一层的输入。The Encoder-Decoder network structure includes an encoding Encoder part and a decoding Decoder part, wherein the encoding Encoder part includes M layers of convolutional layers, and the decoding Decoder part includes M subnetworks, each subnetwork includes 1 deconvolution layer and N convolutional layers Layer, when each sub-network layer of the Decoder part is deconvoluted, the image feature of the convolutional layer corresponding to the Encoder part is called, and the output result of the previous layer is used as the input of the next layer.

使用Caffe深度学习框架对所述深度网络模型进行训练。The deep network model is trained using the Caffe deep learning framework.

本发明有益效果：Beneficial effects of the present invention:

1）本发明所设计的光流估计与去噪联合学习深度网络模型可以同时解决实际中含噪视频的光流估计与去噪问题，与现有技术中基于迭代运算的联合光流估计与视频去噪算法相比，光流估计精度高，基于光流估计辅助的去噪效果更好，而且一旦联合学习深度网络模型训练完成，光流估计和去噪的速度非常快，便于在实际中快速处理大量视频图像。1) The optical flow estimation and denoising joint learning deep network model designed by the present invention can simultaneously solve the problem of optical flow estimation and denoising of noisy video in practice, and it is different from the joint optical flow estimation and video based on iterative operation in the prior art. Compared with the denoising algorithm, the optical flow estimation accuracy is high, and the denoising effect based on the optical flow estimation aid is better, and once the joint learning deep network model training is completed, the optical flow estimation and denoising speed are very fast, which is convenient for rapid implementation in practice. Process large numbers of video images.

2）本发明针对联合学习深度网络模型，采用先单独训练单个网络模块再进行整体网络模型训练的方法，可以有效减少训练过程中的网络参数，避免网络模型出现过拟合。2) For joint learning of deep network models, the present invention adopts a method of training a single network module separately and then training the overall network model, which can effectively reduce the network parameters in the training process and avoid over-fitting of the network model.

3）本发明方法利用深度学习模型来自动学习光流估计图像和去噪图像的图像特征，进行端到端的光流估计与图像去噪，无需估计运动边界进行辅助，而且所采用的Encoder-Decoder深度网络模型能够充分挖掘输入图像中的多维特征，可提升光流估计与去噪的效果。3) The method of the present invention uses a deep learning model to automatically learn the image features of optical flow estimation images and denoising images, and performs end-to-end optical flow estimation and image denoising without the need to estimate motion boundaries for assistance, and the Encoder-Decoder used The deep network model can fully mine the multi-dimensional features in the input image, which can improve the effect of optical flow estimation and denoising.

附图说明Description of drawings

图1是本发明中联合学习深度网络模型的结构示意图；Fig. 1 is the structural representation of joint learning depth network model among the present invention;

图2是样本数据集中视频中相邻两帧噪声图像n1和n2；Figure 2 is two adjacent frames of noise images n1 and n2 in the video in the sample data set;

图3是样本数据集中噪声图像n1和n2对应的标准清晰图像p1和p2；Figure 3 is the standard definition images p1 and p2 corresponding to the noise images n1 and n2 in the sample data set;

图4是样本数据集中图像对p1和p2对应的光流估计结果f；Figure 4 is the optical flow estimation result f corresponding to the image pair p1 and p2 in the sample data set;

图5是Encoder-Decoder网络结构示意图；Figure 5 is a schematic diagram of the Encoder-Decoder network structure;

图6是Encoder-Decoder网络中子网络的结构示意图；Fig. 6 is a schematic structural diagram of a sub-network in the Encoder-Decoder network;

图7是待处理的视频中相邻两帧噪声图像im_n1和im_n2；Figure 7 is two adjacent frames of noise images im_n1 and im_n2 in the video to be processed;

图8是噪声图像im_n1和im_n2对应的光流估计结果；Figure 8 is the optical flow estimation results corresponding to noise images im_n1 and im_n2;

图9是噪声图像im_n1的去噪结果。Fig. 9 is the denoising result of the noisy image im_n1.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

如图1所示，本实施例提供的一种针对视频图像的光流估计与去噪联合学习深度网络模型，包括三个模块：预处理模块、光流估计模块和去噪模块，首先构建包含30000个样本的样本数据集，其中每个样本包括视频中相邻两帧噪声图像n1和n2，如图2所示，噪声图像n1和n2对应的标准清晰图像p1和p2，如图3所示，图像对p1和p2对应的光流估计结果f，如图4所示。As shown in Figure 1, a joint learning deep network model for optical flow estimation and denoising of video images provided by this embodiment includes three modules: a preprocessing module, an optical flow estimation module and a denoising module. A sample data set of 30,000 samples, each of which includes two adjacent frames of noise images n1 and n2 in the video, as shown in Figure 2, and the standard definition images p1 and p2 corresponding to noise images n1 and n2, as shown in Figure 3 , the optical flow estimation result f corresponding to the image pair p1 and p2, as shown in Figure 4.

预处理模块、光流估计模块和去噪模块采用Encoder-Decoder网络结构，如图5所示，Encoder-Decoder网络结构包括编码Encoder部分和解码Decoder部分，其中编码Encoder部分包括6层卷积层c1-c6，这6个卷积层的特征图数量分别为64、64、128、128、256、512，解码Decoder部分包括5个子网络subnet1- subnet5，子网络的结构如图6所示，每个子网络包括1个卷积层和4个反卷积层，每个子网络中卷积层的特征图数量为64，4个反卷积层的特征图数量分别为512、256、128、64，解码Decoder部分的每个子网络层作反卷积时，调用编码Encoder部分对应的卷积层图像特征，上一层的输出结果作为下一层的输入。The preprocessing module, the optical flow estimation module and the denoising module adopt the Encoder-Decoder network structure, as shown in Figure 5, the Encoder-Decoder network structure includes an encoding Encoder part and a decoding Decoder part, and the encoding Encoder part includes a 6-layer convolutional layer c1 -c6, the number of feature maps of these 6 convolutional layers are 64, 64, 128, 128, 256, 512 respectively, the decoding Decoder part includes 5 subnetworks subnet1-subnet5, the structure of the subnetwork is shown in Figure 6, each subnetwork The network includes 1 convolutional layer and 4 deconvolutional layers. The number of feature maps of the convolutional layer in each sub-network is 64, and the number of feature maps of the 4 deconvolutional layers are 512, 256, 128, 64 respectively. The decoding When each sub-network layer of the Decoder part performs deconvolution, the image feature of the convolutional layer corresponding to the Encoder part is called, and the output result of the previous layer is used as the input of the next layer.

利用样本数据集对联合学习深度网络模型进行训练，利用安装在Ubuntu系统上的Caffe环境训练该深度学习模型，采用ADAGRAD优化算法进行训练。首先单独训练预处理模块；然后固定预处理模块的相关参数，同时训练预处理模块和光流估计模块；最后固定预处理模块和光流估计模块的相关参数，整体训练包含三个模块的深度网络模型。单独训练预处理模块和同时训练预处理模块和光流估计模块时，初始学习率为0.01，训练次数为600000次，其中，在训练次数为300000、400000和500000时，学习率分别除以10，降低学习率。整体训练包含三个模块的深度网络模型，初始学习率为0.02，训练次数为500000次，其中，在训练次数为200000、300000和400000时，学习率分别除以8，降低学习率。The joint learning deep network model is trained using the sample data set, and the deep learning model is trained using the Caffe environment installed on the Ubuntu system, and the ADAGRAD optimization algorithm is used for training. First, train the preprocessing module separately; then fix the relevant parameters of the preprocessing module, and train the preprocessing module and the optical flow estimation module at the same time; finally fix the relevant parameters of the preprocessing module and the optical flow estimation module, and train the deep network model including three modules as a whole. When training the preprocessing module alone or training the preprocessing module and the optical flow estimation module at the same time, the initial learning rate is 0.01, and the training times are 600,000 times. When the training times are 300,000, 400,000, and 500,000 times, the learning rate is divided by 10, and the reduction learning rate. The overall training includes a deep network model of three modules, the initial learning rate is 0.02, and the number of training times is 500,000 times. When the number of training times is 200,000, 300,000, and 400,000, the learning rate is divided by 8 to reduce the learning rate.

训练完成联合学习深度网络模型后，直接利用该模型处理含噪视频图像。将视频中相邻两帧噪声图像im_n1和im_n2输入该模型，如图7所示，可直接快速得到噪声图像im_n1和im_n2对应的光流估计结果，如图8所示，以及噪声图像im_n1的去噪结果，如图9所示。After training the joint learning deep network model, the model is directly used to process noisy video images. Input two adjacent noise images im_n1 and im_n2 in the video into the model, as shown in Figure 7, the optical flow estimation results corresponding to the noise images im_n1 and im_n2 can be directly and quickly obtained, as shown in Figure 8, and the noise image im_n1 is removed Noise results, as shown in Figure 9.

以上所揭露的仅为本发明一种较佳实施例而已，当然不能以此来限定本发明之权利范围，因此依本发明权利要求所作的等同变化，仍属本发明所涵盖的范围。The above disclosure is only a preferred embodiment of the present invention, which certainly cannot limit the scope of rights of the present invention. Therefore, equivalent changes made according to the claims of the present invention still fall within the scope of the present invention.

Claims

1. A kind of optical flow estimation and denoising joint learning deep network model for video image, it is characterized in that: this joint deep learning network model comprises three modules: preprocessing module, optical flow estimation module and denoising module, at first utilize The sample data set is used to train the deep network model; then, the preprocessing module is used to perform preliminary denoising processing on the input noise images im_n1 and im_n2, and the preprocessed image pair im_p1 and im_p2 are obtained; the image pair im_p1 is calculated using the optical flow estimation module Perform motion estimation with im_p2 to obtain the optical flow estimation result flow; transform the noise image im_n2 according to the optical flow estimation result flow to obtain the image im_n2', and then use the image im_n2' and noise image im_n1 as the input image of the denoising module to obtain the noise image The final denoised image im_dn corresponding to im_n1.

2. The optical flow estimation and denoising joint learning deep network model for video images according to claim 1, characterized in that: the input noise images im_n1 and im_n2 are two adjacent frames of images in a video containing noise.

3. The optical flow estimation and denoising joint learning deep network model for video images according to claim 1, characterized in that: the number of sample data sets is not less than 20000, wherein each sample includes the relevant There are two adjacent frames of noise images n1 and n2, the standard definition images p1 and p2 corresponding to the noise images n1 and n2, and the optical flow estimation result f corresponding to the image pair p1 and p2.

4. the optical flow estimation and denoising joint learning deep network model for video image according to claim 1, it is characterized in that: the concrete training method of described deep network model is to utilize the corresponding data in sample data set, at first separate training Preprocessing module; then fix the relevant parameters of the preprocessing module, and train the preprocessing module and the optical flow estimation module at the same time; finally fix the relevant parameters of the preprocessing module and the optical flow estimation module, and overall train the deep network model containing three modules.

5. The optical flow estimation and denoising joint learning deep network model for video images according to claim 1 or 4, characterized in that: the preprocessing module, optical flow estimation module and denoising module adopt Encoder-Decoder network structure.

6. The optical flow estimation and denoising joint learning deep network model for video images according to claim 5, characterized in that: the Encoder-Decoder network structure includes an encoding Encoder part and a decoding Decoder part, wherein the encoding Encoder part includes M layers of convolutional layers, the decoding Decoder part includes M subnetworks, each subnetwork includes 1 deconvolution layer and N convolutional layers, when deconvolution is performed on each subnetwork layer of the Decoder part, the encoding Encoder part is called The image features of the convolutional layer, the output of the previous layer is used as the input of the next layer.

7. The optical flow estimation and denoising joint learning deep network model for video images according to claim 4, characterized in that: the Caffe deep learning framework is used to train the deep network model.