CN112819743A

CN112819743A - General video time domain alignment method based on neural network

Info

Publication number: CN112819743A
Application number: CN202110169802.5A
Authority: CN
Inventors: 陈弘林; 李茹; 谢军伟; 童同; 高钦泉; 罗鸣
Original assignee: Fujian Imperial Vision Information Technology Co ltd
Current assignee: Fujian Imperial Vision Information Technology Co ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-05-18

Abstract

The invention provides a general video time domain alignment method based on a neural network, which comprises the following steps: collecting all original video image frames of a current video; processing an original video image frame by an image processing neural network model to obtain a processed image frame; constructing a depth convolution neural network which can be used for aligning the time domain between video image frames; the original video image frame and the processed image frame are used as input, and the output time domain aligned video image frame is obtained through the deep convolutional neural network; and synthesizing the output time domain aligned video image frames to obtain a final time domain aligned complete video. The invention solves the problem of video time domain inconsistency generated by directly applying an image processing model to a video task, and has the advantage of solving the problem of time domain inconsistency of a plurality of different video tasks by using a universal algorithm.

Description

General video time domain alignment method based on neural network

Technical Field

The invention relates to the technical field of video image processing, in particular to a general video time domain alignment method based on a neural network.

Background

In recent years, with the rapid development of the computer vision field, various image processing tasks based on neural networks have made a major breakthrough, and many image processing algorithms have exhibited excellent performance in a single image processing task. For example, in the image denoising task, a well-designed neural network can be used for generating a clear noise-free image from a noise-filled image, so that an excellent visual effect is achieved. In practical application, a video scene better meets the actual requirement, most of the video tasks nowadays are processed by a mode of extracting a video into a plurality of image frames, then processing the image frames frame by using an image processing neural network model, and finally synthesizing the processed image frames to obtain a processed video. The quality of the video obtained by the processing mode is generally poor, and the problem that the video flicker, the video detail incoherence and the like are easily caused due to the fact that the time domains among the generated video image frames are different in the mode of processing frame by frame, and the overall visual effect of the video is influenced.

In order to ensure the temporal consistency of video, researchers have designed dedicated algorithms for different video processing tasks, such as video coloring, video denoising and video enhancement. Although the video processing algorithm of a specific task can improve the time domain consistency of the video processed by the specific task to a certain extent, similar strategies cannot be directly applied to other tasks, so that the method has great limitation, and the development cost and the implementation cost in practical application are also increased by designing different processing algorithms according to different video processing tasks.

Disclosure of Invention

In view of this, the present invention provides a general video time domain alignment method based on a neural network, which solves the problem of video time domain inconsistency caused by directly applying an image processing model to a video task, and has the advantage of using a general algorithm to solve the time domain inconsistency of a plurality of different video tasks.

The invention is realized by adopting the following scheme: a general video time domain alignment method based on a neural network specifically comprises the following steps:

collecting all original video image frames of a current video;

processing an original video image frame by an image processing neural network model to obtain a processed image frame;

constructing a depth convolution neural network which can be used for aligning the time domain between video image frames;

the original video image frame and the processed image frame are used as input, and the output time domain aligned video image frame is obtained through the deep convolutional neural network;

and synthesizing the output time domain aligned video image frames to obtain a final time domain aligned complete video.

Further, the image processing neural network model comprises an image enhancement model, an image denoising model, an image defogging model and an image coloring model.

Further, the deep convolutional neural network for aligning the time domain between the video image frames is a U-Net image transformation network integrating a ConvLSTM convolutional long-short term memory unit layer.

Furthermore, the U-Net image conversion network is an encoder-decoder framework and comprises four down-sampling operations and four up-sampling operations to form a U-shaped structure; and after the fourth downsampling operation is carried out, a ConvLSTM convolution long-short term memory unit layer is accessed, and then the upsampling operation is carried out.

Further, the ConvLSTM convolutionThe long and short term memory unit layer comprises a forgetting gate, an input gate and an output gate, wherein the forgetting gate is used for memorizing the input current

And output of the previous time

Deciding which part needs to be forgotten; the input gate is based on the current input

And the output of the last moment

Determining which information to add to the state of the previous moment

In generating a new state

The output gate is based on the latest state

Output of last moment

And the current input

To determine the output at that moment

Further, the obtaining of the output time-domain aligned video image frame by using the original video image frame and the processed image frame as input through the deep convolutional neural network specifically includes:

when t is the first frame, setting t-1 to t and t-2 to t; when t is the second frame, setting t-1 as t; when t is the last frame, setting t +1 as t;

for the t original video image frame, simultaneously taking the t-1 original video image frame, the t-1 processed image frame, the t original video image frame and the t processed image frame as input, and obtaining a preliminary time-domain stable t aligned image frame through the deep convolutional neural network; calculating a supervised learning training function of the tth processed image frame and the tth aligned image frame;

for the t-1 original video image frame, simultaneously taking a t-2 original video image frame, a t-2 processed image frame, a t-1 original video image frame and a t-1 processed image frame as input, obtaining a preliminary time-domain stable t-1 aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t-1 processed image frame and the t-1 aligned image frame;

for the t +1 th original video image frame, simultaneously taking the t th original video image frame, the t th processed image frame, the t +1 th original video image frame and the t +1 th processed image frame as input, obtaining a preliminary time-domain-stable t +1 th aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame;

and (3) calculating the supervised learning training function of the t-1 th processed image frame and the t-1 th aligned image frame, the supervised learning training function of the t-th processed image frame and the t-th aligned image frame and the supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame according to the following formula 1: 1: 1 to obtain a master supervision learning training function of the deep convolutional neural network which is finally used for optimizing and processing the tth original video image frame.

Further, the supervised learning training function comprises a perception loss function, a structural similarity loss function and a total training function;

the perceptual loss function is as follows:

in which the index j denotes the j-th layer of the network, C_jH_jW_jRepresenting the size of the jth layer feature map; the loss network uses a VGG-19 pre-training network, and phi represents the network; y denotes the image frame to be processed,

representing an aligned image frame;

the structural similarity loss function is as follows:

wherein x and y represent processed image frame and aligned image frame, respectively, mu_xAnd mu_yRepresents the average values of x, y,

and

respectively representing the variance, σ, of x, y_xyRepresents the covariance of x and y, c₁,c₂Are respectively a constant;

the overall training function is:

L＝L_{t 1}+L_t+L_t+1；

Wherein L is_tRepresenting the loss function between the tth processed image frame and the tth aligned image frame, is expressed as follows:

in the formula (I), the compound is shown in the specification,

representing a function of the perceptual loss as a function of,

a loss function of similarity of the illustrated structure is represented,

representing the t-th aligned image frame, y_tRepresenting the t-th processed image frame;

and finally, the L is used for optimizing a master supervision training function of the deep convolutional neural network.

The present invention also provides a general video time domain alignment system based on a neural network, comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, which when executed by the processor, are capable of implementing the method steps as described above.

The present invention also provides a computer readable storage medium having stored thereon computer program instructions executable by a processor, the computer program instructions when executed by the processor being capable of performing the method steps as described above.

Compared with the prior art, the invention has the following beneficial effects:

firstly, the invention firstly proposes to adopt a general algorithm to solve the problem of video time domain inconsistency of different video processing tasks, so that the image processing neural network model with excellent current performance can be directly applied to the corresponding video processing tasks, the engineering requirements can be met, and the traditional research and development cost and application cost of adopting different processing algorithms according to different video tasks are greatly reduced.

Secondly, the invention adopts a novel neural network structure, and combines a convolutional neural network and a cyclic neural network, and the neural network has excellent image processing capability and capability of capturing spatial-temporal correlation at the same time, so that video image frames processed by the neural network structure have spatial-temporal consistency.

Thirdly, the invention designs a delicate supervised training function, which combines a perception loss function and a structural similarity loss function, so that the finally output aligned image frame can achieve the expected processing effect.

Fourthly, the method provided by the invention can directly apply the image processing model to the video processing task in the field of computer vision without generating sensory problems of video flicker, video detail incoherence and the like, meets the real experience of users in actual life, and obtains the affirmation of the users.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.

Fig. 2 shows a network structure and a training function of a neural network-based general video time domain alignment method according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a ConvLSTM long-short term memory unit layer in a network structure of a general video time domain alignment method based on a neural network according to an embodiment of the present invention.

Fig. 4 is a comparison diagram of the effect of the general video time domain alignment method based on the neural network applied to the video coloring task according to the embodiment of the present invention.

Fig. 5 is an effect comparison diagram of the general video time domain alignment method based on the neural network applied to the video denoising task according to the embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a general video time domain alignment method based on a neural network, which specifically includes the following steps:

step one, collecting all original video image frames of a current video; specifically, the original video image frame may be a black and white video image frame acquired in a black and white video or a noise-containing video image frame acquired in a noise-containing video.

Step two, processing the original video image frame by an image processing neural network model to obtain a processed image frame;

step three, constructing a depth convolution neural network which can be used for aligning the time domain between video image frames;

step four, the original video image frame and the processed image frame are used as input, and the output time domain aligned video image frame is obtained through the deep convolution neural network;

and step five, synthesizing the output time domain aligned video image frames to obtain a final time domain aligned complete video.

In this embodiment, the image processing neural network model includes a neural network model used in image processing in the computer vision field, such as an image enhancement model, an image denoising model, an image defogging model, an image coloring model, and the like. Specifically, the processed image frame may be a color video image frame of a black-and-white video processed by an image coloring model or a noise-free video image frame of a noise-containing video processed by an image denoising model.

In this embodiment, the deep convolutional neural network for aligning the time domain between video image frames is a U-Net image transformation network integrating a ConvLSTM convolutional long-short term memory unit layer.

In the present embodiment, as shown in fig. 2, the U-Net image transform network is an encoder-decoder architecture, and includes four down-sampling operations and four up-sampling operations to form a U-shaped structure; and after the fourth downsampling operation is carried out, a ConvLSTM convolution long-short term memory unit layer is accessed, and then the upsampling operation is carried out.

Wherein the downsampling operation is implemented by a maximum pooling layer and the upsampling operation is implemented by deconvolution.

In this embodiment, as shown in fig. 3, the ConvLSTM convolution long-short term memory unit layer is mainly composed of three "gate" structures, including a forgetting gate, an input gate, and an output gate. Each "gate" structure is an operation that uses a sigmoid neural network and a bit-wise multiplication.

The forgetting gate is according to the current input

And output of the previous time

And the output of the last moment

Determining which information to add to the state of the previous moment

In generating a new state

The output gate is based on the latest state

Output of last moment

And the current input

To determine the output at that moment

The concrete implementation formula of the ConvLSTM convolution long-short term memory unit layer is as follows:

wherein i_tRepresenting an input gate, f_tThe representative of the forgetting to door is,

representing the updated state, o_tWhich represents the output gate or gates, respectively,

which represents the output of the computer system,

is the current input. W_xiRepresenting the input weight, W, of the input gate_hiRepresents the output weight, W, of the input gate_ciRepresenting updated state weights of input gates, b_iIndicating an offset term of the input gate, W_hfOutput weight, W, representing forgetting gate_cfRepresenting updated state weights of forgetting gates, b_fOffset term, W, representing a forgetting gate_xcInput weights, W, representing updated states_hcOutput weights representing updated states, b_cBias term, W, representing updated state_SoRepresenting the input weight, W, of the output gate_hoRepresenting the output weight, W, of the output gate_coIndicating updated output gatesState weight of b_oRepresenting the bias term of the output gate.

In this embodiment, the obtaining, by using the original video image frame and the processed image frame as inputs and using the deep convolutional neural network, an output video image frame aligned in a time domain specifically includes:

In this embodiment, the supervised learning training function includes a perceptual loss function, a structural similarity loss function, and a total training function;

and a perception loss function, comparing the aligned image frame processed by the neural network with the processed image frame, and calculating the perception loss function, wherein the formula of the perception function is as follows:

representing an aligned image frame;

a structural similarity loss function, which is to compare the aligned image frames processed by the deep convolutional neural network with the processed image frames, and calculate the structural similarity loss function, wherein the formula of the structural similarity loss function is specifically as follows:

and

and the total training function compares the t-1 th aligned image frame and the t-1 th processed image frame which are processed by the deep convolutional neural network, the t-th aligned image frame and the t-th processed image frame, and the t +1 th aligned image frame and the t +1 th processed image frame, calculates the total training function, and continuously optimizes the total training function. The total training function is specifically:

L＝L_{t 1}+L_t+L_t+1；

in the formula (I), the compound is shown in the specification,

representing a function of the perceptual loss as a function of,

a loss function of similarity of the illustrated structure is represented,

And finally, synthesizing the output video image frames aligned in the time domain to obtain a complete video with the consistent time domain. In particular, the full video may be a time-domain aligned color video or a time-domain aligned noiseless video. As shown in fig. 4 and 5, it can be seen from the result contrast map of the video coloring task and the result contrast map of the video de-noising task that the visual effect is affected by the obvious flicker, the abrupt change of details, and the like between the processing image frames with inconsistent contrast time domains.

The present embodiment also provides a general video time domain alignment system based on a neural network, which includes a memory, a processor, and computer program instructions stored on the memory and capable of being executed by the processor, and when the computer program instructions are executed by the processor, the method steps as described above can be implemented.

The present embodiments also provide a computer readable storage medium having stored thereon computer program instructions executable by a processor, the computer program instructions, when executed by the processor, being capable of performing the method steps as described above.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A general video time domain alignment method based on a neural network is characterized by comprising the following steps:

collecting all original video image frames of a current video;

2. The neural network-based universal video time domain alignment method according to claim 1, wherein the image processing neural network model comprises an image enhancement model, an image denoising model, an image defogging model and an image coloring model.

3. The method according to claim 1, wherein the deep convolutional neural network for aligning the inter-frame time domain of the video image is a U-Net image transform network that integrates a ConvLSTM convolutional long-short term memory unit layer.

4. The method according to claim 3, wherein the U-Net image transformation network is an encoder-decoder architecture, and comprises four down-sampling operations and four up-sampling operations, forming a U-shaped structure; and after the fourth downsampling operation is carried out, a ConvLSTM convolution long-short term memory unit layer is accessed, and then the upsampling operation is carried out.

5. The neural network-based universal video time domain alignment method according to claim 3, wherein the ConvLSTM convolution long-short term memory unit layer comprises a forgetting gate, an input gate and an output gate, and the forgetting gate is based on the current input

And output of the previous time

And the output of the last moment

Determining which information to add to the state of the previous moment

In generating a new state

The output gate is based on the latest state

Output of last moment

And the current input

To determine the output at that moment

6. The method as claimed in claim 1, wherein the obtaining of the output time-domain aligned video image frame by the deep convolutional neural network using the original video image frame and the processed image frame as input specifically comprises:

7. The neural network-based universal video time domain alignment method according to claim 6, wherein the supervised learning training functions include a perceptual loss function, a structural similarity loss function, and a total training function;

the perceptual loss function is as follows:

representing an aligned image frame;

the structural similarity loss function is as follows:

and

respectively representing the variance, σ, of x, y_SyRepresents the covariance of x and y, c₁,c₂Are respectively a constant;

the overall training function is:

L＝L_t-1+L_t+L_t+1；

in the formula (I), the compound is shown in the specification,

the function of the perceptual loss is represented by,

a loss function of structural similarity is expressed,

8. A neural network based universal video temporal alignment system comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, the computer program instructions when executed by the processor being capable of implementing the method steps of claims 1-7.

9. A computer-readable storage medium having stored thereon computer program instructions executable by a processor, the computer program instructions when executed by the processor being capable of performing the method steps of claims 1-7.