CN112819743A - General video time domain alignment method based on neural network - Google Patents

General video time domain alignment method based on neural network Download PDF

Info

Publication number
CN112819743A
CN112819743A CN202110169802.5A CN202110169802A CN112819743A CN 112819743 A CN112819743 A CN 112819743A CN 202110169802 A CN202110169802 A CN 202110169802A CN 112819743 A CN112819743 A CN 112819743A
Authority
CN
China
Prior art keywords
image frame
neural network
aligned
video
time domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110169802.5A
Other languages
Chinese (zh)
Inventor
陈弘林
李茹
谢军伟
童同
高钦泉
罗鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Imperial Vision Information Technology Co ltd
Original Assignee
Fujian Imperial Vision Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Imperial Vision Information Technology Co ltd filed Critical Fujian Imperial Vision Information Technology Co ltd
Priority to CN202110169802.5A priority Critical patent/CN112819743A/en
Publication of CN112819743A publication Critical patent/CN112819743A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/73Deblurring; Sharpening
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a general video time domain alignment method based on a neural network, which comprises the following steps: collecting all original video image frames of a current video; processing an original video image frame by an image processing neural network model to obtain a processed image frame; constructing a depth convolution neural network which can be used for aligning the time domain between video image frames; the original video image frame and the processed image frame are used as input, and the output time domain aligned video image frame is obtained through the deep convolutional neural network; and synthesizing the output time domain aligned video image frames to obtain a final time domain aligned complete video. The invention solves the problem of video time domain inconsistency generated by directly applying an image processing model to a video task, and has the advantage of solving the problem of time domain inconsistency of a plurality of different video tasks by using a universal algorithm.

Description

General video time domain alignment method based on neural network
Technical Field
The invention relates to the technical field of video image processing, in particular to a general video time domain alignment method based on a neural network.
Background
In recent years, with the rapid development of the computer vision field, various image processing tasks based on neural networks have made a major breakthrough, and many image processing algorithms have exhibited excellent performance in a single image processing task. For example, in the image denoising task, a well-designed neural network can be used for generating a clear noise-free image from a noise-filled image, so that an excellent visual effect is achieved. In practical application, a video scene better meets the actual requirement, most of the video tasks nowadays are processed by a mode of extracting a video into a plurality of image frames, then processing the image frames frame by using an image processing neural network model, and finally synthesizing the processed image frames to obtain a processed video. The quality of the video obtained by the processing mode is generally poor, and the problem that the video flicker, the video detail incoherence and the like are easily caused due to the fact that the time domains among the generated video image frames are different in the mode of processing frame by frame, and the overall visual effect of the video is influenced.
In order to ensure the temporal consistency of video, researchers have designed dedicated algorithms for different video processing tasks, such as video coloring, video denoising and video enhancement. Although the video processing algorithm of a specific task can improve the time domain consistency of the video processed by the specific task to a certain extent, similar strategies cannot be directly applied to other tasks, so that the method has great limitation, and the development cost and the implementation cost in practical application are also increased by designing different processing algorithms according to different video processing tasks.
Disclosure of Invention
In view of this, the present invention provides a general video time domain alignment method based on a neural network, which solves the problem of video time domain inconsistency caused by directly applying an image processing model to a video task, and has the advantage of using a general algorithm to solve the time domain inconsistency of a plurality of different video tasks.
The invention is realized by adopting the following scheme: a general video time domain alignment method based on a neural network specifically comprises the following steps:
collecting all original video image frames of a current video;
processing an original video image frame by an image processing neural network model to obtain a processed image frame;
constructing a depth convolution neural network which can be used for aligning the time domain between video image frames;
the original video image frame and the processed image frame are used as input, and the output time domain aligned video image frame is obtained through the deep convolutional neural network;
and synthesizing the output time domain aligned video image frames to obtain a final time domain aligned complete video.
Further, the image processing neural network model comprises an image enhancement model, an image denoising model, an image defogging model and an image coloring model.
Further, the deep convolutional neural network for aligning the time domain between the video image frames is a U-Net image transformation network integrating a ConvLSTM convolutional long-short term memory unit layer.
Furthermore, the U-Net image conversion network is an encoder-decoder framework and comprises four down-sampling operations and four up-sampling operations to form a U-shaped structure; and after the fourth downsampling operation is carried out, a ConvLSTM convolution long-short term memory unit layer is accessed, and then the upsampling operation is carried out.
Further, the ConvLSTM convolutionThe long and short term memory unit layer comprises a forgetting gate, an input gate and an output gate, wherein the forgetting gate is used for memorizing the input current
Figure BDA0002938642100000021
And output of the previous time
Figure BDA0002938642100000022
Deciding which part needs to be forgotten; the input gate is based on the current input
Figure BDA0002938642100000023
And the output of the last moment
Figure BDA0002938642100000024
Determining which information to add to the state of the previous moment
Figure BDA0002938642100000025
In generating a new state
Figure BDA0002938642100000026
The output gate is based on the latest state
Figure BDA0002938642100000027
Output of last moment
Figure BDA0002938642100000028
And the current input
Figure BDA0002938642100000029
To determine the output at that moment
Figure BDA00029386421000000210
Further, the obtaining of the output time-domain aligned video image frame by using the original video image frame and the processed image frame as input through the deep convolutional neural network specifically includes:
when t is the first frame, setting t-1 to t and t-2 to t; when t is the second frame, setting t-1 as t; when t is the last frame, setting t +1 as t;
for the t original video image frame, simultaneously taking the t-1 original video image frame, the t-1 processed image frame, the t original video image frame and the t processed image frame as input, and obtaining a preliminary time-domain stable t aligned image frame through the deep convolutional neural network; calculating a supervised learning training function of the tth processed image frame and the tth aligned image frame;
for the t-1 original video image frame, simultaneously taking a t-2 original video image frame, a t-2 processed image frame, a t-1 original video image frame and a t-1 processed image frame as input, obtaining a preliminary time-domain stable t-1 aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t-1 processed image frame and the t-1 aligned image frame;
for the t +1 th original video image frame, simultaneously taking the t th original video image frame, the t th processed image frame, the t +1 th original video image frame and the t +1 th processed image frame as input, obtaining a preliminary time-domain-stable t +1 th aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame;
and (3) calculating the supervised learning training function of the t-1 th processed image frame and the t-1 th aligned image frame, the supervised learning training function of the t-th processed image frame and the t-th aligned image frame and the supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame according to the following formula 1: 1: 1 to obtain a master supervision learning training function of the deep convolutional neural network which is finally used for optimizing and processing the tth original video image frame.
Further, the supervised learning training function comprises a perception loss function, a structural similarity loss function and a total training function;
the perceptual loss function is as follows:
Figure BDA0002938642100000031
in which the index j denotes the j-th layer of the network, CjHjWjRepresenting the size of the jth layer feature map; the loss network uses a VGG-19 pre-training network, and phi represents the network; y denotes the image frame to be processed,
Figure BDA0002938642100000032
representing an aligned image frame;
the structural similarity loss function is as follows:
Figure BDA0002938642100000033
wherein x and y represent processed image frame and aligned image frame, respectively, muxAnd muyRepresents the average values of x, y,
Figure BDA0002938642100000034
and
Figure BDA0002938642100000035
respectively representing the variance, σ, of x, yxyRepresents the covariance of x and y, c1,c2Are respectively a constant;
the overall training function is:
L=Lt 1+Lt+Lt+1
Wherein L istRepresenting the loss function between the tth processed image frame and the tth aligned image frame, is expressed as follows:
Figure BDA0002938642100000041
in the formula (I), the compound is shown in the specification,
Figure BDA0002938642100000042
representing a function of the perceptual loss as a function of,
Figure BDA0002938642100000043
a loss function of similarity of the illustrated structure is represented,
Figure BDA0002938642100000044
representing the t-th aligned image frame, ytRepresenting the t-th processed image frame;
and finally, the L is used for optimizing a master supervision training function of the deep convolutional neural network.
The present invention also provides a general video time domain alignment system based on a neural network, comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, which when executed by the processor, are capable of implementing the method steps as described above.
The present invention also provides a computer readable storage medium having stored thereon computer program instructions executable by a processor, the computer program instructions when executed by the processor being capable of performing the method steps as described above.
Compared with the prior art, the invention has the following beneficial effects:
firstly, the invention firstly proposes to adopt a general algorithm to solve the problem of video time domain inconsistency of different video processing tasks, so that the image processing neural network model with excellent current performance can be directly applied to the corresponding video processing tasks, the engineering requirements can be met, and the traditional research and development cost and application cost of adopting different processing algorithms according to different video tasks are greatly reduced.
Secondly, the invention adopts a novel neural network structure, and combines a convolutional neural network and a cyclic neural network, and the neural network has excellent image processing capability and capability of capturing spatial-temporal correlation at the same time, so that video image frames processed by the neural network structure have spatial-temporal consistency.
Thirdly, the invention designs a delicate supervised training function, which combines a perception loss function and a structural similarity loss function, so that the finally output aligned image frame can achieve the expected processing effect.
Fourthly, the method provided by the invention can directly apply the image processing model to the video processing task in the field of computer vision without generating sensory problems of video flicker, video detail incoherence and the like, meets the real experience of users in actual life, and obtains the affirmation of the users.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.
Fig. 2 shows a network structure and a training function of a neural network-based general video time domain alignment method according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a ConvLSTM long-short term memory unit layer in a network structure of a general video time domain alignment method based on a neural network according to an embodiment of the present invention.
Fig. 4 is a comparison diagram of the effect of the general video time domain alignment method based on the neural network applied to the video coloring task according to the embodiment of the present invention.
Fig. 5 is an effect comparison diagram of the general video time domain alignment method based on the neural network applied to the video denoising task according to the embodiment of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the present embodiment provides a general video time domain alignment method based on a neural network, which specifically includes the following steps:
step one, collecting all original video image frames of a current video; specifically, the original video image frame may be a black and white video image frame acquired in a black and white video or a noise-containing video image frame acquired in a noise-containing video.
Step two, processing the original video image frame by an image processing neural network model to obtain a processed image frame;
step three, constructing a depth convolution neural network which can be used for aligning the time domain between video image frames;
step four, the original video image frame and the processed image frame are used as input, and the output time domain aligned video image frame is obtained through the deep convolution neural network;
and step five, synthesizing the output time domain aligned video image frames to obtain a final time domain aligned complete video.
In this embodiment, the image processing neural network model includes a neural network model used in image processing in the computer vision field, such as an image enhancement model, an image denoising model, an image defogging model, an image coloring model, and the like. Specifically, the processed image frame may be a color video image frame of a black-and-white video processed by an image coloring model or a noise-free video image frame of a noise-containing video processed by an image denoising model.
In this embodiment, the deep convolutional neural network for aligning the time domain between video image frames is a U-Net image transformation network integrating a ConvLSTM convolutional long-short term memory unit layer.
In the present embodiment, as shown in fig. 2, the U-Net image transform network is an encoder-decoder architecture, and includes four down-sampling operations and four up-sampling operations to form a U-shaped structure; and after the fourth downsampling operation is carried out, a ConvLSTM convolution long-short term memory unit layer is accessed, and then the upsampling operation is carried out.
Wherein the downsampling operation is implemented by a maximum pooling layer and the upsampling operation is implemented by deconvolution.
In this embodiment, as shown in fig. 3, the ConvLSTM convolution long-short term memory unit layer is mainly composed of three "gate" structures, including a forgetting gate, an input gate, and an output gate. Each "gate" structure is an operation that uses a sigmoid neural network and a bit-wise multiplication.
The forgetting gate is according to the current input
Figure BDA0002938642100000061
And output of the previous time
Figure BDA0002938642100000062
Deciding which part needs to be forgotten; the input gate is based on the current input
Figure BDA0002938642100000063
And the output of the last moment
Figure BDA0002938642100000064
Determining which information to add to the state of the previous moment
Figure BDA0002938642100000065
In generating a new state
Figure BDA0002938642100000066
The output gate is based on the latest state
Figure BDA0002938642100000067
Output of last moment
Figure BDA0002938642100000068
And the current input
Figure BDA0002938642100000069
To determine the output at that moment
Figure BDA00029386421000000610
The concrete implementation formula of the ConvLSTM convolution long-short term memory unit layer is as follows:
Figure BDA00029386421000000611
Figure BDA00029386421000000612
Figure BDA00029386421000000613
Figure BDA00029386421000000614
Figure BDA0002938642100000071
wherein itRepresenting an input gate, ftThe representative of the forgetting to door is,
Figure BDA0002938642100000072
representing the updated state, otWhich represents the output gate or gates, respectively,
Figure BDA0002938642100000073
which represents the output of the computer system,
Figure BDA0002938642100000074
is the current input. WxiRepresenting the input weight, W, of the input gatehiRepresents the output weight, W, of the input gateciRepresenting updated state weights of input gates, biIndicating an offset term of the input gate, WhfOutput weight, W, representing forgetting gatecfRepresenting updated state weights of forgetting gates, bfOffset term, W, representing a forgetting gatexcInput weights, W, representing updated stateshcOutput weights representing updated states, bcBias term, W, representing updated stateSoRepresenting the input weight, W, of the output gatehoRepresenting the output weight, W, of the output gatecoIndicating updated output gatesState weight of boRepresenting the bias term of the output gate.
In this embodiment, the obtaining, by using the original video image frame and the processed image frame as inputs and using the deep convolutional neural network, an output video image frame aligned in a time domain specifically includes:
when t is the first frame, setting t-1 to t and t-2 to t; when t is the second frame, setting t-1 as t; when t is the last frame, setting t +1 as t;
for the t original video image frame, simultaneously taking the t-1 original video image frame, the t-1 processed image frame, the t original video image frame and the t processed image frame as input, and obtaining a preliminary time-domain stable t aligned image frame through the deep convolutional neural network; calculating a supervised learning training function of the tth processed image frame and the tth aligned image frame;
for the t-1 original video image frame, simultaneously taking a t-2 original video image frame, a t-2 processed image frame, a t-1 original video image frame and a t-1 processed image frame as input, obtaining a preliminary time-domain stable t-1 aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t-1 processed image frame and the t-1 aligned image frame;
for the t +1 th original video image frame, simultaneously taking the t th original video image frame, the t th processed image frame, the t +1 th original video image frame and the t +1 th processed image frame as input, obtaining a preliminary time-domain-stable t +1 th aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame;
and (3) calculating the supervised learning training function of the t-1 th processed image frame and the t-1 th aligned image frame, the supervised learning training function of the t-th processed image frame and the t-th aligned image frame and the supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame according to the following formula 1: 1: 1 to obtain a master supervision learning training function of the deep convolutional neural network which is finally used for optimizing and processing the tth original video image frame.
In this embodiment, the supervised learning training function includes a perceptual loss function, a structural similarity loss function, and a total training function;
and a perception loss function, comparing the aligned image frame processed by the neural network with the processed image frame, and calculating the perception loss function, wherein the formula of the perception function is as follows:
Figure BDA0002938642100000081
in which the index j denotes the j-th layer of the network, CjHjWjRepresenting the size of the jth layer feature map; the loss network uses a VGG-19 pre-training network, and phi represents the network; y denotes the image frame to be processed,
Figure BDA0002938642100000089
representing an aligned image frame;
a structural similarity loss function, which is to compare the aligned image frames processed by the deep convolutional neural network with the processed image frames, and calculate the structural similarity loss function, wherein the formula of the structural similarity loss function is specifically as follows:
Figure BDA0002938642100000082
wherein x and y represent processed image frame and aligned image frame, respectively, muxAnd muyRepresents the average values of x, y,
Figure BDA0002938642100000083
and
Figure BDA0002938642100000084
respectively representing the variance, σ, of x, yxyRepresents the covariance of x and y, C1,c2Are respectively a constant;
and the total training function compares the t-1 th aligned image frame and the t-1 th processed image frame which are processed by the deep convolutional neural network, the t-th aligned image frame and the t-th processed image frame, and the t +1 th aligned image frame and the t +1 th processed image frame, calculates the total training function, and continuously optimizes the total training function. The total training function is specifically:
L=Lt 1+Lt+Lt+1
Wherein L istRepresenting the loss function between the tth processed image frame and the tth aligned image frame, is expressed as follows:
Figure BDA0002938642100000085
in the formula (I), the compound is shown in the specification,
Figure BDA0002938642100000086
representing a function of the perceptual loss as a function of,
Figure BDA0002938642100000087
a loss function of similarity of the illustrated structure is represented,
Figure BDA0002938642100000088
representing the t-th aligned image frame, ytRepresenting the t-th processed image frame;
and finally, the L is used for optimizing a master supervision training function of the deep convolutional neural network.
And finally, synthesizing the output video image frames aligned in the time domain to obtain a complete video with the consistent time domain. In particular, the full video may be a time-domain aligned color video or a time-domain aligned noiseless video. As shown in fig. 4 and 5, it can be seen from the result contrast map of the video coloring task and the result contrast map of the video de-noising task that the visual effect is affected by the obvious flicker, the abrupt change of details, and the like between the processing image frames with inconsistent contrast time domains.
The present embodiment also provides a general video time domain alignment system based on a neural network, which includes a memory, a processor, and computer program instructions stored on the memory and capable of being executed by the processor, and when the computer program instructions are executed by the processor, the method steps as described above can be implemented.
The present embodiments also provide a computer readable storage medium having stored thereon computer program instructions executable by a processor, the computer program instructions, when executed by the processor, being capable of performing the method steps as described above.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims (9)

1. A general video time domain alignment method based on a neural network is characterized by comprising the following steps:
collecting all original video image frames of a current video;
processing an original video image frame by an image processing neural network model to obtain a processed image frame;
constructing a depth convolution neural network which can be used for aligning the time domain between video image frames;
the original video image frame and the processed image frame are used as input, and the output time domain aligned video image frame is obtained through the deep convolutional neural network;
and synthesizing the output time domain aligned video image frames to obtain a final time domain aligned complete video.
2. The neural network-based universal video time domain alignment method according to claim 1, wherein the image processing neural network model comprises an image enhancement model, an image denoising model, an image defogging model and an image coloring model.
3. The method according to claim 1, wherein the deep convolutional neural network for aligning the inter-frame time domain of the video image is a U-Net image transform network that integrates a ConvLSTM convolutional long-short term memory unit layer.
4. The method according to claim 3, wherein the U-Net image transformation network is an encoder-decoder architecture, and comprises four down-sampling operations and four up-sampling operations, forming a U-shaped structure; and after the fourth downsampling operation is carried out, a ConvLSTM convolution long-short term memory unit layer is accessed, and then the upsampling operation is carried out.
5. The neural network-based universal video time domain alignment method according to claim 3, wherein the ConvLSTM convolution long-short term memory unit layer comprises a forgetting gate, an input gate and an output gate, and the forgetting gate is based on the current input
Figure FDA0002938642090000011
And output of the previous time
Figure FDA0002938642090000012
Deciding which part needs to be forgotten; the input gate is based on the current input
Figure FDA0002938642090000013
And the output of the last moment
Figure FDA0002938642090000014
Determining which information to add to the state of the previous moment
Figure FDA0002938642090000015
In generating a new state
Figure FDA0002938642090000016
The output gate is based on the latest state
Figure FDA0002938642090000017
Output of last moment
Figure FDA0002938642090000018
And the current input
Figure FDA0002938642090000019
To determine the output at that moment
Figure FDA00029386420900000110
6. The method as claimed in claim 1, wherein the obtaining of the output time-domain aligned video image frame by the deep convolutional neural network using the original video image frame and the processed image frame as input specifically comprises:
when t is the first frame, setting t-1 to t and t-2 to t; when t is the second frame, setting t-1 as t; when t is the last frame, setting t +1 as t;
for the t original video image frame, simultaneously taking the t-1 original video image frame, the t-1 processed image frame, the t original video image frame and the t processed image frame as input, and obtaining a preliminary time-domain stable t aligned image frame through the deep convolutional neural network; calculating a supervised learning training function of the tth processed image frame and the tth aligned image frame;
for the t-1 original video image frame, simultaneously taking a t-2 original video image frame, a t-2 processed image frame, a t-1 original video image frame and a t-1 processed image frame as input, obtaining a preliminary time-domain stable t-1 aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t-1 processed image frame and the t-1 aligned image frame;
for the t +1 th original video image frame, simultaneously taking the t th original video image frame, the t th processed image frame, the t +1 th original video image frame and the t +1 th processed image frame as input, obtaining a preliminary time-domain-stable t +1 th aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame;
and (3) calculating the supervised learning training function of the t-1 th processed image frame and the t-1 th aligned image frame, the supervised learning training function of the t-th processed image frame and the t-th aligned image frame and the supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame according to the following formula 1: 1: 1 to obtain a master supervision learning training function of the deep convolutional neural network which is finally used for optimizing and processing the tth original video image frame.
7. The neural network-based universal video time domain alignment method according to claim 6, wherein the supervised learning training functions include a perceptual loss function, a structural similarity loss function, and a total training function;
the perceptual loss function is as follows:
Figure FDA0002938642090000021
in which the index j denotes the j-th layer of the network, CjHjWjRepresenting the size of the jth layer feature map; the loss network uses a VGG-19 pre-training network, and phi represents the network; y denotes the image frame to be processed,
Figure FDA0002938642090000022
representing an aligned image frame;
the structural similarity loss function is as follows:
Figure FDA0002938642090000031
wherein x and y represent processed image frame and aligned image frame, respectively, muxAnd muyRepresents the average values of x, y,
Figure FDA0002938642090000032
and
Figure FDA0002938642090000033
respectively representing the variance, σ, of x, ySyRepresents the covariance of x and y, c1,c2Are respectively a constant;
the overall training function is:
L=Lt-1+Lt+Lt+1
wherein L istRepresenting the loss function between the tth processed image frame and the tth aligned image frame, is expressed as follows:
Figure FDA0002938642090000034
in the formula (I), the compound is shown in the specification,
Figure FDA0002938642090000035
the function of the perceptual loss is represented by,
Figure FDA0002938642090000036
a loss function of structural similarity is expressed,
Figure FDA0002938642090000037
representing the t-th aligned image frame, ytRepresenting the t-th processed image frame;
and finally, the L is used for optimizing a master supervision training function of the deep convolutional neural network.
8. A neural network based universal video temporal alignment system comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, the computer program instructions when executed by the processor being capable of implementing the method steps of claims 1-7.
9. A computer-readable storage medium having stored thereon computer program instructions executable by a processor, the computer program instructions when executed by the processor being capable of performing the method steps of claims 1-7.
CN202110169802.5A 2021-02-08 2021-02-08 General video time domain alignment method based on neural network Pending CN112819743A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110169802.5A CN112819743A (en) 2021-02-08 2021-02-08 General video time domain alignment method based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110169802.5A CN112819743A (en) 2021-02-08 2021-02-08 General video time domain alignment method based on neural network

Publications (1)

Publication Number Publication Date
CN112819743A true CN112819743A (en) 2021-05-18

Family

ID=75862267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110169802.5A Pending CN112819743A (en) 2021-02-08 2021-02-08 General video time domain alignment method based on neural network

Country Status (1)

Country Link
CN (1) CN112819743A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115243073A (en) * 2022-07-22 2022-10-25 腾讯科技(深圳)有限公司 Video processing method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115243073A (en) * 2022-07-22 2022-10-25 腾讯科技(深圳)有限公司 Video processing method, device, equipment and storage medium
CN115243073B (en) * 2022-07-22 2024-05-14 腾讯科技(深圳)有限公司 Video processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107529650B (en) Closed loop detection method and device and computer equipment
CN108875935B (en) Natural image target material visual characteristic mapping method based on generation countermeasure network
CN107369166B (en) Target tracking method and system based on multi-resolution neural network
WO2021048607A1 (en) Motion deblurring using neural network architectures
CN112434655B (en) Gait recognition method based on adaptive confidence map convolution network
CN112507990A (en) Video time-space feature learning and extracting method, device, equipment and storage medium
CN111696110B (en) Scene segmentation method and system
CN112598597A (en) Training method of noise reduction model and related device
CN112634296A (en) RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism
CN105046664A (en) Image denoising method based on self-adaptive EPLL algorithm
CA3137297C (en) Adaptive convolutions in neural networks
CN113570516B (en) Image blind motion deblurring method based on CNN-Transformer hybrid self-encoder
Qin et al. Etdnet: An efficient transformer deraining model
CN113095254A (en) Method and system for positioning key points of human body part
Chaurasiya et al. Deep dilated CNN based image denoising
CN112651360A (en) Skeleton action recognition method under small sample
CN115588237A (en) Three-dimensional hand posture estimation method based on monocular RGB image
CN112819743A (en) General video time domain alignment method based on neural network
CN107729821B (en) Video summarization method based on one-dimensional sequence learning
CN109711417B (en) Video saliency detection method based on low-level saliency fusion and geodesic
CN115761242B (en) Denoising method and terminal based on convolutional neural network and fuzzy image characteristics
CN116703772A (en) Image denoising method, system and terminal based on adaptive interpolation algorithm
CN116167015A (en) Dimension emotion analysis method based on joint cross attention mechanism
CN115358952A (en) Image enhancement method, system, equipment and storage medium based on meta-learning
KR102340387B1 (en) Method of learning brain connectivity and system threrfor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination