CN112819743A - General video time domain alignment method based on neural network - Google Patents
General video time domain alignment method based on neural network Download PDFInfo
- Publication number
- CN112819743A CN112819743A CN202110169802.5A CN202110169802A CN112819743A CN 112819743 A CN112819743 A CN 112819743A CN 202110169802 A CN202110169802 A CN 202110169802A CN 112819743 A CN112819743 A CN 112819743A
- Authority
- CN
- China
- Prior art keywords
- image frame
- neural network
- aligned
- video
- time domain
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 31
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000012545 processing Methods 0.000 claims abstract description 42
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 27
- 238000003062 neural network model Methods 0.000 claims abstract description 10
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 6
- 230000006870 function Effects 0.000 claims description 71
- 238000012549 training Methods 0.000 claims description 41
- 238000004590 computer program Methods 0.000 claims description 18
- 230000015654 memory Effects 0.000 claims description 18
- 238000004040 coloring Methods 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 6
- 238000003860 storage Methods 0.000 claims description 6
- 150000001875 compounds Chemical class 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000002123 temporal effect Effects 0.000 claims description 2
- 230000008901 benefit Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 11
- 230000008447 perception Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/70—Denoising; Smoothing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/73—Deblurring; Sharpening
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a general video time domain alignment method based on a neural network, which comprises the following steps: collecting all original video image frames of a current video; processing an original video image frame by an image processing neural network model to obtain a processed image frame; constructing a depth convolution neural network which can be used for aligning the time domain between video image frames; the original video image frame and the processed image frame are used as input, and the output time domain aligned video image frame is obtained through the deep convolutional neural network; and synthesizing the output time domain aligned video image frames to obtain a final time domain aligned complete video. The invention solves the problem of video time domain inconsistency generated by directly applying an image processing model to a video task, and has the advantage of solving the problem of time domain inconsistency of a plurality of different video tasks by using a universal algorithm.
Description
Technical Field
The invention relates to the technical field of video image processing, in particular to a general video time domain alignment method based on a neural network.
Background
In recent years, with the rapid development of the computer vision field, various image processing tasks based on neural networks have made a major breakthrough, and many image processing algorithms have exhibited excellent performance in a single image processing task. For example, in the image denoising task, a well-designed neural network can be used for generating a clear noise-free image from a noise-filled image, so that an excellent visual effect is achieved. In practical application, a video scene better meets the actual requirement, most of the video tasks nowadays are processed by a mode of extracting a video into a plurality of image frames, then processing the image frames frame by using an image processing neural network model, and finally synthesizing the processed image frames to obtain a processed video. The quality of the video obtained by the processing mode is generally poor, and the problem that the video flicker, the video detail incoherence and the like are easily caused due to the fact that the time domains among the generated video image frames are different in the mode of processing frame by frame, and the overall visual effect of the video is influenced.
In order to ensure the temporal consistency of video, researchers have designed dedicated algorithms for different video processing tasks, such as video coloring, video denoising and video enhancement. Although the video processing algorithm of a specific task can improve the time domain consistency of the video processed by the specific task to a certain extent, similar strategies cannot be directly applied to other tasks, so that the method has great limitation, and the development cost and the implementation cost in practical application are also increased by designing different processing algorithms according to different video processing tasks.
Disclosure of Invention
In view of this, the present invention provides a general video time domain alignment method based on a neural network, which solves the problem of video time domain inconsistency caused by directly applying an image processing model to a video task, and has the advantage of using a general algorithm to solve the time domain inconsistency of a plurality of different video tasks.
The invention is realized by adopting the following scheme: a general video time domain alignment method based on a neural network specifically comprises the following steps:
collecting all original video image frames of a current video;
processing an original video image frame by an image processing neural network model to obtain a processed image frame;
constructing a depth convolution neural network which can be used for aligning the time domain between video image frames;
the original video image frame and the processed image frame are used as input, and the output time domain aligned video image frame is obtained through the deep convolutional neural network;
and synthesizing the output time domain aligned video image frames to obtain a final time domain aligned complete video.
Further, the image processing neural network model comprises an image enhancement model, an image denoising model, an image defogging model and an image coloring model.
Further, the deep convolutional neural network for aligning the time domain between the video image frames is a U-Net image transformation network integrating a ConvLSTM convolutional long-short term memory unit layer.
Furthermore, the U-Net image conversion network is an encoder-decoder framework and comprises four down-sampling operations and four up-sampling operations to form a U-shaped structure; and after the fourth downsampling operation is carried out, a ConvLSTM convolution long-short term memory unit layer is accessed, and then the upsampling operation is carried out.
Further, the ConvLSTM convolutionThe long and short term memory unit layer comprises a forgetting gate, an input gate and an output gate, wherein the forgetting gate is used for memorizing the input currentAnd output of the previous timeDeciding which part needs to be forgotten; the input gate is based on the current inputAnd the output of the last momentDetermining which information to add to the state of the previous momentIn generating a new stateThe output gate is based on the latest stateOutput of last momentAnd the current inputTo determine the output at that moment
Further, the obtaining of the output time-domain aligned video image frame by using the original video image frame and the processed image frame as input through the deep convolutional neural network specifically includes:
when t is the first frame, setting t-1 to t and t-2 to t; when t is the second frame, setting t-1 as t; when t is the last frame, setting t +1 as t;
for the t original video image frame, simultaneously taking the t-1 original video image frame, the t-1 processed image frame, the t original video image frame and the t processed image frame as input, and obtaining a preliminary time-domain stable t aligned image frame through the deep convolutional neural network; calculating a supervised learning training function of the tth processed image frame and the tth aligned image frame;
for the t-1 original video image frame, simultaneously taking a t-2 original video image frame, a t-2 processed image frame, a t-1 original video image frame and a t-1 processed image frame as input, obtaining a preliminary time-domain stable t-1 aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t-1 processed image frame and the t-1 aligned image frame;
for the t +1 th original video image frame, simultaneously taking the t th original video image frame, the t th processed image frame, the t +1 th original video image frame and the t +1 th processed image frame as input, obtaining a preliminary time-domain-stable t +1 th aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame;
and (3) calculating the supervised learning training function of the t-1 th processed image frame and the t-1 th aligned image frame, the supervised learning training function of the t-th processed image frame and the t-th aligned image frame and the supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame according to the following formula 1: 1: 1 to obtain a master supervision learning training function of the deep convolutional neural network which is finally used for optimizing and processing the tth original video image frame.
Further, the supervised learning training function comprises a perception loss function, a structural similarity loss function and a total training function;
the perceptual loss function is as follows:
in which the index j denotes the j-th layer of the network, CjHjWjRepresenting the size of the jth layer feature map; the loss network uses a VGG-19 pre-training network, and phi represents the network; y denotes the image frame to be processed,representing an aligned image frame;
the structural similarity loss function is as follows:
wherein x and y represent processed image frame and aligned image frame, respectively, muxAnd muyRepresents the average values of x, y,andrespectively representing the variance, σ, of x, yxyRepresents the covariance of x and y, c1,c2Are respectively a constant;
the overall training function is:
L=Lt 1+Lt+Lt+1;
Wherein L istRepresenting the loss function between the tth processed image frame and the tth aligned image frame, is expressed as follows:
in the formula (I), the compound is shown in the specification,representing a function of the perceptual loss as a function of,a loss function of similarity of the illustrated structure is represented,representing the t-th aligned image frame, ytRepresenting the t-th processed image frame;
and finally, the L is used for optimizing a master supervision training function of the deep convolutional neural network.
The present invention also provides a general video time domain alignment system based on a neural network, comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, which when executed by the processor, are capable of implementing the method steps as described above.
The present invention also provides a computer readable storage medium having stored thereon computer program instructions executable by a processor, the computer program instructions when executed by the processor being capable of performing the method steps as described above.
Compared with the prior art, the invention has the following beneficial effects:
firstly, the invention firstly proposes to adopt a general algorithm to solve the problem of video time domain inconsistency of different video processing tasks, so that the image processing neural network model with excellent current performance can be directly applied to the corresponding video processing tasks, the engineering requirements can be met, and the traditional research and development cost and application cost of adopting different processing algorithms according to different video tasks are greatly reduced.
Secondly, the invention adopts a novel neural network structure, and combines a convolutional neural network and a cyclic neural network, and the neural network has excellent image processing capability and capability of capturing spatial-temporal correlation at the same time, so that video image frames processed by the neural network structure have spatial-temporal consistency.
Thirdly, the invention designs a delicate supervised training function, which combines a perception loss function and a structural similarity loss function, so that the finally output aligned image frame can achieve the expected processing effect.
Fourthly, the method provided by the invention can directly apply the image processing model to the video processing task in the field of computer vision without generating sensory problems of video flicker, video detail incoherence and the like, meets the real experience of users in actual life, and obtains the affirmation of the users.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.
Fig. 2 shows a network structure and a training function of a neural network-based general video time domain alignment method according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a ConvLSTM long-short term memory unit layer in a network structure of a general video time domain alignment method based on a neural network according to an embodiment of the present invention.
Fig. 4 is a comparison diagram of the effect of the general video time domain alignment method based on the neural network applied to the video coloring task according to the embodiment of the present invention.
Fig. 5 is an effect comparison diagram of the general video time domain alignment method based on the neural network applied to the video denoising task according to the embodiment of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the present embodiment provides a general video time domain alignment method based on a neural network, which specifically includes the following steps:
step one, collecting all original video image frames of a current video; specifically, the original video image frame may be a black and white video image frame acquired in a black and white video or a noise-containing video image frame acquired in a noise-containing video.
Step two, processing the original video image frame by an image processing neural network model to obtain a processed image frame;
step three, constructing a depth convolution neural network which can be used for aligning the time domain between video image frames;
step four, the original video image frame and the processed image frame are used as input, and the output time domain aligned video image frame is obtained through the deep convolution neural network;
and step five, synthesizing the output time domain aligned video image frames to obtain a final time domain aligned complete video.
In this embodiment, the image processing neural network model includes a neural network model used in image processing in the computer vision field, such as an image enhancement model, an image denoising model, an image defogging model, an image coloring model, and the like. Specifically, the processed image frame may be a color video image frame of a black-and-white video processed by an image coloring model or a noise-free video image frame of a noise-containing video processed by an image denoising model.
In this embodiment, the deep convolutional neural network for aligning the time domain between video image frames is a U-Net image transformation network integrating a ConvLSTM convolutional long-short term memory unit layer.
In the present embodiment, as shown in fig. 2, the U-Net image transform network is an encoder-decoder architecture, and includes four down-sampling operations and four up-sampling operations to form a U-shaped structure; and after the fourth downsampling operation is carried out, a ConvLSTM convolution long-short term memory unit layer is accessed, and then the upsampling operation is carried out.
Wherein the downsampling operation is implemented by a maximum pooling layer and the upsampling operation is implemented by deconvolution.
In this embodiment, as shown in fig. 3, the ConvLSTM convolution long-short term memory unit layer is mainly composed of three "gate" structures, including a forgetting gate, an input gate, and an output gate. Each "gate" structure is an operation that uses a sigmoid neural network and a bit-wise multiplication.
The forgetting gate is according to the current inputAnd output of the previous timeDeciding which part needs to be forgotten; the input gate is based on the current inputAnd the output of the last momentDetermining which information to add to the state of the previous momentIn generating a new stateThe output gate is based on the latest stateOutput of last momentAnd the current inputTo determine the output at that moment
The concrete implementation formula of the ConvLSTM convolution long-short term memory unit layer is as follows:
wherein itRepresenting an input gate, ftThe representative of the forgetting to door is,representing the updated state, otWhich represents the output gate or gates, respectively,which represents the output of the computer system,is the current input. WxiRepresenting the input weight, W, of the input gatehiRepresents the output weight, W, of the input gateciRepresenting updated state weights of input gates, biIndicating an offset term of the input gate, WhfOutput weight, W, representing forgetting gatecfRepresenting updated state weights of forgetting gates, bfOffset term, W, representing a forgetting gatexcInput weights, W, representing updated stateshcOutput weights representing updated states, bcBias term, W, representing updated stateSoRepresenting the input weight, W, of the output gatehoRepresenting the output weight, W, of the output gatecoIndicating updated output gatesState weight of boRepresenting the bias term of the output gate.
In this embodiment, the obtaining, by using the original video image frame and the processed image frame as inputs and using the deep convolutional neural network, an output video image frame aligned in a time domain specifically includes:
when t is the first frame, setting t-1 to t and t-2 to t; when t is the second frame, setting t-1 as t; when t is the last frame, setting t +1 as t;
for the t original video image frame, simultaneously taking the t-1 original video image frame, the t-1 processed image frame, the t original video image frame and the t processed image frame as input, and obtaining a preliminary time-domain stable t aligned image frame through the deep convolutional neural network; calculating a supervised learning training function of the tth processed image frame and the tth aligned image frame;
for the t-1 original video image frame, simultaneously taking a t-2 original video image frame, a t-2 processed image frame, a t-1 original video image frame and a t-1 processed image frame as input, obtaining a preliminary time-domain stable t-1 aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t-1 processed image frame and the t-1 aligned image frame;
for the t +1 th original video image frame, simultaneously taking the t th original video image frame, the t th processed image frame, the t +1 th original video image frame and the t +1 th processed image frame as input, obtaining a preliminary time-domain-stable t +1 th aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame;
and (3) calculating the supervised learning training function of the t-1 th processed image frame and the t-1 th aligned image frame, the supervised learning training function of the t-th processed image frame and the t-th aligned image frame and the supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame according to the following formula 1: 1: 1 to obtain a master supervision learning training function of the deep convolutional neural network which is finally used for optimizing and processing the tth original video image frame.
In this embodiment, the supervised learning training function includes a perceptual loss function, a structural similarity loss function, and a total training function;
and a perception loss function, comparing the aligned image frame processed by the neural network with the processed image frame, and calculating the perception loss function, wherein the formula of the perception function is as follows:
in which the index j denotes the j-th layer of the network, CjHjWjRepresenting the size of the jth layer feature map; the loss network uses a VGG-19 pre-training network, and phi represents the network; y denotes the image frame to be processed,representing an aligned image frame;
a structural similarity loss function, which is to compare the aligned image frames processed by the deep convolutional neural network with the processed image frames, and calculate the structural similarity loss function, wherein the formula of the structural similarity loss function is specifically as follows:
wherein x and y represent processed image frame and aligned image frame, respectively, muxAnd muyRepresents the average values of x, y,andrespectively representing the variance, σ, of x, yxyRepresents the covariance of x and y, C1,c2Are respectively a constant;
and the total training function compares the t-1 th aligned image frame and the t-1 th processed image frame which are processed by the deep convolutional neural network, the t-th aligned image frame and the t-th processed image frame, and the t +1 th aligned image frame and the t +1 th processed image frame, calculates the total training function, and continuously optimizes the total training function. The total training function is specifically:
L=Lt 1+Lt+Lt+1;
Wherein L istRepresenting the loss function between the tth processed image frame and the tth aligned image frame, is expressed as follows:
in the formula (I), the compound is shown in the specification,representing a function of the perceptual loss as a function of,a loss function of similarity of the illustrated structure is represented,representing the t-th aligned image frame, ytRepresenting the t-th processed image frame;
and finally, the L is used for optimizing a master supervision training function of the deep convolutional neural network.
And finally, synthesizing the output video image frames aligned in the time domain to obtain a complete video with the consistent time domain. In particular, the full video may be a time-domain aligned color video or a time-domain aligned noiseless video. As shown in fig. 4 and 5, it can be seen from the result contrast map of the video coloring task and the result contrast map of the video de-noising task that the visual effect is affected by the obvious flicker, the abrupt change of details, and the like between the processing image frames with inconsistent contrast time domains.
The present embodiment also provides a general video time domain alignment system based on a neural network, which includes a memory, a processor, and computer program instructions stored on the memory and capable of being executed by the processor, and when the computer program instructions are executed by the processor, the method steps as described above can be implemented.
The present embodiments also provide a computer readable storage medium having stored thereon computer program instructions executable by a processor, the computer program instructions, when executed by the processor, being capable of performing the method steps as described above.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.
Claims (9)
1. A general video time domain alignment method based on a neural network is characterized by comprising the following steps:
collecting all original video image frames of a current video;
processing an original video image frame by an image processing neural network model to obtain a processed image frame;
constructing a depth convolution neural network which can be used for aligning the time domain between video image frames;
the original video image frame and the processed image frame are used as input, and the output time domain aligned video image frame is obtained through the deep convolutional neural network;
and synthesizing the output time domain aligned video image frames to obtain a final time domain aligned complete video.
2. The neural network-based universal video time domain alignment method according to claim 1, wherein the image processing neural network model comprises an image enhancement model, an image denoising model, an image defogging model and an image coloring model.
3. The method according to claim 1, wherein the deep convolutional neural network for aligning the inter-frame time domain of the video image is a U-Net image transform network that integrates a ConvLSTM convolutional long-short term memory unit layer.
4. The method according to claim 3, wherein the U-Net image transformation network is an encoder-decoder architecture, and comprises four down-sampling operations and four up-sampling operations, forming a U-shaped structure; and after the fourth downsampling operation is carried out, a ConvLSTM convolution long-short term memory unit layer is accessed, and then the upsampling operation is carried out.
5. The neural network-based universal video time domain alignment method according to claim 3, wherein the ConvLSTM convolution long-short term memory unit layer comprises a forgetting gate, an input gate and an output gate, and the forgetting gate is based on the current inputAnd output of the previous timeDeciding which part needs to be forgotten; the input gate is based on the current inputAnd the output of the last momentDetermining which information to add to the state of the previous momentIn generating a new stateThe output gate is based on the latest stateOutput of last momentAnd the current inputTo determine the output at that moment
6. The method as claimed in claim 1, wherein the obtaining of the output time-domain aligned video image frame by the deep convolutional neural network using the original video image frame and the processed image frame as input specifically comprises:
when t is the first frame, setting t-1 to t and t-2 to t; when t is the second frame, setting t-1 as t; when t is the last frame, setting t +1 as t;
for the t original video image frame, simultaneously taking the t-1 original video image frame, the t-1 processed image frame, the t original video image frame and the t processed image frame as input, and obtaining a preliminary time-domain stable t aligned image frame through the deep convolutional neural network; calculating a supervised learning training function of the tth processed image frame and the tth aligned image frame;
for the t-1 original video image frame, simultaneously taking a t-2 original video image frame, a t-2 processed image frame, a t-1 original video image frame and a t-1 processed image frame as input, obtaining a preliminary time-domain stable t-1 aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t-1 processed image frame and the t-1 aligned image frame;
for the t +1 th original video image frame, simultaneously taking the t th original video image frame, the t th processed image frame, the t +1 th original video image frame and the t +1 th processed image frame as input, obtaining a preliminary time-domain-stable t +1 th aligned image frame through the deep convolutional neural network, and calculating a supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame;
and (3) calculating the supervised learning training function of the t-1 th processed image frame and the t-1 th aligned image frame, the supervised learning training function of the t-th processed image frame and the t-th aligned image frame and the supervised learning training function of the t +1 th processed image frame and the t +1 th aligned image frame according to the following formula 1: 1: 1 to obtain a master supervision learning training function of the deep convolutional neural network which is finally used for optimizing and processing the tth original video image frame.
7. The neural network-based universal video time domain alignment method according to claim 6, wherein the supervised learning training functions include a perceptual loss function, a structural similarity loss function, and a total training function;
the perceptual loss function is as follows:
in which the index j denotes the j-th layer of the network, CjHjWjRepresenting the size of the jth layer feature map; the loss network uses a VGG-19 pre-training network, and phi represents the network; y denotes the image frame to be processed,representing an aligned image frame;
the structural similarity loss function is as follows:
wherein x and y represent processed image frame and aligned image frame, respectively, muxAnd muyRepresents the average values of x, y,andrespectively representing the variance, σ, of x, ySyRepresents the covariance of x and y, c1,c2Are respectively a constant;
the overall training function is:
L=Lt-1+Lt+Lt+1;
wherein L istRepresenting the loss function between the tth processed image frame and the tth aligned image frame, is expressed as follows:
in the formula (I), the compound is shown in the specification,the function of the perceptual loss is represented by,a loss function of structural similarity is expressed,representing the t-th aligned image frame, ytRepresenting the t-th processed image frame;
and finally, the L is used for optimizing a master supervision training function of the deep convolutional neural network.
8. A neural network based universal video temporal alignment system comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, the computer program instructions when executed by the processor being capable of implementing the method steps of claims 1-7.
9. A computer-readable storage medium having stored thereon computer program instructions executable by a processor, the computer program instructions when executed by the processor being capable of performing the method steps of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110169802.5A CN112819743A (en) | 2021-02-08 | 2021-02-08 | General video time domain alignment method based on neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110169802.5A CN112819743A (en) | 2021-02-08 | 2021-02-08 | General video time domain alignment method based on neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112819743A true CN112819743A (en) | 2021-05-18 |
Family
ID=75862267
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110169802.5A Pending CN112819743A (en) | 2021-02-08 | 2021-02-08 | General video time domain alignment method based on neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112819743A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115243073A (en) * | 2022-07-22 | 2022-10-25 | 腾讯科技(深圳)有限公司 | Video processing method, device, equipment and storage medium |
-
2021
- 2021-02-08 CN CN202110169802.5A patent/CN112819743A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115243073A (en) * | 2022-07-22 | 2022-10-25 | 腾讯科技(深圳)有限公司 | Video processing method, device, equipment and storage medium |
CN115243073B (en) * | 2022-07-22 | 2024-05-14 | 腾讯科技(深圳)有限公司 | Video processing method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107529650B (en) | Closed loop detection method and device and computer equipment | |
CN108875935B (en) | Natural image target material visual characteristic mapping method based on generation countermeasure network | |
CN107369166B (en) | Target tracking method and system based on multi-resolution neural network | |
WO2021048607A1 (en) | Motion deblurring using neural network architectures | |
CN112434655B (en) | Gait recognition method based on adaptive confidence map convolution network | |
CN112507990A (en) | Video time-space feature learning and extracting method, device, equipment and storage medium | |
CN111696110B (en) | Scene segmentation method and system | |
CN112598597A (en) | Training method of noise reduction model and related device | |
CN112634296A (en) | RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism | |
CN105046664A (en) | Image denoising method based on self-adaptive EPLL algorithm | |
CA3137297C (en) | Adaptive convolutions in neural networks | |
CN113570516B (en) | Image blind motion deblurring method based on CNN-Transformer hybrid self-encoder | |
Qin et al. | Etdnet: An efficient transformer deraining model | |
CN113095254A (en) | Method and system for positioning key points of human body part | |
Chaurasiya et al. | Deep dilated CNN based image denoising | |
CN112651360A (en) | Skeleton action recognition method under small sample | |
CN115588237A (en) | Three-dimensional hand posture estimation method based on monocular RGB image | |
CN112819743A (en) | General video time domain alignment method based on neural network | |
CN107729821B (en) | Video summarization method based on one-dimensional sequence learning | |
CN109711417B (en) | Video saliency detection method based on low-level saliency fusion and geodesic | |
CN115761242B (en) | Denoising method and terminal based on convolutional neural network and fuzzy image characteristics | |
CN116703772A (en) | Image denoising method, system and terminal based on adaptive interpolation algorithm | |
CN116167015A (en) | Dimension emotion analysis method based on joint cross attention mechanism | |
CN115358952A (en) | Image enhancement method, system, equipment and storage medium based on meta-learning | |
KR102340387B1 (en) | Method of learning brain connectivity and system threrfor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |