CN115883764A

CN115883764A - Underwater high-speed video frame interpolation method and system based on data cooperation

Info

Publication number: CN115883764A
Application number: CN202310076493.6A
Authority: CN
Inventors: 姜宇; 齐红; 赵明浩; 王跃航; 张永霁; 魏枫林; 王凯
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-03-31
Anticipated expiration: 2043-02-08
Also published as: CN115883764B

Abstract

An underwater high-speed video frame inserting method and system based on data cooperation. Respectively acquiring RGB data and event data by a traditional camera and an event camera; fusing the acquired RGB data and the event data by utilizing a U-shaped synthesis network to acquire a synthesis result; performing frame optical flow estimation by using the synthetic result and the acquired RGB data through a three-layer multi-scale optical flow estimation network; performing event optical flow estimation by using the acquired RGB data and event data through a three-layer multi-scale optical flow estimation network; and fusing the fusion result, the frame optical flow estimation result performed through the three-layer multi-scale optical flow estimation network and the event optical flow estimation performed through the three-layer multi-scale optical flow estimation network through a U-shaped fusion network, and outputting an intermediate frame. The method and the device realize the generation of the intermediate frame of the video by using the RGB data and the event data, improve the frame rate of the video, and optimize the robustness of the frame interpolation effect under the nonlinear motion in an underwater scene.

Description

Underwater high-speed video frame interpolation method and system based on data cooperation

Technical Field

The invention belongs to the field of video frame synthesis, and particularly relates to an underwater high-speed video frame interpolation method and system based on RGB data and event data cooperation.

Background

The purpose of underwater video frame insertion is to reasonably restore an image of an intermediate frame through front and rear frame information in a given underwater low frame rate video, to be as close to a real motion track as possible, so as to improve the frame rate of the video. The underwater video frame insertion has extremely high application value, the recording cost of the underwater high frame rate video is extremely high, and the low frame rate video is easily acquired by various devices. For example, underwater animals move rapidly, underwater equipment rotates, and the conventional equipment at present can hardly obtain complete visual images due to the rapid movements, mainly because the frame rate of the video is not high enough, so that the actions are not coherent and clear enough from a visual point of view. The video pin method can effectively solve the problem by increasing the frame rate of the video.

However, when the existing video pin method faces the situation of non-linear motion, the motion estimation is not accurate and the details are not complete. The prior art is mainly based on a motion estimation method and a kernel method, the kernel method is mainly to estimate the inter-frame motion by using deformable convolution, but the efficiency of the kernel method is often limited by the size of the deformable convolution kernel. The size of the deformable convolution kernel seriously affects the computational resources occupied by the algorithm, and once the variation range exceeds the size of the deformable convolution kernel, the efficiency of the kernel-based method is obviously reduced. The motion estimation-based technique is to estimate the optical flow between two RGB frames and obtain an intermediate frame by forward mapping or backward mapping, and the motion estimation-based method is limited by the accuracy of motion estimation. The optical flow method is limited to the assumption of constant brightness and the assumption of linear motion, whereas most motion under water is non-linear motion. The optical flow-based method is limited by the optical flow method, and cannot accurately restore the intermediate frame between two nonlinear motion frames. The event camera has extremely low delay, and can output asynchronous event streams in real time, namely the brightness change of pixel points at a certain position at any moment. Once the brightness change in such an image exceeds the event camera threshold, the event camera will output an event with polarity (positive, negative), time, location (x, y). The event information contains the actual motion information of the object, and the actual intermediate motion state of the object, namely the intermediate frame close to the reality, can be simulated and restored by combining the event information and the RGB image information and utilizing the convolutional neural network to carry out deep learning.

Disclosure of Invention

The invention provides an underwater high-speed video frame interpolation method based on cooperation of RGB data and event data, which realizes generation of an intermediate frame of a video by utilizing the RGB data and the event data, improves the frame rate of the video, and optimizes the robustness of a frame interpolation effect under nonlinear motion in an underwater scene.

The invention provides an underwater high-speed video frame interpolation system based on cooperation of RGB data and event data, which realizes generation of an intermediate frame of a video by utilizing the RGB data and the event data, improves the frame rate of the video, and optimizes the robustness of a frame interpolation effect under nonlinear motion in an underwater scene.

The invention is realized by the following technical scheme:

an underwater high-speed video frame interpolation method based on cooperation of RGB data and event data comprises the following steps:

step 1, respectively acquiring RGB data and event data of a visual object in a space where the visual object is located by a traditional camera and an event camera;

step 2, fusing the RGB data and the event data acquired in the step 1 by utilizing a U-shaped synthetic network to acquire a synthetic result;

step 3, performing frame optical flow estimation through a three-layer multi-scale optical flow estimation network by using the synthesis result of the step 2 and the RGB data acquired in the step 1;

step 4, performing event optical flow estimation by using the RGB data and the event data acquired in the step 1 through a three-layer multi-scale optical flow estimation network;

and 5, fusing the fusion result of the step 2, the frame optical flow estimation result which is processed by the three-layer multi-scale optical flow estimation network in the step 3 and the event optical flow estimation which is processed by the three-layer multi-scale optical flow estimation network in the step 4 through a U-shaped fusion network, and outputting an intermediate frame.

A method for inserting frames of underwater high-speed videos based on cooperation of RGB data and event data further comprises the steps of converting asynchronous event data into a synchronous representation form, specifically, selecting 5 boxes for events with time lengths of two boundary frame intervals, compressing the events in nearby timestamps to the corresponding boxes in a bilinear interpolation mode, and obtaining synchronous event frames of 5 channels.

A method for inserting frames of underwater high-speed videos based on cooperation of RGB data and event data is characterized in that a U-shaped synthetic network in step 2 is specifically characterized in that a convolutional neural network model is connected according to residual errors in the U-shaped synthetic network, input RGB images and corresponding event data are synthesized through a decoding and coding structure, and a synthetic result is recorded.

A method for inserting frames of underwater high-speed videos based on cooperation of RGB data and event data is characterized in that a synthesis result obtained in step 2 is specifically that two RGB image frames of 3 channels and two event frames of 5 channels are subjected to 12 groups of convolution, a coding and decoding structure comprising 4 times of down-sampling and 4 times of up-sampling is used for fusing and learning information of two modes, the RGB images of the 3 channels are output as predicted values, L1 loss and perception loss functions are used as loss functions, and real intermediate frame data are used as real values for supervised learning.

A underwater high-speed video frame interpolation method based on cooperation of RGB data and event data is characterized in that a three-layer multi-scale optical flow estimation network in the steps 3 and 4 is specifically that a synthetic result of a U-shaped synthetic network and an RGB image adopt three-layer multi-scale residual connection convolutional neural network model, multi-scale feature information is fused, feature vectors are output, and a frame synthetic result is obtained through optical flow mapping;

and connecting the RGB image and corresponding event data with a convolutional neural network model by adopting three layers of multi-scale residuals, fusing multi-mode multi-scale feature information, outputting a feature vector, and mapping by an optical flow to obtain an event synthesis result.

A method for inserting frames of underwater high-speed video based on cooperation of RGB data and event data includes splicing RGB image frames of two 3 channels and RGB synthesized frames of the obtained 3 channels together to obtain initial input data F, obtaining optical flows F of two 2 channels by F through a first optical flow estimation module ₁ ，F ₁ Splicing with F after bilinear scaling, passing through a second optical flow estimation module, and obtaining characteristics and F ₁ Adding to obtain F ₂ ，F ₂ Splicing the first image with F subjected to bilinear scaling through a third optical flow estimation module, and obtaining features and F ₂ Adding to obtain F ₃ (ii) a The input feature F needs to go through 3 optical flow estimation modules in total, and finally two 2-channel optical flows are obtained, which respectively represent motion vectors from an intermediate frame to a left boundary frame and a right boundary frame;

two 3-channel RGB frame-based estimation results can be obtained from two boundary frames and two motion vectors through inverse mapping. Using the L1 loss function as a loss function, and using real intermediate frame data as a real value to perform supervised learning;

the optical flow estimation module includes 10 layers of convolutional networks and one layer of transposed convolutional networks and Relu activation functions.

An underwater high-speed video frame interpolation method based on cooperation of RGB data and event data is characterized in that in the step 3, event optical flow estimation specifically comprises the steps that two RGB image frames of 3 channels and two event data of 5 channels are spliced together to obtain initial input data F, and the initial input data F is used for obtaining optical flows F of two channels 2 through a first optical flow estimation module ₁ ，F ₁ Splicing with F after bilinear scaling, passing through a second optical flow estimation module, and obtaining characteristics and F ₁ Adding to obtain F ₂ ，F ₂ Splicing the first image with F subjected to bilinear scaling through a third optical flow estimation module, and obtaining features and F ₂ Adding to obtain F ₃ (ii) a The input feature F needs to go through 3 optical flow estimation modules in total, and finally two 2-channel optical flows are obtained, which respectively represent motion vectors from an intermediate frame to a left boundary frame and a right boundary frame;

two RGB (red, green and blue) result of 3 channels based on event estimation can be obtained by two boundary frames and two motion vectors through reverse mapping; using the L1 loss function as a loss function, and using real intermediate frame data as a real value to perform supervised learning;

An underwater high-speed video frame inserting method based on cooperation of RGB data and event data is characterized in that a 3-channel RGB composite frame is spliced together in channel dimensions by using frame-based results of two 3-channel RGB and event-based results of two 3-channel RGB estimation;

firstly, information of two modes is fused and learned through two layers of dynamic convolution networks with 10 experts and 10 groups of convolutions, a coding and decoding structure comprising 4 times of downsampling and 4 times of upsampling, an RGB image of a 3 channel is output to serve as a predicted value, an L1 loss and perception loss function are used as a loss function, and real intermediate frame data are used as a real value (Ground Truth) to be supervised and learned.

An underwater high-speed video frame interpolation method based on cooperation of RGB data and event data is characterized in that a calculation formula of a video frame interpolation model of event optical flow estimation is as follows:

in the formula, 0 and 1 are the time of two boundary frames, t is the time of an intermediate frame, V represents a motion vector, and i and j represent the number of channels of an event representation vector.

An underwater high-speed video frame interpolation system based on RGB data and event data cooperation comprises a synthesis module, an optical flow estimation module, an event optical flow estimation module and a fusion module;

the synthesis module is used for directly synthesizing information of two modes of RGB data and event data to obtain a synthesized video intermediate frame;

the frame optical flow estimation module is used for estimating the optical flows from the intermediate frame to the two boundary frames by utilizing a three-layer multi-scale optical flow estimation network according to the synthesized intermediate frame and RGB data, and obtaining a video intermediate frame after the optical flows are mapped;

the event optical flow estimation module is used for estimating the optical flow from an intermediate frame to two boundary frames by using the event data and the RGB data and utilizing a three-layer multi-scale optical flow estimation network, and obtaining a video intermediate frame after optical flow mapping;

and the fusion module is used for fusing the results of the three modules to obtain the most accurate video intermediate frame.

The invention has the beneficial effects that:

the invention realizes the generation of the intermediate frame of the video by using the RGB data and the event data, improves the frame rate of the video and optimizes the robustness of the frame interpolation effect under the nonlinear motion.

The invention utilizes the convolution neural network to carry out deep learning, and can simulate and restore the real intermediate motion state of the object, namely, the intermediate motion state is close to the real intermediate frame.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

FIG. 2 is a block diagram of the domains of the present invention.

Detailed description of the preferred embodiments

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A method for inserting frames of underwater high-speed videos based on cooperation of RGB data and event data is characterized in that a synthesis result obtained in step 2 is specifically that two RGB image frames of 3 channels and two event frames of 5 channels are subjected to 12 groups of convolution, a coding and decoding structure comprising 4 times of down-sampling and 4 times of up-sampling is used for fusing and learning information of two modes, the RGB images of the 3 channels are output as predicted values, an L1 loss and perception loss function is used as a loss function, and real intermediate frame data is used as a true value (Ground Truth) for supervised learning.

A method for inserting frames of underwater high-speed video based on cooperation of RGB data and event data includes splicing RGB image frames of two 3 channels and RGB synthesized frames of the obtained 3 channels together to obtain initial input data F, obtaining optical flows F of two 2 channels by F through a first optical flow estimation module ₁ ，F ₁ Splicing with F after bilinear scaling, passing through a second optical flow estimation module, and obtaining characteristics and F ₁ Adding to obtain F ₂ ，F ₂ Splicing with F subjected to bilinear scaling, passing through a third optical flow estimation module, and obtaining features and F ₂ Are added to obtain F ₃ (ii) a The input feature F needs to go through 3 optical flow estimation modules in total, and finally two 2-channel optical flows are obtained and respectively represent motion vectors from an intermediate frame to a left boundary frame and a right boundary frame;

two 3-channel RGB frame-based estimation results can be obtained from two boundary frames and two motion vectors through inverse mapping. Using the L1 loss function as a loss function, and using real intermediate frame data as a real value (Ground Truth) for supervised learning;

An underwater high-speed video frame interpolation method based on cooperation of RGB data and event data is characterized in that event optical flow estimation in the step 3 is specifically to obtain two RGB image frames of 3 channels and two event data of 5 channelsSplicing together to obtain initial input data F, obtaining two 2-channel optical flows F through a first optical flow estimation module ₁ ，F ₁ Splicing with F after bilinear scaling, passing through a second optical flow estimation module, and obtaining characteristics and F ₁ Adding to obtain F ₂ ，F ₂ Splicing with F subjected to bilinear scaling, passing through a third optical flow estimation module, and obtaining features and F ₂ Adding to obtain F ₃ (ii) a The input feature F needs to go through 3 optical flow estimation modules in total, and finally two 2-channel optical flows are obtained and respectively represent motion vectors from an intermediate frame to a left boundary frame and a right boundary frame;

two RGB (red, green and blue) result of 3 channels based on event estimation can be obtained by two boundary frames and two motion vectors through reverse mapping; using the L1 loss function as a loss function, and using real intermediate frame data as a real value (Ground Truth) for supervised learning;

firstly, information of two modes is fused and learned through two layers of dynamic convolution networks with 10 experts, 10 groups of convolutions and a coding and decoding structure comprising 4 times of down-sampling and 4 times of up-sampling, RGB images of 3 channels are output to serve as predicted values, L1 loss and perceptual loss functions serve as loss functions, and real intermediate frame data serve as real values (Ground Truth) to conduct supervised learning.

in the formula (I), the compound is shown in the specification,0. 1 is the time of two boundary frames, t is the time of an intermediate frame, V represents a motion vector, and i and j respectively represent the number of channels of an event representation vector. Such as

Motion vector estimation for a period of time from the insertion frame instant to the next event representation; i.e. the motion vector estimate totality from the intermediate frame to the boundary frame is denoted as ≥>

、/>

And the motion vector is the motion vector after the event data calibration.

Connecting a convolutional neural network model according to a residual error in the U-shaped synthetic network, synthesizing the input RGB image and corresponding event data through a decoding and coding structure, and recording a synthetic result;

the three-layer multi-scale optical flow estimation network adopts three-layer multi-scale residual errors to connect the synthetic result of the U-shaped synthetic network and the RGB image with a convolutional neural network model, fuses multi-scale feature information, and outputs a feature vector to obtain a frame synthetic result through optical flow mapping;

connecting RGB image and corresponding event data with a convolutional neural network model by adopting three-layer multi-scale residual error, fusing multi-mode multi-scale feature information, outputting feature vectors, and mapping by optical flow to obtain event synthesis result

The U-shaped fusion network uses a U-shaped residual error with double-layer dynamic convolution to connect with a convolution neural network model, and outputs a unique three-channel RGB image through a synthetic result obtained by fusing the U-shaped fusion network with a decoding and coding structure, a frame synthetic result obtained by a three-layer multi-scale optical flow estimation network and an event synthetic result.

Claims

1. An underwater high-speed video frame interpolation method based on cooperation of RGB data and event data is characterized by comprising the following steps of:

step 1, respectively acquiring RGB data and event data by a traditional camera and an event camera;

2. The underwater high-speed video frame interpolation method based on the cooperation of the RGB data and the event data as claimed in claim 1, wherein the step 1 further includes converting asynchronous event data into a synchronous representation form, specifically, selecting 5 boxes for events with a time length of two boundary frame intervals, compressing events in nearby timestamps to corresponding boxes in a manner similar to bilinear interpolation, and acquiring synchronous event frames of 5 channels.

3. The underwater high-speed video frame interpolation method based on the cooperation of the RGB data and the event data according to claim 2, wherein the U-shaped synthetic network of the step 2 is specifically that the input RGB image and the corresponding event data are synthesized through a decoding and coding structure according to a residual connecting convolutional neural network model in the U-shaped synthetic network, and recorded as a synthetic result.

4. The method as claimed in claim 3, wherein the synthesis result obtained in step 2 is specifically that two 3-channel RGB image frames and two 5-channel event frames undergo 12 sets of convolution, a coding and decoding structure including 4 times of downsampling and 4 times of upsampling fuses and learns information of two modalities, 3-channel RGB images are output as predicted values, L1 loss and perceptual loss functions are used as loss functions, and real intermediate frame data are used as real values for supervised learning.

5. The underwater high-speed video frame interpolation method based on the cooperation of the RGB data and the event data as claimed in claim 1, wherein the three-layer multi-scale optical flow estimation network of the step 3 and the step 4 is characterized in that a three-layer multi-scale residual error is adopted to connect a convolutional neural network model, multi-scale feature information is fused, feature vectors are output, and a frame synthesis result is obtained through optical flow mapping;

and connecting a convolution neural network model by adopting three layers of multi-scale residuals, fusing multi-mode multi-scale feature information, outputting feature vectors, and mapping by using an optical flow to obtain an event synthesis result.

6. The underwater high-speed video frame interpolation method based on the cooperation of the RGB data and the event data as claimed in claim 5, wherein the method comprisesCharacterized in that the step 3 frame optical flow estimation specifically comprises splicing two 3-channel RGB image frames and the obtained 3-channel RGB synthesized frame to obtain initial input data F, and obtaining two 2-channel optical flows F by a first optical flow estimation module ₁ ，F ₁ Splicing with F after bilinear scaling, passing through a second optical flow estimation module, and obtaining characteristics and F ₁ Are added to obtain F ₂ ，F ₂ Splicing the first image with F subjected to bilinear scaling through a third optical flow estimation module, and obtaining features and F ₂ Are added to obtain F ₃ (ii) a The input feature F needs to go through 3 optical flow estimation modules in total, and finally two 2-channel optical flows are obtained and respectively represent motion vectors from an intermediate frame to a left boundary frame and a right boundary frame;

two boundary frames and two motion vectors are subjected to inverse mapping to obtain RGB (red, green and blue) frame-based estimation results of two 3 channels; using the L1 loss function as a loss function, and using real intermediate frame data as a real value to perform supervised learning;

7. The method as claimed in claim 6, wherein the step 3 event optical flow estimation is implemented by splicing two 3-channel RGB image frames and two 5-channel event data to obtain initial input data F, and obtaining two 2-channel optical flows F by the first optical flow estimation module ₁ ，F ₁ Splicing with F after bilinear scaling, passing through a second optical flow estimation module, and obtaining characteristics and F ₁ Adding to obtain F ₂ ，F ₂ Splicing with F subjected to bilinear scaling, passing through a third optical flow estimation module, and obtaining features and F ₂ Adding to obtain F ₃ (ii) a The input feature F needs to go through 3 optical flow estimation modules in total, and finally two 2-channel optical flows are obtained, which respectively represent motion vectors from an intermediate frame to a left boundary frame and a right boundary frame;

8. The underwater high-speed video frame interpolation method based on the cooperation of the RGB data and the event data as claimed in claim 1, wherein a 3-channel RGB composite frame, two 3-channel RGB frame-based results and two 3-channel RGB event estimation result are spliced together in channel dimension;

firstly, information of two modes is fused and learned through two layers of dynamic convolution networks with 10 experts, 10 groups of convolutions and a coding and decoding structure comprising 4 times of down-sampling and 4 times of up-sampling, RGB images of 3 channels are output to serve as predicted values, L1 loss and perceptual loss functions serve as loss functions, and real intermediate frame data serve as real values to conduct supervised learning.

9. The method for underwater high-speed video frame interpolation based on RGB data and event data collaboration as claimed in claim 4, wherein the calculation formula of the video frame interpolation model of the event optical flow estimation is as follows:

，

10. An underwater high-speed video frame interpolation system based on RGB data and event data cooperation is characterized by comprising a synthesis module, an optical flow estimation module, an event optical flow estimation module and a fusion module;