CN115174817A

CN115174817A - Hybrid anti-shake method and system based on deep learning

Info

Publication number: CN115174817A
Application number: CN202211077092.4A
Authority: CN
Inventors: 高歌; 王保耀; 郭奇锋
Original assignee: Shenzhen Shenzhi Future Intelligence Co ltd
Current assignee: Shenzhen Shenzhi Future Intelligence Co ltd
Priority date: 2022-09-05
Filing date: 2022-09-05
Publication date: 2022-10-11

Abstract

The invention discloses a hybrid anti-shake method and a hybrid anti-shake system based on deep learning, wherein the method comprises the following steps: acquiring a video shot by a camera, and acquiring continuous N frames of images based on the video; inputting continuous N frames of images into a bidirectional optical flow network to obtain an output result of the bidirectional optical flow network; acquiring pose data of a camera; inputting the output result of the bidirectional optical flow network and the pose data into an alignment network; and acquiring an output result of the alignment network, warping the output result of the alignment network to a corresponding pose to obtain an image stabilization result of the current image frame, and finishing anti-shake operation. The embodiment of the invention calculates the dense optical flow by using a deep learning end-to-end neural network method, is more robust than the traditional algorithm, obtains higher optical flow result precision, and selects historical and future camera pose data in a time domain. And the pose data is fused and corrected in an airspace, so that the anti-shake effect is reduced, and the quality of video images is improved.

Description

Hybrid anti-shake method and system based on deep learning

Technical Field

The invention relates to the technical field of image processing, in particular to a hybrid anti-shake method and system based on deep learning.

Background

With the continuous development of smart cameras, video anti-shake technology is becoming more and more important in products in the fields of unmanned aerial vehicles, unmanned ships, city security, high-point monitoring, robots, aerospace and the like.

Video anti-shake techniques can be roughly classified into Optical Image Stabilization (OIS), electronic Image Stabilization (EIS), and Hybrid Image Stabilization (HIS).

OIS is a hardware solution that uses micro-electro-mechanical system (MEMS) gyroscopes to detect motion and adjust the camera system accordingly.

The EIS is from the perspective of software algorithm, does not need additional hardware support, and stabilizes the low-frequency jitter and large-amplitude motion of the video. Compared with OIS, the method has the advantages of being embedded in software, easy to upgrade, low in power consumption, low in cost and the like. HIS is a fusion scheme for OIS and EIS. The HIS fusion scheme has the advantages that the advantages of all the sensors can be taken, the information of the sensors can be collected together, and the judgment accuracy of the camera anti-shake system is improved through comprehensive analysis.

Most of the anti-shake algorithms for devices on the market today are image-based methods to smooth the camera path. The algorithm flexibility is suitable for nonlinear motion compensation. Under the condition of no rigid constraint, the screenshot ratio is large, non-rigid distortion and smear can also occur, and the motion compensation effect is poor.

The prior art is therefore still subject to further development.

Disclosure of Invention

In view of the above technical problems, embodiments of the present invention provide a hybrid anti-shake method and system based on deep learning, which can solve the technical problems that in the prior art, most of anti-shake algorithms adopt an image processing-based method to perform translation of a camera path, a screenshot ratio is large, non-rigid distortion and smear occur without rigid termination, a motion compensation effect is relatively left, and video shooting quality is affected.

A first aspect of an embodiment of the present invention provides a hybrid anti-shake method based on deep learning, including:

acquiring a video shot by a camera, and acquiring continuous N frames of images based on the video;

inputting continuous N frames of images into a bidirectional optical flow network to obtain an output result of the bidirectional optical flow network;

acquiring pose data of a camera;

inputting the output result of the bidirectional optical flow network and the pose data into an alignment network;

and acquiring an output result of the alignment network, warping the output result of the alignment network to a corresponding pose to obtain an image stabilization result of the current image frame, and finishing anti-shake operation.

Optionally, acquiring a video shot by a camera, and acquiring consecutive N-frame images based on the video includes:

acquiring a video shot by a camera, and acquiring continuous 5-frame RGB images based on the video;

4 pairs of RGB color space data are generated based on the 5-frame RGB images.

Optionally, inputting the continuous N frames of images into the bidirectional optical flow network, and acquiring an output result of the bidirectional optical flow network, including:

inputting 4 pairs of RGB color space data into a bidirectional optical flow network, wherein the bidirectional optical flow network is a CNN network conforming to a UNet structure;

and acquiring 4 positive and negative optical flow results output by the bidirectional optical flow network.

Optionally, acquiring pose data of the camera includes:

acquiring initial triaxial angular velocity data of a camera based on the MEMS gyroscope;

and filtering the initial triaxial angular velocity data based on complementary filtering and Kalman filtering to generate pose data of the camera.

Optionally, the inputting the output result of the bidirectional optical flow network and the pose data into an alignment network comprises:

carrying out synchronous processing on the triaxial angular velocity data and the video data to generate synchronous triaxial angular velocity data and obtain a relative rotation matrix corresponding to the triaxial angular velocity data;

and inputting the synchronized triaxial angular velocity data and the output result of the bidirectional optical flow network into an alignment network, wherein the alignment network is an RNN (radio network) and comprises a forgetting stage, a memory selecting stage and an output stage.

A second aspect of the embodiments of the present invention provides a hybrid anti-shake system based on deep learning, where the system includes: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of:

acquiring pose data of a camera;

Optionally, the computer program when executed by the processor further implements the steps of:

4 pairs of RGB color space data are generated based on the 5-frame RGB images.

A third aspect of embodiments of the present invention provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions, and when executed by one or more processors, the computer-executable instructions may cause the one or more processors to perform the deep learning based hybrid anti-shake method described above.

In the technical scheme provided by the embodiment of the invention, a video shot by a camera is obtained, and continuous N frames of images are obtained based on the video; inputting continuous N frames of images into a bidirectional optical flow network to obtain an output result of the bidirectional optical flow network; acquiring pose data of a camera; inputting the output result of the bidirectional optical flow network and the pose data into an alignment network; and acquiring an output result of the alignment network, warping the output result of the alignment network to a corresponding pose to obtain an image stabilization result of the current image frame, and finishing anti-shake operation. The embodiment of the invention uses a deep learning end-to-end neural network method to calculate the dense optical flow, is more robust than the traditional algorithm, obtains a higher optical flow result precision, and selects historical and future camera pose data in a time domain. And the pose data is fused and corrected in an airspace, so that the anti-shake effect is reduced, and the quality of video images is improved.

Drawings

Fig. 1 is a schematic flowchart illustrating an embodiment of a hybrid anti-shake method based on deep learning according to the present invention;

fig. 2 is a schematic diagram of a hardware structure of another embodiment of a hybrid anti-shake system based on deep learning according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following detailed description of embodiments of the invention refers to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a hybrid anti-shake method based on deep learning according to the present invention. As shown in fig. 1, includes:

s100, acquiring a video shot by a camera, and acquiring continuous N-frame images based on the video;

step S200, inputting continuous N frames of images into a bidirectional optical flow network, and acquiring an output result of the bidirectional optical flow network;

s300, acquiring pose data of a camera;

step S400, inputting the output result of the bidirectional optical flow network and the pose data into an alignment network;

and S500, acquiring an output result of the alignment network, warping the output result of the alignment network to a corresponding pose to obtain an image stabilization result of the current image frame, and finishing anti-shake operation.

In specific implementation, the embodiment of the invention adopts a camera to shoot a video, the shot video data is converted into an image, and the converted original image format comprises but is not limited to original image formats such as RGB, dng and RAW, or other color space pictures such as HSV and YUV.

Acquiring continuous N frames of images in the converted images, inputting the continuous frames of images into the bidirectional optical flow network, and acquiring an output result of the bidirectional optical flow network. The image is processed by adopting a bidirectional optical flow network, wherein an optical flow algorithm is based on three assumptions: the brightness between adjacent frames is constant; the motion of objects between adjacent frames is relatively 'tiny'; the space consistency is kept; i.e. adjacent pixels have the same motion.

Bidirectional optical flow, namely the optical flow result is calculated for both forward and reverse time dimensions, which plays an important role in deducing the occlusion area between frames. The training data used in the bidirectional optical flow network training is 720P resolution pictures, but may be replaced by other resolution pictures combined with data preprocessing such as up-down sampling.

The sensor can adopt OIS, a gyroscope, an accelerometer or a magnetometer and other sensors capable of obtaining camera pose information, the sensor System can adopt an MEMS (Micro-Electro-Mechanical System), also called a Micro-electromechanical System, a Micro-machine and the like, and refers to a device with the size of several millimeters or less, the MEMS System is capable of obtaining the camera pose information, and the camera pose information mainly comprises triaxial angular velocity data of the camera. The internal structure of the micro-electro-mechanical system is generally in the micron or even nanometer scale, and the micro-electro-mechanical system is an independent intelligent system.

Inputting the output result of the optical flow network and the three-axis angular velocity sensor corresponding to the pose information of the camera into the trained alignment network, performing alignment operation, and acquiring the output result of the alignment network; and warping the output result of the alignment network to a pose corresponding to the camera to obtain an image stabilizing result of the current image frame, and finishing anti-shake operation.

Further, acquiring a video shot by a camera, and acquiring continuous N frames of images based on the video, comprising:

4 pairs of RGB color space data are generated based on the 5-frame RGB images.

In specific implementation, a video shot by a camera is obtained, continuous 5-frame RGB images are obtained from the video, and continuous five-frame RGB color space data pairs

、

、

、

The use of (dimension of each frame is hxw x 3) as input to find motion between frames is widely used.

Further, inputting the continuous N frames of images into the bidirectional optical flow network, and acquiring the output result of the bidirectional optical flow network, the method comprises the following steps:

Specifically, the bidirectional optical flow network is a CNN network conforming to UNet structure, and the output result is 4 optical flow forward and backward optical flow results:

、

、

、

each in the data format of H x W x 2.

The Farneback algorithm based on the OpenCV is the traditional most classical dense optical flow algorithm, and FlowNet I, II, III and PWC Net based on deep learning and a subsequently updated latest optical flow network are matched with a reversed optical flow layer to directly obtain the bidirectional optical flow. Bidirectional optical flow results can be obtained directly, including bidirectional optical flow networks based on framing applications, and the like.

Further, acquiring pose data of the camera and performing synchronous operation with the video time stamp comprises:

In specific implementation, the MEMS gyroscope data three-axis angular velocity

Function of (2). Firstly, data preprocessing operations of complementary filtering and Kalman filtering are carried out on gyroscope data, namely, an angle obtained by a gyroscope is used as an optimal value in a short time, and an acceleration value sampled by acceleration is averaged at regular time to correct the angle obtained by the gyroscope. Then, kalman uses the state estimation value of the previous time and the observation value of the current time to obtain the optimal estimation of the state variable of the dynamic system at the current time.

Further, inputting the output result of the bidirectional optical flow network and the camera pose data into an alignment network, comprising:

In particular, the network is aligned in such a way that it provides a bi-directional optical flow result

、

、

、

As input, pass through a plurality of 2D convolutional layers and activation functions. The encoder is operative to encode the data in a high degree of dimensionality

Encoding into low-dimensional hidden variables to force neural network to learn most information-quantity characteristics, and rotating matrix for motion parameters of image

This matrix includes the rotation parameters and translation parameters of the three axes. Current frame

Is subject to four adjacent frame transformation parameters

The mean value of (a):

(formula 1)

The motion parameters are obtained in time sequence and then sent to the RNN for learning long-term dependence information, and information persistence is allowed. The anti-shake algorithm needs to infer the next moment through the persistence of motion information of the previous time period, but excessive long-term dependence is avoided. The RNN network of this invention needs to be designed into three internal stages to achieve the effect of filtering valid information in time sequence:

a forgetting stage: this stage is mainly the selective forgetting of the input coming from the last node.

Selecting a memory stage: this stage selectively memorizes the inputs of this stage. Which important ones are recorded and which ones are not important, and the others are recorded less. And adding the results obtained in the two steps to obtain the state transmitted to the next state.

And (5) an output stage. This phase will determine which will be the output of the current state. And the result obtained in the last stage is scaled through a tanh activation function, and a final result is output.

The input required to be received by the RNN network has the functions of fusing and filtering the preprocessed and synchronized MEMS camera poses besides the memory and selection of the implicit motion parameters in the previous step on the time sequence; because the gyroscope data has higher sampling frequency, the gyroscope and the video data need to be synchronously processed aiming at the sampling time, and the invention uses the spherical linearInterpolation formula when the time stamp is coincident

In the case of (2):

wherein the definition for the Slerp formula is:

(formula 2)

Wherein

Represents from

Is rotated to

And (4) radian. So that the same time stamp as the camera video can be calculated

At the moment

. Since the gyroscope data is acquired in the 3D world coordinate system, the camera pose needs to be derived from the 2D image coordinates to which the extrinsic parameters are mapped in combination with the camera intrinsic parameters:

(formula 3)

(formula 4)

Represents a reference matrix within the camera and is,

is a matrix of the rotations of the camera,

representing the focal length. The RNN network needs a history pose queue, and absolute pose information is obtained by the last step of calculation. However, relative rotation matrixes are required between pose and pose rotation transformation and in the network learning process, so that the relative rotation matrix obtained by gyroscope data is required. The advantage of this design is that the network model only needs to learn the initial changes, but is non-deformable for absolute poses. It has also been found that during training, more consistent visual effects and greater generalization ability can be achieved using relative information. The pose information passing through the alignment network is assisted by the relative pose of the MEMS, so that the rotation information is better learned, and high-frequency jitter is filtered in a time domain.

And further, warping the output result of the alignment network to a corresponding pose to obtain an image stabilization result of the current image frame, and finishing anti-shake operation.

In specific implementation, the stable transformation matrix result output by the network is aligned, and the jittered initial RGB color space data is obtained

Warping the image to the pose corresponding to the rotation matrix result, namely the image stabilizing result of the current frame. The warping process here is performed by dividing the picture into a 12 × 12 grid. And respectively twisting the image in each grid to the pose after image stabilization. The image stabilization result has good uniformity and stability and original parallax property is maintained.

Further, the loss function in the embodiment of the present invention is calculated as follows:

transformation loss: this loss has two components, and the camera motion can be tracked in the initial stage in order for the network to learn the motion parameters first. Part of the matrix for rotationCalculated parameters

Sum true value

To find the L1 loss, another part is to transform the image before

And the image after the parameter transformation from the network science

The L1 loss was calculated.

(formula 5)

Smoothing loss: based on the sampling time interval, the invention designs two parts in the smoothing loss part to constrain the camera trajectory. One is used to directly constrain the inter-frame displacement and the other expands the time interval to constrain the current frame to more closely fit the global displacement:

(equation 6)

(formula 7)

Loss of drawing: the smoothing effect of the network often brings the side effect of causing the picture to cross the actual picture boundary, and the invention designs the picture loss to directly punish. Wherein weight parameters conforming to a Gaussian distribution are combined

The standard deviation is a preset value.

Can control the appearance of the pictureTolerance parameters.

Representing the number of future frames that can be merged into the calculation. For the definition of the function, four corners of pose after image stabilization can be evaluated

Projected into actual camera space

The twist angle and the maximum value of the frame edge distance are normalized, and this relative distance is calculated. This lossy design may control the sensitivity of the algorithm to camera motion.

(formula 8)

Deformation loss: the most important one of the judgments of the anti-shake algorithm is deformation, because it will greatly reduce the original image quality.

Is the spherical angle between the current image space and the true camera pose,

is the threshold value of the threshold value,

is a parameter that controls the slope of the logic function. Deformation losses only if the angular deviation is greater than a threshold value

It is effective when it is used.

(formula 9)

Optical flow loss: other loss functions are given to the image integral layerTo perform the calculation. The optical flow loss is applied to reduce the motion range between pixels in motion. In the calculation, the points of the actual camera space will be calculated

，

Conversion to virtual space

And the corresponding relation between the pixel points is still tight after the warping operation in image stabilization. Therefore, the operation that hollow pixel points appear after warping and interpolation is needed is also avoided.

(formula 10)

(formula 11)

(formula 12)

Total loss: since the invention is trained in stages, the respective weights of the loss functions need to be adjusted and referred to in each stage to achieve the purpose of training in the stage.

(formula 13)

The method absorbs and integrates the advantages of a camera hardware system and a deep learning algorithm, can provide excellent video image stabilization effect in daily, parallax, running, fast rotation and crowd scenes, and corrects images at a pixel level. And original view angles are restored as far as possible, and high-quality videos with high stability, low screen capture ratio and low distortion are kept.

The embodiment of the invention has the following technical advantages:

the method for computing the dense optical flow by using the deep learning end-to-end CNN network is more robust than the traditional algorithm, and the obtained optical flow result precision (EPE) is higher.

The RNN network was first used to select historical and future camera pose data in the time domain. And performing fusion correction on the pose data in the airspace.

The MEMS gyroscope data provides more accurate rotation parameters for the camera on the basis of the existing 3DOF, and a 6DOF anti-shake algorithm is realized. This can be more closely related to the true motion of the camera and can supplement the camera data.

Index factors are fused in the loss function for the first time, the track is smooth and deformed, and the three hard indexes which are most concerned by shake prevention are drawn and directly fused into training. And control parameters are added, constraining but not excessively altering the actual scene.

Previous anti-shake algorithms only focus on the shake patterns associated with artificial motion. However, the camera itself is designed to be not subject to the rolling phenomenon, and the camera itself is not designed to be shaken. The invention has the function of correcting the rolling curtain phenomenon in the part of the light flow result.

The design of drawing the loss function not only directly controls the screen capture ratio, but also can better restore the view angle of the original video compared with other algorithms. This is not a concern of other anti-shake algorithms before.

The rotary matrix is used for representing the pose, and the parameter quantity and the calculated quantity are greatly reduced. And Slerp spherical linear interpolation is used for solving the problem of multi-sensor time synchronization.

It should be noted that, a certain order does not necessarily exist between the above steps, and those skilled in the art can understand, according to the description of the embodiments of the present invention, that in different embodiments, the above steps may have different execution orders, that is, may be executed in parallel, may also be executed interchangeably, and the like.

With reference to fig. 2, fig. 2 is a schematic diagram of a hardware structure of another embodiment of the deep learning-based hybrid anti-shake system according to an embodiment of the present invention, and as shown in fig. 2, the system 10 includes: a memory 101, a processor 102 and a computer program stored on the memory and executable on the processor, the computer program realizing the following steps when executed by the processor 101:

acquiring pose data of a camera;

The specific implementation steps are the same as those of the method embodiments, and are not described herein again.

Optionally, the computer program when executed by the processor 101 further implements the steps of:

4 pairs of RGB color space data are generated based on the 5-frame RGB images.

Optionally, the computer program when executed by the processor 101 further realizes the steps of:

and inputting the synchronized triaxial angular velocity data and the output result of the bidirectional optical flow network into an alignment network, wherein the alignment network is an RNN (radio network) and comprises a forgetting stage, a selecting and memorizing stage and an outputting stage.

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer-executable instructions for execution by one or more processors, for example, to perform method steps S100 through S500 of fig. 1 described above.

By way of example, non-volatile storage media can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Synchronous RAM (SRAM), dynamic RAM, (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The disclosed memory components or memory of the operating environment described in embodiments of the invention are intended to comprise one or more of these and/or any other suitable types of memory.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A hybrid anti-shake method based on deep learning is characterized by comprising the following steps:

acquiring pose data of a camera;

2. The deep learning-based hybrid anti-shake method according to claim 1, wherein the acquiring a video taken by a camera, acquiring consecutive N-frame images based on the video, comprises:

4 pairs of RGB color space data are generated based on the 5-frame RGB images.

3. The deep learning-based hybrid anti-shake method according to claim 2, wherein the inputting of the consecutive N frames of images into the bidirectional optical flow network and obtaining the output result of the bidirectional optical flow network comprises:

4. The deep learning-based hybrid anti-shake method according to claim 3, wherein the acquiring pose data of the camera comprises:

5. The deep learning-based hybrid anti-shake method according to claim 4, wherein the inputting the output results of the bidirectional optical flow network and the pose data into an alignment network comprises:

performing synchronous processing on the triaxial angular velocity data and the video data to generate synchronized triaxial angular velocity data and obtain a relative rotation matrix corresponding to the triaxial angular velocity data;

6. A hybrid anti-shake system based on deep learning, the system comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of:

acquiring pose data of a camera;

7. The deep learning-based hybrid anti-shake system according to claim 6, wherein the computer program, when executed by the processor, further implements the steps of:

4 pairs of RGB color space data are generated based on the 5-frame RGB images.

8. The deep learning based hybrid anti-shake system according to claim 7, wherein the computer program, when executed by the processor, further implements the steps of:

9. The deep learning based hybrid anti-shake system according to claim 8, wherein the computer program, when executed by the processor, further implements the steps of:

10. A non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the deep learning based hybrid anti-shake method of any one of claims 1-5.