CN113163121A

CN113163121A - Video anti-shake method and readable storage medium

Info

Publication number: CN113163121A
Application number: CN202110431451.0A
Authority: CN
Inventors: 张卡; 刘跃; 陈俊; 何佳; 戴亮亮; 尼秀明
Original assignee: Anhui Qingxin Internet Information Technology Co ltd
Current assignee: Anhui Qingxin Internet Information Technology Co ltd
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2021-07-23

Abstract

The invention discloses a video anti-shake method and a readable storage medium, belonging to the technical field of deep learning and video anti-shake, comprising the following steps: acquiring a video image sequence to be processed as the input of a pre-trained deep neural network model, wherein the deep neural network model comprises a feature extraction module and a jitter parameter estimation module; acquiring deep abstract features of each frame of image of a video image sequence to be processed by using a feature extraction module; processing deep abstract features of a plurality of adjacent frame images by using a jitter parameter estimation module, and estimating jitter parameters of the current frame image; and acquiring a stable current frame image according to the jitter parameters of the current frame. The anti-interference device has the advantages of strong anti-interference capability, higher running speed and good anti-shake effect.

Description

Video anti-shake method and readable storage medium

Technical Field

The invention relates to the technical field of deep learning and video anti-shaking, in particular to a video anti-shaking method based on deep learning and a readable storage medium.

Background

With the development of the scientific and technological level and the advancement of the modernization of the resident life, various video data have become an essential component in daily life and work. Because video acquisition equipment can receive the interference of various environmental factors, for example the shakiness of hand when big wind blows, installation base vibration, handheld device, lead to the video of gathering can more or less contain the picture shake, when the video picture contains more shake, can influence video watching effect. Therefore, the anti-shake processing of the video is an indispensable step. At present, the main techniques for video anti-shake include the following:

(1) mechanical video anti-shake technology: the anti-shake device uses special sensors such as a gyroscope and an accelerometer to detect the motion parameter values of the camera, and compensates the motion of the camera by moving the image sensor, so that the anti-shake purpose is achieved.

(2) Optical video anti-shake technology: which variably adjusts the path length of light using a movable set of lens combinations. After the motion parameter value of the camera is detected through the special sensor, the motion of the camera is compensated through moving the lens, and then the anti-shake purpose is achieved.

(3) Digital video anti-shake technology: after the video picture is collected, the motion estimation is carried out on the video picture through a computer vision technology, then the motion track is smoothed, and a video with stable pictures is reconstructed. No special sensors need to be used to estimate the camera motion.

The mechanical video anti-shake technology and the optical video anti-shake technology both need corresponding special sensors to sense the motion of the camera, and are high in hardware cost and not beneficial to large-area application. The digital video anti-shake technology does not need any special hardware, and can realize video anti-shake only by depending on a computer vision technology. The digital video anti-shake system has low cost and good anti-shake effect, and is widely applied.

Disclosure of Invention

The invention aims to overcome the defects in the background technology and provide a video anti-shake method which is strong in anti-interference capability, high in running speed and good in anti-shake effect.

To achieve the above object, in one aspect, a video anti-shake method is adopted, which includes the following steps:

acquiring a video image sequence to be processed as the input of a pre-trained deep neural network model, wherein the deep neural network model comprises a feature extraction module and a jitter parameter estimation module;

acquiring deep abstract features of each frame of image of a video image sequence to be processed by using a feature extraction module;

processing deep abstract features of a plurality of adjacent frame images by using a jitter parameter estimation module, and estimating jitter parameters of the current frame image;

and acquiring a stable current frame image according to the jitter parameters of the current frame.

Further, the feature extraction module includes an input image resolution fast-dropping layer IFD, convolution layers conv0, conv1_0, conv1_1, conv2_0, conv2_1, conv2_2 and conv3, feature map additive layers sum0 and sum1, and a maximum value pooling layer maxpool0, wherein the video image sequence to be processed serves as an input of the input image resolution fast-dropping layer IFD, an output of the input image resolution fast-dropping layer IFD is connected to an input of the convolution layer conv0, an output of the convolution layer conv0 is connected to an input of the maximum value pooling layer maxpool0, an output of the maximum value pooling layer maxpool0 is connected to an input of the feature map additive layer sum0 and an input of the convolution layer conv1_0, an output of the convolution layer conv 6 _0 is connected to an input of the convolution layer 1_1, and an output of the convolution layer 1 is connected to an input of the feature map additive layer 0; the output of the feature map additive layer sum0 is connected to the input of the convolution layer conv2_0 and the input of the convolution layer conv2_2, respectively, the output of the convolution layer conv2_0 is connected to the input of the convolution layer conv2_1, the output of the convolution layer conv2_1 and the output of the convolution layer conv2_2 are both connected to the input of the feature map additive layer sum1, the output of the feature map additive layer sum1 is connected to the input of the convolution layer conv3, and the output of the convolution layer conv3 is a deep abstract feature corresponding to each frame of image of the video image sequence to be processed.

Further, the jitter parameter estimation module comprises a feature splicing layer concat, a convolution layer conv4, a conv5_0 and conv5_1, a feature map addition layer sum2, a global mean pooling layer gavepool layer and a full connection layer fc;

the input of the feature splicing layer concat is a deep abstract feature corresponding to each frame of image of the video image sequence to be processed, the output of the feature splicing layer concat is connected with the input of the convolution layer conv4, the output of the convolution layer conv4 is connected with the input of the convolution layer conv5_0 and the input of the feature map adding layer sum2, the output of the convolution layer conv5_0 is connected with the input of the feature map adding layer sum2 through the convolution layer conv5_1, the output of the feature map adding layer sum2 is connected with the input of the full connection layer fc through the global mean pooling layer gavepool, and the output of the full connection layer fc is a jitter parameter of the current frame image.

Further, the input image resolution fast dropping layer IFD is configured to decompose each frame of image of the video image sequence to be processed into a plurality of decomposed sub-images, specifically:

carrying out uniform grid division on each frame of image of the video image sequence to be processed to obtain a grid subgraph;

carrying out digital coding on each grid subgraph according to the row priority order to obtain a digital coding grid subgraph;

and taking out the pixels with the same digital code in each grid sub-graph, arranging and splicing the pixels according to the grid sequence to obtain the decomposed sub-graphs, and using the decomposed sub-graphs as the input of the convolutional layer conv 0.

Further, the training process of the deep neural network model comprises the following steps:

acquiring training sample data;

designing a target loss function of the deep neural network model as a mean square error loss function;

training the deep neural network model by using the training sample data, and learning the model parameters to obtain the pre-trained deep neural network model.

Further, the acquiring training sample data comprises:

collecting a jitter video set;

processing data in a jittering video set to obtain jittering parameters of the video;

and uniformly dividing the jittering video into a plurality of short video segments according to different sampling time lengths t, uniformly sampling a plurality of continuous images for each short video segment to form an input image sample sequence of the deep neural network model, and recording the jittering parameters of the current sampling image as the label data of the input image sequence.

Further, the overlapping area of the frames of any two continuous images of the plurality of continuous images is more than 50%.

Further, the obtaining a stable current frame image according to the jitter parameter of the current frame includes:

calculating a corresponding similarity transformation matrix according to the jitter parameters;

and multiplying the current frame image by the similarity transformation matrix to obtain a stable current frame image.

Further, still include:

and when the jitter parameters of the current frame image are predicted, storing the characteristic data of other images except the first image in the video image sequence to be processed for predicting the jitter parameters of the next frame image.

In another aspect, a computer-readable storage medium is used, on which a computer program is stored, the computer program being executed by a processor and being capable of implementing the video anti-shake method as described above.

Compared with the prior art, the invention has the following technical effects: the method adopts a deep learning technology, comprehensively utilizes pixel motion information among multi-frame image sequences, adaptively predicts the jitter parameters of the current frame image, obtains the current frame image with stable picture through compensation operation, and has good anti-jitter effect; the interference caused by complex light change, object motion and the like can be resisted, and the robustness is stronger; by adopting an efficient network structure, the model calculation amount is small, and the running speed is higher; the network is trained end to end, and the model is more convenient to use.

Drawings

The following detailed description of embodiments of the invention refers to the accompanying drawings in which:

FIG. 1 is a flow chart of a video anti-shake method;

FIG. 2 is a flow chart of a video anti-shake method;

FIG. 3 is a diagram of a deep neural network architecture;

FIG. 4 is a diagram of a feature extraction module architecture;

FIG. 5 is a block diagram of a jitter parameter estimation module;

FIG. 6 is a schematic diagram of single-pass image decomposition, where diagram (a) represents an encoded image and diagram (b) represents a decomposed subgraph;

wherein, the alphanumerics beside each network layer graph represent the feature diagram size of the current feature layer, namely: the feature map height x feature map width x number of feature map channels.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.

In the embodiments of the present invention, unless otherwise specified, the use of directional terms such as "upper, lower, top, and bottom" is generally used with respect to the orientation shown in the drawings or the positional relationship of the components with respect to each other in the vertical, or gravitational direction.

In addition, if there is a description of "first", "second", etc. in the embodiments of the present invention, the description of "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature.

As shown in fig. 1, the present embodiment discloses a video anti-shake method, which includes the following steps S1 to S4:

s1, acquiring a video image sequence to be processed as the input of a pre-trained deep neural network model, wherein the deep neural network model comprises a feature extraction module and a jitter parameter estimation module;

s2, acquiring deep abstract features of each frame of image of the video image sequence to be processed by using the feature extraction module;

s3, processing deep abstract features of a plurality of adjacent frame images by using a jitter parameter estimation module, and estimating jitter parameters of the current frame image;

and S4, acquiring a stable current frame image according to the jitter parameters of the current frame.

As a further preferred technical solution, before the anti-shake processing is performed on the video image sequence to be processed by using the pre-trained deep neural network model, the deep neural network model needs to be constructed and trained, as shown in fig. 2:

(1) designing a deep neural network model:

the video image jitter is reflected by the continuous appearance of a plurality of adjacent frames of images. Therefore, the video frame jitter correction is to acquire the jitter parameters of the current frame image according to the motion rules of a plurality of adjacent frames of images, so as to compensate the current frame and reconstruct a stable video frame. The method adopts a deep learning technology, and adaptively calculates the jitter parameters of the current frame image according to the motion rules of a plurality of adjacent frames of images by means of a deep neural network model. The invention uses Convolution Neural Network (CNN) to design deep neural network model, for convenience of describing the invention, some terms are defined: feature resolution refers to feature height x feature width, feature size refers to feature height x feature width x number of feature channels, kernel size refers to kernel width x kernel height, and span refers to width span x height span, and each convolutional layer is followed by a bulk normalization layer and a nonlinear activation layer. The specific network structure of the deep neural network model designed by the invention is shown in fig. 3, and the design process comprises the following steps:

1-1) designing an input image of a deep neural network model:

the present invention needs to obtain the dithering parameters of the current frame according to a plurality of adjacent frame images, therefore, the input image comprises 20 RGB images with 3 channels, wherein the first 14 images represent the continuous image sequence before the current frame, the 15 th image represents the current frame image, and the later 5 images represent the continuous image sequence after the current frame. It should be noted that, the consecutive 20 3-channel RGB images may select different sampling steps according to different application scenarios. The resolution of the input image is 256 × 256, and the input image may be a whole video picture or a captured partial video picture according to different requirements.

It should be understood that the number of input images may be chosen to be different depending on the task requirements.

1-2) designing a network structure of a deep neural network model:

the deep neural network is mainly used for acquiring the jitter parameters of the current frame image according to the input adjacent frame image sequences. The deep neural network model designed by the invention comprises a feature extraction module and a jitter parameter estimation module. The specific design steps are as follows:

1-2-1) designing a feature extraction module of the deep neural network model, wherein the feature extraction module is mainly used for extracting deep abstract features of an input image and has the characteristics of strong feature extraction capability, high operation efficiency and the like. The specific network structure of the module is shown in fig. 4, wherein an ifd (image fast definition) layer is a layer for rapidly reducing the resolution of an input image; conv0 is a convolutional layer with a core size of 5 × 5 and a span of 2 × 2; the maxpool0 layer is a maximum pooling layer with a kernel size of 2 × 2 and a span of 2 × 2; conv1_0, conv1_1, conv2_0 are convolutional layers each having a core size of 3 × 3 and a span of 1 × 1; conv2_1 is a convolutional layer with a core size of 3 × 3 and a span of 2 × 2; conv2_2 is a convolutional layer with a core size of 1 × 1 and a span of 2 × 2; conv3 is a convolutional layer with a core size of 3 × 3 and a span of 2 × 2; sum0 and sum1 are feature map addition layers, and mainly achieve pixel-by-pixel addition of two input feature maps. The IFD layer is used for decomposing a large input feature map into a plurality of small decomposed sub-maps, namely sub-input feature maps, wherein a multi-channel image needs to be split into a plurality of single-channel images to be decomposed respectively, and the specific decomposition method is as follows by taking the single-channel image as an example:

image gridding, which is to divide an input image into uniform grids to obtain grid sub-images, wherein the width and height of each grid are 4 pixels, and a schematic diagram is shown in fig. 6 (a).

The grid image coding is to perform digital coding on each grid subgraph according to the line priority order to obtain a digital coding grid subgraph, wherein the range of coding numbers is 1-4, and the schematic diagram is shown in fig. 6 (a).

And (3) acquiring a decomposition subgraph, wherein the decomposition subgraph is mainly based on all the digital coding grid subgraphs, pixels with the same digital coding in each grid subgraph are taken out and are arranged and spliced according to the grid sequence to form a new subgraph, namely the decomposition subgraph, and the schematic diagram is shown in fig. 6 (b).

1-2-2) designing a jitter parameter estimation module of the deep neural network model, wherein the jitter parameter estimation module estimates the jitter parameters of the current frame image by jointly utilizing deep abstract characteristics of a plurality of adjacent frame images. The specific network structure is shown in fig. 5, wherein the concat layer is a feature splicing layer, and mainly realizes that a plurality of input feature graphs are spliced into an output feature graph according to channel dimensions; conv4, conv5_0, conv5_1 are convolutional layers each having a core size of 3 × 3 and a span of 1 × 1; sum2 is a feature map addition layer, and mainly realizes pixel-by-pixel addition of two input feature maps; the gavepool layer is a global mean pooling layer; the fc layer is a fully connected layer that outputs 3-dimensional feature representation of the dithering parameters of the current frame image.

(2) The deep neural network model is trained, parameters of the deep neural network model are optimized mainly through a large amount of marked training sample data, so that the detection performance of the deep neural network model is optimal, and the method comprises the following specific steps:

2-1) obtaining training sample data, wherein the quality of the training sample data directly determines the anti-shake performance of the video, and the specific steps are as follows:

2-1-1) collecting a shaking video set, and mainly collecting shaking videos under various backgrounds, various light rays and various shooting angles, namely shaking phenomena exist in the videos.

2-1-2), mainly aiming at the jittering video, calculating the jittering parameters of each frame of image by adopting the existing mature technology, namely the overall motion parameters of the frame of jittering image relative to the stable image, specifically comprising x coordinate displacement, y coordinate displacement and a rotation angle around the center of the image.

2-1-3) obtaining an input image sequence of the deep neural network model, namely a video image sequence, wherein a jittering video is uniformly divided into shorter video segments according to different sampling time lengths t, for each video segment, 20 continuous images are uniformly sampled, the 20 continuous images jointly form an input image sample sequence of the deep neural network model, and jitter parameters of a 15 th sampled image are recorded as label data of the input image sequence. In addition, in 20 consecutive images, there is an overlapping area of 50% or more in the screen of any two consecutive images.

It should be noted that, because the jittered video is not collected well, some stable videos that are not jittered may be collected first in this embodiment, and the jittered video is obtained indirectly through a jittering transformation algorithm, which specifically includes: and collecting a stable video set, and mainly collecting stable videos under various backgrounds, various light rays and various shooting angles, namely the videos basically do not have a shaking phenomenon. For these stable videos, some existing video jitter motion models are adopted to perform jitter transformation and convert the video jitter motion models into jittered videos.

2-2) designing a target loss function of the deep neural network model, wherein the target loss function is realized based on regression analysis, and the target loss function adopts a Mean Square Error (MSE) loss function.

2-3) training a deep neural network model, mainly sending a marked video image sequence set into the well-defined deep neural network model, and learning related model parameters to obtain the well-trained deep neural network model;

(3) (3) processing an arbitrary given video image sequence by using the deep neural network model, and the steps are as follows:

3-1) directly outputting the jitter parameters of the current frame image (namely 15 images of the video image sequence) after forward operation of the deep neural network model;

3-2) calculating a corresponding similarity transformation matrix according to the obtained jitter parameters;

and 3-3) directly multiplying the current frame image by the similarity transformation matrix to obtain a stable current frame image.

As a further preferable embodiment, the method further includes:

It should be noted that, when the deep neural network model is used for prediction, since the feature extraction modules of all images of the input image sequence are shared, when the jitter parameters of two frames of images before and after are predicted in sequence, a large number of repeated feature extraction calculations exist, that is, the features of the last 19 images of the input image sequence when the jitter parameters of the current frame of image are predicted are the same as the features of the first 19 images of the input image sequence when the jitter parameters of the next frame of image are predicted, so when the jitter parameters of the current frame of image are predicted, the feature data of the last 19 images are stored and directly used for predicting the jitter parameters of the next frame of image, and thus, each time the jitter parameters of the current frame of image are predicted, only the features of one image need to be extracted actually, and the calculation amount is greatly reduced.

It should be noted that the video jitter described in this embodiment refers to: the collection camera moves integrally, namely, in a video picture, the motion tracks of all pixels are the same except for a moving object.

Another embodiment of the present invention discloses a computer-readable storage medium, on which a computer program is stored, the computer program being executed by a processor, and the video anti-shake method as described in the above embodiments can be implemented.

Those skilled in the art can understand that all or part of the steps in the method for implementing the above embodiments may be implemented by a program to instruct related hardware, where the program is stored in a storage medium and includes several instructions to enable a (may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, the embodiments of the present invention are not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the embodiments of the present invention within the technical idea of the embodiments of the present invention, and the simple modifications all belong to the protection scope of the embodiments of the present invention.

Claims

1. A video anti-shake method, comprising:

2. The video anti-shake method of claim 1, wherein the feature extraction module includes an input image resolution fast-dropping layer IFD, convolution layers conv0, conv1_0, conv1_1, conv2_0, conv2_1, conv2_2 and conv3, feature map addition layers sum0 and sum1, and a maximum-value pooling layer maxpool0, wherein the video image sequence to be processed is used as an input of an input image resolution fast dropping layer IFD, an output of the input image resolution fast dropping layer IFD is connected with an input of a convolution layer conv0, an output of a convolution layer conv0 is connected with an input of a maximum value pooling layer maxpool0, an output of the maximum value pooling layer maxpool0 is connected with an input of a feature map additive layer sum0 and an input of a convolution layer conv1_0, an output of a convolution layer conv1_0 is connected with an input of a convolution layer conv1_1, and an output of the convolution layer conv1_1 is connected with an input of a feature map additive layer sum 0; the output of the feature map additive layer sum0 is connected to the input of the convolution layer conv2_0 and the input of the convolution layer conv2_2, respectively, the output of the convolution layer conv2_0 is connected to the input of the convolution layer conv2_1, the output of the convolution layer conv2_1 and the output of the convolution layer conv2_2 are both connected to the input of the feature map additive layer sum1, the output of the feature map additive layer sum1 is connected to the input of the convolution layer conv3, and the output of the convolution layer conv3 is a deep abstract feature corresponding to each frame of image of the video image sequence to be processed.

3. The video anti-shake method of claim 1, wherein the shake parameter estimation module includes a feature concatenation layer concat, a convolutional layer conv4, conv5_0, and conv5_1, a feature map addition layer sum2, a global mean pooling layer gavepool layer, and a full connection layer fc;

4. The video anti-shake method according to claim 2, wherein the input image resolution fast dropping layer IFD is configured to decompose each frame of image of the sequence of video images to be processed into a plurality of decomposed sub-images, specifically:

5. The video anti-shake method of claim 1, wherein the training process of the deep neural network model comprises:

acquiring training sample data;

6. The video anti-shake method of claim 5, wherein the obtaining training sample data comprises:

collecting a jitter video set;

7. The video anti-shaking method of claim 6, wherein there is an overlap area of more than 50% in the frames of any two consecutive images of the plurality of consecutive images.

8. The video anti-shake method according to claim 1, wherein the obtaining a stable current frame image according to the shake parameter of the current frame comprises:

9. The video anti-shake method of claim 1, further comprising:

10. A computer-readable storage medium, having a computer program stored thereon, wherein the computer program is executed by a processor to implement the video anti-shake method according to any one of claims 1-9.