CN108765317B

CN108765317B - Joint optimization method for space-time consistency and feature center EMD self-adaptive video stabilization

Info

Publication number: CN108765317B
Application number: CN201810429065.6A
Authority: CN
Inventors: 郝爱民; 李晓; 李帅; 秦洪
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2018-05-08
Filing date: 2018-05-08
Publication date: 2021-08-27
Anticipated expiration: 2038-05-08
Also published as: CN108765317A

Abstract

The invention provides a joint optimization method of space-time consistency and feature center EMD self-adaptive video stabilization, which aims at video anti-shake on the basis of decomposing noise signals by using an EMD method, applies the technologies of saliency protection based on space structure consistency matrix estimation, visual elimination, self-adaptive smoothing, cutting area reduction, video completion and the like aiming at a shake video, improves the stability, the universality, the accuracy and the self-adaptability of video enhancement processing, and improves the integrity of the video.

Description

Joint optimization method for space-time consistency and feature center EMD self-adaptive video stabilization

Technical Field

The invention relates to a joint optimization method for space-time consistency and feature center EMD (Empirical Mode Decomposition) adaptive video stabilization, belonging to the technical field of computer vision enhancement.

Background

Hand-held devices used by amateurs, such as mobile phones, camcorders, tablet computers and commonly used cameras, have become fashionable, but because the stabilization devices of the devices are too simple, the videos captured by the hand-held devices are often erratic and uncomfortable for people to see. Video stabilization techniques aim to remove the jitter and shock between frames visible in jittered video. It is one of the most active research subjects in the field of computer vision and can be applied to many high-level video enhancement applications, such as artificial observation, video identification, video detection, video tracking, video compression, and the like.

Given videos captured by multiple handheld devices (such as cell phones or camcorders), most of the most advanced video stabilization methods, such as f.liu et al, a.goldstein et al, and c.morimoto et al, learn two-dimensional linear motion models by estimating and smoothing linear (affine or homography) transforms of consecutive frames, and many other most advanced video stabilization methods, such as c.buehler et al, f.liu et al, and s.liu et al, employ three-dimensional lens curve motion by processing parallax to generate strong stabilization results. Over the long term evolution, 2D and 3D approaches have had great success. However, some challenges have not been fully resolved in order to better integrate the traditional problems of feature detection, feature registration, and camera trajectory analysis into a unified stable framework.

The disadvantages and shortcomings of the prior art are summarized as follows: first, from the perspective of producing a stable, non-parallax camera trajectory based on saliency protection, parallax caused by non-trivial depth variations in the scene makes the estimation an ill-posed problem, since different regions may require different homographies of trajectory and spatial diversity. Although spatial multi-dimensional reconstruction can in principle handle parallax and produce more stable results, multi-dimensional motion model estimation is not robust enough due to variability and often neglecting significant feature protections such as fast rotation, tracking failures, camera zooming and motion blur. Therefore, how to simultaneously generate a uniform disparity-free homography matrix based on saliency protection by using spatially diverse homography matrices is urgently needed in the video stabilization problem. Second, the current approach suffers more or less from the following problems from the perspective of adaptively smoothing camera motion. Without sufficient adaptivity, more challenging situations (e.g. fast motion, fast scene transition, large occlusion) are difficult to handle by straightforward smoothing of the raw camera motion. As the motion of the camera tends to be smooth over/under in some cases. For example, unstable video tends to be excessively cut or jittery video tends to jitter. Therefore, considering the optimized quality of the original video, how to design an adaptive smoothing model to analyze and smooth the jittered video is crucial to the robust and high-quality result. Third, the current method has more or less of the following problems from the viewpoint of reducing the cropping area of the optimized video. Many techniques focus too much on the smoothing effect, ignoring the cropping area of the smoothed video. Therefore, in view of protection of original image content, an accurate and simple strategy is required to flexibly protect original video content while successfully suppressing high-frequency judder and low-frequency judder thereof. Fourth, while various high-order image interpolation and extrapolation methods have proven effective from the point of view of the integrity of the stabilized video, some unexpected results may occur if the effects of neighboring pixels are not adequately considered. For example, discontinuities in resample values may create significant mosaic and jagged portions. Therefore, in consideration of uncertainty of image interpolation and extrapolation, an accurate and simple strategy is required to flexibly complement the missing part of the original video.

Disclosure of Invention

The technical problem solved by the invention is as follows: the method overcomes the problems of stability and robustness of the existing video stabilization method, provides a joint optimization method of space-time consistency and feature center EMD adaptive video stabilization, adaptively processes video stabilization, image significance protection, parallax weakening, adaptive smoothing, cutting area reduction and video completion in a space-time consistent mode, improves the stability, universality, accuracy and adaptivity of video enhancement processing, and improves the integrity of the video.

The technical scheme adopted by the invention is as follows: a joint optimization method for space-time consistency and feature center EMD adaptive video stabilization comprises the following steps:

firstly, the invention provides a method for estimating a homography matrix of space structure consistency based on image salient region protection. This method effectively reveals the consistency of the intrinsic motion of different regions of the dithered video. While still being able to flexibly construct a uniform camera motion profile.

Secondly, the invention provides a method for self-adapting eigenmode function coefficients based on empirical mode decomposition. The method adaptively ensures that the jittering video is more stable, and promotes the analysis and optimization of the unstable video.

Thirdly, the invention provides a characteristic center strategy based on Gaussian distribution. This strategy can significantly reduce the cut-out region of the stabilized video while preserving the stability of the optimized video.

Finally, the present invention proposes an efficient video matching and completion method that adaptively interpolates missing regions and interpolating overlapping regions for the original video (image frame sequence).

The method specifically comprises the following four steps:

step (1), the consistency of the spatial structure based on the image significance protection: extracting image Scale Invariance (SIFT) feature points by an SIFT method, performing image Scale Invariance (SIFT) feature matching, deploying uniform grids on an image, acquiring saliency vectors of a plurality of grids, and performing space structure consistency deformation on the basis of protecting the saliency vectors of the image grids by taking the uniform grids as a reference to obtain a deformed image frame sequence;

step (2), self-adaptive eigenmode function: starting from a viewpoint position, re-acquiring an SIFT feature set from each image frame obtained by deformation in the step (1), constructing a space structure matrix based on SIFT, extracting rotation, translation and scaling motion information of the space structure matrix, constructing an original lens motion signal, decomposing and generating an eigenmode function through an EMD (empirical mode decomposition) algorithm, adaptively generating anisotropic coefficients of all the eigenmode functions according to an adaptive eigenmode function optimization algorithm, and acquiring a new lens adaptive motion signal through a weighted summation algorithm of the eigenmode function and the anisotropic coefficients;

step (3), EMD of the feature center: taking the lens self-adaptive motion signal obtained by calculation in the step (2) as a new input signal, weighting to obtain a new characteristic center motion signal according to a characteristic center algorithm, and protecting the original signal motion trend of the lens by adopting a weighting algorithm based on a Gaussian function while further inhibiting jitter components;

step (4), area extrapolation and interpolation: and (4) further generating a new stable video, namely an image frame sequence, on the basis of the characteristic central motion signal obtained by calculation in the step (3), complementing the blank formed by frame translation, rotation and stretching of the video, performing extrapolation of a missing region on the basis of an adaptive time domain algorithm, and performing interpolation of an overlapping region on the basis of a cubic spline interpolation algorithm.

The method for consistency of the spatial structure based on image significance protection in the step (1) is realized as follows:

(11) and (3) detection of extreme values in the scale space: firstly, constructing a scale space, searching image positions on all scales, and identifying potential interest points which are invariable in scale and rotation through a Gaussian differential function;

(12) key point positioning: determining the position and scale of each candidate position by fitting a fine model; the selection of key points depends on their degree of stability;

(13) direction determination: assigning one or more directions to each keypoint location based on the local gradient direction of the image, all subsequent operations on the image data being transformed with respect to the direction, scale and location of the keypoint, thereby providing invariance to these transformations;

(14) description of key points: measuring local gradient of the image on a selected scale in a neighborhood around each key point, and then converting the gradient into a representation, wherein the representation allows deformation of local shape and illumination change, and the obtained key points are scale-invariant feature points;

(15) matching the scale-invariant feature points by measuring Euclidean distances of the scale-invariant feature points in the adjacent frames;

(16) distributing a significance mapping vector for each pixel of the image, wherein the value range of the significance mapping vector is 0-1, deploying uniform grids on the image, averaging the significance mapping vectors of all pixels on each grid, and acquiring the significance vector of each grid;

(17) and taking the uniform grid as a reference, performing space structure consistency deformation on the basis of protecting the saliency vector of the image grid, preferentially protecting the area with the saliency vector value of the grid larger than the median, and concentrating distortion caused by the space structure consistency deformation into the area with the saliency vector value of the grid smaller than the median to obtain a deformed image frame sequence.

The self-adaptive eigenmode function optimization algorithm in the step (2) comprehensively considers a plurality of competing factors, wherein the factors comprise jitter elimination, excessive cutting elimination, maximum reduction of distortion and distortion, and solution of convex linear programming through CVX;

the concrete implementation is as follows:

(21) starting from the viewpoint position, re-acquiring an SIFT feature set from each image frame obtained by deformation in the step (1), and constructing a spatial structure matrix based on SIFT;

(22) converting the space structure matrix based on the SIFT into a model related to scale, rotation and translation, extracting rotation, translation and scaling motion information of the scale, rotation and translation related model, aggregating the rotation, translation and scaling motion information of all image frames, and constructing an original lens motion signal;

(23) decomposing and generating an eigenmode function through an EMD decomposition algorithm;

(24) and generating the anisotropy coefficients of all the eigenmode functions in a self-adaptive mode function optimization algorithm, and acquiring a new lens self-adaptive motion signal through a weighted summation algorithm of the eigenmode functions and the anisotropy coefficients.

The characteristic center algorithm in the step (3) further inhibits jitter components and simultaneously protects the motion trend of the original signal;

the concrete implementation is as follows:

(31) and taking the lens self-adaptive motion signal as a new input signal, and solving a Gaussian transformation signal of the new input signal through a Gaussian function.

(32) And according to the characteristic center algorithm, weighting the Gaussian transformation signal and the original motion signal of the lens to obtain a new characteristic center motion signal. The original signal motion tendency of the lens is protected as much as possible while further suppressing the shake component.

The adaptive time domain algorithm in the step (4) defines a matching rate, so that a combined optimization method of space-time consistency and feature center EMD adaptive video stabilization can adaptively select a library image for video frame extrapolation;

the concrete implementation is as follows:

(41) further generating a new stable video, namely an image frame sequence, by taking the characteristic center motion signal as an input parameter;

(42) in order to complement the blank formed by translation, rotation and stretching of the video frame, based on the adaptive time domain algorithm and the similarity transformation algorithm, the available library images, namely the extrapolated images, are adaptively selected and extrapolated, and the extrapolation of the missing area is executed.

(43) In order to complement the blank formed by frame translation, rotation, and stretching of the video frame sequence, the gray values of 16 adjacent pixels in the overlap region are weighted and processed based on a cubic spline interpolation algorithm, and interpolation of the overlap region is performed.

The principle of the invention is as follows: on the basis of decomposing noise signals by using an EMD method, the video anti-shake is taken as a target, and technologies such as saliency protection based on space structure consistency matrix estimation, visual elimination, self-adaption smoothness, cutting area reduction, video completion and the like are applied to a shake video. The invention mainly comprises the following four aspects:

(1) based on the consistency of the spatial structure of the image significance protection, the image significance is an important visual feature in the image, and the attention degree of human eyes to certain areas of the image is reflected. A large number of saliency mapping methods are widely applied to image compression, encoding, image edge and region reinforcement, saliency target segmentation and extraction and the like. In the spatial structure consistency method, a uniform grid is constructed on a natural image, a saliency map is constructed on an original image, and the saliency map is used for distributing 0-1 saliency values on each image pixel of the image. The saliency values of each cell of the grid of the original image are averaged to form a saliency vector for each cell of the grid. A new deformation mesh of the deformed image is calculated according to a corresponding stabilization method, the deformation process is used for saving the image area of the salient area, and the inevitable distortion deformation is concentrated in the unimportant area.

(2) The self-adaptive eigenmode function and the empirical mode decomposition are used for carrying out signal decomposition according to the time scale characteristics of data, and any basis function is not required to be preset. This is essentially different from the fourier decomposition and wavelet decomposition methods that are built on a priori harmonic basis functions and wavelet basis functions. Due to the characteristics, the empirical mode decomposition method can be theoretically applied to the decomposition of any type of signals, so that the method has very obvious advantages in processing non-stationary and non-linear data, is suitable for analyzing non-linear and non-stationary signal sequences, and has high signal-to-noise ratio. The key of the method is empirical mode decomposition, which can decompose a complex signal into a finite number of eigenmode functions, and each decomposed function component comprises local characteristic signals of different time scales of an original signal. The invention researches and realizes an eigenmode function decomposition method based on self-adaptation, and the core idea is as follows: the method comprises the steps of obtaining SIFT feature points of each frame of a jittering video, estimating a geometric deformation matrix on the basis, further calculating a lens motion signal, further decomposing the signal into a group of eigenmode functions, and calculating the optimal coefficient of each eigenmode function in a self-adaptive mode by using a CVX (constant frequency X) method. The method is used for carrying out stabilization processing on non-stationary data and then carrying out Hilbert transform to obtain a time-frequency spectrogram, so that the frequency with physical significance is obtained. The three-dimensional reconstruction of a video scene in a three-dimensional space is effectively avoided, and the robustness of the system is improved. Meanwhile, video processing is effectively converted into a signal processing problem, and the original video is conveniently analyzed and processed. The self-adaptability of the algorithm is improved, and the human intervention is avoided.

(3) The invention provides a characteristic center strategy based on Gaussian distribution, namely Gaussian distribution and normal distribution, and is a convenient model for quantitative phenomena in natural science and behavior science. Various psychological test scores and physical phenomena such as photon counts have been found to be approximately normal distributions. Although the root cause of these phenomena is often unknown, it can be shown theoretically that if many small effects are added up as one variable, then this variable follows a normal distribution. Normal distributions appear in many regions: for example, the mean of the sampling distribution is approximately normal, even though the original population distribution of the sampled samples does not follow a normal distribution. In addition, normal distribution information entropy is largest among all distributions with known mean and variance, which makes it a natural choice for a distribution with known mean and variance. A normal distribution is one of the most widely used distributions in statistics and many statistical tests. In probability theory, a normal distribution is the limiting distribution of several continuous as well as discrete distributions. The feature center strategy obviously reduces the clipping area of the stable video while maintaining the optimized video stabilization rate. The strategy keeps the characteristic centrality of the EMD, introduces the weighted Gaussian distribution, and keeps the motion trend of the original video track of the new video motion track while inhibiting the jitter.

(4) And (4) carrying out regional extrapolation and interpolation, and further generating a new stable video (image frame sequence) by taking the characteristic central motion signal as an input parameter. In order to complement the blank formed by frame translation, rotation and stretching of the video (image frame sequence), a matching rate is defined based on an adaptive time domain algorithm and a similar transformation algorithm, an available library image (an extrapolated image) is adaptively selected and extrapolated, and extrapolation of a missing area is performed. In order to complement the blank space formed by frame translation, rotation and stretching of the video (image frame sequence), the gray values of 16 adjacent pixels of the overlapping region are weighted and processed based on a cubic spline interpolation algorithm, and the interpolation of the overlapping region is performed. The adaptive time domain algorithm enables the space-time consistency and feature center EMD adaptive video stabilization combined optimization method to adaptively select a library image for video frame extrapolation.

Compared with the prior art, the invention has the advantages that:

(1) spatial structure consistency: the invention provides the spatial structure consistency based on the image significance protection, homography matrixes with different spaces are respectively constructed based on the spatial structure diversity of each area of the image, and the spatial consistency deformation is carried out based on the homography matrixes. Meanwhile, a uniform motion track of the camera can be flexibly constructed, and the universality and the accuracy of video enhancement processing are improved.

(2) Self-adaptability: the invention creates a self-adaptive eigenmode function optimization algorithm, self-adaptively calculates the video stability factor by using a CVX method, and self-adaptively generates an optimized camera motion track. The method adaptively ensures that the jittering video is more stable, promotes the analysis and optimization of the unstable video, and improves the stability and the adaptivity of video enhancement.

(3) Characteristic centrality: a characteristic center algorithm is constructed, the algorithm is a characteristic center strategy based on Gaussian distribution, the strategy can remarkably reduce the shearing area of the stable video, meanwhile, the stability of the optimized video is protected, and the accuracy and the integrity of the video are improved.

(4) Video integrity: in order to complement the blank formed by translation, rotation and stretching of the video frame, extrapolation of a missing area is executed based on an adaptive time domain algorithm, and interpolation of an overlapping area is executed based on a cubic spline interpolation algorithm. The method can adaptively interpolate the lost area and interpolate the overlapped area aiming at the original video (image frame sequence), thereby further improving the integrity of the video.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a graph of the spatial structure consistency homography matrix estimation and the relationship between S (t), D (t), and B (t); the upper graph is a homography matrix estimation graph of the consistency of the spatial structure, and the lower graph is a relation graph among S (t), D (t) and B (t);

FIG. 3 shows the original and optimized motion signals and the IMFs (eigenmode functions) decomposed from the original signals; the upper graph is a graph of the original and optimized motion signals, and the lower graph is an eigenmode function graph decomposed from the original signals;

FIG. 4 shows a square with or without feature centers EMD and its signals; the upper diagram is a feature screenshot of a square with or without feature center EMD, the lower left diagram is an EMD change diagram without the feature center, and the lower right diagram is an EMD change diagram with the feature center;

FIG. 5 is a diagram of adaptive time domain estimation;

FIG. 6 is an unknown pixel gray value estimate.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

As shown in FIG. 1, the method comprises the following steps:

(1) and extracting image features by an SIFT method, performing feature matching, obtaining a saliency vector, and performing space consistency deformation by taking the saliency vector as a reference.

(2) And (2) from the viewpoint position, re-acquiring a feature set of each image frame obtained by deformation in the step (1), constructing a space structure matrix based on SIFT, extracting rotation, translation and scaling information, constructing an original motion signal, and acquiring a new motion signal according to a self-adaptive eigenmode function optimization algorithm.

(3) And (3) taking the self-adaptive motion signal obtained by calculation in the step (2) as a new input signal, and protecting the motion trend of the original signal as far as possible while further inhibiting the jitter component according to a characteristic center algorithm.

(4) And (4) generating a new stable video by the characteristic center motion signal obtained by calculation in the step (3), performing extrapolation of a missing area based on a self-adaptive time domain algorithm for completing a blank formed by translation, rotation and stretching of a video frame, and performing interpolation of an overlapping area based on a cubic spline interpolation algorithm.

Specific implementations of the invention are described in detail.

1. Spatial structure consistency based on image saliency protection

The method extracts image features through an SIFT method, performs feature matching, obtains a saliency vector, and performs spatial consistency deformation by taking the saliency vector as a reference.

a. Spatial structure homography matrix construction based on SIFT

According to the method, SIFT features are extracted from a jittering video, and descriptor components and coordinate components are obtained from the SIFT features. The descriptor component is a K x 128 matrix where each row represents the invariant descriptor for one of the K keypoints. The SIFT descriptor is a vector of 128 values normalized to a single length. The coordinate component is a K × 4 matrix in which each row includes 4 values as the values of the key point coordinates (row, column, scale and direction). Oriented in [ -PI, PI [ ]]In the radian range. A SIFT feature match is accepted only if its euclidean distance is less than a distratito multiple of the second closest distance. In general, distratito is set to 0.1. The euclidean distance from point s to t is the length of the line segment connecting them. By the above-mentioned treatment, [ x, y ] can be obtained]An M x 2 matrix of coordinates. Outliers in the two frame match points are excluded by using the M-estimator SAmple Consensus (MSAC) algorithm. As shown in the second image of the first rectangle of figure 1,the geometric transformation maps the interior points in the matching points of the left video frame to the interior points in the matching points of the right video frame. Using a 3X 3 matrix T_tTo represent the geometric transformation:

wherein R is_tIs a 2 × 2 rotation matrix, O_tAre 2 x 1 translation vectors representing the direction and position of camera motion, respectively, in the global coordinate system. As shown in FIG. 2, the relative camera motion at time T may be transformed by a two-dimensional Euclidean transformation T_tIndicates that S is satisfied_t＝S_t-1T_t-1. St is represented as:

b. spatial structure consistency optimization

The invention overlays a uniform grid on the image, having

A column and

and (4) a row. The goal is to compute the warped mesh of the adjusted image. Consistent with common image redirection methods, saliency maps are used to assign each pixel of an image an important value between 0 and 1. A distortion that preserves as much as possible the salient regions of the image can be calculated, and the inevitable distortion will be concentrated in the less important regions. And calculating the average value of all the significance values in each unit of the square grids of the grid original image, and finally obtaining the significance vector.

Based on the significance vector, the influence of spatial grid inconsistency is further optimized, and the parallax is greatly reduced. The camera path and local homography matrix are acquired using the method proposed by Lowe et al. Then, a camera path with a spatial grid that is not uniform is defined for the entire video. Let S_i(t) as a framethe camera positions of the squares i in t are represented as follows:

S_i(t)＝S_i(t-1)T_i(t-1)

get S_i(1) For the cell matrix, the following formula is derived:

the video frame is uniformly divided into a plurality of meshes. As shown in FIG. 2 (top panel), each grid has a trajectory, denoted by S_i(t) represents. T is_i(t-1) denotes the same grid cell, from S_i(t-1) to S_i(t) estimated local homography matrix. As shown in fig. 2 (lower graph), d (t) represents the smooth path, and b (t) represents the transformation from the original path s (t) to the smooth path d (t). The spatial structure consistency trajectories of these cameras can be smoothed by the following formula:

s ═ S (t) } is the original path, and D ═ { D (t) } is the optimized path. Ω (i) represents eight neighbors of grid cell i. Data item | D_i(t)-S_i(t) | ensures that the new camera path approaches the original path to reduce clipping and distortion, and | | D_i(t)-D_j(t) | | may keep the current grid cell consistent with nearby neighbors. Parameter lambda_tFor balancing the two above. For an edge mesh cell, its value is set to be the same as that of its non-existent neighbors. I.e. when j is absent, it can be denoted as D_j(t)＝D_i(t) of (d). This optimization is quadratic, and the best results can be obtained by solving a large sparse linear system. The above solution is updated by a Jacobi-based iterative approach.

δ is an iteration factor. At initialization, D⁽⁰⁾(t) s (t), the optimal path D is obtained_i(t) of (d). Using B (t) ═ S^-1(t) d (t), the original video frame may be converted into a video frame having spatial structure consistency while protecting the salient region. With this technique, the present invention eliminates parallax between spatially distinct grid cells within each frame. However, it cannot eliminate jitter between different video frames. The video stabilization between different frames will be described in the next section.

2. Adaptive eigenmode function

And (3) from the viewpoint position, re-acquiring a feature set of each image frame obtained by deformation in the step (1), constructing a space structure matrix based on SIFT, extracting rotation, translation and scaling information, constructing an original motion signal, and acquiring a new motion signal according to a self-adaptive eigenmode function optimization algorithm.

EMD can decompose any complex signal to generate IMF through a process of screening that requires the following steps to be performed iteratively. The first step is to find local extreme points (maxima and minima). The second step is to generate the upper and lower envelopes by cubic spline interpolation. The third step is to judge whether the difference is IMF. The last step is to judge whether the residual is monotonous. Specifically, it can decompose the original signal, and the formula is as follows:

where f is_k(k-1, …, N) are IMFs, r_NIs the corresponding residual. IMF is shown in FIG. 3 (lower panel)_k(k-1, …,5), IMF6 denotes the residual. For ease of representation and calculation, the invention sets f_N+1＝r_NThat is, the present invention considers the residual as the last IMF. The original signal shown in fig. 3 (lower panel) is decomposed into IMFs and residuals. As in fig. 3 (lower panel), 6 IMFs represent the decomposed signal of the original signal. To stabilize the video, the high frequency signal should be smooth. The optimal trajectory of the camera, represented as a high gray line as in fig. 3 (upper graph), is obtained by minimizing the objective function, which is as follows.

Where α denotes the coefficient of IMF. Variables are denoted by X. When in use

When the temperature of the water is higher than the set temperature,

and

the L1 norm of the first, second and third derivatives of X, respectively.

And

the minimum of the sum smoothes the IMF (as shown in fig. 3 (lower panel)) to remove jitter in the unstable video.

Representing the original signal as shown in fig. 3 (upper panel).

And

the minimum value of the difference of (d) keeps the original signal close to the optimized signal to avoid excessive clipping. W is an adaptive balancing factor for balancing the four terms.

In summary, the optimization method of the present invention takes into account a variety of competing factors, such as eliminating vibration, eliminating excessive shear, and minimizing distortion. The above formula is a convex optimization problem that can be solved by a convex linear programming method (CVX). The optimal motion signal (shown in fig. 3 (top panel)) can be solved by the following equation:

is the optimal motion signal, represented as the high gray line in fig. 3 (upper panel).

Is a new coefficient of IMF. Fig. 3 (top) shows the camera trajectory before and after smoothing, represented by the low and high gray lines, respectively.

3. EMD of feature centers

As in fig. 4 (bottom left), the low gray lines represent the motion signal of the original path and the high gray lines represent the motion signal of the smooth path without the characteristic center EMD. As shown in the bottom left image in fig. 4, the original method is overly smooth, losing the original tendency to move. The right picture shows the feature-centric results. The left picture is over-cropped. In order to preserve the trend of the original EMD motion signal, extreme points of the original motion signal are defined as features. To maintain central characteristics for the EMD signal, the characteristic-centered modal equation is as follows:

here ω is_tRepresenting 60 adjacent frames. The invention introduces a Gaussian function G_t() And is provided with G_t() The standard deviation was 10. S_tRepresenting the original value of frame t without the feature center EMD.

Frames representing featureless central EMD

The original value of (a). S_tRepresenting the original value of frame t without the feature center EMD.

Representing the optimized value of frame t.

Representing frames

The optimum value of (c). As in figure 4 (lower right),

representing the value of frame t with a characteristic centrality EMD. This value ensures that the new path maintains the trend of the original path while successfully suppressing high frequency jitter and low frequency jitter of the original path.

4. Regional extrapolation and interpolation

In order to fill the blank caused by the translation, rotation and stretching of the video frame, a video extrapolation method based on an adaptive time domain algorithm and an interpolation method based on cubic spline interpolation are adopted to realize the extrapolation of the missing area and the interpolation of the overlapping area. Library images are selected by an adaptive time domain method and deformed by an As-Similar-As-Possible algorithm. According to the warped image of the adaptive temporal method of the present invention, the present invention extrapolates all video frames. Specifically, the features of the scale-invariant feature transform (SIFT) between the frame t and its neighbors are first detected and matched. And then calculating the matching rate by the number of the matched features and the total number of the features. As shown in fig. 5, the percentage is a value of the matching rate. As the matching rate decreases, the transformed library image may be distorted. Therefore, the threshold is set to 65%, and video frames having a matching rate of more than 65% are selected as library images. The adaptive time domain range is [ t-E +1, t + E ] in FIG. 5. The extrapolated overlapping part may have uneven transition, and is performed by a cubic spline interpolation method. Fig. 6 shows the interpolated gray value of the unknown pixel (x, y), which can be solved by a weighted interpolation of sixteen gray values in its vicinity.

The above examples are provided for the purpose of describing the present invention only, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims

1. A joint optimization method for space-time consistency and feature center EMD adaptive video stabilization is characterized by comprising the following steps:

extracting image Scale Invariance (SIFT) feature points by an SIFT method, performing SIFT feature matching, deploying uniform grids on an image, obtaining saliency vectors of a plurality of grids, and performing space structure consistency deformation on the basis of protecting the saliency vectors of the image grids by taking the uniform grids as a reference to obtain a deformed image frame sequence;

starting from the viewpoint position, re-acquiring an SIFT feature set of each image frame obtained by deformation in the step (1), constructing a space structure matrix based on SIFT, extracting rotation, translation and scaling motion information of the space structure matrix, constructing an original lens motion signal, decomposing and generating an eigenmode function through an EMD (empirical mode decomposition) algorithm, adaptively generating anisotropic coefficients of all the eigenmode functions according to an adaptive eigenmode function optimization algorithm, and acquiring a new lens adaptive motion signal through a weighted summation algorithm of the eigenmode function and the anisotropic coefficients;

step (3) taking the lens adaptive motion signal obtained by calculation in the step (2) as a new input signal, weighting to obtain a new characteristic center motion signal according to a characteristic center algorithm, and protecting the original signal motion trend of the lens by adopting a weighting algorithm based on a Gaussian function while further inhibiting jitter components;

step (4), based on the characteristic central motion signal obtained by calculation in step (3), further generating a new stable video, namely an image frame sequence, compensating blanks formed by frame translation, rotation and stretching of the video, performing extrapolation of a missing region based on an adaptive time domain algorithm, and performing interpolation of an overlapping region based on a cubic spline interpolation algorithm;

the method for obtaining the deformed image frame sequence by performing consistent deformation of the spatial structure on the basis of protecting the saliency vector of the image grid in the step (1) is realized as follows:

(17) taking a uniform grid as a reference, performing space structure consistency deformation on the basis of protecting a saliency vector of an image grid, preferentially protecting an area with a grid saliency vector value larger than a median value, and concentrating distortion caused by space structure consistency deformation in an area with a grid saliency vector value smaller than the median value to obtain a deformed image frame sequence;

the self-adaptive eigenmode function optimization algorithm in the step (2) is realized as follows:

(24) generating the anisotropic coefficients of all the eigenmode functions in a self-adaptive mode function optimization algorithm, and acquiring a new lens self-adaptive motion signal through a weighted summation algorithm of the eigenmode functions and the anisotropic coefficients thereof;

the feature center algorithm in the step (3) is realized as follows:

(31) taking the lens self-adaptive motion signal as a new input signal, and solving a Gaussian transformation signal of the new input signal through a Gaussian function;

(32) according to the characteristic center algorithm, weighting processing is carried out on the Gaussian transformation signal and the original motion signal of the lens, a new characteristic center motion signal is obtained, and the motion trend of the original signal of the lens is protected as far as possible while the jitter component is further suppressed;

the adaptive time domain algorithm in the step (4) is realized as follows:

(42) in order to complement the blank formed by translation, rotation and stretching of the video frame, based on the adaptive time domain algorithm and the similarity transformation algorithm, the available library image, namely the extrapolated image, is adaptively selected and extrapolated, and the extrapolation of the missing area is executed;