CN112001860A

CN112001860A - Video debounce algorithm based on content-aware blocking strategy

Info

Publication number: CN112001860A
Application number: CN202010810101.0A
Authority: CN
Inventors: 凌强; 赵敏达; 王健; 李峰
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-08-13
Filing date: 2020-08-13
Publication date: 2020-11-27

Abstract

The invention provides a video debounce algorithm based on a content-aware blocking strategy, which comprises the following steps: step1, extracting a characteristic track from the video, obtaining the characteristic point distribution of each frame according to the characteristic track, and performing triangle segmentation according to the characteristic points; step2, based on the track smooth constraint, the interframe similarity transformation constraint, the intraframe similarity transformation constraint and the regular constraint, and solving the stable position of the characteristic point through self-adaptive weight setting, and further solving the stable position of the edge control point through the interframe similarity transformation constraint and the intraframe similarity transformation constraint according to the obtained stable position of the characteristic point; step 3. mapping from a dithered view to a stabilized view is established based on the affine transformed frame-by-frame image. According to the invention, each frame in the video is divided into triangular mesh structures with different numbers and sizes according to the distribution of the track points, and the triangular meshes are utilized to carry out inter-frame motion estimation and smoothing processing, so that a more stable effect can be generated, and the phenomenon of local distortion in the generated video is avoided.

Description

Video debounce algorithm based on content-aware blocking strategy

Technical Field

The invention relates to the technical field of computer vision and video debouncing, in particular to a video debouncing algorithm based on a content-aware blocking strategy.

Background

In recent years, as the video debounce problem is studied more and more algorithms are proposed. Despite the numerous correlation algorithms, existing algorithms follow the following three steps: camera motion estimation, camera motion smoothing, and mapping of a shake perspective to a stable perspective. We can classify the debounce algorithm into a 2D algorithm, a 3D algorithm, and a 2.5D algorithm according to the method used in the above three steps.

The 2D algorithm refers to a method of representing inter-frame motion of a video frame using an inter-frame transformation matrix and smoothing camera motion by smoothing a series of inter-frame transformation matrices. While gaussian low-pass filtering, particle filtering, regularization, etc. are typically used when smoothing camera motion. Grundmann et al achieve smoothing of the camera path by constraining the derivatives of the motion changes between frames. Joshi et al use the video frame sequence in a certain neighborhood of the current frame to perform inter-frame feature point matching to obtain the maximum interior point set that fits the video frame sequence, thereby excluding the influence of part of the foreground region on jitter estimation. Liu et al, using a warping transform based on content preservation to divide the video frame into a grid-like structure and perform camera motion estimation and smoothing locally, can more robustly cope with scenes with discontinuous depth-of-field transformation.

The 3D algorithm is specially designed for the video of a three-dimensional scene, so that the video de-jittering problem of a scene with large parallax can be better solved. The method adopts an SFM method to recover the 3D attitude matrix of the camera, and then smoothes the 3D attitude matrix sequence to eliminate jitter. Liu et al reconstructs the three-dimensional motion pattern of the captured scene and employs a mesh-based warping transformation for generation of stable frames. Zhang et al performs estimation and smoothing of camera motion by establishing a smoothness-based optimization function.

In recent years, people design 2.5D algorithms by combining the advantages of the 2D algorithm and the 3D algorithm, firstly extract the characteristic tracks of the jitter video, then smooth the characteristic tracks through various algorithms, and finally realize the mapping from the jitter frame to the stable frame by utilizing the corresponding relation of the jitter characteristic tracks and the smooth characteristic tracks in each frame. Lee et al extract feature trajectories from the dithered video and design optimization functions to solve for smooth trajectories at stable viewing angles. The above approach does not perform well when processing video containing large foreground and parallax scenes. Liu et al impose well-known subspace constraints on the feature trajectories for motion estimation and smoothing. In order to ensure the performance of a video jitter removal algorithm in a scene with a large foreground, Ling et al propose an algorithm for separating foreground and background characteristic tracks and removing jitter of a video based on a feedback strategy.

As can be seen, the conventional method divides the entire video frame into several rectangles of fixed size, and then estimates a mapping relationship from a local jittered view to a stable view for each small block. This approach has the obvious drawback of ignoring the content of the video and considering all regions of the full picture equally. For example, for areas lacking useful information, such as sky, roads, and rich content areas, such as crowd, the method divides them into rectangular areas of the same size and performs an estimation of the transformation matrix within each rectangle. For the former region, such a blocking operation may be unnecessary since the same motion pattern may be satisfied over a larger range, but for the latter region, such a blocking operation is not fine enough since there is a more complicated motion pattern and a discontinuous disparity transform.

In conclusion, the method has poor applicability to scenes with large foreground shielding, complex parallax change and the like, and has poor de-jittering capability to videos of complex scenes.

Disclosure of Invention

In order to solve the technical problem, the invention adaptively blocks the video frame according to the scene content, simultaneously imposes constraints on the foreground and the background, and performs camera motion estimation by using the full-image information, thereby enhancing the jitter removal capability of the video containing the complex scene. The invention provides a video debounce algorithm based on a content-aware blocking strategy. Unlike existing de-jittering algorithms based on fixed blocking strategies, the present algorithm considers video content and adaptively partitions the video frame into blocks accordingly. And carrying out Delaunay triangulation on the feature points of the feature track in each frame to realize the self-adaptive block partitioning strategy and realize different partitioning results in each frame. By solving a two-stage optimization problem, the position of each triangle in the triangular mesh segmentation under the stable visual angle can be obtained, and the mapping from the jittering visual angle to the stable visual angle is carried out according to the position. The algorithm proposed by the invention does not distinguish foreground and background feature tracks any more and uses all feature tracks to estimate and smooth the camera motion. In order to further improve the robustness of the algorithm, two adaptive weight setting strategies are provided, and the performance of scenes with large foreground and parallax change is obviously improved.

The technical scheme of the invention is as follows: a video de-jittering algorithm based on a content-aware blocking strategy comprises the following steps:

and Step1, extracting a characteristic track from the video, obtaining the characteristic point distribution of each frame according to the characteristic track, and performing triangle segmentation according to the characteristic points.

And Step2, solving the stable position of the characteristic point through self-adaptive weight setting based on trajectory smoothing constraint, inter-frame similarity transformation constraint, intra-frame similarity transformation constraint and regular constraint. And solving the stable position of the edge control point according to the obtained stable position of the characteristic point, the interframe similarity transformation constraint and the intraframe similarity transformation constraint.

Step3 mapping of the frame-by-frame image from a dithered view to a stabilized view based on affine transformation.

Further, in the video debounce algorithm based on the content-aware blocking strategy, the feature trajectory extraction method in Step1 is as follows:

the method comprises the steps of extracting feature points from video frames by using a KLT algorithm and tracking to generate feature tracks, dividing the video frames into grids of 10x10 in size in order to avoid all the feature tracks from gathering in a middle region, extracting 200 corner points by using a uniform threshold, and ensuring that at least one feature point can be detected by reducing the threshold for a region without corner points.

Further, in the video debounce algorithm based on the content-aware blocking strategy, a delaunay algorithm is adopted for triangulation in Step 1.

Using standard Delaunay triangulation method to pair M_tPerforming segmentation and generating triangular mesh segmentation result

Wherein K_tRepresents Q_tThe number of medium triangles. M_tAnd Q_tAre respectively expressed as

And

to find the result at a stable viewing angle for these edge regions, 10 control points are set on each of the four sides of the video frame and defined as E for the set of t-th frames_t＝{E_t,1,...,E_t,36}. We then base the video frame on the new set of points M_t,E_tDelaunay triangulation operation is performed. The segmentation result is represented as

Wherein

Referred to as "interior triangles" meaning vertices all made up of M_tThe point in (1) constitutes a triangle.

Referred to as an "outer triangle," denotes a triangle whose vertices contain at least one control point. Q_t、B_tThe position under a stable viewing angle is represented as

Further, in the video debounce algorithm based on the content-aware blocking strategy, the Step of solving the stable position of the feature point based on the trajectory smoothing constraint, the inter-frame similarity transformation constraint, the intra-frame similarity transformation constraint and the regularization constraint in Step2 is as follows:

the method solves an optimization problem comprising three constraints on the characteristic track

Is estimated, the three constraints are:

(1) a characteristic track P is given_iPosition at stable viewing angle

Should change slowly between frames.

(2) In the t-th frame, the frame,

should each stabilized triangle maintain and Q_tThe corresponding original triangles in (a) are similar.

(3) In the t-th frame, the frame,

should maintain the transformation relationship between similar triangles in (1) and (Q)_tThe transformation relations between the corresponding triangles in the tree are consistent.

Based on the above constraints, an optimization function is designed that is minimized as follows:

wherein:

(1)

is a "smoothing term" used to smooth the feature trajectory by constraining the first and second derivatives of the feature trajectory to smooth the feature trajectory.

The definition is as follows:

where α and β are weighting coefficients, α ═ 2 and β ═ 10.

(2)

Is an "inter-frame similarity transformation constraint term" that ensures that the video frame after transformation and the original video frame remain similarly transformed, for Q_tK in (1)_tA Delaunay triangle whose stable viewing angle is defined as

Q_tAnd

the vertex of the ith triangle is defined as

And

is required and Q_tThe corresponding triangles in (a) are similar.

The definition is as follows:

where γ is a weight coefficient, γ is 10,

the definition is as follows:

wherein a is_BAnd b_BThis can be obtained by the following formula:

(3)

the method is an intra-frame similarity transformation constraint item which is used for ensuring that the transformation relation between local areas in a video frame obtained by transformation is similar to the transformation relation between original video frames.

The definition is as follows:

wherein:

is a weight coefficient, 20, phi (i) denotes all the adjoining triangles of triangle i, a, b, c can be obtained by the following formula:

a+b+c＝1

(4)

is a "regularization term" that limits the stabilized feature trajectory from remaining positionally close to the original feature trajectory, thereby avoiding excessive transformation resulting in substantial loss of video content. The definition is as follows:

further, in the video debounce algorithm based on the content-aware blocking strategy, the Step of solving the stable position of the edge control point in Step2 is as follows:

the following optimization problem is designed to find the positions of these control points at a stable viewing angle:

wherein:

(1)

is an interframe similarity transformation constraint term defined as follows:

wherein

Different optimization operations for the feature points and the control points are represented, and are specifically defined as follows:

representing the desired stable position, the solution is as follows:

indicating the stable position of the feature point. Gamma is a weight coefficient with a value equal to 1.

(2)

Is an intra-frame similarity transformation constraint term defined as follows:

wherein:

wherein a, b, c are solved by the following equations:

a+b+c＝1

indicating the stable position of the feature point.

Further, in the video debounce algorithm based on the content-aware blocking strategy, the adaptive weight setting strategy in Step2 is as follows:

(1) time-adaptive based weight setting

The term expects that the change between adjacent frames of the feature trajectory tends to 0. However, rapid camera motion often results in large frame-to-frame variations in the feature trajectories, which can lead to collapse of the video content. I.e., alpha, should be appropriately reduced when fast camera motion is detected, so improvements

In the form:

wherein the sigma is 10, the total weight of the powder,

and

representing the velocities in the x and y directions, respectively, is calculated as follows:

(2) weight setting based on spatial adaptation

Since handheld devices capture videos of real-world scenes, there are inevitably discontinuous depth variations in these videos and inconsistencies in local motion due to foreground occlusion. This may create a distortion phenomenon in the stabilized video frame. In order to solve the problems, whether a dynamic foreground object exists in each triangle divided by a triangular mesh is judged, and then a triangular area with the foreground object is added

The weight of the term. For the characteristic track i, an overdetermined equation is adopted to calculate C_t,iAnd C_t-1,iA transformation matrix between, wherein H_t,iThe definition is as follows:

the parameters are solved by the following equations:

the above equation is solved by the least square method, and the above formula is expressed asA_t,iβ_t,i＝B_t,iTo obtain

Residual is | | A_t,iβ-B_t,i||²The residual is specifically defined as:

wherein theta is_t,iIs a spatial scale defined as:

where ρ ═ (W/τ + H/τ)/2, and W and H denote the width and height of the video frame. τ is used to control the scale size of the partitions in the normalization, and is set to a value of 10 and remains equal to the number of control points on each side of the video frame.

Finally, it is

Is defined as:

wherein

Representing the set of three vertices of the triangle p.

By finding all P_i,tAnd

and performing affine transformation and mapping the jitter frame to the stable frame.

Compared with the prior art, the invention has the advantages that:

(1) the invention adaptively blocks the video frame according to the scene content, simultaneously imposes constraints on the foreground and the background, and utilizes the full-image information to carry out camera motion estimation, thereby enhancing the jitter removal capability of the video containing the complex scene.

(2) Further, adaptive weight setting based on time and space further improves the robustness of the algorithm.

(3) Further, the calculation of the stable position of the edge control point improves the stability of the image of the edge area.

Drawings

Fig. 1 is a flow chart of a video debounce algorithm method based on a content-aware blocking strategy according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person skilled in the art based on the embodiments of the present invention belong to the protection scope of the present invention without creative efforts.

As shown in fig. 1, according to an embodiment of the present invention, a video de-jittering algorithm method based on content-aware blocking strategy is provided, which includes the following steps:

step1, extracting and triangulating characteristic track

The method comprises the steps of extracting feature points from video frames by using a KLT algorithm and tracking to generate feature tracks, in order to avoid gathering all the feature tracks in a middle region, firstly dividing the video frames into grids of 10x10, and extracting 200 corner points in total by using a uniform threshold (considering the calculation amount, preferably, only 200 corner points are extracted), and for regions without corner points, ensuring that at least one feature point can be detected by reducing the threshold. And then generating a feature track by using the extracted feature points.

Suppose that N feature tracks are extracted in a jittered video, which is defined as

These trajectories may be fromBackground or foreground regions. For a feature trajectory i (i e [1, N ]]) The first and last frames of its occurrence are defined as s_iAnd e_i. Defining a characteristic track i at the t(s)_i≤t≤e_i) The position of the frame is P_i,t. The set of the feature points of which all the feature tracks appear in the t-th frame is M_t＝{P_i,t|i∈[1,N],t∈[s_i,e_i]}。

And

to find the result at stable viewing angles of these edge regions, 10 control points, for a total of 36 control points, are set on each of the four sides of the video frame and defined as E in the t-th frame set_t＝{E_t,1,...,E_t,36}. The video frame is then based on the new set of points M_t,E_tDelaunay triangulation operation is performed. The segmentation result is represented as

Wherein

Referred to as an "outer triangle," denotes a triangle whose vertices contain at least one control point. Q_t、B_tPosition under stable viewing angle is byIs shown as

Step2, smoothing the characteristic track by using an optimization function

2.1 calculation of Stable positions of feature points

The step is to solve an optimization problem pair characteristic track containing three constraints

Is estimated, the three constraints are:

(1) a characteristic track P is given_iPosition at stable viewing angle

Should change slowly between frames.

(2) In the t-th frame, the frame,

(3) In the t-th frame, the frame,

wherein:

(1)

The definition is as follows:

where α and β are weighting coefficients, α ═ 2 and β ═ 10.

(2)

Q_tAnd

the vertex of the ith triangle is defined as

And

is required and Q_tThe corresponding triangles in (a) are similar.

The definition is as follows:

where γ is a weight coefficient, γ is 10,

the definition is as follows:

wherein a is_BAnd b_BThis can be obtained by the following formula:

(3)

The definition is as follows:

wherein:

where is the weight factor, 20, phi (i) denotes all the adjoining triangles of triangle i, a, b, c can be obtained by the following formula:

a+b+c＝1

(4)

the adaptive weight setting strategy is as follows:

(1) time-adaptive based weight setting

In the form:

wherein the sigma is 10, the total weight of the powder,

and

the calculation method of (c) is as follows:

(2) weight setting based on spatial adaptation

The weight of the term. For the characteristic trajectory i, C is calculated by using the overdetermined equation_t,iAnd C_t-1,iA transformation matrix between, wherein H_t,iThe definition is as follows:

the parameters are solved by the following equations:

the above equation is solved by the least square method, and the above formula is expressed as A_t,iβ_t,i＝B_t,iTo obtain

Residual is | | A_t,iβ-B_t,i||²The residual is specifically defined as:

wherein theta is_t,iIs a spatial scale defined as:

Finally, it is

Is defined as:

wherein

Representing the set of three vertices of the triangle p.

2.2 calculation of Stable position of control Point

wherein:

(1)

is an interframe similarity transformation constraint term defined as follows:

wherein:

indicating the stable position of the feature point.

(2)

Is an intra-frame similarity transformation constraint term defined as follows:

wherein:

wherein a, b, c are solved by the following equations:

a+b+c＝1

step3, affine transformation from jittering visual angle to stable visual angle

And performing homography matrix calculation according to the characteristic point coordinates under the dithering visual angle of the t-th frame and the estimated characteristic point coordinates under the stable visual angle, and performing affine transformation.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A video debounce algorithm based on a content-aware blocking strategy is characterized by comprising the following steps:

step1, extracting a characteristic track from the video, obtaining the characteristic point distribution of each frame according to the characteristic track, and performing triangle segmentation according to the characteristic points;

step2, based on the track smooth constraint, the interframe similarity transformation constraint, the intraframe similarity transformation constraint and the regular constraint, and solving the stable position of the characteristic point through self-adaptive weight setting, and further solving the stable position of the edge control point through the interframe similarity transformation constraint and the intraframe similarity transformation constraint according to the obtained stable position of the characteristic point;

step 3. mapping from a dithered view to a stabilized view is established based on the affine transformed frame-by-frame image.

2. The video de-jittering algorithm based on content-aware blocking strategy according to claim 1,

the characteristic track extraction method in Step1 is as follows:

extracting feature points from a video frame by using a KLT algorithm and tracking to generate a feature track, firstly dividing the video frame into grids of 10x10, extracting 200 corner points by using a unified threshold, and ensuring that at least one feature point can be detected by reducing the threshold for an area without corner points;

These trajectories come from the background or foreground region, for a characteristic trajectory i, i ∈ [1, N ]]The first and last frames of its occurrence are defined as s_iAnd e_iDefining the position of the characteristic track i in the t frame as P_i,tWherein s is_i≤t≤e_iThe set of feature points of which all feature tracks appear in the t-th frame is M_t＝{P_i,t|i∈[1,N],t∈[s_i,e_i]}。

3. The video de-jittering algorithm based on content-aware blocking strategy according to claim 1, wherein said triangulation Step in Step1 is as follows:

set M of feature points by using standard Delaunay triangulation method_tPerforming segmentation and generating triangular mesh segmentation result

Wherein K_tRepresents Q_tNumber of medium triangles, M_tAnd Q_tAre respectively expressed as

And

to find the result at a stable view angle of the edge region, four at the video frameSetting 10 control points on each side of the edges, and defining the set of the control points in the t-th frame as E_t＝{E_t,1,...,E_t,36Then the video frame is put according to the new point set { M }_t,E_tCarry out Delaunay triangulation operation, the segmentation result is expressed as

Wherein

Referred to as "interior triangles" meaning vertices all made up of M_tThe point(s) in (b) constitute a triangle,

referred to as "outer triangles" and representing triangles having vertices containing at least one control point, the number L_t，Q_t、B_tThe position under a stable viewing angle is represented as

4. The video de-jittering algorithm based on content-aware blocking strategy according to claim 1, wherein said Step2 is based on trajectory smoothing constraint, inter-frame similarity transformation constraint, intra-frame similarity transformation constraint and regularization constraint, and the Step of solving stable positions of feature points is as follows:

by solving an optimization problem pair feature trajectory containing three constraints

Is estimated, the three constraints are:

(A) a characteristic track P is given_iPosition at stable viewing angle

Should change slowly between frames;

(B) in the t-th frame, the frame,

should each stabilized triangle maintain and Q_tThe corresponding original triangles in (1) are similar;

(C) in the t-th frame, the frame,

should maintain the transformation relation between adjacent triangles_tThe transformation relations between the corresponding adjacent triangles in the triangle are consistent;

wherein:

(1)

is a 'smoothing term' used to smooth the feature trajectory by constraining the first and second derivatives of the feature trajectory to smooth the feature trajectory;

the definition is as follows:

wherein α and β are weighting coefficients, α ═ 2, β ═ 10;

(2)

is an "inter-frame similarity transformation constraint term" that ensures that the video frame after transformation and the original video frame remain similarly transformed, for Q_tK in (1)_tDelaunay triangle, the desired view angle being defined as

Stable viewing angle is defined as

Q_tAnd

the vertex of the ith triangle is defined as

And

is required and Q_tThe corresponding triangles in (a) are similar,

the definition is as follows:

γ is a weight coefficient, γ ═ 10, where:

wherein a is_BAnd b_BIs the coefficient of the corresponding vector edge, solved by the following equation:

(3)

the video frame is an intra-frame similarity transformation constraint item which is used for ensuring that the transformation relation between local areas in the video frame obtained by transformation is similar to the transformation relation between the local areas in the original video frame;

the definition is as follows:

where γ is a weight coefficient, φ (i) represents all adjacent triangles of triangle i,

is the expected stable position of the intra-frame similarity transformation, defined as follows:

where is the weighting factor, 20, a, b, c are the coefficients corresponding to the three vertices, which is obtained by the following formula:

a+b+c＝1

(4)

is a "regularization term" used to limit the stabilized feature trajectory from remaining positionally close to the original feature trajectory, thereby avoiding a large loss of video content due to excessive transformation, defined as follows:

5. the video de-jittering algorithm based on content-aware blocking strategy according to claim 4, wherein the Step of solving the stable position of the edge control point in Step2 is as follows:

wherein:

(1)

is an interframe similarity transformation constraint term defined as follows:

wherein

representing the desired stable position, the solution is as follows:

representing the position where the feature point is stable, gamma is a weight coefficient with a value equal to 1.

(2)

Is an intra-frame similarity transformation constraint term, boundary B_tThe triangles in the frame are required to keep consistent transformation relation between adjacent triangles in the stable frame and the relation between corresponding triangles in the original video frame; the adjacent triangle j of the triangle i may belong to B_tAnd Q_tTriangle j with B_t,iDifferent vertices may belong to B_tOr Q_tA 1 to B_tAnd Q_tIs defined as

In the t-th frame, B_t,iIs defined as BQ_t,j，B_t,iAnd BQ_t,jThe different vertices in the set are defined as

And

using linear texture mapping

Is represented as BQ_t,jCombinations of vertices, i.e.

A triangle at a stable viewing angle is shown,

has a vertex of

The definition is as follows:

wherein

Representing different operations according to the control point and the feature point to which the vertex of the stabilized triangle belongs, is defined as follows:

wherein a, b, c are solved by the following equations:

a+b+c＝1。

6. the video de-jittering algorithm based on content-aware blocking strategy according to claim 1, wherein said adaptive weight setting method in Step2 adopts two methods as follows:

(1) time-adaptive based weight setting

The change between adjacent frames of the term expected feature trajectory tends to 0, i.e. alpha should be reduced appropriately when fast camera motion is detected, improvement

In the form:

wherein the sigma is 10, the total weight of the powder,

and

(2) weight setting based on spatial adaptation

For a video shot by a handheld device in a real scene, discontinuous depth change and inconsistency of local motion caused by foreground occlusion exist, and a distortion phenomenon is generated in a stabilized video frame, so that whether a dynamic foreground object exists in each triangle segmented by a triangular mesh is judged firstly, and then a triangular area with the foreground object is added

The weight of the item; for the characteristic trajectory i, C is calculated by using the overdetermined equation_t,iAnd C_t-1,iA transformation matrix between, wherein H_t,iThe definition is as follows:

the parameters are solved by the following equations:

Residual is | | A_t,iβ-B_t,i||²The residual is specifically defined as:

wherein theta is_t,iIs a spatial scale defined as:

where ρ ═ (W/τ + H/τ)/2, W and H denote the width and height of the video frame; tau is used for controlling the scale size of the blocks in the normalization, the value of tau is set to 10, and the tau is equal to the number of control points on each edge of the video frame;

finally, it is

Is defined as:

wherein

Representing the set of three vertices of the triangle p.

7. The video de-jittering algorithm based on content-aware blocking strategy according to claim 1, wherein said Step3 specifically comprises: