CN107341815B

CN107341815B - Violent motion detection method based on multi-view stereoscopic vision scene stream

Info

Publication number: CN107341815B
Application number: CN201710404056.7A
Authority: CN
Inventors: 项学智; 肖德广; 宋凯; 翟明亮; 吕宁; 尹力; 郭鑫立; 王帅; 张荣芳; 于泽婷; 张玉琦
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2017-06-01
Filing date: 2017-06-01
Publication date: 2020-10-16
Anticipated expiration: 2037-06-01
Also published as: CN107341815A

Abstract

The invention provides a violent motion detection method based on a multi-view stereoscopic scene stream. Firstly, the method comprises the following steps: acquiring a plurality of groups of image sequences by using a calibrated multi-view camera; II, secondly: preprocessing an image sequence; thirdly, the method comprises the following steps: designing a scene flow energy functional data item; fourthly, the method comprises the following steps: designing a scene flow energy functional smoothing term; fifthly: performing optimization solution on an energy functional ground; calculating by using a calculation model from the image pyramid lowest resolution image obtained in the step two; sixthly, the method comprises the following steps: clustering of scene stream motion regions; seventhly, the method comprises the following steps: constructing a motion direction discrete degree evaluation model, and judging whether the motion is violent; eighthly: constructing a kinetic energy size evaluation model of a motion area; nine: and setting a threshold value, and triggering an alarm when n continuous frames meet the evaluation condition. The invention adopts scene flow estimation based on multi-view stereo vision, and multiple groups of image sequences from the same scene are acquired by a calibrated multi-view camera. The detection of sharp motion can be efficiently performed using a 3-dimensional scene stream.

Description

Violent motion detection method based on multi-view stereoscopic vision scene stream

Technical Field

The invention relates to a method for detecting violent movement, in particular to a method for detecting violent movement based on multi-view stereoscopic vision scene flow.

Background

With the high development of the technology information technology, especially human beings have breakthrough progress on computer vision and artificial intelligence, so that a lot of work which should be completed by manpower can be completed by a computer. Such as video surveillance, the most common method of operation is for a person to observe the surveillance display and then react accordingly to the occurrence of an abnormal event. A false alarm phenomenon inevitably occurs because a person cannot concentrate on monitoring all events occurring in a video for a long time. Therefore, it is very important to use a computer to process the video frames and determine whether an abnormal event occurs.

The video surveillance camera is typically fixed in position, i.e. object detection in a static background. The classical methods for the detection of objects in most static backgrounds are as follows: background subtraction, interframe subtraction and optical flow. The background subtraction method has the advantages of small calculation amount, and can update the background model according to the dynamic background change, but is greatly influenced by the background change. The interframe difference method also has a small amount of operation, but does not perform well in terms of stability and robustness. The above two methods are difficult to achieve ideal effects for detecting violent movement. The optical flow method is to calculate an optical flow field through two adjacent frames of images, wherein the calculated flow field is 2-dimensional, namely only plane motion information is lost but depth information is lost. Under the condition of no depth information, the detection of the violent motion is difficult to evaluate and judge, and false alarms are easily caused.

The scene stream contains 3-dimensional motion information and 3-dimensional surface depth information, which is indicative of the true motion of the surface of the object relative to the three general methods described above. The scene flow can obtain enough information to judge whether the motion is violent motion, namely the scene flow can effectively solve the problem of judgment of the violent motion.

Disclosure of Invention

The invention aims to provide a method for detecting violent motion based on multi-view stereoscopic scene flow, which has strong detection adaptability.

The purpose of the invention is realized as follows:

the method comprises the following steps: acquiring a plurality of groups of image sequences by using a calibrated multi-view camera;

step two: preprocessing an image sequence, performing multi-resolution down-sampling on the image sequence by adopting an image pyramid, performing coordinate system conversion according to internal and external parameters of a camera, and establishing a relation between an image coordinate system and a camera coordinate system;

step three: designing a scene flow energy functional data item, directly fusing 3-dimensional scene flow information and 3-dimensional surface depth information, designing the data item, and introducing a robust penalty function at the same time on the basis of a structure tensor constancy assumption;

step four: designing a scene flow energy functional smoothing term, wherein the smoothing term adopts flow driving anisotropic smoothing which simultaneously constrains a 3-dimensional flow field V (u, V, w) and a 3-dimensional surface depth Z, and the smoothing term simultaneously introduces a robust penalty function;

step five: optimizing and solving the energy functional, minimizing the energy functional to obtain an Euler-Lagrange equation, and then solving the equation; starting to use a calculation model to calculate from the image pyramid lowest resolution image obtained in the step two until the image pyramid lowest resolution image reaches the full resolution image;

step six: clustering the motion areas of the scene flow, clustering the motion areas by using a clustering algorithm, separating the motion areas from background areas, and removing the background areas;

step seven: constructing a motion direction discrete degree evaluation model, and judging whether the motion is violent;

step eight: constructing a kinetic energy size evaluation model of a motion area;

step nine: and setting a threshold value, and triggering an alarm when n continuous frames meet the evaluation condition.

The present invention may further comprise:

1. in the first step, a plurality of groups of image sequences are acquired by using a calibrated multi-view camera, and then scene flow V (u, V, w) and depth information Z are obtained.

2. In the second step, in the establishment of the relationship between the image coordinate system and the camera coordinate system, the relationship between the 2-dimensional optical flow and the 3-dimensional scene flow is established as

Where (u, v) is the 2-dimensional optical flow, (u)₀，v₀) Are the optical center coordinates.

3. The design of the data items described in step three specifically includes using the assumption of constancy based on the structure tensor,

the constancy assumption of the structure tensor of the N cameras at the time t and t +1 is defined as:

reference camera C₀The assumption of constancy of the structure tensor with the other N-1 cameras at time t is defined as:

reference camera C₀The structural constancy assumption with the other N-1 cameras at time t +1 is defined as:

in the above data item formula

Penalty function is 0.0001, so that the smoothness approximates to L₁The norm of the number of the first-order-of-arrival,

is a binary shielding mask, is obtained by a shielding boundary region detection technology of a stereo image, and is used for shielding points when pixels are shielding points

Non-occluded points

I_TIs a local tensor of the 2-dimensional image, e.g. formula

As shown.

4. The design of the scene flow energy functional smoothing term described in step four specifically includes,

directly regularizing 3-dimensional flow field and depth information, and designing a flow-driven anisotropic smooth hypothesis, S_m(V) and S_d(Z) respectively constraining the 3-dimensional flow field and the depth information, wherein a smooth item design formula is as follows:

S_m(V)＝ψ(|u(x，y)_x|²)+ψ(|u(x，y)_y|²)+ψ(|v(x，y)_x|²)+ψ(|v(x，y)_y|²)+ψ(|w(x，y)_x|²)+ψ(|w(x，y)_y|²)

S_d(Z)＝ψ(|Z(x，y)_x|²)+ψ(|Z(x，y)_y|²)

the overall scene flow estimation energy functional is as follows,

5. the clustering of the scene flow motion areas in the sixth step specifically includes clustering the scene flow V (u, V, w) obtained in the fifth step by using a clustering algorithm, and separating the background from the motion areas, wherein the feature information of the scene flow specifically includes: each point scene flow u, v, w three components, each point scene flow module value is

The included angle theta between each point scene flow and the xoy plane, the xoz plane and the yoz plane_x，θ_y，θ_zEach point ofAll represent V by a 7-dimensional feature vector_i，j＝(u，v，w，|V|，θ_x，θ_y，θ_z)；

The specific process is as follows: the input is a similarity matrix S formed by the similarity between every two of all N data points_N×NThe initial stage treats all samples as potential cluster centers and then x in order to find the appropriate cluster center_kContinuously collecting the attraction degree r (i, k) and the reliability degree a (i, k) from the data samples, and matching the formula

Continuously iterating to update the attraction degree and the reliability degree until m cluster center points are generated, wherein r (i, k) is used for describing the degree that the point k is suitable as the cluster center of the data point i, and a (i, k) is used for describing the degree that the point k is selected as the cluster center of the point i;

setting a flag for a motion area, wherein if the motion area is a motion area, the flag is 1, if the motion area is a background area, the flag is 0, counting the number of pixels in the motion area as count, and setting the motion area as a spatial neighborhood

6. The constructing of the motion direction discrete degree evaluation model described in the seventh step specifically includes,

defining the Z axis based on the camera coordinate system as the reference vector direction, and calculating the included angle phi between each motion vector and the reference direction_i，j(t), the calculation formula is as follows:

calculating phi of pixel point in each frame motion region_i，j(t) variance D (phi)_i，j(t)), wherein

Is the average of all the included angles,

7. the kinetic energy of each frame of motion region is calculated from the calculated scene stream as follows:

calculating the average kinetic energy of the motion region from the total kinetic energy of each frame of motion region

8. In the step of setting an angle variance threshold phi_thAnd kinetic energy threshold value W_thWhen D (phi)_i，j(t))＞φ_th，

If n frames satisfy the above two conditions, it is determined that the motion is violent and an alarm is triggered.

The invention adopts scene flow estimation based on multi-view stereo vision, and multiple groups of image sequences from the same scene are acquired by a calibrated multi-view camera. The invention can obtain scene flow information of a multi-view scene sequence and scene 3-dimensional surface depth information, and can effectively detect severe motion by using the 3-dimensional scene flow.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 shows a stereo correspondence relationship between image sequences acquired by the multi-view camera.

FIG. 3 is a flow chart of an algorithm for solving a scene flow.

Detailed Description

With reference to fig. 1, the detection of the violent motion based on the multi-view stereoscopic scene stream of the present invention mainly includes the following steps:

s1, a plurality of groups of image sequences are obtained by using a calibrated multi-view camera.

And S2, preprocessing the input image, and performing multi-resolution down-sampling on the image sequence by adopting an image pyramid. And converting a coordinate system according to the internal and external parameters of the camera, and establishing a relation between an image coordinate system and a camera coordinate system.

And S3, designing a scene flow energy functional data item. Different from the most of the previous constraint modes which combine optical flow and parallax, the method adopts the mode of directly fusing 3-dimensional scene flow information and 3-dimensional surface depth information. The design of the data item adopts the constancy assumption based on the structure tensor, and simultaneously introduces a robust penalty function.

And S4, designing a scene flow energy functional smoothing item. And the smoothing term adopts flow driving anisotropic smoothing which simultaneously constrains a 3-dimensional flow field V (u, V, w) and a 3-dimensional surface depth Z, and the robust penalty function is introduced into the smoothing term.

And S5, energy functional ground optimization solving. In order to solve the 3-dimensional motion V (u, V, w) and the 3-dimensional surface depth Z, it is necessary to minimize the energy functional, obtain the euler-lagrange equation, and then solve the equation. A multi-resolution calculation scheme from coarse to fine is introduced for solving the problem of large displacement existing in a scene stream. The calculation using the calculation model is started from the image pyramid lowest resolution image obtained in S2 until the full resolution image is reached. And S6, clustering the motion areas of the scene streams. And clustering the motion areas by using a clustering algorithm, separating the motion areas from the background areas, and removing the background areas, thereby facilitating the establishment of a subsequent violent motion judgment model.

And S7, compared with the steady motion state, the motion direction of the 3-dimensional scene flow obtained in the target area in the violent motion state is disordered. Based on the above, a motion direction discrete degree evaluation model can be constructed, and whether the motion is violent motion or not is judged.

S8, compared with the steady motion state, the 3-dimensional scene flow value obtained by the target area in the violent motion state is larger. Based on the method, a motion area kinetic energy size evaluation model can be constructed.

S9, manually setting corresponding threshold values, and triggering an alarm when the continuous n frames meet evaluation conditions.

The invention will now be described in more detail by way of example with reference to the accompanying drawings.

S1, as shown in figure 2, an image sequence is acquired by using a calibrated multi-view camera. Points in a real scene move from the P position to t +1 from time t

Position, two points at each camera C_iThe corresponding points in the imaging plane are respectively points p_iAnd point

time t +1 position

Where V (u, V, w) is a real-world 3-dimensional motion vector, u represents a real-world horizontal-direction instantaneous motion velocity, V represents a real-world vertical-direction instantaneous motion velocity, and w represents an instantaneous motion velocity in the depth direction. Mapping V (u, V, w) to 2-dimensional is optical flow

S2, the method is based on direct estimation of 3-dimensional scene flow of multi-view stereo vision, and a real world 3-dimensional motion flow field V (u, V, w) and a 3-dimensional surface depth Z are directly constrained in an energy functional. The scene flow energy functional is based on a 2-dimensional plane image, so that a 3-dimensional space needs to be mapped to a 2-dimensional space through perspective projection transformation, and a mapping relation between a 2-dimensional optical flow and a 3-dimensional scene flow is established. I (x)_i，y_iT) is a camera C_iImage sequence pixel point at time t, MⁱIs about the camera C_iThe projection matrix of (2). P (X, Y, Z)^TThe real coordinate of the camera coordinate system at the time t is mapped to an image sequence relational expression as follows:

wherein M isⁱIs a 3 × 4 projection matrix, [ M ]ⁱ]_1，2Is the first two rows of the matrix, [ M ]ⁱ]₃Is the third row of the matrix. The projection matrix is shown in formula (2), and C is a camera internal parameter matrix which is only related to the internal structure of the camera. [ R T]It is the extrinsic parameter matrix of the camera that is determined by the orientation of the camera relative to the world coordinate system.

The scene flow solving energy functional obtained based on the relation has P (X, Y, Z)^TV (u, V, w)6 unknowns. As shown in equation (3), the relationship between X, Y and Z can be established, and 6 unknowns can be reduced to 4 unknowns. Solving for Z and V by N pairs of image sequences, where (o)_x，o_y) Is the camera principal point.

The relationship between the 2-dimensional optical flow V (u, V) and the 3-dimensional scene flow V (u, V, w) is shown in equation (4):

and performing image pyramid on the obtained N image sequences, performing multi-resolution down-sampling on the images, wherein the sampling factor eta is 0.9, and performing Gaussian filtering on the images obtained in each layer to filter partial noise.

And S3, designing a scene flow energy functional data item. The structure tensor constancy assumption is used in both the spatial and temporal domains. From the structure tensor constancy assumption the following equation can be derived:

the structure tensor of the N cameras at the time t and t +1 is constantly defined as the formula (5):

reference camera C₀The structure tensor constancy hypothesis definition at time t for a camera and other N-1 cameras is shown in equation (6):

reference camera C₀The structure tensor constancy assumption definition at time t +1 for a camera and other N-1 cameras is shown in equation (7):

in the above data item formula

0.0001 is a penalty function, such that the smoothness approximates TV-L₁Norm, in order to reduce the influence of the out-of-set points on the functional solution. I is_TIs the local tensor of the 2-dimensional image as shown in equation (8).

Is a binary occlusion mask that acts to ignore occlusion point pixels. When the pixel is an occlusion point

Non-occluded points

Is calculated by the occlusion boundary area detection technology of the stereo image, and adopts an occlusion boundary detection algorithm based on a credible map. The reference camera C can be effectively detected₀And occlusion areas with other cameras.

And S4, designing a scene flow energy functional smoothing item. Suppose for the parameterExamination camera C₀The depth information Z and the 3-dimensional flow field V (u, V, w) are piecewise smooth. The smoothing term directly regularizes a 3-dimensional flow field and depth information, the flow field has smoothness in a 3-dimensional space, and a flow driving anisotropic smoothing hypothesis is designed, so that the smoothness of scene flow is ensured.

S_d(Z)＝ψ(|Z(x，y)_x|²)+ψ(|Z(x，y)_y|²) (10)

S_m(V) and S_dAnd (Z) respectively constraining the 3-dimensional flow field and the depth information, and adopting anisotropic constraint smoothing based on 3-dimensional flow driving for the flow field. The entire energy functional can be written as shown in equation (11):

s5, a solution scheme of the scene flow is adopted, namely the values of Z and V when the energy functional is minimized to the maximum extent are found. The common approach is to minimize the energy functional to obtain the Euler-Lagrange equation and then solve the Euler-Lagrange equation. The euler-lagrange equation after minimization of the energy functional can be written as:

before minimizing the energy functional, the structural tensor constancy assumption in the data item is abbreviated for simplicity as follows:

Δ_i＝I_T(p₀，t)-I_T(p_i，t) (14)

according to the variation principle, the energy functional E (Z, V) is minimized, the partial derivatives of u and Z are respectively calculated and are equal to 0, and the following Euler-Lagrangian equation can be obtained:

for v, w minimizes the energy functional to obtain an Eulerian-Lagrangian equation similar to equation (16). The nonlinear problem exists in data items and smoothing items, and the most critical in the process of solving the scene stream is how to avoid trapping in a local minimum to obtain a global optimal solution.

Since a violent motion is detected, there will be a large displacement motion. In order to solve the problem of large displacement, a multi-resolution calculation scheme from coarse to fine is adopted. Using the image pyramid already obtained in S2, the initial value of the scene stream is set to 0, the calculation is started from the lowest resolution, and the initial value is added to the result as the initial value of the next resolution until the full resolution image is reached. Therefore, the problem of calculation inaccuracy caused by large displacement can be effectively eliminated. The specific solution is shown in fig. 3. In FIG. 3, L is the number of image pyramid layers, and only V (u, V, w) is calculated when L is greater than or equal to K, and scene streams V (u, V, w) and Z are calculated when L is greater than 0 and less than K.

S6.3 dimensional scene stream motion V (u, V, w) clustering. The scene stream V (u, V, w) calculated at S5 is not necessarily zero-valued in the background region due to noise and error. If the background area scene flow is not zero, the subsequent violent motion judgment is influenced. The motion areas are clustered by using a clustering algorithm, so that the background and the motion areas are separated, the background area is excluded, and the evaluation of the violent motion can be effectively carried out.

The clustering algorithm aims to find an optimal class representative point set so that the sum of the similarity of all data points to the nearest class representative point is maximum. The algorithm is briefly as follows: the input to the algorithm is all N data pointsSimilarity matrix S formed by similarity between every two_N×NThe algorithm start-up considers all samples as potential cluster centers. Information establishing the degree of attraction with other sample points for each sample point is defined as follows:

the attraction degree: r (i, k) is used to describe the extent to which point k fits as the cluster center for data point i.

Reliability: a (i, k) is used to describe how well point i selects point k as its cluster center.

To find a suitable cluster center x_kThe algorithm continuously gathers evidence r (i, k) and a (i, k) from data samples. The iterative formula for r (i, k) and a (i, k) is as follows:

the algorithm iterates through equations (18) (19) to update the attractiveness and reliability until m high quality cluster center points are generated, while assigning the remaining data points to the corresponding clusters.

The calculated scene stream includes a scene stream of the background region and a scene stream of the moving object, which are significantly different. The scene flow at each point will differ in magnitude and direction. Therefore, the algorithm takes the scene flow direction information and the amplitude information of each point as the characteristics of the point to form the characteristic vector of the point, and the characteristic vector is input into the clustering algorithm for classification.

The feature information of the scene stream specifically includes: each point scene flow u, v, w three components, each point scene flow module value is

The included angle theta between each point scene flow and the xoy plane, the xoz plane and the yoz plane_x，θ_y，θ_z. Each point represents V by a 7-dimensional feature vector_i，j＝(u，v，w，|V|，θ_x，θ_y，θ_z). And clustering the scene flow with the 7-dimensional feature vector, wherein the obtained clustering area comprises a background area and a motion area. Generally, based on scene flow, when the camera is stationary, it is determined that a region with a motion vector close to 0 in the clustering result belongs to a background region, and other clustering regions are motion regions.

After the motion area and the background area are separated, a flag is set. If the motion region flag is equal to 1, if the motion region flag is equal to 0, counting the number of pixels in the motion region as count, and setting the motion region as a spatial neighborhood

And S7, according to the motion area scene flow V (u, V, w) obtained in the step S6, establishing a proper evaluation model to evaluate whether the motion area scene flow is violent or not.

And establishing a motion direction evaluation model according to the motion direction condition of the scene flow. The motion area of each frame of the moving object in the camera coordinate system has been separated in S6. If the motion is normal and stable, the motion vector direction of the motion area is analyzed, and the main focus can be found in one direction. The distribution of the motion directions of the violent motion is relatively random. If the motion vector direction histogram of the motion point is constructed, the histogram formed by the violent motion is relatively discrete, and the histogram formed by the steady motion is relatively centralized.

For quantitative evaluation of the direction of motion of each motion vector, the Z-axis based on the camera coordinate system is defined as the reference vector direction. The direction of the motion vector can be determined by calculating the included angle between each motion point of the motion area and the reference direction. Obtaining the horizontal velocity u of each pixel point of the nth frame in the camera coordinate system from S5_i，j(t) velocity v in the vertical direction_i，j(t) and velocity w in the depth direction_i，j(t) of (d). Its angle phi with the reference vector_i，j(t) is shown in equation (20):

to determine if the motion is a violent motion, calculate phi of all motion points_i，j(t) variance D (phi)_i，j(t)), as shown in formula (21), wherein

Is the mean of all included angles.

And S8, establishing an evaluation model of the kinetic energy of the motion according to the motion energy of the motion area. The kinetic energy of each frame of motion region of the scene stream is calculated as follows:

the average kinetic energy of each pixel point in the motion area can be calculated according to the total kinetic energy of each frame of motion area

S9, manually setting an angle variance threshold phi_thWith the kinetic energy threshold W of each pixel_thFrom S7 and S8, D (φ) is known_i，j(t))＞φ_th，

If n frames continuously satisfy the two conditions, it is determined that the motion is abnormal and violent, and an alarm is triggered.

The invention uses 3-dimensional scene flow for violent motion detection for the first time, and can better realize the detection and alarm function of violent motion.

Claims

1. A violent motion detection method based on multi-view stereoscopic vision scene flow is characterized by comprising the following steps:

2. The method of claim 1, wherein the method comprises: in the second step, in the establishment of the relationship between the image coordinate system and the camera coordinate system, the relationship between the 2-dimensional optical flow and the 3-dimensional scene flow is established as

3. The method of claim 1, wherein the method comprises: the design of the data items described in step three specifically includes using the assumption of constancy based on the structure tensor,

in the above data item formula

Is a penalty function, making the smoothness approximate to L₁The norm of the number of the first-order-of-arrival,

Non-shielding holder

I_TIs a local tensor of the 2-dimensional image, e.g. formula

As shown.

4. The method of claim 1, wherein the method comprises: the design of the scene flow energy functional smoothing term described in step four specifically includes,

S_d(Z)＝ψ(|Z(x，y)_x|²)+ψ(|Z(x，y)_y|²)

the overall scene flow estimation energy functional is as follows,

5. the method of claim 1, wherein the method comprises: the clustering of the scene flow motion areas in the sixth step specifically includes clustering the scene flow V (u, V, w) obtained in the fifth step by using a clustering algorithm, and separating the background from the motion areas, wherein the feature information of the scene flow specifically includes: each point scene flow u, v, w three components, each point scene flow module value is

The included angle theta between each point scene flow and the xoy plane, the xoz plane and the yoz plane_x，θ_y，θ_zEach point represents V by a 7-dimensional feature vector_i，j＝(u，v，w，|V|，θ_x，θ_y，θ_z)；

6. The method of claim 1, wherein the method comprises: the constructing of the motion direction discrete degree evaluation model described in the seventh step specifically includes,

Is the average of all the included angles,

7. the method of claim 1, wherein the method comprises: the kinetic energy of each frame of motion region is calculated from the calculated scene stream as follows:

8. The method of claim 1, wherein the method comprises: in the step of setting an angle variance threshold phi_thAnd kinetic energy threshold value W_thWhen is coming into contact with