CN106485735A

CN106485735A - Human body target recognition and tracking method based on stereovision technique

Info

Publication number: CN106485735A
Application number: CN201510550955.9A
Authority: CN
Inventors: 吕芳; 任侃; 潘佳惠; 韶阿俊
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2015-09-01
Filing date: 2015-09-01
Publication date: 2017-03-08

Abstract

The invention discloses a kind of human body target recognition and tracking method based on stereovision technique, including：Same width scene picture is obtained by two cameras simultaneously from two different angles, form stereo pairs；The inside and outside parameter of video camera is determined by camera calibration, establishes imaging model；Using the matching algorithm based on window, window is created centered on the point to be matched of wherein piece image, identical sliding window is created on another piece image, sliding window is moved in units of pixel successively along EP point, calculation window matching degree, optimal match point is found, the three-dimensional geometric information of target is obtained by principle of parallax, generate depth image；Head and shoulder information are distinguished with gray feature using One-Dimensional Maximum-Entropy thresholding method；Targetpath is tracked and obtains using Adaptive window tracking to human body target.Present invention achieves the accurate identification of human body target and count tracking, and effectively remove ambient noise, it is to avoid the interference of environment.

Description

Human body target identification and tracking method based on stereoscopic vision technology

Technical Field

The invention relates to a human body target tracking method, in particular to a human body target identification and tracking method based on a stereoscopic vision technology.

Background

With the rapid improvement of performances in the aspects of computer storage, operation and the like, people gradually apply computers to realize complex functions such as scene reconstruction, target recognition, man-machine interaction and the like, so that the scale and the research direction of the computer application field are exploited, and the rapid development of related subjects is promoted. As an active research field at present, the essence of computer vision is to use a camera to replace human eyes, use a computer to replace a brain, recognize and track a target, perform corresponding graphic analysis and processing, and generate an image suitable for instrument detection or human eye observation. The video technology can continuously transmit images in a period of time, contains more detailed information, and has the advantages of intuition, concrete performance, easy processing and the like. Identification of video objects has become an important issue in the fields of image processing, pattern recognition, human-computer interaction, and the like, and is widely used in various intelligent systems in the fields of manufacturing, medical diagnosis, military, and the like.

The traditional APC system mainly comprises a pressure sensing system and an infrared shielding system, and with the rapid development of laser infrared technology, a proper heat release infrared probe is adopted to detect signals sent by a human body for identification and counting. When the target moves, the infrared sensor detects the change caused by the infrared spectrum of the human body, namely the process of the motion of the human body target is obtained, the target can be distinguished through signal processing, and the method is low in cost and simple to operate, but has the problems of inaccurate identification and statistical result, limited application place and the like.

Meanwhile, the image processing method can also be used for solving the human body recognition problem. However, most methods only use some recognition algorithms for two-dimensional images, such as selecting some parts of the human body as features, and try to obtain a match in the images, thereby achieving the purpose of recognition. At present, there are many methods commonly used for human body recognition: the method based on the human body model and the structural elements has higher requirements on extracting image information of the whole person, is difficult to process moving and deforming objects, and has high requirements on the real-time property of image acquisition; the method is mainly based on the principle of a wavelet template, the whole image needs to be searched according to different sizes, and the calculation amount is large.

Disclosure of Invention

The invention aims to provide a human body target recognition and tracking method based on a stereoscopic vision technology.

The technical solution for realizing the purpose of the invention is as follows: a human body target recognition and tracking method based on a stereoscopic vision technology is characterized by comprising the following steps:

step 1, two cameras simultaneously acquire pictures of the same scene from two different angles to form a stereo image pair;

step 2, determining internal and external parameters of the camera through camera calibration, and establishing an imaging model;

3, adopting a window-based matching algorithm, creating a window by taking a point to be matched of one image as a center, creating the same sliding window on the other image, sequentially moving the sliding windows along an epipolar line by taking a pixel point as a unit, calculating window matching measure, finding the optimal matching point, obtaining three-dimensional geometric information of a target by a parallax principle, and generating a depth image;

step 4, distinguishing head and shoulder information by adopting a one-dimensional maximum entropy threshold segmentation method and combining with gray features, and identifying a human body target;

and 5, tracking the human body target by adopting a self-adaptive wave gate tracking method and obtaining a target track.

Compared with the prior art, the invention has the following remarkable advantages:

(1) the invention has small calculated amount, can quickly and accurately identify the human body target by using simple images,

(2) under the condition of crowding, the depth information of the image can be used for identification, interference is effectively eliminated, and a moving target is distinguished;

(3) the invention can effectively realize human target tracking and counting and avoid repetition.

Drawings

Fig. 1 is a flow chart of a human body target recognition and tracking method based on a stereoscopic vision technology.

FIG. 2 is an original image according to an embodiment of the present invention.

FIG. 3 is a segmented image of the shoulders of a human target according to an embodiment of the present invention.

Fig. 4 is a motion trail diagram of a human target according to an embodiment of the present invention.

Detailed Description

With reference to fig. 1, the human body target recognition and tracking method based on the stereoscopic vision technology of the present invention includes the following steps:

step 1, obtaining a stereo image pair:

two MTV-1881EX-3 cameras are placed in parallel, and pictures of the same scene are simultaneously obtained from two different angles to form a stereo image pair;

step 2, determining internal and external parameters of the camera through camera calibration, and establishing an imaging model, specifically:

step 2-1, calibrating the coordinates of the camera, wherein the calibration graph is a checkerboard, and the calibration principle is as follows:

assuming that the world coordinate system plane with z equal to 0 is the template plane, [ r ]₁r₂r₃]Is a rotation matrix of the camera coordinate system relative to the world coordinate system, and t is a translation vector of the camera coordinate system relative to the world coordinate system, [ X Y1 ]]^THomogeneous coordinates of points on the template, [ u v 1 ]]^TProjecting points on the template plane to homogeneous coordinates on the image plane, wherein K represents a camera internal reference matrix;

step 2-2, setting a camera coordinate system Ox_cy_cz_cIs a rectangular coordinate system fixed on the camera, and the origin O is defined as the optical center, x, of the camera_c，y_cThe axes are respectively parallel to the x, y and z axes of the image physical coordinate system_cThe axis coinciding with the optical axis, i.e. z_cThe axis being perpendicular to the imaging plane of the camera, the distance OO of the optical centre to the image plane₁Is the effective focal length f of the camera;

step 2-3, setting (x)_w，y_w，z_w) Is the three-dimensional coordinate of a certain P point in a three-dimensional world coordinate system, (x)_c，y_c，z_c) For the three-dimensional coordinates of the same point p in the camera coordinate system, the transformation of the point in the world coordinate system into the camera coordinate system is represented by an orthogonal rotation matrix R and a translation transformation matrix T as:

where R is a 3 × 3 rotation matrix and a translation matrix

The orthogonal matrix R is a direction cosine combination of the optical axis with respect to the coordinate axes of the world coordinate system, and includes three independent angle variables (euler angles): rotation by an angle ψ (yaw) about the x-axis; rotation by an angle θ (pitch) about the y-axis; rotating the phi angle (rolling) around the z axis, and adding three variables of T to form six parameters which are called external parameters of the camera;

step 2-4, simplifying rigid transformation of the world coordinate system and the camera coordinate system into a uniform coordinate and matrix form:

thus, a matrix M may be used between the world coordinate system and the camera coordinate system₂Is expressed as long as M is known₂The transformation of coordinates between two coordinate systems can be performed;

the transformation from the camera coordinate system to the ideal image physical coordinate system, i.e. the ideal perspective projection transformation under the pinhole model, has the following formula:

x＝f·x_c/z_cy＝f·y_c/z_c

x and y are respectively the abscissa and the ordinate of an ideal image physical coordinate system;

the above formula is also expressed in terms of homogeneous coordinates and matrices as:

the transformation of the ideal image coordinate system to the image pixel coordinate system, expressed in homogeneous coordinates, is:

wherein u is₀、v₀Coordinates representing a camera coordinate system;

the inverse relationship is:

by substituting the above relational expressions, the relation between the coordinates of the point P represented by the world coordinate system and the coordinates (u, v) of the projection P' thereof can be obtained:

wherein α f/dx f s_x/d_y，β＝f/dy；M₁For an array of internal parameters, M₂The method is characterized in that the method is an external parameter matrix, wherein M is a matrix 3 × 4, called a projection matrix, and represents the basic relationship between two-dimensional image coordinates and three-dimensional world coordinates, the world coordinates of known object points can be used for solving corresponding ideal image coordinates, and conversely, if the image coordinates of the M matrix and the image points are known, a space ray corresponding to the optical center of a camera can be solved;

obtaining the basic relation between the two-dimensional image coordinate and the three-dimensional world coordinate, namely completing the calibration of the camera; the coordinates of the image obtained by the camera can be uniformly converted into coordinates on a three-dimensional world coordinate system through the imaging model, namely, the imaging model of the image obtained by the camera on the three-dimensional world coordinate system is determined.

3, adopting a window-based matching algorithm, creating a window by taking a point to be matched of one image as a center, creating the same sliding window on the other image, sequentially moving the sliding windows along an epipolar line by taking a pixel point as a unit, calculating window matching measure, finding the optimal matching point, obtaining three-dimensional geometric information of a target by a parallax principle, and generating a depth image; the method specifically comprises the following steps:

step 3-1, the right image is assumed as a reference, and a difference is made between the right image and the background to obtain a foreground image;

step 3-2, determining parallax:

the first step, in the foreground image, assuming that the right image is taken as a reference, calculating the gray difference value of each pixel at a given parallax with the corresponding point of the left image;

secondly, on each parallax, a narrow-strip-shaped window perpendicular to the base line direction is used instead, and the sum of the gray level difference of the window taking each pixel as the center is calculated by adopting a window-based matching algorithm, wherein the expression is as follows:

wherein m x n is the size of the template window, γ is the unit length of the template window, and is the unit width of the template window, and I_right[x_e+γ，y_e+]Is a right view [ x_e+γ，y_e+]Gray value at coordinate, I_left[x_e+γ+d，y_e+]Is a left view [ x_e+γ+d，y_e+]Gray value at coordinates, d is parallax;

thirdly, in the set parallax range, taking the maximum parallax from the minimum parallax of the d, sequentially comparing the values of the expressions, wherein the point corresponding to the minimum value is the optimal matching point, and the corresponding parallax value is used as the parallax value of the pixel point;

step 3-3, determining depth information of the target:

the binocular distance measurement mainly utilizes the relationship that the difference (namely parallax) directly existing in the transverse coordinates of the imaging of a target point on a left view and a right view and the distance Z from the target point to an imaging plane have inverse proportion, and under the condition that the focal length of a camera is known, the depth information of any point, namely the coordinate value of the point on the Z axis under a camera coordinate system, and b is set as the optical center distance of two cameras; the vertical distance from the target Q to the camera is H; same focal length of f, Q₁、Q₂An imaging point of a target Q in a camera; d isParallax, assuming that the optical axes of the two cameras are parallel to each other, is derived from a similar triangle:

H＝(b×f)/d

the obtained vertical distance from the target Q to the camera is the depth information of the target;

therefore, the stereoscopic vision counting can utilize two or more than two cameras with position offset to obtain the depth information of the scene by triangulation, if the point in the scene is required to have an image point in the left and right images; in the left view and the right view, the positions of image points are different, namely parallax, the distance between the points in the scene and the camera is different, the parallax is also different, and the parallax is smaller and larger along with the distance from the camera; binocular stereo vision is based on this disparity and uses trigonometric calculations to determine the distance of an object from the camera.

Step 4, distinguishing head and shoulder information by adopting a one-dimensional maximum entropy threshold segmentation method and combining with gray features, and identifying a human body target, wherein the method specifically comprises the following steps:

step 4-1, dividing the depth image into small lattices with L x L pixels, wherein L is a positive integer, one nine-square lattice is taken as a unit, the small lattices with L x L pixels are moved once every comparison in the sequence from left to right and from top to bottom when the small lattices are moved, and if the average gray scale of the middle lattice is higher than that of the surrounding eight neighborhoods, the middle lattice is determined to be a head target area;

step 4-2, setting a threshold value for the head target area, carrying out binarization, and segmenting the head target; the method specifically comprises the following steps:

adopting one-dimensional maximum entropy threshold segmentation method to segment head and non-head regions, and making p be_iRepresenting the proportion of pixels with gray value i in the image, dividing the head and shoulder regions by taking the gray level t as a threshold, forming a head region by pixel points higher than the gray level t in the region, and forming a non-head region by pixel points lower than the gray level t, wherein the entropies of the non-head region and the head region are respectively defined as:

H_O＝-Σ_i[p_i/(1-p_t)]lg[p_i/(1-p_t)]

wherein:where i represents the gray value of the pixel (0. ltoreq. i. ltoreq.255), H_t＝-Σ_tp_ilgp_i,H_E＝-Σ_ip_ilgp_iWhen the value of the entropy function is summedWhen the maximum is obtained, the gray level t can be used as a threshold value for dividing the image:

t＝arg{max{H_B+H_O}}

step 4-3, determining the average gray level and the gray level variance of the segmented head region:

m, N respectively represents the row and column number of each region, epsilon respectively represents the row and column number of the unit, f (,) represents the gray value of the point, and when the gray variance is larger than a set threshold value, the pixel point is filtered;

4-4, filtering out a false target with a long and narrow appearance according to whether the ratio of the width of the total pixel to the height of the field of view of the head of the human body under different heights of the field of view meets a set range;

the geometric characteristics of the head mainly comprise ellipticity, head area, length and width and the like; the range of the width of the head total pixel under a certain field height can be known through continuous simulation tests, and meanwhile, the range of w/h can be obtained through simulation to be [0.65, 1.5], and a false target with a long and narrow appearance can be effectively filtered through the threshold discrimination.

Step 5, tracking the human body target by adopting a self-adaptive wave gate tracking method to obtain a target track, which specifically comprises the following steps:

step 5-1, respectively expanding 8 pixel values in four directions by setting a wave gate size as an external rectangular frame of a human head target, wherein the moving direction is from left to right and from top to bottom;

step 5-2, in the process of moving the wave gate, correcting the wave gate rectangular frame by using the single chip microcomputer, predicting the target position of the next field by using the target position quantity of the first five fields obtained by the camera, wherein the difference between the predicted value and the accurate value of the target of the actual next field is the wave gate tracking error;

and 5-3, in the moving process of the gate, improving the real-time performance and tracking accuracy of the system by adopting a 2-point linear prediction algorithm:

in the formula, t_kIs the gate information of the k-th field,is a predicted value of the X coordinate of the wave gate center,for the x coordinate of the kth field wave gate information, the predicted value of the Y coordinate of the wave gate center can be obtained in the same way

And 5-4, after the target tracking is realized by adopting the self-adaptive wave gate, acquiring the flight path as follows:

firstly, initializing a flight path;

step two, the zeta frame appears the target for the first time, and the wave gate, the center coordinate and the frame number information of the target are determined;

thirdly, in the zeta +1 th frame, a target with the highest similarity is searched according to the gate coordinate prediction of the zeta frame target, a temporary track is generated, and the position of the target in the zeta +2 th frame is predicted;

and fourthly, determining the target track according to the information of the previous two frames in the zeta +2 th frame, and entering a stable tracking stage.

Step 5-5, counting human body targets and recording images:

the number of people entering and exiting is recorded at intervals, so that monitoring personnel can conveniently check the passenger flow at each time interval; and meanwhile, the generated depth image is stored in a computer hard disk, and the camera is read out from the hard disk when in a non-working state. On one hand, the system is used for the monitor to selectively watch the passenger flow video and restore the scene data; on the other hand, the experiment video can be repeatedly read to modify the threshold parameter so as to improve the identification precision.

The present invention will be further described with reference to the following specific examples.

Examples

And in combination with the original depth image of fig. 2, the human motion object which is dark in human hair and colored in dark clothes or is easily confused with the background when the human hair is light in color is processed by using a stereoscopic vision-based algorithm under the condition of visible light.

Fig. 3 is a segmented image of the head and shoulder of the human body, and it can be clearly seen that, as a result of processing the moving object under visible light, when the human body is dark and is painted with dark clothes or the human body is light and is easily confused with the background, the depth image output by the stereoscopic vision technology has high recognition accuracy, the head object of the human body is marked by a circle and is separated from the shoulder object, which is not easily affected by light and the background, and the object and the background are clearly distinguished.

With the combination of fig. 4, the human moving target which is crowded and is too close to contact with people is identified and track-traced, the three-dimensional information of the human body can be displayed, the human head target is marked by a square frame, and the curve of the head represents the track of the motion of the human body, so that the limitation of two-dimensional image human body identification is effectively broken through, the accurate identification and tracking of the human body target are realized, the same target is distinguished and counted, the background noise can be effectively removed, and the interference of the environment is avoided.

Claims

1. A human body target recognition and tracking method based on a stereoscopic vision technology is characterized by comprising the following steps:

2. The human body target recognition and tracking method based on the stereoscopic vision technology as claimed in claim 1, wherein the step 2 is specifically as follows:

[\begin{matrix} u \\ v \\ 1 \end{matrix}] = K [\begin{matrix} r_{1} & r_{2} & r_{3} & t \end{matrix}] [\begin{matrix} X \\ Y \\ 0 \\ 1 \end{matrix}] = K [\begin{matrix} r_{1} & r_{2} & t \end{matrix}] [\begin{matrix} X \\ Y \\ 1 \end{matrix}]

[\begin{matrix} x_{c} \\ y_{c} \\ z_{c} \end{matrix}] = R [\begin{matrix} x_{w} \\ y_{w} \\ z_{w} \end{matrix}] + T

where R is a 3 × 3 rotation matrix and a translation matrix

T = [\begin{matrix} t_{x} \\ t_{y} \\ t_{z} \end{matrix}];

The orthogonal matrix R is a direction cosine combination of an optical axis relative to the coordinate axis of the world coordinate system, and comprises three independent angle variables: rotating the phi angle around the x axis, rotating the theta angle around the y axis and rotating the phi angle around the z axis, wherein three variables of the phi angle and the T are collectively called as external parameters of the camera;

[\begin{matrix} x_{c} \\ y_{c} \\ z_{c} \\ 1 \end{matrix}] = [\begin{matrix} R & T \\ 0^{T} & 1 \end{matrix}] [\begin{matrix} x_{w} \\ y_{w} \\ z_{w} \\ 1 \end{matrix}] = M_{2} [\begin{matrix} x_{w} \\ y_{w} \\ z_{w} \\ 1 \end{matrix}]

\tilde{x_{c}} = [\begin{matrix} R & T \\ 0^{T} & 1 \end{matrix}] \tilde{x_{w}} &DoubleRightArrow; \tilde{x_{c}} = M_{2} \tilde{x_{w}} &DoubleRightArrow; \tilde{x_{w}} = M_{2}^{- 1} \tilde{x_{c}}

x＝f·x_c/z_cy＝f·y_c/z_c

z_{c} [\begin{matrix} x \\ y \\ 1 \end{matrix}] = [\begin{matrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} x_{c} \\ y_{c} \\ z_{c} \\ 1 \end{matrix}]

[\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} 1 / d x & 0 & u_{0} \\ 0 & 1 / d x & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x \\ y \\ 1 \end{matrix}] = [\begin{matrix} s_{x} / d y & 0 & u_{0} \\ 0 & 1 / d y & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x \\ y \\ 1 \end{matrix}]

the inverse relationship is:

[\begin{matrix} x \\ y \\ 1 \end{matrix}] = [\begin{matrix} d y / s_{x} & 0 & - u_{0} d y / s_{x} \\ 0 & d y & - v_{0} d y \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} u \\ v \\ 1 \end{matrix}]

obtaining the relation between the coordinates of the point P represented by the world coordinate system and the coordinates (u, v) of the projection P' of the point P:

\begin{matrix} z_{c} [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} s_{x} / d y & 0 & u_{0} \\ 0 & 1 / d y & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} R & T \\ 0^{T} & 1 \end{matrix}] [\begin{matrix} x_{w} \\ y_{w} \\ z_{w} \\ 1 \end{matrix}] \\ = [\begin{matrix} α & 0 & u_{0} & 0 \\ 0 & β & v_{0} & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} R & T \\ 0^{T} & 1 \end{matrix}] [\begin{matrix} x_{w} \\ y_{w} \\ z_{w} \\ 1 \end{matrix}] = M_{1} M_{2} \tilde{x_{w}} = M \tilde{x_{w}} \end{matrix}

wherein α f/dx fs_x/dy，β＝f/dy；M₁For an array of internal parameters, M₂M is a projection matrix of 3 × 4, and represents the basic relationship between the two-dimensional image coordinates and the three-dimensional world coordinates.

3. The human body target recognition and tracking method based on the stereoscopic vision technology as claimed in claim 1, wherein the step 3 is specifically as follows:

step 3-2, determining parallax:

Σ_{γ = - \frac{m}{2}}^{\frac{m}{2}} Σ_{δ = - \frac{n}{2}}^{\frac{n}{2}} | I_{r i g h t} [x_{e} + γ, y_{e} + δ] - I_{l e f t} [x_{e} + γ + d, y_{e} + δ] |

step 3-3, determining depth information of the target:

knowing the focal length of the camera, setting b as the optical center distance of the two cameras, wherein the depth information of any point, namely the coordinate value of the point on the Z axis in the camera coordinate system; the vertical distance from the target Q to the camera is H; same focal length of f, Q₁、Q₂Respectively imaging points of the target Q in the two cameras; d is parallax, and assuming that the optical axes of the two cameras are parallel to each other, the derivation is given by the similar triangle:

H＝(b×f)/d

and the obtained vertical distance from the target Q to the camera is the depth information of the target.

4. The human body target recognition and tracking method based on the stereoscopic vision technology as claimed in claim 1, wherein the step 4 is specifically as follows:

adopting one-dimensional maximum entropy threshold segmentation method to segment head and non-head regions, and making p be_iRepresenting the proportion of pixels with gray value i in the image, dividing the head and shoulder regions by taking the gray level t as a threshold value, forming the head region by the pixel points with gray level higher than t in the region,the pixel points lower than the gray level t form a non-head region, and the entropies of the non-head region and the head region are respectively defined as:

H_{B} = - Σ_{i} (\frac{p_{i}}{p_{t}}) \lg (\frac{p_{i}}{p_{t}})

H_O＝-Σ_i[p_i/(1-p_t)]lg[p_i/(1-p_t)]

wherein,i represents the gray value of the pixel (0 ≦ i ≦ 255), H_t＝-Σ_tp_ilgp_i，H_E＝-Σ_ip_ilgp_iWhen the value of the entropy function is summedTaking the maximum, the gray level t as the threshold for segmenting the image:

t＝arg{max{H_B+H_O}}

\overset{&OverBar;}{g} = \frac{Σ_{ϵ = 0}^{M - 1} Σ_{&Element; = 0}^{N - 1} f (ϵ, &Element;)}{M * N}

var = \frac{Σ_{ϵ = 0}^{M - 1} Σ_{&Element; = 0}^{N - 1} {(f (ϵ, &Element;) - \overset{&OverBar;}{g})}^{2}}{M * N}

and 4-4, filtering out a false target with a long and narrow appearance according to whether the ratio of the width of the total pixel to the height of the field of view of the head of the human body under different field of view heights meets a set range, so as to obtain the human body target.

5. The human body target recognition and tracking method based on the stereoscopic vision technology as claimed in claim 1, wherein the step 5 is specifically as follows:

{\hat{X}}_{(\frac{t_{k + 1}}{t_{k}})} = 2 x_{t_{k}} - x_{t_{k - 1}}

firstly, initializing a flight path;