CN109035305B

CN109035305B - Indoor human body detection and tracking method based on RGB-D low-visual-angle condition

Info

Publication number: CN109035305B
Application number: CN201810908661.2A
Authority: CN
Inventors: 袁泽慧; 段荣杰; 安晓红; 李世中; 张亚
Original assignee: North University of China
Current assignee: North University of China
Priority date: 2018-08-10
Filing date: 2018-08-10
Publication date: 2021-06-25
Anticipated expiration: 2038-08-10
Also published as: CN109035305A

Abstract

The invention relates to an indoor human body detection and tracking method based on RGB-D low visual angle, belonging to the technical field of human body detection, and specifically comprising the following steps of carrying out point cloud collection by utilizing Huashuo Xtion Pro to obtain 3D point cloud, then carrying out noise reduction and down sampling treatment on the 3D point cloud, carrying out detection and removal on the ground, then carrying out 3D clustering by utilizing Euclidean distance between two points, calculating HOG characteristics of each cluster, feeding the HOG characteristics to a pre-trained SVM binary soft classifier, classifying people with high HOG characteristics, thereby achieving the purpose of human body detection, and finally realizing human body tracking by utilizing two likelihood probabilities formed by color consistency and distance consistency in the data coherence process. The invention has high precision and is widely used for indoor human body detection and tracking under the condition of low visual angle.

Description

Indoor human body detection and tracking method based on RGB-D low-visual-angle condition

Technical Field

The invention relates to an indoor human body detection and tracking method based on RGB-D low visual angle, belonging to the technical field of human body detection.

Background

Human detection and tracking is a key to the task of mobile robots in indoor environments. For a mobile robot, it must be able to distinguish between human bodies and other obstacles in order to adjust the trajectory according to its task. Such as a service robot, must provide assistance to a particular person in a particular environment.

Currently, there are some methods for detecting and tracking human body by simply using RGB-D depth camera or radar ranging. With the advent of RGB-D depth cameras, such as microsoft Kinect or luxurious rotation Pro, it can capture 640 × 480 pixels of images at a rate of 30 frames/s, and due to its advantages of low power consumption, it has been widely used in recent years in the fields of 3D perception of robots and indoor positioning and recognition.

There are several more successful human detection algorithms that are based on the human whole body, especially if the head is visible. In some cases, such as when the robot is particularly close to the object, a large portion of the object may be outside the sensor sensing range. Also, when the RGB-D sensor is mounted on a small robot, such as Turtlebot 2. Its line of sight is very close to the ground and therefore only the lower body or only the legs of the person are observed. Human detection and tracking is very difficult at low viewing angles, mainly because some main features are lost, and no obvious features are available for distinguishing human bodies from other objects, such as legs and chairs of a table, and the like, which cannot be distinguished from the human bodies by using the corresponding features. Based on the fact that when RGB-D is installed on the turtle bot2, if the person is very close to the robot, (distance <100cm), in most cases, it is visible from the foot to the waist, we propose a human detection and tracking algorithm in the low viewing angle case.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides an indoor human body detection and tracking method based on the RGB-D low-visual-angle condition, which clusters objects on the ground at a limited height, distinguishes people or objects by combining HOG characteristics and an SVM classifier, then tracks a human body by using combined likelihood probability formed by color consistency and distance consistency, and has higher accuracy.

In order to achieve the above object, the technical solution adopted by the present invention is an indoor human body detection and tracking method based on RGB-D low viewing angle, comprising the following steps,

step a, data acquisition and pretreatment

Performing point cloud collection by using the Hua Shu Xtion Pro to obtain a dense 3D point cloud, performing noise reduction processing on the dense 3D point cloud through a straight-through filter, and performing down-sampling processing on the 3D point cloud after noise reduction through a three-dimensional voxel grid filter;

step b, ground detection and filtering

Detecting the ground in the 3D initial point cloud processed in the step a by using a least square method based on RANSAC, wherein for non-first frame data, the ground parameters obtained by the detection of the previous frame can be used as the initial parameters of the next frame, performing ground detection on the current frame data by using the RANSAC method according to the initial given ground parameters and a preset initial distance threshold, and filtering the data from the 3D point cloud according to the indexes of all three-dimensional points on the ground in the 3D point cloud; for the first frame data, determining the relative relation with the ground according to the installation position of the RGB-D sensor on the robot, thereby giving initial parameters of the ground;

step c. clustering

Acquiring point cloud data within 130cm from the ground from the 3D point cloud removed from the ground, then performing 3D clustering by using Euclidean distance between two points, and defining the two points to belong to the same class when the Euclidean distance between each pair of points is smaller than a predefined distance threshold; wherein, clustering needs to set two initial thresholds: namely the minimum distance between two points belonging to the same cluster and the minimum point number required for generating one cluster;

HOG + SVM classification

Projecting the point cloud in the bounding box after 3D clustering to an RGB image, calculating an HOG descriptor of the point cloud by using an obtained image module, then sending the obtained HOG descriptor to a pre-trained SVM classifier, and calculating the HOG confidence coefficient of each cluster; when the calculated HOG confidence coefficient is higher than a set threshold value, judging that the cluster is a person, otherwise, judging that the cluster is not the person;

step e. tracking

And the obtained human body cluster is used as the input of a tracking module, namely, the human body cluster is used as an object to be tracked next step, then the human body cluster detected in each frame is used for matching with the existing tracking object, and the maximum likelihood probability between the currently detected human body and the known tracking object is calculated by adopting a method based on the combination of distance consistency and color consistency.

Preferably, in the step c, when clustering is performed, 1) a Kd tree of the 3D point cloud is first created as a search method used when the point cloud is extracted later; 2) setting a clustered empty list C and an empty queue of point clouds; 3) for each point p in the point cloud, searching all adjacent points in the ball by taking the point p as a center and a preset distance threshold as a radius, firstly checking whether the adjacent points obtained by searching are added into other clusters, and if not, adding the adjacent points into a queue Q; 4) after all the points in the queue Q are processed, adding the queue Q into the cluster C; 5) and finishing clustering after all the points in the initial point cloud are processed.

Preferably, in the step c, after the clustering is completed, the problems of over-clustering and under-clustering need to be further processed;

the process of over-clustering is as follows: for each resulting cluster C_iFirst, the projection p of its center point on the XZ plane is calculated_iIf p is_iAnd cluster C_jCentral projection point p of_jIs less than a set threshold, then cluster C is considered to be_iAnd C_jBelong to the same cluster and then combine the clusters;

the under-clustering process is as follows: for each cluster, calculating geometrical information of the cluster, wherein the geometrical information specifically comprises width, depth and height information; if the geometric information of some clusters is far larger than a set threshold value, further dividing the clusters by using color information, namely classifying the points with the same color into the same class; for clusters where there are too few points, discard directly.

Preferably, in the step e,

distance consistency is defined as: given a human detection cluster C_iBy processing the global nearest neighbor data to find the nearest tracking object T_jIf their distance is less than a threshold, the detected clusters are considered to be linked to the tracked cluster, and then T is used for the tracked object_jEach point p in_i,jFinding a detection target C using octree method_iEach point p in_j,iCalculating the distance between them, point p_i,jAnd point p_j,iThe distance consistency probability between them is defined as follows:

wherein α is a weight vector;

colour(s)Consistency is defined as: when comparing the current detection clusters C_iAnd tracking object T_jWhen the color information is obtained, the nearest pair of points is found by calculation<p_j,i,p_i,j>Color consistency between them, which can be calculated in RGB, HSV or other color spaces; taking HSV space as an example, point p_i,jAnd p_j,iThe color consistency probability between is defined as follows:

wherein c is_i,jAnd c_j,iRespectively represents p_i,jAnd p_j,iThe HSV information of (1), beta represents weight;

p_i,jand p_j,iThe joint consistency probability between them is defined as:

L(p_i,j,p_j,i)＝L_d(p_i,j,p_j,i)L_c(p_i,j,p_j,i)；

for each tracked object T_jAnd detecting cluster C_iThe maximum joint likelihood probability of (d) is defined as:

if L (j, i) is higher than the set threshold, it indicates the current cluster C_iAnd tracking object T_jIs the same person, otherwise, if the same person is not found with C_iThe associated tracking object, a new tracking object is created.

Compared with the prior art, the invention has the following technical effects: the invention provides a human body detection algorithm based on the main characteristic of the lower half of the human body, which mainly aims at the condition that when an RGB-D sensor is lower than the ground, namely, at a low visual angle, or when a detection object is closer to the sensor, only the lower half of the human body is visible. The algorithm can effectively improve the accuracy of human body detection, and is based on the common knowledge that the human body moves on the ground, firstly, the ground in a scene is detected and removed, objects on the ground are clustered on a limited height, and the objects with high HOG confidence values are classified into people and are classified into other people by calculating the HOG characteristics and then feeding the HOG characteristics to a pre-trained SVM binary soft classifier. And then, the detected human body result is used as the input of a tracking module, a tracking object closest to the currently detected human body cluster is searched by using a joint likelihood probability formed by color consistency and distance consistency, and when the maximum likelihood probability is greater than a set threshold, the currently detected human body cluster and the tracking object are considered to be the same person.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a diagram of the dense 3D point cloud data collected by the present invention.

FIG. 3 is a point cloud diagram of the dense 3D point cloud data of the present invention after being preprocessed.

Fig. 4 is a schematic diagram of the recognition situation under different environments.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, an indoor human body detection and tracking method based on RGB-D low viewing angle condition includes the following steps,

step a, data acquisition and processing

As shown in fig. 2, the dense 3D point cloud is obtained by performing point cloud collection using the filling x tion Pro. Due to the influence of different error sources, in particular the discretization effect in the depth measurement and the fact that the camera is calibrated within a certain range, a large amount of noise is present in the acquired initial RGB-D point cloud, while the RGB-D point cloud acquired per frame contains 307200 points, corresponding to the 640 × 480 dimensions of its depth image. Therefore, in order to improve the processing speed and accuracy of data, points which are not interested are firstly removed by using a through filter, for example, points which are more than 5m in the z direction are filtered according to the parameters of the Xtion pro. And then, a three-dimensional voxel grid filter is used for carrying out down-sampling on the point cloud, and the algorithm approximately displays other points in the voxel by using the gravity centers of all the points in the voxel, so that the down-sampling of the point cloud data is realized, and the calculation amount is reduced. The side length size of a voxel (i.e., a three-dimensional cube) is set to 0.06 m. The point cloud data acquired by the RGB-D sensor can be compressed to an order of magnitude through the down-sampling, so that the processing time of the point cloud in the later period is greatly shortened, and meanwhile, the density of the data after the down-sampling is the same at each position. The processed point cloud data is shown in fig. 3.

Step b, ground detection and filtering

This step is based on the assumption that a person is walking on the ground, so we first perform ground detection in the filtered point cloud and remove it. When the RGB-D is fixedly installed on the mobile robot, the relative position of the RGB-D and the ground is approximately known, initial ground parameters are set when the first frame point cloud is processed based on the relative position, the ground in the 3D point cloud is detected by using a least square method based on RANSAC on the basis of the initial ground parameters, updated ground parameters are obtained, and the updated ground parameters are used as the initial parameters of the ground of the next frame. This allows real-time updating of the parameters of the ground in the robot coordinate system. In theory, the ground is a strictly plane, whose parameters are fixed with respect to the robot, or in the robot coordinate system, but the parameters of the ground are different in the robot coordinate system at different times due to the vibrations of the camera in the robot motion, and the slight inclination of the actual ground.

Step c. clustering

Acquiring point cloud data within 130cm from the ground from the 3D point cloud removed from the ground, then performing 3D clustering by using Euclidean distance between two points, and defining the two points to belong to the same class when the Euclidean distance between each pair of points is smaller than a predefined distance threshold; wherein, clustering needs to set two initial thresholds: i.e. the minimum distance between two points belonging to the same cluster and the minimum number of points needed to generate a cluster.

Once the ground is detected, the three-dimensional points belonging to the ground can be removed, so that the objects on the ground are no longer in contact with the ground and appear as unconnected objects. Since our human body detection method uses the lower part of the human body as a feature description, point cloud data to be analyzed is limited to point cloud data within 130cm from the ground, and then 3D clustering is performed using the euclidean distance between two points.

In clustering, 1) creating a Kd tree of the 3D point cloud as a searching method for extracting the point cloud; 2) setting a clustered empty list C and an empty queue of point clouds; 3) for each point p in the point cloud, searching all adjacent points in the ball by taking the point p as a center and a preset distance threshold as a radius, firstly checking whether the adjacent points obtained by searching are added into other clusters, and if not, adding the adjacent points into a queue Q; 4) after all the points in the queue Q are processed, adding the queue Q into the cluster C; 5) and finishing clustering after all the points in the initial point cloud are processed.

After the euclidean clustering is completed, two typical problems exist in the actual operation, namely over-clustering and under-clustering: (1) over-clustering, i.e., the point cloud that should belong to the same object, is segmented into multiple clusters, mainly due to noise, occlusion, and some missing depth data. (2) Under-clustering, as the name implies, is that the point clouds that should belong to two or more different objects are improperly classified as the same cluster. For example, when a person is in close proximity to a background, cabinet, or table, the person is often categorized as a background point, table, or the like. In our experiments, it is not uncommon for two different people's points to be merged together in a clustering operation, since only points within 130 centimeters above ground are considered.

To solve these two problems, after the initial euclidean clustering is performed, the resulting clusters need to be further processed.

For the over-clustering problem, after clustering is completed, for each obtained cluster C_iFirst, the projection p of its center point on the XZ plane is calculated_iIf p is_iAnd cluster C_jCentral projection point p of_jIs less than a set thresholdThen, consider cluster C_iAnd C_jBelong to the same cluster and then merge them.

For under-clustering, color information separation is used when a person and background are classified into a class. This is based on the assumption that the person's pants color and background color are different, such as the jeans color is blue and the walls are white. The specific algorithm is as follows: for each cluster, calculating geometrical information of the cluster, wherein the geometrical information specifically comprises width, depth and height information; if the geometric information of some clusters is far larger than a set threshold value, further dividing the clusters by using color information, namely classifying the points with the same color into the same class; also for clusters where there are too few points, they are discarded directly in this operation.

HOG + SVM classification

Projecting the point cloud in the bounding box after 3D clustering to an RGB image, calculating an HOG descriptor of the point cloud by using an obtained image module, then sending the obtained HOG descriptor to a pre-trained SVM classifier, and calculating the HOG confidence coefficient of each cluster; when the calculated HOG confidence coefficient is higher than a set threshold value, judging that the cluster is a person, otherwise, judging that the cluster is not a person;

step e. tracking

And the obtained human body cluster is used as the input of a tracking module, namely, the human body cluster is used as an object to be tracked next, then the human body cluster is matched with the existing tracking object, and the maximum likelihood probability between the currently detected human body and the known tracking object is calculated by specifically adopting a method based on the combination of distance consistency and color consistency.

The tracking algorithm is to estimate the trajectory of each target using particle filtering. Assuming that a person moves on the ground, the state-tracking quantity of each person appears as a 2D transform, i.e., the position (x, y) of the center of gravity and the rotation angle θ. Wherein the moving mold is set to constant velocity motion because the mold is good at dealing with the complete coupling problem.

When a cluster is classified as a human, meaning that they are not associated with a known tracked object, a new tracked object is created. It is apparent that the human body detected in the first frame is initialized as a new tracking object.

In order to match the detected person in the current frame with the known tracked object, the consistency based on color and distance is calculated as follows:

distance consistency is defined as: given a human detection cluster C_iBy processing global nearest neighbor data to find the nearest human tracking object T_jIf their distance is less than a threshold, the detected clusters are considered to be associated with the tracked object, and then T is used for the tracked object_jEach point p in_i,jFinding a detection target C using octree method_iEach point p in_j,iCalculating the distance between them, point p_i,jAnd point p_j,iThe distance consistency probability between them is defined as follows:

wherein α is a weight vector;

color consistency is defined as: when comparing the current detection clusters C_iAnd tracking object T_jWhen the color information is obtained, the nearest pair of points is found by calculation<p_j,i,p_i,j>Color consistency between. Color consistency may be calculated in RGB, HSV, or other color spaces. Taking HSV space as an example, point p_i,jAnd p_j,iThe color consistency probability between is defined as follows:

p_i,jand p_j,iThe joint consistency probability between them is defined as:

L(p_i,j,p_j,i)＝L_d(p_i,j,p_j,i)L_c(p_i,j,p_j,i)；

for theEach tracking object T_jAnd detecting cluster C_iThe maximum joint likelihood probability of (d) is then defined as:

if L (j, i) is higher than the set threshold, it indicates the current cluster C_iAnd tracking object T_jIs the same person, otherwise not. If not found with C_iThe associated tracking object, a new tracking object is created.

To validate the proposed algorithm, the following experiment was performed: and installing a large Xtion Pro and a laser radar on the Turtlebot2, wherein the Xtion Pro is used for collecting original RGB-D point cloud data of the environment, and the laser radar is used for avoiding obstacles.

As shown in fig. 4, in order to verify the method proposed in the present invention, experiments were performed in three different scenarios, respectively:

simple environment: in an environment without obstacles, the sensor is fixed and two persons move with the same trajectory, as shown by a in fig. 4.

Moderate environment: in an environment without obstacles, the sensor is fixed, more than two persons move at random and there is a crossing between the motion trajectories, as in b in fig. 4.

Difficult environment: there is an obstacle, the robot moves, and three or more people walk at random and their movement trajectories are crossed with each other, as shown by c in fig. 4.

For each scene, the video sequence was at about 250 frames, and the total test set included 798 frames. There are 2698 instances of people in total, which are manually marked on the RGB image as real data.

Human body detection results:

in order to verify the performance of the proposed method, frame-based metrics are used, and the advantages and disadvantages of the proposed method are measured by using three items of precision (p), recall (r) and f1 score. The three terms are defined as:

wherein TP represents the true positive rate, FP represents the false positive rate, and FN represents the false negative rate.

As shown in fig. 4, the detected human body is enclosed by a green frame, and it is obvious from the figure that the experimental results are better for the former two cases, especially in a simple environment. However, in difficult scenarios, the results are less good than the first two. This is readily understood as having an obstruction in a difficult scene. In some cases, classification by geometric information and HOG + SVM is difficult. As a result, some obstacles are erroneously recognized as human bodies. The performance of human detection in three different environments is recorded in table one.

To reduce the false positive rate, the HOG confidence threshold is set strictly to-2.2. This helps to reduce the false positive rate, but also results in an increase in the false negative rate.

Table one: performance parameters of human body detection model under three different environments

	Accuracy of measurement	Recall rate	f1 score
				Simple environment	0.97	0.91	0.94
Moderate environment	0.92	0.88	0.89
				Difficult environment	0.82	0.78	0.80

Human body tracking result:

we evaluated our tracking results in terms of false positive and false negative rates, and table two records tracking performance in three different environments, with better results in simple and medium cases, whereas in difficult cases FP and FN ratios are somewhat higher, 5.8% and 5.2%, respectively. This is mainly due to the fact that in this scene, people move faster, are blocked by others, or are out of the field of view of the camera.

Table two: performance parameters of a human tracking model in three different environments

	FP	FN
			Simple environment	2.4％	1.8％
Moderate environment	4.6％	4.4％
			Difficult environment	5.8％	5.2％

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principles of the present invention are intended to be included therein.

Claims

1. An indoor human body detection and tracking method based on RGB-D low visual angle condition is characterized by comprising the following steps,

step a, data acquisition and pretreatment

step b, ground detection and filtering

step c. clustering

HOG + SVM classification

step e. tracking

2. The method as claimed in claim 1, wherein in the step c, 1) a Kd tree of 3D point cloud is first created as a search method for extracting point cloud later; 2) setting a clustered empty list C and an empty queue of point clouds; 3) for each point p in the point cloud, searching all adjacent points in the sphere by taking the point p as a center and a preset distance threshold as a radius, firstly checking whether the adjacent points obtained by searching are added into other clusters, and if not, adding the adjacent points into a queue Q; 4) after all the points in the queue Q are processed, adding the queue Q into the cluster C; 5) and finishing clustering after all the points in the initial point cloud are processed.

3. The method for detecting and tracking the indoor human body based on the RGB-D low visual angle condition as claimed in claim 2, wherein in the step c, after the clustering is completed, the problems of over-clustering and under-clustering are required to be further processed;

the process of over-clustering is as follows: for each one getCluster of arrival C_iFirst, the projection p of its center point on the XZ plane is calculated_iIf p is_iAnd cluster C_jCentral projection point p of_jIs less than a set threshold, then cluster C is considered to be_iAnd C_jBelong to the same cluster and then combine the clusters;

4. The RGB-D based indoor human body detection and tracking method for low viewing angle condition as claimed in claim 1, wherein in step e,

wherein α is a weight vector;

color consistency is defined as: when comparing the current detection clusters C_iAnd tracking object T_jWhen the color information is obtained, the nearest pair of points is found by calculation<p_j,i,p_i,j>Color consistency between them, which can be calculated in RGB, HSV or other color spaces; within HSV space, point p_i,jAnd p_j,iThe color consistency probability between is defined as follows:

p_i,jand p_j,iThe joint consistency probability between them is defined as:

L(p_i,j,p_j,i)＝L_d(p_i,j,p_j,i)L_c(p_i,j,p_j,i)；