CN111144207A

CN111144207A - Human body detection and tracking method based on multi-mode information perception

Info

Publication number: CN111144207A
Application number: CN201911146615.4A
Authority: CN
Inventors: 周波; 黄文超; 甘亚辉; 房芳; 钱堃
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2020-05-12
Anticipated expiration: 2039-11-21
Also published as: CN111144207B

Abstract

The invention discloses a human body detection and tracking method based on multi-mode information perception, which comprises the following steps: calibrating a color camera and a depth camera and performing data filtering processing; detecting the body and the head of a person in the color image and the depth image respectively based on multi-mode information perception human body detection, and fusing two detection results according to the spatial proportion information of the head and the body; based on multi-mode information perception, tracking a body and a head in a color image and a depth image respectively by using a coring related filtering tracking algorithm, and establishing a model of a tracked object; and the space constraint of the tracking object model and the head-to-body ratio is utilized to perfect a tracking mechanism. The method disclosed by the invention is based on multi-mode information perception, overcomes the defects of a target detection and tracking method only based on vision, has wide application in the field of indoor service robots, and is beneficial to functions of human-computer interaction operation, user following and the like.

Description

Human body detection and tracking method based on multi-mode information perception

Technical Field

The invention belongs to the field of indoor service robot application, and particularly relates to a human body detection and tracking method based on multi-mode information perception, in particular to a long-time robust detection and tracking method under an unstructured indoor environment and an illumination change scene.

Background

With the development of computer vision technology and the rise of artificial intelligence, the application range of intelligent service robots is more and more extensive, especially indoor mobile service robots. In an indoor environment, a robot needs to be able to perceive a complex, unstructured scene and interact with a human, and using only visual information is not sufficient to cope with dynamic changes in ambient lighting conditions. The RGB-D camera is a novel vision sensor, can provide high-resolution color and depth images at the same time, and is a very excellent man-machine interaction tool. There is a need to provide an efficient method to fully utilize multimodal information for detection and tracking.

The target detection and tracking method mostly adopts a solution of a camera and a laser. The two-dimensional laser can directly acquire the geometric information of the environment, and is high in precision and fast in processing. However, the amount of information that can be utilized is small, and only simple shape features can be extracted, which is easily confused with similar objects in the environment. Methods of detection and tracking using a camera can be further divided into manual feature-based methods and deep learning-based methods. The manual feature-based method needs to extract set features, train a classifier and detect by using a sliding window, and has the advantages of controllable calculated amount, clear significance of extracted features and insufficient accuracy of algorithm. The method based on deep learning can achieve higher precision, but the calculation amount is large, and real-time operation cannot be performed on a common calculation platform.

In summary, the above conventional target detection and tracking method has the following problems: 1) a satisfactory balance between the performance and the real-time performance of the algorithm is difficult to achieve; 2) only a single information source of color or depth is used for detection and tracking, and detection and tracking under complex environment cannot be realized; 3) the method is lack of analysis and processing under the condition of short-time failure of the algorithm, is easy to lose tracking and has poor robustness.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects of the prior art, a human body detection and tracking method based on multi-mode information perception is provided, so that the problem of real-time and robust human body detection and tracking in a complex environment is solved.

The technical scheme is as follows: in order to realize the purpose, the invention adopts the following technical scheme:

a human body detection and tracking method based on multi-modal information perception comprises the following steps: the method comprises the following steps:

(1) calibrating a color camera and a depth camera and performing data filtering, aligning the color image and the depth image through calibration, and then performing filtering respectively;

(2) human body detection based on multi-modal information perception: detecting a body in the color image, detecting a head in the depth image, and fusing according to the spatial proportion information;

(3) human body tracking based on multi-modal information perception: respectively tracking the body and the head in the color image and the depth image by using a nuclear correlated filtering target tracking algorithm, and establishing a tracking object model to check a tracking result;

(4) a tracking mechanism is perfected by utilizing the space constraint of a tracking object model and a head-body ratio, and if a single tracker fails in the tracking process, the tracking stability is maintained according to the space position constraint of the head and the body.

Further, the step (1) comprises the following steps:

(11) a color camera and a depth camera are respectively used for shooting a plurality of checkerboard pictures at different angles and different distances, so that each position of an image can be covered;

(12) detecting and matching angular points in different images, and calculating an internal and external reference matrix of the color camera according to the matched angular point pair, wherein the internal reference matrix of the color camera is as follows:

wherein fx and fy are focal lengths, x0 and y0 are principal point coordinates relative to an imaging plane, and s is a coordinate axis inclination parameter;

the color phase machine external parameter matrix is as follows:

wherein, R is a rotation matrix of 3 × 3, t is a translation vector of 3 × 1, and both are relative to a world coordinate system;

(13) mapping the depth values onto the color image;

let P be the position of a spatial point, P_rgbAnd p_irFor its coordinates in color and depth images, K_rgbAnd K_irInternal reference matrices, R, for colour and depth cameras respectively_irAnd t_irMapping the depth value of the P point to the coordinate of the color image by the following formula, wherein the color camera is used as an external parameter of a reference system of the depth camera;

p_rgb＝K_rgbR_irK_ir ^-1p_ir+t_ir；

(14) and simultaneously shrinking the registered color image and the depth image, removing high-frequency noise in the color image by using Gaussian filtering, and removing a depth missing point in the depth image by using median filtering.

Further, the step (2) comprises the following steps:

(21) scanning an image in a color image by using a sliding window, extracting HOG characteristics in the window, and judging whether the window contains a human body by using a trained SVM classifier to obtain all windows possibly containing the human body in the color image;

(22) using a sliding window in the depth image, extracting Haar features in the window, using an Adaboost classifier to classify whether the window is a human head, and obtaining all windows which possibly contain the human head in the depth image;

(23) and (3) according to the fusion detection of the spatial proportion information, and according to the head-body ratio fusion step (21) and the step (22), obtaining a detection result of fusion multi-mode information.

Further, the step (21) includes the steps of:

(211) scanning a color image by a multi-scale sliding window, and firstly amplifying and reducing an original color image according to a preset proportion to obtain color images of multiple scales; then sliding on each color image by a sliding window with a fixed size, and checking whether the window contains a human body;

(212) extracting the HOG features in the window, wherein the HOG features are extracted by the following steps:

(2121) graying and gray level normalization;

firstly, graying the whole color image, and then normalizing;

(2122) calculating the gradient of each pixel in the color image;

wherein the gradient G of the color image in the x and y directions at (x, y)_x，G_yRespectively as follows:

G_x(x,y)＝I(x+1,y)-I(x-1,y)；

G_y(x,y)＝I(x,y+1)-I(x,y-1)；

the magnitude of the gradient is then

In the direction of

(2123) Dividing the cells of pixels by 8 multiplied by 8, counting the gradient information of all pixels in one cell, and representing the result by a histogram in the gradient direction;

(2124) dividing blocks by 16 multiplied by 16, and carrying out contrast normalization on the gradient histogram in the blocks;

(2125) setting a detection window with the size of 64 multiplied by 128, generating a characteristic vector in the detection window, and combining the characteristic vectors of each block in the detection window to obtain the characteristic vector of the detection window for subsequent classification;

zooming an original color image to form a color image pyramid, sliding a detection window on the color image of the current scale, classifying each position by using a trained SVM classifier, and judging whether a human body exists at the position; and finally, carrying out a non-maximum suppression algorithm on the obtained result to eliminate multiple detection windows of the same target.

Further, the step (22) includes the steps of:

(221) scanning a depth image by a multi-scale sliding window, and firstly amplifying and reducing an original depth image according to a preset proportion to obtain depth images of multiple scales; then sliding a sliding window with a fixed size on each depth image, and checking whether the window contains a human head;

(222) extracting Haar features in the window;

the Haar features are simple rectangular block features and are divided into three types, namely edge features, linear features and diagonal features, and the feature value of each rectangular region is the sum of pixels in a white region and the sum of pixels in a black region subtracted;

(223) adaboost classification, which is to train a classifier by using an AdaBoost algorithm;

the AdaBoost algorithm is a mode of improving a weak learner through enough data so as to generate a high-precision strong learner; a weak classifier h_j(x) As shown in formula:

wherein f is_jIs characterized by, theta_jIs a threshold value, p_jThe role of (a) is to control the direction of the inequality, x is a 24 x 24 image sub-window; training N times to obtain N weak classifiers, and training the N-th weak classifierAdding normalized weight, wherein the weight is probability distribution; training a classifier h for each feature j_j，h_jUsing only a single feature, the one classifier h with the lowest error is selected_nUpdating the weight to obtain a strong classifier;

classifying the Haar feature vectors in the detection window obtained in the step (222) by using an Adaboost classifier, and giving a probability score of the existence of the human head in the detection window;

(224) and (3) integrating the human head detection result in the depth image, and performing non-maximum suppression according to the probability score of each window to obtain the human head detection result in the depth image.

Further, the step (23) includes the steps of:

(231) acquiring a head and body detection result, acquiring a body frame in the color image from the step (21), acquiring a head frame in the depth image from the step (22), traversing a set of body frames, and executing the following operation on each body frame;

(232) judging whether the body frame has a head frame or not, if not, deleting the body frame, and returning to the step (231); if so, performing step (233);

(233) judging whether the number of the head frames in the body frame is 1, if so, associating the body frame and the head frames to form multi-modal combined human body detection; if the number of the head frames in the body frame exceeds one, selecting an optimal head frame according to the positions of the head frames and the respective confidence degrees, and then associating the optimal head frame with the current body frame.

Further, the step (3) comprises the following steps:

(31) establishing models of the tracked object in the color image and the depth image;

in the color image, the model is a color histogram, and in the depth image, the model is a depth template image; the steps of extracting the color histogram are as follows: firstly, converting RGB color into HSV color space, wherein H is hue, S is saturation and V is brightness, then extracting H channel according to the following formula, and counting distribution of H values in a window to form a color histogram; the method for extracting the depth template picture comprises the steps of intercepting a head surrounding frame before tracking is started, zooming to a standard size, and taking the head surrounding frame as the template picture of head tracking in the depth image;

R′＝R/255，G′＝G/255，B′＝B/255

C_max＝max(R′,G′,B′)，C_min＝min(R′,G′,B′)

Δ＝C_max-C_min

(32) tracking the body in the color map and the head in the depth map simultaneously using a KCF nucleation correlation filtering algorithm; the method comprises the following steps: extracting pixel values around a tracked object by using a circulation matrix to serve as training samples, training a discriminant function by using ridge regression, and transforming the samples to a kernel space by using kernel transformation to solve the problem that the sample linearity is not separable;

(33) matching and updating the object model in the tracking process; the matching method is to calculate the normalized correlation coefficient of the tracking object and the initial model, and the calculation formula of the coefficient is as follows:

wherein d is a color histogram H₁And H₂R is the normalized correlation coefficient of the depth template pictures T and I; the values of the two numbers are in the range of [0,1 ]]The larger the matching degree, the higher the matching degree, and 0 represents the worst matching effect; if the matching value is greater than 0.9, namely the confidence of the algorithm on the tracking result is higher, carrying out weighted updating on the model; the weight of the initial model is 1-w, and the weight of the current model of the tracked object is w, wherein w is 0.5 × d or w is 0.5 × R.

Further, the step (4) comprises the following steps:

(41) judging the tracking effectiveness according to the normalized correlation coefficient provided in the step (3) in the tracking process; firstly, judging whether the head tracking is effective, namely whether the normalized correlation coefficient R values of the depth template pictures T and I are greater than 0.5, if so, turning to the step (42), otherwise, turning to the step (43);

(42) to determine whether body tracking is valid, i.e. color histogram H₁And H₂If the normalized correlation coefficient d is larger than 0.5, the body tracking result in the current color image and the head tracking result in the depth image are both effective, and the normal tracking process is continued, otherwise, the step (44) is carried out;

(43) to determine whether body tracking is valid, i.e. color histogram H₁And H₂If the normalized correlation coefficient d is larger than 0.5, the head tracking in the depth image is invalid, the body tracking in the color image is still effective, the position of the head is estimated according to the spatial position constraint of the head and the body, the matching of a head model is continuously carried out, and the head tracking is recovered once the matching is effective; otherwise, turning to step (45);

(44) in this case, the head tracking in the depth image is effective, while the body tracking in the color image is ineffective, and it is necessary to estimate an approximate body position from the head position according to the spatial position constraints of the head and the body, and to continuously perform the matching of the body color histogram, and to recover the body tracking once the matching is effective;

(45) at this time, the tracking of the head and the tracking of the body both fail to indicate that the tracked object is not present in the color image and the depth image due to occlusion or fast motion, in which case the tracking algorithm stops and a warning needs to be given to the user to make an appropriate response.

Has the advantages that: compared with the prior art, the method and the device effectively solve the problems of real-time and robust detection and tracking of the human body in the complex environment based on the multi-mode information acquired by the RGB-D camera. The multi-modal information is adopted for detecting and tracking the human body, and compared with the single use of color information or depth information, the adaptability of the algorithm to different environmental illumination conditions is improved; the detection results of the color image and the depth image are fused by utilizing the spatial proportion information, so that the recall ratio is improved, the false detection rate is reduced, and the algorithm accuracy is improved; the tracking results on the color image and the depth image are combined, model characteristic information of a tracked object is comprehensively utilized, and the results can be verified and restored in the tracking process, so that the overall algorithm has higher robustness. The method is simple and efficient, can meet the functions of man-machine interaction operation, user following and the like of the indoor service robot, and has wide application range and good economic benefit.

Drawings

FIG. 1 is a general flow chart of the algorithm;

FIG. 2 is a flow chart of the color image human detection in step (2) of the present invention;

FIG. 3 is a flowchart of the human head detection of the depth image in step (2) of the present invention;

FIG. 4 is a flow chart of fusion detection according to head-to-body ratio in step (2) of the present invention;

FIG. 5 is a flow chart of step (3) of the present invention;

FIG. 6 is a flow chart of step (4) of the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

Fig. 1 is a general flowchart of a human body detection and tracking method based on multi-modal information perception, specifically including the following steps:

(1) calibrating a color camera and a depth camera and carrying out data filtering processing;

firstly, an RGB-D camera (comprising a color camera and a depth camera, wherein the color camera is used for acquiring a color image, the color image comprises R, G, B three-color values, the depth camera is used for acquiring a depth image, and the depth image comprises a distance (D) value) is used for acquiring color and depth data of the surrounding environment; secondly, according to a certain position offset between a color camera and a depth camera in the RGB-D camera, calibrating the camera to obtain internal and external parameter matrixes of the color camera and the depth camera, and enabling the depth value and the color value of each point to be in one-to-one correspondence; and finally, respectively carrying out filtering processing on the color image and the depth image so as to remove the bright spots and the noise.

The principle of the depth camera for measuring the depth is that an infrared speckle transmitter emits infrared beams, the beams are reflected back to the depth camera after contacting with an obstacle, and then the distance is calculated through the geometrical relationship between returned speckles. The depth camera is actually a common camera provided with a filter, only images infrared light, and only needs to be calibrated by irradiating an object with an infrared light source. The color camera is calibrated by adopting a checkerboard method, a camera to be calibrated is used for shooting a plurality of checkerboard pictures under different visual angles, different pictures are matched by utilizing angular point detection, and an internal and external parameter matrix of the camera can be solved through an equation. The calibration of the color camera comprises the following specific steps:

(11) the RGB-D camera is held in place using a tripod and then a checkerboard picture is taken using a color camera at multiple angles and distances, ensuring that every position of the image can be covered.

(12) Detecting and matching angular points in different images, and calculating an internal reference matrix and an external reference matrix of the camera according to the matched angular point pairs, wherein the internal reference matrix is shown as a formula (1), fx and fy are respectively focal lengths of an x axis and a y axis, (x0 and y0) are principal point coordinates relative to an imaging plane, s is a coordinate axis inclination parameter, and is 0 under an ideal condition; the external reference matrix is shown as a formula (2), wherein R is a rotation matrix of 3 multiplied by 3, t is a translation vector of 3 multiplied by 1, and the two are relative to a world coordinate system;

(13) the depth values in the depth image are mapped onto the color image. Let P be the position of a spatial point, P_rgbAnd p_irTheir coordinates in the color image and depth image, K_rgbAnd K_irInternal reference matrices, R, for colour and depth cameras respectively_irAnd t_irFor the outer parameter of the depth camera with the color camera as the reference frame, the depth value of the P point can be mapped to the coordinate of the color image by the formula (3);

p_rgd＝K_rgdR_irK_ir ^-1p_ir+t_ir(3)；

(14) and simultaneously scaling the registered color image and the depth image to 480 multiplied by 270, removing high-frequency noise in the color image by using Gaussian filtering, and removing a depth missing point in the depth image by using median filtering.

(2) Human body detection based on multi-modal information perception: firstly, scanning an image in a color image by using a sliding window, extracting features of a HOG (Histogram of Oriented grids) in the window, and then judging whether the window contains a human body or not by using a trained SVM (Support Vector Machine) classifier, so that all windows possibly containing the human body in the color image can be obtained; secondly, a sliding window is also used in the depth image, Haar (Haar) features in the window are extracted, an Adaboost classifier is used for classifying whether the window is the head of a person or not, and then all windows which possibly contain the head of the person in the depth image are obtained; finally, according to the fusion detection of the spatial proportion information, fusing the two detection results according to the head-body ratio (about 1: 7) of the person to obtain a detection result fusing the multi-mode information;

(21) firstly, human body detection based on HOG characteristics and SVM classifiers is carried out in the color image. The flow chart of this operation is shown in fig. 2, and the specific steps are as follows:

(211) the multi-scale sliding window scans the color image. Firstly, enlarging and reducing an original color image according to the proportion of 1.05 (generally in the interval of 1.01-1.5) to obtain color images with multiple scales; then, a sliding window of a fixed size (64 × 128) is slid over each color image to check whether a human body is included in the window.

(212) HOG features within the window are extracted. The extraction steps of the HOG features are as follows:

(2121) graying and grayscale normalization. Since the HOG features mainly describe edge gradient features, the color information has little effect, and in order to reduce the influence of dark light illumination, the whole color image needs to be grayed first and then normalized.

(2122) The gradient of each pixel in the color image is calculated. Wherein the gradient G of the color image in the x and y directions at (x, y)_x，G_yAs shown in the formulas (4) and (5), the gradient at that position has an amplitude of

In the direction of

G_x(x,y)＝I(x+1,y)-I(x-1,y) (4)；

G_y(x,y)＝I(x,y+1)-I(x,y-1) (5)；

(2123) The cells of the pixels are divided by 8 multiplied by 8, the gradient information of all the pixels in one cell is counted, and the result is represented by a histogram in the gradient direction. The directional channels of the histogram are evenly distributed between 0 deg. -180 deg. (unsigned gradient) or 0 deg. -360 deg. (signed gradient). To reduce aliasing effects, the vote values of neighboring channels in the histogram are also passed through bilinear differences in direction and position.

(2124) The blocks are divided in 16 x 16 size and the gradient histograms are contrast normalized within the block.

(2125) Setting a detection window with the size of 64 multiplied by 128, generating a feature vector in the detection window, and combining the feature vectors of each block in the detection window to obtain the feature vector of the detection window for subsequent classification.

(213) And (5) SVM classification. And (3) classifying the feature vectors in the detection window obtained in the step (212) by using an SVM classifier, and giving a probability score (the value range is 0-1) of the human body in the detection window.

(214) And detecting the human body in the color image. And synthesizing the classification result of each detection window, and performing non-maximum suppression according to the probability score of each window to obtain the human body detection result in the color image.

(22) Secondly, human head detection based on Haar features and an Adaboost cascade classifier is adopted in the depth image. A flow chart of this operation is shown in fig. 3. The method comprises the following specific steps:

(221) the multi-scale sliding window scans the depth image. Firstly, enlarging and reducing an original depth image according to the proportion of 1.05 (generally in the interval of 1.01-1.5) to obtain depth images with multiple scales; then, a sliding window of a fixed size (30 × 30) is slid over each depth image to check whether the window contains a human head.

(222) Haar features within the window are extracted. The Haar features are simple rectangular block features and are divided into three types, namely edge features, linear features and diagonal features, and the feature value of each rectangular region is the sum of pixels in a white region and the sum of pixels in a black region subtracted. The computation of features may be accelerated using an integral map;

(223) the Adaboost classification. The classifier is trained using the AdaBoost algorithm. The AdaBoost algorithm is a way to boost weak learners with enough data to generate strong learners with high accuracy. A weak classifier h_j(x) As shown in formula (6), wherein f_jIs characterized by, theta_jIs a threshold value, p_jThe effect of (a) is to control the direction of the inequality, x being a 24 x 24 depth image sub-window. And (3) training N times to obtain N weak classifiers, and adding normalized weight to the N training time, wherein the weight is probability distribution. Training a classifier h for each feature j_j，h_jUsing only a single feature, the one classifier h with the lowest error is selected_nUpdating the weight to obtain a strong classifier;

the classifiers are then combined into more complex classifiers using a cascaded structure. A cascade classifier is a combination of a series of strong classifiers in which the classifiers at each level are thresholded to minimize false negatives in order to allow most objects to pass through, while non-object regions are rejected. The classifier at the front end uses a small number of features, the classifier at the rear end uses a large number of features, the calculation is fast, and the depth image which can finally reach the rear end is very small, so that the overall calculation speed is very high. And (4) classifying the feature vectors in the detection window obtained in the step (222) by using an Adaboost classifier, and giving a probability score (the value range is 0-1) of the existence of the human head in the detection window.

(224) And detecting the human head in the depth image. And synthesizing the classification result of each detection window, and performing non-maximum value inhibition according to the probability score of each window to obtain the human head detection result in the depth image.

(23) According to the fusion detection of the spatial proportion information, the two detection results are fused according to the head-body ratio of the person to obtain the detection result of the fused multi-mode information, and a flow chart is shown in fig. 4 and specifically comprises the following steps:

(231) and acquiring a head and body detection result. Acquiring a body frame in the color image from the step (21), acquiring a head frame in the depth image from the step (22), and traversing the set of body frames to perform the following operation on each body frame;

(232) judging whether the body frame has a head frame or not, and if not, deleting the body frame; if so, executing the next step;

(233) judging whether the number of the head frames in the body frame is 1, if so, associating the body frame and the head frames to form multi-modal combined human body detection; if the number of the head frames in the body frame exceeds one, selecting an optimal head frame according to the positions of the head frames and respective confidence degrees, and then associating the optimal head frame with the current body frame;

the body detected in the color image alone and the head detected in the depth image alone may be erroneously detected (a region that is not a target is detected as a target) or erroneously detected (a target is not detected). In order to make the detection result more reliable, the RGB-D information, i.e. the body frame in the color image and the head frame in the depth image, needs to be fused. By adjusting parameters, targets can be detected as many as possible in an independent detection stage, namely, missing detection is reduced, in a fusion stage, body frames and head frames are screened according to the proportion (about 1: 7) of the heads and bodies of most normal people, and the final result is that only one head frame is required in each body frame, so that most false detections can be eliminated, the false detection probability is greatly reduced, and the accuracy is improved.

(3) Human body tracking based on multi-modal information perception: firstly, initializing a model of a tracked object in a color image and a depth image respectively; secondly, respectively tracking the body and the head in the color image and the depth image by using a nuclear correlated filtering algorithm; and finally, if the confidence coefficient is higher in the tracking process, updating the tracking object model to adapt to the change of the tracking object. The flow chart of the above process is shown in fig. 5, and the specific steps are as follows:

(31) and establishing a model of the tracking object in the color image and the depth image. In the color image, the model of the tracking object is a color histogram, and in the depth image, the model of the tracking object is a depth template picture. The steps of extracting the color histogram are as follows: firstly, converting RGB color into HSV color space, wherein H is hue, S is saturation and V is brightness, then extracting H channel according to formula (7), and counting distribution of H value in window to form color histogram. The method for extracting the depth template picture comprises the steps of intercepting a head surrounding frame before tracking is started, zooming to a standard size, and taking the head surrounding frame as the template picture of head tracking in the depth image;

R′＝R/255，G′＝G/255，B′＝B/255

C_max＝max(R′,G′,B′)，C_min＝min(R′,G′,B′)

Δ＝C_max-C_min

(32) the body is tracked in the color image and the head is tracked in the depth image simultaneously using the KCF (Kernelized Correlation Filters) algorithm. The method comprises the following steps: the method comprises the steps of extracting pixel values around a tracked object by using a circulation matrix to serve as training samples, training a discriminant function by using ridge regression, and transforming the samples to a kernel space by using kernel transformation to solve the problem that the sample linearity is not separable. In the operation, discrete Fourier transform can be used for diagonalizing the sample matrix in Fourier space, and the calculation of the matrix can be replaced by the dot product of the vector, especially the inversion of the matrix, so that the calculation speed is greatly improved.

(33) The model of the tracked object is matched and updated during the tracking process. The matching method comprises calculating normalized correlation coefficient of the tracking object and the initial tracking object model, wherein the correlation coefficient is calculated as shown in formulas (8) and (9), wherein d is color histogram H₁And H₂R is the normalized correlation coefficient of the depth template pictures T and I. The values of the two numbers are in the range of [0,1 ]]In between, a larger one indicates a higher degree of matching, and 0 indicates the worst matching effect. And if the matching value is greater than 0.9, namely the confidence of the algorithm on the tracking result is higher, performing weighted updating on the tracking object model. The weight of the initial tracking object model is 1-w, and the current model weight of the tracking object is w, where w is 0.5 × d or w is 0.5 × R.

(4) And (3) perfecting a tracking mechanism by utilizing space constraints of a tracking object model and a head-to-body ratio: firstly, model feature extraction of a tracked object is continuously carried out in the tracking process, and the feature is matched with an initial tracked object model to judge whether tracking is effective or not; secondly, if the situation that one tracker fails and the other tracker is still effective occurs in the tracking process, the still effective tracking result is used in a short time, the position of the failed tracking object is searched in a specified range based on the space constraint of the head-body ratio, and the tracking is timely recovered; finally, if a situation occurs where both trackers fail, the algorithm needs to be stopped and a warning is given to the user. The flow chart of the step is shown in fig. 6, and the specific steps are as follows:

(41) and (4) judging the tracking effectiveness according to the normalized correlation coefficient size provided in the step (33) in the tracking process. Firstly, judging whether the head tracking is effective, namely whether the R value in the formula (9) is more than 0.5, if so, turning to (42), otherwise, turning to (43);

(42) judging whether the body tracking is effective, namely whether the d value in the formula (8) is more than 0.5, if so, indicating that the body tracking in the current color image and the head tracking in the depth image are both effective, and continuing the normal tracking process, otherwise, turning to the step (44);

(43) judging whether the body tracking is effective, namely whether the value d in the formula (8) is greater than 0.5, if so, indicating that the head tracking in the depth image is invalid, and the body tracking in the color image is still effective, estimating the position of the head from the position of the body according to the space position constraint of the head and the body, continuously performing the matching of a head model, and recovering the head tracking once the matching is effective; otherwise, switching to (45);

(45) at this point, the tracking of the head and the tracking of the body have both failed, indicating that the tracked object is not present in the color image and the depth image due to occlusion or fast motion, in which case the tracking algorithm stops and a warning needs to be given to the user to make an appropriate response.

Claims

1. A human body detection and tracking method based on multi-modal information perception is characterized in that: the method comprises the following steps:

2. The human body detection and tracking method based on multi-modal information perception according to claim 1, wherein the step (1) comprises the following steps:

the color phase machine external parameter matrix is as follows:

(13) mapping the depth values onto the color image;

p_rgb＝K_rgbR_irK_ir ^-1p_ir+t_ir；

3. The human body detection and tracking method based on multi-modal information perception according to claim 1, wherein the step (2) comprises the steps of:

4. The human body detection and tracking method based on multi-modal information perception according to claim 3, wherein the step (21) comprises the steps of:

(2121) graying and gray level normalization;

firstly, graying the whole color image, and then normalizing;

(2122) calculating the gradient of each pixel in the color image;

G_x(x，y)＝I(x+1，y)-I(x-1，y)；

G_y(x，y)＝I(x，y+1)-I(x，y-1)；

the magnitude of the gradient is then

In the direction of

5. The method for detecting and tracking human body based on multi-modal information perception according to claim 3, wherein the step (22) comprises the steps of:

(222) extracting Haar features in the window;

wherein f is_jIs characterized by, theta_jIs a threshold value, p_jThe role of (a) is to control the direction of the inequality, x is a 24 x 24 image sub-window; training N times to obtain N weak classifiers, and adding normalized weight to the N training time, wherein the weight is probability distribution; training a classifier h for each feature j_j，h_jUsing only a single feature, the one classifier h with the lowest error is selected_nUpdating the weight to obtain a strong classifier;

6. The human body detection and tracking method based on multi-modal information perception according to claim 3, wherein the step (23) comprises the steps of:

7. The human body detection and tracking method based on multi-modal information perception according to claim 1, wherein the step (3) comprises the following steps:

R′＝R/255，G′＝G/255，B′＝B/255

C_max＝max(R′，G′，B′)，C_min＝min(R′，G′，B′)

Δ＝C_max-C_min

8. The human body detection and tracking method based on multi-modal information perception according to claim 1, wherein the step (4) comprises the steps of: