CN111144207B

CN111144207B - Human body detection and tracking method based on multi-mode information perception

Info

Publication number: CN111144207B
Application number: CN201911146615.4A
Authority: CN
Inventors: 周波; 黄文超; 甘亚辉; 房芳; 钱堃
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2023-07-07
Anticipated expiration: 2039-11-21
Also published as: CN111144207A

Abstract

The invention discloses a human body detection and tracking method based on multi-mode information perception, which comprises the following steps: calibrating a color camera and a depth camera and performing data filtering; detecting the human body and the head of a person in the color image and the depth image respectively based on human body detection perceived by multi-mode information, and fusing two detection results according to the space proportion information of the head and the body; tracking a human body based on multi-mode information perception, tracking a body and a head in a color image and a depth image respectively by using a nucleated correlation filtering tracking algorithm, and establishing a model of a tracked object; and perfecting a tracking mechanism by utilizing the space constraint of the tracking object model and the head-body ratio. The method is based on multi-mode information perception, overcomes the defect of a target detection and tracking method based on vision only, has wide application in the field of indoor service robots, and is beneficial to the functions of man-machine interaction operation, user following and the like.

Description

Human body detection and tracking method based on multi-mode information perception

Technical Field

The invention belongs to the field of indoor service robot application, and particularly relates to a human body detection and tracking method based on multi-mode information perception, in particular to a long-time robust detection and tracking method under unstructured indoor environment and illumination change scenes.

Background

With the development and maturity of computer vision technology and the rising of artificial intelligence, the application range of intelligent service robots is more and more wide, especially indoor mobile service robots. In indoor environments, robots need to be able to perceive complex unstructured scenes and interact with humans, using visual information alone is not sufficient to cope with dynamic ambient lighting condition changes. The RGB-D camera is taken as a novel vision sensor, can simultaneously provide high-resolution color and depth images, and is a very excellent man-machine interaction tool. There is a need to propose an efficient method to make full use of multi-modal information for detection and tracking.

Methods of target detection and tracking mostly employ camera and laser solutions. The two-dimensional laser can directly acquire the geometric information of the environment, and has high precision and quick processing. But it can use a small amount of information, can only extract simple shape features, and is easily confused with similar objects in the environment. Methods of detection and tracking using cameras can be further classified into manual feature-based methods and deep learning-based methods. The method based on manual features needs to extract the set features, train the classifier and detect by using a sliding window, has the advantages of controllable calculated amount, and the extracted features have clear meaning, but the algorithm has insufficient accuracy. The method based on deep learning can achieve higher precision, but has large calculation amount, and cannot run in real time on a common computing platform.

In general, the above-described conventional target detection and tracking method has the following problems: 1) A satisfactory balance between the performance and the real-time performance of the algorithm is difficult to achieve; 2) Only a single information source of color or depth is used for detection and tracking, and detection and tracking in a complex environment cannot be realized; 3) The method lacks analysis and processing under the condition of algorithm short-time failure, is easy to lose tracking and has poor robustness.

Disclosure of Invention

The invention aims to: in order to overcome the defects of the prior art, the human body detection and tracking method based on multi-mode information perception is provided to solve the problems of real-time and robust human body detection and tracking in a complex environment.

The technical scheme is as follows: in order to achieve the above purpose, the present invention adopts the following technical scheme:

human body detection and tracking method based on multi-mode information perception: the method comprises the following steps:

(1) Calibrating a color camera and a depth camera, carrying out data filtering treatment, aligning a color image and a depth image through calibration, and then respectively carrying out filtering treatment;

(2) Human body detection based on multi-modal information perception: detecting a body in the color image, detecting a head in the depth image, and fusing according to the space proportion information;

(3) Human body tracking based on multi-modal information awareness: tracking the body and the head in the color image and the depth image respectively by using a nucleating correlation filtering target tracking algorithm, and establishing a tracking object model to check a tracking result;

(4) And (3) perfecting a tracking mechanism by utilizing the space constraint of the tracking object model and the head-body ratio, and maintaining the tracking stability according to the space position constraint of the head and the body if a single tracker fails in the tracking process.

Further, the step (1) includes the steps of:

(11) Shooting a plurality of checkerboard pictures with different angles and different distances by using a color camera and a depth camera respectively, so as to ensure that each position of an image can be covered;

(12) Detecting and matching angular points in different images, and calculating an internal and external parameter matrix of the color camera according to the matched angular point pairs, wherein the internal parameter matrix of the color camera is as follows:

wherein fx, fy are focal lengths, x0, y0 are principal point coordinates relative to an imaging plane, and s is a coordinate axis inclination parameter;

the color camera external parameter matrix is:

wherein R is a 3×3 rotation matrix, t is a 3×1 translation vector, and both are in a world coordinate system;

(13) Mapping the depth values onto a color image;

let P be the position of a spatial point, P _rgb And p _ir K is the coordinates thereof in the color image and the depth image _rgb And K _ir Reference matrix R of color camera and depth camera respectively _ir And t _ir Taking a color camera as an external reference of a reference system for the depth camera, mapping the depth value of the P point into the coordinates of the color image by the following formula;

p _rgb ＝K _rgb R _ir K _ir ^-1 p _ir +t _ir ；

(14) And (3) simultaneously shrinking the registered color image and the depth image, removing high-frequency noise from the color image by using Gaussian filtering, and removing depth missing points from the depth image by using median filtering.

Further, the step (2) includes the steps of:

(21) Scanning an image in a color image by using a sliding window, extracting HOG characteristics in the window, and judging whether the window contains human bodies or not by using a trained SVM classifier to obtain all windows possibly containing human bodies in the color image;

(22) Using a sliding window in the depth image, extracting Haar characteristics in the window, and using an Adaboost classifier to classify whether the window is the head of a person or not, so as to obtain all windows possibly containing the head of the person in the depth image;

(23) And (3) according to the fusion detection of the space proportion information, according to the head-body ratio of the person, the detection results of the fusion multi-mode information are obtained according to the two detection results of the fusion step (21) and the fusion step (22).

Still further, step (21) includes the steps of:

(211) The method comprises the steps of scanning a color image through a multi-scale sliding window, and firstly enlarging and reducing an original color image according to a preset proportion to obtain a multi-scale color image; then sliding on each color image with a sliding window with a fixed size, and checking whether the window contains a human body or not;

(212) The HOG features in the window are extracted, and the HOG features are extracted as follows:

(2121) Graying and gray normalization;

firstly, graying the whole color image, and then normalizing;

(2122) Calculating the gradient of each pixel in the color image;

wherein the gradient G of the color image in the x, y direction at (x, y) _x ，G _y The method comprises the following steps of:

G _x (x,y)＝I(x+1,y)-I(x-1,y)；

G _y (x,y)＝I(x,y+1)-I(x,y-1)；

the gradient has a magnitude of

Direction is->

(2123) Dividing cells of pixels by 8×8 size, counting gradient information of all pixels in one cell, and representing the result by using a histogram of gradient directions;

(2124) Dividing the blocks in a size of 16 multiplied by 16, and carrying out contrast normalization on the gradient histogram in the blocks;

(2125) Setting a detection window with the size of 64 multiplied by 128, generating a feature vector in the detection window, and combining the feature vectors of each block positioned in the detection window to obtain the feature vector of the detection window for subsequent classification;

scaling an original color image to form a color image pyramid, sliding a detection window on the color image with the current scale, classifying at each position by using a trained SVM classifier, and judging whether a human body exists at the position; and finally, performing a non-maximum suppression algorithm on the obtained result to eliminate multiple detection windows of the same target.

Further, step (22) includes the steps of:

(221) The method comprises the steps of scanning a depth image by a multi-scale sliding window, and firstly enlarging and reducing an original depth image according to a preset proportion to obtain a depth image with multiple scales; then sliding on each depth image by a sliding window with a fixed size, and checking whether the window contains a human head or not;

(222) Extracting Haar features in the window;

haar features are simple rectangular block features, and are divided into edge features, linear features and diagonal features, wherein the feature value of each rectangular region is pixel sum in a white region and pixel sum in a subtracted black region;

(223) Adaboost classification, training a classifier by using an AdaBoost algorithm;

the AdaBoost algorithm is a way to boost the weak learner with enough data to generate a high-precision strong learner; weak classifier h _j (x) The formula is as follows:

wherein f _j Is characterized by theta _j Is threshold, p _j The function of (2) is to control the direction of the inequality, x is the 24 x 24 image sub-window; training for N times, training N weak classifiers together, and adding normalized weights to the N-th training, wherein the weights are probability distribution; training a classifier h for each feature j _j ，h _j Using only a single feature, the one classifier h with the lowest error is selected _n Updating the weight to finally obtain a strong classifier;

classifying Haar feature vectors in the detection window obtained in the step (222) by using an Adaboost classifier to give a likelihood score of the existence of a human head in the detection window;

(224) And (3) integrating the detection results of the human head in the depth image, and carrying out non-maximum suppression according to the probability score of each window to obtain the detection result of the human head in the depth image.

Still further, step (23) includes the steps of:

(231) Acquiring a head and body detection result, acquiring a body frame in the color image from the step (21), acquiring a head frame in the depth image from the step (22), traversing a set of body frames, and executing the following operation on each body frame;

(232) Judging whether a head frame exists in the body frames, if not, deleting the body frames, and returning to the step (231); if so, executing step (233);

(233) Judging whether the number of head frames in the body frame is 1, if so, associating the body frame and the head frame to form a multi-mode combined human body detection; if the number of head frames in the body frame exceeds one, an optimal head frame is selected according to the position of the head frame and the respective confidence level, and then the optimal head frame is associated with the current body frame.

Further, the step (3) includes the steps of:

(31) Establishing a model of the tracking object in the color image and the depth image;

in the color map, the model is a color histogram, and in the depth map, the model is a depth template picture; the step of extracting the color histogram is as follows: firstly converting RGB colors into an HSV color space, wherein H is hue, S is saturation, V is brightness, then extracting H channels according to the following formula, and counting the distribution of H values in a window to form a color histogram; the method for extracting the depth template picture is to intercept a head bounding box before tracking starts and zoom the head bounding box to a standard size to be used as a template picture of head tracking in a depth image;

R′＝R/255，G′＝G/255，B′＝B/255

C _max ＝max(R′,G′,B′)，C _min ＝min(R′,G′,B′)

Δ＝C _max -C _min

(32) Tracking the body in the color map and the head in the depth map simultaneously by using a KCF (kernel-based correlation filter) algorithm; the method comprises the following steps: extracting pixel values around a tracking object by using a cyclic matrix as a training sample, training a discriminant function by using ridge regression, and transforming the sample into a nuclear space by using nuclear transformation to solve the problem of inseparable sample linearity;

(33) Matching and updating the object model in the tracking process; the matching method is to calculate the normalized correlation coefficient of the tracking object and the initial model, and the calculation formula of the coefficient is as follows:

wherein d is a color histogram H ₁ And H ₂ R is the normalized correlation coefficient of the depth template pictures T and I; the values of the two numbers are in the range of 0,1]The larger the matching degree is, the higher the matching degree is, and 0 represents the worst matching effect; if the matching value is larger than 0.9, namely the confidence coefficient of the algorithm to the tracking result is higher, carrying out weighted updating on the model; the initial model weights are 1-w, and the current model weight of the tracked object is w, wherein w=0.5×d or w=0.5×r.

Further, the step (4) includes the steps of:

(41) In the tracking process, judging the tracking effectiveness according to the normalized correlation coefficient in the step (3); firstly judging whether head tracking is effective or not, namely whether the normalized correlation coefficient R value of the depth template pictures T and I is larger than 0.5 or not, if so, turning to a step (42), otherwise, turning to a step (43);

(42) Determining whether body tracking is effective, i.e. color histogram H ₁ And H ₂ If the normalized correlation coefficient d value of the (b) is larger than 0.5, indicating that the body tracking in the current color image and the head tracking in the depth image are both effective, continuing the normal tracking process, otherwise, turning to the step (44);

(43) Determining whether body tracking is effective, i.e. color histogram H ₁ And H ₂ If the normalized correlation coefficient d value of the (b) is greater than 0.5, indicating that the head tracking in the depth image is invalid at the moment, and the body tracking in the color image is still valid, estimating the position of the head according to the position of the body according to the spatial position constraint of the head and the body, continuously performing matching of the head model, and recovering the head tracking once the matching is valid; otherwise, go to step (45);

(44) This is the case when head tracking in depth images is valid, whereas body tracking in color images is invalid, it is necessary to estimate the approximate body position from the head position based on the spatial position constraints of the head and body, and continue matching of the body color histogram, and once matching is valid, resume tracking of the body;

(45) At this point, both the head tracking and the body tracking are disabled, indicating that the tracked object is not present in the color image and depth image due to occlusion or rapid movement, in which case the tracking algorithm stops and a warning needs to be given to the user to make an appropriate response.

The beneficial effects are that: compared with the prior art, the method and the device effectively solve the problems of real-time and robust detection and tracking of the human body in the complex environment based on the multi-mode information acquired by the RGB-D camera. The multi-mode information is adopted to detect and track the human body, so that compared with the single use of color information or depth information, the adaptability of the algorithm to different environment illumination conditions is improved; the detection results of the color and depth images are fused by using the space proportion information, so that the recall ratio is improved, the false detection rate is reduced, and the algorithm accuracy is improved; the tracking results on the color image and the depth image are combined, the model characteristic information of the tracked object is comprehensively utilized, and the results can be verified and recovered in the tracking process, so that the overall algorithm has higher robustness. The method is simple and efficient, can meet the functions of man-machine interaction operation, user following and the like of the indoor service robot, and has wide application range and good economic benefit.

Drawings

FIG. 1 is an overall flow chart of an algorithm;

FIG. 2 is a flow chart of the color image person detection in step (2) of the present invention;

FIG. 3 is a flow chart of the detection of the human head of the depth image in the step (2) of the present invention;

FIG. 4 is a flow chart of the fusion detection according to the head-to-body ratio in the step (2) of the present invention;

FIG. 5 is a flow chart of step (3) of the present invention;

FIG. 6 is a flow chart of step (4) of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and detailed description.

Fig. 1 is a general flowchart of a human body detecting and tracking method based on multi-mode information sensing according to the present invention, and the implementation steps are as follows:

(1) Calibrating a color camera and a depth camera and performing data filtering;

firstly, an RGB-D camera (comprising a color camera and a depth camera, wherein the color camera is used for acquiring a color image, the color image comprises R, G, B values of three colors, and the depth camera is used for acquiring a depth image, and the depth image comprises a distance (D) value) to acquire color and depth data of surrounding environment; secondly, according to certain position offset between a color camera and a depth camera in the RGB-D camera, obtaining an internal and external parameter matrix of the color camera and the depth camera through camera calibration, so that the depth value of each point corresponds to the color value one by one; and finally, respectively carrying out filtering treatment on the color image and the depth image to remove bright spots and noise.

The principle of depth camera depth measurement is that an infrared speckle emitter emits an infrared beam, which is reflected back to the depth camera after hitting an obstacle, and then the distance is calculated by returning the geometrical relationship between the speckles. The depth camera is actually a common camera with a filter, and only images infrared light, so that the depth camera is calibrated by only irradiating an object by an infrared light source. The calibration of the color camera adopts a checkerboard method, the camera to be calibrated is used for shooting checkerboard pictures under a plurality of different visual angles, and the angular points are used for detecting and matching different pictures, so that the camera internal and external parameter matrix can be solved through an equation. The specific steps of the calibration of the color camera comprise:

(11) The RGB-D camera is fixed by using a tripod, and then a color camera is used for shooting checkerboard pictures at a plurality of angles and distances, so that each position of an image can be covered.

(12) Detecting and matching angular points in different images, and calculating an internal and external parameter matrix of the camera according to the matched angular point pairs, wherein the internal parameter matrix is shown as a formula (1), fx and fy are focal lengths of an x axis and a y axis respectively, (x 0 and y 0) are principal point coordinates relative to an imaging plane, s is a coordinate axis inclination parameter, and is ideally 0; the external parameter matrix is shown as a formula (2), wherein R is a 3×3 rotation matrix, t is a 3×1 translation vector, and both are in a world coordinate system;

(13) Depth values in the depth image are mapped onto the color image. Let P be the position of a spatial point, P _rgb And p _ir Its coordinates in the color image and depth image, K _rgb And K _ir Reference matrix R of color camera and depth camera respectively _ir And t _ir Taking a color camera as an external reference of a reference system for the depth camera, mapping the depth value of the P point into the coordinates of the color image by the formula (3);

p _rgd ＝K _rgd R _ir K _ir ^-1 p _ir +t _ir (3)；

(14) And simultaneously scaling the registered color image and the depth image to the size of 480 multiplied by 270, removing high-frequency noise from the color image by using Gaussian filtering, and removing depth missing points from the depth image by using median filtering.

(2) Human body detection based on multi-modal information perception: firstly, scanning an image in a color image by using a sliding window, extracting HOG (Histogram of Oriented Gridients, gradient direction histogram) characteristics in the window, and judging whether the window contains human bodies or not by using a trained SVM (Support Vector Machine ) classifier, so that all windows possibly containing human bodies in the color image can be obtained; secondly, a sliding window is also used in the depth image, haar (Haar) features in the window are extracted, an Adaboost classifier is used for classifying whether the window is the head of a person, and then all windows possibly containing the head of the person in the depth image are obtained; finally, according to the fusion detection of the space proportion information, the two detection results are fused according to the head-body ratio (about 1:7) of the person, and a detection result of fusion of the multi-mode information is obtained;

(21) Human body detection based on HOG features and SVM classifiers is first performed in a color image. The flow chart of this operation is shown in fig. 2, and the specific steps are as follows:

(211) A multi-scale sliding window scans a color image. Firstly, the original color image is enlarged and reduced according to the proportion of 1.05 (generally in the interval of 1.01-1.5), and a plurality of scale color images are obtained; then a sliding window of a fixed size (64 x 128) is slid over each color image to check whether the window contains a human body.

(212) HOG features within the window are extracted. The HOG features are extracted as follows:

(2121) Graying and gray normalization. Since HOG features mainly describe edge gradient features, color information is not very useful, and in order to reduce the effect of light darkness, the whole color image needs to be normalized after being grayed first.

(2122) The gradient of each pixel in the color image is calculated. Wherein the gradient G of the color image in the x, y direction at (x, y) _x ，G _y As shown in the formulas (4) and (5), the gradient has a magnitude of

Direction is->

G _x (x,y)＝I(x+1,y)-I(x-1,y) (4)；

G _y (x,y)＝I(x,y+1)-I(x,y-1) (5)；

(2123) The cells of the pixels are divided in the size of 8 x 8, the gradient information of all the pixels in one cell is counted, and the result is represented by a histogram of the gradient direction. The directional channels of the histogram are uniformly distributed between 0 ° -180 ° (unsigned gradient) or between 0 ° -360 ° (signed gradient). In order to reduce aliasing effects, the vote values of adjacent channels in the histogram also pass through bilinear differences in direction and position.

(2124) The block is divided into 16×16 blocks, and the gradient histogram is subjected to contrast normalization within the block.

(2125) Setting a detection window with the size of 64 multiplied by 128, generating a feature vector in the detection window, and combining the feature vectors of each block in the detection window to obtain the feature vector of the detection window for subsequent classification.

(213) And (5) SVM classification. Classifying the feature vectors in the detection window obtained in the step (212) by using an SVM classifier, and giving a probability score (the value range is 0-1) that the human body exists in the detection window.

(214) The human body detection result in the color image. And synthesizing the classification result of each detection window, and performing non-maximum suppression according to the likelihood score of each window to obtain the human body detection result in the color image.

(22) Secondly, human head detection based on Haar features and an Adaboost cascade classifier is adopted in the depth image. A flowchart of this operation is shown in fig. 3. The method comprises the following specific steps:

(221) A multi-scale sliding window scans the depth image. Firstly, the original depth image is enlarged and reduced according to the proportion of 1.05 (generally in the interval of 1.01-1.5), so as to obtain depth images with multiple scales; then a sliding window of fixed size (30 x 30) is slid over each depth image, checking whether the window contains a human head.

(222) Haar features within the window are extracted. Haar features are simple rectangular block features, and are divided into three types, namely edge features, linear features and diagonal features, wherein the feature value of each rectangular region is the sum of pixels in a white region and the sum of pixels in a subtracted black region. The computation of features may be accelerated using an integral map;

(223) Adaboost classification. The classifier was trained using the AdaBoost algorithm. The AdaBoost algorithm is a way to boost the weak learner with enough data to generate a high-precision strong learner. Weak classifier h _j (x) As shown in formula (6), wherein f _j Is characterized by theta _j Is threshold, p _j The function of (2) is to control the direction of the inequality, x being the 24 x 24 depth image sub-window. And training for N times, training N weak classifiers together, and adding normalized weights to the N-th training, wherein the weights are probability distribution. Training a classifier h for each feature j _j ，h _j Using only a single feature, the one classifier h with the lowest error is selected _n Updating the weight to finally obtain a strong classifier;

the cascaded structure is then used to combine the classifiers into a more complex classifier. A cascade of classifiers is a combination of a series of strong classifiers, where the classifier of each layer is thresholded to minimize false negative examples in order to enable most of the targets to pass through, but not the target area. The classifier at the front end has fewer features, the classifier at the rear end has more features and is calculated slowly, but the depth image which can reach the rear end finally is very few, so that the overall calculation speed is very high. Classifying the feature vectors in the detection window obtained in the step (222) by using an Adaboost classifier to give a probability score (the value range is 0-1) that the head of the person exists in the detection window.

(224) And detecting a result of the human head in the depth image. And synthesizing the classification result of each detection window, and performing non-maximum suppression according to the likelihood score of each window to obtain the human head detection result in the depth image.

(23) According to the fusion detection of the space proportion information, the two detection results are fused according to the head-body ratio of the person to obtain a detection result of fusing the multi-mode information, and the flow chart is shown in fig. 4, and the specific steps are as follows:

(231) And obtaining a head and body detection result. Acquiring body frames in the color image from the step (21), acquiring head frames in the depth image from the step (22), then traversing the set of body frames, and performing the following operation on each body frame;

(232) Judging whether a head frame exists in the body frame, and deleting the body frame if the head frame does not exist in the body frame; if so, executing the next step;

(233) Judging whether the number of head frames in the body frame is 1, if so, associating the body frame and the head frame to form a multi-mode combined human body detection; if the number of the head frames in the body frame exceeds one, selecting an optimal head frame according to the position of the head frame and the respective confidence level, and then associating the optimal head frame with the current body frame;

the body detected in the color image alone and the head detected in the depth image may be false detected (a region other than the target is detected as the target) or missed detected (no target is detected). In order to make the detection result more reliable, it is necessary to fuse the RGB-D information, i.e. the body frame in the color image and the head frame in the depth image. By adjusting parameters, the detection targets can be as many as possible in the independent detection stage, namely, the omission is reduced, and in the fusion stage, the body frames and the head frames are screened according to the proportion of the heads and the bodies (about 1:7) of most normal people, and the final result is that only one head frame is needed in each body frame, so that most of false detection can be eliminated, the false detection probability is greatly reduced, and meanwhile, the accuracy is improved.

(3) Human body tracking based on multi-modal information awareness: firstly, initializing a model of a tracking object in a color image and a depth image respectively; secondly, tracking the body and the head in the color image and the depth image respectively by using a coring correlation filtering algorithm; and finally, if the confidence coefficient is higher in the tracking process, updating the tracking object model to adapt to the change of the tracking object. The flowchart of the above process is shown in fig. 5, and the specific steps are as follows:

(31) A model of the tracked object in the color image and the depth image is built. In a color image, the model of the tracked object is a color histogram, and in a depth image, the model of the tracked object is a depth template picture. The step of extracting the color histogram is as follows: firstly converting RGB colors into HSV color space, wherein H is hue, S is saturation, V is brightness, then extracting H channels according to a formula (7), and counting the distribution of H values in a window to form a color histogram. The method for extracting the depth template picture is to intercept a head bounding box before tracking starts and zoom the head bounding box to a standard size to be used as a template picture of head tracking in a depth image;

R′＝R/255，G′＝G/255，B′＝B/255

C _max ＝max(R′,G′,B′)，C _min ＝min(R′,G′,B′)

Δ＝C _max -C _min

(32) The body is tracked simultaneously in the color image and the head in the depth image using the KCF (Kernelized Correlation Filters coring correlation filtering) algorithm. The method comprises the following steps: the pixel values around the tracking object are extracted by using a cyclic matrix as training samples, a discriminant function is trained by ridge regression, and then the samples are transformed into a nuclear space by using nuclear transformation so as to solve the problem that the sample is linear and inseparable. The operation can use discrete Fourier transformation to diagonalize the sample matrix in the Fourier space, so that the dot multiplication of the vector can be used for replacing the calculation of the matrix, especially the inversion of the matrix, and the calculation speed is greatly improved.

(33) The model of the tracked object is matched and updated during the tracking process. The matching method is used for calculating a tracking object and an initial tracking pairNormalized correlation coefficients of the image model are calculated as shown in formulas (8), (9), where d is the color histogram H ₁ And H ₂ And R is the normalized correlation coefficient of the depth template pictures T and I. The values of the two numbers are in the range of 0,1]The larger the matching degree is, the higher the matching degree is, and 0 is the worst matching effect. And if the matching value is larger than 0.9, namely the confidence of the algorithm on the tracking result is higher, carrying out weighted updating on the tracking object model. The initial tracking object model weights are 1-w, and the current tracking object model weights are w, wherein w=0.5×d or w=0.5×r.

(4) The space constraint of the tracking object model and the head-body ratio is utilized to perfect a tracking mechanism: firstly, continuously extracting model features of a tracked object in the tracking process, and matching the features with an initial tracked object model to judge whether the tracking is effective or not; secondly, if one tracker fails and the other tracker is still effective in the tracking process, using a still effective tracking result in a short time, searching the position of a failed tracking object in a specified range based on the space constraint of the head-body ratio and timely recovering tracking; finally, if a situation arises in which both trackers fail, the algorithm needs to be stopped and a warning is issued to the user. The flow chart of this step is shown in fig. 6, and the specific steps are as follows:

(41) And (3) judging the tracking effectiveness according to the normalized correlation coefficient size set forth in the step (33) in the tracking process. Firstly judging whether head tracking is effective or not, namely whether the R value in the formula (9) is larger than 0.5 or not, if so, switching to the formula (42), otherwise, switching to the formula (43);

(42) Judging whether the body tracking is effective, namely whether the d value in the formula (8) is larger than 0.5, if so, indicating that the body tracking in the current color image and the head tracking in the depth image are effective, and continuing the normal tracking process, otherwise, switching to the formula (44);

(43) Judging whether body tracking is effective, namely whether the d value in the formula (8) is larger than 0.5, if so, indicating that head tracking in the depth image is invalid, and body tracking in the color image is still effective, estimating the position of the head according to the position constraint of the head and the body, continuously matching a head model, and recovering the head tracking once the matching is effective; otherwise, go to (45);

(45) At this point both the tracking of the head and the tracking of the body have failed, indicating that the tracked object is not present in the color image and depth image due to occlusion or rapid movement, in which case the tracking algorithm stops and a warning needs to be given to the user to make a suitable response.

Claims

1. A human body detection and tracking method based on multi-mode information perception is characterized in that: the method comprises the following steps:

(2) Human body detection based on multi-modal information perception: detecting a body in the color image, detecting a head in the depth image, and fusing according to the space proportion information; the method comprises the following steps:

(23) According to the fusion detection of the space proportion information, according to the head-body ratio of the person, the detection results of the fusion multi-mode information are obtained according to the two detection results of the fusion step (21) and the fusion step (22);

2. The human body detecting and tracking method based on multi-modal information sensing as set forth in claim 1, wherein the step (1) includes the steps of:

the color camera external parameter matrix is:

(13) Mapping the depth values onto a color image;

p _rgb ＝K _rgb R _ir K _ir ^-1 p _ir +t _ir ；

3. The human body detection and tracking method based on multi-modal information sensing as claimed in claim 1, wherein the step (21) includes the steps of:

(2121) Graying and gray normalization;

firstly, graying the whole color image, and then normalizing;

(2122) Calculating the gradient of each pixel in the color image;

G _x (x,y)＝I(x+1,y)-I(x-1,y)；

G _y (x,y)＝I(x,y+1)-I(x,y-1)；

the gradient has a magnitude of

Direction is->

4. The human detection and tracking method based on multi-modal information sensing as set forth in claim 1, wherein the step (22) includes the steps of:

(222) Extracting Haar features in the window;

5. The human detection and tracking method based on multi-modal information sensing as claimed in claim 1, wherein the step (23) includes the steps of:

6. The human body detecting and tracking method based on multi-modal information sensing as set forth in claim 1, wherein the step (3) includes the steps of:

R′＝R/255，G′＝G/255，B′＝B/255

C _max ＝max(R′,G′,B′)，C _min ＝min(R′,G′,B′)

Δ＝C _max -C _min

7. The human body detecting and tracking method based on multi-modal information sensing as set forth in claim 1, wherein the step (4) includes the steps of:

(42) Determining whether body tracking is effective, i.e. color histogram H ₁ And H ₂ If the normalized correlation coefficient d value of the (B) is larger than 0.5, if so, the method indicates that the body tracking in the current color image and the head tracking in the depth image are both effective, and normal tracking is continuedOtherwise, go to step (44);