CN111144207B - Human body detection and tracking method based on multi-mode information perception - Google Patents

Human body detection and tracking method based on multi-mode information perception Download PDF

Info

Publication number
CN111144207B
CN111144207B CN201911146615.4A CN201911146615A CN111144207B CN 111144207 B CN111144207 B CN 111144207B CN 201911146615 A CN201911146615 A CN 201911146615A CN 111144207 B CN111144207 B CN 111144207B
Authority
CN
China
Prior art keywords
tracking
head
depth
color
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911146615.4A
Other languages
Chinese (zh)
Other versions
CN111144207A (en
Inventor
周波
黄文超
甘亚辉
房芳
钱堃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201911146615.4A priority Critical patent/CN111144207B/en
Publication of CN111144207A publication Critical patent/CN111144207A/en
Application granted granted Critical
Publication of CN111144207B publication Critical patent/CN111144207B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/40Image enhancement or restoration by the use of histogram techniques
    • G06T5/70
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20024Filtering details
    • G06T2207/20032Median filtering

Abstract

The invention discloses a human body detection and tracking method based on multi-mode information perception, which comprises the following steps: calibrating a color camera and a depth camera and performing data filtering; detecting the human body and the head of a person in the color image and the depth image respectively based on human body detection perceived by multi-mode information, and fusing two detection results according to the space proportion information of the head and the body; tracking a human body based on multi-mode information perception, tracking a body and a head in a color image and a depth image respectively by using a nucleated correlation filtering tracking algorithm, and establishing a model of a tracked object; and perfecting a tracking mechanism by utilizing the space constraint of the tracking object model and the head-body ratio. The method is based on multi-mode information perception, overcomes the defect of a target detection and tracking method based on vision only, has wide application in the field of indoor service robots, and is beneficial to the functions of man-machine interaction operation, user following and the like.

Description

Human body detection and tracking method based on multi-mode information perception
Technical Field
The invention belongs to the field of indoor service robot application, and particularly relates to a human body detection and tracking method based on multi-mode information perception, in particular to a long-time robust detection and tracking method under unstructured indoor environment and illumination change scenes.
Background
With the development and maturity of computer vision technology and the rising of artificial intelligence, the application range of intelligent service robots is more and more wide, especially indoor mobile service robots. In indoor environments, robots need to be able to perceive complex unstructured scenes and interact with humans, using visual information alone is not sufficient to cope with dynamic ambient lighting condition changes. The RGB-D camera is taken as a novel vision sensor, can simultaneously provide high-resolution color and depth images, and is a very excellent man-machine interaction tool. There is a need to propose an efficient method to make full use of multi-modal information for detection and tracking.
Methods of target detection and tracking mostly employ camera and laser solutions. The two-dimensional laser can directly acquire the geometric information of the environment, and has high precision and quick processing. But it can use a small amount of information, can only extract simple shape features, and is easily confused with similar objects in the environment. Methods of detection and tracking using cameras can be further classified into manual feature-based methods and deep learning-based methods. The method based on manual features needs to extract the set features, train the classifier and detect by using a sliding window, has the advantages of controllable calculated amount, and the extracted features have clear meaning, but the algorithm has insufficient accuracy. The method based on deep learning can achieve higher precision, but has large calculation amount, and cannot run in real time on a common computing platform.
In general, the above-described conventional target detection and tracking method has the following problems: 1) A satisfactory balance between the performance and the real-time performance of the algorithm is difficult to achieve; 2) Only a single information source of color or depth is used for detection and tracking, and detection and tracking in a complex environment cannot be realized; 3) The method lacks analysis and processing under the condition of algorithm short-time failure, is easy to lose tracking and has poor robustness.
Disclosure of Invention
The invention aims to: in order to overcome the defects of the prior art, the human body detection and tracking method based on multi-mode information perception is provided to solve the problems of real-time and robust human body detection and tracking in a complex environment.
The technical scheme is as follows: in order to achieve the above purpose, the present invention adopts the following technical scheme:
human body detection and tracking method based on multi-mode information perception: the method comprises the following steps:
(1) Calibrating a color camera and a depth camera, carrying out data filtering treatment, aligning a color image and a depth image through calibration, and then respectively carrying out filtering treatment;
(2) Human body detection based on multi-modal information perception: detecting a body in the color image, detecting a head in the depth image, and fusing according to the space proportion information;
(3) Human body tracking based on multi-modal information awareness: tracking the body and the head in the color image and the depth image respectively by using a nucleating correlation filtering target tracking algorithm, and establishing a tracking object model to check a tracking result;
(4) And (3) perfecting a tracking mechanism by utilizing the space constraint of the tracking object model and the head-body ratio, and maintaining the tracking stability according to the space position constraint of the head and the body if a single tracker fails in the tracking process.
Further, the step (1) includes the steps of:
(11) Shooting a plurality of checkerboard pictures with different angles and different distances by using a color camera and a depth camera respectively, so as to ensure that each position of an image can be covered;
(12) Detecting and matching angular points in different images, and calculating an internal and external parameter matrix of the color camera according to the matched angular point pairs, wherein the internal parameter matrix of the color camera is as follows:
Figure BDA0002282381700000021
wherein fx, fy are focal lengths, x0, y0 are principal point coordinates relative to an imaging plane, and s is a coordinate axis inclination parameter;
the color camera external parameter matrix is:
Figure BDA0002282381700000022
wherein R is a 3×3 rotation matrix, t is a 3×1 translation vector, and both are in a world coordinate system;
(13) Mapping the depth values onto a color image;
let P be the position of a spatial point, P rgb And p ir K is the coordinates thereof in the color image and the depth image rgb And K ir Reference matrix R of color camera and depth camera respectively ir And t ir Taking a color camera as an external reference of a reference system for the depth camera, mapping the depth value of the P point into the coordinates of the color image by the following formula;
p rgb =K rgb R ir K ir -1 p ir +t ir
(14) And (3) simultaneously shrinking the registered color image and the depth image, removing high-frequency noise from the color image by using Gaussian filtering, and removing depth missing points from the depth image by using median filtering.
Further, the step (2) includes the steps of:
(21) Scanning an image in a color image by using a sliding window, extracting HOG characteristics in the window, and judging whether the window contains human bodies or not by using a trained SVM classifier to obtain all windows possibly containing human bodies in the color image;
(22) Using a sliding window in the depth image, extracting Haar characteristics in the window, and using an Adaboost classifier to classify whether the window is the head of a person or not, so as to obtain all windows possibly containing the head of the person in the depth image;
(23) And (3) according to the fusion detection of the space proportion information, according to the head-body ratio of the person, the detection results of the fusion multi-mode information are obtained according to the two detection results of the fusion step (21) and the fusion step (22).
Still further, step (21) includes the steps of:
(211) The method comprises the steps of scanning a color image through a multi-scale sliding window, and firstly enlarging and reducing an original color image according to a preset proportion to obtain a multi-scale color image; then sliding on each color image with a sliding window with a fixed size, and checking whether the window contains a human body or not;
(212) The HOG features in the window are extracted, and the HOG features are extracted as follows:
(2121) Graying and gray normalization;
firstly, graying the whole color image, and then normalizing;
(2122) Calculating the gradient of each pixel in the color image;
wherein the gradient G of the color image in the x, y direction at (x, y) x ,G y The method comprises the following steps of:
G x (x,y)=I(x+1,y)-I(x-1,y);
G y (x,y)=I(x,y+1)-I(x,y-1);
the gradient has a magnitude of
Figure BDA0002282381700000031
Direction is->
Figure BDA0002282381700000032
(2123) Dividing cells of pixels by 8×8 size, counting gradient information of all pixels in one cell, and representing the result by using a histogram of gradient directions;
(2124) Dividing the blocks in a size of 16 multiplied by 16, and carrying out contrast normalization on the gradient histogram in the blocks;
(2125) Setting a detection window with the size of 64 multiplied by 128, generating a feature vector in the detection window, and combining the feature vectors of each block positioned in the detection window to obtain the feature vector of the detection window for subsequent classification;
scaling an original color image to form a color image pyramid, sliding a detection window on the color image with the current scale, classifying at each position by using a trained SVM classifier, and judging whether a human body exists at the position; and finally, performing a non-maximum suppression algorithm on the obtained result to eliminate multiple detection windows of the same target.
Further, step (22) includes the steps of:
(221) The method comprises the steps of scanning a depth image by a multi-scale sliding window, and firstly enlarging and reducing an original depth image according to a preset proportion to obtain a depth image with multiple scales; then sliding on each depth image by a sliding window with a fixed size, and checking whether the window contains a human head or not;
(222) Extracting Haar features in the window;
haar features are simple rectangular block features, and are divided into edge features, linear features and diagonal features, wherein the feature value of each rectangular region is pixel sum in a white region and pixel sum in a subtracted black region;
(223) Adaboost classification, training a classifier by using an AdaBoost algorithm;
the AdaBoost algorithm is a way to boost the weak learner with enough data to generate a high-precision strong learner; weak classifier h j (x) The formula is as follows:
Figure BDA0002282381700000041
wherein f j Is characterized by theta j Is threshold, p j The function of (2) is to control the direction of the inequality, x is the 24 x 24 image sub-window; training for N times, training N weak classifiers together, and adding normalized weights to the N-th training, wherein the weights are probability distribution; training a classifier h for each feature j j ,h j Using only a single feature, the one classifier h with the lowest error is selected n Updating the weight to finally obtain a strong classifier;
classifying Haar feature vectors in the detection window obtained in the step (222) by using an Adaboost classifier to give a likelihood score of the existence of a human head in the detection window;
(224) And (3) integrating the detection results of the human head in the depth image, and carrying out non-maximum suppression according to the probability score of each window to obtain the detection result of the human head in the depth image.
Still further, step (23) includes the steps of:
(231) Acquiring a head and body detection result, acquiring a body frame in the color image from the step (21), acquiring a head frame in the depth image from the step (22), traversing a set of body frames, and executing the following operation on each body frame;
(232) Judging whether a head frame exists in the body frames, if not, deleting the body frames, and returning to the step (231); if so, executing step (233);
(233) Judging whether the number of head frames in the body frame is 1, if so, associating the body frame and the head frame to form a multi-mode combined human body detection; if the number of head frames in the body frame exceeds one, an optimal head frame is selected according to the position of the head frame and the respective confidence level, and then the optimal head frame is associated with the current body frame.
Further, the step (3) includes the steps of:
(31) Establishing a model of the tracking object in the color image and the depth image;
in the color map, the model is a color histogram, and in the depth map, the model is a depth template picture; the step of extracting the color histogram is as follows: firstly converting RGB colors into an HSV color space, wherein H is hue, S is saturation, V is brightness, then extracting H channels according to the following formula, and counting the distribution of H values in a window to form a color histogram; the method for extracting the depth template picture is to intercept a head bounding box before tracking starts and zoom the head bounding box to a standard size to be used as a template picture of head tracking in a depth image;
R′=R/255,G′=G/255,B′=B/255
C max =max(R′,G′,B′),C min =min(R′,G′,B′)
Δ=C max -C min
Figure BDA0002282381700000051
(32) Tracking the body in the color map and the head in the depth map simultaneously by using a KCF (kernel-based correlation filter) algorithm; the method comprises the following steps: extracting pixel values around a tracking object by using a cyclic matrix as a training sample, training a discriminant function by using ridge regression, and transforming the sample into a nuclear space by using nuclear transformation to solve the problem of inseparable sample linearity;
(33) Matching and updating the object model in the tracking process; the matching method is to calculate the normalized correlation coefficient of the tracking object and the initial model, and the calculation formula of the coefficient is as follows:
Figure BDA0002282381700000052
Figure BDA0002282381700000053
wherein d is a color histogram H 1 And H 2 R is the normalized correlation coefficient of the depth template pictures T and I; the values of the two numbers are in the range of 0,1]The larger the matching degree is, the higher the matching degree is, and 0 represents the worst matching effect; if the matching value is larger than 0.9, namely the confidence coefficient of the algorithm to the tracking result is higher, carrying out weighted updating on the model; the initial model weights are 1-w, and the current model weight of the tracked object is w, wherein w=0.5×d or w=0.5×r.
Further, the step (4) includes the steps of:
(41) In the tracking process, judging the tracking effectiveness according to the normalized correlation coefficient in the step (3); firstly judging whether head tracking is effective or not, namely whether the normalized correlation coefficient R value of the depth template pictures T and I is larger than 0.5 or not, if so, turning to a step (42), otherwise, turning to a step (43);
(42) Determining whether body tracking is effective, i.e. color histogram H 1 And H 2 If the normalized correlation coefficient d value of the (b) is larger than 0.5, indicating that the body tracking in the current color image and the head tracking in the depth image are both effective, continuing the normal tracking process, otherwise, turning to the step (44);
(43) Determining whether body tracking is effective, i.e. color histogram H 1 And H 2 If the normalized correlation coefficient d value of the (b) is greater than 0.5, indicating that the head tracking in the depth image is invalid at the moment, and the body tracking in the color image is still valid, estimating the position of the head according to the position of the body according to the spatial position constraint of the head and the body, continuously performing matching of the head model, and recovering the head tracking once the matching is valid; otherwise, go to step (45);
(44) This is the case when head tracking in depth images is valid, whereas body tracking in color images is invalid, it is necessary to estimate the approximate body position from the head position based on the spatial position constraints of the head and body, and continue matching of the body color histogram, and once matching is valid, resume tracking of the body;
(45) At this point, both the head tracking and the body tracking are disabled, indicating that the tracked object is not present in the color image and depth image due to occlusion or rapid movement, in which case the tracking algorithm stops and a warning needs to be given to the user to make an appropriate response.
The beneficial effects are that: compared with the prior art, the method and the device effectively solve the problems of real-time and robust detection and tracking of the human body in the complex environment based on the multi-mode information acquired by the RGB-D camera. The multi-mode information is adopted to detect and track the human body, so that compared with the single use of color information or depth information, the adaptability of the algorithm to different environment illumination conditions is improved; the detection results of the color and depth images are fused by using the space proportion information, so that the recall ratio is improved, the false detection rate is reduced, and the algorithm accuracy is improved; the tracking results on the color image and the depth image are combined, the model characteristic information of the tracked object is comprehensively utilized, and the results can be verified and recovered in the tracking process, so that the overall algorithm has higher robustness. The method is simple and efficient, can meet the functions of man-machine interaction operation, user following and the like of the indoor service robot, and has wide application range and good economic benefit.
Drawings
FIG. 1 is an overall flow chart of an algorithm;
FIG. 2 is a flow chart of the color image person detection in step (2) of the present invention;
FIG. 3 is a flow chart of the detection of the human head of the depth image in the step (2) of the present invention;
FIG. 4 is a flow chart of the fusion detection according to the head-to-body ratio in the step (2) of the present invention;
FIG. 5 is a flow chart of step (3) of the present invention;
FIG. 6 is a flow chart of step (4) of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and detailed description.
Fig. 1 is a general flowchart of a human body detecting and tracking method based on multi-mode information sensing according to the present invention, and the implementation steps are as follows:
(1) Calibrating a color camera and a depth camera and performing data filtering;
firstly, an RGB-D camera (comprising a color camera and a depth camera, wherein the color camera is used for acquiring a color image, the color image comprises R, G, B values of three colors, and the depth camera is used for acquiring a depth image, and the depth image comprises a distance (D) value) to acquire color and depth data of surrounding environment; secondly, according to certain position offset between a color camera and a depth camera in the RGB-D camera, obtaining an internal and external parameter matrix of the color camera and the depth camera through camera calibration, so that the depth value of each point corresponds to the color value one by one; and finally, respectively carrying out filtering treatment on the color image and the depth image to remove bright spots and noise.
The principle of depth camera depth measurement is that an infrared speckle emitter emits an infrared beam, which is reflected back to the depth camera after hitting an obstacle, and then the distance is calculated by returning the geometrical relationship between the speckles. The depth camera is actually a common camera with a filter, and only images infrared light, so that the depth camera is calibrated by only irradiating an object by an infrared light source. The calibration of the color camera adopts a checkerboard method, the camera to be calibrated is used for shooting checkerboard pictures under a plurality of different visual angles, and the angular points are used for detecting and matching different pictures, so that the camera internal and external parameter matrix can be solved through an equation. The specific steps of the calibration of the color camera comprise:
(11) The RGB-D camera is fixed by using a tripod, and then a color camera is used for shooting checkerboard pictures at a plurality of angles and distances, so that each position of an image can be covered.
(12) Detecting and matching angular points in different images, and calculating an internal and external parameter matrix of the camera according to the matched angular point pairs, wherein the internal parameter matrix is shown as a formula (1), fx and fy are focal lengths of an x axis and a y axis respectively, (x 0 and y 0) are principal point coordinates relative to an imaging plane, s is a coordinate axis inclination parameter, and is ideally 0; the external parameter matrix is shown as a formula (2), wherein R is a 3×3 rotation matrix, t is a 3×1 translation vector, and both are in a world coordinate system;
Figure BDA0002282381700000071
Figure BDA0002282381700000072
(13) Depth values in the depth image are mapped onto the color image. Let P be the position of a spatial point, P rgb And p ir Its coordinates in the color image and depth image, K rgb And K ir Reference matrix R of color camera and depth camera respectively ir And t ir Taking a color camera as an external reference of a reference system for the depth camera, mapping the depth value of the P point into the coordinates of the color image by the formula (3);
p rgd =K rgd R ir K ir -1 p ir +t ir (3);
(14) And simultaneously scaling the registered color image and the depth image to the size of 480 multiplied by 270, removing high-frequency noise from the color image by using Gaussian filtering, and removing depth missing points from the depth image by using median filtering.
(2) Human body detection based on multi-modal information perception: firstly, scanning an image in a color image by using a sliding window, extracting HOG (Histogram of Oriented Gridients, gradient direction histogram) characteristics in the window, and judging whether the window contains human bodies or not by using a trained SVM (Support Vector Machine ) classifier, so that all windows possibly containing human bodies in the color image can be obtained; secondly, a sliding window is also used in the depth image, haar (Haar) features in the window are extracted, an Adaboost classifier is used for classifying whether the window is the head of a person, and then all windows possibly containing the head of the person in the depth image are obtained; finally, according to the fusion detection of the space proportion information, the two detection results are fused according to the head-body ratio (about 1:7) of the person, and a detection result of fusion of the multi-mode information is obtained;
(21) Human body detection based on HOG features and SVM classifiers is first performed in a color image. The flow chart of this operation is shown in fig. 2, and the specific steps are as follows:
(211) A multi-scale sliding window scans a color image. Firstly, the original color image is enlarged and reduced according to the proportion of 1.05 (generally in the interval of 1.01-1.5), and a plurality of scale color images are obtained; then a sliding window of a fixed size (64 x 128) is slid over each color image to check whether the window contains a human body.
(212) HOG features within the window are extracted. The HOG features are extracted as follows:
(2121) Graying and gray normalization. Since HOG features mainly describe edge gradient features, color information is not very useful, and in order to reduce the effect of light darkness, the whole color image needs to be normalized after being grayed first.
(2122) The gradient of each pixel in the color image is calculated. Wherein the gradient G of the color image in the x, y direction at (x, y) x ,G y As shown in the formulas (4) and (5), the gradient has a magnitude of
Figure BDA0002282381700000081
Direction is->
Figure BDA0002282381700000082
G x (x,y)=I(x+1,y)-I(x-1,y) (4);
G y (x,y)=I(x,y+1)-I(x,y-1) (5);
(2123) The cells of the pixels are divided in the size of 8 x 8, the gradient information of all the pixels in one cell is counted, and the result is represented by a histogram of the gradient direction. The directional channels of the histogram are uniformly distributed between 0 ° -180 ° (unsigned gradient) or between 0 ° -360 ° (signed gradient). In order to reduce aliasing effects, the vote values of adjacent channels in the histogram also pass through bilinear differences in direction and position.
(2124) The block is divided into 16×16 blocks, and the gradient histogram is subjected to contrast normalization within the block.
(2125) Setting a detection window with the size of 64 multiplied by 128, generating a feature vector in the detection window, and combining the feature vectors of each block in the detection window to obtain the feature vector of the detection window for subsequent classification.
(213) And (5) SVM classification. Classifying the feature vectors in the detection window obtained in the step (212) by using an SVM classifier, and giving a probability score (the value range is 0-1) that the human body exists in the detection window.
(214) The human body detection result in the color image. And synthesizing the classification result of each detection window, and performing non-maximum suppression according to the likelihood score of each window to obtain the human body detection result in the color image.
(22) Secondly, human head detection based on Haar features and an Adaboost cascade classifier is adopted in the depth image. A flowchart of this operation is shown in fig. 3. The method comprises the following specific steps:
(221) A multi-scale sliding window scans the depth image. Firstly, the original depth image is enlarged and reduced according to the proportion of 1.05 (generally in the interval of 1.01-1.5), so as to obtain depth images with multiple scales; then a sliding window of fixed size (30 x 30) is slid over each depth image, checking whether the window contains a human head.
(222) Haar features within the window are extracted. Haar features are simple rectangular block features, and are divided into three types, namely edge features, linear features and diagonal features, wherein the feature value of each rectangular region is the sum of pixels in a white region and the sum of pixels in a subtracted black region. The computation of features may be accelerated using an integral map;
(223) Adaboost classification. The classifier was trained using the AdaBoost algorithm. The AdaBoost algorithm is a way to boost the weak learner with enough data to generate a high-precision strong learner. Weak classifier h j (x) As shown in formula (6), wherein f j Is characterized by theta j Is threshold, p j The function of (2) is to control the direction of the inequality, x being the 24 x 24 depth image sub-window. And training for N times, training N weak classifiers together, and adding normalized weights to the N-th training, wherein the weights are probability distribution. Training a classifier h for each feature j j ,h j Using only a single feature, the one classifier h with the lowest error is selected n Updating the weight to finally obtain a strong classifier;
Figure BDA0002282381700000091
the cascaded structure is then used to combine the classifiers into a more complex classifier. A cascade of classifiers is a combination of a series of strong classifiers, where the classifier of each layer is thresholded to minimize false negative examples in order to enable most of the targets to pass through, but not the target area. The classifier at the front end has fewer features, the classifier at the rear end has more features and is calculated slowly, but the depth image which can reach the rear end finally is very few, so that the overall calculation speed is very high. Classifying the feature vectors in the detection window obtained in the step (222) by using an Adaboost classifier to give a probability score (the value range is 0-1) that the head of the person exists in the detection window.
(224) And detecting a result of the human head in the depth image. And synthesizing the classification result of each detection window, and performing non-maximum suppression according to the likelihood score of each window to obtain the human head detection result in the depth image.
(23) According to the fusion detection of the space proportion information, the two detection results are fused according to the head-body ratio of the person to obtain a detection result of fusing the multi-mode information, and the flow chart is shown in fig. 4, and the specific steps are as follows:
(231) And obtaining a head and body detection result. Acquiring body frames in the color image from the step (21), acquiring head frames in the depth image from the step (22), then traversing the set of body frames, and performing the following operation on each body frame;
(232) Judging whether a head frame exists in the body frame, and deleting the body frame if the head frame does not exist in the body frame; if so, executing the next step;
(233) Judging whether the number of head frames in the body frame is 1, if so, associating the body frame and the head frame to form a multi-mode combined human body detection; if the number of the head frames in the body frame exceeds one, selecting an optimal head frame according to the position of the head frame and the respective confidence level, and then associating the optimal head frame with the current body frame;
the body detected in the color image alone and the head detected in the depth image may be false detected (a region other than the target is detected as the target) or missed detected (no target is detected). In order to make the detection result more reliable, it is necessary to fuse the RGB-D information, i.e. the body frame in the color image and the head frame in the depth image. By adjusting parameters, the detection targets can be as many as possible in the independent detection stage, namely, the omission is reduced, and in the fusion stage, the body frames and the head frames are screened according to the proportion of the heads and the bodies (about 1:7) of most normal people, and the final result is that only one head frame is needed in each body frame, so that most of false detection can be eliminated, the false detection probability is greatly reduced, and meanwhile, the accuracy is improved.
(3) Human body tracking based on multi-modal information awareness: firstly, initializing a model of a tracking object in a color image and a depth image respectively; secondly, tracking the body and the head in the color image and the depth image respectively by using a coring correlation filtering algorithm; and finally, if the confidence coefficient is higher in the tracking process, updating the tracking object model to adapt to the change of the tracking object. The flowchart of the above process is shown in fig. 5, and the specific steps are as follows:
(31) A model of the tracked object in the color image and the depth image is built. In a color image, the model of the tracked object is a color histogram, and in a depth image, the model of the tracked object is a depth template picture. The step of extracting the color histogram is as follows: firstly converting RGB colors into HSV color space, wherein H is hue, S is saturation, V is brightness, then extracting H channels according to a formula (7), and counting the distribution of H values in a window to form a color histogram. The method for extracting the depth template picture is to intercept a head bounding box before tracking starts and zoom the head bounding box to a standard size to be used as a template picture of head tracking in a depth image;
R′=R/255,G′=G/255,B′=B/255
C max =max(R′,G′,B′),C min =min(R′,G′,B′)
Δ=C max -C min
Figure BDA0002282381700000111
(32) The body is tracked simultaneously in the color image and the head in the depth image using the KCF (Kernelized Correlation Filters coring correlation filtering) algorithm. The method comprises the following steps: the pixel values around the tracking object are extracted by using a cyclic matrix as training samples, a discriminant function is trained by ridge regression, and then the samples are transformed into a nuclear space by using nuclear transformation so as to solve the problem that the sample is linear and inseparable. The operation can use discrete Fourier transformation to diagonalize the sample matrix in the Fourier space, so that the dot multiplication of the vector can be used for replacing the calculation of the matrix, especially the inversion of the matrix, and the calculation speed is greatly improved.
(33) The model of the tracked object is matched and updated during the tracking process. The matching method is used for calculating a tracking object and an initial tracking pairNormalized correlation coefficients of the image model are calculated as shown in formulas (8), (9), where d is the color histogram H 1 And H 2 And R is the normalized correlation coefficient of the depth template pictures T and I. The values of the two numbers are in the range of 0,1]The larger the matching degree is, the higher the matching degree is, and 0 is the worst matching effect. And if the matching value is larger than 0.9, namely the confidence of the algorithm on the tracking result is higher, carrying out weighted updating on the tracking object model. The initial tracking object model weights are 1-w, and the current tracking object model weights are w, wherein w=0.5×d or w=0.5×r.
Figure BDA0002282381700000112
Figure BDA0002282381700000113
(4) The space constraint of the tracking object model and the head-body ratio is utilized to perfect a tracking mechanism: firstly, continuously extracting model features of a tracked object in the tracking process, and matching the features with an initial tracked object model to judge whether the tracking is effective or not; secondly, if one tracker fails and the other tracker is still effective in the tracking process, using a still effective tracking result in a short time, searching the position of a failed tracking object in a specified range based on the space constraint of the head-body ratio and timely recovering tracking; finally, if a situation arises in which both trackers fail, the algorithm needs to be stopped and a warning is issued to the user. The flow chart of this step is shown in fig. 6, and the specific steps are as follows:
(41) And (3) judging the tracking effectiveness according to the normalized correlation coefficient size set forth in the step (33) in the tracking process. Firstly judging whether head tracking is effective or not, namely whether the R value in the formula (9) is larger than 0.5 or not, if so, switching to the formula (42), otherwise, switching to the formula (43);
(42) Judging whether the body tracking is effective, namely whether the d value in the formula (8) is larger than 0.5, if so, indicating that the body tracking in the current color image and the head tracking in the depth image are effective, and continuing the normal tracking process, otherwise, switching to the formula (44);
(43) Judging whether body tracking is effective, namely whether the d value in the formula (8) is larger than 0.5, if so, indicating that head tracking in the depth image is invalid, and body tracking in the color image is still effective, estimating the position of the head according to the position constraint of the head and the body, continuously matching a head model, and recovering the head tracking once the matching is effective; otherwise, go to (45);
(44) This is the case when head tracking in depth images is valid, whereas body tracking in color images is invalid, it is necessary to estimate the approximate body position from the head position based on the spatial position constraints of the head and body, and continue matching of the body color histogram, and once matching is valid, resume tracking of the body;
(45) At this point both the tracking of the head and the tracking of the body have failed, indicating that the tracked object is not present in the color image and depth image due to occlusion or rapid movement, in which case the tracking algorithm stops and a warning needs to be given to the user to make a suitable response.

Claims (7)

1. A human body detection and tracking method based on multi-mode information perception is characterized in that: the method comprises the following steps:
(1) Calibrating a color camera and a depth camera, carrying out data filtering treatment, aligning a color image and a depth image through calibration, and then respectively carrying out filtering treatment;
(2) Human body detection based on multi-modal information perception: detecting a body in the color image, detecting a head in the depth image, and fusing according to the space proportion information; the method comprises the following steps:
(21) Scanning an image in a color image by using a sliding window, extracting HOG characteristics in the window, and judging whether the window contains human bodies or not by using a trained SVM classifier to obtain all windows possibly containing human bodies in the color image;
(22) Using a sliding window in the depth image, extracting Haar characteristics in the window, and using an Adaboost classifier to classify whether the window is the head of a person or not, so as to obtain all windows possibly containing the head of the person in the depth image;
(23) According to the fusion detection of the space proportion information, according to the head-body ratio of the person, the detection results of the fusion multi-mode information are obtained according to the two detection results of the fusion step (21) and the fusion step (22);
(3) Human body tracking based on multi-modal information awareness: tracking the body and the head in the color image and the depth image respectively by using a nucleating correlation filtering target tracking algorithm, and establishing a tracking object model to check a tracking result;
(4) And (3) perfecting a tracking mechanism by utilizing the space constraint of the tracking object model and the head-body ratio, and maintaining the tracking stability according to the space position constraint of the head and the body if a single tracker fails in the tracking process.
2. The human body detecting and tracking method based on multi-modal information sensing as set forth in claim 1, wherein the step (1) includes the steps of:
(11) Shooting a plurality of checkerboard pictures with different angles and different distances by using a color camera and a depth camera respectively, so as to ensure that each position of an image can be covered;
(12) Detecting and matching angular points in different images, and calculating an internal and external parameter matrix of the color camera according to the matched angular point pairs, wherein the internal parameter matrix of the color camera is as follows:
Figure FDA0004191179190000011
wherein fx, fy are focal lengths, x0, y0 are principal point coordinates relative to an imaging plane, and s is a coordinate axis inclination parameter;
the color camera external parameter matrix is:
Figure FDA0004191179190000012
wherein R is a 3×3 rotation matrix, t is a 3×1 translation vector, and both are in a world coordinate system;
(13) Mapping the depth values onto a color image;
let P be the position of a spatial point, P rgb And p ir K is the coordinates thereof in the color image and the depth image rgb And K ir Reference matrix R of color camera and depth camera respectively ir And t ir Taking a color camera as an external reference of a reference system for the depth camera, mapping the depth value of the P point into the coordinates of the color image by the following formula;
p rgb =K rgb R ir K ir -1 p ir +t ir
(14) And (3) simultaneously shrinking the registered color image and the depth image, removing high-frequency noise from the color image by using Gaussian filtering, and removing depth missing points from the depth image by using median filtering.
3. The human body detection and tracking method based on multi-modal information sensing as claimed in claim 1, wherein the step (21) includes the steps of:
(211) The method comprises the steps of scanning a color image through a multi-scale sliding window, and firstly enlarging and reducing an original color image according to a preset proportion to obtain a multi-scale color image; then sliding on each color image with a sliding window with a fixed size, and checking whether the window contains a human body or not;
(212) The HOG features in the window are extracted, and the HOG features are extracted as follows:
(2121) Graying and gray normalization;
firstly, graying the whole color image, and then normalizing;
(2122) Calculating the gradient of each pixel in the color image;
wherein the gradient G of the color image in the x, y direction at (x, y) x ,G y The method comprises the following steps of:
G x (x,y)=I(x+1,y)-I(x-1,y);
G y (x,y)=I(x,y+1)-I(x,y-1);
the gradient has a magnitude of
Figure FDA0004191179190000021
Direction is->
Figure FDA0004191179190000022
(2123) Dividing cells of pixels by 8×8 size, counting gradient information of all pixels in one cell, and representing the result by using a histogram of gradient directions;
(2124) Dividing the blocks in a size of 16 multiplied by 16, and carrying out contrast normalization on the gradient histogram in the blocks;
(2125) Setting a detection window with the size of 64 multiplied by 128, generating a feature vector in the detection window, and combining the feature vectors of each block positioned in the detection window to obtain the feature vector of the detection window for subsequent classification;
scaling an original color image to form a color image pyramid, sliding a detection window on the color image with the current scale, classifying at each position by using a trained SVM classifier, and judging whether a human body exists at the position; and finally, performing a non-maximum suppression algorithm on the obtained result to eliminate multiple detection windows of the same target.
4. The human detection and tracking method based on multi-modal information sensing as set forth in claim 1, wherein the step (22) includes the steps of:
(221) The method comprises the steps of scanning a depth image by a multi-scale sliding window, and firstly enlarging and reducing an original depth image according to a preset proportion to obtain a depth image with multiple scales; then sliding on each depth image by a sliding window with a fixed size, and checking whether the window contains a human head or not;
(222) Extracting Haar features in the window;
haar features are simple rectangular block features, and are divided into edge features, linear features and diagonal features, wherein the feature value of each rectangular region is pixel sum in a white region and pixel sum in a subtracted black region;
(223) Adaboost classification, training a classifier by using an AdaBoost algorithm;
the AdaBoost algorithm is a way to boost the weak learner with enough data to generate a high-precision strong learner; weak classifier h j (x) The formula is as follows:
Figure FDA0004191179190000031
wherein f j Is characterized by theta j Is threshold, p j The function of (2) is to control the direction of the inequality, x is the 24 x 24 image sub-window; training for N times, training N weak classifiers together, and adding normalized weights to the N-th training, wherein the weights are probability distribution; training a classifier h for each feature j j ,h j Using only a single feature, the one classifier h with the lowest error is selected n Updating the weight to finally obtain a strong classifier;
classifying Haar feature vectors in the detection window obtained in the step (222) by using an Adaboost classifier to give a likelihood score of the existence of a human head in the detection window;
(224) And (3) integrating the detection results of the human head in the depth image, and carrying out non-maximum suppression according to the probability score of each window to obtain the detection result of the human head in the depth image.
5. The human detection and tracking method based on multi-modal information sensing as claimed in claim 1, wherein the step (23) includes the steps of:
(231) Acquiring a head and body detection result, acquiring a body frame in the color image from the step (21), acquiring a head frame in the depth image from the step (22), traversing a set of body frames, and executing the following operation on each body frame;
(232) Judging whether a head frame exists in the body frames, if not, deleting the body frames, and returning to the step (231); if so, executing step (233);
(233) Judging whether the number of head frames in the body frame is 1, if so, associating the body frame and the head frame to form a multi-mode combined human body detection; if the number of head frames in the body frame exceeds one, an optimal head frame is selected according to the position of the head frame and the respective confidence level, and then the optimal head frame is associated with the current body frame.
6. The human body detecting and tracking method based on multi-modal information sensing as set forth in claim 1, wherein the step (3) includes the steps of:
(31) Establishing a model of the tracking object in the color image and the depth image;
in the color map, the model is a color histogram, and in the depth map, the model is a depth template picture; the step of extracting the color histogram is as follows: firstly converting RGB colors into an HSV color space, wherein H is hue, S is saturation, V is brightness, then extracting H channels according to the following formula, and counting the distribution of H values in a window to form a color histogram; the method for extracting the depth template picture is to intercept a head bounding box before tracking starts and zoom the head bounding box to a standard size to be used as a template picture of head tracking in a depth image;
R′=R/255,G′=G/255,B′=B/255
C max =max(R′,G′,B′),C min =min(R′,G′,B′)
Δ=C max -C min
Figure FDA0004191179190000041
(32) Tracking the body in the color map and the head in the depth map simultaneously by using a KCF (kernel-based correlation filter) algorithm; the method comprises the following steps: extracting pixel values around a tracking object by using a cyclic matrix as a training sample, training a discriminant function by using ridge regression, and transforming the sample into a nuclear space by using nuclear transformation to solve the problem of inseparable sample linearity;
(33) Matching and updating the object model in the tracking process; the matching method is to calculate the normalized correlation coefficient of the tracking object and the initial model, and the calculation formula of the coefficient is as follows:
Figure FDA0004191179190000042
Figure FDA0004191179190000043
wherein d is a color histogram H 1 And H 2 R is the normalized correlation coefficient of the depth template pictures T and I; the values of the two numbers are in the range of 0,1]The larger the matching degree is, the higher the matching degree is, and 0 represents the worst matching effect; if the matching value is larger than 0.9, namely the confidence coefficient of the algorithm to the tracking result is higher, carrying out weighted updating on the model; the initial model weights are 1-w, and the current model weight of the tracked object is w, wherein w=0.5×d or w=0.5×r.
7. The human body detecting and tracking method based on multi-modal information sensing as set forth in claim 1, wherein the step (4) includes the steps of:
(41) In the tracking process, judging the tracking effectiveness according to the normalized correlation coefficient in the step (3); firstly judging whether head tracking is effective or not, namely whether the normalized correlation coefficient R value of the depth template pictures T and I is larger than 0.5 or not, if so, turning to a step (42), otherwise, turning to a step (43);
(42) Determining whether body tracking is effective, i.e. color histogram H 1 And H 2 If the normalized correlation coefficient d value of the (B) is larger than 0.5, if so, the method indicates that the body tracking in the current color image and the head tracking in the depth image are both effective, and normal tracking is continuedOtherwise, go to step (44);
(43) Determining whether body tracking is effective, i.e. color histogram H 1 And H 2 If the normalized correlation coefficient d value of the (b) is greater than 0.5, indicating that the head tracking in the depth image is invalid at the moment, and the body tracking in the color image is still valid, estimating the position of the head according to the position of the body according to the spatial position constraint of the head and the body, continuously performing matching of the head model, and recovering the head tracking once the matching is valid; otherwise, go to step (45);
(44) This is the case when head tracking in depth images is valid, whereas body tracking in color images is invalid, it is necessary to estimate the approximate body position from the head position based on the spatial position constraints of the head and body, and continue matching of the body color histogram, and once matching is valid, resume tracking of the body;
(45) At this point both the tracking of the head and the tracking of the body have failed, indicating that the tracked object is not present in the color image and depth image due to occlusion or rapid movement, in which case the tracking algorithm stops and a warning needs to be given to the user to make a suitable response.
CN201911146615.4A 2019-11-21 2019-11-21 Human body detection and tracking method based on multi-mode information perception Active CN111144207B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911146615.4A CN111144207B (en) 2019-11-21 2019-11-21 Human body detection and tracking method based on multi-mode information perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911146615.4A CN111144207B (en) 2019-11-21 2019-11-21 Human body detection and tracking method based on multi-mode information perception

Publications (2)

Publication Number Publication Date
CN111144207A CN111144207A (en) 2020-05-12
CN111144207B true CN111144207B (en) 2023-07-07

Family

ID=70517199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911146615.4A Active CN111144207B (en) 2019-11-21 2019-11-21 Human body detection and tracking method based on multi-mode information perception

Country Status (1)

Country Link
CN (1) CN111144207B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667518B (en) * 2020-06-24 2023-10-31 北京百度网讯科技有限公司 Face image display method and device, electronic equipment and storage medium
CN111968087B (en) * 2020-08-13 2023-11-07 中国农业科学院农业信息研究所 Plant disease area detection method
CN112150448B (en) * 2020-09-28 2023-09-26 杭州海康威视数字技术股份有限公司 Image processing method, device and equipment and storage medium
TWI798663B (en) * 2021-03-22 2023-04-11 伍碩科技股份有限公司 Depth image compensation method and system
US20220350342A1 (en) * 2021-04-25 2022-11-03 Ubtech North America Research And Development Center Corp Moving target following method, robot and computer-readable storage medium
CN113393401B (en) * 2021-06-24 2023-09-05 上海科技大学 Object detection hardware accelerator, system, method, apparatus and medium
CN116580828B (en) * 2023-05-16 2024-04-02 深圳弗瑞奇科技有限公司 Visual monitoring method for full-automatic induction identification of cat health

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503615A (en) * 2016-09-20 2017-03-15 北京工业大学 Indoor human body detecting and tracking and identification system based on multisensor

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102136139B (en) * 2010-01-22 2016-01-27 三星电子株式会社 Targeted attitude analytical equipment and targeted attitude analytical approach thereof
CN102800126A (en) * 2012-07-04 2012-11-28 浙江大学 Method for recovering real-time three-dimensional body posture based on multimodal fusion
US9542626B2 (en) * 2013-09-06 2017-01-10 Toyota Jidosha Kabushiki Kaisha Augmenting layer-based object detection with deep convolutional neural networks
CN107093182B (en) * 2017-03-23 2019-10-11 东南大学 A kind of human height's estimation method based on feature corners
CN107197384B (en) * 2017-05-27 2019-08-02 北京光年无限科技有限公司 The multi-modal exchange method of virtual robot and system applied to net cast platform

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503615A (en) * 2016-09-20 2017-03-15 北京工业大学 Indoor human body detecting and tracking and identification system based on multisensor

Also Published As

Publication number Publication date
CN111144207A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
CN111144207B (en) Human body detection and tracking method based on multi-mode information perception
US10684681B2 (en) Neural network image processing apparatus
JP6305171B2 (en) How to detect objects in a scene
WO2016015547A1 (en) Machine vision-based method and system for aircraft docking guidance and aircraft type identification
CN106682603B (en) Real-time driver fatigue early warning system based on multi-source information fusion
Liu et al. A novel distance estimation method leading a forward collision avoidance assist system for vehicles on highways
US20220180534A1 (en) Pedestrian tracking method, computing device, pedestrian tracking system and storage medium
KR20110064117A (en) Method for determining frontal pose of face
TW201405486A (en) Real time detecting and tracing objects apparatus using computer vision and method thereof
KR101903127B1 (en) Gaze estimation method and apparatus
CN107203743B (en) Face depth tracking device and implementation method
CN112784712B (en) Missing child early warning implementation method and device based on real-time monitoring
CN113850865A (en) Human body posture positioning method and system based on binocular vision and storage medium
CN114399675A (en) Target detection method and device based on machine vision and laser radar fusion
CN111965636A (en) Night target detection method based on millimeter wave radar and vision fusion
CN109886195B (en) Skin identification method based on near-infrared monochromatic gray-scale image of depth camera
CN108021926A (en) A kind of vehicle scratch detection method and system based on panoramic looking-around system
CN110909561A (en) Eye state detection system and operation method thereof
CN111524183A (en) Target row and column positioning method based on perspective projection transformation
CN114612933B (en) Monocular social distance detection tracking method
Spremolla et al. RGB-D and thermal sensor fusion-application in person tracking
Sun et al. Automatic targetless calibration for LiDAR and camera based on instance segmentation
CN112183287A (en) People counting method of mobile robot under complex background
CN113723432B (en) Intelligent identification and positioning tracking method and system based on deep learning
CN110826495A (en) Body left and right limb consistency tracking and distinguishing method and system based on face orientation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant