CN112668432A

CN112668432A - Human body detection tracking method in ground interactive projection system based on YoloV5 and Deepsort

Info

Publication number: CN112668432A
Application number: CN202011531976.3A
Authority: CN
Inventors: 吴强; 季晓枫; 冯育; 胡瑞闻; 唐昊
Original assignee: Shanghai Magic Digital Creative Technology Co ltd
Current assignee: Shanghai Magic Digital Creative Technology Co ltd
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-04-16

Abstract

The invention relates to a human body detection tracking method in a ground interactive projection system based on YoloV5 and deep sort, which comprises the following steps: step 1: acquiring an image to be identified, and preprocessing the image; step 2: constructing a human body detection tracking model based on YoloV5 and Deepsort; and step 3: training a human body detection tracking model; and 4, step 4: calculating the accuracy of the model, judging whether the accuracy is greater than a preset threshold value, if so, executing the step 5, otherwise, returning to the step 3; and 5: and (3) detecting and tracking the human body in the ground interactive projection system in real time by using the trained human body detection and tracking model. Compared with the prior art, the method has the advantages of high precision, good real-time performance and the like.

Description

Human body detection tracking method in ground interactive projection system based on YoloV5 and Deepsort

Technical Field

The invention relates to the technical field of ground interactive human body detection and tracking, in particular to a human body detection and tracking method in a ground interactive projection system based on YoloV5 and deep sort.

Background

The ground interactive projection can change any one flat and light floor area into a cool and high-tech real-time interactive magic ground. The real interaction between the participants and the images on the ground enables a plurality of participants to be blended into the scene and participate in the game at the same time. Rich content can be displayed, and various interesting interactive mini-games can be contained. Spectators can interact with the content in the projection picture with limbs, bring an unusual advertisement and the effect that the amusement is shone and is shone mutually, can play fine active atmosphere, increase the scientific and technological content of show, also provide the bandwagon effect that has creative content for enterprise's user, improved on-the-spot people's strength and relevant unit's image, have fine propaganda effect.

Human interaction is the key of the ground interactive projection system, and if the projection system wants to interact with a user, the human body needs to be detected and tracked, so that the interaction with the human body can be further realized. At present, the algorithm for detecting and tracking human body on the ground interactive projection is not studied much. Chinese patent CN110516556A discloses a multi-target tracking detection method based on Darkflow-Deepsort, which realizes the detection and tracking of multiple targets by using YoloV3 and Deepsort algorithms, but the target detection precision and the tracking precision of the algorithms are low, and the running speed of the algorithms is low, so that the requirements of a ground interactive projection system are not met.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a human body detection and tracking method based on YooloV 5 and Deepsort in a ground interactive projection system, which has high precision and good real-time performance.

The purpose of the invention can be realized by the following technical scheme:

a human body detection tracking method in a ground interactive projection system based on YoloV5 and Deepsort comprises the following steps:

step 1: acquiring an image to be identified, and preprocessing the image;

step 2: constructing a human body detection tracking model based on YoloV5 and Deepsort;

and step 3: training a human body detection tracking model;

and 4, step 4: calculating the accuracy of the model, judging whether the accuracy is greater than a preset threshold value, if so, executing the step 5, otherwise, returning to the step 3;

and 5: and (3) detecting and tracking the human body in the ground interactive projection system in real time by using the trained human body detection and tracking model.

Preferably, the step 1 specifically comprises:

acquiring a set of images to be identified, randomly dividing the images into a training set and a testing set, performing Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling on all the images, and expanding the training set and the testing set.

Preferably, the human body detection tracking model in step 2 is specifically:

step 2-1: building a YoloV5 submodel for detecting a human body;

step 2-2: and constructing a deep sort submodel for tracking the detected human body.

More preferably, the YoloV5 submodel is specifically:

the YoloV5 submodel comprises a Backbone network of Backbone, a Neck network and a Head network which are connected in sequence;

the Backbone network of the backhaul is a network formed by combining a Focus structure and a CSP structure and is used for acquiring characteristic images of the image pairs;

the Neck network is a network formed by combining an FPN structure and a PAN structure, is used for mixing and combining the characteristics of the images and transmitting the characteristic images to a prediction layer;

the Head network is used for predicting the image characteristics, obtaining a boundary frame and predicting the category; the Head network uses GIOU _ Loss as a Loss function of a Bounding box; the anchor boxes are screened using a weighted NMS.

More preferably, the DeepSort submodel specifically comprises:

firstly, the state of the detected human body target frame is estimated, the state of the target is predicted, and then the detection result and the prediction result are associated and matched.

More preferably, the state estimation method specifically includes:

using an eight-dimensional space

Describing the state of the trajectory at a certain time, wherein (u, v) is the center coordinates of the target frame, r is the aspect ratio of the target frame, h is the height of the target frame,

corresponding speed information in an image coordinate system;

predicting the motion state of the target by using a standard Kalman filter based on a constant velocity model and a linear observation model, wherein the predicted result is (u, v, r, h);

for each tracking target, recording the frame number alpha after the last detection result is matched with the tracking result_kSetting the parameter to 0 once the monitoring result of one target is correctly associated with the tracking result; if α is_kExceeds the set maximum threshold A_maxIf the target is not tracked, determining that the tracking process of the target is finished;

the judgment of the new target is: if the prediction result of the target position of the potential new tracker in three continuous frames can be correctly associated with the detection result, the new moving target is confirmed to be present; if the requirement cannot be met, the moving object needs to be deleted.

More preferably, the association matching method includes association matching of motion informationMatching and appearance information correlation matching; associating a degree of match d using motion information⁽¹⁾(i, j) and appearance information association matching degree d⁽²⁾(i, j) as the final correlation matching degree c_i,jThe method specifically comprises the following steps:

using c_i,jWeight per line, using

Determining the connecting line of the initial match, b_ijIs a value of 0/1, b_ijWhen 1, it means that there is a connection between i and j, b_ij0 means that there is no connection between i and j;

c_i,jthe calculation method comprises the following steps:

c_i,j＝λd⁽¹⁾(i,j)+(1-λ)d⁽²⁾(i,j)

wherein λ is a hyper-parameter.

More preferably, the motion information association matching specifically includes:

measuring the distance between the Track and the prediction result Detection by using the squared Mahalanobis distance, wherein the calculation method comprises the following steps:

wherein d is_jIs the jth prediction result Detection; y is_iIs the ith Track;

covariance as d and y;

is an indicator that if the mahalanobis distance is less than the threshold t⁽¹⁾Representing a successful match.

More preferably, the appearance information association matching specifically includes:

for each prediction result d_jSolving for a unit norm eigenvector r_i；

Calculating the latest 100 successful association feature sets of the ith track and the jth prediction result d of the current frame_jFeature vector r_iMinimum cosine distance between:

is an indicator if the minimum cosine distance is less than a threshold value t⁽¹⁾Representing a successful match.

More preferably, the association matching method further includes cascade matching, specifically:

when a target is shielded for a long time, the uncertainty of Kalman filtering prediction is greatly increased, if two Kalman filters compete for the matching right of the same detection result at the same time, the detection result is associated with a track with longer shielding time, and at the moment, cascade matching is introduced to give priority to the target which appears more frequently.

Compared with the prior art, the invention has the following beneficial effects:

firstly, the human body detection and tracking precision is high: according to the human body detection tracking method in the ground interactive projection system, the YoloV5 model and the Deepsort model are combined, the YoloV5 model is used for detecting the human body of the ground interactive system, and the Deepsort model is used for tracking the detected human body, so that the human body detection and tracking precision is greatly improved, and a foundation is provided for realizing the ground interactive projection system.

Secondly, the algorithm is fast in speed and good in real-time performance: the image processing speed of the YoloV5 human body detection model used by the human body detection and tracking method in the ground interactive projection system is more than 100 times of the image processing speed of the human body detection model in the prior art, so that the real-time performance of human body detection and tracking of the ground interactive projection system is greatly improved.

Drawings

FIG. 1 is a schematic flow chart of a human body detection and tracking method in a ground interactive projection system according to the present invention;

fig. 2 is a schematic structural diagram of YoloV5 model in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

A human body detection tracking method in a floor interactive projection system based on yoolov 5 and deep sort, a flow of which is shown in fig. 1, includes:

step 1: acquiring an image to be identified, and preprocessing the image, specifically:

randomly dividing the images into a training set and a test set, performing Mosaic data enhancement, adaptive anchor frame calculation and adaptive picture scaling on all the images, and expanding the training set and the test set;

(1) mosaic data enhancement

The method comprises the steps of splicing in the modes of random zooming, random cutting and random arrangement, and has a good detection effect on small targets;

(2) adaptive anchor frame computing

For different data sets, there will be an anchor frame with the length and width initially set. In the network training, the network outputs a prediction frame on the basis of an initial anchor frame, and then compares the prediction frame with a real frame ground route, calculates the difference between the prediction frame and the real frame ground route, and then reversely updates and iterates network parameters.

(3) Adaptive picture scaling

Since the length and width of different pictures are different, the common method is to scale the original picture to a standard size and send the scaled original picture to the detection network. For example, dimensions 416 × 416, 608 × 608 are commonly used in the Yolo algorithm. In actual use, the aspect ratios of many pictures are different, so after the zoom filling, the sizes of the black edges at two ends are different, and if the filling is more, information redundancy exists, which affects the reasoning speed. Thus, the letterbox function of datasets. py in the code of Yolov5 is modified to adaptively add the least black edges to the original image. The black edges at the two ends of the image height are reduced, and the calculated amount is reduced during reasoning, namely the target detection speed is improved. The reasoning speed is improved by 37%, so that the effect is obvious.

Step 2: constructing a human body detection tracking model based on YoloV5 and Deepsort, which specifically comprises the following steps:

step 2-1: a YoloV5 submodel is set up for detecting human bodies, and the structure of the YoloV5 submodel is shown in figure 2;

the Head network is connected with the output end and used for predicting the image characteristics to obtain a boundary frame and predict the category; the Head network uses GIOU _ Loss as a Loss function of a Bounding box; screening the anchor frame by using a weighted NMS mode;

step 2-2: building a deep sort submodel for tracking the detected human body;

the deep sort submodel specifically comprises: firstly, carrying out state estimation on a detected human body target frame, predicting the state of a target, and then carrying out correlation matching on a detection result and a prediction result;

the state estimation method specifically comprises the following steps:

using an eight-dimensional space

corresponding speed information in an image coordinate system;

the judgment of the new target is: if the prediction result of the target position of the potential new tracker in three continuous frames can be correctly associated with the detection result, the new moving target is confirmed to be present; if the requirement can not be met, the moving target needs to be deleted;

the association matching method comprises the steps of association matching of motion information and association matching of appearance information; associating a degree of match d using motion information⁽¹⁾(i, j) and appearance information association matching degree d⁽²⁾(i, j) as the final correlation matching degree c_i,jThe method specifically comprises the following steps:

using c_i,jWeight per line, using

c_i,j＝λd⁽¹⁾(i,j)+(1-λ)d⁽²⁾(i,j)

8. the method as claimed in claim 7, wherein the motion information association matching specifically comprises:

wherein d is_jIs the jth prediction result Detection; y is_iIs the ith Track;

covariance as d and y;

is an indicator that if the mahalanobis distance is less than the threshold t⁽¹⁾Represents a successful match;

the appearance information association matching specifically comprises the following steps:

for each prediction result d_jSolving for a unit norm eigenvector r_i；

The association matching method in this embodiment further includes cascade matching, and the application timing specifically is:

when a target is shielded for a long time, the uncertainty of Kalman filtering prediction is greatly increased, if two Kalman filters simultaneously compete for the matching right of the same detection result at the moment, the detection result is associated with a track with longer shielding time, and at the moment, the introduction of cascade matching gives priority matching right to the target which appears more frequently;

and step 3: training a human body detection tracking model;

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A human body detection tracking method in a floor interactive projection system based on YoloV5 and Deepsort is characterized in that the detection tracking method comprises the following steps:

step 1: acquiring an image to be identified, and preprocessing the image;

and step 3: training a human body detection tracking model;

2. The method for detecting and tracking the human body in the floor interactive projection system based on YoloV5 and deep sort as claimed in claim 1, wherein the step 1 specifically comprises:

3. The method as claimed in claim 1, wherein the human body detection tracking model in the step 2 is specifically:

step 2-1: building a YoloV5 submodel for detecting a human body;

4. The human body detection and tracking method in the floor interactive projection system based on yoolov 5 and deep sort as claimed in claim 3, wherein the yoolov 5 submodel is specifically:

5. The method as claimed in claim 3, wherein the deep sort submodel specifically comprises:

6. The human body detection tracking method in the floor interactive projection system based on YoloV5 and deep sort as claimed in claim 5, wherein the state estimation method specifically comprises:

using an eight-dimensional space

corresponding speed information in an image coordinate system;

7. The human body detection and tracking method in the YoloV5 and deep sort-based ground interactive projection system according to claim 5, wherein the correlation matching method comprises motion information correlation matching and appearance information correlation matching; associating a degree of match d using motion information⁽¹⁾(i, j) and appearance information association matching degree d⁽²⁾(i, j) as the final correlation matching degree c_i,jThe method specifically comprises the following steps:

using c_i,jWeight per line, using

c_i,jthe calculation method comprises the following steps:

c_i,j＝λd⁽¹⁾(i,j)+(1-λ)d⁽²⁾(i,j)

wherein λ is a hyper-parameter.

wherein d is_jIs the jth prediction result Detection; y is_iIs the ith Track;

covariance as d and y;

9. The human body detection and tracking method in the floor interactive projection system based on YoloV5 and deep sort as claimed in claim 7, wherein the appearance information association matching specifically comprises:

for each prediction result d_jSolving for a unit norm eigenvector r_i；

is aAn indicator if the minimum cosine distance is less than a threshold value t⁽¹⁾Representing a successful match.

10. The human body detection and tracking method in the floor interactive projection system based on YoloV5 and deep sort as claimed in claim 7, wherein the association matching method further comprises cascade matching, specifically: