CN112200076A

CN112200076A - Method for carrying out multi-target tracking based on head and trunk characteristics

Info

Publication number: CN112200076A
Application number: CN202011076008.8A
Authority: CN
Inventors: 柯逍; 叶宇; 李悦洲
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2021-01-08
Anticipated expiration: 2040-10-10
Also published as: CN112200076B

Abstract

The present invention relates to a method for multi-target tracking based on head and torso features, comprising the following steps: Step S1 : obtaining pedestrian detection results in videos, screening the results, and deleting wrong detection results; Step S2 : after screening After preprocessing the detected results, input them into the human key point detection network to obtain all human key points; Step S3: Screen the obtained key points of each pedestrian, and select the key points of the head and shoulder to combine, Obtain head and torso features; Step S4: Input the obtained head and torso features of a single pedestrian into the tracker for initialization, and then track the target. The invention can effectively extract the features of the head and torso in the detection frame, so that the proportion of the effective information obtained by the tracker during initialization is maximized.

Description

Method for carrying out multi-target tracking based on head and trunk characteristics

Technical Field

The invention relates to the field of computer vision, in particular to a method for multi-target tracking based on head and trunk characteristics.

Background

Multi-Object Tracking (MOT). The main task is to give an image sequence, find moving objects in the image sequence, correspond moving objects in different frames one to one (Identity), and then give the motion tracks of different objects. The mainstream framework adopted by the academia in the multi-target Tracking (MOT) problem at present is TBD (Tracking-by-Detection), that is, Tracking based on Detection, and in this mainstream Tracking framework, the multi-target Tracking problem is expressed as an association matching problem: if the detection result obtained from a certain frame is matched with the detection result obtained from the previous frame, the same target is identified.

With the continuous development of single-target trackers in recent years, a large number of trackers with good tracking effect and high running speed appear. In previous work, a single-target tracker has been applied to a multi-target tracking task and achieved certain effect, but the performance of the single-target tracker in a complex scene (such as an MOT19 data set) is not ideal, because in the complex scene, a large amount of redundant features and interference information are contained in a detection frame, and the tracking effect is greatly affected by initializing the tracker with the features and the information.

Disclosure of Invention

In view of this, the present invention provides a method for performing multi-target tracking based on head and torso features, which can effectively extract head and torso features in a detection frame, so as to maximize an effective information ratio obtained by a tracker during initialization.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for multi-target tracking based on head and trunk characteristics comprises the following steps:

step S1: acquiring a pedestrian detection result in the video, screening the result, and deleting an error detection result;

step S2: preprocessing the screened detection results, and inputting the preprocessed detection results into a human body key point detection network to obtain all human body key points;

step S3: screening the obtained key points of each pedestrian, selecting the key points of the head and the shoulders to combine to obtain the characteristics of the head and the trunk;

step S4: and inputting the obtained head and trunk characteristics of the single pedestrian into a tracker for initialization, and further tracking the target.

Further, the step S1 is specifically:

step S11, after each frame of image of the video is preprocessed, the image is detected by using a target detection network;

step S12: detecting information of each frame of the video by using a pedestrian detector to obtain a detection result R, and enabling R to be { K ═ K_i,P_jDet _ x, det _ y, det _ w, det _ h, det _ c, i 1,2,. M, j 1,2,. N, representing the set of all detection results in a video sequence, where M represents the number of all image frames in a video sequence, N represents the number of all detected pedestrians in a frame of image, and K represents the number of detected pedestrians in a frame of image_iRepresenting the ith frame of a video sequence, P_jRepresenting the jth pedestrian in the frame image, det _ x, det _ y, det _ w and det _ h respectively represent the x coordinate and the y coordinate of the upper left corner of the detection frame of the pedestrian and the width and the height of the detection frame, and det _ c represents the confidence coefficient of the detection frame;

step S13: let the confidence threshold value of pedestrian detection be T_dThe pedestrian aspect ratio threshold is T_rDeleting the detection results satisfying the following conditions:

det_c＜T_d or det_w/det_h＞T_r。

further, the step S2 is specifically:

step S21, the screened detection frame resize is set to a preset size;

step S22: preprocessing the image after resize, copying the preprocessed image, horizontally turning the copied image, and inputting the original image and the turned image into a human body key point detection network;

step S23: obtaining the output result S of the human body key point detection network, and enabling S to be { J ═ J_zZ, where Z represents the number of human keypoints in the image; j. the design is a square_z- { jo int _ x, jo int _ y, jo int _ c } denotes the thz key points, wherein jo int _ x represents the x coordinate of the human body key point, jo int _ y represents the y coordinate of the human body key point, and jo int _ c represents the confidence coefficient of the human body key point;

step S24: let the human body key point detection result of the original image be S_srcLet the human body key point detection result of the flip image be S_flipAnd fusing the two detection results.

Further, the pretreatment specifically comprises: firstly, removing sharp noise of an image by using Gaussian filtering, and then removing fine interference details by using a USM sharpening enhancement algorithm, wherein the calculation method comprises the following steps:

where output represents an output image, orign _ image represents an original image, gaus _ image represents an image after gaussian filtering, and ω represents a USM coefficient.

Further, the fusion method comprises the following steps:

wherein c is_srcRepresenting confidence, x, of key points of the body in the original image_srcX-coordinate, y representing key points of a human body in an original image_srcY-coordinate representing key points of the human body in the original image, where c_flipRepresenting confidence, x, of key points of a human body in a flip image_flipX-coordinate, y representing key points of a human body in a flip image_flipY-coordinate representing key points of the human body in the flip image, final _ x and final _ y, respectivelyAnd x and y coordinates of the human key points of which the final fusion is finished are represented, and final _ c represents the confidence coefficient of the human key points of which the final fusion is finished.

Further, the step S3 is specifically:

step S31: screening the selected key points, and adopting a screening scheme based on confidence coefficient to enable the detection confidence coefficient threshold of the human key points to be T_kpDeleting the human body key points meeting the following requirements:

jo int_c＜T_kp

step S32: combining the screened key points, enabling the screened human body key point set to be Q, traversing the set Q, sequencing the whole set Q according to the sizes of x coordinates and y coordinates, finding out the human body key points at the top, the bottom, the left and the right in the set, and obtaining the minimum rectangular convex hull in the set, wherein the image content in the rectangular convex hull is the head and trunk characteristics of the target.

Further, the step S4 is specifically:

step S41: let the set of targets in the tracked state be O_trackThe set comprises all targets in a tracking state from a first frame of the video to a current frame;

step S42: traversing a set of targets O in a state being tracked_trackPerforming IOU and OKS calculation on the newly obtained head and trunk characteristics and all targets in the set to confirm whether the target is in a tracking state;

step S43: calculating the fusion metric value of the bounding box of the tracked target and the newly obtained head and trunk characteristic bounding box in the target set in the tracked state, wherein the calculation method is as follows

If FF is greater than 0.5, the target is considered to exist, and the target does not need to be initialized to a new target again, and the step S1 is executed; otherwise the feature is identified as belonging to a new target, proceeding to the next step;

step S44: inputting new head and trunk characteristics into a tracker for initialization, and adding a target set O in a tracked state_trackGo to step S1.

Further, the method for calculating the IOU and OKS is as follows:

wherein A represents the area of the bounding box of the first object and B represents the area of the bounding box of the second object; the calculation method of the bounding box area is that the length of the rectangle is multiplied by the width of the rectangle; vis_zIndicates the visibility of the z-th keypoint (greater than 0 indicating visibility), dis indicates the Euclidean distance, scale, of the existing and detected human keypoints²The square root, σ, representing the size of the area occupied by these key points_zA normalization factor representing the z-th human body key point.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention does not depend on the quality of the detection result, and can also be corrected through the human body key point detection network even if the detection frame has deviation with the group Truth in the data set;

2. according to the invention, the head trunk characteristics acquired through the human body key point detection network are used for target tracking, and the angle of monitoring scene shooting is considered, so that the head trunk is not easy to be shielded, and effective characteristics can be extracted for tracking even in a monitoring scene crowded with people;

3. the method adopts the human body key point detection network to obtain effective information contained in the head and trunk characteristics, and the effective information has larger average proportion in the image, thereby being more beneficial to the initialization and the subsequent tracking of the tracker.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

Referring to fig. 1, the present invention provides a method for multi-target tracking based on head and torso features, comprising the following steps:

step S2: inputting the screened detection result into a human body key point detection network to obtain all human body key points; the dependence degree on the quality of the detection result can be reduced, and even if the quality of the detection result is poor, the detection result can be corrected through key points of a human body;

step S3: considering the shooting angle of a monitoring scene, as the head trunk is not easy to be shielded, effective features can be extracted for tracking even in the monitoring scene crowded with people, the obtained key points of each pedestrian are screened, and then the key points of the head and the shoulders are selected for combination to obtain the head trunk features;

step S4: the obtained head and trunk characteristics of the single pedestrian are input into the tracker to be initialized, and then the target is tracked.

In this embodiment, the step S1 includes the following steps:

step S11: after each frame of image of the video is preprocessed, a target detection network is utilized

Detecting the same;

step S12: detecting information of each frame of the video by using a pedestrian detector to obtain a detection result R, and enabling R to be { K ═ K_i,P_jDet _ x, det _ y, det _ w, det _ h, det _ c, i 1,2,. M, j 1,2,. N, representing the set of all detection results in a video sequence, where M represents all images in a video sequenceNumber of frames, N represents the number of all detected pedestrians in one frame image, K_iRepresenting the ith frame of a video sequence, P_jRepresenting the jth pedestrian in the frame image, det _ x, det _ y, det _ w and det _ h respectively represent the x coordinate and the y coordinate of the upper left corner of the detection frame of the pedestrian and the width and the height of the detection frame, and det _ c represents the confidence coefficient of the detection frame;

det_c＜T_d or det_w/det_h＞T_r

in this embodiment, step S2 specifically includes the following steps:

step S21: the screened detection box resize is to a size of 4:3, using a size of 344 × 258;

step S22: preprocessing the image after resize, firstly removing sharp noise of the image by using Gaussian filtering, wherein the Gaussian filtering is used because boundary information can be better reserved; then, a USM sharpening enhancement algorithm is used for removing fine interference details, and the calculation method is as follows:

where output represents an output image, orign _ image represents an original image, gaus _ image represents an image after gaussian filtering, and ω represents a USM coefficient. Copying the preprocessed output image, horizontally turning the copied image, and inputting the original image and the turned image into a human body key point detection network;

step S23: obtaining the output result S of the human body key point detection network, and enabling S to be { J ═ J_zZ, where Z represents the number of human keypoints in the image. J. the design is a square_zThe z-th key point is represented by { jo int _ x, jo int _ y and jo int _ c }, wherein the jo int _ x represents the x coordinate of the human body key point, the jo int _ y represents the y coordinate of the human body key point, and the jo int _ c represents the confidence coefficient of the human body key point;

step S24: let the human body key point detection result of the original image be S_srcLet the human body key point detection result of the flip image be S_flipAnd fusing the two detection results, wherein the step is to achieve more accurate human body key point coordinates, and the fusing method comprises the following steps:

wherein c is_srcRepresenting confidence, x, of key points of the body in the original image_srcX-coordinate, y representing key points of a human body in an original image_srcY-coordinate representing key points of the human body in the original image, where c_flipRepresenting confidence, x, of key points of a human body in a flip image_flipX-coordinate, y representing key points of a human body in a flip image_flipAnd the y coordinate represents the human body key point in the overturned image, final _ x and final _ y represent the x and y coordinates of the human body key point after final fusion is finished respectively, and final _ c represents the confidence coefficient of the human body key point after final fusion is finished.

In this embodiment, selecting human body key points with good tracking characteristics specifically includes: human body key points of the head (eyes, nose, ears) and two shoulders, the step S3 specifically includes the following steps:

jo int_c＜T_kp

step S32: combining the screened key points, enabling the screened human body key point set to be Q, traversing the set Q, sorting the whole set Q according to the sizes of x coordinates and y coordinates respectively, finding out the human body key points at the top, the bottom, the left and the right in the set, and obtaining the minimum rectangular convex hull in the set. The image content within the rectangular convex hull is the head-torso feature of the target.

In this embodiment, step S4 specifically includes the following steps:

step S41: let the set of targets in the tracked state be O_trackThe set contains all the objects in the tracking state from the first frame of the video to the current frame.

Step S42: traversing a set of targets O in a state being tracked_trackAnd performing IOU (cross-over ratio) and OKS (object key point similarity) calculation on the newly obtained head and trunk features and all the objects in the set to confirm whether the object is in a tracking state or not, wherein the IOU and OKS calculation method comprises the following steps:

wherein A represents the area of the bounding box of the first object and B represents the area of the bounding box of the second object; the calculation method of the bounding box area is that the length of the rectangle is multiplied by the width of the rectangle; vis_zIndicates the visibility of the z-th keypoint (greater than 0 indicating visibility), dis indicates the Euclidean distance, scale, of the existing and detected human keypoints²The square root, σ, representing the size of the area occupied by these key points_zAnd (3) a normalization factor representing the z-th human body key point (the factor is obtained by calculating the standard deviation of all group Truth in the existing data set, and reflects the influence degree of the current key point on the whole body).

If FF is greater than 0.5, the target is considered to exist, and the target does not need to be initialized to a new target again, and the step S11 is executed; otherwise the feature is deemed to be subject to a new target and the next step is taken.

Step S44: inputting new head and trunk characteristics into a tracker for initialization, and adding a target set O in a tracked state_trackGo to step S11.

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims

1. A method for multi-target tracking based on head and trunk characteristics is characterized by comprising the following steps:

2. The method for multi-target tracking based on head-torso features of claim 1, wherein the step S1 is specifically performed by:

det_c＜T_d or det_w/det_h＞T_r。

3. the method for multi-target tracking based on head-torso features of claim 1, wherein the step S2 is specifically performed by:

step S21, the screened detection frame resize is set to a preset size;

step S23: obtaining the output result S of the human body key point detection network, and enabling S to be { J ═ J_zZ, where Z represents the number of human keypoints in the image; j. the design is a square_z-join _ x, join _ y, join _ c represents the z-th keypoint, where join _ x represents the x-coordinate of a human keypoint, join _ y represents the y-coordinate of a human keypoint, and join _ c represents the confidence of this human keypoint;

step S24: let the human body key point detection result of the original image be S_srcTo turn over the human body of the imageThe key point detection result is S_flipAnd fusing the two detection results.

4. The method for multi-target tracking based on head and torso features of claim 3, wherein the preprocessing is specifically: firstly, removing sharp noise of an image by using Gaussian filtering, and then removing fine interference details by using a USM sharpening enhancement algorithm, wherein the calculation method comprises the following steps:

5. The method for multi-target tracking based on head and torso features of claim 3, wherein the fusion method is as follows:

wherein c is_srcRepresenting confidence, x, of key points of the body in the original image_srcX-coordinate, y representing key points of a human body in an original image_srcY-coordinate representing key points of the human body in the original image, where c_flipRepresenting confidence, x, of key points of a human body in a flip image_flipX-coordinate, y representing key points of a human body in a flip image_flipRepresenting a flipped imageAnd the y coordinate of the middle human key point, final _ x and final _ y respectively represent the x and y coordinates of the final fused human key point, and final _ c represents the confidence coefficient of the final fused human key point.

6. The method for multi-target tracking based on head-torso features of claim 1, wherein the step S3 is specifically performed by:

joint_c＜T_kp

7. The method for multi-target tracking based on head-torso features of claim 1, wherein the step S4 is specifically performed by:

8. The method for multi-target tracking based on head torso features of claim 7, wherein the IOU and OKS calculations are performed as follows: