Background
Multi-Object Tracking (MOT). The main task is to give an image sequence, find moving objects in the image sequence, correspond moving objects in different frames one to one (Identity), and then give the motion tracks of different objects. The mainstream framework adopted by the academia in the multi-target Tracking (MOT) problem at present is TBD (Tracking-by-Detection), that is, Tracking based on Detection, and in this mainstream Tracking framework, the multi-target Tracking problem is expressed as an association matching problem: if the detection result obtained from a certain frame is matched with the detection result obtained from the previous frame, the same target is identified.
With the continuous development of single-target trackers in recent years, a large number of trackers with good tracking effect and high running speed appear. In previous work, a single-target tracker has been applied to a multi-target tracking task and achieved certain effect, but the performance of the single-target tracker in a complex scene (such as an MOT19 data set) is not ideal, because in the complex scene, a large amount of redundant features and interference information are contained in a detection frame, and the tracking effect is greatly affected by initializing the tracker with the features and the information.
Disclosure of Invention
In view of this, the present invention provides a method for performing multi-target tracking based on head and torso features, which can effectively extract head and torso features in a detection frame, so as to maximize an effective information ratio obtained by a tracker during initialization.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for multi-target tracking based on head and trunk characteristics comprises the following steps:
step S1: acquiring a pedestrian detection result in the video, screening the result, and deleting an error detection result;
step S2: preprocessing the screened detection results, and inputting the preprocessed detection results into a human body key point detection network to obtain all human body key points;
step S3: screening the obtained key points of each pedestrian, selecting the key points of the head and the shoulders to combine to obtain the characteristics of the head and the trunk;
step S4: and inputting the obtained head and trunk characteristics of the single pedestrian into a tracker for initialization, and further tracking the target.
Further, the step S1 is specifically:
step S11, after each frame of image of the video is preprocessed, the image is detected by using a target detection network;
step S12: detecting information of each frame of the video by using a pedestrian detector to obtain a detection result R, and enabling R to be { K ═ Ki,PjDet _ x, det _ y, det _ w, det _ h, det _ c, i 1,2,. M, j 1,2,. N, representing the set of all detection results in a video sequence, where M represents the number of all image frames in a video sequence, N represents the number of all detected pedestrians in a frame of image, and K represents the number of detected pedestrians in a frame of imageiRepresenting the ith frame of a video sequence, PjRepresenting the jth pedestrian in the frame image, det _ x, det _ y, det _ w and det _ h respectively represent the x coordinate and the y coordinate of the upper left corner of the detection frame of the pedestrian and the width and the height of the detection frame, and det _ c represents the confidence coefficient of the detection frame;
step S13: let the confidence threshold value of pedestrian detection be TdThe pedestrian aspect ratio threshold is TrDeleting the detection results satisfying the following conditions:
det_c<Td or det_w/det_h>Tr。
further, the step S2 is specifically:
step S21, the screened detection frame resize is set to a preset size;
step S22: preprocessing the image after resize, copying the preprocessed image, horizontally turning the copied image, and inputting the original image and the turned image into a human body key point detection network;
step S23: obtaining the output result S of the human body key point detection network, and enabling S to be { J ═ JzZ, where Z represents the number of human keypoints in the image; j. the design is a squarez- { jo int _ x, jo int _ y, jo int _ c } denotes the thz key points, wherein jo int _ x represents the x coordinate of the human body key point, jo int _ y represents the y coordinate of the human body key point, and jo int _ c represents the confidence coefficient of the human body key point;
step S24: let the human body key point detection result of the original image be SsrcLet the human body key point detection result of the flip image be SflipAnd fusing the two detection results.
Further, the pretreatment specifically comprises: firstly, removing sharp noise of an image by using Gaussian filtering, and then removing fine interference details by using a USM sharpening enhancement algorithm, wherein the calculation method comprises the following steps:
where output represents an output image, orign _ image represents an original image, gaus _ image represents an image after gaussian filtering, and ω represents a USM coefficient.
Further, the fusion method comprises the following steps:
wherein c issrcRepresenting confidence, x, of key points of the body in the original imagesrcX-coordinate, y representing key points of a human body in an original imagesrcY-coordinate representing key points of the human body in the original image, where cflipRepresenting confidence, x, of key points of a human body in a flip imageflipX-coordinate, y representing key points of a human body in a flip imageflipY-coordinate representing key points of the human body in the flip image, final _ x and final _ y, respectivelyAnd x and y coordinates of the human key points of which the final fusion is finished are represented, and final _ c represents the confidence coefficient of the human key points of which the final fusion is finished.
Further, the step S3 is specifically:
step S31: screening the selected key points, and adopting a screening scheme based on confidence coefficient to enable the detection confidence coefficient threshold of the human key points to be TkpDeleting the human body key points meeting the following requirements:
jo int_c<Tkp
step S32: combining the screened key points, enabling the screened human body key point set to be Q, traversing the set Q, sequencing the whole set Q according to the sizes of x coordinates and y coordinates, finding out the human body key points at the top, the bottom, the left and the right in the set, and obtaining the minimum rectangular convex hull in the set, wherein the image content in the rectangular convex hull is the head and trunk characteristics of the target.
Further, the step S4 is specifically:
step S41: let the set of targets in the tracked state be OtrackThe set comprises all targets in a tracking state from a first frame of the video to a current frame;
step S42: traversing a set of targets O in a state being trackedtrackPerforming IOU and OKS calculation on the newly obtained head and trunk characteristics and all targets in the set to confirm whether the target is in a tracking state;
step S43: calculating the fusion metric value of the bounding box of the tracked target and the newly obtained head and trunk characteristic bounding box in the target set in the tracked state, wherein the calculation method is as follows
If FF is greater than 0.5, the target is considered to exist, and the target does not need to be initialized to a new target again, and the step S1 is executed; otherwise the feature is identified as belonging to a new target, proceeding to the next step;
step S44: inputting new head and trunk characteristics into a tracker for initialization, and adding a target set O in a tracked statetrackGo to step S1.
Further, the method for calculating the IOU and OKS is as follows:
wherein A represents the area of the bounding box of the first object and B represents the area of the bounding box of the second object; the calculation method of the bounding box area is that the length of the rectangle is multiplied by the width of the rectangle; viszIndicates the visibility of the z-th keypoint (greater than 0 indicating visibility), dis indicates the Euclidean distance, scale, of the existing and detected human keypoints2The square root, σ, representing the size of the area occupied by these key pointszA normalization factor representing the z-th human body key point.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention does not depend on the quality of the detection result, and can also be corrected through the human body key point detection network even if the detection frame has deviation with the group Truth in the data set;
2. according to the invention, the head trunk characteristics acquired through the human body key point detection network are used for target tracking, and the angle of monitoring scene shooting is considered, so that the head trunk is not easy to be shielded, and effective characteristics can be extracted for tracking even in a monitoring scene crowded with people;
3. the method adopts the human body key point detection network to obtain effective information contained in the head and trunk characteristics, and the effective information has larger average proportion in the image, thereby being more beneficial to the initialization and the subsequent tracking of the tracker.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
Referring to fig. 1, the present invention provides a method for multi-target tracking based on head and torso features, comprising the following steps:
step S1: acquiring a pedestrian detection result in the video, screening the result, and deleting an error detection result;
step S2: inputting the screened detection result into a human body key point detection network to obtain all human body key points; the dependence degree on the quality of the detection result can be reduced, and even if the quality of the detection result is poor, the detection result can be corrected through key points of a human body;
step S3: considering the shooting angle of a monitoring scene, as the head trunk is not easy to be shielded, effective features can be extracted for tracking even in the monitoring scene crowded with people, the obtained key points of each pedestrian are screened, and then the key points of the head and the shoulders are selected for combination to obtain the head trunk features;
step S4: the obtained head and trunk characteristics of the single pedestrian are input into the tracker to be initialized, and then the target is tracked.
In this embodiment, the step S1 includes the following steps:
step S11: after each frame of image of the video is preprocessed, a target detection network is utilized
Detecting the same;
step S12: detecting information of each frame of the video by using a pedestrian detector to obtain a detection result R, and enabling R to be { K ═ Ki,PjDet _ x, det _ y, det _ w, det _ h, det _ c, i 1,2,. M, j 1,2,. N, representing the set of all detection results in a video sequence, where M represents all images in a video sequenceNumber of frames, N represents the number of all detected pedestrians in one frame image, KiRepresenting the ith frame of a video sequence, PjRepresenting the jth pedestrian in the frame image, det _ x, det _ y, det _ w and det _ h respectively represent the x coordinate and the y coordinate of the upper left corner of the detection frame of the pedestrian and the width and the height of the detection frame, and det _ c represents the confidence coefficient of the detection frame;
step S13: let the confidence threshold value of pedestrian detection be TdThe pedestrian aspect ratio threshold is TrDeleting the detection results satisfying the following conditions:
det_c<Td or det_w/det_h>Tr
in this embodiment, step S2 specifically includes the following steps:
step S21: the screened detection box resize is to a size of 4:3, using a size of 344 × 258;
step S22: preprocessing the image after resize, firstly removing sharp noise of the image by using Gaussian filtering, wherein the Gaussian filtering is used because boundary information can be better reserved; then, a USM sharpening enhancement algorithm is used for removing fine interference details, and the calculation method is as follows:
where output represents an output image, orign _ image represents an original image, gaus _ image represents an image after gaussian filtering, and ω represents a USM coefficient. Copying the preprocessed output image, horizontally turning the copied image, and inputting the original image and the turned image into a human body key point detection network;
step S23: obtaining the output result S of the human body key point detection network, and enabling S to be { J ═ JzZ, where Z represents the number of human keypoints in the image. J. the design is a squarezThe z-th key point is represented by { jo int _ x, jo int _ y and jo int _ c }, wherein the jo int _ x represents the x coordinate of the human body key point, the jo int _ y represents the y coordinate of the human body key point, and the jo int _ c represents the confidence coefficient of the human body key point;
step S24: let the human body key point detection result of the original image be SsrcLet the human body key point detection result of the flip image be SflipAnd fusing the two detection results, wherein the step is to achieve more accurate human body key point coordinates, and the fusing method comprises the following steps:
wherein c issrcRepresenting confidence, x, of key points of the body in the original imagesrcX-coordinate, y representing key points of a human body in an original imagesrcY-coordinate representing key points of the human body in the original image, where cflipRepresenting confidence, x, of key points of a human body in a flip imageflipX-coordinate, y representing key points of a human body in a flip imageflipAnd the y coordinate represents the human body key point in the overturned image, final _ x and final _ y represent the x and y coordinates of the human body key point after final fusion is finished respectively, and final _ c represents the confidence coefficient of the human body key point after final fusion is finished.
In this embodiment, selecting human body key points with good tracking characteristics specifically includes: human body key points of the head (eyes, nose, ears) and two shoulders, the step S3 specifically includes the following steps:
step S31: screening the selected key points, and adopting a screening scheme based on confidence coefficient to enable the detection confidence coefficient threshold of the human key points to be TkpDeleting the human body key points meeting the following requirements:
jo int_c<Tkp
step S32: combining the screened key points, enabling the screened human body key point set to be Q, traversing the set Q, sorting the whole set Q according to the sizes of x coordinates and y coordinates respectively, finding out the human body key points at the top, the bottom, the left and the right in the set, and obtaining the minimum rectangular convex hull in the set. The image content within the rectangular convex hull is the head-torso feature of the target.
In this embodiment, step S4 specifically includes the following steps:
step S41: let the set of targets in the tracked state be OtrackThe set contains all the objects in the tracking state from the first frame of the video to the current frame.
Step S42: traversing a set of targets O in a state being trackedtrackAnd performing IOU (cross-over ratio) and OKS (object key point similarity) calculation on the newly obtained head and trunk features and all the objects in the set to confirm whether the object is in a tracking state or not, wherein the IOU and OKS calculation method comprises the following steps:
wherein A represents the area of the bounding box of the first object and B represents the area of the bounding box of the second object; the calculation method of the bounding box area is that the length of the rectangle is multiplied by the width of the rectangle; viszIndicates the visibility of the z-th keypoint (greater than 0 indicating visibility), dis indicates the Euclidean distance, scale, of the existing and detected human keypoints2The square root, σ, representing the size of the area occupied by these key pointszAnd (3) a normalization factor representing the z-th human body key point (the factor is obtained by calculating the standard deviation of all group Truth in the existing data set, and reflects the influence degree of the current key point on the whole body).
Step S43: calculating the fusion metric value of the bounding box of the tracked target and the newly obtained head and trunk characteristic bounding box in the target set in the tracked state, wherein the calculation method is as follows
If FF is greater than 0.5, the target is considered to exist, and the target does not need to be initialized to a new target again, and the step S11 is executed; otherwise the feature is deemed to be subject to a new target and the next step is taken.
Step S44: inputting new head and trunk characteristics into a tracker for initialization, and adding a target set O in a tracked statetrackGo to step S11.
The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.