CN110910428B

CN110910428B - Real-time multi-target tracking method based on neural network

Info

Publication number: CN110910428B
Application number: CN201911236736.8A
Authority: CN
Inventors: 吴圣杰
Original assignee: Jiangsu Zhongyun Smart Data Technology Co ltd
Current assignee: Jiangsu Zhongyun Smart Data Technology Co ltd
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2022-04-01
Anticipated expiration: 2039-12-05
Also published as: CN110910428A

Abstract

The invention discloses a real-time multi-target tracking method based on a neural network, which comprises the following steps: s1: establishing a pre-training target detection model; s2: initializing a global tracker; s3: performing target detection on the video stream to be tracked by using the pre-training target detection model; s4: acquiring the actual position coordinates and the predicted position coordinates of each tracked target in a target set with an established tracking state in the global tracker; s5: respectively carrying out cross calculation on the position coordinate of each target to be matched, the actual position coordinate and the predicted position coordinate of each tracked target; s6: analyzing the maximum matching coefficient in S5 and updating the robust state of each tracked target to obtain tracking information; s7: s3 through S6 are repeated. The method can effectively improve the detection efficiency of the tracked object under special conditions such as shielding, deformation, environmental change, rapid movement and the like.

Description

Real-time multi-target tracking method based on neural network

Technical Field

The invention relates to the field of deep learning, in particular to a real-time multi-target tracking method based on a neural network.

Background

For video target tracking, some excellent traditional methods such as KCF, TLD, CSRT and other tracking algorithms based on optical flow method, Mean-Shift and template matching and other basic algorithms emerge for many years. However, the traditional method needs manual work to participate in tracking target calibration, and target tracking of a specific category cannot be performed fully automatically. When the conditions of shielding, deformation, environmental change, rapid movement and the like are met, the tracking target is easily lost by the traditional method.

In order to solve these disadvantages of the conventional method, some neural network-based tracking algorithms have been proposed in recent years and achieve better tracking effect than the conventional algorithms, but they still have certain disadvantages. For example, chinese patent application No. 201611271054.7 discloses a target tracking algorithm based on a neural network, which uses a neural network self-encoder to train and extract target features for object detection in each frame of image, and has a drawback in that when a plurality of similar targets exist in a scene, upper and lower frame target matching cannot be performed well, and thus a tracking target is lost. Meanwhile, under the condition that the time of the target existing in the video sequence is short, enough training samples cannot be obtained to train the self-encoder to learn the features on line. In order to solve the above problems, chinese patent application No. 201811532811.0 proposes a tracking method similar to the SORT algorithm, which matches and tracks targets according to detected IOU values between targets, but this method too depends on the IOU values of the target detected by the current frame and the target to be tracked, and when the detection interval is too long, detection is missed or the object moves too fast each time, the target cannot be matched.

Disclosure of Invention

In view of the above, in order to solve the defects of the prior art, the invention provides a real-time multi-target tracking algorithm based on a neural network, which can perform robust tracking on a target in a video stream in real time, and can effectively solve the problem that the prior method has poor performance in special situations such as occlusion, deformation, environmental change, rapid movement and the like, and has good target matching capability. The specific contents are as follows: a real-time multi-target tracking method based on a neural network comprises the following steps:

s1: establishing a pre-training target detection model, wherein the pre-training target detection model is used for detecting a tracking target in a video stream to be tracked;

s2: initializing a global tracker;

s3: performing target detection on the video stream to be tracked by using the pre-training target detection model to obtain a target set to be matched and a corresponding position coordinate set;

s4: acquiring the actual position coordinates and the predicted position coordinates of each tracked target in a target set with an established tracking state in the global tracker;

s5: respectively carrying out cross calculation on the position coordinate of each target to be matched in the target set to be matched in S3, the actual position coordinate and the predicted position coordinate of each tracked target in S4 to obtain a Jaccard coefficient, and comparing to obtain a maximum matching coefficient of each target to be matched and each tracked target;

s6: analyzing the maximum matching coefficient in S5, updating the robust state of each tracked target in the global tracker to obtain tracking information, and establishing a tracking state of a target to be matched which is not matched with the tracked target;

s7: repeating S3-S6 until the video stream ends.

Further, the step S1: establishing a pre-training target detection model, comprising: and pre-training the object type to be detected by using a deep learning neural network target detection model to obtain a pre-training target detection model.

Further, the detection time of the target detection model is less than the time of generating each frame of image by the video stream.

Further, the step S2: the global tracker includes: the actual tracking target set and the buffer tracking target set.

Further, when the global tracker is initialized, the actual tracking target set and the buffered tracking target set are both empty sets.

Further, the set of actual tracked objects and the set of buffered tracked objects will store tracking information for the tracked objects, the tracking information including: and the tracked target is one or more of the current position coordinate, the historical position coordinate, the number of frames which are accumulated and disappeared in the disappearance state and the number of frames which are continuously in the continuous tracking state in the image.

Further, the obtaining of the predicted position coordinate in S4 includes: and calculating the actual position coordinates by using an elastic offset updating method to obtain the corresponding predicted position coordinates in the current frame image.

Further, the step S6: analyzing S5 the maximum matching coefficient, updating the robust state of each tracked target in the global tracker to obtain tracking information, and establishing a tracking state of the target to be matched which is not matched with the tracked target, including:

a: when the maximum matching coefficient is smaller than a threshold value n, updating the tracked target to be in a disappearing state, and then judging whether the number of frames accumulated by the tracked target to disappear is larger than or equal to a threshold value L:

a1: if the number of the frames is larger than or equal to a threshold value L, deleting the tracked target in the global tracker;

a2: and if the number of frames is less than the threshold L, adding 1 to the number of frames counted by the number of lost tracked targets.

Further, the step S6: analyzing S5 the maximum matching coefficient and updating the robust state of each tracked target in the global tracker to obtain tracking information, and establishing a tracking state of the target to be matched which is not matched with the tracked target, further comprising:

b: when the maximum matching coefficient is larger than or equal to a threshold value n, judging whether the tracked target is located in an actual tracking target set or a buffer tracking target set in a global tracker:

b1: if the tracked target is located in the actual tracked target set, updating the current position coordinate and the historical position coordinate of the tracked target;

b2: if the tracked target is located in the buffered tracking target set, judging whether the frame number of the tracked target in the continuous tracking state currently is greater than or equal to a threshold K frame:

b21: if the frame number is larger than or equal to the threshold K frame, updating the current position coordinates and the historical position coordinates of the tracked target, and moving the tracked target from the buffer tracking target set to the actual tracking target set;

b22: and if the current position coordinate and the historical position coordinate of the tracked target are less than the threshold K frames, updating the current position coordinate and the historical position coordinate of the tracked target, and adding 1 to the count of the number of frames which are continuously in the continuous tracking state at present.

c: if the target to be matched is not matched with any tracked target in the global tracker, establishing a new tracked target by using the current position coordinate information of the target to be matched in the buffer tracked target set in the global tracker.

The invention has the beneficial effects that:

the detection efficiency of the tracked object under special conditions such as shielding, deformation, environmental change, rapid movement and the like can be effectively improved, meanwhile, after a soft-shift elastic offset updating algorithm and a tracked object robust state updating algorithm are added, the phenomena of false detection and missing detection are restrained, and good target matching and tracking capability is obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 shows a flowchart of a multi-target tracking method according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As shown in fig. 1, this embodiment shows an implementation of tracking a pedestrian target based on the present algorithm.

S1: establishing a pre-training target detection model, wherein the pre-training target detection model is used for detecting a tracking target in a video stream to be tracked.

The video stream is composed of images in a time sequence, and a video of one second can be decomposed into a plurality of images, wherein each image after decomposition is called a frame image. In this embodiment, the recording standard of the video stream on the hardware platform is to generate 30 frames of images in one second, so that the time consumed for generating each frame of image can be calculated to be about 1000ms/30 ≈ 33ms, and when the time consumed for detecting each frame of image by the pre-training target detection model is less than the calculated 33ms, the image generated by each frame can be processed in time, so as to achieve a real-time effect.

Firstly, collecting a pedestrian labeling data set which comprises 10000 picture training sets and 1000 picture testing sets, and labeling a human body on each picture. And then, a deep learning neural network target detection model is set up to train on a training set of the data set, and data augmentation methods such as mirroring, scaling, rotation, contrast enhancement and the like are used in the training process to increase the generalization of the pre-training target detection model to pedestrian detection.

After a lightweight target detection model H-SSD (Half-Shot multi box Detector) designed according to this embodiment is trained, 97.6% of mapp (mean Average Precision value, i.e., Average value of APs of each category) is achieved on a test set after an SSD detection algorithm framework is used as a basis and an improved target detection model of depth separable convolution and Half-Precision quantization technology is combined, the Average detection time on each frame image of the model after deployment on a Jetson TX2 hardware platform of NVIDIA corporation is about 22ms, which is less than 33ms, and thus, the real-time requirement can be satisfied. After obtaining the H-SSD pre-trained pedestrian detection model, the pedestrians in the video stream are tracked by the following steps.

S2: the global tracker is initialized.

Before each frame of image is input into a video stream, a global pedestrian tracker is initialized, the global pedestrian tracker comprises an actual pedestrian tracking target set and a buffering pedestrian tracking target set, and both sets are empty sets during initialization.

S3: and performing target detection on the current frame image in the video stream to be tracked by using the pre-training target detection model to obtain a target set to be matched and a corresponding position coordinate set detected in the current frame image.

Carrying out pedestrian detection on the current frame image in the video stream to be tracked by using an H-SSD pre-trained pedestrian detection model to obtain all detected pedestrian target sets { x } in the current frame image_iSet of position coordinates { Loc (x) }_i) Where the symbols represent sets, { x { } symbols_iI.e. the set of all detected pedestrian objects, where x is_iRepresents { x_iThe ith detected pedestrian target in the set. Similarly, { Loc (x)_i) Position coordinate set for storing all detected pedestrian targets, Loc (x)_i) Is x_iThe position coordinates of (a). { x_iThe set needs to be matched with the pedestrian target set with the tracking state established in the global tracker in the step S2 in the following step, so the set is called { x }_iAnd the set is a set of pedestrian targets to be matched.

S4: and acquiring the actual position coordinates of each tracked target in the target set with the established tracking state in the global tracker in the previous frame image, and calculating the actual position coordinates by using a soft-shift elastic offset updating algorithm to obtain the corresponding predicted position coordinates in the current frame image.

Acquiring a pedestrian target set { y) with tracking states established in the global tracker_iEach tracked pedestrian target y in the four_iActual position coordinates Loc (y) in the previous frame image_i) Stored in { Loc (y) }_i) And the tracked pedestrian target actual position coordinate set. Then pair { Loc (y)_i) Each actual position coordinate Loc (y) in the set_i) Calculating and obtaining corresponding predicted position coordinates Loc (Sy) in the current frame image by using soft-shift elastic offset updating algorithm_i) Stored in { Loc (Sy) }_i) In the tracked pedestrian target predicted position coordinate set, the calculation formula of the soft-shift elastic offset updating algorithm is as follows: loc (Sy)_i)= Loc(y_i)+ Loc(Δy_i)，Loc(Δy_i) Is y_iEstimated position deviationAmount of displacement, by y_iThe historical position coordinate information of the mobile terminal is obtained through calculation.

S5: and respectively performing cross calculation on the position coordinates of each target to be matched in the target set to be matched in the current frame image in the S3, the actual position coordinates of each tracked target in the target set with the established tracking state in the global tracker in the last frame image and the corresponding predicted position coordinates in the current frame image in the S4 to obtain the maximum matching coefficient of each target to be matched and each tracked target.

Will { Loc (x)_i) Set of coordinates of pedestrian target positions to be matched with { Loc (y) }_i) Set of tracked pedestrian target actual position coordinates and { Loc (Sy)_i) Respectively carrying out sub-element cross calculation on Jaccard coefficients of the coordinate set of the predicted position of the tracked pedestrian target, and obtaining x of each pedestrian target to be matched through calculation and comparison_iWith each tracked pedestrian target y_iMaximum matching coefficient of (Jmax) (x)_i，y_i) Maximum matching coefficient Jmax (x)_i，y_i) Is Jmax (x)_i，y_i)=Max(J(Loc(x_i), Loc(Sy_i))，J(Loc(x_i), Loc(y_i))). Here, J () is a Jaccard coefficient used to quantify the degree of overlap of the positions on the image between two objects, and is calculated by the formula: j (a, B) = | a ≦ B |/| a ≦ B |, | a ≦ B | represents the location intersection region of the a target and the B target, and | a ≦ B | represents the location union region of the a target and the B target. The interval of the calculation result of the Jaccard coefficient is 0,1]The larger the calculation result, the more overlapping the positions of the two objects on the image is represented.

S6: and analyzing S5 the maximum matching coefficient, updating the robust state of each tracked target in the global tracker to obtain the tracking information of the tracked target in the current frame image, and establishing the tracking state of the target to be matched which is not matched with the tracked target.

Analysis of the maximum matching coefficient J calculated in S5_max(x_i，y_i) There are several cases:

a: when J is_max(x_i，y_i) Small valueWhen the threshold value n =0.1, updating the tracked pedestrian target y_iIs in a vanishing state, then y is judged_iWhether or not the number of frames that have disappeared in totalization is equal to or greater than a threshold L =2, a 1: if the tracked pedestrian target y is larger than or equal to the threshold L, deleting the tracked pedestrian target y in the global pedestrian tracker_i(ii) a A2: if less than threshold L, then the tracked pedestrian target y_iThe count of accumulated number of frames that have disappeared is incremented by 1.

B: when J is_max(x_i，y_i) When the value is equal to or greater than the threshold value n =0.1, the tracked pedestrian target y is determined_iWith corresponding pedestrian objects x to be matched_iOn match, updating the tracked pedestrian target y_iTo continue the tracking state, the tracked pedestrian target y is then judged_iWhether an actual set of pedestrian tracking targets or a buffered set of pedestrian tracking targets located in the global pedestrian tracker:

b1: if y is_iIf the target is located in the actual pedestrian tracking target set, y is updated_iCurrent location coordinates and historical location coordinates.

B2: if y is_iIf the pedestrian tracking target is located in the buffering pedestrian tracking target set, y is judged_iWhether the number of frames currently continuously in the continuous tracking state is greater than or equal to a threshold value K =4 frames. B21: if the value is larger than or equal to the threshold value K frame, y is updated_iCurrent position coordinates and historical position coordinates, and y is_iMoving from the buffered pedestrian tracking target set to the actual pedestrian tracking target set; b22: if less than the threshold K frames, update y_iAnd 1 is added to the count of the number of frames currently continuously in the continuous tracking state.

C: if there is a pedestrian target x to be matched_iHas not yet tracked pedestrian target y with any of the global pedestrian trackers_iIn matching, the target x to be matched is used in the buffering pedestrian tracking target set_iAnd establishing and adding a new tracking target according to the current position coordinate information.

After the analysis is completed through S6, the tracked pedestrian target in the global pedestrian tracker will obtain its new position coordinates in the current frame image, and the newly detected pedestrian target will establish the tracking state.

S7: s3 through S6 are repeated until the video stream ends.

And collecting a pedestrian tracking test set, wherein the test set consists of 1000 test videos. The performance of the experiment on the test set and other methods of this example is compared as shown in the following table:

wherein, the MOTA (Multi-Object Tracking Accuracy) represents the Accuracy after the missing detection rate and the false alarm rate are combined, the MOTP (Multi-Object Tracking Precision) represents the average frame overlapping rate of all the tracked targets, and the two evaluation indexes are both higher and better.

The experimental result shows that compared with the traditional method, the method has huge improvement on the tracking effect, and compared with the target tracking algorithm SORT based on deep learning, due to the introduction of the elastic optimization strategy (soft-shift elastic offset updating algorithm and tracker robust state updating algorithm), the method has obvious improvement on two evaluation indexes.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, elements recited by the phrase "comprising a" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Finally, it is to be noted that: the above description is only a preferred embodiment of the present invention, and is only used to illustrate the technical solutions of the present invention, and not to limit the protection scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A real-time multi-target tracking method based on a neural network is characterized by comprising the following steps:

s2: initializing a global tracker;

s7: repeating S3-S6 until the video stream ends.

2. The real-time multi-target tracking method based on neural network as claimed in claim 1, wherein said S1: establishing a pre-training target detection model, comprising: and pre-training the object type to be detected by using a deep learning neural network target detection model to obtain a pre-training target detection model.

3. The real-time multi-target tracking method based on the neural network as claimed in claim 2, wherein the time consumption for detecting the target detection model is less than the time consumption for generating each frame of image by the video stream.

4. The real-time multi-target tracking method based on neural network as claimed in claim 1, wherein said S2: the global tracker includes: the actual tracking target set and the buffer tracking target set.

5. The real-time multi-target tracking method based on the neural network as claimed in claim 4, wherein the actual tracking target set and the buffered tracking target set are both empty sets when the global tracker is initialized.

6. The real-time multi-target tracking method based on neural network as claimed in claim 5, wherein the actual tracking target set and the buffered tracking target set will store tracking information of the tracked targets, the tracking information comprising: and the tracked target is one or more of the current position coordinate, the historical position coordinate, the number of frames which are accumulated and disappeared in the disappearance state and the number of frames which are continuously in the continuous tracking state in the image.

7. The real-time multi-target tracking method based on neural network as claimed in claim 1, wherein the obtaining of the predicted position coordinates in S4 includes: and calculating the actual position coordinates by using an elastic offset updating method to obtain the corresponding predicted position coordinates in the current frame image.

8. The real-time multi-target tracking method based on neural network as claimed in claim 1, wherein said S6: analyzing S5 the maximum matching coefficient, updating the robust state of each tracked target in the global tracker to obtain tracking information, and establishing a tracking state of the target to be matched which is not matched with the tracked target, including:

9. The real-time multi-target tracking method based on neural network as claimed in claim 1, wherein said S6: analyzing S5 the maximum matching coefficient and updating the robust state of each tracked target in the global tracker to obtain tracking information, and establishing a tracking state of the target to be matched which is not matched with the tracked target, further comprising:

10. The real-time multi-target tracking method based on neural network as claimed in claim 1, wherein said S6: analyzing S5 the maximum matching coefficient and updating the robust state of each tracked target in the global tracker to obtain tracking information, and establishing a tracking state of the target to be matched which is not matched with the tracked target, further comprising:

c: and if the target to be matched is not matched with any tracked target in the global tracker, establishing a new tracked target by using the current position coordinate information of the target to be matched in a buffer tracked target set in the global tracker.