CN111353448A

CN111353448A - Pedestrian multi-target tracking method based on relevance clustering and space-time constraint

Info

Publication number: CN111353448A
Application number: CN202010148317.5A
Authority: CN
Inventors: 李旻先; 桑毅
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2020-06-30

Abstract

The invention discloses a pedestrian multi-target tracking method based on relevance clustering and space-time constraint, which comprises the following steps: and (3) pedestrian visual feature extraction, namely, pedestrian track association under a single camera based on the relevance clustering of visual features, and pedestrian track matching under a cross camera by utilizing a space-time constraint method. Aiming at the problem that the pedestrian tracking under a single camera is easy to interrupt, the invention introduces a space-time sliding window to solve the problem; meanwhile, a space-time constraint method is introduced to associate the same pedestrian in a cross-camera scene, so that multi-target tracking of the pedestrian under the cross-camera scene is realized; the method provided by the invention can consistently improve tracking indexes such as MOTA, MOTP and recall rate in pedestrian tracking.

Description

Pedestrian multi-target tracking method based on relevance clustering and space-time constraint

Technical Field

The invention relates to the field of computer vision and pattern recognition, in particular to a pedestrian multi-target tracking method based on relevance clustering and space-time constraint.

Background

Computer Vision (Computer Vision) is a science that studies how to "look" at a machine, and further, is a Computer technology that uses a camera and a Computer to identify, track, measure, etc. an object instead of the human eye. Computer vision aims at creating artificial intelligence systems that can extract "information" from images or multidimensional data, and therefore has been a hotspot and difficulty of research in the field of computer science in recent years.

The multi-target tracking problem is used as the basis of a high-level visual task and becomes a crucial research problem in the field of computer vision. Multiple Object Tracking (MOT). The main task is to give an image sequence, find moving objects in the image sequence, and correspond moving objects in different frames one to one. And then the motion tracks of different objects are given. These objects may be arbitrary, such as pedestrians, vehicles, athletes, various animals, etc., and most studied is "pedestrian tracking". This is because firstly the "pedestrian" is a typical non-rigid target, which is more difficult than a rigid target, and secondly the detection and tracking of the pedestrian is more commercially valuable in practical applications. By incomplete statistics, at least 75% of multi-objective tracking studies are studying pedestrian tracking.

In addition, the video monitoring in the current society is very extensive in arrangement, and how to effectively acquire the track information of pedestrians from the data of a plurality of video monitoring has very important value for a social security system. Therefore, cross-camera pedestrian tracking has become an important research content in the field of computer vision. However, cross-camera pedestrian tracking has some problems which are difficult to deal with in practical application, and mainly includes two aspects: on one hand, in most of the existing tracking algorithms, whether the two pedestrian detection frames belong to the same pedestrian target or not is judged by calculating the overlapping area between the two adjacent pedestrian detection frames, but due to the fact that the objects under the surveillance video are numerous, the calculation process of pedestrian tracking is interrupted due to the fact that the objects are shielded. On the other hand, for pedestrian tracking under a cross-camera, most of the methods adopted at present are similar to pedestrian retrieval methods in pedestrian re-identification, and the methods are mostly realized based on the characteristics of pedestrian targets, but do not use spatiotemporal information in a data set.

Disclosure of Invention

The invention aims to provide a pedestrian multi-target tracking method based on relevance clustering and space-time constraint, which is used for completing the tracking of pedestrians in different scenes.

The technical solution for realizing the purpose of the invention is as follows: a pedestrian multi-target tracking method based on relevance clustering and space-time constraint comprises the following steps:

1) decompressing the video into frames by inputting video streams and formulating a video data set according to the selected interval;

2) detecting each image of the video data set by using a pedestrian detection algorithm by using the video data set to obtain pedestrian detection data; detecting data as minimum matrix frame information containing pedestrians;

3) cutting the video data set according to the matrix frame information of the pedestrian detection data to generate a pedestrian picture set, and formulating a training set and a test set according to a selected interval;

4) training a deep convolutional neural network for pedestrian re-identification by utilizing a training set, and outputting the trained deep convolutional neural network for extracting the visual characteristics of pedestrians, wherein the loss of the deep convolutional neural network is formed by triple loss; inputting the test set picture into a trained deep convolution neural network for visual feature extraction to obtain a pedestrian visual feature matrix;

5) calculating pedestrian appearance characteristic correlation and motion correlation according to the pedestrian visual characteristic matrix and pedestrian detection information, finishing correlation clustering, and realizing pedestrian track correlation under a single camera by using a correlation clustering result so as to finish pedestrian multi-target tracking under the single camera;

6) and according to the multi-target tracking result of the pedestrian under the single camera, correlating the pedestrian track under the cross camera by using a space-time constraint method so as to complete the multi-target tracking of the pedestrian under the cross camera.

Compared with the prior art, the invention has the following remarkable advantages: (1) the method of the relevance clustering can solve the problem of tracking interruption caused by the shielding of objects or pedestrians, is accurate and stable, and has obviously better performance in a test data set than other algorithms; (2) the pedestrian tracking under the cross-camera is completed by using a space-time constraint method, the characteristic information in the pedestrian detection data is used, and the space-time information of the data set is fully used, so that the pedestrian tracking result under the cross-camera is more accurate, and the evaluation index is higher than that of other algorithms. Similarly, the method is also suitable for pedestrian re-identification, and the accuracy of pedestrian retrieval is improved.

Drawings

FIG. 1 is a flow chart of a pedestrian multi-target tracking method based on relevance clustering and space-time constraint.

Fig. 2 is an exemplary cross-camera pedestrian tracking diagram.

Detailed Description

The invention provides a pedestrian multi-target tracking method based on relevance clustering and space-time constraint, which mainly uses a deep convolutional neural network to extract features, wherein the relevance clustering is used for completing pedestrian multi-target tracking under a single camera, and the space-time constraint method is used for completing three main parts of pedestrian multi-target tracking across the cameras. With reference to fig. 1, the method comprises the following steps:

Further, in step 1), the input video stream is decompressed into frames according to a frame rate of 60fps, and the decompressed pictures are named according to a specified naming rule to form a video data set.

Further, the specific method in the step 2) comprises the following steps: and carrying out pedestrian detection on the video data set based on a pedestrian detection algorithm openposition to obtain pedestrian key point data. And after the key point information is obtained, converting the key point information into matrix frame information in pedestrian detection so as to obtain the required pedestrian detection information, wherein the detection information result is the coordinate information of the upper left corner and the lower right corner of the rectangular frame.

Further, in the step 4), the deep convolutional neural network framework is ResNet50, and each parameter in the network is iteratively updated by using Adam in the adaptive learning rate gradient descent optimization algorithm until the parameter converges, so as to obtain a trained feature learning network.

The deep convolutional neural network uses a triple loss function, and the expression of the triple loss function L is as follows:

in the formula (I), the compound is shown in the specification,

is the distance between the anchor point and the positive sample,

is the distance of the anchor point from the negative sample α indicates a minimum separation between the two distances, and the subscript + indicates]When the internal value is larger than zero, the modified value is taken as loss, and when the internal value is smaller than zero, the loss is zero.

According to the deep convolutional neural network model, using the test set to extract the features of the pedestrians, obtaining an appearance feature matrix of the pedestrians and calculating feature correlation, wherein the formula is as follows:

in the formula, d (x)_i,x_j) The characteristic distance between the ith pedestrian and the jth pedestrian is calculated here using the euclidean distance. t is t_aIs the average of the distances between positive and negative samples in the training set.

Further, in the step 5), a pedestrian motion correlation is calculated using a linear motion model according to the pedestrian detection data.

First, the offset of the pedestrian matrix frame in one frame, which we call the velocity, is calculated

Expressed, the formula is:

in the formula, t_wIn order to set the size of the window,

is the coordinate of the center point of the b-th matrix frame in the j-th frame,

the coordinate of the center point of the a matrix frame of the ith frame.

Secondly, calculating the motion correlation error e_mThe meaning of which represents the error between the value of the nearest matrix box predicted by a matrix box according to its rate and the actual value, the formula is expressed as:

e_m＝e_f+e_b

in the formula, e_fIs a forward error, e_bIs a backward error.

Forward error e_fExpressed as:

e_f(i,j)＝c_i+v_if_ij-c_j(j>i)

in the formula, c_iThe coordinates of the center point of the ith matrix box in a certain frame. v. of_iIs the rate of the ith matrix box. f. of_ijIs the distance between the frame number of the ith matrix frame and the frame number of the jth matrix frame. c. C_jIs the center point coordinate of the jth matrix box.

The backward error is expressed as:

e_b(i,j)＝c_i-v_if_ij-c_j(j<i)

the motion correlation error is expressed as:

e_m＝e_f+e_b

finally, a motion correlation W is calculated_mThe meaning of the method represents the similarity degree between two adjacent matrix boxes, and the formula is as follows:

w_m＝α(t_m-e_m)

wherein, α, t_mParameters set for the algorithm, e_mIs a motion dependent error.

Further, the relevance cluster is determined by combining the feature relevance and the motion relevance, and the formula is as follows:

W＝(W_a+W_m)⊙D

wherein W is a correlation matrix, W_aIs a characteristic correlation matrix. W_mThe motility matrix, D is the Discount matrix, ⊙ is the dot product operation, the Discount matrix is:

D＝e^-Δt∈[0,1]

in the formula, Δ t indicates the degree of attenuation, and as Δ t increases, the correlation tends to be 0.

According to the correlation matrix, a graph G ═ V, (E, W) is constructed, where V is a node, each matrix box is regarded as a node in the patent, E is an edge set, an edge can be constructed if two matrix boxes are positively correlated, and W is a weight, i.e., a value of correlation. After the graph is constructed, all nodes of one connected component in the graph are endowed with the same ID through calculating the connected component of the graph, and the nodes are the same pedestrian.

Further, the generation of the short track and the long track of the person moving downwards by the single camera can be generated by setting the size of the window, and the short track is expressed in m₁The related clustering algorithm is used in a window within the frame image range. The long track is based on the short track, and the size of the window is set to be m₂Frame, and the overlapping range of front and rear windows is m₃Frame, and finally, using relevance clustering within the window.

The relevance clustering algorithm shown in table 1 is an index result of 8 individual camera scenes, wherein MOTA is multi-target tracking accuracy, and IDF1, IDP and IDR respectively represent the F value, accuracy and recall rate of pedestrian ID prediction. IDS represents the number of ID momentary transitions of pedestrian ID due to interaction or interruption, and FM represents the number of interruptions of tracking trajectory.

TABLE 1 index results of correlation clustering algorithm under 8 individual camera scenes

According to the table 1, most of the MOTA values of 8 cameras exceed 80, and even the number 4 camera with a small number of pedestrians exceeds 90, the three values of ID prediction exceed 80, and IDs and FM only have about 100 on the basis that the number of tracks exceeds 10000. In combination, the accuracy and the tracking efficiency of the correlation clustering algorithm in a single-camera scene are very high.

Further, the implementation of the cross-camera downlink human tracking algorithm specifically comprises the following steps:

1) on the basis of the pedestrian track under a single camera, selecting n images in each section of pedestrian track as a representative, putting the representative images into a deep convolution neural network to extract features, and then solving the mean value to be used as the appearance features of pedestrians. Then, each pedestrian is selected as an inquiry target in turn, and the tracks of the other pedestrians are used as a pedestrian pool.

2) And for each query target, respectively calculating the distance between the query target and other targets in the pedestrian pool and sequencing, wherein the calculation method is cosine distance.

3) And correcting the screening result by using a space-time constraint method according to the sorting screening result.

Further, for the time relationship in the space-time constraint, it is found through experiments that the movement of the pedestrian in different areas in the training set is concentrated in a time period, and for this case, we propose the concept of space-time similarity and use a space-time model to represent the space-time similarity of the pedestrian.

Further, for the spatio-temporal model, one is set up

Is used to represent the time difference between two new pictures and then uses maximum likelihood estimation, at

M in front and back₄The probability of the occurrence of the target is calculated in the frame and is marked as P. Calculating P in the training set to obtain the time when the P value is maximum

The value of (c).

Further, for the screening result of the query target, the same target pedestrian of the scene connected with the query target can be obtained according to the spatio-temporal model and the spatio-temporal information of the query target. And taking the result as a new query target to continue to obtain a target pedestrian connected with the new query target according to space-time constraint until a complete track of the pedestrian under different cameras is obtained, wherein the tracking effect is as shown in an exemplary graph of pedestrian tracking under a camera-crossing scene in fig. 2.

The space-time constraint algorithm shown in table 2 is an index result in a cross-camera scene, because in a pedestrian multi-target tracking algorithm in the cross-camera scene, an ID-related index is very important.

TABLE 2 index results of space-time constraint algorithm in cross-camera scene

IDF1	IDP	IDR
			75.4	76.0	74.8

According to the table 2, the IDF1, the IDP and the IDR score are about 75 points, which indicates that the space-time constraint algorithm can obtain good pedestrian tracking performance in the cross-camera scene.

In summary, the invention discloses a pedestrian multi-target tracking method based on relevance clustering and space-time constraint, which starts with pedestrian multi-target tracking under a single camera, completes matching of a single-camera descending pedestrian track through a relevance clustering method, and completes the cross-camera descending pedestrian multi-target tracking by using a space-time constraint method on the basis of the matching. The tracking effect is improved under a single camera by combining the appearance characteristics and the motion characteristics of the pedestrians, and the accuracy rate of pedestrian matching is improved under the scene of crossing the cameras.

Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention should be determined by the appended claims.

Claims

1. A pedestrian multi-target tracking method based on relevance clustering and space-time constraint is characterized by comprising the following steps:

2. The pedestrian multi-target tracking method based on relevance clustering and spatio-temporal constraints as claimed in claim 1, wherein in step 1), the input video stream is decompressed into pictures at a frame rate of 60fps, and the decompressed pictures are named according to a specified naming rule to form a video data set.

3. The pedestrian multi-target tracking method based on relevance clustering and space-time constraint according to claim 1, wherein the step 2) specifically comprises the following steps:

21) carrying out pedestrian detection on the video data set based on a pedestrian detection algorithm openposition to obtain pedestrian key point data;

22) after the key point data is obtained, converting the key point data into matrix frame information containing pedestrians, and thus obtaining required pedestrian detection data; the detection information is specifically coordinate information of the upper left corner and the lower right corner of the rectangular frame.

4. The pedestrian multi-target tracking method based on relevance clustering and space-time constraint according to claim 1, characterized in that in step 4), the deep convolutional neural network framework is ResNet50, and each parameter in the network is iteratively updated by using Adam in an adaptive learning rate gradient descent optimization algorithm until the parameters converge, so as to obtain a trained feature learning network; the method specifically comprises the following steps:

41) the deep convolutional neural network uses a triple loss function, and the expression of the triple loss function L is as follows:

in the formula (I), the compound is shown in the specification,

is the distance between the anchor point and the positive sample,

is the distance of the anchor point from the negative sample, α indicates a minimum separation between the two distances, and the subscript + indicates]When the internal value is larger than zero, taking the value as the final value of the loss function, and when the internal value is smaller than zero, the loss is zero;

42) according to the deep convolutional neural network model, the pedestrian appearance characteristics are extracted by using the test set, an appearance characteristic matrix of the pedestrian is obtained, the characteristic correlation is calculated, and the formula is as follows:

in the formula, d (x)_i,x_j) Calculating a characteristic distance between the ith pedestrian and the jth pedestrian by using the Euclidean distance; t is t_aIs the average of the distances between positive and negative samples in the training set.

5. The pedestrian multi-target tracking method based on relevance clustering and space-time constraint according to claim 4, wherein in the step 5), the pedestrian motion relevance is calculated by using a linear motion model according to the pedestrian detection data, and the method specifically comprises the following steps:

51) calculating the offset of the pedestrian matrix frame in one frame, called the speed, using

Expressed, the formula is:

in the formula, 0<i<j<t_w，t_wIn order to set the size of the window,

the coordinate of the center point of the a matrix frame of the ith frame is taken as the coordinate of the center point of the a matrix frame of the ith frame;

52) calculating a motion correlation error e_mThe meaning of which represents the error between the value of the nearest matrix box predicted by a matrix box according to its rate and the actual value, the formula is expressed as:

e_m＝e_f+e_b

in the formula, e_fIs a forward error, e_bIs a backward error;

forward error e_fExpressed as:

e_f(i,j)＝c_i+v_if_ij-c_j(j>i)

in the formula, c_iThe coordinates of the center point of the ith matrix frame in a certain frame are obtained; v. of_iIs the rate of the ith matrix box;

f_ijis the frame number of the ith matrix frame and the jth matrix frameThe distance between the frame numbers of the matrix frames; c. C_jIs the center point coordinate of the jth matrix frame;

the backward error is expressed as:

e_b(i,j)＝c_i-v_if_ij-c_j(j<i)

the motion correlation error is expressed as:

e_m＝e_f+e_b

53) computing a motion correlation W_mThe meaning of the method represents the similarity degree between two adjacent matrix boxes, and the formula is as follows:

w_m＝α(t_m-e_m)

6. The pedestrian multi-target tracking method based on relevance clustering and space-time constraint according to claim 1 or 5, wherein the relevance clustering is determined by feature relevance and motion relevance, and the formula is as follows:

W＝(W_a+W_m)⊙D

wherein W is a correlation matrix, W_aFor the feature correlation matrix, W_mIs a motility matrix, D is a count matrix, ⊙ is a dot product, and the count matrix is:

D＝e^-Δt∈[0,1]

wherein Δ t denotes the degree of attenuation;

constructing a graph G, namely (V, E, W) according to the correlation matrix, wherein V is a node, each matrix frame is regarded as a node, E is an edge set, an edge can be constructed if two matrix frames are positively correlated, and W is a weight, namely a value of the correlation; after the graph is constructed, all nodes of one connected component in the graph are endowed with the same ID through calculating the connected component of the graph, and the nodes are the same pedestrian.

7. The pedestrian multi-target tracking method based on relevance clustering and space-time constraint according to claim 1, characterized in that under a single cameraThe generation of the short track and the long track of the pedestrian can be generated by setting the size of a window, and specifically comprises the following steps: short track is at m₁The method is realized by using a related clustering algorithm in a window within a frame image range; the long track is based on the short track, and the size of the window is set to be m₂Frame, and the overlapping range of front and rear windows is m₃Frame, and finally, using relevance clustering within the window.

8. The pedestrian multi-target tracking method based on relevance clustering and space-time constraint according to claim 1, wherein the step 6) specifically comprises the following steps:

61) on the basis of the pedestrian track under a single camera, selecting n images in each section of pedestrian track as a representative, putting the representative images into a deep convolutional neural network to extract features, and then solving the mean value to serve as the appearance features of the pedestrian; then, sequentially selecting each pedestrian as an inquiry target, and taking the tracks of the rest pedestrians as a pedestrian pool;

62) for each query target, respectively calculating the distance between the query target and other targets in the pedestrian pool and sequencing the query targets, wherein the calculation method is cosine distance;

63) and correcting the screening result by using a space-time constraint method according to the sorting screening result.

9. The pedestrian multi-target tracking method based on relevance clustering and space-time constraint according to claim 8, wherein the space-time constraint method specifically comprises:

for the time relation in the space-time constraint, the movement of the pedestrians in different areas in a training set is concentrated in a time period according to the experimental finding, and a space-time model is used for representing the space-time similarity of the pedestrians;

for the space-time model, set up one

Calculating the probability of the occurrence of the target in the front and back frames, and marking as P; calculating P in the training set to obtain the time when the P value is maximum

A value of (d);

for the screening result of the query target, obtaining the same target pedestrian in the scene connected with the query target according to the spatio-temporal model and the spatio-temporal information of the query target; taking the result as a new query target, and continuously obtaining a target pedestrian connected with the result according to space-time constraint until obtaining complete tracks of the pedestrian under different cameras;

and outputting a cross-camera small pedestrian tracking result.