CN112270310A

CN112270310A - Cross-camera pedestrian multi-target tracking method and device based on deep learning

Info

Publication number: CN112270310A
Application number: CN202011333900.XA
Authority: CN
Inventors: 张旭; 徐振国; 丁亚男; 汤健
Original assignee: Shanghai University of Engineering Science
Current assignee: Shanghai University of Engineering Science
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-01-26

Abstract

The invention relates to a method and a device for multi-target tracking of pedestrians across cameras based on deep learning, wherein the method comprises the following steps: acquiring video data of multiple data sources; inputting the pedestrian detection frame into a trained encoder-decoder depth network, simultaneously performing pedestrian detection and pedestrian apparent feature extraction to obtain a pedestrian detection frame detected by taking a pedestrian as a central point, and outputting pedestrian apparent features according to the central point; performing correlation matching on all pedestrian detection frames based on all pedestrian apparent features to obtain tracking tracks of all pedestrians; selecting a high-quality frame for a tracking track of each pedestrian in a plurality of videos; then, the trained deep neural network is used for respectively extracting the face characteristics and the gait characteristics of the pedestrian; then, carrying out feature fusion or matching on the multiple features; and finally, performing cross-camera track association clustering. The method and the device can be directly applied to an old monitoring system, have good robustness and can realize accurate and efficient cross-camera pedestrian tracking.

Description

Cross-camera pedestrian multi-target tracking method and device based on deep learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a method and a device for multi-target tracking of pedestrians across cameras based on deep learning.

Background

With the progress of computer performance and the development of deep learning, many computer vision technologies are widely applied to people's real life. For a traditional video monitoring system, whether a specified target appears, and the position and time of the appearance need to depend on a large amount of human resources, and the traditional video monitoring system mainly plays a role in obtaining evidence afterwards. Research developers in related fields have integrated technologies in multiple subject fields such as computer vision, image processing, pattern recognition, artificial intelligence and the like to research and develop an intelligent video monitoring system which can automatically recognize and track pedestrians in a video sequence by means of the powerful data processing capacity of a computer.

In the existing pedestrian multi-target tracking method, a pedestrian needs to be detected in a video sequence, and then detection frames belonging to the same pedestrian are associated in different video frames, so that a pedestrian tracking track is obtained. The core of the pedestrian tracking method is to improve the correct matching of the pedestrian detection frame. Conventional pedestrian tracking methods typically use manual features such as color histogram oriented gradient, local binary pattern, etc. However, these manual features are complicated in design and cannot distinguish different people well, and have great limitations. In recent years, deep learning has made good progress in various tasks in the field of computer vision, and because the deep learning has strong feature expression capability, image features can be well described, and a large number of target detection and tracking methods based on the deep learning are developed.

The current target detection methods based on deep learning can be divided into two types, one type is R-CNN (Region-CNN) series methods based on candidate regions, such as RCNN, Fast-RCNN, Faster-RCNN, R-FCN and the like, and the other type is a detection method based on regression prediction, such as YOLO, SSD and the like.

At present, multi-target tracking of pedestrians is mainly based on two ideas, one is a tracking method irrelevant to detection, and the other is a tracking method based on detection. Detection-independent tracking methods require manual tagging of a certain number of objects of interest in a first frame and then tracking of the tagged objects in subsequent video frames, but it cannot handle the appearance of new objects and automatically terminate missing objects. Relatively speaking, a tracking method based on detection is a mainstream solution to the MOT problem at present, and can automatically extract a series of detected bounding boxes for a target in a video, and then assign the same ID to the detection result of the same target according to the relation of a video sequence.

In practical application, enterprises or units needing the intelligent monitoring system directly replace the old monitoring system with the novel intelligent monitoring equipment, but a large amount of extra funds are required to be invested when the novel intelligent monitoring equipment is replaced, and unnecessary resource waste is caused when the old monitoring equipment is eliminated. However, video frames in old surveillance videos have the problems of low resolution, obvious illumination change, insufficient shielding and the like, and how to reduce and reduce the influence caused by the defects and factors is the key to solve the problem of pedestrian tracking.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a method and a device for cross-camera pedestrian multi-target tracking based on deep learning.

The purpose of the invention can be realized by the following technical scheme:

a single-camera pedestrian multi-target tracking method based on deep learning comprises the following steps:

step 1: acquiring video stream data by using a video acquisition module;

step 2: inputting the video frame acquired by the video acquisition module into a trained encoder-decoder deep network, performing parallel multi-scale feature extraction, and simultaneously outputting two tasks, namely pedestrian detection and pedestrian apparent feature extraction, at the output end of the network;

and step 3: for the pedestrian detection part, outputting a candidate target center point heat map, the size of a pedestrian detection frame and the offset of the pedestrian center relative to the actual position of the original image;

and 4, step 4: outputting the pedestrian apparent features of the center points of all the candidate targets to a pedestrian apparent feature extraction part;

and 5: and correlating all the pedestrian detection frames based on the pedestrian apparent features in all the pedestrian detection frames to obtain the tracking tracks of all the pedestrians.

Further, the step 2 comprises: inputting the obtained video frame into a trained encoder-decoder network for parallel multi-scale feature extraction and fusion;

specifically, in order to learn richer features, the features of the bottom layer and the top layer need to be merged into the network to identify the targets with different sizes in the network;

specifically, the invention adopts a trained encoder-decoder network similar to U-NET to fuse different feature maps, and solves the frequently occurring scale change problem in multi-target tracking by fusing the features of the different feature maps.

Further, the step 3 comprises:

outputting a candidate target center point heat map for estimating a position of a target center, the response decaying exponentially with distance between the position on the heat map and the object center;

outputting the offset of the center of the pedestrian relative to the actual position of the original image, and estimating continuous offset relative to the center of the pedestrian for each pixel so as to reduce the influence of down-sampling;

outputting the size of a pedestrian detection frame, and estimating the height and width of the pedestrian detection frame at each position, mainly for improving the detection precision;

specifically, because the current pedestrian tracking method uses anchor frame-based target detection, the apparent features of pedestrians are extracted in an anchor frame area, the actual areas of the anchor frame and the pedestrians have position deviation, and the center of the pedestrian is inconsistent with the information of the features of a rough anchor frame, which may cause serious ambiguity in network training.

The invention uses a target detection method which does not depend on an anchor frame, and considers the target detection as the problem of target central point detection, and outputs pedestrian characteristics according to the central point. The elimination of the anchor box alleviates the ambiguity problem and the use of a high resolution feature map allows the pedestrian features to better align with the target center.

Further, the step 4 comprises:

and (5) learning the apparent features of the pedestrians and learning the feature vector of a lower dimension. The traditional extraction of the pedestrian appearance features generally learns high-dimensional features, less training samples are needed for low-dimensional features, and the learning of the low-dimensional features is beneficial to reducing the risk of over-fitting of small data and improving the tracking robustness.

Further, the step 5 comprises:

and comparing the feature vectors of the two pedestrian detection frames, associating the detection frames belonging to the same person, and finally connecting into a pedestrian track.

The single-camera pedestrian multi-target tracking method based on deep learning further comprises the following steps: the pedestrian trajectories are counted to determine the number of pedestrians in the video.

The invention provides a cross-camera multi-target pedestrian tracking method based on deep learning, which comprises the following steps of:

step 1: acquiring a plurality of video stream data collected by a plurality of cameras;

step 2: respectively processing a plurality of videos by using the single-camera multi-target pedestrian tracking method based on deep learning to obtain the tracking tracks of all pedestrians corresponding to the plurality of video streams;

and step 3: for the tracking track of each pedestrian in a plurality of videos, selecting high-quality frames so as to calculate the apparent pedestrian features of the whole track;

and 4, step 4: for the selected video frame, respectively extracting the face features and the gait features of the pedestrian by using the trained deep neural network, and optionally executing a step 5 or a step 6 after executing the step 4;

and 5: carrying out heterogeneous feature fusion on the pedestrian appearance feature extracted under the single camera and the pedestrian face feature and gait feature extracted in the previous step, and directly turning to the step 7 after executing the step 5;

step 6: matching the pedestrian appearance characteristics extracted from the video of the single camera of the same pedestrian with the pedestrian face characteristics and gait characteristics extracted in the step 4, and directly turning to the step 7 after executing the step 6;

and 7: and performing cross-camera track association clustering according to the pedestrian features fused in the step 5. Or carrying out cross-camera track hierarchical clustering according to the multiple pedestrian characteristics matched in the step 6.

Further, the step 3 comprises:

firstly, preliminarily screening high-quality video frames by using the results of the association of the apparent features and the tracks extracted in the step 2, and then further screening by using a video frame quality prediction network;

specifically, the preliminary screening process is to select a frame with high correlation for each pedestrian track to perform clustering, and remove a video frame with a matching error;

specifically, the video frame quality prediction network is a two-class network, and the training of the network depends on a self-labeling data set;

more specifically, the frames with high correlation degree are preliminarily screened and clustered according to the fact that the characteristic distance between the characteristic of the low-quality frame and the characteristic of other frames is large;

more specifically, in the self-labeled data set, negative samples come from two parts, one part is a result of primary screening, the other part processes a normal picture through a digital image processing technology to generate a fuzzy and noisy sample, and the testing process is synchronously performed by a testing set and manual inspection;

further, the step 4 comprises:

specifically, for the extraction of the face features, a FaceNet face recognition framework is used, which obtains advanced results on a public data set;

specifically, for the extraction of the Gait features, a Gait-Part Gait recognition model which obtains advanced results on a public data set is used;

further, the step 5 comprises:

specifically, in the fusion process of a plurality of extracted features, a loose coupling scheme is used;

further, the step 6 comprises:

performing the operation of the step 3 on the pedestrian apparent characteristics extracted from the video of the single camera, then performing track level matching based on single frame matching, and then performing track level matching according to the result of the single frame matching;

further, the step 7 includes:

specifically, cross-camera track clustering is performed according to the fusion features, or cross-camera track clustering is performed according to the matched various pedestrian features. More specifically, the apparent features of the pedestrians extracted under a single camera are taken as a main body, and the face features and the gait features of the pedestrians extracted in the step 4 are taken as the weight of dynamic change;

on one hand, due to the visual angle of the camera and the direction of the pedestrian, the detected face sample is far less than the apparent features of the pedestrian in the actual situation, and the corresponding face cannot be detected under the situation that the pedestrian faces away from the camera or is blocked, so that the proper weight is difficult to set based on the weighted average mode;

on the other hand, although the gait recognition solves the problems of the visual angle of a camera and the direction of a pedestrian, the gait recognition is limited to technical reasons, and the gait recognition at present does not reach the accuracy of other recognition methods;

based on the analysis, the pedestrian apparent feature extracted under a single camera is used as a main body, and the pedestrian face feature and the gait feature extracted in the step 4 are used as the weight of dynamic change, so that the cross-camera track clustering is more robust.

The invention provides a hardware device for the deep learning-based single-camera pedestrian multi-target tracking method, which comprises the following steps:

the video acquisition module is used for acquiring video data acquired by the camera;

the pedestrian detection module is used for carrying out pedestrian detection on the video frames acquired by the video acquisition module so as to acquire detection frames of all pedestrians in each video frame;

the apparent feature acquisition module is used for performing parallel multi-scale feature extraction on the video frames acquired by the video acquisition module by using the apparent feature acquisition module so as to acquire apparent features of all pedestrians;

the data correlation module is used for matching the target detection frames of all the pedestrians based on the apparent pedestrian features so as to obtain a pedestrian tracking result;

specifically, the data association module includes a similarity measurement module and a trajectory merging module. The similarity measurement module is used for calculating the similarity between the apparent features of all the pedestrians, and the track merging module merges the detection results of the same pedestrian into the same track according to the calculated pedestrian similarity of the similarity measurement module and marks the same identity.

The hardware device of the deep learning-based single-camera pedestrian multi-target tracking method further comprises: and the counting module is used for counting at least one pedestrian track so as to determine the number of pedestrians in the video.

The invention provides a hardware device for the cross-camera pedestrian multi-target tracking method based on deep learning, which comprises the following steps:

the multi-data-source video acquisition module is used for acquiring a plurality of video data acquired by a plurality of cameras respectively;

the intelligent video analysis module is used for respectively processing a plurality of video data by utilizing the hardware device of the single-camera pedestrian multi-target tracking method based on the deep learning to obtain the tracking results of all pedestrians in a plurality of videos;

the track frame selection module is used for selecting high-quality video frames from the pedestrian tracking tracks obtained by the intelligent video analysis module, and removing pedestrian frames with fuzzy, shielded or deformed targets so as to calculate the pedestrian apparent characteristics of the whole track;

the face recognition module is used for acquiring the face features of the pedestrians in the video frames selected by the track frame selection module;

the gait recognition module is used for acquiring the gait characteristics of the pedestrian in the video frame selected by the track frame selection module; the characteristic fusion module is used for carrying out heterogeneous characteristic fusion on the pedestrian apparent characteristic extracted under the single camera, the pedestrian face characteristic extracted from the face recognition module and the gait characteristic extracted from the gait recognition module;

in particular, the feature fusion module introduces an attention mechanism for directing the mixing of information from different modules;

the characteristic matching module is used for matching pedestrian apparent characteristics extracted from the video of a single camera of the same pedestrian with the face characteristics extracted from the face recognition module and the pedestrian gait characteristics extracted from the gait recognition module;

and the multi-data-source track association module is used for performing association clustering on the tracks of all pedestrians in the videos shot by different cameras to obtain a pedestrian tracking result crossing the cameras.

Compared with the prior art, the invention has the following advantages:

(1) the invention provides a cross-camera pedestrian multi-target tracking method and device based on fusion of pedestrian face features, gait features and apparent features, which are used as a pedestrian detection method and device based on deep learning and independent of an anchor frame, and are not limited by the anchor frame. The elimination of the anchor box alleviates the ambiguity problem and the use of a high resolution feature map allows the pedestrian features to better align with the target center.

(2) The method and the device for cross-camera pedestrian multi-target tracking based on deep learning are applied to practical application, do not need to replace a new intelligent monitoring system, can be directly applied to an old monitoring system, avoid unnecessary resource waste, have good robustness, can alleviate and reduce the defects of low video frame resolution, illumination change and the like in the old monitoring video, and can effectively resist the problems of mutual shielding among pedestrians, pedestrian detection result deviation and the like.

(3) The method and the device for cross-camera pedestrian multi-target tracking based on deep learning can realize accurate and efficient tracking results.

Drawings

To more clearly illustrate the above and other objects, features and advantages of the present invention, embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. The accompanying drawings are included to provide a further description of the embodiments of the invention, and together with the description serve to explain the invention and not to limit it. For a person skilled in the art, without inventive effort, further figures can be obtained from these figures.

FIG. 1 is a flowchart of a deep learning-based single-camera pedestrian multi-target tracking method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a cross-camera pedestrian multi-target tracking method based on deep learning according to an embodiment of the present invention;

FIG. 3 is a schematic block diagram of a hardware device of a deep learning-based single-camera pedestrian multi-target tracking method according to an embodiment of the present invention;

fig. 4 is a schematic block diagram of a hardware device of a cross-camera pedestrian multi-target tracking method based on deep learning according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

As shown in fig. 1, according to an embodiment of the present invention, the present invention provides a method for tracking multiple targets of a pedestrian by using a single camera based on deep learning, including the following steps:

step S101, video data is obtained;

the video data may be an original video captured by a camera or the like, or may be video data obtained after preprocessing on the original video.

Step S102, inputting a video into a trained encoder-decoder deep network, performing parallel multi-scale feature extraction, and simultaneously outputting two tasks, namely pedestrian detection and pedestrian apparent feature extraction, at the output end of the network;

in order to learn richer features, it is necessary to integrate the features of the lower layer and the higher layer into the network to identify the targets with different sizes appearing in the network. Step S102 can adopt a trained encoder-decoder network similar to U-NET to fuse different feature maps, and solve the frequently occurring scale change problem in multi-target tracking by fusing the features of different feature maps.

Step S103, outputting a candidate target center point heat map, the size of a pedestrian detection frame and the offset of a pedestrian center relative to the actual position of the original image for a pedestrian detection part;

the pedestrian detection part uses a target detection method independent of an anchor frame, and regards target detection as the problem of target central point detection, and outputs pedestrian characteristics according to the central point. The elimination of the anchor box alleviates the ambiguity problem and the use of a high resolution feature map allows the pedestrian features to better align with the target center.

Step S104, outputting the pedestrian apparent features of the center points of all the candidate targets for the pedestrian apparent feature extraction part;

the pedestrian apparent features are extracted by using the trained encoder-decoder deep network, and the low-dimensional features need less training samples, so that the learning of the low-dimensional features is beneficial to reducing the risk of over-fitting of small data, and the tracking robustness can be improved.

In step S105, all the pedestrian detection frames are correlated based on the pedestrian apparent features in all the pedestrian detection frames to obtain the tracking trajectories of all the pedestrians.

According to the cross-camera pedestrian multi-target tracking method based on deep learning, disclosed by the embodiment of the invention, the target detection method of the anchor frame is not relied on, and the method is not limited by the anchor frame. The method takes target detection as the problem of target central point detection, and outputs pedestrian characteristics according to the central point. The elimination of anchor frame can effectively resist sheltering from each other between the pedestrian, detect the skew scheduling problem, can effectively reduce pedestrian's identity exchange. According to the cross-camera pedestrian multi-target tracking method based on deep learning, accurate and efficient pedestrian multi-target tracking can be achieved.

As shown in fig. 2, according to an embodiment of the present invention, the present invention provides a cross-camera pedestrian multi-target tracking method based on deep learning, including the following steps:

step S201, acquiring a plurality of video stream data collected by a plurality of cameras;

step S202, respectively processing a plurality of videos by using the single-camera multi-target pedestrian tracking method based on deep learning in the embodiment to obtain the tracking tracks of all pedestrians corresponding to a plurality of video streams;

step S203, selecting high-quality frames for the tracking track of each pedestrian in a plurality of videos so as to calculate the pedestrian apparent characteristics of the whole track;

step S203 includes: firstly, selecting frames with high correlation degree for the track of each pedestrian for clustering by using the apparent characteristics extracted in the step S202 and the track correlation result, removing the video frames with wrong matching, and then further screening by using a video frame quality prediction network; the video frame quality prediction network is a two-class network, and the training of the network depends on a self-labeling data set; in the self-labeling data set, negative samples come from two parts, one part is a primary screening result, the other part processes a normal picture through a digital image processing technology to generate fuzzy and noisy samples, and the testing process is synchronously performed by a testing set and manual inspection;

step S204, for the selected video frame, the face feature and the gait feature of the pedestrian are respectively extracted by using the trained deep neural network, and step S205 or step S206 can be optionally executed after step S204 is executed;

for the extraction of the face features, a FaceNet face recognition framework which obtains advanced results on a public data set is used for carrying out feature extraction to obtain a current pedestrian face feature vector; for the Gait feature extraction, a Gait-Part Gait recognition model which obtains advanced results on a public data set is used for feature extraction to obtain a current pedestrian Gait feature vector;

step S205, heterogeneous feature fusion is carried out on the pedestrian appearance feature extracted under the single camera and the pedestrian face feature and gait feature extracted in the previous step, and step S207 is directly switched to after step S205 is executed;

performing multi-feature fusion on the pedestrian appearance features extracted under the condition of a single camera and the pedestrian face features and gait features extracted in the previous step, and using a loose coupling scheme;

step S206, the pedestrian appearance characteristics extracted from the single-camera video belonging to the same pedestrian and the pedestrian face characteristics and gait characteristics extracted in the step S4 are matched, and the step S207 is directly switched to after the step S206 is executed;

the operation of the step S203 is executed for the pedestrian apparent features extracted from the video of the single camera, and then the track level matching is carried out according to the result of the single frame matching based on the single frame matching;

and step S207, performing cross-camera track association clustering according to the pedestrian features fused in the step S205. Or performing cross-camera track hierarchical clustering according to the various pedestrian features matched in the step S206.

The cross-camera track association clustering takes the pedestrian apparent characteristics extracted under a single camera as a main body, and the pedestrian face characteristics and the gait characteristics extracted in the S204 as the weight of dynamic change, so that the cross-camera track association clustering is more robust.

According to the camera-crossing pedestrian multi-target tracking method based on deep learning, the problems of mutual shielding, detection offset and the like of pedestrians can be effectively solved, the identity exchange of the pedestrians can be effectively reduced, the robustness is good, and accurate and efficient camera-crossing pedestrian tracking can be realized.

As shown in fig. 3, according to an embodiment of the present invention, the present invention provides an apparatus 300 for single-camera pedestrian multi-target tracking method hardware based on deep learning, including a video acquisition module 301, a pedestrian tracking module 302 and a data association module 305, where the pedestrian tracking module 302 includes a pedestrian detection module 303 and an apparent feature acquisition module 304.

The modules can respectively execute the steps of the deep learning-based single-camera pedestrian multi-target tracking method described in conjunction with fig. 1. Only the main functions of the various modules of the apparatus 300 of the method are described below.

The video acquisition module 301 is configured to acquire video data acquired by a camera;

the pedestrian tracking module 302 is used for detecting pedestrians, extracting features and measuring the pedestrian similarity of the tracks;

specifically, the pedestrian tracking module 302 includes a pedestrian detection module 303 and an apparent feature acquisition module 304;

the pedestrian detection module 303 is configured to perform pedestrian detection on the video frames acquired by the video acquisition module to acquire detection frames of all pedestrians in each video frame;

the apparent feature acquisition module 304 is used for performing parallel multi-scale feature extraction on the video frames acquired by the video acquisition module by using the apparent feature acquisition module to acquire apparent features of all pedestrians;

the data association module 305 is used for matching the target detection frames of all pedestrians based on the apparent pedestrian features to obtain a pedestrian tracking result;

according to the embodiment of the present invention, the device 300 of the single-camera pedestrian multi-target tracking method hardware based on deep learning further includes a counting module, which is used for counting the pedestrian tracks in the video and determining the number of pedestrians in the video.

As shown in fig. 4, according to an embodiment of the present invention, the present invention provides a device 400 of cross-camera pedestrian multi-target tracking method hardware based on deep learning, including a multi-data-source video acquisition module 401, an intelligent video analysis module 402, a track frame selection module 403, a face recognition module 404, a gait recognition module 405, a feature fusion module 406, a feature matching module 407, and a multi-data-source track association module 408. The modules can respectively execute the steps of the depth learning-based cross-camera pedestrian multi-target tracking method described in conjunction with fig. 1 and 2. Only the main functions of the various modules of the apparatus 400 of the method are described below.

The multiple data source video obtaining module 401 is configured to obtain multiple video data collected by multiple cameras respectively;

the intelligent video analysis module 402 is configured to respectively process a plurality of video data by using the hardware device of the deep learning-based single-camera pedestrian multi-target tracking method, and obtain tracking results of all pedestrians in a plurality of videos;

the track frame selection module 403 is configured to select a high-quality video frame from the pedestrian tracking tracks obtained by the intelligent video analysis module, and remove a pedestrian frame with a blurred, blocked, or deformed target, so as to calculate an apparent pedestrian feature of the whole track;

the face recognition module 404 is configured to obtain face features of all pedestrians in the video frame selected by the track frame selection module 403;

the gait recognition module 405 is configured to obtain face features of all pedestrians in the video frame selected by the track frame selection module 403;

the feature fusion module 406 performs heterogeneous feature fusion on the pedestrian apparent features extracted under the single camera, the pedestrian face features extracted in the face recognition module and the gait features extracted by the gait recognition module, and introduces an attention mechanism for guiding mixed information from different modules;

the feature matching module 407 is configured to match an apparent feature of a pedestrian extracted from a video of a single camera of the same pedestrian with a face feature extracted from the face recognition module and a gait feature of the pedestrian extracted from the gait recognition module;

the multiple data source trajectory association module 408 is configured to perform association clustering on trajectories of all pedestrians in videos shot by different cameras to obtain a pedestrian tracking result across the cameras.

The method and the device for cross-camera pedestrian multi-target tracking based on deep learning are independent of a target detection method of an anchor frame and are not limited by the anchor frame. The invention regards target detection as the problem of target center point detection, and outputs pedestrian characteristics according to the center point. According to the method and the device for cross-camera pedestrian multi-target tracking based on deep learning, in practical application, a new intelligent monitoring system does not need to be replaced, the method and the device can be directly applied to an old monitoring system, unnecessary resource waste is avoided, good robustness is achieved, the defects of low resolution, illumination change and the like of video frames in old monitoring videos can be reduced, and the problems of mutual shielding among pedestrians, pedestrian detection result deviation and the like can be effectively solved. According to the method and the device for cross-camera pedestrian multi-target tracking based on deep learning, accurate and efficient tracking results can be achieved.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A cross-camera pedestrian multi-target tracking method based on deep learning is characterized by comprising the following steps:

step 1: acquiring video stream data acquired by a plurality of cameras;

step 2: processing video stream data acquired by each camera by using a single-camera pedestrian multi-target tracking method based on deep learning to obtain tracking tracks of all corresponding pedestrians;

and step 3: selecting a video frame with higher quality from the tracking tracks of all the pedestrians so as to calculate the pedestrian performance characteristics of the whole track;

and 4, step 4: for the selected video frames, respectively extracting the face features and the gait features of pedestrians by using the trained deep neural network, and then executing the step 5 or the step 6;

and 5: carrying out heterogeneous feature fusion on all pedestrian apparent features extracted under a single camera and the pedestrian face features and gait features extracted in the step 4, and then executing a step 7;

step 6: step 7 is executed after the pedestrian apparent characteristics extracted from the video of the same pedestrian under the single camera and the pedestrian face characteristics and gait characteristics extracted in the step 4 are matched;

and 7: and (4) performing cross-camera track association clustering according to the pedestrian features fused in the step (5), or performing cross-camera track hierarchical clustering according to various pedestrian features matched in the step (6).

2. The method for multi-target pedestrian tracking across cameras based on deep learning of claim 1, wherein the method for multi-target pedestrian tracking across single camera based on deep learning in step 2 comprises the following steps:

step 01: acquiring video stream data acquired by a single camera;

step 02: inputting video frames acquired by video streaming data acquired by a single camera correspondingly to a trained encoder-decoder deep network, performing parallel multi-scale feature extraction, and simultaneously outputting two tasks at a network output end, namely pedestrian detection and pedestrian apparent feature extraction;

step 03: for the pedestrian detection part, outputting a candidate target center point heat map, the size of a pedestrian detection frame and the offset of the pedestrian center relative to the actual position of the original image;

step 04: outputting the pedestrian apparent features of the center points of all the candidate targets to a pedestrian apparent feature extraction part;

step 05: and correlating all the pedestrian detection frames based on the pedestrian apparent features in all the pedestrian detection frames to obtain the tracking tracks of all the pedestrians.

3. The method for multi-target pedestrian tracking across cameras based on deep learning according to claim 1, wherein the step 3 specifically comprises: preliminarily screening high-quality video frames according to the extracted apparent features in the tracking tracks of all pedestrians and the track correlation result, and then further screening through a video frame quality prediction network so as to calculate the pedestrian performance features of the whole track, wherein the preliminary screening process comprises the following steps: selecting a frame with high correlation for the track of each pedestrian, namely clustering according to the characteristic distance between the characteristics of the low-quality frame and the characteristics of other frames, and removing the video frame with wrong matching; the video frame quality prediction network adopts a two-classification network, and the training of the network depends on a self-labeling data set; the negative samples in the self-labeled data set are derived from two parts, one part is a result of primary screening, and the other part is a sample which is blurred and noisy and is generated by processing a normal picture through a digital image processing technology.

4. The method for multi-target pedestrian tracking across cameras based on deep learning according to claim 1, wherein the step 4 specifically comprises: and for the selected video frame, adopting a faceNet face recognition frame to extract the face characteristics of the pedestrian, and adopting a Gait-Part Gait recognition model.

5. The method for multi-target pedestrian tracking across cameras based on deep learning according to claim 1, wherein the step 5 specifically comprises: and (4) performing heterogeneous feature fusion on all the pedestrian apparent features extracted under the single camera and the pedestrian face features and gait features extracted in the step (4) by using a loose coupling mode, and then executing the step (7).

6. The method for multi-target pedestrian tracking across cameras based on deep learning according to claim 1, wherein the step 6 specifically comprises: and (3) performing the operation of the step (3) on the pedestrian apparent features extracted from the videos of all the single cameras, then performing track level matching based on single frame matching, and performing the step (7) after the track level matching is completed according to the result of the single frame matching.

7. The method for multi-target pedestrian tracking across cameras based on deep learning according to claim 1, wherein the step 7 specifically comprises: and (3) carrying out cross-camera track clustering according to the fusion characteristics or carrying out cross-camera track clustering according to various matched pedestrian characteristics, wherein in the two processing processes, the pedestrian performance characteristics extracted by the single-camera pedestrian multi-target tracking method based on deep learning are adopted as main bodies, and the pedestrian face characteristics and the gait characteristics extracted in the step (4) are used as the weight of dynamic change.

8. The method for multi-target pedestrian tracking across cameras based on deep learning of claim 2, wherein the step 02 specifically comprises: the method comprises the steps of inputting and fusing the characteristics of the bottom layer and the top layer of an obtained video frame into a trained encoder-decoder network to fuse different characteristic diagrams, and solving the scale change problem frequently occurring in multi-target tracking by fusing the characteristics of the different characteristic diagrams, wherein the encoder-decoder network adopts U-NET.

9. The method for multi-target pedestrian tracking across cameras based on deep learning according to claim 2, wherein the step 03 specifically comprises: outputting a candidate target center point heat map for estimating a position of a target center, the response decaying exponentially with distance between the position on the heat map and the object center;

outputting the offset of the center of the pedestrian relative to the actual position of the original image, and estimating continuous offset relative to the center of the pedestrian for each pixel to reduce the influence of down-sampling;

and outputting the size of the pedestrian detection frame for estimating the height and width of the pedestrian detection frame at each position so as to improve the detection precision.

10. The method for multi-target pedestrian tracking across cameras based on deep learning of claim 2, wherein the step 04 specifically comprises: and for the pedestrian apparent feature extraction part, learning a low-dimensional feature vector corresponding to the pedestrian performance feature by using the trained encoder-decoder deep network, and outputting the pedestrian apparent features of all candidate target center points.

11. The method for multi-target pedestrian tracking across cameras based on deep learning of claim 2, wherein the step 05 specifically comprises: and comparing the feature vectors of the two pedestrian detection frames, associating the detection frames belonging to the same person, and finally connecting into a pedestrian track.

12. An apparatus for the deep learning based cross-camera pedestrian multi-target tracking method according to any one of claims 1 to 11, characterized in that the apparatus comprises:

the intelligent video analysis module is used for respectively processing a plurality of video data by utilizing a hardware device corresponding to the single-camera pedestrian multi-target tracking method based on the deep learning to obtain the tracking results of all pedestrians in a plurality of videos;

the gait recognition module is used for acquiring the gait characteristics of the pedestrian in the video frame selected by the track frame selection module;

the characteristic fusion module is used for carrying out heterogeneous characteristic fusion on the pedestrian apparent characteristic extracted under the single camera, the pedestrian face characteristic extracted from the face recognition module and the gait characteristic extracted from the gait recognition module;

the characteristic matching module is used for mutually matching the pedestrian apparent characteristic extracted from the video of the single camera of the same pedestrian, the face characteristic extracted from the face recognition module and the pedestrian gait characteristic extracted from the gait recognition module;

13. The apparatus of claim 12, wherein the feature fusion module employs a feature fusion module with an attention mechanism for guiding mixed information from different modules, and performs feature fusion in a loose coupling manner in the fusion process of multiple features given extraction.

14. An apparatus for the deep learning based single-camera pedestrian multi-target tracking method as claimed in claim 2, characterized in that the apparatus comprises:

the pedestrian tracking module comprises a pedestrian detection module and an apparent feature acquisition module and is used for detecting pedestrians, extracting features and calculating the pedestrian similarity of the track;

and the data correlation module is used for matching the target detection frames of all the pedestrians based on the apparent pedestrian features so as to obtain a pedestrian tracking result.

15. The apparatus of claim 14, wherein the data association module comprises a similarity measurement module and a trajectory merging module, wherein:

the similarity measurement module is used for calculating the similarity between the apparent features of all the pedestrians;

and the track merging module is used for merging the detection results belonging to the same pedestrian into the same track according to the pedestrian similarity calculated by the similarity measurement module and marking the same identity.