CN112132873A

CN112132873A - Multi-lens pedestrian recognition and tracking based on computer vision

Info

Publication number: CN112132873A
Application number: CN202011013830.XA
Authority: CN
Inventors: 胡先军; 李斌
Original assignee: Tianjin Fengwu Technology Co ltd
Current assignee: Tianjin Fengwu Technology Co ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2020-12-25

Abstract

The invention relates to the technical field of video monitoring, in particular to multi-lens pedestrian identification and tracking based on computer vision, which comprises the following steps: step 1: deployment of camera equipment and video stream acquisition; step 2: a tracking module under the single camera; and 4, step 4: tracking across cameras; and 5: the re _ label block resolves the row ID-SWITCH. The invention provides an ID error correction module aiming at ID exchange of pedestrians during intersection, and the ID error correction module can distinguish tracks by analyzing and finding the gait characteristics of the pedestrians during the ID exchange, thereby achieving the purpose of ID correction and being used for community security protection, early warning of the loss and dangerous behaviors of children, crowd track analysis of supermarkets and the like; through a computer vision method, uninterrupted processing can be performed in the background, and the efficiency and the correctness can be obviously improved.

Description

Multi-lens pedestrian recognition and tracking based on computer vision

Technical Field

The invention relates to the technical field of video monitoring, in particular to multi-lens pedestrian identification and tracking based on computer vision.

Background

At present, the road monitoring system enters the stage of expansion and change in the world of the urban monitoring in the world with the outburst of the different military, and under the change of the demand, a security monitoring system needs more integrated solutions with Artificial Intelligence (AI). Modern public security is no longer limited to infinitely expanding image monitoring coverage density and breadth and pursuing ultra-high definition, but the traditional security monitoring era is further developed by means and tools of AI artificial intelligence, and the traditional security monitoring era turns to the AI artificial intelligence security monitoring era focusing on data acquisition, application and management. With the increase of the number of the monitoring devices at present, the image resolution is continuously improved, the data volume of the images and pictures collected by public security is increased in geometric proportion, and the improvement of the image resolution also increases the processing capacity and the utilization rate of the server. Therefore, security monitoring image monitoring faces a great challenge in the technologies of image retrieval, access control data storage, data operation and the like.

The cross-camera multi-target tracking is a very important research topic in the field of monitoring videos, and is directly referred to as MTMC hereinafter. At present, single-target tracking and multi-target tracking of a single camera have some good solutions, but the field of MTMC generally forms no solution set and has a very large research space.

Therefore, the invention discloses a pedestrian tracking method across cameras. The behavior tracks of the persons are generated through data collected by different cameras, behavior analysis is carried out on special people, early warning is given, some behavior habits can be obtained through the track analysis, target person sequences can be searched, and manual operation is reduced.

Disclosure of Invention

The invention aims to provide multi-lens pedestrian recognition and tracking based on computer vision.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-lens pedestrian recognition and tracking based on computer vision is provided, comprising the following steps:

step 1: deployment of camera devices and video stream capture

Cameras are arranged at important entrances, all paths, fork junctions and other places of the monitored area and are used for tracking and identifying pedestrians; when a certain camera or a certain video is obtained, the protocol is wholly matched through an internal exchange network of an intranet, the RTSP access protocol is adopted, and a limit is set on the access connection;

step 2: tracking module under single camera

Given a segment of video, the JDE model processes each frame and outputs a frame and a corresponding appearance embedding; calculating a correlation matrix between the embedding of the observations and the embedding in the pre-existing pool of trajectories; assigning the observations to the trajectories using the Hungarian algorithm; the Kalman filter is used for smoothing the track and predicting the position of the previous track in the current frame; if the assigned observation is spatially too far from the predicted location, the assignment will be rejected; the embedding of a tracker is then updated by flagging the Tracklet as missing if no observations are assigned to it; if the lost time is larger than a given threshold value, the lost track is marked and is deleted from the current track pool; or will be found anew in the allocation step; an early attempt to jointly learn a detector and an embedded model (JDE) in a single deep network was employed; the proposed JDE uses a single network to output the detection results and the corresponding appearance embedding of the detection boxes simultaneously; in contrast, the SDE method and the two-stage method feature resampled pixels (bounding boxes) and feature maps, respectively; both the bounding box and the feature map are fed into a separate re-ID model to extract appearance features; the loss function of JDE adopts trihard loss, which considers the contribution of hard sample to final loss on the basis of triple loss; suppose two input pictures I₁And I₂The feature f is obtained by network forward propagation₁And f₂The Euclidean distance between the feature vectors of the two pictures is as follows:

d_I1,I2＝||f₁-f₂||₂

the picture a and the picture p are a pair of positive sample pairs, the picture a and the picture n are a pair of negative sample pairs, and the triple loss is as follows:

L_t＝(d_a,p-d_a,n+α)₊

in the formula (X)₊Denotes max (x,0), alpha is a manually set threshold parameter

Trihard loss:

For each training batch, selecting pedestrians with P IDs randomly, wherein each pedestrian selects K different pictures randomly, namely one batch contains PxK, A is a picture set with the same ID as a, and the other picture sets with different IDs are B;

and step 3: re _ ID combined single-lens pedestrian with same ID

The global features are combined with the multi-granularity local features, the global features are responsible for extracting common features of the whole macro, and then the image is cut into different blocks, wherein each block has different granularities and is responsible for extracting features of different levels or different grades; the global and local characteristics are combined together, so that abundant information and details can be provided to represent the complete condition of an input picture;

and 4, step 4: cross-camera tracking

Taking the average value of pedestrian characteristics of a whole section of track as a as the characteristic of the track, and then adopting re-ranking to further optimize sequencing; for the tracking problem of the cross-camera, adopting an algorithm traversal of binary tree search, wherein the feature matching adopts extracted reid features which are related to the robustness of the model;

and 5: re _ label module for solving line ID-SWITCH

By linking pedestrians in different positions and monitored by a plurality of monitoring cameras, when visual appearance characteristics are unreliable, such as wearing a mask, making up and the like, analysis based on behavior characteristics (such as gait) becomes an optional solution of the human body re-identification problem; by analyzing the single shot tracking result, after the same ID is merged through the reid feature, the situation that a part of ID-SWITCH exists is that the IDs of two people are interchanged when meeting, and the part of ID needs to be solved through a correction module; the system comprises two sub-modules of target object gait feature extraction and object re-identification, wherein a target object gait feature extraction module detects an object through a foreground and extracts a target object silhouette, a gait cycle and a walking angle, and finally the gait features are fused and sent to an object re-identification module; the object re-identification module finds the top three re-identification results that most closely match the target object in the candidate object receipt set.

Further, the cameras deployed in the step 1 comprise a plurality of cameras which are horizontally arranged and installed, and the plurality of cameras can rotate at a wide angle and are not interfered with each other.

Further, in step 3, the image is cut into different blocks, each block has a different granularity, each block is assigned with a different value n (n is 1, 2, 3.. n), different values n are corresponding to different levels of features, and the detail degree of the model detail information increases with the increase of the n value.

Further, in step 4, in addition to three camera pairs with overlapping, a merging operation of some single camera tracks is also performed during cross-camera tracking.

Further, the implementation method of gait feature extraction in step 5 is as follows:

1) preprocessing a picture, and changing the ROI and the scale of the pedestrian;

2) extracting a GEI gait energy diagram of the picture;

3) extracting HOG characteristics from the GEI;

4) constructing a training set and a testing set to train an SVM classifier;

5) and (5) labeling the pedestrian by using the trained model.

Further, in the ID correction implementation method in step 5, the target space position of each frame is predicted through kalman filtering, and when a target is shielded, a large prediction error is caused; correcting part I by adopting reasonable area search methodD, assuming the center of the search area is (x)_c,y_c) Radius r, (x)_c,y_d) Is any detection r of the current frame_dThe central position of the rectangular area, if the condition is satisfied:

N_rthe number of frames of the target in an unassociated state is taken as the number of the frames; and then calculating the similarity according to the gait characteristics of the pedestrians in the region, and performing the subsequent re-matching.

The invention has the beneficial effects that:

1. the method can be used for community security protection, for example, the tracking of the old can be used for early warning of falling of the old, tracking of children, and early warning of the loss and dangerous behaviors of children; meanwhile, the method can be used for crowd trajectory analysis of the supermarket, so that the supermarket can reasonably analyze data to achieve intelligent shelf management.

2. Under the condition of large video data volume, the video retrieval by means of close to manual work is abandoned, and high efficiency and correctness are ensured; by means of a computer vision method, uninterrupted processing can be conducted in the background, and efficiency and accuracy can be improved remarkably.

3. The line ID-SWITCH is solved through the re _ label module, under the condition that ID exchange occurs when two people meet, a series of characteristics such as object gait characteristics and the like are utilized for identification, the previous tracking is carried out, interference and errors are avoided, and the overall accuracy is higher

Drawings

In order to more clearly illustrate the technical solution of the embodiment of the present invention, the attached drawings required to be used in the embodiment of the present invention will be briefly described below.

FIG. 1 is an overall framework flow of the present invention;

fig. 2 is a re _ label module provided by the present invention.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some components of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product.

Referring to fig. 1 and 2, a computer vision based multi-lens pedestrian recognition and tracking includes the following steps:

step 1: deployment of camera devices and video stream capture

step 2: tracking module under single camera

Given a segment of video, the JDE model processes each frame and outputs a frame and a corresponding appearance embedding; calculating a correlation matrix between the embedding of the observations and the embedding in the pre-existing pool of trajectories; assigning the observations to the trajectories using the Hungarian algorithm; the Kalman filter is used for smoothing the track and predicting the position of the previous track in the current frame; if the assigned observation is spatially too far from the predicted location, the assignment will be rejected; the embedding of a tracker is then updated by flagging the Tracklet as missing if no observations are assigned to it; if the lost time is larger than a given threshold value, the lost track is marked and is deleted from the current track pool; or will be found anew in the allocation step; an early attempt to jointly learn a detector and an embedded model (JDE) in a single deep network was employed; the proposed JDE uses a single network to output the detection results and the corresponding appearance embedding of the detection boxes simultaneously; in contrast, the SDE method and the two-stage method feature resampled pixels (bounding boxes) and feature maps, respectively; both the bounding box and the feature map are fed into a separate re-ID model to extract appearance features; in which loss of JDEThe function adopts a trihard loss, and the loss considers the contribution of the difficult sample to the final loss on the basis of the triple loss; suppose two input pictures I₁And I₂The feature f is obtained by network forward propagation₁And f₂The Euclidean distance between the feature vectors of the two pictures is as follows:

d_I1,I2＝||f₁-f₂||₂

L_t＝(d_a,p-d_a,n+α)₊

Trihard loss:

and step 3: re _ ID combined single-lens pedestrian with same ID

and 4, step 4: cross-camera tracking

and 5: re _ label module for solving line ID-SWITCH

The cameras deployed in the step 1 comprise a plurality of cameras which are horizontally arranged and installed, and the cameras can rotate at a wide angle and do not interfere with each other.

In step 3, the image is cut into different blocks, each block has different granularity, each block is assigned with different values n (n is 1, 2, 3.. n), different values n are corresponding to different levels of features, and the detail degree of the model detail information is increased along with the increase of the n values.

In step 4, except for three overlapped camera pairs, merging of single camera tracks is also performed during cross-camera tracking.

The gait feature extraction implementation method in the step 5 comprises the following steps:

2) extracting a GEI gait energy diagram of the picture;

3) extracting HOG characteristics from the GEI;

4) constructing a training set and a testing set to train an SVM classifier;

5) and (5) labeling the pedestrian by using the trained model.

In the ID correction implementation method in the step 5, the target space position of each frame is predicted through Kalman filtering, and a large prediction error is brought when a target is shielded; correcting partial ID by reasonable area search method, assuming the center of the search area is (x)_c,y_c) Radius r, (x)_c,y_d) Is any detection r of the current frame_dThe center position of the rectangular area, if the condition is satisfied:

The foregoing is merely exemplary and illustrative of the present invention and various modifications, additions and substitutions may be made by those skilled in the art to the specific embodiments described without departing from the scope of the invention as defined in the following claims.

Claims

1. A multi-lens pedestrian recognition and tracking based on computer vision, comprising the steps of:

step 1: deployment of camera devices and video stream capture

step 2: tracking module under single camera

Given a segment of video, the JDE model processes each frame and outputs a frame and a corresponding appearance embedding; calculating the embeddings of the observed valuesAn incidence matrix between embeddings into a pre-existing pool of traces; assigning the observations to the trajectories using the Hungarian algorithm; the Kalman filter is used for smoothing the track and predicting the position of the previous track in the current frame; if the assigned observation is spatially too far from the predicted location, the assignment will be rejected; the embedding of a tracker is then updated by flagging the Tracklet as missing if no observations are assigned to it; if the lost time is larger than a given threshold value, the lost track is marked and is deleted from the current track pool; or will be found anew in the allocation step; an early attempt to jointly learn a detector and an embedded model (JDE) in a single deep network was employed; the proposed JDE uses a single network to output the detection results and the corresponding appearance embedding of the detection boxes simultaneously; in contrast, the SDE method and the two-stage method feature resampled pixels (bounding boxes) and feature maps, respectively; both the bounding box and the feature map are fed into a separate re-ID model to extract appearance features; wherein, the loss function of JDE adopts trihardlos, which considers the contribution of the difficult sample to the final loss on the basis of triple loss; suppose two input pictures I₁And I₂The feature f is obtained by network forward propagation₁And f₂The Euclidean distance between the feature vectors of the two pictures is as follows:

d_I1,I2＝||f₁-f₂||₂

L_t＝(d_a,p-d_a,n+α)₊

Trihard loss:

For each training batch, selecting P pedestrians with ID randomly, selecting K different pictures randomly by each pedestrian,

that is, one batch contains PxK, a is a picture set with the same ID as a, and the rest picture sets with different IDs are B;

and step 3: re _ ID combined single-lens pedestrian with same ID

and 4, step 4: cross-camera tracking

and 5: re _ label module for solving line ID-SWITCH

By linking pedestrians which are monitored by a plurality of monitoring cameras and are positioned at different positions, when visual appearance characteristics are unreliable, such as wearing a mask, making up and the like, analysis based on behavior characteristics (such as gait) becomes an optional solution of the human body re-identification problem; by analyzing the single shot tracking result, after the same ID is merged through the reid feature, the situation that a part of ID-SWITCH exists is that the IDs of two people are interchanged when meeting, and the part of ID needs to be solved through a correction module; the system comprises two sub-modules of target object gait feature extraction and object re-identification, wherein a target object gait feature extraction module detects an object through a foreground and extracts a target object silhouette, a gait cycle and a walking angle, and finally the gait features are fused and sent to an object re-identification module; the object re-identification module finds the top three re-identification results that most closely match the target object in the candidate object receipt set.

2. The computer vision based multi-lens pedestrian recognition and tracking method as claimed in claim 1, wherein the cameras deployed in step 1 comprise a plurality of horizontally arranged cameras, and the cameras can rotate at a wide angle and do not interfere with each other.

3. The computer vision-based multi-lens pedestrian recognition and tracking method according to claim 1, wherein in step 3, the image is cut into different blocks, each block has different granularity, each block is assigned with different values n (n is 1, 2, 3).

4. A computer vision based multi-lens pedestrian recognition and tracking method in accordance with claim 1, wherein in step 4, in addition to three overlapping camera pairs, a plurality of single camera tracks are merged during cross-camera tracking.

5. The computer vision-based multi-lens pedestrian recognition and tracking method according to claim 1, wherein the gait feature extraction implementation method in step 5 is as follows:

2) extracting a GEI gait energy diagram of the picture;

3) extracting HOG characteristics from the GEI;

4) constructing a training set and a testing set to train an SVM classifier;

5) and (5) labeling the pedestrian by using the trained model.

6. The method for identifying and tracking the pedestrian with multiple lenses based on the computer vision according to claim 1, wherein in the step 5, the ID correction implementation method predicts the target space position of each frame through Kalman filtering, and when the target is shielded, a large prediction error is brought; partial ID correction by reasonable area search, assumingThe center of the search area is (x)_c,y_c) Radius r, (x)_c,y_d) Is any detection r of the current frame_dThe central position of the rectangular area, if the condition is satisfied: