CN117152826B

CN117152826B - Real-time cross-mirror tracking method based on target tracking and anomaly detection

Info

Publication number: CN117152826B
Application number: CN202311440641.4A
Authority: CN
Inventors: 朱博; 盛智标; 杨宝赢; 李涛; 李懋卿; 丁硕; 高志强
Original assignee: Wuhan Zhongke Tongda High New Technology Co Ltd
Current assignee: Wuhan Zhongke Tongda High New Technology Co Ltd
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2024-03-22
Anticipated expiration: 2043-11-01
Also published as: CN117152826A

Abstract

The invention relates to a real-time cross-mirror tracking method based on target tracking and anomaly detection. The method first uses the YOLOv3 object detection algorithm to make a detection decision for a video image frame and processes a first frame of the searched video image frame to generate an initial bounding box. Then the Re3 object tracking algorithm is used to identify the image sequences (corresponding to the continuous bounding box of the same object person), then the DOC anomaly detection algorithm (Learning Deep Features for One-Class i ﬁ category) is used to select the best image representation for each group of image sequences, and finally the siamiddl method is used to perform classical Re-ID detection on this representation image. The method can simultaneously reduce the size of the gallery and improve the image quality of the gallery, and remarkably improves the detection efficiency of the real-time Re-ID.

Description

Real-time cross-mirror tracking method based on target tracking and anomaly detection

Technical Field

The invention relates to the technical field of computer vision image processing, in particular to a real-time cross-mirror tracking method based on target tracking and anomaly detection.

Background

Cross-mirror tracking is a popular term for pedestrian Re-identification (Re-ID) technology. The technology can rapidly judge and analyze whether the target appears in a plurality of camera lens pictures, and simultaneously record the appearance time and position information. The method is widely applied to scenes such as security event detection, public security case handling evidence collection, tracking of specific target personnel and the like.

A typical real-time Re-ID system consists of two main modules: (1) A gallery generator which extracts bounding boxes (marks at the time of human body recognition) of target persons to compose an image gallery; (2) Classical Re-ID module for identifying and querying from the cut images in the gallery.

However, the current real-time Re-ID approach suffers from two drawbacks: (1) the generated gallery is overlarge and the resource consumption is serious. (2) When the gallery is generated, a large number of images with poor quality are easy to generate, and small errors generated by the images can have a large influence on the detection result.

Disclosure of Invention

Based on the expression, the invention provides a real-time cross-mirror tracking method based on target tracking and anomaly detection, which can reduce the size of a gallery and improve the image quality of the gallery and also can improve the detection efficiency of real-time Re-ID.

The technical scheme for solving the technical problems is as follows:

the real-time cross-mirror tracking method based on target tracking and anomaly detection comprises the following steps:

s1, decomposing an original video stream to generate a group of continuous image sequences possibly containing the same person;

s2, judging pedestrians in each image in the image sequence to judge whether target personnel exist in each image, and if the target personnel exist in the current image, marking the image as an effective image;

s3, selecting target personnel in each effective image by a frame, generating an image boundary frame sequence, outputting all images with boundary frames into a group of track image fragment lists in a list form, and storing all complete fragments;

s4, analyzing the track segments, selecting an optimal representative image for each image sequence, and putting the optimal representative image into an image search library;

s5, analyzing the best representative image, calculating and inquiring similarity scores between the target person in the image and the target person in the gallery image, and outputting a detection conclusion.

As a preferable scheme: in step S2, the YOLOv3 target detection module is used to process the first image of the searched image sequence to generate an initial bounding box, the initial bounding box is used to initialize the target tracker, the valid image is determined to enter the target tracking module for continuous analysis, otherwise, the target detection module continues to determine the next image.

As a preferable scheme: the object tracking module using Re3 algorithm in step S3 operates in the valid image to generate a sequence of image bounding boxes, and then outputs the result in the form of a list as a set of track image fragment lists, and saves all the complete fragments.

As a preferable scheme: in the abnormality detection module in the step S4, the trace segment is analyzed by using the abnormality detection module of the DOC, and an optimal representative image is selected for each image sequence to be put into an image search library for detection by a subsequent Re-ID module.

Preferably, in step S1:

(1) Decomposing an original video stream to generate an image sequence;

(2) Each image in the image sequence is identified and judged in sequence to judge whether pedestrians exist in each image, and if the pedestrians exist, identification features of the pedestrians are extracted;

(3) And separating all images with pedestrians from the image sequence, collecting the images with pedestrians according to the identification characteristics of each pedestrian so as to obtain a sample image sequence of each pedestrian, wherein the images without the pedestrians are not collected.

As a preferable scheme: in the step (2), identifying human body features in each image in the obtained image sequence to judge whether pedestrians exist in each image; extracting the identification characteristics of each pedestrian if the pedestrian exists, extracting the facial characteristics and the gait characteristics of the pedestrian if the face of the pedestrian can be identified, and extracting the back face shape characteristics and the gait characteristics of the pedestrian if the face of the pedestrian cannot be identified; if no pedestrian exists in the image sequence of the group, the next image sequence is judged.

As a preferable scheme: and (3) identifying the same pedestrian according to the facial features during image collection, identifying the same pedestrian according to the back face shape features, comparing the gait features of the pedestrian facing away from the camera with the gait features of the pedestrian facing towards the camera, and if the gait features are consistent, considering the pedestrian as the same pedestrian, so as to collect the front face image and the back face image of the same pedestrian.

As a preferable scheme: the back side feature includes a ratio of a head width to a shoulder width.

As a preferable scheme: the gait characteristics include the ratio of the distance between the center of the pedestrian's head and the center of the shoulder to the distance of the crown relief while walking, and the cycle of the walking relief.

Compared with the prior art, the technical scheme of the application has the following beneficial technical effects: the method first uses the YOLOv3 object detection algorithm to make a detection decision for a video image frame and processes a first frame of the searched video image frame to generate an initial bounding box. Then the Re3 object tracking algorithm is used to identify the image sequences (corresponding to the continuous bounding box of the same object person), then the DOC anomaly detection algorithm (Learning Deep Features for One-Class i ﬁ category) is used to select the best image representation for each group of image sequences, and finally the siamiddl method is used to perform classical Re-ID detection on this representation image. The method can simultaneously reduce the size of the gallery and improve the image quality of the gallery, and remarkably improves the detection efficiency of the real-time Re-ID.

Drawings

FIG. 1 is a schematic flow chart of a method in a first embodiment;

fig. 2 is a schematic rear view of a pedestrian in the second embodiment.

Detailed Description

Referring to fig. 1, a real-time cross-mirror tracking method based on target tracking and anomaly detection includes the following steps:

step one, the original video stream is decomposed to generate a group of continuous image sequences possibly containing the same pedestrian.

The image sequence is that an image set acquired by each video frame is separated from the video stream according to a certain time interval, and the number of pictures in the image sequence can be 10, 20 and the like, and is determined according to specific requirements. In this embodiment, a person may be used to perform blur determination on a video frame in an image sequence, so as to determine whether the group of image sequences includes the same pedestrian.

And secondly, judging pedestrians in each image in the image sequence to judge whether target personnel exist in the image, and marking the image as an effective image if the target personnel exist in the current image.

In this embodiment, a YOLOv3 pedestrian detection model is used to determine pedestrians in an image sequence.

The role of the YOLOv3 pedestrian detection model is to find a predefined class (referred to as target person in this method) from the image and locate the target object from among a number of candidate targets. The method uses a pre-trained YOLOv3 model with dark-53 as proposed in TensorFlow, which has multi-scale predictive capabilities that can detect small objects.

In this embodiment, a YOLOv3 pedestrian detection model is predefined, a category detected by the model is defined as a target person, each image in the continuous image sequence is input into the YOLOv3 pedestrian detection model, and the YOLOv3 pedestrian detection model searches for a pedestrian in each image and identifies the pedestrian in each image to detect whether the target person exists in each image. If the target person exists in the image, the image is marked as an effective image, otherwise, the Yolov3 pedestrian detection model continues to judge the next image, and after a plurality of iterations, the Yolov3 pedestrian detection model can select all the images judged to be effective. The YOLOv3 pedestrian detection model performs box selection on the target person in the first valid image to generate an initial bounding box, which is used to initialize the target tracking model.

The YOLO model is called a unified detector or single-stage detector, which predicts bounding boxes and classification probabilities of a complete image directly and delivers it unidirectionally through convolutional neural networks. All versions of YOLO divide the image into grids, each with a class probability and associated confidence score, and predict the location of the bounding box.

Thirdly, selecting target personnel in the effective image, generating an image boundary frame sequence, outputting all images with boundary frames in a list form into a group of track image fragment lists, and storing all complete fragments.

A target tracking model using the Re3 algorithm (hereinafter, abbreviated as Re3 target tracking model) was used in the present embodiment. The R3 algorithm (an algorithm of a regression-based tracker) is used in the method, and is an accurate general target tracking algorithm. The Re3 algorithm uses a convolution layer to embed the shape of the object, a loop layer recalls the shape and motion of the object, and a regression layer outputs the position of the object. The Re3 algorithm initially requires a bounding box around the object to be tracked and generates a bounding box in a subsequent frame.

In the step, the initialization of the Re3 target tracking model is completed by utilizing the initial bounding box, and the Re3 target tracking model can be operated in the subsequent effective image to automatically generate an image bounding box sequence. And then outputting a group of framed track images anchored by the Re3 target tracking model in a list form as a group of track image fragment lists, and storing all complete fragments.

And fourthly, analyzing the track image segments, selecting an optimal representative image for each image sequence, and putting the optimal representative image into an image search library for subsequent detection.

In this embodiment, an abnormality detection model is used to analyze the trajectory image segment.

The anomaly detection model in this embodiment is based on the anomaly detection method of DOC (Learning Deep Features for One-Class i ﬁ category). The DOC anomaly detection method is to train a DOC anomaly detection model by taking an image of a CUHK03 data set as a target class and taking an image of a VOC 2012 data set as an anomaly value. The backbone of the feature extractor of the DOC anomaly detection model was Inception ResNetV2 pre-trained on ImageNet. The DOC anomaly detection model is capable of identifying data that deviate significantly from conventional values.

The DOC method is used for anomaly detection, because normal data is subjected to Gaussian distribution, and abnormal data is not subjected to Gaussian distribution, so that the model training aims at accurately representing 'normal', namely abnormal points with larger deviation from the model. We only use normal data for training, and then measure the gap between the object and the normal data, and can consider this process as a single classification problem One-Class Classification.

Therefore, the DOC anomaly detection model is a single-class classifier, and good images and bad images of Re-ID can be distinguished through training. For each image, the DOC method may generate a score (for evaluating how close to the normal image, the closer the score is), indicating its suitability for Re-ID.

In this step, an anomaly detection module is operated in each image of the track picture segment, and the anomaly detection module generates a score for each image, and then uses the score to select the best image of the image sequence, i.e. the image with the highest score.

And fifthly, analyzing the optimal image, and calculating a similarity score between the optimal image and the gallery image to obtain a detection conclusion.

Before the inquiry command starts, a picture which is determined to be a target person is input, the picture is taken as a detection reference, then inquiry action is started, the classical Re-ID method of SiamiDL is used for analyzing the best representative image, the similarity score between the inquiry image and the gallery image is calculated, and a detection conclusion is output.

The scheme can reduce the size of the gallery and improve the image quality of the gallery, and remarkably improve the detection efficiency of the real-time Re-ID.

And (3) effect verification:

(1) Description of evaluation index

To evaluate the performance of the TrADe Re-ID on the real-time PRID dataset, we used the following evaluation criteria:

index 1, search Rate (FR). Representing the proportion of short video that can be correctly returned when a query request is made. When a query frequently fails (no results are returned), a low FR occurs.

Index 2, true result verification rate (TVR). Representing the proportion of alarms raised, the query is in the presented candidate. When monitoring personnel are often subjected to unreasonable interference, a low TVR may occur.

Index 3, average precision (mAP), which is defined by calculating the area under the TVR and FR curves.

The optimal F1 fraction F1 for index 4, FR and TVR. The score calculation method here is similar to the F-score calculation of precision and recall that is commonly used in machine learning algorithms. The difference is that here F1 represents the optimal F1 score, a value that enables the real-time Re-ID procedure to work perfectly.

(2) Description of the Experimental results

Experimental results based on the real-time prid dataset are as follows.

It can be seen that the advantage of the reduced gallery volume of the method compared to using only the classical Re-ID method (sialdl) results in a significant performance improvement. We can also see that the performance of the present method is almost always better than when siamiddl alone is used. This means that the selection of the best image to represent the image sequence using the anomaly detection module has significant consequences for the real-time Re-ID. Overall, these results demonstrate that the present method is a viable approach to solve the real-time Re-ID problem and optimize its performance.

Example two

The difference between this embodiment and the first embodiment is that the implementation of step S1 is different.

Specifically, in this embodiment, whether the same pedestrian is included in the group of image sequences is determined by an automated procedure, rather than by manual fuzzy determination.

The manual judgment is not only heavy, but also because the manual judgment generally identifies each pedestrian through facial features, when pedestrians in images (such as monitoring video images of public places such as stations, airports and the like) are relatively dense and have high mobility, and the pedestrians turn around, the same pedestrians are often lost manually. This results in a sequence of images that may contain the same pedestrian, often losing many key images, as determined by manual blurring.

In this embodiment, in step S1:

(1) The original video stream is decomposed to generate a sequence of images.

(2) And sequentially identifying and judging each image in the image sequence to judge whether pedestrians exist in each image, and extracting identification features if pedestrians exist.

The sample image sequences of the pedestrians can be input into a YOLOv3 pedestrian detection model to be judged, so that whether the pedestrians in the sample image sequences are target persons or not can be judged.

Specifically, in the step (2), the human body features in each image in the obtained image sequence are identified so as to determine whether a pedestrian exists in each image, if the pedestrian exists, the identification features of each pedestrian are extracted, if the human face of the pedestrian can be identified, the facial features (the facial features include the front face and the side face) and the gait features of the pedestrian are extracted, and if the human face of the pedestrian cannot be identified, the back face body features and the gait features of the pedestrian are extracted. If no pedestrian exists in the image sequence of the group, the next image sequence is judged.

As is apparent from the above description, the recognition features of the pedestrian in this embodiment include facial features, dorsal physique features, and gait features.

Referring to fig. 2, the back side feature in this embodiment includes a ratio K of a head width W1 to a shoulder width W2; gait characteristics include the ratio P of the distance of the pedestrian head center O1 to the shoulder center O2 and the overhead relief distance while walking, and the walking relief period T.

The height of the person can fluctuate when walking, and each person has own walking action habit and the body proportion of each person is unique. The ratio P of the distance of the head center O1 to the shoulder center O2 and the height relief distance at the time of walking, and the duration of one relief period T are substantially constant for the same pedestrian. Because in the video picture, the real head width, shoulder width, head-shoulder center point distance and height fluctuation are difficult to measure along with the change of the distance between the pedestrian and the camera; however, for the same pedestrian, the K value, the P value and the T value in the continuous video frames are fixed, so the K value, the P value and the T value at the time of walking can be taken as a means for identifying the same pedestrian in the continuous video pictures.

The length and width of each pixel point are defined as 1.

For pedestrians facing away from the camera, when the pedestrians are recognized for the first time, the heads of the pedestrians can be recognized, head outlines are extracted after the pedestrians are recognized, a row of pixel points with the largest transverse span (such as a row of pixel points in the position of a transverse dotted line of the head in fig. 2) are selected from the region covered by the heads, and the row of pixel points are scanned to obtain the number n1 of the row of pixel points; and similarly, identifying the shoulder of the pedestrian, extracting the shoulder outline after the shoulder is identified, selecting a row of pixel points with the largest transverse span (such as a row of pixel points at the position of the transverse line of the shoulder in fig. 2) in the area covered by the shoulder, and scanning the row of pixel points to obtain the number n2 of the row of pixel points.

Assuming that the head width of the pedestrian is W1 and the shoulder width is W2, W1/w2=n1/n 2, that is, k=n1/n 2 can be considered.

And meanwhile, selecting a midpoint O1 of a transverse dotted line of the head, selecting a midpoint O2 of a transverse dotted line of the shoulder, and calculating the pixel distances of O1 and O2, namely X pixel points, according to the pixel point coordinates of O1 and O2. Positioning the coordinates of O1 in several continuous frames of pictures to obtain the maximum displacement of O1 in the vertical direction, namely the pixel point span in the vertical direction; for example, if the maximum pixel span of O1 in the vertical direction is S pixels, and if the distance between the head center and the shoulder center of a pedestrian is L1 and the height fluctuation distance of the pedestrian walking is L2, p=l1/l2=x/S can be considered.

In consecutive frames, the time points T1 and T2 when O1 moves to the highest position twice adjacent are acquired, and then t=t2-T1.

The K value, the P value and the T value of each pedestrian in the video image can be obtained through the means.

After the video stream is decomposed into a continuous image sequence, all images with pedestrians are separated from the image sequence by identifying human body characteristics in each image, and then identifying characteristics of each pedestrian in all the separated images are extracted. For all pedestrians, if the facial features of the pedestrians can be extracted, the gait features of the pedestrians (corresponding to the pedestrians facing the camera) are extracted at the same time; for all pedestrians, if the back face shape characteristics can be extracted, the gait characteristics (corresponding to the pedestrians facing away from the camera) of the pedestrians are extracted at the same time.

In the step (3), the same pedestrian is identified according to the facial features when the images are collected, the same pedestrian is identified according to the back face shape features, the gait features of the pedestrian facing away from the camera are compared with the gait features of the pedestrian facing towards the camera, and if the gait features are consistent, the pedestrian is considered to be the same pedestrian, so that the front face image and the back face image of the same pedestrian are collected together.

And comparing the gait characteristics of the pedestrian facing away from the camera with the gait characteristics of the pedestrian facing towards the camera, and if the gait characteristics are consistent, considering the pedestrian as the same pedestrian.

Combining facial features, back shape features and gait feature neutralization judgment, and collecting all images (including front images and back images) containing the pedestrian according to a time line aiming at each pedestrian so as to obtain a sample image sequence of the pedestrian. Therefore, all images of all pedestrians can be obtained, the situation of missed judgment is not easy to occur, and the obtained image sequence contains more complete information quantity of all pedestrians.

After the obtained sample image sequence is input into the pedestrian detection model, the pedestrian detection model can be more favorable for judging whether the pedestrian in the sample image sequence is a target person or not, and a better cross-mirror tracking effect is realized.

By the means, each pedestrian is taken as a unit in practice, and images which are collected and contain the pedestrian are automatically separated from the original image sequence to serve as a sample image sequence; the image sequence generated by the method is specific and more accurate, the traditional mode of manual judgment is changed, the operation efficiency can be improved, the manual workload is reduced, and the condition of missed judgment can be avoided.

In this embodiment, in order to further improve the recognition efficiency of the subsequent pedestrian detection module for pedestrians in the sample image sequence, the following measures may be taken:

before the sample image sequence is input into the pedestrian detection model, one image containing the facial features of the pedestrian is extracted from the sample image sequence, and the image is taken as the first image of the sample image sequence, so that a new sample image sequence is obtained.

And inputting the new sample image sequence into a pedestrian detection model for detection and identification.

After the sample image sequence is input into the pedestrian detection model, the pedestrian detection model can recognize the most obvious recognition characteristic (namely facial characteristic) of the pedestrian when recognizing the first image, and whether the pedestrian is a predefined target person in the model can be judged at the first time.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. The real-time cross-mirror tracking method based on target tracking and anomaly detection is characterized by comprising the following steps of:

s1, (1) decomposing an original video stream to generate an image sequence; (2) Identifying human body characteristics in each image in the image sequence in sequence to judge whether pedestrians exist in each image; extracting the identification characteristics of each pedestrian if the pedestrian exists, extracting the facial characteristics and the gait characteristics of the pedestrian if the face of the pedestrian can be identified, and extracting the back face body characteristics and the gait characteristics of the pedestrian if the face of the pedestrian cannot be identified, wherein the back face body characteristics comprise the ratio of the head width to the shoulder width, and the gait characteristics comprise the ratio of the distance between the head center and the shoulder center of the pedestrian and the head top fluctuation distance and the walking fluctuation period when walking; if no pedestrian exists in the image sequences of the group, judging the image sequences of the next group; (3) Separating all images with pedestrians from the image sequence, collecting the images with pedestrians according to the identification characteristics of each pedestrian so as to obtain a sample image sequence of each pedestrian, wherein the images without the pedestrians are not collected; identifying the same pedestrian according to the facial features when collecting the images, identifying the same pedestrian according to the back face shape features, comparing the gait features of the pedestrian facing away from the camera with the gait features of the pedestrian facing towards the camera, and if the gait features are consistent, considering the pedestrian as the same pedestrian, so as to collect the front face image and the back face image of the same pedestrian together; generating a set of consecutive image sequences that may contain the same person in the manner described above;

s4, analyzing the track fragments by using an anomaly detection module of the DOC, generating a score for each image anomaly detection module, wherein the score is used for evaluating the degree of approach to a normal image, the score is higher as the degree of approach is higher, and then selecting an optimal image of the image sequence by using the score, wherein the optimal image is the highest-scoring image; selecting an optimal representative image for each image sequence to be placed into an image search library for subsequent detection;

2. The real-time cross-mirror tracking method based on object tracking and anomaly detection of claim 1, wherein the method is characterized by: in step S2, the YOLOv3 target detection module is used to process the first image of the searched image sequence to generate an initial bounding box, the initial bounding box is used to initialize the target tracker, the valid image is determined to enter the target tracking module for continuous analysis, otherwise, the target detection module continues to determine the next image.

3. The real-time cross-mirror tracking method based on target tracking and anomaly detection of claim 1, wherein the method is characterized by: the object tracking module using Re3 algorithm in step S3 operates in the valid image to generate a sequence of image bounding boxes, and then outputs the result in the form of a list as a set of track image fragment lists, and saves all the complete fragments.