CN112381132A

CN112381132A - Target object tracking method and system based on fusion of multiple cameras

Info

Publication number: CN112381132A
Application number: CN202011253000.4A
Authority: CN
Inventors: 赖哲渊; 姚明江
Original assignee: SAIC Volkswagen Automotive Co Ltd
Current assignee: SAIC Volkswagen Automotive Co Ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2021-02-19

Abstract

The invention discloses a target object tracking method based on fusion of a plurality of cameras, which comprises the following steps: 100: extracting target object information to be tracked in real time from images shot by a plurality of cameras; 200: adopting a trained depth residual encoder to identify images input into a target object detection frame; 300: storing the position information, the target object type, the target object ID, the target object appearance characteristic code and the corresponding time stamp of the target object detection frame in the image, and taking the position information, the target object type, the target object ID, the target object appearance characteristic code and the corresponding time stamp as corresponding historical data; 400: predicting the current position of the target object detection frame in the image according to the historical position information of the target object detection frame in the image; 500: screening out candidate target objects smaller than a first threshold value based on the set first threshold value; 600: screening out candidate matching target objects smaller than a second threshold value based on a set second threshold value; 700: and performing matching assignment on the current target object which is currently detected from the candidate matching target objects by adopting a Hungarian algorithm so as to realize tracking.

Description

Target object tracking method and system based on fusion of multiple cameras

Technical Field

The present invention relates to a target tracking method and system, and more particularly, to a target tracking method and system based on a camera.

Background

In recent years, with the rapid development of the automatic driving technology, the possibility of the automatic driving automobile being used in daily life is increasing. The method for detecting and tracking the object by utilizing the vehicle-mounted camera is an important link of the automatic driving automobile in automatic driving perception.

At present, the existing multi-object reproduction tracking method is generally performed on the basis of detection, and is almost based on a vehicle-mounted front-view camera. The current mainstream tracking method comprises the following steps: object positions are predicted based on optical flow tracking, linear velocity assumptions and matched by cross-over-crossing ratios (IOU), and the like.

However, the above methods have problems that a large estimation deviation is brought when a detected object is occluded for a long time, an object that has already appeared is tagged with a new tag (ID), and the possibility that IDs of different objects are exchanged is high; on the other hand, the field of view of a single camera is limited, and in a scene with multiple cameras, such as looking around, tracked objects are easily lost, and the algorithm is hardly applicable, which in turn affects prediction and planning.

Based on the situation, the invention is based on the automatic driving scene of the vehicle, and the number of cameras of the automatic driving vehicle is considered to be large, so that the target object tracking method based on the fusion of the plurality of cameras is expected to be obtained.

Disclosure of Invention

One of the objectives of the present invention is to provide a target tracking method based on multiple camera fusion, which can perform training and re-recognition on target objects, such as vehicles and pedestrians, during automatic driving, so as to extract appearance features of the target objects, and use the similarity as a matching reference to improve the tracking accuracy of the target objects.

In order to achieve the above object, the present invention provides a target tracking method based on fusion of multiple cameras, which includes the steps of:

100: extracting target object information to be tracked in real time from images shot by a plurality of cameras, wherein the target object information at least comprises: the position information of the target object detection frame in the image, the image in the target object detection frame, the target object type and the target object ID;

200: adopting a trained depth residual encoder to identify the image input into the target object detection frame so as to output a corresponding target object appearance characteristic code; the number of depth residual encoders corresponds to the number of object classes;

300: storing the position information, the target object type, the target object ID, the target object appearance characteristic code and the corresponding time stamp of the target object detection frame in the image, and taking the position information, the target object type, the target object ID, the target object appearance characteristic code and the corresponding time stamp as corresponding historical data;

400: predicting the current position of the target object detection frame in the image according to the historical position information of the target object detection frame in the image to obtain the predicted position of the target object detection frame;

500: calculating the Euclidean distance between the current target object detection frame and the corresponding predicted position of the target object detection frame based on the position of the current target object detection frame, and screening out candidate target objects smaller than a first threshold value based on a set first threshold value;

600: calculating the cosine distance between the appearance feature code of the current target and the appearance feature code of the candidate target based on the appearance feature code of the current target detected currently, and screening out candidate matching targets smaller than a second threshold based on a set second threshold;

700: and performing matching assignment on the current target object which is currently detected from the candidate matching target objects by adopting a Hungarian algorithm so as to realize tracking.

The difference between the target object tracking method based on the fusion of a plurality of cameras and the prior tracking technology is as follows: the traditional tracking technology is mostly used for predicting and judging based on the position of a target object, whether the target object is the same or not is difficult to determine in a plurality of different cameras, and the ID of the target object is often easy to lose; the target object tracking method based on the fusion of the plurality of cameras respectively performs training re-recognition on vehicles and pedestrians, extracts appearance characteristics of the target object, and takes the similarity as a matching reference, so that the tracking accuracy of the target object is effectively improved.

Further, in the target tracking method based on the fusion of multiple cameras of the present invention, the target at least includes a pedestrian and a vehicle.

Further, in the target tracking method based on the fusion of multiple cameras according to the present invention, in step 400, a kalman filter is used to predict the current position of the target detection frame in the image.

Further, in the target tracking method based on the fusion of multiple cameras according to the present invention, a preprocessing step is further included between step 100 and step 200: the image within the object detection box is scaled to the input size of the depth residual encoder.

Further, in the target tracking method based on the fusion of multiple cameras of the present invention, the position information of the target detection frame in the image includes the pixel position of the center point of the detection frame and the length and width of the detection frame.

Further, in the target tracking method based on the fusion of multiple cameras according to the present invention, the method further includes step 800: when the currently detected target object is not matched with the corresponding target object from the candidate matching target objects, a new ID is given to the currently detected target object, and the currently detected target object is stored as historical data.

Further, in the target tracking method based on the fusion of the plurality of cameras, the depth residual encoder is trained by adopting an MOT pedestrian re-identification data set and a vehicle-mounted camera acquisition data set.

Accordingly, another object of the present invention is to provide a target tracking system based on fusion of multiple cameras, which can be used to implement the above-mentioned target tracking method of the present invention.

In order to achieve the above object, the present invention provides a target tracking system based on fusion of multiple cameras, which includes:

the target object detection module extracts target object information to be tracked in real time from images shot by a plurality of cameras, and the target object information at least comprises: the position information of the target object detection frame in the image, the image in the target object detection frame, the target object type and the target object ID;

a target re-identification encoding module including depth residual encoders corresponding to the number of types of the target, each depth residual encoder outputting a corresponding target appearance feature code based on the input image in the target detection frame;

the database receives the position information, the target object type, the target object ID, the target object appearance feature code and the corresponding timestamp of the target object detection frame in the image and stores the position information, the target object type, the target object ID, the target object appearance feature code and the corresponding timestamp as corresponding historical data;

the online matching and tracking module comprises a position prediction submodule, a distance calculation submodule and a Hungarian matching submodule, wherein:

the position prediction sub-module predicts the current position of the target object detection frame in the image based on the historical position information of the target object detection frame in the image stored in the database so as to obtain the predicted position of the target object detection frame;

the distance calculation sub-module calculates the Euclidean distance between the position of the current target object detection frame detected currently and the predicted position of the corresponding target object detection frame, and screens out candidate target objects smaller than a first threshold value from a database on the basis of the set first threshold value; then calculating the cosine distance between the appearance feature code of the current detected target object and the appearance feature code of the candidate target object, and screening out candidate matching target objects smaller than a second threshold value from the database based on the set second threshold value;

and the Hungarian matching submodule performs matching assignment on the current detected target object from the candidate matching target objects by adopting a Hungarian algorithm so as to realize the tracking work.

Further, in the target tracking system based on the fusion of the plurality of cameras, the database includes a matching database and a cloud database, the cloud database stores all historical data of the target, and the matching database stores historical data of a set number of frames of the target.

Further, in the target tracking system based on the fusion of multiple cameras of the present invention, the target re-identification encoding module further includes a preprocessing sub-module, and the preprocessing sub-module scales the image in the target detection frame to the input size of the depth residual encoder.

Compared with the prior art, the target object tracking method and system based on the fusion of the plurality of cameras have the following advantages and beneficial effects:

(1) the invention provides a cross-camera multi-object tracking method suitable for an automatic driving scene of a vehicle, which can realize tracking of a wider field of view by utilizing the spatial arrangement of a plurality of cameras;

(2) the depth residual error encoder in the target object re-recognition encoding module can be trained through the data sets of pedestrian and vehicle re-recognition respectively, so that the image in the target object detection frame is input based on the image, the corresponding target object appearance characteristic code is output, and the tracking accuracy of the target object is improved;

drawings

Fig. 1 schematically shows a tracking algorithm overall module schematic diagram of a target tracking system based on multiple camera fusion in an embodiment of the invention.

Fig. 2 schematically shows a neural network training diagram of a target re-identification coding module of the target tracking system based on fusion of multiple cameras in an embodiment of the present invention.

Fig. 3 schematically shows a database module diagram of a target object tracking system based on multiple camera fusion according to an embodiment of the present invention.

Fig. 4 schematically shows a flowchart of steps of a target tracking method based on multi-camera fusion according to an embodiment of the present invention.

Detailed Description

The target tracking method and system based on multi-camera fusion according to the present invention will be further explained and explained with reference to the drawings and specific embodiments of the specification, however, the explanation and explanation do not unduly limit the technical solution of the present invention.

As shown in fig. 1, in the present embodiment, the target tracking system according to the present invention may include: the device comprises a target object detection module, a target object re-identification coding module, a database and an online matching tracking module.

In the target tracking system of the present invention, the target detection module can extract target information to be tracked in real time from images taken by a plurality of cameras, and the target information at least includes: position information of the object detection frame in the image, the image within the object detection frame, the object type, and the object ID. The target object detection module transmits the target object information to be tracked, which is extracted in real time, to the target object re-identification coding module, and the target object re-identification coding module may include depth residual encoders corresponding to the number of types of the target object, and each depth residual encoder outputs a corresponding target object appearance feature code based on the input image in the target object detection frame.

Correspondingly, the database in the target object tracking system can effectively store the position information, the target object type, the target object ID, the target object appearance feature code and the corresponding timestamp of the target object detection frame in the image, and the position information, the target object type, the target object ID, the target object appearance feature code and the corresponding timestamp of the target object detection frame are used as corresponding historical data.

The online matching and tracking module can receive the output of the current target object re-identification coding module and perform feature matching with the database to complete re-identification and tracking of the target object.

It should be noted that, in the present invention, the position prediction sub-module in the online matching and tracking module can predict the current position of the target object detection frame in the image based on the historical position information of the target object detection frame in the image stored in the database, so as to obtain the predicted position of the target object detection frame; a distance calculation sub-module in the online matching and tracking module can calculate the Euclidean distance between the position of the current target object detection frame currently detected and the predicted position of the corresponding target object detection frame, and screen out candidate target objects smaller than a first threshold value from a database based on the set first threshold value; then calculating the cosine distance between the appearance feature code of the current detected target object and the appearance feature code of the candidate target object, and screening out candidate matching target objects smaller than a second threshold value from the database based on the set second threshold value; after the candidate matching target objects are screened out, matching assignment is carried out on the current target object detected currently from the candidate matching target objects by the Hungarian matching submodule through the Hungarian algorithm, and tracking is achieved.

In addition, in this embodiment, the target re-identification encoding module in the target tracking system further includes a pre-processing sub-module, and the pre-processing sub-module may scale the image in the target detection frame to the input size of the depth residual encoder.

In addition, it should be noted that, in the present embodiment, the target object in the target object tracking system based on multiple camera fusion according to the present invention may include at least a pedestrian and a vehicle.

As shown in fig. 2, in the present embodiment, the target re-identification coding module of the target tracking system according to the present invention introduces a depth residual encoder (or called depth residual network) to extract coding of the appearance features of the target. The depth residual encoder consists of 2 convolutional layers, 1 pooling layer, 6 residual modules and 1 fully-connected layer. One part of data in the training stage of the depth residual error encoder is derived from a public MOT pedestrian re-identification data set, the other part of data is derived from a vehicle-mounted camera acquisition data set, and the depth residual error encoder adopts the MOT pedestrian re-identification data set and the vehicle-mounted camera acquisition data set for training.

In the present embodiment, in the training phase of the depth residual encoder, the data set is extended by using data enhancement in the preprocessing module, considering that the postures and the integrity of the same target object in different cameras are greatly different. Specifically, by randomly picking 1/3 pedestrians and vehicles in the data set, randomly cropping the bottom and top portion pixels, and then scaling to the original size. The enhanced data set better simulates the misalignment problem that exists in a multi-camera scene. The target object re-identification coding module separately trains 2 corresponding weights for pedestrians and vehicles, and can better distinguish the characteristics of the same type of target object. The output of the depth residual network is a 128-dimensional feature vector, which is encoded as the appearance feature of the target.

Fig. 3 schematically shows a database module of the target tracking system based on multi-camera fusion according to an embodiment of the present invention.

As shown in fig. 3, in this embodiment, the database in the target tracking system based on multiple camera fusion according to the present invention may include: the system comprises a matching database and a cloud database.

In the invention, the position information, the target object type, the target object ID, the target object appearance characteristic code and the corresponding timestamp of the target object detection frame in the image need to be uploaded to a cloud database and a matching database. The cloud database records feature codes of all target objects, and on one hand, the cloud database can be used for adjusting the number of frames stored in the matching database and increasing the robustness of the algorithm; on the other hand, the characteristics and the appearing time sequence of a certain target object can be quickly found in a scene with monitoring requirements.

Accordingly, the matching database in the database only stores the historical data of the set frame number of the target object (or called tracker), and simultaneously has a data updating and deleting mechanism. In the present embodiment, the matching database may store only records of past 100 frames of the target object. When the target object (or called tracker) is not matched with the target objects of all the cameras in a new frame, the target object loss time is accumulated, and when the loss time exceeds a threshold value, the target object (or called tracker) is deleted from the matching database.

It should be noted that, in the present invention, a target object tracking method based on the fusion of multiple cameras is also disclosed. As shown in fig. 4, and with reference to fig. 1 to fig. 3, the target tracking method based on multiple camera fusion according to the present invention can be obtained, and the target tracking method may include the following steps:

In the target tracking method according to the present invention, in step 400, a kalman filter may be used to predict the current position of the target detection frame in the image.

In addition, in this embodiment, a preprocessing step may be further included between the step 100 and the step 200: the image within the object detection box is scaled to the input size of the depth residual encoder.

In addition, in some other embodiments, in the target tracking system based on multiple camera fusion according to the present invention, the position information of the target detection frame in the image may include the pixel position of the center point of the detection frame and the length and width of the detection frame.

In the target tracking method based on multi-camera fusion according to the present invention, in step 800, when a currently detected target is not matched with a corresponding target from among candidate matching targets, a new ID is assigned to the currently detected target, and the currently detected target is stored as history data.

Referring to fig. 4 in conjunction with the step 100 and 700 of the target tracking method, the target tracking method of the present invention is implemented based on the target tracking system of the present invention.

In the embodiment shown in fig. 4, the target to be tracked by the target tracking method according to the present invention is referred to as a tracker. In the process shown in fig. 4, the target detection module detects an input image sequence, obtains a tracker detection frame from an image of a current frame, and inputs the tracker detection frame into a trained depth residual encoder, thereby outputting a corresponding tracker appearance feature code.

And when an online matching tracking module in the system processes the previous frame, the position of the tracker possibly appearing in the current frame is measured by using a Kalman filtering predictor according to the historical information of the central position of each tracker detection frame, so as to obtain the predicted position of the tracker detection frame.

Accordingly, in the present embodiment, the distance calculating section of the online matching tracking module includes two steps of screening:

step one, calculating Euclidean distance between a current frame detection frame and a prediction position of each tracker detection frame; and secondly, calculating the minimum cosine distance between the appearance feature code of the tracker meeting the condition that the Euclidean distance in the first step is smaller than the threshold value, and regarding the minimum cosine distance as the similarity between the tracker and the detection target object, so as to screen out the tracker with the similarity smaller than the threshold value with the detection target object, and regarding the tracker as a potential candidate matching tracker. Wherein, the significance of the first step of screening is as follows: the positions of the trackers are constrained, the number of candidate trackers is reduced, and the calculation burden of the second step of screening is reduced.

In the embodiment, the detection target object and the potential matching tracker are assigned by adopting a Hungarian algorithm, if the detection target object and the potential matching tracker are matched with the corresponding tracker, the ID of the tracker is given to the target object, and meanwhile, the information of the matching database is updated to complete the tracking of the current target object; and if the current detection target object is not matched with the corresponding tracker, the current detection target object is a new target object, a new ID is given to the current detection target object, and the current detection target object is uploaded to the matching database to be used as the tracker.

In the above technical solution, for the remaining tracker that is not matched with the target object, it is described that if the tracker disappears in the current frame, the lost time is updated, and if the lost time exceeds the threshold, the tracker is deleted from the matching database, which indicates that the tracker is no longer observed by the camera.

The scope of the present invention is not limited to the examples given herein, and all prior art that does not contradict the inventive concept, including but not limited to prior patent documents, prior publications, and the like, are intended to be encompassed by the present invention.

In addition, the combination of the features in the present application is not limited to the combination described in the claims of the present application or the combination described in the embodiments, and all the features described in the present application may be freely combined or combined in any manner unless contradictory to each other.

It should also be noted that the above-mentioned embodiments are only specific embodiments of the present invention. It is apparent that the present invention is not limited to the above embodiments and similar changes or modifications can be easily made by those skilled in the art from the disclosure of the present invention and shall fall within the scope of the present invention.

Claims

1. A target object tracking method based on fusion of a plurality of cameras is characterized by comprising the following steps:

2. The multi-camera fusion based object tracking method of claim 1, wherein the object comprises at least a pedestrian and a vehicle.

3. The method for tracking the target object based on the fusion of the plurality of cameras according to claim 1, wherein in step 400, a kalman filter is used to predict the current position of the target object detection frame in the image.

4. The target tracking method based on multi-camera fusion according to claim 1, further comprising a preprocessing step between the step 100 and the step 200: the image within the object detection box is scaled to the input size of the depth residual encoder.

5. The target tracking method based on the fusion of multiple cameras according to claim 1, wherein the position information of the target detection frame in the image comprises the pixel position of the center point of the detection frame and the length and width of the detection frame.

6. The method for tracking the target object based on the fusion of the plurality of cameras according to claim 1, further comprising the step 800 of: when the currently detected target object is not matched with the corresponding target object from the candidate matching target objects, a new ID is given to the currently detected target object, and the currently detected target object is stored as historical data.

7. The method of claim 1, wherein the depth residual encoder is trained using MOT pedestrian re-identification data sets and vehicle-mounted camera acquisition data sets.

8. A target tracking system based on fusion of a plurality of cameras, comprising:

and the Hungarian matching submodule performs matching assignment on the current detected target object from the candidate matching target objects by adopting a Hungarian algorithm so as to realize tracking.

9. The multi-camera fusion-based target tracking system of claim 8, wherein the database comprises a matching database and a cloud database, the cloud database stores all historical data of the target, and the matching database stores historical data of a set number of frames of the target.

10. The multi-camera fusion based object tracking system of claim 8 wherein the object re-identification encoding module further comprises a pre-processing sub-module that scales the image within the object detection box to the input size of a depth residual encoder.