CN112001252A

CN112001252A - Multi-target tracking method based on heteromorphic graph network

Info

Publication number: CN112001252A
Application number: CN202010712454.7A
Authority: CN
Inventors: 张宝鹏; 李芮; 滕竹; 刘炜; 李浥东
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-11-27
Anticipated expiration: 2040-07-22
Also published as: CN112001252B

Abstract

The invention provides a multi-target tracking method based on a heteromorphic graph network, which is applied to multi-target tracking. Firstly, an object detection frame is obtained by using an object detection algorithm, and then data association between video frames is carried out by using optical flow calculation and linear regression operation. In order to solve the problem of target occlusion, the model uses a heteromorphic network to extract the characteristics of a detection frame and a tracking target for similarity measurement after data association, and judges whether a newly-appeared detection frame belongs to the existing target. The heterogeneous graph network comprises three parts of appearance feature extraction, spatial relationship extraction and time relationship extraction, and is used for learning discriminant features so as to encode information such as appearance, spatial position, time relationship and the like of a target, and improve the representation capability and discriminant capability of the features, thereby improving the performance of multi-target tracking.

Description

Multi-target tracking method based on heteromorphic graph network

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-target tracking method based on a heteromorphic graph network.

Background

With the development of deep learning, the convolutional neural network is applied to more and more scenes, and multi-target tracking is increasingly identified in the field of computer vision due to the wide application of the convolutional neural network in video monitoring, human-computer interaction and virtual reality. Multi-target tracking aims at locating multiple target objects in a given video sequence, assigning different identity IDs to different objects and recording the trajectory of each ID in the video. At present, with the continuous development of target detection technology based on convolutional neural network, a tracking algorithm based on detection has become the mainstream direction of multi-target tracking. The detection-based tracking algorithm firstly needs to perform target detection on each video frame to obtain a detection result of each frame, and then performs data association according to the detection result to create a track of each object in the video.

In detection-based tracking algorithms, it is important to learn the feature representation of objects with discriminability, which determines whether the tracker can correctly distinguish the trajectories between different objects. However, because the appearance of the target in the video shot by the camera is fuzzy, most of the existing methods only consider the appearance characteristics of the target and cannot accurately identify and distinguish different targets, and the methods mainly solve the problem of data association and cannot process the frequent occlusion problem existing in the video, which directly affects the improvement of the algorithm performance.

In multi-target tracking based on detection, most methods mainly research the data association problem, namely designing a robust model to perform association of the same target between adjacent video frames to obtain the tracks of all targets in a video sequence. However, the above method neglects the influence of the occlusion problem on the target track, and treats the occluded target as the target whose track has been terminated, which directly affects the performance of multi-target tracking. Recently, a multi-target tracking method using a regressor in a target detection algorithm to perform data association achieves good tracking effect and further processes the problem of target occlusion, as shown in fig. 5, a video sequence and the positions of all target detection frames in the video sequence obtained by using a common target detection algorithm are input, then a regressor in a target detection algorithm fast RCNN is used to perform data association between video frames, a convolutional neural network ResNet-50 is used to extract appearance characteristics of targets and detection frames, target re-identification is performed to judge whether a detection frame newly appearing in the video frames belongs to a terminated target track, and finally the target track obtained after the above operation is output. The performance of the method is superior to that of most existing multi-target tracking methods, but only appearance characteristics of the target are considered, and information such as spatial topology, time relation and the like in a multi-target tracking scene is ignored.

Disclosure of Invention

The embodiment of the invention provides a multi-target tracking method based on a heteromorphic graph network, which is used for solving the problem that how to find a feature representation method capable of enhancing the target discriminability under the condition of fully utilizing multi-target tracking video data in the prior art; under the conditions that the multi-target tracking video data is low in pixel, targets are fuzzy, light rays are different, visual angles are different, and the targets are shielded, the judgment force of target feature representation is improved on the basis of appearance features, and the technical problem of shielding is solved.

In order to achieve the purpose, the invention adopts the following technical scheme.

A multi-target tracking method based on a heteromorphic graph network is characterized by comprising the following steps:

s1, extracting a detection frame of each frame through a common target detection algorithm based on the original video sequence;

s2, obtaining the position of the target in each frame of image through data association processing based on the original video sequence and the detection frame;

s3, based on the position and the detection frame of the target, obtaining target characteristics and detection characteristics through the network processing of the heteromorphic graph;

s4, similarity measurement processing is carried out on the target characteristic and the detection characteristic, whether the detection frame belongs to a certain termination target or not is judged, if yes, the detection frame is added into the termination target, and the termination target is set to be in an active state; otherwise, initializing a new target for the detection frame; judging whether the current frame is the last frame of the video, if so, ending the execution of the method; otherwise, the step S2 is executed.

Preferably, the obtaining the position of the target in each frame of image through data association processing based on the original video sequence and the detection frame includes:

let t be the frame sequence number of the original video sequence;

when t is equal to 1, initializing by using all detection frames in the 1 st frame, and acquiring the initial position of the target;

when t is greater than 1, performing data association according to the position of the detection frame in the t-1 frame to obtain the position of the target of the t-th frame;

and carrying out position discrimination processing on the position of the target of the t-th frame by using a binary classifier.

Preferably, when t >1, the performing data association according to the position of the detection frame in the t-1 frame includes: adjusting the position of the detection frame at the t-1 th frame through the light flow graph of the adjacent video frame, and then performing regression operation on the adjusted detection frame by using a linear regressor to obtain the position of the target of the t-th frame;

the step of performing position discrimination processing on the position of the target of the t-th frame by the classifier comprises the following steps:

the position of a target of the t frame is scored by using a binary classifier, if the position score of a certain target is smaller than a preset threshold value, the target is judged as a terminated target, and otherwise, the target is judged as an active target;

and calculating the intersection ratio between the detection frame of the t-th frame and the position of the active target to obtain a detection frame which cannot be matched with the position of the target.

Preferably, the obtaining of the target feature and the detection feature through the heteromorphic network processing based on the position and the detection frame of the target includes:

respectively extracting appearance characteristics, spatial relation characteristics and time relation characteristics of the t frame which is judged as a track terminated target through an abnormal composition network;

fusing the appearance characteristic, the spatial relation characteristic and the time relation characteristic of the target with the t-th frame judged as the track termination to obtain a target characteristic;

extracting appearance characteristics and spatial relation characteristics of a detection frame which cannot be matched with the active target through the heterogeneous graph network;

and fusing the appearance characteristic and the spatial relation characteristic of the detection frame which cannot be matched with the active target to obtain the detection characteristic.

Preferably, the performing similarity measurement processing on the target feature and the detection feature, and determining whether the detection frame belongs to a certain termination target specifically includes:

calculating the Euclidean distance between the detection feature and the target feature, if the Euclidean distance is smaller than a preset threshold value, adding a detection frame which cannot be matched with the active target into a termination target track smaller than the preset threshold value, setting the target as the active target, and executing the data association processing process on the target at the t +1 th frame; otherwise, initializing a detection frame which cannot be matched with the active target to obtain a new target ID;

and judging whether the original video sequence is finished, if so, outputting a tracking track of the target, and otherwise, continuing to execute the step S2 aiming at the t +1 th frame.

According to the technical scheme provided by the embodiment of the invention, the multi-target tracking method based on the heteromorphic graph network is applied to multi-target tracking. Firstly, an object detection frame is obtained by using an object detection algorithm, and then data association between video frames is carried out by using optical flow calculation and linear regression operation. In order to solve the problem of target occlusion, the model uses a heteromorphic network to extract the characteristics of a detection frame and a tracking target for similarity measurement after data association, and judges whether a newly-appeared detection frame belongs to the existing target. The heterogeneous graph network comprises three parts of appearance feature extraction, spatial relationship extraction and time relationship extraction, and is used for learning discriminant features so as to encode information such as appearance, spatial position, time relationship and the like of a target, and improve the representation capability and discriminant capability of the features, thereby improving the performance of multi-target tracking.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a processing flow chart of a multi-target tracking method based on a heteromorphic graph network according to the present invention;

FIG. 2 is an overall frame diagram of a multi-target tracking model based on an heteromorphic network in the multi-target tracking method based on the heteromorphic network according to the present invention;

FIG. 3 is a flowchart of a specific implementation procedure of a heterogeneous graph network-based multi-target tracking method according to the present invention;

FIG. 4 is a logic diagram of heterogeneous graph network processing in a heterogeneous graph network-based multi-target tracking method provided by the present invention;

fig. 5 is an overall frame diagram of an approximate solution in the prior art.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

Referring to fig. 1, the multi-target tracking method based on the heteromorphic network provided by the invention comprises the following steps:

s4, similarity measurement processing is carried out on the target characteristic and the detection characteristic, whether the detection frame belongs to a certain termination target or not is judged, if yes, the detection frame is added into the termination target, and the termination target is set to be in an active state; otherwise, initializing a new target for the detection frame; judging whether the current frame is the last frame of the video, if so, ending the execution of the method; otherwise, the above step S2 is executed.

In the embodiment provided by the present invention, the detection frame, i.e. the detection response, is also referred to as "detection hypothesis" or "detection observed quantity", which is the output quantity of the detection process. Objects are closed regions that are clearly distinguished from the surroundings in an image and are often referred to as objects, although these objects generally have some physical significance, such as pedestrians, cars, etc. The tracks are output quantities of multi-target tracking, and one track corresponds to a position sequence of the target in a time period. Multiple target basis is to locate multiple targets of interest simultaneously in a given video and maintain their IDs, record their trajectories.

In the embodiment of the invention, a multi-target tracking model based on a heteromorphic graph network is provided, and the model consists of three modules of data association, heterogeneous feature extraction and similarity measurement. As shown in fig. 2, the input of the present invention includes two parts, an original video sequence and a detection frame position obtained by a common target detection algorithm, and the input data is sequentially subjected to data association, extraction of heterogeneous features such as appearance, spatial and temporal relationships, and similarity measurement operations to obtain a final target track. The working process of each module is as follows:

data association module

Inputting an original video sequence and a detection frame position obtained by a common target detection algorithm into a model, firstly, passing through a data association module, finely adjusting the position of a target in the previous frame by using a light flow graph, then, regressing the finely adjusted position by using a linear regressor, obtaining the position of the target in the current frame image, judging whether a track is terminated according to a fraction obtained by a classifier, and then, calculating the intersection ratio between the detection frame position of the current frame and the target position to obtain a detection frame which cannot be matched with the target; the data association module is connected with the heterogeneous feature extraction module, and a target obtained by data association is input into the heterogeneous feature extraction module to extract heterogeneous features;

heterogeneous feature extraction module

The input of the heterogeneous feature extraction module is a detection box which cannot be matched with a target after data association and the target of which the current frame track is already terminated. The heterogeneous feature extraction module comprises an appearance feature extraction sub-network, a spatial relationship extraction sub-network and a temporal relationship extraction sub-network, extracts the appearance, spatial and temporal relationships of the target and fuses to obtain target features, and extracts the appearance and spatial relationships of the detection frames and fuses to obtain detection features. The fused target feature and the detection feature are transmitted to a similarity measurement module;

similarity measurement module

The module processes the target feature and the detection feature obtained by the heterogeneous feature extraction module, calculates the Euclidean distance between the target feature and the detection feature, compares the Euclidean distance with a given threshold value, judges whether a detection frame which is obtained by the data association module and cannot be matched belongs to a terminated track, if the Euclidean distance is smaller than the threshold value, the detection frame is added into the terminated track of the target, and otherwise, the detection frame is initialized to a new target.

In the embodiment provided by the present invention, as shown in fig. 3, the overall processing flow is as follows.

In the above step S1, a specific process is to give a video sequence that needs multi-target tracking, extract a detection box existing in each frame of the video sequence by using a common target detection algorithm, and input the video sequence and the obtained detection box into the model.

In the step S2, the specific process includes:

let t be the frame ordinal number of the original video sequence, start from t ═ 1 until all video frames of the video sequence are processed;

when t is greater than 1, performing data association according to the position of the target in a t-1 frame, firstly, fine-tuning the position of the target in the t-1 frame through a light flow graph, and regressing the fine-tuned position by using a linear regressor to obtain the position of the target in the t-th frame;

Further, the process of the discrimination processing specifically includes:

the position of the target of the t-th frame is scored by using a binary classifier, if the position score of a certain target is smaller than a preset threshold value for judgment, the target is judged to be a terminated target and is added into a target set of which the track is terminated, otherwise, the target is judged to be an active target and is added into an active target set, and the data association operation of the active target is continued for the next frame;

and calculating the intersection ratio between the detection frame of the t-th frame and the position of the active target to obtain a detection frame which cannot be matched with the active target.

In experiments, the applicant finds that the environment of the multi-target tracking video is complex, the number of targets is large, frequent shielding exists among the targets or the targets are shielded by surrounding buildings, so that in the whole video sequence, the shielded target track can be stopped and disappeared, but the target track can disappear for a period of time and reappear, and a new ID is given when the target track reappears, so that the overall effect of multi-target tracking is influenced. In order to solve the problems, the invention adds a heterogeneous feature extraction and similarity measurement module, judges whether new detection occurs in a video frame after data association is completed, extracts heterogeneous features of a target of which the new detection and the track are already terminated to perform similarity measurement, and judges whether the new detection belongs to the target of which the track is terminated due to previous shielding.

As shown in fig. 4, the heterogeneous graph network is composed of a convolutional neural network for extracting appearance features of the target and the detection, a spatial relationship graph network and a temporal relationship graph network, and the spatial and temporal relationship graphs simultaneously encode spatial and temporal relationships.

In the step of extracting and processing the heterogeneous characteristics, the specific process is as follows:

after the execution process of the intersection ratio calculation is completed, if all the detection frames are matched with the positions of the targets one by one, the operation on the current frame can be represented to be completed, otherwise, the appearance characteristics and the spatial relationship characteristics of the detection frames which cannot be matched with the active targets are extracted through the heteromorphic network;

Compared with the prior art, the heterogeneous graph network is additionally provided with a heterogeneous feature extraction and similarity measurement module, the proposed heterogeneous graph network is used for extracting targets and detected heterogeneous features, then the similarity measurement of the features is carried out, and whether the detection belongs to a target of which a certain track is already terminated is judged. The environment of the multi-target tracking video is complex, the number of targets is large, frequent shielding exists among the targets or the targets are shielded by surrounding buildings, and therefore target tracks disappear for a period of time and reappear in the whole video sequence. The two modules are added after data association, and are mainly used for solving the problem of target occlusion. The data association provided by the invention can achieve a relatively ideal tracking effect under the condition that a target is not shielded, but once the shielding phenomenon occurs, the shielded target track is terminated, and a new target ID is given when the shielded target track reappears, so that the overall effect of multi-target tracking is influenced.

Further, the process of performing the similarity measurement process specifically includes:

calculating the Euclidean distance between the detection feature and the target feature, if the Euclidean distance is smaller than a preset threshold value, determining that a detection frame which cannot be matched with the active target belongs to a target of which the track is already terminated, adding the target into the track of the target, setting the target as the active target, and executing the data association processing process on the target at the t +1 th frame, otherwise, determining that the detection frame which cannot be matched with the position of the active target does not belong to the target of which the existing track is already terminated, and initializing the detection frame to obtain a new target ID;

and finally, judging whether the original video sequence is finished or not, outputting the tracking tracks of all the targets if the original video sequence is finished, and returning to continue executing the step S2 aiming at the t +1 th frame if the original video sequence is not finished.

In summary, the multi-target tracking method based on the heteromorphic network provided by the invention uses discriminant feature learning, and firstly uses a common target detection algorithm, namely fast RCNN, to extract possible positions of targets in all video frames of a video sequence, namely a detection frame. And then, the optical flow graph and the linear regressor are used for carrying out data association, and the data association method has the advantages of simple structure, convenience in operation and higher accuracy of adjacent frame target association. Because frequent shielding phenomenon exists in multi-target tracking, in order to improve the tracking performance, the invention judges whether a new detection frame appears in a video frame after data association, extracts the appearance, space, time relation and other characteristics of the new detection frame and a target with a terminated track by using a heteromorphic network for fusion, and finally measures the similarity of the detection characteristics and the target characteristics for judging whether the newly appeared detection belongs to the target with the terminated track. In the invention, data association and heterogeneous feature extraction supplement each other, the representation capability and discrimination capability of the model are improved together, and the performance of multi-target tracking is further improved.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A multi-target tracking method based on a heteromorphic graph network is characterized by comprising the following steps:

2. The method of claim 1, wherein obtaining the position of the target in each frame of image through data association processing based on the original video sequence and the detection frame comprises:

let t be the frame sequence number of the original video sequence;

3. The method according to claim 2, wherein when t >1, the associating data at the position of t-1 frame according to the detection frame comprises: adjusting the position of the detection frame at the t-1 th frame through the light flow graph of the adjacent video frame, and then performing regression operation on the adjusted detection frame by using a linear regressor to obtain the position of the target of the t-th frame;

4. The method of claim 3, wherein obtaining the target feature and the detected feature through heterogeneous graph network processing based on the position and the detection frame of the target comprises:

5. The method according to claim 4, wherein said performing the similarity measure processing on the target feature and the detection feature to determine whether the detection frame belongs to a certain termination target specifically comprises: