CN112001252B

CN112001252B - Multi-target tracking method based on different composition network

Info

Publication number: CN112001252B
Application number: CN202010712454.7A
Authority: CN
Inventors: 张宝鹏; 李芮; 滕竹; 刘炜; 李浥东
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2024-04-12
Anticipated expiration: 2040-07-22
Also published as: CN112001252A

Abstract

The invention provides a multi-target tracking method based on an heterograph network, which is applied to multi-target tracking. The method comprises the steps of firstly obtaining a target detection frame by using a target detection algorithm, and then carrying out data association between video frames by using optical flow calculation and linear regression operation. In order to solve the problem of target shielding, the model uses a heterogeneous graph network to extract the characteristics of a detection frame and a tracking target to carry out similarity measurement after data association, and judges whether the newly-appearing detection frame belongs to the existing target. The heterogeneous graph network comprises three parts, namely appearance feature extraction, spatial relation extraction and time relation extraction, and is used for learning discriminant features so as to encode information such as appearance, spatial position and time relation of the target, and the like, and the representation capacity and the discriminant capacity of the features are improved, so that the performance of multi-target tracking is improved.

Description

Multi-target tracking method based on different composition network

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-target tracking method based on an heterograph network.

Background

With the development of deep learning, convolutional neural networks are applied to more and more scenes, and multi-target tracking is increasingly and repeatedly recognized in the field of computer vision due to the wide application of the convolutional neural networks in video monitoring, man-machine interaction and virtual reality. Multi-target tracking aims at locating multiple target objects in a given video sequence, assigning different identity IDs to the different objects and recording the trajectory of each ID in the video. At present, with the continuous development of target detection technology based on convolutional neural network, a tracking algorithm based on detection has become a mainstream direction of multi-target tracking. The detection-based tracking algorithm first needs to perform target detection on each video frame to obtain a detection result of each frame, and then performs data association according to the detection result to create a track of each object in the video.

In detection-based tracking algorithms, it is important to learn a discriminative representation of the target characteristics, which determines whether the tracker can correctly distinguish the trajectory between different targets. However, since the appearance of the target in the video shot by the camera is blurred, most of the existing methods only consider the appearance characteristics of the target and cannot accurately identify and distinguish different targets, and the methods mainly solve the problem of data association and cannot process the problem of frequent occlusion in the video, which directly affects the improvement of algorithm performance.

In multi-target tracking based on detection, most methods mainly study the problem of data association, namely, designing a robust model to associate the same targets between adjacent video frames, and obtaining the tracks of all targets in a video sequence. However, the method ignores the influence of the shielding problem on the track of the target, and treats the shielded target as the target with the track terminated, so that the method directly influences the performance of multi-target tracking. Recently, a multi-target tracking method using a regressor in a target detection algorithm to perform data association achieves a good tracking effect, further processes the target shielding problem, as shown in fig. 5, inputs the positions of all target detection frames in a video sequence and the video sequence obtained by using a common target detection algorithm, then uses the regressor in the target detection algorithm fast RCNN to perform data association between video frames, extracts appearance features of the target and the detection frames by using a convolutional neural network res net-50, performs target re-recognition to determine whether the newly-appearing detection frame in the video frame belongs to a terminated target track, and finally outputs the target track obtained after the operation. The performance of this method is superior to most existing multi-target tracking methods, but only the appearance characteristics of the targets are considered, and the information such as spatial topology, time relation and the like in the multi-target tracking scene is ignored.

Disclosure of Invention

The embodiment of the invention provides a multi-target tracking method based on an heterograph network, which is used for solving the problem of how to find a characteristic representation method capable of enhancing target discriminant under the condition of fully utilizing multi-target tracking video data in the prior art; under the conditions that the pixels of the multi-target tracking video data are low, the targets are fuzzy, the light rays are different, the visual angles are different, and the targets are shielded, the discrimination of the target feature representation is improved on the basis of the appearance feature, and the technical problem of shielding is solved.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

A multi-objective tracking method based on an heterograph network, comprising:

s1, extracting a detection frame of each frame through a public target detection algorithm based on an original video sequence;

s2, acquiring the position of a target in each frame of image through data association processing based on an original video sequence and a detection frame;

s3, processing through an heterograph network based on the position and the detection frame of the target to obtain target characteristics and detection characteristics;

s4, carrying out similarity measurement processing on the target features and the detection features, judging whether the detection frame belongs to a certain termination target, if so, adding the detection frame into the termination target, and setting the termination target to be in an active state; otherwise, initializing a new target for the detection frame; judging whether the current frame is the last frame of the video, if so, ending executing the method; otherwise, the step S2 is performed.

Preferably, the obtaining the position of the target in each frame of image through data association processing based on the original video sequence and the detection frame includes:

let t be the frame number of the original video sequence;

when t=1, initializing by using all detection frames in the 1 st frame to obtain an initial position of a target;

when t >1, carrying out data association according to the position of the detection frame in the t-1 frame to obtain the position of the target of the t frame;

and carrying out position discrimination processing on the position of the target of the t frame through a binary classifier.

Preferably, when t >1, the performing data association according to the position of the detection frame at the t-1 frame includes: adjusting the position of the detection frame in the t-1 frame through the light flow diagram of the adjacent video frame, and then carrying out regression operation on the adjusted detection frame by using a linear regression to obtain the position of the target of the t frame;

the position discrimination processing of the target position of the t frame by the classifier comprises the following steps:

scoring the position of the target of the t frame by using a binary classifier, if the position score of a certain target is smaller than a preset threshold value, judging the target as a terminated target, otherwise, judging the target as an active target;

and calculating the cross ratio between the detection frame of the t frame and the position of the active target to obtain a detection frame which cannot be matched with the position of the target.

Preferably, the obtaining the target feature and the detection feature through the heterograph network processing based on the target position and the detection frame includes:

respectively extracting appearance characteristics, spatial relationship characteristics and time relationship characteristics of the target of which the t frame is judged to be terminated by the track through the heterograph network;

fusing the appearance characteristic, the spatial relationship characteristic and the time relationship characteristic of the target of which the t frame is judged to be the track terminated to obtain the target characteristic;

extracting appearance characteristics and spatial relation characteristics of a detection frame which cannot be matched with an active target through an abnormal composition network;

and fusing the appearance characteristics and the spatial relationship characteristics of the detection frame which cannot be matched with the active target to obtain detection characteristics.

Preferably, the performing similarity measurement processing on the target feature and the detection feature, and determining whether the detection frame belongs to a certain termination target specifically includes:

calculating the Euclidean distance between the detection feature and the target feature, if the Euclidean distance is smaller than a preset threshold, adding a detection frame which cannot be matched with the active target into a termination target track smaller than the preset threshold, setting the target as the active target, and executing the data association processing process on the target in the t+1st frame; otherwise, initializing a detection frame which cannot be matched with the active target to obtain a new target ID;

and judging whether the original video sequence is ended, if so, outputting a tracking track of the target, otherwise, continuing to execute the step S2 aiming at the t+1st frame.

The technical scheme provided by the embodiment of the invention can be seen that the invention provides a multi-target tracking method based on an iso-composition network, which is applied to multi-target tracking. The method comprises the steps of firstly obtaining a target detection frame by using a target detection algorithm, and then carrying out data association between video frames by using optical flow calculation and linear regression operation. In order to solve the problem of target shielding, the model uses a heterogeneous graph network to extract the characteristics of a detection frame and a tracking target to carry out similarity measurement after data association, and judges whether the newly-appearing detection frame belongs to the existing target. The heterogeneous graph network comprises three parts, namely appearance feature extraction, spatial relation extraction and time relation extraction, and is used for learning discriminant features so as to encode information such as appearance, spatial position and time relation of the target, and the like, and the representation capacity and the discriminant capacity of the features are improved, so that the performance of multi-target tracking is improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a process flow diagram of a multi-objective tracking method based on an heterograph network provided by the invention;

FIG. 2 is a diagram of an overall framework of a multi-objective tracking model based on an heterograph network in the multi-objective tracking method based on the heterograph network provided by the invention;

FIG. 3 is a flowchart of a specific implementation procedure of a multi-objective tracking method based on an heterograph network according to the present invention;

FIG. 4 is a logic block diagram of heterogeneous graph network processing in a multi-objective tracking method based on heterogeneous graph networks according to the present invention;

fig. 5 is a prior art approximate overall frame diagram.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the purpose of facilitating an understanding of the embodiments of the invention, reference will now be made to the drawings of several specific embodiments illustrated in the drawings and in no way should be taken to limit the embodiments of the invention.

Referring to fig. 1, the multi-target tracking method based on the heterograph network provided by the invention comprises the following steps:

s4, carrying out similarity measurement processing on the target features and the detection features, judging whether the detection frame belongs to a certain termination target, if so, adding the detection frame into the termination target, and setting the termination target to be in an active state; otherwise, initializing a new target for the detection frame; judging whether the current frame is the last frame of the video, if so, ending executing the method; otherwise, the above step S2 is performed.

In the embodiment provided by the invention, the detection frame, namely the detection response, also called as "detection hypothesis", "detection observance", is the output of the detection process. The object is that in an image, a closed area that is clearly distinguished from the surrounding environment is often referred to as an object, although these objects generally have a certain physical meaning, such as pedestrians, vehicles, etc. The track is the output of multi-target tracking, and one track corresponds to the position sequence of the target in a time period. The multi-object basis is to locate multiple objects of interest simultaneously in a given video, and maintain their IDs, record their trajectories.

In the embodiment of the invention, a multi-objective tracking model based on an iso-composition network is provided, and the model consists of three modules of data association, heterogeneous feature extraction and similarity measurement. As shown in FIG. 2, the input of the invention comprises two parts, namely an original video sequence and a detection frame position obtained by a public target detection algorithm, and input data sequentially carry out data association, extraction of heterogeneous characteristics such as appearance, spatial and temporal relations and the like, and similarity measurement operation to obtain a final target track. The working process of each module is as follows:

data association module

The method comprises the steps that an original video sequence and a detection frame position obtained by a public target detection algorithm are input into a model, firstly, the module finely adjusts the position of a target in a previous frame by using a light flow diagram, then, a linear regression is used for regressing the finely adjusted position to obtain the position of the target in a current frame image, whether a track is terminated or not is judged according to a score obtained by a classifier, and then, the cross-correlation ratio between the detection frame position of the current frame and the target position is calculated to obtain a detection frame which cannot be matched with the target; the data association module is connected with the heterogeneous feature extraction module, and a target obtained by data association is input into the heterogeneous feature extraction module to extract heterogeneous features;

heterogeneous feature extraction module

The heterogeneous feature extraction module is input into a detection frame which cannot be matched with the target after data association and the target of which the current frame track is terminated. The heterogeneous feature extraction module comprises an appearance feature extraction sub-network, a spatial relationship extraction sub-network and a temporal relationship extraction sub-network, extracts appearance, spatial and temporal relationships of a target and fuses the appearance, the spatial and the temporal relationships of the target to obtain target features, and extracts appearance and spatial relationships of a detection frame and fuses the appearance, the spatial relationships and the temporal relationships of the target to obtain detection features. The fused target features and the fused detection features are transmitted to a similarity measurement module;

similarity measurement module

The module processes the target feature and the detection feature obtained by the heterogeneous feature extraction module, calculates the Euclidean distance between the target feature and the detection feature, compares the Euclidean distance with a given threshold value, judges whether a detection frame which cannot be matched and is obtained by the data association module belongs to a terminated track, adds the detection to the terminated track of the target if the Euclidean distance is smaller than the threshold value, and initializes the detection frame as a new target if the Euclidean distance is smaller than the threshold value.

In the embodiment provided by the invention, as shown in fig. 3, the overall process flow is as follows.

In the step S1, a specific process is to provide a video sequence to be subjected to multi-target tracking, extract a detection frame existing in each frame in the video sequence by using a common target detection algorithm, and input the video sequence and the obtained detection frame into a model.

In the step S2, the specific process includes:

let t be the frame number of the original video sequence, starting from t=1 until all video frames of the video sequence have been processed;

when t >1, carrying out data association according to the position of the target in the t-1 frame, finely adjusting the position of the target in the t-1 frame through a light flow graph, and carrying out regression on the finely adjusted position by using a linear regression to obtain the position of the target in the t frame;

Further, the process of the discriminating processing specifically includes:

the position of the target of the t frame is scored by using a binary classifier, if the position score of a certain target is smaller than a preset threshold value for judging, the target is judged to be a terminated target and added into a target set with the track terminated, otherwise, the target is judged to be an active target and added into the active target set, and the data association operation of the active target is continued for the next frame;

and calculating the cross ratio between the detection frame of the t frame and the position of the active target to obtain the detection frame which cannot be matched with the active target.

The applicant finds that in experiments, the environment in the multi-target tracking video is complex, the number of targets is large, and frequent shielding exists among the targets or the targets are shielded by surrounding buildings, so that the shielded target track can be terminated and vanished in the whole video sequence, the target track can vanish for a period of time and then reappear, and a new ID is given when reappearance occurs, so that the overall effect of multi-target tracking is influenced. In order to solve the problems, the invention adds a heterogeneous feature extraction and similarity measurement module, after finishing data association, judges whether new detection occurs in the video frame, extracts the new detection and the heterogeneous features of the track terminated target to carry out similarity measurement, and judges whether the new detection belongs to the target which is blocked before and terminates the track.

As shown in fig. 4, the heterogeneous graph network is composed of a convolutional neural network, a spatial relationship graph network and a temporal relationship graph network, wherein the convolutional neural network is used for extracting appearance characteristics of targets and detection, and the spatial and temporal relationship graph network encodes spatial and temporal relationships simultaneously.

In the step of extracting the heterogeneous characteristics, the specific process is as follows:

after the executing process of the cross-over ratio calculation is completed, if all the detection frames are matched with the positions of the targets one by one, the operation of the current frame can be indicated to be completed, otherwise, the appearance characteristics and the spatial relation characteristics of the detection frames which cannot be matched with the active targets are extracted through the heterograph network;

Compared with the prior art, the heterogeneous composition network is added with the heterogeneous feature extraction and similarity measurement module, the proposed heterogeneous composition network is used for extracting the target and the detected heterogeneous features, and then similarity measurement of the features is carried out to judge whether the detection belongs to the target of which a certain track is terminated. The environment in a multi-target tracking video is complex, the number of targets is large, and frequent occlusion exists between targets or targets are occluded by surrounding buildings, which results in that the target track disappears for a period of time and then reappears in the whole video sequence. These two modules are added after data association, mainly to solve the object occlusion problem. The data association provided by the invention can achieve a more ideal tracking effect under the condition that targets are not blocked, but once the blocked target track is blocked, a new target ID is given when the blocked target track appears again, so that the overall effect of multi-target tracking is influenced.

Further, the process of performing the similarity measurement process specifically includes:

calculating the Euclidean distance between the detection feature and the target feature, if the Euclidean distance is smaller than a preset threshold value, considering that a detection frame which cannot be matched with an active target belongs to a target with a terminated track, adding the target into the track of the target, setting the target as the active target, executing the data association processing process on the target in the t+1st frame, otherwise, considering that the detection frame which cannot be matched with the position of the active target does not belong to the target with the terminated track existing before, and initializing the detection frame to obtain a new target ID;

and finally, judging whether the original video sequence is ended, if so, outputting tracking tracks of all targets, otherwise, returning to the step S2 for the t+1st frame.

In summary, the multi-target tracking method based on the heterograph network provided by the invention uses discriminant feature learning, and firstly uses the public target detection algorithm Faster RCNN to extract possible positions of targets in all video frames of the video sequence, namely detection frames. And then, the light flow graph and the linear regression are used for carrying out data association, the data association method has simple structure and convenient operation, and the accuracy rate of the target association of the adjacent frames is higher. In order to improve tracking performance, the invention judges whether a new detection frame appears in a video frame after data association, extracts characteristics such as appearance, space and time relation of a new detection frame and a target with a terminated track by using a heterogeneous graph network to fuse, and finally carries out similarity measurement of detection characteristics and target characteristics to judge whether the new detection belongs to the target with the terminated track. The data association and the heterogeneous feature extraction supplement each other, so that the representation capacity and the discrimination capacity of the model are improved together, and the performance of multi-target tracking is improved.

Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.

From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part. The apparatus and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A multi-objective tracking method based on an heterograph network, comprising:

s3, processing through an heterograph network based on the position and the detection frame of the target to obtain target characteristics and detection characteristics; the method specifically comprises the following steps:

fusing the appearance characteristics and the spatial relationship characteristics of the detection frame which cannot be matched with the active target to obtain detection characteristics;

2. The method of claim 1, wherein obtaining the position of the object in each frame of the image based on the original video sequence and the detection frame through the data association process comprises:

let t be the frame number of the original video sequence;

3. The method of claim 2, wherein when t >1, the data associating according to the position of the detection frame at t-1 frame comprises: adjusting the position of the detection frame in the t-1 frame through the light flow diagram of the adjacent video frame, and then carrying out regression operation on the adjusted detection frame by using a linear regression to obtain the position of the target of the t frame;

the position discrimination processing of the target position of the t frame through the binary classifier comprises the following steps:

4. The method of claim 1, wherein the performing similarity measurement on the target feature and the detection feature, and determining whether the detection box belongs to a certain termination target specifically comprises: