CN115457082A

CN115457082A - Pedestrian multi-target tracking algorithm based on multi-feature fusion enhancement

Info

Publication number: CN115457082A
Application number: CN202211067913.6A
Authority: CN
Inventors: 周彦; 陈俊宇; 王冬丽; 杜镇楠
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2022-12-09

Abstract

The invention discloses a pedestrian multi-target tracking algorithm based on multi-feature fusion enhancement, which integrates three modules of feature extraction, target detection and data association into a network architecture. According to the algorithm, two continuous frames are used as a link node as input, in order to strengthen a resnet50 backbone network, the traditional convolution is replaced by the Incep convolution, and the receptive field of a feature extraction network is strengthened; in order to enhance the adaptability to target deformation, the extracted features are subjected to weighted bidirectional pyramid fusion processing, and more attention is paid to important feature information; in order to enable the network to better process pedestrian targets in crowded scenes, information is enhanced by adding a context-sensitive prediction module into the network and placing a wider and deeper convolution prediction module on the top of layers with different step sizes; and finally, regressing the same target boundary box pair in the adjacent frames, carrying out similarity comparison in the common frame, carrying out data association by using IOU matching, and outputting a tracking track. The algorithm improves the precision of multi-target tracking, and can meet the requirement of the video monitoring field on pedestrian target tracking.

Description

Pedestrian multi-target tracking algorithm based on multi-feature fusion enhancement

Technical Field

The invention relates to the technical field of intelligent monitoring, in particular to a pedestrian multi-target tracking algorithm enhanced by multi-feature fusion.

Background

With the continuous improvement of living standard and population growth of people, city monitoring in China faces a plurality of problems, for example, illegal molecules and subsequent tracking may exist in market population, and ensuring the stability of society becomes an important problem for city monitoring departments. Pedestrian tracking is a key basic technology for solving urban safety, has a very rich research value in a real scene, and can provide important information for urban safety by detecting a pedestrian target and tracking pedestrians in follow-up, so that the multi-target tracking of the pedestrians becomes a key problem to be solved for the urban safety.

In the past, multi-target tracking develops rapidly and makes great progress, but most of the existing multi-target tracking field solutions are detection and tracking paradigms based on two-stage models, and the paradigms divide multi-target tracking into two independent tasks, namely detection and association. The method comprises the steps of firstly obtaining a boundary box of an object in each frame through an existing detector, then forming a track through data association cross-frame linking, and embedding an identity to distinguish the objects in an association process. Such a two-step process reveals two ways to improve tracking performance, one by enhancing detection and the other by enhancing data correlation. The tracking process has the influence of shielding, so that objects are overlapped to cause missed detection, and the difficulty of data association is increased. Object detection in successive video frames is difficult to produce stable and reliable results, and the temporal connection between the upper and lower frames is easily lost. The two tasks are separated, although the performance is good, the real-time performance is poor, and the use of an actual scene is not satisfied. Under the conditions of simple model and low calculation cost, the multi-target tracking method based on deep learning can effectively meet the real-time requirement to improve the tracking performance.

Disclosure of Invention

In view of this, the present invention provides a pedestrian multi-target tracking algorithm with multi-feature fusion enhancement to avoid the disadvantages in the prior art, which improves the efficiency of pedestrian multi-target tracking and can meet the requirements in the city monitoring field.

The purpose of the invention is realized by the following technical scheme: the pedestrian multi-target tracking algorithm with the multi-feature fusion enhancement is provided, and comprises the following steps:

s1, performing image preprocessing on a video frame sequence of a pedestrian video set, and extracting characteristic information;

s2, further feature fusion is needed in the extracted feature information, and richer target features are obtained;

and S3, a context sensitive prediction module is used, so that the pedestrian target in a crowded scene can be well processed, and the information is enhanced.

And S4, calculating a regression frame pair of the same target in adjacent frames, performing IOU similarity matching, and outputting a tracking track.

As a further improvement, the image preprocessing in step S1 includes an improvement of extracting feature information, and an inclusion convolution is added to replace a conventional convolution, so as to enhance the sensing region with a more flexible search space, which can improve the search capability of GNN, and enable the model to have an effective receptive field.

As a further improvement, the step improves feature information extraction, and the inclusion convolution is adopted as follows:

in the formula (I), the compound is shown in the specification,

and

is that the filter goes from 1 to d on the x-axis and y-axis of the ith output channel _max Expansion of (C) ^out Is the number of output channels.

As a further improvement, the feature fusion in step S2 includes a weighted bidirectional feature pyramid structure. Firstly, deleting a node with only one input, wherein if one node only has one input boundary, the node has no great effect in the process of feature fusion, and the network can be simplified by deleting redundant nodes; secondly, an edge is directly added to an original input node and an original output node, and more useful feature information is fused under the condition that too much calculation cost is not increased; thirdly, the importance of different features can be effectively learned by adopting multi-scale fusion from top to bottom and from bottom to top through repeated stacking, and higher-level feature fusion is realized.

As a further improvement, the step improves the feature fusion and adopts a weighted bidirectional feature pyramid. Multiscale feature layer

Wherein

Is represented by _i By an operation f of aggregating different features and outputting the new feature

Taking layer 6 as an example, the calculation formula is:

in the formula (I), the compound is shown in the specification,

in order to have the middle characteristic from top to bottom,

for the bottom-up output features, ε =0.0001 is set to avoid numerical instability during the calculation, and the rest features are calculated in a similar operation manner.

As a further improvement, the context sensitive prediction module is included in the step S3. In crowded scenes, pedestrian targets are not easily distinguished in appearance, and the receptive field is increased by placing wider and deeper convolution prediction modules on top of layers with different step sizes, so that the prediction modules obtain better classification and localization functions and enhance information.

As a further improvement, in step S4, a final tracking track is formed, specifically, the same target regression frame pair in adjacent frames is subjected to IOU matching, and a target with high similarity can be determined as the same target, and the tracking track is output.

Aiming at the research, the invention provides a pedestrian multi-target tracking algorithm enhanced by multi-feature fusion. Firstly, extracting feature information by using Resnet50 based on increment convolution as a backbone network, sending the extracted feature information to a weighted bidirectional feature pyramid to generate a multi-scale feature representation, and then performing context sensitive prediction enhancement operation on the multi-scale feature. In order to associate the same target in adjacent frames, adjacent multi-scale feature maps generated in a backbone network are connected in series, and then are sent into a prediction network to return a bounding box pair. And performing K-means clustering on all ground truth value bounding boxes, and distributing each cluster to a corresponding pyramid layer for subsequent prediction of a specific scale. The prediction network comprises three branches, namely a classification branch, an identity verification branch and a regression block pair regression branch. The classification branch judges whether the region is a target or a background by predicting the confidence coefficient score of the foreground region through the target detection frame, the identity verification branch is used for judging whether the detection frame pair belongs to the same target or not, and the regression branch predicts the coordinates of the same target frame pair. The three branches use the joint attention, the prediction confidence maps of the classification branch and the identity verification branch as the attention maps, so that the attention maps of the two branches can be complemented to achieve more attention-focused and effective information, then the attention maps are multiplied by the combined features and input into the regression branches, the more attention-focused pedestrian targets in the regression process are promoted to avoid the interference of the regression boxes on the irrelevant information in the matching process, and the accuracy of the same target detection box pair is ensured to a certain extent. The algorithm improves the accuracy of pedestrian multi-target tracking, and can be applied to tracking pedestrians in daily scenes.

Drawings

The invention is further illustrated by means of the attached drawings, but the embodiments in the drawings do not constitute any limitation to the invention, and for a person skilled in the art, other drawings can be obtained on the basis of the following drawings without inventive effort.

FIG. 1 is an overall frame diagram of a multi-feature enhanced pedestrian multi-target tracking algorithm;

fig. 2 is a structural diagram of the inclusion convolution.

FIG. 3 is a block diagram of a weighted bidirectional feature pyramid.

Fig. 4 is a block diagram of a context sensitive prediction module.

FIG. 5 is a graph of the tracking effect of the algorithm of the present invention under a fixed shot.

FIG. 6 is a diagram of the tracking effect of the algorithm of the present invention under a moving lens.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings and specific embodiments, and it is to be noted that the embodiments and features of the embodiments of the present application can be combined with each other without conflict.

Referring to fig. 1, which is a general frame diagram of an algorithm, the implementation of the present invention provides a multi-feature enhanced pedestrian multi-target tracking algorithm, which includes the following steps:

the image preprocessing of the step comprises the improvement of extracting feature information, the inclusion convolution is added to replace the traditional convolution, and the structure diagram of the inclusion convolution is shown in figure 2. The convolution strengthens the perception area by a more flexible search space, and can improve the search capability of the GNN, so that the model can have an effective receptive field.

Specifically, the inclusion convolution space contains multiple modes of dilation, with the dilation of each axis, each channel, and each convolution layer being independent to provide a dense range of effective receptive fields. The inclusion convolution has independent expansions for the two axes in each channel, expressed in the form:

in the formula (I), the compound is shown in the specification,

and

the feature fusion of this step includes a weighted bidirectional feature pyramid structure, and the structure diagram of the weighted bidirectional feature pyramid is shown in fig. 3. Firstly, deleting a node with only one input, if one node only has one input boundary, the node has no great effect in the process of feature fusion, and deleting redundant nodes can simplify the network; secondly, an edge is directly added to an original input node and an original output node, and more useful feature information is fused under the condition that too much calculation cost is not increased; thirdly, the importance of different features can be effectively learned by adopting multi-scale fusion from top to bottom and from bottom to top through repeated stacking, and higher-level feature fusion is realized.

Specifically, the step improves feature fusion, and three times of stacking are repeated by adopting a weighted bidirectional feature pyramid. Multiscale feature layer

Wherein

Represents l _i Is characterized by aAggregating the operations f of different features and outputting the new feature

Taking layer 6 as an example, the calculation formula is:

in the formula (I), the compound is shown in the specification,

in order to have the middle characteristic from top to bottom,

epsilon =0.0001 is set for the bottom-up output feature to avoid numerical instability during calculation, and the rest features are calculated in a similar operation manner.

This step includes a context sensitive prediction module, and the model architecture can be seen in fig. 4. In crowded scenes, pedestrian targets are not easily distinguished in appearance, and the receptive field is increased by placing wider and deeper convolution prediction modules on top of layers with different step sizes, so that the prediction modules obtain better classification and localization functions and enhance information.

Specifically, a plurality of convolution operations of 3 × 3 and 1 × 1 are added in the module, and the receiving field corresponding to the asynchronous length of the network is increased. By letting the network outputs share the benefits of a wider and deeper network, the prediction module becomes deeper and wider, and better features can be obtained for subsequent subtasks.

The method comprises the steps of forming a final tracking track, specifically performing IOU matching on the same target regression frame pair in adjacent frames, confirming the targets with high similarity as the same target, and outputting the tracking track.

Specifically, only slight changes exist in the boundary boxes of the same target on adjacent frames, the affinity matrix is obtained by calculating the IOU, the optimal matching of the detection boxes of the same target is realized by applying the Hungarian algorithm, and the track where the target is successfully matched is updated.

In summary, the multi-feature enhanced pedestrian multi-target tracking algorithm has the following advantages:

1) In the image preprocessing step, the conventional convolution is replaced by the inclusion convolution, so that the searching capability of the backbone network is improved, and the network can fit the effective receptive field to the video data set.

2) In the image feature fusion step, an efficient weighted bidirectional feature pyramid network is adopted, so that the problems that feature information is easy to lose in the transmission process, targets are identified and positioned disadvantageously and the like are effectively solved, and higher-level feature fusion is realized.

3) In the step of processing the pedestrian target in the crowded scene, a context sensitive prediction module is adopted. Considering that the nature and size of the detected object are easily influenced by environmental factors and the like, particularly under a crowded scene, the differences of pedestrian targets in appearance are not easy to distinguish, the module can well enhance the information, and the problem is solved.

In the description above, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore should not be construed as limiting the scope of the present invention.

In conclusion, although the present invention has been described with reference to the preferred embodiments, it should be noted that, although various changes and modifications may be made by those skilled in the art, they should be included in the scope of the present invention unless they depart from the scope of the present invention.

Claims

1. A pedestrian multi-target tracking algorithm based on multi-feature fusion enhancement is characterized by comprising the following steps:

2. The multi-feature fusion enhanced pedestrian multi-target tracking algorithm as claimed in claim 1, wherein the image preprocessing in the step S1 includes improvement of feature information extraction, and inclusion convolution is added to replace conventional convolution, so that a more flexible search space is used to strengthen a sensing region, thereby improving the search capability of GNN and enabling a model to have an effective receptive field.

3. The pedestrian multi-target tracking algorithm according to claim 2, wherein the step of improving feature information extraction adopts an increment convolution as follows:

in the formula (I), the compound is shown in the specification,

and

4. The multi-feature fusion enhanced pedestrian multi-target tracking algorithm according to claim 1, wherein the feature fusion in the step S2 includes a weighted bidirectional feature pyramid structure. Firstly, deleting a node with only one input, if one node only has one input boundary, the node has no great effect in the process of feature fusion, and deleting redundant nodes can simplify the network; secondly, an edge is directly added to an original input node and an original output node, and more useful feature information is fused under the condition that too much calculation cost is not increased; thirdly, by adopting multi-scale fusion from top to bottom and from bottom to top, the importance of different features can be effectively learned by repeated stacking for many times, and higher-level feature fusion is realized.

5. The pedestrian multi-target tracking algorithm of claim 4, wherein the step of improving feature fusion employs a weighted bidirectional feature pyramid. Multiscale feature layer

Wherein

Taking layer 6 as an example, the calculation formula is:

in the formula (I), the compound is shown in the specification,

in order to have the middle characteristic from top to bottom,

6. The multi-feature fusion enhanced pedestrian multi-target tracking algorithm according to claim 1, wherein the step S3 comprises a context sensitive prediction module. In crowded scenes, pedestrian targets are not easily distinguished in appearance, and the receptive field is increased by placing wider and deeper convolution prediction modules on top of layers with different step sizes, so that the prediction modules obtain better classification and localization functions and enhance information.

7. The multi-feature fusion enhanced pedestrian multi-target tracking algorithm according to claim 1, wherein in the step S4, a final tracking track is formed, the same target regression box pair specifically represented in adjacent frames is subjected to IOU matching, and a target with high similarity can be determined as the same target, and the tracking track is output.