CN115294490A

CN115294490A - Dynamic multi-target identification method under intermittent shielding

Info

Publication number: CN115294490A
Application number: CN202210782158.3A
Authority: CN
Inventors: 曹政才; 李俊年; 张东
Original assignee: Beijing University of Chemical Technology
Current assignee: Beijing University of Chemical Technology
Priority date: 2022-07-01
Filing date: 2022-07-01
Publication date: 2022-11-04

Abstract

The invention discloses a dynamic multi-target identification method under intermittent shielding. Firstly, continuous image sequences and video data sets for target recognition with intermittent shielding influence are collected and manufactured and are used for training and testing a dynamic multi-target recognition algorithm. Secondly, a target identification module based on a central network is designed, and the bounding box, the motion offset and the hot spot map feature information of the target in two adjacent frames of images of the input video are obtained. And thirdly, updating the feature information of the target in the current frame by adopting a feature extraction module based on a gated cyclic unit network, and acquiring the motion track of the target by combining the feature information of the target in the previous frame. And finally, matching the targets and the motion tracks thereof by adopting a target track matching module based on the Hungarian algorithm, and allocating a corresponding ID value to each target so as to realize dynamic multi-target identification. The method can be applied to the dynamic multi-target identification problem under the intermittent shielding condition.

Description

Dynamic multi-target identification method under intermittent shielding

Technical Field

The invention relates to the field of image processing, in particular to a dynamic multi-target identification method under intermittent shielding.

Background

The multi-target identification is a basic research problem in computer vision, is being widely applied to various fields such as intelligent monitoring, industrial detection, human-computer interaction and the like, has important research and application values, but still faces various challenges in complex scenes such as multiple targets, mutual shielding, obvious change of ambient illumination, background interference and the like.

In recent years, the deep learning algorithm further promotes the improvement of the environment perception performance, and even reaches or exceeds the accuracy of human recognition on the tasks such as target classification, and therefore, a plurality of target recognition methods based on deep learning are proposed. Compared with the traditional method, the target identification method based on deep learning is concerned by extensive researchers due to stronger generalization and robustness.

The mutual shielding of targets is very easy to occur in the process of dynamic multi-target identification, the traditional method improves the influence of low confidence coefficient of single target identification by combining target characteristics of front and back frames in an input video on the basis of single-frame image target detection, and further improves the identification precision of the targets, but the method can not well acquire the characteristic information of the targets between adjacent image frames, so that the targets and the motion tracks thereof can not be effectively matched, and the effect on the problem of target identification shielded for a long time is poor.

In recent years, a series of multi-target recognition methods based on deep learning have been proposed, for example, zhou et al propose to use a central network (centrnet) for dynamic multi-target recognition in the European Conference on Computer Vision, and the method achieves higher accuracy of dynamic target recognition on the basis of more simplified network structure by extracting feature information of a boundary frame, a hot spot map and a motion offset of a target in two frames before and after an input video and matching the target and a motion track thereof. Experiments prove that the method can obtain more excellent dynamic target recognition performance, and the deep learning algorithm is proved to have excellent performance on dynamic target recognition. However, the method cannot well extract the target characteristic information in the input video multi-frame image, and is difficult to accurately identify the dynamic target affected by intermittent shielding. Through related technology retrieval, a dynamic multi-target identification method based on deep learning under the influence of intermittent shielding of a target is not found at present.

Disclosure of Invention

In order to solve the problem that a dynamic multi-target identification method based on a central network cannot solve the dynamic multi-target identification problem under intermittent shielding, the invention provides a novel method for identifying the dynamic multi-target under the intermittent shielding.

The invention provides a dynamic multi-target identification method under intermittent shielding, which comprises two stages of training and testing, wherein,

the training stage is realized by a target identification module based on a central network, a feature extraction module based on a gated cycle unit network and a target track matching module based on a Hungarian algorithm. The target identification module based on the central network comprises a reference network and a feature extraction module, wherein the reference network adopts a DLA-34 network structure, and the feature extraction module adopts the central network.

After the testing stage and the training stage are finished, the video to be recognized is input into a target recognition module based on a central network, and after processing, the target and the motion trail thereof are matched through a target trail matching module based on the Hungarian algorithm, so that the dynamic multi-target recognition can be realized. The method comprises the following steps:

step 1: training data set preparation.

Step 2: inputting training video data with total length of T frames into a reference network, and acquiring adjacent frame sequence (x) marked with target type and bounding box information through the reference network _t-1 ，x _t ) T =2, \ 8230;, T, where x _t And the image of the t-th frame marked with the target type and the boundary frame information is represented, the target type information is used for identifying the target, the type information of the same type of target is the same, and the boundary frame information is used for calculating the position of the central point of the target.

And step 3: sequence of adjacent frames (x) to be acquired by reference network _t-1 ，x _t ) An input feature extraction module for predicting x _t+1 The motion offset of all target center points on the frame sequence is used for calculating a motion track, and the hot spot graph is used for judging whether the marked target type is quasi-standard or notAnd (8) determining.

And 4, step 4: calculating a loss function, wherein the overall loss function expression is updated as follows:

where T denotes the sequence length of the input video, λ _f ，λ _s And λ _O For the hyper-parameters, the weight of each branch in the overall loss function is defined, λ _f ＝1，λ _s ＝0.1，λ _O ＝0.1；

L _f And (3) representing the predicted loss of the target heat point diagram, specifically adopting a loss function based on local loss, wherein the expression is as follows:

wherein, Y _cab A hot-spot map true value representing the target belonging to class c at position (a, b) of the t-th frame, which is directly obtained from the input video,

the representation is composed of a sequence of adjacent frames (x) _t-1 Xt) predicting the hot spot of the target at the corresponding position in the t frame, N represents the number of the targets in the t frame image, alpha and beta are hyper-parameters,

representing all kinds of objects at all positions of the image of the t-th frame.

L _s And representing the predicted loss of the target bounding box position, and adopting a loss function based on the L1 loss, wherein the expression is as follows:

wherein,

the position of the bounding box representing the position of the ith target center point on the t frame image predicted by the reference network,

and a real value of the position of the boundary box representing the position of the ith target center point on the t frame image is directly obtained from the input video.

L _o And (3) representing the prediction loss of the target motion offset, and adopting a loss function based on a regression function, wherein the expression is as follows:

wherein

Each bounding box represented on the t-th frame image

The motion trajectory of the upper object i is,

representing the real value of the central point position of the target i on the detected t frame and t-1 frame images, wherein the real value is directly obtained by an input video;

representing target feature information M based on update ^t Using a positive alignment head P in the input t-th frame image ^t The supervised learning is carried out on the position of the center point of the target,

representing target feature information M based on update ^t Using a negative alignment head V on the input tth frame image ^t And carrying out supervised learning on the position of the target central point. Wherein M is ^t Based on gated cyclic unitsThe characteristic extraction module of the network is completed by respectively calculating the updated door parameter z in the gating cycle unit network corresponding to the t-th frame ^t Hidden layer recursive representation parameter

And resetting the gate parameter r ^t Further obtaining the updated t frame image target characteristic information M ^t The target characteristic information comprises the size of a target boundary frame, a motion offset and a heat point diagram, and the specific calculation formula is as follows:

wherein z is ^t Represents the updated door parameter, M ^t-1 Representing the updated target bounding box size, the motion offset and the hot spot diagram of the t-1 st frame, wherein M is the image of the input 1 st frame ¹ ＝0，

Representing hidden layer recursive representation parameters, updating gate parameters z ^t The calculation expression of (a) is as follows:

z ^t ＝δ(W _z F ^t +U _z M ^t-1 +b _z )

where δ () represents the Logistic function with an output interval of (0, 1), W _Z 、U _z And b _z Is a network parameter that the gated cyclic unit network updates the gate can learn, F ^t Representing the size of the target bounding box, the motion offset and the hot spot map of the t frame before updating, F ¹ 、F ² The values of the three-dimensional image are respectively the size of a target boundary frame on the image of the 1 st frame and the image of the 2 nd frame, the motion offset and the real value of the hot spot diagram, the real value is directly obtained by the input video, and a predicted value is given from the third frame.

Hidden layer recursive representation parameters

The calculation expression of (a) is as follows:

wherein [ ] indicates a convolution operation, W _M 、U _M And b _M Is a network parameter which can be learnt by a hidden layer of the gating cycle unit network.

r ^t To reset the gate parameter, the computational expression is:

r ^t ＝δ(W _r F ^t +U _r M ^t-1 +b _r )

W _r 、U _r and b _r Is a network parameter that the gated cycle cell network reset gate can learn.

And 5: based on the intersection ratio of the true value and the predicted value of the target motion trail obtained by the target motion offset, matching a plurality of targets and motion trails which still appear after shielding by adopting a Hungarian algorithm based on the intersection ratio, wherein the calculation process of the intersection ratio is as follows:

definition G = { G ₁ ,…,g _T The real value of the target motion track is directly obtained from the input video, and D = { D = is defined ₁ ,…,d _T The predicted value of the target motion track is expressed as the following relation function:

and for the targets which do not appear any more from the t-th frame, matching the targets with the predicted track by adopting a pedestrian re-recognition method, and traversing the training video to finish training by analogy.

After training is finished, the video to be recognized is input into a target recognition module based on a central network, and after processing, the target and the motion trail of the target are matched through a target trail matching module based on the Hungarian algorithm, so that dynamic multi-target recognition can be realized.

Has the advantages that:

the method has the advantages that the method can accurately identify the dynamic multiple targets influenced by intermittent shielding (less than or equal to 30 FPS), has higher robustness, can identify the dynamic multiple targets influenced by the intermittent shielding in different scenes (campus, market, street and the like), and has the identification accuracy rate of more than or equal to 70%.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention.

Fig. 1 is a flow chart of a dynamic multi-target identification method in the present invention.

Fig. 2 is a schematic diagram of a dynamic multi-target identification method in the present invention.

Fig. 3 is a schematic diagram of a feature extraction module based on a central network in the present invention.

FIG. 4 is a diagram illustrating the effect of the present invention on the constructed data set for dynamic multi-target recognition affected by intermittent occlusion.

Detailed Description

For a better understanding of the technical solutions of the present invention, the following further describes embodiments of the present invention with reference to the accompanying drawings and specific examples. It is noted that the aspects described below in connection with the figures and the specific embodiments are only illustrative and should not be construed as imposing any limitation on the scope of the present invention.

A flow chart of a dynamic multi-target identification method under the influence of intermittent shielding is shown in figure 1, the schematic diagram of the method is shown in figure 2, the method comprises two stages of training and testing, the training stage is realized by a target identification module based on a central network, a feature extraction module based on a gated cyclic unit network and a target track matching module based on a Hungarian algorithm, and the method specifically comprises the following steps:

step 1: the method comprises the steps of collecting continuous image sequences and video data used for dynamic multi-target recognition in a plurality of different scenes (such as schools, shopping malls, streets and the like) by adopting a Logitech C920 camera, selecting the continuous image sequences and the video data influenced by intermittent shielding, and finishing data set manufacturing, wherein the data set comprises a training set, a verification set and a test set.

Step 2: object identification module based on central networkIn the method, a DLA-34 network structure is adopted as a reference network, and the network comprises an encoder-decoder structure, so that the accuracy of target identification can be improved. Inputting training image data of total T frames into a reference network, and acquiring adjacent frame sequence (x) marked with target type and bounding box information thereof through the reference network _t-1 ，x _t ) T =2, \ 8230;, T, where x _t The method comprises the steps that a t frame image marked with target types and boundary frame information is represented, the target type information is used for identifying targets, the type information of the same type of targets is the same, and the boundary frame information is used for calculating the position of a target center point;

and 3, step 3: in the target identification module based on the central network, the feature extraction module adopts the central network to extract the adjacent frame sequences (x) acquired by the reference network _t-1 ，x _t ) Input feature extraction module predicting x _t+1 The motion offset of all target central points in the frame sequence is used for calculating a motion track, and the hot spot graph is used for judging whether the marked target type is accurate or not. The module can accelerate the extraction speed of target features, reduce network parameters and improve the running speed of the algorithm, and a schematic diagram of the module is shown in FIG. 3.

And 4, step 4: calculating a loss function, wherein the overall loss function expression is as follows:

where T denotes the sequence length of the input video, λ _f ，λ _s And λ _O For the hyper-parameters, the weight of each branch in the overall loss function is defined, λ _f ＝1，λ _S ＝0.1，λ _O ＝0.1。

wherein Y is _cab A hot-spot diagram true value representing the target belonging to the category c at the position (a, b) of the t-th frame, the true value being directly obtained from the input video,

the representation is composed of a sequence of adjacent frames (x) _t-1 ，x _t ) Predicting a hot spot diagram of the target at the corresponding position in the t-th frame, wherein N represents the number of the targets in the t-th frame of the training data, alpha and beta are hyper-parameters,

representing the traversal of all kinds of objects at all positions in the t-th frame image.

wherein,

L _o And representing the prediction loss of the target motion offset, and adopting a loss function based on a regression function, wherein the expression is as follows:

wherein

Representing images in the t-th frameEach bounding box

The motion trajectory of the upper object i is,

representing the real value of the central point position of the target i on the t frame image and the t-1 frame image, wherein the real value is directly obtained by an input video;

representing target feature information M based on update ^t Using a positive alignment head P in the input t-th frame image ^t Supervised learning of the target midpoint location is performed,

representing target feature information M based on update ^t Using a negative alignment head V on the input t-th frame image ^t And carrying out supervised learning on the position of the target central point. Wherein M is ^t The updating process is completed by a characteristic extraction module based on the gated cyclic unit network, and the updated gate parameters z in the gated cyclic unit network corresponding to the t-th frame are respectively calculated ^t Hidden layer recursive representation parameter

And resetting the gate parameter r ^t Further obtain the updated t frame target characteristic information M ^t The target characteristic information comprises the size of a target boundary box, a motion offset and a heat point diagram, and the specific calculation formula is as follows:

wherein z is ^t Indicating updated door parameters, M ^t-1 Representing the updated target bounding box size, the motion offset and the hot spot diagram of the t-1 st frame, wherein M is the image of the input 1 st frame ¹ ＝0，

z ^t ＝δ(W _z F ^t +U _z M ^t-1 +b _z )

wherein W _Z 、U _z And b _z Is a gate-controlled cycle unit network update gate-learnable network parameter, F ^t Representing the size of the target bounding box, the motion offset and the hot spot diagram of the t frame before updating, F ¹ 、F ² The values of the three-dimensional image are respectively the size of a target boundary frame on the image of the 1 st frame and the image of the 2 nd frame, the motion offset and the real value of the hot spot diagram, the real value is directly obtained by the input video, and a predicted value is given from the third frame.

Hidden layer recursive representation parameters

The calculation expression of (a) is as follows:

wherein [ ] indicates a convolution operation, W _M 、U _M And b _M Is a network parameter that can be learned by a hidden layer of the gated loop unit network.

r ^t To reset the gate parameter, the calculation expression is:

r ^t ＝δ(W _r F ^t +U _r M ^t-1 +b _r )

W _r 、U _r and b _r Is a network parameter that can be learned by the gated loop unit network reset gate.

And 5: based on the intersection and combination ratio of the real value and the predicted value of the target motion track obtained by the target motion offset, matching a plurality of predicted targets with corresponding tracks by adopting Hungarian algorithm, and matching targets which still appear after the t-th frame is shielded based on the intersection and combination ratio, wherein the calculation process of the intersection and combination ratio is as follows:

when the temperature is higher than the set temperature

And matching the target and the corresponding track. And for the target which does not appear any more from the t-th frame, matching the target with the predicted track by adopting a pedestrian re-recognition method, and repeating the training video to finish the training.

The dynamic multi-target recognition algorithm is trained by using an open data set MOT17, KITTI and COCO, an Intel (R) Xeon (R) CPU E5-2620 v4@2.10GHz x 16, a 4-core NVIDIA GeForce GTX 2080 display card is carried by the server, the memory is 128GB, the operating system is Ubuntu 16.04, and the dynamic multi-target recognition algorithm is realized by adopting a Python 3.6 programming language and a Pythrch deep learning framework. The training hyper-parameters are: batchsize =2, epochs =100, iterations =1600, optimizer = adam. The characteristic dimension of the gated round-robin cell network is set to 256, and a 3 x 3 filter is included, and the result of the algorithm operation is shown in fig. 4.

The method has the advantages that the dynamic multi-target influenced by intermittent shielding (less than or equal to 30 FPS) can be accurately identified, the robustness is high, the dynamic multi-target influenced by the intermittent shielding in different scenes (such as campuses, shopping malls, streets and the like) can be identified, and the identification accuracy rate is more than or equal to 70%.

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in one or more steps, occur in different orders, and the principles of acts may be understood and appreciated by those skilled in the art.

Although illustrative embodiments of the present invention have been described in some detail so that those skilled in the art will appreciate, it is not limited thereto but is capable of making various changes and modifications within the spirit and scope of the invention as defined and limited by the appended claims.

Claims

1. A dynamic multi-target identification method under intermittent shielding is characterized in that: the method comprises two stages of training and testing, wherein,

the training stage is realized by a target identification module based on a central network, a feature extraction module based on a gated cyclic unit network and a target track matching module based on a Hungarian algorithm;

and (3) a testing stage: after training is finished, the video to be recognized is input into a target recognition module based on a central network, and after processing, the targets and the motion tracks thereof are matched through a target track matching module based on the Hungarian algorithm, so that dynamic multi-target recognition can be realized.

2. The dynamic multi-target identification method under intermittent shielding according to claim 1, characterized in that: the target identification module based on the central network comprises a reference network and a feature extraction module, wherein the reference network adopts a DLA-34 network structure, and the feature extraction module adopts the central network.

3. The method for dynamic multi-target recognition under intermittent shielding according to claim 1, wherein the training process comprises the following steps:

step 1, training data set preparation.

Step 2: total length of the tubeInputting training video data of T frames into a reference network, and acquiring adjacent frame sequences (x) marked with target types and bounding box information thereof through the reference network _t-1 ，x _t ) T =2, \8230, T, where x _t And the t frame image marked with the target type and the boundary frame information thereof is represented, the target type information is used for identifying the target, and the boundary frame information is used for calculating the position of the central point of the target.

And step 3: sequence of adjacent frames (x) to be acquired by reference network _t-1 ，x _t ) An input feature extraction module for predicting x _t+1 The motion offset of the center point of all the targets in the frame sequence is used for calculating a motion track, and the hot spot graph is used for judging whether the marked target type is accurate or not.

where T denotes the sequence length of the input video, λ _f ，λ _s And λ _O For the hyper-parameter, the weight of each branch in the overall loss function is respectively defined,

L _f and (3) representing the predicted loss of the target heat point diagram, specifically adopting a loss function based on local loss, wherein the expression is as follows: )

Wherein, Y _cab Represents the corresponding value of the target belonging to the category c on the real hotspot graph at the position (a, b) of the t-th frame,

the representation is composed of a sequence of adjacent frames (x) _t-1 ，x _t ) Predicting a hot spot diagram of a target at a corresponding position in the tth frame, wherein N represents the tth frame of training dataAlpha and beta are hyper-parameters,

representing all kinds of targets at all positions in the t-th frame image;

L _s and (3) representing the predicted loss of the target bounding box position, and adopting a loss function based on the L1 loss, wherein the expression is as follows:

wherein,

the position of a bounding box representing the position of the ith target center point on the tth frame image predicted by the reference network,

the real value of the position of the boundary box representing the position of the ith target center point on the tth frame image;

wherein

Each bounding box representing the image in the t-th frame

The motion trajectory of the upper object i is,

graph showing the t-th frame and the t-1 th frameThe real value of the center point position of the target i on the image;

representing target feature information M based on update ^t Using a negative alignment head V on the input tth frame image ^t And carrying out supervised learning on the position of the center point of the target. Wherein, M ^t The updating process is completed by a characteristic extraction module based on the gated cyclic unit network, and the updated gate parameters z in the gated cyclic unit network corresponding to the t-th frame are respectively calculated ^t Hidden layer recursive representation parameter

And resetting the gate parameter r ^t Further obtaining the target characteristic information M on the updated t frame image ^t The target characteristic information comprises the size of a target boundary frame, a motion offset and a heat point diagram, and the specific calculation formula is as follows:

wherein z is ^t Indicating updated door parameters, M ^t-1 Showing the size of the target bounding box, the motion offset and the hot spot diagram after updating on the t-1 frame,

z ^t ＝δ(W _z F ^t +U _z M ^t-1 +b _z )

where δ () represents the Logistic function with an output interval of (0,) 1), W _z 、U _z And b _z Is a gate-controlled cycle unit network update gate-learnable network parameter, F ^t Representing the size of a target boundary box, the motion offset and a heat point diagram on the t frame before updating; )

Hidden layer recursive representation parameters

The calculation expression of (c) is as follows:

r ^t To reset the gate parameter, the computational expression is:

r ^t ＝δ(W _r F ^t +U _r M ^t-1 +b _r )

And 5: based on the intersection and combination ratio of the true value and the predicted value of the target motion trajectory obtained by the target motion offset, matching a plurality of predicted targets with corresponding trajectories by adopting a Hungarian algorithm, and matching targets which appear after the t-th frame is shielded based on the intersection and combination ratio, wherein the calculation process of the intersection and combination ratio is as follows:

definition G = { G ₁ ,…,g _T D = { D } is the true value of the target motion trajectory ₁ ,…,d _T The predicted value of the target motion track is expressed as the following relation function: