CN112784756B

CN112784756B - Human body identification tracking method

Info

Publication number: CN112784756B
Application number: CN202110095729.1A
Authority: CN
Inventors: 王堃; 刘耀辉; 戴旺
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2022-08-26
Anticipated expiration: 2041-01-25
Also published as: CN112784756A

Abstract

The invention discloses a human body identification tracking method, which comprises the following steps: step 100: collecting original video stream data, and converting the original video stream data into pictures to establish an initial data set; step 200: performing enhancement processing and screening on the initial data set to obtain a training set, a verification set and a test set; step 300: constructing a Centernet network structure consisting of a main network, an upper sampling path and a top convolution, wherein the top convolution adopts a depth separable convolution; step 400: designing a BOX matching mechanism and a loss function to construct a complete Centernet network structure; step 500: training, verifying and testing the complete Centet network structure by using a training set, a verification set and a test set to obtain a Centet network model; step 600: and identifying and tracking human bodies in the real-time video stream data by using the Centernet network model. The human body identification tracking method optimizes the structure of the Centernet network, improves the detection speed under the condition of not reducing the detection accuracy and optimizes the balance between the accuracy and the speed.

Description

Human body identification tracking method

Technical Field

The invention relates to the field of machine vision, in particular to a human body identification tracking method.

Background

Multi-Object tracking (MOT) is a research hotspot in the field of computer vision, and its content refers to information such as the position, size, and complete motion trajectory of each target of an independent target meeting requirements or having a certain visual characteristic determined in a specific or real-time video sequence. In recent years, with the rapid increase of data processing capacity and the development of image analysis technology, target monitoring and real-time tracking technology is separate, and has very important practical value in the fields of video monitoring, positioning navigation, intelligent human-computer interaction, virtual reality and the like, and the multi-target tracking technology based on video stream has become a popular direction for research of all experts and scholars.

The Centernet network is used as a target tracking algorithm, a region of interest is not required to be established in a region, the speed is greatly improved, and an optimization space is provided in the balance between the detection accuracy and the detection speed.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a human body identification tracking method, which can further improve the detection speed and enlarge the receptive field while ensuring the detection accuracy.

The technical scheme is as follows: the human body identification tracking method specifically comprises the following steps:

step 100: collecting original video stream data, and converting the original video stream data into pictures to establish an initial data set;

step 200: performing enhancement processing and screening on the initial data set to obtain a training set, a verification set and a test set;

step 300: constructing a Centernet network structure consisting of a main network, an upper sampling path and a top convolution, wherein the top convolution adopts a depth separable convolution;

step 400: designing a BOX matching mechanism and a loss function to construct a complete Centernet network structure;

step 500: training, verifying and testing the complete Centet network structure by using a training set, a verification set and a test set to obtain a Centet network model;

step 600: human body recognition in the real-time video stream data is tracked using the centrnet network model.

Further, the BOX matching mechanism in step 400 is: if the Bbox containing the central point of the object predicted by the characteristic point is occupied, the Bbox closest to the central point of the object is selected as the Anchor.

Further, the loss function in step 400 is expressed as:

L _del ＝L _k +L _size +L _off

wherein L is _del To total loss, L _k For confidence loss, L _size For target frame size loss, L _off For the center offset loss, the predicted Bbox parameter is set to (b) _x ，b _y ，b _w ，b _h ) Wherein b is _x And b _y The position of the center point of Box, b _w And b _h Representing the width and height of Box, adding three influence factors of xi, delta and zeta into the confidence loss, namely:

L _k ＝ξ ₁ L _nt +ξ ₂ L _pt

L _nt ＝-(1-b _y ^) ^δ ₁ *log(b _y ^+ζ)

L _pt ＝-(1-b _y ^) ^δ ₂ *log(b _y ^)

wherein L is _nt For negative sample loss, L _pt For positive sample loss, xi ₁ 、ξ ₂ 、δ ₁ 、δ ₂ ζ is searched by the grid to obtain the optimum value.

Further, the original video stream data in step 100 is obtained by real-time video recording through a camera and by means of an internet crawler.

Further, the enhancement processing in step 200 includes geometric transformation and color transformation.

Further, the backbone network in step 300 is one of ResNet-18, MobileNet, Xception, ShuffleNet, ResNet101, and DenseNet.

Further, the upper sampling path in step 300 includes a CBAM module and a feature fusion module, where the CBAM module is configured to optimize the extracted image features, and the feature fusion module is configured to fuse shallow features, that is, deep features.

Further, the activation functions of the Centernet network in the step 300 are h-swish and h-sigmoid.

The step 500 comprises:

step 510: giving a model training mode and parameters, and sending a training set into a complete Centernet network structure for training to obtain a first characteristic data set;

step 520: training is carried out on a vector device to obtain a Centernet network model.

Has the beneficial effects that: compared with the prior art, the invention has the following advantages:

1. the main network of the Centernet network is replaced by the lightweight network, the method is suitable for embedded equipment, and the detection speed is improved.

2. A feature fusion module is introduced in the process of mining, low-level spatial information and high-level semantic information are fused, and the defects of mutual blocking of pedestrians and missing detection and false detection caused by illumination and visual angle change are overcome.

3. An attention module is introduced, an activation function with small calculation amount is replaced, and the practicability of the algorithm is guaranteed while the calculation is fast.

4. The convolution operation in the Centernet network is replaced by the deep separable convolution, so that the receptive field is expanded on the basis of not reducing the resolution and not increasing the calculated amount, and the large target is better detected, positioned and segmented.

Drawings

FIG. 1 is a flow chart of a human identification tracking method of the present invention;

figure 2 is a diagram comparing the architecture of the Centernet of the present invention with that of a conventional Centernet.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

Referring to fig. 1, the human body recognition and tracking method according to the embodiment of the invention comprises the following steps:

step 300: building a Centernet network structure consisting of a backbone network, an upper mining path and a top convolution, wherein the top convolution adopts a depth separable convolution;

step 600: and identifying and tracking human bodies in the real-time video stream data by using the Centernet network model.

According to the human body identification tracking method in the technical scheme, the depth separable convolution is adopted as the Centernet network structure, parameters and calculated amount can be obviously compressed, the operation performance of the model is improved, meanwhile, the receptive field can be expanded on the basis that the image resolution is not reduced and extra calculated amount is not increased, and then the large target is detected and segmented, and the target is accurately positioned. Meanwhile, convolution with different convolution rates is adopted, so that characteristics of different receptive fields can be obtained, and the characteristics of the multi-scale pedestrian can be obtained. The designed Box matching mechanism and the loss function can respectively solve the problem of center point coincidence and the problem of unbalance of positive and negative samples which often occur in pedestrian detection.

In the Centernet network, a series of fixed BBoxs in the feature map are judged to be positive samples or not by calculating the intersection ratio, if the intersection ratio is more than 0.7, the samples are marked as positive samples, and if the intersection ratio is less than 0.3, the samples are marked as negative samples. The BBox of the positive sample contains center points of objects, and the center points are at low resolution, each of which can detect only one object, and the network can predict the BBox only by predicting the offset within a certain cell. In this design, one feature point can only predict one object, and if more than one object center point in one image is overlapped, missing detection is caused, which is common in pedestrian detection. So in some embodiments, the Box matching mechanism of step 400 is: when the Anchor is selected, if the center point BBox corresponding to the characteristic point is occupied, the BBox closest to the center point is selected as the Anchor to predict the object, so that the problem of center point repetition is avoided.

In some embodiments, the loss function consists of three parts, and the overall can be represented as:

L _del ＝L _k +L _size +L _off

wherein L is _del To total loss, L _k For confidence loss, L _size For target frame size loss, L _off Is the center offset loss. Setting the predicted Bbox parameter to (b) _x ，b _y ，b _w ，b _h ) Wherein b is _x And b _y The position of the center point of Box, b _w And b _h Representing the width and height of Box, when the input model is 512 × 512 in size and the output is 28 × 28 feature map, since one feature point only predicts one object, extreme imbalance of positive and negative samples occurs in extreme cases. To solve this problem, adding three influencing factors of xi, delta and zeta into the confidence coefficient loss improves the loss of positive samples and reduces the loss of negative samples to solve the problem of unbalance of the positive and negative samples, namely:

L _k ＝ξ ₁ L _nt +ξ ₂ L _pt

L _nt ＝-(1-b _y ^) ^δ ₁ *log(b _y ^+ζ)

L _pt ＝-(1-b _y ^) ^δ ₂ *log(b _y ^)

at negative sample loss L _nt Where by setting ζ and δ ₁ Two factors to reduce the loss of negative samples, at positive samples, L _pt Middle passing delta ₂ Making adjustments, and finally controlling the contribution of positive and negative sample losses by xi factorsAnd (4) proportion. By aiming at xi in the loss function ₁ 、ξ ₂ 、δ ₁ 、δ ₂ ζ the optimal set of parameters is found using a grid search. In this embodiment, ξ is taken ₁ Is 0.25 ξ ₂ Is 1, delta ₁ Is 3, delta ₂ It was 1.5 and ζ was 0.2.

In some embodiments, the raw video stream data in step 100 may be obtained by real-time video recording of scenes of ground pedestrians, and the database may be augmented by an internet crawler. At present, most of pedestrian detection public data sets such as MIT and ImageNet are in head-up view angles, and are not suitable for monitoring cameras arranged in top-down view angles, so that pedestrian data in the top-down view angles need to be shot by oneself and obtained in field, and data volume is supplemented by matching with an internet crawler.

In some embodiments, the original video stream data is converted into pictures through a script, and the video stream can be converted into a group of pictures by calling an imerecode function in the CV2 to circularly read the video and performing a storage operation every several frames.

In some embodiments, the data enhancement in step 200 mainly includes two means of set transformation and color transformation, wherein the geometric transformation includes various operations such as random flipping, rotation, clipping, deformation, scaling, and the like, and the color transformation includes noise, gaussian blur, color transformation, erasure, padding, and the like. In the present embodiment, random rotation and scaling in the geometric transformation and gaussian blur in the color transformation are mainly employed.

In some embodiments, the enhanced picture needs to be manually screened, and the scene types and the number of pedestrians are controlled through manual screening, so that different types of data are evenly distributed as much as possible, and the generalization performance of the model can be improved, and overfitting of the model is prevented. In this example, the samples were marked in the PASCAL VOC format by manual labeling. The PASCAL VOC format is used because most databases are now in this format, which facilitates training other types of data features. The labeling tool is LabelImg, is a multi-platform image labeling tool written by adopting Python language, labels sample information in a visual interface interaction mode to obtain xml script files corresponding to the samples one by one, labels the required object information as pedestrian category attribute (Person) and coordinate information of a target pedestrian boundary box, and finally obtains a set of complete training set comprising a training set, a verification set and a test set.

The left diagram in fig. 2 is a conventional centret network structure using a hourglass network structure, and the right diagram in fig. 2 is a centret network structure according to an embodiment of the present invention. In some embodiments, the centrenetet network structure in step 300 adopts a lightweight network more suitable for embedded devices, such as ResNet-18, MobileNet, Xception, ShuffleNet, etc., and it is understood that the backbone network may be switched to a larger network such as ResNet101, DenseNet, etc. to obtain higher accuracy.

In this embodiment, the backbone network of the centret network adopts a light-weight residual error network ResNet-18 to increase the detection speed, and the network structure list is shown in table 1.

TABLE 1 ResNet-18 network architecture Table

In this embodiment, first, up-and-down sampling is performed by using the transposed convolution, and then the number of convolution kernels is changed by using the deformable convolution, and then the feature map is up-sampled by using the transposed convolution. The method comprises the steps of selecting outputs of 'layer 2', 'layer 3' and 'layer 4' in a ResNet network as feature graphs of '8 x', '16 x' and '32 x', fusing the three feature graphs through a feature fusion module, then acquiring '4 x' on the feature graph of '8 x' after fusion through deconvolution, and finally performing category confidence coefficient and BBox prediction through two convolutions at the top end of the network.

Since a large amount of characteristic information is lost through a plurality of convolution and pooling operations, the detection accuracy is reduced. Meanwhile, because the size of the shallow feature map is generally large, the real-time performance of the network is reduced by introducing the shallow features in a large amount, and at the feature representation level, the low-level feature representation is different from the high-level feature representation, and a channel is only used for connecting the low-level feature with the high-level feature, so that a lot of noise is brought. In order to solve the above problems, in some embodiments, a feature fusion module is added in the upper sampling path, and the feature fusion module fuses the shallow feature and the deep feature, so that low-level rich spatial information and high-level semantic information are fused, and thus the detection accuracy of the small target and the blocked target can be increased, which has great advantage in detection and tracking of a large stream of people.

In some embodiments, in order to optimize the extracted image features, avoid a large number of redundant features, further increase the detection speed and obtain better feature expression, a concentration module (CBAM) is added in the upper sampling path.

In some embodiments, the Centernet network structure further adopts h-swish and h-Sigmoid activation functions on the basis of adding the attention module to replace the traditional ReLU and Sigmoid activation functions, so that the calculation amount is further reduced, and meanwhile, the precision loss in model calculation can be effectively avoided.

In some embodiments, step 500 comprises:

In this embodiment, the training process is in turn the full network structure-partial structure-header structure-full network structure. The specific training modes and parameters in step 510 are as follows: the loss in the early stage of training is large, a step-long learning rate strategy is adopted, and the convergence of the model is accelerated through a large learning rate; and the cosine function type learning rate attenuation is used in the later training stage to provide a smaller learning rate, so that the convergence stability of the model is ensured. In the whole training process, the sparse rate is 0.01, the gamma in the learning rate is 0.1, the learning rate is 0.0001, the step size is 100, the learning rate is reduced to one tenth of the previous learning rate per 100 iteration steps, the iteration cycle is 140 times, and the batch size of batch training is 16.

In step 520, the weight file of the model is saved once for each iteration cycle, and the training is continued by selecting the continuous training mode and inheriting the weight file of the selected iteration cycle.

Claims

1. A human body identification tracking method is characterized by comprising the following steps:

2. The human body identification tracking method according to claim 1, wherein the BOX matching mechanism in the step 400 is: if the Bbox containing the center point of the object predicted by the characteristic point is occupied, selecting the Bbox closest to the center point of the object as Anchor.

3. The method for recognizing and tracking the human body as claimed in claim 1, wherein the loss function in the step 400 is expressed as:

L _del ＝L _k +L _size +L _off

wherein L is _del For total losses, L _k For confidence loss, L _size For target frame size loss, L _off For the center offset loss, the predicted Bbox parameter is set to (b) _x ，b _y ，b _w ，b _h ) Wherein b is _x And b _y Respectively the position of the center point of Box, b _w And b _h Representing the width and height of Box, adding three influence factors of xi, delta and zeta into the confidence loss, namely:

L _k ＝ξ ₁ L _nt +ξ ₂ L _pt

L _nt ＝--(1-b _y ^)δ ₁ *log(b _y ^+ζ)

L _pt ＝-(1-b _y ^)δ ₂ *log(b _y ^)

wherein L is _nt For negative sample loss, L _pt For positive sample loss, xi ₁ 、ξ ₂ 、δ ₁ 、δ ₂ ζ, the best value is obtained by the grid search.

4. The human body identification tracking method according to claim 1, wherein the original video stream data in the step 100 is obtained by a camera real-time video recording assisted by an internet crawler.

5. The method for recognizing and tracking human body according to claim 1, wherein the enhancement processing in step 200 includes geometric transformation and color transformation.

6. The human body identification tracking method according to claim 1, wherein the backbone network in step 300 is one of ResNet-18, MobileNet, Xception, ShuffleNet, ResNet101 and DenseNet.

7. The human body recognition and tracking method according to claim 1, wherein the up-sampling path in step 300 comprises a CBAM module and a feature fusion module, the CBAM module is used for optimizing the extracted image features, and the feature fusion module is used for fusing shallow features and deep features.

8. The human body identification tracking method according to claim 7, wherein the activation functions of the Centernet network in the step 300 are h-swish and h-sigmoid.

9. The human body recognition tracking method of claim 1, wherein the step 500 comprises: