CN113591532A

CN113591532A - Real-time pedestrian detection and feature extraction module based on self-selection mechanism

Info

Publication number: CN113591532A
Application number: CN202110391719.2A
Authority: CN
Inventors: 殷光强; 梁杰; 殷康宁; 候少麒; 王春雨
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-11-02

Abstract

The invention discloses a real-time pedestrian detection and feature extraction module based on a self-selection mechanism, including a pedestrian detection module, which transmits the number A of pedestrians detected in a picture, the pedestrian detection frame and the corresponding pedestrian feature B into the features. Extraction module; feature extraction module, through the detection of the number of pedestrians A, the backbone network is screened in the residual backbone network module, and the pedestrian feature B specific to each person is sent to the corresponding residual network in the backbone network for feature extraction. At the same time, it is distinguished by the orientation information, and then the pedestrian global information and pedestrian orientation information of each pedestrian are compared and identified with the pedestrian-related information in the database; it can adapt to pedestrian re-identification tasks in complex monitoring scenarios, and can be plug-and-play. , which can effectively solve the problems of large number of model parameters and a large number of redundant calculations in the existing person re-identification task, and at the same time, the problems of low applicability, low efficiency and poor generalization ability in the face of actual situations.

Description

Real-time pedestrian detection and feature extraction module based on self-selection mechanism

Technical Field

The invention relates to the technical field of pedestrian re-identification in the field of computer vision, in particular to a real-time pedestrian detection and feature extraction module based on a self-selection mechanism.

Background

The key technology for rapidly and accurately identifying the pedestrians is that the same pedestrians exist in continuous videos or images through machine learning and computer vision analysis. The traditional pedestrian re-identification task can be simplified into feature processing and feature comparison identification, which are two relatively independent tasks and two necessary steps for realizing the human re-identification in a monitoring scene. The feature processing link can divide pedestrian detection and feature extraction into two relatively independent modules. The task flow division is shown in fig. 1.

However, most of the current patents and researches only aim at the feature extraction module to search for accuracy on one side, and the work is difficult to fall to the ground. Meanwhile, although some scholars research on double tasks of pedestrian detection and feature extraction, the accuracy rate of pedestrian re-identification of the proposed method is low, the main reason is that the existing pedestrian re-identification algorithm is a 'detection-identification' framework system, and the framework is reasonable but not optimal and cannot self-select a network according to actual complex scenes. Meanwhile, data association transition depends on the detection quality, and the accuracy of pedestrian detection can seriously affect the accuracy of pedestrian re-identification, so that the idea of realizing pedestrian re-identification by using a single network cannot meet the actual requirement.

The table of the existing network structure of residuals 18, 34 and 50 is shown in table 1:

TABLE 1

The patent with publication number CN105224912B discloses a "video pedestrian detection and tracking method based on motion information and track association", but the proposed method does not perform targeted training on the monitored images, adopts a sliding window search mode to extract features, and then uses a classifier for detection, so that the calculation amount is large, the efficiency is low, and real-time tracking in practical application cannot be satisfied;

the patent with publication number CN108764338A proposes a "pedestrian tracking algorithm applied to video analysis", which applies an optical flow method, an algorithm that a color histogram method and a Logistic regression classifier are all discarded gradually in the detection and tracking field, and the performance cannot meet the real-time monitoring in a complex scene.

The patent with publication number CN109871763A proposes a specific target tracking method based on YOLO, which is a superposition of the existing "detection" and "tracking" methods, and each part of the framework is an optimal algorithm, so that end-to-end optimization cannot be achieved, and meanwhile, the algorithm has a large calculation amount and is not suitable for deployment and transplantation of mobile devices.

Reference [ 1 ] is Han, Kai, et al, "GhostNet: More features from threads operations," Proceedings of the IEEE/CVF Conference on CVPR.2020.

Reference [ 2 ] is, Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He. Non Local neuronetworks in CVPR, 2018.

Reference [ 3 ] Hao Luo, Youzhi Gu, Xinyu Liao, Shenqi Lai, Wei Jiang of Bag of Tricks and dA Strong Baseline for Deep Person Re-identification. in CVPR, 2019.

Reference [ 4 ] Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on CVPR.2018.

Disclosure of Invention

The invention aims to provide a real-time pedestrian detection and feature extraction module based on a self-selection mechanism, which can adapt to a pedestrian re-identification task under a complex monitoring scene, can be used in a plug-and-play manner, and can effectively solve the problems of large model parameter quantity, large redundant calculation, low applicability to actual conditions, low efficiency and poor generalization capability in the existing pedestrian re-identification task.

The invention is realized by the following technical scheme: a real-time pedestrian detection and feature extraction module based on a self-selection mechanism comprises a pedestrian detection module and a feature extraction module, wherein,

the pedestrian detection module is used for transmitting the pedestrian number A, the pedestrian detection frame and the corresponding pedestrian feature B detected in one picture into the feature extraction module;

and the feature extraction module is used for screening the backbone network in the residual backbone network module according to the detected pedestrian number A, sending the pedestrian features B of each person into the corresponding residual network in the backbone network for feature extraction, meanwhile, distinguishing the pedestrian features B by combining with the orientation information, and then comparing and identifying the pedestrian global information and the pedestrian orientation information of each pedestrian with the pedestrian related information in the database.

In order to further realize the invention, the following arrangement mode is adopted: the pedestrian detection module adopts a structural mode of a single-stage detection network, is sequentially provided with a convolution layer and 3 bottleneck channel layers, is also respectively provided with an attention module behind each bottleneck channel layer, is also provided with a channel adding module behind the last attention module, and the output of each attention module is connected to the channel adding module, namely the pedestrian detection module is provided with the convolution layer, the bottleneck channel layers, the attention modules and the channel adding module according to the processing sequence of input pictures, and the output of each attention module is subjected to channel addition in the channel adding module to obtain the pedestrian number A, the pedestrian detection frame and the corresponding pedestrian characteristic B.

In order to further realize the invention, the following arrangement mode is adopted: the bottleneck channel layer is composed of redundant simplified modules of two different stride.

In order to further realize the invention, the following arrangement mode is adopted: the slice convolution layer of the pedestrian detection module is used for rearranging the width and height information of the picture, and specifically comprises the following steps: the slice convolution layer of the pedestrian detection module divides the width and height data of the picture into 4 parts by half, so as to obtain 4 parts of data, each part of data is obtained by 2 times of downsampling, then 4 parts of data are spliced in channel dimensionality, and finally convolution operation is carried out.

The so-called channel dimension: generally, a color picture is seen, which is generally a 3-channel picture in RGB format, and assuming that the width and height of the picture are both 4, the storage form of the picture in the computer is (4 × 3), where 3 is the channel dimension represented by the channel dimension, and may also be understood as the number of channels, i.e. 3 channels. For example, the sheet was passed through a convolution operation of a sliced convolution layer, cut into 4 portions each having a characteristic pattern of 2 × 3 by a size of one half of the width and height (4 × 4), and the data was divided into 4 portions. The feature maps of 2 × 3 (4 × 3) are combined by stitching in the channel dimension, so that a large amount of feature information is lost in the down-sampling operation (the down-sampling operation is that 4 × 4 becomes 2 × 2, and if there is no stitching in the channel dimension, the picture is changed from 4 × 3 to 2 × 3, and the 4 × 3 down-sampling is 2 × 12 by slicing the convolution layer, so that the feature information of the original whole picture (4 × 3) is not lost.

Since stride (step size) of the redundancy simplification module can be adjusted to control the size of the output feature, when the characteristic quantity before and after the redundancy simplification module is not changed, stride can be convolved by 1; the desired output becomes half the input, and a convolution with stride 2 can be used.

Information input by the redundant simplification module with stride 2 is convolved (stride 2) to become half of original features, and feature information with consistent quantity is obtained through convolution (stride 1) and is called as original information (the features are consistent only in quantity and size and are changed after the features are convolved); the other part is operated by linear operation (adding, subtracting, multiplying and dividing on the original characteristic to become linear operation, and adopting the plus and minus linear operation in the invention). Compared with common down-sampling, the method fuses redundant information and has less relative loss of characteristic information.

And the redundant simplification module with stride 1 has the characteristics unchanged before and after the first step of convolution, and the final output characteristics are fused with redundant information, so that the information content is expanded.

In order to further realize the invention, the following arrangement mode is adopted: the channel bottleneck layer is formed by serially connecting a stride-2 redundant simplified module and a stride-1 redundant simplified module; the channel bottleneck layer adopts down-sampling pictures to extract features, and specifically comprises the following steps: firstly, carrying out convolution operation by using a redundant simplification module with stride being 2 to generate a feature map with half size, combining original information generated after convolution of the feature map with the original information generated after linear transformation to complete one-time downsampling; and generating a feature map with the same size through a redundant simplification module with stride being 1, combining original information generated after the feature map with the same size is convolved with redundant information generated through linear transformation, reducing the loss of the feature information caused by convolution, and finally finishing the operation of extracting the feature of the down-sampling picture of the channel bottleneck layer.

The main purpose of the backbone network is to extract features from the down-sampled pictures, and similar and available features (redundant information) in the pictures are ignored and are not used. The method is characterized by comprising two redundant simplification modules (see reference [ 1 ] for details) with different stride, wherein the redundant simplification modules utilize redundant information generated by similar characteristic diagrams to process. The method is to divide the original convolution into two batches to process: the output is first performed using the original convolution operation, and then a series of simple linear operations are used to generate more features. Through the two steps, the feature graphs with the same number as that of the common convolution processes are obtained, the redundant features are fully utilized, and the convolution layers are replaced by simple linear operation, so that the network parameters are greatly reduced, and the learning capacity of the convolutional neural network can be maintained or even improved under the condition of reducing the calculation amount by 20%.

In order to further realize the invention, the following arrangement mode is adopted: the attention module obtains global compression characteristic quantity by executing global average pooling on the characteristic diagram, obtains the weight of each channel through two layers of full connection layers, and takes the weight as the input of the next layer after normalization and weighting.

Due to the fact that the calculated amount and the model parameters are reduced due to the use of the channel bottleneck layer, the detection precision is reduced, therefore, an attention module (refer to 4) is added to reduce the influence, on the premise that the calculated amount is slightly increased, the attention aiming at the channel is screened out, the learning capacity of detecting the network characteristics is enhanced, and the negative influence which is possibly brought when the network parameters are reduced is reduced.

In order to further realize the invention, the following arrangement mode is adopted: the characteristic extraction module is provided with a network depth self-adaptive selection module and a residual backbone network module, wherein,

the network depth self-adaptive selection module is used for self-adaptively inputting the pedestrian characteristics B into a corresponding residual network in the residual backbone network module by judging the range of the interval where the pedestrian number A detected by the current picture is positioned and according to a corresponding judgment rule;

the residual backbone network module is used for extracting the features of the pedestrian features B and further separating pedestrian orientation information and pedestrian global information through the feature sharing layer so as to be input into a subsequent database for comparison.

In order to further realize the invention, the following arrangement mode is adopted: the network depth self-adaptive selection module is provided with an ifelse selector with a built-in judgment rule, and the judgment rule is as follows:

wherein, the modified residual 50 network is selected in case of strategy 1, the modified residual 34 network is selected in case of strategy 2, and the modified residual 18 network is selected in case of strategy 3.

In order to further realize the invention, the following arrangement mode is adopted: the residual error backbone network module is provided with a backbone network, a global average pooling layer, a BN layer, a first full-connection layer and a shared characteristic layer, wherein,

the backbone network is provided with a modified residual 18 network (residual 18Non Local), a modified residual 34 network (residual 34Non Local) and a modified residual 50 network (residual 50Non Local)3 residual networks, wherein the 3 residual networks are mainly different in the number of convolutional layers in each of the first layer to the fourth layer;

table 2 shows a network structure table of various residual error networks in the backbone network.

TABLE 2

In the table: x is not added; "3 x3, 64" represents 64 filters of size 3x3, and so on for other data; stride represents the step size.

The global average pooling layer is used for obtaining a global compression characteristic quantity of the pedestrian characteristic B, and adding all pixel values of the characteristic graph to obtain a numerical value (the global average pooling layer is adopted, the weighted average is obtained 2048, and the numerical value represents the characteristic graph), namely the numerical value represents the corresponding characteristic graph;

the BN layer adopts a BNfeature structure, and features are constrained on the hypersphere, so that the classification hyperplane is clearer and the accuracy of orientation classification is improved; BNFeature is mainly to normalize the feature value (modulo 1) by adding a normalization layer, so that features with different orientations can be represented on a unit circle, and thus, the result calculated by the euclidean distance is more convenient to compare and classify.

The full connection layer 1 is used for obtaining the weight of each residual error network, is connected with the global average pooling layer, plays a role of a classifier in the whole residual error backbone network module and consists of convolution with a convolution kernel of 1x1, and the full connection layer 1 can convert 2048-dimensional diagnosis specialization into 512-dimensional feature vectors which can be used for representing target full local features;

the shared characteristic layer can reduce the influence of characteristic difference on the identification rate under the same ID and different orientations.

In order to reduce the influence of feature differences under different orientations of the same identity on the recognition rate, a shared feature layer containing the orientation and the identity (global features) is designed. The 512-dimensional Feature vector passing through the Bn Feature passes through two branches of the shared Feature layer: one path of branch is used for fully connecting layers 2(512, 3) of orientation attribute identification (orientation classification), and the 512-dimensional features are converted into 3-dimensional features which respectively represent the orientation (front, back and side) of the target so as to realize orientation attribute judgment (orientation information). The same 512-dimensional feature vector is used to represent global features (pedestrian global information) in the other branch.

And finally, obtaining features and labels from the separated orientation information and the pedestrian global information, and inputting the features and labels into a subsequent database for comparison.

In order to further realize the invention, the following arrangement mode is adopted: a Nonlocal layer is arranged between the first layer and the second layer, between the second layer and the third layer, and between the third layer and the fourth layer of any residual network, a BN layer of any residual network adopts BnFature, and a Pooling layer 2 of any residual network adopts Global Average Pooling.

In order to establish the relation between frames in a video and the relation between two pixels with a certain distance on an image, a Non-Local layer (1x1, N, refer to [ 2 ]) is adopted, the channel attention mechanism is combined, the commonality is highlighted, the input and output scales are completely the same, an extra layer transformation scale is not required to be added, and a plurality of Non-Local layers can be added to a shallow layer to improve the precision.

In order to normalize the feature vector and better perform orientation classification and feature extraction, a BnFature (the structural reference is 3) structure is adopted in a BN layer of a residual error network (an improved residual error 18 network, an improved residual error 34 network and an improved residual error 50 network), features are constrained on a hypersphere, the features are constrained in a European space, so that the classification hyperplane is clearer, and the orientation classification accuracy is improved.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the invention is composed of pedestrian detection and characteristic extraction cascade, and reduces the network parameter quantity on the basis of ensuring the detection precision in the aspect of pedestrian detection, thereby improving the detection speed. The method specifically comprises the steps of optimizing a single-stage detection-based network, redesigning a backbone network structure, adding a redundant information simplification module and an attention module, reducing network parameters and ensuring accuracy.

(2) In order to ensure that the real-time tracking effect is met under the conditions of different scenes and pedestrian densities, the invention designs a feature extraction network self-selection mechanism, and a residual network is self-adaptively selected according to the number of pedestrians in each frame: if the number of the pedestrians is large, the pedestrians are sent into a shallow network; if the number of pedestrians is small, the pedestrians are sent into the deep network.

(3) The residual error network of the invention adds a Non Local structural layer on the basis of the existing identification network model, thereby further improving the identification speed while ensuring the identification stability.

(4) The invention redesigns part of the network structure, reduces the network complexity and the calculated amount; meanwhile, a self-selection mechanism is provided, and orientation auxiliary information is introduced, so that the application problem of pedestrian re-identification in a complex scene can be solved.

Drawings

Fig. 1 is a flowchart of a conventional pedestrian re-identification task.

Fig. 2 is a structural diagram of a pedestrian detection module according to the present invention.

Fig. 3 is a schematic structural diagram of a feature extraction module according to the present invention.

Fig. 4 is a diagram of a redundant simplified module refinement architecture.

Fig. 5 is a detailed structural diagram of the attention module.

Fig. 6 is a diagram of a bottleneck channel layer structure.

FIG. 7 shows the entire network structure described in example 11

Detailed Description

The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in detail and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Example 1:

the invention designs a real-time pedestrian detection and feature extraction module based on a self-selection mechanism, which comprises a pedestrian detection module and a feature extraction module, wherein,

Example 2:

the present embodiment is further optimized based on the above embodiment, and the same parts as those in the foregoing technical solution will not be described herein again, and further to better implement the present invention, the following setting manner is particularly adopted: the pedestrian detection module adopts a structural mode of a single-stage detection network, a convolution layer and 3 bottleneck channel layers are sequentially arranged, an attention module is respectively arranged behind each bottleneck channel layer, a channel adding module is arranged behind the last attention module, the output of each attention module is connected to the channel adding module, namely the pedestrian detection module is provided with the convolution layer, the bottleneck channel layers, the attention module, the bottleneck channel layers, the attention module and the channel adding module according to the processing sequence of input pictures, and the output of each attention module is subjected to channel addition in the channel adding module to obtain the pedestrian number A, the pedestrian detection frame and the corresponding pedestrian characteristic B.

Example 3:

the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: the bottleneck channel layer is composed of redundant simplified modules of two different stride.

Example 4:

the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: the slice convolution layer of the pedestrian detection module is used for rearranging the width and height information of the picture, and specifically comprises the following steps: the slice convolution layer of the pedestrian detection module divides the width and height data of the picture into 4 parts by half to obtain 4 parts of data, each part of data is obtained by 2 times of downsampling, then 4 parts of data are spliced in channel dimension, and finally convolution operation is carried out.

The so-called channel dimension: generally, a color picture is seen, which is generally a 3-channel picture in RGB format, and assuming that the width and height of the picture are both 4, the storage form of the picture in the computer is (4 × 3), where 3 is the channel dimension represented by the channel dimension, and may also be understood as the number of channels, i.e. 3 channels. For example, the sheet was passed through a convolution operation of slicing the convolution layer, cut into 4 pieces each having a characteristic pattern of 2 × 3 in a size of halving the width and height (4 × 4), and then the data was divided into 4 pieces. The feature maps of 2 × 12 size are combined by stitching on the channel dimension, so that a large amount of feature information will be lost in the downsampling operation (the downsampling operation is 4 × 4 size is changed into 2 × 2 size, if there is no stitching on the channel dimension, the picture is changed from 4 × 3 into 2 × 3, and a large amount of feature information will be lost, while the feature information of the original whole picture (4 × 3) is not lost while 4 × 3 is downsampled into 2 × 12 by slicing the convolution layer.

Example 5:

the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: since stride (step size) of the redundancy simplification module can be adjusted to control the size of the output characteristic, when the characteristic quantity before and after the redundancy simplification module is not changed, stride can be convolved to 1; the desired output becomes half the input, and a convolution with stride 2 can be used.

Information input by the redundant simplification module with stride 2 is convolved (stride 2) to become half of original features, and feature information with consistent quantity is obtained through convolution (stride 1) and is called as original information (the features are consistent only in quantity and size and are changed after the features are convolved); the other part is operated by linear operation (adding, subtracting, multiplying and dividing on the original characteristic to become linear operation, and the method adopts the operation of adding, subtracting and linear, the redundant characteristics can be obtained by the linear operation, the characteristics can contain the characteristics similar to the original characteristics so as to be utilized, and then the redundant information (green) is obtained by convolution (stride 1), and the original information and the redundant information are correspondingly fused (namely, corresponding position is added), so that the convolution or down sampling operation is completed. Compared with common down-sampling, the method fuses redundant information and has less relative loss of characteristic information.

The channel bottleneck layer is formed by serially connecting a stride-2 redundant simplified module and a stride-1 redundant simplified module; the channel bottleneck layer adopts down-sampling pictures to extract features, and specifically comprises the following steps: firstly, carrying out convolution operation by using a redundant simplification module with stride being 2 to generate a feature map with half size, combining original information generated after convolution of the feature map with redundant information generated through linear transformation to complete down sampling for one time; and generating a feature map with the same size through a redundancy simplification module with stride being 1, combining original information generated after convolution of the feature map with redundant information generated through linear transformation, reducing the loss of feature information caused by convolution, and finally completing the operation of extracting features from the downsampled picture of the channel bottleneck layer.

Example 6:

the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: the attention module obtains global compression characteristic quantity by executing global average pooling on the characteristic diagram, obtains the weight of each channel through two layers of full connection layers, and takes the weight as the input of the next layer after normalization and weighting.

Example 7:

the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: the feature extraction module is provided with a network depth self-adaptive selection module and a residual backbone network module, wherein,

Example 8:

the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: the network depth self-adaptive selection module is provided with an ifelse selector with a built-in judgment rule, and the judgment rule is as follows:

Example 9:

the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: the residual backbone network module is provided with a backbone network, a global average pooling layer, a BN layer, a first full connection layer and a shared characteristic layer, wherein,

Example 10:

the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: a Nonlocal layer is arranged between the first layer and the second layer, between the second layer and the third layer, and between the third layer and the fourth layer of any residual network, a BN layer of any residual network adopts BnFature, and a Pooling layer 2 of any residual network adopts Global Average Pooling.

Example 11:

with reference to fig. 2 and fig. 3, an object of the embodiment is to provide a real-time pedestrian detection and feature extraction module based on a self-selection mechanism, which can adapt to a pedestrian re-identification task in a complex monitoring scene, and can be plug and play, and can effectively solve the problems of large model parameter amount, large amount of redundant computation, low applicability, low efficiency and poor generalization capability in the existing pedestrian re-identification task.

A real-time pedestrian feature extraction module based on a self-selection mechanism is characterized in that the design process comprises the following steps:

1) building a single-stage detection-based network, adding a redundancy simplification module (reference [ 1 ]) and an attention module (reference [ 4 ]), constructing a new backbone network (pedestrian detection module), detecting characteristics through the part, and conveying the characteristics to a residual error network combined with a Non Local (reference [ 2 ]) layer and a BnFature (reference [ 3 ]) layer for characteristic extraction;

2) combining the step 1), adding a network depth self-adaptive selection module, and self-adaptively selecting the corresponding re-recognition network depth according to the number of pedestrians detected in the step 1);

3) and training the pedestrian detection module and the feature extraction module, and testing the real-time performance of the network integrated system by using the corresponding test set.

The step 1) comprises the following specific steps:

1.1) analyzing a detection network:

adopting a single-stage detection network structure, redesigning a backbone network structure and arrangement, referring to the single-stage network structure, and applying a redundancy simplification module and an attention module in a backbone network:

wherein the redundant simplification module utilizes redundant information processing generated by similar characteristic diagrams. The method is to divide the original convolution into two batches to process: the output is first performed using the original convolution operation, and then a series of simple linear operations are used to generate more features. Through the two steps, the characteristic graphs with the same quantity as that of the common convolution process are obtained, the redundant characteristics are fully utilized, the convolution layer is replaced by simple linear operation, the network parameters are greatly reduced, the learning capacity of the convolutional neural network can be kept or even improved under the condition of reducing the calculated amount by 20%, and the refined structure of the redundant simplified module is shown in fig. 4; meanwhile, the traditional down sampling can lose information destruction characteristics, the changed backbone network rearranges the input width and height information, in a simple way, the width and height data are halved and segmented into 4 parts of data, each part of data is obtained by 2 times down sampling, then the data are spliced in channel dimensionality, and finally convolution operation is carried out, so that the maximum benefit is that the information loss can be reduced to the maximum degree when the down sampling operation is carried out.

The reduction of the network parameters may affect the detection accuracy, an attention module (refer to [ 4 ]) is added to reduce the influence, and on the premise of slightly increasing the calculation amount, the attention of the channel is screened out to enhance the learning capability of the detection network characteristics, so as to reduce the negative influence which may be brought by reducing the network parameters. The attention module refinement structure is shown in fig. 5, global compression feature quantity is obtained by performing global averaging pooling on the feature map, the weight of each channel is obtained through two full-connection layers, and the weighted weight is normalized and weighted to be used as the input of the next layer

1.2) building a detection network

In a backbone network of a detection end, a convolution layer mainly adopts Conv convolution, and a redundancy simplification module is adopted for designing a bottleneck channel layer; the bottleneck channel layer is mainly composed of two redundant simplified modules (redundant simplified module Stride 2 and redundant simplified module Stride 1) stacked twice as shown in fig. 6, the first module expands the number of channels, and the second module reduces the number of channels to be consistent with the number of input channels. When stride is 1, the function is similar to the residual block; when stride is 2, a stride of 2 convolutional layer is added between two redundant simplified modules, the size of the characteristic diagram can be reduced to 1/2 of the input, and the skip connection circuit also needs the same down-sampling to ensure that the channel adding operation can be aligned. The pedestrian detection module can implement the function of a down-sampling layer in the traditional method. The redesigned pedestrian detection module structure is shown in fig. 2.

1.3) analyzing and re-identifying the network:

the re-identification network is used for realizing pedestrian detection, feature extraction and comparison identification, and two problems need to be considered, namely, the features of the same Identity (ID) in different orientations are different, so that one person has only one feature in the case of a general network, but the same ID has different features in different orientations, and if orientation factors are ignored, the identification accuracy is low after the orientation of a pedestrian is changed. The second is the problem of the efficiency of feature extraction. With the deepening of the network depth and the complexity of the structure, the operation time of the network is also prolonged, and the real-time performance is influenced. In view of the above problems, the overall network structure adopted by the present embodiment is shown in fig. 7, and mainly includes two core portions: the pedestrian detection module and the characteristic extraction module.

1.4) building a re-identification network:

1.4.1) single network optimization module:

this embodiment studies a common network of a residual error network: residual 18, residual 34, and residual 50, combined with the Non Local layer and the bnfierce layer, were tested by designing a network structure as shown in table 2. In order to establish the relation between frames in a video and the relation between two pixels with a certain distance on an image, a Non Local structure is adopted, a channel attention mechanism is combined, the commonality is highlighted, the input and output scales are ensured to be completely the same, an extra layer transformation scale is not required to be added, and a plurality of Non Local modules can be added to a shallow layer network to improve the precision. In order to normalize the feature vector and better perform orientation classification and feature extraction, a BnFature structure is adopted, wherein the BnFature is a structure provided by reference to [ 3 ], and features are constrained on a hypersphere in a Euclidean space, so that a classification hyperplane is clearer, and the accuracy of orientation classification is improved. This embodiment depth analysis shows the detailed structure of the network corresponding to the combination of the residual different depths and Non Local, bnfieture as shown in table 2.

1.4.2) design shared feature Branch containing orientation and ID (shared feature layer)

In order to reduce the influence of feature differences under different orientations of the same identity on the recognition rate, a shared feature branch containing the orientation and the identity (global features) is designed. The 512-dimensional feature vector passing through the bnfieture is passed through two branches: one path of branch is used for fully connecting layers 2(512, 3) of orientation attribute identification, and the 512-dimensional feature is converted into a 3-dimensional feature which respectively represents the orientation (front, back and side) of the target so as to realize orientation attribute judgment. The same 512-dimensional feature vector is used to represent the global feature in the other branch.

2) Design network depth adaptive selection module

The recognition accuracy and the feature extraction efficiency can have a game process, the deeper the network depth is, the higher the recognition accuracy is, and the lower the FPS (frame per second for measuring the inference speed) is; the shallower the network depth is, the lower the recognition accuracy is, and the higher the feature extraction efficiency is, so that the whole system is required to achieve game balance, and the feature extraction efficiency is ensured to meet implementation requirements while ensuring higher precision. Meanwhile, if the number of people in the current video frame is small, a deeper network is selected, so that hardware calculation power can be wasted, and the operation efficiency is reduced; if a shallower network is selected, the accuracy of the tracked pedestrians will be greatly compromised when the number of incoming pedestrians is greater. For this embodiment, a network depth adaptive selection module is designed, such as the residual backbone network module in fig. 3, and the module can adaptively select a corresponding residual network structure according to the number of targets in each frame of the pedestrian detection part, thereby improving the feature extraction efficiency without reducing the accuracy. The strategy is divided into three selection strategies, namely, strategy 1 is residual 50+ Non Local, strategy 2 is residual 34+ Non Local, and strategy 3 is residual 18+ Non Local, which are expressed by the following formula:

for example, the number of people detected in the currently incoming video frame is 15, and the backbone network can be adaptively switched to the residual 34+ Non Local network, so that the detected features are sent to the network with the corresponding depth on the premise of ensuring the identification accuracy, and the identification speed can be increased as much as possible.

The method comprises the steps that a whole module with the functions of feature recognition, extraction and re-recognition is obtained through cascade connection, in the whole operation process of the detection and re-recognition network, the pedestrian features in a video frame can be detected by a detection network end (pedestrian detection module) designed in the step 1) with low calculation amount and high accuracy, the number of pedestrians transmitted in each frame is transmitted to the re-recognition network while the features are transmitted to the re-recognition network, the network depth can be selected in a self-adaption mode through the re-recognition network (feature extraction module) designed in the step 2), pictures are transmitted to a backbone network combining a residual error series network and Non Local, after a global average pooling layer is formed, a residual error 50 needs to pass through a full connection layer 1, 2048-dimensional feature vectors are changed into 512-dimensional feature vectors, the residual error 34 and the residual error 18, and the output is 512-dimensional feature vectors. The 512-dimensional feature vector is used as the pedestrian feature and simultaneously enters the full-connection layer 2, the feature extraction output direction is adopted, and the feature vector is stored in a database for comparison and identification according to different directions.

3) And (3) network training and testing, namely training the weight of the network module built in the step 1) and the step 2) by utilizing the monitoring data set, and testing the real-time performance of the module by using a corresponding test set.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. a real-time pedestrian detection and feature extraction module based on self-selection mechanism, is characterized in that: comprise pedestrian detection module and feature extraction module, wherein,

The pedestrian detection module transmits the number of pedestrians detected in a picture A, the pedestrian detection frame and the corresponding pedestrian feature B to the feature extraction module;

The feature extraction module selects the backbone network in the residual backbone network module through the detected number of pedestrians A, and sends the pedestrian feature B specific to each person into the corresponding residual network in the backbone network for feature extraction. The information is distinguished, and then the pedestrian global information and pedestrian orientation information of each pedestrian are compared and identified with the pedestrian-related information in the database.

2. a kind of real-time pedestrian detection and feature extraction module based on self-selection mechanism according to claim 1, is characterized in that: described pedestrian detection module is sequentially provided with a slice convolution layer and 3 bottleneck channel layers, in An attention module is also set after each bottleneck channel layer, and a channel addition module is set after the last attention module, and the output of each attention module is connected to the channel addition module.

3. A kind of real-time pedestrian detection and feature extraction module based on self-selection mechanism according to claim 2, is characterized in that: described bottleneck channel layer is made up of redundant simplified modules of two kinds of different strides.

4. a kind of real-time pedestrian detection and feature extraction module based on self-selection mechanism according to claim 2, is characterized in that: the slice convolution layer of described pedestrian detection module is used for rearranging the width and height information of picture , specifically: the slice convolution layer of the pedestrian detection module takes half of the width and height data of the picture, divides it into 4 parts, and obtains 4 pieces of data, then splices the 4 pieces of data in the channel dimension, and finally performs the convolution operation.

5. a kind of real-time pedestrian detection and feature extraction module based on self-selection mechanism according to claim 2, it is characterized in that: described channel bottleneck layer adopts the redundancy simplified module of stride=2 and the redundancy of stride=1 Simplified modules are connected in series; the channel bottleneck layer uses down-sampling images to extract features, specifically: first, the redundant simplified module convolution operation with stride=2 is used to generate a half-sized feature map, and the half-sized feature map is convolved The original information generated later is combined with the redundant information generated by linear transformation to complete a downsampling; then a feature map of the same size is generated by a redundant simplification module with stride=1, and the feature map of the same size is convolved The original information generated later is combined with the redundant information generated by linear transformation, which reduces the loss of feature information caused by convolution, and finally completes a downsampling image feature extraction operation at the channel bottleneck layer.

6. A kind of real-time pedestrian detection and feature extraction module based on self-selection mechanism according to claim 2, it is characterized in that: described attention module obtains global compressed feature quantity by performing global average pooling on the feature map , and then the weights of each channel are obtained through two fully connected layers, and the weights are normalized and weighted as the input of the next layer.

7. A real-time pedestrian detection and feature extraction module based on a self-selection mechanism according to any one of claims 1 to 6, wherein the feature extraction module is provided with a network depth adaptive selection module and a residual backbone network module, which,

The network depth adaptive selection module, by judging the range of the pedestrian number A detected in the current picture, and according to the corresponding judgment rules, adaptively input the pedestrian feature B to the residual backbone network module;

The residual backbone network module extracts the pedestrian feature B and further separates the pedestrian orientation information and the pedestrian global information, so as to be input into the subsequent database for comparison.

8. a kind of real-time pedestrian detection and feature extraction module based on self-selection mechanism according to claim 7, it is characterized in that: described network depth self-adaptive selection module is provided with an ifelse selector with built-in judgment rule, and judges The rules are:

Among them, when it is strategy 1, the improved residual 50 network is selected, when it is strategy 2, the improved residual 34 network is selected, and when it is strategy 3, the improved residual 18 network is selected.

9. A kind of real-time pedestrian detection and feature extraction module based on self-selection mechanism according to claim 7, it is characterized in that: described residual backbone network module is provided with backbone network, global average pooling layer, BN layer, th A fully connected layer and a shared feature layer, where,

The backbone network is provided with 3 residual networks: improved residual 18 network, improved residual 34 network and improved residual 50 network;

The global average pooling layer is used to obtain the global compressed feature quantity of the pedestrian feature B;

The fully connected layer 1, after receiving the global average pooling layer, acts as a "classifier" for obtaining the weights of each residual network;

The BN layer adopts the BNFeature structure, and by constraining the feature to the hypersphere, the feature is constrained in the Euclidean space, so that the classification hyperplane is clearer, and the accuracy of the orientation classification is improved;

The shared feature layer includes process feature branches of orientation and global features, which can reduce the influence of feature differences on the recognition rate under different orientations of the same ID.

10. A real-time pedestrian detection and feature extraction module based on a self-selection mechanism according to claim 7, characterized in that: the first layer and the second layer of any one of the residual networks, the second layer and the A Nonlocal layer is set between the third layer, the third layer and the fourth layer, and the BN layer of any one of the residual networks adopts BnFeature, and the pooling layer 2 of any one of the residual networks adopts Global Average Pooling.