CN113901962A

CN113901962A - Method and system for identifying pedestrian in abnormal state based on deep learning

Info

Publication number: CN113901962A
Application number: CN202111471511.8A
Authority: CN
Inventors: 李之红; 张晶; 王子男; 高秀丽
Original assignee: Beijing University of Civil Engineering and Architecture
Current assignee: Beijing University of Civil Engineering and Architecture
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-01-07

Abstract

The invention provides a pedestrian recognition method and system under abnormal conditions based on deep learning, which comprises the steps of firstly, constructing an initial recognition model by utilizing a YOLOV3 model based on a deep learning framework; then, inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; and finally, inputting the target video image to be detected into the pedestrian recognition model in the abnormal state for pedestrian recognition and labeling, and eliminating redundant prior frames by adopting a non-maximum suppression method to obtain a final pedestrian recognition labeling image. According to the method, the pedestrian recognition model under the abnormal state is constructed based on deep learning, and then redundant priori frames are removed by adopting a non-maximum suppression method, so that the pedestrian recognition precision under the abnormal state is improved.

Description

Method and system for identifying pedestrian in abnormal state based on deep learning

Technical Field

The invention relates to the technical field of image processing, in particular to a pedestrian identification method and system in an abnormal state based on deep learning.

Background

The target detection is an important branch of image processing and computer vision, is also a core part of an intelligent monitoring system, and plays an important role in subsequent tasks such as target tracking, trajectory prediction and the like. Along with the development of the target detection technology, the intelligent video monitoring means can be widely applied, however, video monitoring is easily influenced by external environments such as background and light, and compared with pedestrian detection in a normal state, pedestrian detection in an abnormal environment is a very complex problem, and technical difficulties of mutual shielding among pedestrians, more small targets, random action and the like exist, so that the identification precision is low.

Disclosure of Invention

The invention aims to provide a pedestrian recognition method and system in an abnormal state based on deep learning so as to improve the pedestrian recognition accuracy.

In order to achieve the above object, the present invention provides a method for identifying a pedestrian in an abnormal state based on deep learning, the method comprising:

step S1: constructing a sample training set;

step S2: constructing an initial recognition model by using a YOLOV3 model based on a deep learning framework;

step S3: inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; the pedestrian recognition model in the abnormal state comprises: a feature extraction network, a multi-scale extraction fusion network and a priori frame labeling network;

step S4: and inputting the target video image to be detected into the pedestrian recognition model in the abnormal state for pedestrian recognition and labeling, and eliminating redundant prior frames by adopting a non-maximum suppression method to obtain a final pedestrian recognition labeling image.

Optionally, the inputting the target video image to be detected into the pedestrian recognition model in the abnormal state for pedestrian recognition and labeling, and eliminating redundant prior frames by using a non-maximum suppression method to obtain a final pedestrian recognition labeling diagram specifically includes:

step S41: inputting a target video image to be detected into a feature extraction network for feature extraction to obtain initial feature maps with different scales;

step S42: inputting the initial feature maps of different scales into a multi-scale extraction fusion network for feature extraction and fusion to obtain fusion feature maps of different scales;

step S43: selecting a prior frame corresponding to each fusion characteristic graph;

step S44: inputting the fusion characteristic graphs of different scales into a priori frame marking network, marking pedestrians in the fusion characteristic graphs of different scales according to a priori frame corresponding to each fusion characteristic graph, and obtaining pedestrian identification initial marking graphs of different sizes;

step S45: overlapping and fusing the pedestrian identification initial labeling graphs of different sizes to obtain a total fusion characteristic graph;

step S46: and eliminating redundant prior frames on the total fusion characteristic graph by adopting a non-maximum value inhibition method, and outputting a final pedestrian identification labeled graph.

Optionally, the feature extraction network comprises: the device comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, a fourth feature extraction module and a fifth feature extraction module, wherein the first feature extraction module is connected with the fifth feature extraction module sequentially through the second feature extraction module, the third feature extraction module, the fourth feature extraction module and the fifth feature extraction module;

the first feature extraction module comprises 2 convolution blocks and 1 residual block, wherein the 1 st convolution block is connected with the residual block through the 2 nd convolution block; the residual block comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the 2 nd convolution block and the 2 nd convolution block in the first characteristic extraction module; adding the feature graph output by the 2 nd convolution block in the residual block and the feature graph output by the 2 nd convolution block in the first feature extraction module, and inputting the addition result to the second feature extraction module;

the second feature extraction module comprises 1 convolution block and 2 residual blocks, and the convolution blocks are connected through the 1 st residual block and the 2 nd residual block; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the 1 st convolution block in the second feature extraction module, the 1 st convolution block in the 2 nd residual block is connected with the 2 nd convolution block in the 1 st residual block, the feature map output by the 2 nd convolution block in the last residual block is added with the feature map output by the convolution block in the second feature extraction module, and the addition result is input to the third feature extraction module;

the third feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the third feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the third feature extraction module, and the addition result is input into the fourth feature extraction module and the third extraction fusion network;

the fourth feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fourth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the fourth feature extraction module, and the addition result is input into the fifth feature extraction module and the second extraction fusion network;

the fifth feature extraction module comprises 1 convolution block and 4 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 4 th residual block through 2 residual blocks at a time; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; and the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fifth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the fifth feature extraction module, and the addition result is input into the first extraction fusion network.

Optionally, the multi-scale extraction fusion network includes a first extraction fusion network, a second extraction fusion network, and a third extraction fusion network; the first extraction fusion network is connected with the fifth feature extraction module, the second extraction fusion network is connected with the fourth feature extraction module, the third extraction fusion network is connected with the third feature extraction module, and the first extraction fusion network is connected with the third extraction fusion network through the second extraction fusion network;

the first extraction fusion network comprises 1 Convolitional Set layer; the conditional Set layer is respectively connected with the fifth feature extraction module, the convolution block in the second extraction fusion network and the 1 st convolution block in the first labeling module;

the second extraction fusion network comprises 1 conditional Set layer, 1 upsampling layer, 1 splicing layer and 1 volume block; the convolution block is respectively connected with a corresponding Set layer and an upper sampling layer in the first extraction fusion network, the upper sampling layer is connected with a splicing layer, and the corresponding Set layer is respectively connected with a 1 st convolution block in the second labeling module, the splicing layer, a convolution block in the third extraction fusion network and a fourth feature extraction module;

the third extraction fusion network comprises 1 conditional Set layer, 1 upsampling layer, 1 splicing layer and 1 volume block; the convolution block is respectively connected with a corresponding Set layer and an upsampling layer in the second extraction fusion network, the upsampling layer is connected with a splicing layer, and the corresponding Set layer is respectively connected with a 1 st convolution block, the splicing layer and a third feature extraction module in a third labeling module.

Optionally, the prior frame labeling network includes a first labeling module, a second labeling module, and a third labeling module; the first labeling module is connected with the first extraction fusion network, the second labeling module is connected with the second extraction fusion network, and the third labeling module is connected with the third extraction fusion network;

the first labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with a Convolitional Set layer and the 2 nd convolution block in the first extraction fusion network;

the second labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the Convolitional Set layer and the 2 nd convolution block in the second extraction fusion network;

the third labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the Convolitional Set layer and the 2 nd convolution block in the third extraction fusion network.

The invention also provides a deep learning-based pedestrian recognition system in an abnormal state, which comprises:

the training set constructing module is used for constructing a sample training set;

the initial identification model building module is used for building an initial identification model by using a YOLOV3 model based on a deep learning framework;

the training module is used for inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; the pedestrian recognition model in the abnormal state comprises: a feature extraction network, a multi-scale extraction fusion network and a priori frame labeling network;

and the pedestrian identification and marking module is used for inputting the target video image to be detected into the pedestrian identification model in the abnormal state for pedestrian identification and marking, and eliminating redundant prior frames by adopting a non-maximum inhibition method to obtain a final pedestrian identification marking image.

Optionally, the pedestrian identification and marking module specifically includes:

the characteristic extraction unit is used for inputting a target video image to be detected into a characteristic extraction network for characteristic extraction to obtain initial characteristic graphs of different scales;

the characteristic extraction and fusion unit is used for inputting the initial characteristic graphs of different scales into a multi-scale extraction and fusion network to extract and fuse the characteristics to obtain fusion characteristic graphs of different scales;

the selection unit is used for selecting the prior frame corresponding to each fusion characteristic graph;

the marking unit is used for inputting the fusion characteristic graphs of different scales into the prior frame marking network, marking the pedestrians in the fusion characteristic graphs of different scales according to the prior frame corresponding to each fusion characteristic graph, and obtaining pedestrian identification initial marking graphs of different sizes;

the superposition fusion unit is used for carrying out superposition fusion on pedestrian identification initial labeling graphs of different sizes to obtain a total fusion characteristic graph;

and the eliminating unit is used for eliminating redundant prior frames on the total fusion characteristic graph by adopting a non-maximum value inhibition method and outputting a final pedestrian identification label graph.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a pedestrian recognition method based on deep learning in an abnormal state according to the present invention;

FIG. 2 is a schematic diagram of a pedestrian recognition model in an abnormal state according to the present invention;

FIG. 3 is an example of a VOC + APD data set sample of the present invention;

FIG. 4 is a diagram of a pedestrian recognition system based on deep learning in an abnormal state according to the present invention;

FIG. 5 shows the non-stationary pedestrian identification results of the present invention based on YOLOV3-VOC + APD;

FIG. 6 is the results of unusual pedestrian identification based on YOLOV 3-VOC;

FIG. 7 is SSD-VOC + APD based extraordinary pedestrian identification results;

fig. 8 is the SSD-VOC based non-stationary pedestrian recognition result.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

With the rise of deep learning, artificial intelligence plays an important role in more and more fields such as image recognition, natural language processing and the like. Under the support of big data, the deep neural network has a remarkable performance in the field of image recognition and target detection, wherein the deep convolutional neural network is more representative of a network structure, and the network structure has two characteristics: (1) local receptive fields were used: the neuron is only connected with the upper layer of neuron adjacent to the neuron, and more abstract and essential global features are obtained by extracting local features; (2) weight sharing is adopted: when the same convolution kernel operates different local receptive fields, the specific position of the characteristic in the picture does not need to be considered specially, and the same weight parameter is adopted, so that the calculated amount of the network parameter is greatly reduced.

Example 1

As shown in fig. 1, the invention discloses a pedestrian recognition method in an abnormal state based on deep learning, which comprises the following steps:

step S1: and constructing a sample training set.

Step S2: based on a deep learning framework, an initial recognition model is constructed by using a Yolov3 model.

Step S3: inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; the pedestrian recognition model in the abnormal state comprises: the system comprises a feature extraction network, a multi-scale extraction fusion network and a priori frame labeling network.

The individual steps are discussed in detail below:

step S1: constructing a sample training set, which specifically comprises the following steps:

step S11: manufacturing a pedestrian data set in an abnormal state, namely an APD data set for short, by a network crawling technology and an image label technology; that is to say: aiming at the problem that a training data set related to abnormal pedestrians is lacked at present, 2500 sample images of abnormal pedestrian data are obtained through a web crawler technology, an APD data set is manufactured by combining target detection data set manufacturing software Labelimage, and the purpose of enriching the data set is achieved.

Step S12: and constructing a VOC + APD data set based on the APD data set and the VOC2007 data set.

Step S13: constructing a training set and a testing set based on the VOC + APD data set; the training set contains 5011 sample images in the VOC data set, and the APD contains 1500 sample images; the test set contained 4952 sample images of the VOC sample and the APD contained 1000 sample images of the sample.

Step S2: based on a deep learning framework Tensorflow, constructing an initial recognition model by using a Yolov3 model; in this embodiment, the version is a 1.2 deep learning framework based on Tensorflow in the Anaconda environment python version 3.6.0. The hardware is configured to: AMD Ryzen 93950 × 16-Core Processor, 16g memory, GPU NVIDIA Quadro P620.

Step S3: inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; in this embodiment, the hyper-parameters of the initial recognition model are set as: the learning rate is 0.001, the learning attenuation rate is 0.94, the iteration number is 3000, the GPU occupancy is 0.85 (the better the hardware equipment is, the higher the value can be set), the optimizer is an inverse solution gradient algorithm, the Adam algorithm is adopted, and the batch size is 50.

And summing the loss of the number of samples used by the parameters of the network updating once, and judging whether the model is converged or not by observing the change of the total loss function value along with the iteration times, wherein the smaller the total loss value is, the better the training degree of the model is, and the lower the training loss value is, namely, the final pedestrian recognition model in the abnormal state is output.

In this embodiment, the total loss function includes a confidence loss function, a category probability loss function, and a target frame loss function, and the total loss function formula is calculated as follows:

L=L₁+L₂+L_loc

class probability loss function L₁And a confidence loss function L₂Expressed by binary cross entropy, the concrete formula is as follows:

where the index i represents the index of the sample, n represents the number of samples required for one update of all parameters of the network,

is the label value of the sample and,y _iin order to predict the value of the network,L _confrepresenting a binary cross entropy.

Wherein the content of the first and second substances,L _loca function representing the loss of the target box is represented,vrepresents the intersection area of the target real box and the prediction bounding box,urepresents the area of the union of the target real box and the predicted bounding box,A ^crepresenting the smallest rectangular area surrounding the predicted bounding box and the target real box.

The method utilizes the test set to verify the pedestrian recognition model in the abnormal state, the confidence coefficient loss function tends to be stable when the parameters are updated for about 500 times, and the loss is suddenly increased when the parameters are updated for about 1000 times, and the value is 657; the target frame loss steadily floats up and down around a mean of 4.32; the class probability loss tends to be stable around 500 parameter updates; the frame total loss value tends to stabilize at about 500 parameter updates and the loss rises at about 1000 parameter updates, which is a result of the contribution of the confidence loss to the total loss. In conclusion, the training meets the convergence requirement, and the training effect is better.

As shown in fig. 2, the pedestrian recognition model in the abnormal state of the present invention includes: a feature extraction network Darknet-53, a multi-scale extraction fusion network and a priori frame annotation network; said Darknet-53 comprising: the device comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, a fourth feature extraction module and a fifth feature extraction module, wherein the first feature extraction module is connected with the fifth feature extraction module sequentially through the second feature extraction module, the third feature extraction module, the fourth feature extraction module and the fifth feature extraction module. In the following embodiments, each convolution block includes a combination of convolution layer conv, normalization layer BN and leakage RELU layers, and the excitation function is specifically:

where f (×) represents the excitation function.

The first feature extraction module comprises 2 convolution blocks and 1 residual block, wherein the 1 st convolution block is connected with the residual block through the 2 nd convolution block; wherein, the convolution layer in the 1 st convolution block comprises 32 convolution kernels of 3 × 3, and the size of the output feature map is 256 × 256; the convolution layer in the 2 nd convolution block includes 64 convolution kernels of 3 × 3/2, and the size of the output feature map is 128 × 128; the residual block comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the 2 nd convolution block and the 2 nd convolution block in the first characteristic extraction module; the convolution layer in the 1 st convolution block includes 32 convolution kernels of 1 x 1, the convolution layer in the 2 nd convolution block includes 64 convolution kernels of 3 x 3, and the size of the residual block output feature map is 128 x 128; and adding the feature graph output by the 2 nd convolution block in the residual block and the feature graph output by the 2 nd convolution block in the first feature extraction module, and inputting the addition result to the second feature extraction module.

The second feature extraction module comprises 1 convolution block and 2 residual blocks, and the convolution blocks are connected through the 1 st residual block and the 2 nd residual block; wherein, the convolution layer in the convolution block comprises 128 convolution kernels of 3 × 3/2, and the size of the output feature map is 64 × 64; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the 1 st convolution block in the second feature extraction module, and the 1 st convolution block in the 2 nd residual block is connected with the 2 nd convolution block in the 1 st residual block; the convolutional layer in the 1 st convolutional block comprises 64 convolution kernels of 1 x 1, the convolutional layer in the 2 nd convolutional block comprises 128 convolution kernels of 3 x 3, and the size of the last residual block output feature map is 64 x 64; and adding the feature graph output by the 2 nd convolution block in the last residual block with the feature graph output by the convolution block in the second feature extraction module, and inputting the addition result to the third feature extraction module.

The third feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; wherein the convolution layers in the convolution block comprise 256 convolution kernels of 3 × 3/2, and the size of the output feature map is 32 × 32; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the third feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the convolution layer in the 1 st convolution block comprises 128 convolution kernels of 1 × 1, the convolution layer in the 2 nd convolution block comprises 256 convolution kernels of 3 × 3, and the size of the feature map output by the last residual block is 32 × 32; and adding the feature graph output by the 2 nd convolution block in the last residual block with the feature graph output by the convolution block in the third feature extraction module, and inputting the addition result to the fourth feature extraction module and the third extraction fusion network.

The fourth feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; wherein the convolution layers in the convolution block comprise 512 convolution kernels of 3 × 3/2, and the size of the output feature map is 16 × 16; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fourth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the convolution layer in the 1 st convolution block comprises 256 convolution kernels with 1 x 1 and the convolution layer in the 2 nd convolution block comprises 512 convolution kernels with 3 x 3, and the size of the feature map output by the last residual block is 16 x 16; and adding the feature graph output by the 2 nd convolution block in the last residual block with the feature graph output by the convolution block in the fourth feature extraction module, and inputting the addition result to the fifth feature extraction module and the second extraction fusion network.

The fifth feature extraction module comprises 1 convolution block and 4 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 4 th residual block through 2 residual blocks at a time; wherein the convolutional layers in the convolutional block comprise 1024 convolution kernels of 3 × 3/2, and the size of the output feature map is 8 × 8; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fifth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the convolution layer in the 1 st convolution block comprises 512 convolution kernels of 1 x 1, the convolution layer in the 2 nd convolution block comprises 1024 convolution kernels of 3 x 3, and the size of the feature map output by the last residual block is 8 x 8; and adding the feature graph output by the 2 nd convolution block in the last residual block with the feature graph output by the convolution block in the fifth feature extraction module, and inputting the addition result into the first extraction fusion network.

The third, fourth and fifth feature extraction modules output 13 × 1024, 26 × 512 and 52 × 256 initial feature maps with three different sizes, respectively.

The second feature extraction module, the third feature extraction module, the fourth feature extraction module and the fifth feature extraction module respectively comprise 1 convolution block, and the size change of the feature graph after one convolution is represented as follows:

wherein the content of the first and second substances,H ⁱ⁺¹ *W ⁱ⁺¹representing the size of the output image of the convolution block in each feature extraction module,H ⁱ *W ⁱand the size of the convolution block input image in each feature extraction module is represented, k x k represents the size of a convolution kernel, p represents edge expansion, s represents a step size, and superscript i represents a layer index in a convolution neural network.

The invention realizes scaling through the convolution kernel with the step length s of 2, the network in the convolution block adopts the edge extension mode of 'same', namely the size of the output characteristic diagram is not changed, and the channel number of the characteristic diagram is determined by the change of the number of the convolution kernels.

The multi-scale extraction fusion network comprises a first extraction fusion network, a second extraction fusion network and a third extraction fusion network; the first extraction fusion network is connected with the fifth feature extraction module, the second extraction fusion network is connected with the fourth feature extraction module, the third extraction fusion network is connected with the third feature extraction module, and the first extraction fusion network is connected with the third extraction fusion network through the second extraction fusion network.

The first extraction fusion network comprises 1 Convolitional Set layer; the relational Set layer is respectively connected with the fifth feature extraction module, the volume block in the second extraction fusion network and the 1 st volume block (namely 3 x 3) in the first labeling module.

The second extraction fusion network comprises 1 corresponding Set layer, 1 upsampling layer (UP sampling), 1 splicing layer (corresponding) and 1 convolution block (also called corresponding); the volume block (namely 1 x 1) is respectively connected with a Convolitional Set layer and an upsampling layer in the first extraction fusion network, the upsampling layer is connected with a splicing layer, and the Convolitional Set layer is respectively connected with a 1 st volume block (namely 3 x 3) in the second labeling module, the splicing layer, a volume block (namely 1 x 1) in the third extraction fusion network and a fourth feature extraction module.

The third extraction fusion network comprises 1 conditional Set layer, 1 upsampling layer (UP sampling), 1 splicing layer (Concatenale) and 1 convolution block (also called convolutionjoint); the volume block (namely 1 × 1) is respectively connected with a corresponding Set layer and an upsampling layer in the second extraction fusion network, the upsampling layer is connected with a splicing layer, and the corresponding Set layer is respectively connected with a 1 st volume block (namely 3 × 3), the splicing layer and a third feature extraction module in the third labeling module.

The convergent Set layers in the first extraction fusion network, the second extraction fusion network and the third extraction fusion network respectively output fusion feature maps of 13 × 512, 26 × 256 and 52 × 128.

The priori frame labeling network comprises a first labeling module, a second labeling module and a third labeling module; the first labeling module is connected with the first extraction fusion network, the second labeling module is connected with the second extraction fusion network, and the third labeling module is connected with the third extraction fusion network.

The first labeling module comprises 2 volume blocks, and the 1 st volume block (namely 3 × 3) is respectively connected with the corresponding Set layer and the 2 nd volume block (namely 1 × 1) in the first extraction fusion network.

The second labeling module includes 2 volume blocks, and the 1 st volume block (i.e. 3 × 3) is respectively connected with the corresponding Set layer and the 2 nd volume block (i.e. 1 × 1) in the second extraction fusion network.

The third labeling module includes 2 volume blocks, and the 1 st volume block (i.e. 3 × 3) is respectively connected with the corresponding Set layer and the 2 nd volume block (i.e. 1 × 1) in the third extraction fusion network.

The 2 nd volume blocks of the first labeling module, the second labeling module, and the third labeling module each output the pedestrian recognition initial labeling graph with 13 × 3 candidate frames, 26 × 3 candidate frames, and 52 × 3 candidate frames.

The characteristic extraction network Darknet-53 does not contain a full connection layer and a pooling layer of a basic network, and the size of a characteristic diagram is reduced by adding a plurality of volume blocks and carrying out up-sampling operation.

Step S4: inputting a target video image to be detected into the pedestrian recognition model in the abnormal state for pedestrian recognition and labeling, and eliminating redundant prior frames by adopting a non-maximum suppression method to obtain a final pedestrian recognition labeling image, which specifically comprises the following steps:

step S41: and inputting the target video image to be detected into a feature extraction network for feature extraction to obtain initial feature maps with different scales.

Step S42: inputting the initial feature maps of different scales into a multi-scale extraction fusion network for feature extraction and fusion to obtain fusion feature maps of different scales.

Step S43: selecting a prior frame corresponding to each fusion characteristic graph; fused feature maps of different scales correspond to different prior frames.

The invention adopts K-means clustering to obtain 9 prior frames with different scales; specific dimensions of the 9 different scale boxes include 116 × 90, 156 × 198, 373 × 326, 30 × 61, 62 × 45, 59 × 119, 10 × 13, 16 × 30, and 33 × 23; the image is divided into S × S grids, and when the center of the target falls within a certain grid, each grid needs to detect 3 anchors (note that there are 3 scales), so for each scale, the output is S × 3.

The prior frames of 9 sizes are clustered by K-means, the predicted feature maps of 3 sizes (13 x 13, 26 x 26 and 52 x 52) are scaled by convolution blocks, 3 prior frames are established for each unit cell of the fused feature maps of different sizes, the principle that small objects are predicted by large-scale output feature layers and large objects are predicted by small-scale output feature layers is followed, the prior frames of large sizes (116 x 90, 156 x 198 and 373 x 326) are adopted for the fused feature map receptive field of 13 x 13, the prior frames of medium sizes (30 x 61, 62 x 45 and 59 x 119) are adopted for the fused feature map receptive field of 26 x 26, and the prior frames of small sizes (10 x 13, 16 x 30 and 33 x 23) are adopted for the fused feature map receptive field of 52 x 52.

Step S44: inputting the fusion characteristic graphs of different scales into a priori frame marking network, marking the pedestrians in the fusion characteristic graphs of different scales according to the priori frames corresponding to the fusion characteristic graphs, and obtaining pedestrian identification initial marking graphs of different sizes. In this embodiment, 3 a priori frames are used on each fused feature map, for example, 13 x 3 a priori frames are generated for the 13 x 13 fused feature map.

The specific formula for calculating the position information marked by the prior frame is as follows:

wherein (A), (B), (C), (D), (C), (B), (C)b _x ,b _y ,b _w ,b _h) The location information representing the prior box label,t _x ,t _yrespectively representing the horizontal and vertical translation values of the prior frame prediction coordinate,t _w ,t _heach represents a scaling factor for a scale,b _x ,b _yeach represents the coordinates of the center of the bounding box,b _w ,b _hthe width and height of the bounding box are indicated,c _x ,c _yall represent the coordinates of the upper left corner of the corresponding cell of the feature map to which the central point of the bounding box belongs,p _w ,p _hrepresenting the width and height of the prior box mapping to the feature map, respectively.

Step S45: and overlapping and fusing the pedestrian identification initial labeling graphs of different sizes to obtain a total fusion characteristic graph.

Step S46: and eliminating redundant prior frames on the total fusion characteristic graph by adopting a non-maximum value inhibition method, and outputting a final pedestrian identification labeled graph. Overlapping prior boxes and prior boxes with too low a confidence are referred to as redundant prior boxes. In order to improve the identification accuracy, a large number of overlapped prior frames and prior frames with low confidence degrees exist on the total fusion characteristic diagram obtained in the step S45, so that a non-maximum value inhibition method is adopted to remove redundant prior frames, a prediction frame is screened and drawn, and a final pedestrian identification label diagram is finally predicted and output.

The framework of the invention has 9 kinds of prior frames with different sizes, the feature maps with different sizes establish the prior frames with different sizes to be applied to the target detection of each cell,

as sigmoid function:

the function of the method can map the input to the (0, 1) interval, and can effectively ensure that the central coordinate of the bounding box is in the cell responsible for predicting the target.

The pedestrian recognition model in the abnormal state provided by the invention takes Darknet-53 as a characteristic extraction framework, the network is mainly characterized in that a large amount of Residual error networks (namely Residual error blocks) which are easy to optimize are added, the accuracy is improved by increasing the network depth, the parameter training amount is reduced, and the problem of gradient disappearance is solved.

Example 2

As shown in fig. 4, the present invention also discloses a deep learning-based pedestrian recognition system in an abnormal state, which comprises:

and a training set constructing module 401, configured to construct a sample training set.

An initial recognition model building module 402, configured to build an initial recognition model using the YOLOV3 model based on the deep learning framework.

A training module 403, configured to input the training set into the initial recognition model for training by using a gradient descent algorithm, and use the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; the pedestrian recognition model in the abnormal state comprises: the system comprises a feature extraction network, a multi-scale extraction fusion network and a priori frame labeling network.

And the pedestrian identification and marking module 404 is configured to input the target video image to be detected into the pedestrian identification model in the abnormal state for pedestrian identification and marking, and remove redundant prior frames by using a non-maximum suppression method to obtain a final pedestrian identification marking map.

As an optional implementation manner, the pedestrian identification and marking module 404 of the present invention specifically includes:

and the feature extraction unit is used for inputting the target video image to be detected into the feature extraction network for feature extraction to obtain initial feature maps with different scales.

And the feature extraction and fusion unit is used for inputting the initial feature maps of different scales into the multi-scale extraction and fusion network to extract and fuse features so as to obtain fusion feature maps of different scales.

And the selection unit is used for selecting the prior frame corresponding to each fusion characteristic graph.

And the marking unit is used for inputting the fusion characteristic graphs of different scales into the prior frame marking network, marking the pedestrians in the fusion characteristic graphs of different scales according to the prior frame corresponding to each fusion characteristic graph, and obtaining pedestrian identification initial marking graphs of different sizes.

And the superposition fusion unit is used for carrying out superposition fusion on the pedestrian identification initial labeling graphs of different sizes to obtain a total fusion characteristic graph.

The same parts as those in embodiment 1 will not be described in detail.

Example 3

To verify the generalization performance of the YOLOV3-VOC + APD training model, the model was evaluated using a test data set. The classification category comprises 20 types of targets such as people, automobiles, bicycles and the like, the AP value is adopted for evaluation, and the frame has higher AP value for large targets such as buses, automobiles, airplanes and the like; in contrast, the model pair has a low AP value for indoor objects, such as potted plants, chairs, tables, etc., due to weak indoor light. The overall accuracy mAP of YOLOV3-VOC + APD is 0.83, which shows that the framework of the invention has good detection target performance.

On the other hand, to prove that YOLOV3-VOC + APD can detect pedestrians in an abnormal state, 3 different methods are implemented for comparison, namely SSD-VOC (i.e. SSD model trained based on VOC data set), SSD-VOC + APD (i.e. SSD model trained based on VOC + APD data set), YOLOV3-VOC (i.e. YOLOV3 model trained based on VOC data set).

The cases used are quite challenging, and there are four typical non-steady state cases listed (i.e., the four panels contained in fig. 5-8), where: (a) the figure is characterized in that the figure is special in human behavior, such as groveling, squatting and other unusual actions; (b) the characteristic of the graph is that the object integrated with the human-like object interferes the detection; (c) the characteristics of the graph cause part of pedestrians to be blurred in the graph due to the problem of camera shake; (d) the characteristic of the map is that under the condition of gathering people, the mutual shielding among pedestrians is serious.

Fig. 5 to 8 list the detection results of the pedestrian under the above typical four unusual states for each model. FIG. 5 is the non-stationary pedestrian recognition result based on YOLOV3-VOC + APD, FIG. 6 is the non-stationary pedestrian recognition result based on YOLOV3-VOC, FIG. 7 is the non-stationary pedestrian recognition result based on SSD-VOC + APD, and FIG. 8 is the non-stationary pedestrian recognition result based on SSD-VOC. For the convenience of distinguishing, the square detection box represents the detection result, and the circle represents the missing detection content of the frame. Accuracy of detection of a relative pedestrian: the reliability of the model framework of the invention was demonstrated by an AP value of 0.88 for the YOLOV3-VOC + APD framework, an AP value of 0.82 for the YOLOV3-VOC framework, an AP value of 0.83 for the SSD-VOC + APD framework, and an AP value of 0.78 for the SSD-VOC framework.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A pedestrian recognition method in an abnormal state based on deep learning is characterized by comprising the following steps:

step S1: constructing a sample training set;

2. The method for identifying pedestrians in an abnormal state based on deep learning of claim 1, wherein the step of inputting the target video image to be detected into the pedestrian identification model in the abnormal state for pedestrian identification and labeling, and eliminating redundant prior frames by using a non-maximum suppression method to obtain a final pedestrian identification label map specifically comprises:

3. The deep learning-based extraordinary pedestrian recognition method according to claim 1, wherein the feature extraction network comprises: the device comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, a fourth feature extraction module and a fifth feature extraction module, wherein the first feature extraction module is connected with the fifth feature extraction module sequentially through the second feature extraction module, the third feature extraction module, the fourth feature extraction module and the fifth feature extraction module;

4. The deep learning-based extraordinary pedestrian recognition method according to claim 3, wherein the multi-scale extraction fusion network comprises a first extraction fusion network, a second extraction fusion network and a third extraction fusion network; the first extraction fusion network is connected with the fifth feature extraction module, the second extraction fusion network is connected with the fourth feature extraction module, the third extraction fusion network is connected with the third feature extraction module, and the first extraction fusion network is connected with the third extraction fusion network through the second extraction fusion network;

5. The deep learning-based pedestrian identification method in the unusual state according to claim 4, wherein the prior frame labeling network comprises a first labeling module, a second labeling module and a third labeling module; the first labeling module is connected with the first extraction fusion network, the second labeling module is connected with the second extraction fusion network, and the third labeling module is connected with the third extraction fusion network;

6. An extraordinary pedestrian recognition system based on deep learning, the system comprising:

7. The system for identifying pedestrians under unusual conditions based on deep learning of claim 6, wherein the module for identifying and labeling pedestrians specifically comprises:

8. The deep learning based extraordinary pedestrian recognition system of claim 6, wherein the feature extraction network comprises: the device comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, a fourth feature extraction module and a fifth feature extraction module, wherein the first feature extraction module is connected with the fifth feature extraction module sequentially through the second feature extraction module, the third feature extraction module, the fourth feature extraction module and the fifth feature extraction module;

9. The deep learning based extraordinary pedestrian recognition system of claim 8, wherein the multi-scale extraction fusion network comprises a first extraction fusion network, a second extraction fusion network, and a third extraction fusion network; the first extraction fusion network is connected with the fifth feature extraction module, the second extraction fusion network is connected with the fourth feature extraction module, the third extraction fusion network is connected with the third feature extraction module, and the first extraction fusion network is connected with the third extraction fusion network through the second extraction fusion network;

10. The deep learning-based extraordinary pedestrian recognition system of claim 9, wherein the a priori box labeling network comprises a first labeling module, a second labeling module, and a third labeling module; the first labeling module is connected with the first extraction fusion network, the second labeling module is connected with the second extraction fusion network, and the third labeling module is connected with the third extraction fusion network;