CN113901962A - Method and system for identifying pedestrian in abnormal state based on deep learning - Google Patents

Method and system for identifying pedestrian in abnormal state based on deep learning Download PDF

Info

Publication number
CN113901962A
CN113901962A CN202111471511.8A CN202111471511A CN113901962A CN 113901962 A CN113901962 A CN 113901962A CN 202111471511 A CN202111471511 A CN 202111471511A CN 113901962 A CN113901962 A CN 113901962A
Authority
CN
China
Prior art keywords
block
convolution
convolution block
feature extraction
residual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111471511.8A
Other languages
Chinese (zh)
Inventor
李之红
张晶
王子男
高秀丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Civil Engineering and Architecture
Original Assignee
Beijing University of Civil Engineering and Architecture
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Civil Engineering and Architecture filed Critical Beijing University of Civil Engineering and Architecture
Priority to CN202111471511.8A priority Critical patent/CN113901962A/en
Publication of CN113901962A publication Critical patent/CN113901962A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a pedestrian recognition method and system under abnormal conditions based on deep learning, which comprises the steps of firstly, constructing an initial recognition model by utilizing a YOLOV3 model based on a deep learning framework; then, inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; and finally, inputting the target video image to be detected into the pedestrian recognition model in the abnormal state for pedestrian recognition and labeling, and eliminating redundant prior frames by adopting a non-maximum suppression method to obtain a final pedestrian recognition labeling image. According to the method, the pedestrian recognition model under the abnormal state is constructed based on deep learning, and then redundant priori frames are removed by adopting a non-maximum suppression method, so that the pedestrian recognition precision under the abnormal state is improved.

Description

Method and system for identifying pedestrian in abnormal state based on deep learning
Technical Field
The invention relates to the technical field of image processing, in particular to a pedestrian identification method and system in an abnormal state based on deep learning.
Background
The target detection is an important branch of image processing and computer vision, is also a core part of an intelligent monitoring system, and plays an important role in subsequent tasks such as target tracking, trajectory prediction and the like. Along with the development of the target detection technology, the intelligent video monitoring means can be widely applied, however, video monitoring is easily influenced by external environments such as background and light, and compared with pedestrian detection in a normal state, pedestrian detection in an abnormal environment is a very complex problem, and technical difficulties of mutual shielding among pedestrians, more small targets, random action and the like exist, so that the identification precision is low.
Disclosure of Invention
The invention aims to provide a pedestrian recognition method and system in an abnormal state based on deep learning so as to improve the pedestrian recognition accuracy.
In order to achieve the above object, the present invention provides a method for identifying a pedestrian in an abnormal state based on deep learning, the method comprising:
step S1: constructing a sample training set;
step S2: constructing an initial recognition model by using a YOLOV3 model based on a deep learning framework;
step S3: inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; the pedestrian recognition model in the abnormal state comprises: a feature extraction network, a multi-scale extraction fusion network and a priori frame labeling network;
step S4: and inputting the target video image to be detected into the pedestrian recognition model in the abnormal state for pedestrian recognition and labeling, and eliminating redundant prior frames by adopting a non-maximum suppression method to obtain a final pedestrian recognition labeling image.
Optionally, the inputting the target video image to be detected into the pedestrian recognition model in the abnormal state for pedestrian recognition and labeling, and eliminating redundant prior frames by using a non-maximum suppression method to obtain a final pedestrian recognition labeling diagram specifically includes:
step S41: inputting a target video image to be detected into a feature extraction network for feature extraction to obtain initial feature maps with different scales;
step S42: inputting the initial feature maps of different scales into a multi-scale extraction fusion network for feature extraction and fusion to obtain fusion feature maps of different scales;
step S43: selecting a prior frame corresponding to each fusion characteristic graph;
step S44: inputting the fusion characteristic graphs of different scales into a priori frame marking network, marking pedestrians in the fusion characteristic graphs of different scales according to a priori frame corresponding to each fusion characteristic graph, and obtaining pedestrian identification initial marking graphs of different sizes;
step S45: overlapping and fusing the pedestrian identification initial labeling graphs of different sizes to obtain a total fusion characteristic graph;
step S46: and eliminating redundant prior frames on the total fusion characteristic graph by adopting a non-maximum value inhibition method, and outputting a final pedestrian identification labeled graph.
Optionally, the feature extraction network comprises: the device comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, a fourth feature extraction module and a fifth feature extraction module, wherein the first feature extraction module is connected with the fifth feature extraction module sequentially through the second feature extraction module, the third feature extraction module, the fourth feature extraction module and the fifth feature extraction module;
the first feature extraction module comprises 2 convolution blocks and 1 residual block, wherein the 1 st convolution block is connected with the residual block through the 2 nd convolution block; the residual block comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the 2 nd convolution block and the 2 nd convolution block in the first characteristic extraction module; adding the feature graph output by the 2 nd convolution block in the residual block and the feature graph output by the 2 nd convolution block in the first feature extraction module, and inputting the addition result to the second feature extraction module;
the second feature extraction module comprises 1 convolution block and 2 residual blocks, and the convolution blocks are connected through the 1 st residual block and the 2 nd residual block; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the 1 st convolution block in the second feature extraction module, the 1 st convolution block in the 2 nd residual block is connected with the 2 nd convolution block in the 1 st residual block, the feature map output by the 2 nd convolution block in the last residual block is added with the feature map output by the convolution block in the second feature extraction module, and the addition result is input to the third feature extraction module;
the third feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the third feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the third feature extraction module, and the addition result is input into the fourth feature extraction module and the third extraction fusion network;
the fourth feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fourth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the fourth feature extraction module, and the addition result is input into the fifth feature extraction module and the second extraction fusion network;
the fifth feature extraction module comprises 1 convolution block and 4 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 4 th residual block through 2 residual blocks at a time; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; and the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fifth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the fifth feature extraction module, and the addition result is input into the first extraction fusion network.
Optionally, the multi-scale extraction fusion network includes a first extraction fusion network, a second extraction fusion network, and a third extraction fusion network; the first extraction fusion network is connected with the fifth feature extraction module, the second extraction fusion network is connected with the fourth feature extraction module, the third extraction fusion network is connected with the third feature extraction module, and the first extraction fusion network is connected with the third extraction fusion network through the second extraction fusion network;
the first extraction fusion network comprises 1 Convolitional Set layer; the conditional Set layer is respectively connected with the fifth feature extraction module, the convolution block in the second extraction fusion network and the 1 st convolution block in the first labeling module;
the second extraction fusion network comprises 1 conditional Set layer, 1 upsampling layer, 1 splicing layer and 1 volume block; the convolution block is respectively connected with a corresponding Set layer and an upper sampling layer in the first extraction fusion network, the upper sampling layer is connected with a splicing layer, and the corresponding Set layer is respectively connected with a 1 st convolution block in the second labeling module, the splicing layer, a convolution block in the third extraction fusion network and a fourth feature extraction module;
the third extraction fusion network comprises 1 conditional Set layer, 1 upsampling layer, 1 splicing layer and 1 volume block; the convolution block is respectively connected with a corresponding Set layer and an upsampling layer in the second extraction fusion network, the upsampling layer is connected with a splicing layer, and the corresponding Set layer is respectively connected with a 1 st convolution block, the splicing layer and a third feature extraction module in a third labeling module.
Optionally, the prior frame labeling network includes a first labeling module, a second labeling module, and a third labeling module; the first labeling module is connected with the first extraction fusion network, the second labeling module is connected with the second extraction fusion network, and the third labeling module is connected with the third extraction fusion network;
the first labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with a Convolitional Set layer and the 2 nd convolution block in the first extraction fusion network;
the second labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the Convolitional Set layer and the 2 nd convolution block in the second extraction fusion network;
the third labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the Convolitional Set layer and the 2 nd convolution block in the third extraction fusion network.
The invention also provides a deep learning-based pedestrian recognition system in an abnormal state, which comprises:
the training set constructing module is used for constructing a sample training set;
the initial identification model building module is used for building an initial identification model by using a YOLOV3 model based on a deep learning framework;
the training module is used for inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; the pedestrian recognition model in the abnormal state comprises: a feature extraction network, a multi-scale extraction fusion network and a priori frame labeling network;
and the pedestrian identification and marking module is used for inputting the target video image to be detected into the pedestrian identification model in the abnormal state for pedestrian identification and marking, and eliminating redundant prior frames by adopting a non-maximum inhibition method to obtain a final pedestrian identification marking image.
Optionally, the pedestrian identification and marking module specifically includes:
the characteristic extraction unit is used for inputting a target video image to be detected into a characteristic extraction network for characteristic extraction to obtain initial characteristic graphs of different scales;
the characteristic extraction and fusion unit is used for inputting the initial characteristic graphs of different scales into a multi-scale extraction and fusion network to extract and fuse the characteristics to obtain fusion characteristic graphs of different scales;
the selection unit is used for selecting the prior frame corresponding to each fusion characteristic graph;
the marking unit is used for inputting the fusion characteristic graphs of different scales into the prior frame marking network, marking the pedestrians in the fusion characteristic graphs of different scales according to the prior frame corresponding to each fusion characteristic graph, and obtaining pedestrian identification initial marking graphs of different sizes;
the superposition fusion unit is used for carrying out superposition fusion on pedestrian identification initial labeling graphs of different sizes to obtain a total fusion characteristic graph;
and the eliminating unit is used for eliminating redundant prior frames on the total fusion characteristic graph by adopting a non-maximum value inhibition method and outputting a final pedestrian identification label graph.
Optionally, the feature extraction network comprises: the device comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, a fourth feature extraction module and a fifth feature extraction module, wherein the first feature extraction module is connected with the fifth feature extraction module sequentially through the second feature extraction module, the third feature extraction module, the fourth feature extraction module and the fifth feature extraction module;
the first feature extraction module comprises 2 convolution blocks and 1 residual block, wherein the 1 st convolution block is connected with the residual block through the 2 nd convolution block; the residual block comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the 2 nd convolution block and the 2 nd convolution block in the first characteristic extraction module; adding the feature graph output by the 2 nd convolution block in the residual block and the feature graph output by the 2 nd convolution block in the first feature extraction module, and inputting the addition result to the second feature extraction module;
the second feature extraction module comprises 1 convolution block and 2 residual blocks, and the convolution blocks are connected through the 1 st residual block and the 2 nd residual block; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the 1 st convolution block in the second feature extraction module, the 1 st convolution block in the 2 nd residual block is connected with the 2 nd convolution block in the 1 st residual block, the feature map output by the 2 nd convolution block in the last residual block is added with the feature map output by the convolution block in the second feature extraction module, and the addition result is input to the third feature extraction module;
the third feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the third feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the third feature extraction module, and the addition result is input into the fourth feature extraction module and the third extraction fusion network;
the fourth feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fourth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the fourth feature extraction module, and the addition result is input into the fifth feature extraction module and the second extraction fusion network;
the fifth feature extraction module comprises 1 convolution block and 4 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 4 th residual block through 2 residual blocks at a time; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; and the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fifth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the fifth feature extraction module, and the addition result is input into the first extraction fusion network.
Optionally, the multi-scale extraction fusion network includes a first extraction fusion network, a second extraction fusion network, and a third extraction fusion network; the first extraction fusion network is connected with the fifth feature extraction module, the second extraction fusion network is connected with the fourth feature extraction module, the third extraction fusion network is connected with the third feature extraction module, and the first extraction fusion network is connected with the third extraction fusion network through the second extraction fusion network;
the first extraction fusion network comprises 1 Convolitional Set layer; the conditional Set layer is respectively connected with the fifth feature extraction module, the convolution block in the second extraction fusion network and the 1 st convolution block in the first labeling module;
the second extraction fusion network comprises 1 conditional Set layer, 1 upsampling layer, 1 splicing layer and 1 volume block; the convolution block is respectively connected with a corresponding Set layer and an upper sampling layer in the first extraction fusion network, the upper sampling layer is connected with a splicing layer, and the corresponding Set layer is respectively connected with a 1 st convolution block in the second labeling module, the splicing layer, a convolution block in the third extraction fusion network and a fourth feature extraction module;
the third extraction fusion network comprises 1 conditional Set layer, 1 upsampling layer, 1 splicing layer and 1 volume block; the convolution block is respectively connected with a corresponding Set layer and an upsampling layer in the second extraction fusion network, the upsampling layer is connected with a splicing layer, and the corresponding Set layer is respectively connected with a 1 st convolution block, the splicing layer and a third feature extraction module in a third labeling module.
Optionally, the prior frame labeling network includes a first labeling module, a second labeling module, and a third labeling module; the first labeling module is connected with the first extraction fusion network, the second labeling module is connected with the second extraction fusion network, and the third labeling module is connected with the third extraction fusion network;
the first labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with a Convolitional Set layer and the 2 nd convolution block in the first extraction fusion network;
the second labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the Convolitional Set layer and the 2 nd convolution block in the second extraction fusion network;
the third labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the Convolitional Set layer and the 2 nd convolution block in the third extraction fusion network.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a pedestrian recognition method and system under abnormal conditions based on deep learning, which comprises the steps of firstly, constructing an initial recognition model by utilizing a YOLOV3 model based on a deep learning framework; then, inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; and finally, inputting the target video image to be detected into the pedestrian recognition model in the abnormal state for pedestrian recognition and labeling, and eliminating redundant prior frames by adopting a non-maximum suppression method to obtain a final pedestrian recognition labeling image. According to the method, the pedestrian recognition model under the abnormal state is constructed based on deep learning, and then redundant priori frames are removed by adopting a non-maximum suppression method, so that the pedestrian recognition precision under the abnormal state is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a pedestrian recognition method based on deep learning in an abnormal state according to the present invention;
FIG. 2 is a schematic diagram of a pedestrian recognition model in an abnormal state according to the present invention;
FIG. 3 is an example of a VOC + APD data set sample of the present invention;
FIG. 4 is a diagram of a pedestrian recognition system based on deep learning in an abnormal state according to the present invention;
FIG. 5 shows the non-stationary pedestrian identification results of the present invention based on YOLOV3-VOC + APD;
FIG. 6 is the results of unusual pedestrian identification based on YOLOV 3-VOC;
FIG. 7 is SSD-VOC + APD based extraordinary pedestrian identification results;
fig. 8 is the SSD-VOC based non-stationary pedestrian recognition result.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a pedestrian recognition method and system in an abnormal state based on deep learning so as to improve the pedestrian recognition accuracy.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
With the rise of deep learning, artificial intelligence plays an important role in more and more fields such as image recognition, natural language processing and the like. Under the support of big data, the deep neural network has a remarkable performance in the field of image recognition and target detection, wherein the deep convolutional neural network is more representative of a network structure, and the network structure has two characteristics: (1) local receptive fields were used: the neuron is only connected with the upper layer of neuron adjacent to the neuron, and more abstract and essential global features are obtained by extracting local features; (2) weight sharing is adopted: when the same convolution kernel operates different local receptive fields, the specific position of the characteristic in the picture does not need to be considered specially, and the same weight parameter is adopted, so that the calculated amount of the network parameter is greatly reduced.
Example 1
As shown in fig. 1, the invention discloses a pedestrian recognition method in an abnormal state based on deep learning, which comprises the following steps:
step S1: and constructing a sample training set.
Step S2: based on a deep learning framework, an initial recognition model is constructed by using a Yolov3 model.
Step S3: inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; the pedestrian recognition model in the abnormal state comprises: the system comprises a feature extraction network, a multi-scale extraction fusion network and a priori frame labeling network.
Step S4: and inputting the target video image to be detected into the pedestrian recognition model in the abnormal state for pedestrian recognition and labeling, and eliminating redundant prior frames by adopting a non-maximum suppression method to obtain a final pedestrian recognition labeling image.
The individual steps are discussed in detail below:
step S1: constructing a sample training set, which specifically comprises the following steps:
step S11: manufacturing a pedestrian data set in an abnormal state, namely an APD data set for short, by a network crawling technology and an image label technology; that is to say: aiming at the problem that a training data set related to abnormal pedestrians is lacked at present, 2500 sample images of abnormal pedestrian data are obtained through a web crawler technology, an APD data set is manufactured by combining target detection data set manufacturing software Labelimage, and the purpose of enriching the data set is achieved.
Step S12: and constructing a VOC + APD data set based on the APD data set and the VOC2007 data set.
Step S13: constructing a training set and a testing set based on the VOC + APD data set; the training set contains 5011 sample images in the VOC data set, and the APD contains 1500 sample images; the test set contained 4952 sample images of the VOC sample and the APD contained 1000 sample images of the sample.
Step S2: based on a deep learning framework Tensorflow, constructing an initial recognition model by using a Yolov3 model; in this embodiment, the version is a 1.2 deep learning framework based on Tensorflow in the Anaconda environment python version 3.6.0. The hardware is configured to: AMD Ryzen 93950 × 16-Core Processor, 16g memory, GPU NVIDIA Quadro P620.
Step S3: inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; in this embodiment, the hyper-parameters of the initial recognition model are set as: the learning rate is 0.001, the learning attenuation rate is 0.94, the iteration number is 3000, the GPU occupancy is 0.85 (the better the hardware equipment is, the higher the value can be set), the optimizer is an inverse solution gradient algorithm, the Adam algorithm is adopted, and the batch size is 50.
And summing the loss of the number of samples used by the parameters of the network updating once, and judging whether the model is converged or not by observing the change of the total loss function value along with the iteration times, wherein the smaller the total loss value is, the better the training degree of the model is, and the lower the training loss value is, namely, the final pedestrian recognition model in the abnormal state is output.
In this embodiment, the total loss function includes a confidence loss function, a category probability loss function, and a target frame loss function, and the total loss function formula is calculated as follows:
L=L1+L2+Lloc
class probability loss function L1And a confidence loss function L2Expressed by binary cross entropy, the concrete formula is as follows:
Figure 777656DEST_PATH_IMAGE001
where the index i represents the index of the sample, n represents the number of samples required for one update of all parameters of the network,
Figure 583807DEST_PATH_IMAGE002
is the label value of the sample and,y i in order to predict the value of the network,L conf representing a binary cross entropy.
Figure 444315DEST_PATH_IMAGE003
Wherein the content of the first and second substances,L loc a function representing the loss of the target box is represented,vrepresents the intersection area of the target real box and the prediction bounding box,urepresents the area of the union of the target real box and the predicted bounding box,A c representing the smallest rectangular area surrounding the predicted bounding box and the target real box.
The method utilizes the test set to verify the pedestrian recognition model in the abnormal state, the confidence coefficient loss function tends to be stable when the parameters are updated for about 500 times, and the loss is suddenly increased when the parameters are updated for about 1000 times, and the value is 657; the target frame loss steadily floats up and down around a mean of 4.32; the class probability loss tends to be stable around 500 parameter updates; the frame total loss value tends to stabilize at about 500 parameter updates and the loss rises at about 1000 parameter updates, which is a result of the contribution of the confidence loss to the total loss. In conclusion, the training meets the convergence requirement, and the training effect is better.
As shown in fig. 2, the pedestrian recognition model in the abnormal state of the present invention includes: a feature extraction network Darknet-53, a multi-scale extraction fusion network and a priori frame annotation network; said Darknet-53 comprising: the device comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, a fourth feature extraction module and a fifth feature extraction module, wherein the first feature extraction module is connected with the fifth feature extraction module sequentially through the second feature extraction module, the third feature extraction module, the fourth feature extraction module and the fifth feature extraction module. In the following embodiments, each convolution block includes a combination of convolution layer conv, normalization layer BN and leakage RELU layers, and the excitation function is specifically:
Figure 315320DEST_PATH_IMAGE004
where f (×) represents the excitation function.
The first feature extraction module comprises 2 convolution blocks and 1 residual block, wherein the 1 st convolution block is connected with the residual block through the 2 nd convolution block; wherein, the convolution layer in the 1 st convolution block comprises 32 convolution kernels of 3 × 3, and the size of the output feature map is 256 × 256; the convolution layer in the 2 nd convolution block includes 64 convolution kernels of 3 × 3/2, and the size of the output feature map is 128 × 128; the residual block comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the 2 nd convolution block and the 2 nd convolution block in the first characteristic extraction module; the convolution layer in the 1 st convolution block includes 32 convolution kernels of 1 x 1, the convolution layer in the 2 nd convolution block includes 64 convolution kernels of 3 x 3, and the size of the residual block output feature map is 128 x 128; and adding the feature graph output by the 2 nd convolution block in the residual block and the feature graph output by the 2 nd convolution block in the first feature extraction module, and inputting the addition result to the second feature extraction module.
The second feature extraction module comprises 1 convolution block and 2 residual blocks, and the convolution blocks are connected through the 1 st residual block and the 2 nd residual block; wherein, the convolution layer in the convolution block comprises 128 convolution kernels of 3 × 3/2, and the size of the output feature map is 64 × 64; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the 1 st convolution block in the second feature extraction module, and the 1 st convolution block in the 2 nd residual block is connected with the 2 nd convolution block in the 1 st residual block; the convolutional layer in the 1 st convolutional block comprises 64 convolution kernels of 1 x 1, the convolutional layer in the 2 nd convolutional block comprises 128 convolution kernels of 3 x 3, and the size of the last residual block output feature map is 64 x 64; and adding the feature graph output by the 2 nd convolution block in the last residual block with the feature graph output by the convolution block in the second feature extraction module, and inputting the addition result to the third feature extraction module.
The third feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; wherein the convolution layers in the convolution block comprise 256 convolution kernels of 3 × 3/2, and the size of the output feature map is 32 × 32; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the third feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the convolution layer in the 1 st convolution block comprises 128 convolution kernels of 1 × 1, the convolution layer in the 2 nd convolution block comprises 256 convolution kernels of 3 × 3, and the size of the feature map output by the last residual block is 32 × 32; and adding the feature graph output by the 2 nd convolution block in the last residual block with the feature graph output by the convolution block in the third feature extraction module, and inputting the addition result to the fourth feature extraction module and the third extraction fusion network.
The fourth feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; wherein the convolution layers in the convolution block comprise 512 convolution kernels of 3 × 3/2, and the size of the output feature map is 16 × 16; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fourth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the convolution layer in the 1 st convolution block comprises 256 convolution kernels with 1 x 1 and the convolution layer in the 2 nd convolution block comprises 512 convolution kernels with 3 x 3, and the size of the feature map output by the last residual block is 16 x 16; and adding the feature graph output by the 2 nd convolution block in the last residual block with the feature graph output by the convolution block in the fourth feature extraction module, and inputting the addition result to the fifth feature extraction module and the second extraction fusion network.
The fifth feature extraction module comprises 1 convolution block and 4 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 4 th residual block through 2 residual blocks at a time; wherein the convolutional layers in the convolutional block comprise 1024 convolution kernels of 3 × 3/2, and the size of the output feature map is 8 × 8; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fifth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the convolution layer in the 1 st convolution block comprises 512 convolution kernels of 1 x 1, the convolution layer in the 2 nd convolution block comprises 1024 convolution kernels of 3 x 3, and the size of the feature map output by the last residual block is 8 x 8; and adding the feature graph output by the 2 nd convolution block in the last residual block with the feature graph output by the convolution block in the fifth feature extraction module, and inputting the addition result into the first extraction fusion network.
The third, fourth and fifth feature extraction modules output 13 × 1024, 26 × 512 and 52 × 256 initial feature maps with three different sizes, respectively.
The second feature extraction module, the third feature extraction module, the fourth feature extraction module and the fifth feature extraction module respectively comprise 1 convolution block, and the size change of the feature graph after one convolution is represented as follows:
Figure 982535DEST_PATH_IMAGE005
wherein the content of the first and second substances,H i+1 *W i+1representing the size of the output image of the convolution block in each feature extraction module,H i *W i and the size of the convolution block input image in each feature extraction module is represented, k x k represents the size of a convolution kernel, p represents edge expansion, s represents a step size, and superscript i represents a layer index in a convolution neural network.
The invention realizes scaling through the convolution kernel with the step length s of 2, the network in the convolution block adopts the edge extension mode of 'same', namely the size of the output characteristic diagram is not changed, and the channel number of the characteristic diagram is determined by the change of the number of the convolution kernels.
The multi-scale extraction fusion network comprises a first extraction fusion network, a second extraction fusion network and a third extraction fusion network; the first extraction fusion network is connected with the fifth feature extraction module, the second extraction fusion network is connected with the fourth feature extraction module, the third extraction fusion network is connected with the third feature extraction module, and the first extraction fusion network is connected with the third extraction fusion network through the second extraction fusion network.
The first extraction fusion network comprises 1 Convolitional Set layer; the relational Set layer is respectively connected with the fifth feature extraction module, the volume block in the second extraction fusion network and the 1 st volume block (namely 3 x 3) in the first labeling module.
The second extraction fusion network comprises 1 corresponding Set layer, 1 upsampling layer (UP sampling), 1 splicing layer (corresponding) and 1 convolution block (also called corresponding); the volume block (namely 1 x 1) is respectively connected with a Convolitional Set layer and an upsampling layer in the first extraction fusion network, the upsampling layer is connected with a splicing layer, and the Convolitional Set layer is respectively connected with a 1 st volume block (namely 3 x 3) in the second labeling module, the splicing layer, a volume block (namely 1 x 1) in the third extraction fusion network and a fourth feature extraction module.
The third extraction fusion network comprises 1 conditional Set layer, 1 upsampling layer (UP sampling), 1 splicing layer (Concatenale) and 1 convolution block (also called convolutionjoint); the volume block (namely 1 × 1) is respectively connected with a corresponding Set layer and an upsampling layer in the second extraction fusion network, the upsampling layer is connected with a splicing layer, and the corresponding Set layer is respectively connected with a 1 st volume block (namely 3 × 3), the splicing layer and a third feature extraction module in the third labeling module.
The convergent Set layers in the first extraction fusion network, the second extraction fusion network and the third extraction fusion network respectively output fusion feature maps of 13 × 512, 26 × 256 and 52 × 128.
The priori frame labeling network comprises a first labeling module, a second labeling module and a third labeling module; the first labeling module is connected with the first extraction fusion network, the second labeling module is connected with the second extraction fusion network, and the third labeling module is connected with the third extraction fusion network.
The first labeling module comprises 2 volume blocks, and the 1 st volume block (namely 3 × 3) is respectively connected with the corresponding Set layer and the 2 nd volume block (namely 1 × 1) in the first extraction fusion network.
The second labeling module includes 2 volume blocks, and the 1 st volume block (i.e. 3 × 3) is respectively connected with the corresponding Set layer and the 2 nd volume block (i.e. 1 × 1) in the second extraction fusion network.
The third labeling module includes 2 volume blocks, and the 1 st volume block (i.e. 3 × 3) is respectively connected with the corresponding Set layer and the 2 nd volume block (i.e. 1 × 1) in the third extraction fusion network.
The 2 nd volume blocks of the first labeling module, the second labeling module, and the third labeling module each output the pedestrian recognition initial labeling graph with 13 × 3 candidate frames, 26 × 3 candidate frames, and 52 × 3 candidate frames.
The characteristic extraction network Darknet-53 does not contain a full connection layer and a pooling layer of a basic network, and the size of a characteristic diagram is reduced by adding a plurality of volume blocks and carrying out up-sampling operation.
Step S4: inputting a target video image to be detected into the pedestrian recognition model in the abnormal state for pedestrian recognition and labeling, and eliminating redundant prior frames by adopting a non-maximum suppression method to obtain a final pedestrian recognition labeling image, which specifically comprises the following steps:
step S41: and inputting the target video image to be detected into a feature extraction network for feature extraction to obtain initial feature maps with different scales.
Step S42: inputting the initial feature maps of different scales into a multi-scale extraction fusion network for feature extraction and fusion to obtain fusion feature maps of different scales.
Step S43: selecting a prior frame corresponding to each fusion characteristic graph; fused feature maps of different scales correspond to different prior frames.
The invention adopts K-means clustering to obtain 9 prior frames with different scales; specific dimensions of the 9 different scale boxes include 116 × 90, 156 × 198, 373 × 326, 30 × 61, 62 × 45, 59 × 119, 10 × 13, 16 × 30, and 33 × 23; the image is divided into S × S grids, and when the center of the target falls within a certain grid, each grid needs to detect 3 anchors (note that there are 3 scales), so for each scale, the output is S × 3.
The prior frames of 9 sizes are clustered by K-means, the predicted feature maps of 3 sizes (13 x 13, 26 x 26 and 52 x 52) are scaled by convolution blocks, 3 prior frames are established for each unit cell of the fused feature maps of different sizes, the principle that small objects are predicted by large-scale output feature layers and large objects are predicted by small-scale output feature layers is followed, the prior frames of large sizes (116 x 90, 156 x 198 and 373 x 326) are adopted for the fused feature map receptive field of 13 x 13, the prior frames of medium sizes (30 x 61, 62 x 45 and 59 x 119) are adopted for the fused feature map receptive field of 26 x 26, and the prior frames of small sizes (10 x 13, 16 x 30 and 33 x 23) are adopted for the fused feature map receptive field of 52 x 52.
Step S44: inputting the fusion characteristic graphs of different scales into a priori frame marking network, marking the pedestrians in the fusion characteristic graphs of different scales according to the priori frames corresponding to the fusion characteristic graphs, and obtaining pedestrian identification initial marking graphs of different sizes. In this embodiment, 3 a priori frames are used on each fused feature map, for example, 13 x 3 a priori frames are generated for the 13 x 13 fused feature map.
The specific formula for calculating the position information marked by the prior frame is as follows:
Figure 648003DEST_PATH_IMAGE006
wherein (A), (B), (C), (D), (C), (B), (C)b x ,b y ,b w ,b h ) The location information representing the prior box label,t x ,t y respectively representing the horizontal and vertical translation values of the prior frame prediction coordinate,t w ,t h each represents a scaling factor for a scale,b x ,b y each represents the coordinates of the center of the bounding box,b w ,b h the width and height of the bounding box are indicated,c x ,c y all represent the coordinates of the upper left corner of the corresponding cell of the feature map to which the central point of the bounding box belongs,p w ,p h representing the width and height of the prior box mapping to the feature map, respectively.
Step S45: and overlapping and fusing the pedestrian identification initial labeling graphs of different sizes to obtain a total fusion characteristic graph.
Step S46: and eliminating redundant prior frames on the total fusion characteristic graph by adopting a non-maximum value inhibition method, and outputting a final pedestrian identification labeled graph. Overlapping prior boxes and prior boxes with too low a confidence are referred to as redundant prior boxes. In order to improve the identification accuracy, a large number of overlapped prior frames and prior frames with low confidence degrees exist on the total fusion characteristic diagram obtained in the step S45, so that a non-maximum value inhibition method is adopted to remove redundant prior frames, a prediction frame is screened and drawn, and a final pedestrian identification label diagram is finally predicted and output.
The framework of the invention has 9 kinds of prior frames with different sizes, the feature maps with different sizes establish the prior frames with different sizes to be applied to the target detection of each cell,
Figure 979496DEST_PATH_IMAGE007
as sigmoid function:
Figure 309983DEST_PATH_IMAGE008
the function of the method can map the input to the (0, 1) interval, and can effectively ensure that the central coordinate of the bounding box is in the cell responsible for predicting the target.
The pedestrian recognition model in the abnormal state provided by the invention takes Darknet-53 as a characteristic extraction framework, the network is mainly characterized in that a large amount of Residual error networks (namely Residual error blocks) which are easy to optimize are added, the accuracy is improved by increasing the network depth, the parameter training amount is reduced, and the problem of gradient disappearance is solved.
Example 2
As shown in fig. 4, the present invention also discloses a deep learning-based pedestrian recognition system in an abnormal state, which comprises:
and a training set constructing module 401, configured to construct a sample training set.
An initial recognition model building module 402, configured to build an initial recognition model using the YOLOV3 model based on the deep learning framework.
A training module 403, configured to input the training set into the initial recognition model for training by using a gradient descent algorithm, and use the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; the pedestrian recognition model in the abnormal state comprises: the system comprises a feature extraction network, a multi-scale extraction fusion network and a priori frame labeling network.
And the pedestrian identification and marking module 404 is configured to input the target video image to be detected into the pedestrian identification model in the abnormal state for pedestrian identification and marking, and remove redundant prior frames by using a non-maximum suppression method to obtain a final pedestrian identification marking map.
As an optional implementation manner, the pedestrian identification and marking module 404 of the present invention specifically includes:
and the feature extraction unit is used for inputting the target video image to be detected into the feature extraction network for feature extraction to obtain initial feature maps with different scales.
And the feature extraction and fusion unit is used for inputting the initial feature maps of different scales into the multi-scale extraction and fusion network to extract and fuse features so as to obtain fusion feature maps of different scales.
And the selection unit is used for selecting the prior frame corresponding to each fusion characteristic graph.
And the marking unit is used for inputting the fusion characteristic graphs of different scales into the prior frame marking network, marking the pedestrians in the fusion characteristic graphs of different scales according to the prior frame corresponding to each fusion characteristic graph, and obtaining pedestrian identification initial marking graphs of different sizes.
And the superposition fusion unit is used for carrying out superposition fusion on the pedestrian identification initial labeling graphs of different sizes to obtain a total fusion characteristic graph.
And the eliminating unit is used for eliminating redundant prior frames on the total fusion characteristic graph by adopting a non-maximum value inhibition method and outputting a final pedestrian identification label graph.
The same parts as those in embodiment 1 will not be described in detail.
Example 3
To verify the generalization performance of the YOLOV3-VOC + APD training model, the model was evaluated using a test data set. The classification category comprises 20 types of targets such as people, automobiles, bicycles and the like, the AP value is adopted for evaluation, and the frame has higher AP value for large targets such as buses, automobiles, airplanes and the like; in contrast, the model pair has a low AP value for indoor objects, such as potted plants, chairs, tables, etc., due to weak indoor light. The overall accuracy mAP of YOLOV3-VOC + APD is 0.83, which shows that the framework of the invention has good detection target performance.
On the other hand, to prove that YOLOV3-VOC + APD can detect pedestrians in an abnormal state, 3 different methods are implemented for comparison, namely SSD-VOC (i.e. SSD model trained based on VOC data set), SSD-VOC + APD (i.e. SSD model trained based on VOC + APD data set), YOLOV3-VOC (i.e. YOLOV3 model trained based on VOC data set).
The cases used are quite challenging, and there are four typical non-steady state cases listed (i.e., the four panels contained in fig. 5-8), where: (a) the figure is characterized in that the figure is special in human behavior, such as groveling, squatting and other unusual actions; (b) the characteristic of the graph is that the object integrated with the human-like object interferes the detection; (c) the characteristics of the graph cause part of pedestrians to be blurred in the graph due to the problem of camera shake; (d) the characteristic of the map is that under the condition of gathering people, the mutual shielding among pedestrians is serious.
Fig. 5 to 8 list the detection results of the pedestrian under the above typical four unusual states for each model. FIG. 5 is the non-stationary pedestrian recognition result based on YOLOV3-VOC + APD, FIG. 6 is the non-stationary pedestrian recognition result based on YOLOV3-VOC, FIG. 7 is the non-stationary pedestrian recognition result based on SSD-VOC + APD, and FIG. 8 is the non-stationary pedestrian recognition result based on SSD-VOC. For the convenience of distinguishing, the square detection box represents the detection result, and the circle represents the missing detection content of the frame. Accuracy of detection of a relative pedestrian: the reliability of the model framework of the invention was demonstrated by an AP value of 0.88 for the YOLOV3-VOC + APD framework, an AP value of 0.82 for the YOLOV3-VOC framework, an AP value of 0.83 for the SSD-VOC + APD framework, and an AP value of 0.78 for the SSD-VOC framework.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A pedestrian recognition method in an abnormal state based on deep learning is characterized by comprising the following steps:
step S1: constructing a sample training set;
step S2: constructing an initial recognition model by using a YOLOV3 model based on a deep learning framework;
step S3: inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; the pedestrian recognition model in the abnormal state comprises: a feature extraction network, a multi-scale extraction fusion network and a priori frame labeling network;
step S4: and inputting the target video image to be detected into the pedestrian recognition model in the abnormal state for pedestrian recognition and labeling, and eliminating redundant prior frames by adopting a non-maximum suppression method to obtain a final pedestrian recognition labeling image.
2. The method for identifying pedestrians in an abnormal state based on deep learning of claim 1, wherein the step of inputting the target video image to be detected into the pedestrian identification model in the abnormal state for pedestrian identification and labeling, and eliminating redundant prior frames by using a non-maximum suppression method to obtain a final pedestrian identification label map specifically comprises:
step S41: inputting a target video image to be detected into a feature extraction network for feature extraction to obtain initial feature maps with different scales;
step S42: inputting the initial feature maps of different scales into a multi-scale extraction fusion network for feature extraction and fusion to obtain fusion feature maps of different scales;
step S43: selecting a prior frame corresponding to each fusion characteristic graph;
step S44: inputting the fusion characteristic graphs of different scales into a priori frame marking network, marking pedestrians in the fusion characteristic graphs of different scales according to a priori frame corresponding to each fusion characteristic graph, and obtaining pedestrian identification initial marking graphs of different sizes;
step S45: overlapping and fusing the pedestrian identification initial labeling graphs of different sizes to obtain a total fusion characteristic graph;
step S46: and eliminating redundant prior frames on the total fusion characteristic graph by adopting a non-maximum value inhibition method, and outputting a final pedestrian identification labeled graph.
3. The deep learning-based extraordinary pedestrian recognition method according to claim 1, wherein the feature extraction network comprises: the device comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, a fourth feature extraction module and a fifth feature extraction module, wherein the first feature extraction module is connected with the fifth feature extraction module sequentially through the second feature extraction module, the third feature extraction module, the fourth feature extraction module and the fifth feature extraction module;
the first feature extraction module comprises 2 convolution blocks and 1 residual block, wherein the 1 st convolution block is connected with the residual block through the 2 nd convolution block; the residual block comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the 2 nd convolution block and the 2 nd convolution block in the first characteristic extraction module; adding the feature graph output by the 2 nd convolution block in the residual block and the feature graph output by the 2 nd convolution block in the first feature extraction module, and inputting the addition result to the second feature extraction module;
the second feature extraction module comprises 1 convolution block and 2 residual blocks, and the convolution blocks are connected through the 1 st residual block and the 2 nd residual block; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the 1 st convolution block in the second feature extraction module, the 1 st convolution block in the 2 nd residual block is connected with the 2 nd convolution block in the 1 st residual block, the feature map output by the 2 nd convolution block in the last residual block is added with the feature map output by the convolution block in the second feature extraction module, and the addition result is input to the third feature extraction module;
the third feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the third feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the third feature extraction module, and the addition result is input into the fourth feature extraction module and the third extraction fusion network;
the fourth feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fourth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the fourth feature extraction module, and the addition result is input into the fifth feature extraction module and the second extraction fusion network;
the fifth feature extraction module comprises 1 convolution block and 4 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 4 th residual block through 2 residual blocks at a time; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; and the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fifth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the fifth feature extraction module, and the addition result is input into the first extraction fusion network.
4. The deep learning-based extraordinary pedestrian recognition method according to claim 3, wherein the multi-scale extraction fusion network comprises a first extraction fusion network, a second extraction fusion network and a third extraction fusion network; the first extraction fusion network is connected with the fifth feature extraction module, the second extraction fusion network is connected with the fourth feature extraction module, the third extraction fusion network is connected with the third feature extraction module, and the first extraction fusion network is connected with the third extraction fusion network through the second extraction fusion network;
the first extraction fusion network comprises 1 Convolitional Set layer; the conditional Set layer is respectively connected with the fifth feature extraction module, the convolution block in the second extraction fusion network and the 1 st convolution block in the first labeling module;
the second extraction fusion network comprises 1 conditional Set layer, 1 upsampling layer, 1 splicing layer and 1 volume block; the convolution block is respectively connected with a corresponding Set layer and an upper sampling layer in the first extraction fusion network, the upper sampling layer is connected with a splicing layer, and the corresponding Set layer is respectively connected with a 1 st convolution block in the second labeling module, the splicing layer, a convolution block in the third extraction fusion network and a fourth feature extraction module;
the third extraction fusion network comprises 1 conditional Set layer, 1 upsampling layer, 1 splicing layer and 1 volume block; the convolution block is respectively connected with a corresponding Set layer and an upsampling layer in the second extraction fusion network, the upsampling layer is connected with a splicing layer, and the corresponding Set layer is respectively connected with a 1 st convolution block, the splicing layer and a third feature extraction module in a third labeling module.
5. The deep learning-based pedestrian identification method in the unusual state according to claim 4, wherein the prior frame labeling network comprises a first labeling module, a second labeling module and a third labeling module; the first labeling module is connected with the first extraction fusion network, the second labeling module is connected with the second extraction fusion network, and the third labeling module is connected with the third extraction fusion network;
the first labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with a Convolitional Set layer and the 2 nd convolution block in the first extraction fusion network;
the second labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the Convolitional Set layer and the 2 nd convolution block in the second extraction fusion network;
the third labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the Convolitional Set layer and the 2 nd convolution block in the third extraction fusion network.
6. An extraordinary pedestrian recognition system based on deep learning, the system comprising:
the training set constructing module is used for constructing a sample training set;
the initial identification model building module is used for building an initial identification model by using a YOLOV3 model based on a deep learning framework;
the training module is used for inputting the training set into the initial recognition model for training by adopting a gradient descent algorithm, and taking the initial recognition model corresponding to the minimum total loss function as a pedestrian recognition model in an abnormal state; the pedestrian recognition model in the abnormal state comprises: a feature extraction network, a multi-scale extraction fusion network and a priori frame labeling network;
and the pedestrian identification and marking module is used for inputting the target video image to be detected into the pedestrian identification model in the abnormal state for pedestrian identification and marking, and eliminating redundant prior frames by adopting a non-maximum inhibition method to obtain a final pedestrian identification marking image.
7. The system for identifying pedestrians under unusual conditions based on deep learning of claim 6, wherein the module for identifying and labeling pedestrians specifically comprises:
the characteristic extraction unit is used for inputting a target video image to be detected into a characteristic extraction network for characteristic extraction to obtain initial characteristic graphs of different scales;
the characteristic extraction and fusion unit is used for inputting the initial characteristic graphs of different scales into a multi-scale extraction and fusion network to extract and fuse the characteristics to obtain fusion characteristic graphs of different scales;
the selection unit is used for selecting the prior frame corresponding to each fusion characteristic graph;
the marking unit is used for inputting the fusion characteristic graphs of different scales into the prior frame marking network, marking the pedestrians in the fusion characteristic graphs of different scales according to the prior frame corresponding to each fusion characteristic graph, and obtaining pedestrian identification initial marking graphs of different sizes;
the superposition fusion unit is used for carrying out superposition fusion on pedestrian identification initial labeling graphs of different sizes to obtain a total fusion characteristic graph;
and the eliminating unit is used for eliminating redundant prior frames on the total fusion characteristic graph by adopting a non-maximum value inhibition method and outputting a final pedestrian identification label graph.
8. The deep learning based extraordinary pedestrian recognition system of claim 6, wherein the feature extraction network comprises: the device comprises a first feature extraction module, a second feature extraction module, a third feature extraction module, a fourth feature extraction module and a fifth feature extraction module, wherein the first feature extraction module is connected with the fifth feature extraction module sequentially through the second feature extraction module, the third feature extraction module, the fourth feature extraction module and the fifth feature extraction module;
the first feature extraction module comprises 2 convolution blocks and 1 residual block, wherein the 1 st convolution block is connected with the residual block through the 2 nd convolution block; the residual block comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the 2 nd convolution block and the 2 nd convolution block in the first characteristic extraction module; adding the feature graph output by the 2 nd convolution block in the residual block and the feature graph output by the 2 nd convolution block in the first feature extraction module, and inputting the addition result to the second feature extraction module;
the second feature extraction module comprises 1 convolution block and 2 residual blocks, and the convolution blocks are connected through the 1 st residual block and the 2 nd residual block; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the 1 st convolution block in the second feature extraction module, the 1 st convolution block in the 2 nd residual block is connected with the 2 nd convolution block in the 1 st residual block, the feature map output by the 2 nd convolution block in the last residual block is added with the feature map output by the convolution block in the second feature extraction module, and the addition result is input to the third feature extraction module;
the third feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the third feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the third feature extraction module, and the addition result is input into the fourth feature extraction module and the third extraction fusion network;
the fourth feature extraction module comprises 1 convolution block and 8 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 8 th residual block sequentially through 6 residual blocks; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fourth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the fourth feature extraction module, and the addition result is input into the fifth feature extraction module and the second extraction fusion network;
the fifth feature extraction module comprises 1 convolution block and 4 residual blocks, the convolution block is connected with the 1 st residual block, and the 1 st residual block is connected with the 4 th residual block through 2 residual blocks at a time; each residual block comprises 2 convolution blocks, and the 1 st convolution block is connected with the 2 nd convolution block; and the 1 st convolution block in the 1 st residual block is connected with the convolution block in the fifth feature extraction module, the 1 st convolution block in the (i + 1) th residual block is connected with the 2 nd convolution block in the i th residual block, the feature graph output by the 2 nd convolution block in the last residual block is added with the feature graph output by the convolution block in the fifth feature extraction module, and the addition result is input into the first extraction fusion network.
9. The deep learning based extraordinary pedestrian recognition system of claim 8, wherein the multi-scale extraction fusion network comprises a first extraction fusion network, a second extraction fusion network, and a third extraction fusion network; the first extraction fusion network is connected with the fifth feature extraction module, the second extraction fusion network is connected with the fourth feature extraction module, the third extraction fusion network is connected with the third feature extraction module, and the first extraction fusion network is connected with the third extraction fusion network through the second extraction fusion network;
the first extraction fusion network comprises 1 Convolitional Set layer; the conditional Set layer is respectively connected with the fifth feature extraction module, the convolution block in the second extraction fusion network and the 1 st convolution block in the first labeling module;
the second extraction fusion network comprises 1 conditional Set layer, 1 upsampling layer, 1 splicing layer and 1 volume block; the convolution block is respectively connected with a corresponding Set layer and an upper sampling layer in the first extraction fusion network, the upper sampling layer is connected with a splicing layer, and the corresponding Set layer is respectively connected with a 1 st convolution block in the second labeling module, the splicing layer, a convolution block in the third extraction fusion network and a fourth feature extraction module;
the third extraction fusion network comprises 1 conditional Set layer, 1 upsampling layer, 1 splicing layer and 1 volume block; the convolution block is respectively connected with a corresponding Set layer and an upsampling layer in the second extraction fusion network, the upsampling layer is connected with a splicing layer, and the corresponding Set layer is respectively connected with a 1 st convolution block, the splicing layer and a third feature extraction module in a third labeling module.
10. The deep learning-based extraordinary pedestrian recognition system of claim 9, wherein the a priori box labeling network comprises a first labeling module, a second labeling module, and a third labeling module; the first labeling module is connected with the first extraction fusion network, the second labeling module is connected with the second extraction fusion network, and the third labeling module is connected with the third extraction fusion network;
the first labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with a Convolitional Set layer and the 2 nd convolution block in the first extraction fusion network;
the second labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the Convolitional Set layer and the 2 nd convolution block in the second extraction fusion network;
the third labeling module comprises 2 convolution blocks, and the 1 st convolution block is respectively connected with the Convolitional Set layer and the 2 nd convolution block in the third extraction fusion network.
CN202111471511.8A 2021-12-06 2021-12-06 Method and system for identifying pedestrian in abnormal state based on deep learning Pending CN113901962A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111471511.8A CN113901962A (en) 2021-12-06 2021-12-06 Method and system for identifying pedestrian in abnormal state based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111471511.8A CN113901962A (en) 2021-12-06 2021-12-06 Method and system for identifying pedestrian in abnormal state based on deep learning

Publications (1)

Publication Number Publication Date
CN113901962A true CN113901962A (en) 2022-01-07

Family

ID=79195281

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111471511.8A Pending CN113901962A (en) 2021-12-06 2021-12-06 Method and system for identifying pedestrian in abnormal state based on deep learning

Country Status (1)

Country Link
CN (1) CN113901962A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111275010A (en) * 2020-02-25 2020-06-12 福建师范大学 Pedestrian re-identification method based on computer vision
CN111444809A (en) * 2020-03-23 2020-07-24 华南理工大学 Power transmission line abnormal target detection method based on improved YO L Ov3
CN111461039A (en) * 2020-04-07 2020-07-28 电子科技大学 Landmark identification method based on multi-scale feature fusion
CN111723863A (en) * 2020-06-19 2020-09-29 中国农业科学院农业信息研究所 Fruit tree flower identification and position acquisition method and device, computer equipment and storage medium
CN112529065A (en) * 2020-12-04 2021-03-19 浙江工业大学 Target detection method based on feature alignment and key point auxiliary excitation
CN112949508A (en) * 2021-03-08 2021-06-11 咪咕文化科技有限公司 Model training method, pedestrian detection method, electronic device and readable storage medium
US20210370993A1 (en) * 2020-05-27 2021-12-02 University Of South Carolina Computer vision based real-time pixel-level railroad track components detection system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111275010A (en) * 2020-02-25 2020-06-12 福建师范大学 Pedestrian re-identification method based on computer vision
CN111444809A (en) * 2020-03-23 2020-07-24 华南理工大学 Power transmission line abnormal target detection method based on improved YO L Ov3
CN111461039A (en) * 2020-04-07 2020-07-28 电子科技大学 Landmark identification method based on multi-scale feature fusion
US20210370993A1 (en) * 2020-05-27 2021-12-02 University Of South Carolina Computer vision based real-time pixel-level railroad track components detection system
CN111723863A (en) * 2020-06-19 2020-09-29 中国农业科学院农业信息研究所 Fruit tree flower identification and position acquisition method and device, computer equipment and storage medium
CN112529065A (en) * 2020-12-04 2021-03-19 浙江工业大学 Target detection method based on feature alignment and key point auxiliary excitation
CN112949508A (en) * 2021-03-08 2021-06-11 咪咕文化科技有限公司 Model training method, pedestrian detection method, electronic device and readable storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
何伟鑫 等: ""基于改进YOLOV3算法的弹库目标识别方法研究"", 《现代电子技术》 *
吴伟浩 等: ""基于改进Yolov3的电连接器缺陷检测"", 《传感技术学报》 *
杨波 等: ""改进实时目标检测算法的电力巡检鸟巢检测"", 《电气技术》 *
蒋毅 等: ""深度迁移学习在紫茎泽兰检测中的应用"", 《计算机系统应用》 *

Similar Documents

Publication Publication Date Title
Ye et al. A review on deep learning-based structural health monitoring of civil infrastructures
Hao et al. Leveraging multimodal social media data for rapid disaster damage assessment
Guo et al. Data‐driven flood emulation: Speeding up urban flood predictions by deep convolutional neural networks
CN110136170B (en) Remote sensing image building change detection method based on convolutional neural network
CN111179217A (en) Attention mechanism-based remote sensing image multi-scale target detection method
CN110287960A (en) The detection recognition method of curve text in natural scene image
Wu et al. Building damage detection using U-Net with attention mechanism from pre-and post-disaster remote sensing datasets
CN110378281A (en) Group Activity recognition method based on pseudo- 3D convolutional neural networks
CN110348437B (en) Target detection method based on weak supervised learning and occlusion perception
Kovordányi et al. Cyclone track forecasting based on satellite images using artificial neural networks
CN112052837A (en) Target detection method and device based on artificial intelligence
CN113379771B (en) Hierarchical human body analysis semantic segmentation method with edge constraint
CN111931703B (en) Object detection method based on human-object interaction weak supervision label
CN112634369A (en) Space and or graph model generation method and device, electronic equipment and storage medium
CN116310850B (en) Remote sensing image target detection method based on improved RetinaNet
CN112070040A (en) Text line detection method for video subtitles
CN110472514B (en) Adaptive vehicle target detection algorithm model and construction method thereof
CN110956684B (en) Crowd movement evacuation simulation method and system based on residual error network
CN116662468A (en) Urban functional area identification method and system based on geographic object space mode characteristics
CN111738164A (en) Pedestrian detection method based on deep learning
CN114881286A (en) Short-time rainfall prediction method based on deep learning
Wani et al. Segmentation of satellite images of solar panels using fast deep learning model
Kim et al. Massive scale deep learning for detecting extreme climate events
Shuai et al. Regression convolutional network for vanishing point detection
CN109284752A (en) A kind of rapid detection method of vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination