CN113591532A - Real-time pedestrian detection and feature extraction module based on self-selection mechanism - Google Patents
Real-time pedestrian detection and feature extraction module based on self-selection mechanism Download PDFInfo
- Publication number
- CN113591532A CN113591532A CN202110391719.2A CN202110391719A CN113591532A CN 113591532 A CN113591532 A CN 113591532A CN 202110391719 A CN202110391719 A CN 202110391719A CN 113591532 A CN113591532 A CN 113591532A
- Authority
- CN
- China
- Prior art keywords
- module
- pedestrian
- layer
- network
- residual
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 68
- 238000000605 extraction Methods 0.000 title claims abstract description 51
- 230000007246 mechanism Effects 0.000 title claims abstract description 23
- 238000012216 screening Methods 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 claims description 29
- 238000011176 pooling Methods 0.000 claims description 22
- 238000005070 sampling Methods 0.000 claims description 16
- 238000010586 diagram Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 11
- 230000009466 transformation Effects 0.000 claims description 9
- 230000006835 compression Effects 0.000 claims description 7
- 238000007906 compression Methods 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 abstract description 9
- 238000012544 monitoring process Methods 0.000 abstract description 6
- 239000013598 vector Substances 0.000 description 16
- 238000012360 testing method Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000002054 transplantation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a real-time pedestrian detection and feature extraction module based on a self-selection mechanism, which comprises a pedestrian detection module, a feature extraction module and a feature extraction module, wherein the pedestrian detection module is used for transmitting the detected pedestrian number A, a pedestrian detection frame and the corresponding pedestrian feature B in a picture into the feature extraction module; the characteristic extraction module is used for screening the backbone network in the residual backbone network module according to the detected pedestrian number A, sending pedestrian characteristics B specific to each person into a corresponding residual network in the backbone network for characteristic extraction, meanwhile distinguishing the pedestrian characteristics B by combining orientation information, and then comparing and identifying the pedestrian global information and the pedestrian orientation information of each pedestrian with the pedestrian related information in the database; the pedestrian re-identification task under the complex monitoring scene can be adapted, plug and play can be realized, the problems that the number of model parameters is large, a large amount of redundant calculation exists in the existing pedestrian re-identification task, and meanwhile, the practical situation applicability is low, the efficiency is low, and the generalization capability is poor can be effectively solved.
Description
Technical Field
The invention relates to the technical field of pedestrian re-identification in the field of computer vision, in particular to a real-time pedestrian detection and feature extraction module based on a self-selection mechanism.
Background
The key technology for rapidly and accurately identifying the pedestrians is that the same pedestrians exist in continuous videos or images through machine learning and computer vision analysis. The traditional pedestrian re-identification task can be simplified into feature processing and feature comparison identification, which are two relatively independent tasks and two necessary steps for realizing the human re-identification in a monitoring scene. The feature processing link can divide pedestrian detection and feature extraction into two relatively independent modules. The task flow division is shown in fig. 1.
However, most of the current patents and researches only aim at the feature extraction module to search for accuracy on one side, and the work is difficult to fall to the ground. Meanwhile, although some scholars research on double tasks of pedestrian detection and feature extraction, the accuracy rate of pedestrian re-identification of the proposed method is low, the main reason is that the existing pedestrian re-identification algorithm is a 'detection-identification' framework system, and the framework is reasonable but not optimal and cannot self-select a network according to actual complex scenes. Meanwhile, data association transition depends on the detection quality, and the accuracy of pedestrian detection can seriously affect the accuracy of pedestrian re-identification, so that the idea of realizing pedestrian re-identification by using a single network cannot meet the actual requirement.
The table of the existing network structure of residuals 18, 34 and 50 is shown in table 1:
TABLE 1
The patent with publication number CN105224912B discloses a "video pedestrian detection and tracking method based on motion information and track association", but the proposed method does not perform targeted training on the monitored images, adopts a sliding window search mode to extract features, and then uses a classifier for detection, so that the calculation amount is large, the efficiency is low, and real-time tracking in practical application cannot be satisfied;
the patent with publication number CN108764338A proposes a "pedestrian tracking algorithm applied to video analysis", which applies an optical flow method, an algorithm that a color histogram method and a Logistic regression classifier are all discarded gradually in the detection and tracking field, and the performance cannot meet the real-time monitoring in a complex scene.
The patent with publication number CN109871763A proposes a specific target tracking method based on YOLO, which is a superposition of the existing "detection" and "tracking" methods, and each part of the framework is an optimal algorithm, so that end-to-end optimization cannot be achieved, and meanwhile, the algorithm has a large calculation amount and is not suitable for deployment and transplantation of mobile devices.
Reference [ 1 ] is Han, Kai, et al, "GhostNet: More features from threads operations," Proceedings of the IEEE/CVF Conference on CVPR.2020.
Reference [ 2 ] is, Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He. Non Local neuronetworks in CVPR, 2018.
Reference [ 3 ] Hao Luo, Youzhi Gu, Xinyu Liao, Shenqi Lai, Wei Jiang of Bag of Tricks and dA Strong Baseline for Deep Person Re-identification. in CVPR, 2019.
Reference [ 4 ] Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on CVPR.2018.
Disclosure of Invention
The invention aims to provide a real-time pedestrian detection and feature extraction module based on a self-selection mechanism, which can adapt to a pedestrian re-identification task under a complex monitoring scene, can be used in a plug-and-play manner, and can effectively solve the problems of large model parameter quantity, large redundant calculation, low applicability to actual conditions, low efficiency and poor generalization capability in the existing pedestrian re-identification task.
The invention is realized by the following technical scheme: a real-time pedestrian detection and feature extraction module based on a self-selection mechanism comprises a pedestrian detection module and a feature extraction module, wherein,
the pedestrian detection module is used for transmitting the pedestrian number A, the pedestrian detection frame and the corresponding pedestrian feature B detected in one picture into the feature extraction module;
and the feature extraction module is used for screening the backbone network in the residual backbone network module according to the detected pedestrian number A, sending the pedestrian features B of each person into the corresponding residual network in the backbone network for feature extraction, meanwhile, distinguishing the pedestrian features B by combining with the orientation information, and then comparing and identifying the pedestrian global information and the pedestrian orientation information of each pedestrian with the pedestrian related information in the database.
In order to further realize the invention, the following arrangement mode is adopted: the pedestrian detection module adopts a structural mode of a single-stage detection network, is sequentially provided with a convolution layer and 3 bottleneck channel layers, is also respectively provided with an attention module behind each bottleneck channel layer, is also provided with a channel adding module behind the last attention module, and the output of each attention module is connected to the channel adding module, namely the pedestrian detection module is provided with the convolution layer, the bottleneck channel layers, the attention modules and the channel adding module according to the processing sequence of input pictures, and the output of each attention module is subjected to channel addition in the channel adding module to obtain the pedestrian number A, the pedestrian detection frame and the corresponding pedestrian characteristic B.
In order to further realize the invention, the following arrangement mode is adopted: the bottleneck channel layer is composed of redundant simplified modules of two different stride.
In order to further realize the invention, the following arrangement mode is adopted: the slice convolution layer of the pedestrian detection module is used for rearranging the width and height information of the picture, and specifically comprises the following steps: the slice convolution layer of the pedestrian detection module divides the width and height data of the picture into 4 parts by half, so as to obtain 4 parts of data, each part of data is obtained by 2 times of downsampling, then 4 parts of data are spliced in channel dimensionality, and finally convolution operation is carried out.
The so-called channel dimension: generally, a color picture is seen, which is generally a 3-channel picture in RGB format, and assuming that the width and height of the picture are both 4, the storage form of the picture in the computer is (4 × 3), where 3 is the channel dimension represented by the channel dimension, and may also be understood as the number of channels, i.e. 3 channels. For example, the sheet was passed through a convolution operation of a sliced convolution layer, cut into 4 portions each having a characteristic pattern of 2 × 3 by a size of one half of the width and height (4 × 4), and the data was divided into 4 portions. The feature maps of 2 × 3 (4 × 3) are combined by stitching in the channel dimension, so that a large amount of feature information is lost in the down-sampling operation (the down-sampling operation is that 4 × 4 becomes 2 × 2, and if there is no stitching in the channel dimension, the picture is changed from 4 × 3 to 2 × 3, and the 4 × 3 down-sampling is 2 × 12 by slicing the convolution layer, so that the feature information of the original whole picture (4 × 3) is not lost.
Since stride (step size) of the redundancy simplification module can be adjusted to control the size of the output feature, when the characteristic quantity before and after the redundancy simplification module is not changed, stride can be convolved by 1; the desired output becomes half the input, and a convolution with stride 2 can be used.
Information input by the redundant simplification module with stride 2 is convolved (stride 2) to become half of original features, and feature information with consistent quantity is obtained through convolution (stride 1) and is called as original information (the features are consistent only in quantity and size and are changed after the features are convolved); the other part is operated by linear operation (adding, subtracting, multiplying and dividing on the original characteristic to become linear operation, and adopting the plus and minus linear operation in the invention). Compared with common down-sampling, the method fuses redundant information and has less relative loss of characteristic information.
And the redundant simplification module with stride 1 has the characteristics unchanged before and after the first step of convolution, and the final output characteristics are fused with redundant information, so that the information content is expanded.
In order to further realize the invention, the following arrangement mode is adopted: the channel bottleneck layer is formed by serially connecting a stride-2 redundant simplified module and a stride-1 redundant simplified module; the channel bottleneck layer adopts down-sampling pictures to extract features, and specifically comprises the following steps: firstly, carrying out convolution operation by using a redundant simplification module with stride being 2 to generate a feature map with half size, combining original information generated after convolution of the feature map with the original information generated after linear transformation to complete one-time downsampling; and generating a feature map with the same size through a redundant simplification module with stride being 1, combining original information generated after the feature map with the same size is convolved with redundant information generated through linear transformation, reducing the loss of the feature information caused by convolution, and finally finishing the operation of extracting the feature of the down-sampling picture of the channel bottleneck layer.
The main purpose of the backbone network is to extract features from the down-sampled pictures, and similar and available features (redundant information) in the pictures are ignored and are not used. The method is characterized by comprising two redundant simplification modules (see reference [ 1 ] for details) with different stride, wherein the redundant simplification modules utilize redundant information generated by similar characteristic diagrams to process. The method is to divide the original convolution into two batches to process: the output is first performed using the original convolution operation, and then a series of simple linear operations are used to generate more features. Through the two steps, the feature graphs with the same number as that of the common convolution processes are obtained, the redundant features are fully utilized, and the convolution layers are replaced by simple linear operation, so that the network parameters are greatly reduced, and the learning capacity of the convolutional neural network can be maintained or even improved under the condition of reducing the calculation amount by 20%.
In order to further realize the invention, the following arrangement mode is adopted: the attention module obtains global compression characteristic quantity by executing global average pooling on the characteristic diagram, obtains the weight of each channel through two layers of full connection layers, and takes the weight as the input of the next layer after normalization and weighting.
Due to the fact that the calculated amount and the model parameters are reduced due to the use of the channel bottleneck layer, the detection precision is reduced, therefore, an attention module (refer to 4) is added to reduce the influence, on the premise that the calculated amount is slightly increased, the attention aiming at the channel is screened out, the learning capacity of detecting the network characteristics is enhanced, and the negative influence which is possibly brought when the network parameters are reduced is reduced.
In order to further realize the invention, the following arrangement mode is adopted: the characteristic extraction module is provided with a network depth self-adaptive selection module and a residual backbone network module, wherein,
the network depth self-adaptive selection module is used for self-adaptively inputting the pedestrian characteristics B into a corresponding residual network in the residual backbone network module by judging the range of the interval where the pedestrian number A detected by the current picture is positioned and according to a corresponding judgment rule;
the residual backbone network module is used for extracting the features of the pedestrian features B and further separating pedestrian orientation information and pedestrian global information through the feature sharing layer so as to be input into a subsequent database for comparison.
In order to further realize the invention, the following arrangement mode is adopted: the network depth self-adaptive selection module is provided with an ifelse selector with a built-in judgment rule, and the judgment rule is as follows:
wherein, the modified residual 50 network is selected in case of strategy 1, the modified residual 34 network is selected in case of strategy 2, and the modified residual 18 network is selected in case of strategy 3.
In order to further realize the invention, the following arrangement mode is adopted: the residual error backbone network module is provided with a backbone network, a global average pooling layer, a BN layer, a first full-connection layer and a shared characteristic layer, wherein,
the backbone network is provided with a modified residual 18 network (residual 18Non Local), a modified residual 34 network (residual 34Non Local) and a modified residual 50 network (residual 50Non Local)3 residual networks, wherein the 3 residual networks are mainly different in the number of convolutional layers in each of the first layer to the fourth layer;
table 2 shows a network structure table of various residual error networks in the backbone network.
TABLE 2
In the table: x is not added; "3 x3, 64" represents 64 filters of size 3x3, and so on for other data; stride represents the step size.
The global average pooling layer is used for obtaining a global compression characteristic quantity of the pedestrian characteristic B, and adding all pixel values of the characteristic graph to obtain a numerical value (the global average pooling layer is adopted, the weighted average is obtained 2048, and the numerical value represents the characteristic graph), namely the numerical value represents the corresponding characteristic graph;
the BN layer adopts a BNfeature structure, and features are constrained on the hypersphere, so that the classification hyperplane is clearer and the accuracy of orientation classification is improved; BNFeature is mainly to normalize the feature value (modulo 1) by adding a normalization layer, so that features with different orientations can be represented on a unit circle, and thus, the result calculated by the euclidean distance is more convenient to compare and classify.
The full connection layer 1 is used for obtaining the weight of each residual error network, is connected with the global average pooling layer, plays a role of a classifier in the whole residual error backbone network module and consists of convolution with a convolution kernel of 1x1, and the full connection layer 1 can convert 2048-dimensional diagnosis specialization into 512-dimensional feature vectors which can be used for representing target full local features;
the shared characteristic layer can reduce the influence of characteristic difference on the identification rate under the same ID and different orientations.
In order to reduce the influence of feature differences under different orientations of the same identity on the recognition rate, a shared feature layer containing the orientation and the identity (global features) is designed. The 512-dimensional Feature vector passing through the Bn Feature passes through two branches of the shared Feature layer: one path of branch is used for fully connecting layers 2(512, 3) of orientation attribute identification (orientation classification), and the 512-dimensional features are converted into 3-dimensional features which respectively represent the orientation (front, back and side) of the target so as to realize orientation attribute judgment (orientation information). The same 512-dimensional feature vector is used to represent global features (pedestrian global information) in the other branch.
And finally, obtaining features and labels from the separated orientation information and the pedestrian global information, and inputting the features and labels into a subsequent database for comparison.
In order to further realize the invention, the following arrangement mode is adopted: a Nonlocal layer is arranged between the first layer and the second layer, between the second layer and the third layer, and between the third layer and the fourth layer of any residual network, a BN layer of any residual network adopts BnFature, and a Pooling layer 2 of any residual network adopts Global Average Pooling.
In order to establish the relation between frames in a video and the relation between two pixels with a certain distance on an image, a Non-Local layer (1x1, N, refer to [ 2 ]) is adopted, the channel attention mechanism is combined, the commonality is highlighted, the input and output scales are completely the same, an extra layer transformation scale is not required to be added, and a plurality of Non-Local layers can be added to a shallow layer to improve the precision.
In order to normalize the feature vector and better perform orientation classification and feature extraction, a BnFature (the structural reference is 3) structure is adopted in a BN layer of a residual error network (an improved residual error 18 network, an improved residual error 34 network and an improved residual error 50 network), features are constrained on a hypersphere, the features are constrained in a European space, so that the classification hyperplane is clearer, and the orientation classification accuracy is improved.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) the invention is composed of pedestrian detection and characteristic extraction cascade, and reduces the network parameter quantity on the basis of ensuring the detection precision in the aspect of pedestrian detection, thereby improving the detection speed. The method specifically comprises the steps of optimizing a single-stage detection-based network, redesigning a backbone network structure, adding a redundant information simplification module and an attention module, reducing network parameters and ensuring accuracy.
(2) In order to ensure that the real-time tracking effect is met under the conditions of different scenes and pedestrian densities, the invention designs a feature extraction network self-selection mechanism, and a residual network is self-adaptively selected according to the number of pedestrians in each frame: if the number of the pedestrians is large, the pedestrians are sent into a shallow network; if the number of pedestrians is small, the pedestrians are sent into the deep network.
(3) The residual error network of the invention adds a Non Local structural layer on the basis of the existing identification network model, thereby further improving the identification speed while ensuring the identification stability.
(4) The invention redesigns part of the network structure, reduces the network complexity and the calculated amount; meanwhile, a self-selection mechanism is provided, and orientation auxiliary information is introduced, so that the application problem of pedestrian re-identification in a complex scene can be solved.
Drawings
Fig. 1 is a flowchart of a conventional pedestrian re-identification task.
Fig. 2 is a structural diagram of a pedestrian detection module according to the present invention.
Fig. 3 is a schematic structural diagram of a feature extraction module according to the present invention.
Fig. 4 is a diagram of a redundant simplified module refinement architecture.
Fig. 5 is a detailed structural diagram of the attention module.
Fig. 6 is a diagram of a bottleneck channel layer structure.
FIG. 7 shows the entire network structure described in example 11
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in detail and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Example 1:
the invention designs a real-time pedestrian detection and feature extraction module based on a self-selection mechanism, which comprises a pedestrian detection module and a feature extraction module, wherein,
the pedestrian detection module is used for transmitting the pedestrian number A, the pedestrian detection frame and the corresponding pedestrian feature B detected in one picture into the feature extraction module;
and the feature extraction module is used for screening the backbone network in the residual backbone network module according to the detected pedestrian number A, sending the pedestrian features B of each person into the corresponding residual network in the backbone network for feature extraction, meanwhile, distinguishing the pedestrian features B by combining with the orientation information, and then comparing and identifying the pedestrian global information and the pedestrian orientation information of each pedestrian with the pedestrian related information in the database.
Example 2:
the present embodiment is further optimized based on the above embodiment, and the same parts as those in the foregoing technical solution will not be described herein again, and further to better implement the present invention, the following setting manner is particularly adopted: the pedestrian detection module adopts a structural mode of a single-stage detection network, a convolution layer and 3 bottleneck channel layers are sequentially arranged, an attention module is respectively arranged behind each bottleneck channel layer, a channel adding module is arranged behind the last attention module, the output of each attention module is connected to the channel adding module, namely the pedestrian detection module is provided with the convolution layer, the bottleneck channel layers, the attention module, the bottleneck channel layers, the attention module and the channel adding module according to the processing sequence of input pictures, and the output of each attention module is subjected to channel addition in the channel adding module to obtain the pedestrian number A, the pedestrian detection frame and the corresponding pedestrian characteristic B.
Example 3:
the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: the bottleneck channel layer is composed of redundant simplified modules of two different stride.
Example 4:
the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: the slice convolution layer of the pedestrian detection module is used for rearranging the width and height information of the picture, and specifically comprises the following steps: the slice convolution layer of the pedestrian detection module divides the width and height data of the picture into 4 parts by half to obtain 4 parts of data, each part of data is obtained by 2 times of downsampling, then 4 parts of data are spliced in channel dimension, and finally convolution operation is carried out.
The so-called channel dimension: generally, a color picture is seen, which is generally a 3-channel picture in RGB format, and assuming that the width and height of the picture are both 4, the storage form of the picture in the computer is (4 × 3), where 3 is the channel dimension represented by the channel dimension, and may also be understood as the number of channels, i.e. 3 channels. For example, the sheet was passed through a convolution operation of slicing the convolution layer, cut into 4 pieces each having a characteristic pattern of 2 × 3 in a size of halving the width and height (4 × 4), and then the data was divided into 4 pieces. The feature maps of 2 × 12 size are combined by stitching on the channel dimension, so that a large amount of feature information will be lost in the downsampling operation (the downsampling operation is 4 × 4 size is changed into 2 × 2 size, if there is no stitching on the channel dimension, the picture is changed from 4 × 3 into 2 × 3, and a large amount of feature information will be lost, while the feature information of the original whole picture (4 × 3) is not lost while 4 × 3 is downsampled into 2 × 12 by slicing the convolution layer.
Example 5:
the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: since stride (step size) of the redundancy simplification module can be adjusted to control the size of the output characteristic, when the characteristic quantity before and after the redundancy simplification module is not changed, stride can be convolved to 1; the desired output becomes half the input, and a convolution with stride 2 can be used.
Information input by the redundant simplification module with stride 2 is convolved (stride 2) to become half of original features, and feature information with consistent quantity is obtained through convolution (stride 1) and is called as original information (the features are consistent only in quantity and size and are changed after the features are convolved); the other part is operated by linear operation (adding, subtracting, multiplying and dividing on the original characteristic to become linear operation, and the method adopts the operation of adding, subtracting and linear, the redundant characteristics can be obtained by the linear operation, the characteristics can contain the characteristics similar to the original characteristics so as to be utilized, and then the redundant information (green) is obtained by convolution (stride 1), and the original information and the redundant information are correspondingly fused (namely, corresponding position is added), so that the convolution or down sampling operation is completed. Compared with common down-sampling, the method fuses redundant information and has less relative loss of characteristic information.
And the redundant simplification module with stride 1 has the characteristics unchanged before and after the first step of convolution, and the final output characteristics are fused with redundant information, so that the information content is expanded.
The channel bottleneck layer is formed by serially connecting a stride-2 redundant simplified module and a stride-1 redundant simplified module; the channel bottleneck layer adopts down-sampling pictures to extract features, and specifically comprises the following steps: firstly, carrying out convolution operation by using a redundant simplification module with stride being 2 to generate a feature map with half size, combining original information generated after convolution of the feature map with redundant information generated through linear transformation to complete down sampling for one time; and generating a feature map with the same size through a redundancy simplification module with stride being 1, combining original information generated after convolution of the feature map with redundant information generated through linear transformation, reducing the loss of feature information caused by convolution, and finally completing the operation of extracting features from the downsampled picture of the channel bottleneck layer.
The main purpose of the backbone network is to extract features from the down-sampled pictures, and similar and available features (redundant information) in the pictures are ignored and are not used. The method is characterized by comprising two redundant simplification modules (see reference [ 1 ] for details) with different stride, wherein the redundant simplification modules utilize redundant information generated by similar characteristic diagrams to process. The method is to divide the original convolution into two batches to process: the output is first performed using the original convolution operation, and then a series of simple linear operations are used to generate more features. Through the two steps, the feature graphs with the same number as that of the common convolution processes are obtained, the redundant features are fully utilized, and the convolution layers are replaced by simple linear operation, so that the network parameters are greatly reduced, and the learning capacity of the convolutional neural network can be maintained or even improved under the condition of reducing the calculation amount by 20%.
Example 6:
the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: the attention module obtains global compression characteristic quantity by executing global average pooling on the characteristic diagram, obtains the weight of each channel through two layers of full connection layers, and takes the weight as the input of the next layer after normalization and weighting.
Due to the fact that the calculated amount and the model parameters are reduced due to the use of the channel bottleneck layer, the detection precision is reduced, therefore, an attention module (refer to 4) is added to reduce the influence, on the premise that the calculated amount is slightly increased, the attention aiming at the channel is screened out, the learning capacity of detecting the network characteristics is enhanced, and the negative influence which is possibly brought when the network parameters are reduced is reduced.
Example 7:
the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: the feature extraction module is provided with a network depth self-adaptive selection module and a residual backbone network module, wherein,
the network depth self-adaptive selection module is used for self-adaptively inputting the pedestrian characteristics B into a corresponding residual network in the residual backbone network module by judging the range of the interval where the pedestrian number A detected by the current picture is positioned and according to a corresponding judgment rule;
the residual backbone network module is used for extracting the features of the pedestrian features B and further separating pedestrian orientation information and pedestrian global information through the feature sharing layer so as to be input into a subsequent database for comparison.
Example 8:
the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: the network depth self-adaptive selection module is provided with an ifelse selector with a built-in judgment rule, and the judgment rule is as follows:
wherein, the modified residual 50 network is selected in case of strategy 1, the modified residual 34 network is selected in case of strategy 2, and the modified residual 18 network is selected in case of strategy 3.
Example 9:
the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: the residual backbone network module is provided with a backbone network, a global average pooling layer, a BN layer, a first full connection layer and a shared characteristic layer, wherein,
the backbone network is provided with a modified residual 18 network (residual 18Non Local), a modified residual 34 network (residual 34Non Local) and a modified residual 50 network (residual 50Non Local)3 residual networks, wherein the 3 residual networks are mainly different in the number of convolutional layers in each of the first layer to the fourth layer;
the global average pooling layer is used for obtaining a global compression characteristic quantity of the pedestrian characteristic B, and adding all pixel values of the characteristic graph to obtain a numerical value (the global average pooling layer is adopted, the weighted average is obtained 2048, and the numerical value represents the characteristic graph), namely the numerical value represents the corresponding characteristic graph;
the BN layer adopts a BNfeature structure, and features are constrained on the hypersphere, so that the classification hyperplane is clearer and the accuracy of orientation classification is improved; BNFeature is mainly to normalize the feature value (modulo 1) by adding a normalization layer, so that features with different orientations can be represented on a unit circle, and thus, the result calculated by the euclidean distance is more convenient to compare and classify.
The full connection layer 1 is used for obtaining the weight of each residual error network, is connected with the global average pooling layer, plays a role of a classifier in the whole residual error backbone network module and consists of convolution with a convolution kernel of 1x1, and the full connection layer 1 can convert 2048-dimensional diagnosis specialization into 512-dimensional feature vectors which can be used for representing target full local features;
the shared characteristic layer can reduce the influence of characteristic difference on the identification rate under the same ID and different orientations.
In order to reduce the influence of feature differences under different orientations of the same identity on the recognition rate, a shared feature layer containing the orientation and the identity (global features) is designed. The 512-dimensional Feature vector passing through the Bn Feature passes through two branches of the shared Feature layer: one path of branch is used for fully connecting layers 2(512, 3) of orientation attribute identification (orientation classification), and the 512-dimensional features are converted into 3-dimensional features which respectively represent the orientation (front, back and side) of the target so as to realize orientation attribute judgment (orientation information). The same 512-dimensional feature vector is used to represent global features (pedestrian global information) in the other branch.
And finally, obtaining features and labels from the separated orientation information and the pedestrian global information, and inputting the features and labels into a subsequent database for comparison.
Example 10:
the present embodiment is further optimized based on any of the above embodiments, and the same parts as those in the foregoing technical solutions will not be described herein again, and in order to further better implement the present invention, the following setting modes are particularly adopted: a Nonlocal layer is arranged between the first layer and the second layer, between the second layer and the third layer, and between the third layer and the fourth layer of any residual network, a BN layer of any residual network adopts BnFature, and a Pooling layer 2 of any residual network adopts Global Average Pooling.
In order to establish the relation between frames in a video and the relation between two pixels with a certain distance on an image, a Non-Local layer (1x1, N, refer to [ 2 ]) is adopted, the channel attention mechanism is combined, the commonality is highlighted, the input and output scales are completely the same, an extra layer transformation scale is not required to be added, and a plurality of Non-Local layers can be added to a shallow layer to improve the precision.
In order to normalize the feature vector and better perform orientation classification and feature extraction, a BnFature (the structural reference is 3) structure is adopted in a BN layer of a residual error network (an improved residual error 18 network, an improved residual error 34 network and an improved residual error 50 network), features are constrained on a hypersphere, the features are constrained in a European space, so that the classification hyperplane is clearer, and the orientation classification accuracy is improved.
Example 11:
with reference to fig. 2 and fig. 3, an object of the embodiment is to provide a real-time pedestrian detection and feature extraction module based on a self-selection mechanism, which can adapt to a pedestrian re-identification task in a complex monitoring scene, and can be plug and play, and can effectively solve the problems of large model parameter amount, large amount of redundant computation, low applicability, low efficiency and poor generalization capability in the existing pedestrian re-identification task.
A real-time pedestrian feature extraction module based on a self-selection mechanism is characterized in that the design process comprises the following steps:
1) building a single-stage detection-based network, adding a redundancy simplification module (reference [ 1 ]) and an attention module (reference [ 4 ]), constructing a new backbone network (pedestrian detection module), detecting characteristics through the part, and conveying the characteristics to a residual error network combined with a Non Local (reference [ 2 ]) layer and a BnFature (reference [ 3 ]) layer for characteristic extraction;
2) combining the step 1), adding a network depth self-adaptive selection module, and self-adaptively selecting the corresponding re-recognition network depth according to the number of pedestrians detected in the step 1);
3) and training the pedestrian detection module and the feature extraction module, and testing the real-time performance of the network integrated system by using the corresponding test set.
The step 1) comprises the following specific steps:
1.1) analyzing a detection network:
adopting a single-stage detection network structure, redesigning a backbone network structure and arrangement, referring to the single-stage network structure, and applying a redundancy simplification module and an attention module in a backbone network:
wherein the redundant simplification module utilizes redundant information processing generated by similar characteristic diagrams. The method is to divide the original convolution into two batches to process: the output is first performed using the original convolution operation, and then a series of simple linear operations are used to generate more features. Through the two steps, the characteristic graphs with the same quantity as that of the common convolution process are obtained, the redundant characteristics are fully utilized, the convolution layer is replaced by simple linear operation, the network parameters are greatly reduced, the learning capacity of the convolutional neural network can be kept or even improved under the condition of reducing the calculated amount by 20%, and the refined structure of the redundant simplified module is shown in fig. 4; meanwhile, the traditional down sampling can lose information destruction characteristics, the changed backbone network rearranges the input width and height information, in a simple way, the width and height data are halved and segmented into 4 parts of data, each part of data is obtained by 2 times down sampling, then the data are spliced in channel dimensionality, and finally convolution operation is carried out, so that the maximum benefit is that the information loss can be reduced to the maximum degree when the down sampling operation is carried out.
The reduction of the network parameters may affect the detection accuracy, an attention module (refer to [ 4 ]) is added to reduce the influence, and on the premise of slightly increasing the calculation amount, the attention of the channel is screened out to enhance the learning capability of the detection network characteristics, so as to reduce the negative influence which may be brought by reducing the network parameters. The attention module refinement structure is shown in fig. 5, global compression feature quantity is obtained by performing global averaging pooling on the feature map, the weight of each channel is obtained through two full-connection layers, and the weighted weight is normalized and weighted to be used as the input of the next layer
1.2) building a detection network
In a backbone network of a detection end, a convolution layer mainly adopts Conv convolution, and a redundancy simplification module is adopted for designing a bottleneck channel layer; the bottleneck channel layer is mainly composed of two redundant simplified modules (redundant simplified module Stride 2 and redundant simplified module Stride 1) stacked twice as shown in fig. 6, the first module expands the number of channels, and the second module reduces the number of channels to be consistent with the number of input channels. When stride is 1, the function is similar to the residual block; when stride is 2, a stride of 2 convolutional layer is added between two redundant simplified modules, the size of the characteristic diagram can be reduced to 1/2 of the input, and the skip connection circuit also needs the same down-sampling to ensure that the channel adding operation can be aligned. The pedestrian detection module can implement the function of a down-sampling layer in the traditional method. The redesigned pedestrian detection module structure is shown in fig. 2.
1.3) analyzing and re-identifying the network:
the re-identification network is used for realizing pedestrian detection, feature extraction and comparison identification, and two problems need to be considered, namely, the features of the same Identity (ID) in different orientations are different, so that one person has only one feature in the case of a general network, but the same ID has different features in different orientations, and if orientation factors are ignored, the identification accuracy is low after the orientation of a pedestrian is changed. The second is the problem of the efficiency of feature extraction. With the deepening of the network depth and the complexity of the structure, the operation time of the network is also prolonged, and the real-time performance is influenced. In view of the above problems, the overall network structure adopted by the present embodiment is shown in fig. 7, and mainly includes two core portions: the pedestrian detection module and the characteristic extraction module.
1.4) building a re-identification network:
1.4.1) single network optimization module:
this embodiment studies a common network of a residual error network: residual 18, residual 34, and residual 50, combined with the Non Local layer and the bnfierce layer, were tested by designing a network structure as shown in table 2. In order to establish the relation between frames in a video and the relation between two pixels with a certain distance on an image, a Non Local structure is adopted, a channel attention mechanism is combined, the commonality is highlighted, the input and output scales are ensured to be completely the same, an extra layer transformation scale is not required to be added, and a plurality of Non Local modules can be added to a shallow layer network to improve the precision. In order to normalize the feature vector and better perform orientation classification and feature extraction, a BnFature structure is adopted, wherein the BnFature is a structure provided by reference to [ 3 ], and features are constrained on a hypersphere in a Euclidean space, so that a classification hyperplane is clearer, and the accuracy of orientation classification is improved. This embodiment depth analysis shows the detailed structure of the network corresponding to the combination of the residual different depths and Non Local, bnfieture as shown in table 2.
1.4.2) design shared feature Branch containing orientation and ID (shared feature layer)
In order to reduce the influence of feature differences under different orientations of the same identity on the recognition rate, a shared feature branch containing the orientation and the identity (global features) is designed. The 512-dimensional feature vector passing through the bnfieture is passed through two branches: one path of branch is used for fully connecting layers 2(512, 3) of orientation attribute identification, and the 512-dimensional feature is converted into a 3-dimensional feature which respectively represents the orientation (front, back and side) of the target so as to realize orientation attribute judgment. The same 512-dimensional feature vector is used to represent the global feature in the other branch.
2) Design network depth adaptive selection module
The recognition accuracy and the feature extraction efficiency can have a game process, the deeper the network depth is, the higher the recognition accuracy is, and the lower the FPS (frame per second for measuring the inference speed) is; the shallower the network depth is, the lower the recognition accuracy is, and the higher the feature extraction efficiency is, so that the whole system is required to achieve game balance, and the feature extraction efficiency is ensured to meet implementation requirements while ensuring higher precision. Meanwhile, if the number of people in the current video frame is small, a deeper network is selected, so that hardware calculation power can be wasted, and the operation efficiency is reduced; if a shallower network is selected, the accuracy of the tracked pedestrians will be greatly compromised when the number of incoming pedestrians is greater. For this embodiment, a network depth adaptive selection module is designed, such as the residual backbone network module in fig. 3, and the module can adaptively select a corresponding residual network structure according to the number of targets in each frame of the pedestrian detection part, thereby improving the feature extraction efficiency without reducing the accuracy. The strategy is divided into three selection strategies, namely, strategy 1 is residual 50+ Non Local, strategy 2 is residual 34+ Non Local, and strategy 3 is residual 18+ Non Local, which are expressed by the following formula:
for example, the number of people detected in the currently incoming video frame is 15, and the backbone network can be adaptively switched to the residual 34+ Non Local network, so that the detected features are sent to the network with the corresponding depth on the premise of ensuring the identification accuracy, and the identification speed can be increased as much as possible.
The method comprises the steps that a whole module with the functions of feature recognition, extraction and re-recognition is obtained through cascade connection, in the whole operation process of the detection and re-recognition network, the pedestrian features in a video frame can be detected by a detection network end (pedestrian detection module) designed in the step 1) with low calculation amount and high accuracy, the number of pedestrians transmitted in each frame is transmitted to the re-recognition network while the features are transmitted to the re-recognition network, the network depth can be selected in a self-adaption mode through the re-recognition network (feature extraction module) designed in the step 2), pictures are transmitted to a backbone network combining a residual error series network and Non Local, after a global average pooling layer is formed, a residual error 50 needs to pass through a full connection layer 1, 2048-dimensional feature vectors are changed into 512-dimensional feature vectors, the residual error 34 and the residual error 18, and the output is 512-dimensional feature vectors. The 512-dimensional feature vector is used as the pedestrian feature and simultaneously enters the full-connection layer 2, the feature extraction output direction is adopted, and the feature vector is stored in a database for comparison and identification according to different directions.
3) And (3) network training and testing, namely training the weight of the network module built in the step 1) and the step 2) by utilizing the monitoring data set, and testing the real-time performance of the module by using a corresponding test set.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.
Claims (10)
1. The utility model provides a real-time pedestrian detects and feature extraction module based on self-selection mechanism which characterized in that: comprises a pedestrian detection module and a feature extraction module, wherein,
the pedestrian detection module is used for transmitting the pedestrian number A, the pedestrian detection frame and the corresponding pedestrian feature B detected in one picture into the feature extraction module;
and the feature extraction module is used for screening the backbone network in the residual backbone network module according to the detected pedestrian number A, sending the pedestrian features B specific to each person into the corresponding residual network in the backbone network for feature extraction, meanwhile, distinguishing the pedestrian features B by combining with the orientation information, and then comparing and identifying the pedestrian global information and the pedestrian orientation information of each pedestrian with the pedestrian related information in the database.
2. The module of claim 1, wherein the module is for detecting and extracting pedestrian features in real time based on a self-selection mechanism, and comprises: the pedestrian detection module is sequentially provided with a slice convolution layer and 3 bottleneck channel layers, an attention module is arranged behind each bottleneck channel layer, a channel addition module is arranged behind the last attention module, and the output of each attention module is connected to the channel addition module.
3. The module of claim 2, wherein the module is for detecting and extracting pedestrian features in real time based on a self-selection mechanism, and comprises: the bottleneck channel layer is composed of redundant simplified modules of two different stride.
4. The module of claim 2, wherein the module is for detecting and extracting pedestrian features in real time based on a self-selection mechanism, and comprises: the slice convolution layer of the pedestrian detection module is used for rearranging the width and height information of the picture, and specifically comprises the following steps: the slice convolution layer of the pedestrian detection module divides the width and height data of the picture into 4 parts by half to obtain 4 parts of data, then splices the 4 parts of data in the channel dimension, and finally carries out convolution operation.
5. The module of claim 2, wherein the module is for detecting and extracting pedestrian features in real time based on a self-selection mechanism, and comprises: the channel bottleneck layer is formed by serially connecting a stride-2 redundant simplified module and a stride-1 redundant simplified module; the channel bottleneck layer adopts down-sampling pictures to extract features, and specifically comprises the following steps: firstly, carrying out convolution operation by using a redundant simplification module with stride being 2 to generate a feature map with half size, combining original information generated after convolution of the feature map with the original information generated after linear transformation to complete one-time downsampling; and generating a feature map with the same size through a redundancy simplification module with stride being 1, combining original information generated after convolution of the feature map with redundant information generated through linear transformation, reducing the loss of feature information caused by convolution, and finally completing the operation of extracting features from the downsampled picture of the channel bottleneck layer.
6. The module of claim 2, wherein the module is for detecting and extracting pedestrian features in real time based on a self-selection mechanism, and comprises: the attention module obtains global compression characteristic quantity by executing global average pooling on the characteristic diagram, obtains the weight of each channel through two layers of full connection layers, and takes the weight as the input of the next layer after normalization and weighting.
7. The module according to any one of claims 1 to 6, wherein the module comprises: the feature extraction module is provided with a network depth self-adaptive selection module and a residual backbone network module, wherein,
the network depth self-adaptive selection module is used for self-adaptively inputting the pedestrian characteristics B into the residual backbone network module by judging the range of the interval where the pedestrian number A detected by the current picture is positioned and according to the corresponding judgment rule;
and the residual backbone network module is used for extracting the characteristics of the pedestrian characteristics B and further separating pedestrian orientation information and pedestrian global information so as to be input into a subsequent database for comparison.
8. The module of claim 7, wherein the module is for detecting and extracting pedestrian features in real time based on a self-selection mechanism, and comprises: the network depth self-adaptive selection module is provided with an ifelse selector with a built-in judgment rule, and the judgment rule is as follows:
wherein, the network of improved residuals 50 is selected for strategy 1, the network of improved residuals 34 is selected for strategy 2, and the network of improved residuals 18 is selected for strategy 3.
9. The module of claim 7, wherein the module is for detecting and extracting pedestrian features in real time based on a self-selection mechanism, and comprises: the residual backbone network module is provided with a backbone network, a global average pooling layer, a BN layer, a first full-connection layer and a shared characteristic layer, wherein,
the trunk network is provided with 3 residual error networks of an improved residual error 18 network, an improved residual error 34 network and an improved residual error 50 network;
the global average pooling layer is used for obtaining a global compression characteristic quantity of the pedestrian characteristic B;
the full connection layer 1 plays a role of a classifier after being connected to the global average pooling layer and is used for obtaining the weight of each residual error network;
the BN layer adopts a BNfeature structure, and features are constrained on the hypersphere, so that the classification hyperplane is clearer and the accuracy of orientation classification is improved;
the shared feature layer comprises process feature branches of orientation and global features, and can reduce the influence of feature difference on the recognition rate under the same ID and different orientations.
10. The module of claim 7, wherein the module is for detecting and extracting pedestrian features in real time based on a self-selection mechanism, and comprises: a Nonlocal layer is arranged between the first layer and the second layer, between the second layer and the third layer, and between the third layer and the fourth layer of any residual network, a BN layer of any residual network adopts BnFature, and a Pooling layer 2 of any residual network adopts Global Average Pooling.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110391719.2A CN113591532A (en) | 2021-04-13 | 2021-04-13 | Real-time pedestrian detection and feature extraction module based on self-selection mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110391719.2A CN113591532A (en) | 2021-04-13 | 2021-04-13 | Real-time pedestrian detection and feature extraction module based on self-selection mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113591532A true CN113591532A (en) | 2021-11-02 |
Family
ID=78242988
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110391719.2A Pending CN113591532A (en) | 2021-04-13 | 2021-04-13 | Real-time pedestrian detection and feature extraction module based on self-selection mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113591532A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111783576A (en) * | 2020-06-18 | 2020-10-16 | 西安电子科技大学 | Pedestrian re-identification method based on improved YOLOv3 network and feature fusion |
CN112183647A (en) * | 2020-09-30 | 2021-01-05 | 国网山西省电力公司大同供电公司 | Transformer substation equipment sound fault detection and positioning method based on deep learning |
-
2021
- 2021-04-13 CN CN202110391719.2A patent/CN113591532A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111783576A (en) * | 2020-06-18 | 2020-10-16 | 西安电子科技大学 | Pedestrian re-identification method based on improved YOLOv3 network and feature fusion |
CN112183647A (en) * | 2020-09-30 | 2021-01-05 | 国网山西省电力公司大同供电公司 | Transformer substation equipment sound fault detection and positioning method based on deep learning |
Non-Patent Citations (1)
Title |
---|
YE LI ET AL.: "A Multi-task Joint Framework for Real-time Person Search" * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yeh et al. | Lightweight deep neural network for joint learning of underwater object detection and color conversion | |
CN111639692A (en) | Shadow detection method based on attention mechanism | |
CN110263712B (en) | Coarse and fine pedestrian detection method based on region candidates | |
CN113963032A (en) | Twin network structure target tracking method fusing target re-identification | |
CN116189281B (en) | End-to-end human behavior classification method and system based on space-time self-adaptive fusion | |
CN111415318A (en) | Unsupervised correlation filtering target tracking method and system based on jigsaw task | |
Tang et al. | Deep saliency quality assessment network with joint metric | |
CN115641632A (en) | Face counterfeiting detection method based on separation three-dimensional convolution neural network | |
Muddamsetty et al. | A performance evaluation of fusion techniques for spatio-temporal saliency detection in dynamic scenes | |
Luo et al. | LatRAIVF: An infrared and visible image fusion method based on latent regression and adversarial training | |
CN118212463A (en) | Target tracking method based on fractional order hybrid network | |
CN116934796B (en) | Visual target tracking method based on twinning residual error attention aggregation network | |
CN111539434B (en) | Infrared weak and small target detection method based on similarity | |
CN113591532A (en) | Real-time pedestrian detection and feature extraction module based on self-selection mechanism | |
CN114582002B (en) | Facial expression recognition method combining attention module and second-order pooling mechanism | |
CN117557923B (en) | Real-time traffic detection method for unmanned aerial vehicle vision sensing device | |
CN116110076B (en) | Power transmission aerial work personnel identity re-identification method and system based on mixed granularity network | |
Deng et al. | LP3DAM: Lightweight parallel 3D attention module for violence detection | |
Huang et al. | Multi-camshift for multi-view faces tracking and recognition | |
Tsai et al. | Combined 2D and 3D Convolution Residual Attention Network for Hand Gesture Recognition | |
Chihaoui et al. | Implementation of skin color selection prior to Gabor filter and neural network to reduce execution time of face detection | |
CN118247478B (en) | Child positioning method, device, equipment and storage medium based on optimized Yolov model | |
CN117557923A (en) | Real-time traffic detection method for unmanned aerial vehicle vision sensing device | |
Yang et al. | An Image Saliency Detection Method Based on Combining Global and Local Information | |
Erdem | A region covariances-based visual attention model for RGB-D images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20211102 |