CN111460980A - Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion - Google Patents

Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion Download PDF

Info

Publication number
CN111460980A
CN111460980A CN202010237758.2A CN202010237758A CN111460980A CN 111460980 A CN111460980 A CN 111460980A CN 202010237758 A CN202010237758 A CN 202010237758A CN 111460980 A CN111460980 A CN 111460980A
Authority
CN
China
Prior art keywords
pedestrian
small
feature
network
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010237758.2A
Other languages
Chinese (zh)
Other versions
CN111460980B (en
Inventor
薛涛
郭卫霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongfu Software (Xi'an) Co.,Ltd.
Original Assignee
Xian Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Polytechnic University filed Critical Xian Polytechnic University
Priority to CN202010237758.2A priority Critical patent/CN111460980B/en
Publication of CN111460980A publication Critical patent/CN111460980A/en
Application granted granted Critical
Publication of CN111460980B publication Critical patent/CN111460980B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-scale detection method of small-target pedestrians based on multi-semantic feature fusion, which comprises the following steps of 1: preprocessing the selected public pedestrian data set, and dividing the public pedestrian data set into a training set and a test set; step 2: improving a fast R-CNN network model, extracting and fusing shallow features and deep abstract features to obtain feature maps, activating the feature maps, sending the feature maps into a P-RPN network to generate a candidate frame, optimizing an anchor box in the RPN network, and then performing dimensionality reduction operation on a feature vector obtained by ROI Pooling to obtain a multiscale detection model of the small-target pedestrian with fused multi-semantic features; and step 3: training a multi-scale detection model of the small target pedestrian with multi-semantic feature fusion; and 4, step 4: and detecting the small target pedestrian. The invention improves the detection effect of the network model on small target pedestrians.

Description

Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion
Technical Field
The invention belongs to the technical field of computer vision based on deep learning, and particularly relates to a multi-scale detection method for small-target pedestrians based on multi-semantic feature fusion.
Background
The conventional pedestrian detection method is based on manually extracted Features, such as Histogram of Gradient directions (HOG) Features, local Binary pattern (L) Features, L BP) Features, Aggregated Channel Features (ACF), and the like, and then the extracted sample pedestrian Features are input into a classifier model to realize target detection, Viola et al in 2003 proposes a VJ algorithm, and adopts Haar Features + Adaboost classifier to realize rapid pedestrian detection, Dalal et al proposes a HOG Features + SVM classifier detection method in 2005, which effectively improves and expands the accuracy of pedestrian detection.
In recent years, with the rapid development of deep learning, image target detection algorithms based on neural networks gradually become mainstream, and are mainly divided into two types, namely, a single-stage detection algorithm which comprises a YO L O (young Only L ook one), a YO L Ov2, an SSD (single Shot multiple Box Detector) and the like, wherein the core idea of the algorithm is to convert a target detection task into a regression problem to solve the problem, and input an original image to directly output a position and type judgment result, so that the single-stage algorithm has certain advantages in detection speed but has poor detection effect on small targets and close objects which are mutually close, and the other type is a double-stage detection algorithm which comprises a Region-based convolutional neural Network (Region-based convolutional neural Network, R-CNN) and series optimization algorithms, Fast R-CNN, FastN policy, FasR-CNN and the like, and the algorithms are mainly used for generating a plurality of regions by FasPropol, and utilizing a Connaturn-based convolutional neural Network (RS-CNN) to realize the detection of human face detection by using a series of detection algorithms, such as a FasFasFasFasFasP, a series of detection algorithm, a detection algorithm which is widely applied to detection target detection algorithm with high detection speed, a target detection speed detection function, a target detection algorithm, a characteristic detection algorithm, a characteristic detection algorithm is widely applied to a detection algorithm, a detection algorithm is widely applied to a detection algorithm of.
At present, a large number of research results are obtained in pedestrian detection, a plurality of built-in subnets are introduced on the basis of a Fast R-CNN model and used for detecting multi-scale pedestrians in a non-intersecting range, a Hyper L earner framework is provided and pedestrian features and additional channel features are fused to improve the pedestrian detection quality, a new data set PRW is introduced and is used for evaluating pedestrian re-identification in an original image by combining model detection scores to Confidence Weighted Similarity measurement (CWS) of Similarity measurement, a pyramid RPN structure is provided on the basis of the Fast R-CNN to solve the multi-scale problem of underground pedestrians, and meanwhile, feature fusion is added in an algorithm to enhance the detection performance of the underground small-target pedestrians.
The method has good effect in some scenes, but most pedestrian images are from surveillance videos, vehicle-mounted cameras and the like, so that the problems of low resolution, small size, multiple scales and the like exist in pedestrian detection, and accurate detection is more difficult.
Disclosure of Invention
The invention aims to provide a multi-scale detection method of a small-target pedestrian based on multi-semantic feature fusion, which solves the problems of low resolution, small size, multi-scale and the like of the pedestrian obtained in the prior art, and makes accurate detection more difficult.
The technical scheme adopted by the invention is that,
the multi-scale detection method of the small target pedestrian based on the multi-semantic feature fusion specifically comprises the following steps:
step 1: selecting a pedestrian public data set, preprocessing the pedestrian public data set, converting a data file format, expanding an image data set, and dividing the pedestrian public data set into a training set and a test set;
step 2, improving a fast R-CNN network model, constructing a shallow feature extraction network LL FM, extracting deep abstract features of pedestrians by a VGG16 network, fusing the shallow features extracted by the LL FM and the deep abstract features extracted by the VGG16 to obtain feature maps, sending the feature maps into an activation block for activation, sending the feature maps into a P-RPN network to generate a candidate frame, optimizing an anchor box in the RPN network, and then performing dimensionality reduction on a feature vector obtained by ROI Pooling to obtain a multiscale detection model of the small-target pedestrians with fused multi-semantic features;
and step 3: inputting the training set obtained in the step 1 into the multi-scale detection model of the multi-semantic-feature-fused small-target pedestrian obtained in the step 2, optimizing a loss function until the loss function is converged, and finishing the training of the multi-scale detection model of the multi-semantic-feature-fused small-target pedestrian;
and 4, step 4: inputting the test set in the step 1 into the multi-scale detection model of the small-target pedestrian with multi-semantic feature fusion trained in the step 3, outputting a detection result, and completing the detection of the small-target pedestrian.
The present invention is also characterized in that,
in step 1, the training set and the test set are divided according to the front and back sequence of the pedestrian public data set, a video file of the pedestrian public data set is converted into an image in a png format, a description file of the pedestrian public data set is converted into an xml format, the training set is stored in every 10 frames, the test set is stored in every 30 frames, the data set is expanded by turning left and right, and the test set is divided according to the difference of pedestrian heights to obtain the preprocessed training set and test set.
In step 2, the constructed shallow feature extraction network LL FM selects Conv2_2, Conv3_3 and Conv4_3 of the VGG16 network, performs convolution operations with channel numbers of 32, 48 and 64 and convolution kernels of 3 × 3 on the Conv2_2, the Conv3_3 and the Conv4_3, and connects two by two through a Concat method and 3 pooling layers, and finally extracts shallow features.
In step 2, the shallow feature extracted by LL FM and the deep abstract feature extracted by VGG16 are fused by using a Concat method.
In step 2, the dimensionality reduction operation is to add a t-SNE dimensionality reduction module after an ROI Pooling layer.
In step 2, the P-RPN network optimizes the proportion and the scale of generating the anchor box in the RPN.
In step 2, the active block includes a full link layer, a Re L U layer, and a Dropout layer.
The small target pedestrian multi-scale detection method based on the multi-semantic feature fusion has the advantages that firstly, the diversity of training samples is enhanced through a series of preprocessing measures on a public data set, the effectiveness of the detection effect is improved, secondly, an L FMM module is designed to extract the shallow features which are more similar in the image, the deep features and the shallow features are fused through a feature fusion technology, the feature extraction performance of a network model on the small target pedestrians is enhanced, meanwhile, the influence of the increase of feature parameters on the detection speed is reduced through the dimension reduction operation, in addition, the RPN structure is improved aiming at the pedestrian features, the multi-scale problem of the pedestrians is solved, and the detection effect of the network model on the small target pedestrians is finally improved.
Drawings
FIG. 1 is a flow chart of the multi-scale detection method of small target pedestrians based on multi-semantic feature fusion of the present invention;
FIG. 2 is a network structure diagram of the multi-scale detection method for small target pedestrians based on multi-semantic feature fusion according to the present invention;
FIG. 3 is a block diagram of L FMM block in the network of FIG. 2 for the multi-scale detection method of small target pedestrians based on multi-semantic feature fusion in accordance with the present invention;
FIG. 4 is a diagram of a fusion structure of deep and shallow features in the network of FIG. 2 according to the multi-scale detection method of small target pedestrians based on multi-semantic feature fusion of the present invention;
FIG. 5 is a schematic diagram of a comparison between a constructed network model and an original model in a Small test set P-R curve of a Small-target pedestrian multi-scale detection method based on multi-semantic feature fusion;
FIG. 6 is a schematic diagram of a comparison between a constructed network model and an original model in a multiscale detection method of small target pedestrians based on multi-semantic feature fusion according to the invention under a Reasonable test set P-R curve;
FIG. 7 is a schematic diagram of a comparison between a constructed network model and an original model in an All test set of a P-R curve of a small-target pedestrian multi-scale detection method based on multi-semantic feature fusion.
Detailed Description
The multi-scale detection method for small-target pedestrians based on multi-semantic feature fusion is described in detail below with reference to the accompanying drawings and the detailed description.
The multi-scale detection method of the small target pedestrian based on the multi-semantic feature fusion specifically comprises the following steps:
step 1: selecting a pedestrian public data set, preprocessing the pedestrian public data set, converting a data file format, expanding an image data set, and dividing the pedestrian public data set into a training set and a test set;
step 2, improving a fast R-CNN network model, constructing a shallow feature extraction network LL FM, extracting deep abstract features of pedestrians by a VGG16 network, fusing the shallow features extracted by the LL FM and the deep abstract features extracted by the VGG16 to obtain feature maps, sending the feature maps into an activation block for activation, sending the feature maps into a P-RPN network to generate a candidate frame, optimizing an anchor box in the RPN network, and then performing dimensionality reduction on a feature vector obtained by ROI Pooling to obtain a multiscale detection model of the small-target pedestrians with fused multi-semantic features;
and step 3: inputting the training set obtained in the step 1 into the multi-scale detection model of the multi-semantic-feature-fused small-target pedestrian obtained in the step 2, optimizing a loss function until the loss function is converged, and finishing the training of the multi-scale detection model of the multi-semantic-feature-fused small-target pedestrian;
and 4, step 4: inputting the test set in the step 1 into the multi-scale detection model of the small-target pedestrian with multi-semantic feature fusion trained in the step 3, outputting a detection result, and completing the detection of the small-target pedestrian.
Further, in step 1, the training set and the test set are divided according to the front and back sequence of the pedestrian public data set, the video file of the pedestrian public data set is converted into the image in the png format, the description file of the pedestrian public data set is converted into the xml format, the training set is stored in every 10 frames, the test set is stored in every 30 frames, the data set is expanded by turning left and right, and the test set is divided according to the difference of the heights of pedestrians, so that the training set and the test set after preprocessing are obtained.
Further, in step 2, the constructed shallow feature extraction network LL FM selects Conv2_2, Conv3_3 and Conv4_3 of the VGG16 network, performs convolution operations with channel numbers of 32, 48 and 64 and convolution kernels of 3 × 3 on the Conv2_2, the Conv3_3 and the Conv4_3, respectively, and connects two by two through a Concat method, and 3 pooling layers, and finally extracts the shallow feature.
Further, in step 2, the shallow features extracted from the LL FM and the deep abstract features extracted from the VGG16 are fused by using a Concat method.
Further, in step 2, the dimensionality reduction operation is to add a t-SNE dimensionality reduction module after the ROI Pooling layer.
Further, in step 2, the P-RPN network optimizes the proportion and scale of generating the anchor box in the RPN.
Further, in step 2, the active block includes a full connection layer, a Re L U layer, and a Dropout layer.
The multi-scale detection method of small-target pedestrians based on multi-semantic feature fusion of the invention is further described in detail by specific embodiments.
Examples
The invention discloses a multi-scale detection method of small target pedestrians based on Faster R-CNN, which comprises the following specific steps as shown in figure 1:
step 1: preparation of the Experimental data set
The experimental data adopts a common data set Caltech pedistrian of California university, the data set is shot and collected in an urban road environment by using a vehicle-mounted camera, 11 video sets are provided, the total time is about 10h, the data set comprises about 250000 frames of images (about 137 minutes), 350000 Pedestrian labeling boxes, 2300 different pedestrians, and the image resolution is 640 × 480 pixels.
Step 2: experimental data set preprocessing
Caltech Peerrian has 11 video sets in total, wherein the front 6 segments of Set00-Set05 are selected as a training Set, and the rear 5 segments of Set06-Set10 are selected as a test Set.
Each video set of the source data set is divided into two parts, one part is a video file in seq format, and the other part is a description file in vbb format. The invention converts the video file into the image with the png format and converts the description file into the xml format.
Because the continuity between the video acquisition frames is strong and the characteristic difference is not obvious, in order to improve the training effectiveness, the invention selects to store one image in every 10 frames to obtain 12963 training sets, and the test set adopts to store one image in every 30 frames, and 4088 images in total. To expand the data set, the resulting data set was doubled in a left-right flip manner, with 25926 and 8176 final training and test sets, respectively.
According to the invention, the test set is divided into different levels according to the sizes of pedestrians, so that the detection effect is conveniently compared, and the attributes of the test set are shown in table 1.
TABLE 1 test set Attribute Table
Test set Test set attributes Number of images
All Test set of all images 8176
Small Pedestrian height less than or equal to 50 pixels 4498
Reasonable Pedestrian height greater than 50 pixels 3678
And step 3: network model improvement
The improved pedestrian detection network structure is shown in fig. 2, an input image respectively extracts deep and shallow features of the image through VGG16 and L FMM (shallow feature extraction network), the features are fused and then sent to a P-RPN network to generate a candidate frame, and simultaneously an activation Block (Activate Block, AB) is sent to perform activation operation, so that the nonlinear expression capability of the network is increased, and the occurrence of an over-fitting problem is prevented.
Step 3.1: constructing shallow feature extraction modules
The invention selects VGG16 as a feature extraction network, concrete network parameters are shown in Table 2, the whole network comprises 13 convolutional layers, the number of layers is deeper, and the number of channels of each layer is more, so that richer and abstract high-level semantic features can be extracted, the network structure is very regular, each convolutional layer systematically uses a convolutional core of 3 × 3, so that the network convergence speed is higher, the network is totally divided into 5 layers, 5 Pooling layers are included, so that feature parameters are reduced, the efficiency is improved, but simultaneously, the feature of a target in an image is lost due to excessive Pooling dimension reduction, a shallow feature extraction module is designed to obtain the basic feature of a shallow layer with an image appearance while the deep abstract feature of the image is extracted by using a VGG16 network, and the deep layer and shallow layer features are fused (the invention selects and fuses with a feature map after Conv5_3 convolution, the Pooling operation of the 5 layer is removed), and the shallow layer and the full connection layer are sent together, so that the detection result of small targets is more accurate.
Table 2 VGG16 network architecture parameters table
Figure BDA0002431569100000091
The shallow Feature extraction Module (L ow-level Feature Map Module, L FMM) designed by the present invention does not select the Feature Map of the last Conv5_3 layer of VGG16, but selects three Conv2_2, Conv3_3 and Conv4_3 layers to reflect the Feature Map of the low level information of the image, and then performs the concatage processing on the shallow features, as shown in FIG. 3, the specific operation of the L FMM Module performs the convolution operations of Conv2_2, Conv3_3 and Conv4_3 layers respectively to form 32, 48 and 48, 64, the convolution kernel is 3, and the convolution operation of Conv2_2, Conv3_3 and Conv4_3 layers respectively performs the convolution operations of the channel numbers of 32, 48 and 3 × 3 layers to input the final convolved Feature extraction on the two shallow Feature pools, and the final Feature pool of the image is input into the final Feature extraction Module.
Step 3.2: feature fusion
In order to enable the detection network to utilize the shallow feature and the deep feature at the same time, the shallow feature extracted by L FMM and the deep feature obtained by Conv5_3 need to be fused, the feature fusion is to aggregate the information obtained from different convolutional layers in some specific way, as shown in FIG. 4, the invention uses concat to perform feature fusion, that is, stacking the deep feature and the shallow feature in the channel dimension, and the specific operations are as follows:
the function of the Concat layer is to splice two or more feature maps in the channel dimension, and there is no operation of the eltwise layer, that is, the feature maps as input may be different except for the channel dimension, and the other dimensions must be consistent (that is, N, H, W is consistent), such as: the channels of the two feature maps that require concat are k1 and k2, respectively, and the output after performing the concat operation can be expressed as:
N*(k1+k2)*H*W
wherein, N is the number of images of feature map, usually the number of minimatch, H is the height of the input image, W is the width of the input image, and the channel of feature map is also the number of filters.
Step 3.3: building an activation Block
As shown in fig. 4, an Active Block (AB) is added after feature fusion, and includes a full link layer, a Re L U layer, and a Dropout layer, so that the overfitting problem is suppressed by adding a large amount of non-linear expressions to the network, and the network generalization capability is improved.
The fully connected kernel operation is the matrix-vector product, which can be converted into a convolution with a convolution kernel of 1 × 1, and the calculation formula is:
Figure BDA0002431569100000111
wherein x isiAs input to the model, w, b are both neuron parameters, wiFor the weight, b is the bias, and the function f (x) is the activation function, which is used to determine the output range.
The role of the Re L U function is to increase the non-linear relationship between the neural network layers, the function being defined as:
Figure BDA0002431569100000112
in the deep neural network model, the activation rate of the neurons is in inverse proportion to the number of model layers, for example, when the model increases N layers, the activation rate of the Re L U neuron is correspondingly reduced by N times of 2.
Dropout is the rounding off of neurons during forward propagation, with a rejection probability of p and a retention probability of 1-p. The discarded neurons are output with a zero result, the loss values obtained from this are propagated back on the remaining neurons, and the parameters (w, b) are updated using the stochastic gradient descent SGD. Discarding different hidden neurons randomly is preferable to training different networks, which will produce different overfitting, where the reciprocal fits cancel each other out, thereby reducing the overfitting phenomena as a whole.
Each neuron in the training phase may be discarded randomly, but each neuron must exist during testing and the weight needs to be adjusted again:
Figure BDA0002431569100000113
where p is the discard probability, W(l)As a weight value for the training phase,
Figure BDA0002431569100000114
the obtained weight value of the test stage.
Step 3.4: constructing P-RPN networks
In the fast R-CNN, an RPN network is convolved on a feature map through a sliding window of 3 × 3 to obtain a set of target candidate frames, each pixel point of the feature map is taken as a center, 3 proportions (aspects) 1:1/1:2/2:1 and 3 scales (scales)128/256/512 are respectively used to generate 9 anchors with different sizes, and in the pedestrian detection task, the feature frames of the target pedestrians cannot be generated particularly accurately by the size of a universal anchor box.
Aiming at the Caltech data set, the proportion of the anchor box of the pedestrian is determined through a K-means clustering algorithm, and the accuracy of generating the candidate box is improved. Firstly, the width and height of anchor boxes are calculated through K-means, the calculated width and height are both the proportion relative to the whole input picture, and the convolutional neural network has translation invariance and can be directly converted into the proportion relative to a characteristic diagram, and the conversion formula is as follows:
Figure BDA0002431569100000121
Figure BDA0002431569100000122
wherein, downsamples represents the multiplying power of downsampling, widthinputAnd heightinputFor width and height of input imageanchorAnd heightanchorIs the width and height of the anchor box relative to the input image, and w and h are the width and height of the anchor box relative to the feature map.
The original distance measurement formula used in the K-means is euclidean distance, which makes the error generated when the size of the anchor box is large when performing border regression be larger, and in order to make the error independent of the size of the anchor box, the following distance formula is used herein to replace euclidean distance:
d(box,centroid)=1-IoU(box,centroid)
wherein IoU (box, centroid) is the intersection ratio between the generated anchor box and the reference frame, and d (box, centroid) is the similarity between the anchor box and the reference frame.
Finally, the ratio of the anchors is determined to be 1:1/1:2/1: 3.
In order to make the network more sensitive to small target pedestrians, the anchor scale is modified to 64/128/256, and finally 9 candidate frames more conforming to the characteristics of the small target pedestrians are generated for each sliding window. This RPN structure is referred to herein as P-RPN (RPN for Pedestrian).
Step 3.5: reducing vitamin
The addition of shallow layer features increases the calculation parameters of the network, the original Faster R-CNN only pools on the feature map output by Conv5_3 and feeds the feature map into a full connection layer, after the feature fusion is performed, 128 shallow layer feature maps are added to the network, the size of the feature map after ROI Pooling is 20 × 15, 4096 × 128 × 20 × 15 is added to the full connection layer, which is 157286400, and about 15000 ten thousand parameters.
And 4, step 4: training network model
The image training set, the test set and the whole network model are already constructed through the steps 1 to 3, and in the step, the weight of the network model obtained in the step 3 needs to be trained and adjusted according to the provided training data set to optimize the loss until the training loss is converged, and the final weight is obtained to obtain the trained model, as shown in "1 → 2 → 3 → 4 → 5" in fig. 1.
The invention selects a deep learning framework Tensorflow 1.8.0 as an experimental platform for testing, and the software and hardware configuration of the platform is as follows: the operating system is Ubuntu 16.04, 16G memory, the GPU is NVIDIA GeForce GTX Titan Xp, and the GPU acceleration library is CUDA 9.0 and CUDNN 7.6.
The VGG16 network pre-trained by ImageNet is loaded into the network to initialize the weight of the feature extraction network, an SGD random gradient descent algorithm is selected to optimize the network model in the training process, the initial value of the learning rate is set to be 0.001, the momentum coefficient is 0.9, and the learning rate is attenuated to be 0.0001 after 5 ten thousand iterations.
And 5: model detection effect
And (3) applying the trained network model obtained in the step (4) to a test image sample without a label, and performing forward propagation to obtain a class label and probability estimated by the image so as to achieve the purpose of image recognition, wherein the flow of the steps is shown as '1 → 6 → 7 → 8 → 9' in the figure 1, the image is input into the network model to perform forward propagation, the probability of whether the image is a pedestrian is output, and finally the pedestrian detection task is completed.
The effectiveness of the improved model is verified through two aspects of the detection effect and the detection speed of the pedestrian.
The test results are shown in Table 3, and it can be seen from the test results that the improved L FMM and P-RPN module of the invention can improve the detection effect, and at the same time, it can be seen that the higher detection accuracy can be obtained by combining both L FMM and P-RPN with the original model.
TABLE 3 detection Effect of different models on Caltech dataset
Figure BDA0002431569100000141
The detection performance of each model can be more intuitively seen by comparing the P-R curves of different models on different test sets. The P-R curves of different models on Small, Reasonable and All test sets are respectively shown in fig. 5, 6 and 7, and it can be seen that the improved model of the invention has an obvious improvement on the detection effect of Small target pedestrians, and the detection performance is improved by about 4.12% compared with the original model. The improved model has a slightly improved detection effect on common pedestrians and pedestrians with clear targets.
In order to test the detection speed of the improved algorithm, the detection speed of the average single image of different models is compared, and as a result is shown in table 4, the detection speed of the whole network is slowed down after an L FMM module is added, the number of feature maps is increased due to the addition of a shallow feature extraction network, the calculation cost of the network is increased, and the detection speed of the network is basically kept unchanged compared with that of the original model after feature dimension reduction is carried out through t-SNE, so that the detection speed of the pedestrian detection is not reduced while the pedestrian detection accuracy is improved by the improved algorithm.
TABLE 4 comparison of the detection speeds of the different models
Figure BDA0002431569100000151
The invention relates to a multiscale detection method of a small target pedestrian based on multi-semantic feature fusion, which enhances the diversity of training samples and also improves the effectiveness of a test effect by a series of preprocessing measures on a public data set, extracts a shallow feature which is relatively similar in an image by a designed L FMM module, fuses a deep feature and the shallow feature by using a Concat feature fusion technology, enhances the feature extraction performance of a network model on the small target pedestrian, reduces the influence of the increase of feature parameters on the detection speed by using t-SNE dimension reduction operation, and also designs an active Block in the network model to increase the nonlinear expression capability of the network and prevent the occurrence of an over-fitting problem.

Claims (7)

1. The multi-scale detection method of the small target pedestrian based on the multi-semantic feature fusion is characterized by comprising the following steps:
step 1: selecting a pedestrian public data set, preprocessing the pedestrian public data set, converting a data file format, expanding an image data set, and dividing the pedestrian public data set into a training set and a test set;
step 2, improving a fast R-CNN network model, constructing a shallow feature extraction network LL FM, extracting deep abstract features of pedestrians by a VGG16 network, fusing the shallow features extracted by LL FM and the deep abstract features extracted by the VGG16 to obtain feature maps, sending the feature maps into an activation block for activation, sending the feature maps into a P-RPN network to generate a candidate frame, optimizing an anchor box in the RPN network, and then performing dimensionality reduction on feature vectors obtained by ROI Pooling to obtain a multiscale detection model of small-target pedestrians with fused multi-semantic features;
and step 3: inputting the training set obtained in the step 1 into the multi-scale detection model of the multi-semantic-feature-fused small-target pedestrian obtained in the step 2, optimizing a loss function until the loss function is converged, and finishing the training of the multi-scale detection model of the multi-semantic-feature-fused small-target pedestrian;
and 4, step 4: inputting the test set in the step 1 into the multi-scale detection model of the small-target pedestrian with multi-semantic feature fusion trained in the step 3, outputting a detection result, and completing the detection of the small-target pedestrian.
2. The multi-scale detection method for small target pedestrians based on multi-semantic feature fusion of claim 1, wherein in step 1, the training set and the test set are divided according to the front and back order of the pedestrian common data set, the video file of the pedestrian common data set is converted into the png format image, the description file of the pedestrian common data set is converted into the xml format, one training set is stored in each 10 frames, one test set is stored in each 30 frames, the data set is expanded by turning left and right, and the test set is divided according to the difference of pedestrian heights to obtain the pre-processed training set and test set.
3. The multi-scale detection method for small target pedestrians based on multi-semantic feature fusion of claim 1, wherein in step 2, the constructed shallow feature extraction network LL FM is selected from Conv2_2, Conv3_3 and Conv4_3 of VGG16 network, and convolution operations with 32, 48 and 64 channels and 3 × 3 convolution kernels are performed on the Conv2_2, the Conv3_3 and the Conv4_3 respectively, and are connected two by a Concat method to extract 3 pooling layers and final shallow features.
4. The multi-scale detection method for small target pedestrians based on multi-semantic feature fusion as claimed in claim 1, wherein in step 2, the shallow feature extracted from LL FM and the deep abstract feature extracted from VGG16 are fused by using Concat method.
5. The method for multi-scale detection of small target pedestrians based on multi-semantic feature fusion as claimed in claim 1, wherein in step 2, the dimensionality reduction operation is to add a t-SNE dimensionality reduction module after the ROI Pooling layer.
6. The method for multi-scale detection of small target pedestrians based on multi-semantic feature fusion as claimed in claim 1, wherein in step 2, the P-RPN network optimizes the proportion and scale of generating anchor box in RPN.
7. The multi-scale detection method for small target pedestrians based on multi-semantic feature fusion as claimed in claim 1, wherein in step 2, the activation block comprises a full connection layer, a Re L U layer and a Dropout layer.
CN202010237758.2A 2020-03-30 2020-03-30 Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion Active CN111460980B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010237758.2A CN111460980B (en) 2020-03-30 2020-03-30 Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010237758.2A CN111460980B (en) 2020-03-30 2020-03-30 Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion

Publications (2)

Publication Number Publication Date
CN111460980A true CN111460980A (en) 2020-07-28
CN111460980B CN111460980B (en) 2023-04-07

Family

ID=71685075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010237758.2A Active CN111460980B (en) 2020-03-30 2020-03-30 Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion

Country Status (1)

Country Link
CN (1) CN111460980B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101221A (en) * 2020-09-15 2020-12-18 哈尔滨理工大学 Method for real-time detection and identification of traffic signal lamp
CN112270279A (en) * 2020-11-02 2021-01-26 重庆邮电大学 Multi-dimensional-based remote sensing image micro-target detection method
CN112446308A (en) * 2020-11-16 2021-03-05 北京科技大学 Semantic enhancement-based pedestrian detection method based on multi-scale feature pyramid fusion
CN113052187A (en) * 2021-03-23 2021-06-29 电子科技大学 Global feature alignment target detection method based on multi-scale feature fusion
CN113095418A (en) * 2021-04-19 2021-07-09 航天新气象科技有限公司 Target detection method and system
CN113273992A (en) * 2021-05-11 2021-08-20 清华大学深圳国际研究生院 Signal processing method and device
CN113505640A (en) * 2021-05-31 2021-10-15 东南大学 Small-scale pedestrian detection method based on multi-scale feature fusion
CN116509357A (en) * 2023-05-16 2023-08-01 长春理工大学 Continuous blood pressure estimation method based on multi-scale convolution

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508675A (en) * 2018-11-14 2019-03-22 广州广电银通金融电子科技有限公司 A kind of pedestrian detection method for complex scene
CN109800628A (en) * 2018-12-04 2019-05-24 华南理工大学 A kind of network structure and detection method for reinforcing SSD Small object pedestrian detection performance
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN109508675A (en) * 2018-11-14 2019-03-22 广州广电银通金融电子科技有限公司 A kind of pedestrian detection method for complex scene
CN109800628A (en) * 2018-12-04 2019-05-24 华南理工大学 A kind of network structure and detection method for reinforcing SSD Small object pedestrian detection performance

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
郑冬等: "基于轻量化SSD的车辆及行人检测网络", 《南京师大学报(自然科学版)》 *
韩松臣等: "基于改进Faster-RCNN的机场场面小目标物体检测算法", 《南京航空航天大学学报》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101221A (en) * 2020-09-15 2020-12-18 哈尔滨理工大学 Method for real-time detection and identification of traffic signal lamp
CN112101221B (en) * 2020-09-15 2022-06-21 哈尔滨理工大学 Method for real-time detection and identification of traffic signal lamp
CN112270279A (en) * 2020-11-02 2021-01-26 重庆邮电大学 Multi-dimensional-based remote sensing image micro-target detection method
CN112270279B (en) * 2020-11-02 2022-04-12 重庆邮电大学 Multi-dimensional-based remote sensing image micro-target detection method
CN112446308A (en) * 2020-11-16 2021-03-05 北京科技大学 Semantic enhancement-based pedestrian detection method based on multi-scale feature pyramid fusion
CN112446308B (en) * 2020-11-16 2024-09-13 北京科技大学 Pedestrian detection method based on semantic enhancement multi-scale feature pyramid fusion
CN113052187A (en) * 2021-03-23 2021-06-29 电子科技大学 Global feature alignment target detection method based on multi-scale feature fusion
CN113095418A (en) * 2021-04-19 2021-07-09 航天新气象科技有限公司 Target detection method and system
CN113095418B (en) * 2021-04-19 2022-02-18 航天新气象科技有限公司 Target detection method and system
CN113273992A (en) * 2021-05-11 2021-08-20 清华大学深圳国际研究生院 Signal processing method and device
CN113505640A (en) * 2021-05-31 2021-10-15 东南大学 Small-scale pedestrian detection method based on multi-scale feature fusion
CN116509357A (en) * 2023-05-16 2023-08-01 长春理工大学 Continuous blood pressure estimation method based on multi-scale convolution

Also Published As

Publication number Publication date
CN111460980B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN111460980B (en) Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion
US11195051B2 (en) Method for person re-identification based on deep model with multi-loss fusion training strategy
CN111126472B (en) SSD (solid State disk) -based improved target detection method
Li et al. Scale-aware fast R-CNN for pedestrian detection
CN110427867B (en) Facial expression recognition method and system based on residual attention mechanism
US20220067335A1 (en) Method for dim and small object detection based on discriminant feature of video satellite data
CN111652236B (en) Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene
CN109241982B (en) Target detection method based on deep and shallow layer convolutional neural network
Lu et al. Deep multi-patch aggregation network for image style, aesthetics, and quality estimation
CN111460914B (en) Pedestrian re-identification method based on global and local fine granularity characteristics
CN111046821B (en) Video behavior recognition method and system and electronic equipment
CN107463892A (en) Pedestrian detection method in a kind of image of combination contextual information and multi-stage characteristics
CN107292246A (en) Infrared human body target identification method based on HOG PCA and transfer learning
CN113420607A (en) Multi-scale target detection and identification method for unmanned aerial vehicle
Yang et al. Real-time pedestrian and vehicle detection for autonomous driving
Wei et al. Pedestrian detection in underground mines via parallel feature transfer network
CN108280421A (en) Human bodys' response method based on multiple features Depth Motion figure
Yuan et al. Few-shot scene classification with multi-attention deepemd network in remote sensing
Buenaposada et al. Improving multi-class Boosting-based object detection
CN113591545B (en) Deep learning-based multi-level feature extraction network pedestrian re-identification method
CN117593794A (en) Improved YOLOv7-tiny model and human face detection method and system based on model
Yang et al. Real-time pedestrian detection for autonomous driving
Sajib et al. A feature based method for real time vehicle detection and classification from on-road videos
CN115761220A (en) Target detection method for enhancing detection of occluded target based on deep learning
Lin et al. Stop line detection and distance measurement for road intersection based on deep learning neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230714

Address after: 710075 Zone C, 3rd Floor, Synergy Building, No. 12 Gaoxin Second Road, High tech Zone, Xi'an City, Shaanxi Province

Patentee after: Zhongfu Software (Xi'an) Co.,Ltd.

Address before: 710048 Shaanxi province Xi'an Beilin District Jinhua Road No. 19

Patentee before: XI'AN POLYTECHNIC University

TR01 Transfer of patent right