CN111460980A

CN111460980A - Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion

Info

Publication number: CN111460980A
Application number: CN202010237758.2A
Authority: CN
Inventors: 薛涛; 郭卫霞
Original assignee: Xian Polytechnic University
Current assignee: Zhongfu Software (Xi'an) Co.,Ltd.
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-28
Anticipated expiration: 2040-03-30
Also published as: CN111460980B

Abstract

The invention discloses a multi-scale detection method of small-target pedestrians based on multi-semantic feature fusion, which comprises the following steps of 1: preprocessing the selected public pedestrian data set, and dividing the public pedestrian data set into a training set and a test set; step 2: improving a fast R-CNN network model, extracting and fusing shallow features and deep abstract features to obtain feature maps, activating the feature maps, sending the feature maps into a P-RPN network to generate a candidate frame, optimizing an anchor box in the RPN network, and then performing dimensionality reduction operation on a feature vector obtained by ROI Pooling to obtain a multiscale detection model of the small-target pedestrian with fused multi-semantic features; and step 3: training a multi-scale detection model of the small target pedestrian with multi-semantic feature fusion; and 4, step 4: and detecting the small target pedestrian. The invention improves the detection effect of the network model on small target pedestrians.

Description

Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion

Technical Field

The invention belongs to the technical field of computer vision based on deep learning, and particularly relates to a multi-scale detection method for small-target pedestrians based on multi-semantic feature fusion.

Background

The conventional pedestrian detection method is based on manually extracted Features, such as Histogram of Gradient directions (HOG) Features, local Binary pattern (L) Features, L BP) Features, Aggregated Channel Features (ACF), and the like, and then the extracted sample pedestrian Features are input into a classifier model to realize target detection, Viola et al in 2003 proposes a VJ algorithm, and adopts Haar Features + Adaboost classifier to realize rapid pedestrian detection, Dalal et al proposes a HOG Features + SVM classifier detection method in 2005, which effectively improves and expands the accuracy of pedestrian detection.

In recent years, with the rapid development of deep learning, image target detection algorithms based on neural networks gradually become mainstream, and are mainly divided into two types, namely, a single-stage detection algorithm which comprises a YO L O (young Only L ook one), a YO L Ov2, an SSD (single Shot multiple Box Detector) and the like, wherein the core idea of the algorithm is to convert a target detection task into a regression problem to solve the problem, and input an original image to directly output a position and type judgment result, so that the single-stage algorithm has certain advantages in detection speed but has poor detection effect on small targets and close objects which are mutually close, and the other type is a double-stage detection algorithm which comprises a Region-based convolutional neural Network (Region-based convolutional neural Network, R-CNN) and series optimization algorithms, Fast R-CNN, FastN policy, FasR-CNN and the like, and the algorithms are mainly used for generating a plurality of regions by FasPropol, and utilizing a Connaturn-based convolutional neural Network (RS-CNN) to realize the detection of human face detection by using a series of detection algorithms, such as a FasFasFasFasFasP, a series of detection algorithm, a detection algorithm which is widely applied to detection target detection algorithm with high detection speed, a target detection speed detection function, a target detection algorithm, a characteristic detection algorithm, a characteristic detection algorithm is widely applied to a detection algorithm, a detection algorithm is widely applied to a detection algorithm of.

At present, a large number of research results are obtained in pedestrian detection, a plurality of built-in subnets are introduced on the basis of a Fast R-CNN model and used for detecting multi-scale pedestrians in a non-intersecting range, a Hyper L earner framework is provided and pedestrian features and additional channel features are fused to improve the pedestrian detection quality, a new data set PRW is introduced and is used for evaluating pedestrian re-identification in an original image by combining model detection scores to Confidence Weighted Similarity measurement (CWS) of Similarity measurement, a pyramid RPN structure is provided on the basis of the Fast R-CNN to solve the multi-scale problem of underground pedestrians, and meanwhile, feature fusion is added in an algorithm to enhance the detection performance of the underground small-target pedestrians.

The method has good effect in some scenes, but most pedestrian images are from surveillance videos, vehicle-mounted cameras and the like, so that the problems of low resolution, small size, multiple scales and the like exist in pedestrian detection, and accurate detection is more difficult.

Disclosure of Invention

The invention aims to provide a multi-scale detection method of a small-target pedestrian based on multi-semantic feature fusion, which solves the problems of low resolution, small size, multi-scale and the like of the pedestrian obtained in the prior art, and makes accurate detection more difficult.

The technical scheme adopted by the invention is that,

the multi-scale detection method of the small target pedestrian based on the multi-semantic feature fusion specifically comprises the following steps:

step 1: selecting a pedestrian public data set, preprocessing the pedestrian public data set, converting a data file format, expanding an image data set, and dividing the pedestrian public data set into a training set and a test set;

step 2, improving a fast R-CNN network model, constructing a shallow feature extraction network LL FM, extracting deep abstract features of pedestrians by a VGG16 network, fusing the shallow features extracted by the LL FM and the deep abstract features extracted by the VGG16 to obtain feature maps, sending the feature maps into an activation block for activation, sending the feature maps into a P-RPN network to generate a candidate frame, optimizing an anchor box in the RPN network, and then performing dimensionality reduction on a feature vector obtained by ROI Pooling to obtain a multiscale detection model of the small-target pedestrians with fused multi-semantic features;

and step 3: inputting the training set obtained in the step 1 into the multi-scale detection model of the multi-semantic-feature-fused small-target pedestrian obtained in the step 2, optimizing a loss function until the loss function is converged, and finishing the training of the multi-scale detection model of the multi-semantic-feature-fused small-target pedestrian;

and 4, step 4: inputting the test set in the step 1 into the multi-scale detection model of the small-target pedestrian with multi-semantic feature fusion trained in the step 3, outputting a detection result, and completing the detection of the small-target pedestrian.

The present invention is also characterized in that,

in step 1, the training set and the test set are divided according to the front and back sequence of the pedestrian public data set, a video file of the pedestrian public data set is converted into an image in a png format, a description file of the pedestrian public data set is converted into an xml format, the training set is stored in every 10 frames, the test set is stored in every 30 frames, the data set is expanded by turning left and right, and the test set is divided according to the difference of pedestrian heights to obtain the preprocessed training set and test set.

In step 2, the constructed shallow feature extraction network LL FM selects Conv2_2, Conv3_3 and Conv4_3 of the VGG16 network, performs convolution operations with channel numbers of 32, 48 and 64 and convolution kernels of 3 × 3 on the Conv2_2, the Conv3_3 and the Conv4_3, and connects two by two through a Concat method and 3 pooling layers, and finally extracts shallow features.

In step 2, the shallow feature extracted by LL FM and the deep abstract feature extracted by VGG16 are fused by using a Concat method.

In step 2, the dimensionality reduction operation is to add a t-SNE dimensionality reduction module after an ROI Pooling layer.

In step 2, the P-RPN network optimizes the proportion and the scale of generating the anchor box in the RPN.

In step 2, the active block includes a full link layer, a Re L U layer, and a Dropout layer.

The small target pedestrian multi-scale detection method based on the multi-semantic feature fusion has the advantages that firstly, the diversity of training samples is enhanced through a series of preprocessing measures on a public data set, the effectiveness of the detection effect is improved, secondly, an L FMM module is designed to extract the shallow features which are more similar in the image, the deep features and the shallow features are fused through a feature fusion technology, the feature extraction performance of a network model on the small target pedestrians is enhanced, meanwhile, the influence of the increase of feature parameters on the detection speed is reduced through the dimension reduction operation, in addition, the RPN structure is improved aiming at the pedestrian features, the multi-scale problem of the pedestrians is solved, and the detection effect of the network model on the small target pedestrians is finally improved.

Drawings

FIG. 1 is a flow chart of the multi-scale detection method of small target pedestrians based on multi-semantic feature fusion of the present invention;

FIG. 2 is a network structure diagram of the multi-scale detection method for small target pedestrians based on multi-semantic feature fusion according to the present invention;

FIG. 3 is a block diagram of L FMM block in the network of FIG. 2 for the multi-scale detection method of small target pedestrians based on multi-semantic feature fusion in accordance with the present invention;

FIG. 4 is a diagram of a fusion structure of deep and shallow features in the network of FIG. 2 according to the multi-scale detection method of small target pedestrians based on multi-semantic feature fusion of the present invention;

FIG. 5 is a schematic diagram of a comparison between a constructed network model and an original model in a Small test set P-R curve of a Small-target pedestrian multi-scale detection method based on multi-semantic feature fusion;

FIG. 6 is a schematic diagram of a comparison between a constructed network model and an original model in a multiscale detection method of small target pedestrians based on multi-semantic feature fusion according to the invention under a Reasonable test set P-R curve;

FIG. 7 is a schematic diagram of a comparison between a constructed network model and an original model in an All test set of a P-R curve of a small-target pedestrian multi-scale detection method based on multi-semantic feature fusion.

Detailed Description

The multi-scale detection method for small-target pedestrians based on multi-semantic feature fusion is described in detail below with reference to the accompanying drawings and the detailed description.

Further, in step 1, the training set and the test set are divided according to the front and back sequence of the pedestrian public data set, the video file of the pedestrian public data set is converted into the image in the png format, the description file of the pedestrian public data set is converted into the xml format, the training set is stored in every 10 frames, the test set is stored in every 30 frames, the data set is expanded by turning left and right, and the test set is divided according to the difference of the heights of pedestrians, so that the training set and the test set after preprocessing are obtained.

Further, in step 2, the constructed shallow feature extraction network LL FM selects Conv2_2, Conv3_3 and Conv4_3 of the VGG16 network, performs convolution operations with channel numbers of 32, 48 and 64 and convolution kernels of 3 × 3 on the Conv2_2, the Conv3_3 and the Conv4_3, respectively, and connects two by two through a Concat method, and 3 pooling layers, and finally extracts the shallow feature.

Further, in step 2, the shallow features extracted from the LL FM and the deep abstract features extracted from the VGG16 are fused by using a Concat method.

Further, in step 2, the dimensionality reduction operation is to add a t-SNE dimensionality reduction module after the ROI Pooling layer.

Further, in step 2, the P-RPN network optimizes the proportion and scale of generating the anchor box in the RPN.

Further, in step 2, the active block includes a full connection layer, a Re L U layer, and a Dropout layer.

The multi-scale detection method of small-target pedestrians based on multi-semantic feature fusion of the invention is further described in detail by specific embodiments.

Examples

The invention discloses a multi-scale detection method of small target pedestrians based on Faster R-CNN, which comprises the following specific steps as shown in figure 1:

step 1: preparation of the Experimental data set

The experimental data adopts a common data set Caltech pedistrian of California university, the data set is shot and collected in an urban road environment by using a vehicle-mounted camera, 11 video sets are provided, the total time is about 10h, the data set comprises about 250000 frames of images (about 137 minutes), 350000 Pedestrian labeling boxes, 2300 different pedestrians, and the image resolution is 640 × 480 pixels.

Step 2: experimental data set preprocessing

Caltech Peerrian has 11 video sets in total, wherein the front 6 segments of Set00-Set05 are selected as a training Set, and the rear 5 segments of Set06-Set10 are selected as a test Set.

Each video set of the source data set is divided into two parts, one part is a video file in seq format, and the other part is a description file in vbb format. The invention converts the video file into the image with the png format and converts the description file into the xml format.

Because the continuity between the video acquisition frames is strong and the characteristic difference is not obvious, in order to improve the training effectiveness, the invention selects to store one image in every 10 frames to obtain 12963 training sets, and the test set adopts to store one image in every 30 frames, and 4088 images in total. To expand the data set, the resulting data set was doubled in a left-right flip manner, with 25926 and 8176 final training and test sets, respectively.

According to the invention, the test set is divided into different levels according to the sizes of pedestrians, so that the detection effect is conveniently compared, and the attributes of the test set are shown in table 1.

TABLE 1 test set Attribute Table

Test set	Test set attributes	Number of images
			All	Test set of all images	8176
Small	Pedestrian height less than or equal to 50 pixels	4498
			Reasonable	Pedestrian height greater than 50 pixels	3678

And step 3: network model improvement

The improved pedestrian detection network structure is shown in fig. 2, an input image respectively extracts deep and shallow features of the image through VGG16 and L FMM (shallow feature extraction network), the features are fused and then sent to a P-RPN network to generate a candidate frame, and simultaneously an activation Block (Activate Block, AB) is sent to perform activation operation, so that the nonlinear expression capability of the network is increased, and the occurrence of an over-fitting problem is prevented.

Step 3.1: constructing shallow feature extraction modules

The invention selects VGG16 as a feature extraction network, concrete network parameters are shown in Table 2, the whole network comprises 13 convolutional layers, the number of layers is deeper, and the number of channels of each layer is more, so that richer and abstract high-level semantic features can be extracted, the network structure is very regular, each convolutional layer systematically uses a convolutional core of 3 × 3, so that the network convergence speed is higher, the network is totally divided into 5 layers, 5 Pooling layers are included, so that feature parameters are reduced, the efficiency is improved, but simultaneously, the feature of a target in an image is lost due to excessive Pooling dimension reduction, a shallow feature extraction module is designed to obtain the basic feature of a shallow layer with an image appearance while the deep abstract feature of the image is extracted by using a VGG16 network, and the deep layer and shallow layer features are fused (the invention selects and fuses with a feature map after Conv5_3 convolution, the Pooling operation of the 5 layer is removed), and the shallow layer and the full connection layer are sent together, so that the detection result of small targets is more accurate.

Table 2 VGG16 network architecture parameters table

The shallow Feature extraction Module (L ow-level Feature Map Module, L FMM) designed by the present invention does not select the Feature Map of the last Conv5_3 layer of VGG16, but selects three Conv2_2, Conv3_3 and Conv4_3 layers to reflect the Feature Map of the low level information of the image, and then performs the concatage processing on the shallow features, as shown in FIG. 3, the specific operation of the L FMM Module performs the convolution operations of Conv2_2, Conv3_3 and Conv4_3 layers respectively to form 32, 48 and 48, 64, the convolution kernel is 3, and the convolution operation of Conv2_2, Conv3_3 and Conv4_3 layers respectively performs the convolution operations of the channel numbers of 32, 48 and 3 × 3 layers to input the final convolved Feature extraction on the two shallow Feature pools, and the final Feature pool of the image is input into the final Feature extraction Module.

Step 3.2: feature fusion

In order to enable the detection network to utilize the shallow feature and the deep feature at the same time, the shallow feature extracted by L FMM and the deep feature obtained by Conv5_3 need to be fused, the feature fusion is to aggregate the information obtained from different convolutional layers in some specific way, as shown in FIG. 4, the invention uses concat to perform feature fusion, that is, stacking the deep feature and the shallow feature in the channel dimension, and the specific operations are as follows:

the function of the Concat layer is to splice two or more feature maps in the channel dimension, and there is no operation of the eltwise layer, that is, the feature maps as input may be different except for the channel dimension, and the other dimensions must be consistent (that is, N, H, W is consistent), such as: the channels of the two feature maps that require concat are k1 and k2, respectively, and the output after performing the concat operation can be expressed as:

N*(k₁+k₂)*H*W

wherein, N is the number of images of feature map, usually the number of minimatch, H is the height of the input image, W is the width of the input image, and the channel of feature map is also the number of filters.

Step 3.3: building an activation Block

As shown in fig. 4, an Active Block (AB) is added after feature fusion, and includes a full link layer, a Re L U layer, and a Dropout layer, so that the overfitting problem is suppressed by adding a large amount of non-linear expressions to the network, and the network generalization capability is improved.

The fully connected kernel operation is the matrix-vector product, which can be converted into a convolution with a convolution kernel of 1 × 1, and the calculation formula is:

wherein x is_iAs input to the model, w, b are both neuron parameters, w_iFor the weight, b is the bias, and the function f (x) is the activation function, which is used to determine the output range.

The role of the Re L U function is to increase the non-linear relationship between the neural network layers, the function being defined as:

in the deep neural network model, the activation rate of the neurons is in inverse proportion to the number of model layers, for example, when the model increases N layers, the activation rate of the Re L U neuron is correspondingly reduced by N times of 2.

Dropout is the rounding off of neurons during forward propagation, with a rejection probability of p and a retention probability of 1-p. The discarded neurons are output with a zero result, the loss values obtained from this are propagated back on the remaining neurons, and the parameters (w, b) are updated using the stochastic gradient descent SGD. Discarding different hidden neurons randomly is preferable to training different networks, which will produce different overfitting, where the reciprocal fits cancel each other out, thereby reducing the overfitting phenomena as a whole.

Each neuron in the training phase may be discarded randomly, but each neuron must exist during testing and the weight needs to be adjusted again:

where p is the discard probability, W^(l)As a weight value for the training phase,

the obtained weight value of the test stage.

Step 3.4: constructing P-RPN networks

In the fast R-CNN, an RPN network is convolved on a feature map through a sliding window of 3 × 3 to obtain a set of target candidate frames, each pixel point of the feature map is taken as a center, 3 proportions (aspects) 1:1/1:2/2:1 and 3 scales (scales)128/256/512 are respectively used to generate 9 anchors with different sizes, and in the pedestrian detection task, the feature frames of the target pedestrians cannot be generated particularly accurately by the size of a universal anchor box.

Aiming at the Caltech data set, the proportion of the anchor box of the pedestrian is determined through a K-means clustering algorithm, and the accuracy of generating the candidate box is improved. Firstly, the width and height of anchor boxes are calculated through K-means, the calculated width and height are both the proportion relative to the whole input picture, and the convolutional neural network has translation invariance and can be directly converted into the proportion relative to a characteristic diagram, and the conversion formula is as follows:

wherein, downsamples represents the multiplying power of downsampling, width_inputAnd height_inputFor width and height of input image_anchorAnd height_anchorIs the width and height of the anchor box relative to the input image, and w and h are the width and height of the anchor box relative to the feature map.

The original distance measurement formula used in the K-means is euclidean distance, which makes the error generated when the size of the anchor box is large when performing border regression be larger, and in order to make the error independent of the size of the anchor box, the following distance formula is used herein to replace euclidean distance:

d(box,centroid)＝1-IoU(box,centroid)

wherein IoU (box, centroid) is the intersection ratio between the generated anchor box and the reference frame, and d (box, centroid) is the similarity between the anchor box and the reference frame.

Finally, the ratio of the anchors is determined to be 1:1/1:2/1: 3.

In order to make the network more sensitive to small target pedestrians, the anchor scale is modified to 64/128/256, and finally 9 candidate frames more conforming to the characteristics of the small target pedestrians are generated for each sliding window. This RPN structure is referred to herein as P-RPN (RPN for Pedestrian).

Step 3.5: reducing vitamin

The addition of shallow layer features increases the calculation parameters of the network, the original Faster R-CNN only pools on the feature map output by Conv5_3 and feeds the feature map into a full connection layer, after the feature fusion is performed, 128 shallow layer feature maps are added to the network, the size of the feature map after ROI Pooling is 20 × 15, 4096 × 128 × 20 × 15 is added to the full connection layer, which is 157286400, and about 15000 ten thousand parameters.

And 4, step 4: training network model

The image training set, the test set and the whole network model are already constructed through the steps 1 to 3, and in the step, the weight of the network model obtained in the step 3 needs to be trained and adjusted according to the provided training data set to optimize the loss until the training loss is converged, and the final weight is obtained to obtain the trained model, as shown in "1 → 2 → 3 → 4 → 5" in fig. 1.

The invention selects a deep learning framework Tensorflow 1.8.0 as an experimental platform for testing, and the software and hardware configuration of the platform is as follows: the operating system is Ubuntu 16.04, 16G memory, the GPU is NVIDIA GeForce GTX Titan Xp, and the GPU acceleration library is CUDA 9.0 and CUDNN 7.6.

The VGG16 network pre-trained by ImageNet is loaded into the network to initialize the weight of the feature extraction network, an SGD random gradient descent algorithm is selected to optimize the network model in the training process, the initial value of the learning rate is set to be 0.001, the momentum coefficient is 0.9, and the learning rate is attenuated to be 0.0001 after 5 ten thousand iterations.

And 5: model detection effect

And (3) applying the trained network model obtained in the step (4) to a test image sample without a label, and performing forward propagation to obtain a class label and probability estimated by the image so as to achieve the purpose of image recognition, wherein the flow of the steps is shown as '1 → 6 → 7 → 8 → 9' in the figure 1, the image is input into the network model to perform forward propagation, the probability of whether the image is a pedestrian is output, and finally the pedestrian detection task is completed.

The effectiveness of the improved model is verified through two aspects of the detection effect and the detection speed of the pedestrian.

The test results are shown in Table 3, and it can be seen from the test results that the improved L FMM and P-RPN module of the invention can improve the detection effect, and at the same time, it can be seen that the higher detection accuracy can be obtained by combining both L FMM and P-RPN with the original model.

TABLE 3 detection Effect of different models on Caltech dataset

The detection performance of each model can be more intuitively seen by comparing the P-R curves of different models on different test sets. The P-R curves of different models on Small, Reasonable and All test sets are respectively shown in fig. 5, 6 and 7, and it can be seen that the improved model of the invention has an obvious improvement on the detection effect of Small target pedestrians, and the detection performance is improved by about 4.12% compared with the original model. The improved model has a slightly improved detection effect on common pedestrians and pedestrians with clear targets.

In order to test the detection speed of the improved algorithm, the detection speed of the average single image of different models is compared, and as a result is shown in table 4, the detection speed of the whole network is slowed down after an L FMM module is added, the number of feature maps is increased due to the addition of a shallow feature extraction network, the calculation cost of the network is increased, and the detection speed of the network is basically kept unchanged compared with that of the original model after feature dimension reduction is carried out through t-SNE, so that the detection speed of the pedestrian detection is not reduced while the pedestrian detection accuracy is improved by the improved algorithm.

TABLE 4 comparison of the detection speeds of the different models

The invention relates to a multiscale detection method of a small target pedestrian based on multi-semantic feature fusion, which enhances the diversity of training samples and also improves the effectiveness of a test effect by a series of preprocessing measures on a public data set, extracts a shallow feature which is relatively similar in an image by a designed L FMM module, fuses a deep feature and the shallow feature by using a Concat feature fusion technology, enhances the feature extraction performance of a network model on the small target pedestrian, reduces the influence of the increase of feature parameters on the detection speed by using t-SNE dimension reduction operation, and also designs an active Block in the network model to increase the nonlinear expression capability of the network and prevent the occurrence of an over-fitting problem.

Claims

1. The multi-scale detection method of the small target pedestrian based on the multi-semantic feature fusion is characterized by comprising the following steps:

step 2, improving a fast R-CNN network model, constructing a shallow feature extraction network LL FM, extracting deep abstract features of pedestrians by a VGG16 network, fusing the shallow features extracted by LL FM and the deep abstract features extracted by the VGG16 to obtain feature maps, sending the feature maps into an activation block for activation, sending the feature maps into a P-RPN network to generate a candidate frame, optimizing an anchor box in the RPN network, and then performing dimensionality reduction on feature vectors obtained by ROI Pooling to obtain a multiscale detection model of small-target pedestrians with fused multi-semantic features;

2. The multi-scale detection method for small target pedestrians based on multi-semantic feature fusion of claim 1, wherein in step 1, the training set and the test set are divided according to the front and back order of the pedestrian common data set, the video file of the pedestrian common data set is converted into the png format image, the description file of the pedestrian common data set is converted into the xml format, one training set is stored in each 10 frames, one test set is stored in each 30 frames, the data set is expanded by turning left and right, and the test set is divided according to the difference of pedestrian heights to obtain the pre-processed training set and test set.

3. The multi-scale detection method for small target pedestrians based on multi-semantic feature fusion of claim 1, wherein in step 2, the constructed shallow feature extraction network LL FM is selected from Conv2_2, Conv3_3 and Conv4_3 of VGG16 network, and convolution operations with 32, 48 and 64 channels and 3 × 3 convolution kernels are performed on the Conv2_2, the Conv3_3 and the Conv4_3 respectively, and are connected two by a Concat method to extract 3 pooling layers and final shallow features.

4. The multi-scale detection method for small target pedestrians based on multi-semantic feature fusion as claimed in claim 1, wherein in step 2, the shallow feature extracted from LL FM and the deep abstract feature extracted from VGG16 are fused by using Concat method.

5. The method for multi-scale detection of small target pedestrians based on multi-semantic feature fusion as claimed in claim 1, wherein in step 2, the dimensionality reduction operation is to add a t-SNE dimensionality reduction module after the ROI Pooling layer.

6. The method for multi-scale detection of small target pedestrians based on multi-semantic feature fusion as claimed in claim 1, wherein in step 2, the P-RPN network optimizes the proportion and scale of generating anchor box in RPN.

7. The multi-scale detection method for small target pedestrians based on multi-semantic feature fusion as claimed in claim 1, wherein in step 2, the activation block comprises a full connection layer, a Re L U layer and a Dropout layer.