CN113537013A

CN113537013A - Multi-scale self-attention feature fusion pedestrian detection method

Info

Publication number: CN113537013A
Application number: CN202110761053.5A
Authority: CN
Inventors: 张凯
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2021-10-22

Abstract

The invention discloses a multi-scale self-attention feature fusion pedestrian detection method, which belongs to the field of artificial intelligence and comprises the following steps: (1) acquiring pedestrian detection image data; (2) detecting the design of the sizes of the pedestrians in the image; (3) dividing positive and negative samples of a pedestrian data set; (4) building a pedestrian detection model; according to the invention, the Faster R-CNN is adopted to build a pedestrian detection frame, a multi-scale feature fusion network model is provided, more and more effective feature information can be extracted, overfitting can be avoided, the training speed is greatly increased by using a GPU with better performance to train, the receptive field is enlarged, small targets can be detected without reducing the resolution, and the pedestrian detection method is very suitable for accurate and rapid detection of pedestrians.

Description

Multi-scale self-attention feature fusion pedestrian detection method

The technical field is as follows:

the invention relates to a multi-scale self-attention feature fusion pedestrian detection method, and belongs to the field of artificial intelligence.

Background art:

the pedestrian detection is used as a field with great research value in computer vision, and is widely applied to the fields of automobile unmanned driving, human behavior analysis, intelligent traffic, intelligent video monitoring and the like; the method is the first step in the application of vehicle auxiliary driving, intelligent video monitoring, human behavior analysis and the like, and is also applied to the emerging fields of aerial images, victim rescue and the like in recent years; because the posture of the human body is random, the body is complex and changeable, and the problems of adhesion, shielding and the like exist, the theory and the technology for accurately detecting the pedestrians in various scenes still need to be deeply explored and researched; the pedestrian detection process comprises the steps of detecting an input picture or video, judging whether the input picture or video contains a pedestrian, and if the input picture or video contains the pedestrian, giving pedestrian information.

The traditional pedestrian detection algorithm has the core that the features of input information are extracted and classified, and the information of pedestrians is difficult to be analyzed in all directions by utilizing a manually designed feature extractor, so the detection effect of the pedestrian detection method is not ideal; with the improvement of computer computing power, deep learning is rapidly developed in recent years; the target detection effect is obviously improved by the deep learning-based method; accordingly, a multi-scale self-attention feature-fused pedestrian detection method is presented herein; the fast R-CNN is taken as a core, and the feature extraction network of the framework is redesigned, so that the complexity of the framework is reduced and the detection efficiency of the model is improved under the condition of ensuring the detection accuracy.

The invention content is as follows:

the pedestrian detection algorithm with the multi-scale self-attention feature fusion is provided for solving the problems that the traditional pedestrian target detection algorithm is low in detection efficiency and poor in detection effect.

Therefore, the invention provides the following technical scheme:

1. a multi-scale self-attention feature fusion pedestrian detection method is characterized by comprising the following steps:

step 1: acquiring pedestrian detection image data;

step 2: detecting the design of the sizes of the pedestrians in the image;

and step 3: dividing positive and negative samples of a pedestrian data set;

and 4, step 4: and (5) building a pedestrian detection model.

2. The method as claimed in claim 1, wherein in step 1, the pedestrian detection image data is collected, the pedestrian detection data set is collected in the research work of detecting the standing pedestrian in the image and video, the pedestrian in the data set is mainly in the standing state and is higher than 100 pixels, the image definition is high, and the image is mostly selected from personal photo, google and camera shooting.

3. The method for detecting the pedestrian with the multi-scale self-attention feature fusion as claimed in claim 1, wherein in the step 2, the design of the size of the pedestrian in the image is detected, and the specific steps are as follows: re-clustering the pedestrian data set by using a K-means algorithm to obtain an anchor size more suitable for the model, and in order to iterate a final result more quickly and accurately, using IoU (overlapping degree of a prediction Box and an anchor Box) as a metric, namely, using Distance as 1-IoU (Box, Center), wherein Box represents a set of real object surrounding boxes, and Center represents a set of clustering Center boxes; clustering pedestrian data ultimately results in 4 center points, respectively (12, 22), (25, 38), (41, 77), (48, 91).

4. The method for detecting the pedestrian with the multi-scale self-attention feature fusion as claimed in claim 1, wherein in the step 3, the positive and negative samples of the pedestrian data set are divided, and the specific steps are as follows: marking is carried out by utilizing a target detection marking tool, the picture is formatted into a certain picture size, then pedestrian image information is obtained and divided into two categories including pedestrians and not including pedestrians, and collected pedestrian image data is divided into independent and unrepeated training sets and test sets according to a certain proportion in a random sampling mode.

5. The method for detecting the pedestrian with the multi-scale self-attention feature fusion as claimed in claim 1, wherein in the step 4, the construction of the pedestrian detection model comprises the following specific processes:

step 4-1, the pedestrian detection efficiency can be effectively improved by reducing the network parameters and the network operation amount by using a lightweight network model, and the MobileNet V2 is a lightweight network and is applied to a characteristic information extraction task;

step 4-2, the receptive field of the deep layer network is large, the semantic information abstraction capability is strong, but the pixel number order of the feature map is small, many details can be lost, the expression capability of the space geometric features is lacked, the receptive field of the shallow layer network is small, the space information representation capability is strong, although the pixel number is high, the abstraction capability of the semantic information is weak, the deep layer semantic information and the shallow layer high resolution features are fused to form a multi-scale feature map for predicting a coordinate position frame and classifying tasks, and therefore the detection precision is improved;

step 4-3SENET is a simple and effective attention mechanism network model with low complexity and small calculated amount; the SENet module can enable the network to be fitted on the channel dimension of the feature map, the functions of strengthening the features with judgment in the image and inhibiting the non-significant features in the image are achieved, and the SENet module is respectively embedded before the multi-scale features are fused to conduct convolution processing on the feature map.

Has the advantages that:

through the technical scheme, the invention has the beneficial effects that: a pedestrian detection method with multi-scale self-attention feature fusion is provided; firstly, a lightweight model MobileNet V2 is used as a feature extraction network of Faster R-CNN, so that the pedestrian detection efficiency is improved; secondly, performing multi-scale feature fusion on feature layers of different scales of the MobileNet V2 network, and fully utilizing high-level semantic information and bottom-level high-resolution features; finally, a multi-scale feature layer is used for designing a feature fusion network with a self-adaptive scale, a SENET network module is connected in the middle of the multi-scale feature layer, and different weights are distributed to different channels of the feature layer; aiming at the specific application of pedestrian target detection, reclustering the anchor by using a K-means method to determine the size of the anchor; compared with the prior art, the invention has the advantages that:

1. the method takes the Faster R-CNN as a frame, and utilizes the MobileNet V2 to replace the VGG16 as a characteristic extraction module of the Faster R-CNN network, so that the calculated amount and the model scale of the network are reduced, and the detection efficiency of the Faster R-CNN is improved.

2. And multi-scale feature fusion is introduced, and multi-scale feature fusion is carried out on feature layers with different scales of the MobileNet V2 network to obtain a feature map combining high-resolution features and high semantic information, so that more detailed information is provided for a target prediction part of an algorithm, and the limitation of single pedestrian detection scale is improved.

3. For the SENET module for joining the feature maps of four different scales, an attention mechanism is added to the dimension of the channel of the network, different weights are distributed to different channels of the feature layer, and the detection efficiency is further improved.

4. Aiming at the specific application of pedestrian target detection, the Anchor is re-clustered by using a K-means method to determine the size of the Anchor, and a new size is re-designed for the size of the Anchor in the fast R-CNN network, so that the network can better fit the structural characteristics of pedestrians, and the target position can be predicted more easily and accurately.

The Faster R-CNN is improved mainly from the following aspects:

1. the pedestrian detection efficiency can be effectively improved by reducing the network parameters and the operation amount of the network by using the lightweight network model; the MobileNet V2 is a lightweight network and is applied to a characteristic information extraction task; different from the traditional convolution structure, the MobileNet V2 uses a convolution calculation mode combining an inverted residual error structure and a bottleneck layer, so that the calculation amount and model parameters of a network can be effectively reduced, and the pedestrian target detection efficiency is improved on the premise of keeping the detection precision; the residual error structure in ResNet is the process of firstly reducing dimension of input, then convolving the input and finally increasing dimension; MobileNet V2 proposes an inverse residual structure by using the thought of ResNet, namely, the residual structure is to firstly use 1 × 1 convolution to perform channel expansion on an input feature map, a feature extraction task is completed by using 3 × 3 depth separable convolution, then use 1 × 1 convolution to perform channel compression, and the process is opposite to the residual structure, so that the network can learn target features more fully.

2. The working principle of the convolutional neural network is to extract target features by means of convolutional kernel layer-by-layer fitting, wherein one important concept is a receptive field; the receptive field of the deep network is large, the semantic information abstraction capability is strong, but the pixel magnitude of the characteristic diagram is small, so that a lot of detailed information can be lost, and the expression capability of the space geometric characteristics is lacked; the receptive field of the shallow network is small, the representation capability of the spatial information is strong, and although the number of pixels is high, the abstract capability of the semantic information is weak; and fusing the deep semantic information and the shallow high-resolution features to form a multi-scale feature map of a predicted coordinate position frame and a classification task, so as to improve the detection precision.

3. The SENet module is used for connecting the feature maps of four different scales, an attention mechanism is added to the dimension of a channel of the network, and the SENet module can enable the network to be fitted on the dimension of the channel of the feature maps, so that the features with judgment in the images are strengthened, and the non-significant features in the images are restrained.

Description of the drawings:

FIG. 1 is a flow chart of a pedestrian detection method implementing multi-scale self-attention feature fusion in accordance with the present invention;

FIG. 2 is a simplified diagram of a network architecture for extracting image features before the present invention is modified;

FIG. 3 is a simplified diagram of a network architecture for extracting image features that is modified in accordance with the teachings of the present invention;

FIG. 4 and FIG. 6 are diagrams illustrating the detection effect of the present invention before the improvement;

FIG. 5 and FIG. 7 are diagrams illustrating the detection effect of the present invention after the improvement;

FIG. 8 is a graph comparing the performance of the present invention with other methods.

Detailed Description

The present invention is further described below in conjunction with the appended drawings to enable one skilled in the art to practice the invention with reference to the description.

The technical scheme adopted by the invention is as follows: a multi-scale self-attention feature fusion pedestrian detection method comprises the following steps:

(1) acquiring pedestrian detection image data;

(2) detecting the design of the sizes of the pedestrians in the image;

(3) dividing positive and negative samples of a pedestrian data set;

(4) and (5) building a pedestrian detection model.

The invention is described in further detail below with reference to the accompanying drawings, and provides a multi-scale self-attention feature fusion pedestrian detection method, wherein the training steps are as shown in fig. 1:

the acquisition of pedestrian detection image data, pedestrian detection data sets collected in research work to detect upright pedestrians in images and videos, pedestrians in data sets being mainly in an upright state and higher than 100 pixels, high in picture definition, and pictures mostly selected from personal photos, google, and camera shots.

The design for detecting the sizes of pedestrians in the image comprises the following specific steps: re-clustering the pedestrian data set by using a K-means algorithm to obtain an anchor size more suitable for the model, and in order to iterate a final result more quickly and accurately, using IoU (overlapping degree of a prediction Box and an anchor Box) as a metric, namely, using Distance as 1-IoU (Box, Center), wherein Box represents a set of real object surrounding boxes, and Center represents a set of clustering Center boxes; clustering pedestrian data ultimately results in 4 center points, respectively (12, 22), (25, 38), (41, 77), (48, 91).

The method comprises the following steps of dividing positive and negative samples of a pedestrian data set: marking is carried out by utilizing a target detection marking tool, the picture is formatted into a certain picture size, then pedestrian image information is obtained and divided into two categories including pedestrians and not including pedestrians, and collected pedestrian image data is divided into independent and unrepeated training sets and test sets according to a certain proportion in a random sampling mode.

The construction of the pedestrian detection model comprises the following specific processes:

step 1: the pedestrian detection efficiency can be effectively improved by reducing the network parameters and the network operation amount by using a lightweight network model, and the MobileNet V2 is a lightweight network and is applied to a characteristic information extraction task;

step 2: the deep network has a large receptive field and strong semantic information abstraction capability, but the feature map has a small pixel number, so that a lot of detail information can be lost, the spatial geometrical feature expression capability is lacked, the shallow network has a small receptive field and strong spatial information representation capability, although the pixel number is high, the semantic information abstraction capability is weak, the deep semantic information and the shallow high-resolution feature are fused to form a multi-scale feature map for predicting a coordinate position frame and classifying tasks, and thus the detection precision is improved;

and step 3: SENET is a simple and effective attention mechanism network model with low complexity and small calculation amount; the SENet module can enable the network to be fitted on the channel dimension of the feature map, the functions of strengthening the features with judgment in the image and inhibiting the non-significant features in the image are achieved, and the SENet module is respectively embedded before the multi-scale features are fused to conduct convolution processing on the feature map.

Compared with the traditional model in fig. 2, the built model is as shown in fig. 3, and the main processing flow is as follows:

1) inputting the original data into a MobileNet V2 network to obtain a primary Feature Map;

2) selecting the outputs of bottleeck 4, bottleeck 5 and bottleeck 7 to perform multi-scale feature fusion, respectively connecting a SENEt module behind the three feature extraction layers, and fitting the outputs of the three layers by using the SENEt module;

3) expanding the channel dimension of the characteristic diagram subjected to SEnet fitting into 320 by using convolution of 1 multiplied by 1;

4) the output of the bottleeck 7 is processed by SEnet +1 × 1 convolution to obtain a first prediction layer P1;

5) performing 2 times of upsampling on P1, and adding the upsampled P1 and the output of bottleeck 5 which is subjected to SEnet +1 × 1 convolution processing to obtain P2;

6) performing 2 times of upsampling on P2, and adding the upsampled P2 and the output of bottleeck 4 which is subjected to SEnet +1 × 1 convolution processing to obtain P3;

7) inputting P1, P2 and P3 into RPN to generate target candidate frames, and generating target candidate regions according to candidate frame size responses for P1, P2 and P3;

8) carrying out post-processing on the target candidate area to predict and position;

FIGS. 4, 5, 6 and 7 show the effect of the invention before and after testing, two groups selected for comparison; two scenes at the street are selected as samples to be researched; FIG. 4, FIG. 6 are graphs of test effects without improvement using multi-scale self-attention feature fusion, FIG. 5, FIG. 7 are graphs of test effects with improvement using multi-scale self-attention feature fusion; for the items with low precision and missing detection before improvement, the detection precision is obviously higher than before after improvement, and the defect of missing detection is relieved; in general, the project uses multi-scale self-attention feature fusion to improve the detection accuracy.

The Precision (Precision), Recall rate (Recall) and Average Precision (AP) are adopted as quantitative evaluation indexes of the performance of the algorithm in the experiment, wherein the Precision (Precision) and the Recall rate (Recall) are expressed as follows:

wherein TP (true Positive): IoU>Number of prediction frames of 0.7 (each group Truth participates only once), FP (false Positive): IoU<The number of prediction frames (or other prediction frames in which the same GT is predicted) of 0.7, fn (false negative): number of GT not detected, AP: area under Precision-Recall curve; to better assess the feasibility of the algorithm, we also followed the Log-Average Miss Rate over false sites per image (denoted MR)^-2) The evaluation strategy of (1).

To illustrate the effectiveness of the algorithm herein, we compared the algorithm herein with current high performance algorithms, including Faster R-CNN, BF + RPN, CSP, PBM, OR-CNN, and additionally including the one-stage algorithm YOLO9000, SSD, and ALFNET; the same experimental configuration is adopted in the method, and the algorithm is realized on two data sets of INRIA and cityPersons; FIG. 8 shows MR for the current popular algorithm^-2Evaluating the performance of the method on CityPersons; MR for an algorithm^-2The results of comparison between the detection accuracy and the calculation speed are shown in table 1.

TABLE 1 evaluation results of different methods

The fast R-CNNK is an innovative algorithm structure, a MobileNet V2 network is used for a feature extraction module, a SEnet attention mechanism is added to the output of bottleeck 4, bottleeck 5 and bottleeck 7 of the network, multi-scale feature fusion is carried out, and the dimension of an anchor box is re-clustered; as can be seen from the table, the changeDetection precision and MR of advanced fast R-CNNK structure^-2All higher than other algorithms; although the detection efficiency of the Faster R-CNNK is slightly lower than YOLO9000, the detection effect is far better than that of YOLO9000. CSP, ALFNET and fast R-CNNK models are trained on an INRIA data set and then are tested by using a test set; taken together, the Faster R-CNNK herein performs best for the compromise of detection efficiency and detection effectiveness; the experimental data prove that the model is a good model and can be applied to actual conditions; in addition, the invention only verifies that the multi-scale self-attention feature fusion is effective, and the invention does not relate to whether the multi-scale self-attention feature fusion is effective or not, which also needs further subsequent work.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof; the present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein; any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may include only a single embodiment, and such description is for clarity only, and those skilled in the art will be able to make the description as a whole, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

step 1: acquiring pedestrian detection image data;

step 2: detecting the design of the sizes of the pedestrians in the image;

and step 3: dividing positive and negative samples of a pedestrian data set;

and 4, step 4: and (5) building a pedestrian detection model.

2. The method of claim 1, wherein the method comprises: in the step 1, the acquisition of image data is detected by pedestrians, a pedestrian detection data set is collected in the research work of detecting upright pedestrians in images and videos, the pedestrians in the data set are mainly in an upright state and are higher than 100 pixels, the image definition is high, and most of the images are selected from personal photos, google and camera shooting.

3. The method for detecting pedestrians based on fusion of multiscale self-attention features as claimed in claim 1, wherein in step 2, the design of pedestrian size in the image is detected by the specific steps of: and re-clustering the pedestrian data set by using a K-means algorithm to obtain an anchor size more suitable for the model, and in order to iterate a final result more quickly and accurately, IoU, namely the overlapping degree of a prediction Box and an anchor Box, is used as a measure, namely, an objective function, namely Distance is 1-IoU (Box, Center), wherein Box represents a set of real object surrounding boxes, Center represents a set of clustering Center boxes, and pedestrian data are clustered to finally obtain 4 central points which are respectively (12, 22), (25, 38), (41, 77), (48, 91).

and 4-3, the SENET module is a simple and effective attention mechanism network model with low complexity and small calculated amount, the SENET module can enable the network to be fitted on the channel dimension of the feature map, the feature with judgment in the image is strengthened, the effect of inhibiting the non-significant feature in the image is achieved, and the SENET module is respectively embedded before the multi-scale feature fusion to carry out convolution processing on the feature map.