CN116630876A

CN116630876A - Airport scene monitoring image target detection method based on YOLO frame

Info

Publication number: CN116630876A
Application number: CN202310392103.6A
Authority: CN
Inventors: 蔡成涛; 周文涛; 郑丽颖; 李晨铭; 曹一乾
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2023-04-13
Filing date: 2023-04-13
Publication date: 2023-08-22

Abstract

The invention discloses an airport scene monitoring image target detection method based on a YOLO frame, which comprises the following steps: collecting airport scene monitoring images, constructing a data set conforming to the characteristics of the airport scene monitoring images based on the airport scene monitoring images, marking the data set to obtain a marked data set, and randomly dividing the marked data set into a training set and a test set; constructing an AS-YOLO airport scene monitoring image target detection model; training the AS-YOLO airport scene monitoring image target detection model by using the training set; and carrying out target detection on the test set by using a trained AS-YOLO airport scene monitoring image target detection model, and verifying through target detection evaluation indexes. The method solves the problem of lack of data in the airport scene monitoring image characteristic research; the method solves the problems of undersize and inaccurate detection of airport scene monitoring image pedestrians and automobile targets.

Description

Airport scene monitoring image target detection method based on YOLO frame

Technical Field

The invention belongs to the technical field of target detection in images, and particularly relates to an airport scene monitoring image target detection method based on a YOLO frame.

Background

Today, the air transportation industry continuously develops, the flight volume continuously increases, the airport security guarantee pressure is larger and larger, the general civil aviation industry rapidly develops, and new challenges are brought to airport security management. The manual processing of airport surveillance images or videos is unstable, cumbersome, time consuming and expensive, and therefore it is significant to develop and design an object detection algorithm for airport scene surveillance images using computer vision techniques.

The current typical target detection technology is mainly divided into two-stage detection and one-stage detection methods, wherein the two-stage detection firstly generates a detection region candidate frame, and then predicts a position frame and a category, for example: target detection algorithms such as R-CNN, fast R-CNN and the like; a stage of detection generates predicted location boxes and categories in the detection network, such as: target detection methods such as SDD and YOLO.

The above aspects have the following problems in constructing airport scene surveillance image target detection: firstly, the object detection method is to improve and optimize common data sets such as PASCAL VOC, MS COOC and the like, and lacks data sets for airport scene monitoring images. Secondly, the detection accuracy of the existing YOLO target detection series is lower, and the airport scene monitoring image has the data characteristics that the airplane targets are too large, and the targets such as pedestrians and automobiles are too small, so that the airport scene monitoring image detection cannot be simply attributed to the problem of small target detection due to the existence of the large targets such as the airplane, and the accuracy of the airport scene monitoring image target detection is affected due to the existence of the small targets such as the pedestrians and the automobiles.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an airport scene monitoring image target detection method based on a YOLO frame, an airport scene image characteristic data set is constructed through airport scene monitoring acquisition images, a target detection method AS-YOLO for the airport scene monitoring images is constructed based on the YOLO frame, and finally the data set is utilized to verify and improve the target detection method.

In order to achieve the above object, the present invention provides the following solutions:

an airport scene monitoring image target detection method based on a YOLO frame comprises the following steps:

collecting airport scene monitoring images, constructing a data set conforming to the characteristics of the airport scene monitoring images based on the airport scene monitoring images, marking the data set to obtain a marked data set, and randomly dividing the marked data set into a training set and a test set;

constructing an AS-YOLO airport scene monitoring image target detection model;

training the AS-YOLO airport scene monitoring image target detection model by using the training set;

and carrying out target detection on the test set by using a trained AS-YOLO airport scene monitoring image target detection model, and verifying through target detection evaluation indexes.

Preferably, the method for labeling the data set comprises the following steps:

and classifying and labeling the planes, pedestrians and vehicles in each picture in the data set, and constructing a labeling data set.

Preferably, the construction method of the AS-YOLO airport scene monitoring image target detection model comprises the following steps:

constructing a backbone network by using the designed CFEAM structure, CBF structure and MP1 structure;

based on combining the FPN structure and the PAN structure, adding a designed F-SPPF structure and FEAM structure to construct a neck network;

4 head networks were constructed using CBF architecture.

Preferably, the CBF structure is composed of a convolution module, a normalization module, and a fralu activation function;

the CFEAM structure is used in a crossing way through a plurality of FEAM structures;

the MP1 structure is formed by a maximum pooling module and convolution with the step length of 2, and two downsampling modes are recombined.

Preferably, the extraction process of the spatial features of the CFEAM structure is as follows:

the input is subjected to CBF operation with a convolution kernel of 1 and is transmitted to the FEAM structure to be respectively carried out four times, so as to generate 4 different characteristic diagrams;

the first feature map and the second feature map are taken to be subjected to concat combination and transmitted to a CBF convolution layer with a convolution kernel of 3, and a first convolution result is obtained;

the first convolution result and the third feature map are subjected to concat combination and transmitted to a CBF convolution layer with a convolution kernel of 3, and a second convolution result is obtained;

and performing concat combination on the second convolution result and the fourth feature map, and outputting a result.

Preferably, the FPN structure is a top-down feature pyramid;

the PAN structure is a bottom-up feature pyramid;

the F-SPPF structure is combined through three maximum pooling operations and is transmitted to the CBF structure to extract characteristics;

the FEAM structure input end is changed from main end input to combined action of main end input and residual error input, the main end input is a reference CBAM attention module, a CAM module with multiple parameters is replaced by a lightweight ECA module, and meanwhile a FReLU activation function specially applying visual tasks is used for replacing a sigmoid activation function.

Preferably, the process of acquiring the weight matrix with different dimensions by the FEAM structure is as follows:

the lightweight ECA module is utilized to average and pool the main input edge to obtain the characteristic value of each channel;

the channel eigenvalues verify the channel weights between eigenvectors using a 1 x 1 convolution;

the channel weight is subjected to a sigmoid activation function to obtain a feature map channel weight;

multiplying the feature map channel weight with the feature map of the initial input FEAM module to obtain the output feature of the lightweight ECA module;

activating the output characteristics through a FReLU activation function and transmitting the activated output characteristics to a SAM module;

the SAM module uses average pooling and maximum pooling to compress the feature map Yc in the channel dimension to obtain two-dimensional feature maps;

splicing the two-dimensional feature graphs based on the channel Concat to obtain a feature graph with the channel number of 2;

convolving the spliced feature map with a hidden layer comprising a single convolution kernel;

generating a spatial attention weight on the convolution result through sigmoid operation;

multiplying the spatial attention weight with the feature map of the initial input FEAM module to obtain a feature map containing channels and spatial attention weights;

and activating the feature map through a FReLU activation function, and adding the feature map with residual input to obtain the output of the multi-scale information expressive power.

Preferably, the expression of the output of the multi-scale information expressive power is:

wherein Y is an air control image feature map.

Compared with the prior art, the invention has the beneficial effects that:

the invention constructs the data set which accords with the airport scene monitoring image characteristics based on the airport real acquisition image. The YOLO target detection framework comprises a backbone network, a neck network and a head network, and the backbone network is constructed by using a designed CFEAM module and a CBF module; based on the FPN structure and PAN, a designed F-SPPF module and FEAM module are added, and the network feature fusion capability is enhanced to construct a neck network; the CBF module constructs 4 head networks and a target detection method for airport scene monitoring images. The advantages are that: (1) The method solves the problem of lack of data in airport scene monitoring image characteristic research. (2) The method solves the problems of undersize and inaccurate detection of airport scene monitoring image pedestrians and automobile targets.

Drawings

In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the embodiments are briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an airport scene monitoring image target detection method based on a YOLO frame;

FIG. 2 is a block diagram of a FEAM module of the present invention;

FIG. 3 is a schematic diagram of a CAM module according to the present invention;

FIG. 4 is a schematic diagram of an ECA module network architecture according to the present invention;

FIG. 5 is a diagram showing a network structure of a SAM module according to the present invention;

fig. 6 is a FReLU of the invention: MAX (x, T (x)) block diagram;

FIG. 7 is a schematic diagram of an AS-YOLO object detection network according to the present invention;

FIG. 8 is a schematic diagram of the basic convolution structure of a CBF of the present invention;

FIG. 9 is a schematic view of the CFEAM structure of the present invention;

FIG. 10 is a schematic view of MP structure according to the present invention;

FIG. 11 is a schematic diagram of the F-SPPF structure of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

As shown in fig. 1, the invention discloses a method for detecting an airport scene monitoring image target based on a YOLO frame, which comprises the following steps:

constructing an AS-YOLO airport scene monitoring image target detection model;

training an AS-YOLO airport scene monitoring image target detection model by using a training set;

performing target detection on the test set by using a trained AS-YOLO airport scene monitoring image target detection model, and verifying through target detection evaluation indexes;

and writing the weight parameters obtained by training into detect. Py, constructing a frame operation improvement model by using python, and verifying that the AS-YOLO model is more accurate and meets the target detection requirement of the airport scene monitoring image.

In this embodiment, the method for labeling a data set includes:

The experimental data of the invention come from airport real collected data, and in order to ensure data diversity, various representative environmental images of the airport are used, such as: sunny days, late night, cloudy and rainy, etc. The invention uses the common objects of airports, mainly comprising common objects such as airplanes, pedestrians, vehicles and the like. And establishing a target tag data set according to the characteristics of the airport target. The target position label is marked with the following categories: the position label format is txt format label with the center point heightened width. The position information label marks the data set by using a marking program Labelimg, and the total number of the marked pictures is 2000. The data set of the invention comprises 3466 airplane targets, 2994 pedestrian targets and 1911 vehicle targets, the total targets 8371, and the target pixels are smaller than 32 multiplied by 32 as absolute small targets according to the definition of small targets in the COCO data set, and the small targets in the data set of the invention have the proportion of 39.42 percent.

In this embodiment, the attention network mechanism reduces garbage to increase detection capability by making the model more interesting to a certain class of targets. The FEAM attention module mainly comprises an ECA module, a SAM module and a FReLU activation function. For an air traffic control image, the larger weight assigned to the air traffic control image represents that the area contains more detection targets, and the model can use more attention to learn the characteristics of the area, so that the accuracy of an air traffic control image detection algorithm is improved, and airport management personnel are better assisted in making decisions.

Inspired by the residual attention module and the CBAM attention module, the core idea of FEAM adopts a residual module structure, and the input air control image feature map Y acquires different dimension weight matrixes along two dimensions of a channel and a space respectively. The input end of the FEAM attention module is changed from the main end input to the combined action of the main end input and the residual error input, the main input end of the FEAM attention module is used for referencing the CBAM attention module, the CAM module with multiple parameters is replaced by a lightweight ECA module, and meanwhile, a FReLU activation function specially applied to visual tasks is used for replacing a sigmoid activation function. The structure is shown in fig. 2.

Specifically, the process of acquiring the weight matrix with different dimensions by the FEAM structure is as follows: firstly, an ECA module obtains each channel characteristic value by averaging and pooling a main input side, the channel characteristic values verify the channel weight among characteristic vectors by using a 1 multiplied by 1 convolution, a signature activation function is used for obtaining a characteristic map channel weight, the characteristic map channel weight is multiplied with a characteristic map of an initial input FEAM module to obtain an ECA module output characteristic, and the output characteristic is activated by a FReLU activation function and transmitted to a SAM module. The SAM module uses average pooling and maximum pooling to compress the feature map Yc in the channel dimension to obtain two-dimensional feature maps, and the two-dimensional feature maps are combined based on the channel Concat to obtain a feature map with the channel number of 2. In order to ensure that the finally obtained features are consistent with the input Yc in the space dimension, a hidden layer containing a single convolution kernel is used for carrying out convolution operation on the spliced feature images, finally, a spatial attention weight is generated through sigmoid operation, multiplication is carried out on the feature images of the initial input FEAM module to obtain a feature image containing channels and spatial attention weights, the feature image is activated through a FReLU activation function, and the feature image is added with residual input to obtain the output with better multi-scale information expressive capacity.

Specifically, the CAM module focuses on the semantic information of the feature map more, uses average pooling integrated information and maximum pooling collected detailed information for the input feature map Y in the space dimension, and improves the network expression capability while reducing network calculation by two pooling operations. The two one-dimensional vectors after pooling are sent to full-connection layer operation, and the invention uses 1X 1 convolution to verify the weight sharing among the feature vectors. Finally, the channel attention Oc is generated through the addition operation and the sigmoid activation operation, and the structure is shown in fig. 3.

O _c ＝sigmoid(MLP(MaxPool(Y))+MLP(AvgPool(Y))) (2)

Since the CAM uses full connection manipulation to map features, as shown in fig. 3, the network parameter calculation amount is large when a plurality of CBAMs are inserted into the network, so the CAM is improved aiming at the problem of large calculation amount caused by large parameter amount. FEAM selects ECA channel attention module to replace traditional CAM, one-dimensional convolution with convolution kernel length of 3 performs feature integration on 3 channels in channel neighborhood, and finally generates attention through sigmoid activation operationThe structure is shown in fig. 4.

Specifically, the SAM spatial attention module focuses on the region with more effective features in the feature map by focusing on the position information of the features, and supplements the attention of the channel. And carrying out compression operation on the feature map Yc in the channel dimension by using average pooling and maximum pooling to obtain two-dimensional feature maps, and obtaining a feature map with the channel number of 2 based on the channel Concat. In order to ensure that the finally obtained features are consistent with the input Yc in the space dimension, a hidden layer containing a single convolution kernel is used for carrying out convolution operation on the spliced feature images, and finally the spatial attention weight Os is generated through sigmoid operation, and the structure is shown as figure 5

O _s ＝sigmoid(Conv2(AvgPool(Y),MaxPool(Y))) (4)

Specifically, in the convolutional neural network, the activation layer provides nonlinear capability for the network, the FReLU activation function is an activation function specially designed for visual tasks, and the function of the activation function is realized by using common convolution, so that the function has the capability of completing visual image tasks better. The most widely used activation function at present is still a ReLU activation function whose function is expressed as follows:

ReLU＝max(0,x) (5)

the ReLU activation function has higher precision in many tasks, the idea that FReLU inherits ReLU is expanded into space, the 2D condition T (x) of the space context of each pixel is relied on, the realization is simple, only a small amount of calculation is added, the visual task is completed better, and the visual task is expressed as follows, and the structure is shown in figure 6:

FReLU＝max(x,T(x)) (6)

in this embodiment, the present invention, inspired by the self-attention structure and the cross-combination structure, develops a multimode spatial feature extraction structure CFEAM in conjunction with the FEAM module. The extraction process of the space characteristics of the CFEAM structure comprises the following steps: the method comprises the steps of inputting CBF operation with a convolution kernel of 1, transmitting the CBF operation to an FEAM structure, respectively generating 4 different feature images four times, taking a first feature image and a second feature image to conduct concat combination, transmitting the result to a CBF convolution layer with the convolution kernel of 3, conducting concat combination on the result and a third feature image, transmitting the result to a CBF convolution layer with the convolution kernel of 3, conducting concat combination on the result and a fourth feature image, and outputting the result. The feature extraction capability of the backbone network is improved by using the fusion of images with different depths of the structure, the FEAM attention module reduces environmental interference and also increases the detection capability of small target objects at an airport.

In this embodiment, the method for constructing the AS-YOLO airport scene monitoring image target detection model includes:

the method comprises the steps of firstly constructing an airport scene monitoring image data set through airport scene monitoring acquisition images, and constructing a backbone network by using a designed CFEAM module and a CBF module based on a YOLO target detection frame comprising the backbone network, the neck network and the head network; based on the FPN structure and PAN, a designed F-SPPF module and FEAM module are added, and the network feature fusion capability is enhanced to construct a neck network; the CBF module constructs 4 head networks and a target detection method for airport scene monitoring images; verifying a training network model by using the established airport scene monitoring image data set; and verifying through target detection evaluation indexes.

In this embodiment, AS-YOLO is the same AS YOLO network framework AS a whole, and the network is composed of a backbone network, a neck network and a head network, and the overall structure of the network is shown in fig. 7.

The backbone network comprises CBF structure (as shown in fig. 8), CFEAM structure (as shown in fig. 9), MP1 structure (as shown in fig. 10). The CBF structure is composed of a convolution module, a normalization module and a FReLU activation function. The CFEAM structure is mainly inspired by the self-attention structure and the cross combined structure, and has good feature learning capability through the cross use of a plurality of FEAM structures. The MP1 structure is composed of a maximum pooling module and convolution with the step length of 2, and two downsampling modes are recombined, so that the network learning capacity is enhanced under the condition of not breaking the structure. The SPPCSPC structure divides the features into two parts, wherein one part obtains different receptive fields to distinguish large targets from small targets through four different maximum pooling operations, the other part extracts the features by using the CBF structure, and finally the two parts are combined together to improve network precision.

The neck network and header network contain FPN structure, PAN structure, FEAM structure, and F-SPPF structure (as shown in fig. 11). The FPN structure is a top-down feature pyramid, and the small target detection capability is improved by an up-sampling mode. And the PAN is transmitted to an upper layer from bottom to top by utilizing lower layer information, so that the detection capability of the shielding target is improved.

The FEAM structure is an attention module designed by the invention, and small target detection capability can be enhanced by adding a small amount of calculation capability. The F-SPPF structure is combined through three maximum pooling operations and is transmitted to the CBF structure to extract the characteristics, so that the calculation cost is reduced, and the problems of image distortion and the like are effectively avoided.

The head network, classical YOLO frame sends the original image to the feature detection network through 8 times, 16 times and 32 times downsampling to obtain 20 x 20 large target detection feature map, 40 x 40 target feature detection map and 80 x 80 small target feature detection map. In the target detection algorithm, the deep convolution is considered to have abundant semantic information, but the position information is less and part of small target information is lost. According to the characteristics of the airport scene monitoring image targets, a large number of small targets exist for pedestrians and automobiles, so that a new scale feature map is added on the basis of an original algorithm, the original image is subjected to 4 times, 8 times, 16 times and 32 times downsampling to obtain a 20 x 20 large target detection feature map, a 40 x 40 target feature detection map, a 80 x 80 small target feature detection map and a 160 x 160 small target feature detection map, and then the original image is sent into a detection network. The network structure improvement of the airport scene monitoring image target characteristics is beneficial to the detection of pedestrians and vehicles, and the detection effect is improved.

In this embodiment, for the airport scene monitoring data set and the AS-YOLO airport scene monitoring image target detection model constructed by the present invention, verification of the airport scene monitoring image target detection model is completed, and specific steps are AS follows:

step1: configuring a network training profile

1600 sheets are selected from the dataset as training set, and 400 sheets are selected as test set. Before training, AS-YOLO data and model configuration files are required to be changed, the number of object categories is changed to 3 in the data files, the names of the object categories are modified in category list names, anchor frames with the sizes of (5, 9), (6, 14), (43,15), (12, 16), (19, 36), (40, 28), (36, 75), (76, 55), (72, 146), (142, 110), (192, 243), (459, 401) are set in the model configuration files, feature scale and attention modules are added according to the improved network structure, CFEAM model codes are added in common.py files, CFEAM structures are introduced in yolo.py files, and yolo.py files are operated to detect whether the network modification is correct.

Step2: configuring a network training environment

The experimental environment is an eosin cloud computing service system, each node is provided with 1 x86 processor with a 32-core main frequency of 2.5GHz and 1 NVIDIATeslaV100 accelerator card, each node is provided with 2 16GB DDR42666ECC REG memories, and two sets of eosin Parastor300S parallel storage systems are provided for high-capacity data storage. In terms of network communication, the cluster adopts a full-line-speed and non-blocking 200Gb HDR Infiniband special-purpose computing network, and a pytorch1.9.0 deep learning framework is used, and specific configuration information is shown in table 1.

TABLE 1

Network test software part environment configuration

AS-YOLO model part parameter setting

Step3: target detection evaluation index selection

All models were trained and tested in airport scene surveillance image data built using the present invention. Recall, precision, F1 and mAP were used as evaluation indexes of the model in the experiment, and the threshold of IOU was set to 0.5. F1 is the harmonic mean of Recall and Precision, giving a more accurate correspondence to the model. mAP is the average value of the average accuracy of a plurality of targets under different Recall conditions. Wherein Recall, precision, F and mAP are defined as follows:

where TP represents the number of correctly identified positive samples, FP represents the number of incorrectly identified positive samples, FN represents the number of missing identified positive samples, and m represents the number of identified categories.

Step4: analysis of experimental results

In order to verify the effectiveness of the improved target detection algorithm provided by the invention, the experimental results are shown in the following table 2:

TABLE 2

The resolution of the input image is 640 multiplied by 3, and the detection accuracy comparison results of 300 epoch experiments YOLOv3, YOLOv5l, YOLOv7 and AS-YOLOv are trained. Experimental results show that the AS-YOLO target detection algorithm provided by the invention has better performance, is improved by 5.7 compared with mAP0.5 of YOLOv3, is improved by 1.5% compared with mAP0.5 of YOLOv5l, and is improved by 3.2 compared with mAP0.5 of YOLOv 7. The airport monitoring images are detected by using the YOLOv3, YOLOv5l, YOLOv7 and AS-YOLO target detection algorithms, and the results show that the YOLO algorithm can not detect the target in AS-YOLO, and the accuracy of detecting the target is higher.

The above embodiments are merely illustrative of the preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but various modifications and improvements made by those skilled in the art to which the present invention pertains are made without departing from the spirit of the present invention, and all modifications and improvements fall within the scope of the present invention as defined in the appended claims.

Claims

1. An airport scene monitoring image target detection method based on a YOLO frame is characterized by comprising the following steps:

constructing an AS-YOLO airport scene monitoring image target detection model;

2. The YOLO frame-based airport scene monitoring image target detection method of claim 1, wherein the method of labeling the dataset comprises:

3. The method for detecting the airport scene monitoring image target based on the YOLO frame according to claim 1, wherein the method for constructing the AS-YOLO airport scene monitoring image target detection model is AS follows:

4 head networks were constructed using CBF architecture.

4. The method for detecting an airport scene monitor image target based on the YOLO framework of claim 3,

the CBF structure is composed of a convolution module, a normalization module and a FReLU activation function;

5. The method for detecting the airport scene monitoring image target based on YOLO frame according to claim 4, wherein the extraction process of the spatial features of the CFEAM structure is as follows:

6. The method for detecting an airport scene monitor image target based on the YOLO framework of claim 3,

the FPN structure is a top-down feature pyramid;

the PAN structure is a bottom-up feature pyramid;

7. The method for detecting the airport scene monitoring image target based on YOLO frame according to claim 6, wherein the procedure of obtaining the weight matrix with different dimensions by the FEAM structure is as follows:

8. The YOLO frame-based airport scene monitoring image target detection method of claim 7, wherein the expression of the output of the multiscale information presentation capability is:

wherein Y is an air control image feature map.