CN113255589A

CN113255589A - Target detection method and system based on multi-convolution fusion network

Info

Publication number: CN113255589A
Application number: CN202110707169.0A
Authority: CN
Inventors: 陈克鹏
Original assignee: Beijing Telecom Easiness Information Technology Co Ltd
Current assignee: Beijing Telecom Easiness Information Technology Co Ltd
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-08-13
Anticipated expiration: 2041-06-25
Also published as: CN113255589B

Abstract

The invention relates to a target detection method and a system based on a multi-convolution fusion network, wherein the method comprises the following steps: taking image data of vehicles coming and going in a traffic junction acquired by a camera carried by an unmanned aerial vehicle as a data set; constructing a network structure for image target detection; training the network structure for image target detection according to the data set to obtain an image target detection model; carrying out target detection on image data to be detected by using the image target detection model; the network structure for image target detection comprises: a ResNet101 network, a multi-convolution fusion network, a region generation network, an ROI pooling layer and a detection head. The method enhances the representation capability of the image target, and further improves the detection accuracy.

Description

Target detection method and system based on multi-convolution fusion network

Technical Field

The invention relates to the field of image processing, in particular to a target detection method and a target detection system based on a multi-convolution fusion network.

Background

In recent years, the unmanned aerial vehicle industry is rapidly developed and widely applied to aspects of rescue, surveying and mapping, freight transportation, reconnaissance, traffic supervision and the like. Accurate detection of targets in aerial images is a precondition that an unmanned aerial vehicle can successfully complete various tasks, however, due to the influence of imaging angles and heights, targets in aerial images often have the characteristics of small visual area, low resolution, much background interference and the like, self characteristic information is less, compared with targets in natural scene images, the detection difficulty is higher, and at present, the detection accuracy of aerial images needs to be improved.

Disclosure of Invention

The invention aims to provide a target detection method and a target detection system based on a multi-convolution fusion network, which improve the detection accuracy.

In order to achieve the purpose, the invention provides the following scheme:

a target detection method based on a multi-convolution fusion network comprises the following steps:

taking image data of vehicles coming and going in a traffic junction acquired by a camera carried by an unmanned aerial vehicle as a data set;

constructing a network structure for image target detection;

training the network structure for image target detection according to the data set to obtain an image target detection model;

carrying out target detection on image data to be detected by using the image target detection model;

the network structure for image target detection comprises: a ResNet101 network, a multi-convolution fusion network, a region generation network, an ROI pooling layer and a detection head;

the ResNet101 network comprises a first convolution module, a second convolution module, a third convolution module, a fourth convolution module and a fifth convolution module which are connected in sequence; the multi-convolution fusion network comprises a first multi-convolution fusion module, a second multi-convolution fusion module, a third multi-convolution fusion module, a fourth multi-convolution fusion module and a fifth multi-convolution fusion module;

the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module are all used for performing multi-convolution feature fusion on an input image;

an output of the fifth convolution module is connected to an input of the fifth multi-convolution fusion module, an output of the fourth convolution module is connected to an input of the fourth multi-convolution fusion module, an output of the third convolution module is connected to an input of the third multi-convolution fusion module, an output of the second convolution module is connected to an input of the second multi-convolution fusion module, and an output of the first convolution module is connected to an input of the first multi-convolution fusion module; the output of the fifth multi-convolution fusion module is a fifth feature map, the fifth feature map outputs a fourth feature map through 2-time upsampling and element-by-element addition with the output of the fourth multi-convolution fusion module, the fourth feature map outputs a third feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the third multi-convolution fusion module, the third feature map outputs a second feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the second multi-convolution fusion module, and the second feature map outputs a first feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the first multi-convolution fusion module; inputting the first feature map, the second feature map, the third feature map, the fourth feature map and the fifth feature map into the area generation network; the region generation network is connected with the ROI pooling layer, the ROI pooling layer is connected with the detection head, and the detection head is used for outputting a detection result.

Optionally, the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module are identical in structure, and each of the first multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module includes a first convolution branch, a second convolution branch, a third convolution branch, a fourth convolution branch, a first SEnet attention mechanism module, a second SEnet attention mechanism module, a third SEnet attention mechanism module, and a fourth SEnet attention mechanism module;

the first convolution branch comprises convolution operations with convolution kernels of 1 × 1, step sizes of 3 and pixel padding of 0, the second convolution branch comprises convolution operations with convolution kernels of 3 × 3, step sizes of 2 and pixel padding of 1, the third convolution branch comprises convolution operations with convolution kernels of 5 × 5, step sizes of 2 and pixel padding of 2, and the fourth convolution branch comprises convolution operations with convolution kernels of 7 × 7, step sizes of 2 and pixel padding of 3; the feature map of the first convolution branch output is input to the first SEnet attention mechanism module, the feature map of the second convolution branch output is input to the second SEnet attention mechanism module, the feature map of the third convolution branch output is input to the third SEnet attention mechanism module, and the feature map of the fourth convolution branch output is input to the fourth SEnet attention mechanism module;

the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module perform global average pooling on input feature maps based on channel dimensions to obtain feature maps with the size of 1 × 1 × 512, the feature maps with the size of 1 × 1 × 512 are input into a first full connection layer, the first full connection layer outputs the feature maps with the size of 1 × 1 × 512/r, a ReLU activation function is adopted to perform activation operation on the feature maps with the size of 1 × 1 × 512/r, the feature maps with the size of 1 × 1 × 512/r are expanded into 1 × 1 × 512/r through the second full connection layer, and then the feature maps containing channel attention information are output through a Sigmoid function; r is a set value;

the four feature graphs which are output by the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module and contain channel attention information are subjected to element-level addition operation to obtain a feature fusion feature graph, and the feature fusion feature graph is subjected to convolution operation with convolution kernel of 1 × 1, step length of 1 and pixel filling of 0 and then output.

Optionally, the sizes of the features output by the first convolution branch, the second convolution branch, the third convolution branch and the fourth convolution branch are the same, and are all 64 × 64 × 512.

Optionally, the detection head comprises a regression branch and a classification branch; the classification branch determines the category of the detection target by using the classification loss, and the regression branch determines the position information of the detection target by using the regression loss.

Optionally, the image data of vehicles coming and going in the transportation junction collected by the camera carried by the unmanned aerial vehicle is used as a data set, and the method specifically includes:

the method comprises the steps that image data of vehicles coming in and going out of a traffic junction are collected through a camera carried by an unmanned aerial vehicle;

carrying out random adjustment on brightness, saturation and contrast of the image data to obtain preprocessed image data;

dividing the preprocessed image data into a training set and a test set;

adopting Labelme software to label the vehicle targets in the images in the training set according to the categories to obtain the labeled training set; the training set after the test set and the class label form the data set.

Optionally, the training of the network structure for image target detection according to the data set to obtain an image target detection model specifically includes:

when a network structure for detecting the image target is trained according to the data set, calculating a loss function, and adjusting parameters in the network structure according to the loss function to obtain an image target detection model; the loss function includes a classification loss and a regression loss.

Optionally, the loss function is expressed as:

；

wherein,

the loss function is represented by a function of the loss,iis shown asiThe number of the samples is one,

is a first normalization parameter that is a function of,

is a second normalization parameter that is a function of,

is a balance parameter for the weight or weights,

a loss of classification is indicated and,

the regression loss is expressed as a function of time,

is shown asiThe probability that an individual sample is predicted as a vehicle,

is the firstiThe label that has been labeled for each of the samples,

a panning scaling parameter representing the predicted bounding box,

a pan zoom parameter representing the real bounding box.

The invention also discloses a target detection system based on the multi-convolution fusion network, which comprises the following steps:

the data set acquisition module is used for taking image data of vehicles coming and going in the traffic junction acquired by a camera carried by the unmanned aerial vehicle as a data set;

the network construction module is used for constructing a network structure for detecting the image target;

the image target detection model training module is used for training the network structure for image target detection according to the data set to obtain an image target detection model;

the target detection module is used for carrying out target detection on the image data to be detected by utilizing the image target detection model;

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the invention, different characteristic information is fused through each multi-convolution fusion area module of the multi-convolution fusion network, and multi-scale fusion is carried out on different characteristic information, so that the representation capability of the image target is enhanced, and the detection accuracy is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a target detection method based on a multi-convolution fusion network according to the present invention;

FIG. 2 is a diagram illustrating a network structure for image target detection according to the present invention;

FIG. 3 is a schematic diagram of a network structure for image target detection according to the present invention;

FIG. 4 is a multi-convolution fusion module configuration of the present invention;

FIG. 5 is a schematic diagram of a target detection method based on a multi-convolution fusion network according to the present invention;

fig. 6 is a schematic structural diagram of a target detection system based on a multi-convolution fusion network according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a network structure and a method for detecting an image target, which improve the detection accuracy.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic flow chart of a target detection method based on a multi-convolution fusion network according to the present invention, and as shown in fig. 1, the target detection method based on the multi-convolution fusion network includes the following steps:

step 101: the image data of vehicles coming and going in the traffic junction collected through a camera carried by the unmanned aerial vehicle is used as a data set.

Wherein, step 101 specifically includes:

the camera that carries through unmanned aerial vehicle gathers the image data of coming and going vehicle in the traffic hub.

And carrying out random adjustment on brightness, saturation and contrast of the image data to obtain preprocessed image data.

And dividing the preprocessed image data into a training set and a testing set.

Adopting Labelme software to label the types of the vehicle targets in the images in the training set to obtain a labeled training set; the test set and the class labeled training set constitute a data set.

Step 102: and constructing a network structure for image target detection.

Step 103: and training a network structure for image target detection according to the data set to obtain an image target detection model.

Wherein, step 103 specifically comprises:

when a network structure for image target detection is trained according to the data set, calculating a loss function, and adjusting parameters in the network structure according to the loss function to obtain an image target detection model; the loss function includes classification loss and regression loss.

The loss function is expressed as:

；

wherein,

the function of the loss is represented by,iis shown asiThe number of the samples is one,

is a first normalization parameter that is a function of,

is a second normalization parameter that is a function of,

is a balance parameter for the weight or weights,

a loss of classification is indicated and,

the regression loss is expressed as a function of time,

is the firstiThe label that has been labeled for each of the samples,

a panning scaling parameter representing the predicted bounding box,

a pan zoom parameter representing the real bounding box.

Step 104: and carrying out target detection on the image data to be detected by using the image target detection model.

Fig. 2-3 are schematic diagrams of network structures for image target detection according to the present invention, and as shown in fig. 2 and 3, the network structures for image target detection include: a ResNet101 network 201, a multi-convolution fusion network 202, a region generation network 203, a ROI (region of interest) pooling layer 204, and a detection header 205.

The ResNet101 network 201 comprises a first convolution module, a second convolution module, a third convolution module, a fourth convolution module and a fifth convolution module which are connected in sequence; the multi-convolution fusion network 202 includes a first multi-convolution fusion module, a second multi-convolution fusion module, a third multi-convolution fusion module, a fourth multi-convolution fusion module, and a fifth multi-convolution fusion module.

The first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module and the fifth multi-convolution fusion module are all used for carrying out multi-convolution feature fusion on the input image.

The output of the fifth convolution module is connected with the input of the fifth multi-convolution fusion module, the output of the fourth convolution module is connected with the input of the fourth multi-convolution fusion module, the output of the third convolution module is connected with the input of the third multi-convolution fusion module, the output of the second convolution module is connected with the input of the second multi-convolution fusion module, and the output of the first convolution module is connected with the input of the first multi-convolution fusion module; the fifth multi-convolution fusion module outputs a fifth feature map, the fifth feature map outputs a fourth feature map through 2 times of upsampling and element addition of the output of the fourth multi-convolution fusion module, the fourth feature map outputs a third feature map through 3 x 3 convolution operation after 2 times of upsampling and element addition of the output of the third multi-convolution fusion module, the third feature map outputs a second feature map through 3 x 3 convolution operation after 2 times of upsampling and element addition of the output of the second multi-convolution fusion module, and the second feature map outputs a first feature map through 3 x 3 convolution operation after 2 times of upsampling and element addition of the output of the first multi-convolution fusion module; a first feature map, a second feature map, a third feature map, a fourth feature map, and a fifth feature map input area generation network 203; the region generation network 203 is connected to the ROI pooling layer 204, the ROI pooling layer 204 is connected to the detection header 205, and the detection header 205 is used for outputting a detection result. The area generation network 203 is used to generate a series of candidate target areas.

The algorithm in the ROI pooling layer 204 is specifically: extracting feature maps from the candidate target regions generated by the first feature map and region generation network 203, the second feature map and region generation network 203, the third feature map and region generation network 203, and the fourth feature map and region generation network 203.

Fig. 4 is a diagram of a multi-convolution fusion module according to the present invention, and as shown in fig. 4, the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module have the same structure, and each of the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module includes a first convolution branch, a second convolution branch, a third convolution branch, a fourth convolution branch, a first SEnet attention mechanism module, a second SEnet attention mechanism module, a third SEnet attention mechanism module, and a fourth SEnet attention mechanism module.

The first convolution branch comprises convolution operations with convolution kernels of 1 × 1, step size of 3 and pixel filling of 0, the second convolution branch comprises convolution operations with convolution kernels of 3 × 3, step size of 2 and pixel filling of 1, the third convolution branch comprises convolution operations with convolution kernels of 5 × 5, step size of 2 and pixel filling of 2, and the fourth convolution branch comprises convolution operations with convolution kernels of 7 × 7, step size of 2 and pixel filling of 3; the characteristic diagram output by the first convolution branch is input into a first SEnet attention mechanism module, the characteristic diagram output by the second convolution branch is input into a second SEnet attention mechanism module, the characteristic diagram output by the third convolution branch is input into a third SEnet attention mechanism module, and the characteristic diagram output by the fourth convolution branch is input into a fourth SEnet attention mechanism module.

The first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module perform global average pooling on input feature maps based on channel dimensions to obtain feature maps with the size of 1 × 1 × 512, the feature maps with the size of 1 × 1 × 512 are input into a first full connection layer, the first full connection layer outputs the feature maps with the size of 1 × 1 × 512/r, a ReLU activation function is adopted to perform activation operation on the feature maps with the size of 1 × 1 × 512/r, the feature maps with the size of 1 × 1 × 512/r are expanded into 1 × 1 × 512/r through the second full connection layer, and then the feature maps containing channel attention information are output through a Sigmoid function; r is a set value.

And four feature graphs containing channel attention information output by the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module are subjected to element-level addition operation to obtain a feature fusion feature graph, and the feature fusion feature graph is subjected to convolution operation with a convolution kernel of 1 × 1, a step length of 1 and pixel filling of 0 and then output.

The features output by the first, second, third and fourth convolution branches are the same size, all 64 × 64 × 512.

The detection head 205 includes a regression branch and a classification branch; the classification branch determines the category of the detection target by using the classification loss, and the regression branch determines the position information of the detection target by using the regression loss.

And training and optimizing parameters of a network structure for image target detection by adopting an aerial image data set, finally performing model test, and performing target detection on the vehicle image to be detected by utilizing an image target detection model.

The invention discloses a multi-convolution fusion module, and combines the multi-convolution fusion module with a multi-scale Feature fusion technology, and provides a Network structure for detecting an image target.

The following describes a target detection method based on a multi-convolution fusion network in detail.

As shown in fig. 5, a target detection method based on a multi-convolution fusion network specifically includes the following steps.

Step1, constructing an aerial image data set. The specific process is as follows: firstly, acquiring image data of vehicles coming from and going to a traffic junction through an unmanned aerial vehicle camera; secondly, randomly adjusting the brightness, the saturation and the contrast of the acquired original image through a preprocessing operation; secondly, carrying out category labeling on the aerial vehicle target in the image based on Labelme software, thereby obtaining a labeling file in an Extensible Markup Language (XML) format; and finally, carrying out training set and test set division, making labels for the data in the training set, and carrying out no processing on the data in the test set.

Step2, building a deep neural network (network structure for image target detection), training a deep neural network model by adopting a training set in an aerial image data set to obtain an aerial image detection model, and taking an aerial image 1024 × 1024 input by the invention as an example, describing the specific process as follows:

and designing a multi-convolution fusion module (comprising a first multi-convolution fusion module, a second multi-convolution fusion module, a third multi-convolution fusion module, a fourth multi-convolution fusion module and a fifth multi-convolution fusion module), and embedding the multi-convolution fusion module into a backbone network ResNet101 of a fast RCNN network. The backbone network used by fast RCNN in the present invention is ResNet101, which is used to extract the features of aerial images, and the ResNet101 network 201 is composed of 5 convolution modules (conv1, conv2, conv3, conv4, conv5), as shown in fig. 3, a multi-convolution fusion module is designed and embedded into 5 convolution modules respectively, so that the subsequent feature maps all contain the extracted target key information with different attributes. As shown in fig. 3, taking an example of an input aerial image 1024 × 1024 of the present invention, the size of the output feature map C _3 after passing through the first three convolution modules (conv1, conv2, conv3) is 128 × 128 × 512, and the design process of the multi-convolution fusion module is shown by taking the feature map as an input of the multi-convolution fusion module (third multi-convolution fusion module):

as shown in fig. 4, a multi-convolution branch structure is first designed, and a feature map output after conv3 (third convolution module) is used as an input feature map of the structure. Inputting the feature map into different convolution branches, namely performing four different convolution operations on the feature map respectively, wherein the convolution operations comprise convolution operations with convolution kernels of 1 × 1, step sizes of 3 and pixel filling of 0, convolution operations with convolution kernels of 3 × 3, step sizes of 2 and pixel filling of 1, convolution operations with convolution kernels of 5 × 5, step sizes of 2 and pixel filling of 2, and convolution operations with convolution kernels of 7 × 7, step sizes of 2 and pixel filling of 3, and thus obtaining four feature maps (the size is 64 × 64 × 512) with the same size and containing different feature information.

Next, a SEnet attention mechanism is constructed and embedded behind the multi-convolution branch structure. First, a SEnet attention mechanism is constructed, as shown in FIG. 4, four output characteristic diagrams of a multi-volume integral branch structure are used as the input of the SEnet attention mechanism, and the SEnet attention mechanism is designed through the following process: taking four feature maps (the sizes of all the feature maps are 64 × 64 × 512) which pass through a multi-volume integral branch structure as input feature maps of the module, firstly performing global average pooling on the input feature maps based on channel dimensions, and respectively obtaining four feature maps with the sizes of 1 × 1 × 512. Then, the four feature maps are input into a full connection layer, and the full connection layer is used for reducing the number of channels of the feature map with the size of 1 × 1 × 512 to one r of the original number, reducing the calculation amount of the full connection layer, and outputting the four feature maps with the size of 1 × 1 × 512/r. And respectively performing activation operation on the four feature maps by adopting a ReLU activation function, expanding the four feature maps from 1 multiplied by 512/r to 1 multiplied by 512 by adopting a second full connection layer, and finally limiting the weight of the 512-layer feature maps to the range of [0, 1] by a Sigmoid function. 512 channels of the four feature maps are multiplied by the output weight 1 × 1 × 512, thereby outputting four feature maps (size 64 × 64 × 512) containing channel attention information. The calculation formula of the SEnet attention mechanism is as follows:

B=σ（FC（ReLu（FC（Avgpool（A）））））；

where a denotes an input feature map of the attention module, B denotes an output feature map, FC denotes a fully-connected layer (including a first fully-connected layer and a second fully-connected layer), and σ denotes a sigmoid activation function.

Since the SEnet attention mechanism does not change the resolution of the feature map, as shown in FIG. 4, after the attention mechanism is respectively embedded into the multi-volume integral branch structures, the network is facilitated to screen the excessive feature information extracted by the multi-volume integral branch structures, and the screened key features are transmitted to a subsequent feature layer, so that the detection accuracy of the aerial image target is improved.

And finally, designing a multi-convolution fusion structure. And (4) carrying out element-level addition operation on the four feature maps output by the SEnet attention mechanism to obtain the feature map (with the size of 64 × 64 × 512) fusing different feature attributes. And carrying out convolution operation with convolution kernel of 1 × 1, step size of 1 and pixel filling of 0 to refine the number of channels to 256 and eliminate the characteristic aliasing effect, and finally obtaining a characteristic diagram with the size of 64 × 64 × 256.

The multi-convolution fusion module is respectively formed by connecting a multi-convolution branch structure, an SEnet attention mechanism and a multi-convolution fusion structure in series, as shown in fig. 3, the multi-convolution fusion module is respectively embedded into 5 convolution modules of the ResNet101 network 201, so that the network can extract and refine more abundant key feature information based on different convolution operations, and the key feature information is transmitted to a subsequent layer, and the detection accuracy of an aerial image target is improved. In addition, the multi-convolution fusion module can reduce the space dimension and the channel number of the feature map to half of the original number through key feature extraction, thereby reducing the calculation cost.

A fast R-CNN structure based on a Feature Pyramid Network (FPN) is designed. The concrete structure (as shown in fig. 3) is as follows: the backbone network ResNet101 is mainly composed of five convolution modules (conv1, conv2, conv3, conv4, conv5), and output characteristic diagrams of the five convolution modules are respectively represented as C _1, C _2, C _3, C _4 and C _ 5. Taking an aerial image 1024 × 1024 input in the invention as an example, the sizes of the characteristic diagrams C _1 to C _5 are sequentially as follows: 512 × 512 × 128, 256 × 256 × 256, 128 × 128 × 512, 64 × 64 × 1024, 32 × 32 × 2048. Respectively passing C _1, C _2, C _3, C _4 and C _5 through five multi-convolution fusion modules, obtaining rich characteristic information, and simultaneously unifying the number of channels to be 256, namely the sizes are as follows: 256 × 256 × 256, 128 × 128 × 256, 64 × 64 × 256, 32 × 32 × 256, and 16 × 16 × 256. The output feature map of C _5 passed through the multi-convolution fusion module (fifth multi-convolution fusion module) is named P _6 (fifth feature map). And sequentially carrying out 2-time scaling and up-sampling on the feature map of the high-resolution semantic information of the upper layer to obtain a feature map with the same size as the lower layer by adopting a multi-scale feature fusion mode, and carrying out element-level addition on the feature map and the high-resolution feature map of the lower layer to obtain P _2, P _3, P _4, P _5 (fourth feature map) and P _6 (fifth feature map) layers. And (3) carrying out 3 × 3 convolution on the P _2, P _3 and P _4 layers to eliminate the characteristic aliasing effect of the lower layers and obtain the final P _2 (first characteristic diagram), P _3 (second characteristic diagram) and P _4 (third characteristic diagram) layers.

As shown in fig. 5, ResNet101, the multi-convolution fusion module, and FPN form a feature extraction network for extracting features in an input image.

Next, an RPN (regional pro-social Network) Network structure is established. The RPN network structure is a 3 × 3 convolutional layer and two output branches: the probability that the first branch outputs the candidate region as various targets; the second branch outputs the coordinates of the top left corner and the width and height of the candidate area border (bounding box). The RPN traverses the feature map in five feature layers P _2 to P _6 based on a sliding anchor frame of 3 × 3 size, respectively, to generate a plurality of anchor boxes, and generate a series of prosages (candidate target regions), where each layer performs target candidate frame prediction. And finally, performing connection fusion on the prediction result of each layer. In the RPN training process, the target with the IOU (intersection ratio) greater than 0.7 with the real label box is a positive label (vehicle target), and the target with the IOU (intersection ratio) less than 0.3 is a negative label (background).

According to the area (w multiplied by h) of each Proposals frame generated by RPN, respectively mapping the Proposals frame to the corresponding characteristic layer

The next ROI Pooling procedure was performed.

The value calculation formula is as follows:

；

wherein

W and h are the width and height of the bounding box, respectively, and k has values of 2, 3, 4 and 5.

P ₂A first characteristic diagram is shown in which,P ₃a second characteristic diagram is shown, which is,P ₄a third characteristic diagram is shown, which is,P ₅a fourth characteristic diagram is shown.

And inputting the obtained Proposals into the ROI Pooling layer for feature extraction, and outputting Proposals feature maps with the uniform size of 7 multiplied by 7 so as to be convenient for inputting the fully-connected layer in the next step. After each characteristic diagram sample passes through two 1024-dimensional full-connection layers, the two detection branches of fast RCNN are used for respectively calculating: classifying the background and the vehicle target by using a classification loss function, and determining the vehicle class to which the propofol area belongs; and obtaining the positioning information of the vehicle target after finishing the frame regression operation by utilizing the regression loss. Training a network model, calculating a loss function, updating parameters of the whole network, and finally obtaining a training model, wherein the training loss comprises two parts, namely classification loss and regression loss, and the calculation formula is as follows:

；

in the formula,

the subscript of each of the samples is indicated,

and

are all normalized parameters, and are all the parameters,

is a balance parameter of the weight.

Indicating a classification loss.

Indicating the probability that the sample is predicted to be a vehicle,

is a tagged real data tag.

The regression loss of the bounding box is shown,

is defined as

(t-t*)，

The definition of the function is

X represents formula input, t-t, t represents the translation scaling parameter of the Proposal predicted target frame, t represents the translation scaling parameter of the real data corresponding to the Proposal,

when the representative sample is a positive sample, i.e.

Is activated.

A pan-zoom parameter representing the Proposal predicted target box,

translation scaling parameter, t, representing real data corresponding to the Proposal_x ^*A translation scaling parameter, t, representing the coordinate x of the upper left corner of the predicted target frame_yTranslation scaling parameter, t, representing coordinate y of the upper left corner of the predicted target frame_wA translation scaling parameter representing the predicted target frame width w. t is t_hA panning scaling parameter, t, representing the predicted target frame height h_x ^*A translation scaling parameter, t, representing the coordinate x of the upper left corner of the real target box_y ^*Translation scaling parameter, t, representing the coordinate y of the upper left corner of the real target box_w ^*A pan scaling parameter representing the true target box width w. t is t_h ^*A pan zoom parameter representing the true target frame height h.

Step3, finishing the overall structure of the deep neural network based on the steps, training the model and optimizing parameters by adopting the aerial image data set, and finally performing model test. Specifically, end-to-end training is carried out on the deep neural network obtained in the steps on a training set of an aerial photography image data set, forward propagation and backward propagation steps are carried out on each picture input into the neural network, and the method is based on a loss function

And updating the internal parameters of the model to obtain the aerial image target detection model.

The method comprises the following steps of adopting a test set of an aerial image data set as a test example, inputting the test set into a trained deep neural network model (an image target detection model), and detecting a vehicle target in an aerial image, wherein the specific process comprises the following steps:

(1) inputting a group of aerial images to be tested, limiting the maximum side length of the input images to be 1024, and obtaining 400 candidate target regions Proposals in the images through RPN after Feature extraction of a ResNet Network, a multi-convolution fusion module and a Feature Pyramid Network (FPN).

(2) And the ROI Pooling takes the original image feature map and each candidate target area as input, extracts the feature maps of the candidate target areas and outputs a 7 x 7 feature map with uniform size for next detection frame regression and aerial photography vehicle category classification.

(3) And obtaining the rectangular position information of the target detection frame of each aerial vehicle through regression and category judgment of the characteristic information of the Proposal through the full connection layer and the frame. Finally, all circumscribed rectangles marked as aerial vehicle targets are marked in the original image.

(4) The indexes used for evaluating the result are average precision AP and average precision mAP. True Negative (tube Negative, TN): is determined to be a negative sample, and is in fact a negative sample; true positive (tube positive, TP): is determined to be a positive sample, and is in fact a positive sample; false Negative (FN): is judged as a negative sample, but is actually a positive sample; false Positive (FP): is determined to be a positive sample, but is actually a negative sample. Recall (Recall) = TP/(TP + FN), accuracy (Precision) = TP/(TP + FP), and a Precision-Recall (P-R) curve is a two-dimensional curve with Precision and Recall as vertical and horizontal axis coordinates. The average precision AP is the area enclosed by the P-R curves corresponding to each category, and the average precision mAP is the average value of the AP values of each category.

The target detection method based on the multi-convolution fusion network disclosed by the invention has the beneficial effects that:

(1) according to the method, a multi-convolution fusion module is adopted to further extract a plurality of different potential feature information contained in conv1-conv5, key detection features are refined based on an SEnet attention mechanism in the module, and the key features are transmitted to a later layer, so that the detection accuracy of aerial image targets is improved.

(2) Through the detection Network based on the Feature Pyramid Network (FPN), the multi-convolution fusion module and the fast RCNN, the Network combines the multi-convolution fusion module and the multi-scale Feature fusion technology, so that the Feature characterization capability of the Network on the aerial image target is enhanced in a combined manner.

Fig. 6 is a schematic structural diagram of a target detection system based on a multi-convolution fusion network, and as shown in fig. 6, the target detection system based on the multi-convolution fusion network includes:

the data set acquisition module 301 is configured to use image data of vehicles coming and going in the transportation junction acquired by a camera carried by the unmanned aerial vehicle as a data set.

A network construction module 302, configured to construct a network structure for image target detection.

And the image target detection model training module 303 is configured to train a network structure for image target detection according to the data set to obtain an image target detection model.

And the target detection module 304 is configured to perform target detection on the image data to be detected by using the image target detection model.

A network architecture for image object detection comprising: a ResNet101 network 201, a multi-convolution fusion network 202, a region generation network 203, a ROI pooling layer 204, and a detection header 205.

The output of the fifth convolution module is connected with the input of the fifth multi-convolution fusion module, the output of the fourth convolution module is connected with the input of the fourth multi-convolution fusion module, the output of the third convolution module is connected with the input of the third multi-convolution fusion module, the output of the second convolution module is connected with the input of the second multi-convolution fusion module, and the output of the first convolution module is connected with the input of the first multi-convolution fusion module; the fifth multi-convolution fusion module outputs a fifth feature map, the fifth feature map outputs a fourth feature map through 2 times of upsampling and element addition of the output of the fourth multi-convolution fusion module, the fourth feature map outputs a third feature map through 3 x 3 convolution operation after 2 times of upsampling and element addition of the output of the third multi-convolution fusion module, the third feature map outputs a second feature map through 3 x 3 convolution operation after 2 times of upsampling and element addition of the output of the second multi-convolution fusion module, and the second feature map outputs a first feature map through 3 x 3 convolution operation after 2 times of upsampling and element addition of the output of the first multi-convolution fusion module; a first feature map, a second feature map, a third feature map, a fourth feature map, and a fifth feature map input area generation network 203; the region generation network 203 is connected to the ROI pooling layer 204, the ROI pooling layer 204 is connected to the detection header 205, and the detection header 205 is used for outputting a detection result.

The first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module and the fifth multi-convolution fusion module have the same structure and respectively comprise a first convolution branch, a second convolution branch, a third convolution branch, a fourth convolution branch, a first SEnet attention mechanism module, a second SEnet attention mechanism module, a third SEnet attention mechanism module and a fourth SEnet attention mechanism module.

The first convolution branch comprises convolution operations with convolution kernels of 1 × 1, step size of 3 and pixel filling of 0, the second convolution branch comprises convolution operations with convolution kernels of 3 × 3, step size of 2 and pixel filling of 1, the third convolution branch comprises convolution operations with convolution kernels of 5 × 5, step size of 2 and pixel filling of 2, and the fourth convolution branch comprises convolution operations with convolution kernels of 7 × 7, step size of 2 and pixel filling of 3; inputting the feature diagram output by the first convolution branch into a first SEnet attention mechanism module, inputting the feature diagram output by the second convolution branch into a second SEnet attention mechanism module, inputting the feature diagram output by the third convolution branch into a third SEnet attention mechanism module, and inputting the feature diagram output by the fourth convolution branch into a fourth SEnet attention mechanism module;

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A target detection method based on a multi-convolution fusion network is characterized by comprising the following steps:

constructing a network structure for image target detection;

2. The multi-convolution fusion network-based object detection method according to claim 1, wherein the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module are identical in structure and each include a first convolution branch, a second convolution branch, a third convolution branch, a fourth convolution branch, a first SEnet attention mechanism module, a second SEnet attention mechanism module, a third SEnet attention mechanism module, and a fourth SEnet attention mechanism module;

3. The method for detecting the target based on the multi-convolution fusion network according to claim 2, wherein the first convolution branch, the second convolution branch, the third convolution branch and the fourth convolution branch output features with the same size, and the sizes are all 64 x 512.

4. The method for detecting the target based on the multi-convolution fusion network is characterized in that the detection head comprises a regression branch and a classification branch; the classification branch determines the category of the detection target by using the classification loss, and the regression branch determines the position information of the detection target by using the regression loss.

5. The target detection method based on the multi-convolution fusion network according to claim 1, wherein the taking of image data of vehicles coming and going in a transportation junction collected by a camera carried by an unmanned aerial vehicle as a data set specifically includes:

collecting image data of vehicles coming and going in a traffic junction through a camera carried by an unmanned aerial vehicle;

dividing the preprocessed image data into a training set and a test set;

6. The method for detecting the target based on the multi-convolution fusion network according to claim 1, wherein training a network structure of image target detection according to the data set to obtain an image target detection model specifically includes:

7. The method for detecting the target based on the multi-convolution fusion network according to claim 6, wherein the loss function is expressed as:

；

wherein,

is a first normalization parameter that is a function of,

is a second normalization parameter that is a function of,

is a balance parameter for the weight or weights,

a loss of classification is indicated and,

the regression loss is expressed as a function of time,

is the firstiThe label that has been labeled for each of the samples,

a panning scaling parameter representing the predicted bounding box,

a pan zoom parameter representing the real bounding box.

8. A target detection system based on a multi-convolution fusion network is characterized by comprising:

9. The multi-convolution fusion network based object detection system of claim 8, wherein the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module are identical in structure and each include a first convolution branch, a second convolution branch, a third convolution branch, a fourth convolution branch, a first SEnet attention mechanism module, a second SEnet attention mechanism module, a third SEnet attention mechanism module, and a fourth SEnet attention mechanism module;

10. The system according to claim 9, wherein the first, second, third and fourth convolution branches output features of the same size, each of which is 64 x 512.