CN113255589A - Target detection method and system based on multi-convolution fusion network - Google Patents

Target detection method and system based on multi-convolution fusion network Download PDF

Info

Publication number
CN113255589A
CN113255589A CN202110707169.0A CN202110707169A CN113255589A CN 113255589 A CN113255589 A CN 113255589A CN 202110707169 A CN202110707169 A CN 202110707169A CN 113255589 A CN113255589 A CN 113255589A
Authority
CN
China
Prior art keywords
convolution
module
output
fusion
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110707169.0A
Other languages
Chinese (zh)
Other versions
CN113255589B (en
Inventor
陈克鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Telecom Easiness Information Technology Co Ltd
Original Assignee
Beijing Telecom Easiness Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Telecom Easiness Information Technology Co Ltd filed Critical Beijing Telecom Easiness Information Technology Co Ltd
Priority to CN202110707169.0A priority Critical patent/CN113255589B/en
Publication of CN113255589A publication Critical patent/CN113255589A/en
Application granted granted Critical
Publication of CN113255589B publication Critical patent/CN113255589B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/08Detecting or categorising vehicles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Astronomy & Astrophysics (AREA)
  • Remote Sensing (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a target detection method and a system based on a multi-convolution fusion network, wherein the method comprises the following steps: taking image data of vehicles coming and going in a traffic junction acquired by a camera carried by an unmanned aerial vehicle as a data set; constructing a network structure for image target detection; training the network structure for image target detection according to the data set to obtain an image target detection model; carrying out target detection on image data to be detected by using the image target detection model; the network structure for image target detection comprises: a ResNet101 network, a multi-convolution fusion network, a region generation network, an ROI pooling layer and a detection head. The method enhances the representation capability of the image target, and further improves the detection accuracy.

Description

Target detection method and system based on multi-convolution fusion network
Technical Field
The invention relates to the field of image processing, in particular to a target detection method and a target detection system based on a multi-convolution fusion network.
Background
In recent years, the unmanned aerial vehicle industry is rapidly developed and widely applied to aspects of rescue, surveying and mapping, freight transportation, reconnaissance, traffic supervision and the like. Accurate detection of targets in aerial images is a precondition that an unmanned aerial vehicle can successfully complete various tasks, however, due to the influence of imaging angles and heights, targets in aerial images often have the characteristics of small visual area, low resolution, much background interference and the like, self characteristic information is less, compared with targets in natural scene images, the detection difficulty is higher, and at present, the detection accuracy of aerial images needs to be improved.
Disclosure of Invention
The invention aims to provide a target detection method and a target detection system based on a multi-convolution fusion network, which improve the detection accuracy.
In order to achieve the purpose, the invention provides the following scheme:
a target detection method based on a multi-convolution fusion network comprises the following steps:
taking image data of vehicles coming and going in a traffic junction acquired by a camera carried by an unmanned aerial vehicle as a data set;
constructing a network structure for image target detection;
training the network structure for image target detection according to the data set to obtain an image target detection model;
carrying out target detection on image data to be detected by using the image target detection model;
the network structure for image target detection comprises: a ResNet101 network, a multi-convolution fusion network, a region generation network, an ROI pooling layer and a detection head;
the ResNet101 network comprises a first convolution module, a second convolution module, a third convolution module, a fourth convolution module and a fifth convolution module which are connected in sequence; the multi-convolution fusion network comprises a first multi-convolution fusion module, a second multi-convolution fusion module, a third multi-convolution fusion module, a fourth multi-convolution fusion module and a fifth multi-convolution fusion module;
the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module are all used for performing multi-convolution feature fusion on an input image;
an output of the fifth convolution module is connected to an input of the fifth multi-convolution fusion module, an output of the fourth convolution module is connected to an input of the fourth multi-convolution fusion module, an output of the third convolution module is connected to an input of the third multi-convolution fusion module, an output of the second convolution module is connected to an input of the second multi-convolution fusion module, and an output of the first convolution module is connected to an input of the first multi-convolution fusion module; the output of the fifth multi-convolution fusion module is a fifth feature map, the fifth feature map outputs a fourth feature map through 2-time upsampling and element-by-element addition with the output of the fourth multi-convolution fusion module, the fourth feature map outputs a third feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the third multi-convolution fusion module, the third feature map outputs a second feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the second multi-convolution fusion module, and the second feature map outputs a first feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the first multi-convolution fusion module; inputting the first feature map, the second feature map, the third feature map, the fourth feature map and the fifth feature map into the area generation network; the region generation network is connected with the ROI pooling layer, the ROI pooling layer is connected with the detection head, and the detection head is used for outputting a detection result.
Optionally, the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module are identical in structure, and each of the first multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module includes a first convolution branch, a second convolution branch, a third convolution branch, a fourth convolution branch, a first SEnet attention mechanism module, a second SEnet attention mechanism module, a third SEnet attention mechanism module, and a fourth SEnet attention mechanism module;
the first convolution branch comprises convolution operations with convolution kernels of 1 × 1, step sizes of 3 and pixel padding of 0, the second convolution branch comprises convolution operations with convolution kernels of 3 × 3, step sizes of 2 and pixel padding of 1, the third convolution branch comprises convolution operations with convolution kernels of 5 × 5, step sizes of 2 and pixel padding of 2, and the fourth convolution branch comprises convolution operations with convolution kernels of 7 × 7, step sizes of 2 and pixel padding of 3; the feature map of the first convolution branch output is input to the first SEnet attention mechanism module, the feature map of the second convolution branch output is input to the second SEnet attention mechanism module, the feature map of the third convolution branch output is input to the third SEnet attention mechanism module, and the feature map of the fourth convolution branch output is input to the fourth SEnet attention mechanism module;
the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module perform global average pooling on input feature maps based on channel dimensions to obtain feature maps with the size of 1 × 1 × 512, the feature maps with the size of 1 × 1 × 512 are input into a first full connection layer, the first full connection layer outputs the feature maps with the size of 1 × 1 × 512/r, a ReLU activation function is adopted to perform activation operation on the feature maps with the size of 1 × 1 × 512/r, the feature maps with the size of 1 × 1 × 512/r are expanded into 1 × 1 × 512/r through the second full connection layer, and then the feature maps containing channel attention information are output through a Sigmoid function; r is a set value;
the four feature graphs which are output by the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module and contain channel attention information are subjected to element-level addition operation to obtain a feature fusion feature graph, and the feature fusion feature graph is subjected to convolution operation with convolution kernel of 1 × 1, step length of 1 and pixel filling of 0 and then output.
Optionally, the sizes of the features output by the first convolution branch, the second convolution branch, the third convolution branch and the fourth convolution branch are the same, and are all 64 × 64 × 512.
Optionally, the detection head comprises a regression branch and a classification branch; the classification branch determines the category of the detection target by using the classification loss, and the regression branch determines the position information of the detection target by using the regression loss.
Optionally, the image data of vehicles coming and going in the transportation junction collected by the camera carried by the unmanned aerial vehicle is used as a data set, and the method specifically includes:
the method comprises the steps that image data of vehicles coming in and going out of a traffic junction are collected through a camera carried by an unmanned aerial vehicle;
carrying out random adjustment on brightness, saturation and contrast of the image data to obtain preprocessed image data;
dividing the preprocessed image data into a training set and a test set;
adopting Labelme software to label the vehicle targets in the images in the training set according to the categories to obtain the labeled training set; the training set after the test set and the class label form the data set.
Optionally, the training of the network structure for image target detection according to the data set to obtain an image target detection model specifically includes:
when a network structure for detecting the image target is trained according to the data set, calculating a loss function, and adjusting parameters in the network structure according to the loss function to obtain an image target detection model; the loss function includes a classification loss and a regression loss.
Optionally, the loss function is expressed as:
Figure 100002_DEST_PATH_IMAGE001
wherein,
Figure 100002_DEST_PATH_IMAGE002
the loss function is represented by a function of the loss,iis shown asiThe number of the samples is one,
Figure 100002_DEST_PATH_IMAGE003
is a first normalization parameter that is a function of,
Figure 100002_DEST_PATH_IMAGE004
is a second normalization parameter that is a function of,
Figure 100002_DEST_PATH_IMAGE005
is a balance parameter for the weight or weights,
Figure 100002_DEST_PATH_IMAGE006
a loss of classification is indicated and,
Figure 100002_DEST_PATH_IMAGE007
the regression loss is expressed as a function of time,
Figure 100002_DEST_PATH_IMAGE008
is shown asiThe probability that an individual sample is predicted as a vehicle,
Figure 100002_DEST_PATH_IMAGE009
is the firstiThe label that has been labeled for each of the samples,
Figure 100002_DEST_PATH_IMAGE010
a panning scaling parameter representing the predicted bounding box,
Figure 100002_DEST_PATH_IMAGE011
a pan zoom parameter representing the real bounding box.
The invention also discloses a target detection system based on the multi-convolution fusion network, which comprises the following steps:
the data set acquisition module is used for taking image data of vehicles coming and going in the traffic junction acquired by a camera carried by the unmanned aerial vehicle as a data set;
the network construction module is used for constructing a network structure for detecting the image target;
the image target detection model training module is used for training the network structure for image target detection according to the data set to obtain an image target detection model;
the target detection module is used for carrying out target detection on the image data to be detected by utilizing the image target detection model;
the network structure for image target detection comprises: a ResNet101 network, a multi-convolution fusion network, a region generation network, an ROI pooling layer and a detection head;
the ResNet101 network comprises a first convolution module, a second convolution module, a third convolution module, a fourth convolution module and a fifth convolution module which are connected in sequence; the multi-convolution fusion network comprises a first multi-convolution fusion module, a second multi-convolution fusion module, a third multi-convolution fusion module, a fourth multi-convolution fusion module and a fifth multi-convolution fusion module;
the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module are all used for performing multi-convolution feature fusion on an input image;
an output of the fifth convolution module is connected to an input of the fifth multi-convolution fusion module, an output of the fourth convolution module is connected to an input of the fourth multi-convolution fusion module, an output of the third convolution module is connected to an input of the third multi-convolution fusion module, an output of the second convolution module is connected to an input of the second multi-convolution fusion module, and an output of the first convolution module is connected to an input of the first multi-convolution fusion module; the output of the fifth multi-convolution fusion module is a fifth feature map, the fifth feature map outputs a fourth feature map through 2-time upsampling and element-by-element addition with the output of the fourth multi-convolution fusion module, the fourth feature map outputs a third feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the third multi-convolution fusion module, the third feature map outputs a second feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the second multi-convolution fusion module, and the second feature map outputs a first feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the first multi-convolution fusion module; inputting the first feature map, the second feature map, the third feature map, the fourth feature map and the fifth feature map into the area generation network; the region generation network is connected with the ROI pooling layer, the ROI pooling layer is connected with the detection head, and the detection head is used for outputting a detection result.
Optionally, the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module are identical in structure, and each of the first multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module includes a first convolution branch, a second convolution branch, a third convolution branch, a fourth convolution branch, a first SEnet attention mechanism module, a second SEnet attention mechanism module, a third SEnet attention mechanism module, and a fourth SEnet attention mechanism module;
the first convolution branch comprises convolution operations with convolution kernels of 1 × 1, step sizes of 3 and pixel padding of 0, the second convolution branch comprises convolution operations with convolution kernels of 3 × 3, step sizes of 2 and pixel padding of 1, the third convolution branch comprises convolution operations with convolution kernels of 5 × 5, step sizes of 2 and pixel padding of 2, and the fourth convolution branch comprises convolution operations with convolution kernels of 7 × 7, step sizes of 2 and pixel padding of 3; the feature map of the first convolution branch output is input to the first SEnet attention mechanism module, the feature map of the second convolution branch output is input to the second SEnet attention mechanism module, the feature map of the third convolution branch output is input to the third SEnet attention mechanism module, and the feature map of the fourth convolution branch output is input to the fourth SEnet attention mechanism module;
the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module perform global average pooling on input feature maps based on channel dimensions to obtain feature maps with the size of 1 × 1 × 512, the feature maps with the size of 1 × 1 × 512 are input into a first full connection layer, the first full connection layer outputs the feature maps with the size of 1 × 1 × 512/r, a ReLU activation function is adopted to perform activation operation on the feature maps with the size of 1 × 1 × 512/r, the feature maps with the size of 1 × 1 × 512/r are expanded into 1 × 1 × 512/r through the second full connection layer, and then the feature maps containing channel attention information are output through a Sigmoid function; r is a set value;
the four feature graphs which are output by the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module and contain channel attention information are subjected to element-level addition operation to obtain a feature fusion feature graph, and the feature fusion feature graph is subjected to convolution operation with convolution kernel of 1 × 1, step length of 1 and pixel filling of 0 and then output.
Optionally, the sizes of the features output by the first convolution branch, the second convolution branch, the third convolution branch and the fourth convolution branch are the same, and are all 64 × 64 × 512.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the invention, different characteristic information is fused through each multi-convolution fusion area module of the multi-convolution fusion network, and multi-scale fusion is carried out on different characteristic information, so that the representation capability of the image target is enhanced, and the detection accuracy is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a target detection method based on a multi-convolution fusion network according to the present invention;
FIG. 2 is a diagram illustrating a network structure for image target detection according to the present invention;
FIG. 3 is a schematic diagram of a network structure for image target detection according to the present invention;
FIG. 4 is a multi-convolution fusion module configuration of the present invention;
FIG. 5 is a schematic diagram of a target detection method based on a multi-convolution fusion network according to the present invention;
fig. 6 is a schematic structural diagram of a target detection system based on a multi-convolution fusion network according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a network structure and a method for detecting an image target, which improve the detection accuracy.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic flow chart of a target detection method based on a multi-convolution fusion network according to the present invention, and as shown in fig. 1, the target detection method based on the multi-convolution fusion network includes the following steps:
step 101: the image data of vehicles coming and going in the traffic junction collected through a camera carried by the unmanned aerial vehicle is used as a data set.
Wherein, step 101 specifically includes:
the camera that carries through unmanned aerial vehicle gathers the image data of coming and going vehicle in the traffic hub.
And carrying out random adjustment on brightness, saturation and contrast of the image data to obtain preprocessed image data.
And dividing the preprocessed image data into a training set and a testing set.
Adopting Labelme software to label the types of the vehicle targets in the images in the training set to obtain a labeled training set; the test set and the class labeled training set constitute a data set.
Step 102: and constructing a network structure for image target detection.
Step 103: and training a network structure for image target detection according to the data set to obtain an image target detection model.
Wherein, step 103 specifically comprises:
when a network structure for image target detection is trained according to the data set, calculating a loss function, and adjusting parameters in the network structure according to the loss function to obtain an image target detection model; the loss function includes classification loss and regression loss.
The loss function is expressed as:
Figure 932232DEST_PATH_IMAGE001
wherein,
Figure 380531DEST_PATH_IMAGE002
the function of the loss is represented by,iis shown asiThe number of the samples is one,
Figure 363530DEST_PATH_IMAGE003
is a first normalization parameter that is a function of,
Figure 899685DEST_PATH_IMAGE004
is a second normalization parameter that is a function of,
Figure 415855DEST_PATH_IMAGE005
is a balance parameter for the weight or weights,
Figure 503897DEST_PATH_IMAGE006
a loss of classification is indicated and,
Figure 708613DEST_PATH_IMAGE007
the regression loss is expressed as a function of time,
Figure 517300DEST_PATH_IMAGE008
is shown asiThe probability that an individual sample is predicted as a vehicle,
Figure 451758DEST_PATH_IMAGE009
is the firstiThe label that has been labeled for each of the samples,
Figure 149849DEST_PATH_IMAGE010
a panning scaling parameter representing the predicted bounding box,
Figure 513965DEST_PATH_IMAGE011
a pan zoom parameter representing the real bounding box.
Step 104: and carrying out target detection on the image data to be detected by using the image target detection model.
Fig. 2-3 are schematic diagrams of network structures for image target detection according to the present invention, and as shown in fig. 2 and 3, the network structures for image target detection include: a ResNet101 network 201, a multi-convolution fusion network 202, a region generation network 203, a ROI (region of interest) pooling layer 204, and a detection header 205.
The ResNet101 network 201 comprises a first convolution module, a second convolution module, a third convolution module, a fourth convolution module and a fifth convolution module which are connected in sequence; the multi-convolution fusion network 202 includes a first multi-convolution fusion module, a second multi-convolution fusion module, a third multi-convolution fusion module, a fourth multi-convolution fusion module, and a fifth multi-convolution fusion module.
The first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module and the fifth multi-convolution fusion module are all used for carrying out multi-convolution feature fusion on the input image.
The output of the fifth convolution module is connected with the input of the fifth multi-convolution fusion module, the output of the fourth convolution module is connected with the input of the fourth multi-convolution fusion module, the output of the third convolution module is connected with the input of the third multi-convolution fusion module, the output of the second convolution module is connected with the input of the second multi-convolution fusion module, and the output of the first convolution module is connected with the input of the first multi-convolution fusion module; the fifth multi-convolution fusion module outputs a fifth feature map, the fifth feature map outputs a fourth feature map through 2 times of upsampling and element addition of the output of the fourth multi-convolution fusion module, the fourth feature map outputs a third feature map through 3 x 3 convolution operation after 2 times of upsampling and element addition of the output of the third multi-convolution fusion module, the third feature map outputs a second feature map through 3 x 3 convolution operation after 2 times of upsampling and element addition of the output of the second multi-convolution fusion module, and the second feature map outputs a first feature map through 3 x 3 convolution operation after 2 times of upsampling and element addition of the output of the first multi-convolution fusion module; a first feature map, a second feature map, a third feature map, a fourth feature map, and a fifth feature map input area generation network 203; the region generation network 203 is connected to the ROI pooling layer 204, the ROI pooling layer 204 is connected to the detection header 205, and the detection header 205 is used for outputting a detection result. The area generation network 203 is used to generate a series of candidate target areas.
The algorithm in the ROI pooling layer 204 is specifically: extracting feature maps from the candidate target regions generated by the first feature map and region generation network 203, the second feature map and region generation network 203, the third feature map and region generation network 203, and the fourth feature map and region generation network 203.
Fig. 4 is a diagram of a multi-convolution fusion module according to the present invention, and as shown in fig. 4, the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module have the same structure, and each of the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module includes a first convolution branch, a second convolution branch, a third convolution branch, a fourth convolution branch, a first SEnet attention mechanism module, a second SEnet attention mechanism module, a third SEnet attention mechanism module, and a fourth SEnet attention mechanism module.
The first convolution branch comprises convolution operations with convolution kernels of 1 × 1, step size of 3 and pixel filling of 0, the second convolution branch comprises convolution operations with convolution kernels of 3 × 3, step size of 2 and pixel filling of 1, the third convolution branch comprises convolution operations with convolution kernels of 5 × 5, step size of 2 and pixel filling of 2, and the fourth convolution branch comprises convolution operations with convolution kernels of 7 × 7, step size of 2 and pixel filling of 3; the characteristic diagram output by the first convolution branch is input into a first SEnet attention mechanism module, the characteristic diagram output by the second convolution branch is input into a second SEnet attention mechanism module, the characteristic diagram output by the third convolution branch is input into a third SEnet attention mechanism module, and the characteristic diagram output by the fourth convolution branch is input into a fourth SEnet attention mechanism module.
The first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module perform global average pooling on input feature maps based on channel dimensions to obtain feature maps with the size of 1 × 1 × 512, the feature maps with the size of 1 × 1 × 512 are input into a first full connection layer, the first full connection layer outputs the feature maps with the size of 1 × 1 × 512/r, a ReLU activation function is adopted to perform activation operation on the feature maps with the size of 1 × 1 × 512/r, the feature maps with the size of 1 × 1 × 512/r are expanded into 1 × 1 × 512/r through the second full connection layer, and then the feature maps containing channel attention information are output through a Sigmoid function; r is a set value.
And four feature graphs containing channel attention information output by the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module are subjected to element-level addition operation to obtain a feature fusion feature graph, and the feature fusion feature graph is subjected to convolution operation with a convolution kernel of 1 × 1, a step length of 1 and pixel filling of 0 and then output.
The features output by the first, second, third and fourth convolution branches are the same size, all 64 × 64 × 512.
The detection head 205 includes a regression branch and a classification branch; the classification branch determines the category of the detection target by using the classification loss, and the regression branch determines the position information of the detection target by using the regression loss.
And training and optimizing parameters of a network structure for image target detection by adopting an aerial image data set, finally performing model test, and performing target detection on the vehicle image to be detected by utilizing an image target detection model.
The invention discloses a multi-convolution fusion module, and combines the multi-convolution fusion module with a multi-scale Feature fusion technology, and provides a Network structure for detecting an image target.
The following describes a target detection method based on a multi-convolution fusion network in detail.
As shown in fig. 5, a target detection method based on a multi-convolution fusion network specifically includes the following steps.
Step1, constructing an aerial image data set. The specific process is as follows: firstly, acquiring image data of vehicles coming from and going to a traffic junction through an unmanned aerial vehicle camera; secondly, randomly adjusting the brightness, the saturation and the contrast of the acquired original image through a preprocessing operation; secondly, carrying out category labeling on the aerial vehicle target in the image based on Labelme software, thereby obtaining a labeling file in an Extensible Markup Language (XML) format; and finally, carrying out training set and test set division, making labels for the data in the training set, and carrying out no processing on the data in the test set.
Step2, building a deep neural network (network structure for image target detection), training a deep neural network model by adopting a training set in an aerial image data set to obtain an aerial image detection model, and taking an aerial image 1024 × 1024 input by the invention as an example, describing the specific process as follows:
and designing a multi-convolution fusion module (comprising a first multi-convolution fusion module, a second multi-convolution fusion module, a third multi-convolution fusion module, a fourth multi-convolution fusion module and a fifth multi-convolution fusion module), and embedding the multi-convolution fusion module into a backbone network ResNet101 of a fast RCNN network. The backbone network used by fast RCNN in the present invention is ResNet101, which is used to extract the features of aerial images, and the ResNet101 network 201 is composed of 5 convolution modules (conv1, conv2, conv3, conv4, conv5), as shown in fig. 3, a multi-convolution fusion module is designed and embedded into 5 convolution modules respectively, so that the subsequent feature maps all contain the extracted target key information with different attributes. As shown in fig. 3, taking an example of an input aerial image 1024 × 1024 of the present invention, the size of the output feature map C _3 after passing through the first three convolution modules (conv1, conv2, conv3) is 128 × 128 × 512, and the design process of the multi-convolution fusion module is shown by taking the feature map as an input of the multi-convolution fusion module (third multi-convolution fusion module):
as shown in fig. 4, a multi-convolution branch structure is first designed, and a feature map output after conv3 (third convolution module) is used as an input feature map of the structure. Inputting the feature map into different convolution branches, namely performing four different convolution operations on the feature map respectively, wherein the convolution operations comprise convolution operations with convolution kernels of 1 × 1, step sizes of 3 and pixel filling of 0, convolution operations with convolution kernels of 3 × 3, step sizes of 2 and pixel filling of 1, convolution operations with convolution kernels of 5 × 5, step sizes of 2 and pixel filling of 2, and convolution operations with convolution kernels of 7 × 7, step sizes of 2 and pixel filling of 3, and thus obtaining four feature maps (the size is 64 × 64 × 512) with the same size and containing different feature information.
Next, a SEnet attention mechanism is constructed and embedded behind the multi-convolution branch structure. First, a SEnet attention mechanism is constructed, as shown in FIG. 4, four output characteristic diagrams of a multi-volume integral branch structure are used as the input of the SEnet attention mechanism, and the SEnet attention mechanism is designed through the following process: taking four feature maps (the sizes of all the feature maps are 64 × 64 × 512) which pass through a multi-volume integral branch structure as input feature maps of the module, firstly performing global average pooling on the input feature maps based on channel dimensions, and respectively obtaining four feature maps with the sizes of 1 × 1 × 512. Then, the four feature maps are input into a full connection layer, and the full connection layer is used for reducing the number of channels of the feature map with the size of 1 × 1 × 512 to one r of the original number, reducing the calculation amount of the full connection layer, and outputting the four feature maps with the size of 1 × 1 × 512/r. And respectively performing activation operation on the four feature maps by adopting a ReLU activation function, expanding the four feature maps from 1 multiplied by 512/r to 1 multiplied by 512 by adopting a second full connection layer, and finally limiting the weight of the 512-layer feature maps to the range of [0, 1] by a Sigmoid function. 512 channels of the four feature maps are multiplied by the output weight 1 × 1 × 512, thereby outputting four feature maps (size 64 × 64 × 512) containing channel attention information. The calculation formula of the SEnet attention mechanism is as follows:
B=σ(FC(ReLu(FC(Avgpool(A)))));
where a denotes an input feature map of the attention module, B denotes an output feature map, FC denotes a fully-connected layer (including a first fully-connected layer and a second fully-connected layer), and σ denotes a sigmoid activation function.
Since the SEnet attention mechanism does not change the resolution of the feature map, as shown in FIG. 4, after the attention mechanism is respectively embedded into the multi-volume integral branch structures, the network is facilitated to screen the excessive feature information extracted by the multi-volume integral branch structures, and the screened key features are transmitted to a subsequent feature layer, so that the detection accuracy of the aerial image target is improved.
And finally, designing a multi-convolution fusion structure. And (4) carrying out element-level addition operation on the four feature maps output by the SEnet attention mechanism to obtain the feature map (with the size of 64 × 64 × 512) fusing different feature attributes. And carrying out convolution operation with convolution kernel of 1 × 1, step size of 1 and pixel filling of 0 to refine the number of channels to 256 and eliminate the characteristic aliasing effect, and finally obtaining a characteristic diagram with the size of 64 × 64 × 256.
The multi-convolution fusion module is respectively formed by connecting a multi-convolution branch structure, an SEnet attention mechanism and a multi-convolution fusion structure in series, as shown in fig. 3, the multi-convolution fusion module is respectively embedded into 5 convolution modules of the ResNet101 network 201, so that the network can extract and refine more abundant key feature information based on different convolution operations, and the key feature information is transmitted to a subsequent layer, and the detection accuracy of an aerial image target is improved. In addition, the multi-convolution fusion module can reduce the space dimension and the channel number of the feature map to half of the original number through key feature extraction, thereby reducing the calculation cost.
A fast R-CNN structure based on a Feature Pyramid Network (FPN) is designed. The concrete structure (as shown in fig. 3) is as follows: the backbone network ResNet101 is mainly composed of five convolution modules (conv1, conv2, conv3, conv4, conv5), and output characteristic diagrams of the five convolution modules are respectively represented as C _1, C _2, C _3, C _4 and C _ 5. Taking an aerial image 1024 × 1024 input in the invention as an example, the sizes of the characteristic diagrams C _1 to C _5 are sequentially as follows: 512 × 512 × 128, 256 × 256 × 256, 128 × 128 × 512, 64 × 64 × 1024, 32 × 32 × 2048. Respectively passing C _1, C _2, C _3, C _4 and C _5 through five multi-convolution fusion modules, obtaining rich characteristic information, and simultaneously unifying the number of channels to be 256, namely the sizes are as follows: 256 × 256 × 256, 128 × 128 × 256, 64 × 64 × 256, 32 × 32 × 256, and 16 × 16 × 256. The output feature map of C _5 passed through the multi-convolution fusion module (fifth multi-convolution fusion module) is named P _6 (fifth feature map). And sequentially carrying out 2-time scaling and up-sampling on the feature map of the high-resolution semantic information of the upper layer to obtain a feature map with the same size as the lower layer by adopting a multi-scale feature fusion mode, and carrying out element-level addition on the feature map and the high-resolution feature map of the lower layer to obtain P _2, P _3, P _4, P _5 (fourth feature map) and P _6 (fifth feature map) layers. And (3) carrying out 3 × 3 convolution on the P _2, P _3 and P _4 layers to eliminate the characteristic aliasing effect of the lower layers and obtain the final P _2 (first characteristic diagram), P _3 (second characteristic diagram) and P _4 (third characteristic diagram) layers.
As shown in fig. 5, ResNet101, the multi-convolution fusion module, and FPN form a feature extraction network for extracting features in an input image.
Next, an RPN (regional pro-social Network) Network structure is established. The RPN network structure is a 3 × 3 convolutional layer and two output branches: the probability that the first branch outputs the candidate region as various targets; the second branch outputs the coordinates of the top left corner and the width and height of the candidate area border (bounding box). The RPN traverses the feature map in five feature layers P _2 to P _6 based on a sliding anchor frame of 3 × 3 size, respectively, to generate a plurality of anchor boxes, and generate a series of prosages (candidate target regions), where each layer performs target candidate frame prediction. And finally, performing connection fusion on the prediction result of each layer. In the RPN training process, the target with the IOU (intersection ratio) greater than 0.7 with the real label box is a positive label (vehicle target), and the target with the IOU (intersection ratio) less than 0.3 is a negative label (background).
According to the area (w multiplied by h) of each Proposals frame generated by RPN, respectively mapping the Proposals frame to the corresponding characteristic layer
Figure DEST_PATH_IMAGE012
The next ROI Pooling procedure was performed.
Figure DEST_PATH_IMAGE013
The value calculation formula is as follows:
Figure DEST_PATH_IMAGE014
wherein
Figure DEST_PATH_IMAGE015
W and h are the width and height of the bounding box, respectively, and k has values of 2, 3, 4 and 5.
P 2A first characteristic diagram is shown in which,P 3a second characteristic diagram is shown, which is,P 4a third characteristic diagram is shown, which is,P 5a fourth characteristic diagram is shown.
And inputting the obtained Proposals into the ROI Pooling layer for feature extraction, and outputting Proposals feature maps with the uniform size of 7 multiplied by 7 so as to be convenient for inputting the fully-connected layer in the next step. After each characteristic diagram sample passes through two 1024-dimensional full-connection layers, the two detection branches of fast RCNN are used for respectively calculating: classifying the background and the vehicle target by using a classification loss function, and determining the vehicle class to which the propofol area belongs; and obtaining the positioning information of the vehicle target after finishing the frame regression operation by utilizing the regression loss. Training a network model, calculating a loss function, updating parameters of the whole network, and finally obtaining a training model, wherein the training loss comprises two parts, namely classification loss and regression loss, and the calculation formula is as follows:
Figure DEST_PATH_IMAGE016
in the formula,
Figure DEST_PATH_IMAGE017
the subscript of each of the samples is indicated,
Figure DEST_PATH_IMAGE018
and
Figure DEST_PATH_IMAGE019
are all normalized parameters, and are all the parameters,
Figure DEST_PATH_IMAGE020
is a balance parameter of the weight.
Figure DEST_PATH_IMAGE021
Indicating a classification loss.
Figure DEST_PATH_IMAGE022
Indicating the probability that the sample is predicted to be a vehicle,
Figure DEST_PATH_IMAGE023
is a tagged real data tag.
Figure DEST_PATH_IMAGE024
The regression loss of the bounding box is shown,
Figure 531205DEST_PATH_IMAGE024
is defined as
Figure DEST_PATH_IMAGE025
(t-t*),
Figure 726694DEST_PATH_IMAGE025
The definition of the function is
Figure DEST_PATH_IMAGE026
X represents formula input, t-t, t represents the translation scaling parameter of the Proposal predicted target frame, t represents the translation scaling parameter of the real data corresponding to the Proposal,
Figure DEST_PATH_IMAGE027
when the representative sample is a positive sample, i.e.
Figure DEST_PATH_IMAGE028
Is activated.
Figure DEST_PATH_IMAGE029
A pan-zoom parameter representing the Proposal predicted target box,
Figure DEST_PATH_IMAGE030
translation scaling parameter, t, representing real data corresponding to the Proposalx *A translation scaling parameter, t, representing the coordinate x of the upper left corner of the predicted target frameyTranslation scaling parameter, t, representing coordinate y of the upper left corner of the predicted target framewA translation scaling parameter representing the predicted target frame width w. t is thA panning scaling parameter, t, representing the predicted target frame height hx *A translation scaling parameter, t, representing the coordinate x of the upper left corner of the real target boxy *Translation scaling parameter, t, representing the coordinate y of the upper left corner of the real target boxw *A pan scaling parameter representing the true target box width w. t is th *A pan zoom parameter representing the true target frame height h.
Step3, finishing the overall structure of the deep neural network based on the steps, training the model and optimizing parameters by adopting the aerial image data set, and finally performing model test. Specifically, end-to-end training is carried out on the deep neural network obtained in the steps on a training set of an aerial photography image data set, forward propagation and backward propagation steps are carried out on each picture input into the neural network, and the method is based on a loss function
Figure DEST_PATH_IMAGE031
And updating the internal parameters of the model to obtain the aerial image target detection model.
The method comprises the following steps of adopting a test set of an aerial image data set as a test example, inputting the test set into a trained deep neural network model (an image target detection model), and detecting a vehicle target in an aerial image, wherein the specific process comprises the following steps:
(1) inputting a group of aerial images to be tested, limiting the maximum side length of the input images to be 1024, and obtaining 400 candidate target regions Proposals in the images through RPN after Feature extraction of a ResNet Network, a multi-convolution fusion module and a Feature Pyramid Network (FPN).
(2) And the ROI Pooling takes the original image feature map and each candidate target area as input, extracts the feature maps of the candidate target areas and outputs a 7 x 7 feature map with uniform size for next detection frame regression and aerial photography vehicle category classification.
(3) And obtaining the rectangular position information of the target detection frame of each aerial vehicle through regression and category judgment of the characteristic information of the Proposal through the full connection layer and the frame. Finally, all circumscribed rectangles marked as aerial vehicle targets are marked in the original image.
(4) The indexes used for evaluating the result are average precision AP and average precision mAP. True Negative (tube Negative, TN): is determined to be a negative sample, and is in fact a negative sample; true positive (tube positive, TP): is determined to be a positive sample, and is in fact a positive sample; false Negative (FN): is judged as a negative sample, but is actually a positive sample; false Positive (FP): is determined to be a positive sample, but is actually a negative sample. Recall (Recall) = TP/(TP + FN), accuracy (Precision) = TP/(TP + FP), and a Precision-Recall (P-R) curve is a two-dimensional curve with Precision and Recall as vertical and horizontal axis coordinates. The average precision AP is the area enclosed by the P-R curves corresponding to each category, and the average precision mAP is the average value of the AP values of each category.
The target detection method based on the multi-convolution fusion network disclosed by the invention has the beneficial effects that:
(1) according to the method, a multi-convolution fusion module is adopted to further extract a plurality of different potential feature information contained in conv1-conv5, key detection features are refined based on an SEnet attention mechanism in the module, and the key features are transmitted to a later layer, so that the detection accuracy of aerial image targets is improved.
(2) Through the detection Network based on the Feature Pyramid Network (FPN), the multi-convolution fusion module and the fast RCNN, the Network combines the multi-convolution fusion module and the multi-scale Feature fusion technology, so that the Feature characterization capability of the Network on the aerial image target is enhanced in a combined manner.
Fig. 6 is a schematic structural diagram of a target detection system based on a multi-convolution fusion network, and as shown in fig. 6, the target detection system based on the multi-convolution fusion network includes:
the data set acquisition module 301 is configured to use image data of vehicles coming and going in the transportation junction acquired by a camera carried by the unmanned aerial vehicle as a data set.
A network construction module 302, configured to construct a network structure for image target detection.
And the image target detection model training module 303 is configured to train a network structure for image target detection according to the data set to obtain an image target detection model.
And the target detection module 304 is configured to perform target detection on the image data to be detected by using the image target detection model.
A network architecture for image object detection comprising: a ResNet101 network 201, a multi-convolution fusion network 202, a region generation network 203, a ROI pooling layer 204, and a detection header 205.
The ResNet101 network 201 comprises a first convolution module, a second convolution module, a third convolution module, a fourth convolution module and a fifth convolution module which are connected in sequence; the multi-convolution fusion network 202 includes a first multi-convolution fusion module, a second multi-convolution fusion module, a third multi-convolution fusion module, a fourth multi-convolution fusion module, and a fifth multi-convolution fusion module.
The first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module and the fifth multi-convolution fusion module are all used for carrying out multi-convolution feature fusion on the input image.
The output of the fifth convolution module is connected with the input of the fifth multi-convolution fusion module, the output of the fourth convolution module is connected with the input of the fourth multi-convolution fusion module, the output of the third convolution module is connected with the input of the third multi-convolution fusion module, the output of the second convolution module is connected with the input of the second multi-convolution fusion module, and the output of the first convolution module is connected with the input of the first multi-convolution fusion module; the fifth multi-convolution fusion module outputs a fifth feature map, the fifth feature map outputs a fourth feature map through 2 times of upsampling and element addition of the output of the fourth multi-convolution fusion module, the fourth feature map outputs a third feature map through 3 x 3 convolution operation after 2 times of upsampling and element addition of the output of the third multi-convolution fusion module, the third feature map outputs a second feature map through 3 x 3 convolution operation after 2 times of upsampling and element addition of the output of the second multi-convolution fusion module, and the second feature map outputs a first feature map through 3 x 3 convolution operation after 2 times of upsampling and element addition of the output of the first multi-convolution fusion module; a first feature map, a second feature map, a third feature map, a fourth feature map, and a fifth feature map input area generation network 203; the region generation network 203 is connected to the ROI pooling layer 204, the ROI pooling layer 204 is connected to the detection header 205, and the detection header 205 is used for outputting a detection result.
The first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module and the fifth multi-convolution fusion module have the same structure and respectively comprise a first convolution branch, a second convolution branch, a third convolution branch, a fourth convolution branch, a first SEnet attention mechanism module, a second SEnet attention mechanism module, a third SEnet attention mechanism module and a fourth SEnet attention mechanism module.
The first convolution branch comprises convolution operations with convolution kernels of 1 × 1, step size of 3 and pixel filling of 0, the second convolution branch comprises convolution operations with convolution kernels of 3 × 3, step size of 2 and pixel filling of 1, the third convolution branch comprises convolution operations with convolution kernels of 5 × 5, step size of 2 and pixel filling of 2, and the fourth convolution branch comprises convolution operations with convolution kernels of 7 × 7, step size of 2 and pixel filling of 3; inputting the feature diagram output by the first convolution branch into a first SEnet attention mechanism module, inputting the feature diagram output by the second convolution branch into a second SEnet attention mechanism module, inputting the feature diagram output by the third convolution branch into a third SEnet attention mechanism module, and inputting the feature diagram output by the fourth convolution branch into a fourth SEnet attention mechanism module;
the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module perform global average pooling on input feature maps based on channel dimensions to obtain feature maps with the size of 1 × 1 × 512, the feature maps with the size of 1 × 1 × 512 are input into a first full connection layer, the first full connection layer outputs the feature maps with the size of 1 × 1 × 512/r, a ReLU activation function is adopted to perform activation operation on the feature maps with the size of 1 × 1 × 512/r, the feature maps with the size of 1 × 1 × 512/r are expanded into 1 × 1 × 512/r through the second full connection layer, and then the feature maps containing channel attention information are output through a Sigmoid function; r is a set value.
And four feature graphs containing channel attention information output by the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module are subjected to element-level addition operation to obtain a feature fusion feature graph, and the feature fusion feature graph is subjected to convolution operation with a convolution kernel of 1 × 1, a step length of 1 and pixel filling of 0 and then output.
The features output by the first, second, third and fourth convolution branches are the same size, all 64 × 64 × 512.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A target detection method based on a multi-convolution fusion network is characterized by comprising the following steps:
taking image data of vehicles coming and going in a traffic junction acquired by a camera carried by an unmanned aerial vehicle as a data set;
constructing a network structure for image target detection;
training the network structure for image target detection according to the data set to obtain an image target detection model;
carrying out target detection on image data to be detected by using the image target detection model;
the network structure for image target detection comprises: a ResNet101 network, a multi-convolution fusion network, a region generation network, an ROI pooling layer and a detection head;
the ResNet101 network comprises a first convolution module, a second convolution module, a third convolution module, a fourth convolution module and a fifth convolution module which are connected in sequence; the multi-convolution fusion network comprises a first multi-convolution fusion module, a second multi-convolution fusion module, a third multi-convolution fusion module, a fourth multi-convolution fusion module and a fifth multi-convolution fusion module;
the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module are all used for performing multi-convolution feature fusion on an input image;
an output of the fifth convolution module is connected to an input of the fifth multi-convolution fusion module, an output of the fourth convolution module is connected to an input of the fourth multi-convolution fusion module, an output of the third convolution module is connected to an input of the third multi-convolution fusion module, an output of the second convolution module is connected to an input of the second multi-convolution fusion module, and an output of the first convolution module is connected to an input of the first multi-convolution fusion module; the output of the fifth multi-convolution fusion module is a fifth feature map, the fifth feature map outputs a fourth feature map through 2-time upsampling and element-by-element addition with the output of the fourth multi-convolution fusion module, the fourth feature map outputs a third feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the third multi-convolution fusion module, the third feature map outputs a second feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the second multi-convolution fusion module, and the second feature map outputs a first feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the first multi-convolution fusion module; inputting the first feature map, the second feature map, the third feature map, the fourth feature map and the fifth feature map into the area generation network; the region generation network is connected with the ROI pooling layer, the ROI pooling layer is connected with the detection head, and the detection head is used for outputting a detection result.
2. The multi-convolution fusion network-based object detection method according to claim 1, wherein the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module are identical in structure and each include a first convolution branch, a second convolution branch, a third convolution branch, a fourth convolution branch, a first SEnet attention mechanism module, a second SEnet attention mechanism module, a third SEnet attention mechanism module, and a fourth SEnet attention mechanism module;
the first convolution branch comprises convolution operations with convolution kernels of 1 × 1, step sizes of 3 and pixel padding of 0, the second convolution branch comprises convolution operations with convolution kernels of 3 × 3, step sizes of 2 and pixel padding of 1, the third convolution branch comprises convolution operations with convolution kernels of 5 × 5, step sizes of 2 and pixel padding of 2, and the fourth convolution branch comprises convolution operations with convolution kernels of 7 × 7, step sizes of 2 and pixel padding of 3; the feature map of the first convolution branch output is input to the first SEnet attention mechanism module, the feature map of the second convolution branch output is input to the second SEnet attention mechanism module, the feature map of the third convolution branch output is input to the third SEnet attention mechanism module, and the feature map of the fourth convolution branch output is input to the fourth SEnet attention mechanism module;
the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module perform global average pooling on input feature maps based on channel dimensions to obtain feature maps with the size of 1 × 1 × 512, the feature maps with the size of 1 × 1 × 512 are input into a first full connection layer, the first full connection layer outputs the feature maps with the size of 1 × 1 × 512/r, a ReLU activation function is adopted to perform activation operation on the feature maps with the size of 1 × 1 × 512/r, the feature maps with the size of 1 × 1 × 512/r are expanded into 1 × 1 × 512/r through the second full connection layer, and then the feature maps containing channel attention information are output through a Sigmoid function; r is a set value;
the four feature graphs which are output by the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module and contain channel attention information are subjected to element-level addition operation to obtain a feature fusion feature graph, and the feature fusion feature graph is subjected to convolution operation with convolution kernel of 1 × 1, step length of 1 and pixel filling of 0 and then output.
3. The method for detecting the target based on the multi-convolution fusion network according to claim 2, wherein the first convolution branch, the second convolution branch, the third convolution branch and the fourth convolution branch output features with the same size, and the sizes are all 64 x 512.
4. The method for detecting the target based on the multi-convolution fusion network is characterized in that the detection head comprises a regression branch and a classification branch; the classification branch determines the category of the detection target by using the classification loss, and the regression branch determines the position information of the detection target by using the regression loss.
5. The target detection method based on the multi-convolution fusion network according to claim 1, wherein the taking of image data of vehicles coming and going in a transportation junction collected by a camera carried by an unmanned aerial vehicle as a data set specifically includes:
collecting image data of vehicles coming and going in a traffic junction through a camera carried by an unmanned aerial vehicle;
carrying out random adjustment on brightness, saturation and contrast of the image data to obtain preprocessed image data;
dividing the preprocessed image data into a training set and a test set;
adopting Labelme software to label the vehicle targets in the images in the training set according to the categories to obtain the labeled training set; the training set after the test set and the class label form the data set.
6. The method for detecting the target based on the multi-convolution fusion network according to claim 1, wherein training a network structure of image target detection according to the data set to obtain an image target detection model specifically includes:
when a network structure for detecting the image target is trained according to the data set, calculating a loss function, and adjusting parameters in the network structure according to the loss function to obtain an image target detection model; the loss function includes a classification loss and a regression loss.
7. The method for detecting the target based on the multi-convolution fusion network according to claim 6, wherein the loss function is expressed as:
Figure DEST_PATH_IMAGE001
wherein,
Figure DEST_PATH_IMAGE002
the loss function is represented by a function of the loss,iis shown asiThe number of the samples is one,
Figure DEST_PATH_IMAGE003
is a first normalization parameter that is a function of,
Figure DEST_PATH_IMAGE004
is a second normalization parameter that is a function of,
Figure DEST_PATH_IMAGE005
is a balance parameter for the weight or weights,
Figure DEST_PATH_IMAGE006
a loss of classification is indicated and,
Figure DEST_PATH_IMAGE007
the regression loss is expressed as a function of time,
Figure DEST_PATH_IMAGE008
is shown asiThe probability that an individual sample is predicted as a vehicle,
Figure DEST_PATH_IMAGE009
is the firstiThe label that has been labeled for each of the samples,
Figure DEST_PATH_IMAGE010
a panning scaling parameter representing the predicted bounding box,
Figure DEST_PATH_IMAGE011
a pan zoom parameter representing the real bounding box.
8. A target detection system based on a multi-convolution fusion network is characterized by comprising:
the data set acquisition module is used for taking image data of vehicles coming and going in the traffic junction acquired by a camera carried by the unmanned aerial vehicle as a data set;
the network construction module is used for constructing a network structure for detecting the image target;
the image target detection model training module is used for training the network structure for image target detection according to the data set to obtain an image target detection model;
the target detection module is used for carrying out target detection on the image data to be detected by utilizing the image target detection model;
the network structure for image target detection comprises: a ResNet101 network, a multi-convolution fusion network, a region generation network, an ROI pooling layer and a detection head;
the ResNet101 network comprises a first convolution module, a second convolution module, a third convolution module, a fourth convolution module and a fifth convolution module which are connected in sequence; the multi-convolution fusion network comprises a first multi-convolution fusion module, a second multi-convolution fusion module, a third multi-convolution fusion module, a fourth multi-convolution fusion module and a fifth multi-convolution fusion module;
the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module are all used for performing multi-convolution feature fusion on an input image;
an output of the fifth convolution module is connected to an input of the fifth multi-convolution fusion module, an output of the fourth convolution module is connected to an input of the fourth multi-convolution fusion module, an output of the third convolution module is connected to an input of the third multi-convolution fusion module, an output of the second convolution module is connected to an input of the second multi-convolution fusion module, and an output of the first convolution module is connected to an input of the first multi-convolution fusion module; the output of the fifth multi-convolution fusion module is a fifth feature map, the fifth feature map outputs a fourth feature map through 2-time upsampling and element-by-element addition with the output of the fourth multi-convolution fusion module, the fourth feature map outputs a third feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the third multi-convolution fusion module, the third feature map outputs a second feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the second multi-convolution fusion module, and the second feature map outputs a first feature map through 3 × 3 convolution operation after 2-time upsampling and element-by-element addition with the output of the first multi-convolution fusion module; inputting the first feature map, the second feature map, the third feature map, the fourth feature map and the fifth feature map into the area generation network; the region generation network is connected with the ROI pooling layer, the ROI pooling layer is connected with the detection head, and the detection head is used for outputting a detection result.
9. The multi-convolution fusion network based object detection system of claim 8, wherein the first multi-convolution fusion module, the second multi-convolution fusion module, the third multi-convolution fusion module, the fourth multi-convolution fusion module, and the fifth multi-convolution fusion module are identical in structure and each include a first convolution branch, a second convolution branch, a third convolution branch, a fourth convolution branch, a first SEnet attention mechanism module, a second SEnet attention mechanism module, a third SEnet attention mechanism module, and a fourth SEnet attention mechanism module;
the first convolution branch comprises convolution operations with convolution kernels of 1 × 1, step sizes of 3 and pixel padding of 0, the second convolution branch comprises convolution operations with convolution kernels of 3 × 3, step sizes of 2 and pixel padding of 1, the third convolution branch comprises convolution operations with convolution kernels of 5 × 5, step sizes of 2 and pixel padding of 2, and the fourth convolution branch comprises convolution operations with convolution kernels of 7 × 7, step sizes of 2 and pixel padding of 3; the feature map of the first convolution branch output is input to the first SEnet attention mechanism module, the feature map of the second convolution branch output is input to the second SEnet attention mechanism module, the feature map of the third convolution branch output is input to the third SEnet attention mechanism module, and the feature map of the fourth convolution branch output is input to the fourth SEnet attention mechanism module;
the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module perform global average pooling on input feature maps based on channel dimensions to obtain feature maps with the size of 1 × 1 × 512, the feature maps with the size of 1 × 1 × 512 are input into a first full connection layer, the first full connection layer outputs the feature maps with the size of 1 × 1 × 512/r, a ReLU activation function is adopted to perform activation operation on the feature maps with the size of 1 × 1 × 512/r, the feature maps with the size of 1 × 1 × 512/r are expanded into 1 × 1 × 512/r through the second full connection layer, and then the feature maps containing channel attention information are output through a Sigmoid function; r is a set value;
the four feature graphs which are output by the first SEnet attention mechanism module, the second SEnet attention mechanism module, the third SEnet attention mechanism module and the fourth SEnet attention mechanism module and contain channel attention information are subjected to element-level addition operation to obtain a feature fusion feature graph, and the feature fusion feature graph is subjected to convolution operation with convolution kernel of 1 × 1, step length of 1 and pixel filling of 0 and then output.
10. The system according to claim 9, wherein the first, second, third and fourth convolution branches output features of the same size, each of which is 64 x 512.
CN202110707169.0A 2021-06-25 2021-06-25 Target detection method and system based on multi-convolution fusion network Active CN113255589B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110707169.0A CN113255589B (en) 2021-06-25 2021-06-25 Target detection method and system based on multi-convolution fusion network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110707169.0A CN113255589B (en) 2021-06-25 2021-06-25 Target detection method and system based on multi-convolution fusion network

Publications (2)

Publication Number Publication Date
CN113255589A true CN113255589A (en) 2021-08-13
CN113255589B CN113255589B (en) 2021-10-15

Family

ID=77189569

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110707169.0A Active CN113255589B (en) 2021-06-25 2021-06-25 Target detection method and system based on multi-convolution fusion network

Country Status (1)

Country Link
CN (1) CN113255589B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332849A (en) * 2022-03-16 2022-04-12 科大天工智能装备技术(天津)有限公司 Crop growth state combined monitoring method and device and storage medium
CN114511515A (en) * 2022-01-17 2022-05-17 山东高速路桥国际工程有限公司 Bolt corrosion detection system and detection method based on BoltCorrDetNet network
CN114943903A (en) * 2022-05-25 2022-08-26 广西财经学院 Self-adaptive clustering target detection method for aerial image of unmanned aerial vehicle
CN115272992A (en) * 2022-09-30 2022-11-01 松立控股集团股份有限公司 Vehicle attitude estimation method
CN115861938A (en) * 2023-02-06 2023-03-28 北京中超伟业信息安全技术股份有限公司 Unmanned aerial vehicle counter-braking method and system based on unmanned aerial vehicle identification
CN117952977A (en) * 2024-03-27 2024-04-30 山东泉海汽车科技有限公司 Pavement crack identification method, device and medium based on improvement yolov s

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951212A (en) * 2020-04-08 2020-11-17 北京交通大学 Method for identifying defects of contact network image of railway
CN112101373A (en) * 2019-06-18 2020-12-18 富士通株式会社 Object detection method and device based on deep learning network and electronic equipment
CN112364855A (en) * 2021-01-14 2021-02-12 北京电信易通信息技术股份有限公司 Video target detection method and system based on multi-scale feature fusion
US20210056351A1 (en) * 2018-06-04 2021-02-25 Jiangnan University Multi-scale aware pedestrian detection method based on improved full convolutional network
CN112465746A (en) * 2020-11-02 2021-03-09 新疆天维无损检测有限公司 Method for detecting small defects in radiographic film
CN112766409A (en) * 2021-02-01 2021-05-07 西北工业大学 Feature fusion method for remote sensing image target detection

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210056351A1 (en) * 2018-06-04 2021-02-25 Jiangnan University Multi-scale aware pedestrian detection method based on improved full convolutional network
CN112101373A (en) * 2019-06-18 2020-12-18 富士通株式会社 Object detection method and device based on deep learning network and electronic equipment
CN111951212A (en) * 2020-04-08 2020-11-17 北京交通大学 Method for identifying defects of contact network image of railway
CN112465746A (en) * 2020-11-02 2021-03-09 新疆天维无损检测有限公司 Method for detecting small defects in radiographic film
CN112364855A (en) * 2021-01-14 2021-02-12 北京电信易通信息技术股份有限公司 Video target detection method and system based on multi-scale feature fusion
CN112766409A (en) * 2021-02-01 2021-05-07 西北工业大学 Feature fusion method for remote sensing image target detection

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114511515A (en) * 2022-01-17 2022-05-17 山东高速路桥国际工程有限公司 Bolt corrosion detection system and detection method based on BoltCorrDetNet network
CN114332849A (en) * 2022-03-16 2022-04-12 科大天工智能装备技术(天津)有限公司 Crop growth state combined monitoring method and device and storage medium
CN114332849B (en) * 2022-03-16 2022-08-16 科大天工智能装备技术(天津)有限公司 Crop growth state combined monitoring method and device and storage medium
CN114943903A (en) * 2022-05-25 2022-08-26 广西财经学院 Self-adaptive clustering target detection method for aerial image of unmanned aerial vehicle
CN114943903B (en) * 2022-05-25 2023-04-07 广西财经学院 Self-adaptive clustering target detection method for aerial image of unmanned aerial vehicle
CN115272992A (en) * 2022-09-30 2022-11-01 松立控股集团股份有限公司 Vehicle attitude estimation method
CN115861938A (en) * 2023-02-06 2023-03-28 北京中超伟业信息安全技术股份有限公司 Unmanned aerial vehicle counter-braking method and system based on unmanned aerial vehicle identification
CN117952977A (en) * 2024-03-27 2024-04-30 山东泉海汽车科技有限公司 Pavement crack identification method, device and medium based on improvement yolov s
CN117952977B (en) * 2024-03-27 2024-06-04 山东泉海汽车科技有限公司 Pavement crack identification method, device and medium based on improvement yolov s

Also Published As

Publication number Publication date
CN113255589B (en) 2021-10-15

Similar Documents

Publication Publication Date Title
CN113255589B (en) Target detection method and system based on multi-convolution fusion network
CN108596101B (en) Remote sensing image multi-target detection method based on convolutional neural network
CN108764063B (en) Remote sensing image time-sensitive target identification system and method based on characteristic pyramid
CN111738110A (en) Remote sensing image vehicle target detection method based on multi-scale attention mechanism
CN111222396A (en) All-weather multispectral pedestrian detection method
CN113313082B (en) Target detection method and system based on multitask loss function
CN113361528B (en) Multi-scale target detection method and system
CN112084869A (en) Compact quadrilateral representation-based building target detection method
CN113609896A (en) Object-level remote sensing change detection method and system based on dual-correlation attention
CN111144418B (en) Railway track area segmentation and extraction method
CN113313094B (en) Vehicle-mounted image target detection method and system based on convolutional neural network
CN112800906A (en) Improved YOLOv 3-based cross-domain target detection method for automatic driving automobile
CN114820655B (en) Weak supervision building segmentation method taking reliable area as attention mechanism supervision
CN116229452B (en) Point cloud three-dimensional target detection method based on improved multi-scale feature fusion
CN114519819B (en) Remote sensing image target detection method based on global context awareness
CN112766409A (en) Feature fusion method for remote sensing image target detection
CN117079163A (en) Aerial image small target detection method based on improved YOLOX-S
CN113052108A (en) Multi-scale cascade aerial photography target detection method and system based on deep neural network
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN113111740A (en) Characteristic weaving method for remote sensing image target detection
CN117853955A (en) Unmanned aerial vehicle small target detection method based on improved YOLOv5
CN118015490A (en) Unmanned aerial vehicle aerial image small target detection method, system and electronic equipment
CN114550016B (en) Unmanned aerial vehicle positioning method and system based on context information perception
CN112801195A (en) Deep learning-based fog visibility prediction method, storage device and server
CN111726535A (en) Smart city CIM video big data image quality control method based on vehicle perception

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant