CN113361528A - Multi-scale target detection method and system - Google Patents

Multi-scale target detection method and system Download PDF

Info

Publication number
CN113361528A
CN113361528A CN202110910802.6A CN202110910802A CN113361528A CN 113361528 A CN113361528 A CN 113361528A CN 202110910802 A CN202110910802 A CN 202110910802A CN 113361528 A CN113361528 A CN 113361528A
Authority
CN
China
Prior art keywords
convolution
cavity
target detection
hole
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110910802.6A
Other languages
Chinese (zh)
Other versions
CN113361528B (en
Inventor
朱敏
严凡
王帅
赵文登
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Telecom Easiness Information Technology Co Ltd
Original Assignee
Beijing Telecom Easiness Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Telecom Easiness Information Technology Co Ltd filed Critical Beijing Telecom Easiness Information Technology Co Ltd
Priority to CN202110910802.6A priority Critical patent/CN113361528B/en
Publication of CN113361528A publication Critical patent/CN113361528A/en
Application granted granted Critical
Publication of CN113361528B publication Critical patent/CN113361528B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a multi-scale target detection method and a system, wherein the method comprises the following steps: constructing a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are sequentially connected, the output of each convolution module is respectively connected with one convolution branch, each convolution branch comprises a convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in a parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; sequentially performing up-sampling operation on each first feature map and performing element-level addition on adjacent first feature maps with the same size according to the output of the convolution branch from low to high resolution to obtain a plurality of fusion feature maps; performing model training on the cavity pyramid network model according to the target detection data set to obtain a target detection model; and carrying out target detection on the image to be detected by using the target detection model. The invention improves the accuracy of target detection.

Description

Multi-scale target detection method and system
Technical Field
The invention relates to the technical field of target detection, in particular to a multi-scale target detection method and system.
Background
The target detection technology is a large core research direction in the field of computer vision, and aims to obtain the belonged classification and the position of an interested target in an image. The technology is not only the research foundation of many computer vision tasks such as target tracking, semantic segmentation and the like, but also widely applied to various civil and military fields such as medical diagnosis, automatic driving, intelligent video monitoring, military target monitoring and the like. With diversification and complication of application scenes, a target to be detected often comprises a plurality of targets with different scales, so that the current target detection task faces a serious challenge caused by scale difference. Multi-scale target detection has therefore become one of the research hotspots in the field of target detection.
As a mainstream algorithm for solving the scale problem, the multi-scale feature fusion technology fuses a shallow feature map containing more detailed information and a deep feature map containing more semantic information in a convolutional neural network by constructing a feature pyramid network, so that each feature layer has abundant detailed features and semantic features at the same time, and the feature expression capability of the neural network is effectively improved. However, the existing multi-scale feature fusion method has the disadvantages that each feature layer contains abundant feature information, but each feature layer is only sensitive to targets in a fixed scale range, so that the utilization rate of the feature information is not high, and the detection capability of each feature layer of the network to targets in different scales is also limited.
Disclosure of Invention
The invention aims to provide a multi-scale target detection method and a multi-scale target detection system, which improve the accuracy of target detection.
In order to achieve the purpose, the invention provides the following scheme:
a multi-scale target detection method, comprising:
collecting a target detection data set;
constructing a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are sequentially connected, each convolution module comprises convolution operation, the output of each convolution module is respectively connected with one convolution branch, each convolution branch comprises one convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; the feature graph output by one convolution operation and a plurality of hole convolution operations in the convolution branch adopts element level addition operation to obtain a first feature graph; sequentially carrying out up-sampling operation on each first feature map according to the sequence from low resolution to high resolution, and then carrying out element-level addition on the first feature maps and adjacent first feature maps with the same size to obtain a plurality of fused feature maps;
performing model training on the cavity pyramid network model according to the target detection data set to obtain a target detection model;
and carrying out target detection on the image to be detected by utilizing the target detection model.
Optionally, the output of each convolution module is input into the convolution branch after passing through a convolution layer with a convolution kernel of 1 × 1.
Optionally, the void pyramid network model further includes a regional suggestion network, each of the fusion feature maps is input into the regional suggestion network, and the regional suggestion network outputs a candidate region corresponding to each of the fusion feature maps.
Optionally, the cavity pyramid network model further includes an ROI pooling layer and a detection head, an input of the ROI pooling layer is connected to an output of the region suggestion network, an output of the ROI pooling layer is connected to the detection head, and the detection head is configured to output a detection result.
Optionally, the convolution branch comprises a convolution operation with a convolution kernel of 1 × 1 and 2 hole convolution operations.
Optionally, the 2 hole convolution operations are a first hole convolution operation and a second hole convolution operation, respectively; the first hole convolution operation is a hole convolution operation with a convolution kernel of 3 x 3 and an expansion rate of 2; the second hole convolution operation is a hole convolution operation with a convolution kernel of 5 x 5 and an expansion rate of 2.
The invention also discloses a multi-scale target detection system, which comprises:
the data set acquisition module is used for acquiring a target detection data set;
the cavity pyramid network model building module is used for building a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are sequentially connected, each convolution module comprises convolution operation, the output of each convolution module is respectively connected with one convolution branch, each convolution branch comprises one convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; the feature graph output by one convolution operation and a plurality of hole convolution operations in the convolution branch adopts element level addition operation to obtain a first feature graph; sequentially carrying out up-sampling operation on each first feature map according to the sequence from low resolution to high resolution, and then carrying out element-level addition on the first feature maps and adjacent first feature maps with the same size to obtain a plurality of fused feature maps;
the cavity pyramid network model training module is used for carrying out model training on the cavity pyramid network model according to the target detection data set to obtain a target detection model;
and the target detection module is used for carrying out target detection on the image to be detected by utilizing the target detection model.
Optionally, the output of each convolution module is input into the convolution branch after passing through a convolution layer with a convolution kernel of 1 × 1.
Optionally, the convolution branch includes a convolution operation with a convolution kernel of 1 × 1 and 2 hole convolution operations, where the 2 hole convolution operations are the first hole convolution operation and the second hole convolution operation, respectively; the first hole convolution operation is a hole convolution operation with a convolution kernel of 3 x 3 and an expansion rate of 2; the second hole convolution operation is a hole convolution operation with a convolution kernel of 5 x 5 and an expansion rate of 2.
Optionally, the cavity pyramid network model further includes a regional suggestion network, each of the fusion feature maps is input into the regional suggestion network, and the regional suggestion network outputs a candidate region corresponding to each of the fusion feature maps;
the cavity pyramid network model further comprises an ROI (region of interest) pooling layer and a detection head, wherein the input of the ROI pooling layer is connected with the output of the region suggestion network, the output of the ROI pooling layer is connected with the detection head, and the detection head is used for outputting a detection result.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention combines the cavity convolution multi-branch structure with the multi-scale feature fusion technology, and extracts feature information by adopting the cavity convolution of different convolution kernels, so that the convolution layer has different sizes of receptive fields, a single feature layer is facilitated to acquire richer multi-scale context feature information, the sensitivity of each feature layer to different-scale targets is enhanced, and the accuracy of target detection is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a multi-scale target detection method according to the present invention;
FIG. 2 is a schematic view of a specific flow chart of a multi-scale target detection method according to the present invention;
FIG. 3 is a schematic diagram of a cavity pyramid network model structure according to the present invention;
fig. 4 is a schematic structural diagram of a multi-scale target detection system according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a multi-scale target detection method and a multi-scale target detection system, which improve the accuracy of target detection.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic flow diagram of a multi-scale target detection method of the present invention, and as shown in fig. 1, the multi-scale target detection method specifically includes the following steps:
a multi-scale target detection method, comprising:
step 101: a target detection data set is collected.
Firstly, acquiring image data of vehicles in the advancing process under different traffic scenes by using a vehicle-mounted camera, and preprocessing the images; secondly, carrying out category and position labeling on targets (including various vehicles, pedestrians, traffic signs and road barriers) in the images based on image labeling software so as to obtain a labeling file corresponding to each image; and finally, performing training set and test set division, and making the image data and the annotation file into a VOC2007 data set format, thereby obtaining a target detection data set.
Image preprocessing is a data enhancement operation that includes: horizontal inversion and brightness contrast adjustment are carried out to enhance the robustness of the network to light changes.
As a specific embodiment, the image annotation software is LabelImg software.
Step 102: constructing a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are connected in sequence, each convolution module comprises convolution operation, an image output by each convolution module is reduced in sequence according to the image input direction, the output of each convolution module is connected with one convolution branch respectively, each convolution branch comprises one convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; the feature graph output by one convolution operation and a plurality of hole convolution operations in the convolution branch adopts element level addition operation to obtain a first feature graph; and sequentially carrying out up-sampling operation on each first feature map according to the sequence from low resolution to high resolution, and then carrying out element-level addition on the first feature maps and adjacent first feature maps with the same size to obtain a plurality of fused feature maps.
The plurality of convolution modules are respectively a first convolution module, a second convolution module, a third convolution module, a fourth convolution module and a fifth convolution module which are connected in sequence. The outputs of the first convolution module, the second convolution module, the third convolution module, the fourth convolution module and the fifth convolution module are respectively C _1, C _2, C _3, C _4 and C _5, and the output characteristic diagram of the fifth convolution module is subjected to 0.5-time down-sampling to obtain a characteristic diagram C _ 6.
Each convolution module is a resnet101 convolution module.
And a convolution layer with a convolution kernel of 1 x 1 is arranged between each convolution module and the cavity pyramid network, and the output of each convolution module is input into the cavity pyramid network after passing through the convolution layer with the convolution kernel of 1 x 1. Specifically, C _1, C _2, C _3, C _4, C _5, and C _6 are input to convolution layers with convolution kernel 1 × 1, and then input to the void pyramid network, and corresponding outputs are feature maps P _1, P _2, P _3, P _4, P _5, and P _6, respectively.
The feature map P _6 obtains a feature map with the same scale as the feature map P _5 through 2 times of upsampling operation, and the feature map P _5 is subjected to element-level addition to obtain a feature map F _ 5; the feature map P _5 obtains a feature map with the same scale as the feature map P _4 through 2 times of upsampling operation, and performs element-level addition with the feature map P _4 to obtain a feature map F _ 4; the feature map P _4 obtains a feature map with the same scale as the feature map P _3 through 2 times of upsampling operation, and the feature map P _3 are subjected to element-level addition to obtain a feature map F _ 3; the feature map P _3 obtains a feature map with the same scale as the feature map P _2 through 2 times of upsampling operation, and the feature map P _2 are subjected to element-level addition to obtain a feature map F _ 2; the feature map P _2 obtains a feature map with the same scale as the feature map P _1 through 2 times of upsampling operation, and element-level addition is carried out on the feature map P _1 to obtain a feature map F _ 1; and (3) performing convolution operation on the feature maps F _1, F _2, F _3 and F _4 with one layer of convolution kernel being 3 multiplied by 3 to obtain updated feature maps F _1, F _2, F _3 and F _4 so as to eliminate the feature aliasing effect of the lower layer.
The backbone network of the hole pyramid network model (convolutional neural network) is ResNet 101.
The output of each convolution module passes through a convolution layer with convolution kernel of 1 x 1 and then is input into a convolution branch.
The convolution branch comprises a conventional convolution operation with a convolution kernel of 1 x 1 and two hole convolution operations:
the number of the hole convolution operations is 2, and the 2 hole convolution operations are respectively a first hole convolution operation and a second hole convolution operation; the first hole convolution operation is a hole convolution operation with a convolution kernel of 3 x 3 and an expansion rate of 2; the second hole convolution operation is a hole convolution operation with a convolution kernel of 5 x 5 and an expansion rate of 2.
The hole pyramid network model further includes a region suggestion network (a region candidate network in fig. 2), each fused feature map is input into the region suggestion network, and the region suggestion network outputs a candidate region corresponding to each fused feature map.
The hollow pyramid network model further comprises an ROI (region of interest) pooling layer (target region pooling in FIG. 2), the input of which is connected to the output of the area suggestion network, and a detection head, the output of which is connected to the detection head, for outputting the detection result. The detection head comprises a regression branch and a classification branch.
As shown in fig. 2-3, taking 1024 × 1024 input images as an example, a specific process of multi-scale target detection according to the present invention is described, which includes the following steps:
(1) and designing a hole pyramid network and embedding the hole pyramid network into a backbone network ResNet101 of a Faster RCNN network. The backbone network used by the fast RCNN in the invention is ResNet101, which is used for extracting the characteristic information of the input image, the ResNet101 network is composed of 5 convolution modules (conv1, conv2, conv3, conv4 and conv5), and the output characteristic graphs of the convolution modules are respectively C _1, C _2, C _3, C _4 and C _ 5. And designing a hole pyramid network and embedding the hole pyramid network after the ResNet101 convolution module so as to enable subsequent feature graphs to obtain rich multi-scale context information. As shown in fig. 3, taking the input image 1024 × 1024 of the present invention as an example, taking C _1, C _2, C _3, C _4, and C _5 as the input of the void pyramid network, the design process of the void pyramid network is shown:
firstly, in order to realize the detection of large-scale targets, 0.5-fold down-sampling is carried out on the output feature map C _5 of the 5 th convolution module of the ResNet101 to obtain C _6, so as to obtain a group of feature maps C _ 1-C _6, wherein the feature map sizes are 512 multiplied by 128, 256 multiplied by 256, 128 multiplied by 512, 64 multiplied by 1024, 32 multiplied by 2048 and 16 multiplied by 2048 in sequence. Next, these 6 feature maps are input into convolutional layers with a convolutional kernel of 1 × 1, and this operation is to unify the number of channels of the 6 feature maps into a fixed value of 256, that is, 512 × 512 × 256, 256 × 256 × 256, 128 × 128 × 256, 64 × 64 × 256, 32 × 32 × 256, and 16 × 16 × 256, while ensuring that the spatial size of the feature maps is not changed.
Then, a hole convolution multi-branch structure is constructed. As shown in fig. 3, the same operations are performed on the 6 feature maps, where C _5 is taken as an example: the first branch is a convolution operation with a convolution kernel of 1 x 1, and the branch exists for retaining the original characteristic information of the characteristic diagram, wherein the size of the characteristic diagram is not changed; the second branch is a hole convolution operation with a convolution kernel of 3 × 3, a hole convolution expansion rate is set to 2 (indicated by rate =2 in fig. 3), and in order to ensure that the feature map size is constant, a pixel filling padding is set to 2; the third branch is a cavity convolution operation with a convolution kernel of 5 × 5, the cavity convolution expansion rate is set to 2, and in order to ensure that the feature map size is constant, the pixel filling padding is set to 4. C _5 is respectively input into the three branches, and three characteristic maps with the sizes of 32 × 32 × 256 are obtained. And performing feature fusion on the three feature maps by adopting element-level addition operation to obtain a feature map P _5 with the size of 32 × 32 × 256. After the hole convolution multi-branch structure, feature maps P _1, P _2, P _3, P _4, P _5 and P _6 are obtained in sequence, and the sizes of the feature maps are 512 multiplied by 256, 256 multiplied by 256, 128 multiplied by 256, 64 multiplied by 256, 32 multiplied by 256 and 16 multiplied by 256 in sequence. Conv in FIG. 3 denotes convolution operation, and D-Conv denotes hole convolution operation.
The cavity convolution multi-branch structure adopts the cavity convolution of different convolution kernels to extract characteristic information, so that the convolution layer can obtain the receptive fields of different sizes, and the extraction of the characteristic information of targets of different sizes is facilitated. Meanwhile, in order to avoid the loss of detail information which is helpful for small target detection, the module adopts a first branch to retain the original characteristic information of the characteristic diagram. In addition, the hole convolution operation does not change the resolution of the feature map, and is beneficial to accurate positioning of the target.
And then, constructing a hollow pyramid structure based on multi-scale feature fusion operation. As shown in fig. 3, F _6 is directly obtained from the feature map P _6, and F _6 is subjected to up-sampling operation with 2 times amplification to obtain a feature map with the same scale as P _5, and is subjected to element-level addition with P _5 to obtain F _ 5. And (3) sequentially carrying out 2 times of upsampling operation on the feature map of the high-resolution semantic information of the upper layer to obtain a feature map with the same size as that of the lower layer, and carrying out element-level addition on the feature map and the high-resolution feature map of the lower layer to sequentially obtain F _4, F _3, F _2 and F _1 layers. And (3) carrying out convolution operation with one layer of convolution kernel being 3 multiplied by 3 on the F _1, F _2, F _3 and F _4 layers to eliminate the characteristic aliasing effect of the lower layers and obtain the final F _1, F _2, F _3 and F _4 layers.
The cavity pyramid network is constructed based on three parts of a backbone network ResNet101, a cavity convolution multi-branch structure and multi-scale feature fusion operation, as shown in FIG. 3. By introducing the multi-branch structure of the cavity convolution into the multilevel pyramid network, a single feature layer can acquire richer multi-scale context feature information, and the feature information of the targets with different sizes is transmitted to a subsequent layer, so that the detection accuracy of the network on the targets with different sizes is further improved.
(2) Designing a Faster R-CNN structure based on the hole pyramid network. The concrete structure (as shown in fig. 3) is as follows: in the last step, a cavity pyramid network structure is obtained based on the backbone network ResNet101, the cavity convolution multi-branch structure and the multi-scale feature fusion operation. Taking 1024 × 1024 as an example of the input image of the invention, the sizes of 6 feature maps F _1 to F _6 output by the cavity pyramid network are as follows: 512 × 512 × 256, 256 × 256 × 256, 128 × 128 × 256, 64 × 64 × 256, 32 × 32 × 256, and 16 × 16 × 256.
Next, a regional suggestion Network (RPN) is constructed. The RPN takes 6 characteristic graphs F _ 1-F _6 as input, and the structure of the RPN is composed of a convolution layer with convolution kernel of 3 multiplied by 3 and two output branches: the branch I outputs the probability that the candidate area is the foreground target; and the second branch outputs the coordinates of the upper left corner of a candidate area frame (bounding box), the width and the height of the frame. And the RPN adopts a sliding anchor frame to respectively perform traversal operation on the six feature graphs F _ 1-F _6 and generate a series of candidate regions. And finally, performing connection fusion on the prediction results on the six feature maps F _ 1-F _ 6. In the training process of the RPN, a target with an IOU (intersection ratio) greater than 0.7 with a real label box is a positive sample (target), and a target with an IOU (intersection ratio) less than 0.3 is a negative sample (background).
According to the area of each candidate region frame generated by RPN, mapping the candidate region frame to the corresponding feature layer FkAnd carrying out the next ROI Align operation, and outputting a batch of candidate region feature maps with the size of 7 multiplied by 7 through the ROI Align layer. The ROI Align operation is to unify the size of the candidate region feature map so that it is input to the last fully connected layer for feature extraction and classification.
After passing through two fully connected layers, the candidate region feature map is respectively input into two detection branches (regression branch and classification branch) of fast RCNN: classifying background and foreground targets by using a classification loss function, and determining the target class to which the candidate region belongs; and obtaining the position information of the target after finishing the frame regression operation by utilizing the regression loss. And training the network model, calculating a loss function, updating parameters of the whole network, and finally obtaining the training model. The training loss consists of two components, namely the classification loss and the regression loss, and is calculated as follows:
Figure DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,
Figure 922131DEST_PATH_IMAGE002
the subscript of each of the samples is indicated,
Figure DEST_PATH_IMAGE003
and
Figure 357048DEST_PATH_IMAGE004
are all normalized parameters, and are all the parameters,
Figure DEST_PATH_IMAGE005
is a balance parameter of the weight.
Figure 589315DEST_PATH_IMAGE006
Indicating a classification loss.
Figure DEST_PATH_IMAGE007
Indicating the probability that the sample is predicted to be a vehicle,
Figure 502301DEST_PATH_IMAGE008
is a tagged real data tag.
Figure DEST_PATH_IMAGE009
Represents the regression loss of the bounding box, and is defined as
Figure 893836DEST_PATH_IMAGE010
(t-t), t represents the translation scaling parameter of the Proposal predicted target frame, t represents the translation scaling parameter of the real data corresponding to the Proposal,
Figure DEST_PATH_IMAGE011
the definition of the function is
Figure 652319DEST_PATH_IMAGE012
Figure DEST_PATH_IMAGE013
When the representative sample is a positive sample, i.e.
Figure 496516DEST_PATH_IMAGE014
Is activated.
Figure DEST_PATH_IMAGE015
A pan-zoom parameter representing the Proposal prediction box,
Figure 947613DEST_PATH_IMAGE016
translation scaling parameter, t, representing real data corresponding to the Proposalx *A translation scaling parameter, t, representing the coordinate x of the upper left corner of the predicted target frameyTranslation scaling parameter, t, representing coordinate y of the upper left corner of the predicted target framewA translation scaling parameter representing the predicted target frame width w. t is thA panning scaling parameter, t, representing the predicted target frame height hx *A translation scaling parameter, t, representing the coordinate x of the upper left corner of the real target boxy *Translation scaling parameter, t, representing the coordinate y of the upper left corner of the real target boxw *A pan scaling parameter representing the true target box width w. t is th *A pan zoom parameter representing the true target frame height h.
(3) Training and parameter optimization are carried out on the deep neural network obtained in the steps based on a training set in a target detection data set, forward propagation and backward propagation steps are carried out on each input image, and loss functions are based
Figure DEST_PATH_IMAGE017
And updating the internal parameters of the model to obtain the final target detection model.
Step 103: and performing model training on the cavity pyramid network model according to the target detection data set to obtain a target detection model.
Step 104: and carrying out target detection on the image to be detected by using the target detection model.
A test set of a target detection data set is used as a test example and is input into a trained deep neural network model to detect a target in an image, and the specific process is as follows:
(1) inputting a group of images to be tested, limiting the maximum side length of an input graph to be 1024, outputting a feature graph after feature extraction of a backbone network and a cavity pyramid network, inputting the feature graph into an area and suggesting a network RPN, and thus obtaining 400 candidate target areas in the graph, namely Propusals;
(2) inputting the original image feature map and each candidate target region into an ROI Align layer, extracting the feature map of the candidate target region and outputting the feature map with the same size (7 x 7) for regression and class classification of a detection frame of a next target;
(3) the feature information of the Proposal passes through the full connection layer, the regression branch and the classification branch to obtain the rectangular position information and the category information of the detection frame of each target. Finally marking out all circumscribed rectangles and categories marked as targets in the original image;
(4) the indexes used for evaluating the result are average precision AP and average precision mAP. True Negative (tube Negative, TN): is determined to be a negative sample, and is in fact a negative sample; true positive (tube positive, TP): is determined to be a positive sample, and is in fact a positive sample; false Negative (FN): is judged as a negative sample, but is actually a positive sample; false Positive (FP): is determined to be a positive sample, but is actually a negative sample. Recall (Recall) = TP/(TP + FN), accuracy (Precision) = TP/(TP + FP), and a Precision-Recall (P-R) curve is a two-dimensional curve with Precision and Recall as vertical and horizontal axis coordinates. The average precision AP is the area enclosed by the P-R curves corresponding to each category, and the average precision mAP is the average value of the AP values of each category.
The method of the invention has the following beneficial effects:
(1) the hole pyramid network is designed, and the hole convolution with different convolution kernels is adopted to extract the characteristic information, so that the convolution layer has the receptive fields with different sizes, a single characteristic layer is facilitated to acquire richer multi-scale context characteristic information, and the detection accuracy of the network to the targets with different sizes is improved. In addition, the hole convolution operation does not change the resolution of the feature map, and is beneficial to accurate positioning of the target.
(2) The fast RCNN detection network based on the cavity pyramid network is constructed, the whole detection network combines a cavity convolution multi-branch structure with a multi-scale feature fusion technology, and the sensitivity of each feature layer to different-scale targets is enhanced, so that the detection capability of the network to the multi-scale targets is enhanced in a combined manner.
Fig. 4 is a schematic structural diagram of a multi-scale target detection system of the present invention, which includes:
a data set acquisition module 201, configured to acquire a target detection data set;
a cavity pyramid network model building module 202, configured to build a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are connected in sequence, each convolution module comprises convolution operation, an image output by each convolution module is reduced in sequence according to the image input direction, the output of each convolution module is connected with one convolution branch respectively, each convolution branch comprises one convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; the feature graph output by one convolution operation and a plurality of hole convolution operations in the convolution branch adopts element level addition operation to obtain a first feature graph; sequentially carrying out up-sampling operation on each first feature map according to the sequence from low resolution to high resolution, and then carrying out element-level addition on the first feature maps and adjacent first feature maps with the same size to obtain a plurality of fused feature maps;
the cavity pyramid network model training module 203 is used for performing model training on the cavity pyramid network model according to the target detection data set to obtain a target detection model;
and the target detection module 204 is configured to perform target detection on the image to be detected by using the target detection model.
The output of each convolution module passes through a convolution layer with convolution kernel of 1 x 1 and then is input into a convolution branch.
The convolution branch comprises a convolution operation with a convolution kernel of 1 x 1 and 2 hole convolution operations, wherein the 2 hole convolution operations are a first hole convolution operation and a second hole convolution operation respectively; the first hole convolution operation is a hole convolution operation with a convolution kernel of 3 x 3 and an expansion rate of 2; the second hole convolution operation is a hole convolution operation with a convolution kernel of 5 x 5 and an expansion rate of 2.
The hole pyramid network model further comprises a regional suggestion network, each fusion feature graph is input into the regional suggestion network, and the regional suggestion network respectively outputs candidate regions corresponding to each fusion feature graph;
the hollow pyramid network model further comprises an ROI (region of interest) pooling layer and a detection head, wherein the input of the ROI pooling layer is connected with the output of the area suggestion network, the output of the ROI pooling layer is connected with the detection head, and the detection head is used for outputting a detection result.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A multi-scale target detection method is characterized by comprising the following steps:
collecting a target detection data set;
constructing a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are sequentially connected, each convolution module comprises convolution operation, the output of each convolution module is respectively connected with one convolution branch, each convolution branch comprises one convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; the feature graph output by one convolution operation and a plurality of hole convolution operations in the convolution branch adopts element level addition operation to obtain a first feature graph; sequentially carrying out up-sampling operation on each first feature map according to the sequence from low resolution to high resolution, and then carrying out element-level addition on the first feature maps and adjacent first feature maps with the same size to obtain a plurality of fused feature maps;
performing model training on the cavity pyramid network model according to the target detection data set to obtain a target detection model;
and carrying out target detection on the image to be detected by utilizing the target detection model.
2. The method according to claim 1, wherein the output of each convolution module is input to the convolution branch after passing through a convolution layer with a convolution kernel of 1 x 1.
3. The method according to claim 1, wherein the void pyramid network model further includes a regional suggestion network, each of the fused feature maps is input to the regional suggestion network, and the regional suggestion network outputs candidate regions corresponding to each of the fused feature maps.
4. The multi-scale object detection method according to claim 3, wherein the void pyramid network model further comprises an ROI pooling layer and a detection head, wherein an input of the ROI pooling layer is connected with an output of the region suggestion network, an output of the ROI pooling layer is connected with the detection head, and the detection head is used for outputting a detection result.
5. The method of claim 1, wherein the convolution branch comprises a convolution operation with a convolution kernel of 1 x 1 and 2 hole convolution operations.
6. The multi-scale object detection method according to claim 5, wherein the 2 hole convolution operations are a first hole convolution operation and a second hole convolution operation, respectively; the first hole convolution operation is a hole convolution operation with a convolution kernel of 3 x 3 and an expansion rate of 2; the second hole convolution operation is a hole convolution operation with a convolution kernel of 5 x 5 and an expansion rate of 2.
7. A multi-scale object detection system, comprising:
the data set acquisition module is used for acquiring a target detection data set;
the cavity pyramid network model building module is used for building a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are sequentially connected, each convolution module comprises convolution operation, the output of each convolution module is respectively connected with one convolution branch, each convolution branch comprises one convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; the feature graph output by one convolution operation and a plurality of hole convolution operations in the convolution branch adopts element level addition operation to obtain a first feature graph; sequentially carrying out up-sampling operation on each first feature map according to the sequence from low resolution to high resolution, and then carrying out element-level addition on the first feature maps and adjacent first feature maps with the same size to obtain a plurality of fused feature maps;
the cavity pyramid network model training module is used for carrying out model training on the cavity pyramid network model according to the target detection data set to obtain a target detection model;
and the target detection module is used for carrying out target detection on the image to be detected by utilizing the target detection model.
8. The multi-scale object detection system of claim 7, wherein the output of each convolution module is input to the convolution branch after passing through a convolution layer with a convolution kernel of 1 x 1.
9. The multi-scale target detection system of claim 7, wherein the convolution branch comprises a convolution operation with a convolution kernel of 1 x 1 and 2 hole convolution operations, the 2 hole convolution operations being a first hole convolution operation and a second hole convolution operation, respectively; the first hole convolution operation is a hole convolution operation with a convolution kernel of 3 x 3 and an expansion rate of 2; the second hole convolution operation is a hole convolution operation with a convolution kernel of 5 x 5 and an expansion rate of 2.
10. The multi-scale object detection system according to claim 7, wherein the void pyramid network model further includes a regional suggestion network, each of the fused feature maps is input to the regional suggestion network, and the regional suggestion network outputs a candidate region corresponding to each of the fused feature maps;
the cavity pyramid network model further comprises an ROI (region of interest) pooling layer and a detection head, wherein the input of the ROI pooling layer is connected with the output of the region suggestion network, the output of the ROI pooling layer is connected with the detection head, and the detection head is used for outputting a detection result.
CN202110910802.6A 2021-08-10 2021-08-10 Multi-scale target detection method and system Active CN113361528B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110910802.6A CN113361528B (en) 2021-08-10 2021-08-10 Multi-scale target detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110910802.6A CN113361528B (en) 2021-08-10 2021-08-10 Multi-scale target detection method and system

Publications (2)

Publication Number Publication Date
CN113361528A true CN113361528A (en) 2021-09-07
CN113361528B CN113361528B (en) 2021-10-29

Family

ID=77540829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110910802.6A Active CN113361528B (en) 2021-08-10 2021-08-10 Multi-scale target detection method and system

Country Status (1)

Country Link
CN (1) CN113361528B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114005017A (en) * 2021-09-18 2022-02-01 北京旷视科技有限公司 Target detection method and device, electronic equipment and storage medium
CN114022746A (en) * 2021-11-03 2022-02-08 合肥工业大学 Polynomial multi-scale spatial feature learning method
CN116206248A (en) * 2023-04-28 2023-06-02 江西省水利科学院(江西省大坝安全管理中心、江西省水资源管理中心) Target detection method based on machine learning guide deep learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180018524A1 (en) * 2015-12-16 2018-01-18 Intel Corporation Fully convolutional pyramid networks for pedestrian detection
CN109325534A (en) * 2018-09-22 2019-02-12 天津大学 A kind of semantic segmentation method based on two-way multi-Scale Pyramid
CN110717527A (en) * 2019-09-24 2020-01-21 东南大学 Method for determining target detection model by combining void space pyramid structure
CN111126202A (en) * 2019-12-12 2020-05-08 天津大学 Optical remote sensing image target detection method based on void feature pyramid network
CN112364855A (en) * 2021-01-14 2021-02-12 北京电信易通信息技术股份有限公司 Video target detection method and system based on multi-scale feature fusion
CN113313094A (en) * 2021-07-30 2021-08-27 北京电信易通信息技术股份有限公司 Vehicle-mounted image target detection method and system based on convolutional neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180018524A1 (en) * 2015-12-16 2018-01-18 Intel Corporation Fully convolutional pyramid networks for pedestrian detection
CN109325534A (en) * 2018-09-22 2019-02-12 天津大学 A kind of semantic segmentation method based on two-way multi-Scale Pyramid
CN110717527A (en) * 2019-09-24 2020-01-21 东南大学 Method for determining target detection model by combining void space pyramid structure
CN111126202A (en) * 2019-12-12 2020-05-08 天津大学 Optical remote sensing image target detection method based on void feature pyramid network
CN112364855A (en) * 2021-01-14 2021-02-12 北京电信易通信息技术股份有限公司 Video target detection method and system based on multi-scale feature fusion
CN113313094A (en) * 2021-07-30 2021-08-27 北京电信易通信息技术股份有限公司 Vehicle-mounted image target detection method and system based on convolutional neural network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114005017A (en) * 2021-09-18 2022-02-01 北京旷视科技有限公司 Target detection method and device, electronic equipment and storage medium
CN114022746A (en) * 2021-11-03 2022-02-08 合肥工业大学 Polynomial multi-scale spatial feature learning method
CN116206248A (en) * 2023-04-28 2023-06-02 江西省水利科学院(江西省大坝安全管理中心、江西省水资源管理中心) Target detection method based on machine learning guide deep learning

Also Published As

Publication number Publication date
CN113361528B (en) 2021-10-29

Similar Documents

Publication Publication Date Title
CN112200161B (en) Face recognition detection method based on mixed attention mechanism
CN113361528B (en) Multi-scale target detection method and system
CN113255589B (en) Target detection method and system based on multi-convolution fusion network
CN111783590A (en) Multi-class small target detection method based on metric learning
WO2021218786A1 (en) Data processing system, object detection method and apparatus thereof
CN111461083A (en) Rapid vehicle detection method based on deep learning
CN112364855B (en) Video target detection method and system based on multi-scale feature fusion
CN113313094B (en) Vehicle-mounted image target detection method and system based on convolutional neural network
CN113762409B (en) Unmanned aerial vehicle target detection method based on event camera
CN110659601B (en) Depth full convolution network remote sensing image dense vehicle detection method based on central point
CN111860072A (en) Parking control method and device, computer equipment and computer readable storage medium
CN112861619A (en) Model training method, lane line detection method, equipment and device
CN112634369A (en) Space and or graph model generation method and device, electronic equipment and storage medium
CN113436210B (en) Road image segmentation method fusing context progressive sampling
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN114267025A (en) Traffic sign detection method based on high-resolution network and light-weight attention mechanism
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN117197687A (en) Unmanned aerial vehicle aerial photography-oriented detection method for dense small targets
Zhang et al. Vehicle detection in UAV aerial images based on improved YOLOv3
CN118015490A (en) Unmanned aerial vehicle aerial image small target detection method, system and electronic equipment
Li et al. Improved YOLOv5s algorithm for small target detection in UAV aerial photography
CN113177511A (en) Rotating frame intelligent perception target detection method based on multiple data streams
CN112597996A (en) Task-driven natural scene-based traffic sign significance detection method
CN114550016B (en) Unmanned aerial vehicle positioning method and system based on context information perception
CN111881914A (en) License plate character segmentation method and system based on self-learning threshold

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant