CN113361528B

CN113361528B - Multi-scale target detection method and system

Info

Publication number: CN113361528B
Application number: CN202110910802.6A
Authority: CN
Inventors: 朱敏; 严凡; 王帅; 赵文登
Original assignee: Beijing Telecom Easiness Information Technology Co Ltd
Current assignee: Beijing Telecom Easiness Information Technology Co Ltd
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2021-10-29
Anticipated expiration: 2041-08-10
Also published as: CN113361528A

Abstract

The invention relates to a multi-scale target detection method and a system, wherein the method comprises the following steps: constructing a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are sequentially connected, the output of each convolution module is respectively connected with one convolution branch, each convolution branch comprises a convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in a parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; sequentially performing up-sampling operation on each first feature map and performing element-level addition on adjacent first feature maps with the same size according to the output of the convolution branch from low to high resolution to obtain a plurality of fusion feature maps; performing model training on the cavity pyramid network model according to the target detection data set to obtain a target detection model; and carrying out target detection on the image to be detected by using the target detection model. The invention improves the accuracy of target detection.

Description

Multi-scale target detection method and system

Technical Field

The invention relates to the technical field of target detection, in particular to a multi-scale target detection method and system.

Background

The target detection technology is a large core research direction in the field of computer vision, and aims to obtain the belonged classification and the position of an interested target in an image. The technology is not only the research foundation of many computer vision tasks such as target tracking, semantic segmentation and the like, but also widely applied to various civil and military fields such as medical diagnosis, automatic driving, intelligent video monitoring, military target monitoring and the like. With diversification and complication of application scenes, a target to be detected often comprises a plurality of targets with different scales, so that the current target detection task faces a serious challenge caused by scale difference. Multi-scale target detection has therefore become one of the research hotspots in the field of target detection.

As a mainstream algorithm for solving the scale problem, the multi-scale feature fusion technology fuses a shallow feature map containing more detailed information and a deep feature map containing more semantic information in a convolutional neural network by constructing a feature pyramid network, so that each feature layer has abundant detailed features and semantic features at the same time, and the feature expression capability of the neural network is effectively improved. However, the existing multi-scale feature fusion method has the disadvantages that each feature layer contains abundant feature information, but each feature layer is only sensitive to targets in a fixed scale range, so that the utilization rate of the feature information is not high, and the detection capability of each feature layer of the network to targets in different scales is also limited.

Disclosure of Invention

The invention aims to provide a multi-scale target detection method and a multi-scale target detection system, which improve the accuracy of target detection.

In order to achieve the purpose, the invention provides the following scheme:

a multi-scale target detection method, comprising:

collecting a target detection data set;

constructing a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are sequentially connected, each convolution module comprises convolution operation, the output of each convolution module is respectively connected with one convolution branch, each convolution branch comprises one convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; the feature graph output by one convolution operation and a plurality of hole convolution operations in the convolution branch adopts element level addition operation to obtain a first feature graph; sequentially carrying out up-sampling operation on each first feature map according to the sequence from low resolution to high resolution, and then carrying out element-level addition on the first feature maps and adjacent first feature maps with the same size to obtain a plurality of fused feature maps;

performing model training on the cavity pyramid network model according to the target detection data set to obtain a target detection model;

and carrying out target detection on the image to be detected by utilizing the target detection model.

Optionally, the output of each convolution module is input into the convolution branch after passing through a convolution layer with a convolution kernel of 1 × 1.

Optionally, the void pyramid network model further includes a regional suggestion network, each of the fusion feature maps is input into the regional suggestion network, and the regional suggestion network outputs a candidate region corresponding to each of the fusion feature maps.

Optionally, the cavity pyramid network model further includes an ROI pooling layer and a detection head, an input of the ROI pooling layer is connected to an output of the region suggestion network, an output of the ROI pooling layer is connected to the detection head, and the detection head is configured to output a detection result.

Optionally, the convolution branch comprises a convolution operation with a convolution kernel of 1 × 1 and 2 hole convolution operations.

Optionally, the 2 hole convolution operations are a first hole convolution operation and a second hole convolution operation, respectively; the first hole convolution operation is a hole convolution operation with a convolution kernel of 3 x 3 and an expansion rate of 2; the second hole convolution operation is a hole convolution operation with a convolution kernel of 5 x 5 and an expansion rate of 2.

The invention also discloses a multi-scale target detection system, which comprises:

the data set acquisition module is used for acquiring a target detection data set;

the cavity pyramid network model building module is used for building a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are sequentially connected, each convolution module comprises convolution operation, the output of each convolution module is respectively connected with one convolution branch, each convolution branch comprises one convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; the feature graph output by one convolution operation and a plurality of hole convolution operations in the convolution branch adopts element level addition operation to obtain a first feature graph; sequentially carrying out up-sampling operation on each first feature map according to the sequence from low resolution to high resolution, and then carrying out element-level addition on the first feature maps and adjacent first feature maps with the same size to obtain a plurality of fused feature maps;

the cavity pyramid network model training module is used for carrying out model training on the cavity pyramid network model according to the target detection data set to obtain a target detection model;

and the target detection module is used for carrying out target detection on the image to be detected by utilizing the target detection model.

Optionally, the convolution branch includes a convolution operation with a convolution kernel of 1 × 1 and 2 hole convolution operations, where the 2 hole convolution operations are the first hole convolution operation and the second hole convolution operation, respectively; the first hole convolution operation is a hole convolution operation with a convolution kernel of 3 x 3 and an expansion rate of 2; the second hole convolution operation is a hole convolution operation with a convolution kernel of 5 x 5 and an expansion rate of 2.

Optionally, the cavity pyramid network model further includes a regional suggestion network, each of the fusion feature maps is input into the regional suggestion network, and the regional suggestion network outputs a candidate region corresponding to each of the fusion feature maps;

the cavity pyramid network model further comprises an ROI (region of interest) pooling layer and a detection head, wherein the input of the ROI pooling layer is connected with the output of the region suggestion network, the output of the ROI pooling layer is connected with the detection head, and the detection head is used for outputting a detection result.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention combines the cavity convolution multi-branch structure with the multi-scale feature fusion technology, and extracts feature information by adopting the cavity convolution of different convolution kernels, so that the convolution layer has different sizes of receptive fields, a single feature layer is facilitated to acquire richer multi-scale context feature information, the sensitivity of each feature layer to different-scale targets is enhanced, and the accuracy of target detection is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a multi-scale target detection method according to the present invention;

FIG. 2 is a schematic view of a specific flow chart of a multi-scale target detection method according to the present invention;

FIG. 3 is a schematic diagram of a cavity pyramid network model structure according to the present invention;

fig. 4 is a schematic structural diagram of a multi-scale target detection system according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic flow diagram of a multi-scale target detection method of the present invention, and as shown in fig. 1, the multi-scale target detection method specifically includes the following steps:

a multi-scale target detection method, comprising:

step 101: a target detection data set is collected.

Firstly, acquiring image data of vehicles in the advancing process under different traffic scenes by using a vehicle-mounted camera, and preprocessing the images; secondly, carrying out category and position labeling on targets (including various vehicles, pedestrians, traffic signs and road barriers) in the images based on image labeling software so as to obtain a labeling file corresponding to each image; and finally, performing training set and test set division, and making the image data and the annotation file into a VOC2007 data set format, thereby obtaining a target detection data set.

Image preprocessing is a data enhancement operation that includes: horizontal inversion and brightness contrast adjustment are carried out to enhance the robustness of the network to light changes.

As a specific embodiment, the image annotation software is LabelImg software.

Step 102: constructing a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are connected in sequence, each convolution module comprises convolution operation, an image output by each convolution module is reduced in sequence according to the image input direction, the output of each convolution module is connected with one convolution branch respectively, each convolution branch comprises one convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; the feature graph output by one convolution operation and a plurality of hole convolution operations in the convolution branch adopts element level addition operation to obtain a first feature graph; and sequentially carrying out up-sampling operation on each first feature map according to the sequence from low resolution to high resolution, and then carrying out element-level addition on the first feature maps and adjacent first feature maps with the same size to obtain a plurality of fused feature maps.

The plurality of convolution modules are respectively a first convolution module, a second convolution module, a third convolution module, a fourth convolution module and a fifth convolution module which are connected in sequence. The outputs of the first convolution module, the second convolution module, the third convolution module, the fourth convolution module and the fifth convolution module are respectively C _1, C _2, C _3, C _4 and C _5, and the output characteristic diagram of the fifth convolution module is subjected to 0.5-time down-sampling to obtain a characteristic diagram C _ 6.

Each convolution module is a resnet101 convolution module.

And a convolution layer with a convolution kernel of 1 x 1 is arranged between each convolution module and the cavity pyramid network, and the output of each convolution module is input into the cavity pyramid network after passing through the convolution layer with the convolution kernel of 1 x 1. Specifically, C _1, C _2, C _3, C _4, C _5, and C _6 are input to convolution layers with convolution kernel 1 × 1, and then input to the void pyramid network, and corresponding outputs are feature maps P _1, P _2, P _3, P _4, P _5, and P _6, respectively.

The feature map P _6 obtains a feature map with the same scale as the feature map P _5 through 2 times of upsampling operation, and the feature map P _5 is subjected to element-level addition to obtain a feature map F _ 5; the feature map P _5 obtains a feature map with the same scale as the feature map P _4 through 2 times of upsampling operation, and performs element-level addition with the feature map P _4 to obtain a feature map F _ 4; the feature map P _4 obtains a feature map with the same scale as the feature map P _3 through 2 times of upsampling operation, and the feature map P _3 are subjected to element-level addition to obtain a feature map F _ 3; the feature map P _3 obtains a feature map with the same scale as the feature map P _2 through 2 times of upsampling operation, and the feature map P _2 are subjected to element-level addition to obtain a feature map F _ 2; the feature map P _2 obtains a feature map with the same scale as the feature map P _1 through 2 times of upsampling operation, and element-level addition is carried out on the feature map P _1 to obtain a feature map F _ 1; and (3) performing convolution operation on the feature maps F _1, F _2, F _3 and F _4 with one layer of convolution kernel being 3 multiplied by 3 to obtain updated feature maps F _1, F _2, F _3 and F _4 so as to eliminate the feature aliasing effect of the lower layer.

The backbone network of the hole pyramid network model (convolutional neural network) is ResNet 101.

The output of each convolution module passes through a convolution layer with convolution kernel of 1 x 1 and then is input into a convolution branch.

The convolution branch comprises a conventional convolution operation with a convolution kernel of 1 x 1 and two hole convolution operations:

the number of the hole convolution operations is 2, and the 2 hole convolution operations are respectively a first hole convolution operation and a second hole convolution operation; the first hole convolution operation is a hole convolution operation with a convolution kernel of 3 x 3 and an expansion rate of 2; the second hole convolution operation is a hole convolution operation with a convolution kernel of 5 x 5 and an expansion rate of 2.

The hole pyramid network model further includes a region suggestion network (a region candidate network in fig. 2), each fused feature map is input into the region suggestion network, and the region suggestion network outputs a candidate region corresponding to each fused feature map.

The hollow pyramid network model further comprises an ROI (region of interest) pooling layer (target region pooling in FIG. 2), the input of which is connected to the output of the area suggestion network, and a detection head, the output of which is connected to the detection head, for outputting the detection result. The detection head comprises a regression branch and a classification branch.

As shown in fig. 2-3, taking 1024 × 1024 input images as an example, a specific process of multi-scale target detection according to the present invention is described, which includes the following steps:

(1) and designing a hole pyramid network and embedding the hole pyramid network into a backbone network ResNet101 of a Faster RCNN network. The backbone network used by the fast RCNN in the invention is ResNet101, which is used for extracting the characteristic information of the input image, the ResNet101 network is composed of 5 convolution modules (conv1, conv2, conv3, conv4 and conv5), and the output characteristic graphs of the convolution modules are respectively C _1, C _2, C _3, C _4 and C _ 5. And designing a hole pyramid network and embedding the hole pyramid network after the ResNet101 convolution module so as to enable subsequent feature graphs to obtain rich multi-scale context information. As shown in fig. 3, taking the input image 1024 × 1024 of the present invention as an example, taking C _1, C _2, C _3, C _4, and C _5 as the input of the void pyramid network, the design process of the void pyramid network is shown:

firstly, in order to realize the detection of large-scale targets, 0.5-fold down-sampling is carried out on the output feature map C _5 of the 5 th convolution module of the ResNet101 to obtain C _6, so as to obtain a group of feature maps C _ 1-C _6, wherein the feature map sizes are 512 multiplied by 128, 256 multiplied by 256, 128 multiplied by 512, 64 multiplied by 1024, 32 multiplied by 2048 and 16 multiplied by 2048 in sequence. Next, these 6 feature maps are input into convolutional layers with a convolutional kernel of 1 × 1, and this operation is to unify the number of channels of the 6 feature maps into a fixed value of 256, that is, 512 × 512 × 256, 256 × 256 × 256, 128 × 128 × 256, 64 × 64 × 256, 32 × 32 × 256, and 16 × 16 × 256, while ensuring that the spatial size of the feature maps is not changed.

Then, a hole convolution multi-branch structure is constructed. As shown in fig. 3, the same operations are performed on the 6 feature maps, where C _5 is taken as an example: the first branch is a convolution operation with a convolution kernel of 1 x 1, and the branch exists for retaining the original characteristic information of the characteristic diagram, wherein the size of the characteristic diagram is not changed; the second branch is a hole convolution operation with a convolution kernel of 3 × 3, a hole convolution expansion rate is set to 2 (indicated by rate =2 in fig. 3), and in order to ensure that the feature map size is constant, a pixel filling padding is set to 2; the third branch is a cavity convolution operation with a convolution kernel of 5 × 5, the cavity convolution expansion rate is set to 2, and in order to ensure that the feature map size is constant, the pixel filling padding is set to 4. C _5 is respectively input into the three branches, and three characteristic maps with the sizes of 32 × 32 × 256 are obtained. And performing feature fusion on the three feature maps by adopting element-level addition operation to obtain a feature map P _5 with the size of 32 × 32 × 256. After the hole convolution multi-branch structure, feature maps P _1, P _2, P _3, P _4, P _5 and P _6 are obtained in sequence, and the sizes of the feature maps are 512 multiplied by 256, 256 multiplied by 256, 128 multiplied by 256, 64 multiplied by 256, 32 multiplied by 256 and 16 multiplied by 256 in sequence. Conv in FIG. 3 denotes convolution operation, and D-Conv denotes hole convolution operation.

The cavity convolution multi-branch structure adopts the cavity convolution of different convolution kernels to extract characteristic information, so that the convolution layer can obtain the receptive fields of different sizes, and the extraction of the characteristic information of targets of different sizes is facilitated. Meanwhile, in order to avoid the loss of detail information which is helpful for small target detection, the module adopts a first branch to retain the original characteristic information of the characteristic diagram. In addition, the hole convolution operation does not change the resolution of the feature map, and is beneficial to accurate positioning of the target.

And then, constructing a hollow pyramid structure based on multi-scale feature fusion operation. As shown in fig. 3, F _6 is directly obtained from the feature map P _6, and F _6 is subjected to up-sampling operation with 2 times amplification to obtain a feature map with the same scale as P _5, and is subjected to element-level addition with P _5 to obtain F _ 5. And (3) sequentially carrying out 2 times of upsampling operation on the feature map of the high-resolution semantic information of the upper layer to obtain a feature map with the same size as that of the lower layer, and carrying out element-level addition on the feature map and the high-resolution feature map of the lower layer to sequentially obtain F _4, F _3, F _2 and F _1 layers. And (3) carrying out convolution operation with one layer of convolution kernel being 3 multiplied by 3 on the F _1, F _2, F _3 and F _4 layers to eliminate the characteristic aliasing effect of the lower layers and obtain the final F _1, F _2, F _3 and F _4 layers.

The cavity pyramid network is constructed based on three parts of a backbone network ResNet101, a cavity convolution multi-branch structure and multi-scale feature fusion operation, as shown in FIG. 3. By introducing the multi-branch structure of the cavity convolution into the multilevel pyramid network, a single feature layer can acquire richer multi-scale context feature information, and the feature information of the targets with different sizes is transmitted to a subsequent layer, so that the detection accuracy of the network on the targets with different sizes is further improved.

(2) Designing a Faster R-CNN structure based on the hole pyramid network. The concrete structure (as shown in fig. 3) is as follows: in the last step, a cavity pyramid network structure is obtained based on the backbone network ResNet101, the cavity convolution multi-branch structure and the multi-scale feature fusion operation. Taking 1024 × 1024 as an example of the input image of the invention, the sizes of 6 feature maps F _1 to F _6 output by the cavity pyramid network are as follows: 512 × 512 × 256, 256 × 256 × 256, 128 × 128 × 256, 64 × 64 × 256, 32 × 32 × 256, and 16 × 16 × 256.

Next, a regional suggestion Network (RPN) is constructed. The RPN takes 6 characteristic graphs F _ 1-F _6 as input, and the structure of the RPN is composed of a convolution layer with convolution kernel of 3 multiplied by 3 and two output branches: the branch I outputs the probability that the candidate area is the foreground target; and the second branch outputs the coordinates of the upper left corner of a candidate area frame (bounding box), the width and the height of the frame. And the RPN adopts a sliding anchor frame to respectively perform traversal operation on the six feature graphs F _ 1-F _6 and generate a series of candidate regions. And finally, performing connection fusion on the prediction results on the six feature maps F _ 1-F _ 6. In the training process of the RPN, a target with an IOU (intersection ratio) greater than 0.7 with a real label box is a positive sample (target), and a target with an IOU (intersection ratio) less than 0.3 is a negative sample (background).

Respectively generating candidate regions according to the area size of each candidate region frame generated by RPNMapping of boxes to corresponding feature layers F_kAnd carrying out the next ROI Align operation, and outputting a batch of candidate region feature maps with the size of 7 multiplied by 7 through the ROI Align layer. The ROI Align operation is to unify the size of the candidate region feature map so that it is input to the last fully connected layer for feature extraction and classification.

After passing through two fully connected layers, the candidate region feature map is respectively input into two detection branches (regression branch and classification branch) of fast RCNN: classifying background and foreground targets by using a classification loss function, and determining the target class to which the candidate region belongs; and obtaining the position information of the target after finishing the frame regression operation by utilizing the regression loss. And training the network model, calculating a loss function, updating parameters of the whole network, and finally obtaining the training model. The training loss consists of two components, namely the classification loss and the regression loss, and is calculated as follows:

in the formula (I), the compound is shown in the specification,

the subscript of each of the samples is indicated,

and

are all normalized parameters, and are all the parameters,

is a balance parameter of the weight.

Indicating a classification loss.

Indicating the probability that the sample is predicted to be a vehicle,

is a tagged real data tag.

Represents the regression loss of the bounding box, and is defined as

(t-t), t represents the translation scaling parameter of the Proposal predicted target frame, t represents the translation scaling parameter of the real data corresponding to the Proposal,

the definition of the function is

，

When the representative sample is a positive sample, i.e.

Is activated.

A pan-zoom parameter representing the Proposal prediction box,

translation scaling parameter, t, representing real data corresponding to the Proposal_x ^*A translation scaling parameter, t, representing the coordinate x of the upper left corner of the predicted target frame_yTranslation scaling parameter, t, representing coordinate y of the upper left corner of the predicted target frame_wA translation scaling parameter representing the predicted target frame width w. t is t_hA panning scaling parameter, t, representing the predicted target frame height h_x ^*A translation scaling parameter, t, representing the coordinate x of the upper left corner of the real target box_y ^*Translation scaling parameter, t, representing the coordinate y of the upper left corner of the real target box_w ^*A pan scaling parameter representing the true target box width w. t is t_h ^*A pan zoom parameter representing the true target frame height h.

(3) Training and parameter optimization are carried out on the deep neural network obtained in the steps based on a training set in a target detection data set, forward propagation and backward propagation steps are carried out on each input image, and loss functions are based

And updating the internal parameters of the model to obtain the final target detection model.

Step 103: and performing model training on the cavity pyramid network model according to the target detection data set to obtain a target detection model.

Step 104: and carrying out target detection on the image to be detected by using the target detection model.

A test set of a target detection data set is used as a test example and is input into a trained deep neural network model to detect a target in an image, and the specific process is as follows:

(1) inputting a group of images to be tested, limiting the maximum side length of an input graph to be 1024, outputting a feature graph after feature extraction of a backbone network and a cavity pyramid network, inputting the feature graph into an area and suggesting a network RPN, and thus obtaining 400 candidate target areas in the graph, namely Propusals;

(2) inputting the original image feature map and each candidate target region into an ROI Align layer, extracting the feature map of the candidate target region and outputting the feature map with the same size (7 x 7) for regression and class classification of a detection frame of a next target;

(3) the feature information of the Proposal passes through the full connection layer, the regression branch and the classification branch to obtain the rectangular position information and the category information of the detection frame of each target. Finally marking out all circumscribed rectangles and categories marked as targets in the original image;

(4) the indexes used for evaluating the result are average precision AP and average precision mAP. True Negative (tube Negative, TN): is determined to be a negative sample, and is in fact a negative sample; true positive (tube positive, TP): is determined to be a positive sample, and is in fact a positive sample; false Negative (FN): is judged as a negative sample, but is actually a positive sample; false Positive (FP): is determined to be a positive sample, but is actually a negative sample. Recall (Recall) = TP/(TP + FN), accuracy (Precision) = TP/(TP + FP), and a Precision-Recall (P-R) curve is a two-dimensional curve with Precision and Recall as vertical and horizontal axis coordinates. The average precision AP is the area enclosed by the P-R curves corresponding to each category, and the average precision mAP is the average value of the AP values of each category.

The method of the invention has the following beneficial effects:

(1) the hole pyramid network is designed, and the hole convolution with different convolution kernels is adopted to extract the characteristic information, so that the convolution layer has the receptive fields with different sizes, a single characteristic layer is facilitated to acquire richer multi-scale context characteristic information, and the detection accuracy of the network to the targets with different sizes is improved. In addition, the hole convolution operation does not change the resolution of the feature map, and is beneficial to accurate positioning of the target.

(2) The fast RCNN detection network based on the cavity pyramid network is constructed, the whole detection network combines a cavity convolution multi-branch structure with a multi-scale feature fusion technology, and the sensitivity of each feature layer to different-scale targets is enhanced, so that the detection capability of the network to the multi-scale targets is enhanced in a combined manner.

Fig. 4 is a schematic structural diagram of a multi-scale target detection system of the present invention, which includes:

a data set acquisition module 201, configured to acquire a target detection data set;

a cavity pyramid network model building module 202, configured to build a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are connected in sequence, each convolution module comprises convolution operation, an image output by each convolution module is reduced in sequence according to the image input direction, the output of each convolution module is connected with one convolution branch respectively, each convolution branch comprises one convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; the feature graph output by one convolution operation and a plurality of hole convolution operations in the convolution branch adopts element level addition operation to obtain a first feature graph; sequentially carrying out up-sampling operation on each first feature map according to the sequence from low resolution to high resolution, and then carrying out element-level addition on the first feature maps and adjacent first feature maps with the same size to obtain a plurality of fused feature maps;

the cavity pyramid network model training module 203 is used for performing model training on the cavity pyramid network model according to the target detection data set to obtain a target detection model;

and the target detection module 204 is configured to perform target detection on the image to be detected by using the target detection model.

The convolution branch comprises a convolution operation with a convolution kernel of 1 x 1 and 2 hole convolution operations, wherein the 2 hole convolution operations are a first hole convolution operation and a second hole convolution operation respectively; the first hole convolution operation is a hole convolution operation with a convolution kernel of 3 x 3 and an expansion rate of 2; the second hole convolution operation is a hole convolution operation with a convolution kernel of 5 x 5 and an expansion rate of 2.

The hole pyramid network model further comprises a regional suggestion network, each fusion feature graph is input into the regional suggestion network, and the regional suggestion network respectively outputs candidate regions corresponding to each fusion feature graph;

the hollow pyramid network model further comprises an ROI (region of interest) pooling layer and a detection head, wherein the input of the ROI pooling layer is connected with the output of the area suggestion network, the output of the ROI pooling layer is connected with the detection head, and the detection head is used for outputting a detection result.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A multi-scale target detection method is characterized by comprising the following steps:

collecting a target detection data set;

constructing a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are sequentially connected, each convolution module comprises convolution operation, the output of each convolution module is respectively connected with one convolution branch, each convolution branch comprises one convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; the feature graph output by one convolution operation and a plurality of hole convolution operations in the convolution branch adopts element level addition operation to obtain a first feature graph; sequentially carrying out up-sampling operation on each first feature map according to the sequence from low resolution to high resolution, and then carrying out element-level addition on the first feature maps and adjacent first feature maps with the same size to obtain a plurality of fused feature maps; the convolution branch comprises a convolution operation with a convolution kernel of 1 x 1 and 2 void convolution operations;

2. The method according to claim 1, wherein the output of each convolution module is input to the convolution branch after passing through a convolution layer with a convolution kernel of 1 x 1.

3. The method according to claim 1, wherein the void pyramid network model further includes a regional suggestion network, each of the fused feature maps is input to the regional suggestion network, and the regional suggestion network outputs candidate regions corresponding to each of the fused feature maps.

4. The multi-scale object detection method according to claim 3, wherein the void pyramid network model further comprises an ROI pooling layer and a detection head, wherein an input of the ROI pooling layer is connected with an output of the region suggestion network, an output of the ROI pooling layer is connected with the detection head, and the detection head is used for outputting a detection result.

5. The multi-scale object detection method according to claim 1, wherein the 2 hole convolution operations are a first hole convolution operation and a second hole convolution operation, respectively; the first hole convolution operation is a hole convolution operation with a convolution kernel of 3 x 3 and an expansion rate of 2; the second hole convolution operation is a hole convolution operation with a convolution kernel of 5 x 5 and an expansion rate of 2.

6. A multi-scale object detection system, comprising:

the cavity pyramid network model building module is used for building a cavity pyramid network model; the cavity pyramid network model comprises a plurality of convolution modules and a plurality of convolution branches which are sequentially connected, each convolution module comprises convolution operation, the output of each convolution module is respectively connected with one convolution branch, each convolution branch comprises one convolution operation and a plurality of cavity convolution operations, and the convolution operations in the convolution branches and the cavity convolution operations are in parallel relation; the expansion rates of a plurality of cavity convolution operations in one convolution branch are the same, and the sizes of convolution kernels are different; the feature graph output by one convolution operation and a plurality of hole convolution operations in the convolution branch adopts element level addition operation to obtain a first feature graph; sequentially carrying out up-sampling operation on each first feature map according to the sequence from low resolution to high resolution, and then carrying out element-level addition on the first feature maps and adjacent first feature maps with the same size to obtain a plurality of fused feature maps; the convolution branch comprises a convolution operation with a convolution kernel of 1 x 1 and 2 void convolution operations;

7. The multi-scale object detection system of claim 6, wherein the output of each convolution module is input to the convolution branch after passing through a convolution layer with a convolution kernel of 1 x 1.

8. The multi-scale object detection system of claim 6, wherein the 2 hole convolution operations are a first hole convolution operation and a second hole convolution operation, respectively; the first hole convolution operation is a hole convolution operation with a convolution kernel of 3 x 3 and an expansion rate of 2; the second hole convolution operation is a hole convolution operation with a convolution kernel of 5 x 5 and an expansion rate of 2.

9. The multi-scale object detection system according to claim 6, wherein the void pyramid network model further includes a regional suggestion network, each of the fused feature maps is input to the regional suggestion network, and the regional suggestion network outputs a candidate region corresponding to each of the fused feature maps, respectively;