CN111738237A

CN111738237A - Target detection method of multi-core iteration RPN based on heterogeneous convolution

Info

Publication number: CN111738237A
Application number: CN202010817648.3A
Authority: CN
Inventors: 刘晋; 尚圣杰
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2020-04-29
Filing date: 2020-08-14
Publication date: 2020-10-02
Anticipated expiration: 2040-08-14
Also published as: CN111738237B; CN111563440A

Abstract

The invention discloses a target detection method of multi-core iteration RPN based on heterogeneous convolution, which comprises the following steps: receiving image data to be detected; carrying out graying and local binarization data enhancement processing on the image data to obtain processed image data; inputting the processed image data into a heterogeneous convolution network for feature extraction to obtain a feature map; inputting the feature map into a multi-scale feature extraction network, realizing feature extraction of different scales and obtaining a target feature map; inputting the target characteristic diagram into an RIR target detection network to obtain a plurality of area candidate frames; obtaining a target score corresponding to each region candidate frame according to the non-maximum inhibition function, and obtaining a region suggestion window according to a preset score threshold; and classifying the region suggestion windows according to the full convolution network layer and the normalized index function classifier to obtain a classification result, an image category and a confidence score. By applying the embodiment of the invention, the problems of low operation speed, poor small target detection effect and the like are effectively solved.

Description

Target detection method of multi-core iteration RPN based on heterogeneous convolution

Technical Field

The invention relates to the technical field of computer vision image processing, in particular to a target detection method of multi-core iteration RPN based on heterogeneous convolution.

Background

Target detection is one of the most challenging tasks in computer vision applications, and has wide application in fields such as unmanned driving, security systems and the like. However, in the target detection process of a natural scene, the target detection process is affected by some non-human factors such as illumination, the direction of an object, and object occlusion. For the increasing use demand of people, how to improve the target detection performance in natural scenes has become an urgent need at present.

Currently, target detection methods are mainly divided into two main categories, including two-stage and one-stage network detection. Wherein two-stage target detection is adopted in two stages, which mainly comprises: (1) firstly, generating a candidate region for the image by using a region generation network (RPN), and (2) classifying the generated candidate region by using a deep learning network. While one-stage networks include only one stage that directly utilizes deep learning networks to generate category probabilities and location information for objects. Therefore, the two-stage network has the characteristics of high detection precision and low detection speed, and on the contrary, the one-stage network has the characteristics of high detection speed and low precision.

The traditional two-stage network has good detection effect on common targets, and mainly comprises the following steps: (1) the method comprises the steps of (1) extracting features of a target image by using different feature extraction networks (such as a residual error network (ResNet) and a Convolutional Neural Network (CNN) (2) carrying out primary detection on the target image by using a region generation network (RPN), simply distinguishing a foreground from an irrelevant background in the image and generating a candidate region frame of the target), (3) carrying out class classification on the image target by using an image classification network according to the candidate region frame generated by the RPN so as to output a final position and a class of the target.

Aiming at the problems of insensitivity to targets with different sizes, poor small target detection effect and high time consumption in target detection, a multi-core iterative RPN target detection network based on heterogeneous convolution is designed, a convolution kernel with the size of 1 multiplied by 1 replaces a convolution kernel with the size of 3 multiplied by 3 in the aspect of feature extraction, the accuracy is kept while the calculated amount and the network parameters are reduced, the calculation time is greatly reduced in the aspect of feature extraction, and the detection speed is improved. According to the inclusion thought proposed by Google, a multi-scale feature extraction network is proposed, and different sizes of target images in the images are concentrated through convolution kernels of different sizes, so that the detection accuracy of the network on different sizes is improved through the multi-scale extraction network. According to the existing RPN network mechanism, an iterative RPN mode of RPNINRPN (RIR) is designed, and on the basis of a region candidate frame generated by a first layer of RPN, the generated region candidate frame is further finely screened by a second layer of RPN, so that the screening can not only further increase the accuracy of the classification network, but also further enhance the detection precision of small targets, and further solve the problems of incomplete detection and long time consumption in detection by other methods.

The region generation network (RPN) is a full convolution network, and can simultaneously detect a target at each position of the feature map and give a target score to generate a high-quality region candidate frame. This network is an object detection assisting network proposed in the Faster-RCNN network proposed by Ross b.girshick in 2016.

Inclusion, also known as GoogleNet, is a CNN classification network model proposed by Google in 2014. The Incep network has adaptability to images of different scales through convolution kernels of different sizes, and the network is enlarged in width rather than depth, so that parameters in the network are greatly reduced, and the calculation amount is reduced.

Disclosure of Invention

The invention aims to provide a target detection method of multi-core iteration RPN based on heterogeneous convolution, which aims to overcome the defects of the prior art and can effectively solve the problems of insensitivity to targets with different sizes, low running speed, poor detection effect of small target objects and the like in target detection.

In order to achieve the above object, the present invention provides a target detection method for multi-core iteration RPN based on heterogeneous convolution, including:

receiving image data to be detected;

carrying out graying and local binarization data enhancement processing on the image data to obtain processed image data;

inputting the processed image data into a heterogeneous convolution network for feature extraction to obtain a feature map;

inputting the feature map into a pre-constructed multi-scale feature extraction network to realize feature extraction of different scales and obtain a target feature map;

inputting the target characteristic diagram into a RIR network to obtain a plurality of area candidate frames;

obtaining a target score corresponding to each region candidate frame according to the non-maximum inhibition function; screening the plurality of regional candidate frames according to a preset score threshold value to obtain a regional suggestion window;

and classifying the region suggestion windows according to the full convolution network layer and the normalized index function classifier to obtain a classification result and obtain an image category and a confidence score.

Preferably, the step of performing graying and local binarization data enhancement processing on the image data to obtain processed image data includes:

carrying out graying processing on the received image data;

carrying out local binarization processing on the image subjected to the graying processing to obtain a binarized image;

and carrying out noise addition, rotation and turnover on the binary image by adopting a data enhancement algorithm to obtain processed image data.

Preferably, the step of inputting the processed image data into a heterogeneous convolutional network for feature extraction to obtain a feature map includes:

constructing a heterogeneous convolutional network, wherein the size of a convolutional kernel of the heterogeneous convolutional network is 3 multiplied by 3 and 1 multiplied by 1, and the heterogeneous convolutional network is formed by arranging and combining according to a heterogeneous kernel mode;

inputting the processed image data into the constructed heterogeneous convolution network, and extracting image features;

and carrying out convolution operation on the obtained image feature map through a convolution kernel of 1 × 1 to output the feature map with reduced dimensionality.

In one implementation manner, the step of inputting the feature map into a pre-constructed multi-scale feature extraction network to implement feature extraction of different scales and obtain a target feature map includes:

and inputting the feature map into a multi-scale feature extraction network, convolving the targets with different proportions in the image by adopting convolution kernels with three different sizes in the multi-scale feature extraction network, and generating corresponding target feature maps according to different sensitivity degrees of the convolution kernels with each size to the targets with different sizes.

Inputting the target feature map into a RIR network, and acquiring a plurality of region candidate frames, wherein the step comprises the following steps:

constructing an RIR network structure, wherein the RIR network structure is as follows: the two RPN layers form a network structure in a full connection mode;

inputting the target feature map into the RIR network, and generating set n region candidate frames by the first layer RPN according to the target in the feature map;

and screening the generated n region candidate frames through the second layer RPN.

The target detection method based on heterogeneous convolution multi-core iteration RPN provided by the embodiment of the invention is different from the traditional methods such as target detection morphological processing, and the like, in the aspect of feature extraction, the convolution based on heterogeneous kernels uses convolution kernels with the size of 1 multiplied by 1 to replace convolution kernels with the size of 3 multiplied by 3, so that the accuracy is maintained while the calculated amount and the network parameters are reduced, in the aspect of feature extraction, the calculation time is greatly reduced, and the detection speed is improved. According to the inclusion thought proposed by Google, a multi-scale feature extraction network is proposed, different sizes of target images in different types of concentration images are classified through convolution kernels with different sizes, and therefore the detection accuracy of the network is improved through the multi-scale extraction network. The iterative RPN mode of the RPNINRPN (RIR) is designed according to the existing RPN network mechanism, and on the basis of the region candidate frame generated by the first layer of RPN, the generated region candidate frame is more finely screened through the second layer of RPN, so that the accuracy of the classification network can be further improved, and the detection precision of the small target can be further enhanced, thereby effectively solving the problems of poor detection effect and time consumption of other methods, effectively solving the problems of insensitivity to targets with different sizes, low operation speed, poor detection effect of the small target and the like in target detection, and having wide application range and strong robustness.

Drawings

Fig. 1 is a schematic flowchart of a target detection method of multi-core iterative RPN based on heterogeneous convolution according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a heterogeneous convolutional network according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a multi-scale feature extraction network according to an embodiment of the present invention.

FIG. 4 is a schematic diagram illustrating the detection effect comparison of different branches in multiple scales according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a multi-core iterative RPN network architecture based on heterogeneous convolution according to an embodiment of the present invention.

Fig. 6 is a sample picture of a real-time example of a target detection network according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Please refer to fig. 1-6. It should be noted that the drawings provided in the present embodiment are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

As shown in fig. 1, the present invention provides a target detection method for multi-core iterative RPN based on heterogeneous convolution, where the method includes:

s110, receiving image data to be detected;

s120, carrying out graying and local binarization data enhancement processing on the image data to obtain processed image data;

it can be understood that the received image data is grayed, and the pixels of the image are traversed one by one and expressed by numerical values of 0-255; the method comprises the steps of carrying out local binarization processing on a grayed image, carrying out noise addition, rotation, overturning and other transformations on the image through a data enhancement algorithm so as to enrich original image data, and finally setting the size of the processed image as the size required to be input by a network.

S130, inputting the processed image data into a heterogeneous convolution network for feature extraction to obtain a feature map;

it should be noted that, the method for constructing the heterogeneous convolutional network: the isomorphic convolution refers to convolution kernels with the same size from the first layer to the bottom layer of a convolution network, and is different from the sizes of the convolution kernels of heterogeneous convolution, wherein the convolution kernels with the sizes of 1 × 1 and 3 × 3 are combined according to a certain arrangement sequence, wherein P is the number of kernels with different types in the convolution network, and M is the depth of the set network input. Fig. 2 shows a heterogeneous convolutional neural network with P-4 and M-16.

Further, the image data after being subjected to graying and local binarization processing is transmitted into a constructed heterogeneous network, the size of the image is reduced through a layer of 3 × 3 convolution kernel, and then only the features of the image are learned through three layers of 1 × 1 convolution kernels, so that the size of the feature image is not reduced, and meanwhile, the effect of reducing the calculation complexity is achieved. The output image matrix can be formalized and expressed as the following formula:

wherein h is_o、h_i、h_kThe height of the output image matrix after convolution, the height of the input convolution network image matrix and the height of the convolution kernel are respectively. w is a_o、w_i、w_kThe width of the convolved output image, the width of the input convolution network image matrix and the width of the convolution kernel are respectively. And p is equal to 0 in the heterogeneous convolution process, s is the step number moved on the convolution kernel retargeting image, and s is equal to 2 in the convolution kernel moving process.

S140, inputting the feature map into a pre-constructed multi-scale feature extraction network to realize feature extraction of different scales and obtain a target feature map;

it can be understood that the method of constructing a multi-scale feature extraction network using convolution kernels of 1 × 1, 3 × 3, 5 × 5 size: and performing feature extraction on the image through heterogeneous convolution, outputting a feature map of the image, and reducing dimensionality of the feature map through a convolution kernel with the size of 1 multiplied by 1. And 3 × 3 convolution kernels are replaced by two layers of convolution kernels of 3 × 1 and 1 × 3, and similarly, 5 × 5 convolution kernels are replaced by convolution kernels of 5 × 1 and 1 × 5 sizes, and the multi-scale feature extraction network is constructed as shown in fig. 3.

Further, feature extraction is performed on the feature map with reduced dimensionality through the constructed multi-scale feature extraction network, as shown in fig. 4, (a), (b), (c), (d), and (e) are respectively a final effect map of retaining three sizes of convolution branches in the original input picture and the multi-scale feature extraction network, a final effect map of removing 1 × 1 convolution kernel branches, a final effect map of removing 3 × 3 convolution kernel branches, and a final effect map of removing 5 × 5 convolution kernel branches. It is shown that a convolution kernel of 1 × 1 is sensitive to large, medium and small size objects, while a convolution kernel of 3 × 3 is sensitive to large and medium size objects, and a convolution kernel of 5 × 5 is only sensitive to large size objects and cannot detect other smaller objects. Therefore, the multi-scale feature extraction network can perform multi-scale feature extraction on different image targets to acquire more accurate information.

S150, inputting the target characteristic diagram into a RIR network to obtain a plurality of area candidate frames;

it should be noted that two RPN network layers are connected in sequence to form an RIR network layer of an RPNINRPN, and a specific structure thereof is shown in fig. 5.

Further, feature graphs of targets with different sizes extracted by the multi-scale network are transmitted into an RIR network to be constructed, the obtained target feature graphs are transmitted into the RIR network, the feature graphs are convolved through a sliding window with the size of 3 x 3 to obtain a feature graph with the channel number being 256, the height and the width of the feature graph are the same as those of the transmitted feature graph, the height is set to be H, the width is set to be W, approximate features can be regarded as H x W vectors, and each vector is 256-dimensional. And performing full connection operation on each feature vector twice to respectively obtain feature maps with the sizes of 2 multiplied by H multiplied by W and 4 multiplied by H multiplied by W, and respectively expressing and obtaining a foreground score, a background score and four coordinate values of the foreground. In the process of sliding window convolution of the RPN, K region candidate frames with different sizes are generated every time a pixel passes through. Experiments have shown that the best results are achieved when the candidate box sizes are set to 128 × 128, 256 × 256, 512 × 512 and the aspect ratios are 1: 1, 2: 1, 1: 2, i.e., K is 9.

S160, obtaining a target score corresponding to each area candidate frame according to the non-maximum inhibition function; screening the plurality of regional candidate frames according to a preset score threshold value to obtain a regional suggestion window;

it should be noted that the region candidate box and the score generated by the RIR network are screened by the non-maximum suppression function. The scores of all candidate region boxes are ranked from high to low first, and the box with the highest score is selected. And simultaneously calculating the overlapping area (IOU) of the frame with the highest score and other frames, and only keeping the highest score frame if the IOU is larger than a set threshold value. If the IOU is smaller than the set threshold value, all the regions are reserved until all the region candidate frames are compared.

Specifically, a region candidate frame meeting a set threshold is selected through a non-maximum suppression function, after passing through a first layer of RPN, a region candidate frame and a score which are relatively in line with the actual position of a target in an image are selected, then a feature image in the candidate frame generated in the first layer is transmitted into the next layer of RPN, and then the second layer of RPN carries out more accurate target detection on each part of candidate frame region and gives a corresponding score. And an image which is more consistent with the position of the label is found, so that the influence of irrelevant or less relevant images on detection classification is reduced. The best performing candidate is retained by comparison with the first generated region candidate. The Loss function used by the RIR network formed by two layers of RPNs can be formally expressed as the following formula:

wherein, x, y, w, h respectively represent the center position coordinates and the width and height of the detected region candidate frame in each layer of the RPN network. x is the number of_box，y_box，w_box，h_boxThe coordinates of the center point and the width and height of the 9 region candidate frames generated for the RPN, respectively. x is the number of^*，y^*，w^*，h^*For the coordinates of the centre position of the image labelAnd width and height. N is a radical of_reg，N_clsNormalization of the number of region candidate boxes generated for the RPN network and normalization of the dimensions of the feature map vector. When the label marks the foreground

When marked as background

p_iThe probability that the candidate box is an image object for the i-th region. λ is a regulation parameter, and experiments show that the Loss function plays the maximum positive feedback effect when λ is 10.

S170, classifying the region suggestion window according to the full convolution network layer and the normalized index function classifier to obtain a classification result and obtain an image category and a confidence score.

It can be understood that the region suggestion windows on the acquired images are respectively introduced into a normalized exponential function softmax, and the softmax function classifies the target region according to the set object class and the learned features. And the detection network is fed back in a forward direction through a Focal local Loss function, wherein the Focal local is improved according to a cross entropy function, and the model is more concentrated on samples which are difficult to classify by reducing the weight of the samples which are easy to classify and the weight of a larger number of backgrounds. The loss function can be formally expressed as:

L(p_i)＝-β_i(1-p_i)γlog(p_i) (5)

β therein_iAnd gamma is a set loss parameter, and in the experiment, when gamma is 2, β_iThe model trained at 0.25 performed best. p is a radical of_iThe probability value that the ith object is detected as a certain class. The accuracy of the detection classification network is continuously improved through a Focal local Loss function mechanism, and a final image classification category and a confidence score are output. Fig. 5 is a schematic diagram of a multi-core iterative RPN structure based on heterogeneous convolution, and a network finally obtains the classification and confidence score of an object in an image. FIG. 6 shows an embodiment of the present inventionFor example.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A target detection method of multi-core iteration RPN based on heterogeneous convolution is characterized by comprising the following steps:

receiving image data to be detected;

2. The target detection method of multi-core iterative RPN based on heterogeneous convolution according to claim 1, wherein the step of performing graying and local binarization data enhancement processing on the image data to obtain processed image data includes:

carrying out graying processing on the received image data;

3. The target detection method of multi-core iterative RPN based on heterogeneous convolution according to claim 1, wherein the step of inputting the processed image data into a heterogeneous convolution network for feature extraction to obtain a feature map includes:

constructing a heterogeneous convolutional network, wherein the heterogeneous convolutional network is formed by arranging and combining convolution kernels with the sizes of 3 x 3 and 1 x 1 according to a heterogeneous mode;

and carrying out convolution operation on the obtained image feature map through a convolution kernel with the size of 1 × 1 to output the feature map with reduced dimensionality.

4. The target detection method of multi-core iterative RPN based on heterogeneous convolution according to claim 1, wherein the step of transmitting the feature map into a pre-constructed multi-scale feature extraction network to realize feature extraction of different scales and obtain a target feature map includes:

5. The target detection method of multi-core iterative RPN based on heterogeneous convolution according to claim 1, wherein the step of inputting the target feature map into an RIR network to obtain a plurality of region candidate boxes includes: