CN111738237B

CN111738237B - Heterogeneous convolution-based target detection method for multi-core iteration RPN

Info

Publication number: CN111738237B
Application number: CN202010817648.3A
Authority: CN
Inventors: 刘晋; 尚圣杰
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2020-04-29
Filing date: 2020-08-14
Publication date: 2024-03-15
Anticipated expiration: 2040-08-14
Also published as: CN111738237A; CN111563440A

Abstract

The invention discloses a target detection method of multi-core iteration RPN based on heterogeneous convolution, which comprises the following steps: receiving image data to be detected; carrying out graying and local binarization data enhancement processing on the image data to obtain processed image data; inputting the processed image data into a heterogeneous convolution network for feature extraction to obtain a feature map; inputting the feature map into a multi-scale feature extraction network to realize feature extraction of different scales and obtain a target feature map; inputting the target feature map into a RIR target detection network to obtain a plurality of region candidate frames; obtaining a target score corresponding to each region candidate frame according to a non-maximal inhibition function, and obtaining a region suggestion window according to a preset score threshold; and classifying the region suggestion window according to the full convolution network layer and the normalized exponential function classifier to obtain a classification result, an image category and a confidence score. By applying the embodiment of the invention, the problems of low running speed, poor small target detection effect and the like are effectively solved.

Description

Heterogeneous convolution-based target detection method for multi-core iteration RPN

Technical Field

The invention relates to the technical field of computer vision image processing, in particular to a target detection method of multi-core iteration RPN based on heterogeneous convolution.

Background

Target detection is one of the most challenging tasks in computer vision applications, and has wide application in the fields of unmanned driving, security systems and the like. But is affected by some non-human factors such as illumination, direction of objects, and object shielding during the target detection process of natural scenes. However, for the increasing use demands of people, how to improve the target detection performance in natural scenes has become an urgent need at present.

At present, the target detection method is mainly divided into two main categories, including two types of two-stage and one-stage network detection. Wherein two-stage is a target detection divided into two stages, which mainly includes: (1) First, a candidate region is generated for the image by using a region generation network (RPN), (2) the generated candidate region is classified by using a deep learning network. The one-stage network only comprises one stage, and the deep learning network is directly utilized to generate category probability and position information for the target. Therefore, the two-stage network has the characteristics of high detection accuracy but low detection speed, and the one-stage network has the characteristics of high detection speed but low accuracy.

The traditional two-stage network has good detection effect on common targets, and mainly comprises the following steps: the method comprises the steps of (1) extracting the characteristics of a target image by using different characteristic extraction networks (such as a residual network (ResNet) and a Convolutional Neural Network (CNN), (2) carrying out preliminary detection on the target image by using an area generation network (RPN), simply distinguishing foreground and irrelevant background in the image, and generating a candidate area frame of the target, (3) carrying out category classification on the image target by using an image classification network according to the candidate area frame generated by the RPN, so as to output the final position and category of the target.

Aiming at the problems of insensitivity to targets with different sizes, poor detection effect of small targets and high time consumption in target detection, a multi-core iteration RPN target detection network based on heterogeneous convolution is designed, the convolution based on the heterogeneous core replaces a convolution core with a size of 1X 1 with a convolution core with a size of 3X 3 in the aspect of feature extraction, the calculation amount and the network parameter quantity are reduced, the accuracy is maintained, the calculation time is greatly reduced in the aspect of feature extraction, and the detection speed is improved. According to the concept proposed by Google, we propose a multi-scale feature extraction network, and concentrate on target images with different sizes in the images through convolution kernels with different sizes, so that the detection precision of the network on the different sizes is improved through the multi-scale extraction network. According to the existing RPN network mechanism, an iterative RPN mode of RPNINRPN (RIR) is designed, on the basis of a region candidate frame generated by the RPN of the first layer, the generated region candidate frame is subjected to finer screening by the RPN of the second layer, and the screening can further increase the accuracy of a classification network and further strengthen the detection accuracy of a small target, so that the problems of incomplete detection and overlong time consumption in detection of other methods are effectively solved.

The regional generation network (RPN) is a full convolution network that can detect targets and give target scores at each location of the feature map simultaneously, generating high quality regional candidate boxes. This network is a target detection assisting network proposed by Ross B.Girshick in the 2016 Faster-RCNN network.

The concept, also known as Google net, is a CNN classification network model proposed by Google in 2014. The convolution kernels with different sizes enable the network to have adaptability to images with different scales, and parameters in the network are greatly reduced and the calculation amount is reduced because the network is increased in width rather than depth.

Disclosure of Invention

The invention aims to provide a target detection method of multi-core iteration RPN based on heterogeneous convolution, which aims to overcome the defects of the prior art, and can effectively solve the problems of insensitivity to targets with different sizes, low running speed, poor detection effect of small target objects and the like in target detection.

In order to achieve the above object, the present invention provides a target detection method of multi-core iteration RPN based on heterogeneous convolution, the method comprising:

receiving image data to be detected;

carrying out grey-scale and local binary data enhancement processing on the image data to obtain processed image data;

inputting the processed image data into a heterogeneous convolution network for feature extraction to obtain a feature map;

inputting the feature map into a pre-constructed multi-scale feature extraction network to realize feature extraction of different scales and obtain a target feature map;

inputting the target feature map into a RIR network to obtain a plurality of region candidate frames;

obtaining a target score corresponding to each region candidate frame according to the non-maximum suppression function; screening the multiple region candidate frames according to a preset score threshold value to obtain a region suggestion window;

and classifying the region suggestion window according to the full convolution network layer and the normalized exponential function classifier to obtain a classification result and obtain an image category and a confidence coefficient score.

Preferably, the step of performing gray-scale and local binary data enhancement processing on the image data to obtain processed image data includes:

graying processing is carried out on the received image data;

carrying out local binarization processing on the image subjected to the grey scale processing to obtain a binarized image;

and carrying out noise addition, rotation and overturn on the binarized image by adopting a data enhancement algorithm to obtain processed image data.

Preferably, the step of inputting the processed image data into a heterogeneous convolution network to perform feature extraction to obtain a feature map includes:

constructing a heterogeneous convolution network, wherein the heterogeneous convolution network is formed by arranging and combining convolution kernels with the sizes of 3 multiplied by 3 and 1 multiplied by 1 according to a heterogeneous kernel mode;

inputting the processed image data into the constructed heterogeneous convolution network, and extracting image features;

and carrying out convolution operation on the obtained image feature map through a convolution check image of 1 multiplied by 1 so as to output the feature map with reduced dimensionality.

In one implementation manner, the step of inputting the feature map into a pre-built multi-scale feature extraction network to achieve feature extraction of different scales and obtain a target feature map includes:

inputting the feature map into a multi-scale feature extraction network, convoluting targets with different proportions in the images by adopting three convolution check images with different sizes in the multi-scale feature extraction network, and generating corresponding target feature maps according to different sensitivity degrees of convolution kernels with each size to targets with different sizes.

Inputting the target feature map into a RIR network, and obtaining a plurality of region candidate frames, wherein the method comprises the following steps of:

building a RIR network structure, wherein the RIR network structure is as follows: the two RPN layers form a network structure in a full connection mode;

inputting the target feature map into a RIR network, and generating n set region candidate frames by the first layer RPN according to the targets in the feature map;

and screening the generated n region candidate frames through the second layer RPN.

The target detection method based on heterogeneous convolution for multi-core iteration RPN is different from the traditional target detection morphological processing method, and the like, in the aspect of feature extraction, the convolution based on the heterogeneous core replaces the convolution core with the size of 1X 1 by the convolution core with the size of 3X 3, so that the calculation amount and the network parameter quantity are reduced, the accuracy is maintained, in the aspect of feature extraction, the calculation time is greatly reduced, and the detection speed is improved. According to the concept proposed by Google, a multi-scale feature extraction network is proposed, and different sizes of target images in the focused images are classified by convolution kernels with different sizes, so that the detection precision of the network is improved by the multi-scale extraction network. The method has the advantages that according to the existing RPN network mechanism, an iterative RPN mode of RPNINRPN (RIR) is designed, on the basis of the region candidate frame generated by the first layer RPN, the generated region candidate frame is subjected to finer screening by the second layer RPN, so that the accuracy of a classification network can be further increased, the detection precision of small targets can be further enhanced, the problems of poor detection effect and time consumption of other methods are effectively solved, the problems of insensitivity to targets with different sizes, low running speed, poor detection effect of the small targets and the like in target detection are effectively solved, the application range is wide, and the robustness is strong.

Drawings

Fig. 1 is a schematic flow chart of a target detection method of multi-core iteration RPN based on heterogeneous convolution according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a heterogeneous convolutional network according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of a multi-scale feature extraction network according to an embodiment of the invention.

FIG. 4 is a schematic diagram showing the comparison of detection effects of different branches at multiple scales according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a multi-core iterative RPN network architecture based on heterogeneous convolution according to an embodiment of the present invention.

Fig. 6 is a sample picture of a real-time example of an object detection network according to an embodiment of the present invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention.

Please refer to fig. 1-6. It should be noted that, the illustrations provided in the present embodiment merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.

The invention provides a target detection method of multi-core iteration RPN based on heterogeneous convolution as shown in figure 1, which comprises the following steps:

s110, receiving image data to be detected;

s120, carrying out grey-scale and local binary data enhancement processing on the image data to obtain processed image data;

it can be understood that the received image data is subjected to graying processing, and pixel points of the image are traversed one by one and represented by values of 0-255; and carrying out local binarization processing on the gray-level image, carrying out noise addition, rotation, turnover and other transformations on the image through a data enhancement algorithm so as to enrich the original image data, and finally setting the size of the processed image to be the size required to be input by a network.

S130, inputting the processed image data into a heterogeneous convolution network for feature extraction to obtain a feature map;

it should be noted that, the method for constructing heterogeneous convolutional network comprises the following steps: the isomorphic convolution refers to convolution kernels with the same size from the first layer to the bottom layer of the convolution network, unlike the first layer of the convolution network, the convolution kernels with the different sizes of the isomorphic convolution are firstly combined according to a certain arrangement sequence, wherein P is set as the number of kernels with different types in the convolution network, and M is the depth of the input of the set network. Fig. 2 shows a heterogeneous convolutional neural network of p=4 and m=16.

Further, the image data after the treatments of graying and local binarization are transmitted into a constructed heterogeneous network, after the size of the image is reduced through a layer of 3×3 convolution kernel, only the features of the image are learned through three layers of 1×1 convolution kernels, the size of the feature map is not reduced, and meanwhile the effect of reducing the computational complexity is achieved. Wherein the output image matrix is formally expressed as the following formula:

wherein h is _o 、h _i 、h _k The height of the output image matrix after convolution, the height of the input convolution network image matrix and the height of the convolution kernel are respectively. w (w) _o 、w _i 、w _k The width of the output image after convolution, the width of the input convolution network image matrix and the width of the convolution kernel are respectively. p is the number of steps moved on the convolution kernel and then the target image in the process of deconvolution, and s=2 is set in the process of moving the convolution kernel.

S140, inputting the feature map into a pre-constructed multi-scale feature extraction network to realize feature extraction of different scales, and obtaining a target feature map;

it will be appreciated that the method of constructing a multi-scale feature extraction network using convolution kernels of 1 x 1,3 x 3,5 x 5 dimensions: the feature extraction is carried out on the image through the deconvolution, the feature image of the image is output, and the dimension is reduced through the convolution check feature image with the size of 1 multiplied by 1. And a 3×3 convolution kernel is replaced by a 3×1 and 1×3 two-layer convolution kernel, and a 5×5 convolution kernel is replaced by a 5×1 and 1×5 convolution kernel, so that the multi-scale feature extraction network is constructed as shown in fig. 3.

Further, feature extraction is performed on the feature map with reduced dimensions through the constructed multi-scale feature extraction network, as shown in fig. 4, (a), (b), (c), (d), and (e) are respectively a final effect map of three size convolution branches, a final effect map of 1×1 convolution kernel branches removed, a final effect map of 3×3 convolution kernel branches removed, and a final effect map of 5×5 convolution kernel branches reserved in the original input image and the multi-scale feature extraction network. It is shown that a 1 x 1 convolution kernel is sensitive to both large, medium and small size targets, while a 3 x 3 convolution kernel is sensitive to large as well as medium size targets, and a 5 x 5 convolution kernel is sensitive only to large size targets, and no other smaller targets can be detected. Therefore, the multi-scale feature extraction network can perform multi-scale feature extraction aiming at different image targets, and more accurate information is obtained.

S150, inputting the target feature map into a RIR network to obtain a plurality of region candidate frames;

it should be noted that, two RPN network layers are connected in sequence to form an RIR network layer of rpninpn, and the specific structure thereof is shown in fig. 5.

Further, feature images of targets with different sizes extracted by a multi-scale network are transmitted into a RIR network to be constructed, the obtained target feature images are transmitted into the RIR network, the feature images are convolved through a sliding window with the size of 3 multiplied by 3 to obtain a feature image with 256 channels, the height and the width are the same as those of the transmitted feature images, the height is H, the width is W, and the similar feature images after convolution can be regarded as H multiplied by W vectors, and each vector is 256 dimensions. And performing two full connection operations on each feature vector to respectively obtain feature graphs with the sizes of 2 XH multiplied by W and 4 XH multiplied by W, wherein the feature graphs respectively represent four coordinate values for obtaining the foreground score, the background score and the foreground. In the process of the sliding window convolution of the RPN, K region candidate frames with different sizes are generated after each pixel point passes. Experiments have shown that the best results are obtained when the candidate box sizes are set to 128 x 128, 256 x 256, 512 x 512 and the aspect ratios are 1:1, 2:1, 1:2, i.e. k=9.

S160, obtaining a target score corresponding to each region candidate frame according to the non-maximum suppression function; screening the multiple region candidate frames according to a preset score threshold value to obtain a region suggestion window;

the region candidate boxes and scores generated by the RIR network are screened through a non-maximal inhibition function. The scores of all candidate region boxes are first ranked from high to low, and the box with the highest score is selected. And simultaneously calculating the overlapping area (IOU) of the highest-score frame and other frames, and if the IOU is larger than a set threshold value, only reserving the highest-score frame. If the IOU is smaller than the set threshold value, all the regions are reserved until all the region candidate frames are compared.

Specifically, the region candidate frame meeting the set threshold is selected through the non-maximal inhibition function, the region candidate frame meeting the actual position of the target in the image and the score are selected after the region candidate frame passes through the first layer RPN, the characteristic image in the candidate frame generated by the first layer is transmitted to the next layer RPN, and the second layer RPN carries out more accurate target detection on each part of candidate frame region and gives out the corresponding score. And finding out an image which better accords with the label position, and reducing the influence of the uncorrelated or less correlated image on detection classification. The best candidate is retained by comparison with the first generated region candidate. The Loss function used by the RIR network formed by the two layers of RPNs can be formally expressed as the following formula:

wherein x, y, w and h represent the central position coordinates and width and height of the detected region candidate frame in each layer of RPN network, respectively. X is x _box ，y _box ，w _box ，h _box Center point coordinates and width and height of 9 region candidate boxes generated for the RPN respectively. X is x ^* ，y ^* ，w ^* ，h ^* Is the center position coordinates and width and height of the image tag. N (N) _reg ，N _cls Normalization of the number of region candidate boxes generated for the RPN network and normalization of the dimension of the feature map vector. When the label marks the foregroundWhen marked as backgroundp _i The probability that the candidate box is the image target for the i-th region. λ is the tuning parameter, and experiments show that the Loss function works the greatest positive feedback when λ=10.

S170, classifying the region suggestion window according to the full convolution network layer and the normalized exponential function classifier to obtain a classification result and obtain an image category and a confidence score.

It can be appreciated that the region suggestion windows on the acquired images are respectively transferred into normalized exponential functions softmax, which classify the target region according to the set object class and the learned features. And forward feedback is carried out on the detection network through a Focal Loss function, wherein the Focal Loss is improved according to a cross entropy function, and the model is more focused on samples which are difficult to classify by reducing the weight of the samples which are easy to classify and the weight of the background with a large number. The loss function can be formally expressed as:

L(p _i )＝-β _i (1-p _i )γlog(p _i ) (5)

wherein beta is _i γ is the set loss parameter, and in the experiment, when γ=2, β _i The trained model performed best at =0.25. P is p _i Then a probability value for the i-th object being detected as a certain class. The accuracy of the detection classification network is continuously improved through a mechanism of a Focal Loss function, and the final image classification category and the score of the confidence coefficient are output. Fig. 5 is a schematic diagram of a multi-core iterative RPN structure based on heterogeneous convolution, and the network finally obtains the classification and confidence score of the target in the image. Fig. 6 is a sample of an embodiment of the present invention.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims

1. A target detection method of multi-core iteration RPN based on heterogeneous convolution, the method comprising:

receiving image data to be detected;

classifying the region suggestion window according to the full convolution network layer and the normalized exponential function classifier to obtain a classification result and an image category and a confidence coefficient score;

the step of inputting the processed image data into a heterogeneous convolution network for feature extraction to obtain a feature map comprises the following steps:

constructing a heterogeneous convolution network, wherein the heterogeneous convolution network is formed by arranging and combining convolution kernels with the sizes of 3 multiplied by 3 and 1 multiplied by 1 according to a heterogeneous mode;

performing convolution operation on the obtained image feature map through a convolution check image with the size of 1 multiplied by 1 to output a feature map with reduced dimensionality;

transmitting the feature map to a pre-constructed multi-scale feature extraction network to realize feature extraction of different scales, and obtaining a target feature map, wherein the method comprises the following steps of:

inputting the feature map into a multi-scale feature extraction network, convoluting targets with different proportions in images by adopting three convolution check images with different sizes in the multi-scale feature extraction network, and generating corresponding target feature maps according to different sensitivity degrees of convolution kernels with each size to targets with different sizes;

2. The method for target detection of multi-core iterative RPN based on heterogeneous convolution according to claim 1, wherein the step of performing gray-scale and local binary data enhancement processing on the image data to obtain processed image data comprises:

graying processing is carried out on the received image data;