WO2020113412A1

WO2020113412A1 - Target detection method and system

Info

Publication number: WO2020113412A1
Application number: PCT/CN2018/119132
Authority: WO
Inventors: 王娜; 许康
Original assignee: 深圳大学
Priority date: 2018-12-04
Filing date: 2018-12-04
Publication date: 2020-06-11

Abstract

The present invention discloses a target detection method, wherein the method relates to the field of computer videos, and the method comprises: extracting image features of an input image by using convolutional neural network (CNN) to obtain a plurality of feature maps with different spatial resolutions of the image, adjusting the spatial resolutions of the plurality of feature maps to be consistent, and tiling the plurality of feature maps into an overall feature map, and processing the overall feature map to obtain detection information on a predicted target in the image, the detection information including target position information and target class information. The CNN structure reduces the amount of calculation, improves processing speed of the target detection. Meanwhile, the overall feature map obtained from the method is a fusion of the plurality of feature maps with different spatial resolutions, the obtained overall feature map contains much more image information, and the precision of detecting the target according to the overall feature map is high.

Description

Target detection method and system

Technical field

The invention relates to the field of computer vision, in particular to a method and system for target detection.

Background technique

With the gradual entry into the information age, the world today is under the environment of the information explosion, and it is constantly facing the problem of information surplus. In 2001 alone, the global data volume reached 1.8ZB, which is equivalent to more than 200GB of data generated by everyone in the world. This growth trend is still accelerating. Experts predict that in the next few years, the data will always maintain a 50% annual growth rate. Nowadays, users of major e-commerce, video playback and other user platforms will generate massive amounts of data every day, and most of these data are picture data, and picture data can be divided into monocular picture data and multi-target picture data, single target Picture data can generally be processed by image classification and recognition, while multi-target picture data is generally processed using multi-target detection methods. With the development of computer technology and the wide application of computer vision in life, target detection technology has many fields such as intelligent transportation system, intelligent monitoring system, automatic driving technology, military target detection and medical navigation surgery. Wide application value.

Existing target detection algorithms directly predict the location and type of targets, and establish real-time monitoring networks. The disadvantages of such algorithms are: lower model accuracy, lower recall, and poorer detection of small targets. Existing target detection algorithms cannot simultaneously achieve high accuracy and real-time detection.

technical problem

The main purpose of the present invention is to provide a target detection method and system for solving the technical problem that the existing target detection technology cannot simultaneously achieve high precision and real-time detection.

Technical solution

To achieve the above objective, a first aspect of the present invention provides a method for target detection, the method including:

A convolutional neural network is used to extract image features of the input image to obtain multiple feature maps of the image with different spatial resolutions;

Adjusting the spatial resolution of the multiple feature maps uniformly, and stitching the multiple feature maps into an overall feature map;

Processing the overall feature map to obtain the detection information of the predicted target in the image, where the detection information includes target position information and target category information.

A second aspect of the present invention provides a system for target detection. The system includes: an image feature extraction module, a feature processing module, and a prediction processing module;

The image feature extraction module is used to extract image features of the input image by using a convolutional neural network to obtain a plurality of feature maps with different spatial resolutions of the image;

The feature processing module is used to adjust the spatial resolution of the multiple feature maps uniformly, and splice the multiple feature maps into an overall feature map;

The acquisition result module is configured to process the overall feature map to obtain detection information of a predicted target in the image, where the detection information includes target position information and target category information.

Beneficial effect

It can be seen from the above technical solution that, in the first aspect, the technical solution uses a convolutional neural network to extract image features of the input image. The structure adopted by the convolutional neural network reduces the amount of calculation, does not affect the extraction effect of image features, and improves The processing speed of target detection enables real-time detection of target positions and categories. In the second aspect, the feature map obtained by the technical solution is a fusion of multiple feature maps with different spatial resolutions. The obtained feature map contains more image information, and the accuracy of detecting the target according to the feature map is high, that is, detection is guaranteed. Precision. Therefore, the technical solution simultaneously achieves high-precision and real-time detection of the target position and category.

BRIEF DESCRIPTION

1 is a schematic flowchart of a target detection method provided by an embodiment of the present invention;

FIG. 2 is a schematic flowchart of the detailed steps of step 101;

FIG. 3 is a schematic flowchart of the detailed steps of step 102;

FIG. 4 is a schematic flowchart of the detailed steps of step 103;

5 is a schematic flowchart of a target detection method according to another embodiment of the present invention;

FIG. 6 is a schematic flowchart of the detailed steps of step 104;

7 is a schematic diagram showing multiple pixels on the coordinate axis;

8 is a schematic structural diagram of a target detection system according to another embodiment of the present invention;

Figure 9 is the overall network framework of the target detection system;

10 is a schematic structural diagram of a refinement module of an image feature extraction module;

Figure 11 is a schematic diagram of the downsampling module;

12 is a schematic diagram of the A module for convolution feature extraction;

13 is a schematic diagram of the B module for convolution feature extraction;

14 is a schematic diagram of a C module for convolution feature extraction;

15 is a schematic structural diagram of a detailed module of a feature processing module;

16 is a schematic diagram of multi-scale feature map fusion in an embodiment of the present invention;

17 is a schematic structural diagram of an average pooling module;

18 is a schematic structural diagram of a deconvolution module;

FIG. 19 is a schematic structural diagram of a detailed module for obtaining a result module;

Figure 20 is a schematic diagram of the structure of the Yolo_predict prediction module;

21 is a schematic structural diagram of a residual prediction module;

22 is a schematic structural diagram of a target detection system according to another embodiment of the present invention;

FIG. 23 is a schematic structural diagram of a detailed module of the adjustment module.

Embodiments of the invention

In order to make the purpose, features, and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the drawings in the embodiments of the present invention. Obviously, the description The embodiments are only a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present invention.

Due to the technical problem that high precision and real-time detection cannot be realized simultaneously in the prior art. In order to solve the above technical problems, the present invention proposes a method for target detection.

Please refer to FIG. 1, which is a schematic flowchart of a target detection method according to an embodiment of the present invention. The target detection method includes:

Step 101: A convolutional neural network is used to extract image features of an input image to obtain a plurality of feature maps with different spatial resolutions of the image.

Before extracting the image features of the image, the image should be input into the system, and multiple convolutional neural networks can be used to obtain multiple feature maps of the image with different spatial resolutions, where the feature map carries image information and different spatial resolutions The rate feature map carries image information at different levels of the image.

Step 102: Adjust the spatial resolution of multiple feature maps to be consistent, and stitch the multiple feature maps into an overall feature map.

Among them, the spatial resolution of multiple feature maps is inconsistent, which will affect the subsequent splicing of multiple feature maps. After adjusting the spatial resolution of multiple feature maps to be consistent, the multiple feature maps may be spliced into an overall feature map.

It should be noted that a feature map can have multiple feature channels, the number of image channels of three feature maps with different spatial frequencies is different, and the three feature maps are stitched together according to the feature map channels to form a whole Feature map, and subsequent processing of the overall feature map can obtain the location and category information of the prediction target in the image.

Step 103: Process the overall feature map to obtain detection information of the predicted target in the image, where the detection information includes target position information and target category information.

Among them, the prediction target is a target of the detection information in the image. Processing the feature map is to perform multi-channel feature fusion on the overall feature map, use convolution to extract effective features and filter impurities, so as to obtain the detection information of the prediction target. The detection information includes target location information and target category information.

As shown in FIG. 2, FIG. 2 is a schematic flowchart of the refinement step of step 101. The specific steps of using a convolutional neural network to extract image features of an input image, and obtaining multiple feature maps with different spatial resolutions of the image include:

Step 201: Acquire the first feature map of the image;

Step 202: Acquire a second feature map of the image, the spatial resolution of the second feature map is smaller than the spatial resolution of the first feature map;

Step 203: Acquire a third feature map of the image, the spatial resolution of the third feature map is smaller than that of the first feature map.

Among them, each feature map acquisition process uses multiple maxpool structures of different sizes to extract some image features of different sizes, the left side uses 3*3 convolution to extract image features, and then uses 1*1 Convolution performs feature filtering and further exchanges of information between image channels, and finally concatenates the image features extracted on the left and the image features extracted on the right. On the one hand, more features can be extracted. On the other hand, increasing the number of image channels of the feature map reduces the amount of parameters and increases the processing speed.

Among them, use the concatenate structure to increase the number of image channels, use the maxpool structure to effectively reduce the amount of parameters, and use a concate structure between the 1*1 layers of two different network sub-modules (blocks) in the network structure used. Each feature map has different network layers during the extraction process, so that the extracted feature map contains different image information, and the spatial resolution of each feature map is also different.

In the present invention, the three feature maps are acquired in sequence. As the number of network layers increases, the spatial resolution of the acquired image feature maps decreases.

It should be noted that, in the present invention, after using 3*3 convolution, the concatenate structure is used to increase the number of network channels and retain more image features. The fusion between low-order features and high-order features will bring richer semantics feature.

At the same time, the fusion of multiple maximum-scale pooling structures with different scales and multiple feature maps with different spatial resolutions can obtain rich image semantic information, which is beneficial to the prediction of targets of different sizes and to improve the accuracy of detection targets.

As shown in FIG. 3, FIG. 3 is a schematic flowchart of the refinement step of step 102. The specific steps of adjusting the spatial resolution of a plurality of the feature maps uniformly and stitching the multiple feature maps into an overall feature map include:

Step 301: Adopt the method of average pooling to reduce the spatial resolution of the first feature map, so that the spatial resolution of the first feature map is consistent with the spatial resolution of the second feature map.

Among them, the spatial resolution of the first feature map is greater than that of the second feature map, therefore, the spatial resolution of the first feature map is reduced to make the spatial resolution of the first feature map and the spatial resolution of the second feature map Consistently, in the present invention, the way to reduce the spatial resolution is average pooling.

Step 302: Use deconvolution to increase the spatial resolution of the third feature map, so that the spatial resolution of the third feature map is consistent with the spatial resolution of the second feature map.

Among them, the spatial resolution of the third feature map is smaller than the spatial resolution of the second feature map, therefore, the spatial resolution of the third feature map is increased to make the spatial resolution of the third feature map and the spatial resolution of the second feature map Consistently, in the present invention, the way to increase the spatial resolution is deconvolution.

Step 303: Perform stitching processing on the first feature map, the second feature map, and the third feature map to fuse into an overall feature map.

Among them, the number of image channels of the first feature map, the second feature map, and the third feature map are different. The first feature map, the second feature map, and the third feature map are concatenated on the feature map channel, Form an overall feature map. The overall feature map contains both high-order features and low-order features, effectively improving the prediction results.

In the embodiment of the present invention, the input image size is a*a*3 (the value of a is a preset value), and the sizes of the output first feature map, second feature map, and third feature map are: (a /8)*(a/8)*256,(a/16)*(a/16)*512,(a/32)*(a/32)*1024.

As shown in FIG. 4, FIG. 4 is a schematic flowchart of the refinement step of step 103. Processing the overall feature map to obtain detection information of the predicted target in the image. The detection information includes target position information and target category information. The steps include:

Step 401: Perform multi-channel feature fusion on the overall feature map, then perform feature extraction and filter impurities on the integrated multi-channel feature map to obtain prediction information of the predicted target, which includes the predicted location information of the target and the predicted category information.

Specifically, use 1*1 convolution to perform multi-channel feature fusion on the overall feature map, and then use 3*3 convolution to further extract effective image features and filter out impurities to obtain target prediction information, which includes location information and confidence Degree and category information. The convolutional neural network is used to predict the target information, which reduces the parameters.

In the embodiment of the present invention, after the above convolutional neural network prediction process, the size of the overall feature map is (a/16)*(a/16)*H, where the value of H is determined by the target to be predicted Number of categories, c is the category of the target object. Let a1=(a/16), divide the image into a1*a1 blocks, which corresponds to a1*a1 H-length vectors, each block generates 9 candidate boxes, and each candidate box includes: location information ( x,y,w,h), confidence (represents the probability that an object falls in this candidate box) and category information. The number of target categories is c, that is, there are c-dimensional vectors. Which one-dimensional vector value is the largest in the c-dimensional vector belongs to which category the target that the candidate frame belongs to belongs to.

P(object) represents the probability value of the target falling into a block, the value of this value is equal to the predicted confidence (ie confidence), namely:

P(object)=confidence

Considering that the value of the confidence value is used to determine whether the object is in this candidate box, in order to better judge, the IOU value needs to be considered here, so the calculation of the location confidence here is as follows:

among them,

Represents the intersection ratio of the candidate box and the real box, P _confidence is the value of position confidence.

Among them, if an object (object) falls in a partition, then P(object)=1, otherwise P(object)=0. Each candidate box needs to predict four parameters x, y, w, h, a confidence and c category information in the location information.

Among them, x, y, w, h, confidence ∈ (0, 1), x and y represent the offset of the center point of the candidate box with respect to the upper left corner of the block, and w, h represent the width of the candidate box High relative to the value of a (the preset image input size):

The calculation of the class-specific confidence score of each candidate box belongs to:

It should be noted that during the training process,

The value is computable, but in the process of prediction, this value is not calculated, so it will be set in the prediction process,

The default value is 1.

In the embodiment of the present invention, in order to make the prediction more accurate, a loss function is used to optimize the prediction process. Among them, the loss function is as follows:

Among them, coordError is the coordinate error, confidenceError is the confidence loss error, and classError is the classification error.

Among them, the expression of coordinate error coordError is as follows:

Where f _obj (i,j) is the value of the jth frame of the measurement target falling into the i-th block, and f _noobj (i,j) is the j-th frame of the measurement target not falling into the i-th block Value, when the measurement target falls into the jth box of the ith block, f _obj (i,j)=1, f _noobj (i,j)=0, when the measurement target does not fall into the i At the jth frame of the block, f _obj (i, j) = 1, and f _noobj (i, j) = 0. λ _coord is the weight coefficient, and the value can be set.

The expression of confidence error is as follows:

Among them, the λ _noord parameter is used to reduce the influence of the detected picture background on the loss function.

The expression of classification error classError is as follows:

After optimization based on the above loss function, the predicted results are more accurate than those before the optimization.

Step 402: Using non-maximum suppression method to filter out redundant information and output detection information of the predicted target, the detection information includes target position information and target category information.

The specific steps of filtering out redundant information using non-maximum suppression methods include: placing multiple candidate frame positions in module B, which includes b ₁ , b ₂ , b ₃ , ..., class- The specific confidence score value is put into module S.

Select the maximum value of the S module a value corresponding to the prediction block s ^* b ^*, the rest of the candidate block traversal, if traversing the candidate block satisfies the condition formula:

IOU(b ^* ,b _i )>N _t

Among them, b ^* is the candidate box with the highest probability value belonging to one category of all categories, b _i is the candidate box with probability value belonging to any category, and Nt is the preset threshold of the cross-combination ratio, that is, non-polar Threshold for large value suppression.

Satisfy the above conditions, the intersection ratio of b ^* and b _i is greater than the threshold of non-maximum suppression, remove the S _{i in} module S and the b _{i in} module B, and select the maximum value s in S module again ^* And the corresponding prediction box b ^* , after retraversing the remaining candidate boxes b _i , if the traversed candidate boxes b _i meet the above condition, then remove S _i in S and b _{i in} B, the remaining H Checkboxes and class-specific confidence score values are output. The output candidate box and the information carried are the detection results.

As can be seen from the target detection method provided by the present invention in FIG. 1, in the first aspect, the method uses a convolutional neural network to extract image features of the input image. The structure adopted by the convolutional neural network reduces the amount of calculation and does not affect the image features. The extraction effect improves the processing speed of target detection and realizes real-time detection of target position and category. In the second aspect, the feature map obtained by the technical solution is a fusion of multiple feature maps with different spatial resolutions. The obtained feature map contains more image information, and the accuracy of detecting the target according to the feature map is high, that is, detection is guaranteed. Precision. Therefore, the technical solution simultaneously achieves high-precision and real-time detection of the target position and category.

Please refer to FIG. 5. FIG. 5 is a schematic flowchart of a target detection method according to another embodiment of the present invention. The difference from the previous embodiment is that the method before step 101 further includes: step 104. Adjust input The size of the image.

Among them, the size of the input image is adjusted so that when the image features of the image are subsequently extracted, the size of the image is the same, which is convenient for operation. This method first adjusts the size of the input image so that images of any size can be processed under the same conditions. When adjusting the size of the image, the image specifications of the convolutional neural network must also be set, and the size of the image can be adjusted to a preset value.

In the embodiment of the present invention, the image size of the convolutional neural network is set to 416*416, so the input image size should be set to 416*416.

As shown in FIG. 6, FIG. 6 is a schematic flowchart of the refinement step of step 104. The specific steps of adjusting the size of the input image in step 104 include:

Step 501: Reduce the size of the image whose size is larger than the preset value according to the bilinear interpolation algorithm.

Among them, in the input image, for an image whose original size is greater than a preset value, a bilinear interpolation algorithm is used to reduce the image.

In the embodiment of the present invention, for an image whose original size is larger than 416*416*3, a bilinear interpolation algorithm is used to reduce the image. As shown in FIG. 7, FIG. 7 is a schematic diagram of a plurality of pixels represented on the coordinate axis, f is a pixel value of a pixel, and the known function f is Q ₁₁ =(x ₁ ,y ₁ ), Q ₁₂ =(x ₁ , y ₂ ), Q ₂₁ =(x ₂ ,y ₁ ) and Q ₂₂ =(x ₂ ,y ₂ ) are the values of four points. First perform bilinear interpolation in the x direction to obtain the following formula:

where R ₁ = (x,y ₁ )

where R ₂ = (x,y ₂ )

Then perform linear interpolation in the y direction to get

Then according to the above x and y bilinear interpolation results and then perform bilinear interpolation operation, the final result is:

Among them, since the above image bilinear interpolation uses only four adjacent 4 points, the denominator of the above formula is all 1.

Step 502: Increase the size of the image whose size is smaller than the preset value according to the zero-filling method.

Among them, for the input image, for the image whose original size is smaller than the preset value, the image is increased by zero padding to increase the size of the image to the preset value.

It can be known from the target detection method provided in the embodiment of the present invention that, in the first aspect, the scheme uses a convolutional neural network to extract image features of the input image. The structure adopted by the convolutional neural network reduces the amount of calculation and does not affect the extraction of image features As a result, the processing speed of target detection is improved, and real-time detection of target position and category is realized. In the second aspect, the feature map obtained by the technical solution is a fusion of multiple feature maps with different spatial resolutions. The obtained feature map contains more image information, and the accuracy of detecting the target according to the feature map is high, that is, the detection accuracy is guaranteed. . Therefore, the technical solution simultaneously achieves high-precision and real-time detection of the target position and category. In the third aspect, the size of the input image is adjusted so that the method can process images of any size, and target detection is performed on images of any size, which increases the detection range.

Please refer to FIG. 8, which is a schematic structural diagram of a target detection system according to another embodiment of the present invention. The target detection system includes: an image feature extraction module 601, a feature processing module 602, and an acquisition result module 603.

The image feature extraction module 601 is used to extract the image features of the input image by using a convolutional neural network to obtain a plurality of feature maps with different spatial resolutions of the image.

Before the image feature extraction module 601 extracts the image features of the image, the image should be input into the system. The image feature extraction module 601 can use different convolutional neural networks to obtain multiple feature maps of the image with different spatial resolutions, where, The feature map carries image information, and the feature maps with different spatial resolutions carry image information at different levels of the image.

The feature processing module 602 is used to adjust the spatial resolution of multiple feature maps uniformly, and splice the multiple feature maps into an overall feature map.

Among them, because the spatial resolution of multiple feature maps is inconsistent, it will affect the subsequent splicing of multiple feature maps. After the feature processing module 602 adjusts the spatial resolution of multiple feature maps to be consistent, the multiple feature maps can be spliced into one Overall feature map.

It should be noted that a feature map can have multiple feature channels, and the number of image channels of three feature maps with different spatial frequencies is different. The feature processing module 602 stitches the three feature maps according to the feature map channel. To form an overall feature map, which is subsequently processed to obtain the location and category information of the predicted target in the image.

The acquisition result module 603 is used to process the overall feature map to obtain the detection information of the predicted target in the image. The detection information includes target position information and target category information.

Among them, the prediction target is a target of the detection information in the image, and the processing of the feature map by the acquisition result module 603 is to perform multi-channel feature fusion on the overall feature map, use convolution to extract effective features and filter impurities, so as to obtain the prediction target Detection information, which includes target location information and target category information.

It should be noted that the overall framework of the network of the target detection system is shown in Figure 9. Figure 9 is the overall framework of the network of the target detection system. The system includes a downsampling module, a convolution feature extraction A module, and a convolution feature extraction B module. , Convolution feature extraction C module, deconvolution module, average pooling module, Concate splicing module, prediction module (Yolo_predict) module, non-maximum suppression (Non-Maximum Suppression, NMS) algorithm filtering module. The combination of multiple network modules constitutes the target detection system.

It should be noted that the type a in FIG. 9 is an initially set value. The larger the value a is within a certain range, the better the extraction effect of the entire network.

Further, as shown in FIG. 10, FIG. 10 is a schematic structural diagram of a refinement module of the image feature extraction module. The refinement module of the image feature extraction module 601 includes: a first sampling module 701, a second sampling module 702, and a third sampling Module 703;

The first sampling module 701 is used to obtain a first feature map of the image;

The second sampling module 702 is used to obtain a second feature map of the image, and the spatial resolution of the second feature map is smaller than that of the first feature map;

The third sampling module 703 is used to obtain a third feature map of the image, and the spatial resolution of the third feature map is smaller than that of the second feature map.

The network structures of the first sampling module 701, the second sampling module 702, and the third sampling module 703 all have a concatenate structure. 9 shows the positions of the first sampling module 701, the second sampling module 702, and the third sampling module 703 in the overall network frame diagram. The first sampling module 701, the second sampling module 702, and the third sampling module 703 are used The structure of the module is shown in Figure 11, Figure 12, Figure 13, and Figure 14. Figure 11 is a schematic diagram of the downsampling module, Figure 12 is a schematic diagram of the convolution feature extraction module A, and Figure 13 is a schematic diagram of the convolution feature extraction module B Figure 14 is a schematic diagram of the C module for convolution feature extraction.

It should be noted that the left side of FIG. 11 is the detailed structure of the downsampling module, and the right side is a simplified diagram of the module. According to the structure of the left side picture, multiple maximum pools (maxpool) of different sizes are used on the right Structure, extract some image features of different sizes, use 3*3 convolution on the left to extract image features, then use 1*1 convolution for feature filtering and further information exchange between image channels, and finally extract image features and The image features extracted on the right are concatenate. On the one hand, more features can be extracted. On the other hand, the sampling module adds the number of image channels of the feature map, and reduces the parameter amount, increasing the processing speed.

At the same time, it should also be said that, as shown in Figures 12, 13 and 14, the convolution feature extraction A module, the convolution feature extraction B module and the convolution feature extraction C module are the same type of module, and the convolution feature extraction A The module, convolution feature extraction B module and convolution feature extraction C module all have two different network sub-modules (blocks) in the three modules, which are also connected by a concate connection structure. Each feature map has different network layers during the extraction process, so that the extracted feature map contains different image information, and the spatial resolution of each feature map is also different.

It can be seen from FIG. 9 that each of the first sampling module 701, the second sampling module 702, and the third sampling module 703 is a downsampling module, a convolution feature extraction A module, a convolution feature extraction B module, and a convolution The product feature extraction C module is used in combination. The first sampling module 701, the second sampling module 702, and the third sampling module 703 are connected in sequence. The first sampling module first obtains the first feature map of the image, the second sampling module then obtains the second feature map of the image, and finally The three sampling module obtains the second feature map of the image. As the number of network layers increases, the spatial resolution of the feature map of the acquired image decreases.

It should also be noted that when setting parameters in this model, the number of channels of the 3*3 convolutional layer is set to 6 times the number of channels of the 1*1 convolutional layer. The advantage of this operation is: 3*3 convolutional layer The increase of the number of channels will make the extracted image features more abundant, and the number of channels of the 1*1 convolution layer becomes less, which will make it have a channel compression effect, ensuring that the number of channels will not follow the number of network layers Deepening becomes too much, and also has a certain feature extraction effect.

As shown in FIG. 15, FIG. 15 is a schematic structural diagram of a refinement module of a feature processing module. The feature processing module 602 includes: an average pooling module 801, a deconvolution module 802, and a stitching module 803.

The average pooling module 801 is used to reduce the spatial resolution of the first feature map by using the average pooling method, so that the spatial resolution of the first feature map is consistent with the spatial resolution of the second feature map.

Among them, the spatial resolution of the first feature map is greater than the spatial resolution of the second feature map, therefore, the average pooling module 801 reduces the spatial resolution of the first feature map, so that the spatial resolution of the first feature map and the second feature map The spatial resolution of the graphs is consistent. In the present invention, the average pooling module 801 reduces the spatial resolution by average pooling.

The deconvolution module 802 is configured to use deconvolution to increase the spatial resolution of the third feature map, so that the spatial resolution of the third feature map is consistent with the spatial resolution of the second feature map.

The spatial resolution of the third feature map is smaller than the spatial resolution of the second feature map. Therefore, the deconvolution module 802 increases the spatial resolution of the third feature map so that the spatial resolution of the third feature map and the second feature map The spatial resolution of the graph is consistent. In the present invention, the way the deconvolution module 802 improves the spatial resolution is deconvolution.

The stitching module 803 is used to stitch the first feature map, the second feature map and the third feature map into a whole feature map.

Among them, the number of image channels of the first feature map, the second feature map, and the third feature map are all different, and the stitching module 803 concatenates the first feature map, the second feature map, and the third feature map on the feature channel, Form an overall feature map. The overall feature map contains both high-order features and low-order features, effectively improving the prediction results.

In the embodiment of the present invention, as shown in FIG. 16, FIG. 16 is a schematic diagram of multi-scale feature map fusion in the embodiment of the present invention. The size of the input image is a*a*3 (the value of a is the preset value), and the sizes of the first feature map, the second feature map, and the third feature map output are: (a/8)*(a/ 8)*256,(a/16)*(a/16)*512,(a/32)*(a/32)*1024,(a/8)*(a/8)*256 feature maps are averaged The pooling module 801 performs average pooling and reduces the spatial resolution. The (a/8)*(a/8)*256 feature map is deconvolved by the deconvolution module 802 to improve the spatial resolution, and finally the splicing module 803 Then the three feature maps are stitched together and merged into a whole feature map.

It should be noted that, as shown in FIG. 17, FIG. 17 is a schematic structural diagram of an average pooling module, and FIG. 18 is a schematic structural diagram of a deconvolution module. Both modules have a convolution structure. The average pooling module 801 has an average pooling structure (Avgpool), which reduces the spatial resolution of the first feature map, and the deconvolution module 802 has a deconvolution structure (Deconvolution), which increases The spatial resolution of the third feature map.

Further, as shown in FIG. 19, FIG. 19 is a schematic structural diagram of a detailed module for obtaining a result module. The obtaining result module 603 includes a prediction processing module 901 and a filtering module 902:

The prediction processing module 901 is used for performing multi-channel feature fusion on the overall feature map, and then performing feature extraction on the integrated multi-channel feature map to filter impurities to obtain prediction information of the prediction target, where the prediction information includes location information of the target prediction And predicted category information.

Specifically, as shown in FIG. 20, FIG. 20 is a schematic structural diagram of a Yolo_predict prediction module. The Yolo_predict prediction module includes a residual prediction module (res_predict_block) and a 1*1 convolutional layer. The specific structure of the residual prediction module is shown in FIG. 21, and FIG. 21 is a schematic structural diagram of the residual prediction module. In this residual prediction module, a convolutional neural network is used to predict the target.

Among them, the steps in the residual prediction module include: using 1*1 convolution to perform multi-channel feature fusion on the overall feature map, and then using 3*3 convolution to further extract effective image features and filter out impurities to obtain the target prediction information. The prediction information includes location information, confidence, and category information. The convolutional neural network is used to predict the target information, which reduces the parameters.

In the embodiment of the present invention, after the prediction processing of the convolutional neural network of the feature processing module 602, the size of the overall feature map is (a/16)*(a/16)*H, where the value of H is determined by the required The number of predicted target categories, c is the category of the target object. Let a1=(a/16), divide the image into a1*a1 blocks, which corresponds to a1*a1 H-length vectors, each block generates 9 candidate boxes, and each candidate box includes: location information ( x,y,w,h), confidence (indicating the probability that an object falls in this candidate box) and category information. The number of target categories is c, that is, there are c-dimensional vectors. Which one-dimensional vector value is the largest in the c-dimensional vector belongs to which category the target that the candidate frame belongs to belongs to.

P(object)=confidence

among them,

Among them, if an object falls in a block, P(object)=1, otherwise P(object)=0. Each candidate box needs to predict four parameters x, y, w, h, a confidence and c category information in the location information.

It should be noted that during the training process,

The default value is 1.

In the embodiment of the present invention, in order to make the prediction more accurate, the prediction process of the prediction processing module 901 is optimized using a loss function. Among them, the loss function is as follows:

Among them, the expression of coordinate error coordError is as follows:

Where f _obj (i,j) is the value of the jth box of the measurement target falling into the i-th block, and f _noobj (i,j) is the j-th box of the measurement target not falling into the i-th block Value, when the measurement target falls into the jth frame of the ith block, f _obj (i,j)=1, f _noobj (i,j)=0, when the measurement target does not fall into the i In the jth frame of the block, f _obj (i,j)=1 and f _noobj (i,j)=0. λ _coord is the weight coefficient, and the value can be set.

The expression of confidence error is as follows:

Among them, use the λ _noobj parameter to reduce the impact of the detected picture background on the loss function.

The expression of classification error classError is as follows:

The filtering module 902 is used to filter out redundant prediction information by using a non-maximum suppression method, and output detection information of the prediction target, where the detection information includes target position information and target category information.

The specific steps of filtering out redundant information by the filtering module 902 using non-maximum suppression methods include: placing multiple candidate frame positions into module B, which includes b ₁ , b ₂ , b ₃ , ..., Put the class-specific confidence score value in module S.

IOU(b ^* ,b _i )>N _t

Where b ^* is the candidate box with the largest probability value belonging to one of all categories, b _i is the candidate box with probability values belonging to any category, and Nt is the preset threshold for the cross-combination ratio, that is, non-polar Threshold for large value suppression.

As can be seen from the target detection system provided by FIG. 8 in the present invention, in the first aspect, the image feature extraction module 601 uses a convolutional neural network to extract the image features of the input image. The structure adopted by the convolutional neural network reduces the amount of calculation, and does not It affects the extraction effect of image features, improves the processing speed of target detection, and realizes real-time detection of target position and category. In the second aspect, the feature map obtained in the feature processing module 602 is a fusion of multiple feature maps with different spatial resolutions, and the obtained feature map contains many image information. The acquisition result module 603 detects the target with high accuracy according to the feature map , That guarantees the detection accuracy. Therefore, the target detection system simultaneously achieves high-precision and real-time detection of target positions and categories.

Please refer to FIG. 22. FIG. 22 is a schematic structural diagram of a target detection system according to another embodiment of the present invention. The difference from the previous embodiment is that the target detection system further includes: an adjustment module 604.

Among them, the adjustment module 604 adjusts the size of the input image so that the subsequent image feature extraction module 601 extracts the image features of the image, the size of the image is the same, it is convenient to operate, the system first adjusts the size of the input image, so that any size image Can be processed under the same conditions. When adjusting the size of the image, it is also necessary to set the specifications of the convolutional neural network processing image in the entire system, and the size of any input image can be adjusted to a preset value.

In the embodiment of the present invention, the image size of the convolutional neural network is set to 416*416, so the input image size is set to 416*416 by the adjustment module 604.

As shown in FIG. 23, FIG. 23 is a schematic structural diagram of a detailed module of the adjustment module. The adjustment module 604 includes a reduction module 1001 and an increase module 1002:

The reduction module 1001 is configured to reduce the size of an image whose size is larger than a preset value according to a bilinear interpolation algorithm.

Among the input images, for an image whose original size is greater than a preset value, the reduction module 1001 uses a bilinear interpolation algorithm to reduce the image.

In the embodiment of the present invention, for an image with an original size greater than 416*416*3, the reduction module 1001 uses a bilinear interpolation algorithm to reduce the image. As shown in FIG. 7, FIG. 7 is a schematic diagram of a plurality of pixels represented on the coordinate axis, f is a pixel value of a pixel, and the known function f is Q ₁₁ = (x ₁ , y ₁ ), Q ₁₂ = (x ₁ , y ₂ ), Q ₂₁ =(x ₂ ,y ₁ ) and Q ₂₂ =(x ₂ ,y ₂ ) are the values of four points. First perform bilinear interpolation in the x direction to obtain the following formula:

where R ₁ = (x,y ₁ )

where R ₂ = (x,y ₂ )

Then perform linear interpolation in the y direction to get

The increasing module 1002 is used to increase the size of the image whose size is smaller than the preset value according to the zero-filling method.

Among the input images, for an image whose original size is smaller than a preset value, the increasing module 1002 uses a zero-filling method to increase the image to increase the size of the image to a preset value.

As can be seen from the target detection system provided by the embodiment of the present invention, in the first aspect, the image feature extraction module 601 uses a convolutional neural network to extract image features of the input image. The structure adopted by the convolutional neural network reduces the amount of calculation and does not affect the image The feature extraction effect improves the processing speed of target detection and realizes real-time detection of target position and category. In the second aspect, the feature map obtained in the feature processing module 602 is a fusion of multiple feature maps with different spatial resolutions. The obtained feature map contains more image information, and the acquisition result module 603 detects the target according to the feature map with high accuracy , That guarantees the detection accuracy. Therefore, the target detection system simultaneously achieves high-precision and real-time detection of target positions and categories. In the third aspect, the adjustment module 604 adjusts the size of the input image so that the method can process images of any size, and performs target detection on images of any size, increasing the detection range.

Sequence listing free content

The above is a description of a target detection method and system provided by the present invention. For those skilled in the art, according to the ideas of the embodiments of the present invention, there will be changes in the specific implementation and application scope. In summary, The contents of this description should not be construed as limiting the invention.

Claims

A target detection method, characterized in that the method includes:

A convolutional neural network is used to extract image features of the input image to obtain multiple feature maps of the image with different spatial resolutions;

Adjusting the spatial resolution of the multiple feature maps uniformly, and stitching the multiple feature maps into an overall feature map;

Processing the overall feature map to obtain the detection information of the predicted target in the image, where the detection information includes target position information and target category information.
The method according to claim 1, wherein the step of extracting image features of the input image using a convolutional neural network to obtain a plurality of feature maps with different spatial resolutions of the image includes:

Acquiring the first feature map of the image;

Acquiring a second feature map of the image, the spatial resolution of the second feature map is smaller than the spatial resolution of the first feature map;

Acquire a third feature map of the image, the spatial resolution of the third feature map is smaller than the spatial resolution of the second feature map.
The method according to claim 2, wherein the step of adjusting the spatial resolution of the plurality of feature maps to be consistent, and stitching the plurality of feature maps into an overall feature map includes:

Reducing the spatial resolution of the first feature map using the method of average pooling, so that the spatial resolution of the first feature map is consistent with the spatial resolution of the second feature map;

Using deconvolution to increase the spatial resolution of the third feature map, so that the spatial resolution of the third feature map is consistent with the spatial resolution of the second feature map;

The first feature map, the second feature map, and the third feature map are stitched together to form an overall feature map.
The method according to claim 3, wherein the processing of the overall feature map to obtain detection information of predicted targets in the image, the step of the detection information including target position information and target category information includes:

Perform multi-channel feature fusion on the overall feature map, and then perform feature extraction and filtering impurities on the integrated multi-channel feature map to obtain prediction information of the prediction target, the prediction information includes location information and prediction of the target prediction Category information;

The method of non-maximum suppression is used to filter out redundant information and output detection information of the predicted target, where the detection information includes target position information and target category information.
The method according to claim 1, wherein the step before extracting the image features of the input image using the convolutional neural network further comprises: adjusting the size of the input image.
The method according to claim 5, wherein the step of adjusting the size of the input image comprises:

Reduce the size of images larger than the preset value according to the bilinear interpolation algorithm;

Increase the size of the image whose size is smaller than the preset value according to the way of zero padding.
A target detection system, characterized in that the system includes: an image feature extraction module, a feature processing module and an acquisition result module;

The image feature extraction module is used to extract image features of the input image by using a convolutional neural network to obtain a plurality of feature maps with different spatial resolutions of the image;

The feature processing module is used to adjust the spatial resolution of the multiple feature maps uniformly, and splice the multiple feature maps into an overall feature map;

The acquisition result module is configured to process the overall feature map to obtain detection information of the predicted target in the image, where the detection information includes target position information and target category information.
The system according to claim 7, wherein the image feature extraction module includes: a first sampling module, a second sampling module, and a third sampling module;

The first sampling module is used to obtain a first feature map of the image;

The second sampling module is used to obtain a second feature map of the image, and the spatial resolution of the second feature map is smaller than the spatial resolution of the first feature map;

The third sampling module is configured to obtain a third feature map of the image, and the spatial resolution of the third feature map is smaller than that of the second feature map.