WO2020113412A1 - Target detection method and system - Google Patents

Target detection method and system Download PDF

Info

Publication number
WO2020113412A1
WO2020113412A1 PCT/CN2018/119132 CN2018119132W WO2020113412A1 WO 2020113412 A1 WO2020113412 A1 WO 2020113412A1 CN 2018119132 W CN2018119132 W CN 2018119132W WO 2020113412 A1 WO2020113412 A1 WO 2020113412A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
image
feature
target
module
Prior art date
Application number
PCT/CN2018/119132
Other languages
French (fr)
Chinese (zh)
Inventor
王娜
许康
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Priority to PCT/CN2018/119132 priority Critical patent/WO2020113412A1/en
Publication of WO2020113412A1 publication Critical patent/WO2020113412A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the invention relates to the field of computer vision, in particular to a method and system for target detection.
  • the main purpose of the present invention is to provide a target detection method and system for solving the technical problem that the existing target detection technology cannot simultaneously achieve high precision and real-time detection.
  • a first aspect of the present invention provides a method for target detection, the method including:
  • a convolutional neural network is used to extract image features of the input image to obtain multiple feature maps of the image with different spatial resolutions
  • a second aspect of the present invention provides a system for target detection.
  • the system includes: an image feature extraction module, a feature processing module, and a prediction processing module;
  • the image feature extraction module is used to extract image features of the input image by using a convolutional neural network to obtain a plurality of feature maps with different spatial resolutions of the image;
  • the feature processing module is used to adjust the spatial resolution of the multiple feature maps uniformly, and splice the multiple feature maps into an overall feature map;
  • the acquisition result module is configured to process the overall feature map to obtain detection information of a predicted target in the image, where the detection information includes target position information and target category information.
  • the technical solution uses a convolutional neural network to extract image features of the input image.
  • the structure adopted by the convolutional neural network reduces the amount of calculation, does not affect the extraction effect of image features, and improves
  • the processing speed of target detection enables real-time detection of target positions and categories.
  • the feature map obtained by the technical solution is a fusion of multiple feature maps with different spatial resolutions.
  • the obtained feature map contains more image information, and the accuracy of detecting the target according to the feature map is high, that is, detection is guaranteed. Precision. Therefore, the technical solution simultaneously achieves high-precision and real-time detection of the target position and category.
  • FIG. 1 is a schematic flowchart of a target detection method provided by an embodiment of the present invention.
  • FIG. 2 is a schematic flowchart of the detailed steps of step 101;
  • FIG. 3 is a schematic flowchart of the detailed steps of step 102;
  • FIG. 4 is a schematic flowchart of the detailed steps of step 103;
  • FIG. 5 is a schematic flowchart of a target detection method according to another embodiment of the present invention.
  • FIG. 6 is a schematic flowchart of the detailed steps of step 104;
  • FIG. 7 is a schematic diagram showing multiple pixels on the coordinate axis
  • FIG. 8 is a schematic structural diagram of a target detection system according to another embodiment of the present invention.
  • Figure 9 is the overall network framework of the target detection system
  • FIG. 10 is a schematic structural diagram of a refinement module of an image feature extraction module
  • Figure 11 is a schematic diagram of the downsampling module
  • FIG. 12 is a schematic diagram of the A module for convolution feature extraction
  • FIG. 13 is a schematic diagram of the B module for convolution feature extraction
  • FIG. 14 is a schematic diagram of a C module for convolution feature extraction
  • 15 is a schematic structural diagram of a detailed module of a feature processing module
  • 16 is a schematic diagram of multi-scale feature map fusion in an embodiment of the present invention.
  • 17 is a schematic structural diagram of an average pooling module
  • FIG. 19 is a schematic structural diagram of a detailed module for obtaining a result module
  • Figure 20 is a schematic diagram of the structure of the Yolo_predict prediction module
  • 21 is a schematic structural diagram of a residual prediction module
  • FIG. 22 is a schematic structural diagram of a target detection system according to another embodiment of the present invention.
  • FIG. 23 is a schematic structural diagram of a detailed module of the adjustment module.
  • the present invention proposes a method for target detection.
  • FIG. 1 is a schematic flowchart of a target detection method according to an embodiment of the present invention.
  • the target detection method includes:
  • Step 101 A convolutional neural network is used to extract image features of an input image to obtain a plurality of feature maps with different spatial resolutions of the image.
  • the image Before extracting the image features of the image, the image should be input into the system, and multiple convolutional neural networks can be used to obtain multiple feature maps of the image with different spatial resolutions, where the feature map carries image information and different spatial resolutions
  • the rate feature map carries image information at different levels of the image.
  • Step 102 Adjust the spatial resolution of multiple feature maps to be consistent, and stitch the multiple feature maps into an overall feature map.
  • the spatial resolution of multiple feature maps is inconsistent, which will affect the subsequent splicing of multiple feature maps.
  • the multiple feature maps may be spliced into an overall feature map.
  • a feature map can have multiple feature channels, the number of image channels of three feature maps with different spatial frequencies is different, and the three feature maps are stitched together according to the feature map channels to form a whole Feature map, and subsequent processing of the overall feature map can obtain the location and category information of the prediction target in the image.
  • Step 103 Process the overall feature map to obtain detection information of the predicted target in the image, where the detection information includes target position information and target category information.
  • the prediction target is a target of the detection information in the image.
  • Processing the feature map is to perform multi-channel feature fusion on the overall feature map, use convolution to extract effective features and filter impurities, so as to obtain the detection information of the prediction target.
  • the detection information includes target location information and target category information.
  • FIG. 2 is a schematic flowchart of the refinement step of step 101.
  • the specific steps of using a convolutional neural network to extract image features of an input image, and obtaining multiple feature maps with different spatial resolutions of the image include:
  • Step 201 Acquire the first feature map of the image
  • Step 202 Acquire a second feature map of the image, the spatial resolution of the second feature map is smaller than the spatial resolution of the first feature map;
  • Step 203 Acquire a third feature map of the image, the spatial resolution of the third feature map is smaller than that of the first feature map.
  • each feature map acquisition process uses multiple maxpool structures of different sizes to extract some image features of different sizes
  • the left side uses 3*3 convolution to extract image features
  • 1*1 Convolution performs feature filtering and further exchanges of information between image channels, and finally concatenates the image features extracted on the left and the image features extracted on the right.
  • more features can be extracted.
  • increasing the number of image channels of the feature map reduces the amount of parameters and increases the processing speed.
  • each feature map has different network layers during the extraction process, so that the extracted feature map contains different image information, and the spatial resolution of each feature map is also different.
  • the three feature maps are acquired in sequence. As the number of network layers increases, the spatial resolution of the acquired image feature maps decreases.
  • the concatenate structure is used to increase the number of network channels and retain more image features.
  • the fusion between low-order features and high-order features will bring richer semantics feature.
  • the fusion of multiple maximum-scale pooling structures with different scales and multiple feature maps with different spatial resolutions can obtain rich image semantic information, which is beneficial to the prediction of targets of different sizes and to improve the accuracy of detection targets.
  • FIG. 3 is a schematic flowchart of the refinement step of step 102.
  • the specific steps of adjusting the spatial resolution of a plurality of the feature maps uniformly and stitching the multiple feature maps into an overall feature map include:
  • Step 301 Adopt the method of average pooling to reduce the spatial resolution of the first feature map, so that the spatial resolution of the first feature map is consistent with the spatial resolution of the second feature map.
  • the spatial resolution of the first feature map is greater than that of the second feature map, therefore, the spatial resolution of the first feature map is reduced to make the spatial resolution of the first feature map and the spatial resolution of the second feature map Consistently, in the present invention, the way to reduce the spatial resolution is average pooling.
  • Step 302 Use deconvolution to increase the spatial resolution of the third feature map, so that the spatial resolution of the third feature map is consistent with the spatial resolution of the second feature map.
  • the spatial resolution of the third feature map is smaller than the spatial resolution of the second feature map, therefore, the spatial resolution of the third feature map is increased to make the spatial resolution of the third feature map and the spatial resolution of the second feature map Consistently, in the present invention, the way to increase the spatial resolution is deconvolution.
  • Step 303 Perform stitching processing on the first feature map, the second feature map, and the third feature map to fuse into an overall feature map.
  • the number of image channels of the first feature map, the second feature map, and the third feature map are different.
  • the first feature map, the second feature map, and the third feature map are concatenated on the feature map channel, Form an overall feature map.
  • the overall feature map contains both high-order features and low-order features, effectively improving the prediction results.
  • the input image size is a*a*3 (the value of a is a preset value), and the sizes of the output first feature map, second feature map, and third feature map are: (a /8)*(a/8)*256,(a/16)*(a/16)*512,(a/32)*(a/32)*1024.
  • FIG. 4 is a schematic flowchart of the refinement step of step 103. Processing the overall feature map to obtain detection information of the predicted target in the image.
  • the detection information includes target position information and target category information.
  • the steps include:
  • Step 401 Perform multi-channel feature fusion on the overall feature map, then perform feature extraction and filter impurities on the integrated multi-channel feature map to obtain prediction information of the predicted target, which includes the predicted location information of the target and the predicted category information.
  • target prediction information which includes location information and confidence Degree and category information.
  • the convolutional neural network is used to predict the target information, which reduces the parameters.
  • the size of the overall feature map is (a/16)*(a/16)*H, where the value of H is determined by the target to be predicted Number of categories, c is the category of the target object.
  • the number of target categories is c, that is, there are c-dimensional vectors. Which one-dimensional vector value is the largest in the c-dimensional vector belongs to which category the target that the candidate frame belongs to belongs to.
  • P(object) represents the probability value of the target falling into a block
  • the value of this value is equal to the predicted confidence (ie confidence), namely:
  • the value of the confidence value is used to determine whether the object is in this candidate box, in order to better judge, the IOU value needs to be considered here, so the calculation of the location confidence here is as follows:
  • Each candidate box needs to predict four parameters x, y, w, h, a confidence and c category information in the location information.
  • x, y, w, h, confidence ⁇ (0, 1), x and y represent the offset of the center point of the candidate box with respect to the upper left corner of the block, and w, h represent the width of the candidate box High relative to the value of a (the preset image input size):
  • the calculation of the class-specific confidence score of each candidate box belongs to:
  • a loss function is used to optimize the prediction process.
  • the loss function is as follows:
  • coordError is the coordinate error
  • confidenceError is the confidence loss error
  • classError is the classification error
  • f obj (i,j) is the value of the jth frame of the measurement target falling into the i-th block
  • f noobj (i,j) is the j-th frame of the measurement target not falling into the i-th block Value
  • the ⁇ noord parameter is used to reduce the influence of the detected picture background on the loss function.
  • classification error classError is as follows:
  • Step 402 Using non-maximum suppression method to filter out redundant information and output detection information of the predicted target, the detection information includes target position information and target category information.
  • the specific steps of filtering out redundant information using non-maximum suppression methods include: placing multiple candidate frame positions in module B, which includes b 1 , b 2 , b 3 , ..., class- The specific confidence score value is put into module S.
  • b * is the candidate box with the highest probability value belonging to one category of all categories
  • b i is the candidate box with probability value belonging to any category
  • Nt is the preset threshold of the cross-combination ratio, that is, non-polar Threshold for large value suppression.
  • the intersection ratio of b * and b i is greater than the threshold of non-maximum suppression, remove the S i in module S and the b i in module B, and select the maximum value s in S module again * And the corresponding prediction box b * , after retraversing the remaining candidate boxes b i , if the traversed candidate boxes b i meet the above condition, then remove S i in S and b i in B, the remaining H Checkboxes and class-specific confidence score values are output.
  • the output candidate box and the information carried are the detection results.
  • the method uses a convolutional neural network to extract image features of the input image.
  • the structure adopted by the convolutional neural network reduces the amount of calculation and does not affect the image features.
  • the extraction effect improves the processing speed of target detection and realizes real-time detection of target position and category.
  • the feature map obtained by the technical solution is a fusion of multiple feature maps with different spatial resolutions.
  • the obtained feature map contains more image information, and the accuracy of detecting the target according to the feature map is high, that is, detection is guaranteed. Precision. Therefore, the technical solution simultaneously achieves high-precision and real-time detection of the target position and category.
  • FIG. 5 is a schematic flowchart of a target detection method according to another embodiment of the present invention.
  • the difference from the previous embodiment is that the method before step 101 further includes: step 104. Adjust input The size of the image.
  • the size of the input image is adjusted so that when the image features of the image are subsequently extracted, the size of the image is the same, which is convenient for operation.
  • This method first adjusts the size of the input image so that images of any size can be processed under the same conditions.
  • the image specifications of the convolutional neural network must also be set, and the size of the image can be adjusted to a preset value.
  • the image size of the convolutional neural network is set to 416*416, so the input image size should be set to 416*416.
  • FIG. 6 is a schematic flowchart of the refinement step of step 104.
  • the specific steps of adjusting the size of the input image in step 104 include:
  • Step 501 Reduce the size of the image whose size is larger than the preset value according to the bilinear interpolation algorithm.
  • a bilinear interpolation algorithm is used to reduce the image.
  • Step 502 Increase the size of the image whose size is smaller than the preset value according to the zero-filling method.
  • the image is increased by zero padding to increase the size of the image to the preset value.
  • the scheme uses a convolutional neural network to extract image features of the input image.
  • the structure adopted by the convolutional neural network reduces the amount of calculation and does not affect the extraction of image features
  • the processing speed of target detection is improved, and real-time detection of target position and category is realized.
  • the feature map obtained by the technical solution is a fusion of multiple feature maps with different spatial resolutions.
  • the obtained feature map contains more image information, and the accuracy of detecting the target according to the feature map is high, that is, the detection accuracy is guaranteed. . Therefore, the technical solution simultaneously achieves high-precision and real-time detection of the target position and category.
  • the size of the input image is adjusted so that the method can process images of any size, and target detection is performed on images of any size, which increases the detection range.
  • FIG. 8 is a schematic structural diagram of a target detection system according to another embodiment of the present invention.
  • the target detection system includes: an image feature extraction module 601, a feature processing module 602, and an acquisition result module 603.
  • the image feature extraction module 601 is used to extract the image features of the input image by using a convolutional neural network to obtain a plurality of feature maps with different spatial resolutions of the image.
  • the image feature extraction module 601 Before the image feature extraction module 601 extracts the image features of the image, the image should be input into the system.
  • the image feature extraction module 601 can use different convolutional neural networks to obtain multiple feature maps of the image with different spatial resolutions, where, The feature map carries image information, and the feature maps with different spatial resolutions carry image information at different levels of the image.
  • the feature processing module 602 is used to adjust the spatial resolution of multiple feature maps uniformly, and splice the multiple feature maps into an overall feature map.
  • the feature processing module 602 adjusts the spatial resolution of multiple feature maps to be consistent, the multiple feature maps can be spliced into one Overall feature map.
  • a feature map can have multiple feature channels, and the number of image channels of three feature maps with different spatial frequencies is different.
  • the feature processing module 602 stitches the three feature maps according to the feature map channel. To form an overall feature map, which is subsequently processed to obtain the location and category information of the predicted target in the image.
  • the acquisition result module 603 is used to process the overall feature map to obtain the detection information of the predicted target in the image.
  • the detection information includes target position information and target category information.
  • the prediction target is a target of the detection information in the image
  • the processing of the feature map by the acquisition result module 603 is to perform multi-channel feature fusion on the overall feature map, use convolution to extract effective features and filter impurities, so as to obtain the prediction target Detection information, which includes target location information and target category information.
  • Figure 9 is the overall framework of the network of the target detection system.
  • the system includes a downsampling module, a convolution feature extraction A module, and a convolution feature extraction B module.
  • Convolution feature extraction C module deconvolution module, average pooling module, Concate splicing module, prediction module (Yolo_predict) module, non-maximum suppression (Non-Maximum Suppression, NMS) algorithm filtering module.
  • the combination of multiple network modules constitutes the target detection system.
  • the type a in FIG. 9 is an initially set value. The larger the value a is within a certain range, the better the extraction effect of the entire network.
  • FIG. 10 is a schematic structural diagram of a refinement module of the image feature extraction module.
  • the refinement module of the image feature extraction module 601 includes: a first sampling module 701, a second sampling module 702, and a third sampling Module 703;
  • the first sampling module 701 is used to obtain a first feature map of the image
  • the second sampling module 702 is used to obtain a second feature map of the image, and the spatial resolution of the second feature map is smaller than that of the first feature map;
  • the third sampling module 703 is used to obtain a third feature map of the image, and the spatial resolution of the third feature map is smaller than that of the second feature map.
  • the network structures of the first sampling module 701, the second sampling module 702, and the third sampling module 703 all have a concatenate structure.
  • 9 shows the positions of the first sampling module 701, the second sampling module 702, and the third sampling module 703 in the overall network frame diagram.
  • the first sampling module 701, the second sampling module 702, and the third sampling module 703 are used The structure of the module is shown in Figure 11, Figure 12, Figure 13, and Figure 14.
  • Figure 11 is a schematic diagram of the downsampling module
  • Figure 12 is a schematic diagram of the convolution feature extraction module A
  • Figure 13 is a schematic diagram of the convolution feature extraction module B
  • Figure 14 is a schematic diagram of the C module for convolution feature extraction.
  • the left side of FIG. 11 is the detailed structure of the downsampling module
  • the right side is a simplified diagram of the module.
  • multiple maximum pools (maxpool) of different sizes are used on the right Structure, extract some image features of different sizes, use 3*3 convolution on the left to extract image features, then use 1*1 convolution for feature filtering and further information exchange between image channels, and finally extract image features and
  • the image features extracted on the right are concatenate.
  • more features can be extracted.
  • the sampling module adds the number of image channels of the feature map, and reduces the parameter amount, increasing the processing speed.
  • the convolution feature extraction A module, the convolution feature extraction B module and the convolution feature extraction C module are the same type of module, and the convolution feature extraction A
  • the module, convolution feature extraction B module and convolution feature extraction C module all have two different network sub-modules (blocks) in the three modules, which are also connected by a concate connection structure.
  • Each feature map has different network layers during the extraction process, so that the extracted feature map contains different image information, and the spatial resolution of each feature map is also different.
  • each of the first sampling module 701, the second sampling module 702, and the third sampling module 703 is a downsampling module, a convolution feature extraction A module, a convolution feature extraction B module, and a convolution The product feature extraction C module is used in combination.
  • the first sampling module 701, the second sampling module 702, and the third sampling module 703 are connected in sequence.
  • the first sampling module first obtains the first feature map of the image
  • the second sampling module then obtains the second feature map of the image
  • the three sampling module obtains the second feature map of the image.
  • the spatial resolution of the feature map of the acquired image decreases.
  • the concatenate structure is used to increase the number of network channels and retain more image features.
  • the fusion between low-order features and high-order features will bring richer semantics feature.
  • the number of channels of the 3*3 convolutional layer is set to 6 times the number of channels of the 1*1 convolutional layer.
  • the advantage of this operation is: 3*3 convolutional layer
  • the increase of the number of channels will make the extracted image features more abundant, and the number of channels of the 1*1 convolution layer becomes less, which will make it have a channel compression effect, ensuring that the number of channels will not follow the number of network layers Deepening becomes too much, and also has a certain feature extraction effect.
  • FIG. 15 is a schematic structural diagram of a refinement module of a feature processing module.
  • the feature processing module 602 includes: an average pooling module 801, a deconvolution module 802, and a stitching module 803.
  • the average pooling module 801 is used to reduce the spatial resolution of the first feature map by using the average pooling method, so that the spatial resolution of the first feature map is consistent with the spatial resolution of the second feature map.
  • the average pooling module 801 reduces the spatial resolution of the first feature map, so that the spatial resolution of the first feature map and the second feature map The spatial resolution of the graphs is consistent.
  • the average pooling module 801 reduces the spatial resolution by average pooling.
  • the deconvolution module 802 is configured to use deconvolution to increase the spatial resolution of the third feature map, so that the spatial resolution of the third feature map is consistent with the spatial resolution of the second feature map.
  • the spatial resolution of the third feature map is smaller than the spatial resolution of the second feature map. Therefore, the deconvolution module 802 increases the spatial resolution of the third feature map so that the spatial resolution of the third feature map and the second feature map The spatial resolution of the graph is consistent. In the present invention, the way the deconvolution module 802 improves the spatial resolution is deconvolution.
  • the stitching module 803 is used to stitch the first feature map, the second feature map and the third feature map into a whole feature map.
  • the number of image channels of the first feature map, the second feature map, and the third feature map are all different, and the stitching module 803 concatenates the first feature map, the second feature map, and the third feature map on the feature channel, Form an overall feature map.
  • the overall feature map contains both high-order features and low-order features, effectively improving the prediction results.
  • FIG. 16 is a schematic diagram of multi-scale feature map fusion in the embodiment of the present invention.
  • the size of the input image is a*a*3 (the value of a is the preset value), and the sizes of the first feature map, the second feature map, and the third feature map output are: (a/8)*(a/ 8)*256,(a/16)*(a/16)*512,(a/32)*(a/32)*1024,(a/8)*(a/8)*256 feature maps are averaged
  • the pooling module 801 performs average pooling and reduces the spatial resolution.
  • the (a/8)*(a/8)*256 feature map is deconvolved by the deconvolution module 802 to improve the spatial resolution, and finally the splicing module 803 Then the three feature maps are stitched together and merged into a whole feature map.
  • FIG. 17 is a schematic structural diagram of an average pooling module
  • FIG. 18 is a schematic structural diagram of a deconvolution module. Both modules have a convolution structure.
  • the average pooling module 801 has an average pooling structure (Avgpool), which reduces the spatial resolution of the first feature map
  • the deconvolution module 802 has a deconvolution structure (Deconvolution), which increases The spatial resolution of the third feature map.
  • Avgpool average pooling structure
  • Deconvolution deconvolution
  • FIG. 19 is a schematic structural diagram of a detailed module for obtaining a result module.
  • the obtaining result module 603 includes a prediction processing module 901 and a filtering module 902:
  • the prediction processing module 901 is used for performing multi-channel feature fusion on the overall feature map, and then performing feature extraction on the integrated multi-channel feature map to filter impurities to obtain prediction information of the prediction target, where the prediction information includes location information of the target prediction And predicted category information.
  • FIG. 20 is a schematic structural diagram of a Yolo_predict prediction module.
  • the Yolo_predict prediction module includes a residual prediction module (res_predict_block) and a 1*1 convolutional layer.
  • the specific structure of the residual prediction module is shown in FIG. 21, and FIG. 21 is a schematic structural diagram of the residual prediction module.
  • a convolutional neural network is used to predict the target.
  • the steps in the residual prediction module include: using 1*1 convolution to perform multi-channel feature fusion on the overall feature map, and then using 3*3 convolution to further extract effective image features and filter out impurities to obtain the target prediction information.
  • the prediction information includes location information, confidence, and category information.
  • the convolutional neural network is used to predict the target information, which reduces the parameters.
  • the size of the overall feature map is (a/16)*(a/16)*H, where the value of H is determined by the required
  • the number of predicted target categories, c is the category of the target object.
  • Let a1 (a/16), divide the image into a1*a1 blocks, which corresponds to a1*a1 H-length vectors, each block generates 9 candidate boxes, and each candidate box includes: location information ( x,y,w,h), confidence (indicating the probability that an object falls in this candidate box) and category information.
  • the number of target categories is c, that is, there are c-dimensional vectors. Which one-dimensional vector value is the largest in the c-dimensional vector belongs to which category the target that the candidate frame belongs to belongs to.
  • P(object) represents the probability value of the target falling into a block
  • the value of this value is equal to the predicted confidence (ie confidence), namely:
  • the value of the confidence value is used to determine whether the object is in this candidate box, in order to better judge, the IOU value needs to be considered here, so the calculation of the location confidence here is as follows:
  • Each candidate box needs to predict four parameters x, y, w, h, a confidence and c category information in the location information.
  • x, y, w, h, confidence ⁇ (0, 1), x and y represent the offset of the center point of the candidate box with respect to the upper left corner of the block, and w, h represent the width of the candidate box High relative to the value of a (the preset image input size):
  • the calculation of the class-specific confidence score of each candidate box belongs to:
  • the prediction process of the prediction processing module 901 is optimized using a loss function.
  • the loss function is as follows:
  • coordError is the coordinate error
  • confidenceError is the confidence loss error
  • classError is the classification error
  • f obj (i,j) is the value of the jth box of the measurement target falling into the i-th block
  • f noobj (i,j) is the j-th box of the measurement target not falling into the i-th block Value
  • classification error classError is as follows:
  • the filtering module 902 is used to filter out redundant prediction information by using a non-maximum suppression method, and output detection information of the prediction target, where the detection information includes target position information and target category information.
  • the specific steps of filtering out redundant information by the filtering module 902 using non-maximum suppression methods include: placing multiple candidate frame positions into module B, which includes b 1 , b 2 , b 3 , ..., Put the class-specific confidence score value in module S.
  • b * is the candidate box with the largest probability value belonging to one of all categories
  • b i is the candidate box with probability values belonging to any category
  • Nt is the preset threshold for the cross-combination ratio, that is, non-polar Threshold for large value suppression.
  • the intersection ratio of b * and b i is greater than the threshold of non-maximum suppression, remove the S i in module S and the b i in module B, and select the maximum value s in S module again * And the corresponding prediction box b * , after retraversing the remaining candidate boxes b i , if the traversed candidate boxes b i meet the above condition, then remove S i in S and b i in B, the remaining H Checkboxes and class-specific confidence score values are output.
  • the output candidate box and the information carried are the detection results.
  • the image feature extraction module 601 uses a convolutional neural network to extract the image features of the input image.
  • the structure adopted by the convolutional neural network reduces the amount of calculation, and does not It affects the extraction effect of image features, improves the processing speed of target detection, and realizes real-time detection of target position and category.
  • the feature map obtained in the feature processing module 602 is a fusion of multiple feature maps with different spatial resolutions, and the obtained feature map contains many image information.
  • the acquisition result module 603 detects the target with high accuracy according to the feature map , That guarantees the detection accuracy. Therefore, the target detection system simultaneously achieves high-precision and real-time detection of target positions and categories.
  • FIG. 22 is a schematic structural diagram of a target detection system according to another embodiment of the present invention. The difference from the previous embodiment is that the target detection system further includes: an adjustment module 604.
  • the adjustment module 604 adjusts the size of the input image so that the subsequent image feature extraction module 601 extracts the image features of the image, the size of the image is the same, it is convenient to operate, the system first adjusts the size of the input image, so that any size image Can be processed under the same conditions.
  • the size of the image it is also necessary to set the specifications of the convolutional neural network processing image in the entire system, and the size of any input image can be adjusted to a preset value.
  • the image size of the convolutional neural network is set to 416*416, so the input image size is set to 416*416 by the adjustment module 604.
  • FIG. 23 is a schematic structural diagram of a detailed module of the adjustment module.
  • the adjustment module 604 includes a reduction module 1001 and an increase module 1002:
  • the reduction module 1001 is configured to reduce the size of an image whose size is larger than a preset value according to a bilinear interpolation algorithm.
  • the reduction module 1001 uses a bilinear interpolation algorithm to reduce the image.
  • the reduction module 1001 uses a bilinear interpolation algorithm to reduce the image.
  • the increasing module 1002 is used to increase the size of the image whose size is smaller than the preset value according to the zero-filling method.
  • the increasing module 1002 uses a zero-filling method to increase the image to increase the size of the image to a preset value.
  • the image feature extraction module 601 uses a convolutional neural network to extract image features of the input image.
  • the structure adopted by the convolutional neural network reduces the amount of calculation and does not affect the image
  • the feature extraction effect improves the processing speed of target detection and realizes real-time detection of target position and category.
  • the feature map obtained in the feature processing module 602 is a fusion of multiple feature maps with different spatial resolutions.
  • the obtained feature map contains more image information, and the acquisition result module 603 detects the target according to the feature map with high accuracy , That guarantees the detection accuracy. Therefore, the target detection system simultaneously achieves high-precision and real-time detection of target positions and categories.
  • the adjustment module 604 adjusts the size of the input image so that the method can process images of any size, and performs target detection on images of any size, increasing the detection range.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

The present invention discloses a target detection method, wherein the method relates to the field of computer videos, and the method comprises: extracting image features of an input image by using convolutional neural network (CNN) to obtain a plurality of feature maps with different spatial resolutions of the image, adjusting the spatial resolutions of the plurality of feature maps to be consistent, and tiling the plurality of feature maps into an overall feature map, and processing the overall feature map to obtain detection information on a predicted target in the image, the detection information including target position information and target class information. The CNN structure reduces the amount of calculation, improves processing speed of the target detection. Meanwhile, the overall feature map obtained from the method is a fusion of the plurality of feature maps with different spatial resolutions, the obtained overall feature map contains much more image information, and the precision of detecting the target according to the overall feature map is high.

Description

一种目标检测方法和系统Target detection method and system 技术领域Technical field
本发明涉及计算机视觉领域,尤其涉及一种目标检测方法和系统。The invention relates to the field of computer vision, in particular to a method and system for target detection.
背景技术Background technique
随着逐渐进入信息时代,当今世界正处于信息大爆炸的环境下,且时刻面临着信息过剩问题。仅在2001年,全球数据量就达到1.8ZB,相当于全世界每个人产生200GB以上的数据。这种增长趋势仍在加速,专家预测,接下来几年,数据将始终保持每年50%的增加速度。现如今,各大电商、视频播放等用户平台用户每天将产生海量的数据,而这些数据中大部分数据是图片数据,图片数据又可分为单目图片数据和多目标图片数据,单目标图片数据一般可以用图像分类和识别的方式处理,而多目标图片数据则一般使用多目标检测方法来进行处理。随着计算机技术的发展和计算机视觉原来在生活中的广泛应用,目标检测技术在智能化交通系统、智能监控系统,自动驾驶技术、军事目标检测以及医学导航手术中手术器械定等多个领域具有广泛的应用价值。With the gradual entry into the information age, the world today is under the environment of the information explosion, and it is constantly facing the problem of information surplus. In 2001 alone, the global data volume reached 1.8ZB, which is equivalent to more than 200GB of data generated by everyone in the world. This growth trend is still accelerating. Experts predict that in the next few years, the data will always maintain a 50% annual growth rate. Nowadays, users of major e-commerce, video playback and other user platforms will generate massive amounts of data every day, and most of these data are picture data, and picture data can be divided into monocular picture data and multi-target picture data, single target Picture data can generally be processed by image classification and recognition, while multi-target picture data is generally processed using multi-target detection methods. With the development of computer technology and the wide application of computer vision in life, target detection technology has many fields such as intelligent transportation system, intelligent monitoring system, automatic driving technology, military target detection and medical navigation surgery. Wide application value.
现有的目标检测算法会直接对目标的位置和类别进行预测,建立实时监测网络,这类算法的存在的缺点是:模型精度较低,召回率较低,对于小目标检测的较差。现有目标检测算法不能同时实现高精度且实时检测。Existing target detection algorithms directly predict the location and type of targets, and establish real-time monitoring networks. The disadvantages of such algorithms are: lower model accuracy, lower recall, and poorer detection of small targets. Existing target detection algorithms cannot simultaneously achieve high accuracy and real-time detection.
技术问题technical problem
本发明的主要目的在于提供一种目标检测方法和系统,用于解决现有目标检测技术不能同时实现高精度且实时检测的技术问题。The main purpose of the present invention is to provide a target detection method and system for solving the technical problem that the existing target detection technology cannot simultaneously achieve high precision and real-time detection.
技术解决方案Technical solution
为实现上述目的,本发明第一方面提供一种目标检测的方法,所述方法包括:To achieve the above objective, a first aspect of the present invention provides a method for target detection, the method including:
采用卷积神经网络提取输入的图像的图像特征,得到所述图像多个不同空间分辨率的特征图;A convolutional neural network is used to extract image features of the input image to obtain multiple feature maps of the image with different spatial resolutions;
将所述多个特征图的空间分辨率调节一致,并将所述多个特征图拼接成一个整体特征图;Adjusting the spatial resolution of the multiple feature maps uniformly, and stitching the multiple feature maps into an overall feature map;
处理所述整体特征图,得到所述图像中预测目标的检测信息,所述检测信息包括目标位置信息和目标类别信息。Processing the overall feature map to obtain the detection information of the predicted target in the image, where the detection information includes target position information and target category information.
本发明第二方面提供一种目标检测的系统,所述系统包括:图像特征提取模块、特征处理模块和预测处理模块;A second aspect of the present invention provides a system for target detection. The system includes: an image feature extraction module, a feature processing module, and a prediction processing module;
所述图像特征提取模块,用于采用卷积神经网络提取输入的图像的图像特征,得到所述图像多个不同空间分辨率的特征图;The image feature extraction module is used to extract image features of the input image by using a convolutional neural network to obtain a plurality of feature maps with different spatial resolutions of the image;
所述特征处理模块,用于将所述多个特征图的空间分辨率调节一致,并将所述多个特征图拼接成一个整体特征图;The feature processing module is used to adjust the spatial resolution of the multiple feature maps uniformly, and splice the multiple feature maps into an overall feature map;
所述获取结果模块,用于处理处理所述整体特征图,得到所述图像中预测目标的检测信息,所述检测信息包括目标位置信息和目标类别信息。The acquisition result module is configured to process the overall feature map to obtain detection information of a predicted target in the image, where the detection information includes target position information and target category information.
有益效果Beneficial effect
从上述的技术方案可知,第一方面,该技术方案采用卷积神经网络提取输入的图像的图像特征,该卷积神经网络采用的结构降低计算量,且不影响图像特征的提取效果,提高了目标检测的处理速度,实现对目标位置及类别的实时检测。第二方面,该技术方案获得的特征图为有多个不同空间分辨率的特征图的融合,获得的特征图所包含的图像信息多,根据该特征图检测目标的精度高,即保证了检测精度。因此,本技术方案同时实现了对目标位置及类别的高精度和实时检测。It can be seen from the above technical solution that, in the first aspect, the technical solution uses a convolutional neural network to extract image features of the input image. The structure adopted by the convolutional neural network reduces the amount of calculation, does not affect the extraction effect of image features, and improves The processing speed of target detection enables real-time detection of target positions and categories. In the second aspect, the feature map obtained by the technical solution is a fusion of multiple feature maps with different spatial resolutions. The obtained feature map contains more image information, and the accuracy of detecting the target according to the feature map is high, that is, detection is guaranteed. Precision. Therefore, the technical solution simultaneously achieves high-precision and real-time detection of the target position and category.
附图说明BRIEF DESCRIPTION
图1为本发明实施例提供的目标检测方法的流程示意图;1 is a schematic flowchart of a target detection method provided by an embodiment of the present invention;
图2为步骤101的细化步骤的流程示意图;FIG. 2 is a schematic flowchart of the detailed steps of step 101;
图3为步骤102的细化步骤的流程示意图;FIG. 3 is a schematic flowchart of the detailed steps of step 102;
图4为步骤103的细化步骤的流程示意图;FIG. 4 is a schematic flowchart of the detailed steps of step 103;
图5为本发明另一实施例提供的目标检测方法的流程示意图;5 is a schematic flowchart of a target detection method according to another embodiment of the present invention;
图6为步骤104的细化步骤的流程示意图;FIG. 6 is a schematic flowchart of the detailed steps of step 104;
图7为多个像素点在坐标轴表示的示意图;7 is a schematic diagram showing multiple pixels on the coordinate axis;
图8为本发明另一实施例提供的目标检测系统的结构示意图;8 is a schematic structural diagram of a target detection system according to another embodiment of the present invention;
图9为目标检测系统的网络整体框架图;Figure 9 is the overall network framework of the target detection system;
图10为图像特征提取模块的细化模块的结构示意图;10 is a schematic structural diagram of a refinement module of an image feature extraction module;
图11为下采样模块的示意图;Figure 11 is a schematic diagram of the downsampling module;
图12为卷积特征提取A模块的示意图;12 is a schematic diagram of the A module for convolution feature extraction;
图13为卷积特征提取B模块的示意图;13 is a schematic diagram of the B module for convolution feature extraction;
图14为卷积特征提取C模块的示意图;14 is a schematic diagram of a C module for convolution feature extraction;
图15为特征处理模块的细化模块的结构示意图;15 is a schematic structural diagram of a detailed module of a feature processing module;
图16为本发明实施例中多尺度特征图融合示意图;16 is a schematic diagram of multi-scale feature map fusion in an embodiment of the present invention;
图17为平均池化模块的结构示意图;17 is a schematic structural diagram of an average pooling module;
图18为反卷积模块的结构示意图;18 is a schematic structural diagram of a deconvolution module;
图19为获取结果模块的细化模块的结构示意图;FIG. 19 is a schematic structural diagram of a detailed module for obtaining a result module;
图20为Yolo_predict预测模块的结构示意图;Figure 20 is a schematic diagram of the structure of the Yolo_predict prediction module;
图21为残差预测模块的结构示意图;21 is a schematic structural diagram of a residual prediction module;
图22为本发明另一实施例提供的目标检测系统的结构示意图;22 is a schematic structural diagram of a target detection system according to another embodiment of the present invention;
图23为调节模块的细化模块的结构示意图。FIG. 23 is a schematic structural diagram of a detailed module of the adjustment module.
本发明的实施方式Embodiments of the invention
为使得本发明的发明目的、特征、优点能够更加的明显和易懂,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而非全部实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, features, and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the drawings in the embodiments of the present invention. Obviously, the description The embodiments are only a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present invention.
由于现有技术中不能同时实现高精度且实时检测的技术问题。为了解决上述技术问题,本发明提出一种目标检测的方法。Due to the technical problem that high precision and real-time detection cannot be realized simultaneously in the prior art. In order to solve the above technical problems, the present invention proposes a method for target detection.
请参阅图1,为本发明实施例提供的目标检测方法的流程示意图,该目标检测方法包括:Please refer to FIG. 1, which is a schematic flowchart of a target detection method according to an embodiment of the present invention. The target detection method includes:
步骤101、采用卷积神经网络提取输入的图像的图像特征,得到该图像多个不同空间分辨率的特征图。Step 101: A convolutional neural network is used to extract image features of an input image to obtain a plurality of feature maps with different spatial resolutions of the image.
在提取图像的图像特征之前,应将图像输入至系统中,采用多种卷积神经网络可以获取该图像的多个不同空间分辨率的特征图,其中,特征图中携带图像信息,不同空间分辨率的特征图携带该图像不同层次的图像信息。Before extracting the image features of the image, the image should be input into the system, and multiple convolutional neural networks can be used to obtain multiple feature maps of the image with different spatial resolutions, where the feature map carries image information and different spatial resolutions The rate feature map carries image information at different levels of the image.
步骤102、将多个该特征图的空间分辨率调节一致,并将多个特征图拼接成一个整体特征图。Step 102: Adjust the spatial resolution of multiple feature maps to be consistent, and stitch the multiple feature maps into an overall feature map.
其中,因多个特征图的空间分辨率不一致,会影响后续的多个特征图的拼接,将多个特征图的空间分辨率调节一致后,可将多个特征图拼接成一个整体特征图。Among them, the spatial resolution of multiple feature maps is inconsistent, which will affect the subsequent splicing of multiple feature maps. After adjusting the spatial resolution of multiple feature maps to be consistent, the multiple feature maps may be spliced into an overall feature map.
需要说明的是,一个特征图可有多个特征通道,三个不同空间频率的特征图的图像通道数是不同的,将三个特征图进行拼接,是按照特征图通道进行拼接,形成一个整体特征图,后续该整体特征图经过处理可得到图像中预测目标的位置和类别信息。It should be noted that a feature map can have multiple feature channels, the number of image channels of three feature maps with different spatial frequencies is different, and the three feature maps are stitched together according to the feature map channels to form a whole Feature map, and subsequent processing of the overall feature map can obtain the location and category information of the prediction target in the image.
步骤103、处理整体特征图,得到图像中预测目标的检测信息,该检测信息包括目标位置信息和目标类别信息。Step 103: Process the overall feature map to obtain detection information of the predicted target in the image, where the detection information includes target position information and target category information.
其中,预测目标为图像中检测信息的一个目标,处理该特征图即是将该整体特征图进行多通道特征融合,使用卷积提取有效特征并过滤杂质,从而得到该预测目标的检测信息,该检测信息包括目标位置信息和目标类别信息。Among them, the prediction target is a target of the detection information in the image. Processing the feature map is to perform multi-channel feature fusion on the overall feature map, use convolution to extract effective features and filter impurities, so as to obtain the detection information of the prediction target. The detection information includes target location information and target category information.
如图2所示,图2为步骤101的细化步骤的流程示意图,采用卷积神经网络提取输入的图像的图像特征,得到该图像多个不同空间分辨率的特征图的具体步骤包括:As shown in FIG. 2, FIG. 2 is a schematic flowchart of the refinement step of step 101. The specific steps of using a convolutional neural network to extract image features of an input image, and obtaining multiple feature maps with different spatial resolutions of the image include:
步骤201、获取该图像的第一特征图;Step 201: Acquire the first feature map of the image;
步骤202、获取该图像的第二特征图,该第二特征图的空间分辨率小于第一特征图的空间分辨率;Step 202: Acquire a second feature map of the image, the spatial resolution of the second feature map is smaller than the spatial resolution of the first feature map;
步骤203、获取该图像的第三特征图,该第三特征图的空间分辨率小于第一特征图的空间分辨率。Step 203: Acquire a third feature map of the image, the spatial resolution of the third feature map is smaller than that of the first feature map.
其中,每个特征图获取过程中使用了多个不同尺寸的最大值池化(maxpool)结构,提取出一些不同尺寸的图像特征,左边使用3*3卷积提取图像特征,再使用1*1卷积进行特征过滤和进一步进行图像通道之间的信息交流,最后将左边提取图像特征和右边提取的图像特征concatenate(拼接)起来。一方面,可以提取出更多的特征。另一方面,增加特征图的图像通道数,又降低参数量,增加处理速度。Among them, each feature map acquisition process uses multiple maxpool structures of different sizes to extract some image features of different sizes, the left side uses 3*3 convolution to extract image features, and then uses 1*1 Convolution performs feature filtering and further exchanges of information between image channels, and finally concatenates the image features extracted on the left and the image features extracted on the right. On the one hand, more features can be extracted. On the other hand, increasing the number of image channels of the feature map reduces the amount of parameters and increases the processing speed.
其中,使用concatenate结构增加图像通道数,使用maxpool结构能有效降低参数量,所使用的网络结构中两个不同的网络子模块(block)的1*1层之间也使用一个concate结构相连接,每个特征图在提取过程中因网络层数的不同,从而提取的特征图包含的图像信息也不同,每个特征图的空间分辨率也不同。Among them, use the concatenate structure to increase the number of image channels, use the maxpool structure to effectively reduce the amount of parameters, and use a concate structure between the 1*1 layers of two different network sub-modules (blocks) in the network structure used. Each feature map has different network layers during the extraction process, so that the extracted feature map contains different image information, and the spatial resolution of each feature map is also different.
在本发明中,三个特征图是依次获取的,随着网络层数的断加深,获取的图像的特征图的空间分辨率在减小。In the present invention, the three feature maps are acquired in sequence. As the number of network layers increases, the spatial resolution of the acquired image feature maps decreases.
需要说明的是,在本发明中,使用3*3卷积之后,利用concatenate结构增加网络通道数,保留更多的图像特征,低阶特征与高阶特征之间融合会带来更丰富的语义特征。It should be noted that, in the present invention, after using 3*3 convolution, the concatenate structure is used to increase the number of network channels and retain more image features. The fusion between low-order features and high-order features will bring richer semantics feature.
同时利用多个不同尺度的最大值池化结构和多个不同空间分辨率的特征图的融合,可以获得丰富的图像语义信息,有利于不同大小的目标的预测,有利于提高检测目标精度。At the same time, the fusion of multiple maximum-scale pooling structures with different scales and multiple feature maps with different spatial resolutions can obtain rich image semantic information, which is beneficial to the prediction of targets of different sizes and to improve the accuracy of detection targets.
如图3所示,图3为步骤102的细化步骤的流程示意图,将多个该特征图的空间分辨率调节一致,并将多个特征图拼接成一个整体特征图的具体步骤包括:As shown in FIG. 3, FIG. 3 is a schematic flowchart of the refinement step of step 102. The specific steps of adjusting the spatial resolution of a plurality of the feature maps uniformly and stitching the multiple feature maps into an overall feature map include:
步骤301、采用平均池化的方法降低第一特征图的空间分辨率,以使第一特征图的空间分辨率与第二特征图的空间分辨率一致。Step 301: Adopt the method of average pooling to reduce the spatial resolution of the first feature map, so that the spatial resolution of the first feature map is consistent with the spatial resolution of the second feature map.
其中,第一特征图的空间分辨率大于第二特征图的空间分辨率,因此,降低第一特征图的空间分辨率,使第一特征图的空间分辨率与第二特征图的空间分辨率一致,本发明中,降低空间分辨率的方式为平均池化。Among them, the spatial resolution of the first feature map is greater than that of the second feature map, therefore, the spatial resolution of the first feature map is reduced to make the spatial resolution of the first feature map and the spatial resolution of the second feature map Consistently, in the present invention, the way to reduce the spatial resolution is average pooling.
步骤302、采用反卷积提高第三特征图的空间分辨率,以使第三特征图的空间分辨率与第二特征图的空间分辨率一致。Step 302: Use deconvolution to increase the spatial resolution of the third feature map, so that the spatial resolution of the third feature map is consistent with the spatial resolution of the second feature map.
其中,第三特征图的空间分辨率小于第二特征图的空间分辨率,因此,提高第三特征图的空间分辨率,使第三特征图的空间分辨率与第二特征图的空间分辨率一致,本发明中,提高空间分辨率的方式为反卷积。Among them, the spatial resolution of the third feature map is smaller than the spatial resolution of the second feature map, therefore, the spatial resolution of the third feature map is increased to make the spatial resolution of the third feature map and the spatial resolution of the second feature map Consistently, in the present invention, the way to increase the spatial resolution is deconvolution.
步骤303、将第一特征图、第二特征图和第三特征图进行拼接处理,融合成一个整体特征图。Step 303: Perform stitching processing on the first feature map, the second feature map, and the third feature map to fuse into an overall feature map.
其中,第一特征图、第二特征图和第三特征图的图像通道数均不相同,第一特征图、第二特征图和第三特征图在特征图通道(channel)上进行concate拼接,形成一个整体特征图。该整体特征图既包含了高阶特征也包含了低阶特征,有效提高预测结果。Among them, the number of image channels of the first feature map, the second feature map, and the third feature map are different. The first feature map, the second feature map, and the third feature map are concatenated on the feature map channel, Form an overall feature map. The overall feature map contains both high-order features and low-order features, effectively improving the prediction results.
在本发明实施例中,输入图像尺寸大小为a*a*3(a的值为预设值),输出的第一特征图、第二特征图和第三特征图的大小分别为:(a/8)*(a/8)*256,(a/16)*(a/16)*512,(a/32)*(a/32)*1024。In the embodiment of the present invention, the input image size is a*a*3 (the value of a is a preset value), and the sizes of the output first feature map, second feature map, and third feature map are: (a /8)*(a/8)*256,(a/16)*(a/16)*512,(a/32)*(a/32)*1024.
如图4所示,图4为步骤103的细化步骤的流程示意图,处理整体特征图,得到图像中预测目标的检测信息,该检测信息包括目标位置信息和目标类别信息的步骤包括:As shown in FIG. 4, FIG. 4 is a schematic flowchart of the refinement step of step 103. Processing the overall feature map to obtain detection information of the predicted target in the image. The detection information includes target position information and target category information. The steps include:
步骤401、将整体特征图进行多通道特征融合,再将多通道特征已融合的整体特征图进行特征提取过滤杂质,得到预测目标的预测信息,该预测信息包括目标预测的位置信息和预测的类别信息。Step 401: Perform multi-channel feature fusion on the overall feature map, then perform feature extraction and filter impurities on the integrated multi-channel feature map to obtain prediction information of the predicted target, which includes the predicted location information of the target and the predicted category information.
具体地,使用1*1卷积对整体特征图进行多通道特征融合,再使用3*3卷积进一步提取有效图像特征并过滤掉杂质,得到目标的预测信息,该预测信息包括位置信息、置信度和类别信息。使用卷积神经网络对目标的信息进行预测,减少了参数。Specifically, use 1*1 convolution to perform multi-channel feature fusion on the overall feature map, and then use 3*3 convolution to further extract effective image features and filter out impurities to obtain target prediction information, which includes location information and confidence Degree and category information. The convolutional neural network is used to predict the target information, which reduces the parameters.
在本发明实施例中,经过上述卷积神经网络预测处理后,整体特征图的大小即为(a/16)*(a/16)*H,其中,H的值由所需要预测的目标的类别数,c为目标物体的类别。令a1=(a/16),将图像分为a1*a1个分块,即对应a1*a1个H长的向量,每个分块生成9的候选框,每个候选框包括:位置信息(x,y,w,h)、置信度(confidence,表示物体落在这个候选框的概率)和类别信息。其中目标类别数量为c个,即有c维向量,c维向量中哪一维向量值最大,该候选框所属的目标就属于哪一类。In the embodiment of the present invention, after the above convolutional neural network prediction process, the size of the overall feature map is (a/16)*(a/16)*H, where the value of H is determined by the target to be predicted Number of categories, c is the category of the target object. Let a1=(a/16), divide the image into a1*a1 blocks, which corresponds to a1*a1 H-length vectors, each block generates 9 candidate boxes, and each candidate box includes: location information ( x,y,w,h), confidence (represents the probability that an object falls in this candidate box) and category information. The number of target categories is c, that is, there are c-dimensional vectors. Which one-dimensional vector value is the largest in the c-dimensional vector belongs to which category the target that the candidate frame belongs to belongs to.
P(object)表示目标落在一个分块的概率值,这个值的大小等于预测的置信度(即confidence),即:P(object) represents the probability value of the target falling into a block, the value of this value is equal to the predicted confidence (ie confidence), namely:
P(object)=confidenceP(object)=confidence
而考虑到,置信度这个值的作用是用来判断物体是否在这个候选框内,为了更好的去进行判断,此处需要考虑IOU值,故此处的位置置信度的计算如下:Considering that the value of the confidence value is used to determine whether the object is in this candidate box, in order to better judge, the IOU value needs to be considered here, so the calculation of the location confidence here is as follows:
Figure PCTCN2018119132-appb-000001
Figure PCTCN2018119132-appb-000001
其中,
Figure PCTCN2018119132-appb-000002
表示候选框和真实框的交并比,P confidence为位置置信度的值。
among them,
Figure PCTCN2018119132-appb-000002
Represents the intersection ratio of the candidate box and the real box, P confidence is the value of position confidence.
其中,如果一个目标(object)落在一个分块里,则P(object)=1,否则P(object)=0。每个候选框需预测位置信息中四个参量x,y,w,h、一个置信度confidence和c个类别信息。Among them, if an object (object) falls in a partition, then P(object)=1, otherwise P(object)=0. Each candidate box needs to predict four parameters x, y, w, h, a confidence and c category information in the location information.
其中,x,y,w,h,confidence∈(0,1),x和y表示所指的候选框中心点相对于所在分块的左上角的偏置,w,h表示候选框的宽和高相对于a值(预先设定的图像输入尺寸大小)的大小:Among them, x, y, w, h, confidence ∈ (0, 1), x and y represent the offset of the center point of the candidate box with respect to the upper left corner of the block, and w, h represent the width of the candidate box High relative to the value of a (the preset image input size):
Figure PCTCN2018119132-appb-000003
Figure PCTCN2018119132-appb-000003
每一个候选框属于哪一个类别的概率(class-specific confidence score)值的计算如下:The calculation of the class-specific confidence score of each candidate box belongs to:
Figure PCTCN2018119132-appb-000004
Figure PCTCN2018119132-appb-000004
Figure PCTCN2018119132-appb-000005
Figure PCTCN2018119132-appb-000005
需要说明的是,在训练过程中,
Figure PCTCN2018119132-appb-000006
值是可计算的,但是在预测的过程之中,这个值不可计算,所以会设定在预测过程,
Figure PCTCN2018119132-appb-000007
的值默认为1。
It should be noted that during the training process,
Figure PCTCN2018119132-appb-000006
The value is computable, but in the process of prediction, this value is not calculated, so it will be set in the prediction process,
Figure PCTCN2018119132-appb-000007
The default value is 1.
在本发明实施例中,为了预测的更精确,使用损失(loss)函数优化预测过程。其中,损失函数如下:In the embodiment of the present invention, in order to make the prediction more accurate, a loss function is used to optimize the prediction process. Among them, the loss function is as follows:
Figure PCTCN2018119132-appb-000008
Figure PCTCN2018119132-appb-000008
其中,coordError为坐标误差,confidenceError为置信度损失误差,classError为分类误差。Among them, coordError is the coordinate error, confidenceError is the confidence loss error, and classError is the classification error.
其中,坐标误差coordError的表达式如下:Among them, the expression of coordinate error coordError is as follows:
Figure PCTCN2018119132-appb-000009
Figure PCTCN2018119132-appb-000009
其中,f obj(i,j)为测量目标落入第i个分块的第j个框的值,f noobj(i,j)为测量目标没有落入第i个分块的第j个框的值,当测量目标落入第i个分块的第j个框时,f obj(i,j)=1,f noobj(i,j)=0,当测量目标未落入第i个分块的第j个框时,f obj(i,j)=1,f noobj(i,j)=0。λ coord为权重系数,可设置值。 Where f obj (i,j) is the value of the jth frame of the measurement target falling into the i-th block, and f noobj (i,j) is the j-th frame of the measurement target not falling into the i-th block Value, when the measurement target falls into the jth box of the ith block, f obj (i,j)=1, f noobj (i,j)=0, when the measurement target does not fall into the i At the jth frame of the block, f obj (i, j) = 1, and f noobj (i, j) = 0. λ coord is the weight coefficient, and the value can be set.
置信度损失误差confidenceError的表达式如下:The expression of confidence error is as follows:
Figure PCTCN2018119132-appb-000010
Figure PCTCN2018119132-appb-000010
其中,使用λ noord参数降低检测到图片背景对损失函数的影响。 Among them, the λ noord parameter is used to reduce the influence of the detected picture background on the loss function.
分类误差classError的表达式如下:The expression of classification error classError is as follows:
Figure PCTCN2018119132-appb-000011
Figure PCTCN2018119132-appb-000011
根据上述损失函数的优化后,预测的结果相对未优化之前的预测结果更精确。After optimization based on the above loss function, the predicted results are more accurate than those before the optimization.
步骤402、采用非极大值抑制的方法筛选掉冗余的信息,输出预测目标的检测信息,该检测信息包括目标位置信息和目标类别信息。Step 402: Using non-maximum suppression method to filter out redundant information and output detection information of the predicted target, the detection information includes target position information and target category information.
其中,采用非极大值抑制的方法筛选掉冗余的信息的具体步骤包括:将多个候选框位置放入模块B中,模块B包括b 1,b 2,b 3,…,将class-specific confidence score值放入模块S中。 The specific steps of filtering out redundant information using non-maximum suppression methods include: placing multiple candidate frame positions in module B, which includes b 1 , b 2 , b 3 , ..., class- The specific confidence score value is put into module S.
选取模块S中的数值最大的值s *对应的预测框b *,遍历其余的候选框,如果遍历的候选框满足下式条件: Select the maximum value of the S module a value corresponding to the prediction block s * b *, the rest of the candidate block traversal, if traversing the candidate block satisfies the condition formula:
IOU(b *,b i)>N t IOU(b * ,b i )>N t
其中,b *为属于所有类别中的一个类别的概率值最大的值的候选框,b i为属于任意一个类别的概率值的候选框,Nt为预设的交并比的阈值,即非极大值抑制的阈值。 Among them, b * is the candidate box with the highest probability value belonging to one category of all categories, b i is the candidate box with probability value belonging to any category, and Nt is the preset threshold of the cross-combination ratio, that is, non-polar Threshold for large value suppression.
满足上述条件,b *与b i的交并比大于非极大值抑制的阈值,移除模块S中满足上述条件的S i和模块B中的b i,重新在S模块中选择最大值s *及对应的预测框b *,在重新遍历其余的候选框b i,如果遍历的候选框b i满足上式条件,则移除S中的S i和B中的b i,将余下的侯选框和class-specific confidence score值输出。输出的候选框及所携带的信息即为检测结果。 Satisfy the above conditions, the intersection ratio of b * and b i is greater than the threshold of non-maximum suppression, remove the S i in module S and the b i in module B, and select the maximum value s in S module again * And the corresponding prediction box b * , after retraversing the remaining candidate boxes b i , if the traversed candidate boxes b i meet the above condition, then remove S i in S and b i in B, the remaining H Checkboxes and class-specific confidence score values are output. The output candidate box and the information carried are the detection results.
从图1本发明提供的目标的检测方法可知,第一方面,该方法采用卷积神经网络提取输入的图像的图像特征,该卷积神经网络采用的结构降低计算量,且不影响图像特征的提取效果,提高了目标检测的处理速度,实现对目标位置及类别的实时检测。第二方面,该技术方案获得的特征图为有多个不同空间分辨率的特征图的融合,获得的特征图所包含的图像信息多,根据该特征图检测目标的精度高,即保证了检测精度。因此,本技术方案同时实现了对目标位置及类别的高精度和实时检测。As can be seen from the target detection method provided by the present invention in FIG. 1, in the first aspect, the method uses a convolutional neural network to extract image features of the input image. The structure adopted by the convolutional neural network reduces the amount of calculation and does not affect the image features. The extraction effect improves the processing speed of target detection and realizes real-time detection of target position and category. In the second aspect, the feature map obtained by the technical solution is a fusion of multiple feature maps with different spatial resolutions. The obtained feature map contains more image information, and the accuracy of detecting the target according to the feature map is high, that is, detection is guaranteed. Precision. Therefore, the technical solution simultaneously achieves high-precision and real-time detection of the target position and category.
请参考图5、图5为本发明另一实施例提供的目标检测方法的流程示意图,与上一实 施例的不同之处在于,该方法在步骤101之前的步骤还包括:步骤104、调整输入图像的大小。Please refer to FIG. 5. FIG. 5 is a schematic flowchart of a target detection method according to another embodiment of the present invention. The difference from the previous embodiment is that the method before step 101 further includes: step 104. Adjust input The size of the image.
其中,调整输入图像的大小,使后续提取图像的图像特征时,图像的大小相同,方便操作,该方法先对输入的图像调整大小,使任何大小的图像在同一条件下均能处理。在调整图像的大小时,还需设置卷积神经网络的图像的规格,图像的大小均可调整为预设的值。Among them, the size of the input image is adjusted so that when the image features of the image are subsequently extracted, the size of the image is the same, which is convenient for operation. This method first adjusts the size of the input image so that images of any size can be processed under the same conditions. When adjusting the size of the image, the image specifications of the convolutional neural network must also be set, and the size of the image can be adjusted to a preset value.
在本发明实施例中,卷积神经网络的图像大小设置为416*416,因此输入的图像大小均要设置为416*416。In the embodiment of the present invention, the image size of the convolutional neural network is set to 416*416, so the input image size should be set to 416*416.
如图6所示,图6为步骤104的细化步骤的流程示意图,步骤104中调整输入图像的大小的具体步骤包括:As shown in FIG. 6, FIG. 6 is a schematic flowchart of the refinement step of step 104. The specific steps of adjusting the size of the input image in step 104 include:
步骤501、根据双线性插值的算法将尺寸大于预置值的图像的尺寸缩小。Step 501: Reduce the size of the image whose size is larger than the preset value according to the bilinear interpolation algorithm.
其中,输入的图像中,对于原始尺寸大于预置值的图像,采用双线性插值的算法,将图像缩小。Among them, in the input image, for an image whose original size is greater than a preset value, a bilinear interpolation algorithm is used to reduce the image.
在本发明实施例中,对于原始尺寸大于416*416*3的图像,采用双线性插值的算法,将图像缩小。如图7所示,图7为多个像素点在坐标轴表示的示意图,f是一个像素点的像素值,已知函数f在Q 11=(x 1,y 1)、Q 12=(x 1,y 2),Q 21=(x 2,y 1)以及Q 22=(x 2,y 2)四个点的值。首先在x方向进行双线性插值,得到如下式: In the embodiment of the present invention, for an image whose original size is larger than 416*416*3, a bilinear interpolation algorithm is used to reduce the image. As shown in FIG. 7, FIG. 7 is a schematic diagram of a plurality of pixels represented on the coordinate axis, f is a pixel value of a pixel, and the known function f is Q 11 =(x 1 ,y 1 ), Q 12 =(x 1 , y 2 ), Q 21 =(x 2 ,y 1 ) and Q 22 =(x 2 ,y 2 ) are the values of four points. First perform bilinear interpolation in the x direction to obtain the following formula:
Figure PCTCN2018119132-appb-000012
where R 1=(x,y 1)
Figure PCTCN2018119132-appb-000012
where R 1 = (x,y 1 )
Figure PCTCN2018119132-appb-000013
where R 2=(x,y 2)
Figure PCTCN2018119132-appb-000013
where R 2 = (x,y 2 )
然后在y方向进行线性插值,得到Then perform linear interpolation in the y direction to get
Figure PCTCN2018119132-appb-000014
Figure PCTCN2018119132-appb-000014
然后根据上述x和y双线性插值的结果再进行双线性插值运算,得到最后的结果为:Then according to the above x and y bilinear interpolation results and then perform bilinear interpolation operation, the final result is:
Figure PCTCN2018119132-appb-000015
Figure PCTCN2018119132-appb-000015
其中,由于上述图像双线性插值只使用了四个相邻的4个点,因此,上述公式的分母均是1。Among them, since the above image bilinear interpolation uses only four adjacent 4 points, the denominator of the above formula is all 1.
步骤502、根据补零的方式将尺寸小于预置值的图像的尺寸增大。Step 502: Increase the size of the image whose size is smaller than the preset value according to the zero-filling method.
其中,输入的图像中,对于原始尺寸小于预置值的图像,采用补零的方式,将图像增大,使图像的尺寸增大至预置值。Among them, for the input image, for the image whose original size is smaller than the preset value, the image is increased by zero padding to increase the size of the image to the preset value.
从本发明实施例提供的目标检测方法可知,第一方面,该方案采用卷积神经网络提取输入的图像的图像特征,该卷积神经网络采用的结构降低计算量,且不影响图像特征的提取效果,提高了目标检测的处理速度,实现对目标位置及类别的实时检测。第二方面,该技术方案获得的特征图为多个不同空间分辨率的特征图的融合,获得的特征图所包含的图像信息多,根据该特征图检测目标的精度高,即保证了检测精度。因此,本技术方案同时实现了对目标位置及类别的高精度和实时检测。第三方面,调整输入图像的大小,使得该方法可处理任意尺寸的图像,对任意尺寸的图像进行目标检测,增大了检测范围。It can be known from the target detection method provided in the embodiment of the present invention that, in the first aspect, the scheme uses a convolutional neural network to extract image features of the input image. The structure adopted by the convolutional neural network reduces the amount of calculation and does not affect the extraction of image features As a result, the processing speed of target detection is improved, and real-time detection of target position and category is realized. In the second aspect, the feature map obtained by the technical solution is a fusion of multiple feature maps with different spatial resolutions. The obtained feature map contains more image information, and the accuracy of detecting the target according to the feature map is high, that is, the detection accuracy is guaranteed. . Therefore, the technical solution simultaneously achieves high-precision and real-time detection of the target position and category. In the third aspect, the size of the input image is adjusted so that the method can process images of any size, and target detection is performed on images of any size, which increases the detection range.
请参阅图8,图8为本发明另一实施例提供的目标检测系统的结构示意图,该目标检测的系统包括:图像特征提取模块601、特征处理模块602和获取结果模块603。Please refer to FIG. 8, which is a schematic structural diagram of a target detection system according to another embodiment of the present invention. The target detection system includes: an image feature extraction module 601, a feature processing module 602, and an acquisition result module 603.
图像特征提取模块601,用于采用卷积神经网络提取输入的图像的图像特征,得到该图像多个不同空间分辨率的特征图。The image feature extraction module 601 is used to extract the image features of the input image by using a convolutional neural network to obtain a plurality of feature maps with different spatial resolutions of the image.
在图像特征提取模块601提取图像的图像特征之前,应将图像输入至系统中,图像特征提取模块601采用不同的卷积神经网络可以获取该图像的多个不同空间分辨率的特征图,其中,特征图中携带图像信息,不同空间分辨率的特征图携带该图像不同层次的图像信息。Before the image feature extraction module 601 extracts the image features of the image, the image should be input into the system. The image feature extraction module 601 can use different convolutional neural networks to obtain multiple feature maps of the image with different spatial resolutions, where, The feature map carries image information, and the feature maps with different spatial resolutions carry image information at different levels of the image.
特征处理模块602,用于将多个特征图的空间分辨率调节一致,并将该多个特征图拼接成一个整体特征图。The feature processing module 602 is used to adjust the spatial resolution of multiple feature maps uniformly, and splice the multiple feature maps into an overall feature map.
其中,因多个特征图的空间分辨率不一致,会影响后续的多个特征图的拼接,特征处理模块602将多个特征图的空间分辨率调节一致后,可将多个特征图拼接成一个整体特征图。Among them, because the spatial resolution of multiple feature maps is inconsistent, it will affect the subsequent splicing of multiple feature maps. After the feature processing module 602 adjusts the spatial resolution of multiple feature maps to be consistent, the multiple feature maps can be spliced into one Overall feature map.
需要说明的是,一个特征图可有多个特征通道,三个不同空间频率的特征图的图像通道数是不同的,特征处理模块602将三个特征图进行拼接,是按照特征图通道进行拼接,形成一个整体特征图,后续该整体特征图经过处理可得到图像中预测目标的位置和类别信息。It should be noted that a feature map can have multiple feature channels, and the number of image channels of three feature maps with different spatial frequencies is different. The feature processing module 602 stitches the three feature maps according to the feature map channel. To form an overall feature map, which is subsequently processed to obtain the location and category information of the predicted target in the image.
获取结果模块603,用于处理整体特征图,得到该图像中预测目标的检测信息,该检测信息包括目标位置信息和目标类别信息。The acquisition result module 603 is used to process the overall feature map to obtain the detection information of the predicted target in the image. The detection information includes target position information and target category information.
其中,预测目标为图像中检测信息的一个目标,获取结果模块603处理该特征图即是将该整体特征图进行多通道特征融合,使用卷积提取有效特征并过滤杂质,从而得到该预测目标的检测信息,该检测信息包括目标位置信息和目标类别信息。Among them, the prediction target is a target of the detection information in the image, and the processing of the feature map by the acquisition result module 603 is to perform multi-channel feature fusion on the overall feature map, use convolution to extract effective features and filter impurities, so as to obtain the prediction target Detection information, which includes target location information and target category information.
需要说明的是,目标检测系统的网络整体框架图如9所示,图9为目标检测系统的网络整体框架图,该系统包括下采样模块、卷积特征提取A模块、卷积特征提取B模块、卷积特征提取C模块、反卷积模块、平均池化模块、Concate拼接模块、预测模块(Yolo_predict)模块、非极大值抑制(Non-Maximum Suppression,NMS)算法过滤模块。多个网络模块的组合构成该目标检测系统。It should be noted that the overall framework of the network of the target detection system is shown in Figure 9. Figure 9 is the overall framework of the network of the target detection system. The system includes a downsampling module, a convolution feature extraction A module, and a convolution feature extraction B module. , Convolution feature extraction C module, deconvolution module, average pooling module, Concate splicing module, prediction module (Yolo_predict) module, non-maximum suppression (Non-Maximum Suppression, NMS) algorithm filtering module. The combination of multiple network modules constitutes the target detection system.
需要说明的是,图9种a为初始设定的数值,a值在一定范围内越大,整个网络的提取的效果越好。It should be noted that the type a in FIG. 9 is an initially set value. The larger the value a is within a certain range, the better the extraction effect of the entire network.
进一步地,如图10所示,图10为图像特征提取模块的细化模块的结构示意图,图像特征提取模块601的细化模块包括:第一采样模块701、第二采样模块702和第三采样模块703;Further, as shown in FIG. 10, FIG. 10 is a schematic structural diagram of a refinement module of the image feature extraction module. The refinement module of the image feature extraction module 601 includes: a first sampling module 701, a second sampling module 702, and a third sampling Module 703;
第一采样模块701,用于获取图像的第一特征图;The first sampling module 701 is used to obtain a first feature map of the image;
第二采样模块702,用于获取图像的第二特征图,该第二特征图的空间分辨率小于第一特征图的空间分辨率;The second sampling module 702 is used to obtain a second feature map of the image, and the spatial resolution of the second feature map is smaller than that of the first feature map;
第三采样模块703,用于获取图像的第三特征图,该第三特征图的空间分辨率小于第二特征图的空间分辨率。The third sampling module 703 is used to obtain a third feature map of the image, and the spatial resolution of the third feature map is smaller than that of the second feature map.
其中,第一采样模块701、第二采样模块702和第三采样模块703的网络结构均有concatenate结构。图9中标示了第一采样模块701、第二采样模块702和第三采样模块703在网络整体框架图中的位置,第一采样模块701、第二采样模块702和第三采样模块703所使用模块的结构如图11、图12、图13和图14所示,图11为下采样模块的示意图,图12为卷积特征提取A模块的示意图,图13为卷积特征提取B模块的示意图,图14为卷积特征提取C模块的示意图。The network structures of the first sampling module 701, the second sampling module 702, and the third sampling module 703 all have a concatenate structure. 9 shows the positions of the first sampling module 701, the second sampling module 702, and the third sampling module 703 in the overall network frame diagram. The first sampling module 701, the second sampling module 702, and the third sampling module 703 are used The structure of the module is shown in Figure 11, Figure 12, Figure 13, and Figure 14. Figure 11 is a schematic diagram of the downsampling module, Figure 12 is a schematic diagram of the convolution feature extraction module A, and Figure 13 is a schematic diagram of the convolution feature extraction module B Figure 14 is a schematic diagram of the C module for convolution feature extraction.
需要说的是,图11的左侧为下采样模块的详细结构,右侧为该模块的简图,根据左侧图的结构,在右边使用了多个不同尺寸的最大值池化(maxpool)结构,提取出一些不同尺寸的图像特征,左边使用3*3卷积提取图像特征,再使用1*1卷积进行特征过滤和进一步进行图像通道之间的信息交流,最后将左边提取图像特征和右边提取的图像特征concatenate起来,一方面,可以提取出更多的特征。另一方面,该采样模块加了特征图的图像通道数,又降低参数量,增加处理速度。It should be noted that the left side of FIG. 11 is the detailed structure of the downsampling module, and the right side is a simplified diagram of the module. According to the structure of the left side picture, multiple maximum pools (maxpool) of different sizes are used on the right Structure, extract some image features of different sizes, use 3*3 convolution on the left to extract image features, then use 1*1 convolution for feature filtering and further information exchange between image channels, and finally extract image features and The image features extracted on the right are concatenate. On the one hand, more features can be extracted. On the other hand, the sampling module adds the number of image channels of the feature map, and reduces the parameter amount, increasing the processing speed.
同时还需要说的是,如图12、图13和图14所示,卷积特征提取A模块、卷积特征提取B模块和卷积特征提取C模块为同一类模块,在卷积特征提取A模块、卷积特征提取B模块和卷积特征提取C模块三个模块中均有两个不同的网络子模块(block)的1*1层之间也使用一个concate连接结构相连接。每个特征图在提取过程中因网络层数的不同,从而提取的特征图包含的图像信息也不同,每个特征图的空间分辨率也不同。At the same time, it should also be said that, as shown in Figures 12, 13 and 14, the convolution feature extraction A module, the convolution feature extraction B module and the convolution feature extraction C module are the same type of module, and the convolution feature extraction A The module, convolution feature extraction B module and convolution feature extraction C module all have two different network sub-modules (blocks) in the three modules, which are also connected by a concate connection structure. Each feature map has different network layers during the extraction process, so that the extracted feature map contains different image information, and the spatial resolution of each feature map is also different.
从图9中看出,第一采样模块701、第二采样模块702和第三采样模块703中的每一个模块均是下采样模块、卷积特征提取A模块、卷积特征提取B模块和卷积特征提取C模块组合使用。第一采样模块701、第二采样模块702和第三采样模块703是依次连接的,第一采样模块先获取图像的第一特征图,第二采样模块再获取图像的第二特征图,最后第三采样模块获取图像的第二特征图。随着网络层数的断加深,获取的图像的特征图的空间分辨率在减小。It can be seen from FIG. 9 that each of the first sampling module 701, the second sampling module 702, and the third sampling module 703 is a downsampling module, a convolution feature extraction A module, a convolution feature extraction B module, and a convolution The product feature extraction C module is used in combination. The first sampling module 701, the second sampling module 702, and the third sampling module 703 are connected in sequence. The first sampling module first obtains the first feature map of the image, the second sampling module then obtains the second feature map of the image, and finally The three sampling module obtains the second feature map of the image. As the number of network layers increases, the spatial resolution of the feature map of the acquired image decreases.
需要说明的是,在本发明中,使用3*3卷积之后,利用concatenate结构增加网络通道数,保留更多的图像特征,低阶特征与高阶特征之间融合会带来更丰富的语义特征。It should be noted that, in the present invention, after using 3*3 convolution, the concatenate structure is used to increase the number of network channels and retain more image features. The fusion between low-order features and high-order features will bring richer semantics feature.
还需要说明的是,本模型在设置参数的时候,将3*3卷积层通道数设置为1*1卷积层的通道数的6倍,此操作的优点在于:3*3卷积层通道数的增多,会使得所提取出的图像特征更丰富,而1*1卷积层的通道数变少,会使其有一个通道压缩的效果,保证通道数不会随着网络层数的加深变得过多,同时也有一定的特征提取效果。It should also be noted that when setting parameters in this model, the number of channels of the 3*3 convolutional layer is set to 6 times the number of channels of the 1*1 convolutional layer. The advantage of this operation is: 3*3 convolutional layer The increase of the number of channels will make the extracted image features more abundant, and the number of channels of the 1*1 convolution layer becomes less, which will make it have a channel compression effect, ensuring that the number of channels will not follow the number of network layers Deepening becomes too much, and also has a certain feature extraction effect.
如图15所示,图15为特征处理模块的细化模块的结构示意图,特征处理模块602包括:平均池化模块801、反卷积模块802和拼接模块803。As shown in FIG. 15, FIG. 15 is a schematic structural diagram of a refinement module of a feature processing module. The feature processing module 602 includes: an average pooling module 801, a deconvolution module 802, and a stitching module 803.
平均池化模块801,用于采用平均池化的方法降低第一特征图的空间分辨率,以使第一特征图的空间分辨率与第二特征图的空间分辨率一致。The average pooling module 801 is used to reduce the spatial resolution of the first feature map by using the average pooling method, so that the spatial resolution of the first feature map is consistent with the spatial resolution of the second feature map.
其中,第一特征图的空间分辨率大于第二特征图的空间分辨率,因此,平均池化模块801降低第一特征图的空间分辨率,使第一特征图的空间分辨率与第二特征图的空间分辨率一致,本发明中,平均池化模块801降低空间分辨率的方式为平均池化。Among them, the spatial resolution of the first feature map is greater than the spatial resolution of the second feature map, therefore, the average pooling module 801 reduces the spatial resolution of the first feature map, so that the spatial resolution of the first feature map and the second feature map The spatial resolution of the graphs is consistent. In the present invention, the average pooling module 801 reduces the spatial resolution by average pooling.
反卷积模块802,用于采用反卷积提高第三特征图的空间分辨率,以使第三特征图的空间分辨率与第二特征图的空间分辨率一致。The deconvolution module 802 is configured to use deconvolution to increase the spatial resolution of the third feature map, so that the spatial resolution of the third feature map is consistent with the spatial resolution of the second feature map.
其中,第三特征图的空间分辨率小于第二特征图的空间分辨率,因此,反卷积模块802提高第三特征图的空间分辨率,使第三特征图的空间分辨率与第二特征图的空间分辨率一致,本发明中,反卷积模块802提高空间分辨率的方式为反卷积。The spatial resolution of the third feature map is smaller than the spatial resolution of the second feature map. Therefore, the deconvolution module 802 increases the spatial resolution of the third feature map so that the spatial resolution of the third feature map and the second feature map The spatial resolution of the graph is consistent. In the present invention, the way the deconvolution module 802 improves the spatial resolution is deconvolution.
拼接模块803,用于将第一特征图、第二特征图和第三特征图进行拼接处理,融合成一个整体特征图。The stitching module 803 is used to stitch the first feature map, the second feature map and the third feature map into a whole feature map.
其中,第一特征图、第二特征图和第三特征图的图像通道数均不相同,拼接模块803将第一特征图、第二特征图和第三特征图在特征通道上进行concate拼接,形成一个整体特征图。该整体特征图既包含了高阶特征也包含了低阶特征,有效提高预测结果。Among them, the number of image channels of the first feature map, the second feature map, and the third feature map are all different, and the stitching module 803 concatenates the first feature map, the second feature map, and the third feature map on the feature channel, Form an overall feature map. The overall feature map contains both high-order features and low-order features, effectively improving the prediction results.
在本发明实施例中,如图16所示,图16为本发明实施例中多尺度特征图融合示意图。输入图像尺寸大小为a*a*3(a的值为预设值),输出的第一特征图、第二特征图和第三特征图的大小分别为:(a/8)*(a/8)*256,(a/16)*(a/16)*512,(a/32)*(a/32)*1024,(a/8)*(a/8)*256特征图经过平均池化模块801进行平均池化,降低空间分辨率,(a/8)*(a/8)*256特征图经过反卷积模块802进行反卷积处理,提高空间分辨率,最后拼接模块803再将三个特征图进行拼接处理,融合成一个整体特征图。In the embodiment of the present invention, as shown in FIG. 16, FIG. 16 is a schematic diagram of multi-scale feature map fusion in the embodiment of the present invention. The size of the input image is a*a*3 (the value of a is the preset value), and the sizes of the first feature map, the second feature map, and the third feature map output are: (a/8)*(a/ 8)*256,(a/16)*(a/16)*512,(a/32)*(a/32)*1024,(a/8)*(a/8)*256 feature maps are averaged The pooling module 801 performs average pooling and reduces the spatial resolution. The (a/8)*(a/8)*256 feature map is deconvolved by the deconvolution module 802 to improve the spatial resolution, and finally the splicing module 803 Then the three feature maps are stitched together and merged into a whole feature map.
需要说明的,如图17所示,图17为平均池化模块的结构示意图,图18为反卷积模块的结构示意图。两个模块均有卷积结构。平均池化模块801中有平均池化结构(Avgpool),该平均池化结构降低第一特征图空间分辨率,反卷积模块802中有反卷积结构(Deconvolution),该反卷积结构增加了第三特征图的空间分辨率。It should be noted that, as shown in FIG. 17, FIG. 17 is a schematic structural diagram of an average pooling module, and FIG. 18 is a schematic structural diagram of a deconvolution module. Both modules have a convolution structure. The average pooling module 801 has an average pooling structure (Avgpool), which reduces the spatial resolution of the first feature map, and the deconvolution module 802 has a deconvolution structure (Deconvolution), which increases The spatial resolution of the third feature map.
进一步地,如图19所示,图19为获取结果模块的细化模块的结构示意图。获取结果模块603包括预测处理模块901和过滤模块902:Further, as shown in FIG. 19, FIG. 19 is a schematic structural diagram of a detailed module for obtaining a result module. The obtaining result module 603 includes a prediction processing module 901 and a filtering module 902:
预测处理模块901,用于将整体特征图进行多通道特征融合,再将多通道特征已融合的整体特征图进行特征提取过滤杂质,得到预测目标的预测信息,该预测信息包括目标预测的位置信息和预测的类别信息。The prediction processing module 901 is used for performing multi-channel feature fusion on the overall feature map, and then performing feature extraction on the integrated multi-channel feature map to filter impurities to obtain prediction information of the prediction target, where the prediction information includes location information of the target prediction And predicted category information.
具体地,如图20所示,图20为Yolo_predict预测模块的结构示意图,Yolo_predict预测模块包括残差预测模块(res_predict_block)和1*1卷积层。残差预测模块的具体结构如图21所示,图21为残差预测模块的结构示意图。在该残差预测模块中,使用卷积神经网络对目标进行预测。Specifically, as shown in FIG. 20, FIG. 20 is a schematic structural diagram of a Yolo_predict prediction module. The Yolo_predict prediction module includes a residual prediction module (res_predict_block) and a 1*1 convolutional layer. The specific structure of the residual prediction module is shown in FIG. 21, and FIG. 21 is a schematic structural diagram of the residual prediction module. In this residual prediction module, a convolutional neural network is used to predict the target.
其中,残差预测模块中步骤包括:使用1*1卷积对整体特征图进行多通道特征融合,再使用3*3卷积进一步提取有效图像特征并过滤掉杂质,得到目标的预测信息,该预测信息包括位置信息、置信度和类别信息。使用卷积神经网络对目标的信息进行预测,减少了参数。Among them, the steps in the residual prediction module include: using 1*1 convolution to perform multi-channel feature fusion on the overall feature map, and then using 3*3 convolution to further extract effective image features and filter out impurities to obtain the target prediction information. The prediction information includes location information, confidence, and category information. The convolutional neural network is used to predict the target information, which reduces the parameters.
在本发明实施例中,经过特征处理模块602的卷积神经网络预测处理后,整体特征图的大小即为(a/16)*(a/16)*H,其中,H的值由所需要预测的目标的类别数,c为目标物体的类别。令a1=(a/16),将图像分为a1*a1个分块,即对应a1*a1个H长的向量,每个分块生成9的候选框,每个候选框包括:位置信息(x,y,w,h)、置信度(表示物体落在这个候选框的概率)和类别信息。其中目标类别数量为c个,即有c维向量,c维向量中哪一维向量值最大,该候选框所属的目标就属于哪一类。In the embodiment of the present invention, after the prediction processing of the convolutional neural network of the feature processing module 602, the size of the overall feature map is (a/16)*(a/16)*H, where the value of H is determined by the required The number of predicted target categories, c is the category of the target object. Let a1=(a/16), divide the image into a1*a1 blocks, which corresponds to a1*a1 H-length vectors, each block generates 9 candidate boxes, and each candidate box includes: location information ( x,y,w,h), confidence (indicating the probability that an object falls in this candidate box) and category information. The number of target categories is c, that is, there are c-dimensional vectors. Which one-dimensional vector value is the largest in the c-dimensional vector belongs to which category the target that the candidate frame belongs to belongs to.
P(object)表示目标落在一个分块的概率值,这个值的大小等于预测的置信度(即confidence),即:P(object) represents the probability value of the target falling into a block, the value of this value is equal to the predicted confidence (ie confidence), namely:
P(object)=confidenceP(object)=confidence
而考虑到,置信度这个值的作用是用来判断物体是否在这个候选框内,为了更好的去进行判断,此处需要考虑IOU值,故此处的位置置信度的计算如下:Considering that the value of the confidence value is used to determine whether the object is in this candidate box, in order to better judge, the IOU value needs to be considered here, so the calculation of the location confidence here is as follows:
Figure PCTCN2018119132-appb-000016
Figure PCTCN2018119132-appb-000016
其中,
Figure PCTCN2018119132-appb-000017
表示候选框和真实框的交并比,P confidence为位置置信度的值。
among them,
Figure PCTCN2018119132-appb-000017
Represents the intersection ratio of the candidate box and the real box, P confidence is the value of position confidence.
其中,如果一个object落在一个分块里,则P(object)=1,否则P(object)=0。每个候选框需预测位置信息中四个参量x,y,w,h、一个置信度confidence和c个类别信息。Among them, if an object falls in a block, P(object)=1, otherwise P(object)=0. Each candidate box needs to predict four parameters x, y, w, h, a confidence and c category information in the location information.
其中,x,y,w,h,confidence∈(0,1),x和y表示所指的候选框中心点相对于所在分块的左上角的偏置,w,h表示候选框的宽和高相对于a值(预先设定的图像输入尺寸大小)的大小:Among them, x, y, w, h, confidence ∈ (0, 1), x and y represent the offset of the center point of the candidate box with respect to the upper left corner of the block, and w, h represent the width of the candidate box High relative to the value of a (the preset image input size):
Figure PCTCN2018119132-appb-000018
Figure PCTCN2018119132-appb-000018
每一个候选框属于哪一个类别的概率(class-specific confidence score)值的计算如下:The calculation of the class-specific confidence score of each candidate box belongs to:
Figure PCTCN2018119132-appb-000019
Figure PCTCN2018119132-appb-000019
Figure PCTCN2018119132-appb-000020
Figure PCTCN2018119132-appb-000020
需要说明的是,在训练过程中,
Figure PCTCN2018119132-appb-000021
值是可计算的,但是在预测的过程之中,这个值不可计算,所以会设定在预测过程,
Figure PCTCN2018119132-appb-000022
的值默认为1。
It should be noted that during the training process,
Figure PCTCN2018119132-appb-000021
The value is computable, but in the process of prediction, this value is not calculated, so it will be set in the prediction process,
Figure PCTCN2018119132-appb-000022
The default value is 1.
在本发明实施例中,为了预测的更精确,使用损失函数优化预测处理模块901预测过程。其中,损失函数如下:In the embodiment of the present invention, in order to make the prediction more accurate, the prediction process of the prediction processing module 901 is optimized using a loss function. Among them, the loss function is as follows:
Figure PCTCN2018119132-appb-000023
Figure PCTCN2018119132-appb-000023
其中,coordError为坐标误差,confidenceError为置信度损失误差,classError为分类误差。Among them, coordError is the coordinate error, confidenceError is the confidence loss error, and classError is the classification error.
其中,坐标误差coordError的表达式如下:Among them, the expression of coordinate error coordError is as follows:
Figure PCTCN2018119132-appb-000024
Figure PCTCN2018119132-appb-000024
其中,f obj(i,j)为测量目标落入第i个分块的第j个框的值,f noobj(i,j)为测量目标没有落入第i个分块的第j个框的值,当测量目标落入第i个分块的第j个框时,f obj(i,j)=1,f noobj(i,j)=0,当测量目标未落入第i个分块的第j个框时,,f obj(i,j)=1,f noobj(i,j)=0。λ coord为权重系数,可设置值。 Where f obj (i,j) is the value of the jth box of the measurement target falling into the i-th block, and f noobj (i,j) is the j-th box of the measurement target not falling into the i-th block Value, when the measurement target falls into the jth frame of the ith block, f obj (i,j)=1, f noobj (i,j)=0, when the measurement target does not fall into the i In the jth frame of the block, f obj (i,j)=1 and f noobj (i,j)=0. λ coord is the weight coefficient, and the value can be set.
置信度损失误差confidenceError的表达式如下:The expression of confidence error is as follows:
Figure PCTCN2018119132-appb-000025
Figure PCTCN2018119132-appb-000025
其中,使用λ noobj参数降低检测到图片背景对损失函数的影响。 Among them, use the λ noobj parameter to reduce the impact of the detected picture background on the loss function.
分类误差classError的表达式如下:The expression of classification error classError is as follows:
Figure PCTCN2018119132-appb-000026
Figure PCTCN2018119132-appb-000026
根据上述损失函数的优化后,预测的结果相对未优化之前的预测结果更精确。After optimization based on the above loss function, the predicted results are more accurate than those before the optimization.
过滤模块902,用于采用非极大值抑制的方法筛选掉冗余的预测信息,输出预测目标的检测信息,该检测信息包括目标位置信息和目标类别信息。The filtering module 902 is used to filter out redundant prediction information by using a non-maximum suppression method, and output detection information of the prediction target, where the detection information includes target position information and target category information.
其中,过滤模块902采用非极大值抑制的方法筛选掉冗余的信息的具体步骤包括:将多个候选框位置放入模块B中,模块B包括b 1,b 2,b 3,…,将class-specific confidence score值放入模块S中。 The specific steps of filtering out redundant information by the filtering module 902 using non-maximum suppression methods include: placing multiple candidate frame positions into module B, which includes b 1 , b 2 , b 3 , ..., Put the class-specific confidence score value in module S.
选取模块S中的数值最大的值s *对应的预测框b *,遍历其余的候选框,如果遍历的候选框满足下式条件: Select the maximum value of the S module a value corresponding to the prediction block s * b *, the rest of the candidate block traversal, if traversing the candidate block satisfies the condition formula:
IOU(b *,b i)>N t IOU(b * ,b i )>N t
其中,b *为属于所有类别中的一个类别的概率值最大的值的候选框,b i为属于任意一个类别的概率值的候选框,Nt为预设的交并比的阈值,即非极大值抑制的阈值。 Where b * is the candidate box with the largest probability value belonging to one of all categories, b i is the candidate box with probability values belonging to any category, and Nt is the preset threshold for the cross-combination ratio, that is, non-polar Threshold for large value suppression.
满足上述条件,b *与b i的交并比大于非极大值抑制的阈值,移除模块S中满足上述条件的S i和模块B中的b i,重新在S模块中选择最大值s *及对应的预测框b *,在重新遍历其余的候选框b i,如果遍历的候选框b i满足上式条件,则移除S中的S i和B中的b i,将余下的侯选框和class-specific confidence score值输出。输出的候选框及所携带的信息即为检测结果。 Satisfy the above conditions, the intersection ratio of b * and b i is greater than the threshold of non-maximum suppression, remove the S i in module S and the b i in module B, and select the maximum value s in S module again * And the corresponding prediction box b * , after retraversing the remaining candidate boxes b i , if the traversed candidate boxes b i meet the above condition, then remove S i in S and b i in B, the remaining H Checkboxes and class-specific confidence score values are output. The output candidate box and the information carried are the detection results.
从图8本发明提供的目标的检测方系统可知,第一方面,图像特征提取模块601采用卷积神经网络提取输入的图像的图像特征,该卷积神经网络采用的结构降低计算量,且不影响图像特征的提取效果,提高了目标检测的处理速度,实现对目标位置及类别的实时检测。第二方面,特征处理模块602中获得的特征图为多个不同空间分辨率的特征图的融合,获得的特征图所包含的图像信息多,获取结果模块603根据该特征图检测目标的精度高,即保证了检测精度。因此,该目标检测的系统同时实现了对目标位置及类别的高精度和实时检测。As can be seen from the target detection system provided by FIG. 8 in the present invention, in the first aspect, the image feature extraction module 601 uses a convolutional neural network to extract the image features of the input image. The structure adopted by the convolutional neural network reduces the amount of calculation, and does not It affects the extraction effect of image features, improves the processing speed of target detection, and realizes real-time detection of target position and category. In the second aspect, the feature map obtained in the feature processing module 602 is a fusion of multiple feature maps with different spatial resolutions, and the obtained feature map contains many image information. The acquisition result module 603 detects the target with high accuracy according to the feature map , That guarantees the detection accuracy. Therefore, the target detection system simultaneously achieves high-precision and real-time detection of target positions and categories.
请参考图22、图22为本发明另一实施例提供的目标检测系统的结构示意图,与上一 实施例的不同之处在于,该目标检测系统还包括:调节模块604。Please refer to FIG. 22. FIG. 22 is a schematic structural diagram of a target detection system according to another embodiment of the present invention. The difference from the previous embodiment is that the target detection system further includes: an adjustment module 604.
其中,调节模块604调整输入图像的大小,使后续图像特征提取模块601提取图像的图像特征时,图像的大小相同,方便操作,该系统首先先对输入的图像调整大小,使任何大小的图像在同一条件下均能处理。在调整图像的大小时,还需设置整个系统中卷积神经网络处理图像的规格,任意输入的图像的大小均可调整为预设的值。Among them, the adjustment module 604 adjusts the size of the input image so that the subsequent image feature extraction module 601 extracts the image features of the image, the size of the image is the same, it is convenient to operate, the system first adjusts the size of the input image, so that any size image Can be processed under the same conditions. When adjusting the size of the image, it is also necessary to set the specifications of the convolutional neural network processing image in the entire system, and the size of any input image can be adjusted to a preset value.
在本发明实施例中,卷积神经网络的图像大小设置为416*416,因此输入的图像大小均要被调节模块604设置为416*416。In the embodiment of the present invention, the image size of the convolutional neural network is set to 416*416, so the input image size is set to 416*416 by the adjustment module 604.
如图23所示,图23为调节模块的细化模块的结构示意图,调节模块604包括缩小模块1001和增大模块1002:As shown in FIG. 23, FIG. 23 is a schematic structural diagram of a detailed module of the adjustment module. The adjustment module 604 includes a reduction module 1001 and an increase module 1002:
缩小模块1001,用于根据双线性插值的算法将尺寸大于预置值的图像的尺寸缩小。The reduction module 1001 is configured to reduce the size of an image whose size is larger than a preset value according to a bilinear interpolation algorithm.
其中,输入的图像中,对于原始尺寸大于预置值的图像,缩小模块1001采用双线性插值的算法,将图像缩小。Among the input images, for an image whose original size is greater than a preset value, the reduction module 1001 uses a bilinear interpolation algorithm to reduce the image.
在本发明实施例中,对于原始尺寸大于416*416*3的图像,缩小模块1001采用双线性插值的算法,将图像缩小。如图7所示,图7为多个像素点在坐标轴表示的示意图,f是一个像素点的像素值,已知函数f在Q 11=(x 1,y 1)、Q 12=(x 1,y 2),Q 21=(x 2,y 1)以及Q 22=(x 2,y 2)四个点的值。首先在x方向进行双线性插值,得到如下式: In the embodiment of the present invention, for an image with an original size greater than 416*416*3, the reduction module 1001 uses a bilinear interpolation algorithm to reduce the image. As shown in FIG. 7, FIG. 7 is a schematic diagram of a plurality of pixels represented on the coordinate axis, f is a pixel value of a pixel, and the known function f is Q 11 = (x 1 , y 1 ), Q 12 = (x 1 , y 2 ), Q 21 =(x 2 ,y 1 ) and Q 22 =(x 2 ,y 2 ) are the values of four points. First perform bilinear interpolation in the x direction to obtain the following formula:
Figure PCTCN2018119132-appb-000027
where R 1=(x,y 1)
Figure PCTCN2018119132-appb-000027
where R 1 = (x,y 1 )
Figure PCTCN2018119132-appb-000028
where R 2=(x,y 2)
Figure PCTCN2018119132-appb-000028
where R 2 = (x,y 2 )
然后在y方向进行线性插值,得到Then perform linear interpolation in the y direction to get
Figure PCTCN2018119132-appb-000029
Figure PCTCN2018119132-appb-000029
然后根据上述x和y双线性插值的结果再进行双线性插值运算,得到最后的结果为:Then according to the above x and y bilinear interpolation results and then perform bilinear interpolation operation, the final result is:
Figure PCTCN2018119132-appb-000030
Figure PCTCN2018119132-appb-000030
其中,由于上述图像双线性插值只使用了四个相邻的4个点,因此,上述公式的分母均是1。Among them, since the above image bilinear interpolation uses only four adjacent 4 points, the denominator of the above formula is all 1.
增大模块1002,用于根据补零的方式将尺寸小于预置值的图像的尺寸增大。The increasing module 1002 is used to increase the size of the image whose size is smaller than the preset value according to the zero-filling method.
其中,输入的图像中,对于原始尺寸小于预置值的图像,增大模块1002采用补零的方式,将图像增大,使图像的大小增大至预置值。Among the input images, for an image whose original size is smaller than a preset value, the increasing module 1002 uses a zero-filling method to increase the image to increase the size of the image to a preset value.
从本发明实施例提供的目标检测系统可知,第一方面,图像特征提取模块601采用卷积神经网络提取输入的图像的图像特征,该卷积神经网络采用的结构降低计算量,且不影响图像特征的提取效果,提高了目标检测的处理速度,实现对目标位置及类别的实时检测。第二方面,特征处理模块602中获得的特征图为多个不同空间分辨率的特征图的融合,获得的特征图所包含的图像信息多,获取结果模块603根据该特征图检测目标的精度高,即保证了检测精度。因此,该目标检测的系统同时实现了对目标位置及类别的高精度和实时检测。第三方面,调节模块604调整输入图像的大小,使得该方法可处理任意尺寸的图像,对任意尺寸的图像进行目标检测,增大了检测范围。As can be seen from the target detection system provided by the embodiment of the present invention, in the first aspect, the image feature extraction module 601 uses a convolutional neural network to extract image features of the input image. The structure adopted by the convolutional neural network reduces the amount of calculation and does not affect the image The feature extraction effect improves the processing speed of target detection and realizes real-time detection of target position and category. In the second aspect, the feature map obtained in the feature processing module 602 is a fusion of multiple feature maps with different spatial resolutions. The obtained feature map contains more image information, and the acquisition result module 603 detects the target according to the feature map with high accuracy , That guarantees the detection accuracy. Therefore, the target detection system simultaneously achieves high-precision and real-time detection of target positions and categories. In the third aspect, the adjustment module 604 adjusts the size of the input image so that the method can process images of any size, and performs target detection on images of any size, increasing the detection range.
序列表自由内容Sequence listing free content
以上为对本发明所提供的一种目标检测方法和系统的描述,对于本领域的技术人员,依据本发明实施例的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本发明的限制。The above is a description of a target detection method and system provided by the present invention. For those skilled in the art, according to the ideas of the embodiments of the present invention, there will be changes in the specific implementation and application scope. In summary, The contents of this description should not be construed as limiting the invention.

Claims (8)

  1. 一种目标检测的方法,其特征在于,所述方法包括:A target detection method, characterized in that the method includes:
    采用卷积神经网络提取输入的图像的图像特征,得到所述图像多个不同空间分辨率的特征图;A convolutional neural network is used to extract image features of the input image to obtain multiple feature maps of the image with different spatial resolutions;
    将所述多个特征图的空间分辨率调节一致,并将所述多个特征图拼接成一个整体特征图;Adjusting the spatial resolution of the multiple feature maps uniformly, and stitching the multiple feature maps into an overall feature map;
    处理所述整体特征图,得到所述图像中预测目标的检测信息,所述检测信息包括目标位置信息和目标类别信息。Processing the overall feature map to obtain the detection information of the predicted target in the image, where the detection information includes target position information and target category information.
  2. 根据权利要求1所述的方法,其特征在于,所述采用卷积神经网络提取输入的图像的图像特征,得到所述图像多个不同空间分辨率的特征图的步骤包括:The method according to claim 1, wherein the step of extracting image features of the input image using a convolutional neural network to obtain a plurality of feature maps with different spatial resolutions of the image includes:
    获取所述图像的第一特征图;Acquiring the first feature map of the image;
    获取所述图像的第二特征图,所述第二特征图的空间分辨率小于所述第一特征图的空间分辨率;Acquiring a second feature map of the image, the spatial resolution of the second feature map is smaller than the spatial resolution of the first feature map;
    获取所述图像的第三特征图,所述第三特征图的空间分辨率小于所述第二特征图的空间分辨率。Acquire a third feature map of the image, the spatial resolution of the third feature map is smaller than the spatial resolution of the second feature map.
  3. 根据权利要求2所述的方法,其特征在于,所述将所述多个特征图的空间分辨率调节一致,并将所述多个特征图拼接成一个整体特征图的步骤包括:The method according to claim 2, wherein the step of adjusting the spatial resolution of the plurality of feature maps to be consistent, and stitching the plurality of feature maps into an overall feature map includes:
    采用平均池化的方法降低所述第一特征图的空间分辨率,以使所述第一特征图的空间分辨率与所述第二特征图的空间分辨率一致;Reducing the spatial resolution of the first feature map using the method of average pooling, so that the spatial resolution of the first feature map is consistent with the spatial resolution of the second feature map;
    采用反卷积提高所述第三特征图的空间分辨率,以使所述第三特征图的空间分辨率与所述第二特征图的空间分辨率一致;Using deconvolution to increase the spatial resolution of the third feature map, so that the spatial resolution of the third feature map is consistent with the spatial resolution of the second feature map;
    将所述第一特征图、第二特征图和第三特征图进行拼接处理,融合成一个整体特征图。The first feature map, the second feature map, and the third feature map are stitched together to form an overall feature map.
  4. 根据权利要求3的方法,其特征在于,所述处理所述整体特征图,得到所述图像中预测目标的检测信息,所述检测信息包括目标位置信息和目标类别信息的步骤包括:The method according to claim 3, wherein the processing of the overall feature map to obtain detection information of predicted targets in the image, the step of the detection information including target position information and target category information includes:
    将所述整体特征图进行多通道特征融合,再将多通道特征已融合的所述整体特征图进行特征提取过滤杂质,得到预测目标的预测信息,所述预测信息包括目标预测的位置信息和预测的类别信息;Perform multi-channel feature fusion on the overall feature map, and then perform feature extraction and filtering impurities on the integrated multi-channel feature map to obtain prediction information of the prediction target, the prediction information includes location information and prediction of the target prediction Category information;
    采用非极大值抑制的方法筛选掉冗余的信息,输出预测目标的检测信息,所述检测信息包括目标位置信息和目标类别信息。The method of non-maximum suppression is used to filter out redundant information and output detection information of the predicted target, where the detection information includes target position information and target category information.
  5. 根据权利要求1所述的方法,其特征在于,所述采用卷积神经网络提取输入的图像的图像特征之前的步骤还包括:调整所述输入图像的大小。The method according to claim 1, wherein the step before extracting the image features of the input image using the convolutional neural network further comprises: adjusting the size of the input image.
  6. 根据权利要求5所述的方法,其特征在于,所述调整所述输入图像的大小的步骤包括:The method according to claim 5, wherein the step of adjusting the size of the input image comprises:
    根据双线性插值的算法将尺寸大于预置值的图像的尺寸缩小;Reduce the size of images larger than the preset value according to the bilinear interpolation algorithm;
    根据补零的方式将尺寸小于预置值的图像的尺寸增大。Increase the size of the image whose size is smaller than the preset value according to the way of zero padding.
  7. 一种目标检测的系统,其特征在于,所述系统包括:图像特征提取模块、特征处理模块和获取结果模块;A target detection system, characterized in that the system includes: an image feature extraction module, a feature processing module and an acquisition result module;
    所述图像特征提取模块,用于采用卷积神经网络提取输入的图像的图像特征,得到所述图像多个不同空间分辨率的特征图;The image feature extraction module is used to extract image features of the input image by using a convolutional neural network to obtain a plurality of feature maps with different spatial resolutions of the image;
    所述特征处理模块,用于将所述多个特征图的空间分辨率调节一致,并将所述多个特征图拼接成一个整体特征图;The feature processing module is used to adjust the spatial resolution of the multiple feature maps uniformly, and splice the multiple feature maps into an overall feature map;
    所述获取结果模块,用于处理所述整体特征图,得到所述图像中预测目标的检测信息,所述检测信息包括目标位置信息和目标类别信息。The acquisition result module is configured to process the overall feature map to obtain detection information of the predicted target in the image, where the detection information includes target position information and target category information.
  8. 根据权利要求7所述的系统,其特征在于,所述图像特征提取模块包括:第一采样模块、第二采样模块和第三采样模块;The system according to claim 7, wherein the image feature extraction module includes: a first sampling module, a second sampling module, and a third sampling module;
    所述第一采样模块,用于获取所述图像的第一特征图;The first sampling module is used to obtain a first feature map of the image;
    所述第二采样模块,用于获取所述图像的第二特征图,所述第二特征图的空间分辨率小于所述第一特征图的空间分辨率;The second sampling module is used to obtain a second feature map of the image, and the spatial resolution of the second feature map is smaller than the spatial resolution of the first feature map;
    所述第三采样模块,用于获取所述图像的第三特征图,所述第三特征图的空间分辨率小于所述第二特征图的空间分辨率。The third sampling module is configured to obtain a third feature map of the image, and the spatial resolution of the third feature map is smaller than that of the second feature map.
PCT/CN2018/119132 2018-12-04 2018-12-04 Target detection method and system WO2020113412A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/119132 WO2020113412A1 (en) 2018-12-04 2018-12-04 Target detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/119132 WO2020113412A1 (en) 2018-12-04 2018-12-04 Target detection method and system

Publications (1)

Publication Number Publication Date
WO2020113412A1 true WO2020113412A1 (en) 2020-06-11

Family

ID=70974847

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/119132 WO2020113412A1 (en) 2018-12-04 2018-12-04 Target detection method and system

Country Status (1)

Country Link
WO (1) WO2020113412A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832559A (en) * 2020-06-19 2020-10-27 浙江大华技术股份有限公司 Target detection method and device, storage medium and electronic device
CN111951268A (en) * 2020-08-11 2020-11-17 长沙大端信息科技有限公司 Parallel segmentation method and device for brain ultrasonic images
CN112085022A (en) * 2020-09-09 2020-12-15 上海蜜度信息技术有限公司 Method, system and equipment for recognizing characters
CN112435295A (en) * 2020-11-12 2021-03-02 浙江大华技术股份有限公司 Blackbody position detection method, electronic device and computer-readable storage medium
CN112529956A (en) * 2020-11-12 2021-03-19 浙江大华技术股份有限公司 Blackbody position detection method, electronic device and computer-readable storage medium
CN112989992A (en) * 2021-03-09 2021-06-18 北京百度网讯科技有限公司 Target detection method and device, road side equipment and cloud control platform
CN113269795A (en) * 2021-06-03 2021-08-17 南京耘瞳科技有限公司 Identification method based on scrap steel carriage area
CN114612770A (en) * 2022-03-21 2022-06-10 贵州大学 Article detection method based on convolutional neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018184195A1 (en) * 2017-04-07 2018-10-11 Intel Corporation Joint training of neural networks using multi-scale hard example mining
CN108647585A (en) * 2018-04-20 2018-10-12 浙江工商大学 A kind of traffic mark symbol detection method based on multiple dimensioned cycle attention network
CN108898078A (en) * 2018-06-15 2018-11-27 上海理工大学 A kind of traffic sign real-time detection recognition methods of multiple dimensioned deconvolution neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018184195A1 (en) * 2017-04-07 2018-10-11 Intel Corporation Joint training of neural networks using multi-scale hard example mining
CN108647585A (en) * 2018-04-20 2018-10-12 浙江工商大学 A kind of traffic mark symbol detection method based on multiple dimensioned cycle attention network
CN108898078A (en) * 2018-06-15 2018-11-27 上海理工大学 A kind of traffic sign real-time detection recognition methods of multiple dimensioned deconvolution neural network

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832559A (en) * 2020-06-19 2020-10-27 浙江大华技术股份有限公司 Target detection method and device, storage medium and electronic device
CN111951268A (en) * 2020-08-11 2020-11-17 长沙大端信息科技有限公司 Parallel segmentation method and device for brain ultrasonic images
CN111951268B (en) * 2020-08-11 2024-06-07 深圳蓝湘智影科技有限公司 Brain ultrasound image parallel segmentation method and device
CN112085022A (en) * 2020-09-09 2020-12-15 上海蜜度信息技术有限公司 Method, system and equipment for recognizing characters
CN112085022B (en) * 2020-09-09 2024-02-13 上海蜜度科技股份有限公司 Method, system and equipment for recognizing characters
CN112435295A (en) * 2020-11-12 2021-03-02 浙江大华技术股份有限公司 Blackbody position detection method, electronic device and computer-readable storage medium
CN112529956A (en) * 2020-11-12 2021-03-19 浙江大华技术股份有限公司 Blackbody position detection method, electronic device and computer-readable storage medium
CN112989992A (en) * 2021-03-09 2021-06-18 北京百度网讯科技有限公司 Target detection method and device, road side equipment and cloud control platform
CN112989992B (en) * 2021-03-09 2023-12-15 阿波罗智联(北京)科技有限公司 Target detection method and device, road side equipment and cloud control platform
CN113269795A (en) * 2021-06-03 2021-08-17 南京耘瞳科技有限公司 Identification method based on scrap steel carriage area
CN114612770A (en) * 2022-03-21 2022-06-10 贵州大学 Article detection method based on convolutional neural network
CN114612770B (en) * 2022-03-21 2024-02-20 贵州大学 Article detection method based on convolutional neural network

Similar Documents

Publication Publication Date Title
WO2020113412A1 (en) Target detection method and system
CN109726739A (en) A kind of object detection method and system
CN110287846B (en) Attention mechanism-based face key point detection method
CN110717527B (en) Method for determining target detection model by combining cavity space pyramid structure
CN108960211B (en) Multi-target human body posture detection method and system
US11017545B2 (en) Method and device of simultaneous localization and mapping
CN114708585A (en) Three-dimensional target detection method based on attention mechanism and integrating millimeter wave radar with vision
CN111754579B (en) Method and device for determining external parameters of multi-view camera
CN107301402A (en) A kind of determination method, device, medium and the equipment of reality scene key frame
CN110992263A (en) Image splicing method and system
CN109087337B (en) Long-time target tracking method and system based on hierarchical convolution characteristics
CN110598788A (en) Target detection method and device, electronic equipment and storage medium
WO2021139448A1 (en) Method and apparatus for correcting new model on basis of multiple source models, and computer device
CN111105452A (en) High-low resolution fusion stereo matching method based on binocular vision
CN110909665A (en) Multitask image processing method and device, electronic equipment and storage medium
CN116266387A (en) YOLOV4 image recognition algorithm and system based on re-parameterized residual error structure and coordinate attention mechanism
CN113989758A (en) Anchor guide 3D target detection method and device for automatic driving
CN114926498B (en) Rapid target tracking method based on space-time constraint and leachable feature matching
CN109544584B (en) Method and system for realizing inspection image stabilization precision measurement
US11778327B2 (en) Image reconstruction method and device
CN117745845A (en) Method, device, equipment and storage medium for determining external parameter information
CN112464860A (en) Gesture recognition method and device, computer equipment and storage medium
CN114283081B (en) Depth recovery method based on pyramid acceleration, electronic device and storage medium
CN110059651B (en) Real-time tracking and registering method for camera
CN112634331A (en) Optical flow prediction method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18942444

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28/09/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18942444

Country of ref document: EP

Kind code of ref document: A1