WO2022217551A1 - 目标检测方法和装置 - Google Patents

目标检测方法和装置 Download PDF

Info

Publication number
WO2022217551A1
WO2022217551A1 PCT/CN2021/087584 CN2021087584W WO2022217551A1 WO 2022217551 A1 WO2022217551 A1 WO 2022217551A1 CN 2021087584 W CN2021087584 W CN 2021087584W WO 2022217551 A1 WO2022217551 A1 WO 2022217551A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate frame
candidate
feature
confidence
image
Prior art date
Application number
PCT/CN2021/087584
Other languages
English (en)
French (fr)
Inventor
黄梓钊
江立辉
周凯强
秘谧
王鑫
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180096547.4A priority Critical patent/CN117203678A/zh
Priority to PCT/CN2021/087584 priority patent/WO2022217551A1/zh
Publication of WO2022217551A1 publication Critical patent/WO2022217551A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present application relates to the field of computer vision, and more particularly, to a method and apparatus for object detection.
  • Computer vision is an integral part of various intelligent or autonomous systems in various application fields such as manufacturing, inspection, document analysis, and medical diagnostics.
  • Object detection is a popular direction in computer vision and digital image processing. It is widely used in many fields such as robot navigation, intelligent video surveillance, industrial inspection, aerospace and assisted driving. It is of great significance to reduce the consumption of manpower through computer vision.
  • Object detection is an important branch of computer vision and image processing. Due to the wide application of deep learning, the target detection algorithm has been developed rapidly, and the general target detection network is a network based on deep learning.
  • the deep learning network relies on a large number of floating-point operations, which generally reach billions of floating-point operations.
  • the more floating-point operations required by the network model the greater the challenge to the hardware.
  • the hardware needs to complete dozens of model inferences per second, which brings a great burden to the hardware.
  • Hardware with strong computing power means high price, which is unfavorable for commercial use, so hardware with high computing power will not be used in general commercial practical applications.
  • the memory is very limited. If high-resolution images are directly input into the network model, the memory may overflow and become unusable.
  • the performance requirements of the target detection network are also very high, especially in small target detection scenarios, such as assisted driving or security scenarios, which may require high-resolution detection (For example, the resolution is 1080P) small objects with only 4 pixels on the short side in the image.
  • the detection of small targets can increase the resolution of the input image, but this method will cause a surge in computing power and a higher memory burden, which will lead to higher hardware costs, which is not conducive to commercial applications. ; It is also possible to extract shallow feature information on high-resolution feature maps, but this method depends on the resolution of the input image. If the resolution of the input image is small, it cannot solve the problem of small target detection.
  • the present application provides a target detection method and device, which can improve the performance of a target detection network for small target detection under the conditions of limited computing power and memory.
  • a target detection method comprising: acquiring a feature image of an input image; detecting a target in the input image according to the feature image of the input image to obtain multiple candidate frames and multiple candidate frames Confidence of each candidate frame; select one or more first candidate frames from multiple candidate frames according to the confidence of each candidate frame; obtain one or more first candidate frames from the original image corresponding to the input image according to one or more first candidate frames Extract the first feature; extract the second feature from the feature image of the input image according to one or more first candidate frames; fuse the first feature and the second feature to obtain the fused feature; or multiple targets in the first candidate frame for detection.
  • the target detection method of the present application is suitable for Targets that have been detected once and then detected a second time can greatly improve the detection accuracy of small targets and complex targets.
  • the target detection method of the present application screens the targets detected once before performing the secondary detection, and only filters out the target detection method.
  • the secondary detection of the target that needs secondary detection can reduce the computational burden of the target detection network.
  • the target detection method of the present application combines the features in the original image and the feature image in the process of secondary detection.
  • the original image Since the original image has a higher resolution, the accuracy of target detection can be improved. In addition, due to the high resolution The features of the high-resolution original image are introduced in the secondary detection, and the large-scale memory of the object detection network has been released, so the introduction of the high-resolution original image features will not cause a huge memory burden to the object detection network.
  • the method further includes: detecting the target in the input image according to the feature image of the input image, so as to obtain the size and category of each candidate frame.
  • selecting one or more first candidate frames from multiple candidate frames according to the confidence of each candidate frame including: setting the confidence to be greater than or equal to the first candidate frame
  • a candidate frame with a confidence threshold is determined as a first candidate frame, and/or a candidate frame whose size is smaller than the first size threshold is determined as a first candidate frame, and/or a candidate frame whose category is the first preset category Determined as the first candidate frame.
  • the target detection method of the present application can screen a candidate frame with a confidence greater than or equal to the first confidence threshold as the first candidate frame, and it can be considered that the target in the candidate frame with low confidence is most likely to be a falsely detected target. It is necessary to perform the second target detection, so it can be directly discarded, and only the first candidate frame whose confidence is greater than the first confidence threshold is retained for the second target detection.
  • the candidate frame can be obtained in the first target detection.
  • the size of the candidate frame can be calculated. Therefore, the candidate frame whose size is smaller than the first size threshold can be selected as the first candidate frame. This is because the detection of the smaller size is more difficult. , so it is necessary to perform a second target detection.
  • the candidate frame whose category is the first preset category can also be screened as the first candidate frame, because it is difficult to detect some specific categories of targets. , so it is necessary to perform a second target detection.
  • the method further includes: determining a candidate frame whose confidence is greater than or equal to the first confidence threshold and smaller than the second confidence threshold as the first candidate frame; A candidate frame greater than or equal to the second confidence threshold is determined as a second candidate frame; the target in the second candidate frame is output; and/or, a candidate frame whose size is smaller than the first size threshold is determined as the first candidate frame; the size of the candidate frame is determined as the first candidate frame; A candidate frame larger than or equal to the first size threshold is determined as a second candidate frame; the target in the second candidate frame is output; and/or, a candidate frame whose category is the first preset category is determined as the first candidate frame; the category The candidate frame of the second preset category is determined as the second candidate frame; the target in the second candidate frame is output.
  • the first preset category is a complex category that is difficult to detect.
  • the category "person” can be set as the first preset category, because the human body often has different postures, so detection is difficult;
  • the second preset category is detection.
  • Simple categories that are easier, such as category "sign” can be set as the second preset category, because roadside signs are still life and generally have a simple shape, so detection is easier.
  • targets of the second preset category can be accurately detected in one detection, so the target detection method of the present application directly outputs targets that are easier to detect during the screening process, further reducing the computational burden of the target detection network.
  • selecting one or more first candidate frames from a plurality of candidate frames according to the confidence of each candidate frame includes: selecting a plurality of first candidate frames according to a preset number The candidate boxes are screened to obtain a preset number of candidate boxes.
  • the target detection method of the present application can also set the number of candidate frames for secondary detection.
  • the fewer candidate frames for secondary detection the less burden on the target detection neural network. Therefore, in practice In the application, the number of candidate frames that can be used for secondary detection can be set according to actual conditions.
  • the target detection method of the present application further includes: sorting a plurality of candidate frames according to the degree of confidence from high to low, and selecting the top K candidate frames, where K is the preset number.
  • the target detection method of the present application uses the TopK algorithm to select a preset number of candidate frames from multiple candidate frames, and the ranking can be based on the level of confidence.
  • a target detection device comprising: an acquisition module for acquiring a characteristic image of an input image; a processing module for detecting a target in the input image according to the spatial characteristics of the input image, to obtain Multiple candidate frames and the confidence level of each candidate frame in the multiple candidate frames; the processing module is also used to select one or more first candidate frames from the multiple candidate frames according to the confidence level of each candidate frame; process The module is also used to extract the first feature from the original image corresponding to the input image according to the one or more first candidate frames, and extract the second feature from the feature image of the input image according to the one or more first candidate frames; the processing module It is also used for fusing the first feature and the second feature to obtain the fused feature; the processing module is also used for detecting the targets in one or more first candidate frames according to the fused feature.
  • the processing module is further configured to detect the target in the input image according to the feature image of the input image, so as to obtain the size and category of each candidate frame.
  • the processing module selects one or more first candidate frames from multiple candidate frames according to the confidence of each candidate frame, including: setting the confidence greater than or The candidate frame equal to the first confidence threshold is determined as the first candidate frame, and/or the candidate frame whose size is smaller than the first size threshold is determined as the first candidate frame, and/or the category is the first preset category.
  • the candidate frame is determined as the first candidate frame.
  • the processing module is further configured to: determine a candidate frame whose confidence is greater than or equal to the first confidence threshold and smaller than the second confidence threshold as the first candidate frame; determining a candidate frame with a confidence greater than or equal to a second confidence threshold as a second candidate frame; outputting the target in the second candidate frame; and/or determining a candidate frame with a size smaller than the first size threshold as the first candidate frame ; Determine the candidate frame whose size is greater than or equal to the first size threshold as the second candidate frame; Output the target in the second candidate frame; and/or, determine the candidate frame whose category is the first preset category as the first candidate frame ; Determine the candidate frame whose category is the second preset category as the second candidate frame; and output the target in the second candidate frame.
  • the processing module is further configured to: screen a plurality of candidate frames according to a preset number to obtain a preset number of candidate frames.
  • a target detection device comprising: a processor and a transmission interface
  • the transmission interface is used to obtain the feature image of the input image; the processor is used to detect the target in the input image according to the spatial feature of the input image, so as to obtain multiple candidate frames and the confidence level of each candidate frame in the multiple candidate frames
  • the processor is also used to select one or more first candidate frames from multiple candidate frames according to the confidence of each candidate frame; the processor is also used to obtain one or more first candidate frames from the input image according to the one or more first candidate frames Extracting the first feature from the corresponding original image, and extracting the second feature from the feature image of the input image according to one or more first candidate frames; the processor is further configured to fuse the first feature and the second feature to obtain a fusion and the processor is further configured to detect targets in one or more first candidate boxes according to the fused features.
  • the processor is further configured to detect the target in the input image according to the feature image of the input image, so as to obtain the size and category of each candidate frame.
  • the processor selects one or more first candidate frames from multiple candidate frames according to the confidence level of each candidate frame, including: setting a confidence level greater than or equal to a first confidence level threshold
  • the candidate frame is determined as the first candidate frame, and/or the candidate frame whose size is smaller than the first size threshold is determined as the first candidate frame, and/or the candidate frame whose category is the first preset category is determined as the first candidate frame frame.
  • the processor is further configured to: determine a candidate frame whose confidence is greater than or equal to the first confidence threshold and less than the second confidence threshold as the first candidate frame; determine that the confidence is greater than or equal to the first
  • the candidate frame with two confidence thresholds is determined as the second candidate frame; the target in the second candidate frame is output; and/or, the candidate frame whose size is smaller than the first size threshold is determined as the first candidate frame; the size is greater than or equal to the first candidate frame;
  • a candidate frame with a size threshold is determined as the second candidate frame; the target in the second candidate frame is output; and/or, the candidate frame whose category is the first preset category is determined as the first candidate frame; the category is the second preset category.
  • the candidate frame of the category is determined as the second candidate frame; the target in the second candidate frame is output.
  • the processor is further configured to: filter a plurality of candidate frames according to a preset number to obtain a preset number of candidate frames.
  • a computer-readable storage medium is provided, and a program is stored in the computer-readable medium, and when the program is run on a computer or a processor, the computer or the processor executes the first aspect and the method of any one of the implementation manners of the first aspect.
  • a computer program product includes instructions that, when the instructions are run on a computer or a processor, cause the computer or processor to execute any one of the first aspect and the implementation manner of the first aspect Methods.
  • Fig. 1 is the schematic diagram of the system architecture of the present application
  • Fig. 2 is the schematic block diagram of the convolutional neural network of the present application.
  • FIG. 3 is a schematic block diagram of the target detection system of the present application.
  • Fig. 4 is the overall flow chart of the target detection method of the present application.
  • FIG. 6 is a schematic block diagram of the present application performing convolution processing on an input image
  • FIG. 7 is a schematic block diagram of the first-level detection of the present application.
  • Fig. 8 is the schematic block diagram of screening primary detection results of the present application.
  • FIG. 10 is a schematic block diagram of another first-level detection of the present application.
  • Figure 11 is a schematic block diagram of the present application for screening another first-level detection result
  • FIG. 12 is a schematic block diagram of the target detection device of the present application.
  • FIG. 13 is a schematic diagram of the hardware structure of the target detection apparatus of the present application.
  • Object detection Find all objects of interest (objects) in the image and determine their category and location.
  • Inference The process of making certain decisions in the real environment, in this application, it refers to the process of letting models and algorithms determine whether an image contains an object, and then determine its category and bounding box.
  • Floating point operations per second refers to the floating point operations that can be performed per second, and is an indicator to measure the computing power of hardware.
  • the backbone network is used to extract the underlying image information and is a common structure of vision-based deep neural network models.
  • the backbone network is usually fine-tuned based on the architecture of general deep convolutional neural networks.
  • the backbone network can be fine-tuned based on the architecture of the Visual Geometry Group (VGG) network, a network proposed by the Visual Geometry Group at Oxford University.
  • VCG Visual Geometry Group
  • ResNet deep residual network
  • Feature pyramid network It is a feature extractor designed according to the concept of feature pyramid. The purpose is to provide accuracy and speed. It is composed of bottom-up and top-down parts. Bottom-up is traditional Convolutional networks do feature extraction. With the convolutional state, the spatial resolution decreases and spatial information is lost, but high-level semantic information is detected more.
  • the target detection method of the present application can be applied to fields that require target detection, such as assisted driving, automatic driving, safe cities, and smart terminals.
  • targets detection such as assisted driving, automatic driving, safe cities, and smart terminals.
  • ADAS advanced driving assistance systems
  • ADS autonomous driving systems
  • the image input to the target detection network will be further compressed to a lower resolution (for example, 640*368 resolution), and the resolution of the traffic lights on the image will be lower. , accurate small target detection is more difficult.
  • target detection detecting pedestrians or vehicles
  • the detection results are marked, and the detection results are input into the analysis unit of the system, which can be used to find criminal suspects, missing persons and specific vehicles. Wait.
  • Internet cameras are all over the country, and security cameras may be installed in roads, factories, office buildings, residences and other places.
  • security cameras may be installed in roads, factories, office buildings, residences and other places.
  • computing power of the chips carried by security cameras is generally low. How to use low computing power?
  • the accurate target detection of the chip with computing power is called an urgent problem to be solved.
  • the embodiments of the present application provide a target detection method, which can improve the performance of a target detection network for small target detection under hardware conditions such as limited computing power and memory.
  • FIG. 1 is a schematic diagram of a system architecture of an embodiment of the present application.
  • the system architecture 100 includes an execution device 110 , a training device 120 , a database 130 , a client device 140 , a data storage system 150 , and a data acquisition system 160 .
  • the execution device 110 includes a calculation module 111 , an I/O interface 112 , a preprocessing module 113 and a preprocessing module 114 .
  • the calculation module 111 may include the target model/rule 101, and the preprocessing module 113 and the preprocessing module 114 are optional.
  • the data collection device 160 is used to collect training data.
  • the training data may include a training image (including pedestrians in the training image) and labeling data, wherein the labeling data provides a bounding box of pedestrians in the training picture. ) coordinates and category.
  • the data collection device 160 stores the training data in the database 130 , and the training device 120 obtains the target model/rule 101 by training based on the training data maintained in the database 130 .
  • the training device 120 performs object detection on the input training image, and the output target detection result (the bounding box of objects such as pedestrians and vehicles in the image and the confidence of the category) (degree) is compared with the labeling result until the difference between the target detection result of the object output by the training device 120 and the pre-labeled result is less than a certain threshold, so that the training of the target model/rule 101 is completed.
  • the output target detection result the bounding box of objects such as pedestrians and vehicles in the image and the confidence of the category
  • the above target model/rule 101 can be used to implement the target detection method of the embodiment of the present application, that is, input the image to be processed (after relevant preprocessing) into the target model/rule 101, and then the target detection result of the image to be processed can be obtained .
  • the target model/rule 101 in this embodiment of the present application may specifically be a neural network.
  • the training data maintained in the database 130 does not necessarily come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rule 101 completely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training.
  • the above description should not be used as a reference to this application Limitations of Examples.
  • the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Laptops, augmented reality (AR)/virtual reality (VR), in-vehicle terminals, etc., can also be servers or the cloud.
  • the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with external devices, and the user can input data to the I/O interface 112 through the client device 140, the In this embodiment of the present application, the input data may include: an image to be processed input by the client device.
  • the client device 140 here may specifically be a terminal device.
  • the preprocessing module 113 and the preprocessing module 114 are used to perform preprocessing according to the input data (such as the image to be processed) received by the I/O interface 112.
  • the preprocessing module 113 and the preprocessing module 114 may be absent. Or just a preprocessing module.
  • the calculation module 111 can be directly used to process the input data.
  • the execution device 110 When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the data and instructions obtained by corresponding processing may also be stored in the data storage system 150 .
  • the I/O interface 112 presents the processing result, such as the target detection result calculated by the target model/rule 101, to the client device 140, thereby providing it to the user.
  • the target detection result obtained through the processing of the target model/rule 101 in the calculation module 111 can be processed by the preprocessing module 113 (and the processing by the preprocessing module 114 can also be added), and then the processing result can be sent to the I/O The O interface, and then the I/O interface sends the processing result to the client device 140 for display.
  • the computing module 111 can also transmit the processed target detection result to the I/O interface, and then the I/O interface will process the result.
  • the results are sent to the client device 140 for display.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above task, thus providing the user with the desired result.
  • the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample
  • the data is stored in database 130 .
  • FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • the target model/rule 101 obtained by training the training device 120 may be a neural network in the embodiment of the present application.
  • the neural network provided in the embodiment of the present application may be a CNN and a deep convolutional neural network ( deep convolutional neural networks, DCNN) and so on.
  • a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • a deep learning architecture refers to a machine learning algorithm that performs multiple levels of study.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional), and a fully connected layer 230 .
  • CNN convolutional neural network
  • the convolutional/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer Layer 224 is a pooling layer, 225 is a convolutional layer, and 226 is a pooling layer; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers. layer, 226 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially Can be a weight matrix, which is usually pre-defined, usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image during the convolution operation on the image. ...It depends on the value of the stride step) to process, so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" described above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the size of the convolution feature maps extracted from the multiple weight matrices with the same size is also the same, and then the multiple extracted convolution feature maps with the same size are combined to form The output of the convolution operation.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
  • the initial convolutional layer eg, 221
  • the features extracted by the later convolutional layers eg, 226 become more and more complex, such as features such as high-level semantics.
  • features with higher semantics are more suitable for the problem to be solved.
  • the pooling layer can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to utilize the fully connected layer 230 to generate one or a set of outputs of the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 2) and the output layer 240, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 240 After the multi-layer hidden layers in the fully connected layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error,
  • the forward propagation of the entire convolutional neural network 200 (as shown in Figure 2, the propagation from the direction 210 to 240 is forward propagation)
  • the back propagation (as shown in Figure 2, the propagation from the 240 to 210 direction is the back propagation) will Start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.
  • the convolutional neural network 200 shown in FIG. 2 is only used as an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.
  • a convolutional neural network (CNN) 200 shown in FIG. 2 can be used to perform the target detection method of the embodiment of the present application.
  • the image to be processed passes through the input layer 210, the convolution layer/pooling layer 220 After processing with the fully connected layer 230, the detection result of the to-be-processed image (the bounding box of the target in the to-be-processed image and the confidence of the existence of the bounding box of the target in the image) can be obtained.
  • FIG. 3 shows a schematic block diagram of a target detection system according to an embodiment of the present application.
  • the target detection system can be deployed on related equipment, such as terminal equipment such as vehicle-mounted equipment and security monitoring equipment, so as to reduce the computing power requirement of the target detection network and improve the Performance for small object detection.
  • the target detection system in Figure 3 mainly includes a convolutional neural network (CNN) module, a first-level detector, a screening module, a feature fusion module, a feature extraction module 1, a special extraction module 2, and a detector.
  • the feature fusion module, the feature extraction module 1, the feature extraction module 2, and the detector together form a secondary detector.
  • the CNN module is used to perform a convolution operation on the input image using CNN to obtain the feature space of the input image.
  • the CNN here can be the convolutional neural network shown in Figure 2, specifically the backbone network or the feature in the CNN.
  • Pyramid networks feature pyramid networks, FPN
  • the first-level detector is used to detect the target for the first time according to the characteristic image of the image, and then input the result of the first detection into the screening module.
  • the screening module screens the results of the first detection according to the category, size, confidence, etc., and then according to the actual situation, a part of the screening results can be directly output as a simple target, and another part of the screening results can be input into the secondary detector as a candidate area.
  • the feature extraction module 1 extracts the corresponding features from the feature image output by the CNN module, and the feature extraction module 2 extracts the original image features from the original image, where the original image can be, for example, the original size directly obtained from the sensor. , or an image of the original size adjusted to a preset size according to the computing power, and the input image is generally obtained by reducing the original image.
  • the feature fusion module fuses the two features to obtain fused features.
  • the detector predicts the category and coordinates of the image target according to the fusion features, so as to obtain the category of the target and the coordinates of the target in the image.
  • FIG. 4 shows an overall flow chart of the target detection method according to the embodiment of the present application, in which primary detection, candidate frame screening and secondary detection are the core processes, which will be briefly introduced below.
  • the first-stage detection can be the first-stage detection in the traditional two-stage network, that is, the candidate region is predicted through the region proposal network (RPN); the first-stage detection can also be a one-stage detection network, which directly predicts the to-be-detected The class and location of the object in the image.
  • RPN region proposal network
  • the one-stage network in this application means that the category probability and position coordinate value of the target object are directly generated after only one detection without going through the region proposal stage, that is, the final detection result can be obtained after a single detection.
  • High detection efficiency such as YOLO network, Center Net network and Retina Net network, etc.
  • the two-stage network divides the detection process into two stages (that is, two detections are performed). First, the candidate region is generated by the first detection, and the candidate region is classified by the second detection. The first detection only detects whether the candidate frame contains positive samples. Contains the candidate frame area and confidence level of the positive sample, and the secondary detection detects the specific category of the target in the candidate frame.
  • Feature extraction is performed on the feature image, and then the category and coordinates of the target are detected according to the feature extraction.
  • the first-level detection can already complete most of the large target detection, but in the case of low resolution of the input image, there may be missed detection and false detection for some categories of small targets, so it is necessary to carry out the second-level detection.
  • the embodiment of the present application introduces a candidate frame screening step.
  • the results of the first-level detection are screened according to the information of different dimensions such as the category, size, and confidence of the target, and then the difficult targets obtained from the screening are subjected to the second-level detection.
  • the screened simple targets it can generally be considered that the first-level detection can be accurately detected, so the screening results corresponding to the simple targets can be directly output.
  • features are extracted from the original image and the feature space obtained by CNN operation, and then the two features are fused by lightweight calculation to obtain the fusion feature, and the candidate region is classified and regressed twice. Thereby predicting the category and coordinates of the target. Since the fusion features of the secondary detection have both the features extracted from the high-resolution original image and the features extracted from the low-resolution high-level features obtained by the CNN operation, the fusion of these two features can improve the target detection network for Detection performance of small objects. In addition, since the features of the original image are introduced in the secondary detection, the large-scale memory occupied by the CNN operation has been released, so the introduction of the original image features will not cause a large memory burden.
  • the candidate frame screening step can also effectively reduce the candidate area for secondary detection, and the computing power of the secondary detector can be controlled at the Mflops level or the Gflops level, since most of the current cheap chips can reach the Tflops level.
  • the computing power of Mflops is smaller than that of Gflops and that of Tflops. Therefore, the target detection method of the embodiment of the present application does not bring a large computational burden, and can adapt to the computing power conditions of most chips at present.
  • FIG. 5 shows a schematic flowchart of a target detection method according to an embodiment of the present application, including steps 501 to 506, which will be introduced separately below.
  • the target detection method in FIG. 5 can be applied to scenes that require target detection, such as assisted driving, automatic driving, safe cities, and smart terminals.
  • the image is input into the CNN network shown in Figure 2, the CNN performs convolution processing on the input image, and aggregates the feature information of the input image to obtain the feature image of the input image.
  • the feature information of the image includes the texture of the image, high-level semantics, Spatial relationship, etc.
  • feature image is a collection of image feature information.
  • the CNN network may be a backbone network and/or a feature pyramid network, and the feature images may be output by the backbone network or the feature pyramid network, or may be output by both networks.
  • the output feature image may be a feature image of a preset specific size, or a plurality of feature images of different sizes may be output according to actual needs.
  • S502 Detect the target in the input image according to the feature image of the input image, so as to obtain multiple candidate frames and the confidence level of each candidate frame in the multiple candidate frames.
  • the target detection method of the embodiment of the present application first performs the first target detection on the input image according to the characteristic image of the input image, wherein the method for the first target detection may adopt the existing target detection method.
  • the region proposal network (RPN) in the two-stage network can be used to perform preliminary detection on the input image to detect the candidate frame of possible targets in the input image and the confidence of each candidate frame, where the candidate frame Confidence refers to the confidence that the detected object in the candidate box belongs to a certain category.
  • the first target detection can also use a one-stage network to perform target detection on the input image.
  • the first-stage network can be YOLO network, Center Net network, Retina Net network, etc.
  • the one-stage network can not only detect possible targets in the image.
  • the candidate box and the confidence of each candidate box can also detect the category of the target in each candidate box.
  • S503 Select one or more first candidate frames from multiple candidate frames according to the confidence of each candidate frame.
  • a candidate frame with a confidence greater than or equal to the first confidence threshold can be selected as the first candidate frame, and it can be considered that the target in the candidate frame with low confidence is most likely the target of false detection, and there is no need to perform the first candidate frame. Secondary target detection, so it can be discarded directly.
  • the candidate frame can be obtained in the first target detection.
  • the size of the candidate frame can be calculated. Therefore, the candidate frame whose size is smaller than the first size threshold can be selected as the first candidate frame. This is because the detection of the smaller size is more difficult. , so it is necessary to perform a second target detection.
  • the candidate frame whose category is the first preset category can also be screened as the first candidate frame, because it is difficult to detect some specific categories of targets. , so it is necessary to perform a second target detection.
  • a part of the targets in the candidate boxes may also be screened for direct output.
  • a candidate frame whose confidence is greater than or equal to the first confidence threshold and less than the second confidence threshold is selected as the first candidate frame
  • the candidate frame whose confidence is greater than or equal to the second confidence threshold is selected as the second candidate box, and then output the target in the second candidate box, because the target with high enough confidence can be considered as the correct detection target, so it can be output directly.
  • the candidate frame whose size is smaller than the first size threshold is filtered as the first candidate frame
  • the candidate frame whose size is greater than or equal to the first size threshold is filtered as the second candidate frame
  • the target in the second candidate frame is output, which is Since it is easier to detect objects with a large enough size, generally speaking, the first object detection can be correctly detected, so it can be directly output.
  • the candidate frame whose category is the first preset category is filtered into the first candidate frame
  • the candidate frame whose category is the second preset category is filtered into the second candidate frame
  • the target in the second candidate frame is output, which is Since the target detection of some specific categories is more difficult, while the target detection of other specific categories is easier, generally speaking, the first target detection can be detected correctly, so it can be directly output.
  • a plurality of candidate frames may be screened according to a preset number, so as to obtain a preset number of candidate frames.
  • the candidate frame According to the first candidate frame that needs to be subjected to the second target detection, first find the candidate frame with the corresponding coordinates from the original image corresponding to the input image, and then extract the first feature from it; then find the corresponding coordinate from the feature image of the input image. coordinate candidate box, and then extract the second feature from it.
  • the original image corresponding to the input image refers to the image of the original size directly obtained from the sensor, and the original image is not necessarily the image input to the neural network model.
  • the image input to the neural network model is a downscaled 640*368 resolution image
  • the original image is 1920*1080 resolution.
  • the input image is a 640*368 resolution image.
  • the original image at this time is the size-adjusted image. image.
  • the feature image of the input image can be a feature image of a specific size, or it can be a plurality of feature images of different sizes.
  • the extraction of the second feature from the feature image may be to extract the second feature from a feature image of a specific size, or to extract features from multiple feature images of different sizes to obtain the second feature.
  • an existing image feature fusion method may be used, for example, operations such as convolution/pooling/upsampling may be used, so that the sizes of the two types of features are the same, and then For adding or connecting, a direct straightening method can also be used to straighten the two features respectively, and then connect them.
  • S506 Detect targets in one or more first candidate boxes according to the fused features.
  • the target in the first candidate frame that needs the second target detection is detected. Due to the special combination of the original image features after fusion, the high-resolution original image features can greatly increase the detail information, thereby improving the detection ability of objects (especially small objects).
  • the target detection method of the present application is suitable for Targets that have been detected once and then detected a second time can greatly improve the detection accuracy of small targets and complex targets.
  • the target detection method of the present application screens the targets detected once before performing the secondary detection, and only filters out the target detection method.
  • the secondary detection of targets that need secondary detection can reduce the computational burden of the target detection network.
  • additional conditions can be added for screening according to requirements, such as confidence, category and candidate frame size.
  • the targets that are easier to detect can be directly output, which further reduces the computational burden of the target detection network.
  • the target detection method of the present application combines the features in the original image and the feature image in the process of secondary detection. Since the original image has a higher resolution, the accuracy of target detection can be improved. In addition, due to the high resolution The features of the high-resolution original image are introduced in the secondary detection, and the large-scale memory of the object detection network has been released, so the introduction of the high-resolution original image features will not cause a huge memory burden to the object detection network.
  • the CNN operation extracts the features of the input image, and specifically, performs convolution processing on the input image to obtain the convolution feature map of the input image.
  • the network for performing the CNN operation in the embodiment of the present application may adopt various structures, including a backbone network and a feature pyramid network.
  • Fig. 6 shows a schematic block diagram of CNN processing on the input image.
  • the feature information of the input image is aggregated through operations such as convolution, so as to obtain high-level information.
  • the feature image can be output directly by the backbone network, or the feature image can be output by the feature pyramid network, or the feature image can be jointly output by the two networks.
  • the first-level detection in the embodiment of the present application may use a region proposal network (RPN) in a two-stage network, and predict the candidate region according to the feature image obtained by the above-mentioned CNN operation, and obtain a positive sample in the candidate region and its confidence.
  • RPN region proposal network
  • the main function of RPN is to generate region candidates, which can be regarded as many potential candidate frames.
  • the network does not know how many target objects exist in the image, so RPN usually generates several candidates on the image in advance box, and output the candidate box that is most likely to contain the target object.
  • the feature image is input to the RPN, and the RPN detects the feature image to obtain the coordinates of each candidate frame and the confidence level of each candidate frame in multiple candidate frames, which are represented as multiple candidate frames on the input image.
  • multiple candidate frames are screened. Specifically, candidate frames with a confidence greater than or equal to a preset threshold can be screened, and candidate frames with a confidence less than the preset threshold are directly discarded. According to actual needs, During the screening process, the screening quantity can also be set, so as to output candidate boxes that do not exceed the preset quantity.
  • the results of the first-level detection are screened according to the candidate frame confidence, and a candidate frame with high confidence is selected. If there is a preset number requirement, the TopK algorithm can also be used for screening to select the top K candidate boxes with the highest confidence. As shown in Figure 7, after the candidate frame screening is performed, the number of candidate frames on the input image can be effectively reduced, thereby reducing the computational complexity of secondary detection.
  • feature extraction is performed from the feature map obtained by the CNN operation and the corresponding position in the original image.
  • the feature extraction can be roi pooling, roi align, ps roi pooling , ps roi align and other operations.
  • the fusion result is input to the detector, and the detector predicts the coordinates and category of the target according to the fusion features. Specifically, according to actual needs, the detector can only perform category prediction, or can complete category prediction and coordinates at the same time. predict.
  • the target detection method of the embodiment of the present application does not impose a huge computational burden and memory burden on the target detection network, and at the same time can effectively improve the small target detection performance. It can achieve accurate small target detection under the conditions of high practical value.
  • Table 1 shows that compared with the existing target detection method FasterRCNN, the target detection method of the embodiment of the present application is applied to the test set, and the detection effect of target 1 and target 2 is significantly improved.
  • the precision rate refers to the proportion of correctly detected targets among the detected targets; the recall rate refers to the ratio of the correctly detected targets to the number of all the targets in the test set.
  • a first-stage network can also be used to detect feature images obtained through CNN operations, wherein the first-stage network can be YOLO network, Center Net network and Retina Net network, etc.
  • the first-stage network performs first-level detection according to the feature image obtained by the CNN operation, so as to obtain the coordinates, confidence and category of each candidate frame in multiple candidate frames, that is, the first-level detection result.
  • the candidate frame screening step in this embodiment of the present application further includes screening the first-level detection results according to the size, confidence, and category of the candidate frame, and selects candidate frames that meet certain conditions.
  • the candidate frame that meets certain conditions is used as the result of difficult target detection for secondary detection.
  • the candidate frame whose size is greater than or equal to the threshold value a can be determined as the simple target detection result, and the candidate frame whose size is smaller than the threshold value a can be determined as the difficult target detection result; for another example, you can The candidate frame with confidence greater than or equal to the threshold b is determined as the simple target detection result, and the candidate frame with the confidence in the range of [threshold c, threshold b) is determined as the difficult target detection result, where the threshold c is less than the threshold b, as for the confidence A candidate frame whose degree is less than the threshold c can be judged as an error result and discarded directly; for another example, a candidate frame of a complex class can be judged as a difficult target detection result, while a candidate frame of a simple class can be judged as a target detection result, where Complex categories and simple categories can be preset by humans.
  • the category "person” can be set as a complex category, because the human body often has different postures, which is difficult to detect, while the category "sign" can be set as simple category, because the roadside signs are still life and generally simple in shape, the detection is easier.
  • the above threshold a, threshold b and threshold c are all preset thresholds.
  • the selected difficult target detection results are then subjected to secondary detection to finally obtain target detection results, wherein the process of secondary detection is the same as that described in FIG.
  • target detection apparatus can perform various steps of the target detection method of the embodiment of the present application, and repeated descriptions are appropriately omitted when introducing the target detection apparatus of the embodiment of the present application.
  • FIG. 12 is a schematic block diagram of a target detection apparatus according to an embodiment of the present application.
  • the apparatus 1200 shown in FIG. 12 includes an acquisition module 1201 and a processing module 1202, which will be introduced separately below.
  • the acquiring module 1201 is used for acquiring the characteristic image of the input image.
  • the processing module 1202 is configured to detect the target in the input image according to the feature image of the input image, so as to obtain multiple candidate frames and the confidence level of each candidate frame in the multiple candidate frames.
  • the processing module 1202 is further configured to select one or more first candidate frames from multiple candidate frames according to the confidence of each candidate frame.
  • the processing module 1202 is further configured to extract the first feature from the input image according to the one or more first candidate frames, and extract the second feature from the feature image of the input image.
  • the processing module 1202 is further configured to fuse the first feature and the second feature to obtain a fused feature.
  • the processing module 1202 is further configured to detect targets in one or more first candidate boxes according to the fused features.
  • the processing module 1202 is further configured to detect the target in the input image according to the feature image of the input image, so as to obtain the size and category of each candidate frame.
  • the processing module 1202 screens multiple candidate frames according to the confidence of each candidate frame to obtain one or more first candidate frames, including: selecting candidate frames whose confidence is greater than or equal to the first confidence threshold. Screening as the first candidate frame, and/or screening the candidate frame whose size is smaller than the first size threshold as the first candidate frame, and/or screening the candidate frame whose category is the first preset category as the first candidate frame.
  • the processing module 1202 filters multiple candidate frames according to the confidence of each candidate frame to obtain one or more first candidate frames, and further includes: setting the confidence to be greater than or equal to the first confidence threshold and less than The candidate frame with the second confidence threshold is screened as the first candidate frame; the candidate frame with the confidence greater than or equal to the second confidence threshold is screened as the second candidate frame; the target in the second candidate frame is output; and/or, the The candidate frame whose size is smaller than the first size threshold is selected as the first candidate frame; the candidate frame whose size is greater than or equal to the first size threshold is selected as the second candidate frame; the target in the second candidate frame is output; and/or, the category The candidate frame of the first preset category is selected as the first candidate frame; the candidate frame of the second preset category is selected as the second candidate frame; the target in the second candidate frame is output.
  • the processing module 1202 filters the plurality of candidate frames according to the confidence of each candidate frame, and further includes: screening the plurality of candidate frames according to a preset number to obtain a preset number of candidate frames.
  • the target detection apparatus shown in FIG. 12 can be used to execute each step of the methods in the above-mentioned FIGS. 4 to 11 .
  • FIGS. 4 to 11 please refer to the above description of FIGS. 4 to 11 .
  • the embodiments of the present application will not be repeated here. .
  • FIG. 13 is a schematic diagram of a hardware structure of a target detection apparatus according to an embodiment of the present application.
  • the target detection apparatus 1300 shown in FIG. 13 includes a memory 1301 , a processor 1302 , a communication interface 1303 and a bus 1304 .
  • the memory 1301 , the processor 1302 , and the communication interface 1303 are connected to each other through the bus 1304 for communication.
  • the memory 1301 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 1301 may store a program. When the program stored in the memory 1301 is executed by the processor 1302, the processor 1302 and the communication interface 1303 are used to execute each step of the target detection method of the embodiment of the present application.
  • the processor 1302 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processor (graphics processing unit, GPU), or one or more
  • the integrated circuit is used to execute the relevant program to realize the function required to be performed by the unit in the target detection apparatus of the embodiment of the present application, or to execute the target detection method of the embodiment of the present application.
  • the processor 1302 can also be an integrated circuit chip with signal processing capability.
  • each step of the target detection method in the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 1302 or an instruction in the form of software.
  • the above-mentioned processor 1302 may also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 1301, and the processor 1302 reads the information in the memory 1301 and, in combination with its hardware, completes the functions required to be performed by the units included in the target detection apparatus of the embodiment of the present application, or executes the target detection method of the embodiment of the present application .
  • the communication interface 1303 implements communication between the apparatus 1300 and other devices or a communication network using a transceiving device such as, but not limited to, a transceiver.
  • a transceiving device such as, but not limited to, a transceiver.
  • the image to be processed can be acquired through the communication interface 1303 .
  • Bus 1304 may include a pathway for communicating information between various components of device 1300 (eg, memory 1301, processor 1302, communication interface 1303).
  • the above apparatus 1300 only shows a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the apparatus 1300 may also include other devices necessary for normal operation. Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 1300 may further include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the apparatus 1300 may only include the necessary devices for implementing the embodiments of the present application, and does not necessarily include all the devices shown in FIG. 13 .
  • the present application further provides a target detection device, the device includes: a memory for storing a program; a processor for executing a program stored in the memory, and when the program stored in the memory is executed, the processor is used for executing The object detection method in FIGS. 4 to 11 .
  • the present application further provides a computer-readable storage medium, where the computer-readable medium stores program codes for device execution, where the program codes include the methods for executing FIGS. 4 to 11 .
  • the present application further provides a chip, the chip includes a processor and a data interface, and the processor reads the instructions stored in the memory through the data interface to execute the methods in FIGS. 4 to 11 .
  • a component may be, but is not limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a computing device and the computing device may be components.
  • One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between 2 or more computers.
  • these components can execute from various computer readable media having various data structures stored thereon.
  • a component may, for example, be based on a signal having one or more data packets (eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals) Communicate through local and/or remote processes.
  • data packets eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本申请提供了一种目标检测方法和装置,可以在有限的算力、内存等条件下,提升目标检测网络对于小目标检测的性能。该方法包括:获取输入图像的特征图像;根据输入图像的特征图像对输入图像中的目标进行检测,以得到多个候选框和多个候选框中每个候选框的置信度;根据每个候选框的置信度从多个候选框中选择得到一个或多个第一候选框;根据一个或多个第一候选框从输入图像对应的原始图像中提取第一特征;根据一个或多个第一候选框从输入图像的特征图像中提取第二特征;将第一特征和第二特征融合,以得到融合后的特征;根据融合后的特征对一个或多个第一候选框中的目标进行检测。

Description

目标检测方法和装置 技术领域
本申请涉及计算机视觉领域,并且更具体地,涉及一种目标检测方法和装置。
背景技术
计算机视觉是各个应用领域如制造业、检验、文档分析和医疗诊断等领域中各种智能或自主系统中不可分割的一部分。目标检测是计算机视觉和数字图像处理的一个热门方向,广泛应用于机器人导航、智能视频监控、工业检测、航空航天和辅助驾驶等诸多领域,通过计算机视觉减少对人力的消耗具有重要意义。目标检测是计算机视觉和图像处理的重要分支,也是智能监控系统的核心部分,同时也是泛身份识别领域的基础算法。由于深度学习的广泛应用,目标检测算法得到了快速发展,一般的目标检测网络为基于深度学习的网络。
深度学习网络依赖于大量的浮点运算次数,一般会达到数十亿次浮点数运算,而网络模型需要的浮点运算次数越多,对硬件的挑战也就越大,为了保证模型的实时性,硬件需要每秒能完成数十次模型的推理,这给硬件带来了极大的负担。而计算能力强的硬件意味着高昂的价格,这对于商业化使用是不利的,因此一般商业的实际应用中不会使用高算力的硬件。此外,对于廉价芯片而言,不仅算力有限,其内存也十分有限,若将高分辨率的图像直接输入网络模型,可能导致内存溢出而无法使用。
在安防、辅助驾驶等实际应用场景中,除了对算力有要求,对目标检测网络的性能要求也很高,尤其是小目标的检测场景中,例如辅助驾驶或安防场景可能需要检测高分辨率(例如分辨率为1080P)图像中短边仅有4个像素的小目标。一般的,对于小目标的检测可以增加输入图像的分辨率,但这种方法会造成算力激增,也会带来更高的内存负担,从而带来更高昂的硬件成本,不利于商业化应用;也可以在高分辨率的特征图上提取浅层特征信息,但这种方法依赖于输入图像的分辨率,如果输入图像的分辨率较小,也无法解决小目标检测问题。
发明内容
本申请提供一种目标检测方法和装置,可以在有限的算力、内存等条件下,提升目标检测网络对于小目标检测的性能。
第一方面,提供了一种目标检测方法,该方法包括:获取输入图像的特征图像;根据输入图像的特征图像对输入图像中的目标进行检测,以得到多个候选框和多个候选框中每个候选框的置信度;根据每个候选框的置信度从多个候选框中选择得到一个或多个第一候选框;根据一个或多个第一候选框从输入图像对应的原始图像中提取第一特征;根据一个或多个第一候选框从输入图像的特征图像中提取第二特征;将第一特征和第二特征融合,以得到融合后的特征;根据融合后的特征对一个或多个第一候选框中的目标进行检测。
在目标检测的实际应用中,对于小目标和复杂目标的检测是较为困难的,特别是在输入图像的分辨率不高的情况下,无法保证检测的准确率,因此本申请的目标检测方法对进行了一次检测的目标再进行二次检测,可以极大提高小目标和复杂目标的检测准确率。但是如果将所有的目标都进行二次检测,无疑会增大目标检测网络的算力负担,因此本申请的目标检测方法在进行二次检测之前,对一次检测的目标进行筛选,只筛选出有必要进行二次检测的目标进行二次检测,可以降低目标检测网络的算力负担。本申请的目标检测方法在进行二次检测的过程中,结合了原始图像和特征图像中的特征,由于原始图像具有较高的分辨率,因此可以提升目标检测的准确性,另外,由于高分辨率的原始图像的特征是在二次检测中引入,此时目标检测网络的大规模内存已被释放,因此高分辨率的原始图像的特征的引入不会给目标检测网络造成巨大的内存负担。
在一种可能的实现方式中,该方法还包括:根据输入图像的特征图像对输入图像中的目标进行检测,以得到每个候选框的尺寸和类别。
结合第一方面,在第一方面的某些实现方式中,根据每个候选框的置信度从多个候选框中选择得到一个或多个第一候选框,包括:将置信度大于或等于第一置信度阈值的候选框确定为第一候选框,和/或,将尺寸小于第一尺寸阈值的候选框确定为第一候选框,和/或,将类别为第一预设类别的候选框确定为第一候选框。
本申请的目标检测方法可以将置信度大于或等于第一置信度阈值的候选框筛选为第一候选框,可以认为置信度不高的候选框中的目标极大可能是误检的目标,没有必要进行第二次的目标检测,因此可以直接抛弃,而只保留置信度大于第一置信度阈值的第一候选框进行第二次目标检测。第一次目标检测可以得到候选框,自然地,可以计算出候选框的尺寸,因此可以将尺寸小于第一尺寸阈值的候选框筛选为第一候选框,这是由于尺寸较小的检测较为困难,因此有必要进行第二次目标检测。当第一次目标检测的输出还包括候选框中目标的类别时,还可以将类别为第一预设类别的候选框筛选为第一候选框,这是由于一些特定类别的目标的检测较为困难,因此有必要进行第二次目标检测。
结合第一方面,在第一方面的某些实现方式中,还包括:将置信度大于或等于第一置信度阈值且小于第二置信度阈值的候选框确定为第一候选框;将置信度大于或等于第二置信度阈值的候选框确定为第二候选框;输出第二候选框中的目标;和/或,将尺寸小于第一尺寸阈值的候选框确定为第一候选框;将尺寸大于或等于第一尺寸阈值的候选框确定为第二候选框;输出第二候选框中的目标;和/或,将类别为第一预设类别的候选框确定为第一候选框;将类别为第二预设类别的候选框确定为第二候选框;输出第二候选框中的目标。
其中第一预设类别为检测较为困难的复杂类别,例如,类别“人”可以被设定为第一预设类别,因为人体往往具有不同的姿态,检测较为困难;第二预设类别为检测较为容易的简单类别,例如类别“标志牌”可以被设定为第二预设类别,因为路边标志牌为静物且一般形状简单,检测较为容易。一般的,第二预设类别的目标在一次检测中即可准确检测出,因此本申请的目标检测方法在筛选过程中将检测较为容易的目标直接输出,进一步降低目标检测网络的算力负担。
结合第一方面,在第一方面的某些实现方式中,根据每个候选框的置信度从多个候选框中选择得到一个或多个第一候选框,包括:根据预设数量对多个候选框进行筛选,以得 到预设数量的候选框。
本申请的目标检测方法还可以设定进行二次检测的候选框的数量,在算力有限的情况下,进行二次检测的候选框越少,目标检测神经网络的负担越小,因此在实际应用中可以根据实际条件设定可以进行二次检测的候选框数量。
结合第一方面,在第一方面的某些实现方式中,本申请的目标检测方法还包括:将多个候选框按照置信度从高到低进行排序,选出前K个候选框,其中K为预设数量。
本申请的目标检测方法采用TopK算法从多个候选框中选出预设数量的候选框,排序可以依据置信度的高低。
第二方面,提供了一种目标检测装置,该装置包括:获取模块,用于获取输入图像的特征图像;处理模块,用于根据输入图像的空间特征对输入图像中的目标进行检测,以得到多个候选框和多个候选框中每个候选框的置信度;处理模块还用于,根据每个候选框的置信度从多个候选框中选择得到一个或多个第一候选框;处理模块还用于,根据一个或多个第一候选框从输入图像对应的原始图像中提取第一特征,根据一个或多个第一候选框从输入图像的特征图像中提取第二特征;处理模块还用于,将第一特征和第二特征融合,以得到融合后的特征;处理模块还用于,根据融合后的特征对一个或多个第一候选框中的目标进行检测。
结合第二方面,在第二方面的某些实现方式中,处理模块还用于根据输入图像的特征图像对输入图像中的目标进行检测,以得到每个候选框的尺寸和类别。
结合第二方面,在第二方面的某些实现方式中,处理模块根据每个候选框的置信度从多个候选框中选择得到一个或多个第一候选框,包括:将置信度大于或等于第一置信度阈值的候选框确定为第一候选框,和/或,将尺寸小于第一尺寸阈值的候选框确定为第一候选框,和/或,将类别为第一预设类别的候选框确定为第一候选框。
结合第二方面,在第二方面的某些实现方式中,处理模块还用于:将置信度大于或等于第一置信度阈值、小于第二置信度阈值的候选框确定为第一候选框;将置信度大于或等于第二置信度阈值的候选框确定为第二候选框;输出第二候选框中的目标;和/或,将尺寸小于第一尺寸阈值的候选框确定为第一候选框;将尺寸大于或等于第一尺寸阈值的候选框确定为第二候选框;输出第二候选框中的目标;和/或,将类别为第一预设类别的候选框确定为第一候选框;将类别为第二预设类别的候选框确定为第二候选框;输出第二候选框中的目标。
结合第二方面,在第二方面的某些实现方式中,处理模块还用于:根据预设数量对多个候选框进行筛选,以得到预设数量的候选框。
第三方面,提供了一种目标检测装置,该装置包括:处理器和传输接口,
传输接口用于获取输入图像的特征图像;处理器,用于根据输入图像的空间特征对输入图像中的目标进行检测,以得到多个候选框和多个候选框中每个候选框的置信度;处理器还用于,根据每个候选框的置信度从多个候选框中选择得到一个或多个第一候选框;处理器还用于,根据一个或多个第一候选框从输入图像对应的原始图像中提取第一特征,根据一个或多个第一候选框从输入图像的特征图像中提取第二特征;处理器还用于,将第一特征和第二特征融合,以得到融合后的特征;处理器还用于,根据融合后的特征对一个或多个第一候选框中的目标进行检测。
在一种可能的实现方式中,处理器还用于根据输入图像的特征图像对输入图像中的目标进行检测,以得到每个候选框的尺寸和类别。
在一种可能的实现方式中,处理器根据每个候选框的置信度从多个候选框中选择得到一个或多个第一候选框,包括:将置信度大于或等于第一置信度阈值的候选框确定为第一候选框,和/或,将尺寸小于第一尺寸阈值的候选框确定为第一候选框,和/或,将类别为第一预设类别的候选框确定为第一候选框。
在一种可能的实现方式中,处理器还用于:将置信度大于或等于第一置信度阈值、小于第二置信度阈值的候选框确定为第一候选框;将置信度大于或等于第二置信度阈值的候选框确定为第二候选框;输出第二候选框中的目标;和/或,将尺寸小于第一尺寸阈值的候选框确定为第一候选框;将尺寸大于或等于第一尺寸阈值的候选框确定为第二候选框;输出第二候选框中的目标;和/或,将类别为第一预设类别的候选框确定为第一候选框;将类别为第二预设类别的候选框确定为第二候选框;输出第二候选框中的目标。
在一种可能的实现方式中,处理器还用于:根据预设数量对多个候选框进行筛选,以得到预设数量的候选框。
第四方面,提供了一种计算机可读存储介质,该计算机可读介质中存储有程序,当所述程序在计算机或处理器上运行时,使得所述计算机或所述处理器执行第一方面和第一方面中任意一种实现方式的方法。
第五方面,提供了一种计算机程序产品,该计算机程序产品包括指令,当该指令在计算机或处理器上运行时,使得计算机或处理器执行第一方面和第一方面中任意一种实现方式的方法。
附图说明
图1是本申请的系统架构的示意图;
图2是本申请的卷积神经网络的示意性框图;
图3是本申请的目标检测系统的示意性框图;
图4是本申请的目标检测方法的总体流程图;
图5是本申请的目标检测方法的示意性流程图;
图6是本申请的对输入图像进行卷积处理的示意性框图;
图7是本申请的一级检测的示意性框图;
图8是本申请的对一级检测结果进行筛选的示意性框图;
图9是本申请的二级检测的示意性框图;
图10是本申请的另一种一级检测的示意性框图;
图11是本申请的对另一种一级检测结果进行筛选的示意性框图;
图12是本申请的目标检测装置的示意性框图;
图13是本申请的目标检测装置的硬件结构示意图。
具体实施方式
为了便于理解本申请实施例,下面先对本申请实施例中使用的一些术语或概念进行介绍。
目标检测(objection detection):找出图像中所有感兴趣的目标(物体),确定它们的类别和位置。
推理(inference):在真实环境中做出某些决策的过程,在本申请中指让模型和算法判断图像中是否含有目标物,进而判断得到其类别和边界框的过程。
每秒浮点数运算(floating point operation per second,FLOPs):指每秒可以执行的浮点数运算,是衡量硬件计算能力的指标。
骨干网络(backbone):骨干网络用于提取底层图片信息,是基于视觉的深度神经网络模型的通用结构。在实际中,骨干网络通常是基于一般的深度卷积神经网络的架构微调而成。例如,骨干网络可以基于视觉几何组(visual geometry group,VGG)网络(该网络是牛津大学的视觉几何组(visual geometry group)提出一种网络)的架构微调而成。再如,骨干网络还可以基于深度残差网络(deep residual network,ResNet)的架构微调而成。
特征金字塔网络(feature pyramid networks,FPN):是根据特征金字塔概念设计的特征提取器,目的是提供精度和速度,由自下而上和自上而下两部分构成,自下而上即传统的卷积网络做特征提取,随着卷积的神态,空间分辨率减少,空间信息丢失,但是高级语义信息被更多地检测到。
下面将结合附图,对本申请中的技术方案进行描述。
本申请的目标检测方法可以应用在辅助驾驶、自动驾驶、平安城市、智能终端等需要进行目标检测的领域。以下对其中两种较为常见的应用场景进行简单的介绍。
(1)辅助/自动驾驶系统
在高级驾驶辅助系统(advanced driving assistant system,ADAS)和自动驾驶系统(autonomous driving system,ADS)中需要对路面上的标志、行人或者障碍物进行识别或避让,因此需要进行准确的目标检测。在辅助或自动驾驶场景中,一般而言检测距离越远,越有利于执行更好的辅助或自动驾驶任务,例如在1920*1080分辨率的图像中,前方远处的红绿灯仅有3*3的分辨率,而通常情况下为了降低算力要求,图像输入到目标检测网络会被进一步压缩到更低分辨率(例如640*368分辨率),此时图像上的红绿灯分辨率则会更低,准确的小目标检测更加困难。
(2)平安城市/视频监控系统
在平安城市系统和视频监控系统中通过实时进行目标检测(检测行人或者车辆),标出检测结果,并将检测结果输入系统的分析单元中,可以用于查找犯罪嫌疑人、失踪人口以及特定车辆等。当前互联网摄像头遍布全国各地,在公路、工厂、办公楼、住宅等地方都可能安装安防摄像头,作为移动设备,考虑其价格和便利性,安防摄像头所搭载的芯片算力一般较低,如何使用低算力的芯片进行准确的目标检测,称为亟待解决的问题。
现有的目标检测方法对于小目标检测,通常依赖于较高的图像分辨率,当图像分辨率较低时,则不能有效检测目标物。而输入高分辨率图像会给目标检测网络带来巨大的算力负担,目标检测网络的推理计算量将指数上升,这样的目标检测网络不利于在移动设备中集成。
因此,本申请实施例提供一种目标检测方法,可以在有限的算力、内存等硬件条件下,提升目标检测网络对于小目标检测的性能。
下面结合图1对本申请实施例的系统架构进行详细的介绍。
图1是本申请实施例的系统架构的示意图。如图1所示,系统架构100包括执行设备110、训练设备120、数据库130、客户设备140、数据存储系统150、以及数据采集系统160。
另外,执行设备110包括计算模块111、I/O接口112、预处理模块113和预处理模块114。其中,计算模块111中可以包括目标模型/规则101,预处理模块113和预处理模块114是可选的。
数据采集设备160用于采集训练数据。针对本申请实施例的目标检测方法来说,训练数据可以包括训练图像(该训练图像中包括行人)以及标注数据,其中,标注数据中给出了训练图片中的存在行人的包围框(bounding box)的坐标和类别。在采集到训练数据之后,数据采集设备160将这些训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。
下面对训练设备120基于训练数据得到目标模型/规则101进行描述,训练设备120对输入的训练图像进行物体检测,将输出的目标检测结果(图像中行人车辆等目标的包围框以及类别的置信度)与标注结果进行对比,直到训练设备120输出的物体的目标检测结果与预先标注的结果的差异小于一定的阈值,从而完成目标模型/规则101的训练。
上述目标模型/规则101能够用于实现本申请实施例的目标检测方法,即,将待处理图像(通过相关预处理后)输入该目标模型/规则101,即可得到待处理图像的目标检测结果。本申请实施例中的目标模型/规则101具体可以为神经网络。需要说明的是,在实际应用中,数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图1所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在图1中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:客户设备输入的待处理图像。这里的客户设备140具体可以是终端设备。
预处理模块113和预处理模块114用于根据I/O接口112接收到的输入数据(如待处理图像)进行预处理,在本申请实施例中,可以没有预处理模块113和预处理模块114或者只有的一个预处理模块。当不存在预处理模块113和预处理模块114时,可以直接采用计算模块111对输入数据进行处理。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如将目标模型/规则101计算得到的目标检测结果呈现给客户设备140,从而提供给用户。
具体地,经过计算模块111中的目标模型/规则101处理得到的目标检测结果可以通 过预处理模块113(也可以再加上预处理模块114的处理)的处理后将处理结果送入到I/O接口,再由I/O接口将处理结果送入到客户设备140中显示。
应理解,当上述系统架构100中不存在预处理模块113和预处理模块114时,计算模块111还可以将处理得到的目标检测结果传输到I/O接口,然后再由I/O接口将处理结果送入到客户设备140中显示。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图1中,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图1仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图1中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图1所示,根据训练设备120训练得到目标模型/规则101,可以是本申请实施例中的神经网络,具体的,本申请实施例提供的神经网络可以是CNN以及深度卷积神经网络(deep convolutional neural networks,DCNN)等等。
下面结合图2对本申请涉及的CNN的结构进行详细的介绍。卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图2所示,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及全连接层(fully connected layer)230。下面对这些层的相关内容做详细介绍。
卷积层/池化层220:
卷积层:
如图2所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的卷积特征图的尺寸也相同,再将提取到的多个尺寸相同的卷积特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图2中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
全连接层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用全连接层230来生成一个或者一组所需要的类的数量的输出。因此,在全连接层 230中可以包括多层隐含层(如图2所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在全连接层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图2由210至240方向的传播为前向传播)完成,反向传播(如图2由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图2所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。
应理解,可以采用图2所示的卷积神经网络(CNN)200执行本申请实施例的目标检测方法,如图2所示,待处理图像经过输入层210、卷积层/池化层220和全连接层230的处理之后可以得到待处理图像的检测结果(待处理图像中的目标的包围框以及图像中存在目标的包围框的置信度)。
图3示出了本申请实施例的目标检测系统的示意性框图,该目标检测系统可以部署在相关设备例如车载设备和安防监控设备等终端设备上,从而降低目标检测网络的算力需求,提升对于小目标检测的性能。图3中的目标检测系统主要包括卷积神经网络(convolutional neuron network,CNN)模块、一级检测器、筛选模块、特征融合模块、特征提取模块1、特提取模块2、检测器等组成,其中特征融合模块、特征提取模块1、特提取模块2、检测器共同组成二级检测器。CNN模块用于利用CNN对输入图像进行卷积运算,以得到输入图像的特征空间,这里的CNN可以是图2所示的卷积神经网络,具体可以是CNN中的骨干网络(backbone)或者特征金字塔网络(feature pyramid networks,FPN),也可以是两者的结合,还可以是其他变形的特征提取运算。一级检测器用于根据图像的特征图像对目标进行第一次检测,然后将第一次检测的结果输入筛选模块。筛选模块根据类别、尺寸、置信度等对第一次检测的结果进行筛选,然后根据实际情况,可以将一部分筛选结果作为简单目标直接输出,将另一部分筛选结果输入二级检测器作为候选区域,从而降低二级检测器的算力需求;也可以将筛选结果全部输入二级检测器作为候选区域。根据筛选得到的候选区域,特征提取模块1从CNN模块输出的特征图像中提取相应特征,特征提取模块2从原始图像中提取原图特征,这里的原始图像例如可以是从传感器直接获取的原始尺寸的图像,或者是根据算力情况将原始尺寸的图像调整到预设尺寸后的图像,而输入图像一般由原始图像经过缩小处理得到。特征融合模块将两种特征进行融合,从而得到融合特征。检测器根据融合特征对图像目标的类别、坐标进行预测,从而得到目标的类别和该目标在图像中的坐标。
图4示出了本申请实施例的目标检测方法的总体流程图,其中一级检测、候选框筛选和二级检测为核心流程,以下进行简要介绍。
(1)一级检测
一级检测可以是传统二阶段网络中的第一阶段检测,即通过区域候选网络(region proposal network,RPN)对候选区域进行预测;一级检测也可以是一阶段的检测网络,直 接预测待检测图像中目标的类别和位置。
本申请中的一阶段网络是指不需要经过区域候选(region proposal)阶段,只经过一次检测后直接产生目标物体的类别概率和位置坐标值,即经过单次检测即可得到最终的检测结果,检测效率高,例如YOLO网络、Center Net网络和Retina Net网络等。二阶段网络是将检测过程分为两个阶段(即进行两次检测),首先一次检测产生候选区域,二次检测对候选区域进行分类,其中一次检测只检测候选框中是否含有正样本,得到包含正样本的候选框区域和置信度,二次检测对候选框中的目标的具体类别进行检测,具体根据筛选得到的区域,通过roi pooling或roi align等特征提取的操作,从待检测图像的特征图像上进行特征提取,然后根据提取得到特征检测目标的类别和坐标。
(2)候选框筛选
一般的,一级检测已经可以完成大部分的大目标检测,但是在输入图像分辨率较低的情况下,对于某些类别的小目标可能存在漏检和误检的情况,因此有必要进行二级检测,但如果将一级检测的结果全部进行二级检测,则会带来巨大的算力负担,因此本申请实施例引入候选框筛选步骤。候选框筛选过程中根据目标的类别、大小、置信度等不同维度的信息,对一级检测的结果进行筛选,再将筛选得到的困难目标进行二级检测。而筛选出的简单目标,一般可以认为一级检测已经可以准确检测出,因此可以将简单目标对应的筛选结果直接输出。(3)二级检测
根据输入的候选区域,从原图和经过CNN运算得到的特征空间中分别提取特征,然后采用轻量级的计算将两种特征融合,得到融合特征,对候选区域进行二次的分类和回归,从而预测出目标的类别和坐标。由于二级检测的融合特征既有从高分辨率的原图中提取的特征,又有从经过CNN运算得到的低分辨率高层特征中提取的特征,融合这两种特征可以提升目标检测网络对于小目标的检测性能。此外,由于原图的特征是在二级检测中引入,此时在CNN运算需要占用的大规模内存都已释放,因此原图特征的引入不会造成很大的内存负担。而通过候选框筛选步骤也可以有效降低进行二级检测的候选区域,可以将二级检测器的算力控制在Mflops量级或者Gflops量级,由于目前的大部分廉价芯片都可以达到Tflops量级的算力,其中Mflops量级小于Gflops量级小于Tflops量级,因此本申请实施例的目标检测方法不会带来大的计算负担,可以适应目前大部分芯片的算力条件。
图5示出了本申请实施例的目标检测方法的示意性流程图,包括步骤501至步骤506,以下分别进行介绍。图5中的目标检测方法可以应用在辅助驾驶、自动驾驶、平安城市、智能终端等需要进行目标检测的场景中。
S501,获取输入图像的特征图像。
具体的,将图像输入图2所示的CNN网络,CNN对输入图像进行卷积处理,汇聚输入图像的特征信息,得到该输入图像的特征图像,图像的特征信息包括图像的纹理、高层语义、空间关系等,而特征图像则是图像的特征信息的集合。其中,CNN网络可以是骨干网络和/或特征金字塔网络,特征图像可以由骨干网络或特征金字塔网络输出,也可以由两种网络一起输出。输出的特征图像可以是一个预设特定尺寸的特征图像,也可以根据实际需要输出多个不同尺寸的特征图像。
S502,根据输入图像的特征图像对输入图像中的目标进行检测,以得到多个候选框和多个候选框中每个候选框的置信度。
本申请实施例的目标检测方法首先根据输入图像的特征图像对输入图像进行第一次目标检测,其中,第一次目标检测的方法可以采用现有的目标检测的方法。例如,可以采用二阶段网络中的区域候选网络(region proposal network,RPN)对输入图像进行初步的检测,检测出输入图像中可能目标的候选框和每个候选框的置信度,这里的候选框的置信度是指检测出的候选框中目标属于某个类别的置信度。
可选的,第一次目标检测也可以采用一阶段网络对输入图像进行目标检测,一阶段网络可以是YOLO网络、Center Net网络和Retina Net网络等,一阶段网络不仅可以检测出图像中可能目标的候选框和每个候选框的置信度,还可以检测出每个候选框中目标的类别。
S503,根据每个候选框的置信度从多个候选框中选择得到一个或多个第一候选框。
具体的,可以将置信度大于或等于第一置信度阈值的候选框筛选为第一候选框,可以认为置信度不高的候选框中的目标极大可能是误检的目标,没有必要进行第二次的目标检测,因此可以直接抛弃。第一次目标检测可以得到候选框,自然地,可以计算出候选框的尺寸,因此可以将尺寸小于第一尺寸阈值的候选框筛选为第一候选框,这是由于尺寸较小的检测较为困难,因此有必要进行第二次目标检测。当第一次目标检测的输出还包括候选框中目标的类别时,还可以将类别为第一预设类别的候选框筛选为第一候选框,这是由于一些特定类别的目标的检测较为困难,因此有必要进行第二次目标检测。
可选的,本申请实施例的目标检测方法在筛选过程中,还可以筛选出一部分候选框中的目标直接输出。具体的,将置信度大于或等于第一置信度阈值、小于第二置信度阈值的候选框筛选为第一候选框,将置信度大于或等于第二置信度阈值的候选框筛选为第二候选框,然后输出第二候选框中的目标,这是由于置信度足够高的目标可以认为是正确的检测目标,因此可以直接输出。或者,将尺寸小于第一尺寸阈值的候选框筛选为第一候选框,将尺寸大于或等于第一尺寸阈值的候选框筛选为第二候选框,然后输出第二候选框中的目标,这是由于尺寸足够大的目标检测较为容易,一般来说第一次目标检测就可以正确检测出,因此可以直接输出。或者,将类别为第一预设类别的候选框筛选为第一候选框,将类别为第二预设类别的候选框筛选为第二候选框,然后输出第二候选框中的目标,这是由于一些特定类别的目标检测较为困难,而另一些特定类别的目标检测较为容易,一般来说第一次目标检测就可以正确检测出,因此可以直接输出。
可选的,本申请实施例的目标检测方法在筛选过程中,还可以根据预设数量对多个候选框进行筛选,从而得到预设数量的候选框。
S504,根据一个或多个第一候选框从输入图像对应的原始图像中提取第一特征,从输入图像的特征图像中提取第二特征。
根据筛选出的需要进行第二次目标检测的第一候选框,首先从输入图像对应的原始图像中找到对应坐标的候选框,然后从中提取第一特征;再从输入图像的特征图像中找到对应坐标的候选框,然后从中提取第二特征。
输入图像对应的原始图像是指从传感器直接获取的原始尺寸的图像,该原始图像不一定是输入神经网络模型的图像。例如在实际应用中,自动驾驶需要检测1920*1080分辨率图像中的小目标,而输入神经网络模型的图像为经过缩小处理的640*368分辨率的图像,此时原始图像为1920*1080分辨率的图像,输入图像为640*368分辨率的图像。可选的, 根据实际的算力情况,还可以对传感器直接获取的原始尺寸的图像进行尺寸调整,调整到预设尺寸,再提取第一特征,此时的原始图像即为经过尺寸调整后的图像。
输入图像的特征图像可以是一个特定尺寸的特征图像,也可以是多个不同尺寸的特征图像。相应的,从特征图像中提取第二特征,可以是从一个特定尺寸的特征图像中提取第二特征,也可以是从多个不同尺寸的特征图像中分别提取特征,从而得到第二特征。
S505,将第一特征和第二特征融合,以得到融合后的特征。
本申请实施例的目标检测方法在特征融合过程中可以采用现有的图像特征融合方法,例如,可以采用卷积/池化/上采样的等操作,使得两种特征的尺寸相同,再进行相加或者相连,也可以采用直接拉直的方式,将两种特征分别拉直,然后进行相连。
S506,根据融合后的特征对一个或多个第一候选框中的目标进行检测。
结合融合的原始图像特征和特征图像中的特征,对需要第二次目标检测的第一候选框中的目标进行检测。由于融合后的特别结合了原始图像的特征,高分辨率的原始图像特征可以大幅度增加细节信息,从而提升对于目标(特别是小目标)的检测能力。
在目标检测的实际应用中,对于小目标和复杂目标的检测是较为困难的,特别是在输入图像的分辨率不高的情况下,无法保证检测的准确率,因此本申请的目标检测方法对进行了一次检测的目标再进行二次检测,可以极大提高小目标和复杂目标的检测准确率。但是如果将所有的目标都进行二次检测,无疑会增大目标检测网络的算力负担,因此本申请的目标检测方法在进行二次检测之前,对一次检测的目标进行筛选,只筛选出有必要进行二次检测的目标进行二次检测,可以降低目标检测网络的算力负担,在实际应用中,可以根据需求附加多种条件进行筛选,例如可以根据置信度、类别和候选框尺寸进行筛选,同时在筛选过程中还可以将检测较为容易的目标直接输出,进一步降低目标检测网络的算力负担。本申请的目标检测方法在进行二次检测的过程中,结合了原始图像和特征图像中的特征,由于原始图像具有较高的分辨率,因此可以提升目标检测的准确性,另外,由于高分辨率的原始图像的特征是在二次检测中引入,此时目标检测网络的大规模内存已被释放,因此高分辨率的原始图像的特征的引入不会给目标检测网络造成巨大的内存负担。
以下结合图6至图11对本申请实施例的目标检测方法进行详细的描述。
(1)CNN运算
CNN运算提取输入图像的特征,具体的,对输入图像进行卷积处理,从而得到该输入图像的卷积特征图。本申请实施例中的进行CNN运算的网络可以采用多种结构,包括骨干网络和特征金字塔网络。
图6示出了对输入图像进行CNN处理的示意性框图,如图6所示,通过卷积等运算对输入图像进行特征信息的汇聚,从而获得高层信息。根据网络的结构,可以由骨干网络直接输出特征图像,也可以由特征金字塔网络输出特征图像,还可以由这两种网络共同输出特征图像。
(2)一级检测
本申请实施例中的一级检测可以采用二阶段网络中的区域候选网络(region proposal network,RPN),根据上述经过CNN运算得到的特征图像,对候选区域进行预测,得到候选区域中的正样本及其置信度。RPN的主要功能是生成区域候选,区域候选可以看做是许多潜在的候选框,在目标检测之前,网络并不知道图像中存在多少个目标物体,所以 RPN通常会预先在图像上生成若干个候选框,并输出最有可能包含目标物体的候选框。如图7所示,特征图像输入RPN,RPN对特征图像进行检测,得到多个候选框中每个候选框的坐标和每个候选框的置信度,在输入图像上表现为多个候选框。
(3)候选框筛选
根据一级检测的结果对多个候选框进行筛选,具体的,可以筛选出置信度大于或等于预设阈值的候选框,而置信度小于预设阈值的候选框则直接抛弃,根据实际需要,在筛选过程中还可以设定筛选数量,从而输出不超过预设数量的候选框。
如图8所示,候选框筛选过程中根据候选框置信度对一级检测的结果进行筛选,选出置信度高的候选框。如果有预设数量要求,还可以使用TopK算法进行筛选,选出前K个置信度最高的候选框。如图7所示,在进行候选框筛选后,可以有效减少输入图像上的候选框数量,从而降低二级检测的计算量。
(4)二级检测
如图9所示,根据候选框筛选得到的多个候选框,分别从经过CNN运算得到的特征图和原始图像中的相应位置进行特征提取,特征提取可以是roi pooling、roi align、ps roi pooling、ps roi align等操作。然后采用一定的操作将提取的两种特征进行融合,例如,采用卷积/池化/上采样的等操作,使得两种特征的尺寸相同,再进行相加或者相连,例如将从原始图像中提取的特征尺寸为56*56,从特征图中提取的特征尺寸为7*7,则可以通过卷积运算的方式将从原始图像中提取的特征的尺寸从56*56转换为7*7,然后再与从特征图中提取的特征相加或者相连;也可以采用直接拉直的方式,将两种特征分别拉直,然后进行相连,例如将从原始图像中提取的特征尺寸为56*56,从特征图中提取的特征尺寸为7*7,则可以通过卷积运算的方式将从原始图像中提取的特征的尺寸从56*56转换为1*3136,将从特征图中提取的特征尺寸从7*7转换为1*49,然后再将两种特征相连。在实际应用中,根据实际的算力,从CNN运算提取得到的特征,可以来自一个特定尺寸的特征图,也可以是从根据输入图像得到的多个不同尺寸的特征图中分别提取相应的特征,再与原始图像中提取的特征融合。
在得到融合结果后,将融合结果输入检测器,检测器根据融合特征对目标的坐标和类别进行预测,具体的,根据实际需要,检测器可以只进行类别预测,也可以同时完成类别预测和坐标预测。
在辅助/自动驾驶、安防等领域场景复杂,小目标的检测十分必要,然而端侧移动设备算力有限,在算力有限的条件下只能降低输入图像的分辨率,如此对于小目标检测难度增大。本申请实施例的目标检测方法通过在二级检测过程中引入原始图像的特征,不会对目标检测网络造成巨大的算力负担和内存负担,同时可以有效提升小目标检测性能,在算力有限的条件下实现准确的小目标检测,具有极高的实用价值。
表1示出了与现有的目标检测方法FasterRCNN相比,本申请实施例的目标检测方法应用在测试集上对于目标1和目标2的检测效果提升明显。
表1
  精确率(目标1) 召回率(目标1) 精确率(目标2) 召回率(目标2)
FasterRCNN 94.7 86.3 86.3 85
本申请目标检测方法 94.9 87.1 89.9 89.2
其中精确率是指在检测出的目标中,正确检测的目标所占的比例;召回率是指被正确检测出来的目标与测试集中所有该目标个数的比值。
可选的,本申请实施例的目标检测方法在进行一级检测时,还可以使用一阶段网络对经过CNN运算得到的特征图像进行检测,其中,一阶段网络可以是YOLO网络、Center Net网络和Retina Net网络等。如图10所示,一阶段网络根据经过CNN运算得到的特征图像进行一级检测,从而得到多个候选框中每个候选框的坐标、置信度和类别,即一级检测结果。
如图11所示,在得到一级检测结果后,本申请实施例的候选框筛选步骤还包括根据候选框的尺寸、置信度和类别对一级检测结果进行筛选,将满足一定条件的候选框作为简单目标检测结果直接输出,将满足一定条件的候选框作为困难目标检测结果进行二级检测。例如,由于尺寸越小的目标越难以识别,因此可以将尺寸大于或等于阈值a的候选框判定为简单目标检测结果,将尺寸小于阈值a的候选框判定为困难目标检测结果;又例如,可以将置信度大于或等于阈值b的候选框判定为简单目标检测结果,而将置信度在[阈值c,阈值b)范围的候选框判定为困难目标检测结果,其中阈值c小于阈值b,至于置信度小于阈值c的候选框则可以被判定为错误结果而直接抛弃;再例如,复杂类别的候选框可以被判定为困难目标检测结果,而简单类别的候选框可以被判定为目标检测结果,其中复杂类别和简单类别可以由人为预先设定,例如,类别“人”可以被设定为复杂类别,因为人体往往具有不同的姿态,检测较为困难,而类别“标志牌”可以被设定为简单类别,因为路边标志牌为静物且一般形状简单,检测较为容易。以上阈值a、阈值b和阈值c均为人为预设的阈值。
筛选出的困难目标检测结果再进行二级检测,最终得到目标检测结果,其中二级检测的过程与上述图9中的描述相同,为了简洁,本申请实施例在此不再赘述。
上文结合附图对本申请实施例的目标检测方法进行了详细的介绍,下面结合附图对本申请实施例的目标检测装置进行描述。
应理解,下文中介绍的目标检测装置能够执行本申请实施例的目标检测方法的各个步骤,下面在介绍本申请实施例的目标检测装置适当省略重复的描述。
图12是本申请实施例的目标检测装置的示意性框图。图12所示的装置1200包括获取模块1201和处理模块1202,以下分别进行介绍。
获取模块1201,用于获取输入图像的特征图像。
处理模块1202,用于根据输入图像的特征图像对输入图像中的目标进行检测,以得到多个候选框和多个候选框中每个候选框的置信度。
处理模块1202还用于,根据每个候选框的置信度从多个候选框中选择得到一个或多个第一候选框。
处理模块1202还用于,根据一个或多个第一候选框从输入图像中提取第一特征,从输入图像的特征图像中提取第二特征。
处理模块1202还用于,将第一特征和第二特征融合,以得到融合后的特征。
处理模块1202还用于,根据融合后的特征对一个或多个第一候选框中的目标进行检测。
可选的,处理模块1202还用于根据输入图像的特征图像对输入图像中的目标进行检 测,以得到每个候选框的尺寸和类别。
可选的,处理模块1202根据每个候选框的置信度对多个候选框进行筛选,以得到一个或多个第一候选框,包括:将置信度大于或等于第一置信度阈值的候选框筛选为第一候选框,和/或,将尺寸小于第一尺寸阈值的候选框筛选为第一候选框,和/或,将类别为第一预设类别的候选框筛选为第一候选框。
可选的,处理模块1202根据每个候选框的置信度对多个候选框进行筛选,以得到一个或多个第一候选框,还包括:将置信度大于或等于第一置信度阈值、小于第二置信度阈值的候选框筛选为第一候选框;将置信度大于或等于第二置信度阈值的候选框筛选为第二候选框;输出第二候选框中的目标;和/或,将尺寸小于第一尺寸阈值的候选框筛选为第一候选框;将尺寸大于或等于第一尺寸阈值的候选框筛选为第二候选框;输出第二候选框中的目标;和/或,将类别为第一预设类别的候选框筛选为第一候选框;将类别为第二预设类别的候选框筛选为第二候选框;输出第二候选框中的目标。
可选的,处理模块1202根据每个候选框的置信度对多个候选框进行筛选,还包括:根据预设数量对多个候选框进行筛选,以得到预设数量的候选框。
图12所示的目标检测装置可以用于执行上述图4至图11中的方法的各个步骤,具体可以参见上述对于图4至图11的描述,为了简洁,本申请实施例在此不再赘述。
图13是本申请实施例的目标检测装置的硬件结构示意图。图13所示的目标检测装置1300包括存储器1301、处理器1302、通信接口1303以及总线1304。其中,存储器1301、处理器1302、通信接口1303通过总线1304实现彼此之间的通信连接。
存储器1301可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器1301可以存储程序,当存储器1301中存储的程序被处理器1302执行时,处理器1302和通信接口1303用于执行本申请实施例的目标检测方法的各个步骤。
处理器1302可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的目标检测装置中的单元所需执行的功能,或者执行本申请实施例的目标检测方法。
处理器1302还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请实施例的目标检测方法的各个步骤可以通过处理器1302中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器1302还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1301,处理器1302读取存储器1301中的信息,结合其硬件完成本申请实施例的目标检测装置中包括的单元 所需执行的功能,或者执行本申请实施例的目标检测方法。
通信接口1303使用例如但不限于收发器一类的收发装置,来实现装置1300与其他设备或通信网络之间的通信。例如,可以通过通信接口1303获取待处理图像。
总线1304可包括在装置1300各个部件(例如,存储器1301、处理器1302、通信接口1303)之间传送信息的通路。
应注意,尽管上述装置1300仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置1300还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置1300还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置1300也可仅仅包括实现本申请实施例所必须的器件,而不必包括图13中所示的全部器件。
可选的,本申请还提供了一种目标检测装置,该装置包括:存储器,用于存储程序;处理器,用于执行存储器存储的程序,当存储器存储的程序被执行时,处理器用于执行图4至图11中的目标检测方法。
可选的,本申请还提供了一种计算机可读存储介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行图4至图11的方法。
可选的,本申请还提供了一种芯片,该芯片包括处理器与数据接口,处理器通过数据接口读取存储器上存储的指令,以执行图4至图11的方法。
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (13)

  1. 一种目标检测方法,其特征在于,包括:
    获取输入图像的特征图像;
    根据所述输入图像的特征图像对所述输入图像中的目标进行检测,以得到多个候选框和所述多个候选框中每个候选框的置信度;
    根据所述每个候选框的置信度从所述多个候选框中选择得到一个或多个第一候选框;
    根据所述一个或多个第一候选框从所述输入图像对应的原始图像中提取第一特征;
    根据所述一个或多个第一候选框从所述输入图像的特征图像中提取第二特征;
    将所述第一特征和所述第二特征融合,以得到融合后的特征;
    根据所述融合后的特征对所述一个或多个第一候选框中的目标进行检测。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    根据所述输入图像的特征图像对所述输入图像中的目标进行检测,以得到所述每个候选框的尺寸和类别。
  3. 如权利要求2所述的方法,其特征在于,所述根据所述每个候选框的置信度从所述多个候选框中选择得到一个或多个第一候选框,包括:
    将置信度大于或等于第一置信度阈值的候选框确定为所述第一候选框,和/或,
    将尺寸小于第一尺寸阈值的候选框确定为所述第一候选框,和/或,
    将类别为第一预设类别的候选框确定为所述第一候选框。
  4. 如权利要求2所述的方法,其特征在于,所述方法还包括:
    将置信度大于或等于第一置信度阈值、且小于第二置信度阈值的候选框确定为所述第一候选框;
    将置信度大于或等于所述第二置信度阈值的候选框确定为第二候选框;
    输出所述第二候选框中的目标;和/或,
    将尺寸小于第一尺寸阈值的候选框确定为所述第一候选框;
    将尺寸大于或等于所述第一尺寸阈值的候选框确定为第二候选框;
    输出所述第二候选框中的目标;和/或,
    将类别为第一预设类别的候选框确定为所述第一候选框;
    将类别为第二预设类别的候选框确定为第二候选框;
    输出所述第二候选框中的目标。
  5. 如权利要求1至4中任一项所述的方法,其特征在于,所述方法还包括:
    根据预设数量对所述多个候选框进行筛选,以得到所述预设数量的候选框。
  6. 一种目标检测装置,其特征在于,包括:
    获取模块,用于获取输入图像的特征图像;
    处理模块,用于根据所述输入图像的特征图像对所述输入图像中的目标进行检测,以得到多个候选框和所述多个候选框中每个候选框的置信度;
    所述处理模块还用于,根据所述每个候选框的置信度从所述多个候选框中选择得到一个或多个第一候选框;
    所述处理模块还用于,根据所述一个或多个第一候选框从所述输入图像对应的原始图像中提取第一特征;
    所述处理模块还用于,根据所述一个或多个第一候选框从所述输入图像的特征图像中提取第二特征;
    所述处理模块还用于,将所述第一特征和所述第二特征融合,以得到融合后的特征;
    所述处理模块还用于,根据所述融合后的特征对所述一个或多个第一候选框中的目标进行检测。
  7. 如权利要求6所述的装置,其特征在于,所述处理模块还用于:
    根据所述输入图像的特征图像对所述输入图像中的目标进行检测,以得到所述每个候选框的尺寸和类别。
  8. 如权利要求7所述的装置,其特征在于,所述处理模块根据所述每个候选框的置信度从所述多个候选框中选择得到一个或多个第一候选框,包括:
    将置信度大于或等于第一置信度阈值的候选框确定为所述第一候选框,和/或,
    将尺寸小于第一尺寸阈值的候选框确定为所述第一候选框,和/或,
    将类别为第一预设类别的候选框确定为所述第一候选框。
  9. 如权利要求7所述的装置,其特征在于,所述处理模块还用于:
    将置信度大于或等于第一置信度阈值、且小于第二置信度阈值的候选框确定为所述第一候选框;
    将置信度大于或等于所述第二置信度阈值的候选框确定为第二候选框;
    输出所述第二候选框中的目标;和/或,
    将尺寸小于第一尺寸阈值的候选框确定为所述第一候选框;
    将尺寸大于或等于所述第一尺寸阈值的候选框确定为第二候选框;
    输出所述第二候选框中的目标;和/或,
    将类别为第一预设类别的候选框确定为所述第一候选框;
    将类别为第二预设类别的候选框确定为第二候选框;
    输出所述第二候选框中的目标。
  10. 如权利要求6至9中任一项所述的装置,其特征在于,所述处理模块还用于:
    根据预设数量对所述多个候选框进行筛选,以得到所述预设数量的候选框。
  11. 一种目标检测装置,其特征在于,包括:处理器和传输接口,
    所述处理器,用于执行存储在存储器中的程序,以执行如权利要求1至5中任一项所述的目标检测方法。
  12. 一种计算机可读存储介质,其特征在于,所述计算机可读介质中存储有程序,当所述程序在计算机或处理器上运行时,使得所述计算机或所述处理器执行如权利要求1至5中任一项所述的方法。
  13. 一种计算机程序产品,其特征在于,所述计算机程序产品包括指令,当所述指令在计算机或处理器上运行时,使得所述计算机或所述处理器执行如权利要求1至5中任一项所述的方法。
PCT/CN2021/087584 2021-04-15 2021-04-15 目标检测方法和装置 WO2022217551A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180096547.4A CN117203678A (zh) 2021-04-15 2021-04-15 目标检测方法和装置
PCT/CN2021/087584 WO2022217551A1 (zh) 2021-04-15 2021-04-15 目标检测方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/087584 WO2022217551A1 (zh) 2021-04-15 2021-04-15 目标检测方法和装置

Publications (1)

Publication Number Publication Date
WO2022217551A1 true WO2022217551A1 (zh) 2022-10-20

Family

ID=83640001

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/087584 WO2022217551A1 (zh) 2021-04-15 2021-04-15 目标检测方法和装置

Country Status (2)

Country Link
CN (1) CN117203678A (zh)
WO (1) WO2022217551A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268869A (zh) * 2018-02-13 2018-07-10 北京旷视科技有限公司 目标检测方法、装置及系统
CN110348453A (zh) * 2018-04-04 2019-10-18 中国科学院上海高等研究院 一种基于级联的物体检测方法及系统、存储介质及终端
CN111291717A (zh) * 2020-02-28 2020-06-16 深圳前海微众银行股份有限公司 基于图像的物体检测方法、装置、设备及可读存储介质
CN111666854A (zh) * 2020-05-29 2020-09-15 武汉大学 融合统计显著性的高分辨率sar影像车辆目标检测方法
US20210012127A1 (en) * 2018-09-27 2021-01-14 Beijing Sensetime Technology Development Co., Ltd. Action recognition method and apparatus, driving action analysis method and apparatus, and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108268869A (zh) * 2018-02-13 2018-07-10 北京旷视科技有限公司 目标检测方法、装置及系统
CN110348453A (zh) * 2018-04-04 2019-10-18 中国科学院上海高等研究院 一种基于级联的物体检测方法及系统、存储介质及终端
US20210012127A1 (en) * 2018-09-27 2021-01-14 Beijing Sensetime Technology Development Co., Ltd. Action recognition method and apparatus, driving action analysis method and apparatus, and storage medium
CN111291717A (zh) * 2020-02-28 2020-06-16 深圳前海微众银行股份有限公司 基于图像的物体检测方法、装置、设备及可读存储介质
CN111666854A (zh) * 2020-05-29 2020-09-15 武汉大学 融合统计显著性的高分辨率sar影像车辆目标检测方法

Also Published As

Publication number Publication date
CN117203678A (zh) 2023-12-08

Similar Documents

Publication Publication Date Title
EP3916628A1 (en) Object identification method and device
WO2020253416A1 (zh) 物体检测方法、装置和计算机存储介质
WO2021043112A1 (zh) 图像分类方法以及装置
WO2021218786A1 (zh) 一种数据处理系统、物体检测方法及其装置
US9959468B2 (en) Systems and methods for object tracking and classification
WO2022012158A1 (zh) 一种目标确定方法以及目标确定装置
CN111401517B (zh) 一种感知网络结构搜索方法及其装置
CN112132156A (zh) 多深度特征融合的图像显著性目标检测方法及系统
JP2016062610A (ja) 特徴モデル生成方法及び特徴モデル生成装置
Lyu et al. Small object recognition algorithm of grain pests based on SSD feature fusion
CN111931764A (zh) 一种目标检测方法、目标检测框架及相关设备
CN111368972A (zh) 一种卷积层量化方法及其装置
WO2022206414A1 (zh) 三维目标检测方法及装置
CN110909656B (zh) 一种雷达与摄像机融合的行人检测方法和系统
CN114972182A (zh) 一种物体检测方法及其装置
Nazeer et al. Real time object detection and recognition in machine learning using jetson nano
Chen et al. Pyramid attention object detection network with multi-scale feature fusion
CN111062311B (zh) 一种基于深度级可分离卷积网络的行人手势识别与交互方法
WO2022217551A1 (zh) 目标检测方法和装置
Wu et al. Research on asphalt pavement disease detection based on improved YOLOv5s
EP4296896A1 (en) Perceptual network and data processing method
CN112446292B (zh) 一种2d图像显著目标检测方法及系统
CN114627183A (zh) 一种激光点云3d目标检测方法
CN113420660A (zh) 一种红外图像目标检测模型构建方法、预测方法及系统
CN113256556A (zh) 一种图像选择方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21936432

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180096547.4

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21936432

Country of ref document: EP

Kind code of ref document: A1