CN114005017A

CN114005017A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN114005017A
Application number: CN202111101089.7A
Authority: CN
Inventors: 谌强
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2022-02-01

Abstract

The embodiment of the application provides a target detection method and a target detection device, wherein the method comprises the following steps: acquiring an image to be detected; extracting the features of the image to be detected to obtain the single-scale features corresponding to the image to be detected; performing cavity convolution processing based on the single-scale features corresponding to the image to be detected to obtain multi-scale features corresponding to the image to be detected; predicting the detection result of the image to be detected based on the multi-scale features corresponding to the image to be detected, wherein the detection result comprises: the type of object in the image to be detected and/or the position of the object in the image to be detected. And the target detection with higher precision is carried out at a higher detection speed.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of neural networks, and in particular, to a target detection method, apparatus, electronic device, and storage medium.

Background

At present, detection of objects such as objects and human bodies is generally performed using a detection network. The detection speed of the target detection based on the single-scale characteristic diagram is high, but the precision is low, compared with the target detection based on the single-scale characteristic diagram, the multi-scale characteristic diagram comprises more characteristics related to an object, the precision of the target detection based on the multi-scale characteristic diagram is higher, but the data volume of the multi-scale characteristic diagram is large, the detection speed of the target detection is low, and the method is difficult to be applied to some application scenes with high real-time requirements. How to detect the target with higher precision at higher detection speed becomes a problem to be solved.

Disclosure of Invention

Embodiments of the present application provide a target detection method, an apparatus, an electronic device, a storage medium, and a computer program product, so as to implement high-precision target detection at a high detection speed.

The embodiment of the application provides a target detection method, which comprises the following steps:

acquiring an image to be detected;

extracting the characteristics of the image to be detected to obtain the single-scale characteristics corresponding to the image to be detected;

performing cavity convolution processing on the basis of the single-scale features corresponding to the image to be detected to obtain multi-scale features corresponding to the image to be detected;

predicting a detection result of the image to be detected based on the multi-scale features, wherein the detection result comprises: the type of the object in the image to be detected and/or the position of the object in the image to be detected.

An embodiment of the present application provides a target detection apparatus, including:

an acquisition unit configured to acquire an image to be detected;

the extraction unit is configured to extract the features of the image to be detected to obtain the single-scale features corresponding to the image to be detected;

the processing unit is configured to perform cavity convolution processing on the basis of the single-scale features corresponding to the image to be detected to obtain multi-scale features corresponding to the image to be detected;

a detection unit configured to predict a detection result of the image to be detected based on the multi-scale features, the detection result including: the type of the object in the image to be detected and/or the position of the object in the image to be detected.

An embodiment of the present application provides an electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the above object detection method.

The embodiment of the application provides a storage medium, wherein instructions are stored in the storage medium, and when the instructions in the storage medium are executed by a processor, the target detection method can be realized.

An embodiment of the present application provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the object detection method is implemented.

The embodiment of the application provides a target detection method, a target detection device, an electronic device, a storage medium and a computer program product, wherein the method comprises the steps of carrying out cavity convolution processing on the basis of single-scale features corresponding to an image to be detected to obtain multi-scale features corresponding to the image to be detected, predicting a detection result of the image to be detected on the basis of the multi-scale features corresponding to the image to be detected, obtaining the multi-scale features corresponding to the image to be detected under the condition that the receptive field of a corresponding feature map is increased through the cavity convolution, enabling the multi-scale features corresponding to the image to be detected to be richer, predicting the detection result of the image to be detected on the basis of the richer multi-scale features corresponding to the image to be detected, and enabling the accuracy of the obtained detection result to be higher. Meanwhile, the method and the device directly process the single-scale features corresponding to the image to be detected to obtain the multi-scale features, the data volume of the single-scale features corresponding to the image to be detected is far smaller than that of the multi-scale feature map, time consumption for obtaining the multi-scale features is reduced, the target detection speed is higher, and the method and the device can be applied to application scenes with high real-time requirements. Therefore, the target detection method provided by the application can be used for detecting the target with higher precision at a higher detection speed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flowchart illustrating a target detection method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram showing the structure of a residual block in a hole encoder;

fig. 3 is a block diagram illustrating a structure of an object detection apparatus according to an embodiment of the present application;

fig. 4 shows a block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

In recent years, technical research based on artificial intelligence, such as computer vision, deep learning, machine learning, image processing, and image recognition, has been actively developed. Artificial Intelligence (AI) is an emerging scientific technology for studying and developing theories, methods, techniques and application systems for simulating and extending human Intelligence. The artificial intelligence subject is a comprehensive subject and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning and neural networks. Computer vision is used as an important branch of artificial intelligence, particularly a machine is used for identifying the world, and the computer vision technology generally comprises the technologies of face identification, living body detection, fingerprint identification and anti-counterfeiting verification, biological feature identification, face detection, pedestrian detection, target detection, pedestrian identification, image processing, image identification, image semantic understanding, image retrieval, character identification, video processing, video content identification, behavior identification, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map construction (SLAM), computational photography, robot navigation and positioning and the like. With the research and progress of artificial intelligence technology, the technology is applied to various fields, such as security, city management, traffic management, building management, park management, face passage, face attendance, logistics management, warehouse management, robots, intelligent marketing, computational photography, mobile phone images, cloud services, smart homes, wearable equipment, unmanned driving, automatic driving, smart medical treatment, face payment, face unlocking, fingerprint unlocking, testimony verification, smart screens, smart televisions, cameras, mobile internet, live webcasts, beauty treatment, medical beauty treatment, intelligent temperature measurement and the like.

Fig. 1 shows a flowchart of an object detection method provided in an embodiment of the present application, where the method may be executed by a terminal device or a server, and the method includes:

step 101, obtaining an image to be detected.

In the application, the image to be detected can be acquired by a camera on a terminal device such as a mobile terminal and a vehicle-mounted terminal.

And 102, extracting the features of the image to be detected to obtain the single-scale features corresponding to the image to be detected.

In the method and the device, the convolutional neural network for feature extraction can be utilized to extract features of the image to be detected, and single-scale features corresponding to the image to be detected are obtained.

For example, the convolutional neural network may employ a network for target detection such as the reference network in fast Rcnn, i.e., the Backbone network. The image to be detected can be input into the convolutional neural network for feature extraction, and the single-scale feature corresponding to the image to be detected output by the convolutional neural network for feature extraction is obtained.

And 103, performing cavity convolution processing based on the single-scale features corresponding to the image to be detected to obtain the multi-scale features corresponding to the image to be detected.

In the present application, the convolutional layer for void Convolution may be used to perform void Convolution (scaled Convolution) processing on the single-scale features corresponding to the image to be detected, so as to obtain the multi-scale features corresponding to the image to be detected. The convolution layer for the cavity convolution can comprise a plurality of convolution kernels for the cavity convolution, and the cavity rate of each convolution kernel for the cavity convolution can be different when the cavity convolution is carried out, so that a plurality of different large receptive fields can be formed by carrying out the cavity convolution on the convolution kernels for the cavity convolution with different cavity rates when the cavity convolution is carried out, and then the image to be detected is subjected to multi-scale sensing to form multi-scale characteristics. And for each convolution kernel for cavity convolution, respectively performing cavity convolution on each feature map in the single-scale features corresponding to the image to be detected by the convolution kernel for cavity convolution to obtain the corresponding feature map subjected to the cavity convolution. And all the obtained feature maps subjected to the cavity convolution form multi-scale features corresponding to the image to be detected.

And 104, predicting the detection result of the image to be detected based on the multi-scale features corresponding to the image to be detected.

In the present application, the detection results include: the type of object in the image to be detected and/or the position of the object in the image to be detected.

When the type of the object in the image to be detected needs to be obtained, the multi-scale features corresponding to the image to be detected can be input into the network for classification, so as to obtain the type of the object in the detected image output by the network for classification. When the position of the object in the image to be detected needs to be obtained, the multi-scale features corresponding to the image to be detected can be input into the network for regressing the position of the object, so that the position of the object in the image to be detected, which is output by the network for regressing the position of the object, can be obtained.

In the application, the hollow convolution processing is carried out based on the single-scale features corresponding to the image to be detected, the multi-scale features corresponding to the image to be detected are obtained, the detection result of the image to be detected is predicted based on the multi-scale features corresponding to the image to be detected, the multi-scale features corresponding to the image to be detected are obtained under the condition that the receptive field of the corresponding feature map is increased through the hollow convolution, the multi-scale features corresponding to the image to be detected are richer, the detection result of the image to be detected is predicted based on the richer multi-scale features corresponding to the image to be detected, and the accuracy of the obtained detection result is higher. Meanwhile, the method and the device directly process the single-scale features corresponding to the image to be detected to obtain the multi-scale features, the data volume of the single-scale features corresponding to the image to be detected is far smaller than that of the multi-scale feature map, time consumption for obtaining the multi-scale features is reduced, the target detection speed is higher, and the method and the device can be applied to application scenes with high real-time requirements. Therefore, the target detection method provided by the application can be used for detecting the target with higher precision at a higher detection speed.

In some embodiments, the performing the cavity convolution processing based on the single-scale feature corresponding to the image to be detected to obtain the multi-scale feature corresponding to the image to be detected includes: and carrying out multiple cavity convolution treatments on the single-scale features corresponding to the image to be detected to obtain the multi-scale features corresponding to the image to be detected, wherein the processing data of the first cavity convolution treatment is the single-scale features corresponding to the image to be detected, and the processing data of the non-first cavity convolution treatment is the processing result data of the adjacent previous cavity convolution treatment.

In the method and the device, multiple times of cavity convolution processing can be carried out on the single-scale features corresponding to the image to be detected, and the multi-scale features corresponding to the image to be detected are obtained. The processing data of the first cavity convolution processing is the single-scale characteristic corresponding to the image to be detected, and the processing data of the non-first cavity convolution processing is the processing result data of the adjacent previous cavity convolution processing.

The convolution layer for hole convolution may include a plurality of convolution kernels for hole convolution, each convolution kernel for hole convolution may have a different hole rate when performing hole convolution, and a plurality of different larger receptive fields may be formed by performing hole convolution using the plurality of convolution kernels for hole convolution having different hole rates when performing hole convolution.

The processing data of the first hole convolution processing may be data input to the convolution layer for hole convolution when the first hole convolution processing is performed. When the first-time cavity convolution processing is performed, the single-scale features corresponding to the image to be detected can be input into the convolution layers for cavity convolution, for each convolution kernel for cavity convolution, the convolution kernels for cavity convolution respectively perform cavity convolution on each feature map in the single-scale features corresponding to the image to be detected to obtain corresponding feature maps subjected to cavity convolution, and all the obtained feature maps subjected to cavity convolution form processing result data output by the convolution layers for cavity convolution when the first-time cavity convolution processing is performed.

The processing data of the non-primary-hole convolution processing may be data input to the convolution layer for hole convolution when the non-primary-hole convolution processing is performed, and the processing data of the non-primary-hole convolution processing may be processing result data of the adjacent previous hole convolution processing, in other words, when the non-primary-hole convolution processing is performed, the processing result data of the adjacent previous hole convolution processing may be input to the convolution layer for hole convolution. For example, M is not equal to 1, and when the mth hole convolution processing is performed, the processing result data output from the convolution layer used for hole convolution at the time of the mth-1 hole convolution processing may be used as the processing data of the mth hole convolution processing.

In the application, the processing result data output by the convolution layer for the cavity convolution in the last cavity convolution processing can be used as the multi-scale feature corresponding to the image to be detected.

For example, the hole convolution processing may be performed N times for a single-scale feature corresponding to the image to be detected, and processing result data output by the convolution layer used for the hole convolution processing at the nth time may be used as the multi-scale feature corresponding to the image to be detected.

In this application, carry out cavity convolution processing many times to the single scale characteristic that awaits measuring the image correspondence, can increase the receptive field of the relevant characteristic map that produces at the in-process that obtains the multiscale characteristic that awaits measuring the image correspondence many times to, promote the richness of the multiscale characteristic that obtains and the object correlation's in waiting to measure the image characteristic, and then promote the accuracy of testing result.

In some embodiments, a hole encoder is used for performing hole convolution processing on the single-scale features corresponding to the image to be detected; the hole encoder comprises a plurality of cascaded residual modules; correspondingly, carry out cavity convolution processing many times to the single scale characteristic that awaits measuring the image correspondence, obtain the multiscale characteristic that awaits measuring the image correspondence, include: carrying out multiple cavity convolution processing on the single-scale features corresponding to the image to be detected through multiple cascaded residual modules to obtain multi-scale features; and one residual error module is used for executing cavity convolution processing once, and the output result of the previous residual error module is used as the input data of the adjacent next residual error module.

The residual error module in the hole encoder has a hole convolution function, and is obtained by improving an original residual error module in a ResNet network. The size of the convolution kernel in the convolutional layer for hole convolution may be 3x3, i.e., the convolutional layer for hole convolution is one 3x3 convolutional layer. And replacing the original 3x3 convolutional layer in the original residual module in the ResNet network by the convolutional layer for the cavity convolution to obtain the residual module in the application. The original residual modules in the ResNet network further include a 1x1 convolutional layer located before the original 3x3 convolutional layer and a 1x1 convolutional layer located after the original 3x3 convolutional layer, and accordingly, the residual modules in the hole encoder in the present application further include a 1x1 convolutional layer located before the convolutional layer for hole convolution and a 1x1 convolutional layer located after the convolutional layer for hole convolution.

In the method, when the multiple cascaded residual error modules are used for carrying out multiple cavity convolution processing on the single-scale features corresponding to the image to be detected to obtain the multiple-scale features corresponding to the image to be detected, the Nth residual error module in the cavity encoder is used for completing the Nth cavity convolution processing. The input data of the 1 st residual error module in the cavity encoder is the single-scale feature corresponding to the image to be detected, the single-scale feature corresponding to the image to be detected is input into the 1 st residual error module in the cavity encoder, and the output result of the 1 st residual error module in the cavity encoder is obtained.

In this application, the output result of the previous residual module is used as the input data of the next adjacent residual module, and for the nth residual module in the hole encoder, the previous residual module refers to the nth-1 th residual module, and the nth residual module is the next adjacent residual module relative to the nth-1 th residual module. For example, for the 2 nd residual block in the hole encoder, the former residual block is the 1 st residual block, and the 2 nd residual block is the next residual block adjacent to the N-1 st residual block. And taking the output result of the 1 st residual error module as the input data of the 2 nd residual error module, and so on.

In the method, the output result of the last residual error module in the cavity encoder is used as the multi-scale feature corresponding to the image to be detected. And if the hole encoder comprises N residual error modules, taking the output result of the Nth residual error module as the multi-scale characteristic corresponding to the image to be detected.

The residual module is used for performing residual connection on input data of the residual module and an output result of the last convolutional layer in the residual module to obtain an output result of the residual module, and the residual connection is equivalent to fusing the input data of the residual module and the output result of the last convolutional layer in the residual module.

In the method, the cavity convolution processing is carried out on the single-scale features corresponding to the image to be detected for multiple times through the multiple cascaded residual error modules, so that the multi-scale features corresponding to the image to be detected are obtained, the receptive field of the related feature map generated in the process of obtaining the multi-scale features corresponding to the image to be detected can be increased for multiple times, the richness of the features related to the object in the image to be detected, which are included in the obtained multi-scale features, is improved, and the situation that part of extracted local features, namely part of features in input data of the residual error module, are lost due to the cavity convolution can be avoided.

In some embodiments, the hole rates for the respective residual modules in the hole encoder are different.

In this application, the void rate corresponding to the residual error module in the void encoder is: and the void rate of the convolution layer used for void convolution in the residual module when the void convolution processing is carried out. In the direction from the 1 st residual block in the hole encoder to the last residual block in the hole encoder, the hole rate corresponding to the residual block may be increased incrementally, i.e., gradually.

The size of the void fraction determines the size of the field of view, and a field of view is suitable for extracting features associated with objects of a certain size class.

In this application, the void rate that each residual error module in the hole encoder corresponds is different, can form the receptive field of a plurality of different sizes, and then through the hole convolution, to the object of different size grades, all can be applicable to under the receptive field of the relevant characteristic of the object of extraction with this size grade, extract the relevant characteristic with the object of this size grade, and then can be to the object of different size grades for example little object, well object, big object, all can acquire comparatively accurate characteristic, in order to carry out comparatively accurately detecting.

In some embodiments, the obtaining of the multi-scale features corresponding to the image to be detected based on the single-scale features corresponding to the image to be detected by performing the cavity convolution processing includes: based on the detection task that the image that waits to detect corresponds, carry out feature mapping to the single scale characteristic that the image that waits to detect corresponds, obtain the single scale characteristic that the image that waits to detect after the mapping corresponds, wherein, the detection task that the image that waits to detect corresponds includes: a task for detecting the type of the object in the image to be detected and/or a task for detecting the position of the object in the image to be detected; and performing cavity convolution processing on the single-scale features corresponding to the mapped image to be detected to obtain the multi-scale features corresponding to the image to be detected.

In the method and the device, based on the detection task corresponding to the image to be detected, the single-scale features corresponding to the image to be detected are subjected to feature mapping, and the single-scale features corresponding to the mapped image to be detected are obtained. And the mapped single-scale features corresponding to the image to be detected are the single-scale features which are suitable for executing the detection task and in the single-scale features corresponding to the image to be detected. Correspondingly, the single-scale features corresponding to the mapped image to be detected comprise: the single scale features are suitable for performing tasks for detecting the type of the object in the image to be detected and/or the single scale features are suitable for performing tasks for detecting the position of the object in the image to be detected.

In the present application, feature mapping corresponding to the detection task may be performed on the single-scale features corresponding to the image to be detected by using a convolution layer for feature mapping, for example, a 3 × 3 convolution layer for feature mapping in an RPN module in the fasterrnnn network, so as to obtain the mapped single-scale features corresponding to the image to be detected.

In the method and the device, when the single-scale features corresponding to the mapped image to be detected are subjected to the cavity convolution processing to obtain the multi-scale features corresponding to the image to be detected, the convolution layer used for the cavity convolution can be used for performing the cavity convolution processing on the single-scale features corresponding to the mapped image to be detected to obtain the multi-scale features corresponding to the image to be detected. The convolution layer for hole convolution may include a plurality of convolution kernels for hole convolution, and each convolution kernel for hole convolution may have a different hole rate when hole convolution is performed.

The process of performing the hole convolution processing on the single-scale feature corresponding to the mapped image to be detected by using the convolution layer for hole convolution is the same as the process of performing the hole convolution processing on the single-scale feature corresponding to the image to be detected by using the convolution layer for hole convolution in step 103. The difference is that when the convolutional layer for cavity convolution is used for performing cavity convolution processing on the single-scale feature corresponding to the image to be detected, the input data of the convolutional layer for cavity convolution is the single-scale feature corresponding to the image to be detected, and when the convolutional layer for cavity convolution is used for performing cavity convolution processing on the single-scale feature corresponding to the mapped image to be detected, the input data of the convolutional layer for cavity convolution is the single-scale feature corresponding to the mapped image to be detected.

The data volume of the single-scale features corresponding to the mapped image to be detected is smaller than the data volume of the single-scale features corresponding to the image to be detected, and compared with the step 103 of performing the cavity convolution processing on the single-scale features corresponding to the image to be detected, the calculation amount of the cavity convolution processing can be reduced, the speed of obtaining the multi-scale features corresponding to the image to be detected is increased, and the speed of target detection is further increased.

In the method and the device, when the single-scale features corresponding to the mapped image to be detected are subjected to cavity convolution processing to obtain the multi-scale features corresponding to the image to be detected, the single-scale features corresponding to the mapped image to be detected can be subjected to cavity convolution processing for multiple times through the convolution layer for cavity convolution to obtain the multi-scale features corresponding to the image to be detected. The method is characterized in that the method comprises the following steps that a process of carrying out multiple times of cavity convolution processing on a single-scale feature corresponding to a mapped image to be detected through a convolution layer for cavity convolution is the same as the process of carrying out multiple times of cavity convolution processing on a single-scale feature corresponding to the image to be detected through the convolution layer for cavity convolution, and the difference is that when the multiple times of cavity convolution processing are carried out on the single-scale feature corresponding to the image to be detected, processing data of the first time of cavity convolution processing are the single-scale feature corresponding to the image to be detected, and when the multiple times of cavity convolution processing are carried out on the single-scale feature corresponding to the mapped image to be detected, the processing data of the first time of cavity convolution processing are the single-scale feature corresponding to the mapped image to be detected.

In the method and the device, when the cavity convolution processing is carried out on the single-scale features corresponding to the mapped image to be detected, the cavity convolution processing can be carried out on the single-scale features corresponding to the mapped image to be detected through the cavity encoder, and the multi-scale features corresponding to the image to be detected are obtained. The process of performing the cavity convolution processing on the single-scale features corresponding to the mapped image to be detected by the cavity encoder is the same as the process of performing the cavity convolution processing on the single-scale features corresponding to the image to be detected by the cavity encoder, but the difference is that when the cavity convolution processing is performed on the single-scale features corresponding to the image to be detected by the cavity encoder, the data input into the cavity encoder is the single-scale features corresponding to the image to be detected, namely, the input data of the 1 st residual error module in the cavity encoder is the single-scale characteristic corresponding to the image to be detected, when the hole encoder carries out the hole convolution processing on the single-scale characteristics corresponding to the mapped image to be detected, the data input into the hole encoder is the single-scale characteristic corresponding to the mapped image to be detected, namely, the input data of the 1 st residual error module in the cavity encoder is the single-scale feature corresponding to the mapped image to be detected.

Please refer to fig. 2, which shows a schematic structural diagram of a residual error module in a hole encoder.

Fig. 2 exemplarily shows a structure of one residual block in the hole encoder. The structure of each residual block in the hole encoder is the same. The residual module in the hole encoder includes 1x1 convolutional layer 201, hole convolutional layer 202 for hole convolution, and 1x1 convolutional layer 203. When the hole convolution is performed once by the residual block, the input data 205 of the residual block is processed by the 1x1 convolutional layer 201, the output data of the 1x1 convolutional layer 201 is used as the input data of the hole convolutional layer 202 for hole convolution, the hole convolutional layer 202 for hole convolution performs hole convolution processing on the input data to obtain the output result of the hole convolutional layer 202 for hole convolution, the output data of the hole convolutional layer 202 for hole convolution is used as the input data of the 1x1 convolutional layer 203, and the output result of the hole convolutional layer 202 for hole convolution is processed by the 1x1 convolutional layer 203 to obtain the output result 204 of the 1x1 convolutional layer 203. In the residual block, the output result 204 of the 1 × 1 convolutional layer 203 is residual-connected 206 to the input data 205 of the residual block to obtain the output result of the residual block.

In some embodiments, the target detection method is performed by a detection network, and before acquiring the image to be detected, the method further includes: for each marking frame in the sample image, determining a preset number of anchor frames with the largest Intersection over Union (IoU for short) with the marking frame in all the anchor frames generated aiming at the sample image; determining the anchor frames with the preset number as positive samples corresponding to the marking frames, and determining the anchor frames except the positive samples in all the anchor frames as negative samples; and training the detection network based on the positive sample and all the negative samples corresponding to each labeling box.

In this application, the target detection method may be performed by a detection network, and the detection network may include a module corresponding to each of the above steps, where the module corresponding to the step is configured to perform the step. And before the image to be detected is obtained, training a detection network by utilizing the sample image. The detection network generates a preset number of anchor frames aiming at the sample image, and the process of generating the anchor frames aiming at the preset number of anchor frames of the sample image is the same as that of generating the anchor frames aiming at any existing neural network for target detection.

For each marking frame in the sample image, all anchor frames generated by the detection network for the sample image can be sequenced according to the sequence from the large intersection ratio of the marking frame to the small intersection ratio of the anchor frames, and each anchor frame in a preset number of anchor frames before the sequencing is determined as a positive sample corresponding to the marking frame.

The preset number is k, for each marking frame, the intersection ratio of the marking frame and each anchor frame generated by the detection network aiming at the sample image is calculated, all the anchor frames generated by the target detection network are sequenced according to the sequence from the intersection ratio of the marking frame to the anchor frame from large to small, and each anchor frame in the first k anchor frames after sequencing is determined as a positive sample corresponding to the marking frame.

After the positive sample corresponding to each marking frame is determined, all the positive samples are determined, and each anchor frame except the positive samples in all the anchor frames is determined as the negative sample. The detection network may be trained using the positive samples and all negative samples corresponding to each label box.

In training a detection network for target detection, the positive sample functions such that the detection network learns characteristics of an object related to the positive sample.

In the application, the number of the positive samples of each labeling frame in the sample image is a preset number, so that the number of the positive samples of each labeling frame is balanced, the following conditions can be avoided, and a good training effect of the detection network is ensured to be obtained: if the positive samples of each labeled box are determined only according to whether the intersection ratio is greater than the threshold, the number of positive samples of a labeled box relative to other labeled boxes may be too large, which may cause the detection network to be overfitted when learning the features of the object related to the positive samples. If the positive samples of each labeled box are determined only according to whether the intersection ratio is greater than the threshold, it may also happen that a labeled box has too few positive samples relative to other labeled boxes, which may make it difficult for the detection network to sufficiently learn the features of the object related to the positive sample.

In some embodiments, training the detection network by using the positive sample and the negative sample corresponding to each label box includes: for each marking frame, determining a target positive sample of all positive samples corresponding to the marking frame, wherein the intersection ratio of the target positive sample to the marking frame is greater than a first threshold value, and determining a target negative sample of all negative samples, wherein the intersection ratio of the target negative sample to any one marking frame is less than a second threshold value; wherein the first threshold is greater than the second threshold; and training the detection network model based on the target positive sample and all the target negative samples corresponding to each marking frame.

In this application, for each label box, a positive sample, which is greater than a first threshold in the union ratio with the label box, of all positive samples corresponding to the label box may be determined as a target positive sample corresponding to the label box, so that a positive sample smaller in the union ratio with the label box does not participate in training of the detection network. Negative samples of all negative samples, the intersection ratio of which to any one of the labeling boxes is smaller than the second threshold value, can be determined as target negative samples, so that negative samples larger than the intersection ratio of at least one of the labeling boxes do not participate in training of the detection network.

In the application, only the target positive samples and all the target negative samples corresponding to each labeling frame are used for training the target detection network, and compared with the method of directly training the detection network by using all the positive samples and all the target negative samples corresponding to each labeling frame, the method can avoid the situation that the parameters of the detection network can not be converged to the local minimum value due to the fact that the positive samples which are relatively small in intersection with the labeling frame and/or the negative samples which are relatively large in intersection with the labeling frame participate in the training of the detection network, and meanwhile, the number of the samples which participate in the training of the detection network can be reduced, and therefore the time consumed by the training of the detection network is reduced.

Referring to fig. 3, a block diagram of a target detection apparatus according to an embodiment of the present disclosure is shown. The object detection device includes: an acquisition unit 301, an extraction unit 302, a processing unit 303, and a detection unit 304.

The acquisition unit 301 is configured to acquire an image to be detected;

the extraction unit 302 is configured to perform feature extraction on an image to be detected to obtain a single-scale feature corresponding to the image to be detected;

the processing unit 303 is configured to perform cavity convolution processing based on the single-scale feature corresponding to the image to be detected, so as to obtain a multi-scale feature corresponding to the image to be detected;

the detection unit 304 is configured to predict a detection result of the image to be detected based on the multi-scale features, the detection result including: the type of the object in the image to be detected and/or the position of the object in the image to be detected.

The processing unit 303 is further configured to perform multiple cavity convolution processes on the single-scale feature corresponding to the image to be detected to obtain the multi-scale feature, where the processed data of the first cavity convolution process is the single-scale feature corresponding to the image to be detected, and the processed data of the non-first cavity convolution process is the processed result data of the adjacent previous cavity convolution process.

The processing unit 303 is further configured to perform, by using a hole encoder, hole convolution processing on the single-scale features corresponding to the image to be detected; the hole encoder comprises a plurality of cascaded residual modules; carrying out multiple cavity convolution processing on the single-scale features corresponding to the image to be detected through the multiple cascaded residual modules to obtain the multi-scale features; and one residual error module is used for executing cavity convolution processing once, and the output result of the previous residual error module is used as the input data of the adjacent next residual error module.

In some embodiments, the respective residual modules have different void rates.

In some embodiments, the processing unit 303 is further configured to perform feature mapping on a single-scale feature corresponding to the image to be detected based on a detection task corresponding to the image to be detected, so as to obtain the mapped single-scale feature corresponding to the image to be detected, where the detection task includes: a task for detecting the type of the object in the image to be detected and/or a task for detecting the position of the object in the image to be detected; and carrying out cavity convolution processing on the single-scale features corresponding to the mapped image to be detected to obtain the multi-scale features corresponding to the image to be detected.

In some embodiments, the object detection method is performed by a detection network, and the object detection apparatus further includes: the training unit is configured to determine, for each marking frame in a sample image, a preset number of anchor frames with the largest intersection ratio with the marking frame in all anchor frames generated for the sample image before acquiring an image to be detected; determining the anchor frames with the preset number as positive samples corresponding to the marking frames, and determining the anchor frames except the positive samples in all the anchor frames as negative samples; and training the detection network based on the positive sample and the negative sample corresponding to each labeling box.

In some embodiments, the training unit is further configured to, for each of the labeled boxes, determine a target positive sample whose intersection ratio with the labeled box is greater than a first threshold value among all positive samples corresponding to the labeled box, and determine a target negative sample whose intersection ratio with any one labeled box is less than a second threshold value among all negative samples corresponding to the labeled box; wherein the first threshold is greater than the second threshold; and training the detection network model based on the target positive sample and the target negative sample corresponding to each labeling box.

Any one step and specific operation in any one step in the embodiments of the object detection method provided by the present application may be completed by a corresponding unit in the object detection device. The procedure of the respective operations performed by the respective units in the object detection apparatus refers to the procedure of the respective operations described in the embodiment of the object detection method.

The method comprises the steps of carrying out target detection through a target detection device, carrying out cavity convolution processing on the basis of single-scale features corresponding to an image to be detected to obtain multi-scale features corresponding to the image to be detected, predicting a detection result of the image to be detected on the basis of the multi-scale features corresponding to the image to be detected, wherein the multi-scale features corresponding to the image to be detected are obtained under the condition that the receptive field of corresponding feature maps is increased through cavity convolution, the multi-scale features corresponding to the image to be detected are richer, predicting the detection result of the image to be detected on the basis of the richer multi-scale features corresponding to the image to be detected, and the accuracy of the obtained detection result is higher. Meanwhile, the method and the device directly process the single-scale features corresponding to the image to be detected to obtain the multi-scale features, the data volume of the single-scale features corresponding to the image to be detected is far smaller than that of the multi-scale feature map, time consumption for obtaining the multi-scale features is reduced, the target detection speed is higher, and the method and the device can be applied to application scenes with high real-time requirements. Thus, the target detection with high accuracy can be performed at a high detection speed.

Fig. 4 is a block diagram of an electronic device provided in this embodiment. The electronic device includes a processing component 422 that further includes one or more processors, and memory resources, represented by memory 432, for storing instructions, such as application programs, that are executable by the processing component 422. The application programs stored in memory 432 may include one or more modules that each correspond to a set of instructions. Further, the processing component 422 is configured to execute instructions to perform the above-described methods.

The electronic device may also include a power component 426 configured to perform power management of the electronic device, a wired or wireless network interface 450 configured to connect the electronic device to a network, and an input/output (I/O) interface 458. The electronic device may operate based on an operating system stored in memory 432, such as Windows Server, MacOS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, there is also provided a storage medium comprising instructions, such as a memory comprising instructions, executable by an electronic device to perform the above-described object detection method. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product comprising computer readable code which, when run on an electronic device, causes the electronic device to perform the above object detection method.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of object detection, the method comprising:

acquiring an image to be detected;

2. The method according to claim 1, wherein the obtaining of the multi-scale features corresponding to the image to be detected by performing the hole convolution processing based on the single-scale features corresponding to the image to be detected comprises:

and carrying out multiple times of cavity convolution processing on the single-scale features corresponding to the image to be detected to obtain the multi-scale features, wherein the processing data of the first cavity convolution processing is the single-scale features corresponding to the image to be detected, and the processing data of the non-first cavity convolution processing is the processing result data of the adjacent previous cavity convolution processing.

3. The method of claim 2, wherein a hole encoder performs hole convolution processing on the single-scale features corresponding to the image to be detected; the hole encoder comprises a plurality of cascaded residual modules;

correspondingly, multiple cavity convolution processing is carried out on the single-scale features corresponding to the image to be detected, and the multiple-scale features are obtained, wherein the multiple-scale features comprise:

carrying out multiple cavity convolution processing on the single-scale features corresponding to the image to be detected through the multiple cascaded residual modules to obtain the multi-scale features;

and one residual error module is used for executing cavity convolution processing once, and the output result of the previous residual error module is used as the input data of the adjacent next residual error module.

4. The method of claim 3, wherein the void rates of the residual modules are different.

5. The method according to any one of claims 1 to 4, wherein performing a cavity convolution process based on the single-scale features corresponding to the image to be detected to obtain the multi-scale features corresponding to the image to be detected comprises:

based on the detection task corresponding to the image to be detected, performing feature mapping on the single-scale features corresponding to the image to be detected to obtain the mapped single-scale features corresponding to the image to be detected, wherein the detection task comprises: a task for detecting the type of the object in the image to be detected and/or a task for detecting the position of the object in the image to be detected;

and carrying out cavity convolution processing on the mapped single-scale features corresponding to the image to be detected to obtain the multi-scale features corresponding to the image to be detected.

6. The method according to one of claims 1 to 5, characterized in that the object detection method is performed by a detection network, the method further comprising, before acquiring an image to be detected:

for each marking frame in the sample image, determining a preset number of anchor frames which are the largest in intersection ratio with the marking frame in all anchor frames generated aiming at the sample image;

determining the anchor frames with the preset number as positive samples corresponding to the marking frames, and determining the anchor frames except the positive samples in all the anchor frames as negative samples;

and training the detection network based on the positive sample and the negative sample corresponding to each labeling box.

7. The method of claim 6, wherein training the detection network based on the positive and negative examples corresponding to each label box comprises:

for each labeling frame, determining a target positive sample of all positive samples corresponding to the labeling frame, wherein the intersection ratio of the target positive sample to the labeling frame is greater than a first threshold value, and determining a target negative sample of all negative samples, wherein the intersection ratio of the target negative sample to any labeling frame is less than a second threshold value; wherein the first threshold is greater than the second threshold;

and training the detection network model based on the target positive sample and the target negative sample corresponding to each labeling box.

8. An object detection apparatus, characterized in that the apparatus comprises:

an acquisition unit configured to acquire an image to be detected;

the extraction unit is configured to perform feature extraction on the image to be detected to obtain a single-scale feature corresponding to the image to be detected;

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 7.

10. A storage medium having stored therein instructions which, when executed by a processor, are capable of implementing a method as claimed in any one of claims 1 to 7.

11. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, carries out the method of any one of claims 1 to 7.