CN114387498A

CN114387498A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN114387498A
Application number: CN202111676609.7A
Authority: CN
Inventors: 张宇昂; 郑安林
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-22

Abstract

The embodiment of the application provides a target detection method and a target detection device, wherein the method comprises the following steps: acquiring an image to be detected; inputting an image to be detected into a target detection network to obtain a target detection frame, wherein the target detection network is used for: for each first detection frame in a plurality of detection frames of an image to be detected, obtaining the relative position characteristics of the first detection frame based on the data of each related second detection frame of the first detection frame aiming at the first detection frame; for each first detection frame, obtaining the update characteristics of the first detection frame based on the relative position characteristics of the first detection frame and the current characteristics of the first detection frame; determining the corrected confidence of the first detection frame based on the updated features of the first detection frame; and determining the target detection frame from the plurality of detection frames based on the corrected confidence coefficient of each first detection frame and the current confidence coefficient of each second detection frame.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of neural networks, and in particular, to a target detection method, apparatus, electronic device, and storage medium.

Background

Currently, when a target detection network for realizing end-to-end target detection is used for target detection, the confidence of a detection frame predicted by the target detection network is directly utilized to determine whether the detection frame is matched with a target.

The detection frame with low confidence may be caused by low accuracy of the feature used for predicting the detection frame with low confidence due to global feature interaction between the detection frame with low confidence and other detection frames in the process of target detection. However, since the confidence of the detection frame actually surrounding a target is low, the target detection network determines that the detection frame actually surrounding a target with low confidence does not match the target, the target detection network determines that the detection frame with high confidence matches the target, and the target detection network determines that the detection frame with high confidence matches the target as the detection frame matched with the target, which results in the error of the detection frame determined by the target detection.

Disclosure of Invention

The embodiment of the application provides a target detection method and device, electronic equipment and a storage medium.

The embodiment of the application provides a target detection method, which comprises the following steps:

acquiring an image to be detected;

inputting the image to be detected into a target detection network to obtain a target detection frame, wherein the target detection network is used for:

for each first detection frame in a plurality of detection frames of the image to be detected, obtaining relative position characteristics of the first detection frame based on data, aiming at the first detection frame, of each related second detection frame of the first detection frame, wherein the current confidence of the first detection frame is smaller than or equal to a first confidence threshold, the second detection frame is a detection frame, aiming at the first detection frame, of the plurality of detection frames, the current confidence of the second detection frame is larger than the first confidence threshold, and the data aiming at the first detection frame indicates the association relationship between the related second detection frame and the first detection frame;

for each first detection frame, obtaining an updated feature of the first detection frame based on the relative position feature of the first detection frame and the current feature of the first detection frame, wherein the current feature is extracted from the first detection frame by the target detection network; determining a corrected confidence level of the first detection frame based on the updated features of the first detection frame;

and determining the target detection frame from the plurality of detection frames based on the corrected confidence coefficient of each first detection frame and the current confidence coefficient of each second detection frame.

An embodiment of the present application provides a target detection apparatus, including:

an acquisition unit configured to acquire an image to be detected;

a detection unit configured to input the image to be detected into a target detection network, resulting in a target detection frame, wherein the target detection network is configured to: for each first detection frame in a plurality of detection frames of the image to be detected, obtaining relative position characteristics of the first detection frame based on data, aiming at the first detection frame, of each related second detection frame of the first detection frame, wherein the current confidence of the first detection frame is smaller than or equal to a first confidence threshold, the second detection frame is a detection frame, aiming at the first detection frame, of the plurality of detection frames, the current confidence of the second detection frame is larger than the first confidence threshold, and the data aiming at the first detection frame indicates the association relationship between the related second detection frame and the first detection frame; for each first detection frame, obtaining an updated feature of the first detection frame based on the relative position feature of the first detection frame and the current feature of the first detection frame, wherein the current feature is extracted from the first detection frame by the target detection network; determining a corrected confidence level of the first detection frame based on the updated features of the first detection frame; and determining the target detection frame from the plurality of detection frames based on the corrected confidence coefficient of each first detection frame and the current confidence coefficient of each second detection frame.

An embodiment of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement the object detection method described above.

Embodiments of the present application provide a computer-readable storage medium on which a computer program/instructions are stored, which, when executed by a processor, implement the above-mentioned object detection method.

Embodiments of the present application provide a computer program product, which includes a computer program/instruction, and the computer program/instruction, when executed by a processor, implement the above object detection method.

According to the target detection method and device provided by the embodiment of the application, for a first detection frame with the current confidence degree smaller than or equal to a first confidence degree threshold value, according to data of each relevant second detection frame of the first detection frame, aiming at the first detection frame, and the current characteristics of the first detection frame, the updated characteristics of the first detection frame are obtained, the corrected confidence degree of the first detection frame is determined based on the updated characteristics of the first detection frame, the confidence degree of the first detection frame, which is used for matching with a target in an image to be detected, is improved, and the target detection frame is determined from a plurality of detection frames based on the corrected confidence degree of each first detection frame and the current confidence degree of each second detection frame. The method avoids the situation that whether the detection frame is matched with the target or not by directly utilizing the confidence coefficient of the detection frame predicted by the target detection network, and the determined target detection frame is wrong due to the detection frame with low confidence coefficient, and improves the accuracy of the target detection frame determined by target detection.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a flowchart illustrating a target detection method provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating a process of obtaining an updated feature of a first detection box;

fig. 3 shows a block diagram of a target detection apparatus according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows a flowchart of a target detection method provided in an embodiment of the present application, where the method includes:

step 101, obtaining an image to be detected.

In the present application, an image to be detected is an image that needs to be detected for an object such as a pedestrian or a vehicle included therein.

And 102, inputting the image to be detected into a target detection network to obtain a target detection frame.

In this application, the object detection network is configured to: for each first detection frame in a plurality of detection frames of an image to be detected, obtaining relative position characteristics of the first detection frame based on data, aiming at the first detection frame, of each related second detection frame of the first detection frame, wherein the current confidence of the first detection frame is smaller than or equal to a first confidence threshold, the second detection frame is a detection frame, aiming at the first confidence threshold, of the plurality of detection frames, and the data, aiming at the first detection frame, of the related second detection frame of the first detection frame indicates the association relationship between the related second detection frame of the first detection frame and the first detection frame; for each first detection frame, obtaining the updated feature of the first detection frame based on the relative position feature of the first detection frame and the current feature of the first detection frame, wherein the current feature of the first detection frame is extracted from the first detection frame by a target detection network; determining the corrected confidence of the first detection frame based on the updating characteristics of the first detection frame; and determining the target detection frame from the plurality of detection frames based on the corrected confidence coefficient of each first detection frame and the current confidence coefficient of each second detection frame.

In the present application, the target detection network may be a network that implements end-to-end target detection, such as Sparse R-CNN (Sparse R-CNN) and Deformable DETR.

In this application, a plurality of detection frames of the image to be detected may be: a plurality of detection frames predicted by the target detection network at the end of a certain stage of target detection of the target detection network for the image to be detected.

For example, the target detection network is a sparse area convolutional neural network, the sparse area convolutional neural network includes 6 detection stages for target detection of the input image, and a plurality of detection frames predicted at the end of a certain stage, for example, the 5 th detection stage, of target detection of the image to be detected may be used as a plurality of detection frames of the image to be detected.

For each of the plurality of detection boxes, the current confidence of the detection box may be: the confidence level of the detection frame predicted by the target detection network at the end of a certain phase.

In the present application, the current confidence of the first detection box is less than or equal to the first confidence threshold.

Optionally, the first confidence threshold may be 0.7.

In this application, the second detection frame is a detection frame, among the plurality of detection frames, whose current confidence is greater than the first confidence threshold.

In this application, for a first detection frame, the area occupied by the associated second detection frame of the first detection frame overlaps with the area occupied by the first detection frame.

For a first detection frame and a second detection frame, if the area occupied by the second detection frame overlaps with the area occupied by the first detection frame, the area belonging to the area occupied by the first detection frame and belonging to the area occupied by the second detection frame is called the overlapping area of the first detection frame and the second detection frame, and whether the second detection frame can be used as the relevant second detection frame of the first detection frame can be determined according to the area of the overlapping area of the first detection frame and the second detection frame.

In one implementation, for a first detection frame and a second detection frame, if the area of the overlapping region of the first detection frame and the second detection frame is greater than the area threshold, the second detection frame may be used as the relevant second detection frame of the first detection frame.

In the present application, for a first detection frame, data for the first detection frame of an associated second detection frame of the first detection frame indicates an association relationship between the associated second detection frame and the first detection frame.

For example, for an associated second detection box of a first detection box, the data for the first detection box of the associated second detection box may include: the difference value between the position of the related second detection frame and the position of the first detection frame, and the intersection ratio between the related second detection frame and the first detection frame.

For a first detection frame, when the relative position feature of the first detection frame is obtained based on the data of each relevant second detection frame of the first detection frame, the data of the relevant second detection frame, which is to be the first detection frame, may be subjected to encoding processing to obtain the encoding of the data of the relevant second detection frame, which is to be the first detection frame. When the data of the relevant second detection frame for the first detection frame is encoded, the term of the difference type in the data of the relevant second detection frame for the first detection frame may be encoded by adopting a sine and cosine encoding mode to obtain a vector of the term of the difference type, the term of the difference type is a difference value of the same type of parameters of the relevant second detection frame for the first detection frame and the relevant second detection frame, and the same type of parameters may be position, length, width, and the like. The dimension of the vector of the term of the difference type is a preset dimension, for example, 64 dimensions. For a non-difference type item in the data of the relevant second detection box for the first detection box, a vector of the non-difference type item may be generated, the vector of the non-difference type item is a preset dimension, for example, 64 dimensions, and each component in the vector of the non-difference type item is the difference type item.

For each associated second detection box of the first detection box, encoding of data for the first detection box of the associated second detection box comprises: the vector of the associated second detection box for the item of the difference type in the data of the first detection box, and the vector of the associated second detection box for the item of the non-difference type in the data of the first detection box.

For example, for a first detection box, data for the first detection box of an associated second detection box of the first detection box may include: the difference value between the position of the related second detection frame and the position of the first detection frame, and the intersection ratio between the related second detection frame and the first detection frame. For each related second detection frame of the first detection frame, when data of the related second detection frame for the first detection frame is coded, a difference value between a position of the related second detection frame and a position of the first detection frame is a difference value type item, an intersection ratio between the related second detection frame and the first detection frame is a non-difference value type item, a sine and cosine coding mode is adopted to code the difference value between the position of the related second detection frame and the position of the first detection frame, a vector of the difference value between the position of the related second detection frame and the position of the first detection frame is obtained, a vector of an intersection ratio between the related second detection frame and the first detection frame is generated, and each component in the vector of the intersection ratio between the related second detection frame and the first detection frame is the intersection ratio between the related second detection frame and the first detection frame. The encoding of the data for the first detection box with respect to the second detection box comprises: the vector of the difference value between the position of the related second detection frame and the position of the first detection frame, and the vector of the intersection ratio between the related second detection frame and the first detection frame.

In this application, for a first detection frame, after obtaining the coding of the data for the first detection frame of each associated second detection frame of the first detection frame, the relative position feature of the first detection frame may be generated, and the relative position feature of the first detection frame includes: each of the first detection boxes relates to an encoding of data for the first detection box by a second detection box. The relative position feature of the first detection frame is a vector, and the dimension of the relative position feature of the first detection frame is equal to the sum of the dimensions of the coding of the data of the first detection frame of each related second detection frame of the first detection frame.

In this application, for a first detection box, the current feature of the first detection box may be a feature extracted from the first detection box by the target detection network, the current feature of the first detection box is a vector, and a dimension of the current feature of the first detection box may be equal to a dimension of the relative position feature of the first detection box.

In this application, for a first detection frame, after obtaining the relative position feature of the first detection frame, the relative position feature of the first detection frame and the current feature of the first detection frame may be added to obtain the updated feature of the first detection frame.

In this application, for a first detection frame, when determining the corrected confidence of the first detection frame based on the updated feature of the first detection frame, the updated feature of the first detection frame may be input into a module for determining the confidence of the detection frame in the target detection network, and the corrected confidence of the first detection frame output by the module for determining the confidence of the detection frame.

In the application, the target detection frame is determined from the plurality of detection frames based on the corrected confidence of each first detection frame and the current confidence of each second detection frame.

In the present application, a detection frame of an object, such as a pedestrian or a vehicle, which is matched to an image to be detected, among a plurality of detection frames of the image to be detected, is referred to as an object detection frame. Each target detection frame is matched with one target respectively, and the targets matched with each target detection frame are different.

In the application, any one of the manners of determining the target detection frame matched to the target based on the corresponding confidence degrees of the plurality of corresponding detection frames in the target detection of the network for realizing the end-to-end target detection on the corresponding image may be adopted, and the target detection frame matched to the target in the image to be detected is determined from the plurality of detection frames based on the corrected confidence degree of each first detection frame and the current confidence degree of each second detection frame.

In the present application, the target detection network is trained in advance before step 101 is performed.

In the present application, each time the target detection network is trained, one training image is utilized. Each training of the target detection network utilizes a different training image.

In one training of the target detection network, a training image is input into the target detection network, and a plurality of detection frames of the training image, the corrected confidence of each third detection frame of the plurality of detection frames, and the confidence of each fourth detection frame of the plurality of detection frames can be obtained.

In the present application, the plurality of detection boxes of the training image may be: a plurality of detection frames predicted by the target detection network at the end of a certain stage of target detection of the target detection network on the training image.

For example, the target detection network is a sparse area convolutional neural network, the sparse area convolutional neural network includes 6 detection stages for target detection of the input image, and a plurality of detection frames predicted at the end of one stage of target detection, for example, the 5 th detection stage, for the training image may be set as the plurality of detection frames of the training image.

For each of a plurality of detection boxes of a training image, the current confidence of the detection box may be: and the confidence degree of the detection frame is predicted by the target detection network when a certain stage of target detection of the target detection network aiming at the training image is finished.

The third detection frame is a detection frame of which the current confidence coefficient is smaller than or equal to a second confidence coefficient threshold value in the plurality of detection frames of the training image, and the fourth detection frame is a detection frame of which the current confidence coefficient is larger than the second confidence coefficient threshold value in the plurality of detection frames of the training image.

In the present application, the second confidence threshold may be equal to the first confidence threshold. Optionally, the second confidence threshold is 0.7.

In this application, for a third detection frame, a relative position feature of the third detection frame may be obtained based on data of each of the third detection frames related to a fourth detection frame for the third detection frame, an update feature of the third detection frame may be determined based on the relative position feature of the third detection frame and a current feature of the third detection frame, the update feature of the third detection frame may be input into a module for determining a confidence of a detection frame in a target detection network, and a corrected confidence of the third detection frame output by the module for determining a confidence of a detection frame may be obtained.

For a third detection frame, the area occupied by the associated fourth detection frame of the third detection frame overlaps with the area occupied by the third detection frame. For a third detection frame and a fourth detection frame, if the area occupied by the fourth detection frame overlaps with the area occupied by the fourth detection frame, the area belonging to the area occupied by the third detection frame and the area occupied by the fourth detection frame is called the overlapping area of the third detection frame and the fourth detection frame, and whether the fourth detection frame can be used as the relevant fourth detection frame of the third detection frame can be determined according to the area of the overlapping area of the third detection frame and the fourth detection frame.

In one implementation, for a third detection frame and a fourth detection frame, if the area of the overlapping region of the third detection frame and the fourth detection frame is greater than the area threshold, the fourth detection frame may be used as the relevant fourth detection frame of the third detection frame.

In the present application, the process of obtaining the relative position feature of the third detection frame based on the data of each of the third detection frames related to the fourth detection frame for the third detection frame is the same as the process of obtaining the relative position feature of the first detection frame based on the data of each of the first detection frames related to the second detection frame for the first detection frame.

In this application, for a third detection box, the current feature of the third detection box may be a feature extracted from the third detection box by the target detection network, the current feature of the third detection box is a vector, and a dimension of the current feature of the third detection box may be equal to a dimension of the relative position feature of the third detection box. After the relative position feature of the third detection frame is obtained, the relative position feature of the third detection frame and the current feature of the third detection frame may be added to obtain an updated feature of the third detection frame. For a third detection box, the updated features of the third detection box may be input into a module for determining confidence in the target detection network to obtain a corrected confidence of the third detection box.

In the application, in one training of the target detection network, a plurality of detection frames of the training image and all labeling frames of the training image are matched by using an existing algorithm which can be used for matching the detection frames and the labeling frames, such as the hungarian algorithm, based on the corrected confidence of each third detection frame and the confidence of each fourth detection frame, so as to determine the detection frame serving as a positive sample and the detection frame serving as a negative sample.

The labeling frame of the training image is located in the training image, the labeling frame of the training image is a true value (GT), and the labeling frame of the training image is a frame obtained by labeling and surrounding objects in the training image, such as pedestrians and vehicles. Each label box encloses a target in the training image. After all positive samples and all negative samples are determined, the target detection network may be trained using the positive samples and all negative samples. The loss function can be used based on positive examples, negative examples, the class corresponding to the positive examples, i.e., the predicted class of the object surrounded by the positive examples,And calculating loss according to the type corresponding to the negative sample, namely the predicted type of the object surrounded by the negative sample and the type corresponding to the marking frame, namely the type of the target surrounded by the marking frame, performing back propagation according to the loss, and updating the parameter values of the parameters in the target detection network. For example, the loss can be calculated using the following loss function

Indicating the loss of focal between the predicted class and the true class,

Regression (l1) loss between the detection box and the annotation box representing the target matched to the training image,

Generalized intersection ratio (giou) loss between the detection box and the annotation box representing the target matched into the training image.

In some embodiments, for any one first detection box, the data for that first detection box for its associated second detection box comprises: the difference between the position of the relevant second detection frame and the position of the first detection frame, the difference between the length of the relevant second detection frame and the length of the first detection frame, the difference between the width of the first detection frame and the width of the relevant second detection frame, and the intersection ratio between the relevant second detection frame and the first detection frame.

In some embodiments, for any one first detection box, the associated second detection box of the first detection box is the second detection box whose intersection ratio with the first detection box is greater than the intersection ratio threshold.

Optionally, the cross-over ratio threshold is 0.4.

For a second detection frame, if the intersection ratio of the second detection frame and the first detection frame is greater than the intersection ratio threshold, the second detection frame may be used as a related second detection frame of the first detection frame.

In some embodiments, for each first detection frame, obtaining the updated feature of the first detection frame based on the relative position feature of the first detection frame includes: obtaining an enhanced feature of the first detection frame based on the relative position feature of the first detection frame and the current feature of the first detection frame; and performing feature fusion based on the enhanced features and the learnable features of the first detection frame to obtain the updated features of the first detection frame, wherein the learnable features are features which are learnt by the target detection network and used for performing feature fusion with the enhanced features when the target detection network is trained.

In this application, for a first detection frame, when the enhanced feature of the first detection frame is obtained based on the relative position feature of the first detection frame and the current feature of the first detection frame, the relative position feature of the first detection frame and the current feature of the first detection frame may be added to obtain the enhanced feature of the first detection frame.

When feature fusion is performed based on the enhanced feature and the learnable feature of the first detection frame to obtain the updated feature of the first detection frame, the enhanced feature and the learnable feature of the first detection frame may be added to obtain the first detection frame. The enhanced features of the first detection frame can also be processed by using the full connection layer, so that the enhanced features of the processed first detection frame are obtained. And adding the processed enhanced features and the learnable features of the first detection frame to obtain the updated features of the first detection frame.

In the application, when a target detection network is trained, the relative position feature of a third detection frame in a plurality of detection frames of a training image and the current feature of the third detection frame are added to obtain the enhanced feature of the third detection frame, and feature fusion is performed based on the enhanced feature of the third detection frame and the feature for feature fusion with the enhanced feature of the third detection frame to obtain the updated feature of the third detection frame. And when the target detection network is trained, the feature used for feature fusion with the enhanced feature of the third detection frame is a variable. When updating parameters in the target detection network with the calculated loss, all parameters updated include: and the characteristic is used for carrying out characteristic fusion with the enhanced characteristic of the third detection frame. And after the training of the target detection network is completed, the feature used for feature fusion with the enhanced feature of the third detection frame is taken as a learnable feature.

In the application, the learnable features are features which are learnt by the target detection network and used for feature fusion with the enhanced features when the target detection network is trained, the learnable features are suitable for feature fusion with the enhanced features, feature fusion is performed based on the enhanced features and the learnable features of the first detection frame, the updated features of the first detection frame are obtained, and the accuracy of the updated features of the obtained first detection frame is further improved.

In some embodiments, for each first detection frame, deriving the relative position feature of the first detection frame based on the data for the first detection frame of each associated second detection frame of the first detection frame comprises: for each relevant second detection frame, encoding the data of the relevant second detection frame aiming at the first detection frame to obtain the encoding of the data of the relevant second detection frame aiming at the first detection frame; generating an encoding of the first detection box, the encoding of the first detection box comprising: each of the first detection boxes is associated with an encoding of data for the first detection box of a second detection box; and inputting the code of the first detection frame into the multilayer perceptron to obtain the output characteristics of the multilayer perceptron, and performing maximum pooling processing on the output characteristics of the multilayer perceptron to obtain the relative position characteristics of the first detection frame.

The process of obtaining the data of the relevant second detection frame of the first detection frame for the first detection frame refers to the process of obtaining the data of the relevant second detection frame of the first detection frame for the first detection frame described in step 102.

In the present application, the multilayer perceptron sequentially comprises, in the direction from input to output: the circuit comprises a first full connection layer, a rectification linear unit (Relu for short) connected with the first full connection layer, and a second full connection layer connected with the rectification linear unit.

And for a first detection frame, inputting the code of the first detection frame into the multilayer perceptron, receiving the code of the first detection frame by a first full-connection layer in the multilayer perceptron, and processing the code of the first detection frame to obtain the output characteristic of the first full-connection layer. The rectifying linear unit processes the characteristics output by the first full connection layer to obtain the characteristics output by the rectifying linear unit, the second full connection layer processes the characteristics output by the rectifying linear unit to obtain the characteristics output by the second full connection layer, and the characteristics output by the second full connection layer are used as the characteristics output by the multilayer perceptron.

In the present application, if the features output by one multi-layer sensor are subjected to maximum pooling to obtain the relative position features of the first detection frame, when the enhancement features of the first detection frame are obtained based on the relative position features of the first detection frame and the current features of the first detection frame, the current features of the first detection frame may be input to another multi-layer sensor having the same structure as the one multi-layer sensor, and the enhancement features of the first detection frame may be obtained by adding the features output by the another multi-layer sensor to the relative position features of the first detection frame.

Please refer to fig. 2, which shows a flowchart of obtaining the updated feature of the first detection box.

And inputting the code of the first detection frame into the multilayer perceptron, and performing maximum pooling on the characteristics output by the multilayer perceptron to obtain the relative position characteristics of the first detection frame. And inputting the current characteristic of the first detection frame into another multi-layer perceptron, and adding the characteristic output by the other multi-layer perceptron and the relative position characteristic of the first detection frame to obtain the enhanced characteristic of the first detection frame. The enhanced features of the first detection frame can be processed by using the full connection layer, so that the enhanced features of the processed first detection frame are obtained. And adding the processed enhanced features of the first detection frame with the learnable features to obtain the updated features of the first detection frame.

In some embodiments, further comprising: before the image to be detected is obtained, inputting the training image into a target detection network to obtain the corrected confidence coefficient of each third detection frame and each third detection frame of the training image and the current confidence coefficient of each fourth detection frame and each fourth detection frame of the training image, wherein the current confidence coefficient of the third detection frame is less than or equal to a second confidence coefficient threshold value, and the current confidence coefficient of the fourth detection frame is greater than the second confidence coefficient threshold value; matching all fourth detection frames of the training images with all labeling frames of the training images to determine a fourth detection frame serving as a positive sample and a fourth detection frame serving as a negative sample; when the target labeling frame exists, matching all the third detection frames participating in matching with all the target labeling frames to determine a third detection frame serving as a positive sample and a third detection frame serving as a negative sample, wherein the target labeling frame is a third detection frame which is not matched with a fourth detection frame; and training the target detection network by using all the determined positive samples and all the determined negative samples.

In the present application, in one training of the target detection network, one training image is input into the target detection network to obtain a plurality of detection frames of the training image, a corrected confidence of each third detection frame of the plurality of detection frames, and a confidence of each fourth detection frame of the plurality of detection frames. The target detection of the target detection network for the training image comprises a plurality of stages, and a plurality of detection boxes of the training image may be: a plurality of detection frames predicted by the target detection network at the end of a certain stage of target detection of the target detection network on the training image. For each of a plurality of detection boxes of a training image, the current confidence of the detection box may be: and the confidence degree of the detection frame is predicted by the target detection network when a certain stage of target detection of the target detection network aiming at the training image is finished. The current confidence of the third detection frame is smaller than or equal to the second confidence threshold, and the current confidence of the fourth detection frame is larger than the second confidence threshold.

In one training of the target detection network, all the fourth detection frames of the training image are matched with all the label frames of the training image to determine the fourth detection frame as a positive sample and the fourth detection frame as a negative sample.

The labeling frame of the training image is positioned in the training image, and the labeling frame of the training image is a frame which is obtained through labeling and surrounds the target in the training image. Each label box encloses a target in the training image.

And matching all fourth detection frames of the training image with all labeling frames of the training image, wherein for each fourth detection frame, the fourth detection frame is matched with at most one labeling frame, if the fourth detection frame is matched with one labeling frame, the fourth detection frame can be used as a positive sample, and if the fourth detection frame is not matched with one labeling frame, the fourth detection frame can be used as a negative sample.

In this application, the target labeling frame is a labeling frame that is not matched with the fourth detection frame, that is, the target labeling frame is not matched with any fourth detection frame.

In the present application, the third detection frame participating in matching is the third detection frame whose confidence after correction is greater than or equal to the lowest threshold.

Optionally, the lowest threshold is 0.05.

In one training of the target detection network, matching all the third detection frames participating in matching with all the target labeling frames to determine the third detection frame which is used as a positive sample in all the third detection frames participating in matching and the third detection frame which is used as a negative sample in all the third detection frames participating in matching.

For each third detection frame participating in matching, the third detection frame participating in matching is matched with at most one target labeling frame, if the third detection frame is matched with one target labeling frame, the third detection frame participating in matching can be used as a positive sample, and if the third detection frame participating in matching is not matched with one target labeling frame, the third detection frame participating in matching can be used as a negative sample.

And in one training of the target detection network, training the target detection network by using all the determined positive samples and all the determined negative samples.

In the application, when the target detection network is trained, all fourth detection frames of the training image and all labeling frames of the training image are matched to determine the fourth detection frame as a positive sample and the fourth detection frame as a negative sample, and when the fourth detection frame labeling frame is not matched, all third detection frames participating in matching and all target labeling frames are matched. Considering that the current confidence of the fourth detection frame is higher, the fourth detection frame and the labeling frame are preferentially matched, when the fourth detection frame labeling frame is not matched, the corresponding third detection frame can be matched with the corresponding target labeling frame which is not matched with the fourth detection frame due to the modified confidence, so that the matching result between the detection frame and the labeling frame is accurate, and the target detection network is trained accurately by using the accurate matching result between the detection frame and the labeling frame.

Referring to fig. 3, a block diagram of a target detection apparatus according to an embodiment of the present disclosure is shown. The object detection device includes: a generating unit 301 and a detecting unit 302.

The acquisition unit 301 is configured to acquire an image to be detected;

the detection unit 302 is configured to input the image to be detected into a target detection network, resulting in a target detection frame, the target detection network being configured to: for each first detection frame in a plurality of detection frames of the image to be detected, obtaining relative position characteristics of the first detection frame based on data, aiming at the first detection frame, of each related second detection frame of the first detection frame, wherein the current confidence of the first detection frame is smaller than or equal to a first confidence threshold, the second detection frame is a detection frame, aiming at the first detection frame, of the plurality of detection frames, the current confidence of the second detection frame is larger than the first confidence threshold, and the data aiming at the first detection frame indicates the association relationship between the related second detection frame and the first detection frame; for each first detection frame, obtaining an updated feature of the first detection frame based on the relative position feature of the first detection frame and the current feature of the first detection frame, wherein the current feature is extracted from the first detection frame by the target detection network; determining a corrected confidence level of the first detection frame based on the updated features of the first detection frame; and determining a target detection frame matched with the target in the image to be detected from the plurality of detection frames based on the corrected confidence coefficient of each first detection frame and the current confidence coefficient of each second detection frame.

In some embodiments, deriving the updated feature of the first detection box based on the relative positional feature of the first detection box and the current feature of the first detection box comprises:

obtaining an enhanced feature of the first detection frame based on the relative position feature of the first detection frame and the current feature of the first detection frame;

and performing feature fusion based on the enhanced features and learnable features of the first detection frame to obtain updated features of the first detection frame, wherein the learnable features are features which are learnt by the target detection network and used for performing feature fusion with the enhanced features when the target detection network is trained.

In some embodiments, deriving the relative positional feature of the first detection frame based on the data for the first detection frame for each of the first detection frames associated with the second detection frame comprises:

for each related second detection frame, encoding data of the related second detection frame for the first detection frame to obtain an encoding of the data of the related second detection frame for the first detection frame;

generating an encoding of the first detection box, the encoding of the first detection box comprising: encoding of data for the first detection box for each associated second detection box;

and inputting the code of the first detection frame into a multilayer perceptron to obtain the output characteristics of the multilayer perceptron, and performing maximum pooling processing on the output characteristics of the multilayer perceptron to obtain the relative position characteristics of the first detection frame.

In some embodiments, the data for the first detection box of the first detection box related to a second detection box comprises: the difference between the position of the relevant second detection frame and the position of the first detection frame, the difference between the length of the relevant second detection frame and the length of the first detection frame, the difference between the width of the first detection frame and the width of the relevant second detection frame, and the intersection ratio between the relevant second detection frame and the first detection frame.

In some embodiments, the second detection box associated with the first detection box is the second detection box having a cross-to-parallel ratio greater than a cross-to-parallel ratio threshold with the first detection box.

In some embodiments, the object detection apparatus further comprises:

the training unit is configured to input a training image into a target detection network before an image to be detected is acquired, so as to obtain a corrected confidence coefficient of each third detection frame and each third detection frame of the training image, and a current confidence coefficient of each fourth detection frame and each fourth detection frame of the training image, wherein the current confidence coefficient of the third detection frame is smaller than or equal to a second confidence coefficient threshold value, and the current confidence coefficient of the fourth detection frame is larger than the second confidence coefficient threshold value; matching all fourth detection frames of the training image with all labeling frames of the training image to determine a fourth detection frame serving as a positive sample and a fourth detection frame serving as a negative sample; when the target labeling frame exists, matching all the third detection frames participating in matching with all the target labeling frames to determine a third detection frame serving as a positive sample and a third detection frame serving as a negative sample, wherein the target labeling frame is a third detection frame which is not matched with a fourth detection frame; and training the target detection network by using all the determined positive samples and all the determined negative samples.

It should be noted that the computer readable storage medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a message execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a message execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable messages for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer messages.

The above description is only a preferred embodiment of the present request and is illustrative of the principles of the technology employed. It will be understood by those skilled in the art that the scope of the invention herein referred to is not limited to the technical embodiments with the specific combination of the above technical features, but also encompasses other technical embodiments with any combination of the above technical features or their equivalents without departing from the inventive concept. For example, technical embodiments formed by mutually replacing the above-mentioned features with (but not limited to) technical features having similar functions disclosed in the present application.

Claims

1. A method of object detection, the method comprising:

acquiring an image to be detected;

2. The method of claim 1, wherein deriving the updated feature of the first detection box based on the relative positional feature of the first detection box and the current feature of the first detection box comprises:

3. The method of claim 1, wherein deriving the relative position feature of the first detection frame based on the data for the first detection frame for each associated second detection frame of the first detection frame comprises:

4. The method of any of claims 1-3, wherein the data for the first detection box of an associated second detection box of the first detection box comprises: the difference between the position of the relevant second detection frame and the position of the first detection frame, the difference between the length of the relevant second detection frame and the length of the first detection frame, the difference between the width of the first detection frame and the width of the relevant second detection frame, and the intersection ratio between the relevant second detection frame and the first detection frame.

5. The method of any of claims 1-3, wherein the associated second detection box of the first detection box is a second detection box having a cross-to-parallel ratio with the first detection box that is greater than a cross-to-parallel ratio threshold.

6. The method according to any one of claims 1-5, further comprising:

before an image to be detected is obtained, inputting a training image into a target detection network to obtain a corrected confidence coefficient of each third detection frame and each third detection frame of the training image and a current confidence coefficient of each fourth detection frame and each fourth detection frame of the training image, wherein the current confidence coefficient of the third detection frame is less than or equal to a second confidence coefficient threshold value, and the current confidence coefficient of the fourth detection frame is greater than the second confidence coefficient threshold value;

matching all fourth detection frames of the training image with all labeling frames of the training image to determine a fourth detection frame serving as a positive sample and a fourth detection frame serving as a negative sample;

when the target labeling frame exists, matching all the third detection frames participating in matching with all the target labeling frames to determine a third detection frame serving as a positive sample and a third detection frame serving as a negative sample, wherein the target labeling frame is a third detection frame which is not matched with a fourth detection frame;

and training the target detection network by using all the determined positive samples and all the determined negative samples.

7. An electronic device, comprising: memory, processor and computer program stored on the memory, characterized in that the processor executes the computer program to implement the method of any of claims 1-6.

8. A computer-readable storage medium, on which a computer program/instructions is stored, characterized in that the computer program/instructions, when executed by a processor, implement the method of any one of claims 1-6.

9. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the method of any of claims 1-6.