WO2022257614A1

WO2022257614A1 - Training method and apparatus for object detection model, and image detection method and apparatus

Info

Publication number: WO2022257614A1
Application number: PCT/CN2022/088005
Authority: WO
Inventors: 邹智康; 叶晓青; 孙昊
Original assignee: 北京百度网讯科技有限公司
Priority date: 2021-06-10
Filing date: 2022-04-20
Publication date: 2022-12-15
Also published as: CN113378712B; JP2023539934A; CN113378712A

Abstract

The present disclosure relates to the technical field of artificial intelligence, and specifically to the technical fields of computer vision and deep learning. Provided are a training method and apparatus for an object detection model, and an image detection method and apparatus, wherein same can be applied to fields such as autonomous driving and smart robots. The specific implementation solution involves: training a student detection model according to a difference position, between a distance map, which corresponds to a feature map output by a teacher detection model, of a training image and a distance map, which corresponds to a feature map output by the student detection model, of the training image, and the difference, between features in the corresponding feature maps, of the difference position. In this way, detection information mining for a teacher detection model can be further improved by means of a student detection model, and the detection precision of the student detection model is improved, such that a simple student detection model can achieve a similar detection precision to that of a complex teacher detection model, thereby reducing the occupied computing resources and deployment costs, and increasing the estimation speed.

Description

Object detection model training method, image detection method and device thereof

Cross References to Related Applications

This disclosure claims the priority of the Chinese patent application number "202110649762.4" submitted by Beijing Baidu Netcom Technology Co., Ltd. on June 10, 2021, with the title of invention "Training method for object detection model, image detection method and device thereof" .

technical field

The present disclosure relates to the technical field of artificial intelligence, specifically to the technical field of computer vision and deep learning, which can be applied to the fields of automatic driving, intelligent robot, etc., and especially relates to a training method of an object detection model, an image detection method and a device thereof.

Background technique

At present, object detection (for example, 3D detection of monocular images) is mainly performed through deep learning technology and key point estimation. The detection of objects can provide information on the position information of the object, the length, width and height of the object, and the orientation angle of the object, a total of seven degrees of freedom, and can be widely used in scenarios such as intelligent robots and automatic driving.

Contents of the invention

The disclosure provides a training method for an object detection model, an image detection method and a device thereof.

According to an aspect of the present disclosure, a method for training an object detection model is provided, including: obtaining a trained teacher detection model and a student detection model to be trained; inputting training images into the teacher detection model, and obtaining the teacher The first feature map extracted by the detection model from the training image, and the first object distance map predicted according to the first feature map; the training image is input into the student detection model to obtain the student detection model for all The second feature map extracted from the training image, and the second object distance map predicted according to the second feature map; according to the difference position between the second object distance map and the first object distance map, in the Determine the first local feature corresponding to the difference position in the first feature map, and determine the second local feature corresponding to the difference position in the second feature map; according to the first local feature and the second The difference between local features, the student detection model is trained.

According to another aspect of the present disclosure, there is provided an image detection method, including: acquiring a monocular image; using a trained student detection model to perform image detection on the monocular image, so as to obtain the object in the monocular image Object information; wherein, the student detection model is obtained through training using the training method described in the embodiment of the first aspect of the present disclosure.

According to another aspect of the present disclosure, a training device for an object detection model is provided, including: a first acquisition module for acquiring a trained teacher detection model and a student detection model to be trained; a first processing module for Inputting the training image into the teacher detection model, obtaining the first feature map extracted by the teacher detection model from the training image, and the first object distance map predicted according to the first feature map; the second processing module, For inputting the training image into the student detection model, obtaining a second feature map extracted from the training image by the student detection model, and a second object distance map predicted according to the second feature map; the first A determination module, configured to determine a first local feature corresponding to the difference position in the first feature map according to the difference position between the second object distance map and the first object distance map, and in the A second local feature corresponding to the difference position is determined in the second feature map; a training module is configured to train the student detection model according to the difference between the first local feature and the second local feature.

According to another aspect of the present disclosure, an image detection device is provided, including: an acquisition module for acquiring a monocular image; a detection module for performing image detection on the monocular image using a trained student detection model, to obtain the object information of the object in the monocular image; wherein, the student detection model is trained by using the training device described in the embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores Executable instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute the method described in the embodiment of the first aspect of the present disclosure, or perform the image detection described in the embodiment of the present disclosure method.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to perform the object detection described in the embodiment of the first aspect of the present disclosure The training method of the model, or execute the image detection method described in the embodiment of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, including a computer program. When the computer program is executed by a processor, the object detection model training method described in the embodiment of the first aspect of the present disclosure is implemented, or, Execute the image detection method described in the embodiment of the present disclosure.

It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

Description of drawings

The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

Fig. 3 is a schematic diagram of differences between acquiring first local features and second local features according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a third embodiment of the present disclosure;

5 is a schematic flowchart of a method for training an object detection model according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 9 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 10 is a block diagram of an electronic device for implementing the method for training an object detection model or the method for image detection according to an embodiment of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In related technologies, the detection of objects mainly focuses on increasing the network's processing ability for input images by designing novel modules, thereby improving the accuracy of detection; another mainstream method is to increase the network's representation of spatial distance by introducing depth information. ability to further improve the detection accuracy.

However, in the above-mentioned technologies, by designing novel modules to increase the network’s ability to process input images, it mainly relies on a powerful and complex backbone network to improve the final detection accuracy. This complex network has very large requirements for computing resources. It is convenient to deploy to the service; in addition, the deep and complex network will cause the inference speed to be too slow.

In view of the above problems, the present disclosure proposes a training method for an object detection model, an image detection method and a device thereof.

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure. It should be noted that the object detection model training method of the embodiment of the present disclosure can be applied to the object detection model training device of the embodiment of the present disclosure, and the device can be configured in an electronic device. Wherein, the electronic device may be a mobile terminal, for example, a mobile phone, a tablet computer, a personal digital assistant, and other hardware devices with various operating systems.

As shown in Figure 1, the training method of this object detection model can comprise the following steps:

Step 101, acquire a trained teacher detection model and a student detection model to be trained.

In the embodiment of the present disclosure, the complex neural network can be trained for object detection, the trained complex neural network is used as a teacher detection model, and the untrained simple neural network model is used as a student detection model to be trained.

Step 102, input the training image into the teacher detection model to obtain the first feature map extracted from the training image by the teacher detection model, and the first object distance map predicted according to the first feature map.

As a possible implementation of the embodiment of the present disclosure, the training image can be obtained through the image acquisition device, and the training device can be input into the teacher detection model. The teacher detection model can perform feature extraction on the training image to generate a feature map, and the teacher detection model can also Feature extraction can be further performed according to the feature map to generate information required for object detection (such as location information, size information, etc.), and a distance map can be generated according to the information required for object detection, which can be used to represent objects in the image acquisition device The distance in the (such as camera) coordinate system, the teacher detection model outputs the generated feature map and distance map, the output feature map is used as the first feature map, and the output distance map is used as the first object distance map.

Step 103, input the training image into the student detection model to obtain a second feature map extracted from the training image by the student detection model, and a second object distance map predicted according to the second feature map.

Then, the training image can be input into the student detection model to be trained, and the student detection model to be trained can perform feature extraction on the training image to generate a feature map, and the student detection model to be trained can further perform feature extraction based on the feature map , generate the information required for object detection, and generate a distance map according to the information required for object detection. The student detection model to be trained can output the generated feature map and distance map, and use the output feature map as the second feature map. The output distance map is used as the second object distance map. Wherein, it should be noted that step 102 may be performed before step 103, may be performed after step 103, or may be performed simultaneously with step 103, which is not specifically limited in the present disclosure.

Step 104, according to the difference position between the second object distance map and the first object distance map, determine the first local feature corresponding to the difference position in the first feature map, and determine the second local feature corresponding to the difference position in the second feature map local features.

It can be understood that the teacher detection model is different from the student detection model to be trained, and the output first object distance map is also different from the second object distance map. In the embodiment of the present disclosure, the second object distance map can be compared with the first object distance map. Performing distance measurement on the object distance map, obtaining the difference position between the second object distance map and the first object distance map, and then obtaining the feature corresponding to the difference position in the first feature map, and using this feature as the first local feature, Similarly, the feature corresponding to the difference position in the second feature map is used as the second local feature.

Step 105, train the student detection model according to the difference between the first local feature and the second local feature.

Further, by comparing the first local feature with the second local feature, the difference between the first local feature and the second local feature can be obtained, and the student detection model is trained according to the difference.

To sum up, by obtaining the trained teacher detection model and the student detection model to be trained; input the training image into the teacher detection model, and obtain the first feature map extracted from the training image by the teacher detection model, and the prediction based on the first feature map The first object distance map; the training image is input into the student detection model to obtain the second feature map extracted by the student detection model from the training image, and the second object distance map predicted according to the second feature map; according to the second object distance map and the first object distance map A difference position in the object distance map, determine the first local feature corresponding to the difference position in the first feature map, and determine the second local feature corresponding to the difference position in the second feature map; according to the first local feature and the second The difference between local features is used to train the student detection model. This method is based on the difference position of the distance map corresponding to the feature map output by the teacher detection model and the student detection model. The difference position is the difference between the features in the corresponding feature map. , the training of the student detection model can further improve the mining of the detection information of the teacher detection model by the student detection model, and improve the detection accuracy of the student detection model. Detection accuracy, reducing the occupation of computing resources and deployment costs, and improving the calculation speed.

In order to better obtain the first local feature and the second local feature, as shown in FIG. 2 , FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure. In the embodiment of the present disclosure, the difference position between the first object distance map and the second object distance map can be obtained first, the first local feature corresponding to the difference position is obtained in the first feature map, and the first local feature corresponding to the difference position is obtained in the second feature map. The second local feature corresponding to the difference position, the embodiment shown in Figure 2 includes the following steps:

Step 201, acquire the trained teacher detection model and the student detection model to be trained.

Step 202: Input the training image into the teacher detection model to obtain the first feature map extracted from the training image by the teacher detection model and the first object distance map predicted according to the first feature map.

Step 203, input the training image into the student detection model to obtain a second feature map extracted from the training image by the student detection model, and a second object distance map predicted according to the second feature map.

Step 204, determine the difference position between the first object distance map output by the head network in the teacher detection model and the second object distance map output by the corresponding head network in the student detection model.

In order to effectively determine the difference position between the first object distance map and the second object distance map, optionally, the first object distance map output by the head network in the teacher detection model and the corresponding head network in the student detection model Compare the difference between the distance values of the same position in the output distance map of the second object; take the position where the difference between the distance values is greater than the threshold value as the difference position.

That is to say, the features output by the teacher detection model and the student detection model are respectively input into different head networks in the teacher detection model and the student detection model to obtain different prediction data, for example, the output of the teacher detection model and the student detection model The features are respectively input into the teacher detection model and the category head network in the student detection model. The head network can output the corresponding object category, the features output by the teacher detection model and the student detection model. The output features are respectively input into the 2D The box head network can output the corresponding object 2D box. In the embodiment of the present disclosure, the features output by the teacher detection model are input into the first object distance map that can be output by the 3D head network in the teacher detection model, and the features output by the student detection model are input into the head in the student detection model The second object distance map output by the network is compared with the corresponding distance values at the same position, and the difference between the distance values of the same position in the first object distance map and the second object distance map is obtained, and the difference between the distance values is greater than the preset value. Let the location of the threshold be the difference location.

Step 205, in the first feature map, use the feature extracted from the difference position as the first local feature.

Further, according to the difference position, the position of the difference position in the first feature map is obtained, and feature extraction is performed in the first feature map according to the position, and the extracted feature is used as the first local feature.

Step 206, in the second feature map, use the features extracted from the difference positions as the second local features.

Further, according to the difference position, the position of the difference position in the second feature map is obtained, and feature extraction is performed in the second feature map according to the position, and the extracted feature is used as the second local feature.

Step 207, train the student detection model according to the difference between the first local feature and the second local feature.

For example, as shown in Figure 3, the training picture can obtain the teacher's feature (the first feature map) through the teacher's detection model, the training picture can obtain the student's feature (the second feature map) through the student's detection model, and the teacher's feature can be obtained through the teacher's detection model The 3D head (head network) in the model outputs the first object distance map, and the student features pass through the 3D head (head network) in the student detection model to output the second object distance map, and the first object distance map and the second object distance map Perform distance measurement, obtain the difference position between the first object distance map and the second object distance map, determine the first local feature corresponding to the difference position in the teacher feature, and determine the second local feature corresponding to the difference position in the student feature, A student detection model is trained based on the difference between the first local feature and the second local feature.

In the embodiment of the present disclosure, steps 201-203 may be implemented in any one of the embodiments of the present disclosure, which is not limited in the embodiment of the present disclosure, and will not be repeated here.

To sum up, by obtaining the trained teacher detection model and the student detection model to be trained; input the training image into the teacher detection model, and obtain the first feature map extracted from the training image by the teacher detection model, and the prediction based on the first feature map The first object distance map; input the training image into the student detection model, obtain the second feature map extracted by the student detection model from the training image, and the second object distance map predicted according to the second feature map; determine the head network in the teacher detection model The difference between the output first object distance map and the second object distance map output by the corresponding head network in the student detection model; in the first feature map, the features extracted from the difference position are used as the first local features; in the second feature map, the features extracted from the difference positions are used as the second local features; according to the difference between the first local features and the second local features, the student detection model is trained. According to the difference position between the first object distance map and the second object distance map, the method can obtain the first local feature corresponding to the difference position in the first feature map, and obtain the second local feature corresponding to the difference position in the second feature map. Local features, the student detection model is trained according to the difference between the first local feature and the second local feature, which can further improve the mining of the detection information of the teacher detection model by the student detection model, and improve the detection accuracy of the student detection model. , the simple student detection model can achieve detection accuracy similar to the complex teacher detection model, which reduces the occupation of computing resources and deployment costs, and improves the calculation speed.

In order to improve the detection accuracy of the student detection model, as shown in FIG. 4 , FIG. 4 is a schematic diagram according to a third embodiment of the present disclosure. In the embodiment of the present disclosure, the student detection model may be trained according to the difference between the first local feature and the second local feature, and the embodiment shown in FIG. 4 may include the following steps:

Step 401, acquire the trained teacher detection model and the student detection model to be trained.

Step 402: Input the training image into the teacher detection model to obtain the first feature map extracted from the training image by the teacher detection model and the first object distance map predicted according to the first feature map.

Step 403: Input the training image into the student detection model to obtain a second feature map extracted from the training image by the student detection model and a second object distance map predicted according to the second feature map.

Step 404, according to the difference position between the second object distance map and the first object distance map, determine the first local feature corresponding to the difference position in the first feature map, and determine the second local feature corresponding to the difference position in the second feature map local features.

Step 405: Determine the first loss item of the loss function according to the difference between the first local feature and the second local feature.

In the embodiment of the present disclosure, the first local feature is compared with the second local feature, and the difference between the first local feature and the second local feature can be determined, and the difference is used as the first loss item of the loss function.

Step 406: Determine the second loss item of the loss function according to the difference between the first feature map and the second feature map.

Optionally, by comparing the first feature map with the second feature map, a feature difference between the first feature map and the second feature map can be determined, and the feature difference can be used as a second loss item of the loss function.

As a possible implementation of the embodiment of the present disclosure, the teacher detection model and the student detection model may respectively include multiple corresponding feature extraction layers, and the first feature maps output by each feature extraction layer of the teacher detection model are respectively compared with the student detection model The second feature map output by the corresponding feature extraction layer in , determines the feature difference, and determines the second loss item of the loss function according to the determined feature difference.

That is to say, the teacher detection model and the student detection model respectively include a plurality of corresponding feature extraction layers, the teacher detection model can extract features according to the multiple feature extraction layers, and output the first feature map, and the student detection model can be based on the corresponding multiple feature extraction layers The feature extraction layer extracts features and outputs the second feature map, and calculates the distance between the first feature map output by the teacher detection model and the second feature map output by the student detection model, which can determine the number of features extracted by multiple feature extraction layers of the teacher detection model. The feature difference between the feature and the features extracted by the corresponding multiple feature extraction layers of the student detection model is used as the second loss item of the loss function.

For example, the feature extracted by the teacher detection model based on multiple feature extraction layers is T={t ₁ ,t ₂ ,t ₃ ,t ₄ ,t ₅ }, and the feature extracted by the student detection model based on the corresponding multiple feature extraction layers is S ＝{s ₁ , s ₂ , s ₃ , s ₄ , s ₅ }, calculate the cos (cosine) distance between the feature T output by the teacher’s detection model and the feature S output by the student’s detection model, and judge the teacher’s detection model by the cos distance The similarity between the output features and the features output by the student detection model is calculated to optimize the similarity loss function to shorten the distance between the output features of the teacher detection model and the output features of the student detection model. Among them, the cos distance can be defined by the following formula:

In addition, when the features output by the teacher detection model are more similar to those output by the student detection model, the cos distance will be larger. Therefore, the similarity loss function can be S _i =1-D _i . In the embodiment of the present disclosure, it can be The similarity loss function is used as the second loss term of the loss function.

Step 407, train the student detection model according to each loss item of the loss function.

Further, the student detection model can be trained according to the first loss item and the second loss item of the loss function.

For example, as shown in Figure 5, the training pictures are respectively input into the teacher network (teacher detection model) and the student network (student detection model), the teacher network can output teacher features (the first feature map), and the student network can output student features (the second feature map), the teacher features output the first object distance map through the head network (such as 3D head) in the teacher detection model, the student features output the second object distance map through the head network in the student detection model, and the second Carry out distance measurement between the first object distance map and the second object distance map, obtain the difference position between the first object distance map and the second object distance map, determine the first local feature corresponding to the difference position in the teacher feature, and determine the difference position in the student feature The second local feature corresponding to the difference position, according to the difference between the first local feature and the second local feature, the difference is used as the first loss item of the loss function, and the difference between the teacher feature and the student feature is used as the loss The second loss term of the function, the student network is trained according to the first loss term and the second loss term of the loss function.

In the embodiments of the present disclosure, steps 401-404 can be implemented in any of the embodiments of the present disclosure, which are not limited in the embodiments of the present disclosure, and will not be repeated here.

To sum up, by obtaining the trained teacher detection model and the student detection model to be trained; input the training image into the teacher detection model, and obtain the first feature map extracted from the training image by the teacher detection model, and the prediction based on the first feature map The first object distance map; the training image is input into the student detection model to obtain the second feature map extracted by the student detection model from the training image, and the second object distance map predicted according to the second feature map; according to the second object distance map and the first object distance map A difference position in an object distance map, determining the first local feature corresponding to the difference position in the first feature map, and determining the second local feature corresponding to the difference position in the second feature map; according to the first local feature and The difference between the second local features determines the first loss item of the loss function; according to the difference between the first feature map and the second feature map, determines the second loss item of the loss function; according to each loss item of the loss function, Train the student detection model. In this method, according to the difference position of the distance map corresponding to the feature map output by the teacher detection model and the student detection model, the difference between the features of the difference position in the corresponding feature map is used as the first loss item of the loss function, and the difference between the feature maps The difference between them is used as the second loss item of the loss function, and the student detection model is trained according to the first loss item and the second loss item, which can improve the detection accuracy of the student detection model. In this way, the simple student detection model can achieve the same level as the complex teacher The detection accuracy of the detection model is similar, which reduces the occupation of computing resources and deployment costs, and improves the calculation speed.

In order to further improve the detection accuracy of the student detection model, as shown in FIG. 6 , FIG. 6 is a schematic diagram according to a fourth embodiment of the present disclosure. In the embodiment of the present disclosure, the loss function for training the student detection model may also include a third loss item, and the embodiment shown in FIG. 6 may include the following steps:

Step 601, acquire the trained teacher detection model and the student detection model to be trained.

Step 602: Input the training image into the teacher detection model to obtain the first feature map extracted from the training image by the teacher detection model and the first object distance map predicted according to the first feature map.

Step 603: Input the training image into the student detection model to obtain the second feature map extracted from the training image by the student detection model and the second object distance map predicted according to the second feature map.

Step 604, according to the difference position between the second object distance map and the first object distance map, determine the first local feature corresponding to the difference position in the first feature map, and determine the second local feature corresponding to the difference position in the second feature map local features.

Step 605: Determine the first loss item of the loss function according to the difference between the first local feature and the second local feature.

Step 606: Determine the second loss item of the loss function according to the difference between the first feature map and the second feature map.

Step 607, obtaining labels of training samples.

In the embodiment of the present disclosure, the object position or object size may be marked on the training samples in advance.

Step 608, according to the difference between the object position marked by the training sample and the object position predicted by the student detection model, and/or according to the difference between the object size marked by the training sample and the object size predicted by the student detection model, determine the third loss item.

As an example, the object position marked by the training sample can be compared with the object position predicted by the student detection model, and the difference between the object position marked by the training sample and the object position predicted by the student detection model can be determined as the first loss function Three loss terms, to train the student detection model.

As another example, compare the object size marked by the training sample with the object size predicted by the student detection model, determine the difference between the object size marked by the training sample and the object size predicted by the student detection model, and use it as the first loss function Three loss terms, to train the student detection model.

As another example, the object position marked by the training sample can be compared with the object position predicted by the student detection model, the difference between the object position marked by the training sample and the object position predicted by the student detection model can be determined, and the training sample can be marked Compare the size of the object with the object size predicted by the student detection model to determine the difference between the object size marked by the training sample and the object size predicted by the student detection model, and the difference between the object size marked by the training sample and the object size predicted by the student detection model The difference between , as the third loss term of the loss function, trains the student detection model.

Step 609, train the student detection model according to each loss item of the loss function.

Further, the student detection model can be trained according to the first loss item, the second loss item and the third loss item of the loss function.

In the embodiment of the present disclosure, steps 601-606 may be implemented in any one of the embodiments of the present disclosure, which is not limited in the embodiment of the present disclosure, and will not be repeated here.

To sum up, by obtaining the trained teacher detection model and the student detection model to be trained; input the training image into the teacher detection model, and obtain the first feature map extracted from the training image by the teacher detection model, and the prediction based on the first feature map The first object distance map; the training image is input into the student detection model to obtain the second feature map extracted by the student detection model from the training image, and the second object distance map predicted according to the second feature map; according to the second object distance map and the first object distance map A difference position in an object distance map, determining the first local feature corresponding to the difference position in the first feature map, and determining the second local feature corresponding to the difference position in the second feature map; according to the first local feature and The difference between the second local features determines the first loss item of the loss function; according to the difference between the first feature map and the second feature map, determines the second loss item of the loss function; obtains the label of the training sample; according to the training The difference between the position of the object marked by the sample and the position of the object predicted by the student detection model, and/or according to the difference between the size of the object marked by the training sample and the size of the object predicted by the student detection model, determine the third loss term; according to the loss function Each loss term of , trains the student detection model. In this method, the difference position of the distance map corresponding to the feature map output by the teacher detection model and the student detection model, the difference position between the features in the corresponding feature map, as the first loss item of the loss function, and the difference between the feature maps As the second loss term of the loss function, the difference between the object position marked according to the training sample and the object position predicted by the student detection model, and/or between the object size marked according to the training sample and the object size predicted by the student detection model The difference of is used as the third loss term of the loss function, which can improve the detection accuracy of the student detection model. In this way, the simple student detection model can achieve similar detection accuracy as the complex teacher detection model, and reduce the occupancy and deployment of computing resources cost, and increased calculation speed.

The training method of the object detection model in the embodiment of the present disclosure obtains the trained teacher detection model and the student detection model to be trained; inputs the training image into the teacher detection model, and obtains the first feature map extracted from the training image by the teacher detection model , and the first object distance map predicted according to the first feature map; input the training image into the student detection model, and obtain the second feature map extracted by the student detection model from the training image, and the second object distance map predicted according to the second feature map ; According to the difference position between the second object distance map and the first object distance map, determine the first local feature corresponding to the difference position in the first feature map, and determine the second local feature corresponding to the difference position in the second feature map ; According to the difference between the first local feature and the second local feature, the student detection model is trained. This method is based on the difference position of the distance map corresponding to the feature map output by the teacher detection model and the student detection model. The difference position is in the corresponding The difference between the features in the feature map, the training of the student detection model can further improve the mining of the detection information of the teacher detection model by the student detection model, and improve the detection accuracy of the student detection model. In this way, the simple student detection model can be It achieves detection accuracy similar to that of the complex teacher detection model, reduces the occupation of computing resources, reduces deployment costs, and improves the calculation speed.

FIG. 7 is a schematic diagram according to a fifth embodiment of the present disclosure. In the embodiment of the present disclosure, a trained student detection model can be used for image detection. Based on this, the present disclosure proposes an image detection method. The image detection method of the embodiment of the present disclosure can be applied to the image detection device of the embodiment of the present disclosure, and the device can be configured in an electronic device. Wherein, the electronic device may be a mobile terminal, for example, a mobile phone, a tablet computer, a personal digital assistant, and other hardware devices with various operating systems. As shown in Figure 7, the image detection method includes:

Step 701, acquire a monocular image.

In the embodiment of the present disclosure, a monocular image may be acquired by an image acquisition device.

Step 702, using the trained student detection model to perform image detection on the monocular image to obtain the object information of the object in the monocular image; wherein, the student detection model is obtained by training using the training methods described in FIGS. 1 to 6 .

Optionally, input the monocular image into the trained student detection model, and the trained student detection model can output the object information of the object in the monocular image, such as the 3D position information of the object, the length, width and height of the object, and the orientation of the object There are seven degrees of freedom in total. Wherein, it should be noted that the student detection model is trained by using the training methods described in FIGS. 1 to 6 .

In the image detection method of the embodiment of the present disclosure, by acquiring a monocular image; using a trained student detection model to perform image detection on the monocular image to obtain object information of the object in the monocular image, wherein the student detection model is to use the image 1 to the training method described in Figure 6 to obtain. Therefore, using the trained student detection model to detect the monocular image can improve the detection accuracy of the image.

In order to realize the above-mentioned embodiments in FIG. 1 to FIG. 6 , an embodiment of the present disclosure further proposes an object detection model training device.

Fig. 8 is a schematic diagram according to the sixth embodiment of the present disclosure. As shown in Fig. 8, the object detection model training device 800 includes: a first acquisition module 810, a first processing module 820, a second processing module 830, a first determination Module 840, training module 850.

Among them, the first acquisition module 810 is used to obtain the trained teacher detection model and the student detection model to be trained; the first processing module 820 is used to input the training image into the teacher detection model to obtain the teacher detection model pair for training The first feature map extracted from the image, and the first object distance map predicted according to the first feature map; the second processing module 830 is used to input the training image into the student detection model, and obtain the second feature extracted by the student detection model from the training image map, and the second object distance map predicted according to the second feature map; the first determination module 840 is used to determine the difference position in the first feature map according to the difference position between the second object distance map and the first object distance map The corresponding first local feature, and the second local feature corresponding to the difference position determined in the second feature map; the training module 850 is used to perform the student detection model according to the difference between the first local feature and the second local feature train.

As a possible implementation of the embodiment of the present disclosure, the first determination module 840 is configured to: determine that the first object distance map output by the head network in the teacher detection model is different from the output of the corresponding head network in the student detection model. The difference position where there is a difference between the second object distance map; in the first feature map, the feature extracted from the difference position is used as the first local feature; in the second feature map, the feature extracted from the difference position is used as the second local features.

As a possible implementation of the embodiment of the present disclosure, the first determination module 840 is further configured to: combine the first object distance map output by the head network in the teacher detection model with the second object distance map output by the corresponding head network in the student detection model The distance value comparison difference of the same position in the object distance map; the position where the difference between the distance values is greater than the threshold is taken as the difference position.

As a possible implementation of the embodiment of the present disclosure, the training module 850 is configured to: determine the first loss item of the loss function according to the difference between the first local feature and the second local feature; The difference between the two feature maps determines the second loss item of the loss function; according to each loss item of the loss function, the student detection model is trained.

As a possible implementation of the embodiment of the present disclosure, the loss function further includes a third loss item; the object detection model training apparatus 800 further includes: a second acquisition module and a second determination module.

Wherein, the second acquisition module is used to obtain the annotation of the training sample; the second determination module is used to use the difference between the object position marked according to the training sample and the object position predicted by the student detection model, and/or according to the difference between the object position marked by the training sample The difference between the object size and the object size predicted by the student detection model determines a third loss term.

As a possible implementation of the embodiment of the present disclosure, the teacher detection model and the student detection model respectively include a plurality of corresponding feature extraction layers; the training module 850 is also configured to: output each feature extraction layer of the teacher detection model The first feature map of the student detection model is used to determine the feature difference with the second feature map output by the corresponding feature extraction layer in the student detection model; according to the determined feature difference, the second loss item of the loss function is determined.

The training device of the object detection model in the embodiment of the present disclosure acquires the trained teacher detection model and the student detection model to be trained; inputs the training image into the teacher detection model, and obtains the first feature map extracted by the teacher detection model from the training image , and the first object distance map predicted according to the first feature map; input the training image into the student detection model, and obtain the second feature map extracted by the student detection model from the training image, and the second object distance map predicted according to the second feature map ; According to the difference position between the second object distance map and the first object distance map, determine the first local feature corresponding to the difference position in the first feature map, and determine the second local feature corresponding to the difference position in the second feature map ; According to the difference between the first local feature and the second local feature, the student detection model is trained, and the device can realize the difference position of the distance map corresponding to the feature map output according to the teacher detection model and the student detection model, the difference position In the difference between the features in the corresponding feature map, training the student detection model can further improve the mining of the detection information of the teacher detection model by the student detection model, and improve the detection accuracy of the student detection model. In this way, the simple student detection The model can achieve detection accuracy similar to that of the complex teacher detection model, while reducing the occupancy of computing resources and deployment costs, and improving the inference speed.

In order to realize the embodiment described in FIG. 7 , an embodiment of the present disclosure further proposes an image detection device.

FIG. 9 is a schematic diagram according to a seventh embodiment of the present disclosure. As shown in FIG. 9 , the image detection device 900 includes: an acquisition module 910 and a detection module 920 .

Among them, the acquisition module 910 is used to acquire the monocular image; the detection module 920 is used to perform image detection on the monocular image using the trained student detection model to obtain the object information of the object in the monocular image; wherein the student The detection model is obtained through training using the training device described in FIG. 8 .

The image detection device in the embodiment of the present disclosure acquires a monocular image; uses a trained student detection model to perform image detection on the monocular image to obtain the object information of the object in the monocular image, wherein the student detection model uses a graph 8 obtained by training with the training device. Therefore, using the trained student detection model to detect the monocular image can improve the detection accuracy of the image.

According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

FIG. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 10 , the device 1000 includes a computing unit 1001, which can be loaded into a RAM (Random Access Memory, random access/ accesses the computer program in the memory) 1003 to execute various appropriate actions and processes. In the RAM 1003, various programs and data necessary for the operation of the device 1000 can also be stored. The computing unit 1001, ROM 1002, and RAM 1003 are connected to each other through a bus 1004. An I/O (Input/Output, input/output) interface 1005 is also connected to the bus 1004 .

Multiple components in the device 1000 are connected to the I/O interface 1005, including: an input unit 1006, such as a keyboard, a mouse, etc.; an output unit 1007, such as various types of displays, speakers, etc.; a storage unit 1008, such as a magnetic disk, an optical disk, etc. ; and a communication unit 1009, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

The computing unit 1001 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include but are not limited to CPU (Central Processing Unit, central processing unit), GPU (Graphic Processing Units, graphics processing unit), various dedicated AI (Artificial Intelligence, artificial intelligence) computing chips, various operating The computing unit of the machine learning model algorithm, DSP (Digital Signal Processor, digital signal processor), and any appropriate processor, controller, microcontroller, etc. The calculation unit 1001 executes various methods and processes described above, such as a training method of an object detection model or an image detection method. For example, in some embodiments, a method for training an object detection model or a method for image detection may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1008 . In some embodiments, part or all of the computer program may be loaded and/or installed on the device 1000 via the ROM 1002 and/or the communication unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of the method for training an object detection model described above can be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured in any other appropriate way (for example, by means of firmware) to execute an object detection model training method or an image detection method.

Various implementations of the systems and technologies described above in this paper can be implemented in digital electronic circuit systems, integrated circuit systems, FPGA (Field Programmable Gate Array, Field Programmable Gate Array), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit) , ASSP (Application Specific Standard Product, dedicated standard product), SOC (System On Chip, system on a chip), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or realized in combination of them. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.

Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include electrical connections based on one or more wires, portable computer disks, hard disks, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, Erasable Programmable Read-Only Memory) Or flash memory, optical fiber, CD-ROM (Compact Disc Read-Only Memory, portable compact disk read-only memory), optical storage device, magnetic storage device, or any suitable combination of the above.

To provide interaction with the user, the systems and techniques described herein can be implemented on a computer having: a display device (e.g., a CRT (Cathode-Ray Tube) or LCD ( Liquid Crystal Display (LCD) monitor); and a keyboard and pointing device (such as a mouse or trackball) through which a user can provide input to a computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.

The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: LAN (Local Area Network, local area network), WAN (Wide Area Network, wide area network), the Internet, and blockchain networks.

A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also known as cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS") Among them, there are defects such as difficult management and weak business scalability. The server can also be a server of a distributed system, or a server combined with a blockchain.

Among them, it should be noted that artificial intelligence is a discipline that studies the use of computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), including both hardware-level technology and software-level technology. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, and big data processing; artificial intelligence software technologies mainly include computer vision technology, speech recognition technology, natural language processing technology, and machine learning/distance Learning, big data processing technology, knowledge map technology and other major directions.

It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims

A training method for an object detection model, comprising:

Obtain the trained teacher detection model and the student detection model to be trained;

Inputting the training image into the teacher detection model to obtain a first feature map extracted by the teacher detection model from the training image, and a first object distance map predicted according to the first feature map;

inputting the training image into the student detection model to obtain a second feature map extracted from the training image by the student detection model, and a second object distance map predicted according to the second feature map;

According to the difference position between the second object distance map and the first object distance map, determine the first local feature corresponding to the difference position in the first feature map, and determine in the second feature map determining a second local feature corresponding to the difference position;

The student detection model is trained based on the difference between the first local feature and the second local feature.
The training method according to claim 1, wherein, according to the difference position between the second object distance map and the first object distance map, determine the difference position corresponding to the difference position in the first feature map The first local feature, and determining the second local feature corresponding to the difference position in the second feature map includes:

determining the difference position between the first object distance map output by the head network in the teacher detection model and the second object distance map output by the corresponding head network in the student detection model;

In the first feature map, using the features extracted from the difference positions as the first local features;

In the second feature map, the features extracted from the difference positions are used as the second local features.
The training method according to claim 2, wherein said determining the first object distance map output by the head network in the teacher detection model is the same as the first object distance map output by the corresponding head network in the student detection model Difference locations where differences exist between the two object distance maps, including:

comparing the distance values at the same position in the first object distance map output by the head network in the teacher detection model with the distance values at the same position in the second object distance map output by the corresponding head network in the student detection model;

A position where the difference between the distance values is greater than a threshold is taken as the difference position.
The training method according to claim 1, wherein said training the student detection model according to the difference between the first local feature and the second local feature comprises:

determining a first loss term of a loss function based on the difference between the first local feature and the second local feature;

determining a second loss term of the loss function based on a difference between the first feature map and the second feature map;

The student detection model is trained according to each loss item of the loss function.
The training method according to claim 4, wherein the loss function also includes a third loss term; the method also includes:

obtaining the label of the training sample;

The difference between the object position marked according to the training sample and the object position predicted by the student detection model, and/or the difference between the object size marked according to the training sample and the object size predicted by the student detection model , to determine the third loss term.
The training method according to claim 4, wherein the teacher detection model and the student detection model respectively include a plurality of corresponding feature extraction layers; The difference between, determine the second loss term of the loss function, including:

The first feature map output by each feature extraction layer of the teacher detection model is respectively determined with the second feature map output by the corresponding feature extraction layer in the student detection model to determine the feature difference;

A second loss item of the loss function is determined according to the determined feature difference.
An image detection method, comprising:

Get a monocular image;

Using a trained student detection model to perform image detection on the monocular image to obtain object information of objects in the monocular image; wherein, the student detection model is to use any one of claims 1-6 obtained by training with the training method.
A training device for an object detection model, comprising:

The first obtaining module is used to obtain the trained teacher detection model and the student detection model to be trained;

A first processing module, configured to input a training image into the teacher detection model, obtain a first feature map extracted from the training image by the teacher detection model, and a first object distance map predicted according to the first feature map ;

A second processing module, configured to input the training image into the student detection model, obtain a second feature map extracted by the student detection model from the training image, and a second object predicted according to the second feature map distance map;

A first determining module, configured to determine a first local feature corresponding to the difference position in the first feature map according to the difference position between the second object distance map and the first object distance map, and determining a second local feature corresponding to the difference position in the second feature map;

A training module is used to train the student detection model according to the difference between the first local feature and the second local feature.
The device according to claim 8, wherein the first determining module is configured to:

determining the difference position between the first object distance map output by the head network in the teacher detection model and the second object distance map output by the corresponding head network in the student detection model;

In the first feature map, using the features extracted from the difference positions as the first local features;

In the second feature map, the features extracted from the difference positions are used as the second local features.
The device according to claim 9, wherein the first determining module is further configured to:

comparing the distance values at the same position in the first object distance map output by the head network in the teacher detection model with the distance values at the same position in the second object distance map output by the corresponding head network in the student detection model;

A position where the difference between the distance values is greater than a threshold is taken as the difference position.
The device according to claim 8, wherein the training module is configured to:

determining a first loss term of a loss function based on the difference between the first local feature and the second local feature;

determining a second loss term of the loss function based on a difference between the first feature map and the second feature map;

The student detection model is trained according to each loss item of the loss function.
The apparatus according to claim 11, wherein the loss function further comprises a third loss term; the apparatus further comprises:

The second obtaining module is used to obtain the label of the training sample;

The second determination module is configured to use the difference between the position of the object marked by the training sample and the position of the object predicted by the student detection model, and/or the size of the object marked by the training sample to be predicted by the student detection model The difference between the object sizes determines the third loss term.
The device according to claim 11, wherein the teacher detection model and the student detection model respectively include a plurality of corresponding feature extraction layers; the training module is also used for:

The first feature map output by each feature extraction layer of the teacher detection model is respectively determined with the second feature map output by the corresponding feature extraction layer in the student detection model to determine the feature difference;

A second loss item of the loss function is determined according to the determined feature difference.
An image detection device, comprising:

Acquisition module, used to obtain monocular image;

The detection module is used to use the trained student detection model to perform image detection on the monocular image, so as to obtain the object information of the object in the monocular image; wherein, the student detection model is adopted as claimed in claim 8- 13 obtained by training with the training device described in any one.
An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-6. The method, or, carry out the method described in claim 7.
A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make the computer execute the method according to any one of claims 1-6, or execute the method described in claim 7 described method.
A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6, or performs the method according to claim 7.