WO2020024584A1

WO2020024584A1 - Method, device and apparatus for training object detection model

Info

Publication number: WO2020024584A1
Application number: PCT/CN2019/076982
Authority: WO
Inventors: 张长征; 金鑫; 涂丹丹
Original assignee: 华为技术有限公司
Priority date: 2018-08-03
Filing date: 2019-03-05
Publication date: 2020-02-06

Abstract

Disclosed is a method for training an object detection model executable by a computer apparatus. The method comprises: making at least two copies of a classifier having undergone stage 1 training; in stage 2 training, using each copy of the classifier to detect objects of different sizes; and training an object detection model according to detection results. The method employs a two-stage training mode, such that a resulting object detection mode exhibits higher accuracy in detecting objects to undergo detection.

Description

Method, device and equipment for training object detection model

Technical field

The present application relates to the field of computer technology, and in particular, to a method for training an object detection model, and a device and a computing device for performing the method.

Background technique

Object detection is an artificial intelligence technology that accurately locates and classifies objects in images / videos. It includes general object detection, face detection, pedestrian detection, and text detection. In recent years, academia and industry have actively invested, and algorithms have continued to mature. Current deep learning-based object detection solutions are used in municipal security (pedestrian detection, vehicle detection, license plate detection, etc.), and finance (object detection, face registration, etc.) , Internet (identity verification), smart terminals and other actual products.

At present, object detection has been widely used in a variety of simple / medium complex scenes (such as face detection in access control and bayonet scenes). In an open environment, how to maintain the robustness of the trained object detection model against a variety of unfavorable factors such as large changes in the size of the object to be detected, occlusion, distortion, etc., and improve detection accuracy is still a problem to be solved.

Summary of the invention

This application provides a method for training an object detection model, which improves the detection accuracy of the trained object detection model.

In a first aspect, a method for training an object detection model performed by a computing device is provided. The computing device executing the method may be one or more computing devices distributed in the same or different environments. The method includes:

A training image is acquired, and a backbone network is established according to the training image.

The feature map output by the backbone network is input into a regional proposal network.

The area proposal network selects a plurality of proposed areas from a feature map output by the backbone network according to the area proposal parameters, and inputs the sub-feature maps corresponding to the plurality of proposed areas to a classifier.

The classifier detects an object to be detected in the training image according to a sub-feature map corresponding to the plurality of proposed regions.

Comparing the to-be-detected object in the training image detected by the classifier with the prior results of the training image, and comparing the model parameters of the convolution kernel of the backbone network with the region proposal network according to the comparison result At least one of the model parameters of the convolution kernel, the region proposal parameters, and the parameters of the classifier is excited.

Duplicate the classifier to obtain at least two classifiers.

The area proposal network divides the plurality of proposal areas into at least two proposal area sets, and each proposal area set includes at least one proposal area.

The region proposal network inputs a sub-feature map corresponding to a proposal region included in each proposal region set into one of the at least two classifiers.

Each classifier of the at least two classifiers performs the following actions: detecting an object to be detected in the training image according to a sub-feature map corresponding to the proposed region included in the acquired proposal region set; comparing the detected training A priori result of the object to be detected in the image and the training image, model parameters of the convolution kernel of the backbone network, model parameters of the convolution kernel of the region proposal network, and the region according to the comparison result At least one of the proposed parameters and the parameters of each classifier is stimulated.

Each of the at least two classifiers incentivizes its own parameters according to the comparison result, and generally does not incentivize the parameters of other classifiers in the at least two classifiers according to the comparison result.

The method provided above inputs the training image into the object detection model twice to train the object detection model. In the first stage of training, the size of the object to be detected is not distinguished, so that the trained classifier has a global view. In the second stage of training, each copied classifier is responsible for detecting objects to be detected in a set of proposed regions, that is, responsible for detecting a class of objects to be detected, so that each classifier trained is further targeted. Is more sensitive to different sizes of objects to be detected. The two-stage training improves the detection accuracy of the trained object detection model for objects of different sizes to be detected.

In a possible implementation manner, the method further includes: acquiring system parameters, where the system parameters include at least one of: a number of size clusters of an object to be detected in a training image, and training computing capabilities; according to the system A parameter that determines the number of classifiers in the at least two classifiers obtained after copying.

The number of copied classifiers can be manually configured or calculated based on the situation of the object to be detected in the training image. The appropriate selection of the number of copied classifiers further improves the trained object detection model for different sizes. Detection accuracy of the object to be detected.

In a possible implementation manner, in a case where the system parameter includes the number of size clusters of the object to be detected in the training image, and acquiring the system parameter includes: to the object to be detected in the training image Clustering to obtain the number of size clusters of the object to be detected in the training image.

In a possible implementation manner, the feature map output by the backbone network includes at least two feature maps.

The span of different convolutional layers of the backbone network may be different, so the size of the object to be detected in the proposed area in the feature maps of the different convolutional layers may also be different. At least two feature maps are extracted from the backbone network, which enriches the proposed area. Source, which further improves the detection accuracy of the trained object detection model for objects of different sizes to be detected.

A second aspect of the present application provides a detection model training device, including an initialization module, an object detection model, and an excitation module.

An object detection model is used to obtain a training image, and establish a backbone network based on the training image; select multiple proposed regions from a feature map output from the backbone network according to regional proposal parameters, and sub-feature corresponding to the multiple proposed regions The map is input to a classifier; and an object to be detected in the training image is detected according to a sub-feature map corresponding to the plurality of proposed regions.

An incentive module, configured to compare a detected prior object in the training image with a prior result of the training image, and according to the comparison result, to model parameters of the convolution kernel of the backbone network, and the region proposal At least one of the model parameters of the network's convolution kernel, the region proposal parameters, and the parameters of the classifier is excited.

An initialization module is configured to duplicate the classifier to obtain at least two classifiers.

The object detection model is further configured to divide the plurality of proposal regions into at least two proposal region sets, each proposal region set including at least one proposal region; and sub-features corresponding to the proposal regions included in each proposal region set. A map is input to one of the at least two classifiers; each of the at least two classifiers performs the following action: detecting the sub-feature map corresponding to the proposed region included in the acquired proposal region set The object to be detected in the training image; comparing the detected prior object in the training image and the prior image of the training image, and according to the comparison result, the model parameters, At least one of the model parameters of the convolution kernel of the region proposal network, the region proposal parameters, and the parameters of each classifier is stimulated.

In a possible implementation manner, the initialization module is further configured to obtain system parameters, where the system parameters include at least one of: a number of size clusters of an object to be detected in a training image, and a training computing capability; The system parameters are described to determine the number of classifiers in the at least two classifiers obtained after replication.

In a possible implementation manner, the initialization module is further configured to cluster the sizes of the objects to be detected in the training image to obtain the number of size clusters of the objects to be detected in the training image.

A third aspect of the present application provides a computing device system. The computing device system includes at least one computing device. Each computing device includes a processor and memory. A processor of at least one computing device is configured to access code in the memory to perform the first aspect or the method provided by any possible implementation of the first aspect.

A fourth aspect of the present application provides a non-transitory readable storage medium. When the non-transitory readable storage medium is executed by at least one computing device, the at least one computing device executes the foregoing first aspect. Or the method provided in any possible implementation of the first aspect. The program is stored in the storage medium. The type of the storage medium includes, but is not limited to, volatile memory, such as random access memory, non-volatile memory, such as flash memory, hard disk drive (HDD), and solid state drive (SSD).

A fifth aspect of the present application provides a computing device program product. When the computing device program product is executed by at least one computing device, the at least one computing device executes the foregoing first aspect or any possible first aspect. The methods provided in the implementation. The computer program product may be a software installation package. If the method provided in the foregoing first aspect or any possible implementation manner of the first aspect is required to be used, the computer program product may be downloaded and executed on a computing device. Program products.

A sixth aspect of the present application provides another method for training an object detection model performed by a computing device, the method including two-stage training. among them,

In the first stage of training, a feature map of a training image is extracted through a backbone network, a proposed area is selected from the extracted feature map through a regional proposal network, and a sub-feature map corresponding to the proposed area is input to a classifier. The classifier detects the object to be detected in the training image according to the sub-feature map corresponding to the proposed region, compares the detection result with the prior result of the training image, and compares the backbone network, the At least one of the region proposal network and the classifier performs incentives.

In the second-stage training, at least two replication classifiers are established according to the classifiers that have undergone the first-stage training, and the region proposal network divides the proposed region into at least two proposed region sets, each Each proposal region set includes at least one proposal region, and the sub-feature map corresponding to the proposal region included in each proposal region set is input to a replication classifier, and each replication classifier detects a pending feature in the training image according to the acquired sub-feature map. An object is detected, the detection result is compared with the prior result of the training image, and at least one of the backbone network, the area proposal network, and the classifier is re-energized according to the comparison result.

In the second-stage training, the classifiers that have undergone the first-stage training may be duplicated to establish at least two duplicated classifiers. It is also possible to adjust the classifiers that have undergone the first-stage training before copying to establish at least two copy classifiers.

In a possible implementation manner, the method further includes: acquiring system parameters, where the system parameters include at least one of: a number of size clusters of an object to be detected in a training image, and training computing capabilities; according to the system A parameter determines the number of the replication classifiers established.

In a possible implementation manner, the feature map extracted by the backbone network includes at least two feature maps.

A seventh aspect of the present application provides a computing device system. The computing device system includes at least one computing device. Each computing device includes a processor and memory. A processor of at least one computing device is configured to access code in the memory to perform the sixth aspect or the method provided by any possible implementation of the sixth aspect.

An eighth aspect of the present application provides a non-transitory readable storage medium. When the non-transitory readable storage medium is executed by at least one computing device, the at least one computing device executes the foregoing sixth aspect. Or the method provided in any possible implementation of the sixth aspect. The program is stored in the storage medium. The type of the storage medium includes, but is not limited to, volatile memory, such as random access memory, and non-volatile memory, such as flash memory, HDD, and SSD.

A ninth aspect of the present application provides a computing device program product. When the computing device program product is executed by at least one computing device, the at least one computing device executes the foregoing sixth aspect or any possible one of the sixth aspect. The methods provided in the implementation. The computer program product may be a software installation package. In the case where the method provided in the foregoing sixth aspect or any possible implementation manner of the sixth aspect is required to be used, the computer program product may be downloaded and executed on a computing device. Program products.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical method of the embodiment of the present application more clearly, the drawings required in the embodiment will be briefly introduced below.

FIG. 1 is a schematic diagram of a system architecture provided by this application;

FIG. 2 is a schematic diagram of another system architecture provided by the present application; FIG.

3 is a working flowchart of a detection model training device provided in the present application in a training state;

4 is another working flowchart of the detection model training device provided in the present application in a training state;

5 is a working flowchart of an object detection model training in an inferential state provided by the present application;

6 is a working flowchart of a detection model training device provided in the present application in a training state;

7 is another working flowchart of the detection model training device provided in the present application in a training state;

FIG. 8 is a working flowchart of an object detection model training in an inferential state provided by the present application; FIG.

9 is a schematic structural diagram of a convolution layer and a convolution kernel provided by the present application;

10 is a schematic diagram of a receptive field of a convolutional layer provided by the present application;

11 is a schematic diagram of a receptive field of another convolution layer provided by the present application;

FIG. 12 is a working flowchart of a regional proposal network provided by this application;

13 is a schematic flowchart of a method provided by this application;

14 is a schematic flowchart of another method provided by the present application;

15 is a schematic structural diagram of a detection model training device provided by the present application;

16 is a schematic structural diagram of a computing device provided by the present application;

17 is a schematic structural diagram of a computing device system provided by the present application;

FIG. 18 is a schematic structural diagram of another computing device system provided by the present application.

detailed description

The following describes the technical methods in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.

Each of the “first”, “second”, and “n” in this application does not have a logical or sequential dependency relationship.

As shown in FIG. 1, the method for training an object detection model provided by the present application is executed by a detection model training device. The device can run in a cloud environment, specifically one or more computing devices on the cloud environment. The device can also run in an edge environment, specifically one or more computing devices (edge computing devices) in the edge environment. The device can also run in a terminal environment, specifically one or more terminal devices in the terminal environment. Terminal equipment can be mobile phones, notebooks, servers, desktop computers, etc. The edge computing device may be a server.

As shown in FIG. 2, the detection model training device may be composed of multiple parts (modules), so each part of the detection model training device may also be separately deployed in different environments. For example, the detection model training device may deploy three modules of the detection model training device on a cloud environment, an edge environment, or a terminal environment, or any two of them.

FIG. 3 to FIG. 5 and FIG. 6 to FIG. 8 respectively illustrate the working flow diagrams of the two detection model training devices. In each workflow, the training state of the detection model training device is divided into two stages.

In FIG. 3, the detection model training device works in the first stage of the training state. The purpose of the training state is to use the training image and the prior results of the training image to train a highly accurate object detection model. The prior result of the training image includes a mark of an object to be detected in the training image. As an example, the training image in FIG. 3 includes multiple faces, and each face of the training image is marked with a white frame in the prior result of the training image (as shown in the upper left corner of FIG. 3). The prior results of the training images can generally be provided manually.

A K-layer backbone network is established according to the training image. The backbone network includes K convolutional layers, where K is a positive integer greater than 0. The backbone network extracts feature maps from the training images. The feature map extracted from the backbone network is input to a regional proposal network. The regional proposal network selects a proposed region from the feature map and inputs the sub-feature map corresponding to the proposed region to the classifier. In the process of selecting a proposed region from a feature map, the region proposal network can directly compare the prior results of the training image and the feature map to obtain a region with a high coverage of the object to be detected in the feature map and the training image as the proposed region. Alternatively, the region proposal network may first identify the foreground region and the background region from the feature map, and then extract the proposed region from the foreground region. Among them, the foreground area is an area containing a high amount of information and a high probability of including an object to be detected, and the background area is an area containing a low amount of information, having a lot of repeated information, and a low probability of including an object to be detected.

Each sub-feature map includes features in which a portion of the feature map is located within the proposed area. The classifier determines whether the area of the training image corresponding to the proposed region corresponding to the sub-feature map is an object to be detected according to the sub-feature map. As shown on the right side of Figure 3, the classifier marks the detected face area with a white frame on the training image. By comparing the detection result of the training image with the prior result of the training image, it is possible to know the difference between the object to be detected detected by the detection model training device and the prior result. As shown in FIG. 3, some faces in the prior result are not detected by the detection model training device. The parameters of the object detection model are stimulated according to the difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, model parameters of the convolution kernel of the regional proposal network, and regions of the regional proposal network Proposed parameters, parameters of the classifier. The differences between the detection results of each training image and the prior results of the training image will motivate the various parameters of the above object detection model, so after a large number of training image excitations, the accuracy of the object detection model will be improved.

The detection model training device trains an object detection model through a large number of training images and prior results of the training images. The object detection model includes a backbone network, a region proposal network, and a classifier. The object detection model that has undergone the first stage of the training mode enters the second stage of the training mode, as shown in FIG. 4.

In the second stage of the training mode, the classifier that has undergone the first stage in FIG. 3 is first copied into P copies. The training image is input to the backbone network, and the feature map extracted by the backbone network is input to the regional proposal network. The regional proposal network selects proposal regions from the feature map, and aggregates the selected proposal regions into P proposal region sets according to the size of the proposal region. The sizes of the proposed regions in each set of proposed regions are similar. The sub-feature maps corresponding to the P proposed region sets are input to P classifiers, respectively. A proposed region set corresponds to a classifier, and a sub-feature map corresponding to the proposed region in the proposed region set is input into the classifier. Each classifier detects objects of different sizes in the training image according to the received sub-feature maps, and obtains corresponding detection results. The detection result of each classifier is compared with the prior result of the size of the object to be detected corresponding to the sub-feature map received by the classifier in the training image. The difference between the detection result of the object to be detected at each size and the prior result of the object to be detected at that size will stimulate various parameters of the object detection model. In particular, each classifier will be trained to be more sensitive to detected objects of different sizes, so the accuracy of the object detection model will be further improved after the secondary excitation of a large number of training images. In Fig. 4, the prior results of the object to be detected in the training graph are divided into P classes according to size, and are used for comparison of P classifiers.

As shown in Figure 4, P is equal to 2, that is, the proposed area is divided into two proposed area sets by the regional proposal network. One of the proposed area sets (corresponding to the classifier above) has a smaller size and the other The size of the proposal area in the proposal area set (corresponding to the classifier below) is large. Therefore, the sub-characteristic map corresponding to the proposed region in the previous set of proposed regions is used to detect smaller objects to be detected in the training image, and the sub-characteristic map corresponding to the proposed region in the latter set of proposed regions is used to detect the relatively small objects in the training image. Large object to be detected. The two proposed regions are input into different classifiers. The upper classifier is used to detect smaller objects to be detected, the lower classifier is used to detect larger objects to be detected, and the detection results output by the two classifiers are compared with the corresponding prior results. For example, detection result 1 includes the object to be detected detected by the upper classifier according to the sub-feature map corresponding to the proposed region with a smaller size, and the prior result 1 of the training image includes the prior to the small object to be detected in the training image. Results (size, coordinates, etc.). The detection result 1 is compared with the prior result 1, and the parameters of the object detection model are stimulated according to the contrast difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, and the volume of the regional proposal network The kernel model parameters, the area proposal parameters of the area proposal network, and the parameters of the upper classifier. Similarly, the detection result 2 includes the to-be-detected object detected by the lower classifier according to the sub-feature map corresponding to the larger-sized proposed region, and the prior result 2 of the training image includes the prior-to-be-detected object in the training image with larger size. Test results (size, coordinates, etc.). The detection result 2 is compared with the prior result 2, and each parameter of the object detection model is stimulated according to the comparison difference, including at least one of the following: the model parameters of the convolution kernel of each convolutional layer of the backbone network, and the The model parameters of the convolution kernel, the area proposal parameters of the area proposal network, and the parameters of the classifier below.

It should be noted that the training images used in the first stage and the second stage may be the same, may be different, or may partially overlap. Preset thresholds can be used to distinguish different sets of proposed regions. For example, when P sets of proposed regions need to be distinguished, P-1 thresholds are preset. Each threshold corresponds to the size of a proposed region. The P-1 thresholds are used. The proposed regions selected by the regional proposal network are aggregated into P proposed region sets. Correspondingly, according to the size of the object to be detected in the training image, the object to be detected in the training image is divided into P prior results, and a prior result is compared with a size corresponding detection result to stimulate the object detection model.

The object detection model trained in the second stage can be deployed in the cloud environment, edge environment or terminal environment. Or part of the object detection model can be deployed on three or any two of the cloud environment, edge environment, and terminal environment.

As shown in FIG. 5, in the inference state, the image to be detected is input into the backbone network of the object detection model. After being processed by the region proposal network and the P classifiers, the object detection model outputs the detection result of the image to be detected. Commonly, the detection result includes information such as the position and number of the detected objects to be detected, such as how many human faces there are and where each human face appears. In the inference state, the region proposal network is similar to the second stage of the training state. The extracted proposal regions are classified according to size, and the sub-feature map corresponding to each proposal region is sent to the classifier corresponding to the proposed region, respectively. Each classifier detects objects of different sizes according to the sub-feature maps of the proposed regions of different sizes, and by integrating the detection results of the P classifiers, the detection results of the images to be detected can be obtained.

FIG. 6 to FIG. 8 illustrate another workflow of the test model training device. Compared with the test model training device described in FIGS. 3 to 5, the test model training device described in FIGS. 6 to 8 is in a training state and an inference state. The feature maps extracted from at least two convolutional layers of the backbone network are used as the input of the region proposal network.

In FIG. 6, the detection model training device works in the first stage of the training state. A K-layer backbone network is established according to the training image. The backbone network includes K convolutional layers, where K is a positive integer greater than 0. The backbone network extracts p feature maps from the training images. The p feature maps can be extracted from any p convolutional layers of the backbone network, or any p convolutional layers of the backbone network itself. The p feature maps extracted by the backbone network are input to a regional proposal network. The regional proposal network selects a proposed area from the p feature maps and enters the sub-feature map corresponding to the proposed area into the classifier. Each sub-feature map includes features in which a portion of the feature map is located within the proposed area. The classifier determines whether the area of the training image corresponding to the proposed region corresponding to the sub-feature map is an object to be detected according to the sub-feature map.

As shown on the right side of Figure 6, the classifier marks the detected face area with a white frame on the training image. By comparing the detection result of the training image with the prior result of the training image, it is possible to know the difference between the object to be detected detected by the detection model training device and the prior result. As shown in FIG. 6, some faces in the prior result are not detected by the detection model training device. The parameters of the object detection model are stimulated according to the difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, model parameters of the convolution kernel of the regional proposal network, and regions of the regional proposal network Proposed parameters, parameters of the classifier. The differences between the detection results of each training image and the prior results of the training image will motivate the various parameters of the above object detection model, so after a large number of training image excitations, the accuracy of the object detection model will be improved.

The detection model training device trains an object detection model through a large number of training images and prior results of the training images. The object detection model includes a backbone network, a region proposal network, and a classifier. The object detection model that has undergone the first stage of the training mode enters the second stage of the training mode, as shown in FIG. 7.

In the second stage of the training mode, the classifier that has undergone the first stage in FIG. 6 is first copied into P copies. The training image is input to the backbone network, and at least one feature map extracted by the backbone network is input to the region proposal network. The regional proposal network selects proposal regions from the feature map, and aggregates the selected proposal regions into P proposal region sets according to the size of the proposal region. The proposal area in each proposal area set is determined according to the size of the proposal area and the span of the convolutional layer corresponding to the feature map in which the proposal area is located. The sub-feature maps corresponding to the proposed regions in the P proposed region sets are input into the P classifiers, respectively. A proposed region set corresponds to a classifier, and a sub-feature map corresponding to the proposed region in the proposed region set is input into the classifier. Each classifier detects objects of different sizes to be detected according to the received sub-feature map and obtains corresponding detection results. The detection result of each classifier is compared with the prior result of the size of the object to be detected corresponding to the sub-feature map received by the classifier in the training image. The difference between the detection result of the object to be detected at each size and the prior result of the object to be detected at that size will stimulate various parameters of the object detection model. In particular, each classifier will be trained to be more sensitive to detected objects of different sizes, so the accuracy of the object detection model will be further improved after the secondary excitation of a large number of training images. In FIG. 7, the prior results of the object to be detected in the training graph are classified into P classes according to size, and are used for comparison of P classifiers.

As shown in FIG. 7, P is equal to 2, that is, the area proposal network divides the selected proposal area into two sets of proposal areas. One of the proposal area sets (corresponding to the above classifier) is the size of the proposal area and the location of the proposal area. The product of the span of the convolutional layer corresponding to the feature map of is small. The size of the proposed region of the proposed region in another set of proposed regions (corresponding to the classifier below) and the size of the convolutional layer corresponding to the feature map of the proposed region. The product of the spans is large. Therefore, the sub-characteristic map corresponding to the proposed region in the previous set of proposed regions is used to detect smaller objects to be detected in the training image, and the sub-characteristic map corresponding to the proposed region in the latter set of proposed regions is used to detect the relatively small Large object to be detected. The two proposed regions are input into different classifiers. The upper classifier is used to detect smaller objects to be detected, the lower classifier is used to detect larger objects to be detected, and the detection results output by the two classifiers are compared with the corresponding prior results. For example, detection result 1 includes the object to be detected detected by the upper classifier according to the sub-feature map corresponding to the proposed region with a smaller size, and the prior result 1 of the training image includes the prior to the small object to be detected in the training image. Results (size, coordinates, etc.). The detection result 1 is compared with the prior result 1, and each parameter of the object detection model is stimulated according to the comparison difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, and The model parameters of the convolution kernel, the area proposal parameters of the area proposal network, and the parameters of the upper classifier. Similarly, the detection result 2 includes the to-be-detected object detected by the lower classifier according to the sub-feature map corresponding to the larger-sized proposed region, and the prior result 2 of the training image includes the prior-to-be-detected object in the training image with larger size. Test results (size, coordinates, etc.). The detection result 2 is compared with the prior result 2, and each parameter of the object detection model is stimulated according to the comparison difference, including at least one of the following: the model parameters of the convolution kernel of each convolutional layer of the backbone network, and the The model parameters of the convolution kernel, the area proposal parameters of the area proposal network, and the parameters of the classifier below.

As shown in FIG. 8, in the inference state, the image to be detected is input into the backbone network of the object detection model. After being processed by the region proposal network and the P classifiers, the object detection model outputs the detection result of the image to be detected. Commonly, the detection result includes information such as the position and number of the detected objects to be detected, such as how many human faces there are and where each human face appears. In the inference state, the region proposal network is similar to the second stage of the training state. The extracted proposal regions are classified according to size, and the sub-feature maps corresponding to the proposed regions are sent to the corresponding classifiers. Each classifier detects objects of different sizes according to the sub-feature maps of the proposed regions of different sizes, and by integrating the detection results of the P classifiers, the detection results of the images to be detected can be obtained.

The concepts used in this application are described below.

Backbone network

The backbone network includes a convolutional network, which includes K convolutional layers. The K convolutional layers of a general backbone network constitute multiple convolutional blocks, and each convolutional block includes multiple convolutional layers. The number of convolutional blocks of the backbone network is usually five. In addition to convolutional networks, the backbone network can also include pooling modules. Optionally, the backbone network can use some templates commonly used in the industry, such as Vgg, Resnet, Densenet, Xception, Inception, Mobilenet, etc.

The extracted features of the training image are used as the first convolutional layer of the backbone network. Features extracted from the first convolutional layer of the backbone network by the convolution kernel corresponding to the first convolutional layer form the second convolutional layer of the backbone network. Features extracted from the second convolutional layer of the backbone network by the convolution kernel corresponding to the second convolutional layer of the backbone network form the third convolutional layer of the backbone network. By analogy, the features extracted from the k-1th convolutional layer of the backbone network by the convolution kernel corresponding to the k-1th convolutional layer of the backbone network form the kth convolutional layer of the backbone network, where k is greater than Equal to 1 and less than or equal to K. In the detection model training device corresponding to FIG. 3 to FIG. 5, the feature graph extracted by the convolution kernel corresponding to the Kth convolution layer of the backbone network is used as the input of the regional proposal network. Or the K-th convolutional layer of the backbone network can be directly used as the feature map as the input of the regional proposal network. In the detection model training apparatus corresponding to FIG. 6 to FIG. 8, the feature graph extracted by the k-th convolutional layer of the backbone network by the convolution kernel corresponding to the k-th convolutional layer of the backbone network becomes the input of the regional proposal network. Or the k-th convolution layer of the backbone network can be directly used as the feature map as the input of the regional proposal network. The regional proposal network includes L convolutional layers, where L is an integer greater than 0. Similar to the backbone network, the k'-1th convolutional layer of the regional proposal network is replaced by the k'-1th convolutional layer of the regional proposal network. The features extracted by the corresponding convolution kernel form the k'th convolutional layer of the regional proposal network, where k 'is greater than or equal to 1 and less than or equal to L-1.

Convolutional layers and kernels

Both the backbone network and the regional proposal network include at least one convolutional layer. As shown in FIG. 9, the size of the convolution layer 101 is X * Y * N ₁ , that is, the convolution layer 101 includes X * Y * N ₁ features. Among them, N ₁ is the number of channels, one channel is a feature dimension, and X * Y is the number of features included in each channel. X, Y, N ₁ are all positive integers greater than 0.

The convolution kernel 1011 is one of the convolution kernels used for the convolution layer 101. Since the convolution layer 102 includes N ₂ channels, the convolution layer 101 uses a total of N ₂ convolution kernels. The size and model parameters of the N ₂ convolution kernels may be the same or different. Taking the convolution kernel 1011 as an example, the size of the convolution kernel 1011 is X ₁ * X ₁ * N ₁ . That is, X ₁ * X ₁ * N ₁ model parameters are included in the convolution kernel 1011. The initialization model parameters in the convolution kernel can use model parameter templates commonly used in the industry. When the convolution kernel 1011 slides in the convolution layer 101 and slides to a certain position of the convolution layer 101, the model parameters of the convolution kernel 1011 are multiplied with the characteristics of the convolution layer 101 at the corresponding position. After combining the product results of each model parameter of the convolution kernel 1011 and the features of the convolution layer 101 at the corresponding position, a feature on one channel of the convolution layer 102 is obtained. The product of the features of the convolution layer 101 and the convolution kernel 1011 can be directly used as the features of the convolution layer 102. You can also slide the features of the convolution layer 101 and the convolution kernel 1011 on the convolution layer 101. After outputting all the product results, normalize all the product results, and use the normalized product results as the convolution layer. 102 characteristics.

Visually, the convolution kernel 1011 slides on the convolution layer 101 to perform convolution, and the convolution result forms a channel of the convolution layer 102. Each convolution kernel used by the convolution layer 101 corresponds to one channel of the convolution layer 102. Therefore, the number of channels of the convolution layer 102 is equal to the number of convolution kernels acting on the convolution layer 101. The design of the model parameters in each convolution kernel reflects the characteristics of the features that the convolution kernel hopes to extract from the convolution layer. Through the N ₂ convolution kernels, the features of the N ₂ channels are extracted from the convolution layer 101.

As shown in FIG. 9, the convolution kernel 1011 is split. The convolution kernel 1011 includes N ₁ convolution pieces, and each convolution piece includes X ₁ * X ₁ model parameters (P ₁₁ to Px ₁ x ₁ ). Each model parameter corresponds to a convolution point. The model parameters corresponding to a convolution point are multiplied with the features in the convolution layer at the corresponding position of the convolution point to obtain the convolution result of the convolution point. The sum of the convolution results of the convolution points of a convolution kernel is The convolution result of this convolution kernel.

Convolution kernel sliding span

The sliding span of the convolution kernel is the number of features that each time the convolution kernel slides across the convolution layer. After the convolution kernel completes the convolution at the current position of the current convolution layer to form a feature of the next convolution layer, the convolution kernel slides V features on the basis of the current position of the current convolution layer, and The model parameters of the convolution kernel and the features of the convolution layer are convolved at the position after sliding, and V is the sliding span of the convolution kernel.

Receptive field

The receptive field is the perceptual domain (perceptual range) of a feature on the input image on the convolutional layer. If the pixels in the perceptual range change, the value of the feature will change accordingly. As shown in FIG. 10, the convolution kernel slides on the input image, and the extracted features constitute the convolution layer 101. Similarly, the convolution kernel slides on the convolution layer 101, and the extracted features constitute the convolution layer 102. Then, each feature in the convolution layer 101 is extracted from pixels of the input image within the size of the convolution sheet of the convolution kernel sliding on the input image, and this size is also the receptive field of the convolution layer 101. Therefore, the receptive field of the convolution layer 101 is shown in FIG. 10.

Correspondingly, each feature in the convolutional layer 102 is mapped to a range on the input image (that is, how many pixels are used on the input image), that is, the receptive field of the convolutional layer 102. As shown in FIG. 11, each feature in the convolution layer 102 is extracted from the pixels of the input image within the size of the convolution piece of the convolution kernel sliding on the convolution layer 101. Each feature on the convolution layer 101 is extracted from pixels of the input image within the range of the convolution piece of the convolution kernel sliding on the input image. Therefore, the receptive field of the convolution kernel 102 is larger than that of the convolutional layer 101. If a backbone network includes multiple convolutional layers, the receptive field of the last convolutional layer in the multilayered convolutional layer is the receptive field of the backbone network.

Training computing

The training computing capability is the computing capability available for detecting the model training device in the environment where the model training device is deployed, including at least one of the following: processor frequency, processor occupancy, memory size, memory occupancy, cache utilization, Cache size, image processor frequency, image processor occupancy, and other computing resource parameters. When the various parts of the detection model training device are deployed in multiple environments, the training computing power can be obtained by comprehensively calculating the computing power available to the detection model training device in these multiple environments.

Classifier

The classifier includes a series of parameters, and the classifier detects information such as the position and number of objects to be detected in the image to be detected according to the input features and these functions. Classifiers Common classifiers include Softmax classifier, Sigmoid classifier, and so on.

Stride

In general, the size of the k + 1th convolutional layer of the backbone network is less than or equal to the size of the kth convolutional layer of the backbone network. The span of the kth convolutional layer of the backbone network is the input of the image of the backbone network. The ratio of the size to the size of the k-th convolutional layer. The image of the input backbone network may be a training image or an image to be detected. The span of the k-th convolutional layer of the backbone network is generally determined by how many pooling layers there are between the first convolutional layer and the k-th convolutional layer of the backbone network, and the first convolutional layer to the k-th convolutional layer of the backbone network. The convolution kernel's sliding spans between convolution layers are related. The more pooling layers between the first convolutional layer and the k-th convolutional layer, the larger the convolution kernel sliding span used by the convolutional layer between the first convolutional layer and the k-th convolutional layer. The larger the span of the k convolutional layers.

Regional proposal network, regional proposal parameters, proposed region, proposed region set

As shown in FIG. 12, the area proposal network determines a plurality of proposed areas on the feature map according to the area proposal parameters. The area proposal parameters may include the length and width of the proposed area. The sizes of different proposed areas are generally different.

In the object detection models corresponding to Figs. 3 and 6, the region proposal network first obtains a plurality of proposal regions according to the region proposal parameters, and calculates each of the plurality of proposal regions according to the convolution kernel corresponding to the L-layer convolution layer. The confidence level of the proposed area, that is, the possibility that each proposed area corresponds to the training image includes the object to be detected. And select a sub-feature map corresponding to a certain amount of the proposed region with a confidence level higher than a certain threshold, or the highest confidence level, and input it to the classifier.

In the object detection models corresponding to FIG. 4 to FIG. 5, the area proposal network obtains multiple proposed areas, for example, after the proposed areas 1-4, the proposed area can be increased according to the size of the proposed area (the number of features covered by the proposed area). The proposal regions are aggregated into P proposal region sets. Then, the region proposal network inputs the sub-feature map corresponding to the proposed region in a set of proposed regions to a classifier. The size of the proposed area is related to the size of the object to be detected. Therefore, according to the size of the proposed area, the proposed areas are aggregated into a set of proposed areas, and different classifiers detect the proposed areas in different set of proposed areas and are detected based on the detection results. Excitation makes different classifiers more sensitive to objects of different sizes to be detected.

In the object detection models corresponding to Figs. 7 to 8, different convolutional layers of the backbone network are input as feature maps to the regional proposal network, and the spans of different convolutional layers may be different. Therefore, the convolutional layers of different spans have the same size The sizes of the objects to be detected in the training images corresponding to the proposed regions are different. In the case where the size of the proposed area is the same, the proposed area of the convolutional layer with a larger span indicates an object to be detected with a larger size, and the proposed area of the convolutional layer with a smaller span indicates an object with a smaller size. Therefore, in the object detection models corresponding to Figures 6 to 8, after the region proposal network obtains the proposed regions from different feature maps, the size of each proposed region and the span of the convolutional layer corresponding to the feature map where the proposed region is located are comprehensively considered. Then, according to the size of each proposed region and the span of the convolution layer corresponding to the feature map in which the proposed region is located, the proposed regions obtained in different feature maps are aggregated into P set of proposed regions. Then, the region proposal network inputs the sub-feature map corresponding to the proposed region in a set of proposed regions to a classifier. Commonly, regional proposal networks use the product of the size of each proposed region and the span of the convolution layer corresponding to the feature map where the proposed region is located as the aggregation criterion. For example, after obtaining T proposed regions from different feature maps, each The product of the size of the proposed area and the span of the convolutional layer corresponding to the feature map in which the proposed area is located. According to the T products, the T proposed regions are aggregated into a set of P proposed regions. For example, each of the T products can be compared with a preset threshold of P-1 to determine the corresponding value of each product. The proposal area is divided into which proposal area set.

FIG. 13 and FIG. 14 respectively introduce the work flow of the detection model training device corresponding to FIGS. 3 to 5 and 6 to 8.

As shown in FIG. 13, the work flow of the detection model training device is introduced.

S201. Obtain at least one of the following system parameters: the number of size clusters of the object to be detected; and training computing capabilities.

The number of size clusters of the object to be detected, that is, how many sets the size of the object to be detected can be clustered into. For example, when the number of size clusters of the object to be detected is two, the size of the object to be detected may be divided into two sets.

The number of clusters of the size of the object to be detected may be obtained by using a clustering algorithm to cluster the number of objects to be detected in the training image. The clustering algorithm can use K-means and so on. Alternatively, the number of size clusters of the object to be detected and the complexity of the object to be detected may also be manually input to the detection model training device.

The above system parameters refer to the training image or the parameters of the object to be detected or the backbone network or the training environment in the training image. Such system parameters can be obtained before the object detection model is established. System parameters are also called super parameters. Different system parameters may lead to different replication parameters. The model parameters refer to the parameters corresponding to each convolution point in the convolution kernel. The model parameters are continuously excited and changed during the training process of the object detection model.

The acquisition of the above system parameters can be obtained in multiple times and need not be performed in the same step. It is not necessary to obtain all the above system parameters, and the specific obtained system parameters are determined according to the system parameters used in the subsequent steps of determining the replication parameters. The acquisition time of each system parameter can be before the subsequent steps that use the system parameter.

S202. Determine the replication parameter P according to the system parameters obtained in S201.

According to the system parameters obtained in S201, the replication parameter P is determined. Specifically, a function P = f (system parameter) for calculating the replication parameter P may be preset, and an argument of the function f is a system parameter obtained in S201.

S202 can be executed at any time after S201 and before S208.

S203: Acquire a training image, establish a backbone network according to the training image, and obtain a feature map output by the backbone network.

S204. Input the feature map output by the backbone network into the regional proposal network.

The feature map output by the backbone network in S204 is a feature in the K-th convolutional layer of the backbone network, or a feature extracted by the convolution kernel in the K-th convolutional layer.

S205. The region proposal network selects a proposed region from the feature map, and inputs a sub-feature map corresponding to the proposed region to the classifier.

S206. The classifier detects an object to be detected in the training image according to the sub-feature map input in S205.

Parameters are set in the classifier, and the classifier detects the object to be detected in the training image according to the parameters and input features.

S207, comparing the object to be detected in the training image and the prior result in the training image detected in S206, and motivating at least one of the following parameters according to the comparison result: the convolution kernel of each convolution layer of the backbone network Model parameters, model parameters of the convolution kernel of the region proposal network, region proposal parameters of the region proposal network, parameters of the classifier.

After S207, the excitation of the object detection model by the training image obtained in S203 is completed. The detection model training device obtains the next training image and trains the object detection model according to the next training image and the prior result of the next training image.

The excitation process of the next training image is similar to that of the training image obtained in S203. The main difference is that 1. The next training image is extracted by the backbone network. The model of the convolution kernel of each convolution layer of the backbone network used in the feature map. The parameter is the one that has been excited in S207 (if it has been excited in S207). 2. After the feature image is extracted by the backbone network for the next training image, the model parameters of the convolution kernel of the region proposal network and the region proposal parameters of the region proposal network are the ones that were stimulated in S207. It was motivated). 3. The features of the classifier experienced by the next training image are those that have been stimulated in S207 (if they have been stimulated in S207).

By analogy, each training image will be further stimulated based on the previous training image's stimulation of the object detection model. After all the training images are sequentially used for the training of the object detection model, the first phase of the training state of the object detection model ends.

S208. Copy P classifiers that have undergone the training in the first stage.

S209. Acquire a training image, establish a backbone network according to the training image, and obtain a feature map output by the backbone network.

S210: Input the feature map output by the backbone network into the regional proposal network.

S211: The regional proposal network selects multiple proposal regions from the feature map, divides the selected multiple proposal regions into P proposal region sets, and inputs the sub-feature map corresponding to the proposal regions in each proposal region set into the corresponding classifier. .

S212. The classifier detects an object to be detected in the training image according to the sub-feature map input in S211.

S213, comparing the to-be-detected object in the training image and the prior result in the training image detected in S212, and motivating at least one of the following parameters according to the comparison result: the convolution kernel of each convolutional layer of the backbone network Model parameters, model parameters of the convolution kernel of the region proposal network, region proposal parameters of the region proposal network, parameters of the classifier.

Each of the classifiers in S212 and S213 detects the object to be detected in the training image according to the sub-feature map obtained by itself, and excites the classifier according to the comparison result between the detection result and the prior result. Each classifier copied in S208 executes S212 and S213.

The excitation process of the next training image is similar to that of the training image obtained in S209, the main difference is that 1. The next training image is extracted by the backbone network. The model of the convolution kernel of each convolution layer of the backbone network used in the feature map. The parameter is stimulated in S213 (if it is stimulated in S213). 2. After the feature image is extracted by the backbone network for the next training image, the model parameters of the convolution kernel of the region proposal network and the region proposal parameters of the region proposal network are the ones that were stimulated in S213 (if S213 It was motivated). 3. The features of the classifier experienced by the next training image are those that have been stimulated in S213 (if they have been stimulated in S213).

By analogy, each training image will be further stimulated based on the previous training image's stimulation of the object detection model. After all the training images are sequentially used in the second stage of the training state of the object detection model, the training process of the object detection model ends. As shown in Figure 5, the object detection model can be used for inferential states.

As shown in FIG. 14, another work flow of the detection model training device is introduced. Compared with the work flow shown in FIG. 13, the main difference is that S203 and S209 in the work flow shown in FIG. 13 are replaced by S203 ′ and S203 ′, respectively. S209 '.

Referring to the corresponding parts of FIG. 6 to FIG. 8, in S203 ′ and S209 ′, the backbone network extracts at least two feature maps and inputs them into the area proposal network for the area proposal network to select a proposal area. After all the training images are sequentially used in the second stage of the training state of the object detection model, the training process of the object detection model ends. As shown in Figure 8, this object detection model can be used for inferential states.

The application also provides a detection model training device 400. As shown in FIG. 15, the detection model training device 400 includes an object detection model 401, an excitation module 405, a storage module 406, and an initialization module 407. The object detection model 401 further includes a backbone network 403, a classifier 404, and a region proposal network 402. The classifier 404 includes a classifier in the first stage of the training state, and includes P classifiers in the second stage of the training state and the inference state.

Each of the above modules can be a software module. Among them, in the first stage of the training state, the initialization module 407 is configured to execute S201 and S202 to determine the replication parameter P. The object detection model 401 obtains a training image from the storage module 406 and executes S203 or S203 ′, and S204 to establish a backbone network 403. The area proposal network 402 executes S205. The classifier 404 is configured to execute S206. The incentive module 405 is configured to execute S207. In the second phase of the training state, the initialization module 407 is used to execute S208, and the object detection model 401 obtains a training image from the storage module 406 and executes S209 or S209 ', and S210 to establish a backbone network 403. The area proposal network 402 executes S211. The classifier 404 is used to execute S212. The incentive module 405 is configured to execute S213.

The detection model training device 400 may be provided to a user as an object detection model training service. For example, the detection model training device 400 (or a part thereof) shown in FIG. 1 is deployed on a cloud environment. The user selects the backbone network type and some system parameters, and puts the training image and the prior results of the training image into the storage module 406, and then starts. The detection model training device 400 trains an object detection model 401. The trained object detection model 401 is provided to the user, and the user can run the object detection model 401 on his terminal environment or sell the object detection model 401 directly to a third party for use.

The application also provides a computing device 500. As shown in FIG. 16, the computing device 500 includes a bus 501, a processor 502, a communication interface 503, and a memory 504. The processor 502, the memory 504, and the communication interface 503 communicate through a bus 501.

The processor may be a central processing unit (English: central processing unit, abbreviation: CPU). The memory may include a volatile memory (English: volatile memory), such as a random access memory (English: random access memory, abbreviation: RAM). The memory may also include non-volatile memory (English: non-volatile memory), such as read-only memory (English: read-only memory, abbreviation: ROM), flash memory, HDD, or SSD. The memory stores executable code, and the processor executes the executable code to perform the foregoing object detection method. The memory may also include other software modules required by the operating system, such as the operating system. The operating system may be LINUX ^™ , UNIX ^™ , WINDOWS ^™ and the like.

The memory of the computing device 500 stores code corresponding to each module of the detection model training apparatus 400, and the processor 502 executes these codes to implement the functions of each module of the detection model training apparatus 400, that is, the execution of the functions shown in FIG. 13 or FIG. 14 is performed. method. The computing device 500 may be a computing device in a cloud environment, or a computing device in an edge environment, or a computing device in a terminal environment.

As shown in FIG. 2, each part of the detection model training apparatus 400 may be executed on multiple computing devices in different environments. Therefore, this application also proposes a computing device system. As shown in FIG. 17, the computing device system includes a plurality of computing devices 600. The structure of each computing device 600 is the same as that of the computing device 500 in FIG. 16. A communication path is established between the computing devices 600 through a communication network. Each computing device 600 runs any one or more of a region proposal network 402, a backbone network 403, a classifier 404, an incentive module 405, a storage module 406, and an initialization module 407. Any computing device 600 may be a computing device in a cloud environment, or a computing device in an edge environment, or a computing device in a terminal environment.

Further, as shown in FIG. 18, because the training images and the prior results of the training images occupy a large space, the computing device 600 itself may not be able to store all the training images and the prior results of the training images. This application also proposes a A computing device system. The storage module 406 is deployed in a cloud storage service (such as an object storage service). The user applies for a certain amount of storage space as the storage module 406 in the cloud storage service, and stores the training images and the prior results of the training images into the storage module 406. . When the computing device 600 is running, it acquires the required training images and training images from the remote storage module 406 through the communication network. Each computing device 600 runs any one or more of a region proposal network 402, a backbone network 403, a classifier 404, an incentive module 405, and an initialization module 407. Any computing device 600 may be a computing device in a cloud environment, or a computing device in an edge environment, or a computing device in a terminal environment.

The descriptions of the processes corresponding to the above drawings have different focuses. For a part that is not detailed in a certain process, refer to the description of other processes.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions according to the embodiments of the present invention are wholly or partially generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server, or data center Transmission by wire (for example, coaxial cable, optical fiber, digital subscriber line, or wireless (for example, infrared, wireless, microwave, etc.) to another website site, computer, server, or data center. The computer-readable storage medium may be a computer Any available media that can be accessed or data storage devices such as servers, data centers, etc. that include one or more of the available media integration. The available media can be magnetic media (for example, floppy disks, hard disks, magnetic tapes), optical media (for example, , DVD), or semiconductor media (such as SSD).

Claims

A method for training an object detection model performed by a computing device, wherein the method includes:

Acquiring a training image, and establishing a backbone network according to the training image;

Inputting the feature map output by the backbone network into a regional proposal network;

Selecting, by the area proposal network, a plurality of proposed areas from a feature map output by the backbone network according to the area proposal parameters, and inputting sub-feature maps corresponding to the plurality of proposed areas into a classifier;

Detecting, by the classifier, an object to be detected in the training image according to a sub-feature map corresponding to the plurality of proposed regions;

Comparing the to-be-detected object in the training image detected by the classifier with the prior results of the training image, and comparing the model parameters of the convolution kernel of the backbone network with the region proposal network according to the comparison result At least one of the model parameters of the convolution kernel, the region proposal parameters, and the parameters of the classifier is stimulated;

Copying the classifier to obtain at least two classifiers;

The area proposal network divides the plurality of proposal areas into at least two proposal area sets, and each proposal area set includes at least one proposal area;

The region proposal network inputs a sub-feature map corresponding to a proposal region included in each proposal region set into one of the at least two classifiers;

Each of the at least two classifiers performs the following actions:

Detecting an object to be detected in the training image according to the sub-feature map corresponding to the proposed region included in the acquired proposal region set;

Comparing the detected object in the training image with the prior results of the training image, and comparing the model parameters of the convolution kernel of the backbone network and the convolution kernel of the region proposal network according to the comparison result At least one of the model parameters, the region proposal parameters, and the parameters of each classifier is stimulated.
The method of claim 1, further comprising:

Acquiring system parameters, the system parameters including at least one of: a number of size clusters of an object to be detected in a training image, and a training computing ability;

Determining, according to the system parameter, the number of classifiers in the at least two classifiers obtained after replication.
The method according to claim 2, characterized in that the system parameter includes a case where a number of size clusters of an object to be detected in the training image is included; and the acquiring system parameter includes:

Clustering the sizes of the objects to be detected in the training image to obtain the number of size clusters of the objects to be detected in the training image.
The method according to any one of claims 1 to 3, wherein the feature map output by the backbone network includes at least two feature maps.
A detection model training device, comprising:

An object detection model is used to obtain a training image, and establish a backbone network based on the training image; select multiple proposed regions from a feature map output from the backbone network according to regional proposal parameters, and sub-feature corresponding to the multiple proposed regions A graph input classifier; detecting an object to be detected in the training image according to a sub-feature map corresponding to the plurality of proposed regions;

An incentive module, configured to compare a detected prior object in the training image with a prior result of the training image, and according to the comparison result, to model parameters of the convolution kernel of the backbone network, and the region proposal At least one of a model parameter of the network's convolution kernel, the region proposal parameter, and the parameter of the classifier is excited;

An initialization module for copying the classifier to obtain at least two classifiers;

The object detection model is further configured to divide the plurality of proposal regions into at least two proposal region sets, each proposal region set including at least one proposal region; and sub-features corresponding to the proposal regions included in each proposal region set. A map is input to one of the at least two classifiers; each of the at least two classifiers performs the following action: detecting the sub-feature map corresponding to the proposed region included in the acquired proposal region set The object to be detected in the training image; comparing the detected prior object in the training image and the prior image of the training image, and according to the comparison result, the model parameters, At least one of the model parameters of the convolution kernel of the region proposal network, the region proposal parameters, and the parameters of each classifier is stimulated.
The device according to claim 5, wherein the initialization module is further configured to obtain system parameters, the system parameters including at least one of the following: the number of size clusters of the object to be detected in the training image, and training calculations Capability; determining, according to the system parameter, the number of classifiers in the at least two classifiers obtained after replication.
The device according to claim 6, wherein the initialization module is further configured to perform clustering on the sizes of the objects to be detected in the training image, and to obtain the sizes of the objects to be detected in the training image. The number of classes.
The apparatus according to any one of claims 5 to 7, wherein the feature map output by the backbone network includes at least two feature maps.
A computing device system includes at least one computing device; each computing device includes a processor and a memory, and the processor of the at least one computing device is configured to execute the method according to any one of claims 1 to 4.
A non-transitory readable storage medium, characterized in that when the non-transitory readable storage medium is executed by at least one computing device in a computing device system, the at least one computing device executes claims 1 to 4 A method as described in any of the above.
A computing device program product, wherein when the computing device program product is executed by at least one computing device in a computing device system, the at least one computing device executes the method of any one of claims 1 to 4.
A method for training an object detection model performed by a computing device, wherein the method includes:

In the first stage of training, a feature map of a training image is extracted through a backbone network, a proposed area is selected from the extracted feature map through a regional proposal network, and a sub-feature map corresponding to the proposed area is input to a classifier. The classifier detects the object to be detected in the training image according to the sub-feature map corresponding to the proposed region, compares the detection result with the prior result of the training image, and compares the backbone network, the Motivating at least one of the regional proposal network and the classifier;

In the second-stage training, at least two replication classifiers are established according to the classifiers that have undergone the first-stage training, and the region proposal network divides the proposed region into at least two proposed region sets, each Each proposal region set includes at least one proposal region, and the sub-feature map corresponding to the proposal region included in each proposal region set is input to a replication classifier, and each replication classifier detects a pending feature in the training image according to the acquired sub-feature map. An object is detected, the detection result is compared with the prior result of the training image, and at least one of the backbone network, the area proposal network, and the classifier is re-energized according to the comparison result.
The method of claim 12, further comprising:

Acquiring system parameters, the system parameters including at least one of: a number of size clusters of an object to be detected in a training image, and a training computing ability;

According to the system parameter, the number of the replication classifiers established is determined.
The method according to claim 13, wherein the system parameter includes a case where a number of size clusters of an object to be detected in the training image is included, and the acquiring system parameter includes:

Clustering the sizes of the objects to be detected in the training image to obtain the number of size clusters of the objects to be detected in the training image.
The method according to any one of claims 12 to 14, wherein the feature map extracted by the backbone network includes at least two feature maps.