WO2020024584A1 - Method, device and apparatus for training object detection model - Google Patents

Method, device and apparatus for training object detection model Download PDF

Info

Publication number
WO2020024584A1
WO2020024584A1 PCT/CN2019/076982 CN2019076982W WO2020024584A1 WO 2020024584 A1 WO2020024584 A1 WO 2020024584A1 CN 2019076982 W CN2019076982 W CN 2019076982W WO 2020024584 A1 WO2020024584 A1 WO 2020024584A1
Authority
WO
WIPO (PCT)
Prior art keywords
proposal
training image
detected
training
region
Prior art date
Application number
PCT/CN2019/076982
Other languages
French (fr)
Chinese (zh)
Inventor
张长征
金鑫
涂丹丹
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201811070244.1A external-priority patent/CN110796154B/en
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP19808951.8A priority Critical patent/EP3633553A4/en
Publication of WO2020024584A1 publication Critical patent/WO2020024584A1/en
Priority to US17/025,419 priority patent/US11423634B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method for training an object detection model, and a device and a computing device for performing the method.
  • Object detection is an artificial intelligence technology that accurately locates and classifies objects in images / videos. It includes general object detection, face detection, pedestrian detection, and text detection. In recent years, academia and industry have actively invested, and algorithms have continued to mature. Current deep learning-based object detection solutions are used in municipal security (pedestrian detection, vehicle detection, license plate detection, etc.), and finance (object detection, face registration, etc.) , Internet (identity verification), smart terminals and other actual products.
  • object detection has been widely used in a variety of simple / medium complex scenes (such as face detection in access control and bayonet scenes).
  • simple / medium complex scenes such as face detection in access control and bayonet scenes.
  • unfavorable factors such as large changes in the size of the object to be detected, occlusion, distortion, etc., and improve detection accuracy is still a problem to be solved.
  • This application provides a method for training an object detection model, which improves the detection accuracy of the trained object detection model.
  • a method for training an object detection model performed by a computing device is provided.
  • the computing device executing the method may be one or more computing devices distributed in the same or different environments.
  • the method includes:
  • a training image is acquired, and a backbone network is established according to the training image.
  • the feature map output by the backbone network is input into a regional proposal network.
  • the area proposal network selects a plurality of proposed areas from a feature map output by the backbone network according to the area proposal parameters, and inputs the sub-feature maps corresponding to the plurality of proposed areas to a classifier.
  • the classifier detects an object to be detected in the training image according to a sub-feature map corresponding to the plurality of proposed regions.
  • the area proposal network divides the plurality of proposal areas into at least two proposal area sets, and each proposal area set includes at least one proposal area.
  • the region proposal network inputs a sub-feature map corresponding to a proposal region included in each proposal region set into one of the at least two classifiers.
  • Each classifier of the at least two classifiers performs the following actions: detecting an object to be detected in the training image according to a sub-feature map corresponding to the proposed region included in the acquired proposal region set; comparing the detected training A priori result of the object to be detected in the image and the training image, model parameters of the convolution kernel of the backbone network, model parameters of the convolution kernel of the region proposal network, and the region according to the comparison result At least one of the proposed parameters and the parameters of each classifier is stimulated.
  • Each of the at least two classifiers incentivizes its own parameters according to the comparison result, and generally does not incentivize the parameters of other classifiers in the at least two classifiers according to the comparison result.
  • the method provided above inputs the training image into the object detection model twice to train the object detection model.
  • the size of the object to be detected is not distinguished, so that the trained classifier has a global view.
  • each copied classifier is responsible for detecting objects to be detected in a set of proposed regions, that is, responsible for detecting a class of objects to be detected, so that each classifier trained is further targeted. Is more sensitive to different sizes of objects to be detected.
  • the two-stage training improves the detection accuracy of the trained object detection model for objects of different sizes to be detected.
  • the method further includes: acquiring system parameters, where the system parameters include at least one of: a number of size clusters of an object to be detected in a training image, and training computing capabilities; according to the system A parameter that determines the number of classifiers in the at least two classifiers obtained after copying.
  • the number of copied classifiers can be manually configured or calculated based on the situation of the object to be detected in the training image.
  • the appropriate selection of the number of copied classifiers further improves the trained object detection model for different sizes. Detection accuracy of the object to be detected.
  • system parameter includes the number of size clusters of the object to be detected in the training image
  • acquiring the system parameter includes: to the object to be detected in the training image Clustering to obtain the number of size clusters of the object to be detected in the training image.
  • the feature map output by the backbone network includes at least two feature maps.
  • the span of different convolutional layers of the backbone network may be different, so the size of the object to be detected in the proposed area in the feature maps of the different convolutional layers may also be different. At least two feature maps are extracted from the backbone network, which enriches the proposed area. Source, which further improves the detection accuracy of the trained object detection model for objects of different sizes to be detected.
  • a second aspect of the present application provides a detection model training device, including an initialization module, an object detection model, and an excitation module.
  • An object detection model is used to obtain a training image, and establish a backbone network based on the training image; select multiple proposed regions from a feature map output from the backbone network according to regional proposal parameters, and sub-feature corresponding to the multiple proposed regions
  • the map is input to a classifier; and an object to be detected in the training image is detected according to a sub-feature map corresponding to the plurality of proposed regions.
  • An incentive module configured to compare a detected prior object in the training image with a prior result of the training image, and according to the comparison result, to model parameters of the convolution kernel of the backbone network, and the region proposal At least one of the model parameters of the network's convolution kernel, the region proposal parameters, and the parameters of the classifier is excited.
  • An initialization module is configured to duplicate the classifier to obtain at least two classifiers.
  • the object detection model is further configured to divide the plurality of proposal regions into at least two proposal region sets, each proposal region set including at least one proposal region; and sub-features corresponding to the proposal regions included in each proposal region set.
  • a map is input to one of the at least two classifiers; each of the at least two classifiers performs the following action: detecting the sub-feature map corresponding to the proposed region included in the acquired proposal region set The object to be detected in the training image; comparing the detected prior object in the training image and the prior image of the training image, and according to the comparison result, the model parameters, At least one of the model parameters of the convolution kernel of the region proposal network, the region proposal parameters, and the parameters of each classifier is stimulated.
  • the initialization module is further configured to obtain system parameters, where the system parameters include at least one of: a number of size clusters of an object to be detected in a training image, and a training computing capability; The system parameters are described to determine the number of classifiers in the at least two classifiers obtained after replication.
  • the initialization module is further configured to cluster the sizes of the objects to be detected in the training image to obtain the number of size clusters of the objects to be detected in the training image.
  • the feature map output by the backbone network includes at least two feature maps.
  • a third aspect of the present application provides a computing device system.
  • the computing device system includes at least one computing device.
  • Each computing device includes a processor and memory.
  • a processor of at least one computing device is configured to access code in the memory to perform the first aspect or the method provided by any possible implementation of the first aspect.
  • a fourth aspect of the present application provides a non-transitory readable storage medium.
  • the non-transitory readable storage medium is executed by at least one computing device, the at least one computing device executes the foregoing first aspect. Or the method provided in any possible implementation of the first aspect.
  • the program is stored in the storage medium.
  • the type of the storage medium includes, but is not limited to, volatile memory, such as random access memory, non-volatile memory, such as flash memory, hard disk drive (HDD), and solid state drive (SSD).
  • a fifth aspect of the present application provides a computing device program product.
  • the computing device program product When the computing device program product is executed by at least one computing device, the at least one computing device executes the foregoing first aspect or any possible first aspect.
  • the computer program product may be a software installation package. If the method provided in the foregoing first aspect or any possible implementation manner of the first aspect is required to be used, the computer program product may be downloaded and executed on a computing device. Program products.
  • a sixth aspect of the present application provides another method for training an object detection model performed by a computing device, the method including two-stage training. among them,
  • a feature map of a training image is extracted through a backbone network, a proposed area is selected from the extracted feature map through a regional proposal network, and a sub-feature map corresponding to the proposed area is input to a classifier.
  • the classifier detects the object to be detected in the training image according to the sub-feature map corresponding to the proposed region, compares the detection result with the prior result of the training image, and compares the backbone network, the At least one of the region proposal network and the classifier performs incentives.
  • At least two replication classifiers are established according to the classifiers that have undergone the first-stage training, and the region proposal network divides the proposed region into at least two proposed region sets, each Each proposal region set includes at least one proposal region, and the sub-feature map corresponding to the proposal region included in each proposal region set is input to a replication classifier, and each replication classifier detects a pending feature in the training image according to the acquired sub-feature map.
  • An object is detected, the detection result is compared with the prior result of the training image, and at least one of the backbone network, the area proposal network, and the classifier is re-energized according to the comparison result.
  • the classifiers that have undergone the first-stage training may be duplicated to establish at least two duplicated classifiers. It is also possible to adjust the classifiers that have undergone the first-stage training before copying to establish at least two copy classifiers.
  • the method further includes: acquiring system parameters, where the system parameters include at least one of: a number of size clusters of an object to be detected in a training image, and training computing capabilities; according to the system A parameter determines the number of the replication classifiers established.
  • system parameter includes the number of size clusters of the object to be detected in the training image
  • acquiring the system parameter includes: to the object to be detected in the training image Clustering to obtain the number of size clusters of the object to be detected in the training image.
  • the feature map extracted by the backbone network includes at least two feature maps.
  • a seventh aspect of the present application provides a computing device system.
  • the computing device system includes at least one computing device.
  • Each computing device includes a processor and memory.
  • a processor of at least one computing device is configured to access code in the memory to perform the sixth aspect or the method provided by any possible implementation of the sixth aspect.
  • An eighth aspect of the present application provides a non-transitory readable storage medium.
  • the non-transitory readable storage medium is executed by at least one computing device, the at least one computing device executes the foregoing sixth aspect. Or the method provided in any possible implementation of the sixth aspect.
  • the program is stored in the storage medium.
  • the type of the storage medium includes, but is not limited to, volatile memory, such as random access memory, and non-volatile memory, such as flash memory, HDD, and SSD.
  • a ninth aspect of the present application provides a computing device program product.
  • the computing device program product When the computing device program product is executed by at least one computing device, the at least one computing device executes the foregoing sixth aspect or any possible one of the sixth aspect.
  • the computer program product may be a software installation package.
  • the computer program product may be downloaded and executed on a computing device. Program products.
  • FIG. 1 is a schematic diagram of a system architecture provided by this application.
  • FIG. 2 is a schematic diagram of another system architecture provided by the present application.
  • FIG. 3 is a working flowchart of a detection model training device provided in the present application in a training state
  • FIG. 6 is a working flowchart of a detection model training device provided in the present application in a training state
  • FIG. 7 is another working flowchart of the detection model training device provided in the present application in a training state
  • FIG. 8 is a working flowchart of an object detection model training in an inferential state provided by the present application.
  • FIG. 9 is a schematic structural diagram of a convolution layer and a convolution kernel provided by the present application.
  • FIG. 10 is a schematic diagram of a receptive field of a convolutional layer provided by the present application.
  • FIG. 11 is a schematic diagram of a receptive field of another convolution layer provided by the present application.
  • FIG. 12 is a working flowchart of a regional proposal network provided by this application.
  • FIG. 15 is a schematic structural diagram of a detection model training device provided by the present application.
  • 16 is a schematic structural diagram of a computing device provided by the present application.
  • FIG. 17 is a schematic structural diagram of a computing device system provided by the present application.
  • FIG. 18 is a schematic structural diagram of another computing device system provided by the present application.
  • the method for training an object detection model provided by the present application is executed by a detection model training device.
  • the device can run in a cloud environment, specifically one or more computing devices on the cloud environment.
  • the device can also run in an edge environment, specifically one or more computing devices (edge computing devices) in the edge environment.
  • the device can also run in a terminal environment, specifically one or more terminal devices in the terminal environment.
  • Terminal equipment can be mobile phones, notebooks, servers, desktop computers, etc.
  • the edge computing device may be a server.
  • the detection model training device may be composed of multiple parts (modules), so each part of the detection model training device may also be separately deployed in different environments.
  • the detection model training device may deploy three modules of the detection model training device on a cloud environment, an edge environment, or a terminal environment, or any two of them.
  • FIG. 3 to FIG. 5 and FIG. 6 to FIG. 8 respectively illustrate the working flow diagrams of the two detection model training devices.
  • the training state of the detection model training device is divided into two stages.
  • the detection model training device works in the first stage of the training state.
  • the purpose of the training state is to use the training image and the prior results of the training image to train a highly accurate object detection model.
  • the prior result of the training image includes a mark of an object to be detected in the training image.
  • the training image in FIG. 3 includes multiple faces, and each face of the training image is marked with a white frame in the prior result of the training image (as shown in the upper left corner of FIG. 3).
  • the prior results of the training images can generally be provided manually.
  • a K-layer backbone network is established according to the training image.
  • the backbone network includes K convolutional layers, where K is a positive integer greater than 0.
  • the backbone network extracts feature maps from the training images.
  • the feature map extracted from the backbone network is input to a regional proposal network.
  • the regional proposal network selects a proposed region from the feature map and inputs the sub-feature map corresponding to the proposed region to the classifier.
  • the region proposal network can directly compare the prior results of the training image and the feature map to obtain a region with a high coverage of the object to be detected in the feature map and the training image as the proposed region.
  • the region proposal network may first identify the foreground region and the background region from the feature map, and then extract the proposed region from the foreground region.
  • the foreground area is an area containing a high amount of information and a high probability of including an object to be detected
  • the background area is an area containing a low amount of information, having a lot of repeated information, and a low probability of including an object to be detected.
  • Each sub-feature map includes features in which a portion of the feature map is located within the proposed area.
  • the classifier determines whether the area of the training image corresponding to the proposed region corresponding to the sub-feature map is an object to be detected according to the sub-feature map. As shown on the right side of Figure 3, the classifier marks the detected face area with a white frame on the training image. By comparing the detection result of the training image with the prior result of the training image, it is possible to know the difference between the object to be detected detected by the detection model training device and the prior result. As shown in FIG. 3, some faces in the prior result are not detected by the detection model training device.
  • the parameters of the object detection model are stimulated according to the difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, model parameters of the convolution kernel of the regional proposal network, and regions of the regional proposal network Proposed parameters, parameters of the classifier.
  • model parameters of the convolution kernel of each convolutional layer of the backbone network including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, model parameters of the convolution kernel of the regional proposal network, and regions of the regional proposal network Proposed parameters, parameters of the classifier.
  • the detection model training device trains an object detection model through a large number of training images and prior results of the training images.
  • the object detection model includes a backbone network, a region proposal network, and a classifier.
  • the object detection model that has undergone the first stage of the training mode enters the second stage of the training mode, as shown in FIG. 4.
  • the classifier that has undergone the first stage in FIG. 3 is first copied into P copies.
  • the training image is input to the backbone network, and the feature map extracted by the backbone network is input to the regional proposal network.
  • the regional proposal network selects proposal regions from the feature map, and aggregates the selected proposal regions into P proposal region sets according to the size of the proposal region. The sizes of the proposed regions in each set of proposed regions are similar.
  • the sub-feature maps corresponding to the P proposed region sets are input to P classifiers, respectively.
  • a proposed region set corresponds to a classifier, and a sub-feature map corresponding to the proposed region in the proposed region set is input into the classifier.
  • Each classifier detects objects of different sizes in the training image according to the received sub-feature maps, and obtains corresponding detection results.
  • the detection result of each classifier is compared with the prior result of the size of the object to be detected corresponding to the sub-feature map received by the classifier in the training image.
  • the difference between the detection result of the object to be detected at each size and the prior result of the object to be detected at that size will stimulate various parameters of the object detection model.
  • each classifier will be trained to be more sensitive to detected objects of different sizes, so the accuracy of the object detection model will be further improved after the secondary excitation of a large number of training images.
  • the prior results of the object to be detected in the training graph are divided into P classes according to size, and are used for comparison of P classifiers.
  • P is equal to 2, that is, the proposed area is divided into two proposed area sets by the regional proposal network.
  • One of the proposed area sets (corresponding to the classifier above) has a smaller size and the other The size of the proposal area in the proposal area set (corresponding to the classifier below) is large. Therefore, the sub-characteristic map corresponding to the proposed region in the previous set of proposed regions is used to detect smaller objects to be detected in the training image, and the sub-characteristic map corresponding to the proposed region in the latter set of proposed regions is used to detect the relatively small objects in the training image. Large object to be detected.
  • the two proposed regions are input into different classifiers.
  • the upper classifier is used to detect smaller objects to be detected
  • the lower classifier is used to detect larger objects to be detected
  • the detection results output by the two classifiers are compared with the corresponding prior results.
  • detection result 1 includes the object to be detected detected by the upper classifier according to the sub-feature map corresponding to the proposed region with a smaller size
  • the prior result 1 of the training image includes the prior to the small object to be detected in the training image. Results (size, coordinates, etc.).
  • the detection result 1 is compared with the prior result 1, and the parameters of the object detection model are stimulated according to the contrast difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, and the volume of the regional proposal network
  • the kernel model parameters, the area proposal parameters of the area proposal network, and the parameters of the upper classifier are stimulated according to the contrast difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, and the volume of the regional proposal network
  • the kernel model parameters, the area proposal parameters of the area proposal network, and the parameters of the upper classifier are stimulated according to the contrast difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, and the volume of the regional proposal network
  • the kernel model parameters, the area proposal parameters of the area proposal network, and the parameters of the upper classifier are stimulated according to the contrast difference, including at least one of the following: model parameters of the convolution kernel of each con
  • the detection result 2 is compared with the prior result 2, and each parameter of the object detection model is stimulated according to the comparison difference, including at least one of the following: the model parameters of the convolution kernel of each convolutional layer of the backbone network, and the The model parameters of the convolution kernel, the area proposal parameters of the area proposal network, and the parameters of the classifier below.
  • Preset thresholds can be used to distinguish different sets of proposed regions. For example, when P sets of proposed regions need to be distinguished, P-1 thresholds are preset. Each threshold corresponds to the size of a proposed region. The P-1 thresholds are used. The proposed regions selected by the regional proposal network are aggregated into P proposed region sets. Correspondingly, according to the size of the object to be detected in the training image, the object to be detected in the training image is divided into P prior results, and a prior result is compared with a size corresponding detection result to stimulate the object detection model.
  • the object detection model trained in the second stage can be deployed in the cloud environment, edge environment or terminal environment. Or part of the object detection model can be deployed on three or any two of the cloud environment, edge environment, and terminal environment.
  • the image to be detected is input into the backbone network of the object detection model.
  • the object detection model After being processed by the region proposal network and the P classifiers, the object detection model outputs the detection result of the image to be detected.
  • the detection result includes information such as the position and number of the detected objects to be detected, such as how many human faces there are and where each human face appears.
  • the region proposal network is similar to the second stage of the training state. The extracted proposal regions are classified according to size, and the sub-feature map corresponding to each proposal region is sent to the classifier corresponding to the proposed region, respectively.
  • Each classifier detects objects of different sizes according to the sub-feature maps of the proposed regions of different sizes, and by integrating the detection results of the P classifiers, the detection results of the images to be detected can be obtained.
  • FIG. 6 to FIG. 8 illustrate another workflow of the test model training device.
  • the test model training device described in FIGS. 6 to 8 is in a training state and an inference state.
  • the feature maps extracted from at least two convolutional layers of the backbone network are used as the input of the region proposal network.
  • the detection model training device works in the first stage of the training state.
  • a K-layer backbone network is established according to the training image.
  • the backbone network includes K convolutional layers, where K is a positive integer greater than 0.
  • the backbone network extracts p feature maps from the training images.
  • the p feature maps can be extracted from any p convolutional layers of the backbone network, or any p convolutional layers of the backbone network itself.
  • the p feature maps extracted by the backbone network are input to a regional proposal network.
  • the regional proposal network selects a proposed area from the p feature maps and enters the sub-feature map corresponding to the proposed area into the classifier.
  • Each sub-feature map includes features in which a portion of the feature map is located within the proposed area.
  • the classifier determines whether the area of the training image corresponding to the proposed region corresponding to the sub-feature map is an object to be detected according to the sub-feature map.
  • the classifier marks the detected face area with a white frame on the training image.
  • the detection result of the training image With the prior result of the training image, it is possible to know the difference between the object to be detected detected by the detection model training device and the prior result. As shown in FIG. 6, some faces in the prior result are not detected by the detection model training device.
  • the parameters of the object detection model are stimulated according to the difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, model parameters of the convolution kernel of the regional proposal network, and regions of the regional proposal network Proposed parameters, parameters of the classifier.
  • the differences between the detection results of each training image and the prior results of the training image will motivate the various parameters of the above object detection model, so after a large number of training image excitations, the accuracy of the object detection model will be improved.
  • the detection model training device trains an object detection model through a large number of training images and prior results of the training images.
  • the object detection model includes a backbone network, a region proposal network, and a classifier.
  • the object detection model that has undergone the first stage of the training mode enters the second stage of the training mode, as shown in FIG. 7.
  • the classifier that has undergone the first stage in FIG. 6 is first copied into P copies.
  • the training image is input to the backbone network, and at least one feature map extracted by the backbone network is input to the region proposal network.
  • the regional proposal network selects proposal regions from the feature map, and aggregates the selected proposal regions into P proposal region sets according to the size of the proposal region.
  • the proposal area in each proposal area set is determined according to the size of the proposal area and the span of the convolutional layer corresponding to the feature map in which the proposal area is located.
  • the sub-feature maps corresponding to the proposed regions in the P proposed region sets are input into the P classifiers, respectively.
  • a proposed region set corresponds to a classifier, and a sub-feature map corresponding to the proposed region in the proposed region set is input into the classifier.
  • Each classifier detects objects of different sizes to be detected according to the received sub-feature map and obtains corresponding detection results.
  • the detection result of each classifier is compared with the prior result of the size of the object to be detected corresponding to the sub-feature map received by the classifier in the training image.
  • the difference between the detection result of the object to be detected at each size and the prior result of the object to be detected at that size will stimulate various parameters of the object detection model.
  • each classifier will be trained to be more sensitive to detected objects of different sizes, so the accuracy of the object detection model will be further improved after the secondary excitation of a large number of training images.
  • the prior results of the object to be detected in the training graph are classified into P classes according to size, and are used for comparison of P classifiers.
  • P is equal to 2, that is, the area proposal network divides the selected proposal area into two sets of proposal areas.
  • One of the proposal area sets (corresponding to the above classifier) is the size of the proposal area and the location of the proposal area.
  • the product of the span of the convolutional layer corresponding to the feature map of is small.
  • the product of the spans is large.
  • the sub-characteristic map corresponding to the proposed region in the previous set of proposed regions is used to detect smaller objects to be detected in the training image
  • the sub-characteristic map corresponding to the proposed region in the latter set of proposed regions is used to detect the relatively small Large object to be detected.
  • the two proposed regions are input into different classifiers.
  • the upper classifier is used to detect smaller objects to be detected
  • the lower classifier is used to detect larger objects to be detected
  • the detection results output by the two classifiers are compared with the corresponding prior results.
  • detection result 1 includes the object to be detected detected by the upper classifier according to the sub-feature map corresponding to the proposed region with a smaller size
  • the prior result 1 of the training image includes the prior to the small object to be detected in the training image.
  • the detection result 1 is compared with the prior result 1, and each parameter of the object detection model is stimulated according to the comparison difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, and The model parameters of the convolution kernel, the area proposal parameters of the area proposal network, and the parameters of the upper classifier.
  • the detection result 2 includes the to-be-detected object detected by the lower classifier according to the sub-feature map corresponding to the larger-sized proposed region
  • the prior result 2 of the training image includes the prior-to-be-detected object in the training image with larger size. Test results (size, coordinates, etc.).
  • the detection result 2 is compared with the prior result 2, and each parameter of the object detection model is stimulated according to the comparison difference, including at least one of the following: the model parameters of the convolution kernel of each convolutional layer of the backbone network, and the The model parameters of the convolution kernel, the area proposal parameters of the area proposal network, and the parameters of the classifier below.
  • Preset thresholds can be used to distinguish different sets of proposed regions. For example, when P sets of proposed regions need to be distinguished, P-1 thresholds are preset. Each threshold corresponds to the size of a proposed region. The P-1 thresholds are used. The proposed regions selected by the regional proposal network are aggregated into P proposed region sets. Correspondingly, according to the size of the object to be detected in the training image, the object to be detected in the training image is divided into P prior results, and a prior result is compared with a size corresponding detection result to stimulate the object detection model.
  • the object detection model trained in the second stage can be deployed in the cloud environment, edge environment or terminal environment. Or part of the object detection model can be deployed on three or any two of the cloud environment, edge environment, and terminal environment.
  • the image to be detected is input into the backbone network of the object detection model.
  • the object detection model After being processed by the region proposal network and the P classifiers, the object detection model outputs the detection result of the image to be detected.
  • the detection result includes information such as the position and number of the detected objects to be detected, such as how many human faces there are and where each human face appears.
  • the region proposal network is similar to the second stage of the training state.
  • the extracted proposal regions are classified according to size, and the sub-feature maps corresponding to the proposed regions are sent to the corresponding classifiers.
  • Each classifier detects objects of different sizes according to the sub-feature maps of the proposed regions of different sizes, and by integrating the detection results of the P classifiers, the detection results of the images to be detected can be obtained.
  • the backbone network includes a convolutional network, which includes K convolutional layers.
  • the K convolutional layers of a general backbone network constitute multiple convolutional blocks, and each convolutional block includes multiple convolutional layers.
  • the number of convolutional blocks of the backbone network is usually five.
  • the backbone network can also include pooling modules.
  • the backbone network can use some templates commonly used in the industry, such as Vgg, Resnet, Densenet, Xception, Inception, Mobilenet, etc.
  • the extracted features of the training image are used as the first convolutional layer of the backbone network.
  • Features extracted from the first convolutional layer of the backbone network by the convolution kernel corresponding to the first convolutional layer form the second convolutional layer of the backbone network.
  • Features extracted from the second convolutional layer of the backbone network by the convolution kernel corresponding to the second convolutional layer of the backbone network form the third convolutional layer of the backbone network.
  • the features extracted from the k-1th convolutional layer of the backbone network by the convolution kernel corresponding to the k-1th convolutional layer of the backbone network form the kth convolutional layer of the backbone network, where k is greater than Equal to 1 and less than or equal to K.
  • the feature graph extracted by the convolution kernel corresponding to the Kth convolution layer of the backbone network is used as the input of the regional proposal network.
  • the K-th convolutional layer of the backbone network can be directly used as the feature map as the input of the regional proposal network.
  • the feature graph extracted by the k-th convolutional layer of the backbone network by the convolution kernel corresponding to the k-th convolutional layer of the backbone network becomes the input of the regional proposal network.
  • the k-th convolution layer of the backbone network can be directly used as the feature map as the input of the regional proposal network.
  • the regional proposal network includes L convolutional layers, where L is an integer greater than 0. Similar to the backbone network, the k'-1th convolutional layer of the regional proposal network is replaced by the k'-1th convolutional layer of the regional proposal network. The features extracted by the corresponding convolution kernel form the k'th convolutional layer of the regional proposal network, where k 'is greater than or equal to 1 and less than or equal to L-1.
  • Both the backbone network and the regional proposal network include at least one convolutional layer.
  • the size of the convolution layer 101 is X * Y * N 1 , that is, the convolution layer 101 includes X * Y * N 1 features.
  • N 1 is the number of channels
  • one channel is a feature dimension
  • X * Y is the number of features included in each channel.
  • X, Y, N 1 are all positive integers greater than 0.
  • the convolution kernel 1011 is one of the convolution kernels used for the convolution layer 101. Since the convolution layer 102 includes N 2 channels, the convolution layer 101 uses a total of N 2 convolution kernels. The size and model parameters of the N 2 convolution kernels may be the same or different. Taking the convolution kernel 1011 as an example, the size of the convolution kernel 1011 is X 1 * X 1 * N 1 . That is, X 1 * X 1 * N 1 model parameters are included in the convolution kernel 1011.
  • the initialization model parameters in the convolution kernel can use model parameter templates commonly used in the industry.
  • the model parameters of the convolution kernel 1011 are multiplied with the characteristics of the convolution layer 101 at the corresponding position. After combining the product results of each model parameter of the convolution kernel 1011 and the features of the convolution layer 101 at the corresponding position, a feature on one channel of the convolution layer 102 is obtained.
  • the product of the features of the convolution layer 101 and the convolution kernel 1011 can be directly used as the features of the convolution layer 102. You can also slide the features of the convolution layer 101 and the convolution kernel 1011 on the convolution layer 101. After outputting all the product results, normalize all the product results, and use the normalized product results as the convolution layer. 102 characteristics.
  • the convolution kernel 1011 slides on the convolution layer 101 to perform convolution, and the convolution result forms a channel of the convolution layer 102.
  • Each convolution kernel used by the convolution layer 101 corresponds to one channel of the convolution layer 102. Therefore, the number of channels of the convolution layer 102 is equal to the number of convolution kernels acting on the convolution layer 101.
  • the design of the model parameters in each convolution kernel reflects the characteristics of the features that the convolution kernel hopes to extract from the convolution layer. Through the N 2 convolution kernels, the features of the N 2 channels are extracted from the convolution layer 101.
  • the convolution kernel 1011 is split.
  • the convolution kernel 1011 includes N 1 convolution pieces, and each convolution piece includes X 1 * X 1 model parameters (P 11 to Px 1 x 1 ).
  • Each model parameter corresponds to a convolution point.
  • the model parameters corresponding to a convolution point are multiplied with the features in the convolution layer at the corresponding position of the convolution point to obtain the convolution result of the convolution point.
  • the sum of the convolution results of the convolution points of a convolution kernel is The convolution result of this convolution kernel.
  • the sliding span of the convolution kernel is the number of features that each time the convolution kernel slides across the convolution layer.
  • the convolution kernel slides V features on the basis of the current position of the current convolution layer, and
  • the model parameters of the convolution kernel and the features of the convolution layer are convolved at the position after sliding, and V is the sliding span of the convolution kernel.
  • the receptive field is the perceptual domain (perceptual range) of a feature on the input image on the convolutional layer. If the pixels in the perceptual range change, the value of the feature will change accordingly.
  • the convolution kernel slides on the input image, and the extracted features constitute the convolution layer 101.
  • the convolution kernel slides on the convolution layer 101, and the extracted features constitute the convolution layer 102.
  • each feature in the convolution layer 101 is extracted from pixels of the input image within the size of the convolution sheet of the convolution kernel sliding on the input image, and this size is also the receptive field of the convolution layer 101. Therefore, the receptive field of the convolution layer 101 is shown in FIG. 10.
  • each feature in the convolutional layer 102 is mapped to a range on the input image (that is, how many pixels are used on the input image), that is, the receptive field of the convolutional layer 102.
  • each feature in the convolution layer 102 is extracted from the pixels of the input image within the size of the convolution piece of the convolution kernel sliding on the convolution layer 101.
  • Each feature on the convolution layer 101 is extracted from pixels of the input image within the range of the convolution piece of the convolution kernel sliding on the input image. Therefore, the receptive field of the convolution kernel 102 is larger than that of the convolutional layer 101. If a backbone network includes multiple convolutional layers, the receptive field of the last convolutional layer in the multilayered convolutional layer is the receptive field of the backbone network.
  • the training computing capability is the computing capability available for detecting the model training device in the environment where the model training device is deployed, including at least one of the following: processor frequency, processor occupancy, memory size, memory occupancy, cache utilization, Cache size, image processor frequency, image processor occupancy, and other computing resource parameters.
  • the classifier includes a series of parameters, and the classifier detects information such as the position and number of objects to be detected in the image to be detected according to the input features and these functions.
  • Classifiers Common classifiers include Softmax classifier, Sigmoid classifier, and so on.
  • the size of the k + 1th convolutional layer of the backbone network is less than or equal to the size of the kth convolutional layer of the backbone network.
  • the span of the kth convolutional layer of the backbone network is the input of the image of the backbone network.
  • the image of the input backbone network may be a training image or an image to be detected.
  • the span of the k-th convolutional layer of the backbone network is generally determined by how many pooling layers there are between the first convolutional layer and the k-th convolutional layer of the backbone network, and the first convolutional layer to the k-th convolutional layer of the backbone network.
  • the convolution kernel's sliding spans between convolution layers are related. The more pooling layers between the first convolutional layer and the k-th convolutional layer, the larger the convolution kernel sliding span used by the convolutional layer between the first convolutional layer and the k-th convolutional layer. The larger the span of the k convolutional layers.
  • the area proposal network determines a plurality of proposed areas on the feature map according to the area proposal parameters.
  • the area proposal parameters may include the length and width of the proposed area. The sizes of different proposed areas are generally different.
  • the region proposal network first obtains a plurality of proposal regions according to the region proposal parameters, and calculates each of the plurality of proposal regions according to the convolution kernel corresponding to the L-layer convolution layer.
  • the confidence level of the proposed area that is, the possibility that each proposed area corresponds to the training image includes the object to be detected.
  • the area proposal network obtains multiple proposed areas, for example, after the proposed areas 1-4, the proposed area can be increased according to the size of the proposed area (the number of features covered by the proposed area).
  • the proposal regions are aggregated into P proposal region sets.
  • the region proposal network inputs the sub-feature map corresponding to the proposed region in a set of proposed regions to a classifier.
  • the size of the proposed area is related to the size of the object to be detected. Therefore, according to the size of the proposed area, the proposed areas are aggregated into a set of proposed areas, and different classifiers detect the proposed areas in different set of proposed areas and are detected based on the detection results. Excitation makes different classifiers more sensitive to objects of different sizes to be detected.
  • the region proposal network obtains the proposed regions from different feature maps, the size of each proposed region and the span of the convolutional layer corresponding to the feature map where the proposed region is located are comprehensively considered. Then, according to the size of each proposed region and the span of the convolution layer corresponding to the feature map in which the proposed region is located, the proposed regions obtained in different feature maps are aggregated into P set of proposed regions. Then, the region proposal network inputs the sub-feature map corresponding to the proposed region in a set of proposed regions to a classifier.
  • regional proposal networks use the product of the size of each proposed region and the span of the convolution layer corresponding to the feature map where the proposed region is located as the aggregation criterion. For example, after obtaining T proposed regions from different feature maps, each The product of the size of the proposed area and the span of the convolutional layer corresponding to the feature map in which the proposed area is located. According to the T products, the T proposed regions are aggregated into a set of P proposed regions. For example, each of the T products can be compared with a preset threshold of P-1 to determine the corresponding value of each product. The proposal area is divided into which proposal area set.
  • FIG. 13 and FIG. 14 respectively introduce the work flow of the detection model training device corresponding to FIGS. 3 to 5 and 6 to 8.
  • the work flow of the detection model training device is introduced.
  • the number of size clusters of the object to be detected that is, how many sets the size of the object to be detected can be clustered into. For example, when the number of size clusters of the object to be detected is two, the size of the object to be detected may be divided into two sets.
  • the number of clusters of the size of the object to be detected may be obtained by using a clustering algorithm to cluster the number of objects to be detected in the training image.
  • the clustering algorithm can use K-means and so on.
  • the number of size clusters of the object to be detected and the complexity of the object to be detected may also be manually input to the detection model training device.
  • the above system parameters refer to the training image or the parameters of the object to be detected or the backbone network or the training environment in the training image. Such system parameters can be obtained before the object detection model is established. System parameters are also called super parameters. Different system parameters may lead to different replication parameters.
  • the model parameters refer to the parameters corresponding to each convolution point in the convolution kernel. The model parameters are continuously excited and changed during the training process of the object detection model.
  • the acquisition of the above system parameters can be obtained in multiple times and need not be performed in the same step. It is not necessary to obtain all the above system parameters, and the specific obtained system parameters are determined according to the system parameters used in the subsequent steps of determining the replication parameters.
  • the acquisition time of each system parameter can be before the subsequent steps that use the system parameter.
  • S202 can be executed at any time after S201 and before S208.
  • S203 Acquire a training image, establish a backbone network according to the training image, and obtain a feature map output by the backbone network.
  • the feature map output by the backbone network in S204 is a feature in the K-th convolutional layer of the backbone network, or a feature extracted by the convolution kernel in the K-th convolutional layer.
  • the region proposal network selects a proposed region from the feature map, and inputs a sub-feature map corresponding to the proposed region to the classifier.
  • the classifier detects an object to be detected in the training image according to the sub-feature map input in S205.
  • Parameters are set in the classifier, and the classifier detects the object to be detected in the training image according to the parameters and input features.
  • the detection model training device obtains the next training image and trains the object detection model according to the next training image and the prior result of the next training image.
  • the excitation process of the next training image is similar to that of the training image obtained in S203.
  • the main difference is that 1.
  • the next training image is extracted by the backbone network.
  • the model of the convolution kernel of each convolution layer of the backbone network used in the feature map.
  • the parameter is the one that has been excited in S207 (if it has been excited in S207).
  • the model parameters of the convolution kernel of the region proposal network and the region proposal parameters of the region proposal network are the ones that were stimulated in S207. It was motivated).
  • the features of the classifier experienced by the next training image are those that have been stimulated in S207 (if they have been stimulated in S207).
  • each training image will be further stimulated based on the previous training image's stimulation of the object detection model. After all the training images are sequentially used for the training of the object detection model, the first phase of the training state of the object detection model ends.
  • the regional proposal network selects multiple proposal regions from the feature map, divides the selected multiple proposal regions into P proposal region sets, and inputs the sub-feature map corresponding to the proposal regions in each proposal region set into the corresponding classifier. .
  • the classifier detects an object to be detected in the training image according to the sub-feature map input in S211.
  • Each of the classifiers in S212 and S213 detects the object to be detected in the training image according to the sub-feature map obtained by itself, and excites the classifier according to the comparison result between the detection result and the prior result.
  • Each classifier copied in S208 executes S212 and S213.
  • the excitation process of the next training image is similar to that of the training image obtained in S209, the main difference is that 1.
  • the next training image is extracted by the backbone network.
  • the model of the convolution kernel of each convolution layer of the backbone network used in the feature map.
  • the parameter is stimulated in S213 (if it is stimulated in S213).
  • the model parameters of the convolution kernel of the region proposal network and the region proposal parameters of the region proposal network are the ones that were stimulated in S213 (if S213 It was motivated).
  • the features of the classifier experienced by the next training image are those that have been stimulated in S213 (if they have been stimulated in S213).
  • each training image will be further stimulated based on the previous training image's stimulation of the object detection model.
  • the training process of the object detection model ends.
  • the object detection model can be used for inferential states.
  • FIG. 14 another work flow of the detection model training device is introduced. Compared with the work flow shown in FIG. 13, the main difference is that S203 and S209 in the work flow shown in FIG. 13 are replaced by S203 ′ and S203 ′, respectively. S209 '.
  • the backbone network extracts at least two feature maps and inputs them into the area proposal network for the area proposal network to select a proposal area.
  • the training process of the object detection model ends. As shown in Figure 8, this object detection model can be used for inferential states.
  • the application also provides a detection model training device 400.
  • the detection model training device 400 includes an object detection model 401, an excitation module 405, a storage module 406, and an initialization module 407.
  • the object detection model 401 further includes a backbone network 403, a classifier 404, and a region proposal network 402.
  • the classifier 404 includes a classifier in the first stage of the training state, and includes P classifiers in the second stage of the training state and the inference state.
  • Each of the above modules can be a software module.
  • the initialization module 407 in the first stage of the training state, the initialization module 407 is configured to execute S201 and S202 to determine the replication parameter P.
  • the object detection model 401 obtains a training image from the storage module 406 and executes S203 or S203 ′, and S204 to establish a backbone network 403.
  • the area proposal network 402 executes S205.
  • the classifier 404 is configured to execute S206.
  • the incentive module 405 is configured to execute S207.
  • the initialization module 407 is used to execute S208, and the object detection model 401 obtains a training image from the storage module 406 and executes S209 or S209 ', and S210 to establish a backbone network 403.
  • the area proposal network 402 executes S211.
  • the classifier 404 is used to execute S212.
  • the incentive module 405 is configured to execute S213.
  • the detection model training device 400 may be provided to a user as an object detection model training service.
  • the detection model training device 400 (or a part thereof) shown in FIG. 1 is deployed on a cloud environment.
  • the user selects the backbone network type and some system parameters, and puts the training image and the prior results of the training image into the storage module 406, and then starts.
  • the detection model training device 400 trains an object detection model 401.
  • the trained object detection model 401 is provided to the user, and the user can run the object detection model 401 on his terminal environment or sell the object detection model 401 directly to a third party for use.
  • the application also provides a computing device 500.
  • the computing device 500 includes a bus 501, a processor 502, a communication interface 503, and a memory 504.
  • the processor 502, the memory 504, and the communication interface 503 communicate through a bus 501.
  • the processor may be a central processing unit (English: central processing unit, abbreviation: CPU).
  • the memory may include a volatile memory (English: volatile memory), such as a random access memory (English: random access memory, abbreviation: RAM).
  • the memory may also include non-volatile memory (English: non-volatile memory), such as read-only memory (English: read-only memory, abbreviation: ROM), flash memory, HDD, or SSD.
  • the memory stores executable code, and the processor executes the executable code to perform the foregoing object detection method.
  • the memory may also include other software modules required by the operating system, such as the operating system.
  • the operating system may be LINUX TM , UNIX TM , WINDOWS TM and the like.
  • the memory of the computing device 500 stores code corresponding to each module of the detection model training apparatus 400, and the processor 502 executes these codes to implement the functions of each module of the detection model training apparatus 400, that is, the execution of the functions shown in FIG. 13 or FIG. 14 is performed. method.
  • the computing device 500 may be a computing device in a cloud environment, or a computing device in an edge environment, or a computing device in a terminal environment.
  • each part of the detection model training apparatus 400 may be executed on multiple computing devices in different environments. Therefore, this application also proposes a computing device system.
  • the computing device system includes a plurality of computing devices 600.
  • the structure of each computing device 600 is the same as that of the computing device 500 in FIG. 16.
  • a communication path is established between the computing devices 600 through a communication network.
  • Each computing device 600 runs any one or more of a region proposal network 402, a backbone network 403, a classifier 404, an incentive module 405, a storage module 406, and an initialization module 407.
  • Any computing device 600 may be a computing device in a cloud environment, or a computing device in an edge environment, or a computing device in a terminal environment.
  • This application also proposes a A computing device system.
  • the storage module 406 is deployed in a cloud storage service (such as an object storage service).
  • the user applies for a certain amount of storage space as the storage module 406 in the cloud storage service, and stores the training images and the prior results of the training images into the storage module 406. .
  • the computing device 600 When the computing device 600 is running, it acquires the required training images and training images from the remote storage module 406 through the communication network.
  • Each computing device 600 runs any one or more of a region proposal network 402, a backbone network 403, a classifier 404, an incentive module 405, and an initialization module 407.
  • Any computing device 600 may be a computing device in a cloud environment, or a computing device in an edge environment, or a computing device in a terminal environment.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server, or data center Transmission by wire (for example, coaxial cable, optical fiber, digital subscriber line, or wireless (for example, infrared, wireless, microwave, etc.) to another website site, computer, server, or data center.
  • the computer-readable storage medium may be a computer Any available media that can be accessed or data storage devices such as servers, data centers, etc. that include one or more of the available media integration.
  • the available media can be magnetic media (for example, floppy disks, hard disks, magnetic tapes), optical media (for example, , DVD), or semiconductor media (such as SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed is a method for training an object detection model executable by a computer apparatus. The method comprises: making at least two copies of a classifier having undergone stage 1 training; in stage 2 training, using each copy of the classifier to detect objects of different sizes; and training an object detection model according to detection results. The method employs a two-stage training mode, such that a resulting object detection mode exhibits higher accuracy in detecting objects to undergo detection.

Description

一种训练物体检测模型的方法、装置以及设备Method, device and equipment for training object detection model 技术领域Technical field
本申请涉及计算机技术领域,尤其涉及训练物体检测模型的方法,和用于执行该方法的装置和计算设备。The present application relates to the field of computer technology, and in particular, to a method for training an object detection model, and a device and a computing device for performing the method.
背景技术Background technique
物体检测是一项对图像/视频中的物体进行准确定位并进行分类检测的人工智能技术,其包含通用物体检测、人脸检测、行人检测、文字检测等诸多细分领域。近年来,学术界与工业界积极投入,算法不断成熟,当前基于深度学习的物体检测方案被用在市政安防(行人检测、车辆检测、车牌检测等)、金融(物体检测、刷脸登录等)、互联网(身份验证)、智能终端等实际产品中。Object detection is an artificial intelligence technology that accurately locates and classifies objects in images / videos. It includes general object detection, face detection, pedestrian detection, and text detection. In recent years, academia and industry have actively invested, and algorithms have continued to mature. Current deep learning-based object detection solutions are used in municipal security (pedestrian detection, vehicle detection, license plate detection, etc.), and finance (object detection, face registration, etc.) , Internet (identity verification), smart terminals and other actual products.
目前物体检测已经较广泛应用于多种简单/中等复杂难度场景(比如门禁、卡口场景下检测人脸)。在开放环境下,如何保持训练出的物体检测模型对待检测物体的尺寸变化幅度较大、遮挡、扭曲等多种不利因素的鲁棒性,并提升检测精度,仍是一个待解决的问题。At present, object detection has been widely used in a variety of simple / medium complex scenes (such as face detection in access control and bayonet scenes). In an open environment, how to maintain the robustness of the trained object detection model against a variety of unfavorable factors such as large changes in the size of the object to be detected, occlusion, distortion, etc., and improve detection accuracy is still a problem to be solved.
发明内容Summary of the invention
本申请提供了一种训练物体检测模型的方法,该方法提升了训练出的物体检测模型的检测精度。This application provides a method for training an object detection model, which improves the detection accuracy of the trained object detection model.
第一方面,提供了一种计算设备执行的物体检测模型的训练方法,执行该方法的计算设备可以是分布在相同或不同环境中的一台或多台计算设备。该方法包括:In a first aspect, a method for training an object detection model performed by a computing device is provided. The computing device executing the method may be one or more computing devices distributed in the same or different environments. The method includes:
获取训练图像,根据所述训练图像建立主干网络。A training image is acquired, and a backbone network is established according to the training image.
将所述主干网络输出的特征图输入区域提议网络。The feature map output by the backbone network is input into a regional proposal network.
所述区域提议网络根据区域提议参数从所述主干网络输出的特征图中选取多个提议区域,将所述多个提议区域对应的子特征图输入分类器。The area proposal network selects a plurality of proposed areas from a feature map output by the backbone network according to the area proposal parameters, and inputs the sub-feature maps corresponding to the plurality of proposed areas to a classifier.
所述分类器根据所述多个提议区域对应的子特征图检测所述训练图像中的待检测物体。The classifier detects an object to be detected in the training image according to a sub-feature map corresponding to the plurality of proposed regions.
对比所述分类器检测出的所述训练图像中的待检测物体和所述训练图像的先验结果,根据所述对比结果对所述主干网络的卷积核的模型参数、所述区域提议网络的卷积核的模型参数、所述区域提议参数、所述分类器的参数中的至少一个进行激励。Comparing the to-be-detected object in the training image detected by the classifier with the prior results of the training image, and comparing the model parameters of the convolution kernel of the backbone network with the region proposal network according to the comparison result At least one of the model parameters of the convolution kernel, the region proposal parameters, and the parameters of the classifier is excited.
复制所述分类器,获得至少两个分类器。Duplicate the classifier to obtain at least two classifiers.
所述区域提议网络将所述多个提议区域划分为至少两个提议区域集合,每个提议区域集合包括至少一个提议区域。The area proposal network divides the plurality of proposal areas into at least two proposal area sets, and each proposal area set includes at least one proposal area.
所述区域提议网络将每个提议区域集合包括的提议区域对应的子特征图输入所述至少两个分类器中的一个分类器。The region proposal network inputs a sub-feature map corresponding to a proposal region included in each proposal region set into one of the at least two classifiers.
所述至少两个分类器中的每个分类器执行以下动作:根据获取的提议区域集合包括的提议区域对应的子特征图检测所述训练图像中的待检测物体;对比检测出的所述训练图像中的待检测物体和所述训练图像的先验结果,根据所述对比结果对所述主干网络的卷积核的模型参数、所述区域提议网络的卷积核的模型参数、所述区域提议参数、所述每个分类器的参数中的至少一个进行激励。Each classifier of the at least two classifiers performs the following actions: detecting an object to be detected in the training image according to a sub-feature map corresponding to the proposed region included in the acquired proposal region set; comparing the detected training A priori result of the object to be detected in the image and the training image, model parameters of the convolution kernel of the backbone network, model parameters of the convolution kernel of the region proposal network, and the region according to the comparison result At least one of the proposed parameters and the parameters of each classifier is stimulated.
所述至少两个分类器中的每个分类器根据所述对比结果对自身的参数进行激励,一般不根据所述对比结果对所述至少两个分类器中的其他分类器的参数进行激励。Each of the at least two classifiers incentivizes its own parameters according to the comparison result, and generally does not incentivize the parameters of other classifiers in the at least two classifiers according to the comparison result.
以上提供的方法,将训练图像两次输入物体检测模型,以对物体检测模型进行训练。第一阶段的训练中,不对待检测物体的尺寸进行区分,使得训练出的分类器具有全局观。第二阶段的训练中,复制出的每个分类器负责检测一个提议区域集合内的待检测物体,即负责检测一类尺寸的待检测物体,使得训练出的每个分类器进一步的有针对性的对不同尺寸的待检测物体更加敏感。两阶段的训练,提升了训练出的物体检测模型对于不同尺寸的待检测物体的检测精度。The method provided above inputs the training image into the object detection model twice to train the object detection model. In the first stage of training, the size of the object to be detected is not distinguished, so that the trained classifier has a global view. In the second stage of training, each copied classifier is responsible for detecting objects to be detected in a set of proposed regions, that is, responsible for detecting a class of objects to be detected, so that each classifier trained is further targeted. Is more sensitive to different sizes of objects to be detected. The two-stage training improves the detection accuracy of the trained object detection model for objects of different sizes to be detected.
在一种可能的实现方式中,所述方法还包括:获取系统参数,所述系统参数包括以下至少一个:训练图像中的待检测物体的尺寸聚类的数量、训练计算能力;根据所述系统参数,确定复制后获得的所述至少两个分类器中分类器的数量。In a possible implementation manner, the method further includes: acquiring system parameters, where the system parameters include at least one of: a number of size clusters of an object to be detected in a training image, and training computing capabilities; according to the system A parameter that determines the number of classifiers in the at least two classifiers obtained after copying.
复制出的分类器的数量可以通过人工配置,也可以根据训练图像中待检测物体的情况来计算得出,适当的选择复制出的分类器的数量进一步提升了训练出的物体检测模型对于不同尺寸的待检测物体的检测精度。The number of copied classifiers can be manually configured or calculated based on the situation of the object to be detected in the training image. The appropriate selection of the number of copied classifiers further improves the trained object detection model for different sizes. Detection accuracy of the object to be detected.
在一种可能的实现方式中,所述系统参数包括所述训练图像中的待检测物体的尺寸聚类的数量的情况下;所述获取系统参数包括:对所述训练图像中的待检测物体的尺寸进行聚类,获取所述训练图像中的待检测物体的尺寸聚类的数量。In a possible implementation manner, in a case where the system parameter includes the number of size clusters of the object to be detected in the training image, and acquiring the system parameter includes: to the object to be detected in the training image Clustering to obtain the number of size clusters of the object to be detected in the training image.
在一种可能的实现方式中,所述主干网络输出的特征图包括至少两个特征图。In a possible implementation manner, the feature map output by the backbone network includes at least two feature maps.
主干网络的不同卷积层的跨度可能不同,因此不同卷积层输出的特征图内的提议区域内待检测物体的尺寸也可能不同,从主干网络提取至少两个特征图,丰富了提议区域的来源,进一步提升了训练出的物体检测模型对于不同尺寸的待检测物体的检测精度。The span of different convolutional layers of the backbone network may be different, so the size of the object to be detected in the proposed area in the feature maps of the different convolutional layers may also be different. At least two feature maps are extracted from the backbone network, which enriches the proposed area. Source, which further improves the detection accuracy of the trained object detection model for objects of different sizes to be detected.
本申请的第二方面提供了一种检测模型训练装置,包括初始化模块、物体检测模型和激励模块。A second aspect of the present application provides a detection model training device, including an initialization module, an object detection model, and an excitation module.
物体检测模型,用于获取训练图像,根据所述训练图像建立主干网络;根据区域提议参数从所述主干网络输出的特征图中选取多个提议区域,将所述多个提议区域对应的子特征图输入分类器;根据所述多个提议区域对应的子特征图检测所述训练图像中的待检测物体。An object detection model is used to obtain a training image, and establish a backbone network based on the training image; select multiple proposed regions from a feature map output from the backbone network according to regional proposal parameters, and sub-feature corresponding to the multiple proposed regions The map is input to a classifier; and an object to be detected in the training image is detected according to a sub-feature map corresponding to the plurality of proposed regions.
激励模块,用于对比检测出的所述训练图像中的待检测物体和所述训练图像的先验结果,根据所述对比结果对所述主干网络的卷积核的模型参数、所述区域提议网络的卷积核的模型参数、所述区域提议参数、所述分类器的参数中的至少一个进行激励。An incentive module, configured to compare a detected prior object in the training image with a prior result of the training image, and according to the comparison result, to model parameters of the convolution kernel of the backbone network, and the region proposal At least one of the model parameters of the network's convolution kernel, the region proposal parameters, and the parameters of the classifier is excited.
初始化模块,用于复制所述分类器,获得至少两个分类器。An initialization module is configured to duplicate the classifier to obtain at least two classifiers.
所述物体检测模型,还用于将所述多个提议区域划分为至少两个提议区域集合,每个提议区域集合包括至少一个提议区域;将每个提议区域集合包括的提议区域对应 的子特征图输入所述至少两个分类器中的一个分类器;所述至少两个分类器中的每个分类器执行以下动作:根据获取的提议区域集合包括的提议区域对应的子特征图检测所述训练图像中的待检测物体;对比检测出的所述训练图像中的待检测物体和所述训练图像的先验结果,根据所述对比结果对所述主干网络的卷积核的模型参数、所述区域提议网络的卷积核的模型参数、所述区域提议参数、所述每个分类器的参数中的至少一个进行激励。The object detection model is further configured to divide the plurality of proposal regions into at least two proposal region sets, each proposal region set including at least one proposal region; and sub-features corresponding to the proposal regions included in each proposal region set. A map is input to one of the at least two classifiers; each of the at least two classifiers performs the following action: detecting the sub-feature map corresponding to the proposed region included in the acquired proposal region set The object to be detected in the training image; comparing the detected prior object in the training image and the prior image of the training image, and according to the comparison result, the model parameters, At least one of the model parameters of the convolution kernel of the region proposal network, the region proposal parameters, and the parameters of each classifier is stimulated.
在一种可能的实现方式中,所述初始化模块,还用于获取系统参数,所述系统参数包括以下至少一个:训练图像中的待检测物体的尺寸聚类的数量、训练计算能力;根据所述系统参数,确定复制后获得的所述至少两个分类器中分类器的数量。In a possible implementation manner, the initialization module is further configured to obtain system parameters, where the system parameters include at least one of: a number of size clusters of an object to be detected in a training image, and a training computing capability; The system parameters are described to determine the number of classifiers in the at least two classifiers obtained after replication.
在一种可能的实现方式中,所述初始化模块,还用于对所述训练图像中的待检测物体的尺寸进行聚类,获取所述训练图像中的待检测物体的尺寸聚类的数量。In a possible implementation manner, the initialization module is further configured to cluster the sizes of the objects to be detected in the training image to obtain the number of size clusters of the objects to be detected in the training image.
在一种可能的实现方式中,所述主干网络输出的特征图包括至少两个特征图。In a possible implementation manner, the feature map output by the backbone network includes at least two feature maps.
本申请的第三方面提供了一种计算设备系统。该计算设备系统包括至少一个计算设备。每个计算设备包括处理器和存储器。至少一个计算设备的处理器用于访问所述存储器中的代码以执行第一方面或第一方面的任意可能的实现方式提供的方法。A third aspect of the present application provides a computing device system. The computing device system includes at least one computing device. Each computing device includes a processor and memory. A processor of at least one computing device is configured to access code in the memory to perform the first aspect or the method provided by any possible implementation of the first aspect.
本申请的第四方面提供了一种非瞬态的可读存储介质,所述非瞬态的可读存储介质被至少一台计算设备执行时,所述至少一台计算设备执行前述第一方面或第一方面的任意可能的实现方式中提供的方法。该存储介质中存储了程序。该存储介质的类型包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(hard disk drive,HDD)、固态硬盘(solid state drive,SSD)。A fourth aspect of the present application provides a non-transitory readable storage medium. When the non-transitory readable storage medium is executed by at least one computing device, the at least one computing device executes the foregoing first aspect. Or the method provided in any possible implementation of the first aspect. The program is stored in the storage medium. The type of the storage medium includes, but is not limited to, volatile memory, such as random access memory, non-volatile memory, such as flash memory, hard disk drive (HDD), and solid state drive (SSD).
本申请的第五方面提供了一种计算设备程序产品,所述计算设备程序产品被至少一台计算设备执行时,所述至少一台计算设备执行前述第一方面或第一方面的任意可能的实现方式中提供的方法。该计算机程序产品可以为一个软件安装包,在需要使用前述第一方面或第一方面的任意可能的实现方式中提供的方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。A fifth aspect of the present application provides a computing device program product. When the computing device program product is executed by at least one computing device, the at least one computing device executes the foregoing first aspect or any possible first aspect. The methods provided in the implementation. The computer program product may be a software installation package. If the method provided in the foregoing first aspect or any possible implementation manner of the first aspect is required to be used, the computer program product may be downloaded and executed on a computing device. Program products.
本申请的第六方面提供了另一种计算设备执行的训练物体检测模型的方法,所述方法包括两阶段的训练。其中,A sixth aspect of the present application provides another method for training an object detection model performed by a computing device, the method including two-stage training. among them,
在第一阶段训练中,通过主干网络提取训练图像的特征图,通过区域提议网络从提取出的所述特征图中选取提议区域,将所述提议区域对应的子特征图输入分类器,所述分类器根据所述提议区域对应的子特征图检测所述训练图像中的待检测物体,对比所述检测结果和所述训练图像的先验结果,根据所述对比结果对所述主干网络、所述区域提议网络、所述分类器中的至少一个进行激励。In the first stage of training, a feature map of a training image is extracted through a backbone network, a proposed area is selected from the extracted feature map through a regional proposal network, and a sub-feature map corresponding to the proposed area is input to a classifier. The classifier detects the object to be detected in the training image according to the sub-feature map corresponding to the proposed region, compares the detection result with the prior result of the training image, and compares the backbone network, the At least one of the region proposal network and the classifier performs incentives.
在第二阶段训练中,根据经历过所述第一阶段训练的所述分类器,建立至少两个复制分类器,所述区域提议网络将所述提议区域划分为至少两个提议区域集合,每个提议区域集合包括至少一个提议区域,将每个提议区域集合包括的提议区域对应的子特征图输入一个复制分类器,每个复制分类器根据获取的子特征图检测所述训练图像中的待检测物体,对比所述检测结果和所述训练图像的先验结果,根据所述对比结果对所述主干网络、所述区域提议网络、所述分类器中的至少一个再次进行激励。In the second-stage training, at least two replication classifiers are established according to the classifiers that have undergone the first-stage training, and the region proposal network divides the proposed region into at least two proposed region sets, each Each proposal region set includes at least one proposal region, and the sub-feature map corresponding to the proposal region included in each proposal region set is input to a replication classifier, and each replication classifier detects a pending feature in the training image according to the acquired sub-feature map. An object is detected, the detection result is compared with the prior result of the training image, and at least one of the backbone network, the area proposal network, and the classifier is re-energized according to the comparison result.
在第二阶段训练中,可以复制经历过所述第一阶段训练的所述分类器,以建立至 少两个复制分类器。也可以对经历过所述第一阶段训练的所述分类器进行调整后再进行复制,以建立至少两个复制分类器。In the second-stage training, the classifiers that have undergone the first-stage training may be duplicated to establish at least two duplicated classifiers. It is also possible to adjust the classifiers that have undergone the first-stage training before copying to establish at least two copy classifiers.
在一种可能的实现方式中,所述方法还包括:获取系统参数,所述系统参数包括以下至少一个:训练图像中的待检测物体的尺寸聚类的数量、训练计算能力;根据所述系统参数,确定建立出的所述复制分类器的数量。In a possible implementation manner, the method further includes: acquiring system parameters, where the system parameters include at least one of: a number of size clusters of an object to be detected in a training image, and training computing capabilities; according to the system A parameter determines the number of the replication classifiers established.
在一种可能的实现方式中,所述系统参数包括所述训练图像中的待检测物体的尺寸聚类的数量的情况下;所述获取系统参数包括:对所述训练图像中的待检测物体的尺寸进行聚类,获取所述训练图像中的待检测物体的尺寸聚类的数量。In a possible implementation manner, in a case where the system parameter includes the number of size clusters of the object to be detected in the training image, and acquiring the system parameter includes: to the object to be detected in the training image Clustering to obtain the number of size clusters of the object to be detected in the training image.
在一种可能的实现方式中,所述主干网络提取出的特征图包括至少两个特征图。In a possible implementation manner, the feature map extracted by the backbone network includes at least two feature maps.
本申请的第七方面提供了一种计算设备系统。该计算设备系统包括至少一个计算设备。每个计算设备包括处理器和存储器。至少一个计算设备的处理器用于访问所述存储器中的代码以执行第六方面或第六方面的任意可能的实现方式提供的方法。A seventh aspect of the present application provides a computing device system. The computing device system includes at least one computing device. Each computing device includes a processor and memory. A processor of at least one computing device is configured to access code in the memory to perform the sixth aspect or the method provided by any possible implementation of the sixth aspect.
本申请的第八方面提供了一种非瞬态的可读存储介质,所述非瞬态的可读存储介质被至少一台计算设备执行时,所述至少一台计算设备执行前述第六方面或第六方面的任意可能的实现方式中提供的方法。该存储介质中存储了程序。该存储介质的类型包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、HDD、SSD。An eighth aspect of the present application provides a non-transitory readable storage medium. When the non-transitory readable storage medium is executed by at least one computing device, the at least one computing device executes the foregoing sixth aspect. Or the method provided in any possible implementation of the sixth aspect. The program is stored in the storage medium. The type of the storage medium includes, but is not limited to, volatile memory, such as random access memory, and non-volatile memory, such as flash memory, HDD, and SSD.
本申请的第九方面提供了一种计算设备程序产品,所述计算设备程序产品被至少一台计算设备执行时,所述至少一台计算设备执行前述第六方面或第六方面的任意可能的实现方式中提供的方法。该计算机程序产品可以为一个软件安装包,在需要使用前述第六方面或第六方面的任意可能的实现方式中提供的方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。A ninth aspect of the present application provides a computing device program product. When the computing device program product is executed by at least one computing device, the at least one computing device executes the foregoing sixth aspect or any possible one of the sixth aspect. The methods provided in the implementation. The computer program product may be a software installation package. In the case where the method provided in the foregoing sixth aspect or any possible implementation manner of the sixth aspect is required to be used, the computer program product may be downloaded and executed on a computing device. Program products.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例的技术方法,下面将对实施例中所需要使用的附图作以简单地介绍。In order to explain the technical method of the embodiment of the present application more clearly, the drawings required in the embodiment will be briefly introduced below.
图1为本申请提供的系统架构示意图;FIG. 1 is a schematic diagram of a system architecture provided by this application;
图2为本申请提供的另一系统架构示意图;FIG. 2 is a schematic diagram of another system architecture provided by the present application; FIG.
图3为本申请提供的检测模型训练装置在训练态的工作流程图;3 is a working flowchart of a detection model training device provided in the present application in a training state;
图4为本申请提供的检测模型训练装置在训练态的另一工作流程图;4 is another working flowchart of the detection model training device provided in the present application in a training state;
图5为本申请提供的物体检测模型训练在推理态的工作流程图;5 is a working flowchart of an object detection model training in an inferential state provided by the present application;
图6为本申请提供的检测模型训练装置在训练态的工作流程图;6 is a working flowchart of a detection model training device provided in the present application in a training state;
图7为本申请提供的检测模型训练装置在训练态的另一工作流程图;7 is another working flowchart of the detection model training device provided in the present application in a training state;
图8为本申请提供的物体检测模型训练在推理态的工作流程图;FIG. 8 is a working flowchart of an object detection model training in an inferential state provided by the present application; FIG.
图9为本申请提供的卷积层和卷积核的结构示意图;9 is a schematic structural diagram of a convolution layer and a convolution kernel provided by the present application;
图10为本申请提供的卷积层的感受野的示意图;10 is a schematic diagram of a receptive field of a convolutional layer provided by the present application;
图11为本申请提供的另一卷积层的感受野的示意图;11 is a schematic diagram of a receptive field of another convolution layer provided by the present application;
图12为本申请提供的区域提议网络的工作流程图;FIG. 12 is a working flowchart of a regional proposal network provided by this application;
图13为本申请提供的方法流程示意图;13 is a schematic flowchart of a method provided by this application;
图14为本申请提供的另一方法流程示意图;14 is a schematic flowchart of another method provided by the present application;
图15为本申请提供的检测模型训练装置的结构示意图;15 is a schematic structural diagram of a detection model training device provided by the present application;
图16为本申请提供的计算设备的结构示意图;16 is a schematic structural diagram of a computing device provided by the present application;
图17为本申请提供的计算设备系统的结构示意图;17 is a schematic structural diagram of a computing device system provided by the present application;
图18为本申请提供的另一计算设备系统的结构示意图。FIG. 18 is a schematic structural diagram of another computing device system provided by the present application.
具体实施方式detailed description
下面结合本申请实施例中的附图,对本申请实施例中的技术方法进行描述。The following describes the technical methods in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.
本申请中各个“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系。Each of the “first”, “second”, and “n” in this application does not have a logical or sequential dependency relationship.
如图1所示,本申请提供的训练物体检测模型的方法由检测模型训练装置执行。该装置可以运行在云环境中,具体为云环境上的一个或多个计算设备上。该装置也可以运行在边缘环境中,具体为边缘环境中的一个或多个计算设备(边缘计算设备)上。该装置还可以运行在终端环境中,具体为终端环境中的一个或多个终端设备上。终端设备可以为手机、笔记本、服务器、台式电脑等。边缘计算设备可以为服务器。As shown in FIG. 1, the method for training an object detection model provided by the present application is executed by a detection model training device. The device can run in a cloud environment, specifically one or more computing devices on the cloud environment. The device can also run in an edge environment, specifically one or more computing devices (edge computing devices) in the edge environment. The device can also run in a terminal environment, specifically one or more terminal devices in the terminal environment. Terminal equipment can be mobile phones, notebooks, servers, desktop computers, etc. The edge computing device may be a server.
如图2所示,检测模型训练装置可以由多个部分(模块)组成,因此检测模型训练装置的各个部分也可以分别部署在不同环境中。例如,检测模型训练装置可以在云环境、边缘环境、终端环境中的三个,或其中任意两个环境上部署检测模型训练装置的部分模块。As shown in FIG. 2, the detection model training device may be composed of multiple parts (modules), so each part of the detection model training device may also be separately deployed in different environments. For example, the detection model training device may deploy three modules of the detection model training device on a cloud environment, an edge environment, or a terminal environment, or any two of them.
图3至图5和图6至图8分别介绍了两种检测模型训练装置的工作流程示意图。每种工作流程中,检测模型训练装置的训练态均分为两个阶段。FIG. 3 to FIG. 5 and FIG. 6 to FIG. 8 respectively illustrate the working flow diagrams of the two detection model training devices. In each workflow, the training state of the detection model training device is divided into two stages.
图3中,检测模型训练装置工作于训练态的第一阶段。训练态的目的是利用训练图像以及训练图像的先验结果来训练出一个精度较高的物体检测模型。其中,训练图像的先验结果包括训练图像中的待检测物体的标记。如图3中训练图像为例,该训练图像由包括多个人脸,训练图像的先验结果中将该训练图像的每个人脸用白色的框标记出来(如图3左上角)。训练图像的先验结果一般可以由人工提供。In FIG. 3, the detection model training device works in the first stage of the training state. The purpose of the training state is to use the training image and the prior results of the training image to train a highly accurate object detection model. The prior result of the training image includes a mark of an object to be detected in the training image. As an example, the training image in FIG. 3 includes multiple faces, and each face of the training image is marked with a white frame in the prior result of the training image (as shown in the upper left corner of FIG. 3). The prior results of the training images can generally be provided manually.
根据训练图像建立K层主干网络(backbone network),该主干网络包括K个卷积层,K为大于0的正整数。主干网络从训练图像中抽取特征图。主干网络提取出的特征图被输入至区域提议网络(region proposal network),区域提议网络从特征图中选取提议区域,将提议区域对应的子特征图输入分类器。区域提议网络从特征图中选取提议区域的过程中,可以直接通过对比训练图像的先验结果和特征图,获取特征图中与训练图像中待检测物体覆盖率较高的区域作为提议区域。或者,区域提议网络可以先从特征图中识别前景区域和背景区域,然后从前景区域中提取提议区域。其中,前景区域为包含的信息量较高、包含待检测物体的几率较高的区域,背景区域为包含的信息量较低、重复信息较多、包含待检测物体的几率较低的区域。A K-layer backbone network is established according to the training image. The backbone network includes K convolutional layers, where K is a positive integer greater than 0. The backbone network extracts feature maps from the training images. The feature map extracted from the backbone network is input to a regional proposal network. The regional proposal network selects a proposed region from the feature map and inputs the sub-feature map corresponding to the proposed region to the classifier. In the process of selecting a proposed region from a feature map, the region proposal network can directly compare the prior results of the training image and the feature map to obtain a region with a high coverage of the object to be detected in the feature map and the training image as the proposed region. Alternatively, the region proposal network may first identify the foreground region and the background region from the feature map, and then extract the proposed region from the foreground region. Among them, the foreground area is an area containing a high amount of information and a high probability of including an object to be detected, and the background area is an area containing a low amount of information, having a lot of repeated information, and a low probability of including an object to be detected.
每个子特征图包括特征图中的一部分位于提议区域内的特征。分类器根据子特征 图确定该子特征图对应的提议区域对应的训练图像的区域内是否为待检测物体。如图3右侧所示,分类器在训练图像上用白色的框标记出检测出的人脸区域。通过比对训练图像的检测结果和训练图像的先验结果,可以得知本次检测模型训练装置检测出的待检测物体和先验结果的差异。如图3中所示,先验结果中的部分人脸未被该检测模型训练装置检测出来。根据该差异来激励物体检测模型的各个参数,包括以下中的至少一个:主干网络的各个卷积层的卷积核的模型参数,区域提议网络的卷积核的模型参数,区域提议网络的区域提议参数,所述分类器的参数。每个训练图像的检测结果以及训练图像的先验结果之间的差异均会对上述物体检测模型的各个参数进行激励,因此通过大量训练图像的激励后,物体检测模型的精度将会提升。Each sub-feature map includes features in which a portion of the feature map is located within the proposed area. The classifier determines whether the area of the training image corresponding to the proposed region corresponding to the sub-feature map is an object to be detected according to the sub-feature map. As shown on the right side of Figure 3, the classifier marks the detected face area with a white frame on the training image. By comparing the detection result of the training image with the prior result of the training image, it is possible to know the difference between the object to be detected detected by the detection model training device and the prior result. As shown in FIG. 3, some faces in the prior result are not detected by the detection model training device. The parameters of the object detection model are stimulated according to the difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, model parameters of the convolution kernel of the regional proposal network, and regions of the regional proposal network Proposed parameters, parameters of the classifier. The differences between the detection results of each training image and the prior results of the training image will motivate the various parameters of the above object detection model, so after a large number of training image excitations, the accuracy of the object detection model will be improved.
检测模型训练装置通过大量的训练图像以及训练图像的先验结果对物体检测模型进行训练,物体检测模型包括主干网络、区域提议网络和分类器。经历完训练模式的第一阶段的物体检测模型进入训练模式的第二阶段,如图4所示。The detection model training device trains an object detection model through a large number of training images and prior results of the training images. The object detection model includes a backbone network, a region proposal network, and a classifier. The object detection model that has undergone the first stage of the training mode enters the second stage of the training mode, as shown in FIG. 4.
训练模式的第二阶段里,首先将经历过图3中第一阶段的分类器复制P份。将训练图像输入主干网络,主干网络提取出的特征图被输入至区域提议网络。区域提议网络从特征图中选取提议区域,并将选取的提议区域按照提议区域的尺寸聚集成P个提议区域集合。每一份提议区域集合中的提议区域的尺寸相近。将这P个提议区域集合对应的子特征图分别输入P个分类器。一个提议区域集合对应一个分类器,该提议区域集合中的提议区域对应的子特征图被输入进该分类器。每个分类器根据接收的子特征图检测出训练图像中不同尺寸的待检测物体,获取对应的检测结果。将每个分类器的检测结果与训练图像中与该分类器接收的子特征图对应的待检测物体的尺寸的先验结果进行对比。每个尺寸的待检测物体的检测结果以及该尺寸的待检测物体的先验结果之间的差异均会对物体检测模型的各个参数进行激励。尤其是每个分类器将会被训练成对不同尺寸的被检测物体更加敏感,因此通过大量训练图像的二次激励后,物体检测模型的精度将会进一步提升。图4中,按照尺寸将训练图中的待检测物体的先验结果分为了P类,分别用于P个分类器的对比。In the second stage of the training mode, the classifier that has undergone the first stage in FIG. 3 is first copied into P copies. The training image is input to the backbone network, and the feature map extracted by the backbone network is input to the regional proposal network. The regional proposal network selects proposal regions from the feature map, and aggregates the selected proposal regions into P proposal region sets according to the size of the proposal region. The sizes of the proposed regions in each set of proposed regions are similar. The sub-feature maps corresponding to the P proposed region sets are input to P classifiers, respectively. A proposed region set corresponds to a classifier, and a sub-feature map corresponding to the proposed region in the proposed region set is input into the classifier. Each classifier detects objects of different sizes in the training image according to the received sub-feature maps, and obtains corresponding detection results. The detection result of each classifier is compared with the prior result of the size of the object to be detected corresponding to the sub-feature map received by the classifier in the training image. The difference between the detection result of the object to be detected at each size and the prior result of the object to be detected at that size will stimulate various parameters of the object detection model. In particular, each classifier will be trained to be more sensitive to detected objects of different sizes, so the accuracy of the object detection model will be further improved after the secondary excitation of a large number of training images. In Fig. 4, the prior results of the object to be detected in the training graph are divided into P classes according to size, and are used for comparison of P classifiers.
如图4所示,P等于2,即区域提议网络将选取出的提议区域分成2个提议区域集合,其中一个提议区域集合(对应上方的分类器)中的提议区域的尺寸较小,另一个提议区域集合(对应下方的分类器)中的提议区域的尺寸较大。因此,前一个提议区域集合中的提议区域对应的子特性图用于检测训练图像中较小的待检测物体,后一个提议区域集合中的提议区域对应的子特性图用于检测训练图像中较大的待检测物体。将2个提议区域集合分别输入不同的分类器。上方的分类器用于检测较小的待检测物体,下方的分类器用于检测较大的待检测物体,将两个分类器输出的检测结果分别与对应的先验结果进行对比。例如,检测结果1包括上方的分类器根据尺寸较小的提议区域对应的子特征图检测出的待检测物体,训练图像的先验结果1包括训练图像中尺寸较小的待检测物体的先验结果(尺寸、坐标等)。将检测结果1与先验结果1进行对比,根据对比差异激励物体检测模型的各个参数,包括以下中的至少一个:主干网络的各个卷积层的卷积核的模型参数,区域提议网络的卷积核的模型参数,区域提议网络的区域提议参数,上方的分类器的参数。类似的,检测结果2包括下方的分类器根据尺寸较大的提议区域对应的子特征图检测出的待检测物体,训练图像的先验 结果2包括训练图像中尺寸较大的待检测物体的先验结果(尺寸、坐标等)。将检测结果2与先验结果2进行对比,根据比对差异激励物体检测模型的各个参数,包括以下中的至少一个:主干网络的各个卷积层的卷积核的模型参数,区域提议网络的卷积核的模型参数,区域提议网络的区域提议参数,下方的分类器的参数。As shown in Figure 4, P is equal to 2, that is, the proposed area is divided into two proposed area sets by the regional proposal network. One of the proposed area sets (corresponding to the classifier above) has a smaller size and the other The size of the proposal area in the proposal area set (corresponding to the classifier below) is large. Therefore, the sub-characteristic map corresponding to the proposed region in the previous set of proposed regions is used to detect smaller objects to be detected in the training image, and the sub-characteristic map corresponding to the proposed region in the latter set of proposed regions is used to detect the relatively small objects in the training image. Large object to be detected. The two proposed regions are input into different classifiers. The upper classifier is used to detect smaller objects to be detected, the lower classifier is used to detect larger objects to be detected, and the detection results output by the two classifiers are compared with the corresponding prior results. For example, detection result 1 includes the object to be detected detected by the upper classifier according to the sub-feature map corresponding to the proposed region with a smaller size, and the prior result 1 of the training image includes the prior to the small object to be detected in the training image. Results (size, coordinates, etc.). The detection result 1 is compared with the prior result 1, and the parameters of the object detection model are stimulated according to the contrast difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, and the volume of the regional proposal network The kernel model parameters, the area proposal parameters of the area proposal network, and the parameters of the upper classifier. Similarly, the detection result 2 includes the to-be-detected object detected by the lower classifier according to the sub-feature map corresponding to the larger-sized proposed region, and the prior result 2 of the training image includes the prior-to-be-detected object in the training image with larger size. Test results (size, coordinates, etc.). The detection result 2 is compared with the prior result 2, and each parameter of the object detection model is stimulated according to the comparison difference, including at least one of the following: the model parameters of the convolution kernel of each convolutional layer of the backbone network, and the The model parameters of the convolution kernel, the area proposal parameters of the area proposal network, and the parameters of the classifier below.
需要说明的是,第一阶段和第二阶段使用的训练图像可以相同,也可以不同,也可以部分重叠。可以采用预设的阈值区分不同提议区域集合,例如当需要区分P个提议区域集合的情况下,预设P-1个阈值,每个阈值对应一个提议区域的尺寸,采用这P-1个阈值将区域提议网络选取的提议区域聚集为P个提议区域集合。相应的,按照训练图像中的待检测物体的尺寸,将训练图像中的待检测物体分为P个先验结果,一个先验结果与一个尺寸对应的检测结果进行对比,以激励物体检测模型。It should be noted that the training images used in the first stage and the second stage may be the same, may be different, or may partially overlap. Preset thresholds can be used to distinguish different sets of proposed regions. For example, when P sets of proposed regions need to be distinguished, P-1 thresholds are preset. Each threshold corresponds to the size of a proposed region. The P-1 thresholds are used. The proposed regions selected by the regional proposal network are aggregated into P proposed region sets. Correspondingly, according to the size of the object to be detected in the training image, the object to be detected in the training image is divided into P prior results, and a prior result is compared with a size corresponding detection result to stimulate the object detection model.
第二阶段训练完毕的物体检测模型可以部署在云环境或者边缘环境或者终端环境。或者可以在云环境、边缘环境、终端环境中的三个或其中任意两个上部署物体检测模型的一部分。The object detection model trained in the second stage can be deployed in the cloud environment, edge environment or terminal environment. Or part of the object detection model can be deployed on three or any two of the cloud environment, edge environment, and terminal environment.
如图5所示,推理态中,待检测图像被输入物体检测模型的主干网络,经过区域提议网络和P个分类器的处理后,物体检测模型输出该待检测图像的检测结果。常见的,检测结果中包括被检测出的待检测物体的位置和数量等信息,例如有多少个人脸,每个人脸出现的位置。推理态中,区域提议网络与训练态第二阶段类似,将提取出的提议区域按照尺寸进行分类,每个提议区域对应的子特征图分别发送至该提议区域对应的分类器。每个分类器根据不同尺寸的提议区域的子特征图检测不同尺寸的待检测物体,综合P个分类器的检测结果,可以获得待检测图像的检测结果。As shown in FIG. 5, in the inference state, the image to be detected is input into the backbone network of the object detection model. After being processed by the region proposal network and the P classifiers, the object detection model outputs the detection result of the image to be detected. Commonly, the detection result includes information such as the position and number of the detected objects to be detected, such as how many human faces there are and where each human face appears. In the inference state, the region proposal network is similar to the second stage of the training state. The extracted proposal regions are classified according to size, and the sub-feature map corresponding to each proposal region is sent to the classifier corresponding to the proposed region, respectively. Each classifier detects objects of different sizes according to the sub-feature maps of the proposed regions of different sizes, and by integrating the detection results of the P classifiers, the detection results of the images to be detected can be obtained.
图6至图8介绍了检验模型训练装置的另一种工作流程,与图3至图5介绍的检验模型训练装置相比,图6至图8介绍的检验模型训练装置在训练态和推理态中将主干网络的至少两个卷积层提取出的特征图作为区域提议网络的输入。FIG. 6 to FIG. 8 illustrate another workflow of the test model training device. Compared with the test model training device described in FIGS. 3 to 5, the test model training device described in FIGS. 6 to 8 is in a training state and an inference state. The feature maps extracted from at least two convolutional layers of the backbone network are used as the input of the region proposal network.
图6中,检测模型训练装置工作于训练态的第一阶段。根据训练图像建立K层主干网络,该主干网络包括K个卷积层,K为大于0的正整数。主干网络从训练图像中抽取p个特征图。这p个特征图可以从主干网络的任意p个卷积层提取,或者为主干网络的任意p个卷积层本身。主干网络提取出的p个特征图被输入至区域提议网络,区域提议网络从这p个特征图中选取提议区域,将提议区域对应的子特征图输入分类器。每个子特征图包括特征图中的一部分位于提议区域内的特征。分类器根据子特征图确定该子特征图对应的提议区域对应的训练图像的区域内是否为待检测物体。In FIG. 6, the detection model training device works in the first stage of the training state. A K-layer backbone network is established according to the training image. The backbone network includes K convolutional layers, where K is a positive integer greater than 0. The backbone network extracts p feature maps from the training images. The p feature maps can be extracted from any p convolutional layers of the backbone network, or any p convolutional layers of the backbone network itself. The p feature maps extracted by the backbone network are input to a regional proposal network. The regional proposal network selects a proposed area from the p feature maps and enters the sub-feature map corresponding to the proposed area into the classifier. Each sub-feature map includes features in which a portion of the feature map is located within the proposed area. The classifier determines whether the area of the training image corresponding to the proposed region corresponding to the sub-feature map is an object to be detected according to the sub-feature map.
如图6右侧所示,分类器在训练图像上用白色的框标记出检测出的人脸区域。通过比对训练图像的检测结果和训练图像的先验结果,可以得知本次检测模型训练装置检测出的待检测物体和先验结果的差异。如图6中所示,先验结果中的部分人脸未被该检测模型训练装置检测出来。根据该差异来激励物体检测模型的各个参数,包括以下中的至少一个:主干网络的各个卷积层的卷积核的模型参数,区域提议网络的卷积核的模型参数,区域提议网络的区域提议参数,所述分类器的参数。每个训练图像的检测结果以及训练图像的先验结果之间的差异均会对上述物体检测模型的各个参数进行激励,因此通过大量训练图像的激励后,物体检测模型的精度将会提升。As shown on the right side of Figure 6, the classifier marks the detected face area with a white frame on the training image. By comparing the detection result of the training image with the prior result of the training image, it is possible to know the difference between the object to be detected detected by the detection model training device and the prior result. As shown in FIG. 6, some faces in the prior result are not detected by the detection model training device. The parameters of the object detection model are stimulated according to the difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, model parameters of the convolution kernel of the regional proposal network, and regions of the regional proposal network Proposed parameters, parameters of the classifier. The differences between the detection results of each training image and the prior results of the training image will motivate the various parameters of the above object detection model, so after a large number of training image excitations, the accuracy of the object detection model will be improved.
检测模型训练装置通过大量的训练图像以及训练图像的先验结果对物体检测模 型进行训练,物体检测模型包括主干网络、区域提议网络和分类器。经历完训练模式的第一阶段的物体检测模型进入训练模式的第二阶段,如图7所示。The detection model training device trains an object detection model through a large number of training images and prior results of the training images. The object detection model includes a backbone network, a region proposal network, and a classifier. The object detection model that has undergone the first stage of the training mode enters the second stage of the training mode, as shown in FIG. 7.
训练模式的第二阶段里,首先将经历过图6中第一阶段的分类器复制P份。将训练图像输入主干网络,主干网络提取出的至少一个特征图被输入区域提议网络。区域提议网络从特征图中选取提议区域,并将选取的提议区域按照提议区域的尺寸聚集成P个提议区域集合。每一份提议区域集合中的提议区域根据提议区域的尺寸和该提议区域所在的特征图对应的卷积层的跨度决定。将这P个提议区域集合中的提议区域对应的子特征图分别输入P个分类器。一个提议区域集合对应一个分类器,该提议区域集合中的提议区域对应的子特征图被输入进该分类器。每个分类器根据接收的子特征图检测不同尺寸的待检测物体,获取对应的检测结果。将每个分类器的检测结果与训练图像中与该分类器接收的子特征图对应的待检测物体的尺寸的先验结果进行对比。每个尺寸的待检测物体的检测结果以及该尺寸的待检测物体的先验结果之间的差异均会对物体检测模型的各个参数进行激励。尤其是每个分类器将会被训练成对不同尺寸的被检测物体更加敏感,因此通过大量训练图像的二次激励后,物体检测模型的精度将会进一步提升。图7中,按照尺寸将训练图中的待检测物体的先验结果分为了P类,分别用于P个分类器的对比。In the second stage of the training mode, the classifier that has undergone the first stage in FIG. 6 is first copied into P copies. The training image is input to the backbone network, and at least one feature map extracted by the backbone network is input to the region proposal network. The regional proposal network selects proposal regions from the feature map, and aggregates the selected proposal regions into P proposal region sets according to the size of the proposal region. The proposal area in each proposal area set is determined according to the size of the proposal area and the span of the convolutional layer corresponding to the feature map in which the proposal area is located. The sub-feature maps corresponding to the proposed regions in the P proposed region sets are input into the P classifiers, respectively. A proposed region set corresponds to a classifier, and a sub-feature map corresponding to the proposed region in the proposed region set is input into the classifier. Each classifier detects objects of different sizes to be detected according to the received sub-feature map and obtains corresponding detection results. The detection result of each classifier is compared with the prior result of the size of the object to be detected corresponding to the sub-feature map received by the classifier in the training image. The difference between the detection result of the object to be detected at each size and the prior result of the object to be detected at that size will stimulate various parameters of the object detection model. In particular, each classifier will be trained to be more sensitive to detected objects of different sizes, so the accuracy of the object detection model will be further improved after the secondary excitation of a large number of training images. In FIG. 7, the prior results of the object to be detected in the training graph are classified into P classes according to size, and are used for comparison of P classifiers.
如图7所示,P等于2,即区域提议网络将选取出的提议区域分成2个提议区域集合,其中一个提议区域集合(对应上方的分类器)中的提议区域的尺寸和该提议区域所在的特征图对应的卷积层的跨度之乘积较小,另一个提议区域集合(对应下方的分类器)中的提议区域的提议区域的尺寸和该提议区域所在的特征图对应的卷积层的跨度之乘积较大。因此,前一个提议区域集合中的提议区域对应的子特性图用于检测训练图像中较小的待检测物体,后一个提议区域集合中的提议区域对应的子特性图用于检测训练图像中较大的待检测物体。将2个提议区域集合分别输入不同的分类器。上方的分类器用于检测较小的待检测物体,下方的分类器用于检测较大的待检测物体,将两个分类器输出的检测结果分别与对应的先验结果进行对比。例如,检测结果1包括上方的分类器根据尺寸较小的提议区域对应的子特征图检测出的待检测物体,训练图像的先验结果1包括训练图像中尺寸较小的待检测物体的先验结果(尺寸、坐标等)。将检测结果1与先验结果1进行对比,根据比对差异激励物体检测模型的各个参数,包括以下中的至少一个:主干网络的各个卷积层的卷积核的模型参数,区域提议网络的卷积核的模型参数,区域提议网络的区域提议参数,上方的分类器的参数。类似的,检测结果2包括下方的分类器根据尺寸较大的提议区域对应的子特征图检测出的待检测物体,训练图像的先验结果2包括训练图像中尺寸较大的待检测物体的先验结果(尺寸、坐标等)。将检测结果2与先验结果2进行对比,根据比对差异激励物体检测模型的各个参数,包括以下中的至少一个:主干网络的各个卷积层的卷积核的模型参数,区域提议网络的卷积核的模型参数,区域提议网络的区域提议参数,下方的分类器的参数。As shown in FIG. 7, P is equal to 2, that is, the area proposal network divides the selected proposal area into two sets of proposal areas. One of the proposal area sets (corresponding to the above classifier) is the size of the proposal area and the location of the proposal area. The product of the span of the convolutional layer corresponding to the feature map of is small. The size of the proposed region of the proposed region in another set of proposed regions (corresponding to the classifier below) and the size of the convolutional layer corresponding to the feature map of the proposed region. The product of the spans is large. Therefore, the sub-characteristic map corresponding to the proposed region in the previous set of proposed regions is used to detect smaller objects to be detected in the training image, and the sub-characteristic map corresponding to the proposed region in the latter set of proposed regions is used to detect the relatively small Large object to be detected. The two proposed regions are input into different classifiers. The upper classifier is used to detect smaller objects to be detected, the lower classifier is used to detect larger objects to be detected, and the detection results output by the two classifiers are compared with the corresponding prior results. For example, detection result 1 includes the object to be detected detected by the upper classifier according to the sub-feature map corresponding to the proposed region with a smaller size, and the prior result 1 of the training image includes the prior to the small object to be detected in the training image. Results (size, coordinates, etc.). The detection result 1 is compared with the prior result 1, and each parameter of the object detection model is stimulated according to the comparison difference, including at least one of the following: model parameters of the convolution kernel of each convolutional layer of the backbone network, and The model parameters of the convolution kernel, the area proposal parameters of the area proposal network, and the parameters of the upper classifier. Similarly, the detection result 2 includes the to-be-detected object detected by the lower classifier according to the sub-feature map corresponding to the larger-sized proposed region, and the prior result 2 of the training image includes the prior-to-be-detected object in the training image with larger size. Test results (size, coordinates, etc.). The detection result 2 is compared with the prior result 2, and each parameter of the object detection model is stimulated according to the comparison difference, including at least one of the following: the model parameters of the convolution kernel of each convolutional layer of the backbone network, and the The model parameters of the convolution kernel, the area proposal parameters of the area proposal network, and the parameters of the classifier below.
需要说明的是,第一阶段和第二阶段使用的训练图像可以相同,也可以不同,也可以部分重叠。可以采用预设的阈值区分不同提议区域集合,例如当需要区分P个提议区域集合的情况下,预设P-1个阈值,每个阈值对应一个提议区域的尺寸,采用这 P-1个阈值将区域提议网络选取的提议区域聚集为P个提议区域集合。相应的,按照训练图像中的待检测物体的尺寸,将训练图像中的待检测物体分为P个先验结果,一个先验结果与一个尺寸对应的检测结果进行对比,以激励物体检测模型。It should be noted that the training images used in the first stage and the second stage may be the same, may be different, or may partially overlap. Preset thresholds can be used to distinguish different sets of proposed regions. For example, when P sets of proposed regions need to be distinguished, P-1 thresholds are preset. Each threshold corresponds to the size of a proposed region. The P-1 thresholds are used. The proposed regions selected by the regional proposal network are aggregated into P proposed region sets. Correspondingly, according to the size of the object to be detected in the training image, the object to be detected in the training image is divided into P prior results, and a prior result is compared with a size corresponding detection result to stimulate the object detection model.
第二阶段训练完毕的物体检测模型可以部署在云环境或者边缘环境或者终端环境。或者可以在云环境、边缘环境、终端环境中的三个或其中任意两个上部署物体检测模型的一部分。The object detection model trained in the second stage can be deployed in the cloud environment, edge environment or terminal environment. Or part of the object detection model can be deployed on three or any two of the cloud environment, edge environment, and terminal environment.
如图8所示,推理态中,待检测图像被输入物体检测模型的主干网络,经过区域提议网络和P个分类器的处理后,物体检测模型输出该待检测图像的检测结果。常见的,检测结果中包括被检测出的待检测物体的位置和数量等信息,例如有多少个人脸,每个人脸出现的位置。推理态中,区域提议网络与训练态第二阶段类似,将提取出的提议区域按照尺寸进行分类,提议区域对应的子特征图分别发送至对应的分类器。每个分类器根据不同尺寸的提议区域的子特征图检测不同尺寸的待检测物体,综合P个分类器的检测结果,可以获得待检测图像的检测结果。As shown in FIG. 8, in the inference state, the image to be detected is input into the backbone network of the object detection model. After being processed by the region proposal network and the P classifiers, the object detection model outputs the detection result of the image to be detected. Commonly, the detection result includes information such as the position and number of the detected objects to be detected, such as how many human faces there are and where each human face appears. In the inference state, the region proposal network is similar to the second stage of the training state. The extracted proposal regions are classified according to size, and the sub-feature maps corresponding to the proposed regions are sent to the corresponding classifiers. Each classifier detects objects of different sizes according to the sub-feature maps of the proposed regions of different sizes, and by integrating the detection results of the P classifiers, the detection results of the images to be detected can be obtained.
以下,介绍本申请使用的概念。The concepts used in this application are described below.
主干网络Backbone network
主干网络包括卷积网络,该卷积网络包括K个卷积层。一般主干网络的K个卷积层构成了多个卷积块,每个卷积块包括多个卷积层,主干网络的卷积块的数量常见为5。除了卷积网络外,主干网络还可以包括池化模块。可选的,主干网络可以采用业界常用的一些模板,例如Vgg,Resnet,Densenet,Xception,Inception,Mobilenet等。The backbone network includes a convolutional network, which includes K convolutional layers. The K convolutional layers of a general backbone network constitute multiple convolutional blocks, and each convolutional block includes multiple convolutional layers. The number of convolutional blocks of the backbone network is usually five. In addition to convolutional networks, the backbone network can also include pooling modules. Optionally, the backbone network can use some templates commonly used in the industry, such as Vgg, Resnet, Densenet, Xception, Inception, Mobilenet, etc.
训练图像被提取的特征作为主干网的第1个卷积层。主干网络的第1个卷积层被第1个卷积层对应的卷积核提取出的特征形成了主干网络的第2个卷积层。主干网络的第2个卷积层被主干网络的第2个卷积层对应的卷积核提取出的特征形成了主干网络的第3个卷积层。依此类推,主干网络的第k-1个卷积层被主干网络的第k-1个卷积层对应的卷积核提取出的特征形成了主干网络的第k个卷积层,k大于等于1且小于等于K。图3至图5对应的检测模型训练装置中,主干网络的第K个卷积层被主干网络的第K个卷积层对应的卷积核提取出的特征图形成了区域提议网络的输入,或者可以将主干网络的第K个卷积层直接作为特征图作为区域提议网络的输入。图6至图8对应的检测模型训练装置中,主干网络的第k个卷积层被主干网络的第k个卷积层对应的卷积核提取出的特征图形成了区域提议网络的输入,或者可以将主干网络的第k个卷积层直接作为特征图作为区域提议网络的输入。区域提议网络包括L个卷积层,L为大于0的整数,与主干网络类似的,区域提议网络的第k'-1个卷积层被区域提议网络的第k'-1个卷积层对应的卷积核提取出的特征形成了区域提议网络的第k'个卷积层,k'大于等于1且小于或等于L-1。The extracted features of the training image are used as the first convolutional layer of the backbone network. Features extracted from the first convolutional layer of the backbone network by the convolution kernel corresponding to the first convolutional layer form the second convolutional layer of the backbone network. Features extracted from the second convolutional layer of the backbone network by the convolution kernel corresponding to the second convolutional layer of the backbone network form the third convolutional layer of the backbone network. By analogy, the features extracted from the k-1th convolutional layer of the backbone network by the convolution kernel corresponding to the k-1th convolutional layer of the backbone network form the kth convolutional layer of the backbone network, where k is greater than Equal to 1 and less than or equal to K. In the detection model training device corresponding to FIG. 3 to FIG. 5, the feature graph extracted by the convolution kernel corresponding to the Kth convolution layer of the backbone network is used as the input of the regional proposal network. Or the K-th convolutional layer of the backbone network can be directly used as the feature map as the input of the regional proposal network. In the detection model training apparatus corresponding to FIG. 6 to FIG. 8, the feature graph extracted by the k-th convolutional layer of the backbone network by the convolution kernel corresponding to the k-th convolutional layer of the backbone network becomes the input of the regional proposal network. Or the k-th convolution layer of the backbone network can be directly used as the feature map as the input of the regional proposal network. The regional proposal network includes L convolutional layers, where L is an integer greater than 0. Similar to the backbone network, the k'-1th convolutional layer of the regional proposal network is replaced by the k'-1th convolutional layer of the regional proposal network. The features extracted by the corresponding convolution kernel form the k'th convolutional layer of the regional proposal network, where k 'is greater than or equal to 1 and less than or equal to L-1.
卷积层和卷积核Convolutional layers and kernels
主干网络和区域提议网络均包括至少一个卷积层。如图9所示,卷积层101的尺寸为X*Y*N 1,即卷积层101包括X*Y*N 1个特征。其中,N 1为通道数,一个通道即一个特征维度,X*Y为每一个通道包括的特征数目。X、Y、N 1均为大于0的正整数。 Both the backbone network and the regional proposal network include at least one convolutional layer. As shown in FIG. 9, the size of the convolution layer 101 is X * Y * N 1 , that is, the convolution layer 101 includes X * Y * N 1 features. Among them, N 1 is the number of channels, one channel is a feature dimension, and X * Y is the number of features included in each channel. X, Y, N 1 are all positive integers greater than 0.
卷积核1011为作用于卷积层101使用的卷积核之一。由于卷积层102包括N 2个 通道,因此卷积层101共使用N 2个卷积核,这N 2个卷积核的尺寸和模型参数可以相同也可以不同。以卷积核1011为例,卷积核1011的尺寸为X 1*X 1*N 1。即卷积核1011内包括X 1*X 1*N 1个模型参数。卷积核内的初始化模型参数可以采用业界常用的模型参数模板。卷积核1011在卷积层101内滑动,滑动到卷积层101的某一位置时,卷积核1011的模型参数和对应位置的卷积层101的特征相乘。将卷积核1011的各个模型参数和对应位置的卷积层101的特征的乘积结果合并后,获得卷积层102的一个通道上的一个特征。卷积层101的特征和卷积核1011的乘积结果可以直接作为卷积层102的特征。也可以在卷积层101的特征和卷积核1011在卷积层101上滑动完毕,输出全部乘积结果后,对全部乘积结果进行归一化,将归一化后的乘积结果作为卷积层102的特征。 The convolution kernel 1011 is one of the convolution kernels used for the convolution layer 101. Since the convolution layer 102 includes N 2 channels, the convolution layer 101 uses a total of N 2 convolution kernels. The size and model parameters of the N 2 convolution kernels may be the same or different. Taking the convolution kernel 1011 as an example, the size of the convolution kernel 1011 is X 1 * X 1 * N 1 . That is, X 1 * X 1 * N 1 model parameters are included in the convolution kernel 1011. The initialization model parameters in the convolution kernel can use model parameter templates commonly used in the industry. When the convolution kernel 1011 slides in the convolution layer 101 and slides to a certain position of the convolution layer 101, the model parameters of the convolution kernel 1011 are multiplied with the characteristics of the convolution layer 101 at the corresponding position. After combining the product results of each model parameter of the convolution kernel 1011 and the features of the convolution layer 101 at the corresponding position, a feature on one channel of the convolution layer 102 is obtained. The product of the features of the convolution layer 101 and the convolution kernel 1011 can be directly used as the features of the convolution layer 102. You can also slide the features of the convolution layer 101 and the convolution kernel 1011 on the convolution layer 101. After outputting all the product results, normalize all the product results, and use the normalized product results as the convolution layer. 102 characteristics.
形象的表示,卷积核1011在卷积层101上滑动做卷积,卷积的结果形成了卷积层102的一个通道。卷积层101使用的每一个卷积核对应了卷积层102的一个通道。因此,卷积层102的通道数等于作用于卷积层101的卷积核的数目。每一个卷积核内的模型参数的设计体现了该卷积核希望从卷积层内提取的特征的特点。通过N 2个卷积核,卷积层101被提取出N 2个通道的特征。 Visually, the convolution kernel 1011 slides on the convolution layer 101 to perform convolution, and the convolution result forms a channel of the convolution layer 102. Each convolution kernel used by the convolution layer 101 corresponds to one channel of the convolution layer 102. Therefore, the number of channels of the convolution layer 102 is equal to the number of convolution kernels acting on the convolution layer 101. The design of the model parameters in each convolution kernel reflects the characteristics of the features that the convolution kernel hopes to extract from the convolution layer. Through the N 2 convolution kernels, the features of the N 2 channels are extracted from the convolution layer 101.
如图9所示,将卷积核1011拆分开。卷积核1011包括N 1个卷积片,每个卷积片包括X 1*X 1个模型参数(P 11至Px 1x 1)。每个模型参数对应一个卷积点。一个卷积点对应的模型参数与该卷积点对应位置内的卷积层内的特征相乘获得该卷积点的卷积结果,一个卷积核的卷积点的卷积结果之和为该卷积核的卷积结果。 As shown in FIG. 9, the convolution kernel 1011 is split. The convolution kernel 1011 includes N 1 convolution pieces, and each convolution piece includes X 1 * X 1 model parameters (P 11 to Px 1 x 1 ). Each model parameter corresponds to a convolution point. The model parameters corresponding to a convolution point are multiplied with the features in the convolution layer at the corresponding position of the convolution point to obtain the convolution result of the convolution point. The sum of the convolution results of the convolution points of a convolution kernel is The convolution result of this convolution kernel.
卷积核滑动跨度Convolution kernel sliding span
卷积核的滑动跨度即卷积核在卷积层上每次滑动跨越的特征数。卷积核在当前卷积层的当前位置上做完卷积,形成了下一个卷积层的一个特征后,该卷积核在当前卷积层的当前位置的基础上滑动V个特征,并在滑动后的位置上将卷积核的模型参数和卷积层的特征进行卷积,V即卷积核滑动跨度。The sliding span of the convolution kernel is the number of features that each time the convolution kernel slides across the convolution layer. After the convolution kernel completes the convolution at the current position of the current convolution layer to form a feature of the next convolution layer, the convolution kernel slides V features on the basis of the current position of the current convolution layer, and The model parameters of the convolution kernel and the features of the convolution layer are convolved at the position after sliding, and V is the sliding span of the convolution kernel.
感受野Receptive field
感受野即卷积层上一个特征在输入图像上的感知域(感知范围),在该感知范围内的像素如果发生变化,该特征的值将会随之发生变化。如图10所示,卷积核在输入图像上滑动,提取出的特征构成了卷积层101。类似的,卷积核在卷积层101上滑动,提取出的特征构成了卷积层102。那么,卷积层101中每一个特征,是由输入图像上滑动的卷积核的卷积片的尺寸内的输入图像的像素提取出来的,该尺寸也即卷积层101的感受野。因此,卷积层101的感受野如图10所示。The receptive field is the perceptual domain (perceptual range) of a feature on the input image on the convolutional layer. If the pixels in the perceptual range change, the value of the feature will change accordingly. As shown in FIG. 10, the convolution kernel slides on the input image, and the extracted features constitute the convolution layer 101. Similarly, the convolution kernel slides on the convolution layer 101, and the extracted features constitute the convolution layer 102. Then, each feature in the convolution layer 101 is extracted from pixels of the input image within the size of the convolution sheet of the convolution kernel sliding on the input image, and this size is also the receptive field of the convolution layer 101. Therefore, the receptive field of the convolution layer 101 is shown in FIG. 10.
相应的,卷积层102中的每一个特征映射到输入图像上的范围(即采用输入图像上多大范围的像素)也即卷积层102的感受野。如图11所示,卷积层102中的每一个特征,是由卷积层101上滑动的卷积核的卷积片的尺寸内的输入图像的像素提取出来的。而卷积层101上的每一特征,由是由输入图像上滑动的卷积核的卷积片的范围内的输入图像的像素提取出来的。因此,卷积核102的感受野比卷积层101的感受野要大。如果一个主干网络包括多层卷积层,这多层卷积层中的最后一层卷积层的感受野即该主干网络的感受野。Correspondingly, each feature in the convolutional layer 102 is mapped to a range on the input image (that is, how many pixels are used on the input image), that is, the receptive field of the convolutional layer 102. As shown in FIG. 11, each feature in the convolution layer 102 is extracted from the pixels of the input image within the size of the convolution piece of the convolution kernel sliding on the convolution layer 101. Each feature on the convolution layer 101 is extracted from pixels of the input image within the range of the convolution piece of the convolution kernel sliding on the input image. Therefore, the receptive field of the convolution kernel 102 is larger than that of the convolutional layer 101. If a backbone network includes multiple convolutional layers, the receptive field of the last convolutional layer in the multilayered convolutional layer is the receptive field of the backbone network.
训练计算能力Training computing
训练计算能力即检测模型训练装置部署的环境中,可供用于检测模型训练装置使用的计算能力,包括以下至少一个:处理器频率、处理器占用率、内存大小、内存占用率、缓存利用率、缓存大小、图像处理器频率、图像处理器占用率,其他计算资源参数。当检测模型训练装置的各个部分部署在多个环境中时,训练计算能力可以通过综合计算这多个环境中可供用于检测模型训练装置使用的计算能力来获得。The training computing capability is the computing capability available for detecting the model training device in the environment where the model training device is deployed, including at least one of the following: processor frequency, processor occupancy, memory size, memory occupancy, cache utilization, Cache size, image processor frequency, image processor occupancy, and other computing resource parameters. When the various parts of the detection model training device are deployed in multiple environments, the training computing power can be obtained by comprehensively calculating the computing power available to the detection model training device in these multiple environments.
分类器Classifier
分类器中包括一系列参数构成的函数,分类器根据输入的特征以及这些函数检测待检测图像中的待检测物体的位置和数量等信息。分类器常见的分类器包括Softmax分类器,Sigmoid分类器等。The classifier includes a series of parameters, and the classifier detects information such as the position and number of objects to be detected in the image to be detected according to the input features and these functions. Classifiers Common classifiers include Softmax classifier, Sigmoid classifier, and so on.
跨度(stride)Stride
一般而言,主干网络的第k+1个卷积层的尺寸小于或等于主干网络的第k个卷积层的尺寸,主干网络的第k个卷积层的跨度即输入主干网络的图像的尺寸与第k个卷积层的尺寸之比,该输入主干网络的图像可以为训练图像或者待检测图像。主干网络的第k个卷积层的跨度一般受主干网的第1个卷积层至第k卷积层之间有多少池化层,以及主干网的第1个卷积层至第k个卷积层之间的卷积层的卷积核滑动跨度相关。第1个卷积层至第k卷积层之间的池化层越多,第1个卷积层至第k卷积层之间的卷积层使用的卷积核滑动跨度越大,第k个卷积层的跨度越大。In general, the size of the k + 1th convolutional layer of the backbone network is less than or equal to the size of the kth convolutional layer of the backbone network. The span of the kth convolutional layer of the backbone network is the input of the image of the backbone network. The ratio of the size to the size of the k-th convolutional layer. The image of the input backbone network may be a training image or an image to be detected. The span of the k-th convolutional layer of the backbone network is generally determined by how many pooling layers there are between the first convolutional layer and the k-th convolutional layer of the backbone network, and the first convolutional layer to the k-th convolutional layer of the backbone network. The convolution kernel's sliding spans between convolution layers are related. The more pooling layers between the first convolutional layer and the k-th convolutional layer, the larger the convolution kernel sliding span used by the convolutional layer between the first convolutional layer and the k-th convolutional layer. The larger the span of the k convolutional layers.
区域提议网络、区域提议参数、提议区域、提议区域集合Regional proposal network, regional proposal parameters, proposed region, proposed region set
如图12所示,区域提议网络根据区域提议参数在特征图上确定多个提议区域。区域提议参数可以包括提议区域的长度和宽度。不同提议区域的尺寸一般不同。As shown in FIG. 12, the area proposal network determines a plurality of proposed areas on the feature map according to the area proposal parameters. The area proposal parameters may include the length and width of the proposed area. The sizes of different proposed areas are generally different.
在图3和图6对应的物体检测模型中,区域提议网络首先根据区域提议参数获取到多个提议区域,并根据L层卷积层对应的卷积核计算出这多个提议区域中每个提议区域的置信度,也即每个提议区域对应到训练图像的区域内包括待检测物体的可能性。并选取置信度高于一定阈值,或者置信度最高的一定量的提议区域对应的子特征图输入至分类器。In the object detection models corresponding to Figs. 3 and 6, the region proposal network first obtains a plurality of proposal regions according to the region proposal parameters, and calculates each of the plurality of proposal regions according to the convolution kernel corresponding to the L-layer convolution layer. The confidence level of the proposed area, that is, the possibility that each proposed area corresponds to the training image includes the object to be detected. And select a sub-feature map corresponding to a certain amount of the proposed region with a confidence level higher than a certain threshold, or the highest confidence level, and input it to the classifier.
在图4至图5对应的物体检测模型中,区域提议网络获取到多个提议区域,例如提议区域1-4后,可以根据提议区域的尺寸(提议区域覆盖的特征的数量),将这多个提议区域聚集成P个提议区域集合。然后,区域提议网络将一个提议区域集合内的提议区域对应的子特征图输入至一个分类器。提议区域的尺寸的大小与待检测物体的尺寸相关,因此根据提议区域的尺寸将提议区域聚集成提议区域集合,并由不同分类器对不同提议区域集合内的提议区域进行检测并根据检测结果被激励,使得不同分类器对于不同尺寸的待检测物体更加敏感。In the object detection models corresponding to FIG. 4 to FIG. 5, the area proposal network obtains multiple proposed areas, for example, after the proposed areas 1-4, the proposed area can be increased according to the size of the proposed area (the number of features covered by the proposed area). The proposal regions are aggregated into P proposal region sets. Then, the region proposal network inputs the sub-feature map corresponding to the proposed region in a set of proposed regions to a classifier. The size of the proposed area is related to the size of the object to be detected. Therefore, according to the size of the proposed area, the proposed areas are aggregated into a set of proposed areas, and different classifiers detect the proposed areas in different set of proposed areas and are detected based on the detection results. Excitation makes different classifiers more sensitive to objects of different sizes to be detected.
图7至图8对应的物体检测模型中将主干网的不同卷积层作为特征图输入至区域提议网络,而不同卷积层的跨度可能不同,因此不同跨度的卷积层上的相同尺寸的提议区域对应的训练图像中的待检测物体的尺寸不同。在提议区域尺寸相同的情况下,跨度越大的卷积层的提议区域指示尺寸越大的待检测物体,跨度越小的卷积层的提议区域指示尺寸越小的待检测物体。因此,图6至图8对应的物体检测模型中,区域提议网络从不同特征图中获取提议区域后,综合考虑每个提议区域的尺寸和该提议区域所在的特征图对应的卷积层的跨度后,根据每个提议区域的尺寸和该提议区域所在的 特征图对应的卷积层的跨度,将不同特征图中获取的提议区域聚集成P个提议区域集合。然后,区域提议网络将一个提议区域集合内的提议区域对应的子特征图输入至一个分类器。常见的,区域提议网络采用每个提议区域的尺寸和该提议区域所在的特征图对应的卷积层的跨度之乘积作为聚集标准,例如从不同的特征图获取T个提议区域后,获取每个提议区域的尺寸和该提议区域所在的特征图对应的卷积层的跨度之乘积。根据这T个乘积,将这T个提议区域聚集成P个提议区域集合,例如可以将T个乘积中的每个乘积与预设的P-1的阈值相比,以确定每个乘积对应的提议区域被划分入哪个提议区域集合。In the object detection models corresponding to Figs. 7 to 8, different convolutional layers of the backbone network are input as feature maps to the regional proposal network, and the spans of different convolutional layers may be different. Therefore, the convolutional layers of different spans have the same size The sizes of the objects to be detected in the training images corresponding to the proposed regions are different. In the case where the size of the proposed area is the same, the proposed area of the convolutional layer with a larger span indicates an object to be detected with a larger size, and the proposed area of the convolutional layer with a smaller span indicates an object with a smaller size. Therefore, in the object detection models corresponding to Figures 6 to 8, after the region proposal network obtains the proposed regions from different feature maps, the size of each proposed region and the span of the convolutional layer corresponding to the feature map where the proposed region is located are comprehensively considered. Then, according to the size of each proposed region and the span of the convolution layer corresponding to the feature map in which the proposed region is located, the proposed regions obtained in different feature maps are aggregated into P set of proposed regions. Then, the region proposal network inputs the sub-feature map corresponding to the proposed region in a set of proposed regions to a classifier. Commonly, regional proposal networks use the product of the size of each proposed region and the span of the convolution layer corresponding to the feature map where the proposed region is located as the aggregation criterion. For example, after obtaining T proposed regions from different feature maps, each The product of the size of the proposed area and the span of the convolutional layer corresponding to the feature map in which the proposed area is located. According to the T products, the T proposed regions are aggregated into a set of P proposed regions. For example, each of the T products can be compared with a preset threshold of P-1 to determine the corresponding value of each product. The proposal area is divided into which proposal area set.
图13和图14分别介绍了图3至图5和图6至图8对应的检测模型训练装置的工作流程。FIG. 13 and FIG. 14 respectively introduce the work flow of the detection model training device corresponding to FIGS. 3 to 5 and 6 to 8.
如图13所示,介绍检测模型训练装置的工作流程。As shown in FIG. 13, the work flow of the detection model training device is introduced.
S201,获取以下系统参数中的至少一个:待检测物体的尺寸聚类的数量;训练计算能力。S201. Obtain at least one of the following system parameters: the number of size clusters of the object to be detected; and training computing capabilities.
待检测物体的尺寸聚类的数量,也即待检测物体的尺寸可以被聚类成多少个集合。例如,待检测物体的尺寸聚类的数量为2的情况下,也即可以将待检测物体的尺寸分为两个集合。The number of size clusters of the object to be detected, that is, how many sets the size of the object to be detected can be clustered into. For example, when the number of size clusters of the object to be detected is two, the size of the object to be detected may be divided into two sets.
可以采用聚类算法将训练图像中的待检测物体的尺寸进行聚类后获得的聚类数量来的获得待检测物体的尺寸聚类的数量。聚类算法可以采用K-means等。或者,待检测物体的尺寸聚类的数量、待检测物体的复杂度也可以人工输入至检测模型训练装置。The number of clusters of the size of the object to be detected may be obtained by using a clustering algorithm to cluster the number of objects to be detected in the training image. The clustering algorithm can use K-means and so on. Alternatively, the number of size clusters of the object to be detected and the complexity of the object to be detected may also be manually input to the detection model training device.
以上系统参数指代训练图像或训练图像中的待检测物体或主干网络或训练环境的参数,这类系统参数在建立物体检测模型前就可以获得。系系统参数也称为超级参数,不同的系统参数可能导致不同的复制参数。模型参数指代卷积核内各个卷积点对应的参数,模型参数在物体检测模型的训练过程中不断被激励而发生变化。The above system parameters refer to the training image or the parameters of the object to be detected or the backbone network or the training environment in the training image. Such system parameters can be obtained before the object detection model is established. System parameters are also called super parameters. Different system parameters may lead to different replication parameters. The model parameters refer to the parameters corresponding to each convolution point in the convolution kernel. The model parameters are continuously excited and changed during the training process of the object detection model.
以上系统参数的获取可以分多次获取,不必在同一步骤中执行。以上系统参数也不必全都获取,具体获取的系统参数根据后续确定复制参数的步骤中需要用到的系统参数决定。每个系统参数的获取时间可以在后续用到该系统参数的步骤前。The acquisition of the above system parameters can be obtained in multiple times and need not be performed in the same step. It is not necessary to obtain all the above system parameters, and the specific obtained system parameters are determined according to the system parameters used in the subsequent steps of determining the replication parameters. The acquisition time of each system parameter can be before the subsequent steps that use the system parameter.
S202,根据S201中获取的系统参数,确定复制参数P。S202. Determine the replication parameter P according to the system parameters obtained in S201.
根据S201中获取的系统参数,确定复制参数P。具体可以预设一个计算复制参数P的函数P=f(系统参数),函数f的自变量为S201中获取的系统参数。According to the system parameters obtained in S201, the replication parameter P is determined. Specifically, a function P = f (system parameter) for calculating the replication parameter P may be preset, and an argument of the function f is a system parameter obtained in S201.
S202可以在S201后,S208前任意时刻执行。S202 can be executed at any time after S201 and before S208.
S203,获取训练图像,根据训练图像建立主干网络,获取主干网络输出的特征图。S203: Acquire a training image, establish a backbone network according to the training image, and obtain a feature map output by the backbone network.
S204,将主干网络输出的特征图输入区域提议网络。S204. Input the feature map output by the backbone network into the regional proposal network.
S204中主干网络输出的特征图为主干网的第K个卷积层内的特征,或第K个卷积层被卷积核提取的特征。The feature map output by the backbone network in S204 is a feature in the K-th convolutional layer of the backbone network, or a feature extracted by the convolution kernel in the K-th convolutional layer.
S205,区域提议网络从特征图中选取提议区域,将提议区域对应的子特征图输入分类器。S205. The region proposal network selects a proposed region from the feature map, and inputs a sub-feature map corresponding to the proposed region to the classifier.
S206,分类器根据S205中输入的子特征图,检测该训练图像中的待检测物体。S206. The classifier detects an object to be detected in the training image according to the sub-feature map input in S205.
分类器中设置有参数,分类器根据参数和输入的特征,检测该训练图像中的待检 测物体。Parameters are set in the classifier, and the classifier detects the object to be detected in the training image according to the parameters and input features.
S207,对比S206中检测出的该训练图像中的待检测物体和该训练图像中的先验结果,根据对比结果对以下中的至少一个参数进行激励:主干网络的各个卷积层的卷积核的模型参数,区域提议网络的卷积核的模型参数,区域提议网络的区域提议参数,所述分类器的参数。S207, comparing the object to be detected in the training image and the prior result in the training image detected in S206, and motivating at least one of the following parameters according to the comparison result: the convolution kernel of each convolution layer of the backbone network Model parameters, model parameters of the convolution kernel of the region proposal network, region proposal parameters of the region proposal network, parameters of the classifier.
S207后,S203中获取的训练图像对物体检测模型的激励完毕,检测模型训练装置获取下一个训练图像,并根据下一个训练图像以及该下一个训练图像的先验结果对物体检测模型进行训练。After S207, the excitation of the object detection model by the training image obtained in S203 is completed. The detection model training device obtains the next training image and trains the object detection model according to the next training image and the prior result of the next training image.
下一训练图像的激励过程与S203中获取的训练图像的激励过程类似,主要区别在于1.下一训练图像被主干网络抽取特征图中使用的主干网络的各个卷积层的卷积核的模型参数,是S207中被激励过的(如果S207中对其进行了激励)。2.下一训练图像被主干网络抽取特征图后,该特征图输入的区域提议网络的卷积核的模型参数,区域提议网络的区域提议参数,是S207中被激励过的(如果S207中对其进行了激励)。3.下一训练图像经历的分类器的特征,是S207中被激励过的(如果S207中对其进行了激励)。The excitation process of the next training image is similar to that of the training image obtained in S203. The main difference is that 1. The next training image is extracted by the backbone network. The model of the convolution kernel of each convolution layer of the backbone network used in the feature map. The parameter is the one that has been excited in S207 (if it has been excited in S207). 2. After the feature image is extracted by the backbone network for the next training image, the model parameters of the convolution kernel of the region proposal network and the region proposal parameters of the region proposal network are the ones that were stimulated in S207. It was motivated). 3. The features of the classifier experienced by the next training image are those that have been stimulated in S207 (if they have been stimulated in S207).
依次类推,每个训练图像会在之前的训练图像对物体检测模型进行的激励的基础上进一步进行激励。全部训练图像依次被用于物体检测模型的训练后,物体检测模型的训练态的第一阶段结束。By analogy, each training image will be further stimulated based on the previous training image's stimulation of the object detection model. After all the training images are sequentially used for the training of the object detection model, the first phase of the training state of the object detection model ends.
S208,复制P个经历了训练态的第一阶段训练的分类器。S208. Copy P classifiers that have undergone the training in the first stage.
S209,获取训练图像,根据训练图像建立主干网络,获取主干网络输出的特征图。S209. Acquire a training image, establish a backbone network according to the training image, and obtain a feature map output by the backbone network.
S210,将主干网络输出的特征图输入区域提议网络。S210: Input the feature map output by the backbone network into the regional proposal network.
S211,区域提议网络从特征图中选取多个提议区域,将选取的多个提议区域划分为P个提议区域集合,将每个提议区域集合内的提议区域对应的子特征图输入对应的分类器。S211: The regional proposal network selects multiple proposal regions from the feature map, divides the selected multiple proposal regions into P proposal region sets, and inputs the sub-feature map corresponding to the proposal regions in each proposal region set into the corresponding classifier. .
S212,分类器根据S211中输入的子特征图,检测该训练图像中的待检测物体。S212. The classifier detects an object to be detected in the training image according to the sub-feature map input in S211.
S213,对比S212中检测出的该训练图像中的待检测物体和该训练图像中的先验结果,根据对比结果对以下中的至少一个参数进行激励:主干网络的各个卷积层的卷积核的模型参数,区域提议网络的卷积核的模型参数,区域提议网络的区域提议参数,所述分类器的参数。S213, comparing the to-be-detected object in the training image and the prior result in the training image detected in S212, and motivating at least one of the following parameters according to the comparison result: the convolution kernel of each convolutional layer of the backbone network Model parameters, model parameters of the convolution kernel of the region proposal network, region proposal parameters of the region proposal network, parameters of the classifier.
S212和S213中每个分类器根据自己获得的子特征图,检测训练图像中的待检测物体,并根据检测结果和先验结果的对比结果对该分类器进行激励。S208中复制出来的每个分类器,均执行S212和S213。Each of the classifiers in S212 and S213 detects the object to be detected in the training image according to the sub-feature map obtained by itself, and excites the classifier according to the comparison result between the detection result and the prior result. Each classifier copied in S208 executes S212 and S213.
下一训练图像的激励过程与S209中获取的训练图像的激励过程类似,主要区别在于1.下一训练图像被主干网络抽取特征图中使用的主干网络的各个卷积层的卷积核的模型参数,是S213中被激励过的(如果S213中对其进行了激励)。2.下一训练图像被主干网络抽取特征图后,该特征图输入的区域提议网络的卷积核的模型参数,区域提议网络的区域提议参数,是S213中被激励过的(如果S213中对其进行了激励)。3.下一训练图像经历的分类器的特征,是S213中被激励过的(如果S213中对其进行了激励)。The excitation process of the next training image is similar to that of the training image obtained in S209, the main difference is that 1. The next training image is extracted by the backbone network. The model of the convolution kernel of each convolution layer of the backbone network used in the feature map. The parameter is stimulated in S213 (if it is stimulated in S213). 2. After the feature image is extracted by the backbone network for the next training image, the model parameters of the convolution kernel of the region proposal network and the region proposal parameters of the region proposal network are the ones that were stimulated in S213 (if S213 It was motivated). 3. The features of the classifier experienced by the next training image are those that have been stimulated in S213 (if they have been stimulated in S213).
依次类推,每个训练图像会在之前的训练图像对物体检测模型进行的激励的基础上进一步进行激励。全部训练图像依次被用于物体检测模型的训练态的第二阶段后,物体检测模型的训练过程结束。如图5所示,该物体检测模型可以被用于推理态。By analogy, each training image will be further stimulated based on the previous training image's stimulation of the object detection model. After all the training images are sequentially used in the second stage of the training state of the object detection model, the training process of the object detection model ends. As shown in Figure 5, the object detection model can be used for inferential states.
如图14所示,介绍检测模型训练装置的另一工作流程,与图13所示的工作流程相比,主要区别在于将图13所示的工作流程中的S203和S209分别替换为S203'和S209'。As shown in FIG. 14, another work flow of the detection model training device is introduced. Compared with the work flow shown in FIG. 13, the main difference is that S203 and S209 in the work flow shown in FIG. 13 are replaced by S203 ′ and S203 ′, respectively. S209 '.
参考图6至图8对应的部分,S203'和S209'中,主干网提取了至少两个特征图输入至区域提议网络中,以供区域提议网络选取提议区域。全部训练图像依次被用于物体检测模型的训练态的第二阶段后,物体检测模型的训练过程结束。如图8所示,该物体检测模型可以被用于推理态。Referring to the corresponding parts of FIG. 6 to FIG. 8, in S203 ′ and S209 ′, the backbone network extracts at least two feature maps and inputs them into the area proposal network for the area proposal network to select a proposal area. After all the training images are sequentially used in the second stage of the training state of the object detection model, the training process of the object detection model ends. As shown in Figure 8, this object detection model can be used for inferential states.
本申请还提供了一种检测模型训练装置400。如图15所示,检测模型训练装置400包括物体检测模型401,激励模块405,存储模块406和初始化模块407。物体检测模型401进一步包括主干网络403,分类器404和区域提议网络402。其中分类器404在训练态的第一阶段时包括一个分类器,在训练态的第二阶段和推理态时包括P个分类器。The application also provides a detection model training device 400. As shown in FIG. 15, the detection model training device 400 includes an object detection model 401, an excitation module 405, a storage module 406, and an initialization module 407. The object detection model 401 further includes a backbone network 403, a classifier 404, and a region proposal network 402. The classifier 404 includes a classifier in the first stage of the training state, and includes P classifiers in the second stage of the training state and the inference state.
以上各个模块可以为软件模块。其中,训练态的第一阶段中,初始化模块407用于执行S201和S202,确定复制参数P。物体检测模型401从存储模块406中获取训练图像并执行S203或S203',和S204,以建立主干网络403。区域提议网络402执行S205。分类器404用于执行S206。激励模块405用于执行S207。训练态的第二阶段中,初始化模块407用于执行S208,物体检测模型401从存储模块406中获取训练图像并执行S209或S209',和S210,以建立主干网络403。区域提议网络402执行S211。分类器404用于执行S212。激励模块405用于执行S213。Each of the above modules can be a software module. Among them, in the first stage of the training state, the initialization module 407 is configured to execute S201 and S202 to determine the replication parameter P. The object detection model 401 obtains a training image from the storage module 406 and executes S203 or S203 ′, and S204 to establish a backbone network 403. The area proposal network 402 executes S205. The classifier 404 is configured to execute S206. The incentive module 405 is configured to execute S207. In the second phase of the training state, the initialization module 407 is used to execute S208, and the object detection model 401 obtains a training image from the storage module 406 and executes S209 or S209 ', and S210 to establish a backbone network 403. The area proposal network 402 executes S211. The classifier 404 is used to execute S212. The incentive module 405 is configured to execute S213.
检测模型训练装置400可以作为物体检测模型训练服务向用户提供。例如图1所示检测模型训练装置400(或其部分)部署在云环境上,用户选择主干网络类型、部分系统参数,并将训练图像以及训练图像的先验结果放入存储模块406后,启动检测模型训练装置400对物体检测模型401进行训练。训练完毕的物体检测模型401被提供给用户,用户可以在自己的终端环境上运行该物体检测模型401或者直接出售该物体检测模型401给第三方使用。The detection model training device 400 may be provided to a user as an object detection model training service. For example, the detection model training device 400 (or a part thereof) shown in FIG. 1 is deployed on a cloud environment. The user selects the backbone network type and some system parameters, and puts the training image and the prior results of the training image into the storage module 406, and then starts. The detection model training device 400 trains an object detection model 401. The trained object detection model 401 is provided to the user, and the user can run the object detection model 401 on his terminal environment or sell the object detection model 401 directly to a third party for use.
本申请还提供了一种计算设备500。如图16所示,计算设备500包括总线501、处理器502、通信接口503和存储器504。处理器502、存储器504和通信接口503之间通过总线501通信。The application also provides a computing device 500. As shown in FIG. 16, the computing device 500 includes a bus 501, a processor 502, a communication interface 503, and a memory 504. The processor 502, the memory 504, and the communication interface 503 communicate through a bus 501.
其中,处理器可以为中央处理器(英文:central processing unit,缩写:CPU)。存储器可以包括易失性存储器(英文:volatile memory),例如随机存取存储器(英文:random access memory,缩写:RAM)。存储器还可以包括非易失性存储器(英文:non-volatile memory),例如只读存储器(英文:read-only memory,缩写:ROM),快闪存储器,HDD或SSD。存储器中存储有可执行代码,处理器执行该可执行代码以执行前述物体检测方法。存储器中还可以包括操作系统等其他运行进程所需的软件模块。操作系统可以为LINUX TM,UNIX TM,WINDOWS TM等。 The processor may be a central processing unit (English: central processing unit, abbreviation: CPU). The memory may include a volatile memory (English: volatile memory), such as a random access memory (English: random access memory, abbreviation: RAM). The memory may also include non-volatile memory (English: non-volatile memory), such as read-only memory (English: read-only memory, abbreviation: ROM), flash memory, HDD, or SSD. The memory stores executable code, and the processor executes the executable code to perform the foregoing object detection method. The memory may also include other software modules required by the operating system, such as the operating system. The operating system may be LINUX , UNIX , WINDOWS and the like.
计算设备500的存储器中存储了检测模型训练装置400的各个模块对应的代码, 处理器502执行这些代码实现了检测模型训练装置400的各个模块的功能,即执行了图13或图14所示的方法。计算设备500可以为云环境中的计算设备,或边缘环境中的计算设备,或终端环境中的计算设备。The memory of the computing device 500 stores code corresponding to each module of the detection model training apparatus 400, and the processor 502 executes these codes to implement the functions of each module of the detection model training apparatus 400, that is, the execution of the functions shown in FIG. 13 or FIG. 14 is performed. method. The computing device 500 may be a computing device in a cloud environment, or a computing device in an edge environment, or a computing device in a terminal environment.
如图2所示,检测模型训练装置400的各个部分可能在不同环境上的多台计算设备上执行。因此,本申请还提出了一种计算设备系统。如图17所示,该计算设备系统包括多个计算设备600。每个计算设备600的结构与图16中计算设备500的结构相同。计算设备600间通过通信网络建立通信通路。每个计算设备600上运行区域提议网络402,主干网络403,分类器404,激励模块405,存储模块406和初始化模块407中的任意一个或多个。任一计算设备600可以为云环境中的计算设备,或边缘环境中的计算设备,或终端环境中的计算设备。As shown in FIG. 2, each part of the detection model training apparatus 400 may be executed on multiple computing devices in different environments. Therefore, this application also proposes a computing device system. As shown in FIG. 17, the computing device system includes a plurality of computing devices 600. The structure of each computing device 600 is the same as that of the computing device 500 in FIG. 16. A communication path is established between the computing devices 600 through a communication network. Each computing device 600 runs any one or more of a region proposal network 402, a backbone network 403, a classifier 404, an incentive module 405, a storage module 406, and an initialization module 407. Any computing device 600 may be a computing device in a cloud environment, or a computing device in an edge environment, or a computing device in a terminal environment.
进一步的,如图18所示,由于训练图像和训练图像的先验结果占用的空间很大,计算设备600本身可能无法存储全部的训练图像和训练图像的先验结果,本申请还提出了一种计算设备系统。存储模块406部署在云存储服务中(例如对象存储服务),用户在云存储服务中申请一定容量的存储空间作为存储模块406,并将训练图像和训练图像的先验结果存入存储模块406中。计算设备600运行时,通过通信网络从远端的存储模块406中获取所需的训练图像和训练图像。每个计算设备600上运行区域提议网络402,主干网络403,分类器404,激励模块405和初始化模块407中的任意一个或多个。任一计算设备600可以为云环境中的计算设备,或边缘环境中的计算设备,或终端环境中的计算设备。Further, as shown in FIG. 18, because the training images and the prior results of the training images occupy a large space, the computing device 600 itself may not be able to store all the training images and the prior results of the training images. This application also proposes a A computing device system. The storage module 406 is deployed in a cloud storage service (such as an object storage service). The user applies for a certain amount of storage space as the storage module 406 in the cloud storage service, and stores the training images and the prior results of the training images into the storage module 406. . When the computing device 600 is running, it acquires the required training images and training images from the remote storage module 406 through the communication network. Each computing device 600 runs any one or more of a region proposal network 402, a backbone network 403, a classifier 404, an incentive module 405, and an initialization module 407. Any computing device 600 may be a computing device in a cloud environment, or a computing device in an edge environment, or a computing device in a terminal environment.
上述各个附图对应的流程的描述各有侧重,某个流程中没有详述的部分,可以参见其他流程的相关描述。The descriptions of the processes corresponding to the above drawings have different focuses. For a part that is not detailed in a certain process, refer to the description of other processes.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如SSD)等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions according to the embodiments of the present invention are wholly or partially generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server, or data center Transmission by wire (for example, coaxial cable, optical fiber, digital subscriber line, or wireless (for example, infrared, wireless, microwave, etc.) to another website site, computer, server, or data center. The computer-readable storage medium may be a computer Any available media that can be accessed or data storage devices such as servers, data centers, etc. that include one or more of the available media integration. The available media can be magnetic media (for example, floppy disks, hard disks, magnetic tapes), optical media (for example, , DVD), or semiconductor media (such as SSD).

Claims (15)

  1. 一种计算设备执行的训练物体检测模型的方法,其特征在于,所述方法包括:A method for training an object detection model performed by a computing device, wherein the method includes:
    获取训练图像,根据所述训练图像建立主干网络;Acquiring a training image, and establishing a backbone network according to the training image;
    将所述主干网络输出的特征图输入区域提议网络;Inputting the feature map output by the backbone network into a regional proposal network;
    所述区域提议网络根据区域提议参数从所述主干网络输出的特征图中选取多个提议区域,将所述多个提议区域对应的子特征图输入分类器;Selecting, by the area proposal network, a plurality of proposed areas from a feature map output by the backbone network according to the area proposal parameters, and inputting sub-feature maps corresponding to the plurality of proposed areas into a classifier;
    所述分类器根据所述多个提议区域对应的子特征图检测所述训练图像中的待检测物体;Detecting, by the classifier, an object to be detected in the training image according to a sub-feature map corresponding to the plurality of proposed regions;
    对比所述分类器检测出的所述训练图像中的待检测物体和所述训练图像的先验结果,根据所述对比结果对所述主干网络的卷积核的模型参数、所述区域提议网络的卷积核的模型参数、所述区域提议参数、所述分类器的参数中的至少一个进行激励;Comparing the to-be-detected object in the training image detected by the classifier with the prior results of the training image, and comparing the model parameters of the convolution kernel of the backbone network with the region proposal network according to the comparison result At least one of the model parameters of the convolution kernel, the region proposal parameters, and the parameters of the classifier is stimulated;
    复制所述分类器,获得至少两个分类器;Copying the classifier to obtain at least two classifiers;
    所述区域提议网络将所述多个提议区域划分为至少两个提议区域集合,每个提议区域集合包括至少一个提议区域;The area proposal network divides the plurality of proposal areas into at least two proposal area sets, and each proposal area set includes at least one proposal area;
    所述区域提议网络将每个提议区域集合包括的提议区域对应的子特征图输入所述至少两个分类器中的一个分类器;The region proposal network inputs a sub-feature map corresponding to a proposal region included in each proposal region set into one of the at least two classifiers;
    所述至少两个分类器中的每个分类器执行以下动作:Each of the at least two classifiers performs the following actions:
    根据获取的提议区域集合包括的提议区域对应的子特征图检测所述训练图像中的待检测物体;Detecting an object to be detected in the training image according to the sub-feature map corresponding to the proposed region included in the acquired proposal region set;
    对比检测出的所述训练图像中的待检测物体和所述训练图像的先验结果,根据所述对比结果对所述主干网络的卷积核的模型参数、所述区域提议网络的卷积核的模型参数、所述区域提议参数、所述每个分类器的参数中的至少一个进行激励。Comparing the detected object in the training image with the prior results of the training image, and comparing the model parameters of the convolution kernel of the backbone network and the convolution kernel of the region proposal network according to the comparison result At least one of the model parameters, the region proposal parameters, and the parameters of each classifier is stimulated.
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1, further comprising:
    获取系统参数,所述系统参数包括以下至少一个:训练图像中的待检测物体的尺寸聚类的数量、训练计算能力;Acquiring system parameters, the system parameters including at least one of: a number of size clusters of an object to be detected in a training image, and a training computing ability;
    根据所述系统参数,确定复制后获得的所述至少两个分类器中分类器的数量。Determining, according to the system parameter, the number of classifiers in the at least two classifiers obtained after replication.
  3. 如权利要求2所述的方法,其特征在于,所述系统参数包括所述训练图像中的待检测物体的尺寸聚类的数量的情况下;所述获取系统参数包括:The method according to claim 2, characterized in that the system parameter includes a case where a number of size clusters of an object to be detected in the training image is included; and the acquiring system parameter includes:
    对所述训练图像中的待检测物体的尺寸进行聚类,获取所述训练图像中的待检测物体的尺寸聚类的数量。Clustering the sizes of the objects to be detected in the training image to obtain the number of size clusters of the objects to be detected in the training image.
  4. 如权利要求1至3任一所述的方法,其特征在于,所述主干网络输出的特征图包括至少两个特征图。The method according to any one of claims 1 to 3, wherein the feature map output by the backbone network includes at least two feature maps.
  5. 一种检测模型训练装置,其特征在于,包括:A detection model training device, comprising:
    物体检测模型,用于获取训练图像,根据所述训练图像建立主干网络;根据区域提议参数从所述主干网络输出的特征图中选取多个提议区域,将所述多个提议区域对应的子特征图输入分类器;根据所述多个提议区域对应的子特征图检测所述训练图像中的待检测物体;An object detection model is used to obtain a training image, and establish a backbone network based on the training image; select multiple proposed regions from a feature map output from the backbone network according to regional proposal parameters, and sub-feature corresponding to the multiple proposed regions A graph input classifier; detecting an object to be detected in the training image according to a sub-feature map corresponding to the plurality of proposed regions;
    激励模块,用于对比检测出的所述训练图像中的待检测物体和所述训练图像的先验结果,根据所述对比结果对所述主干网络的卷积核的模型参数、所述区域提议网络的卷积核的模型参数、所述区域提议参数、所述分类器的参数中的至少一个进行激励;An incentive module, configured to compare a detected prior object in the training image with a prior result of the training image, and according to the comparison result, to model parameters of the convolution kernel of the backbone network, and the region proposal At least one of a model parameter of the network's convolution kernel, the region proposal parameter, and the parameter of the classifier is excited;
    初始化模块,用于复制所述分类器,获得至少两个分类器;An initialization module for copying the classifier to obtain at least two classifiers;
    所述物体检测模型,还用于将所述多个提议区域划分为至少两个提议区域集合,每个提议区域集合包括至少一个提议区域;将每个提议区域集合包括的提议区域对应的子特征图输入所述至少两个分类器中的一个分类器;所述至少两个分类器中的每个分类器执行以下动作:根据获取的提议区域集合包括的提议区域对应的子特征图检测所述训练图像中的待检测物体;对比检测出的所述训练图像中的待检测物体和所述训练图像的先验结果,根据所述对比结果对所述主干网络的卷积核的模型参数、所述区域提议网络的卷积核的模型参数、所述区域提议参数、所述每个分类器的参数中的至少一个进行激励。The object detection model is further configured to divide the plurality of proposal regions into at least two proposal region sets, each proposal region set including at least one proposal region; and sub-features corresponding to the proposal regions included in each proposal region set. A map is input to one of the at least two classifiers; each of the at least two classifiers performs the following action: detecting the sub-feature map corresponding to the proposed region included in the acquired proposal region set The object to be detected in the training image; comparing the detected prior object in the training image and the prior image of the training image, and according to the comparison result, the model parameters, At least one of the model parameters of the convolution kernel of the region proposal network, the region proposal parameters, and the parameters of each classifier is stimulated.
  6. 如权利要求5所述的装置,其特征在于,所述初始化模块,还用于获取系统参数,所述系统参数包括以下至少一个:训练图像中的待检测物体的尺寸聚类的数量、训练计算能力;根据所述系统参数,确定复制后获得的所述至少两个分类器中分类器的数量。The device according to claim 5, wherein the initialization module is further configured to obtain system parameters, the system parameters including at least one of the following: the number of size clusters of the object to be detected in the training image, and training calculations Capability; determining, according to the system parameter, the number of classifiers in the at least two classifiers obtained after replication.
  7. 如权利要求6所述的装置,其特征在于,所述初始化模块,还用于对所述训练图像中的待检测物体的尺寸进行聚类,获取所述训练图像中的待检测物体的尺寸聚类的数量。The device according to claim 6, wherein the initialization module is further configured to perform clustering on the sizes of the objects to be detected in the training image, and to obtain the sizes of the objects to be detected in the training image. The number of classes.
  8. 如权利要求5至7任一所述的装置,其特征在于,所述主干网络输出的特征图包括至少两个特征图。The apparatus according to any one of claims 5 to 7, wherein the feature map output by the backbone network includes at least two feature maps.
  9. 一种计算设备系统,包括至少一个计算设备;每个计算设备包括处理器和存储器,所述至少一个计算设备的处理器用于执行权利要求1至4中任一所述的方法。A computing device system includes at least one computing device; each computing device includes a processor and a memory, and the processor of the at least one computing device is configured to execute the method according to any one of claims 1 to 4.
  10. 一种非瞬态的可读存储介质,其特征在于,所述非瞬态的可读存储介质被计算设备系统中的至少一个计算设备执行时,所述至少一个计算设备执行权利要求1至4中任一所述的方法。A non-transitory readable storage medium, characterized in that when the non-transitory readable storage medium is executed by at least one computing device in a computing device system, the at least one computing device executes claims 1 to 4 A method as described in any of the above.
  11. 一种计算设备程序产品,其特征在于,所述计算设备程序产品被计算设备系统中的至少一个计算设备执行时,所述至少一个计算设备执行权利要求1至4中任一所述的方法。A computing device program product, wherein when the computing device program product is executed by at least one computing device in a computing device system, the at least one computing device executes the method of any one of claims 1 to 4.
  12. 一种计算设备执行的训练物体检测模型的方法,其特征在于,所述方法包括:A method for training an object detection model performed by a computing device, wherein the method includes:
    在第一阶段训练中,通过主干网络提取训练图像的特征图,通过区域提议网络从提取出的所述特征图中选取提议区域,将所述提议区域对应的子特征图输入分类器,所述分类器根据所述提议区域对应的子特征图检测所述训练图像中的待检测物体,对比所述检测结果和所述训练图像的先验结果,根据所述对比结果对所述主干网络、所述区域提议网络、所述分类器中的至少一个进行激励;In the first stage of training, a feature map of a training image is extracted through a backbone network, a proposed area is selected from the extracted feature map through a regional proposal network, and a sub-feature map corresponding to the proposed area is input to a classifier. The classifier detects the object to be detected in the training image according to the sub-feature map corresponding to the proposed region, compares the detection result with the prior result of the training image, and compares the backbone network, the Motivating at least one of the regional proposal network and the classifier;
    在第二阶段训练中,根据经历过所述第一阶段训练的所述分类器,建立至少两个复制分类器,所述区域提议网络将所述提议区域划分为至少两个提议区域集合,每个提议区域集合包括至少一个提议区域,将每个提议区域集合包括的提议区域对应的子特征图输入一个复制分类器,每个复制分类器根据获取的子特征图检测所述训练图像 中的待检测物体,对比所述检测结果和所述训练图像的先验结果,根据所述对比结果对所述主干网络、所述区域提议网络、所述分类器中的至少一个再次进行激励。In the second-stage training, at least two replication classifiers are established according to the classifiers that have undergone the first-stage training, and the region proposal network divides the proposed region into at least two proposed region sets, each Each proposal region set includes at least one proposal region, and the sub-feature map corresponding to the proposal region included in each proposal region set is input to a replication classifier, and each replication classifier detects a pending feature in the training image according to the acquired sub-feature map. An object is detected, the detection result is compared with the prior result of the training image, and at least one of the backbone network, the area proposal network, and the classifier is re-energized according to the comparison result.
  13. 如权利要求12所述的方法,其特征在于,所述方法还包括:The method of claim 12, further comprising:
    获取系统参数,所述系统参数包括以下至少一个:训练图像中的待检测物体的尺寸聚类的数量、训练计算能力;Acquiring system parameters, the system parameters including at least one of: a number of size clusters of an object to be detected in a training image, and a training computing ability;
    根据所述系统参数,确定建立出的所述复制分类器的数量。According to the system parameter, the number of the replication classifiers established is determined.
  14. 如权利要求13所述的方法,其特征在于,所述系统参数包括所述训练图像中的待检测物体的尺寸聚类的数量的情况下;所述获取系统参数包括:The method according to claim 13, wherein the system parameter includes a case where a number of size clusters of an object to be detected in the training image is included, and the acquiring system parameter includes:
    对所述训练图像中的待检测物体的尺寸进行聚类,获取所述训练图像中的待检测物体的尺寸聚类的数量。Clustering the sizes of the objects to be detected in the training image to obtain the number of size clusters of the objects to be detected in the training image.
  15. 如权利要求12至14任一所述的方法,其特征在于,所述主干网络提取出的特征图包括至少两个特征图。The method according to any one of claims 12 to 14, wherein the feature map extracted by the backbone network includes at least two feature maps.
PCT/CN2019/076982 2018-08-03 2019-03-05 Method, device and apparatus for training object detection model WO2020024584A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19808951.8A EP3633553A4 (en) 2018-08-03 2019-03-05 Method, device and apparatus for training object detection model
US17/025,419 US11423634B2 (en) 2018-08-03 2020-09-18 Object detection model training method, apparatus, and device

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201810878556 2018-08-03
CN201810878556.9 2018-08-03
CN201811070244.1A CN110796154B (en) 2018-08-03 2018-09-13 Method, device and equipment for training object detection model
CN201811070244.1 2018-09-13

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/025,419 Continuation US11423634B2 (en) 2018-08-03 2020-09-18 Object detection model training method, apparatus, and device

Publications (1)

Publication Number Publication Date
WO2020024584A1 true WO2020024584A1 (en) 2020-02-06

Family

ID=69230505

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/076982 WO2020024584A1 (en) 2018-08-03 2019-03-05 Method, device and apparatus for training object detection model

Country Status (1)

Country Link
WO (1) WO2020024584A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523403A (en) * 2020-04-03 2020-08-11 咪咕文化科技有限公司 Method and device for acquiring target area in picture and computer readable storage medium
CN111860545A (en) * 2020-07-30 2020-10-30 元神科技(杭州)有限公司 Image sensitive content identification method and system based on weak detection mechanism
CN111860509A (en) * 2020-07-28 2020-10-30 湖北九感科技有限公司 Coarse-to-fine two-stage non-constrained license plate region accurate extraction method
CN112861803A (en) * 2021-03-16 2021-05-28 厦门博海中天信息科技有限公司 Image identification method, device, server and computer readable storage medium
CN113221935A (en) * 2021-02-02 2021-08-06 清华大学 Image identification method and system based on environment perception deep convolutional neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324938A (en) * 2012-03-21 2013-09-25 日电(中国)有限公司 Method for training attitude classifier and object classifier and method and device for detecting objects
CN103942558A (en) * 2013-01-22 2014-07-23 日电(中国)有限公司 Method and apparatus for obtaining object detectors
CN104217216A (en) * 2014-09-01 2014-12-17 华为技术有限公司 Method and device for generating detection model, method and device for detecting target

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324938A (en) * 2012-03-21 2013-09-25 日电(中国)有限公司 Method for training attitude classifier and object classifier and method and device for detecting objects
CN103942558A (en) * 2013-01-22 2014-07-23 日电(中国)有限公司 Method and apparatus for obtaining object detectors
CN104217216A (en) * 2014-09-01 2014-12-17 华为技术有限公司 Method and device for generating detection model, method and device for detecting target

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3633553A4 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523403A (en) * 2020-04-03 2020-08-11 咪咕文化科技有限公司 Method and device for acquiring target area in picture and computer readable storage medium
CN111523403B (en) * 2020-04-03 2023-10-20 咪咕文化科技有限公司 Method and device for acquiring target area in picture and computer readable storage medium
CN111860509A (en) * 2020-07-28 2020-10-30 湖北九感科技有限公司 Coarse-to-fine two-stage non-constrained license plate region accurate extraction method
CN111860545A (en) * 2020-07-30 2020-10-30 元神科技(杭州)有限公司 Image sensitive content identification method and system based on weak detection mechanism
CN111860545B (en) * 2020-07-30 2023-12-19 元神科技(杭州)有限公司 Image sensitive content identification method and system based on weak detection mechanism
CN113221935A (en) * 2021-02-02 2021-08-06 清华大学 Image identification method and system based on environment perception deep convolutional neural network
CN112861803A (en) * 2021-03-16 2021-05-28 厦门博海中天信息科技有限公司 Image identification method, device, server and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN110796154B (en) Method, device and equipment for training object detection model
WO2020024584A1 (en) Method, device and apparatus for training object detection model
WO2020024585A1 (en) Method and apparatus for training object detection model, and device
US20180189610A1 (en) Active machine learning for training an event classification
US11106944B2 (en) Selecting logo images using machine-learning-logo classifiers
WO2023138300A1 (en) Target detection method, and moving-target tracking method using same
EP3333768A1 (en) Method and apparatus for detecting target
CN110569721A (en) Recognition model training method, image recognition method, device, equipment and medium
WO2019051941A1 (en) Method, apparatus and device for identifying vehicle type, and computer-readable storage medium
US20180181796A1 (en) Image processing method and apparatus
CN111191568A (en) Method, device, equipment and medium for identifying copied image
US20160162757A1 (en) Multi-class object classifying method and system
CN115631112B (en) Building contour correction method and device based on deep learning
TW201942814A (en) Object classification method, apparatus, server, and storage medium
Fathi et al. General rotation-invariant local binary patterns operator with application to blood vessel detection in retinal images
US10614379B2 (en) Robust classification by pre-conditioned lasso and transductive diffusion component analysis
JP2019220014A (en) Image analyzing apparatus, image analyzing method and program
US11810341B2 (en) Method of identifying filters in a neural network, system and storage medium of the same
US20210406568A1 (en) Utilizing multiple stacked machine learning models to detect deepfake content
CN112287905A (en) Vehicle damage identification method, device, equipment and storage medium
JP2016224821A (en) Learning device, control method of learning device, and program
US20170046615A1 (en) Object categorization using statistically-modeled classifier outputs
CN112801045B (en) Text region detection method, electronic equipment and computer storage medium
CN111753723B (en) Fingerprint identification method and device based on density calibration
CN111091022A (en) Machine vision efficiency evaluation method and system

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019808951

Country of ref document: EP

Effective date: 20200102

NENP Non-entry into the national phase

Ref country code: DE