CN111709471A

CN111709471A - Object detection model training method and object detection method and device

Info

Publication number: CN111709471A
Application number: CN202010535814.0A
Authority: CN
Inventors: 宋奕兵
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2020-09-25
Anticipated expiration: 2040-06-12
Also published as: CN111709471B

Abstract

The application relates to a training method and device of an object detection model, computer equipment and a storage medium. The method comprises the following steps: acquiring a sample image; obtaining a candidate region where a target object is located in a sample image and an initial confidence coefficient of the target object in each candidate region corresponding to each object category through an object detection model to be trained; acquiring an attention response graph of an object detection model to be trained aiming at a sample image according to the initial confidence of each object category corresponding to the target object in each candidate region and the gradient information between the image data of the sample image; and acquiring the attention response value of each candidate region from the attention response graph, and adjusting the network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of the target object in each candidate region corresponding to each object type and the object type label until the convergence condition is met to obtain the target object detection model. The method can improve the detection performance of the object detection model.

Description

Object detection model training method and object detection method and device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for training an object detection model, a computer device, and a storage medium, and a method and an apparatus for object detection, a computer device, and a storage medium.

Background

With the development of computer technology, it is one of the core problems of computer vision technology to detect and identify all interested target objects from an image through an object detection model, and determine object information such as the region position and object category of the target object in the image. However, the target object often occupies only a small area in the image, and the background area image except the target object occupies a large area in the image, and the background image of the large area may affect the object detection model to detect the target object from the image, so that the accuracy of detecting and identifying the target object from the image is reduced.

Disclosure of Invention

In view of the above, it is necessary to provide a training method, an apparatus, a computer device, and a storage medium for an object detection model, and an object detection method, an apparatus, a computer device, and a storage medium, which can improve the accuracy of detecting and recognizing a target object from an image, in view of the above technical problems.

A method of training an object detection model, the method comprising:

obtaining a sample image, the sample image including an object class label of a target object in the sample image; the target object is an object to be detected;

obtaining a candidate region where the target object is located in the sample image and an initial confidence coefficient of each object category corresponding to the target object in each candidate region through an object detection model to be trained; the object detection model to be trained is used for detecting a target object in the sample image;

acquiring a attention response graph of the object detection model to be trained aiming at the sample image according to the initial confidence coefficient of the target object in each candidate region corresponding to each object category and gradient information between the image data of the sample image;

and acquiring the attention response value of each candidate region from the attention response graph, adjusting the network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of each object type corresponding to the target object in each candidate region and the object type label, and repeating the steps until a convergence condition is met to obtain the target object detection model.

An object detection method, comprising:

acquiring an image to be detected;

acquiring a candidate region where a target object in the image to be detected is located and an initial confidence coefficient of the target object in each candidate region corresponding to each object category through a pre-constructed target object detection model; the target object detection model is obtained by adjusting network parameters of an object detection model to be trained according to the attention response value of the candidate region in the sample image, the initial confidence of each object class corresponding to the target object in the candidate region and the object class label of the target object in the sample image; the candidate region of the sample image and the initial confidence of the target object in the candidate region corresponding to each object category are obtained through the object detection model to be trained; the attention response value is obtained from an attention response map obtained according to the initial confidence of the target object in each candidate region corresponding to each object class and the gradient information between the image data of the sample image;

acquiring the target object type of the target object in the candidate region according to the initial confidence of the target object in each candidate region corresponding to each object type;

and outputting the region position information of the candidate region where the target object in the image to be detected is located and the target object type.

An apparatus for training an object detection model, the apparatus comprising:

a sample image acquisition module for acquiring a sample image, the sample image including an object class label of a target object in the sample image; the target object is an object to be detected;

the image processing module is used for acquiring a candidate region where the target object is located in the sample image and an initial confidence coefficient of each object type corresponding to the target object in each candidate region through an object detection model to be trained; the object detection model to be trained is used for detecting a target object in the sample image;

a response map obtaining module, configured to obtain a attention response map of the to-be-trained object detection model for the sample image according to gradient information between an initial confidence of a target object in each candidate region corresponding to each object class and image data of the sample image;

and the response value acquisition module is used for acquiring the attention response value of each candidate region from the attention response graph, adjusting the network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of each object class corresponding to the target object in each candidate region and the object class label, and repeating the steps until a convergence condition is met to obtain the target object detection model.

An object detection apparatus, the apparatus comprising:

the image acquisition module is used for acquiring an image to be detected;

a candidate region acquisition module, configured to acquire, through a pre-constructed target object detection model, a candidate region where a target object in the image to be detected is located and initial confidence levels of the target object in each candidate region corresponding to each object class; the target object detection model is obtained by adjusting network parameters of an object detection model to be trained according to the attention response value of the candidate region in the sample image, the initial confidence of each object class corresponding to the target object in the candidate region and the object class label of the target object in the sample image; the candidate region of the sample image and the initial confidence of the target object in the candidate region corresponding to each object category are obtained through the object detection model to be trained; the attention response value is obtained from an attention response map obtained according to the initial confidence of the target object in each candidate region corresponding to each object class and the gradient information between the image data of the sample image;

an object class acquisition module, configured to acquire a target object class of a target object in each candidate region according to an initial confidence that the target object in each candidate region corresponds to each object class;

and the object information output module is used for outputting the region position information of the candidate region where the target object in the image to be detected is located and the type of the target object.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

In the training method of the object detection model, a sample image is obtained, wherein the sample image comprises an object class label of a target object in the sample image; obtaining a candidate region where the target object is located in the sample image and an initial confidence coefficient of each object category corresponding to the target object in each candidate region through an object detection model to be trained; the object detection model to be trained is used for detecting a target object in the sample image; acquiring a attention response graph of the object detection model to be trained aiming at the sample image according to the initial confidence coefficient of the target object in each candidate region corresponding to each object class and gradient information among image data of the sample image; and acquiring the attention response value of each candidate region from the attention response graph, adjusting the network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of each object type corresponding to the target object in each candidate region and the object type label, and continuing training until the target object detection model is obtained. By obtaining the attention response graph of the object detection model to be trained on each pixel point on the sample image and adjusting the network parameters of the object detection model to be trained according to the attention response graph, the object detection model to be trained has more attention on the target object in the sample image, and further the object detection model to be trained after network parameter adjustment obtains more characteristic information from the region position where the target object is located in the input image, so that the accuracy of detecting the region position where the target object is located and the object type is improved.

Drawings

FIG. 1 is a diagram of an exemplary environment in which a method for training an object detection model may be implemented;

FIG. 2 is a schematic flow chart diagram illustrating a method for training an object detection model in one embodiment;

FIG. 3 is a flowchart illustrating the steps of adjusting the network parameters of the detection model of the object to be trained in one embodiment;

FIG. 4 is a flowchart illustrating the steps of obtaining a attention response graph of a detection model of an object to be trained in one embodiment;

FIG. 5 is a flowchart illustrating a step of obtaining a response value of interest for each candidate region from the response map of interest in one embodiment;

FIG. 6 is a flowchart illustrating the step of obtaining candidate regions in which target objects are located in a sample image and initial confidence levels of the target objects in the candidate regions corresponding to the object classes according to an embodiment of the present invention using a to-be-trained object detection model;

FIG. 7a is a schematic diagram illustrating a method for training an object detection model in accordance with one embodiment;

FIG. 7b is a schematic illustration of candidate regions in one embodiment;

FIG. 8 is a flowchart illustrating a method for object detection according to one embodiment;

FIG. 9 is a block diagram showing the structure of an apparatus for training an object detection model according to an embodiment;

FIG. 10 is a block diagram showing the structure of an object detecting apparatus according to an embodiment;

FIG. 11 is a diagram illustrating an internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The training method of the object detection model provided by the application mainly relates to an artificial intelligence Computer Vision technology, wherein the Computer Vision technology (Computer Vision, CV) is a science for researching how to make a machine see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as recognition, tracking and measurement on a target, and further performing image processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmit the image to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

In particular, the method provided by the application mainly relates to detecting and identifying target objects of interest in an image, and determining the area position of each target object in the image and the object category of the target object, which is one of the core problems of the computer vision technology.

The training method of the object detection model provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers. Specifically, the terminal 102 acquires an image to be detected and sends the image to be detected to the server 104, the server 104 inputs the image to be detected to the target object detection model, acquires a candidate region where a target object is located in the image to be detected and object types corresponding to the target object in the candidate regions through the target object detection model, obtains region position information of the candidate region where the target object is located in the image to be detected and object types corresponding to the target object, and then the server 104 returns the region position information of the candidate region where the target object is located in the image to be detected and the object types corresponding to the target object to the terminal 102.

Alternatively, the terminal 102 may also directly invoke the target object detection model to obtain the region position information of the candidate region where the target object in the image to be detected is located and the object type corresponding to the target object. The present application generally describes a process of how to train an object detection model to be trained to obtain a target object detection model.

In one embodiment, as shown in fig. 2, a method for training an object detection model is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step S202, a sample image is obtained, the sample image comprises an object class label of a target object in the sample image, and the target object is an object to be detected.

The method comprises the following steps that a sample image is used for training an object detection model to be trained, and the sample image comprises one or more target objects; the target object is a specific object to be detected and identified in the image, and may be a tangible object actually existing in the real world, wherein the target object may be set according to an actual application scene; the object class label is used to indicate class information of a target object included in the sample image.

For example, in a practical application scenario, in one embodiment, the sample image may be, but is not limited to, a road image captured by a camera; the target object may be, but is not limited to, an obstacle on a road, such as a person (e.g., a pedestrian on a road), a vehicle, a tree, etc.; the object category label is used to indicate category information of the obstacle included in the sample image, and includes, for example, a category label indicated as "person", a category label indicated as "vehicle", and a category label indicated as "tree".

Step S204, acquiring a candidate area where the target object in the sample image is located and the initial confidence of the target object in each candidate area corresponding to each object category through the to-be-trained object detection model, wherein the to-be-trained object detection model is used for detecting the target object in the sample image.

The candidate region refers to a region of interest that may include a target object in the sample image, and is a partial image region in the sample image, and specifically, the candidate region may be a rectangular region; the initial confidence of the target object in the candidate region corresponding to each object class refers to the possibility that the target object in the candidate region belongs to different object classes, wherein the confidence may be represented by a probability value, a percentile score, and the like.

The detection model of the object to be trained is used for identifying all target objects to be detected in the input image and determining the position information of each target object in the input image and the category information of the target object. The object detection model to be trained may be a neural network model that has been pre-trained, or may be an untrained neural network model. Specifically, the object detection model to be trained may include, but is not limited to, a feature extraction network, a region generation network, and a classifier; the feature extraction network is used for extracting feature information such as semantic features and position features of an input image to obtain a feature map; the area generation network is used for selecting and intercepting candidate areas on the feature graph output by the feature extraction network; the classifier is used for classifying the candidate region output by the region generation network so as to obtain the possibility that the target object in the candidate region belongs to different object classes. The Feature extraction network may adopt a convolutional neural network, such as a ResNet network, an FPN (Feature Pyramid network), or an FPN based on a ResNet network structure; the area generation network may be an rpn (region pro-social network) network.

Specifically, after a sample image is obtained, the server inputs the sample image into a detection model of an object to be trained, and a feature map corresponding to the sample image is extracted through a feature extraction network; inputting the characteristic diagram into a region generation network, determining a region possibly containing a target object through the region generation network, and obtaining a plurality of candidate regions; and finally, classifying the objects contained in the candidate region by using a classifier to obtain the initial confidence of the target object in the candidate region belonging to each object class.

It should be noted that the object class includes a background object class, which is an object class of a non-target object (i.e., an object that is not to be detected), in addition to an object class of a target object that needs to be detected; when the initial prediction confidence of the to-be-trained object detection model identifying the target object in the candidate region as the background object type is the highest, it may be considered that the candidate region does not include the target object to be detected, but includes the background image of the background of the region where the target object to be detected is located, that is, the candidate region predicted by the to-be-trained object detection model may not include the target object to be detected.

Step S206, obtaining a attention response graph of the object detection model to be trained aiming at the sample image according to the initial confidence of the target object in each candidate region corresponding to each object category and the gradient information between the image data of the sample image.

The attention response graph comprises response degree information of each pixel point on a sample image of the object detection model to be trained in the process of detecting the target object, and when the response degree of the pixel point in a certain area is higher, the relevance between the pixel point and the target object can be considered to be higher. Specifically, after the server obtains the initial confidence of the target object in each candidate region belonging to each object category, the server may perform back propagation on the initial confidence in the detection model of the object to be trained through a back propagation algorithm to obtain gradient information between the initial confidence corresponding to each candidate region and image data of the sample image, and further obtain the response degree of the detection model of the object to be trained to each pixel point on the sample image according to the gradient information to obtain a focus response map corresponding to the sample image. It can be understood that, when the gradient information between the initial confidence degree corresponding to the object in a certain region and the image data of the sample image is larger, that is, the response degree on the pixel point of the region is higher, the relevance between the region and the target object may be considered to be larger, and when the gradient information between the initial confidence degree corresponding to the object in the certain region and the image data of the sample image is smaller, that is, the response degree on the pixel point of the region is lower, the relevance between the region and the target object may be considered to be smaller.

Further, the image data of the sample image may be original image data of the sample image, or may also be a feature map of the sample image output by an intermediate layer in the object detection model to be trained, such as a feature map of an area generation network input layer, a feature map of a classifier input layer, and the like, which is not limited herein.

And S208, acquiring the attention response value of each candidate region from the attention response graph, adjusting the network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of each object type corresponding to the target object in each candidate region and the object type label, and repeating the steps until the convergence condition is met to obtain the target object detection model.

The attention response value refers to the response degree information of a certain region or a certain pixel point on a sample image of the object detection model to be trained in the process of detecting the target object, and when the attention response value of the certain region or the certain pixel point is higher, the relevance of the region or the pixel point and the target object can be considered to be higher. Obtaining the attention response value of each candidate region from the attention response map, specifically obtaining the attention response value of each pixel point in the candidate region from the attention response map, and then determining the average value of the attention response values of all the pixel points in the candidate region as the attention response value corresponding to the candidate region; or obtaining the attention response value of each pixel point in the candidate region from the attention response map, and then determining the sum of the attention response values of all the pixel points in the candidate region as the attention response value corresponding to the candidate region.

As described above, when the attention response value of the candidate region is higher, it may be considered that the relevance between the candidate region or the pixel point and the target object is higher (the probability that the region is the region where the target object is located is higher), and the probability that the image feature information of the candidate region includes the feature information representing the target object and the feature information representing the position where the target object is located is higher; in the process that the object detection model to be trained is in the candidate region, when a certain region is the region where the target object is located, the greater the attention degree of the object detection model to be trained to the region is, the more accurately the object detection model to be trained acquires the candidate region where the target object is located and the more accurately the object type; therefore, after the attention response map is obtained, the attention response value of each candidate region is obtained from the attention response map, and then the loss value caused by the difference between the initial confidence of each object type corresponding to the target object in the corresponding candidate region and the object type label is respectively adjusted according to the attention response value of each candidate region, so that the weight of the loss function of the candidate region belonging to the region where the target object is located is increased, and the weight of the loss function of the candidate region not belonging to the region where the target object is located is reduced; and then, the adjusted loss value is used for monitoring the learning process of the to-be-trained object detection model on the sample image, and the network parameters of the to-be-trained object detection model are adjusted, so that the to-be-trained object detection model can put more attention on the region where the target object in the sample image is located, the image characteristics of the region where the target object in the sample image is located can be more fully learned, and the detection performance of the to-be-trained object detection model is improved.

It should be noted that, the procedure of training the detection model of the object to be trained by each sample image is the same, which is the procedure described in the above step S101 to step S104, and the subsequent sample image is continuously trained on the basis of the previous sample image after training the detection model of the object to be trained. The convergence condition may be adjusted or set according to actual needs, for example, when the target class prediction loss value reaches the minimum, the convergence condition may be considered to be satisfied; or when the target class prediction loss value obtained by the current and later iterative training is not changed any more or is changed a little (for example, is smaller than a threshold), the convergence condition can be considered to be satisfied; or after the object detection model to be trained is trained through a preset number of sample images, the convergence condition can be considered to be satisfied. Subsequently, the detection model obtained by the target object detection model may be used to identify the position and the object type of the target object in the image, where the identified type of the target object is the object type included in the sample image used to train the object detection model to be trained.

In the training method of the object detection model, a sample image is obtained, and the sample image comprises an object class label of a target object in the sample image; obtaining a candidate region where a target object is located in a sample image and an initial confidence coefficient of the target object in each candidate region corresponding to each object category through an object detection model to be trained; the object detection model to be trained is used for detecting a target object in the sample image; acquiring an attention response graph of an object detection model to be trained aiming at the sample image according to the initial confidence of the target object in each candidate region corresponding to each object category and the gradient information among the image data of the sample image; and acquiring the attention response value of each candidate region from the attention response graph, adjusting the network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of the target object in each candidate region corresponding to each object type and the object type label, and continuing training until the target object detection model is obtained. By obtaining the attention response graph of the object detection model to be trained on each pixel point on the sample image and adjusting the network parameters of the object detection model to be trained according to the attention response graph, the object detection model to be trained has more attention on the target object in the sample image, and further the object detection model to be trained after network parameter adjustment obtains more characteristic information from the region position where the target object is in the input image, so that the accuracy of identifying the position or the object type of the target object is improved.

In an embodiment, as shown in fig. 3, the step of adjusting the network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of each object class corresponding to the target object in each candidate region, and the object class label includes:

step S302, obtaining a category prediction loss value corresponding to each candidate region according to the initial confidence degree and the object category label of each object category corresponding to the target object in each candidate region;

step S304, obtaining a loss weight value of each candidate area according to the attention response value of each candidate area;

step S306, obtaining a target category prediction loss value of the object detection model to be trained according to the loss weight value and the category prediction loss value of each candidate region;

and step S308, adjusting the network parameters of the object detection model to be trained according to the target class prediction loss value of the object detection model to be trained.

The class prediction loss value is a loss value caused by the difference between the initial confidence degree of each object class corresponding to the target object in the candidate region output by the object detection model to be trained and the actual object class of the target object in the sample image; specifically, the class prediction loss value may be obtained by a cross entropy loss function. The loss weight value is used for controlling the magnitude of the class prediction loss value of different candidate regions, and it can be understood that, when the attention response value of a candidate region is larger, the relevance between the candidate region or pixel point and the target object is larger, the loss weight value of the class prediction loss value of the candidate region is larger, so that the object detection model to be trained can more fully learn the image feature of the candidate region in the sample image, and when the attention response value of the candidate region is smaller, the relevance between the candidate region or pixel point and the target object is smaller, the loss weight value of the class prediction loss value of the candidate region is smaller, so that the object detection model to be trained can reduce the learning of the image feature of the candidate region in the sample image.

Specifically, after acquiring the attention response value of each candidate region and the initial confidence of each object class corresponding to the target object in each candidate region, the server utilizes a cross entropy loss function, calculating the class prediction loss value corresponding to each candidate region according to the initial confidence degree and the corresponding object class label of the target object corresponding to each object class in each candidate region, and finally, adjusting the network parameters of the object detection model to be trained by using the target class prediction loss value of the object detection model to be trained until a convergence condition is met, and obtaining the trained target object detection model. The convergence condition may be adjusted or set according to actual needs, for example, when the target class prediction loss value reaches the minimum, the convergence condition may be considered to be satisfied; or when the target class prediction loss value is not changed any more, the convergence condition can be considered to be met; or after the object detection model to be trained is trained through a preset number of sample images, the convergence condition can be considered to be satisfied.

In one embodiment, the object classes include a foreground object class and a background object class; the object class confidence degrees corresponding to the target objects in the candidate regions comprise a foreground object class confidence degree and a background object class confidence degree; as shown in fig. 4, step S206 obtains a attention response graph of the object detection model to be trained for the sample image according to the initial confidence of the target object in each candidate region corresponding to each object class and the gradient information between the image data of the sample image, including:

step S206a, obtaining the total confidence of the foreground object class and the total confidence of the background object class in the candidate regions according to the initial confidence of the object class corresponding to the target object in each candidate region.

Step S206b, obtaining a attention response map of the foreground object according to the gradient information between the total confidence of the foreground object category and the image data of the sample image.

Step S206c, obtaining a focus response map of the background object according to the gradient information between the total confidence of the background object category and the image data of the sample image.

As described above, the object class includes the background object class in addition to the object class of the target object to be detected. In the object detection process, the foreground object refers to a target object to be detected, and the background object refers to any object except the target object to be detected; correspondingly, the foreground object class refers to any one of object classes of the target objects to be detected, and when the target object in the candidate region does not belong to any one of the object classes of the target objects to be detected, the object class of the target object in the candidate region is the background object class.

Specifically, after the server obtains the initial confidence of the object class corresponding to the target object in each candidate region, for the foreground object class, the server calculates the sum of the initial confidence of the object classes belonging to the foreground object class in each candidate region to obtain the confidence of the foreground object class corresponding to each candidate region, and further calculates the sum of the confidence of the foreground object classes in all candidate regions to obtain the total confidence of the foreground object classes in all candidate regions. Similarly, for the background object class, the server calculates the sum of the initial confidence levels of the background object classes of all the candidate regions to obtain the total confidence level of the background object classes of all the candidate regions.

For example, in a practical application scenario, the target object may include a person, a vehicle, and a tree, and accordingly, the object category of the target object includes category "person", category "vehicle", and category "tree". When the target object in the candidate area is any one of the categories of 'people', vehicles 'and trees', the object category of the target object is a foreground object category; when the target object in the candidate area is not any of the category "person", the category "vehicle", and the category "tree", the object category of the target object is set as the background object. Assuming a first candidate region and a second candidate region output by an object detection model to be trained, wherein the initial confidence of a target object in the first candidate region corresponding to a category "human" is 0.5, the initial confidence of the target object corresponding to the category "vehicle" is 0.3, the initial confidence of the target object corresponding to the category "tree" is 0.1, and the initial confidence of any category (namely, a background object category) which is not the category "human", the category "vehicle" or the number of the categories "is 0.1; the initial confidence of the target object in the second candidate region corresponding to the category "person" is 0.3, the initial confidence of the target object corresponding to the category "vehicle" is 0.2, the initial confidence of the target object corresponding to the category "tree" is 0.1, and the initial confidence of any category (namely, the background object category) not belonging to the category "person", the category "vehicle", and the category "number" is 0.4. At this time, the confidence of the foreground object class in the first candidate region is 0.9, and the confidence of the background object class is 0.1; the confidence coefficient of the foreground object type in the second candidate area is 0.6, and the confidence coefficient of the background object type is 0.4; the total confidence of the foreground object classes for all candidate regions is 1.5 and the total confidence of the background object classes is 0.5.

After obtaining the confidence degrees of the foreground object categories and the confidence degrees of the background object categories of all the candidate regions, the server respectively calculates the partial derivatives of the image data of the input image according to the confidence degrees of the foreground object and the background object to obtain the attention degree response graph of the foreground object and the attention degree response graph of the background object. It can be understood that the attention response map of the foreground object includes foreground position distribution information, and the attention response map of the background object includes background position distribution information.

In one embodiment, as shown in fig. 5, the step of obtaining the attention response value of each candidate region from the attention response map includes:

step S502, determining the object type having the highest numerical initial confidence in each candidate region as the predicted object type of the target object in each candidate region.

After the initial prediction confidence of each candidate region for each object class is obtained, which object class the object contained in each prediction region is can be determined. Specifically, the server may determine, as the predicted object class of the candidate region, the object class of which the initial prediction confidence is the maximum, among the object classes corresponding to the candidate region.

Step S504, according to the prediction object type of the target object in each candidate area, determining the area type of each candidate area, wherein the area type comprises a foreground candidate area and a background candidate area;

after the prediction object type of the candidate area is determined, the server determines the area type of the candidate area according to the prediction object type. Specifically, when the prediction object class of the target object in the candidate region is the object class of the target object to be detected (i.e., the foreground object class), the candidate region is the foreground candidate region, and when the prediction object class of the target object in the candidate region is not the object class of the target object to be detected (i.e., the background object class), the candidate region is the background candidate region.

In step S506, the attention response value of each candidate region is acquired from the attention response map corresponding to the region type.

After the region type of each candidate region is determined, according to the region type of each candidate region, the attention response value of each candidate region is obtained from the attention response map corresponding to the region type. Specifically, when the candidate region is a foreground candidate region, the attention response value of the candidate region is obtained from the foreground attention response map, and when the candidate region is a background candidate region, the attention response value of the candidate region is obtained from the background attention response map.

For example, in an actual application scenario, the target object may include a person, a vehicle, and a tree, and accordingly, the object category of the target object includes a category "person", a category "vehicle", and a category "tree", that is, the foreground object category includes a category "person", a category "vehicle", and a category "tree". Specifically, when the initial confidence of the target object in the candidate region output by the object detection model to be trained, which corresponds to the category "human", is 0.5, the initial confidence of the category "vehicle" is 0.3, the initial confidence of the category "tree" is 0.1, and the initial confidence of any category (namely, the background object category) which is not the category "human", the category "vehicle", and the category "number" is 0.1, at this time, the prediction object category of the target object in the candidate region is determined to be the category "human", that is, the region category of the candidate region is the foreground candidate region, and the attention response value of the candidate region is obtained from the foreground attention response map. When the initial confidence of the target object corresponding to the category "human" in the candidate region output by the object detection model to be trained is 0.3, the initial confidence of the category "vehicle" is 0.2, the initial confidence of the category "tree" is 0.1, and the initial confidence of any category (namely the background object category) not belonging to the category "human", the category "vehicle" and the category "number" is 0.4, at this time, the predicted object category of the target object in the candidate region is determined as the background object category, the region category of the candidate region is the background candidate region, and the attention response value of the candidate region is obtained from the background attention response map.

It can be understood that, in the subsequent process of obtaining the target class prediction loss value of the object detection model to be trained, the class prediction loss value corresponding to each candidate region may be obtained according to the initial confidence and the object class label of each object class corresponding to the target object in each candidate region; then, for a foreground candidate region in each candidate region, obtaining an attention response value of the foreground candidate region from a foreground attention response map, for a background candidate region in each candidate region, obtaining an attention response value of the foreground candidate region from the background attention response map, and further obtaining a target category prediction loss value of the object detection model to be trained according to a loss weight value and a category prediction loss value corresponding to each candidate region. In one embodiment, the target class prediction loss value of the object detection model to be trained may be obtained by the following formula (1):

wherein t is an object class label, wherein when t is 1, the object class label is a label of a foreground object class, and when t is 0, the object class label is a background object class label; p is the initial confidence of the target object in the candidate region for each object class, λ_fAnd α is a coefficient for controlling the magnitude of the class prediction loss value of the foreground candidate region, lambda_bAnd β are coefficients for controlling the magnitude of the class prediction loss value of the background candidate region;

is the attention response value of the background candidate region,

is the attention response value of the foreground candidate area.

In one embodiment, the step of obtaining the attention response value of each candidate region from the attention response map includes: acquiring attention response values corresponding to all pixel points on the candidate region from the attention response map; and determining the average value of the attention response values of the pixel points in the candidate region as the attention response value of the candidate region. The attention response graph comprises attention response values corresponding to all pixel points, the average value of the attention response values of all the pixel points in the candidate region is calculated by obtaining the attention response values of all the pixel points in the candidate region, and the average value is determined as the attention response value of the candidate region.

In the training process of the object detection model to be trained, besides the model parameters of the object detection model to be trained are adjusted by adopting the class prediction loss value, the model parameters of the object detection model to be trained can be adjusted and trained by the loss value caused by the difference of the area positions between the candidate area where the target object output by the object detection model to be trained is located in the sample image and the area where the target object is actually located in the sample image. Thus, in one embodiment, the sample image further comprises a location information tag of the target object in the sample image; the method comprises the following steps of adjusting network parameters of an object detection model to be trained according to the attention response value of each candidate region, the initial confidence of each object class corresponding to a target object in each candidate region and an object class label, wherein the steps comprise: and adjusting network parameters of the detection model of the object to be trained according to the attention response value of each candidate region, the initial confidence of the target object in each candidate region corresponding to each object class, the object class label, the position information of the candidate region and the position information label.

Specifically, the server obtains a target class prediction loss value of the detection model of the object to be trained according to the attention response value of each candidate region, the initial confidence of the target object in each candidate region corresponding to each object class and the object class label, and obtains the actual region of the target object according to the position information label, so as to obtain a first region position loss value of the detection model of the object to be trained according to the region position difference between the candidate region of the target object in the sample image and the actual region of the target object, wherein the candidate region is output by the detection model of the object to be trained; and then, adjusting the network parameters of the object detection model to be trained according to the target class prediction loss value and the first region position loss value.

The network parameters of the object detection model to be trained are adjusted according to the target class prediction loss value and the first area position loss value, specifically, the target class prediction loss value and the first area position loss value are subjected to weighted calculation to obtain the target loss value of the object detection model to be trained, or the target class prediction loss value and the first area position loss value are subjected to mean calculation to obtain the target loss value of the object detection model to be trained. And then, adjusting the network parameters of the object detection model to be trained according to the target loss value until a convergence condition is met, and obtaining the trained object detection model. The convergence condition may be adjusted or set according to actual needs, for example, when the target loss value reaches a minimum value, the convergence condition may be considered to be satisfied; or when the target loss value is not changed any more, the convergence condition can be considered to be met; or after the object detection model to be trained is trained through a preset number of sample images, the convergence condition can be considered to be satisfied.

Further, the position loss value of the second region of the object detection model to be trained can be obtained according to the difference between the candidate region output by the model to be trained and predicted as a foreground candidate region or a background candidate region and the candidate region actually as the foreground region or the background region. Specifically, when the ratio of the area of the overlapping region between the candidate region and the region where the target object is actually located to the area of the region where the target object is actually located is greater than or equal to a preset threshold, the candidate region is actually a foreground region, and conversely, if the ratio is smaller than the preset threshold, the candidate region is actually a background region. Furthermore, the network parameters of the object detection model to be trained can be adjusted according to the target class prediction loss value, the first region position loss value and the second region position loss value.

In an embodiment, as shown in fig. 6, the step of obtaining, by the object detection model to be trained, a candidate region where the target object is located in the sample image and an initial confidence of each object class corresponding to the target object in each candidate region includes:

step S602, extracting a global feature map of a sample image through an object detection model to be trained;

step S604, predicting a candidate area where the target object is located according to the global image characteristics;

step S606, according to the candidate regions, obtaining the initial confidence of the target object in each candidate region corresponding to each object type.

The global feature map is image feature information corresponding to the whole sample image, and includes feature information parameters such as object positions and types related to the target object. As described above, the object detection model to be trained may include, but is not limited to, a feature extraction network, a region generation network, and a classifier; acquiring a candidate region where a target object in a sample image is located and initial confidence degrees of the target object in each candidate region corresponding to each object category by using an object detection model to be trained, specifically, inputting the sample image into the object detection model to be trained, and extracting a global feature map of the sample image through a feature extraction network of the object detection model to be trained; then, inputting the global feature map into a region generation network, and predicting a candidate region possibly containing a target object through the region generation network; and finally, predicting the initial confidence of the target object contained in each candidate region belonging to each object class through a classifier.

Further, in an embodiment, the step S606, according to the candidate regions, obtaining initial confidence levels of the target objects in the candidate regions corresponding to the object classes, includes: obtaining a local feature map corresponding to each candidate region from the global feature map; and acquiring initial confidence degrees of the target objects in the candidate regions corresponding to the object categories according to the local feature maps of the candidate regions.

Wherein, a candidate region corresponds to a local image feature, and the candidate region is a partial image region in the sample image, so the local feature map is the image feature information of the candidate region in the corresponding image region in the sample image. Specifically, after a candidate region where the target object is located is predicted, a local feature map corresponding to the candidate region may be obtained from a corresponding region position in the global image feature according to the region position of the candidate region, and the local feature map corresponding to the candidate region is input into the classifier, and the classifier identifies, according to each obtained local feature map, an initial prediction confidence that the target object in the candidate region belongs to each object type.

In one embodiment, a method of training an object detection model includes:

1. a sample image is acquired that includes an object class label of a target object in the sample image.

2. Obtaining a candidate region where the target object is located in the sample image and an initial confidence coefficient of each object category corresponding to the target object in each candidate region through an object detection model to be trained; the object detection model to be trained is used for detecting a target object in the sample image.

And 2-1, extracting a global feature map of the sample image through the detection model of the object to be trained.

And 2-2, predicting a candidate region where the target object is located according to the global image feature.

And 2-3, acquiring initial confidence degrees of the target objects in the candidate regions corresponding to the object classes according to the candidate regions.

2-3-1, obtaining a local feature map corresponding to each candidate region from the global feature map.

And 2-3-2, acquiring initial confidence degrees of the target objects in the candidate regions corresponding to the object classes according to the local feature maps of the candidate regions.

3. And acquiring a attention response graph of the object detection model to be trained aiming at the sample image according to the initial confidence of the target object in each candidate region corresponding to each object class and the gradient information between the image data of the sample image.

And 3-1, acquiring the total confidence of the foreground object type and the total confidence of the background object type in the candidate regions according to the initial confidence of the object type corresponding to the target object in each candidate region.

3-2, obtaining a attention response graph of the foreground object according to the total confidence of the foreground object category and the gradient information between the image data of the sample image.

3-3, obtaining a focus response image of the background object according to the gradient information between the total confidence of the background object category and the image data of the sample image.

4. And acquiring the attention response value of each candidate region from the attention response graph, adjusting the network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of each object type corresponding to the target object in each candidate region and the object type label, and repeating the steps until a convergence condition is met to obtain the target object detection model.

And 4-1, respectively determining the object type with the highest numerical value of the initial confidence coefficient in each candidate region as the predicted object type of the target object in each candidate region.

And 4-2, determining the region type of each candidate region according to the prediction object type of the target object in each candidate region, wherein the region type comprises a foreground candidate region and a background candidate region.

And 4-3, acquiring the attention response value of each candidate region from the attention response graph corresponding to the region type. Specifically, when the region category of the candidate region is a background candidate region, obtaining a attention response value of the candidate region from a background attention response map; and when the region type of the candidate region is the foreground candidate region, acquiring the attention response value of the candidate region from the foreground attention response image.

4-3-1, acquiring attention response values corresponding to all pixel points on the candidate region from the attention response map;

4-3-2, determining the average value of the attention response values of the pixel points on the candidate region as the attention response value of the candidate region.

And 4-4, acquiring a class prediction loss value corresponding to each candidate region according to the initial confidence degree of the target object in each candidate region corresponding to each object class and the object class label.

And 4-5, acquiring a loss weight value of each candidate region according to the attention response value of each candidate region.

4-6, acquiring a target class prediction loss value of the object detection model to be trained according to the loss weight value and the class prediction loss value of each candidate region;

and 4-7, adjusting the network parameters of the object detection model to be trained according to the target class prediction loss value of the object detection model to be trained.

Referring to fig. 7a and 7b, fig. 7a is a schematic diagram illustrating a training method of an object detection model in an embodiment. As shown in fig. 7a, an image 702 is an input image of the object detection model to be trained, and is also a sample image, and the image 702 includes the target object and the object class label of the target object. The detection model to be trained may include a feature extraction network 704 and a network 706, where the network 706 includes a region generation network and a classifier. First, the server may input the image 702 into the feature extraction network 704, and perform feature extraction on the image 702 through the feature extraction network 704 to obtain a global feature map of the image 702. After obtaining the global image features of the image 702, the area generation network generates a plurality of candidate areas, where the candidate areas are areas that may contain the target object in the sample image. After obtaining a plurality of candidate regions, the region generation network obtains a local feature map corresponding to the candidate regions from the global feature map and inputs the local feature map into the classifier, and the classifier predicts the confidence of the target object in each candidate region belonging to each object class according to the local feature map of each candidate region, respectively, to obtain a model output result 708, where the model output result 708 may be a matrix. As shown in FIG. 7a, model output 708 includes a confidence that the target object in the candidate region belongs to each object class, e.g., model output 708 includes s_man、s_plane、s_carAnd s_bgEtc. s_manRepresenting the confidence that the target object in the candidate region belongs to the class "person", s_planeRepresenting the confidence that the target object in the candidate region belongs to the class "airplane". s_carRepresenting the confidence that the target object in the candidate region belongs to the class "vehicle", s_bgA confidence level indicating that the target object in the candidate region does not belong to any of the object classes in the class "vehicle" and the class "airplane" class "vehicle". As shown in fig. 7b, after acquiring the global image feature of the image 702, the area generation network generates three candidate areas, namely a candidate area a, a candidate area c and a candidate area d, where the output result of the model corresponding to the candidate area a is s_manIs equal to 0.1, s_planeIs equal to 0.05, s_carEqual to 0.55 and s_bgEqual to 0.3, the output result of the model corresponding to the candidate region b is s_manIs equal to 0.3, s_planeIs equal to 0.2, s_carEqual to 0.1 and s_bgEqual to 0.4, the output result of the model corresponding to the candidate region c is s_manIs equal to 0.5, s_planeIs equal to 0.3, s_carEqual to 0.1 and s_bgEqual to 0.1. At this time, the object type of the target object in the candidate region a is determined as the "vehicle" type, the candidate region a is the foreground candidate region, the object type of the target object in the candidate region b is determined as the background object type, the candidate region b is the background candidate region, the object type of the target object in the candidate region c is determined as the "person" type, and the candidate region c is the background candidate region.

The server may then obtain the total confidence s of the foreground object classes for all candidate regions based on the model output 708_fgAnd overall confidence s of the background object class_bgAnd the total confidence coefficient s of the foreground object class_fgAnd overall confidence s of the background object class_bgRespectively back-propagating to the region generation network to determine the overall confidence s of the foreground object class_fgCalculating partial derivatives of the global feature map, obtaining gradient information between the total confidence of the foreground object type and the image data of the sample image, and calculating the total confidence s of the background object type_bgTo the wholeAnd obtaining the gradient information between the total confidence of the background object type and the image data of the sample image by the derivation of the feature map, and finally obtaining a focus response map 710 of the foreground object and a focus response map 712 of the background object. As shown in fig. 7b, in the candidate region a, the confidence of the foreground object class is equal to 0.7, and the confidence of the background object class is equal to 0.3; in the candidate region b, the confidence of the foreground object class is equal to 0.6, and the confidence of the background object class is equal to 0.4; in the candidate region c, the confidence of the foreground object class is equal to 0.9, and the confidence of the background object class is equal to 0.1; at this time, the total confidence s of the foreground object classes of all the candidate regions_fgTotal confidence of background object class s equal to 2.2_bgEqual to 0.8.

Specifically, the total confidence s of the foreground object class is determined_fgObtaining gradient information between the total confidence of the foreground object category and the image data of the sample image by performing partial derivation on the global feature map, wherein the gradient information can be calculated by the following formula (2):

and the total confidence s of the background object class_bgObtaining gradient information between the total confidence of the background object class and the image data of the sample image by the partial derivation of the global feature map, and calculating by the following formula (3):

further, the server determines the object type having the highest initial confidence value from the model output result 708, determines the object type as the predicted object type of the target object in the candidate region, determines the region type of the candidate region according to the predicted object type of the target object in the candidate region, and acquires the attention response value corresponding to the candidate region from the attention response map corresponding to the region type. As shown in fig. 7b, the candidate region a, the candidate region c, and the candidate region d, the object type of the target object in the candidate region a is determined as the "vehicle" type, the candidate region a is the foreground candidate region, the object type of the target object in the candidate region b is determined as the background object type, the candidate region b is the background candidate region, the object type of the target object in the candidate region c is determined as the "person" type, and the candidate region c is the background candidate region. Therefore, the attention response values for the candidate regions a and c are obtained from the foreground attention response map, and the attention response values for the candidate regions b are obtained from the background attention response map.

After the attention response values corresponding to the candidate regions are obtained, the loss weight values of the candidate regions are correspondingly obtained according to the attention response values, the category prediction loss values of the candidate regions are further adjusted according to the loss weight values, the weight of the loss function of the candidate region belonging to the region where the target object is located is increased, and the weight of the loss function of the candidate region belonging to the region where the background object is located is reduced; and then, the adjusted loss value is used for monitoring the learning process of the to-be-trained object detection model on the sample image, and the network parameters of the to-be-trained object detection model are adjusted, so that the to-be-trained object detection model can put more attention on the region where the target object in the sample image is located, the image characteristics of the region where the target object in the sample image is located can be more fully learned, and the detection performance of the to-be-trained object detection model is improved.

The method can acquire a sample image, wherein the sample image comprises an object class label of a target object in the sample image; obtaining a candidate region where a target object is located in a sample image and an initial confidence coefficient of the target object in each candidate region corresponding to each object category through an object detection model to be trained; the object detection model to be trained is used for detecting a target object in the sample image; acquiring an attention response graph of an object detection model to be trained aiming at the sample image according to the initial confidence of the target object in each candidate region corresponding to each object category and the gradient information among the image data of the sample image; and acquiring the attention response value of each candidate region from the attention response graph, adjusting the network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of the target object in each candidate region corresponding to each object type and the object type label, and continuing training until the target object detection model is obtained. By obtaining the attention response graph of the object detection model to be trained on the sample image and obtaining the attention response value of each candidate frame according to the attention response graph, the loss function weights of the foreground object and the background object are dynamically adjusted, the contribution of the foreground object (positive sample) and the background object (negative sample) in the sample image to the loss function is balanced, and the accuracy of identifying the position or the object type of the target object is improved.

In one embodiment, as shown in fig. 8, an object detection method includes:

step S802, an image to be detected is obtained.

Step S804, a candidate area where a target object in an image to be detected is located and initial confidence coefficients of the target object in each candidate area corresponding to each object category are obtained through a target object detection model which is constructed in advance; the target object detection model is obtained by adjusting network parameters of the object detection model to be trained according to the attention response value of the candidate region in the sample image, the initial confidence of each object class corresponding to the target object in the candidate region and the object class label of the target object in the sample image; the candidate region of the sample image and the initial confidence of each object class corresponding to the target object in the candidate region are obtained by using an object detection model to be trained.

Step S806, obtaining the target object type of the target object in the candidate region according to the initial confidence of the target object in each candidate region corresponding to each object type.

Step S808, outputting the region position information of the candidate region where the target object is located in the image to be detected and the target object type.

The image to be detected can be a picture shot by a camera, a picture captured from a video by screen capture, an image uploaded by an application program capable of uploading the image, and the like. The image to be detected comprises a target object, and the target object refers to an object to be detected in the image to be detected.

Specifically, after the image to be detected is obtained, the server inputs the image to be detected to a target object detection model, and obtains a candidate region where the target object is located and an initial confidence of the target object in each candidate region corresponding to each object category through the target object detection model. The target object detection model may be obtained by the training method of the object detection model in any of the above embodiments. After the initial confidence degrees of the target objects in the candidate regions corresponding to the object classes are obtained, the object class with the maximum initial confidence degree is determined as the target object class of the target objects in the candidate regions, and finally, region position information of the candidate regions and the target object class of the target objects in the candidate regions are output.

In one embodiment, the object detection method further comprises: acquiring moving route information according to the region position information of the candidate region where the target object in the image to be detected is located; and sending the moving route information to a driving device, wherein the driving route information is used for indicating the driving device to move according to a moving route corresponding to the moving route information.

The candidate region may be a rectangular region, and the region position information may be coordinate information of 4 corners of the rectangular region in the image to be detected. The driving device refers to a device capable of moving, such as an inspection robot, an automatic driving vehicle, an unmanned aerial vehicle and the like.

Specifically, the object detection model may output area position information of a candidate area where the target object in the image to be detected is located, after the area position information of the candidate area is obtained, the moving route of the driving device is updated according to the area position information, the moving route information is generated and sent to the driving device, and after the driving device obtains the moving route information, the driving device moves according to the moving route corresponding to the route information. It is understood that the movement route information is route information for bypassing the target object, and is used to instruct the traveling apparatus to move and avoid the target object.

For example, the object detection method is applied to an autonomous vehicle provided with an imaging device for imaging an image of a traveling road, and after a road image (i.e., an image to be detected) in real time is acquired by the imaging device, the road image is input to a target object detection model, and a candidate area of an obstacle (i.e., a target object) in the road image is acquired by the target object detection model, and further, an object type of the obstacle in the road image can be acquired. After the area position information of the candidate area of the obstacle in the road image is acquired, the space position of the obstacle on the driving road is calculated according to the area position information, the driving route information (namely the moving route information) is updated according to the space position of the obstacle, and the automatic driving vehicle is instructed to continue driving according to the driving route and avoid the obstacle.

It should be understood that, although the steps in the flowcharts of fig. 2 to 6 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the flows of fig. 2-6 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternatively with other steps or at least a portion of the steps or stages in other steps.

In one embodiment, as shown in fig. 9, there is provided an apparatus for training an object detection model, which may be a part of a computer device using a software module or a hardware module, or a combination of the two modules, and specifically includes: the device comprises a sample image acquisition module, an image processing module, a response image acquisition module and a response value acquisition module, wherein:

a sample image obtaining module 902, configured to obtain a sample image, where the sample image includes an object class label of a target object in the sample image; the target object is an object to be detected;

an image processing module 904, configured to obtain, through the to-be-trained object detection model, a candidate region where the target object is located in the sample image, and an initial confidence of each object class corresponding to the target object in each candidate region; the object detection model to be trained is used for detecting a target object in the sample image;

a response map obtaining module 906, configured to obtain a attention response map of the object detection model to be trained for the sample image according to gradient information between the initial confidence of the target object in each candidate region corresponding to each object class and image data of the sample image;

the response value obtaining module 908 is configured to obtain a degree of interest response value of each candidate region from the degree of interest response map, adjust a network parameter of the object detection model to be trained according to the degree of interest response value of each candidate region, the initial confidence level of each object class corresponding to the target object in each candidate region, and the object class label, and repeat the above steps until a convergence condition is satisfied, so as to obtain the target object detection model.

In one embodiment, the object classes include a foreground object class and a background object class; the object class confidence degrees corresponding to the target objects in the candidate regions comprise a foreground object class confidence degree and a background object class confidence degree; the response image acquisition module is used for acquiring the total confidence coefficient of the foreground object type and the total confidence coefficient of the background object type in the candidate areas according to the initial confidence coefficient of the object type corresponding to the target object in each candidate area; acquiring an attention response map of the foreground object according to gradient information between the total confidence of the foreground object category and the image data of the sample image; and acquiring a attention response graph of the background object according to the gradient information between the total confidence of the background object category and the image data of the sample image.

In one embodiment, the response value obtaining module is configured to determine, as the predicted object class of the target object in each candidate region, the object type with the highest numerical initial confidence in each candidate region respectively; determining the region type of each candidate region according to the prediction object type of the target object in each candidate region, wherein the region type comprises a foreground candidate region and a background candidate region; and acquiring the attention response value of each candidate area from the attention response map corresponding to the area type.

In one embodiment, the response value obtaining module is specifically configured to obtain, from the attention degree response map, attention degree response values corresponding to respective pixel points on the candidate region; and determining the average value of the attention response values of the pixel points in the candidate region as the attention response value of the candidate region.

In one embodiment, the response value obtaining module is specifically configured to obtain a category prediction loss value corresponding to each candidate region according to an initial confidence and an object category label of each object category corresponding to a target object in each candidate region; obtaining a loss weight value of each candidate region according to the attention response value of each candidate region; obtaining a target class prediction loss value of the object detection model to be trained according to the loss weight value and the class prediction loss value of each candidate region; and adjusting the network parameters of the object detection model to be trained according to the target class prediction loss value of the object detection model to be trained.

In one embodiment, the sample image further comprises a location information tag of the target object in the sample image; and the response value acquisition module is specifically used for adjusting network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of each object type corresponding to the target object in each candidate region, the object type label, the position information of the candidate region and the position information label.

In one embodiment, the image processing module is used for extracting a global feature map of the sample image through the object detection model to be trained; predicting a candidate area where the target object is located according to the global image characteristics; and acquiring initial confidence degrees of the target objects in the candidate regions corresponding to the object types according to the candidate regions.

In an embodiment, the image processing module is specifically configured to obtain a local feature map corresponding to each candidate region from the global feature map; and acquiring initial confidence degrees of the target objects in the candidate regions corresponding to the object categories according to the local feature maps of the candidate regions.

For specific definition of the training apparatus for the object detection model, reference may be made to the above definition of the training method for the object detection model, and details are not described here. The modules in the training device of the object detection model can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, as shown in fig. 10, there is provided an object detecting apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, the apparatus specifically includes: an image acquisition module, a candidate region acquisition module, an object category acquisition module and an object information output module,

an image obtaining module 1002, configured to obtain an image to be detected;

a candidate region obtaining module 1004, configured to obtain, through a pre-constructed target object detection model, a candidate region where a target object in the image to be detected is located and initial confidence levels of the target object in each candidate region corresponding to each object class; the target object detection model is obtained by adjusting network parameters of an object detection model to be trained according to the attention response value of the candidate region in the sample image, the initial confidence of each object class corresponding to the target object in the candidate region and the object class label of the target object in the sample image; the candidate region of the sample image and the initial confidence of the target object in the candidate region corresponding to each object category are obtained through the object detection model to be trained; the attention response value is obtained from an attention response map obtained according to the initial confidence of the target object in each candidate region corresponding to each object class and the gradient information between the image data of the sample image;

an object class obtaining module 1006, configured to obtain, according to an initial confidence that a target object in each candidate region corresponds to each object class, a target object class of the target object in the candidate region;

and an object information output module 1008, configured to output region position information of a candidate region where the target object in the image to be detected is located and a target object type.

In one embodiment, the object detection apparatus further includes a moving route acquisition module, configured to acquire moving route information according to area position information of a candidate area where a target object is located in the image to be detected; and sending the moving route information to a driving device, wherein the driving route information is used for indicating the driving device to move according to a moving route corresponding to the moving route information.

For the specific definition of the object detection device, reference may be made to the above definition of the object detection method, which is not described herein again. The modules in the object detection device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 11. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as network parameters of the object detection model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a training method of an object detection model or an object detection method.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of training an object detection model, the method comprising:

obtaining a sample image, wherein the sample image comprises an object class label of a target object in the sample image, and the target object is an object to be detected;

2. The method of claim 1, wherein the object classes comprise a foreground object class and a background object class; the object class confidence degrees corresponding to the target objects in the candidate regions comprise a foreground object class confidence degree and a background object class confidence degree;

the step of obtaining a attention response graph of the object detection model to be trained for the sample image according to the initial confidence of the target object in each candidate region corresponding to each object class and the gradient information between the image data of the sample image includes:

acquiring the total confidence of the foreground object type and the total confidence of the background object type in the candidate regions according to the initial confidence of the object type corresponding to the target object in each candidate region;

obtaining a attention response graph of the foreground object according to gradient information between the total confidence of the foreground object category and the image data of the sample image;

and acquiring a attention response graph of the background object according to the gradient information between the total confidence of the background object category and the image data of the sample image.

3. The method according to claim 1, wherein the step of obtaining the attention response value of each candidate region from the attention response map comprises:

respectively determining the object type with the highest numerical initial confidence coefficient in each candidate region as a predicted object type of the target object in each candidate region;

determining the region type of each candidate region according to the prediction object type of the target object in each candidate region, wherein the region type comprises a foreground candidate region and a background candidate region;

and acquiring the attention response value of each candidate area from the attention response graph corresponding to the area type.

4. The method according to claim 1, wherein the step of obtaining the attention response value of each candidate region from the attention response map comprises:

obtaining attention response values corresponding to all pixel points on the candidate region from the attention response map;

and determining the average value of the attention response values of the pixel points in the candidate region as the attention response value of the candidate region.

5. The method according to claim 1, wherein the step of adjusting the network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of each object class corresponding to the target object in each candidate region, and the object class label comprises:

obtaining a class prediction loss value corresponding to each candidate region according to the initial confidence degree of the target object in each candidate region corresponding to each object class and the object class label;

obtaining a loss weight value of each candidate region according to the attention response value of each candidate region;

obtaining a target class prediction loss value of the object detection model to be trained according to the loss weight value and the class prediction loss value of each candidate region;

and adjusting the network parameters of the object detection model to be trained according to the target class prediction loss value of the object detection model to be trained.

6. The method of claim 1, wherein the sample image further comprises a location information tag of the target object in the sample image;

adjusting network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of each object class corresponding to the target object in each candidate region and the object class label, wherein the step comprises the following steps:

and adjusting network parameters of the object detection model to be trained according to the attention response value of each candidate region, the initial confidence of each object class corresponding to the target object in each candidate region, the object class label, the position information of the candidate region and the position information label.

7. The method according to claim 1, wherein the step of obtaining, by the object detection model to be trained, candidate regions in which the target objects are located in the sample image and initial confidence levels of the target objects in the candidate regions corresponding to the object classes comprises:

extracting a global feature map of the sample image through the detection model of the object to be trained;

predicting a candidate area where the target object is located according to the global image feature;

and acquiring the initial confidence of the target object in each candidate region corresponding to each object type according to the candidate regions.

8. The method of claim 7, wherein the step of obtaining an initial confidence level of the target object in each of the candidate regions for each object class according to the candidate regions comprises:

obtaining a local feature map corresponding to each candidate region from the global feature map;

and acquiring initial confidence degrees of the target objects in the candidate regions corresponding to the object categories according to the local feature maps of the candidate regions.

9. An object detection method, comprising:

acquiring an image to be detected;

10. The method of any one of claims 9, further comprising:

acquiring moving route information according to the region position information of the candidate region where the target object in the image to be detected is located;

and sending the moving route information to a driving device, wherein the driving route information is used for indicating the driving device to move according to a moving route corresponding to the moving route information.

11. An apparatus for training an object detection model, the apparatus comprising:

12. Training apparatus for an object detection model according to claim 11, wherein the object classes comprise a foreground object class and a background object class; the object class confidence degrees corresponding to the target objects in the candidate regions comprise a foreground object class confidence degree and a background object class confidence degree;

the response image acquisition module is used for acquiring the total confidence of the foreground object class and the total confidence of the background object class according to the initial confidence of the object class corresponding to the target object in each candidate region; obtaining a attention response graph of the foreground object according to gradient information between the total confidence of the foreground object category and the image data of the sample image; and acquiring a attention response graph of the background object according to the gradient information between the total confidence of the background object category and the image data of the sample image.

13. An object detection apparatus, characterized in that the apparatus comprises:

the image acquisition module is used for acquiring an image to be detected;

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 10 when executing the computer program.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.