CN111160379A

CN111160379A - Training method and device of image detection model and target detection method and device

Info

Publication number: CN111160379A
Application number: CN201811320550.6A
Authority: CN
Inventors: 张修宝; 田万鑫; 沈海峰
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2020-05-15
Anticipated expiration: 2038-11-07
Also published as: CN111160379B

Abstract

The application provides a training method and device of an image detection model and a target detection method and device, and relates to the technical field of artificial intelligence. The model training method comprises the following steps: inputting a target training image into an initial model; extracting a characteristic response graph of a target training image through a backbone network; classifying the characteristic response graph through a classification sub-network, and calculating to obtain a classification loss value based on a classification processing result and a classification label; performing regression processing on the characteristic response graph through a regression subnetwork, and calculating to obtain a regression loss value based on a regression processing result and a regression label; performing mask processing on the feature response image through a mask sub-network, and calculating a mask loss value based on a mask processing result and a mask label; and training the initial model based on the classification loss value, the regression loss value and the mask loss value to obtain a target detection model. The method improves the model training method, and effectively improves the detection effect of the trained target detection model on the image.

Description

Training method and device of image detection model and target detection method and device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a training method and device of an image detection model and a target detection method and device.

Background

Target detection techniques for detecting faces or other objects of interest from images are a prerequisite for a large number of advanced visual tasks, and can be applied to many practical tasks, such as intelligent video surveillance content-based image retrieval, robotic navigation, augmented reality, and the like.

Most of the existing target detection methods need to extract features of an image to be detected by using a trained target detection model, and then perform target classification (such as determining whether the image is a human face) and target positioning (such as determining the position of the human face in the image) based on the extracted features. The existing model training method is only based on the self network structure of the target detection model for training, and the target detection model obtained by the training method has poor image detection effect, such as difficulty in detecting a target to be detected which is fuzzy or partially shielded in an image.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a training method and apparatus for an image detection model, and a target detection method and apparatus, so as to improve a model training method and improve an image detection effect of a trained target detection model.

Mainly comprises the following aspects:

in a first aspect, an embodiment of the present application provides a method for training an image detection model, where the method includes: inputting a target training image into an initial model; the initial model comprises a backbone network and a head network which are connected in sequence, wherein the head network comprises a classification sub-network, a regression sub-network and a mask sub-network which are parallel; the target training image carries a category label, a regression label and a mask label; extracting a characteristic response graph of the target training image through the backbone network; classifying the feature response graph through the classification sub-network, and calculating to obtain a classification loss value based on a classification processing result and the class label; performing regression processing on the characteristic response graph through the regression subnetwork, and calculating to obtain a regression loss value based on a regression processing result and the regression label; performing mask processing on the feature response graph through the mask sub-network, and calculating a mask loss value based on a mask processing result and the mask label; training the initial model based on the classification loss value, the regression loss value and the mask loss value to obtain a target detection model; the target detection model is used for detecting a target in an image, and comprises the trained backbone network, the trained classification sub-network and the trained regression sub-network.

With reference to the first aspect, an embodiment of the present application provides a first possible implementation manner of the first aspect, where the backbone network includes a residual network and a feature pyramid network; the step of extracting the characteristic response graph of the target training image through a backbone network comprises the following steps: extracting feature maps of multiple scales of the target training image through the residual error network; respectively inputting feature maps of various scales to a plurality of network layers of the feature pyramid network; wherein, each network layer correspondingly inputs a feature map with a scale; performing feature fusion processing on the input feature graph through each network layer of the feature pyramid network to obtain a corresponding feature response graph; wherein, the corresponding receptive fields of the characteristic response graphs output by different network layers are different.

With reference to the first possible implementation manner of the first aspect, an embodiment of the present application provides a second possible implementation manner of the first aspect, where a network layer of the feature pyramid network is at least 4 layers.

With reference to the first possible implementation manner of the first aspect, this application provides a third possible implementation manner of the first aspect, where each network layer of the feature pyramid network is respectively connected to the classification subnetwork, the regression subnetwork, and the mask subnetwork in parallel; the step of classifying the feature response graph by the classification sub-network includes: classifying the feature response graph output by the network layer of the feature pyramid network connected with the classification sub-network through the classification sub-network; the step of classifying the feature response graph through the regression subnetwork includes: performing regression processing on the feature response graph output by the network layer of the feature pyramid network connected with the regression subnetwork through the regression subnetwork; the step of performing masking processing on the feature response graph through the masking subnetwork includes: and performing masking processing on the feature response graph output by the network layer of the feature pyramid network connected with the mask sub-network through the mask sub-network.

With reference to the third possible implementation manner of the first aspect, an embodiment of the present application provides a fourth possible implementation manner of the first aspect, where the mask tags are multiple; each mask label corresponds to the feature map of one scale; the step of calculating a mask loss function value based on the mask processing result and the mask label includes: acquiring a mask label corresponding to the mask subnetwork according to the network layer of the feature pyramid network connected with the mask subnetwork; and calculating a mask loss function value based on the mask processing result of the mask sub-network and the obtained mask label.

With reference to the first aspect, an embodiment of the present application provides a fifth possible implementation manner of the first aspect, where the step of calculating a mask loss function value based on a mask processing result and the mask tag includes: and substituting the mask processing result and the mask label into a cross entropy loss function, and calculating to obtain a mask loss value.

With reference to the fifth possible implementation manner of the first aspect, an embodiment of the present application provides a sixth possible implementation manner of the first aspect, where the cross entropy loss function is a Sigmoid function.

With reference to the first aspect, an embodiment of the present application provides a seventh possible implementation manner of the first aspect, where the step of training the initial model based on the classification loss value, the regression loss value, and the mask loss value includes: and performing joint training on the backbone network, the classification sub-network, the regression sub-network and the mask sub-network by adopting a back propagation algorithm based on the classification loss value, the regression loss value and the mask loss value until the classification loss value converges to a first preset value, the regression loss value converges to a second preset value, and the training is stopped when the mask loss value converges to a third preset value.

With reference to the first aspect, this application provides an eighth possible implementation manner of the first aspect, where the mask subnetwork includes five convolutional layers connected in sequence; the network parameters of the first four convolutional layers are the same, and the network parameters of the fifth convolutional layer are different from those of the first four convolutional layers.

With reference to the eighth possible implementation manner of the first aspect, this application provides a ninth possible implementation manner of the first aspect, where an output dimension of a fifth convolutional layer of the mask subnetwork is equal to the number of classes of the object to be detected.

In a second aspect, an embodiment of the present application provides a target detection method, where the method is applied to a device configured with a detection model; the detection model is a target detection model obtained by training the method according to one of the first aspect to the ninth possible implementation manner of the first aspect; the method comprises the following steps: acquiring an image of a target to be detected; inputting the image into the object detection model; and determining the category and the position of the target to be detected according to the output result of the target detection model.

With reference to the second aspect, the present application provides a first possible implementation manner of the second aspect, where the step of determining the category and the position of the target to be detected according to the output result of the target detection model includes: determining the category of the target to be detected according to the classification processing result output by the target detection model; and determining the position of the target to be detected according to the regression processing result output by the target detection model.

With reference to the second aspect, the present application provides a second possible implementation manner of the second aspect, where the object to be detected includes a human face or a vehicle.

In a third aspect, an embodiment of the present application provides an apparatus for training an image detection model, where the apparatus includes: the image input module is used for inputting the target training image into the initial model; the initial model comprises a backbone network and a head network which are connected in sequence, wherein the head network comprises a classification sub-network, a regression sub-network and a mask sub-network which are parallel; the target training image carries a category label, a regression label and a mask label; the extraction module is used for extracting a characteristic response graph of the target training image through the backbone network; the classification module is used for classifying the characteristic response graph through the classification sub-network and calculating a classification loss value based on a classification processing result and the class label; the regression module is used for carrying out regression processing on the characteristic response graph through the regression subnetwork and calculating to obtain a regression loss value based on a regression processing result and the regression label; the mask module is used for performing mask processing on the feature response graph through the mask sub-network and calculating a mask loss value based on a mask processing result and the mask label; the model generation module is used for training the initial model based on the classification loss value, the regression loss value and the mask loss value to obtain a target detection model; the target detection model is used for detecting a target in an image, and comprises the trained backbone network, the trained classification sub-network and the trained regression sub-network.

With reference to the third aspect, an embodiment of the present application provides a first possible implementation manner of the third aspect, where the backbone network includes a residual network and a feature pyramid network; the feature response graph extraction module is used for: extracting feature maps of multiple scales of the target training image through the residual error network; respectively inputting feature maps of various scales to a plurality of network layers of the feature pyramid network; wherein, each network layer correspondingly inputs a feature map with a scale; performing feature fusion processing on the input feature graph through each network layer of the feature pyramid network to obtain a corresponding feature response graph; wherein, the corresponding receptive fields of the characteristic response graphs output by different network layers are different.

With reference to the first possible implementation manner of the third aspect, the present application provides a second possible implementation manner of the third aspect, where a network layer of the feature pyramid network is at least 4 layers.

With reference to the first possible implementation manner of the third aspect, the present application provides a third possible implementation manner of the third aspect, where each network layer of the feature pyramid network is respectively connected to the classification subnetwork, the regression subnetwork, and the mask subnetwork in parallel; the classification module is further to: classifying the feature response graph output by the network layer of the feature pyramid network connected with the classification sub-network through the classification sub-network; the regression module is further to: the step of classifying the feature response graph through the regression subnetwork includes: performing regression processing on the feature response graph output by the network layer of the feature pyramid network connected with the regression subnetwork through the regression subnetwork; the masking module is further to: the step of performing masking processing on the feature response graph through the masking subnetwork includes: and performing masking processing on the feature response graph output by the network layer of the feature pyramid network connected with the mask sub-network through the mask sub-network.

With reference to the third possible implementation manner of the third aspect, the present application provides a fourth possible implementation manner of the third aspect, where the mask tags are multiple; each mask label corresponds to the feature map of one scale; the masking module is further to: acquiring a mask label corresponding to the mask subnetwork according to the network layer of the feature pyramid network connected with the mask subnetwork; and calculating a mask loss function value based on the mask processing result of the mask sub-network and the obtained mask label.

With reference to the third aspect, an embodiment of the present application provides a fifth possible implementation manner of the third aspect, where the mask module is further configured to: and substituting the mask processing result and the mask label into a cross entropy loss function, and calculating to obtain a mask loss value.

With reference to the fifth possible implementation manner of the third aspect, in this embodiment of the present application, a sixth possible implementation manner of the third aspect is provided, where the cross entropy loss function is a Sigmoid function.

With reference to the third aspect, an embodiment of the present application provides a seventh possible implementation manner of the third aspect, where the model generating module is further configured to: and performing joint training on the backbone network, the classification sub-network, the regression sub-network and the mask sub-network by adopting a back propagation algorithm based on the classification loss value, the regression loss value and the mask loss value until the classification loss value converges to a first preset value, the regression loss value converges to a second preset value, and the training is stopped when the mask loss value converges to a third preset value.

With reference to the third aspect, this application provides an eighth possible implementation manner of the third aspect, where the mask subnetwork includes five convolutional layers connected in sequence; the network parameters of the first four convolutional layers are the same, and the network parameters of the fifth convolutional layer are different from those of the first four convolutional layers.

With reference to the eighth possible implementation manner of the third aspect, this application provides a ninth possible implementation manner of the third aspect, where an output dimension of a fifth convolutional layer of the mask subnetwork is equal to the number of classes of the object to be detected.

In a fourth aspect, the present application provides an object detection apparatus, where the apparatus is applied to a device configured with a detection model; the detection model is a target detection model obtained by training in one of the tenth possible implementation manners of the first aspect to the first aspect; the device comprises: the image acquisition module is used for acquiring an image of a target to be detected; an image input module for inputting the image into the target detection model; and the determining module is used for determining the category and the position of the target to be detected according to the output result of the target detection model.

With reference to the fourth aspect, an embodiment of the present application provides a first possible implementation manner of the fourth aspect, where the determining module is further configured to: determining the category of the target to be detected according to the classification processing result output by the target detection model; and determining the position of the target to be detected according to the regression processing result output by the target detection model.

With reference to the fourth aspect, the present application provides a second possible implementation manner of the fourth aspect, where the object to be detected includes a human face or a vehicle.

In a fifth aspect, an embodiment of the present application further provides an electronic device, including: a processor and a memory, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via a bus, the machine-readable instructions, when executed by the processor, performing the steps of one of the first to tenth possible implementations of the first aspect or one of the second possible implementations of the second to third aspect.

In a sixth aspect, this application provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to perform the steps in one of the first to tenth possible implementation manners of the first aspect or one of the second to tenth possible implementation manners of the second aspect.

The embodiment of the application provides a training method and a device of an image detection model, and a target detection method and a device, wherein an adopted initial model comprises a backbone network, a classification sub-network, a regression sub-network and a mask sub-network, firstly, a feature response graph of a target training image is extracted through the backbone network, then, the classification sub-network, the regression sub-network and the mask sub-network are used for respectively carrying out classification processing, regression processing and mask processing on the feature response graph, the initial model is trained on the basis of a classification loss value, a regression loss value and a mask loss value, and finally, the target detection model comprising the trained backbone network, the classification sub-network and the regression sub-network is obtained. According to the method, the mask subnetwork is introduced in the model training process, the mask processing is performed on the feature response graph through the mask subnetwork to assist in training the backbone network, the classification subnetwork and the regression subnetwork, the feature responsiveness of the trained target detection model to the region where the target is located in the image is higher than the feature responsiveness of the region where the non-target is located, the target to be detected in the image can be effectively highlighted, and the detection effect of the trained target detection model to the image is effectively improved.

In order to make the aforementioned objects, features and advantages of the embodiments of the present application more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 is a flowchart illustrating a method for training an image detection model according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating an initial model according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a head network provided in an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating an initial model according to an embodiment of the present disclosure;

FIG. 5 is a flow chart of a target detection method provided by an embodiment of the present application;

FIG. 6 is a block diagram illustrating an exemplary training apparatus for an image inspection model according to an embodiment of the present disclosure;

fig. 7 is a block diagram illustrating a structure of an object detection apparatus according to an embodiment of the present application;

fig. 8 shows a block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The following detailed description of the embodiments of the present application is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

The model training method and device, the target detection method and device, the electronic device and the computer storage medium provided by the embodiment of the application can be applied to target detection tasks, such as detection of faces, vehicles and other interested targets. The embodiment of the present application does not limit a specific application scenario, and any scheme for training a model by using the method provided by the embodiment of the present application and performing target detection by using a target detection model obtained by training in the embodiment of the present application is within the scope of the present application. The following describes embodiments of the present application in detail.

Referring to a flowchart of a training method for an image detection model shown in fig. 1, the method can be applied to electronic devices such as computers, tablet computers, and other intelligent terminals, and includes the following steps:

step S102, inputting a target training image into an initial model; the initial model comprises a backbone network and a head network which are connected in sequence, wherein the head network comprises a classification sub-network, a regression sub-network and a mask sub-network which are parallel; the target training image carries a category label, a regression label, and a mask label. The head network may also be referred to as a head structure portion of the initial model, and is used for finally outputting the detection result. For understanding, referring to fig. 2, a schematic diagram of an initial model is shown, wherein a backbone network is connected to a classification sub-network, a regression sub-network and a mask sub-network in parallel.

And step S104, extracting a characteristic response diagram of the target training image through the backbone network.

The backbone network is mainly used for performing feature extraction on an input target training image, generating a feature response graph of the target training image, and transmitting the feature response graph to a classification sub-network, a regression sub-network and a mask sub-network.

And step S106, carrying out classification processing on the characteristic response graph through a classification sub-network, and calculating to obtain a classification loss value based on a classification processing result and a classification label.

A classification subnet (classification subnet) includes a plurality of convolutional layers, which are mainly used for target classification. In particular, the classification sub-network is responsible for determining whether an object belonging to a category of interest (i.e., an object to be detected in the input image) appears in the input image, and outputting the probability that an object belonging to the corresponding category of interest appears in the image. For example, in a face detection task, the classification sub-network may output a classification result of "whether it is a face".

The loss value is a measure of how close the actual output is to the desired output. The smaller the loss value, the closer the actual output is to the desired output. It can be understood that the classification result is the actual output of the classification sub-network, and the class label is the expected output of the classification sub-network, and the closeness of the actual output of the classification sub-network to the expected output can be obtained through the classification result and the class label calculation. During specific calculation, a preset classification loss function can be adopted for realization.

And step S108, performing regression processing on the characteristic response graph through a regression subnetwork, and calculating to obtain a regression loss value based on a regression processing result and a regression label.

A regression subnetwork (regression subnet) includes a plurality of convolutional layers; it is mainly used for target location. The targeting task may also be considered a regression task. In particular, the regression subnetwork determines the location of objects of the category of interest in the input image, typically outputting a rectangular bounding box of the objects. For example, in the face detection task, the regression sub-network may output "coordinates of a regression box of the face", that is, a rectangular surrounding box of the face predicted by the regression sub-network, to represent a specific position where the face is located.

Similarly, the regression processing result is the actual output of the regression subnetwork, the regression label is the expected output of the regression subnetwork, and the approximation degree of the actual output and the expected output of the regression subnetwork can be obtained through calculation of the regression processing result and the category label. During specific calculation, a preset regression loss function can be adopted for realization.

Step S110, mask processing is performed on the feature response map through a mask subnetwork, and a mask loss value is calculated based on a mask processing result and a mask label.

A mask subnet (mask subnet) includes a plurality of convolutional layers; the method is mainly used for masking the non-target area in the image and outputting a mask image. For example, the human image after mask processing only highlights the face region, and the rest regions are all hidden.

Similarly, the mask processing result is the actual output of the mask subnetwork, the mask label is the expected output of the mask subnetwork, and the closeness degree of the actual output and the expected output of the mask subnetwork can be obtained through the calculation of the mask processing result and the class label. During specific calculation, a preset mask loss function can be adopted for realization.

Step S112, training an initial model based on the classification loss value, the regression loss value and the mask loss value to obtain a target detection model; the target detection model is used for detecting a target in an image and comprises a trained backbone network, a classification sub-network and a regression sub-network.

In one embodiment, a back propagation algorithm may be used to jointly train the backbone network, the classification sub-network, the regression sub-network, and the mask sub-network based on the classification loss value, the regression loss value, and the mask loss value until the classification loss value converges to a first preset value, the regression loss value converges to a second preset value, and the training is stopped when the mask loss value converges to a third preset value. That is, the network parameters may be adjusted reversely according to the loss value until the loss value reaches a receivable degree, the training is finished, the network at this time is confirmed to meet the requirements, the expected result may be output, and the target detection may be implemented.

It should be noted that the mask subnetwork is only applied to the training process and is mainly used for assisting in training the backbone network, the classification subnetwork and the regression subnetwork, and the network parameters of the backbone network, the classification subnetwork and the regression subnetwork are affected by the mask subnetwork, so that the network parameters obtained after the backbone network, the classification subnetwork and the regression subnetwork are trained can enhance the characteristic response of the area where the object to be detected is located. And if the training is finished, only adopting a target detection model comprising the backbone network, the classification sub-network and the regression sub-network obtained through training to carry out target detection in the actual target detection task. That is, in actual detection, a mask subnetwork is not constructed, and thus the detection time is not affected by the mask subnetwork.

In the training method for the image detection model provided in the embodiment of the application, the initial model includes a backbone network, a classification sub-network, a regression sub-network, and a mask sub-network, the feature response graph of the target training image is extracted through the backbone network, then the classification sub-network, the regression sub-network, and the mask sub-network are used to perform classification processing, regression processing, and mask processing on the feature response graph respectively, the initial model is trained based on the classification loss value, the regression loss value, and the mask loss value, and finally the target detection model including the trained backbone network, the classification sub-network, and the regression sub-network is obtained. According to the method, the mask subnetwork is introduced in the model training process, the mask processing is performed on the feature response graph through the mask subnetwork to assist in training the backbone network, the classification subnetwork and the regression subnetwork, the feature responsiveness of the trained target detection model to the region where the target is located in the image is higher than the feature responsiveness of the region where the non-target is located, the target to be detected in the image can be effectively highlighted, and the detection effect of the trained target detection model to the image is effectively improved.

In this embodiment, a specific implementation manner of a header network is given, referring to the schematic structural diagram of a header network shown in fig. 3, the classification subnetwork, the regression subnetwork, and the mask subnetwork respectively illustrate five convolutional layers connected in sequence, and in an implementation manner, the size of a convolutional kernel in a convolutional layer may be 3 × 3. For each subnetwork, the network parameters of the first four convolutional layers are the same, and the network parameters of the fifth convolutional layer are different from those of the first four convolutional layers. Fig. 3 illustrates that the parameters corresponding to the first four convolutional layers of the classification subnetwork, the regression subnetwork, and the mask subnetwork are W × H × 256, where W × H may be understood as the length and width of the feature map of the convolutional layer processing, and 256 may be understood as the output dimension of the convolutional layer. Where the output dimension can be understood as the number of convolution kernels in the convolution layer. In practical applications, the first four convolutional layers of the classification subnetwork, the regression subnetwork and the mask subnetwork can be shared by parameters to enhance the self-organization of the network, and the output dimension of the fifth convolutional layer (i.e., the last convolutional layer) is different according to the subnetwork type, i.e., the classification subnetwork, the regression subnetwork and the mask subnetwork are mainly different from the last convolutional layer.

Specifically, the output dimension of the last convolutional layer of the classification sub-network is related to the number of anchor points (anchors) at each position in the image and the number of categories of the object to be detected in the image, and the output dimension of the last convolutional layer of the regression sub-network is related to the number of anchor points (anchors) at each position in the image, the number of categories of the object to be detected in the image and the number of position correction parameters of each anchor point relative to a GT (real) frame. For ease of understanding, the detailed description is as follows:

the anchor point (anchor) may also be understood as an anchor frame, in particular as an initial frame or a candidate region, and the anchor point parameters include anchor area (scale) and anchor aspect ratio (aspects). An anchor parameter (i.e., a set of anchor area and anchor aspect ratio) may characterize an anchor. For example, 3 areas and 3 aspect ratios may be combined to form 9 anchors, and each position in the image to be detected may be provided with 9 anchors, such as, for a feature map (feature map) with a size of W × H, where W × H positions (which may be understood as W × H pixels) are included in the feature map, W × H9 anchors may be corresponding, that is, W × H9 initial frames. Anchor points may be used in a single-stage detection model to directly predict detection frames containing target objects.

The number of categories of the target to be detected in the image may be predetermined according to an actual application scenario, which is exemplified here: if the method is only used for face detection, namely the target to be detected is a face, the number of the categories is 1. The number of categories is 2 if used for cat and dog testing.

The above position correction parameters of each anchor point relative to the GT (real) frame may be understood as offsets of the GT frame relative to the center point (x, y), height h and width w of the anchor point, that is, the position correction parameters are 4 in total. The GT frame can be understood as a regression frame, and represents the correct position of the target to be detected.

On the basis of understanding the anchor point, the number of categories of the target to be detected, and the position correction parameter, the output dimension of the last convolutional layer of the classification sub-network, the regression sub-network, and the mask sub-network proposed in this embodiment is further described:

for the classification sub-network, the output dimension is K × a because the last convolutional layer outputs the confidence of the anchors at each position in the image to be detected, where a is the number of anchors at each position and K is the number of classes. For face detection, since there is only one class (i.e., face), K is 1, and therefore, in the face detection task, the output dimension of the last convolutional layer of the classification sub-network is W × H × a.

In addition, according to the size characteristics of the human face, in one embodiment, the ratio of the anchor points set at each position (i.e., the aforementioned aspect ratio) may be set to 1 and 1.5, and the size of the anchor points (i.e., the aforementioned area) is one, so that the number a of anchor points at each position on the image is 2.

For the regression subnetwork, the output dimension is 4 × K × a because the last convolution layer outputs the position correction parameters with respect to the GT box for each anchor point, and there are 4 position correction parameters. In the face detection task, K is 1, and the output dimension of the last convolutional layer of the classification sub-network is 4A.

For the mask sub-network, because the last convolutional layer outputs the mask of the whole image and a class-specific calculation mode is used, that is, different classes use different convolutional kernel groups, the output dimension of the last convolutional layer of the mask sub-network is equal to the number of classes of the target to be detected, that is, the output dimension is K. In the face detection task, if K is 1, the output dimension of the last convolutional layer of the mask subnetwork is 1.

The head network shown in fig. 3 is applied to a face detection task, and the class of the target to be detected is 1, that is, K is 1, so that fig. 3 illustrates that the output dimension of the last convolutional layer of the classification sub-network is a, the output dimension of the last convolutional layer of the regression sub-network is 4A, and the output dimension of the last convolutional layer of the mask sub-network is 1.

The present embodiment provides an implementation of a backbone network, where the backbone network includes a residual network and a feature pyramid network.

Wherein the residual network is a deep convolutional network. The residual error network has a deeper network structure layer, and compared with the traditional network, the residual error network adds a y-x layer (identity mapping layer), the main function of the residual error network is to enable the network not to be degraded along with the increase of the depth, and the residual error network also has a better convergence effect. It is generally considered that each layer of the neural network corresponds to extracting different levels of feature information, such as being divided into a lower layer, a middle layer and a higher layer; the deeper the network, the more information of different layers is extracted, and the more information of layers is combined. Therefore, the residual error network has better image feature extraction capability. In one embodiment, the residual network may be implemented using a ResNet network.

The Feature Pyramid Network (FPN) can be expanded to form a standard convolution Network through top-down channels and transverse connection, so that rich and multi-scale Feature Pyramid images can be effectively extracted from an input image with single resolution. The FPN comprises a multilayer structure, and each layer can detect images to be detected on different scales to generate a multi-scale characteristic diagram. FPN can well promote the multi-scale prediction capability of a Full Convolution Network (FCN).

Referring to the structural schematic diagram of an initial model shown in fig. 4, fig. 4 mainly illustrates that a backbone network includes a residual network and a feature pyramid network, both of which are simple schematic diagrams, for example, the feature pyramid network simply illustrates 3 FPN layers, and in practical application, the feature pyramid network may include 4 network layers (from top to bottom, FPN5 to FPN2 in sequence). Wherein, the network layer of the feature pyramid network can be directly called as FPN layer. In order to expand the detection range of the target, the present embodiment may further increase the levels of the FPN layers, such as two new layers of FPN6 and FPN7, that is, two convolution layers may be added to the 4 FPN layers of the existing feature pyramid network. In practical application, the number of FPN layers of the feature pyramid network can be flexibly reduced according to practical requirements.

The initial model shown in fig. 4 is a network structure with size invariance, that is, no matter how the size of the object to be detected changes, the initial model can be detected by the network structure. For example, the larger the number of FPN layers, the larger the variation range of the receptive field, the more faces with different scales can be detected, and the stronger the size invariance. Wherein, the receptive field can be called as the receiving field, and is defined as the region where the convolution neural network can see the input image, and can be characterized as follows: the feature output is affected by the pixels in the receptive field area. In colloquial terms, a point on a feature map is understood to correspond to an area on an input image. The characteristic diagram with small receptive field is helpful for detecting small targets, and the characteristic diagram with large receptive field is helpful for detecting large targets. The embodiment adopts the characteristic pyramid network at least comprising four network layers, has various receptive fields, and can detect the targets to be detected with different scales.

When extracting the characteristic response graph of the target training image through the backbone network, the method can be realized by referring to the following steps:

(1) extracting characteristic graphs of multiple scales of a target training image through a residual error network;

(2) respectively inputting the feature maps of various scales into a plurality of network layers of the feature pyramid network; wherein, each network layer correspondingly inputs a characteristic diagram with a scale;

(3) performing feature fusion processing on the input feature graph through each network layer of the feature pyramid network to obtain a corresponding feature response graph; wherein, the corresponding receptive fields of the characteristic response graphs output by different network layers are different.

In addition, fig. 4 also illustrates that each network layer of the feature pyramid network is connected to a header network, i.e., a classification subnetwork, a regression subnetwork, and a mask subnetwork in parallel. It will be appreciated that fig. 4 simply illustrates three network layers of the feature pyramid network, and that each network layer is connected to a head network, illustrating three head networks. In fact, the network structure and the network parameters of the three-head network are the same, the three-head network can be essentially regarded as one-head network, and fig. 4 illustrates the three-head network separately for the convenience of understanding.

When the feature response graphs are classified through the classifying sub-networks, specifically, the classifying sub-networks classify the feature response graphs output by the network layers of the feature pyramid network connected with the classifying sub-networks; when the feature response graphs are classified through the regression sub-network, specifically, the feature response graphs output by the network layer of the feature pyramid network connected with the regression sub-network are subjected to regression processing through the regression sub-network; when the mask sub-network is used for performing mask processing on the feature response graph, specifically, the mask sub-network is used for performing mask processing on the feature response graph output by the network layer of the feature pyramid network connected with the mask sub-network.

In practical applications, all network layers of the feature pyramid network may also be input to a 3 × 3 convolutional layer for dimension transformation before inputting the feature response map to the header network, so that the output dimension is a uniform value, such as 256.

The training process of the initial model is further elucidated below in conjunction with fig. 4. The target training image is firstly input to a residual error network to obtain characteristic graphs of multiple scales output by the residual error network, the residual error network respectively inputs the characteristic graphs of the multiple scales to a characteristic pyramid network, and the characteristic pyramid network processes the characteristic graphs of the multiple scales to generate a characteristic response graph. Specifically, each FPN layer corresponds to a feature map of one scale, and a corresponding feature response map is generated through feature fusion processing. It should be noted that the feature fusion processing of the other FPN layers except the top-most layer is performed by using the feature fusion map delivered by the FPN layer of the previous layer and the feature map received by the layer, and the result of the feature fusion processing of the FPN layer of the top layer is still the feature map received by the layer because the FPN layer of the top layer does not have the feature fusion map delivered by the FPN layer of the previous layer. For example, when n is not the maximum of the levels of the feature pyramid network (e.g., when the feature pyramid network is four layers, n is less than 4; when the feature pyramid network is six layers, n is less than 6), the feature response map Pn corresponding to the FPNn layer is the fusion result of the feature response map Pn +1 located at the upper layer and transferred to the FPNn layer, and the feature map P n' of the FPNn layer received from the residual error network at the corresponding scale. When n is the maximum of the feature pyramid network (for example, when the feature pyramid network has four layers, n is 4, and when the feature pyramid network has six layers, n is 6), the processing result of the FPNn layer still corresponds to the feature map P n'. Each FPN layer outputs a feature response graph processed by the layer, and the feature response graphs are respectively output to the head network (namely, a classification sub-network, a regression sub-network and a mask sub-network in the initial model). Each sub-network carries out corresponding type processing operation aiming at the received characteristic response diagram, obtains corresponding processing results, finally calculates respective corresponding loss functions to obtain a classification loss value, a regression loss value and a mask loss value, further carries out combined training on the backbone network, the classification sub-network, the regression sub-network and the mask sub-network based on the classification loss value, the regression loss value and the mask loss value by adopting a back propagation algorithm, and finally determines network parameters of each network.

The network structure shown in fig. 4 can be understood as a one-stage (one-stage) detection model. The single-stage (one-stage) detection model is different from the two-stage (two-stage) detection model in that the single-stage detection model directly predicts a target frame by using an initial frame; the two-stage detection model is to predict a candidate frame by adopting an initial frame and then predict a target frame based on the candidate frame.

Based on this, when the model provided in the embodiment of the present application performs target detection on an input image, a standard frame (i.e., an initial frame) of an SFD (Shot Scale-invariant Face Detector) may be used for setting, and a standard frame matched with the receiving field size of the feature response map of the network layer is laid on each network layer of the feature pyramid. Wherein the content of the first and second substances,

since the model provided by the embodiment of the application is a single-stage detection model, the target frame (such as the finally detected face bounding frame) can be directly found through the initial frame. In specific implementation, initial frames matched with the receiving fields of the corresponding feature response graphs are arranged on different FPN layers, so that the features corresponding to the layers can be extracted according to the initial frames, and corresponding feature fusion graphs are generated.

In the initial model based on fig. 4, since there are various scales of the feature map, the mask label is also various; each mask label corresponds to a feature map of one scale. The mask labels can be understood as mask images, the receptive fields of different mask labels are different, and the sizes of the targets to be detected corresponding to different mask labels are different. Taking the example that the target to be detected is a human face as an example, the mask image only shows the human face, and the non-human face part is masked by the mask processing. For example, in a mask image corresponding to an image containing a person, only the face area is bright, and the rest of the face area is black. The representation form of the face region in the mask image depends on the receptive field corresponding to the mask label.

By adopting a plurality of mask labels of different receptive fields, the supervision information for training can be layered. Specifically, the monitoring information for detection is associated with the network layers of different feature pyramids, and corresponding monitoring information is set for each layer, so that the rectangular surrounding frame of the target to be detected obtained through each network layer is matched with the standard frame laid on the feature map of each network layer in size.

In the training process, when a mask loss function value is obtained through calculation based on a mask processing result and a mask label, the mask label corresponding to the mask subnetwork can be obtained according to a network layer of a feature pyramid network connected with the mask subnetwork; and calculating a mask loss function value based on the mask processing result of the mask sub-network and the obtained mask label. In an embodiment of the present application, a mask loss function value may be calculated by using a cross entropy loss function, specifically, a mask processing result and a mask tag are substituted into the cross entropy loss function, and a mask loss value is calculated. In one embodiment, the cross-entropy loss function is a Sigmoid function, that is, it can be implemented by using conventional pixel-level Sigmoid cross-entropy.

In summary, in the training method for the image detection model provided in this embodiment, the mask subnetwork is introduced to assist in training the backbone network, the classification subnetwork and the regression subnetwork, so that the characteristic responsivity of the finally trained target detection model including the backbone network, the classification subnetwork and the regression subnetwork to the region where the target is located in the image is higher than the characteristic responsivity of the region where the non-target is located, that is, the target to be detected in the image can be effectively highlighted. The method is beneficial to detecting the shielded target or the fuzzy target by enhancing the characteristic response of the target area, and the detection effect of the trained target detection model on the image can be effectively improved by the model training mode. In contrast, in the related art, most of the attention networks are adopted to influence the characteristic response graph of the target detection model, so that the characteristic response graph has higher response in the target area and lower response in the non-target area. However, since the existing attention network is a parallel sub-network, the influence on the parameters of the network backbone part before the head network is little in the training process, and finally, the network characteristic response is violently changed in a weight multiplication mode, which falls into the lower multiplication mode. The mask subnetwork is introduced in the embodiment of the application, network parameters of the backbone network, the classification subnetwork and the regression subnetwork can be integrally influenced, and the influence of the mask subnetwork on the overall network parameters in the training period is increased in a shared parameter mode through combined training of the backbone network, the mask subnetwork, the classification subnetwork and the regression subnetwork, so that the characteristic response of the target area is increased. In addition, the mask subnetwork introduced by the embodiment of the application is only applied to the model training process, once the training is finished, the target detection model comprising the backbone network, the classification subnetwork and the regression subnetwork obtained through the training is obtained, the target detection model does not comprise the mask subnetwork any more, the mask subnetwork does not need to be constructed during actual detection, and the detection time is not influenced on the basis of improving the target detection accuracy.

In addition, the embodiment of the application trains the model by adopting high-level supervision information (namely, segmentation information of the image), and can achieve the effect better than that of an attention network. For ease of understanding, further explanation follows: taking face detection as an example, the supervision information of face detection is a rectangular surrounding frame where all faces in an image are located, and the particularity of face detection is that the faces are absolutely dense and basically have no gaps in the face region, so that the faces are filled with most regions of the rectangular surrounding frame. Based on the characteristic, the conclusion that the outline of the face is basically similar to the rectangular surrounding frame of the face can be obtained, which is equivalent to the conclusion that the monitoring information of the face during target detection is similar to the monitoring information during image semantic segmentation. The most direct embodiment in the embodiment of the present application is to train a network by using a rectangular bounding box of a face (similar to a face contour), where the rectangular bounding box of the face is segmentation information of an image, which is also a further improvement of the existing object detection algorithm.

The embodiment of the application further provides a target detection method, which is applied to equipment configured with a detection model; the detection model is a target detection model obtained by adopting the model training method; referring to fig. 5, a flow chart of a target detection method is shown, which includes the following steps:

step S502, acquiring an image of the target to be detected. The target to be detected may include a human face or a vehicle, and may also be other types of targets, which are not described herein again.

Step S504, the image is input into the target detection model.

And S506, determining the category and the position of the target to be detected according to the output result of the target detection model.

Specifically, the category of the target to be detected is determined according to the classification processing result output by the target detection model; and determining the position of the target to be detected according to the regression processing result output by the target detection model.

According to the target detection method provided by the embodiment, the type and the position of the target to be detected are determined through the target detection model, and the target detection model is obtained by training through the model training method, so that the characteristic responsivity of the target detection model to the region where the target is located in the image is higher than that of the region where the target is not located, that is, the target to be detected in the image can be effectively highlighted, and the target detection accuracy is improved.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific training process of the target detection model in the target detection method described above may refer to the corresponding process in the foregoing embodiment, and is not described herein again. In the target detection process, the loss function value does not need to be calculated. The difference between the target detection model and the initial model in the embodiment of the present application is that the target detection model has no mask subnetwork, and the rest of the structure is the same. Network parameters of a backbone network, a classification sub-network and a regression sub-network in the target detection model are obtained after assistant training of a mask sub-network.

Corresponding to the aforementioned training method of the image detection model, the embodiment further provides a training apparatus of the image detection model, referring to a structural block diagram of a model training apparatus shown in fig. 6, the apparatus includes:

an image input module 602, configured to input a target training image into an initial model; the initial model comprises a backbone network and a head network which are connected in sequence, wherein the head network comprises a classification sub-network, a regression sub-network and a mask sub-network which are parallel; the target training image carries a category label, a regression label and a mask label;

an extracting module 604, configured to extract a feature response map of the target training image through a backbone network;

a classification module 606, configured to perform classification processing on the feature response graph through a classification subnetwork, and calculate a classification loss value based on a classification processing result and a classification label;

the regression module 608 is configured to perform regression processing on the feature response graph through a regression subnetwork, and calculate a regression loss value based on a regression processing result and a regression label;

the mask module 610 is configured to perform mask processing on the feature response map through a mask subnetwork, and calculate a mask loss value based on a mask processing result and a mask label;

a model generation module 612, configured to train the initial model based on the classification loss value, the regression loss value, and the mask loss value to obtain a target detection model; the target detection model is used for detecting a target in an image and comprises a trained backbone network, a classification sub-network and a regression sub-network.

In the training apparatus for the image detection model provided in the embodiment of the application, the initial model used includes a backbone network, a classification subnetwork, a regression subnetwork and a mask subnetwork, the feature response graph of the target training image is first extracted through the backbone network, then the classification subnetwork, the regression subnetwork and the mask subnetwork are used to perform classification processing, regression processing and mask processing on the feature response graph respectively, the initial model is trained based on the classification loss value, the regression loss value and the mask loss value, and finally the target detection model including the trained backbone network, the classification subnetwork and the regression subnetwork is obtained. According to the method, the mask subnetwork is introduced in the model training process, the mask processing is performed on the feature response graph through the mask subnetwork to assist in training the backbone network, the classification subnetwork and the regression subnetwork, the feature responsiveness of the trained target detection model to the region where the target is located in the image is higher than the feature responsiveness of the region where the non-target is located, the target to be detected in the image can be effectively highlighted, and the detection effect of the trained target detection model to the image is effectively improved.

In one embodiment, the backbone network includes a residual network and a feature pyramid network; the feature response map extraction module is configured to: extracting characteristic graphs of multiple scales of a target training image through a residual error network; respectively inputting the feature maps of various scales into a plurality of network layers of the feature pyramid network; wherein, each network layer correspondingly inputs a characteristic diagram with a scale; performing feature fusion processing on the input feature graph through each network layer of the feature pyramid network to obtain a corresponding feature response graph; wherein, the corresponding receptive fields of the characteristic response graphs output by different network layers are different.

In one embodiment, the network layer of the feature pyramid network is at least 4 layers.

In one embodiment, each network layer of the feature pyramid network is respectively connected with a classification sub-network, a regression sub-network and a mask sub-network which are in parallel; the classification module is further configured to: classifying the feature response graph output by the network layer of the feature pyramid network connected with the classification sub-network through the classification sub-network; the regression module is further to: the step of classifying the feature response graph through the regression sub-network comprises the following steps: performing regression processing on the feature response graph output by the network layer of the feature pyramid network connected with the regression subnetwork through the regression subnetwork; the masking module is further to: the step of masking the feature response graph by a masking subnetwork includes: and performing masking processing on the feature response graph output by the network layer of the feature pyramid network connected with the mask sub-network through the mask sub-network.

In one embodiment, the mask tag is a plurality; each mask label corresponds to a feature map of a scale; the mask module is further configured to: acquiring a mask label corresponding to the mask subnetwork according to a network layer of the feature pyramid network connected with the mask subnetwork; and calculating a mask loss function value based on the mask processing result of the mask sub-network and the obtained mask label.

In one embodiment, the mask module is further configured to: and substituting the mask processing result and the mask label into a cross entropy loss function, and calculating to obtain a mask loss value. In a specific embodiment, the cross-entropy loss function is a Sigmoid function.

In one embodiment, the model generation module is further configured to: and performing combined training on the backbone network, the classification sub-network, the regression sub-network and the mask sub-network by adopting a back propagation algorithm based on the classification loss value, the regression loss value and the mask loss value until the classification loss value converges to a first preset value, the regression loss value converges to a second preset value, and the training is stopped when the mask loss value converges to a third preset value.

In one embodiment, the masking subnetwork includes five convolutional layers connected in sequence; the network parameters of the first four convolutional layers are the same, and the network parameters of the fifth convolutional layer are different from those of the first four convolutional layers. In a specific embodiment, the output dimension of the fifth convolutional layer of the masking subnetwork is equal to the number of classes of the object to be detected.

The device provided by the embodiment has the same implementation principle and technical effect as the foregoing embodiment, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiment for the portion of the embodiment of the device that is not mentioned.

Corresponding to the foregoing target detection method, the present embodiment further provides a target detection apparatus, see the structural block diagram of a target detection apparatus shown in fig. 7, which is applied to a device configured with a detection model; the detection model is a target detection model obtained by training by any one of the model training methods; the device includes:

an image obtaining module 702 is configured to obtain an image of a target to be detected. In specific application, the target to be detected comprises a human face or a vehicle.

And an image input module 704, configured to input the image into the target detection model.

And the determining module 706 is configured to determine the category and the position of the target to be detected according to the output result of the target detection model.

According to the target detection device provided by the embodiment, the type and the position of the target to be detected are determined through the target detection model, and the target detection model is obtained by training through the model training method, so that the characteristic responsivity of the target detection model to the region where the target is located in the image is higher than that of the region where the non-target is located, that is, the target to be detected in the image can be effectively highlighted, and the target detection accuracy is improved.

In one embodiment, the determining module is further configured to: determining the category of the target to be detected according to the classification processing result output by the target detection model; and determining the position of the target to be detected according to the regression processing result output by the target detection model.

The present embodiment also provides an electronic device, including: the system comprises a processor, a memory and a bus, wherein the memory stores machine readable instructions executable by the processor, the processor and the memory are communicated through the bus, and the machine readable instructions are executed by the processor to execute any one of the model training methods or any one of the target detection methods.

Referring to the schematic structural diagram of an electronic device shown in fig. 8, the electronic device specifically includes a processor 80, a memory 81, a bus 82, and a communication interface 83, where the processor 80, the communication interface 83, and the memory 81 are connected through the bus 82; the processor 80 is arranged to execute executable modules, such as computer programs, stored in the memory 81.

The Memory 81 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 83 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.

Bus 82 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 8, but that does not indicate only one bus or one type of bus.

The memory 81 is used for storing a program, the processor 80 executes the program after receiving an execution instruction, and the method performed by the apparatus defined by the flow process disclosed in any of the embodiments of the present application may be applied to the processor 80, or implemented by the processor 80.

The processor 80 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 80. The Processor 80 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory 81, and the processor 80 reads the information in the memory 81 and performs the steps of the above method in combination with its hardware.

The training method and/or the target detection method of the image detection model provided in this embodiment may be executed by the electronic device, or the model training apparatus and/or the target detection apparatus provided in this embodiment may be disposed on the electronic device side.

Further, the present embodiment also provides a computer storage medium, in which a computer program is stored, and the computer program is executed by a processor to perform any one of the model training methods or any one of the object detection methods.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training an image detection model, the method comprising:

inputting a target training image into an initial model; the initial model comprises a backbone network and a head network which are connected in sequence, wherein the head network comprises a classification sub-network, a regression sub-network and a mask sub-network which are parallel; the target training image carries a category label, a regression label and a mask label;

extracting a characteristic response graph of the target training image through the backbone network;

classifying the feature response graph through the classification sub-network, and calculating to obtain a classification loss value based on a classification processing result and the class label;

performing regression processing on the characteristic response graph through the regression subnetwork, and calculating to obtain a regression loss value based on a regression processing result and the regression label;

performing mask processing on the feature response graph through the mask sub-network, and calculating a mask loss value based on a mask processing result and the mask label;

training the initial model based on the classification loss value, the regression loss value and the mask loss value to obtain a target detection model; the target detection model is used for detecting a target in an image, and comprises the trained backbone network, the trained classification sub-network and the trained regression sub-network.

2. The method of claim 1, wherein the backbone network comprises a residual network and a feature pyramid network;

the step of extracting the characteristic response graph of the target training image through a backbone network comprises the following steps:

extracting feature maps of multiple scales of the target training image through the residual error network;

respectively inputting feature maps of various scales to a plurality of network layers of the feature pyramid network; wherein, each network layer correspondingly inputs a feature map with a scale;

performing feature fusion processing on the input feature graph through each network layer of the feature pyramid network to obtain a corresponding feature response graph; wherein, the corresponding receptive fields of the characteristic response graphs output by different network layers are different.

3. The method of claim 2, wherein the network layer of the feature pyramid network is at least 4 layers.

4. The method of claim 2, wherein each network layer of the feature pyramid network is connected to the classification subnetwork, the regression subnetwork, and the mask subnetwork, respectively, in parallel;

the step of classifying the feature response graph by the classification sub-network includes: classifying the feature response graph output by the network layer of the feature pyramid network connected with the classification sub-network through the classification sub-network;

the step of classifying the feature response graph through the regression subnetwork includes: performing regression processing on the feature response graph output by the network layer of the feature pyramid network connected with the regression subnetwork through the regression subnetwork;

the step of performing masking processing on the feature response graph through the masking subnetwork includes: and performing masking processing on the feature response graph output by the network layer of the feature pyramid network connected with the mask sub-network through the mask sub-network.

5. The method of claim 4, wherein the mask tag is a plurality of types; each mask label corresponds to the feature map of one scale;

the step of calculating a mask loss function value based on the mask processing result and the mask label includes:

acquiring a mask label corresponding to the mask subnetwork according to the network layer of the feature pyramid network connected with the mask subnetwork;

and calculating a mask loss function value based on the mask processing result of the mask sub-network and the obtained mask label.

6. The method of claim 1, wherein the step of calculating a mask penalty value based on the result of the masking process and the mask tag comprises: and substituting the mask processing result and the mask label into a cross entropy loss function, and calculating to obtain a mask loss value.

7. The method of claim 6, wherein the cross-entropy loss function is a Sigmoid function.

8. The method of claim 1, wherein the step of training the initial model based on the classification loss value, the regression loss value, and the mask loss value comprises:

and performing joint training on the backbone network, the classification sub-network, the regression sub-network and the mask sub-network by adopting a back propagation algorithm based on the classification loss value, the regression loss value and the mask loss value until the classification loss value converges to a first preset value, the regression loss value converges to a second preset value, and the training is stopped when the mask loss value converges to a third preset value.

9. The method of claim 1, wherein the masking subnetwork comprises five convolutional layers connected in sequence; the network parameters of the first four convolutional layers are the same, and the network parameters of the fifth convolutional layer are different from those of the first four convolutional layers.

10. The method of claim 9, wherein an output dimension of a fifth of the convolutional layers of the masking subnetwork is equal to a number of classes of objects to be detected.

11. An object detection method is characterized in that the method is applied to equipment configured with a detection model; the detection model is a target detection model obtained by training according to the method of any one of claims 1 to 10; the method comprises the following steps:

acquiring an image of a target to be detected;

inputting the image into the object detection model;

and determining the category and the position of the target to be detected according to the output result of the target detection model.

12. The method of claim 11, wherein the step of determining the category and the location of the object to be detected based on the output of the object detection model comprises:

determining the category of the target to be detected according to the classification processing result output by the target detection model;

and determining the position of the target to be detected according to the regression processing result output by the target detection model.

13. The method of claim 11, wherein the object to be detected comprises a human face or a vehicle.

14. An apparatus for training an image inspection model, the apparatus comprising:

the image input module is used for inputting the target training image into the initial model; the initial model comprises a backbone network and a head network which are connected in sequence, wherein the head network comprises a classification sub-network, a regression sub-network and a mask sub-network which are parallel; the target training image carries a category label, a regression label and a mask label;

the extraction module is used for extracting a characteristic response graph of the target training image through the backbone network;

the classification module is used for classifying the characteristic response graph through the classification sub-network and calculating a classification loss value based on a classification processing result and the class label;

the regression module is used for carrying out regression processing on the characteristic response graph through the regression subnetwork and calculating to obtain a regression loss value based on a regression processing result and the regression label;

the mask module is used for performing mask processing on the feature response graph through the mask sub-network and calculating a mask loss value based on a mask processing result and the mask label;

the model generation module is used for training the initial model based on the classification loss value, the regression loss value and the mask loss value to obtain a target detection model; the target detection model is used for detecting a target in an image, and comprises the trained backbone network, the trained classification sub-network and the trained regression sub-network.

15. The apparatus of claim 14, wherein the backbone network comprises a residual network and a feature pyramid network; the feature response graph extraction module is used for:

16. The apparatus of claim 15, wherein the network layer of the feature pyramid network is at least 4 layers.

17. The apparatus of claim 15, wherein each network layer of the feature pyramid network is connected to the classification subnetwork, the regression subnetwork, and the mask subnetwork, respectively, in parallel;

the classification module is further to: classifying the feature response graph output by the network layer of the feature pyramid network connected with the classification sub-network through the classification sub-network;

the regression module is further to: the step of classifying the feature response graph through the regression subnetwork includes: performing regression processing on the feature response graph output by the network layer of the feature pyramid network connected with the regression subnetwork through the regression subnetwork;

the masking module is further to: the step of performing masking processing on the feature response graph through the masking subnetwork includes: and performing masking processing on the feature response graph output by the network layer of the feature pyramid network connected with the mask sub-network through the mask sub-network.

18. The apparatus of claim 17, wherein the mask tag is a plurality of types; each mask label corresponds to the feature map of one scale; the masking module is further to:

19. The apparatus of claim 14, wherein the masking module is further to: and substituting the mask processing result and the mask label into a cross entropy loss function, and calculating to obtain a mask loss value.

20. The apparatus of claim 19, wherein the cross entropy loss function is a Sigmoid function.

21. The apparatus of claim 14, wherein the model generation module is further to:

22. The apparatus of claim 14, wherein the masking subnetwork comprises five convolutional layers connected in sequence; the network parameters of the first four convolutional layers are the same, and the network parameters of the fifth convolutional layer are different from those of the first four convolutional layers.

23. The apparatus of claim 22, wherein an output dimension of a fifth of the convolutional layers of the masking subnetwork is equal to a number of classes of objects to be detected.

24. An object detection apparatus, characterized in that the apparatus is applied to a device equipped with a detection model; the detection model is a target detection model obtained by training according to the method of any one of claims 1 to 10; the device comprises:

the image acquisition module is used for acquiring an image of a target to be detected;

an image input module for inputting the image into the target detection model;

and the determining module is used for determining the category and the position of the target to be detected according to the output result of the target detection model.

25. The apparatus of claim 24, wherein the determination module is further configured to:

26. The apparatus of claim 24, wherein the object to be detected comprises a human face or a vehicle.

27. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine readable instructions executable by the processor, the processor and the memory communicating over the bus, the machine readable instructions when executed by the processor performing the method of any of claims 1 to 10 or performing the method of any of claims 11 to 13.

28. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, performs the method of any one of claims 1 to 10, or performs the method of any one of claims 11 to 13.