CN111160379B

CN111160379B - Training method and device of image detection model, and target detection method and device

Info

Publication number: CN111160379B
Application number: CN201811320550.6A
Authority: CN
Inventors: 张修宝; 田万鑫; 沈海峰
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2023-09-15
Anticipated expiration: 2038-11-07
Also published as: CN111160379A

Abstract

The application provides a training method and device of an image detection model and a target detection method and device, and relates to the technical field of artificial intelligence. The model training method comprises the following steps: inputting a target training image into an initial model; extracting a characteristic response diagram of the target training image through a backbone network; classifying the characteristic response graph through a classifying sub-network, and calculating to obtain a classifying loss value based on a classifying processing result and a class label; carrying out regression processing on the characteristic response graph through a regression sub-network, and calculating to obtain a regression loss value based on a regression processing result and a regression label; carrying out mask processing on the characteristic response graph through a mask sub-network, and calculating a mask loss value based on a mask processing result and a mask label; and training the initial model based on the classification loss value, the regression loss value and the mask loss value to obtain a target detection model. The application improves the model training method and effectively improves the detection effect of the target detection model obtained by training on the image.

Description

Training method and device of image detection model, and target detection method and device

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a training method and apparatus for an image detection model, and a target detection method and apparatus.

Background

Target detection techniques for detecting faces or other objects of interest from images are a requisite premise for a large number of advanced visual tasks, and can be applied to many practical tasks such as intelligent video surveillance content-based image retrieval, robotic navigation, augmented reality, and the like.

In most existing target detection methods, feature extraction is performed on an image to be detected by using a trained target detection model, and then target classification (such as determining whether a face is present) and target positioning (such as determining the position of the face in the image) are performed based on the extracted features. The existing model training method only trains based on the self network structure of the target detection model, and the target detection model obtained by the training method has poor image detection effect, such as difficulty in detecting a target to be detected, which is blurred or partially blocked in an image.

Disclosure of Invention

Accordingly, the embodiments of the present application provide a training method and apparatus for an image detection model, and a target detection method and apparatus, which are used for improving the model training method and enhancing the detection effect of the trained target detection model on the image.

Mainly comprises the following aspects:

in a first aspect, an embodiment of the present application provides a training method for an image detection model, where the method includes: inputting a target training image into an initial model; the initial model comprises a backbone network and a head network which are connected in sequence, wherein the head network comprises a classification sub-network, a regression sub-network and a mask sub-network which are parallel; the target training image carries a category label, a regression label and a mask label; extracting a characteristic response diagram of the target training image through the backbone network; classifying the characteristic response graph through the classifying sub-network, and calculating to obtain a classifying loss value based on a classifying processing result and the class label; carrying out regression processing on the characteristic response graph through the regression sub-network, and calculating to obtain a regression loss value based on a regression processing result and the regression label; performing mask processing on the characteristic response graph through the mask subnetwork, and calculating a mask loss value based on a mask processing result and the mask label; training the initial model based on the classification loss value, the regression loss value and the mask loss value to obtain a target detection model; the target detection model is a model for detecting targets in an image, and comprises the backbone network, the classification sub-network and the regression sub-network after training.

With reference to the first aspect, the embodiment of the present application provides a first possible implementation manner of the first aspect, wherein the backbone network includes a residual network and a feature pyramid network; extracting a characteristic response diagram of the target training image through a backbone network, wherein the step comprises the following steps: extracting feature graphs of multiple scales of the target training image through the residual error network; respectively inputting the feature graphs with various scales into a plurality of network layers of the feature pyramid network; wherein, each network layer correspondingly inputs a feature map of one scale; performing feature fusion processing on the input feature graphs through each network layer of the feature pyramid network to obtain corresponding feature response graphs; wherein, the receptive fields corresponding to the characteristic response graphs output by different network layers are different.

With reference to the first possible implementation manner of the first aspect, the embodiment of the present application provides a second possible implementation manner of the first aspect, where a network layer of the feature pyramid network is at least 4 layers.

With reference to the first possible implementation manner of the first aspect, the embodiment of the present application provides a third possible implementation manner of the first aspect, wherein each network layer of the feature pyramid network is connected to the classification sub-network, the regression sub-network and the mask sub-network in parallel, respectively; the step of classifying the characteristic response graph through the classifying sub-network comprises the following steps: classifying the feature response graphs output by the network layer of the feature pyramid network connected with the classifying sub-network through the classifying sub-network; the step of classifying the characteristic response graph through the regression sub-network comprises the following steps: carrying out regression processing on the feature response graph output by the network layer of the feature pyramid network connected with the regression sub-network through the regression sub-network; the step of masking the characteristic response map through the masking sub-network includes: and masking the feature response graph output by the network layer of the feature pyramid network connected with the mask sub-network.

With reference to the third possible implementation manner of the first aspect, an embodiment of the present application provides a fourth possible implementation manner of the first aspect, where the mask labels are multiple; each mask label corresponds to a scale of the feature map; a step of calculating a mask loss function value based on a mask processing result and the mask tag, including: acquiring a mask label corresponding to the mask sub-network according to a network layer of the feature pyramid network connected with the mask sub-network; and calculating a mask loss function value based on a mask processing result of the mask subnetwork and the obtained mask label.

With reference to the first aspect, an embodiment of the present application provides a fifth possible implementation manner of the first aspect, where the step of calculating a mask loss function value based on a mask processing result and the mask tag includes: substituting the mask processing result and the mask label into a cross entropy loss function, and calculating to obtain a mask loss value.

With reference to the fifth possible implementation manner of the first aspect, the embodiment of the present application provides a sixth possible implementation manner of the first aspect, wherein the cross entropy loss function is a Sigmoid function.

With reference to the first aspect, an embodiment of the present application provides a seventh possible implementation manner of the first aspect, wherein the step of training the initial model based on the classification loss value, the regression loss value, and the mask loss value includes: and adopting a back propagation algorithm to perform joint training on the backbone network, the classification sub-network, the regression sub-network and the mask sub-network based on the classification loss value, the regression loss value and the mask loss value until the classification loss value converges to a first preset value, the regression loss value converges to a second preset value and the mask loss value converges to a third preset value, and stopping training.

With reference to the first aspect, an embodiment of the present application provides an eighth possible implementation manner of the first aspect, wherein the mask sub-network includes five convolution layers connected in sequence; the network parameters of the first four convolution layers are the same, and the network parameters of the fifth convolution layer are different from the network parameters of the first four convolution layers.

With reference to the eighth possible implementation manner of the first aspect, the embodiment of the present application provides a ninth possible implementation manner of the first aspect, wherein an output dimension of a fifth one of the convolution layers of the mask sub-network is equal to a number of categories of the object to be detected.

In a second aspect, an embodiment of the present application provides a target detection method, where the method is applied to a device configured with a detection model; the detection model is a target detection model trained by the method of one of the ninth possible implementation manners of the first aspect to the first aspect; the method comprises the following steps: acquiring an image of a target to be detected; inputting the image into the target detection model; and determining the category and the position of the target to be detected according to the output result of the target detection model.

With reference to the second aspect, an embodiment of the present application provides a first possible implementation manner of the second aspect, where the step of determining the category and the position of the target to be detected according to the output result of the target detection model includes: determining the category of the target to be detected according to the classification processing result output by the target detection model; and determining the position of the target to be detected according to the regression processing result output by the target detection model.

With reference to the second aspect, an embodiment of the present application provides a second possible implementation manner of the second aspect, where the object to be detected includes a face or a vehicle.

In a third aspect, an embodiment of the present application provides a training apparatus for an image detection model, including: the image input module is used for inputting the target training image into the initial model; the initial model comprises a backbone network and a head network which are connected in sequence, wherein the head network comprises a classification sub-network, a regression sub-network and a mask sub-network which are parallel; the target training image carries a category label, a regression label and a mask label; the extraction module is used for extracting the characteristic response graph of the target training image through the backbone network; the classification module is used for carrying out classification processing on the characteristic response graph through the classification sub-network, and calculating a classification loss value based on a classification processing result and the class label; the regression module is used for carrying out regression processing on the characteristic response graph through the regression sub-network, and calculating a regression loss value based on a regression processing result and the regression label; the mask module is used for carrying out mask processing on the characteristic response graph through the mask subnetwork, and calculating a mask loss value based on a mask processing result and the mask label; the model generation module is used for training the initial model based on the classification loss value, the regression loss value and the mask loss value to obtain a target detection model; the target detection model is a model for detecting targets in an image, and comprises the backbone network, the classification sub-network and the regression sub-network after training.

With reference to the third aspect, the present embodiment provides a first possible implementation manner of the third aspect, wherein the backbone network includes a residual network and a feature pyramid network; the characteristic response diagram extracting module is used for: extracting feature graphs of multiple scales of the target training image through the residual error network; respectively inputting the feature graphs with various scales into a plurality of network layers of the feature pyramid network; wherein, each network layer correspondingly inputs a feature map of one scale; performing feature fusion processing on the input feature graphs through each network layer of the feature pyramid network to obtain corresponding feature response graphs; wherein, the receptive fields corresponding to the characteristic response graphs output by different network layers are different.

With reference to the first possible implementation manner of the third aspect, the embodiment of the present application provides a second possible implementation manner of the third aspect, where a network layer of the feature pyramid network is at least 4 layers.

With reference to the first possible implementation manner of the third aspect, the embodiment of the present application provides a third possible implementation manner of the third aspect, wherein each network layer of the feature pyramid network is connected to the classification sub-network, the regression sub-network and the mask sub-network in parallel, respectively; the classification module is further configured to: classifying the feature response graphs output by the network layer of the feature pyramid network connected with the classifying sub-network through the classifying sub-network; the regression module is further to: the step of classifying the characteristic response graph through the regression sub-network comprises the following steps: carrying out regression processing on the feature response graph output by the network layer of the feature pyramid network connected with the regression sub-network through the regression sub-network; the masking module is further configured to: the step of masking the characteristic response map through the masking sub-network includes: and masking the feature response graph output by the network layer of the feature pyramid network connected with the mask sub-network.

With reference to the third possible implementation manner of the third aspect, an embodiment of the present application provides a fourth possible implementation manner of the third aspect, where the mask labels are multiple; each mask label corresponds to a scale of the feature map; the masking module is further configured to: acquiring a mask label corresponding to the mask sub-network according to a network layer of the feature pyramid network connected with the mask sub-network; and calculating a mask loss function value based on a mask processing result of the mask subnetwork and the obtained mask label.

With reference to the third aspect, an embodiment of the present application provides a fifth possible implementation manner of the third aspect, where the masking module is further configured to: substituting the mask processing result and the mask label into a cross entropy loss function, and calculating to obtain a mask loss value.

With reference to the fifth possible implementation manner of the third aspect, the embodiment of the present application provides a sixth possible implementation manner of the third aspect, where the cross entropy loss function is a Sigmoid function.

With reference to the third aspect, an embodiment of the present application provides a seventh possible implementation manner of the third aspect, where the model generating module is further configured to: and adopting a back propagation algorithm to perform joint training on the backbone network, the classification sub-network, the regression sub-network and the mask sub-network based on the classification loss value, the regression loss value and the mask loss value until the classification loss value converges to a first preset value, the regression loss value converges to a second preset value and the mask loss value converges to a third preset value, and stopping training.

With reference to the third aspect, an embodiment of the present application provides an eighth possible implementation manner of the third aspect, where the mask sub-network includes five convolution layers connected in sequence; the network parameters of the first four convolution layers are the same, and the network parameters of the fifth convolution layer are different from the network parameters of the first four convolution layers.

With reference to the eighth possible implementation manner of the third aspect, the embodiment of the present application provides a ninth possible implementation manner of the third aspect, wherein an output dimension of a fifth one of the convolution layers of the mask sub-network is equal to the number of categories of the object to be detected.

In a fourth aspect, an embodiment of the present application provides an object detection apparatus, which is applied to a device configured with a detection model; the detection model is a target detection model trained by one of tenth possible implementation manners of the first aspect to the first aspect; the device comprises: the image acquisition module is used for acquiring an image of the object to be detected; the image input module is used for inputting the image into the target detection model; and the determining module is used for determining the category and the position of the target to be detected according to the output result of the target detection model.

With reference to the fourth aspect, an embodiment of the present application provides a first possible implementation manner of the fourth aspect, wherein the determining module is further configured to: determining the category of the target to be detected according to the classification processing result output by the target detection model; and determining the position of the target to be detected according to the regression processing result output by the target detection model.

With reference to the fourth aspect, an embodiment of the present application provides a second possible implementation manner of the fourth aspect, where the object to be detected includes a human face or a vehicle.

In a fifth aspect, an embodiment of the present application further provides an electronic device, including: a processor and a memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over a bus, the machine-readable instructions when executed by the processor performing the steps of one of the tenth possible implementation of the first aspect or one of the second possible implementation of the second aspect.

In a sixth aspect, the present embodiment further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of one of the tenth possible implementation manners of the first aspect or the second possible implementation manners of the second aspect or the second aspect.

The embodiment of the application provides a training method and device of an image detection model, and a target detection method and device, wherein an initial model comprises a backbone network, a classification sub-network, a regression sub-network and a mask sub-network, firstly, a characteristic response diagram of a target training image is extracted through the backbone network, then, the characteristic response diagram is subjected to classification processing, regression processing and mask processing through the classification sub-network, the regression sub-network and the mask sub-network respectively, the initial model is trained based on a classification loss value, a regression loss value and a mask loss value, and finally, the target detection model comprising the trained backbone network, the trained classification sub-network and the trained regression sub-network is obtained. According to the embodiment of the application, the mask sub-network is introduced in the model training process, the mask sub-network is used for masking the characteristic response graph to assist in training the backbone network, the classification sub-network and the regression sub-network, and the mode is beneficial to enabling the characteristic response of the target detection model obtained through training to be higher than the characteristic response of the target region in the image than the non-target region, namely, the target to be detected in the image can be effectively highlighted, and the detection effect of the target detection model obtained through training to the image is effectively improved.

The foregoing objects, features and advantages of embodiments of the application will be more readily apparent from the following detailed description of the embodiments taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a flowchart of a training method for an image detection model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an initial model according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a head network according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an initial model according to an embodiment of the present application;

FIG. 5 is a flowchart of a target detection method according to an embodiment of the present application;

FIG. 6 is a block diagram of a training device for an image detection model according to an embodiment of the present application;

FIG. 7 is a block diagram showing an object detection apparatus according to an embodiment of the present application;

fig. 8 shows a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The following detailed description of embodiments of the application is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

The model training method and device, the target detection method and device, the electronic equipment and the computer storage medium provided by the embodiment of the application can be applied to target detection tasks, such as detection of faces, vehicles, other interested targets and the like. The embodiment of the application does not limit specific application scenes, and any scheme for training the model by using the method provided by the embodiment of the application and carrying out target detection by adopting the target detection model obtained by training the embodiment of the application is within the protection scope of the application. Embodiments of the present application are described in detail below.

Referring to a flowchart of a training method of an image detection model shown in fig. 1, the method can be applied to electronic equipment such as computers, tablet computers and other intelligent terminals, and the method comprises the following steps:

step S102, inputting a target training image into an initial model; the initial model comprises a backbone network and a head network which are sequentially connected, wherein the head network comprises a classification sub-network, a regression sub-network and a mask sub-network which are in parallel; the target training image carries a category label, a regression label and a mask label. Wherein the head network may also be referred to as a head structure portion of the initial model for finally outputting the detection result. For ease of understanding, reference may be made to the schematic structure of an initial model shown in fig. 2, where the backbone network is connected to the parallel classification sub-network, regression sub-network and mask sub-network, respectively.

And step S104, extracting a characteristic response diagram of the target training image through the backbone network.

The backbone network is a main network for feature extraction, and may include a plurality of convolution layers, or be implemented by adopting a plurality of network structures, and is mainly used for feature extraction of an input target training image, generating a feature response graph of the target training image, and transmitting the feature response graph to the classification sub-network, the regression sub-network and the mask sub-network.

And S106, classifying the characteristic response graph through a classifying sub-network, and calculating to obtain a classifying loss value based on a classifying processing result and a class label.

The classification sub-network (classification subnet) includes a plurality of convolution layers, which are primarily used for object classification. Specifically, the classifying sub-network is responsible for judging whether an object belonging to the interest class (i.e., an object to be detected in the input image) appears in the input image, and outputting the possibility that the object belonging to the corresponding interest class appears in the image. For example, in a face detection task, the classification sub-network may output a classification result of "whether or not it is a face".

The loss value is a value that determines how close the actual output is to the desired output. The smaller the loss value, the closer the actual output is to the desired output. It can be understood that the classification result is the actual output of the classification sub-network, and the class label is the expected output of the classification sub-network, and the proximity of the actual output of the classification sub-network to the expected output can be obtained through the calculation of the classification result and the class label. And in the specific calculation, a preset classification loss function can be adopted for realizing.

And S108, carrying out regression processing on the characteristic response graph through the regression sub-network, and calculating to obtain a regression loss value based on a regression processing result and a regression label.

The regression sub-network (regression subnet) includes a plurality of convolution layers; it is mainly used for target positioning. The target localization task may also be considered a regression task. Specifically, the regression sub-network determines the location of the object of the category of interest in the input image, generally outputting a rectangular bounding box of the object. For example, in a face detection task, the regression sub-network may output "coordinates of a regression frame of a face", that is, a rectangular bounding box of the face predicted by the regression sub-network, to characterize a specific location where the face is located.

Similarly, the regression processing result is the actual output of the regression sub-network, the regression label is the expected output of the regression sub-network, and the approximation degree of the actual output and the expected output of the regression sub-network can be obtained through calculation of the regression processing result and the class label. And in the specific calculation, a preset regression loss function can be adopted.

In step S110, the feature response map is masked by the masking sub-network, and a masking loss value is calculated based on the masking result and the masking tag.

A mask subnet (mask subnet) includes a plurality of convolutional layers; the method is mainly used for masking the non-target area in the image and outputting a mask image. For example, the person image after masking only highlights the face region, and the remaining regions are obscured.

Similarly, the mask processing result is the actual output of the mask sub-network, the mask label is the expected output of the mask sub-network, and the approach degree of the actual output of the mask sub-network and the expected output can be obtained through the calculation of the mask processing result and the category label. The specific calculation can be realized by adopting a preset mask loss function.

Step S112, training an initial model based on the classification loss value, the regression loss value and the mask loss value to obtain a target detection model; the target detection model is used for detecting targets in the image and comprises a trained backbone network, a classification sub-network and a regression sub-network.

In one embodiment, a back propagation algorithm may be employed to jointly train the backbone network, the classification sub-network, the regression sub-network, and the mask sub-network based on the classification loss value, the regression loss value, and the mask loss value until the classification loss value converges to a first preset value, the regression loss value converges to a second preset value, and the mask loss value converges to a third preset value. That is, the network parameters can be reversely adjusted according to the loss value until the loss value reaches the acceptable level, and the training is finished, so that the network at the moment is confirmed to meet the requirements, the expected result can be output, and the target detection is realized.

It should be noted that the mask sub-network is only applied to the training process, and is mainly used for assisting in training the backbone network, the classification sub-network and the regression sub-network, and network parameters of the backbone network, the classification sub-network and the regression sub-network are affected by the mask sub-network, so that the network parameters obtained after the training of the backbone network, the classification sub-network and the regression sub-network can enhance the characteristic response of the region where the target to be detected is located. And if the training is finished, in the actual target detection task, only adopting a target detection model comprising the backbone network, the classification sub-network and the regression sub-network obtained through the training to detect the target. That is, in actual detection, the mask subnetwork is not constructed, and thus the detection time is not affected by the mask subnetwork.

According to the training method of the image detection model, an initial model comprises a backbone network, a classification sub-network, a regression sub-network and a mask sub-network, a feature response diagram of a target training image is firstly extracted through the backbone network, then the feature response diagram is respectively subjected to classification processing, regression processing and mask processing through the classification sub-network, the regression sub-network and the mask sub-network, the initial model is trained based on a classification loss value, a regression loss value and the mask loss value, and finally the target detection model comprising the trained backbone network, the trained classification sub-network and the trained regression sub-network is obtained. According to the embodiment of the application, the mask sub-network is introduced in the model training process, the mask sub-network is used for masking the characteristic response graph to assist in training the backbone network, the classification sub-network and the regression sub-network, and the mode is beneficial to enabling the characteristic response of the target detection model obtained through training to be higher than the characteristic response of the target region in the image than the non-target region, namely, the target to be detected in the image can be effectively highlighted, and the detection effect of the target detection model obtained through training to the image is effectively improved.

The embodiment provides a specific implementation manner of a header network, referring to a structural schematic diagram of the header network shown in fig. 3, a classification sub-network, a regression sub-network and a mask sub-network respectively illustrate five convolution layers that are sequentially connected, and in one implementation manner, the convolution kernel in the convolution layers may be 3×3. For each sub-network, the network parameters of the first four convolutional layers are the same, and the fifth convolutional layer is different from the network parameters of the first four convolutional layers. In fig. 3, parameters corresponding to the first four convolution layers of the classification sub-network, the regression sub-network and the mask sub-network are all w×h×256, where w×h may be understood as the length and width of the feature map processed by the convolution layer, and 256 may be understood as the output dimension of the convolution layer. Where the output dimension can be understood as the number of convolution kernels in the convolution layer. In practical applications, the first four convolution layers of the classification sub-network, the regression sub-network, and the mask sub-network may be parameter-shared to enhance the self-organization of the network, and the output dimension of the fifth convolution layer (i.e., the last convolution layer) differs from one sub-network type to another, i.e., the main difference between the classification sub-network, the regression sub-network, and the mask sub-network is the last convolution layer.

Specifically, the output dimension of the last convolution layer of the classification sub-network is related to the number of anchor points (anchors) at each position in the image, the number of categories of objects to be detected in the image, and the output dimension of the last convolution layer of the regression sub-network is related to the number of anchor points (anchors) at each position in the image, the number of categories of objects to be detected in the image, and the number of position correction parameters of each anchor point with respect to the GT (ground process) frame. For ease of understanding, the detailed description follows:

the anchor point (anchor) may also be understood as an anchor frame, and in particular may be understood as an initial frame or a candidate region, and the anchor point parameters include an anchor point area (scale) and an anchor point aspect ratio (aspect). An anchor parameter (i.e., a set of anchor area and anchor aspect ratio) may characterize an anchor. For example, the 3 areas and the 3 aspect ratios may be combined to form 9 anchor points, and each position in the image to be detected may be correspondingly provided with 9 anchor points, for example, for a feature map (feature map) with a size of w×h, the feature map includes w×h positions (may be understood as w×h pixel points), and then w×h×9 anchor points may be correspondingly provided, that is, w×h×9 initial boxes are correspondingly provided. The anchor point can be used to directly predict the detection frame containing the target object in the single-stage detection model.

The number of categories of the objects to be detected in the image may be predetermined according to the actual application scenario, and is exemplified herein: if only for face detection, i.e. the object to be detected is a face, the number of categories is 1. If used for cat and dog detection, the number of categories is 2.

The above-mentioned position correction parameters of each anchor point with respect to the GT (ground route) frame can be understood as offsets of the GT frame with respect to the center point (x, y) of the anchor point, the height h and the width w, that is, the position correction parameters are 4 in total. The GT frame may be understood as a regression frame, which characterizes the correct position of the target to be detected.

On the basis of understanding the anchor point, the category number of the target to be detected and the position correction parameters, the output dimension of the last convolution layer of the classification sub-network, the regression sub-network and the mask sub-network provided by the embodiment is further described:

for a classified sub-network, the output dimension is kxa, where a is the number of anchors at each location and K is the number of classes, because the last convolutional layer outputs confidence for multiple anchors at each location in the image to be detected. For face detection, since there is only one class (i.e., face), K is 1, and thus, in the face detection task, the output dimension of the last convolution layer of the classification sub-network is w×h×a.

In addition, according to the size characteristics of the face, in one embodiment, the ratio of the anchor points (i.e., the aspect ratio) set at each position may be set to 1 and 1.5, and the scale of the anchor points (i.e., the area) is one, so that the number a of anchor points at each position on the image is 2.

For the regression sub-network, the output dimension is 4 xkxa because the last convolutional layer outputs the position correction parameters for each anchor point with respect to the GT frame, and there are 4 position correction parameters. In the face detection task, if K is 1, the output dimension of the last convolution layer of the classification sub-network is 4A.

For the mask sub-network, since the last convolution layer outputs a mask of the whole image and uses a class-specific calculation mode, that is, different classes use different convolution kernel groups, the output dimension of the last convolution layer of the mask sub-network is equal to the number of classes of the object to be detected, that is, the output dimension is K. In the face detection task, K is 1, and the output dimension of the last convolution layer of the mask sub-network is 1.

The header network shown in fig. 3 is applied to a face detection task, and the class of the object to be detected is 1, that is, K is 1, so in fig. 3, it is shown that the output dimension of the last convolution layer of the classification sub-network is a, the output dimension of the last convolution layer of the regression sub-network is 4A, and the output dimension of the last convolution layer of the mask sub-network is 1.

The present example presents one implementation of a backbone network, wherein the backbone network comprises a residual network and a feature pyramid network.

Wherein the residual network is a deep convolutional network. The residual network has a deeper network structure layer, and compared with the traditional network, the residual network is added with a y=x layer (identity mapping layer), and the main function of the residual network is that the network is not degraded with the increase of depth, and the residual network also has a better convergence effect. It is generally considered that each layer of the neural network corresponds to extracting feature information of different layers, such as being divided into a lower layer, a middle layer and a higher layer; the deeper the network, the more information of different layers is extracted, and the more the combination of the layer information among different layers is. The residual network has better image feature extraction capability. In one embodiment, the residual network may be implemented using a ResNet network.

The feature pyramid network (FPN, feature Pyramid Network) can be expanded to form a standard convolutional network through top-down channel and lateral connection, so that rich, multi-scale feature pyramid images can be effectively extracted from a single-resolution input image. The FPN comprises a multi-layer structure, each layer can detect images to be detected on different scales, and a multi-scale characteristic diagram is generated. FPN can better promote the multi-scale predictive capabilities of Full Convolutional Networks (FCNs).

Referring to the schematic structure of an initial model shown in fig. 4, fig. 4 mainly illustrates that the backbone network includes a residual network and a feature pyramid network, and the residual network and the feature pyramid network are both simple illustrations, for example, the feature pyramid network simply illustrates 3 FPN layers, and in practical application, the feature pyramid network may include 4 network layers (FPN 5 to FPN2 in order from top to bottom). Wherein the network layer of the feature pyramid network may be referred to directly as the FPN layer. In order to expand the detection range of the target, the embodiment may further increase the level of the FPN layers, such as adding two layers of FPN6 and FPN7, that is, adding two convolution layers to the 4 FPN layers of the existing feature pyramid network. In practical application, the FPN layer number of the feature pyramid network can be flexibly deleted according to practical requirements.

The initial model shown in fig. 4 is a network structure with dimensional invariance, i.e., the object to be detected can be detected by the network structure regardless of the change in the size of the object. For example, the more the number of FPN layers, the larger the variation range of the receptive field, the more faces with different dimensions can be detected, and the stronger the dimensional invariance. The receptive field may be called a receiving field, and is defined as a region where the convolutional neural network features can see the input image, and may be further characterized as: the feature output is affected by the pixel points within the receptive field area. Colloquially, it is understood that a point on the feature map corresponds to an area on the input image. The feature map with small receptive field is helpful for detecting small targets and the feature map with large receptive field is helpful for detecting large targets. The embodiment adopts a characteristic pyramid network at least comprising four network layers, has multiple receptive fields, and can detect targets to be detected with different scales.

When the characteristic response diagram of the target training image is extracted through the backbone network, the method can be realized by referring to the following steps:

(1) Extracting feature graphs of multiple scales of the target training image through a residual error network;

(2) Respectively inputting the feature graphs with various scales into a plurality of network layers of a feature pyramid network; wherein, each network layer correspondingly inputs a feature map of one scale;

(3) Carrying out feature fusion processing on the input feature graphs through each network layer of the feature pyramid network to obtain corresponding feature response graphs; wherein, the receptive fields corresponding to the characteristic response graphs output by different network layers are different.

In addition, each network layer of the feature pyramid network is connected to the head network, i.e., to the parallel classification sub-network, regression sub-network, and mask sub-network, respectively, as illustrated in fig. 4. It will be appreciated that only three network layers of the feature pyramid network are schematically illustrated in fig. 4, and that each network layer is connected to a header network, schematically illustrating three header networks. In practice, the network structure and network parameters of the three head networks are the same, and the three head networks can be regarded as essentially one head network, and fig. 4 illustrates the three head networks separately for ease of understanding.

When the characteristic response graphs are classified through the classifying sub-network, the characteristic response graphs output by the network layer of the characteristic pyramid network connected with the characteristic response graphs are classified through the classifying sub-network; when classifying the characteristic response graphs through the regression sub-network, specifically, carrying out regression processing on the characteristic response graphs output by the network layer of the characteristic pyramid network connected with the characteristic response graphs through the regression sub-network; when the feature response map is masked by the masking sub-network, specifically, the feature response map output by the network layer of the feature pyramid network connected thereto is masked by the masking sub-network.

In practical applications, all network layers of the feature pyramid network may also be input to a 3 x 3 convolution layer for dimension transformation such that the output dimension is a uniform number, such as 256, before the feature response graph is input to the head network.

The training process of the initial model is further elucidated below in conjunction with fig. 4. The target training image is firstly input into a residual error network to obtain feature images with multiple scales output by the residual error network, the residual error network respectively inputs the feature images with the multiple scales into a feature pyramid network, and the feature pyramid network processes the feature images with the multiple scales to generate a feature response image. Specifically, each FPN layer corresponds to a feature map of one scale, and a corresponding feature response map is generated through feature fusion processing. It should be noted that, the feature fusion processing manner of the other FPN layers except the topmost layer is to perform fusion processing by using the feature fusion graph transmitted by the FPN layer of the previous layer and the feature graph received by the previous layer, and the FPN layer of the top layer is still the feature graph received by the previous layer because the feature fusion graph transmitted by the FPN layer of the previous layer is not available. For example, when n is not the maximum level of the feature pyramid network (such as when the feature pyramid network is four layers, n is less than 4, and when the feature pyramid network is six layers, n is less than 6), the feature response map Pn corresponding to the FPNn layer is the fusion result of the feature response map pn+1 transferred to the FPNn layer by the fpnn+1 located at the upper layer and the feature map P n' of the corresponding scale received from the residual network by the FPNn layer. When n is the hierarchical maximum of the feature pyramid network (such as n=4 when the feature pyramid network is four layers and n=6 when the feature pyramid network is six layers), the processing result of the FPNn layer still corresponds to the feature map P n'. Each FPN layer outputs a feature response graph that is processed by that layer and outputs the feature response graph to the head network (i.e., classification sub-network, regression sub-network, and mask sub-network in the initial model), respectively. Each sub-network performs corresponding type of processing operation on the received characteristic response graph, obtains a corresponding processing result, calculates respective corresponding loss functions to obtain a classification loss value, a regression loss value and a mask loss value, performs joint training on the backbone network, the classification sub-network, the regression sub-network and the mask sub-network based on the classification loss value, the regression loss value and the mask loss value by adopting a back propagation algorithm, and finally determines network parameters of each network.

The network architecture shown in fig. 4 can be understood as a single-stage (one-stage) detection model. The difference between a single-stage (one-stage) detection model and a two-stage (two-stage) detection model is that the single-stage detection model directly predicts a target frame by using an initial frame; the two-stage detection model adopts an initial frame to predict a candidate frame, and then predicts a target frame based on the candidate frame.

Based on this, when the model provided by the embodiment of the application performs target detection on the input image, the standard frame (i.e., the initial frame) setting of the SFD (Shot Scale-invariant Face Detector) with unchanged Scale can be used, and a standard frame matched with the receiving field size of the feature response diagram of the network layer is paved on each network layer of the feature pyramid. Wherein, the liquid crystal display device comprises a liquid crystal display device,

because the model provided by the embodiment of the application is a single-stage detection model, the target frame (such as a finally detected face bounding box) can be directly found through the initial frame. In the implementation, initial frames matched with the receiving fields of the corresponding feature response graphs are arranged on different FPN layers, so that the corresponding features of the layers can be extracted according to the initial frames, and the corresponding feature fusion graphs can be generated.

In the initial model based on fig. 4, since there are multiple scales of the feature map, there are multiple mask labels; each mask label corresponds to a feature map of one scale. The mask labels can be understood as mask patterns, the receptive fields of different mask labels are different, and the sizes of the objects to be detected corresponding to the different mask labels are different. Taking the face as an example of the target to be detected, the mask map only can show the face, and the non-face part is masked by mask processing. For example, in a mask image corresponding to an image containing a person, only the face area is bright, and the remaining areas are black. The expression form of the face region in the mask map depends on the receptive field corresponding to the mask tag.

The training supervision information can be layered by using mask labels of a plurality of different receptive fields. Specifically, the supervision information for detection is associated with network layers of different feature pyramids, and corresponding supervision information is set for each layer, so that a rectangular bounding box of an object to be detected obtained through each network layer is matched with the standard frame size paved on the feature map of each network layer.

In the training process, when the mask loss function value is calculated based on the mask processing result and the mask label, the mask label corresponding to the mask sub-network can be obtained according to the network layer of the feature pyramid network connected with the mask sub-network; and calculating a mask loss function value based on the mask processing result of the mask subnetwork and the acquired mask label. In one embodiment of the present application, a cross entropy loss function may be used to calculate a mask loss function value, specifically, a mask processing result and a mask label are substituted into the cross entropy loss function, and a mask loss value is calculated. In one embodiment, the cross entropy loss function is a Sigmoid function, i.e., a conventional pixel-level Sigmoid cross entropy implementation may be employed.

In summary, according to the training method of the image detection model provided by the embodiment, the mask sub-network is introduced to assist in training the backbone network, the classification sub-network and the regression sub-network, so that the target detection model including the backbone network, the classification sub-network and the regression sub-network obtained through final training has higher characteristic responsivity to the region where the target is located in the image than the region where the non-target is located, and the target to be detected in the image can be effectively highlighted. The model training mode can effectively improve the detection effect of the target detection model obtained through training on the image by enhancing the characteristic response of the target region and helping to detect the blocked target or the fuzzy target. In contrast, in the related art, the attention network is mostly adopted to influence the characteristic response diagram of the target detection model, so that the characteristic response diagram has higher response in the target area and lower response in the non-target area. But since the existing attention network is a parallel sub-network, it has little effect on the parameters of the network backbone part before the head network during training, and finally the network characteristic response is changed violently by means of weight multiplication, which falls on the next multiplication. The mask sub-network is cited in the embodiment of the application, network parameters of the backbone network, the classifying sub-network and the regression sub-network can be integrally influenced, and the influence of the mask sub-network on the overall network parameters during training is increased in a shared parameter mode by carrying out joint training on the backbone network, the mask sub-network, the classifying sub-network and the regression sub-network, so that the characteristic response of a target area is increased. And the mask sub-network introduced by the embodiment of the application is only applied to a model training process, once training is finished, a target detection model comprising a backbone network, a classification sub-network and a regression sub-network which are obtained through training is obtained, the target detection model does not comprise the mask sub-network any more, the mask sub-network is not required to be constructed during actual detection, and the detection time is not influenced on the basis of improving the target detection accuracy.

In addition, the embodiment of the application adopts high-level supervision information (namely segmentation information of the image) to train the model, so that the effect better than that of the attention network can be achieved. For ease of understanding, the following is further explained: taking face detection as an example, the supervision information of the face detection is a rectangular bounding box where all faces are located in an image, and the face detection is peculiar in that the faces are absolutely dense, and no gaps exist in the face area basically, so that the faces are filled in the area of most of the rectangular bounding boxes. Based on the characteristics, a conclusion that the outline of the face is basically similar to the rectangular bounding box of the face can be obtained, and the conclusion is equivalent to the conclusion that the supervision information of the face in the process of target detection is similar to the supervision information in the process of image semantic segmentation. The most direct embodiment of the application is to train the network by adopting a rectangular bounding box of the human face (similar to the outline of the human face), wherein the rectangular bounding box of the human face is the segmentation information of the image, and the mode is a further improvement of the existing target detection algorithm.

The embodiment of the application further provides a target detection method which is applied to equipment provided with a detection model; the detection model is a target detection model obtained by adopting the model training method; referring to a flowchart of a target detection method shown in fig. 5, the method includes the steps of:

Step S502, acquiring an image of an object to be detected. The target to be detected may include a face or a vehicle, and may be other types of targets, which will not be described herein.

Step S504, the image is input into the target detection model.

Step S506, determining the category and the position of the object to be detected according to the output result of the object detection model.

Specifically, determining the category of the target to be detected according to the classification processing result output by the target detection model; and determining the position of the target to be detected according to the regression processing result output by the target detection model.

According to the target detection method, the type and the position of the target to be detected are determined through the target detection model, and the target detection model is obtained through training by the model training method, so that the characteristic responsivity of the target region in the image is higher than that of the non-target region, namely the target to be detected in the image can be effectively highlighted, and the target detection accuracy is improved.

It will be clear to those skilled in the art that, for convenience and brevity of description, the specific training process of the target detection model in the target detection method described above may refer to the corresponding process in the foregoing embodiment, and will not be described in detail herein. In the target detection process, the loss function value is not required to be calculated. The difference between the target detection model and the initial model in the embodiment of the application is that the target detection model has no mask sub-network, and the rest structures are the same. The network parameters of the backbone network, the classification sub-network and the regression sub-network in the target detection model are obtained after the mask sub-network assists in training.

Corresponding to the foregoing training method of the image detection model, the embodiment further provides a training device of the image detection model, referring to a structural block diagram of a model training device shown in fig. 6, the device includes:

an image input module 602 for inputting a target training image into the initial model; the initial model comprises a backbone network and a head network which are sequentially connected, wherein the head network comprises a classification sub-network, a regression sub-network and a mask sub-network which are in parallel; the target training image carries a category label, a regression label and a mask label;

the extracting module 604 is configured to extract a feature response graph of the target training image through the backbone network;

the classification module 606 is configured to perform classification processing on the feature response graph through the classification sub-network, and calculate a classification loss value based on a classification processing result and a class label;

the regression module 608 is configured to perform regression processing on the feature response graph through the regression sub-network, and calculate a regression loss value based on the regression processing result and the regression label;

a masking module 610, configured to mask the feature response graph through a masking subnet, and calculate a masking loss value based on a masking processing result and a masking tag;

The model generating module 612 is configured to train the initial model based on the classification loss value, the regression loss value and the mask loss value, so as to obtain a target detection model; the target detection model is used for detecting targets in the image and comprises a trained backbone network, a classification sub-network and a regression sub-network.

The training device for the image detection model provided by the embodiment of the application adopts an initial model comprising a backbone network, a classification sub-network, a regression sub-network and a mask sub-network, firstly, a characteristic response diagram of a target training image is extracted through the backbone network, then, the characteristic response diagram is respectively subjected to classification processing, regression processing and mask processing through the classification sub-network, the regression sub-network and the mask sub-network, the initial model is trained based on a classification loss value, a regression loss value and a mask loss value, and finally, the target detection model comprising the trained backbone network, the trained classification sub-network and the trained regression sub-network is obtained. According to the embodiment of the application, the mask sub-network is introduced in the model training process, the mask sub-network is used for masking the characteristic response graph to assist in training the backbone network, the classification sub-network and the regression sub-network, and the mode is beneficial to enabling the characteristic response of the target detection model obtained through training to be higher than the characteristic response of the target region in the image than the non-target region, namely, the target to be detected in the image can be effectively highlighted, and the detection effect of the target detection model obtained through training to the image is effectively improved.

In one embodiment, the backbone network includes a residual network and a feature pyramid network; the characteristic response diagram extracting module is used for: extracting feature graphs of multiple scales of the target training image through a residual error network; respectively inputting the feature graphs with various scales into a plurality of network layers of a feature pyramid network; wherein, each network layer correspondingly inputs a feature map of one scale; carrying out feature fusion processing on the input feature graphs through each network layer of the feature pyramid network to obtain corresponding feature response graphs; wherein, the receptive fields corresponding to the characteristic response graphs output by different network layers are different.

In one embodiment, the network layer of the feature pyramid network is at least 4 layers.

In one embodiment, each network layer of the feature pyramid network is connected to a parallel classification sub-network, a regression sub-network, and a mask sub-network, respectively; the classification module is also used for: classifying the feature response graphs output by the network layer of the feature pyramid network connected with the classifying sub-network; the regression module is also to: the step of classifying the characteristic response graph through the regression sub-network comprises the following steps: carrying out regression processing on the feature response graph output by the network layer of the feature pyramid network connected with the regression sub-network; the masking module is also configured to: the step of masking the feature response graph by a masking sub-network includes: the feature response graph output by the network layer of the feature pyramid network connected with the mask sub-network is masked.

In one embodiment, the mask tag is multiple; each mask label corresponds to a feature map of one scale; the mask module is further configured to: acquiring a mask label corresponding to the mask sub-network according to a network layer of the feature pyramid network connected with the mask sub-network; and calculating a mask loss function value based on the mask processing result of the mask subnetwork and the acquired mask label.

In one embodiment, the masking module is further configured to: substituting the mask processing result and the mask label into the cross entropy loss function, and calculating to obtain a mask loss value. In a specific embodiment, the cross entropy loss function is a Sigmoid function.

In one embodiment, the model generating module is further configured to: and adopting a back propagation algorithm to perform joint training on the backbone network, the classified sub-network, the regression sub-network and the mask sub-network based on the classified loss value, the regression loss value and the mask loss value until the classified loss value converges to a first preset value, the regression loss value converges to a second preset value and the mask loss value converges to a third preset value, and stopping training.

In one embodiment, the masking sub-network includes five convolutional layers connected in sequence; the network parameters of the first four convolution layers are the same, and the network parameters of the fifth convolution layer are different from those of the first four convolution layers. In a specific embodiment, the output dimension of the fifth convolution layer of the masking sub-network is equal to the number of categories of the object to be detected.

The device provided in this embodiment has the same implementation principle and technical effects as those of the foregoing embodiment, and for brevity, reference may be made to the corresponding content in the foregoing method embodiment for a part of the description of the device embodiment that is not mentioned.

Corresponding to the foregoing object detection method, the present embodiment also provides an object detection apparatus, see a block diagram of an object detection apparatus shown in fig. 7, which is applied to a device configured with a detection model; the detection model is a target detection model obtained by training by any model training method; the device comprises:

an image acquisition module 702 is configured to acquire an image of an object to be detected. In a specific application, the object to be detected comprises a human face or a vehicle.

An image input module 704 for inputting an image into the object detection model.

A determining module 706, configured to determine a category and a location of the object to be detected according to an output result of the object detection model.

According to the target detection device provided by the embodiment, the type and the position of the target to be detected are determined through the target detection model, and the target detection model is obtained by training through the model training method, so that the characteristic responsivity of the target region in the image is higher than that of the non-target region, namely the target to be detected in the image can be effectively highlighted, and the target detection accuracy is improved.

In one embodiment, the determining module is further configured to: determining the category of the target to be detected according to the classification processing result output by the target detection model; and determining the position of the target to be detected according to the regression processing result output by the target detection model.

The embodiment also provides an electronic device, including: the system comprises a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, and the processor and the memory are communicated through the bus, and the machine-readable instructions are executed by the processor to perform any model training method or any target detection method.

Referring to a schematic structural diagram of an electronic device shown in fig. 8, the electronic device specifically includes a processor 80, a memory 81, a bus 82 and a communication interface 83, where the processor 80, the communication interface 83 and the memory 81 are connected through the bus 82; the processor 80 is arranged to execute executable modules, such as computer programs, stored in the memory 81.

The memory 81 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 83 (which may be wired or wireless), and may use the internet, a wide area network, a local network, a metropolitan area network, etc.

Bus 82 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 8, but not only one bus or type of bus.

The memory 81 is used for storing a program, and the processor 80 executes the program after receiving the execution instruction, and the method executed by the apparatus for defining a flow in any of the foregoing embodiments of the present application may be applied to the processor 80 or implemented by the processor 80.

The processor 80 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in processor 80. The processor 80 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 81 and the processor 80 reads the information in the memory 81 and in combination with its hardware performs the steps of the method described above.

The training method and/or the target detection method of the image detection model provided in this embodiment may be executed by the electronic device, or the model training apparatus and/or the target detection apparatus provided in this embodiment may be disposed on the electronic device side.

Further, the present embodiment also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, performs any of the model training methods described above, or performs any of the object detection methods described above.

It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding process in the foregoing method embodiment for the specific working process of the above apparatus, which is not described herein again.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about variations or substitutions within the technical scope of the present application, and the application should be covered. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of training an image detection model, the method comprising:

inputting a target training image into an initial model; the initial model comprises a backbone network and a head network which are connected in sequence, wherein the head network comprises a classification sub-network, a regression sub-network and a mask sub-network which are parallel; the target training image carries a category label, a regression label and a mask label; parameters of a preset number of convolution layers in front of the classification sub-network, the regression sub-network and the mask sub-network are shared;

extracting a characteristic response diagram of the target training image through the backbone network;

classifying the characteristic response graph through the classifying sub-network, and calculating to obtain a classifying loss value based on a classifying processing result and the class label;

Carrying out regression processing on the characteristic response graph through the regression sub-network, and calculating to obtain a regression loss value based on a regression processing result and the regression label;

performing mask processing on the characteristic response graph through the mask subnetwork, and calculating a mask loss value based on a mask processing result and the mask label; the masking process comprises masking the non-target area in the characteristic response diagram;

training the initial model based on the classification loss value, the regression loss value and the mask loss value to obtain a target detection model; the target detection model is a model for detecting targets in an image, and comprises the backbone network, the classification sub-network and the regression sub-network after training.

2. The method of claim 1, wherein the backbone network comprises a residual network and a feature pyramid network;

extracting a characteristic response diagram of the target training image through a backbone network, wherein the step comprises the following steps:

extracting feature graphs of multiple scales of the target training image through the residual error network;

Respectively inputting the feature graphs with various scales into a plurality of network layers of the feature pyramid network; wherein, each network layer correspondingly inputs a feature map of one scale;

performing feature fusion processing on the input feature graphs through each network layer of the feature pyramid network to obtain corresponding feature response graphs; wherein, the receptive fields corresponding to the characteristic response graphs output by different network layers are different.

3. The method of claim 2, wherein the network layer of the feature pyramid network is at least 4 layers.

4. The method of claim 2, wherein each network layer of the feature pyramid network is connected to the classification subnetwork, the regression subnetwork, and the mask subnetwork, respectively, in parallel;

the step of classifying the characteristic response graph through the classifying sub-network comprises the following steps: classifying the feature response graphs output by the network layer of the feature pyramid network connected with the classifying sub-network through the classifying sub-network;

the step of classifying the characteristic response graph through the regression sub-network comprises the following steps: carrying out regression processing on the feature response graph output by the network layer of the feature pyramid network connected with the regression sub-network through the regression sub-network;

The step of masking the characteristic response map through the masking sub-network includes: and masking the feature response graph output by the network layer of the feature pyramid network connected with the mask sub-network.

5. The method of claim 4, wherein the mask tag is a plurality of; each mask label corresponds to a scale of the feature map;

a step of calculating a mask loss function value based on a mask processing result and the mask tag, including:

acquiring a mask label corresponding to the mask sub-network according to a network layer of the feature pyramid network connected with the mask sub-network;

and calculating a mask loss function value based on a mask processing result of the mask subnetwork and the obtained mask label.

6. The method of claim 1, wherein the step of calculating a mask penalty value based on a mask processing result and the mask tag comprises: substituting the mask processing result and the mask label into a cross entropy loss function, and calculating to obtain a mask loss value.

7. The method of claim 6, wherein the cross entropy loss function is a Sigmoid function.

8. The method of claim 1, wherein the step of training the initial model based on the classification loss value, the regression loss value, and the mask loss value comprises:

and adopting a back propagation algorithm to perform joint training on the backbone network, the classification sub-network, the regression sub-network and the mask sub-network based on the classification loss value, the regression loss value and the mask loss value until the classification loss value converges to a first preset value, the regression loss value converges to a second preset value and the mask loss value converges to a third preset value, and stopping training.

9. The method of claim 1, wherein the masking sub-network comprises five convolutional layers connected in sequence; the network parameters of the first four convolution layers are the same, and the network parameters of the fifth convolution layer are different from the network parameters of the first four convolution layers.

10. The method of claim 9, wherein an output dimension of a fifth one of the convolutional layers of the masking sub-network is equal to a number of categories of objects to be detected.

11. A method of object detection, characterized in that the method is applied to a device configured with a detection model; the detection model is a target detection model trained by the method of any one of claims 1 to 10; the method comprises the following steps:

Acquiring an image of a target to be detected;

inputting the image into the target detection model;

and determining the category and the position of the target to be detected according to the output result of the target detection model.

12. The method of claim 11, wherein the step of determining the category and location of the object to be detected based on the output of the object detection model comprises:

determining the category of the target to be detected according to the classification processing result output by the target detection model;

and determining the position of the target to be detected according to the regression processing result output by the target detection model.

13. The method of claim 11, wherein the object to be detected comprises a human face or a vehicle.

14. A training device for an image detection model, the device comprising:

the image input module is used for inputting the target training image into the initial model; the initial model comprises a backbone network and a head network which are connected in sequence, wherein the head network comprises a classification sub-network, a regression sub-network and a mask sub-network which are parallel; the target training image carries a category label, a regression label and a mask label; parameters of a preset number of convolution layers in front of the classification sub-network, the regression sub-network and the mask sub-network are shared;

The extraction module is used for extracting the characteristic response graph of the target training image through the backbone network;

the classification module is used for carrying out classification processing on the characteristic response graph through the classification sub-network, and calculating a classification loss value based on a classification processing result and the class label;

the regression module is used for carrying out regression processing on the characteristic response graph through the regression sub-network, and calculating a regression loss value based on a regression processing result and the regression label;

the mask module is used for carrying out mask processing on the characteristic response graph through the mask subnetwork, and calculating a mask loss value based on a mask processing result and the mask label; the masking process comprises masking the non-target area in the characteristic response diagram;

the model generation module is used for training the initial model based on the classification loss value, the regression loss value and the mask loss value to obtain a target detection model; the target detection model is a model for detecting targets in an image, and comprises the backbone network, the classification sub-network and the regression sub-network after training.

15. The apparatus of claim 14, wherein the backbone network comprises a residual network and a feature pyramid network; the characteristic response diagram extracting module is used for:

16. The apparatus of claim 15, wherein the network layer of the feature pyramid network is at least 4 layers.

17. The apparatus of claim 15, wherein each network layer of the feature pyramid network is connected to the classification subnetwork, the regression subnetwork, and the mask subnetwork, respectively, in parallel;

the classification module is further configured to: classifying the feature response graphs output by the network layer of the feature pyramid network connected with the classifying sub-network through the classifying sub-network;

The regression module is further to: the step of classifying the characteristic response graph through the regression sub-network comprises the following steps: carrying out regression processing on the feature response graph output by the network layer of the feature pyramid network connected with the regression sub-network through the regression sub-network;

the masking module is further configured to: the step of masking the characteristic response map through the masking sub-network includes: and masking the feature response graph output by the network layer of the feature pyramid network connected with the mask sub-network.

18. The apparatus of claim 17, wherein the mask tag is a plurality; each mask label corresponds to a scale of the feature map; the masking module is further configured to:

19. The apparatus of claim 14, wherein the masking module is further to: substituting the mask processing result and the mask label into a cross entropy loss function, and calculating to obtain a mask loss value.

20. The apparatus of claim 19, wherein the cross entropy loss function is a Sigmoid function.

21. The apparatus of claim 14, wherein the model generation module is further to:

22. The apparatus of claim 14, wherein the masking sub-network comprises five convolutional layers connected in sequence; the network parameters of the first four convolution layers are the same, and the network parameters of the fifth convolution layer are different from the network parameters of the first four convolution layers.

23. The apparatus of claim 22, wherein an output dimension of a fifth one of the convolutional layers of the masking sub-network is equal to a number of categories of objects to be detected.

24. An object detection apparatus, characterized in that the apparatus is applied to a device configured with a detection model; the detection model is a target detection model trained by the method of any one of claims 1 to 10; the device comprises:

The image acquisition module is used for acquiring an image of the object to be detected;

the image input module is used for inputting the image into the target detection model;

and the determining module is used for determining the category and the position of the target to be detected according to the output result of the target detection model.

25. The apparatus of claim 24, wherein the determination module is further for:

26. The apparatus of claim 24, wherein the object to be detected comprises a human face or a vehicle.

27. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor in communication with the memory over the bus, the machine-readable instructions when executed by the processor performing the method of any one of claims 1-10 or the method of any one of claims 11-13.

28. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the method according to any one of claims 1 to 10 or performs the method according to any one of claims 11 to 13.